Machine Learning Nomenclature
About
This post is a collection of commonly used machine learning (ML) terminologies.
Dataset
The data we use in ML is usually defined as dataset, and datasets are a collection of data. The dataset contains the features and target to predict.
It has other names * data * input data * train and test data
Instance
An instance is a row in the dataset.
Is has other names * row * observation * sample * (data) point
Feature
Feature is a column in the dataset. It is used as an input used for prediction or classification. Features are commonly represented by x
variable.
It has other names * column * attribute * (input) variable
Features are of two types * Categorical or qualitative * Numerical or quantitative
Target
It is the information a machine learning algorithm learns to predict. Target is commonly represented by y
variable.
It has other names * label * output
Labeled Data
A data that has both the feature and target attributes defined
Unlabeled Data
A data that has the features defined but has no target attribute.
Categorical Feature
A feature that is not measureable and has discrete set of values like gender, family retionships, movie categories etc. We commonly use bar charts and pie graphs for categorical features.
It has other names * qualitative feature
Categorical features are of two types * Nominal * Ordinal
Nominal feature
Nominal (categorical) feature is one that can not be measured and has no order assgined to it e.g. eye colors, gender etc.
Ordinal feature
Ordinal (categorical) feature is one that can not be measured but has some order assgined to it like movie ratings, military ranks etc.
Numerical feature
Numerical features are those that can be measured or counted and have some ascending or descending order assigned to them.
It has other names * Continious feature * Quantitative feature
Numerical features can be of two types * Discrete * Continous
Discrete feature
Discrete (numerical) feature is one that has specified values and are usually counted e.g. number of facebook like, number of tickets sold etc.
Continous feature
Continous (numerical) feature is one that can have any value assigned to it, and is usually measured e.g. temperature, wind speed etc.
Data
├── Categorical / Qualitative
│ ├── Nominal
│ └── Ordinal
└── Numerical / Quantitative
├── Discrete
└── Continous
Classification
If the target feature is categorical then the ML task is called classification.
Regression
If the target feature is numerical then the ML task is called regression.
Positive class
In binary classification the output class is usually labelled as positive or negative. The positive class is the thing we are testing for. For example, positive class for an email classifier is ‘spam’, and positive class for a medical test can be ‘tumor’.
Negative class
Negative class is the opposite to positive class. For example, negative class for an email classifier is ‘not spam’, and a negative class for a medical test can be ‘not tumor’.
True positive (TP)
Model correctly predicted the positve class
True negative (TN)
Model correctly predicted the negative class
False positive (FP)
Model incorrectly predicted the positive class. Actual class is negative. It has other names * Type I error
False negative (FN)
Model incorrectly predicted the negative class. Actual class is positive. It has other names * Type II error
Accuracy
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision
It tells how accurate the positive predictions are.
Precision = TP / (TP + FP)
It is a good metric when cost of false positives is high.
True Positive Rate (TPR)
TPR = TP / (TP + FN)
It is the probability that an actual positive class will test positive.
It has other names * Recall * Sensitivity
True positive is the y-axis in an ROC curve. It is a good metric when cost of false negatives is high.
False Positive Rate (FPR)
FPR = FP / (FP + TN)
It has other names * 1 - specificity
It is x-axis on ROC curve.
ROC Curve
Receiver Operating Characteristic (ROC) is a curve of TPR vs FPR at different classification thresholds.
True Negative Rate (TNR)
TNR = TN / (TN + FP)
It is the probability of a negative class to test negative.
It has other names * Specificity