Machine Learning Nomenclature

A collection of machine learning terminologies

Published

March 31, 2022

Keywords

About

This post is a collection of commonly used machine learning (ML) terminologies.

Dataset

The data we use in ML is usually defined as dataset, and datasets are a collection of data. The dataset contains the features and target to predict.

It has other names * data * input data * train and test data

Instance

An instance is a row in the dataset.

Is has other names * row * observation * sample * (data) point

Feature

Feature is a column in the dataset. It is used as an input used for prediction or classification. Features are commonly represented by x variable.

It has other names * column * attribute * (input) variable

Features are of two types * Categorical or qualitative * Numerical or quantitative

Target

It is the information a machine learning algorithm learns to predict. Target is commonly represented by y variable.

It has other names * label * output

Labeled Data

A data that has both the feature and target attributes defined

Unlabeled Data

A data that has the features defined but has no target attribute.

Categorical Feature

A feature that is not measureable and has discrete set of values like gender, family retionships, movie categories etc. We commonly use bar charts and pie graphs for categorical features.

It has other names * qualitative feature

Categorical features are of two types * Nominal * Ordinal

Nominal feature

Nominal (categorical) feature is one that can not be measured and has no order assgined to it e.g. eye colors, gender etc.

Ordinal feature

Ordinal (categorical) feature is one that can not be measured but has some order assgined to it like movie ratings, military ranks etc.

Numerical feature

Numerical features are those that can be measured or counted and have some ascending or descending order assigned to them.

It has other names * Continious feature * Quantitative feature

Numerical features can be of two types * Discrete * Continous

Discrete feature

Discrete (numerical) feature is one that has specified values and are usually counted e.g. number of facebook like, number of tickets sold etc.

Continous feature

Continous (numerical) feature is one that can have any value assigned to it, and is usually measured e.g. temperature, wind speed etc.

Data
├── Categorical / Qualitative
│   ├── Nominal
│   └── Ordinal
└── Numerical / Quantitative
    ├── Discrete
    └── Continous

Classification

If the target feature is categorical then the ML task is called classification.

Regression

If the target feature is numerical then the ML task is called regression.

Positive class

In binary classification the output class is usually labelled as positive or negative. The positive class is the thing we are testing for. For example, positive class for an email classifier is ‘spam’, and positive class for a medical test can be ‘tumor’.

Negative class

Negative class is the opposite to positive class. For example, negative class for an email classifier is ‘not spam’, and a negative class for a medical test can be ‘not tumor’.

True positive (TP)

Model correctly predicted the positve class

True negative (TN)

Model correctly predicted the negative class

False positive (FP)

Model incorrectly predicted the positive class. Actual class is negative. It has other names * Type I error

False negative (FN)

Model incorrectly predicted the negative class. Actual class is positive. It has other names * Type II error

Accuracy

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision

It tells how accurate the positive predictions are.

Precision = TP / (TP + FP)

It is a good metric when cost of false positives is high.

True Positive Rate (TPR)

TPR = TP / (TP + FN)

It is the probability that an actual positive class will test positive.

It has other names * Recall * Sensitivity

True positive is the y-axis in an ROC curve. It is a good metric when cost of false negatives is high.

False Positive Rate (FPR)

FPR = FP / (FP + TN)

It has other names * 1 - specificity

It is x-axis on ROC curve.

ROC Curve

Receiver Operating Characteristic (ROC) is a curve of TPR vs FPR at different classification thresholds.

True Negative Rate (TNR)

TNR = TN / (TN + FP)

It is the probability of a negative class to test negative.

It has other names * Specificity