Classification is a data mining technique for solving Yes / No questions. Whereas regression analysis predicts a numeric value, classification helps us predict which class (or category) our data observations belong to. Classification is sometimes referred to as a decision tree and is a particularly useful technique for breaking down very large datasets for analysing and making predictions.
In its simplest form, you could use classification to break down your results into just two categories: true or false. You won’t necessarily come up with 1 (for true) or 0 (for false) but a value somewhere in the middle, with a threshold set to define which are classified true and which are classified false.
In classification experiments, you have a training set of labelled observations which we feed the machine and a test set of observations which we use only for evaluation. We can withhold some of the test data to use to validate our model.
Handwriting recognition is one example where classification could be used. Your training set will have labelled images stating which letters the images refer to which your machine will use to try to learn and evaluate the test set accurately.
The test data cases can then be divided into groups:
- Cases where the model predicts a 1 which were actually true are “true positives”
- Cases where the model predicts a 0 which are actually false are “true negatives”
- Cases where the model predicts a 1 which are actually false are “false positives”, Type I errors
- Cases where the model predicts a 0 which are actually true are “false negatives”, Type II errors
Based on the test results you may choose to move the threshold to change how the predicted values are classified.
The accuracy and significance of the results of a classification model can be measured in numerous different ways. Some examples are below, where TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative:
- Accuracy = (TP + TN) / (TP + FP + TN + FN)
- Precision = TP / (TP + FP) which returns the fraction of cases classified as positives that are actually true and not false positives
- Recall / True Positive Rate = TP / (TP + FN) which provides the fraction of positive cases correctly identified
- False Positive Rate = FP / (FP + TN) comparing false positives against the actual number of negatives
You can plot the TPR (True Positive Rate) and FPR (False Positive Rate) based on any threshold by charting an ROC curve (receiver operating characteristic curve) to show the performance of a classification model across all thresholds. This also allows you to view the AUC (area under the curve) on that plot to understand the accuracy of the predictions from the classification model. The larger the AUC the better the model predicts. Simple guessing in an experiment with two categories would have an AUC of 0.5.
A matrix which displays the number of true positives, true negatives, false positives and false negatives in your testing, to help measure the effectiveness of your classification model.