The following post discusses the method of ‘percentage correct’ predictions and explains why it may not be the most precise method to measure performance. I also examine the topic of analytic measurement techniques in general and recommend the correct substitute prediction method for the situation when ‘percentage correct’ is not a suitable performance measurement approach.
The ‘Percentage Correct’ Prediction Method
Before we dive into the topic of performance measurement and prediction methods, I want first to point out, that it is very common to confuse the term ‘percentage correct’ with other similar terms, such as ‘percentage’ or ‘percentile.’ Refer to the following short glossary to point out the differences between the three terms commonly used in statistics:
- Percentage – indicates parts per hundred
- Percentile – value below which a given proportion of observations in a group of observations fall
- Percentage Correct – in analytics refers to way of measuring the performance of an analytics system
So what is the ‘Percentage Correct’ again? It is a performance classification method used to assess the statistical performance of an analytics system.
Let’s illustrate this on an example: Assume a need to build an analytics system, which first processes the Training Data to create a model. Then the model is employed on a Real Data Set that we want to test. It is only natural that at the end of the analysis, we need to measure the performance of our statistical model, and that is precisely the point where we typically resort to using the ‘percentage correct’ method. Mainly because it provides a percentage correct prediction and simplifies the whole model into a single percentage number, which gives us an overall performance measurement.
Let’s illustrate the above example in RapidMiner. Figure 1 outlines the scenario, in which the last step is a Performance operator, that delivers a list of criteria values of the classification task, one of which can be the ‘Percentage Correct’ option, or otherwise also called an Accuracy.
Figure 1
RapidMiner Accuracy Result (Figure 2) shows how RapidMiner calculates ‘percentage correct’ (proportion of correct predictions). In RapidMiner, this is classified as a relative number of correctly classified examples.
Figure 2
Decision Threshold, Confusion Matrix, and Measuring Error
During the process of performance classification, any of the measurement methods eventually comes to a point, when it needs to choose whether the output of the model is right or not.
To illustrate this on an example, let’s assume we are predicting the outcome of a given situation where we can only have two possible result options, such as YES and NO. To be able to decide, we need a defined decision threshold to know when to mark one of the output results as YES and when to mark it as NO. A typical solution for the scenario where values fall within 0 and 100, is to decide that any value that equals or is higher than 50 should be YES and any output value below 50 should be marked as NO. And that is the basics of the decision threshold.
Following are the relevant terms utilized in the Confusion Matrix. I illustrate the terms based on the earlier YES/NO scenario. Please note that the Confusion Matrix is used only in cases when we are referring to whole numbers.
True Positives (TP): all cases where we have predicted YES and the actual result was YES
True Negatives (TN): all cases where we have predicted NO and the actual result was NO
False Positives (FP): all cases where prediction was YES, but the actual result was NO (‘Type I error’)
False Negatives (FN): all cases where prediction was NO, but the actual result was YES (‘Type II error’)
Example Scenario
Let’s assume there are 1,850 cases we need to predict and our model has predicted YES in 1,100 cases and NO in 750 cases. However, we had 800 actual cases with NO results and 1,050 cases with YES results. Alternatively, in other words, our prediction model is very accurate, out of all 1,850 cases, it was off by only a small margin, it predicted 50 more times YES, and 50 times less NO, than it should have, which is shown easily by using the Confusion Matrix (Figure 3):
Figure 3
n=1850 | Predicted: NO | Predicted: Yes | |
Actual: NO | TRUE NEGATIVES = 700 | FALSE POSITIVES= 100 | 800 |
Actual: YES | FALSE NEGATIVES = 50 | TRUE POSITIVES = 1000 | 1050 |
750 | 1100 |
Accuracy: But how would we go about calculating an Accuracy, or in other words the overall rate of how often was the classifier correct? That can be calculated by using the about Confusion Matrix, using the following formulae: (TRUE POSITIVES + TRUE NEGATIVES) / Total = (1000 + 700) / 1850 = 0.9189. So, our model has an overall accuracy of 91.89%.
Error Rate: To calculate our Error Rate, we simple use (FALSE POSITIVES + FALSE NEGATIVES) / Total = (100 +50) / 1850 = 0.0810. Or in other words, our model has an overall misclassification rate of 8.10%.
True Positive Rate: For actual YES, how often we predicted YES? TP/Actual YES = 100/1050 = 0.9523
False Positive Rate: For actual NO, how often we predicted YES? FP/Actual NO = 100/800 = 0.125
Specificity: For actual NO, how often we predicted no? TN/Actual NO = 700/800 = 0.875
Precision: When we predict YES, how often are we correct? TP/Predicted YES = 1000/1100 = 0.9090
Prevalence: How often does the YES condition occur in our sample? Actual YES/Total = 1050/1850 = 0.5675
—
So, to recap, the ‘decision thresholds’ are simply the threshold limits that can be either automatically derived (from the test data) or defined manually (e.g. in RapidMiner – Apply Threshold operator), and which assist during the classification of the output.
And the ‘confusion matrix’ is essentially a table that is often used to describe the performance of a classification model on a test data for which the true values are known (as explained in my example).
—
When we look at the calculation for some of the measures, like for example:
Specificity
Sensitivity
PPV (positive predictive value)
and NPV (negative predictive value)
, we can almost immediately spot the inefficiencies of measuring row and columns separately. The issue is obvious, each of the measures is a calculated by using only half of the information in the probability table and thus cannot effectively represent all aspects of the performance.
Overall Accuracy is definitely a better measure:
But I like the MMC (Matthews correlation coefficient) even better because it takes the benefit of all the four numbers in a bit of a more balanced approach and thus it’s more comprehensive than the line or column wise measures.
The MCC is calculated as follows:
And assume we want to conduct the performance evaluation of a classification task. We could employ the aforementioned accuracy or MCC measure, but there are many other performance assessment methods available. To illustrate the point, allow me to list some of them along with their short descriptions:
Absolute Error – average absolute deviation of the prediction from the actual value
Classification Error – percentage of incorrect predictions
Correlation – returns the correlation coefficient between the label and prediction attributes
Cross Entropy – the sum over the logarithms of the true label’s confidences divided by the number of examples
Kappa – the correct prediction occurring by chance
Kendall Tau – the strength of the relationship between the actual and predicted labels
Logistic Loss – the average of ln(1+exp(-[conf(CC)])) where ‘conf(CC)’ is the confidence of the correct class.
Margin – the minimal confidence for the correct label
Mean Squared Error – the averaged squared error
Normalized Absolute Error – the absolute error divided by the error made if the average would have been predicted
Relative Error – the average of the absolute deviation of the prediction from the actual value divided by the actual value
Relative Error Lenient – the average of the absolute deviation of the prediction from the actual value divided by the maximum of the actual value and the prediction
Relative Error Strict – the average of the absolute deviation of the prediction from the actual value divided by the minimum of the actual value and the prediction
Root Mean Squared Error – the averaged root mean squared error
Root Relative Squared Error – the averaged root relative squared error
Soft Margin Loss – the average of all 1 – confidences for the correct label
Spearman Rho – correlation between the actual and predicted labels
Squared Correlation – the squared correlation coefficient between the label and prediction attributes
Weighted Mean Precision – calculated through class precisions for individual classes
.
Weighted Mean Recall – calculated through class recalls for individual classes
.
I’ll cover some of these in the next section in more detail.
The ‘Percentage Correct’ and other Performance Prediction Methods
As we can see from the above example, the ‘percentage correct’ method is a highly practical way of measuring performance, especially in a situation where we need to compare a variety of different statistical models. However, it’s not a solution that fits all data sets perfectly.
For example, “when predicting time series like currency exchange rate variation over time or risk of being specific cyber criminals’ target, percentage correct prediction is not applicable”. (Herron, 1999). The reason it is not perfectly applicable to all types of data sets it due to its most praised quality of summarizing performance into a single percentage number, which isn’t always the best method in situations where we encounter and want to report on several different factors. In such scenarios, ‘percentage correct prediction’ (also called PCP) often neglects the relevant data and overstate the precision of reported results.
Percentage Correct is not the only available method to measure the performance; there are other methods that are considered more robust for the specific types of data sets.
The following is the list of various other methods that can be deemed to measure the performance of the analytic system:
Cohen’s Kappa – The Cohen’s Kappa statistics is a more vigorous measure than the earlier exaplained ‘percentage correct prediction’ calculation. It is mainly because Kappa considers the correct prediction that is occurring by chance. “This is essentially a measure of how well the classifier performed as compared to how well it would have performed simply by chance. In other words, a model has a high Kappa score if there is a big difference between the accuracy and the null error rate.” (Markham, K., 2014)
Receiver Operating Characteristic Curves (ROC curves) – a commonly used graph that serves as a summary of the overall classifier performance over all thresholds, which is created by plotting the TP Rate on y-axis and FP Rate on x-axis) as shown in Figure 4. A ROC curve is the most commonly accepted option for visualization of the binary classifier performance.
Figure 4
F-Score – is another way to measure the test accuracy, in this case by considering a weighted average of precision and recall (true positive rate). The F-measure (F1 score) is the harmonic mean of recall and precision:
References
GmbH, R. (2017) Performance (classification) – RapidMiner documentation. Available at: http://docs.rapidminer.com/studio/operators/validation/performance/predictive/performance_classification.html (Accessed: 18 February 2017).
Herron, M. C. (1999) ‘Postestimation Uncertainty in Limited Dependent Variable Models’, Political Analysis , 8 (1 ), pp. 83–98. Available at: http://pan.oxfordjournals.org/content/8/1/83.abstract. (Accessed: 18 February 2017).
About Percentiles (2008) Available at: https://atrium.lib.uoguelph.ca/xmlui/bitstream/handle/10214/1843/A_About_Percentages_and_Percentiles.pdf?sequence=7 (Accessed: 18 February 2017).
Markham, K. (2014) Simple guide to confusion matrix terminology. Available at: http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/ (Accessed: 18 February 2017).
F1 score (2016) in Wikipedia. Available at: https://en.wikipedia.org/wiki/F1_score (Accessed: 19 February 2017).