The ‘Percentage Correct’ and other Performance Prediction Methods

The following post discusses the method of ‘percentage correct’ predictions and explains why it may not be the most precise method to measure performance. I also examine the topic of analytic measurement techniques in general and recommend the correct substitute prediction method for the situation when ‘percentage correct’ is not a suitable performance measurement approach.

The ‘Percentage Correct’ Prediction Method

Before we dive into the topic of performance measurement and prediction methods, I want first to point out, that it is very common to confuse the term ‘percentage correct’ with other similar terms, such as ‘percentage’ or ‘percentile.’ Refer to the following short glossary to point out the differences between the three terms commonly used in statistics:

  • Percentage – indicates parts per hundred
  • Percentile – value below which a given proportion of observations in a group of observations fall
  • Percentage Correct – in analytics refers to way of measuring the performance of an analytics system

So what is the ‘Percentage Correct’ again? It is a performance classification method used to assess the statistical performance of an analytics system.

Let’s illustrate this on an example: Assume a need to build an analytics system, which first processes the Training Data to create a model. Then the model is employed on a Real Data Set that we want to test. It is only natural that at the end of the analysis, we need to measure the performance of our statistical model, and that is precisely the point where we typically resort to using the ‘percentage correct’ method. Mainly because it provides a percentage correct prediction and simplifies the whole model into a single percentage number, which gives us an overall performance measurement.

Let’s illustrate the above example in RapidMiner. Figure 1 outlines the scenario, in which the last step is a Performance operator, that delivers a list of criteria values of the classification task, one of which can be the ‘Percentage Correct’ option, or otherwise also called an Accuracy.

Figure 1

C:\Users\jaros\AppData\Local\Temp\SNAGHTML2b79dd6c.PNG

RapidMiner Accuracy Result (Figure 2) shows how RapidMiner calculates ‘percentage correct’ (proportion of correct predictions). In RapidMiner, this is classified as a relative number of correctly classified examples.

Figure 2

Decision Threshold, Confusion Matrix, and Measuring Error

During the process of performance classification, any of the measurement methods eventually comes to a point, when it needs to choose whether the output of the model is right or not.

To illustrate this on an example, let’s assume we are predicting the outcome of a given situation where we can only have two possible result options, such as YES and NO. To be able to decide, we need a defined decision threshold to know when to mark one of the output results as YES and when to mark it as NO. A typical solution for the scenario where values fall within 0 and 100, is to decide that any value that equals or is higher than 50 should be YES and any output value below 50 should be marked as NO. And that is the basics of the decision threshold.

Following are the relevant terms utilized in the Confusion Matrix. I illustrate the terms based on the earlier YES/NO scenario. Please note that the Confusion Matrix is used only in cases when we are referring to whole numbers.

True Positives (TP): all cases where we have predicted YES and the actual result was YES

True Negatives (TN): all cases where we have predicted NO and the actual result was NO

False Positives (FP): all cases where prediction was YES, but the actual result was NO (‘Type I error’)

False Negatives (FN): all cases where prediction was NO, but the actual result was YES (‘Type II error’)

 

Example Scenario

Let’s assume there are 1,850 cases we need to predict and our model has predicted YES in 1,100 cases and NO in 750 cases. However, we had 800 actual cases with NO results and 1,050 cases with YES results. Alternatively, in other words, our prediction model is very accurate, out of all 1,850 cases, it was off by only a small margin, it predicted 50 more times YES, and 50 times less NO, than it should have, which is shown easily by using the Confusion Matrix (Figure 3):

Figure 3

n=1850 Predicted: NO Predicted: Yes
Actual: NO TRUE NEGATIVES = 700 FALSE POSITIVES= 100 800
Actual: YES FALSE NEGATIVES = 50 TRUE POSITIVES = 1000 1050
750 1100

Accuracy: But how would we go about calculating an Accuracy, or in other words the overall rate of how often was the classifier correct? That can be calculated by using the about Confusion Matrix, using the following formulae: (TRUE POSITIVES + TRUE NEGATIVES) / Total = (1000 + 700) / 1850 = 0.9189. So, our model has an overall accuracy of 91.89%.

Error Rate: To calculate our Error Rate, we simple use (FALSE POSITIVES + FALSE NEGATIVES) / Total = (100 +50) / 1850 = 0.0810. Or in other words, our model has an overall misclassification rate of 8.10%.

True Positive Rate: For actual YES, how often we predicted YES? TP/Actual YES = 100/1050 = 0.9523

False Positive Rate: For actual NO, how often we predicted YES? FP/Actual NO = 100/800 = 0.125

Specificity: For actual NO, how often we predicted no? TN/Actual NO = 700/800 = 0.875

Precision: When we predict YES, how often are we correct? TP/Predicted YES = 1000/1100 = 0.9090

Prevalence: How often does the YES condition occur in our sample? Actual YES/Total = 1050/1850 = 0.5675

 

So, to recap, the ‘decision thresholds’ are simply the threshold limits that can be either automatically derived (from the test data) or defined manually (e.g. in RapidMiner – Apply Threshold operator), and which assist during the classification of the output.

And the ‘confusion matrix’ is essentially a table that is often used to describe the performance of a classification model on a test data for which the true values are known (as explained in my example).

When we look at the calculation for some of the measures, like for example:

Specificity

Sensitivity

PPV (positive predictive value)

and NPV (negative predictive value)

, we can almost immediately spot the inefficiencies of measuring row and columns separately. The issue is obvious, each of the measures is a calculated by using only half of the information in the probability table and thus cannot effectively represent all aspects of the performance.

Overall Accuracy is definitely a better measure:

But I like the MMC (Matthews correlation coefficient) even better because it takes the benefit of all the four numbers in a bit of a more balanced approach and thus it’s more comprehensive than the line or column wise measures.

The MCC is calculated as follows:

Nowadays, MCC is a recommended approach for measuring performance in machine learning, because it takes into account true and false positives and negatives and is regarded as an evenhanded measure even on the classes of very different sizes. BTW. just as a curiosity, the Matthews correlation coefficient (MCC) was introduced by biochemist Brian W. Matthews and goes back to 1975.
 —

And assume we want to conduct the performance evaluation of a classification task. We could employ the aforementioned accuracy or MCC measure, but there are many other performance assessment methods available. To illustrate the point, allow me to list some of them along with their short descriptions:
Absolute Error – average absolute deviation of the prediction from the actual value

 Deltax=x_0-x

Classification Error – percentage of incorrect predictions

Correlation – returns the correlation coefficient between the label and prediction attributes

Cross Entropy – the sum over the logarithms of the true label’s confidences divided by the number of examples

H(p,q)=\operatorname {E}_{p}[-\log q]=H(p)+D_{{{\mathrm {KL}}}}(p\|q),\!

Kappa – the correct prediction occurring by chance

Kendall Tau – the strength of the relationship between the actual and predicted labels

{\displaystyle \tau ={\frac {({\text{number of concordant pairs}})-({\text{number of discordant pairs}})}{n(n-1)/2}}.}

Logistic Loss –  the average of ln(1+exp(-[conf(CC)])) where ‘conf(CC)’ is the confidence of the correct class.

Margin – the minimal confidence for the correct label

Mean Squared Error – the averaged squared error

\operatorname {MSE}={\frac {1}{n}}\sum _{{i=1}}^{n}({\hat {Y_{i}}}-Y_{i})^{2} 

Normalized Absolute Error – the absolute error divided by the error made if the average would have been predicted

Relative Error – the average of the absolute deviation of the prediction from the actual value divided by the actual value

Relative Error Lenient – the average of the absolute deviation of the prediction from the actual value divided by the maximum of the actual value and the prediction

Relative Error Strict – the average of the absolute deviation of the prediction from the actual value divided by the minimum of the actual value and the prediction

Root Mean Squared Error – the averaged root mean squared error

Image result for Root Relative Squared Error formula rrse

Root Relative Squared Error – the averaged root relative squared error

Soft Margin Loss – the average of all 1 – confidences for the correct label

Spearman Rho – correlation between the actual and predicted labels

Squared Correlation – the squared correlation coefficient between the label and prediction attributes

R^{2}\equiv 1-{SS_{\rm {res}} \over SS_{\rm {tot}}}.\,

Weighted Mean Precision – calculated through class precisions for individual classes

{\displaystyle F_{1}=2\cdot {\frac {1}{{\tfrac {1}{\mathrm {recall} }}+{\tfrac {1}{\mathrm {precision} }}}}=2\cdot {\frac {\mathrm {precision} \cdot \mathrm {recall} }{\mathrm {precision} +\mathrm {recall} }}} .

Weighted Mean Recall – calculated through class recalls for individual classes

E = 1 - \left(\frac{\alpha}{P} + \frac{1-\alpha}{R}\right)^{-1} .
I’ll cover some of these in the next section in more detail.

 

The ‘Percentage Correct’ and other Performance Prediction Methods

As we can see from the above example, the ‘percentage correct’ method is a highly practical way of measuring performance, especially in a situation where we need to compare a variety of different statistical models. However, it’s not a solution that fits all data sets perfectly.

For example, “when predicting time series like currency exchange rate variation over time or risk of being specific cyber criminals’ target, percentage correct prediction is not applicable”. (Herron, 1999). The reason it is not perfectly applicable to all types of data sets it due to its most praised quality of summarizing performance into a single percentage number, which isn’t always the best method in situations where we encounter and want to report on several different factors. In such scenarios, ‘percentage correct prediction’ (also called PCP) often neglects the relevant data and overstate the precision of reported results.

Percentage Correct is not the only available method to measure the performance; there are other methods that are considered more robust for the specific types of data sets.

The following is the list of various other methods that can be deemed to measure the performance of the analytic system:

Cohen’s Kappa – The Cohen’s Kappa statistics is a more vigorous measure than the earlier exaplained ‘percentage correct prediction’ calculation. It is mainly because Kappa considers the correct prediction that is occurring by chance. “This is essentially a measure of how well the classifier performed as compared to how well it would have performed simply by chance. In other words, a model has a high Kappa score if there is a big difference between the accuracy and the null error rate.” (Markham, K., 2014)

Receiver Operating Characteristic Curves (ROC curves) – a commonly used graph that serves as a summary of the overall classifier performance over all thresholds, which is created by plotting the TP Rate on y-axis and FP Rate on x-axis) as shown in Figure 4. A ROC curve is the most commonly accepted option for visualization of the binary classifier performance.

Figure 4

F-Score – is another way to measure the test accuracy, in this case by considering a weighted average of precision and recall (true positive rate). The F-measure (F1 score) is the harmonic mean of recall and precision:

References

GmbH, R. (2017) Performance (classification) – RapidMiner documentation. Available at: http://docs.rapidminer.com/studio/operators/validation/performance/predictive/performance_classification.html (Accessed: 18 February 2017).

Herron, M. C. (1999) ‘Postestimation Uncertainty in Limited Dependent Variable Models’, Political Analysis , 8 (1 ), pp. 83–98. Available at: http://pan.oxfordjournals.org/content/8/1/83.abstract. (Accessed: 18 February 2017).

About Percentiles (2008) Available at: https://atrium.lib.uoguelph.ca/xmlui/bitstream/handle/10214/1843/A_About_Percentages_and_Percentiles.pdf?sequence=7 (Accessed: 18 February 2017).

Markham, K. (2014) Simple guide to confusion matrix terminology. Available at: http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/ (Accessed: 18 February 2017).

F1 score (2016) in Wikipedia. Available at: https://en.wikipedia.org/wiki/F1_score (Accessed: 19 February 2017).

Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A., & Nielsen, H. (2000). Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 16(5), 412-424.
Vihinen, M. (2012) How to evaluate the performance of prediction methods? Measures and their interpretation in variation effect analysis [Online] Available from: http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-13-S4-S2.  (Accessed: 21 February 2017).

 

Comments

comments