What are the disadvantages of accuracy?

I have been reading about evaluating a model with accuracy only and I have found some disadvantages. Among them, I read that it equates all errors. How could this problem be solved? Maybe assigning costs to each type of failure? Thank you very much for your help.

Topic machine-learning-model evaluation accuracy machine-learning

Category Data Science


Whether accuracy is a good metric depend on the needs of the application, for instance, here is an example by @benoit_sanchez

You own an egg shop and each egg you sell generates a net revenue of 2 dollars. Each customer who enters the shop may either buy an egg or leave without buying any. For some customers you can decide to make a discount and you will only get 1 dollar revenue but then the customer will always buy.

You plug a webcam that analyses the customer behaviour with features such as "sniffs the eggs", "holds a book with omelette recipes"... and classify them into "wants to buy at 2 dollars" (positive) and "wants to buy only at 1 dollar" (negative) before he leaves.

If your classifier makes no mistake, then you get the maximum revenue you can expect. If it's not perfect, then:

  • for every false positive you loose 1 dollar because the customer leaves and you didn't try to make a successful discount

  • for every false negative you loose 1 dollar because you make a useless discount

Then the accuracy of your classifier is exactly how close you are to the maximum revenue. It is the perfect measure.

See also my answer to a related question at the stats stack exchange

I disagree with the other answers about accuracy and imbalance. If this is a problem, just look at the improvement in accuracy over just guessing the most common class, It really isn't a big deal. Something like:

$$\Delta_\mathrm{acc} = \frac{\mathrm{acc} - \pi}{1 - \pi}$$

where $\pi$ is the proportion of data belonging to the majority class. This is the proportion of the "residual" accuracy that is captured by the model above that captured just from the labels. Note this is an affine transformation of accuracy, so it is still measuring the same thing, just in a slightly more interpretable manner. In @Dave's example, $\Delta_\mathrm{acc}$ would be negative, which would be a clear indication that the model was worse than useless.

If different misclassifications have different costs, then you can use the expected loss i.e. a weighted accuracy, where the errors are weighted by their appropriate costs. Investigate "cost-sensitive learning" or "Bayesian decision theory" for more information.

Basically, to decide what metric to use, you need to be clear about exactly what is important for your application, and chose the metric (or metrics) accordingly. For "normal" applications, I tend to use accuracy (or expected loss in a cost-sensitive setting) to directly measure the quality of the decisions, Area under the ROC (which is a measure of the quality of the ranking of patterns) and cross-entropy/predictive information or Brier score as measures of the general calibration of probability estimates. The mistake is to think there is a "one-size-fits-all" solution to the problem, and it is best to think hard about which metrics are appropriate for each application and why.


A common complaint about accuracy is that it fails when the classes are imbalanced. For instance, if you get an accuracy of $98\%$, that sounds like a high $\text{A}$ in school, so you might be pretty happy with your performance. However, if the class ratio is $99:1$, then you’re doing worse than you would by always guessing the majority class.

However, accuracy has issues when the classes are naturally balanced, too. In many applications, there are different costs associated with the different mistakes. Accuracy takes away from your ability to play the odds. The typical threshold for a (binary) model that outputs probability values (logistic regression, neural nets, and others) is $0.5$. Accuracy makes a $0.49$ and $0.51$ appear to be different categories while $0.51$ and $0.99$ are the same. I’d be a lot more comfortable making a huge decision based on a probability of $0.99$ than on $0.51!$ Accuracy masks this. In fact, any threshold-based metric like sensitivity, specificity, $F_1$, positive predictive value, or negative predictive value masks the differences between $0.51$ and $0.99$.

Consequently, statisticians advocate for direct evaluation of the probability outputs of models, using metrics such as log loss (often called crossentropy in machine learning circles and sometimes negative log likelihood) and Brier score (pretty much mean squared error, with an unsurprising generalization in the multiclass setting).

Vanderbilt’s Frank Harrell, the founder and former head of the Department of Biostatistics at their medical school, as well as a frequent user of the statistics Stack, has two good blog posts about the idea of predicting tendencies instead of categories and measuring success by evaluating the probability outputs of models.

Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules

Classification vs. Prediction


In general, the main disadvantage of accuracy is that it masks the issue of class imbalance. For example if the data contains only 10% of positive instances, a majority baseline classifier which always assigns the negative label would reach 90% accuracy since it would correctly predict 90% instances. But of course such a classifier is useless, it doesn't classify anything.

In realistic cases, the classifier tends to predict the majority class more often in cases where it's not sure. Since accuracy just counts correct cases and most instances belong to the majority class, the accuracy score may be high even though it doesn't distinguish the classes very well. Precision, recall and f-score offer a clearer and more general picture of performance.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.