Foster Provost and I discussed the merits of ROC curves vs. accuracy estimation. Here is a quick summary of our discussion.

The “Receiver Operating Characteristic” (ROC) curve is an alternative to accuracy for the evaluation of learning algorithms on natural datasets. The ROC curve is a *curve* and not a single number statistic. In particular, this means that the comparison of two algorithms on a dataset does not always produce an obvious order.

Accuracy (= 1 – error rate) is a standard method used to evaluate learning algorithms. It is a single-number summary of performance.

AROC is the area under the ROC curve. It is a single number summary of performance.

The comparison of these metrics is a subtle affair, because in machine learning, they are compared on different natural datasets. This makes some sense if we accept the hypothesis “Performance on past learning problems (roughly) predicts performance on future learning problems.”

The ROC vs. accuracy discussion is often conflated with “is the goal classification or ranking?” because ROC curve construction requires a ranking be produced. Here, we assume the goal is classification rather than ranking. (There are several natural problems where ranking of instances is much preferred to classification. In addition, there are several natural problems where classification is the goal.)

Arguments for ROC | Explanation |

Ill-specification | The costs of choices are not well specified. The training examples are often not drawn from the same marginal distribution as the test examples. ROC curves allow for an effective comparison over a range of different choice costs and marginal distributions. |

Ill-dominance | Standard classification algorithms do not have a dominance structure as the costs vary. We should not say “algorithm A is better than algorithm B” when you don’t know the choice costs well enough to be sure. |

Just-in-Time use | Any system with a good ROC curve can easily be designed with a ‘knob’ that controls the rate of false positives vs. false negatives. |

AROC inherits the arguments of ROC except for Ill-dominance.

Arguments for AROC | Explanation |

Summarization | Humans don’t have the time to understand the complexities of a conditional comparison, so having a single number instead of a curve is valuable. |

Robustness | Algorithms with a large AROC are robust against a variation in costs. |

Accuracy is the traditional approach.

Arguments for Accuracy | Explanation |

Summarization | As for AROC. |

Intuitiveness | People understand immediately what accuracy means. Unlike (A)ROC, it’s obvious what happens when one additional example is classified wrong. |

Statistical Stability | The basic test set bound shows that accuracy is stable subject to only the IID assumption. For AROC (and ROC) this is only true when the number in each class is not near zero. |

Minimality | In the end, a classifier makes classification decisions. Accuracy directly measures this while (A)ROC comprimises this measure with hypothetical alternate choice costs. For the same reason, computing (A)ROC may require significantly more work than solving the problem. |

Generality | Accuracy generalizes easily to multiclass accuracy, importance weighted accuracy, and general (per-example) cost sensitive classification. ROC curves become problematic when there are just 3 classes. |

The Just-in-Time argument seems to be the strongest for (A)ROC. One way to rephrase this argument is “Lack of knowledge of relative costs means that classifiers should be rankers so false positive to false negative ratios can be easily altered.” In other words, this is an argument for “ranking instead of classification” rather than “(A)ROC instead of Accuracy”.

Although the area under the ROC curve (AROC) is not an intuitive quantity in itself, I find that its interpretation as a Wilcoxon-Mann-Whitney statistic, which effectively measures the fraction of positive-negative instance pairs that are ranked correctly (discussed, for example, in Corinna Cortes and Mehryar Mohri’s paper), makes the quantity easier to understand. This interpretation also has other benefits; while generalizing ROC curves to more than two classes is not at all straightforward, the above interpretation facilitates graceful generalizations of the AROC statistic to multi-category ranking (forthcoming work with Shyamsundar Rajaram, Thore Graepel and Ralf Herbrich).

One important method not yet mentioned in the present discussion is the elegant work by Provost and Fawcett on the ROC Convex Hull as an alternative to both “vanilla” ROC curves and the Area Under Curve summary. Within the ROCCH framework classifers with highest expected utility have curves sitting on the convex hull of all the candidate classifers’ curves. Expected-cost-optimal regions of the hull’s upper boundary (parametrized by gradient) are related to the practioner’s belief about utility and class priors.

Although I am no longer up-to-date with what is going on in the ROC world, I do believe that there are researchers looking at issues such as putting confidence bands on these curves. This seems especially important for the ROCCH method, as forming the convex hull is equivalent to a kind of pointwise maximum, which would seem to be an unstable thing to do with respect to changes in the set defining “truth” (ie. the set used to form the ROC curves in the first place).

Some additional notes, more or less related to the thread:

a) A subtle and interesting difference between AROC evaluations and

evaluations based on most “standard” loss functions (including 0/1

loss, squared-error, “cost-sensitive classification”, etc.) is that

all the standard loss functions can be evaluated for each

(example) independently of the others. AROC is defined only for

a set of examples.

b) One neat use of AROC is as a base-rate-independent version of the

Bayes rate. Specifically, data sets cannot be compared directly in

terms of their Bayes rates when their base rates differ (by base rate

I mean the typical notion of the marginal/unconditional probability of

the most probable class). However, their “optimal” AROCs (or

estimates thereof) could be compared directly, as notions of how

separable the classes are.

Here’s an example, showing (it seems) how the data sets on which two

different standard learners excel can be separated pretty cleanly by

such a measure.

c) Hand and Till have extended AROC to multiclass problems:

Hand D.J. and Till R.J. (2001) A simple generalisation of the area

under the ROC curve for multiple class classification

problems. Machine Learning, 45, 171-186.

d)

Sofus Macskassy has been studying confidence bounds for ROC

curves, and has some technical reports on the subject.

e) Webb

and Ting have an interesting criticism of the oft-cited notion

that “ROC curves are invariant to changes in the marginal/prior

class distribution”:

Webb, G. I. and K.M. Ting (2005). On the Application of ROC Analysis

to Predict Classification Performance Under Varying Class

Distributions. Machine Learning, 58(1):25-32.