ROC vs. Accuracy vs. AROC

Foster Provost and I discussed the merits of ROC curves vs. accuracy estimation. Here is a quick summary of our discussion.

The “Receiver Operating Characteristic” (ROC) curve is an alternative to accuracy for the evaluation of learning algorithms on natural datasets. The ROC curve is a curve and not a single number statistic. In particular, this means that the comparison of two algorithms on a dataset does not always produce an obvious order.

Accuracy (= 1 – error rate) is a standard method used to evaluate learning algorithms. It is a single-number summary of performance.

AROC is the area under the ROC curve. It is a single number summary of performance.

The comparison of these metrics is a subtle affair, because in machine learning, they are compared on different natural datasets. This makes some sense if we accept the hypothesis “Performance on past learning problems (roughly) predicts performance on future learning problems.”

The ROC vs. accuracy discussion is often conflated with “is the goal classification or ranking?” because ROC curve construction requires a ranking be produced. Here, we assume the goal is classification rather than ranking. (There are several natural problems where ranking of instances is much preferred to classification. In addition, there are several natural problems where classification is the goal.)

Arguments for ROC Explanation
Ill-specification The costs of choices are not well specified. The training examples are often not drawn from the same marginal distribution as the test examples. ROC curves allow for an effective comparison over a range of different choice costs and marginal distributions.
Ill-dominance Standard classification algorithms do not have a dominance structure as the costs vary. We should not say “algorithm A is better than algorithm B” when you don’t know the choice costs well enough to be sure.
Just-in-Time use Any system with a good ROC curve can easily be designed with a ‘knob’ that controls the rate of false positives vs. false negatives.

AROC inherits the arguments of ROC except for Ill-dominance.

Arguments for AROC Explanation
Summarization Humans don’t have the time to understand the complexities of a conditional comparison, so having a single number instead of a curve is valuable.
Robustness Algorithms with a large AROC are robust against a variation in costs.

Accuracy is the traditional approach.

Arguments for Accuracy Explanation
Summarization As for AROC.
Intuitiveness People understand immediately what accuracy means. Unlike (A)ROC, it’s obvious what happens when one additional example is classified wrong.
Statistical Stability The basic test set bound shows that accuracy is stable subject to only the IID assumption. For AROC (and ROC) this is only true when the number in each class is not near zero.
Minimality In the end, a classifier makes classification decisions. Accuracy directly measures this while (A)ROC comprimises this measure with hypothetical alternate choice costs. For the same reason, computing (A)ROC may require significantly more work than solving the problem.
Generality Accuracy generalizes easily to multiclass accuracy, importance weighted accuracy, and general (per-example) cost sensitive classification. ROC curves become problematic when there are just 3 classes.

The Just-in-Time argument seems to be the strongest for (A)ROC. One way to rephrase this argument is “Lack of knowledge of relative costs means that classifiers should be rankers so false positive to false negative ratios can be easily altered.” In other words, this is an argument for “ranking instead of classification” rather than “(A)ROC instead of Accuracy”.

Conferences, Dates, Locations

Conference Locate Date
COLT Bertinoro, Italy June 27-30
AAAI Pittsburgh, PA, USA July 9-13
UAI Edinburgh, Scotland July 26-29
IJCAI Edinburgh, Scotland July 30 – August 5
ICML Bonn, Germany August 7-11
KDD Chicago, IL, USA August 21-24

The big winner this year is Europe. This is partly a coincidence, and partly due to the general internationalization of science over the last few years. With cuts to basic science in the US and increased hassle for visitors, conferences outside the US become more attractive. Europe and Australia/New Zealand are the immediate winners because they have the science, infrastructure, and english in place. China and India are possible future winners.

Intuitions from applied learning

Since learning is far from an exact science, it’s good to pay attention to basic intuitions of applied learning. Here are a few I’ve collected.

  1. Integration In Bayesian learning, the posterior is computed by an integral, and the optimal thing to do is to predict according to this integral. This phenomena seems to be far more general. Bagging, Boosting, SVMs, and Neural Networks all take advantage of this idea to some extent. The phenomena is more general: you can average over many different classification predictors to improve performance. Sources: Zoubin, Caruana
  2. Differentiation Different pieces of an average should differentiate to achieve good performance by different methods. This is know as the ‘symmetry breaking’ problem for neural networks, and it’s why weights are initialized randomly. Boosting explicitly attempts to achieve good differentiation by creating new, different, learning problems. Sources: Yann LeCun, Phil Long
  3. Deep Representation Having a deep representation is necessary for having a good general learner. Decision Trees and Convolutional neural networks take advantage of this. SVMs get around it by allowing the user to engineer knowledge into the kernel. Boosting and Bagging rely on another algorithm for this. Sources: Yann LeCun
  4. Fine Representation of Bias Many learning theory applications use just a coarse representation of bias such as “function in the hypothesis class or not”. In practice, significantly better performance is achieved from a more finely tuned bias. Bayesian learning has this builtin with a prior. Other techniques can take advantage of this as well. Sources: Zoubin, personal experience.

If you have others, please comment on them.

The State of the Reduction

What? Reductions are machines which turn solvers for one problem into solvers for another problem.
Why? Reductions are useful for several reasons.

  1. Laziness. Reducing a problem to classification make at least 10 learning algorithms available to solve a problem. Inventing 10 learning algorithms is quite a bit of work. Similarly, programming a reduction is often trivial, while programming a learning algorithm is a great deal of work.
  2. Crystallization. The problems we often want to solve in learning are worst-case-impossible, but average case feasible. By reducing all problems onto one or a few primitives, we can fine tune these primitives to perform well on real-world problems with greater precision due to the greater number of problems to validate on.
  3. Theoretical Organization. By studying what reductions are easy vs. hard vs. impossible, we can learn which problems are roughly equivalent in difficulty and which are much harder.

What we know now.

Typesafe reductions. In the beginning, there was the observation that every complex object on a computer can be written as a sequence of bits. This observation leads to the notion that a classifier (which predicts a single bit) can be used to predict any complex object. Using this observation, we can make the following statements:

  1. Any prediction problem which can be broken into examples can be solved with a classifier.
  2. In particular, reinforcement learning can be decomposed into examples given a generative model (see Lagoudakis & Parr and Fern, Yoon, & Givan).

This observation also often doesn’t work well in practice, because the classifiers are sometimes wrong, so one of many classifiers are often wrong.

Error Transform Reductions. Worrying about errors leads to the notion of robust reductions (= ways of using simple predictors such as classifiers to make complex predictions). Error correcting output codes were proposed in analogy to coding theory. These were analyzed in terms of error rates on training sets and general losses on training sets. The robustness can be (more generally) analyzed with respect to arbitrary test distributions, and algorithms optimized with respect to this notion are often very simple and yield good performance. Solving created classification problems up to error rate e implies:

  1. Solving importance weighed classifications up to error rate eN where N is the expected importance. Costing
  2. Solving multiclass classification up to error rate 4e using ECOC. Error limiting reductions paper
  3. Solving Cost sensitive classification up to loss 2eZ where Z is the sum of costs. Weighted All Pairs algorithm
  4. Finding a policy within expected reward (T+1)e/2 of the optimal policy for T step reinforcement learning with a generative model. RLgen paper
  5. The same statement holds much more efficiently when the distribution of states of a near optimal policy is also known. PSDP paper

A new problem arises: sometimes the subproblems created are inherently hard, for example when estimating class probability from a classifier. In this situation saying “good performance implies good performance” is vacuous.

Regret Transform Reductions To cope with this, we can analyze how good performance minus the best possible performance (called “regret”) is transformed under reduction. Solving created binary classification problems to regret r implies:

  1. Solving importance weighted regret up to r N using the same algorithm as for errors. Costing
  2. Solving class membership probability up to l2 regret 2r. Probing paper
  3. Solving multiclass classification to regret 4 r0.5. SECOC paper
  4. Predicting costs in cost sensitive classification up to l2 regret 4r SECOC again
  5. Solving cost sensitive classification up to regret 4(r Z)0.5 where Z is the sum of the costs of each choice. SECOC again

There are several reduction-related problems currently being worked on which I’ll discuss in the future.