For learning reductions we have been concentrating on reducing various complex learning problems to binary classification. This choice needs to be actively questioned, because it was not carefully considered.
Binary clasification is learning a classifier c:X -> {0,1} so as to minimize the probability of being wrong, Prx,y~D(c(x) <> y).
The primary alternative candidate seems to be squared error regression. In squared error regression, you learn a regressor s:X -> [0,1] so as to minimize squared error, Ex,y~D (s(x)-y)2.
It is difficult to judge one primitive against another. The judgement must at least partially be made on nontheoretical grounds because (essentially) we are evaluating a choice between two axioms/assumptions.
These two primitives are significantly related. Classification can be reduced to regression in the obvious way: you use the regressor to predict D(y=1|x), then threshold at 0.5. For this simple reduction a squared error regret of r implies a classification regret of at most r0.5. Regression can also be used to reduce to classification using the Probing algorithm. (This is much more obvious when you look at an updated proof. ) Under this reduction, a classification regret of r implies a squared error regression regret of at most r.
Both primitives enjoy a significant amount of prior work with (perhaps) classification enjoying more work in the machine learning community and regression having more emphasis in the statistics community.
The (nonhistoric) reasons for preferring classification seem to be:
- Aesthetically, what could be a more natural primitive than predicting a single bit?
- Classification is (potentially) inherently more representationally concise. When translated into transistors, a classification algorithm might use fewer transistors than a regressor, simply because the natural representation is bits rather than reals (~= floats).
There are several reasons to consider regression over classification:
- More uniform convergence. For a single squared error regressor, the rate of convergence of the estimate of squared error to the expected squared error goes as 1/m, assuming IID samples. For a single binary classifier, the rate of convergence of the estimate of binary loss to the expected binary loss goes as a function varying between 1/m and 1/m0.5.
- There is a computational inequivalence between the primitives, as far as we know. In particular, the Probing algorithm must call a classifier several times in several ways to make a high precision regression prediction. On the other hand, classification via regression requires one call to the underlying regressor.
- Squared error regression learning is often easier than 0/1 classification learning. This is becaue squared error regression is convex, but 0/1 loss is not. Note: this does not imply that squared error regression is convex (It isn’t for general regression algorithms). Instead, it just means that nonconvexity is not enforced by the loss function.
The mathematical evidence points toward squared error regression as a better primitive, although doubts still seem reasonable to entertain.
Ah. How far we’ve come since the days when my NIPS paper on Regularized Least-Squares Classification was summarily rejected because classifying using the square loss is a ridiculous idea that no sane person would consider.
Have you seen this paper by Buja, Stuetzle and Shen?
No, I hadn’t (thanks). The discussion of the relationship to importance weighted misclassification particularly interests me because it sounds like a generalization of the Probing reduciton/algorithm/analysis for squared loss.
All of the other proper scoring rules I know are less desirable than squared error because the convergence rate of their deviations is worse (or in the case of log loss, nonexistent).
An old reduction result is called \”plug-in estimation\”. This a reduction from classification to regression. The classifier is designed by regressing on the posterior. The result upper bounds the excess loss of the resulting classifier in terms of the loss of the regression estimate.
It can be found e.g. in Section 6.7 of the book by Devroye et al..
The section is titled \”Classification is easier than regression function estimation\”, which, in my opinion, is somewhat misleading since what it really tells me is that the rates available to your regression method can be achieved for free in classification. If you are lucky, you can even get smaller errors in classification than in regression, so why not pursue this approach?
My guess is that classification is harder as it was noted here, as well. However, this does not mean that regression based approaches won\’t work there. Classification and regression are, however, fundamentally different (they optimize different, though related criteria). In particular, in regression you get fast rates if e.g. you assume excess smoothness of the regression function. In classification fast rates come when the decision boundary is smooth, or under the margin assumption (the probability that the posterior function takes a value closer to 1/2 than some value t when it is feeded with the independent variable should decay like a large power of t).
Actually, for some time it was believed (?) that the direct approach to classification is better since there were results that showed that fast rates are possible for certain direct algorithms under e.g. the margin assumption and these fast rates were not available for plug-in rules. However, since then we learnt that plug-in estimation is actually not inferior, see the paper by Audibert and Tsybakov, downloadable here!