For learning reductions we have been concentrating on reducing various complex learning problems to binary classification. This choice needs to be actively questioned, because it was not carefully considered.
Binary clasification is learning a classifier c:X -> {0,1} so as to minimize the probability of being wrong, Prx,y~D(c(x) <> y).
The primary alternative candidate seems to be squared error regression. In squared error regression, you learn a regressor s:X -> [0,1] so as to minimize squared error, Ex,y~D (s(x)-y)2.
It is difficult to judge one primitive against another. The judgement must at least partially be made on nontheoretical grounds because (essentially) we are evaluating a choice between two axioms/assumptions.
These two primitives are significantly related. Classification can be reduced to regression in the obvious way: you use the regressor to predict D(y=1|x), then threshold at 0.5. For this simple reduction a squared error regret of r implies a classification regret of at most r0.5. Regression can also be used to reduce to classification using the Probing algorithm. (This is much more obvious when you look at an updated proof. ) Under this reduction, a classification regret of r implies a squared error regression regret of at most r.
Both primitives enjoy a significant amount of prior work with (perhaps) classification enjoying more work in the machine learning community and regression having more emphasis in the statistics community.
The (nonhistoric) reasons for preferring classification seem to be:
- Aesthetically, what could be a more natural primitive than predicting a single bit?
- Classification is (potentially) inherently more representationally concise. When translated into transistors, a classification algorithm might use fewer transistors than a regressor, simply because the natural representation is bits rather than reals (~= floats).
There are several reasons to consider regression over classification:
- More uniform convergence. For a single squared error regressor, the rate of convergence of the estimate of squared error to the expected squared error goes as 1/m, assuming IID samples. For a single binary classifier, the rate of convergence of the estimate of binary loss to the expected binary loss goes as a function varying between 1/m and 1/m0.5.
- There is a computational inequivalence between the primitives, as far as we know. In particular, the Probing algorithm must call a classifier several times in several ways to make a high precision regression prediction. On the other hand, classification via regression requires one call to the underlying regressor.
- Squared error regression learning is often easier than 0/1 classification learning. This is becaue squared error regression is convex, but 0/1 loss is not. Note: this does not imply that squared error regression is convex (It isn’t for general regression algorithms). Instead, it just means that nonconvexity is not enforced by the loss function.
The mathematical evidence points toward squared error regression as a better primitive, although doubts still seem reasonable to entertain.