In the regression vs classification debate, I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations.
- In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points.
- In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints while a squared loss regressor could not. For example, if there is one feature which determines whether a binary label has probability less than or greater than 0.5, a great classifier exists using just one feature. Because squared loss is sensitive to the exact probability, many more features may be required to learn well with respect to squared loss.