Computational Consequences of Classification

In the regression vs classification debate, I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations.

  1. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points.
  2. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints while a squared loss regressor could not. For example, if there is one feature which determines whether a binary label has probability less than or greater than 0.5, a great classifier exists using just one feature. Because squared loss is sensitive to the exact probability, many more features may be required to learn well with respect to squared loss.

5 Replies to “Computational Consequences of Classification”

  1. Actually, in regression it is possible to get an arbitrary polynomial rate under certain conditions. Basically, if the regression function is infinitely differentiable with no noise, one can use polynomial approximations of increasing degree to get arbitrarily fast convergence. In classification, even without noise, it is usually impossible.

    misha b

  2. Can you define this a bit more? What is polynomial in what? And under what conditions?

  3. Suppose we have an infinitely differentiable function (without noise) on, say, an interval. Then given values x_1,…x_n the function can be reconstructed up to C(t)(1/n)^t in squared loss error, where t>0 is any number. The constant C depends on t. One can choose t as a function of n, increasing and slowly, thus getting a rate, which is better than any fixed polynomial.

    misha b

  4. I see. I’m not sure how interesting the noiseless case is.

    In general, there are a number of examples where squared loss minimization is computationally cheaper than 0/1 loss minimization. What I’m trying to say above is that there are some examples where the tables turn dramatically.

  5. My impression is (correct me, if I am mistaken) that an exponential rate for
    active learning is also achieved in the noiseless case.

    I think it is quite interesting to compare regression and classification.
    As you point out, there is a number of scenarios, when one seems to work much better than the other.

Comments are closed.