Optimal Proxy Loss for Classification

Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product fw(x)= sumi wixi
where optimizing squared loss (y-fw(x))2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(fw(x) – 0.5)).

While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices:

  1. Log loss. If we confine the prediction to [0,1], we can treat it as a predicted probability that the label is 1, and measure loss according to log 1/p'(y|x) where p'(y|x) is the predicted probability of the observed label. A standard method for confining the prediction to [0,1] is logistic regression which exponentiates the dot product and normalizes.
  2. Squared loss. The squared loss approach (discussed above) is also quite common. It shares the same “proper scoring rule” semantics as log loss: the optimal representation-independent predictor is the conditional probability of the label y given the features x.
  3. Hinge loss. For hinge loss, you optimize max(0, 1- 4 (y – 0.5) (fw(x) – 0.5) ). The form of hinge loss is slightly unfamiliar, because the label is {0,1} rather than {-1,1}. The optimal prediction for hinge loss is not the probability of y given x but rather some value which is at least 1 if the most likely label is 1 and 0 or smaller if the most likely label is 0. Hinge loss was popularized with support vector machines. Hinge loss is not a proper scoring rule for mean, but since it does get the sign right, using it for classification is reasonable.

Many people have made qualitative arguments about why one loss is better than another. For example see Yaroslav’s old post for an argument about the comparison of log loss and hinge loss and why hinge loss might be better. In the following, I make an elementary quantitative argument.

Log loss is qualitatively dissimilar from the other two, because it is unbounded on the range of interest. Restated, there is no reason other than representational convenience that fw(x) needs to take a value outside of the interval [0,1] for squared loss or hinge loss. In fact, we can freely reduce these losses by considering instead the function fw‘(x) = max(0,min(1,fw(x))). The implication is that optimization of log loss can be unstable in ways that optimization of these other losses is not. This can be stated precisely by noting that sample complexity bounds (simple ones here) for 0-1 loss hold for fw‘(x) under squared or hinge loss, but the same theorem statement does not hold for log loss without additional assumptions. Since stability and convergence are of substantial interest in machine learning, this suggests not using log loss.

For further analysis, we must first define some function converting fw(x) into label predictions. The only reasonable approach is to threshold at 0.5. For log loss and squared loss, any other threshold is inconsistent. Since the optimal predictor for hinge loss always takes value 0 or 1, there is some freedom in how we convert, but a reasonable approach is to also threshold at 0.5.

Now, we want to analyze the stability of predictions. In other words, if an adversary picks the true conditional probability distribution p(y|x) and the prediction fw‘(x), how does the proxy loss of fw‘(x) bound the 0-1 loss? Since we imagine that the conditional distribution is noisy, it’s important to actually consider a regret: how well we do minus the loss of the best possible predictor.

For each of these losses, an optimal strategy of the adversary is to have p(y|x) take value 0.5 – eps and fw‘(x) = 0.5. The 0-1 regret induced is simply 2 eps, since the best possible predictor has error rate 0.5 – eps while the actual predictor has error rate 0.5 + eps. For hinge loss, the regret is eps and for squared loss the regret is eps2. Doing some algebra, this implies that 2 hinge_regret bounds 0-1 regret while 2 squared_regret0.5 bounds 0-1 regret. Since we are only interested in regrets less than 1, the square root is undesirable, and hinge loss is preferred, because a stronger convergence of squared loss is needed to achieve the same guarantee on 0-1 loss.

Can we improve on hinge loss? I don’t know any proxy loss which is quantitatively better, but generalizations exist. The regret of hinge loss is the same as for absolute value loss |y-fw‘(x)| since they are identical for 0,1 labels. One advantage of absolute value loss is that it has a known and sometimes useful semantics for values between 0 and 1: the optimal prediction is the median. This makes the work on quantile regression (Two Three) seem particularly relevant for machine learning.

Key Scientific Challenges

Yahoo released the Key Scientific Challenges program. There is a Machine Learning list I worked on and a Statistics list which Deepak worked on.

I’m hoping this is taken quite seriously by graduate students. The primary value, is that it gave us a chance to sit down and publicly specify directions of research which would be valuable to make progress on. A good strategy for a beginning graduate student is to pick one of these directions, pursue it, and make substantial advances for a PhD. The directions are sufficiently general that I’m sure any serious advance has applications well beyond Yahoo.

A secondary point, (which I’m sure is primary for many 🙂 ) is that there is money for graduate students here. It’s unrestricted, so you can use it for any reasonable travel, supplies, etc…

Nearly all natural problems require nonlinearity

One conventional wisdom is that learning algorithms with linear representations are sufficient to solve natural learning problems. This conventional wisdom appears unsupported by empirical evidence as far as I can tell. In nearly all vision, language, robotics, and speech applications I know where machine learning is effectively applied, the approach involves either a linear representation on hand crafted features capturing substantial nonlinearities or learning directly on nonlinear representations.

There are a few exceptions to this—for example, if the problem of interest to you is predicting the next word given previous words, n-gram methods have been shown effective. Viewed the right way, n-gram methods are essentially linear predictors on an enormous sparse feature space, learned from an enormous number of examples. Hal’s post here describes some of this in more detail.

In contrast, if you go to a machine learning conference, a large number of the new algorithms are variations of learning on a linear representation. This claim should be understood broadly to include (for example) kernel methods, random projection methods, and more traditionally linear representations such as the perceptron. A basic question is: Why is the study of linear representations so prevalent?

There are several reasons for investigating the linear viewpoint.

  1. Linear learning is sufficient. As discussed above, this is really only true in practice if you have sufficiently capable humans hand-engineering features. On one hand, there is a compelling directness to that approach, but on the other it’s not the kind of approach which transfers well to new problems.
  2. Linear learning is a compelling primitive. Many of the effective approaches for nonlinear learning use some combination of linear primitives connected by nonlinearities to make a final prediction. As such, there is a plausible hope that improvements in linear learning can be applied repeatedly in these more complex structures.
  3. Linear learning is the only thing tractable, empirically. This has a grain of truth to it, but it appears to be uncompelling when you get down to the nitty-gritty details. On a dataset large enough to require efficient algorithms, you often want to use online learning. And, when you use online learning with a pure linear representation, the limiting factor is the speed that data can be sucked into the CPU from the network or the disk. If you aren’t doing something more interesting than plain vanilla linear prediction, you are wasting most of your CPU cycles.
  4. Linear learning is the only thing tractable, theoretically. There are certainly many statements and guarantees that we only know how to make with linear representations and (typically) convex losses. However, there are fundamental limits to the extent that a well understood tool can be misused, and it’s important to understand that these theorems do not (and cannot) say that learning on a linear representation will solve some concrete problem like (say) face recognition from 10000 labeled examples. In addition, there are some analysis methods which apply to nonlinear learning systems—my favorite example is learning reductions, but there are others also.

Some of the reasons for linear investigations appear sound, while others are simply variants of “looking where the light is”, which comes from an often retold story:
At night you see someone searching the ground under a streetlight.
You ask, “What happened?”
They say, “I’m looking for the keys I dropped in the bushes.”
“But there aren’t any bushes where you are searching.”
“Yes, but I can’t see over there.”

Netflix prize within epsilon

The competitors for the Netflix Prize are tantalizingly close winning the million dollar prize. This year, BellKor and Commendo Research sent a combined solution that won the progress prize. Reading the writeups 2 is instructive. Several aspects of solutions are taken for granted including stochastic gradient descent, ensemble prediction, and targeting residuals (a form of boosting). Relatively to last year, it appears that many approaches have added parameterizations, especially for the purpose of modeling through time.

The big question is: will they make the big prize? At this point, the level of complexity in entering the competition is prohibitive, so perhaps only the existing competitors will continue to try. (This equation might change drastically if the teams open source their existing solutions, including parameter settings.) One fear is that the progress is asymptoting on the wrong side of the 10% threshold. In the first year, the teams progressed through 84.3% of the 10% gap, and in the second year, they progressed through just 64.4% of the remaining gap. While these numbers suggest an asymptote on the wrong side, in the month since the progress prize another 34.0% improvement of the remainder has been achieved. It’s remarkable that it’s too close to call, with just a 0.0035 RMSE gap to win the big prize. Clever people finding just the right parameterization might very well succeed.