Optimal Proxy Loss for Classification

Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product fw(x)= sumi wixi
where optimizing squared loss (y-fw(x))2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(fw(x) – 0.5)).

While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices:

  1. Log loss. If we confine the prediction to [0,1], we can treat it as a predicted probability that the label is 1, and measure loss according to log 1/p'(y|x) where p'(y|x) is the predicted probability of the observed label. A standard method for confining the prediction to [0,1] is logistic regression which exponentiates the dot product and normalizes.
  2. Squared loss. The squared loss approach (discussed above) is also quite common. It shares the same “proper scoring rule” semantics as log loss: the optimal representation-independent predictor is the conditional probability of the label y given the features x.
  3. Hinge loss. For hinge loss, you optimize max(0, 1- 4 (y – 0.5) (fw(x) – 0.5) ). The form of hinge loss is slightly unfamiliar, because the label is {0,1} rather than {-1,1}. The optimal prediction for hinge loss is not the probability of y given x but rather some value which is at least 1 if the most likely label is 1 and 0 or smaller if the most likely label is 0. Hinge loss was popularized with support vector machines. Hinge loss is not a proper scoring rule for mean, but since it does get the sign right, using it for classification is reasonable.

Many people have made qualitative arguments about why one loss is better than another. For example see Yaroslav’s old post for an argument about the comparison of log loss and hinge loss and why hinge loss might be better. In the following, I make an elementary quantitative argument.

Log loss is qualitatively dissimilar from the other two, because it is unbounded on the range of interest. Restated, there is no reason other than representational convenience that fw(x) needs to take a value outside of the interval [0,1] for squared loss or hinge loss. In fact, we can freely reduce these losses by considering instead the function fw‘(x) = max(0,min(1,fw(x))). The implication is that optimization of log loss can be unstable in ways that optimization of these other losses is not. This can be stated precisely by noting that sample complexity bounds (simple ones here) for 0-1 loss hold for fw‘(x) under squared or hinge loss, but the same theorem statement does not hold for log loss without additional assumptions. Since stability and convergence are of substantial interest in machine learning, this suggests not using log loss.

For further analysis, we must first define some function converting fw(x) into label predictions. The only reasonable approach is to threshold at 0.5. For log loss and squared loss, any other threshold is inconsistent. Since the optimal predictor for hinge loss always takes value 0 or 1, there is some freedom in how we convert, but a reasonable approach is to also threshold at 0.5.

Now, we want to analyze the stability of predictions. In other words, if an adversary picks the true conditional probability distribution p(y|x) and the prediction fw‘(x), how does the proxy loss of fw‘(x) bound the 0-1 loss? Since we imagine that the conditional distribution is noisy, it’s important to actually consider a regret: how well we do minus the loss of the best possible predictor.

For each of these losses, an optimal strategy of the adversary is to have p(y|x) take value 0.5 – eps and fw‘(x) = 0.5. The 0-1 regret induced is simply 2 eps, since the best possible predictor has error rate 0.5 – eps while the actual predictor has error rate 0.5 + eps. For hinge loss, the regret is eps and for squared loss the regret is eps2. Doing some algebra, this implies that 2 hinge_regret bounds 0-1 regret while 2 squared_regret0.5 bounds 0-1 regret. Since we are only interested in regrets less than 1, the square root is undesirable, and hinge loss is preferred, because a stronger convergence of squared loss is needed to achieve the same guarantee on 0-1 loss.

Can we improve on hinge loss? I don’t know any proxy loss which is quantitatively better, but generalizations exist. The regret of hinge loss is the same as for absolute value loss |y-fw‘(x)| since they are identical for 0,1 labels. One advantage of absolute value loss is that it has a known and sometimes useful semantics for values between 0 and 1: the optimal prediction is the median. This makes the work on quantile regression (Two Three) seem particularly relevant for machine learning.

9 Replies to “Optimal Proxy Loss for Classification”

  1. Has anyone tried to define an optimal loss: what loss function minimizes divergence from 0-1 loss while still being convex? In other words, solve for L that minimizes \int (L(x)-01(x))^2 dx or \int |L(x)-01(x)|dx subject to L being convex. Intuitively, I would expect that the Hinge Loss is indeed optimal for some formulation, since there’s no way to erode a chunk out of it and still be convex (whereas log-loss has can be eroded). (There is some flexibility in the hinge loss: you can adjust the slope-intercept of the left side of the Hinge Loss).

    Or maybe you could optimize the proxy loss to minimize regret in the adversarial case (subject to convexity).

  2. This is a good question, and I agree with your intuition, but I haven’t seen a definitive formalization & proof yet.

  3. All of the discussions above implictly assume that the proxy loss should be an upper bound on the 0-1 loss (as the Hinge Loss and log-loss are).

    If we relax this assumption, then the optimal convex loss might be a straight line, like: L(x) = .5 – ax, for some small positive value a. Note that this loss function can go negative. One can clamp it, in which case that yields a hinge loss that doesn’t bound the 0-1 loss.

  4. Another thought along this line is: if the loss function is only quasi-convex, then for only one sample the resulting objective would also be quasi-convex and its minimum unique. When two or more samples are used, the sum of quasi-convex functions is not quasi-convex in general. Then, what is the best quasi-convex approximation to the _joint_ sum of quasi-convex loss functions?

    There also seems to be some recent activity on non-convex loss functions that are tractable; for example, Hamed Masnadi-Shirazi published a paper on “SavageBoost” at NIPS that claims to be able to use non-convex loss functions. Unfortunately, the paper is online on neither the nips.cc page nor Hamed’s homepage yet.

  5. Any loss function can be transformed according to L’ = a L + b without altering the optimization problem. So, hinge loss with a clamp at a negative value is functionally equivalent (as far as optimization goes) to the standard hinge loss.

  6. The SavageBoost paper is now available at http://books.nips.cc/nips21.html see the paper titled :
    On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
    by: Hamed Masnadi-Shirazi, Nuno Vasconcelos

  7. Why did you use max(0, 1- 4 (y – 0.5) (fw(x) – 0.5) ) rather than max(0, 1- 2 (y – 0.5) (fw(x) – 0.5) )?
    It seems to me that with the former, you end up in [-2, 2] rather than [-1, 1]…

    1. We want proxy loss >= 0/1 loss at every point. There are three interesting points to consider:

      y = 1 and fw(x) = 1. Here 1-4(1-0.5)(1-0.5) = 0

      y = 1 and fw(x) = 0.5. Here 1-0 = 1

      y = 1 and fw(x) = 0. Here 1-4(1-0.5)(0-0.5) = 2

      Since the step function for 0/1 loss is at 0.5 with {0,1} labels, this is the best possible formula. (It is more natural with {-1,1} labels.)

Comments are closed.