Watchword: Loss

A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss.

There seems to be a strong dichotomy between two views of what “loss” means in learning.

  1. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer.
  2. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SVMs), “log loss” (common in Bayesian Learning), and “exponential loss” (one incomplete explanation of boosting). One advantage of this viewpoint is that an appropriate choice of loss function (such as any of the above) results in a (relatively tractable) convex optimization problem.

I don’t fully understand the second viewpoint. It seems (to some extent) like looking where the light is rather than where your keys fell on the ground. Many of these losses-of-convenience also seem to have behavior unlike real world problems. For example in this contest somebody would have been the winner except they happened to predict one example incorrectly with very low probability. Under log loss, their loss became very high. This does not seem to correspond to the intuitive notion of what the loss should be on the problem.

6 Replies to “Watchword: Loss”

  1. You are missing one subtle point here: sometimes optimizing a loss on the training data does not mean that
    you are optimizing the same loss for unseen examples. This is where “regularization” enters the picture. For example,
    the principle of finding the largest margin hyperplane as opposed to any other separating hyperplane (as done by SVMs) does not affect classification error on the training data but it does affect the error on new examples.

  2. It’s true that optimizing some loss on the training data doesn’t necessarily optimize the same loss on test data. This describes overfitting. Regularization is a method to avoid overfitting by limiting the complexity of a learned function so that the optimization on the training data is also effective on similar test data.

    However, this issue seems orthogonal in principle to the choice of what loss is optimized. For example, we could optimize the “hinge loss” of a margin or the 0/1 loss of binary classification. As long as the function class is sufficiently limited relative to the size of the data, the training set loss will predict the test set loss when the data is i.i.d.

    For the example of large margin hyperplanes you mention, in practice SVMs _do_ trade off accuracy on the training data for the size of the margin via the ‘C’ parameter. Setting ‘C’ to either 0 or infinity often results in rather poor performance.

  3. l_2 regression is natural (or problem driven), but log-loss isn’t?! What if I my job (of which there are entire industries) is to drive bits across a wire. That’s one LONG code-word to have to send.

    Being from Pittsburgh in a previous life, John, you’re probably also familiar with weatherman “probablilities” here. 0% chance of snow ought to mean there is *absolutely* no way it’s going to snow. It seems here to mean about a 1 in 10 during winter. Weatherman probabilities better correlate with your notion of how to measure loss, but I sure would like 0% to mean no chance.

  4. I am certainly not claiming that any particular loss is always inappropriate.

    The paper seems to roughly state that there are fundamental limits on how useful “looking where the light is”.

Comments are closed.