Watchword: Loss

A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss.

There seems to be a strong dichotomy between two views of what “loss” means in learning.

  1. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer.
  2. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SVMs), “log loss” (common in Bayesian Learning), and “exponential loss” (one incomplete explanation of boosting). One advantage of this viewpoint is that an appropriate choice of loss function (such as any of the above) results in a (relatively tractable) convex optimization problem.

I don’t fully understand the second viewpoint. It seems (to some extent) like looking where the light is rather than where your keys fell on the ground. Many of these losses-of-convenience also seem to have behavior unlike real world problems. For example in this contest somebody would have been the winner except they happened to predict one example incorrectly with very low probability. Under log loss, their loss became very high. This does not seem to correspond to the intuitive notion of what the loss should be on the problem.