Hal asks a very good question: “When is the right time to insert the loss function?” In particular, should it be used at testing time or at training time?
When the world imposes a loss on us, the standard Bayesian recipe is to predict the (conditional) probability of each possibility and then choose the possibility which minimizes the expected loss. In contrast, as the confusion over “loss = money lost” or “loss = the thing you optimize” might indicate, many people ignore the Bayesian approach and simply optimize their loss (or a close proxy for their loss) over the representation on the training set.
The best answer I can give is “it’s unclear, but I prefer optimizing the loss at training time”. My experience is that optimizing the loss in the most direct manner possible typically yields best performance. This question is related to a basic principle which both Yann LeCun(applied) and Vladimir Vapnik(theoretical) advocate: “solve the simplest prediction problem that solves the problem”. (One difficulty with this principle is that ‘simplest’ is difficult to define in a satisfying way.)
One reason why it’s unclear is that optimizing an arbitrary loss is not an easy thing for a learning algorithm to cope with. Learning reductions (which I am a big fan of) give a mechanism for doing this, but they are new and relatively untried.
Drew Bagnell adds: Another approach to integrating loss functions into learning is to try to re-derive ideas about probability theory appropriate for other loss functions. For instance, Peter Grunwald and A.P. Dawid present a variant on maximum entropy learning. Unfortunately, it’s even less clear how often these approaches lead to efficient algorithms.
My experience (applied) agrees with John, with one caveat: if we’re dealing with *generative* models, then we’re pretty much stuck with the the MBR scenario; in some applications (I’m thinking of speech recognition right now), this can do just as well as explicitly minimizing the Hamming loss in the “loss while training” setup. One point in favor of the “insert during test” phase is that if your loss changes (loss = money lost), it is much easier to adjust the prediction to account for the new loss than retrain an entire system for the new loss under the “insert during training” framework.
Learning theory people tend to focus (as far as I can tell) on the “insert during training” framework. Have there been any results published on generalization bounds (or the like) for the “insert during test” framework? I’m thinking it might be enlightening to compare bounds in the simple case of, say, a simple linear binary classifier where you pay nothing for a correct output, pay $1 for saying +1 when it’s really -1 and play $2 for saying -1 when it’s really +1. Can we say anything theoretically about how the two approaches might compare on this problem?
I believe the “loss changes” scenario is part of why Foster Provost favors ROC.
For the problem you mention, you can show that optimizing at training time with the distribution of +1’s twice as common is equivalent to optimizing the importance weighted loss (described here, but this is really ‘folk knowledge’).
I don’t know any theorem with the outline you mention. Andrew Ng once mentioned this was interesting for logistic regression to me, so you might look there.
A few comments on the following two point in favor of “insert during test”:
The problem with point 1 is that things like ROC analysis and reductions only work in some cases. For instance, I doubt whether something like this could be used in the structured prediction problem discussed earlier. Often times reductions are not quite ‘tight’ enough, for instance from log-loss to 0/1-loss (or are they?).
Regarding point 2, Bayesian methods are optimal under very strong assumptions about the model and prior. Here I’m talking about optimality in the strong subjective Bayesian sense (posterior = beliefs about future). Generative learning in general is known to be problematic unless these assumptions hold, especially when models are heavily misspecified.
Regarding Drew’s point on re-deriving probabilistic ideas for given loss functions, I’d like to add Vovk’s aggregating algorithm (AA). Quoting Vovk, “The simplest version of the AA was proposed in [Vovk, 1990] as a common generalization of the Bayesian merging scheme.” In prediction Yamanishi’s (1998) extended stochastic complexity mentioned by Aleks is equivalent to AA. This kind of approach has potential to have the best of both worlds (Bayesian and learning theory), but after only 15 or so years (compared to decades or centuries) it is probably too early to say how good it will be.
Before John reminds me of the ‘probing’ talk he gave at AISTATS, I correct myself: I meant to suggest that it is not possible to reduce 0/1-loss to log-loss in a way that would be tight enough to be useful. (Good log-loss prediction is sufficient but not necessary for good 0/1 classification; base-2 log-loss is an upper bound on 0/1 loss, but there is no lower bound.)
I consider this (partially) a condemnation of log loss. It seems unreasonable to use any loss function which can be infinite in practice. I believe that behavior is encountered nowhere, when we think about “money loss”.
The Evaluating Predictive Uncertainty challenge was particularly telling in this respect because participants were tested on several metrics and the ranked only according to log loss. When Joaquin presented the results, he also had a long detailed discussion of how exactly log loss was an unreasonable loss function.
A loss function should be a variable to a decision-theoretic learning algorithm. I find it pointless to use a different loss in training than in evaluation. Specify a loss function in advance, and *then* train. Loss function is not a good proxy for dealing with overfitting and “noise”.
Log-loss is very reasonable in some applications, such as compression and investment. Not always, of course. But a learning method that suffers infinite loss when it was known that log-loss should also be seen as unreasonable 😉 Imagine a compression algorithm that fills your hard drive with zeros…
Well, talking about money, in Kelly gambling log-loss gives the growth rate of capital. It is true that infinite loss doesn’t mean you lost infinitely much, you’re just bust.
More importantly, I see log-loss as a proxy for 0/1 loss, just like quadratic, exponential, hinge loss, and what have you. (I don’t quite agree with Aleks that using a proxy for the actual loss is always bad. For one thing, optimizing 0/1 loss is usually hopelessly.) Most probabilistic methods are inherently log-loss oriented, so it is relevant in that sense too.
Somewhat off-topic, I have been wondering if there are any results about consistency of convex loss minimization without assuming that the Bayes act is in the hypothesis class (or arbitrarily close). For instance, Steinwart, Lugosi and others seem to always use this assumption. Any suggestions?