Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself.
- Optimizing squared loss lsq(y,y’)=(y-y’)2 means predicting the (conditional) mean of y.
- Optimizing absolute value loss lav(y,y’)=|y-y’| means predicting the (conditional) median of y. Variants can handle other quantiles. 0/1 loss for classification is a special case.
- Optimizing log loss llog(y,y’)=log (1/Prz~y’(z=y)) means minimizing the description length of y.
The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form:
For all distributions D over Y, if
Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y.
There are 3 points to this post.
- Everyone doing general machine learning should be aware of the laundry list above. They form a handy toolkit which can match many of the problems naturally encountered.
- People also try to optimize a variety of other loss functions. Some of these are (effectively) a special case of the above. For example, “hinge loss” is absolute value loss when the hinge point is at the upper range. Some of the other losses do not have any known semantics. In this case, discovering a semantics could be quite valuable.
- The natural direction when thinking about how to solve a problem is to start with the semantics you want and then derive a loss. I don’t know of any general way to do this other than simply applying the laundry list above. As one example, what is a loss function for estimating the mean of a random variable y over the 5th to 95th quantile? (How do we do squared error regression which is insensitive to outliers?) Which semantics are satisfiable with a loss?
Another commonly-used loss is a delta-function (deltay=y’), which corresponds to MAP.
I would think that the logic goes the opposite direction: one first a loss (i.e., the negative of user’s utility function) and then attempts to optimize the loss. The decision theory point of view is that we’re trying to maximize expected reward (=minimize expected loss), so the reward/loss function comes first.
When would you start with a “semantics” and then derive a loss for it? (Perhaps I am misunderstanding.) I can see how it provides some intuition — the delta-function loss interpretation makes MAP look silly. (Although MAP is preferable to the conditional mean in cases where we want the estimate to be a “plausible” interpretation — it’s possible for the mean to be a very unlikely interpretation).
Hmm, the subscript tag didn’t work. That should say: \delta_{y,y’}
Can reduction be applied here? that is can we fatasize about a loss function and then reduce the problem, perhaps via some transformation of input/output space, to another problem that involves a well-understood loss function, and still be able to provide bounds , or at least say something interesting,or at least understand the semantic of the less-understood loss function?
Aaron, there is a close connection between Bayesian error models and loss functions. See my (somewhat cryptic) writeup at http://www.stat.columbia.edu/~cook/movabletype/archives/2006/11/bayesian_infere_3.html
Another way to look at loss (or distance) function is to interpret them (in some cases) as maximum-likelihood parameter finding under a particular distribution assumption. For example, min sum-squared loss is finding the mean paramater under Gaussian distribution assumption (just take the log of Gaussian likelihood), while absolute (L1) loss corresponds to Laplace distribution, etc. There is a well-known relation between Bregman divergence minimization and finding max-likelihood parameters of the corresponding exponential-family distribution. A particular example is Euclidean distance/sum-square loss – Bregman divergence corresponding to Gaussian distribution, etc.
Just to add to my previous comment: “a general way to do this” could be to use appropriate divergence/loss based on statistical properties of the data (e.g., for binary data you may use the loss corresponding to logistic regression rahter than linear regression, since this will exactly correspond to Bernoulli assumption on the data). You can look at work by Collins, Dasgupta, Schapire on ePCA, recent work on Bregman clustering by Banerjee et al, etc http://jmlr.csail.mit.edu/papers/volume6/banerjee05b/banerjee05b.pdf
So, how about the semantics of 1,2, or infinitive norm of the regularization in the objective function to be optimized?
I believe the answer is “yes”. These sorts of statements compose the the learning reduction statements I’m familiar with.
For this post, I was thinking about the situation which (essentially) corresponds to the infinite data limit, in which case MAP is maximum likelihood. For that case, I think ‘log loss’ above captures this.
You are right that we often simply start from a loss function. When people inexperienced to ML try to tackle things, it can be difficult to specify their loss, so it becomes helpful to have these “semantic losses” available.
There is similar semantics, at least in the probabilistic setting. 2-norm is a Gaussian prior, 1-norm is a Laplace prior (cf. LASSO). I don’t know about the oo-norm …
Here are some examples for binary prediction — minimizing conditional log loss recovers conditional probability of y|x (also the conditional mean of y since y is Bernoulli). Minimizing logistic loss log(1+exp(-y g(x)) recovers log-odds. Log-loss and squared loss are similar in that with enough modeling freedom you will recover the true values of the underlying probability density, and there’s a simple way of characterizing all loss functions that have that property. Hinge/linear loss is interesting in that the minimizer only recovers the regions where p(y=1|x)>0.5
It may be impossible to implement the function you give (robust mean) through loss minimization — here’s an outline of proof that it’s impossible to have a loss function that recovers the mean of smallest n-1 datapoints in a sample of size n, and is differentiable on a dense subset of it’s domain. Same technique could show that you can’t recover the mean of middle n-2 points, and I think this could be extended to continuous loss functions as well, although I haven’t found a way yet