Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself.
- Optimizing squared loss lsq(y,y’)=(y-y’)2 means predicting the (conditional) mean of y.
- Optimizing absolute value loss lav(y,y’)=|y-y’| means predicting the (conditional) median of y. Variants can handle other quantiles. 0/1 loss for classification is a special case.
- Optimizing log loss llog(y,y’)=log (1/Prz~y’(z=y)) means minimizing the description length of y.
The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form:
For all distributions D over Y, if
Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y.
There are 3 points to this post.
- Everyone doing general machine learning should be aware of the laundry list above. They form a handy toolkit which can match many of the problems naturally encountered.
- People also try to optimize a variety of other loss functions. Some of these are (effectively) a special case of the above. For example, “hinge loss” is absolute value loss when the hinge point is at the upper range. Some of the other losses do not have any known semantics. In this case, discovering a semantics could be quite valuable.
- The natural direction when thinking about how to solve a problem is to start with the semantics you want and then derive a loss. I don’t know of any general way to do this other than simply applying the laundry list above. As one example, what is a loss function for estimating the mean of a random variable y over the 5th to 95th quantile? (How do we do squared error regression which is insensitive to outliers?) Which semantics are satisfiable with a loss?