This is a paper by Yann LeCun and Fu Jie Huang published at AISTAT 2005. I found this paper very difficult to read, but it does have some point about a computational shortcut.
This paper takes for granted that the method of solving a problem is gradient descent on parameters. Given this assumption, the question arises: Do you want to do gradient descent on a probabilistic model or something else?
All (conditional) probabilistic models have the form p(y|x) = f(x,y)/Z(x) where Z(x) = sumy f(x,y) (the paper calls – log f(x,y) an “energy”). If f is parameterized by some w, the gradient has a term for Z(x), and hence for every value of y. The paper claims, that such models can be optimized for classification purposes using only the correct y and the other y’ not y which maximizes f(x,y). This can even be done on unnormalizable models. The paper further claims that this can be done with an approximate maximum. These claims are plausible based on experimental results and intuition.
It wouldn’t surprise me to learn that ignoring Z(x) (and renormalizing later) is common in fast implementations of some probabilistic model fitting algorithms, but I haven’t seen this previously discussed. Ability to use an approximate maximum y’ seems potentially very useful.
With that encouragement, I had significant difficulties with the paper, including the following:
- Lack of a theorem. A good theorem proving these things work would be quite helpful. It isn’t clear whether the claims are always true, just true on the examples encountered, or true with some small modification.
- Definition of Loss. For better or worse, the paper uses the second definition of loss, “Loss is part of the solution”, which I find unnatural.
- Claims I don’t understand or which aren’t technically true. None of these seem to be related to the main point of the paper, but they are very distracting. For example, there is a claim that log-loss is the “only well-justified loss function”. The meaning of well-justified is unclear, and I can think of several meanings where other losses (such as squared error) are well-justified.
With the above difficulties, this paper seems lucky to have been accepted. This isn’t a criticism of AISTAT because it also seems plausible that this computational shortcut may eventually help many optimizations.
You raise an important issue. There are different kinds of machine learning researchers out there. Some of them have a more mathematical bent, others a more engineering one. You are asking people to be mathematical, to come up with theorems. I prefer good engineering papers, ones that result in efficient implementations with wide applicability. A more philosophically-minded person would want papers that contribute to understanding of the underlying issues. A statistically-minded person would be turned off by the lack of understanding of the underlying statistical theory. We then have various culture clashes and random rejections of good stuff. When reviewing a paper, try to wear the same hat as the author, and provide feedback on what should be done to make it better understandable for reader with a different hat. Your advice is very useful for someone who wants to be liked by the mathematically-minded readers.
I have seen this paper, and liked it a lot, wearing an engineer’s hat. I found it interesting how a loss function (“something one tries to minimize”) can be smooth and pliable for gradient-based optimization, yet directed towards ranking or classification performance.
I think you are right about (often severe) culture clashes in learning.
I don’t believe it’s unreasonable to want a theorem for this particular paper, even from an engineering point of view. Essentially, a theorem has the property that it can verify a technique will be widely applicable. The degree to which this technique is widely applicable (and where it will break) are still unclear to me.
Could the ‘log-loss is the right loss’ notion relate to the differential geometry view of distributions, where there is a notion of a “natural” distance (or via the axiomatic definitions of distance functions for probabilities) ? there is at least a circumstantial case for arguing that for distributions (and only distributions) there are better measures to use than l2-type sum of squares.
I also wonder why log loss is the only justified loss function. So from information geometric standpoint, Cencov’s characterization theorem shows that Fisher information metric is the only metric over probability distributions with certain desired properties. Integrating over the geodesic is hard, so there are various ways to approximate the distance induced by this metric. One way is to use KL-divergence, but one would expect other approximations to exist. What makes log loss so unique?
To my way of thinking, this is putting the cart before the horse. The Fisher metric is a local approximation to the KL-divergence. The reason Cencov’s theorem holds is because KL-divergence is manifestly invariant to parameter changes and congruent embeddings. This does imply a geodesic distance around a probability manifold, but generally we’re more interested in the “straight-line distance” that isn’t restricted to the manifold. (And for that matter isn’t a “distance” at all…)
All of this is somewhat removed from the comment about log-loss, except of course, that the expected log-loss and KL are so closely related. I guess ultimately log-loss is meaningful because uncertainty and description lengths in a language are connected this way. I think Shore and Johnson is the classic axiomatic treatment on relative
entropy in the style of Cencov. (Although infintely more readable…)
This paper reminds me of something perhaps related: Guy Lebannon once described to me an understanding of AdaBoost as being logistic regression ignoring the normalization constraint. From his NIPS paper:
We derive an equivalence between AdaBoost and the dual of a convex optimization problem, showing that the only difference between minimizing the exponential loss used by AdaBoost and maximum likelihood for exponential models is that the latter requires the model to be normalized to form a conditional probability distribution over labels. In addition to establishing a simple and easily understood connection between the two methods, this framework enables us to derive new regularization procedures for boosting that directly correspond to penalized maximum likelihood.
… so, I guess the thought is that Adaboost is getting at a similar computational shortcut in a setting which is implicitly (but not explicitly) gradient descent.
I’m not sure I’d agree with the notion that the fisher metric is a local approximation to the KL-divergence: If you look either at Cencov or at Amari’s work, it seems that the Fisher metric is the “prime notion”. In fact the KL -divergence is only among a large number of “natural” divergences that have invariance properties (the so-called alpha-divergences, that also include the Hellinger distance). Moreover, the Hellinger distance is the “natural” geodesic associated with the Fisher information.
I am not sure if log-loss is the same as log-likelihood, but for example if you look at Csiszar’s characterization of “natural” distances, log likelihood (or functions of it) appear to have a special stature. It seems to me that the “embedding” of probabilities on the manifold using p -> log p is what causes a lot of this to appear. Amari has a discussion of the p->log p mapping vs p -> p or even p->sqrt(p). Ultimately, it does seem like the relationship to information is what makes the log mapping natural, but it is quite fascinating that this also makes the math work out.