This is a paper by Yann LeCun and Fu Jie Huang published at AISTAT 2005. I found this paper very difficult to read, but it does have some point about a computational shortcut.
This paper takes for granted that the method of solving a problem is gradient descent on parameters. Given this assumption, the question arises: Do you want to do gradient descent on a probabilistic model or something else?
All (conditional) probabilistic models have the form p(y|x) = f(x,y)/Z(x) where Z(x) = sumy f(x,y) (the paper calls – log f(x,y) an “energy”). If f is parameterized by some w, the gradient has a term for Z(x), and hence for every value of y. The paper claims, that such models can be optimized for classification purposes using only the correct y and the other y’ not y which maximizes f(x,y). This can even be done on unnormalizable models. The paper further claims that this can be done with an approximate maximum. These claims are plausible based on experimental results and intuition.
It wouldn’t surprise me to learn that ignoring Z(x) (and renormalizing later) is common in fast implementations of some probabilistic model fitting algorithms, but I haven’t seen this previously discussed. Ability to use an approximate maximum y’ seems potentially very useful.
With that encouragement, I had significant difficulties with the paper, including the following:
- Lack of a theorem. A good theorem proving these things work would be quite helpful. It isn’t clear whether the claims are always true, just true on the examples encountered, or true with some small modification.
- Definition of Loss. For better or worse, the paper uses the second definition of loss, “Loss is part of the solution”, which I find unnatural.
- Claims I don’t understand or which aren’t technically true. None of these seem to be related to the main point of the paper, but they are very distracting. For example, there is a claim that log-loss is the “only well-justified loss function”. The meaning of well-justified is unclear, and I can think of several meanings where other losses (such as squared error) are well-justified.
With the above difficulties, this paper seems lucky to have been accepted. This isn’t a criticism of AISTAT because it also seems plausible that this computational shortcut may eventually help many optimizations.