I’m skipping NIPS this year in favor of Ada, but I wanted to point out this paper by Andriy Mnih and Geoff Hinton. The basic claim of the paper is that by carefully but automatically constructing a binary tree over words, it’s possible to predict words well with huge computational resource savings over unstructured approaches.
I’m interested in this beyond the application to word prediction because it is relevant to the general normalization problem: If you want to predict the probability of one of a large number of events, often you must compute a predicted score for all the events and then normalize, a computationally inefficient operation. The problem comes up in many places using probabilistic models, but I’ve run into it with high-dimensional regression.
There are a couple workarounds for this computational bug:
- Approximate. There are many ways. Often the approximations are uncontrolled (i.e. can be arbitrarily bad), and hence finicky in application.
- Avoid. You don’t really want a probability, you want the most probable choice which can be found more directly. Energy based model update rules are an example of that approach and there are many other direct methods from supervised learning. This is great when it applies, but sometimes a probability is actually needed.
This paper points out that a third approach can be viable empirically: use a self-normalizing structure. It seems highly likely that this is true in other applications as well.