A NIPS paper

I’m skipping NIPS this year in favor of Ada, but I wanted to point out this paper by Andriy Mnih and Geoff Hinton. The basic claim of the paper is that by carefully but automatically constructing a binary tree over words, it’s possible to predict words well with huge computational resource savings over unstructured approaches.

I’m interested in this beyond the application to word prediction because it is relevant to the general normalization problem: If you want to predict the probability of one of a large number of events, often you must compute a predicted score for all the events and then normalize, a computationally inefficient operation. The problem comes up in many places using probabilistic models, but I’ve run into it with high-dimensional regression.

There are a couple workarounds for this computational bug:

  1. Approximate. There are many ways. Often the approximations are uncontrolled (i.e. can be arbitrarily bad), and hence finicky in application.
  2. Avoid. You don’t really want a probability, you want the most probable choice which can be found more directly. Energy based model update rules are an example of that approach and there are many other direct methods from supervised learning. This is great when it applies, but sometimes a probability is actually needed.

This paper points out that a third approach can be viable empirically: use a self-normalizing structure. It seems highly likely that this is true in other applications as well.

A Bumper Crop of Machine Learning Graduates

My impression is that this is a particularly strong year for machine learning graduates. Here’s my short list of the strong graduates I know. Analpha (for perversity’s sake) by last name:

  1. Jenn Wortmann. When Jenn visited us for the summer, she had one, two, three, four papers. That is typical—she’s smart, capable, and follows up many directions of research. I believe approximately all of her many papers are on different subjects.
  2. Ruslan Salakhutdinov. A Science paper on bijective dimensionality reduction, mastered and improved on deep belief nets which seems like an important flavor of nonlinear learning, and in my experience he’s very fast, capable and creative at problem solving.
  3. Marc’Aurelio Ranzato. I haven’t spoken with Marc very much, but he had a great visit at Yahoo! this summer, and has an impressive portfolio of applications and improvements on convolutional neural networks and other deep learning algorithms.
  4. Lihong Li. Lihong developed the KWIK (“Knows what it Knows”) learning framework, for analyzing and creating uncertainty-aware learning algorithms. New mathematical models of learning are rare, and the topic is of substantial interest, so this is pretty cool. He’s also worked on a wide variety of other subjects and in my experience is broadly capable.
  5. Steve Hanneke: When the chapter on active learning is written in a machine learning textbook, I expect the disagreement coefficient to be in it. Steve’s work is strongly distinguished from his adviser’s, so he is guaranteed capable of independent research.

There are a couple others such as Daniel and Jake for whom I’m unsure of their graduation plans, although they have already done good work. In addition, I’m sure there are several others that I don’t know—feel free to mention others I don’t know in comments.

It’s traditional to imagine that one is best overall for hiring purposes, but I have substantial difficulty with that—the field of ML is simply to broad. Instead, if you are interested in hiring, each should be considered in your context.

Observations on Linearity for Reductions to Regression

Dean Foster and Daniel Hsu had a couple observations about reductions to regression that I wanted to share. This will make the most sense for people familiar with error correcting output codes (see the tutorial, page 11).

Many people are comfortable using linear regression in a one-against-all style, where you try to predict the probability of choice i vs other classes, yet they are not comfortable with more complex error correcting codes because they fear that they create harder problems. This fear turns out to be mathematically incoherent under a linear representation: comfort in the linear case should imply comfort with more complex codes.

In particular, If there exists a set of weight vectors wi such that P(i|x)= <wi,x>, then for any invertible error correcting output code C, there exists weight vectors wc which decode to perfectly predict the probability of each class. The proof is simple and constructive: the weight vector wc can be constructed according to the linear superposition of wi implied by the code, and invertibility implies that a correct encoding implies a correct decoding.

This observation extends to all-pairs like codes which compare subsets of choices to subsets of choices using “don’t cares”.

Using this observation, Daniel created a very short proof of the PECOC regret transform theorem (here, and Daniel’s updated version).

One further observation is that under ridge regression (a special case of linear regression), for any code, there exists a setting of parameters such that you might as well use one-against-all instead, because you get the same answer numerically. The implication is that the advantages of codes more complex than one-against-all is confined to other prediction methods.