Machine Learning – Page 36 – Machine Learning (Theory)

12/7/2008

A NIPS paper

I’m skipping NIPS this year in favor of Ada, but I wanted to point out this paper by Andriy Mnih and Geoff Hinton. The basic claim of the paper is that by carefully but automatically constructing a binary tree over words, it’s possible to predict words well with huge computational resource savings over unstructured approaches.

I’m interested in this beyond the application to word prediction because it is relevant to the general normalization problem: If you want to predict the probability of one of a large number of events, often you must compute a predicted score for all the events and then normalize, a computationally inefficient operation. The problem comes up in many places using probabilistic models, but I’ve run into it with high-dimensional regression.

There are a couple workarounds for this computational bug:

Approximate. There are many ways. Often the approximations are uncontrolled (i.e. can be arbitrarily bad), and hence finicky in application.
Avoid. You don’t really want a probability, you want the most probable choice which can be found more directly. Energy based model update rules are an example of that approach and there are many other direct methods from supervised learning. This is great when it applies, but sometimes a probability is actually needed.

This paper points out that a third approach can be viable empirically: use a self-normalizing structure. It seems highly likely that this is true in other applications as well.

11/28/2008

A Bumper Crop of Machine Learning Graduates

My impression is that this is a particularly strong year for machine learning graduates. Here’s my short list of the strong graduates I know. Analpha (for perversity’s sake) by last name:

Jenn Wortmann. When Jenn visited us for the summer, she had one, two, three, four papers. That is typical—she’s smart, capable, and follows up many directions of research. I believe approximately all of her many papers are on different subjects.
Ruslan Salakhutdinov. A Science paper on bijective dimensionality reduction, mastered and improved on deep belief nets which seems like an important flavor of nonlinear learning, and in my experience he’s very fast, capable and creative at problem solving.
Marc’Aurelio Ranzato. I haven’t spoken with Marc very much, but he had a great visit at Yahoo! this summer, and has an impressive portfolio of applications and improvements on convolutional neural networks and other deep learning algorithms.
Lihong Li. Lihong developed the KWIK (“Knows what it Knows”) learning framework, for analyzing and creating uncertainty-aware learning algorithms. New mathematical models of learning are rare, and the topic is of substantial interest, so this is pretty cool. He’s also worked on a wide variety of other subjects and in my experience is broadly capable.
Steve Hanneke: When the chapter on active learning is written in a machine learning textbook, I expect the disagreement coefficient to be in it. Steve’s work is strongly distinguished from his adviser’s, so he is guaranteed capable of independent research.

There are a couple others such as Daniel and Jake for whom I’m unsure of their graduation plans, although they have already done good work. In addition, I’m sure there are several others that I don’t know—feel free to mention others I don’t know in comments.

It’s traditional to imagine that one is best overall for hiring purposes, but I have substantial difficulty with that—the field of ML is simply to broad. Instead, if you are interested in hiring, each should be considered in your context.

11/16/200811/26/2008

Observations on Linearity for Reductions to Regression

Dean Foster and Daniel Hsu had a couple observations about reductions to regression that I wanted to share. This will make the most sense for people familiar with error correcting output codes (see the tutorial, page 11).

Many people are comfortable using linear regression in a one-against-all style, where you try to predict the probability of choice i vs other classes, yet they are not comfortable with more complex error correcting codes because they fear that they create harder problems. This fear turns out to be mathematically incoherent under a linear representation: comfort in the linear case should imply comfort with more complex codes.

In particular, If there exists a set of weight vectors w_i such that P(i|x)= <w_i,x>, then for any invertible error correcting output code C, there exists weight vectors w_c which decode to perfectly predict the probability of each class. The proof is simple and constructive: the weight vector w_c can be constructed according to the linear superposition of w_i implied by the code, and invertibility implies that a correct encoding implies a correct decoding.

This observation extends to all-pairs like codes which compare subsets of choices to subsets of choices using “don’t cares”.

Using this observation, Daniel created a very short proof of the PECOC regret transform theorem (here, and Daniel’s updated version).

One further observation is that under ridge regression (a special case of linear regression), for any code, there exists a setting of parameters such that you might as well use one-against-all instead, because you get the same answer numerically. The implication is that the advantages of codes more complex than one-against-all is confined to other prediction methods.

11/9/2008

A Healthy COLT

A while ago, we discussed the health of COLT. COLT 2008 substantially addressed my concerns. The papers were diverse and several were interesting. Attendance was up, which is particularly notable in Europe. In my opinion, the colocation with UAI and ICML was the best colocation since 1998.

And, perhaps best of all, registration ended up being free for all students due to various grants from the Academy of Finland, Google, IBM, and Yahoo.

A basic question is: what went right? There seem to be several answers.

Cost-wise, COLT had sufficient grants to alleviate the high cost of the Euro and location at a university substantially reduces the cost compared to a hotel.
Organization-wise, the Finns were great with hordes of volunteers helping set everything up. Having too many volunteers is a good failure mode.
Organization-wise, it was clear that all 3 program chairs were cooperating in designing the program.
Facilities-wise, proximity in time and space made the colocation much more real than many others have been in the past.
Program-wise, COLT notably had two younger program chairs, Tong and Rocco, which seemed to work well.

10/20/200810/20/2008

New York’s ML Day

I’m not as naturally exuberant as Muthu 2 or David about CS/Econ day, but I believe it and ML day were certainly successful.

At the CS/Econ day, I particularly enjoyed Toumas Sandholm’s talk which showed a commanding depth of understanding and application in automated auctions.

For the machine learning day, I enjoyed several talks and posters (I better, I helped pick them.). What stood out to me was number of people attending: 158 registered, a level qualifying as “scramble to find seats”. My rule of thumb for workshops/conferences is that the number of attendees is often something like the number of submissions. That isn’t the case here, where there were just 4 invited speakers and 30-or-so posters. Presumably, the difference is due to a critical mass of Machine Learning interested people in the area and the ease of their attendance.

Are there other areas where a local Machine Learning day would fly? It’s easy to imagine something working out in the San Francisco bay area and possibly Germany or England.

The basic formula for the ML day is a committee picks a few people to give talks, and posters are invited, with some of them providing short presentations. The CS/Econ day was similar, except they managed to let every submitter do a presentation. Are there tweaks to the format which would improve things?