Key Scientific Challenges

Yahoo released the Key Scientific Challenges program. There is a Machine Learning list I worked on and a Statistics list which Deepak worked on.

I’m hoping this is taken quite seriously by graduate students. The primary value, is that it gave us a chance to sit down and publicly specify directions of research which would be valuable to make progress on. A good strategy for a beginning graduate student is to pick one of these directions, pursue it, and make substantial advances for a PhD. The directions are sufficiently general that I’m sure any serious advance has applications well beyond Yahoo.

A secondary point, (which I’m sure is primary for many 🙂 ) is that there is money for graduate students here. It’s unrestricted, so you can use it for any reasonable travel, supplies, etc…

Nearly all natural problems require nonlinearity

One conventional wisdom is that learning algorithms with linear representations are sufficient to solve natural learning problems. This conventional wisdom appears unsupported by empirical evidence as far as I can tell. In nearly all vision, language, robotics, and speech applications I know where machine learning is effectively applied, the approach involves either a linear representation on hand crafted features capturing substantial nonlinearities or learning directly on nonlinear representations.

There are a few exceptions to this—for example, if the problem of interest to you is predicting the next word given previous words, n-gram methods have been shown effective. Viewed the right way, n-gram methods are essentially linear predictors on an enormous sparse feature space, learned from an enormous number of examples. Hal’s post here describes some of this in more detail.

In contrast, if you go to a machine learning conference, a large number of the new algorithms are variations of learning on a linear representation. This claim should be understood broadly to include (for example) kernel methods, random projection methods, and more traditionally linear representations such as the perceptron. A basic question is: Why is the study of linear representations so prevalent?

There are several reasons for investigating the linear viewpoint.

  1. Linear learning is sufficient. As discussed above, this is really only true in practice if you have sufficiently capable humans hand-engineering features. On one hand, there is a compelling directness to that approach, but on the other it’s not the kind of approach which transfers well to new problems.
  2. Linear learning is a compelling primitive. Many of the effective approaches for nonlinear learning use some combination of linear primitives connected by nonlinearities to make a final prediction. As such, there is a plausible hope that improvements in linear learning can be applied repeatedly in these more complex structures.
  3. Linear learning is the only thing tractable, empirically. This has a grain of truth to it, but it appears to be uncompelling when you get down to the nitty-gritty details. On a dataset large enough to require efficient algorithms, you often want to use online learning. And, when you use online learning with a pure linear representation, the limiting factor is the speed that data can be sucked into the CPU from the network or the disk. If you aren’t doing something more interesting than plain vanilla linear prediction, you are wasting most of your CPU cycles.
  4. Linear learning is the only thing tractable, theoretically. There are certainly many statements and guarantees that we only know how to make with linear representations and (typically) convex losses. However, there are fundamental limits to the extent that a well understood tool can be misused, and it’s important to understand that these theorems do not (and cannot) say that learning on a linear representation will solve some concrete problem like (say) face recognition from 10000 labeled examples. In addition, there are some analysis methods which apply to nonlinear learning systems—my favorite example is learning reductions, but there are others also.

Some of the reasons for linear investigations appear sound, while others are simply variants of “looking where the light is”, which comes from an often retold story:
At night you see someone searching the ground under a streetlight.
You ask, “What happened?”
They say, “I’m looking for the keys I dropped in the bushes.”
“But there aren’t any bushes where you are searching.”
“Yes, but I can’t see over there.”

Netflix prize within epsilon

The competitors for the Netflix Prize are tantalizingly close winning the million dollar prize. This year, BellKor and Commendo Research sent a combined solution that won the progress prize. Reading the writeups 2 is instructive. Several aspects of solutions are taken for granted including stochastic gradient descent, ensemble prediction, and targeting residuals (a form of boosting). Relatively to last year, it appears that many approaches have added parameterizations, especially for the purpose of modeling through time.

The big question is: will they make the big prize? At this point, the level of complexity in entering the competition is prohibitive, so perhaps only the existing competitors will continue to try. (This equation might change drastically if the teams open source their existing solutions, including parameter settings.) One fear is that the progress is asymptoting on the wrong side of the 10% threshold. In the first year, the teams progressed through 84.3% of the 10% gap, and in the second year, they progressed through just 64.4% of the remaining gap. While these numbers suggest an asymptote on the wrong side, in the month since the progress prize another 34.0% improvement of the remainder has been achieved. It’s remarkable that it’s too close to call, with just a 0.0035 RMSE gap to win the big prize. Clever people finding just the right parameterization might very well succeed.