Netflix finishes (and starts)

I attended the Netflix prize ceremony this morning. The press conference part is covered fine elsewhere, with the basic outcome being that BellKor’s Pragmatic Chaos won over The Ensemble by 15-20 minutes, because they were tied in performance on the ultimate holdout set. I’m sure the individual participants will have many chances to speak about the solution. One of these is Bell at the NYAS ML symposium on Nov. 6.

Several additional details may interest ML people.

  1. The degree of overfitting exhibited by the difference in performance on the leaderboard test set and the ultimate hold out set was small, but determining at .02 to .03%.
  2. A tie was possible, because the rules cut off measurements below the fourth digit based on significance concerns. In actuality, of course, the scores do differ before rounding, but everyone I spoke to claimed not to know how. The complete dataset has been released on UCI, so each team could compute their own score to whatever accuracy desired.
  3. I was impressed by the slick systematic uses of SVD mentioned in the technical presentation, as implied by the first comment here.
  4. The amount of programming and time which went into this contest was pretty shocking. I was particularly impressed with the amount of effort that went into various techniques for blending results from different systems. In this respect, the lack of release of the source code is a little bit disappointing.
  5. I forgot to ask explicitly, but no one mentioned using any joins of the data to external databases. That’s somewhat surprising if you think about it given how much other information is available about movies.
  6. I hadn’t previously convexity functioning as a tool for social engineering so explicitly. Because squared loss is convex, any two different solutions of similar performance can be linearly blended to yield a mixed solution of superior performance. The implications of this observation were on display.

Netflix also announced a plan for a new contest, which will focus on using features of users, and predicting well for the (presumably large number of) users who rate very few movies. I hope they get the anonymization on this data right, as it’s obviously important.

This brings up a basic issue: How should a contest be designed? In the main, the finished Netflix contest seems to have been well designed. For example, the double holdout set approach nicely prevents overfitting, which has been a bugaboo of some previous contests. One improvement they are already implementing is asymptopia removal—the contest will award $0.5M in 6 months, and $0.5M more in 18 months. Nevertheless, we might imagine better contests, and perhaps it’s worthwhile to do so given the amount of attention devoted.

  1. Metric One criticism is that squared loss does not very directly reflect the actual value to Netflix of a particular set of recommendations. This seems like a fair criticism, although if you believe ranking according to the optimal expected conditional ratings is the best possible, it is at least consistent. The degree to which suboptimal squared loss prediction controls suboptimality of a recommendation loss is weak, but it should kick in when squared loss is deeply optimized as happened in this contest.

    What you really want is something like “Did the user pick the recommended movie?” This would provide a qualitative leap in the fidelity of the metric to the true underlying problem. Unfortunately, doing this properly is difficult, as you need to cope with exploration issues, which must be done at the time of data collection. So my basic take is that the squared loss metric seems “ok”, with the proviso that it could be done better if you start the data collection with some amount of random exploration.

  2. Prize distribution In a race as tight as this one, it must feel pretty rough for the members of The Ensemble to put so much effort in and then win nothing. A good case can be made that this isn’t optimal design for a contest where we are trying to learn new things. For example, it seems quite plausible that there was some interesting technique used in The Ensemble yet not used by the winner. A case can also be made based on online learning with experts theory, which generally says that the right way to reward a stable of experts is via an exponential weighting scheme. This essentially corresponds to having a “softmax” prize distribution where the distribution to a participant p is according to e-C(winner – p) where C is a problem dependent constant. This introduces the possibility of a sybil attack, but that appears acceptably controllable, especially if the prize distribution is limited to the top few participants.
  3. Source Code After the Netflix prize was going for some time, the programming-time complexity of entering the contest became very formidable. The use of convex loss function and requiring participants to publish helped some with this, but it remained quite difficult. If the contest required the release of source code as well, I could imagine both lowering the barrier to late entry, and helping advance the field a bit more. Of course, it’s hard to go halfway with this—if you really want to guarantee that the source code works, you need to make the information exchange interface be the source code itself (which is then compiled and run in a sandbox), rather than labels.

One last question to consider is: Is it good for the research community to have contests? My general belief on this is a definite “yes”, as it gives people who know how to do things a chance to distinguish themselves. For the Netflix contest in particular, the contest has educated me a bit about ensemble and SVD-style techniques, and I’m sure it’s generally helped crystallize out a set of applicable ML technologies for many people, which I expect to see widely used elsewhere in the future.

Necessary and Sufficient Research

Researchers are typically confronted with big problems that they have no idea how to solve. In trying to come up with a solution, a natural approach is to decompose the big problem into a set of subproblems whose solution yields a solution to the larger problem. This approach can go wrong in several ways.

  1. Decomposition failure. The solution to the decomposition does not in fact yield a solution to the overall problem.
  2. Artificial hardness. The subproblems created are sufficient if solved to solve the overall problem, but they are harder than necessary.

As you can see, computational complexity forms a relatively new (in research-history) razor by which to judge an approach sufficient but not necessary.

In my experience, the artificial hardness problem is very common. Many researchers abdicate the responsibility of choosing a problem to work on to other people. This process starts very naturally as a graduate student, when an incoming student might have relatively little idea about how to do research, so they naturally abdicate the problem choice to an advisor. As an inexperienced graduate student, it’s difficult to avoid this, because an advisor often really does know much better about what is easy, what is hard, and how to decompose a complex problem into solvable subproblems. Nevertheless, if your plan in life is to do substantial research, it’s essential even then to question research directions carefully.

In contrast to sufficient subgoals of a greater goal, there are also necessary subgoals. A necessary subgoal is one which must be solved to solve the greater goal. One of the reasons why the artificial hardness problem is so common is that the sufficient subgoals are commonly confused with necessary subgoals. The essential test for a necessary subgoal is whether or not a solution to the global problem can be used as a solution to the subgoal.

My personal greater goal is creating a master machine learning algorithm that can solve any reasonable learning problem where “reasonable” includes at least the set that humans can solve. Relative to this greater goal, many existing research programs do not appear necessary.

  1. The short form of my second comment on Marcus’s post is that I see the sufficiency but not the necessity of competing with all Turing machines.
  2. The necessity of several statistical modeling approaches appears unclear to me, because they often encounter severe computational problems. Perhaps this is an example of creating an artificially hard problem, as empirical experiences with Searn suggest.

What is necessary?

  1. Large data. It is clear that humans solving learning problems have access to large amounts of information which they employ to solve these problems. While we don’t stick children into a sensory deprivation tank to see how much it retards their ability to solve problems when grown, some experiments along these lines have been done with animals yielding obvious ability deficiency.
  2. Online learning. The ability to learn in an online environment with relatively little processing per bit of input is clearly a sufficient approach to solve many problems. We can also argue that it is a computational necessity, as retraining based upon all past information appears computationally infeasible, or at least deeply wasteful.
  3. Interactive learning. It’s clear that many animals use an interactive process to learn about the world. This form of learning is also necessary, because it provides the ability to answer unanticipated questions. We can further argue the necessity by pointing out that interactive proofs appear much more powerful in computational complexity theory than noninteractive proofs. For example, viewed from a learning perspective, much of science is about interactive learning.
  4. Compositional Design of a learning system. The necessity of compositional design in machine learning is not entirely clear. For example, we could imagine that it’s possible to design good learning systems using an evolutionary approach. Nevertheless, since our basic goal in research is a much more efficient and faster design, it seems that the decision to take a research-based approach implies that compositional design is necessary. Restated: we should be able to design the system to learn in components which compose to form an overall solution.
  5. Large contexts. It’s necessary that a learning algorithm be able to use a relatively large number of bits when making a decision. For example, people working on vision have cool examples where people manage to use many different cues to predict what an object is.
  6. Nonlinearity. People can clearly solve learning problems for which no linear representation (of input information) is capable of achieving good performance.

Some of these are criticizable as perhaps unnecessary, and I can easily imagine missing others. If you have other arguments for what is or is not necessary for this greater goal, please speak up.

There are two other categories of subgoal research we could consider. There are subgoals which are necessary and sufficient (in combination) to solve the greater goal. I don’t know of any such arguments for my greater goal.

The fourth category is subgoals which are neither necessary nor sufficient for a greater goal. In my experience such papers are quite common at conferences with types that include:

  1. Work on speeding up a slow algorithm leaving it slower than the state of the art,
  2. Otherwise improving an algorithm which is suboptimal while leaving it suboptimal.

The nitty-gritty of these questions come at review time. Which papers should be accepted? In general, decision making is pretty terrible because greater goals are rarely stated, perhaps as a form of strategic ambiguity. After all, the set of people attending a conference have multiple greater goals. Nevertheless, my personal ordering is:

  1. Necessary and sufficient research directions. An emptyset in my experience.
  2. Necessary research directions. This is typically a small fraction.
  3. Sufficient research directions. This is a greater fraction.
  4. Neither. This is often the majority.
  5. Wrong. Must be rejected.

So the question to periodically ask yourself as a researcher is: What is the greater goal? Is this subgoal necessary? Sufficient? Something else?

New York Area Machine Learning Events

Several events are happening in the NY area.

  1. Barriers in Computational Learning Theory Workshop, Aug 28. That’s tomorrow near Princeton. I’m looking forward to speaking at this one on “Getting around Barriers in Learning Theory”, but several other talks are of interest, particularly to the CS theory inclined.
  2. Claudia Perlich is running the INFORMS Data Mining Contest with a deadline of Sept. 25. This is a contest using real health record data (they partnered with HealthCare Intelligence) to predict transfers and mortality. In the current US health care reform debate, the case studies of high costs we hear strongly suggest machine learning & statistics can save many billions.
  3. The Singularity Summit October 3&4. This is for the AIists out there. Several of the talks look interesting, although unfortunately I’ll miss it for ALT.
  4. Predictive Analytics World, Oct 20-21. This is stretching the definition of “New York Area” a bit, but the train to DC is reasonable. This is a conference of case studies of applications of ML to real-world problems.
  5. Machine Learning Symposium, Friday Nov. 6. I’m on the committee again this year. The abstract deadline is Sept. 30, and we already have several speakers lined up.

Another 10-year paper in Machine Learning

When I was thinking about the best “10 year paper” for ICML, I also took a look at a few other conferences. Here is one from 10 years ago that interested me:

David McAllester PAC-Bayesian Model Averaging, COLT 1999. 2001 Journal Draft.

Prior to this paper, the only mechanism known for controlling or estimating the necessary sample complexity for learning over continuously parameterized predictors was VC theory and variants, all of which suffered from a basic problem: they were incredibly pessimistic in practice. This meant that only very gross guidance could be provided for learning algorithm design. The PAC-Bayes bound provided an alternative approach to sample complexity bounds which was radically tighter, quantitatively. It also imported and explained many of the motivations for Bayesian learning in a way that learning theory and perhaps optimization people might appreciate. Since this paper came out, there have been a number of moderately successful attempts to drive algorithms directly by the PAC-Bayes bound. We’ve gone from thinking that a bound driven algorithm is completely useless to merely a bit more pessimistic and computationally intense than might be necessary.

The PAC-Bayes bound is related to the “bits-back” argument that Geoff Hinton and Drew van Camp made at COLT 6 years earlier.

What other machine learning or learning theory papers from 10 years ago have had a substantial impact?