Netflix finishes (and starts)

I attended the Netflix prize ceremony this morning. The press conference part is covered fine elsewhere, with the basic outcome being that BellKor’s Pragmatic Chaos won over The Ensemble by 15-20 minutes, because they were tied in performance on the ultimate holdout set. I’m sure the individual participants will have many chances to speak about the solution. One of these is Bell at the NYAS ML symposium on Nov. 6.

Several additional details may interest ML people.

  1. The degree of overfitting exhibited by the difference in performance on the leaderboard test set and the ultimate hold out set was small, but determining at .02 to .03%.
  2. A tie was possible, because the rules cut off measurements below the fourth digit based on significance concerns. In actuality, of course, the scores do differ before rounding, but everyone I spoke to claimed not to know how. The complete dataset has been released on UCI, so each team could compute their own score to whatever accuracy desired.
  3. I was impressed by the slick systematic uses of SVD mentioned in the technical presentation, as implied by the first comment here.
  4. The amount of programming and time which went into this contest was pretty shocking. I was particularly impressed with the amount of effort that went into various techniques for blending results from different systems. In this respect, the lack of release of the source code is a little bit disappointing.
  5. I forgot to ask explicitly, but no one mentioned using any joins of the data to external databases. That’s somewhat surprising if you think about it given how much other information is available about movies.
  6. I hadn’t previously convexity functioning as a tool for social engineering so explicitly. Because squared loss is convex, any two different solutions of similar performance can be linearly blended to yield a mixed solution of superior performance. The implications of this observation were on display.

Netflix also announced a plan for a new contest, which will focus on using features of users, and predicting well for the (presumably large number of) users who rate very few movies. I hope they get the anonymization on this data right, as it’s obviously important.

This brings up a basic issue: How should a contest be designed? In the main, the finished Netflix contest seems to have been well designed. For example, the double holdout set approach nicely prevents overfitting, which has been a bugaboo of some previous contests. One improvement they are already implementing is asymptopia removal—the contest will award $0.5M in 6 months, and $0.5M more in 18 months. Nevertheless, we might imagine better contests, and perhaps it’s worthwhile to do so given the amount of attention devoted.

  1. Metric One criticism is that squared loss does not very directly reflect the actual value to Netflix of a particular set of recommendations. This seems like a fair criticism, although if you believe ranking according to the optimal expected conditional ratings is the best possible, it is at least consistent. The degree to which suboptimal squared loss prediction controls suboptimality of a recommendation loss is weak, but it should kick in when squared loss is deeply optimized as happened in this contest.

    What you really want is something like “Did the user pick the recommended movie?” This would provide a qualitative leap in the fidelity of the metric to the true underlying problem. Unfortunately, doing this properly is difficult, as you need to cope with exploration issues, which must be done at the time of data collection. So my basic take is that the squared loss metric seems “ok”, with the proviso that it could be done better if you start the data collection with some amount of random exploration.

  2. Prize distribution In a race as tight as this one, it must feel pretty rough for the members of The Ensemble to put so much effort in and then win nothing. A good case can be made that this isn’t optimal design for a contest where we are trying to learn new things. For example, it seems quite plausible that there was some interesting technique used in The Ensemble yet not used by the winner. A case can also be made based on online learning with experts theory, which generally says that the right way to reward a stable of experts is via an exponential weighting scheme. This essentially corresponds to having a “softmax” prize distribution where the distribution to a participant p is according to e-C(winner – p) where C is a problem dependent constant. This introduces the possibility of a sybil attack, but that appears acceptably controllable, especially if the prize distribution is limited to the top few participants.
  3. Source Code After the Netflix prize was going for some time, the programming-time complexity of entering the contest became very formidable. The use of convex loss function and requiring participants to publish helped some with this, but it remained quite difficult. If the contest required the release of source code as well, I could imagine both lowering the barrier to late entry, and helping advance the field a bit more. Of course, it’s hard to go halfway with this—if you really want to guarantee that the source code works, you need to make the information exchange interface be the source code itself (which is then compiled and run in a sandbox), rather than labels.

One last question to consider is: Is it good for the research community to have contests? My general belief on this is a definite “yes”, as it gives people who know how to do things a chance to distinguish themselves. For the Netflix contest in particular, the contest has educated me a bit about ensemble and SVD-style techniques, and I’m sure it’s generally helped crystallize out a set of applicable ML technologies for many people, which I expect to see widely used elsewhere in the future.

4 Replies to “Netflix finishes (and starts)”

  1. At ICANN’09, they had a workshop of machine learning in immunology. The idea was to get machine learners into immunology. It basically came down to several guys talking about medicine stuff that the machine learners in the room (about six or so) did not really care about.

    My bet is, if they had started a contest there, even with limited price money, the room would have been packed with attendees and people would have started programming right away.

    If you want machine learners to solve your problem, just give them the data.

  2. This competition was really enjoyable to follow and have created nice improvements to the field. But altogether with that I can see a small con, not occasioned by the contest alone, but by the flaws of peer reviewing. My guess is that it is gonna start a demand to compare new techniques right away with the winning algorithm and using the Netflix dataset.

    In my conception this may be bad to scientific community since it will create incentive to a flood of papers with small adjustments to the state-of-the-art algorithm and some forged results, as I stumble upon occasionaly. The newborns algorithms will not stand a chance to be known by the community and, thus, will not possibly mature to become a competitive recommender system.

  3. John, we calculated the Test RMSE to more than 4 digit precision. The difference is ~.00001. You can see the results on our site: http://www.the-ensemble.com/ . The winning team will be happy to note that they did edge us out, but the difference is remarkably slim. Our day to day improvement pace was 10 times this final difference. But even had we won, the lesson is the same. It was essentially a tie, and what Netflix would use in production, and what the contest was producing, became two different things entirely.

Comments are closed.