Netflix finishes (and starts)

I attended the Netflix prize ceremony this morning. The press conference part is covered fine elsewhere, with the basic outcome being that BellKor’s Pragmatic Chaos won over The Ensemble by 15-20 minutes, because they were tied in performance on the ultimate holdout set. I’m sure the individual participants will have many chances to speak about the solution. One of these is Bell at the NYAS ML symposium on Nov. 6.

Several additional details may interest ML people.

  1. The degree of overfitting exhibited by the difference in performance on the leaderboard test set and the ultimate hold out set was small, but determining at .02 to .03%.
  2. A tie was possible, because the rules cut off measurements below the fourth digit based on significance concerns. In actuality, of course, the scores do differ before rounding, but everyone I spoke to claimed not to know how. The complete dataset has been released on UCI, so each team could compute their own score to whatever accuracy desired.
  3. I was impressed by the slick systematic uses of SVD mentioned in the technical presentation, as implied by the first comment here.
  4. The amount of programming and time which went into this contest was pretty shocking. I was particularly impressed with the amount of effort that went into various techniques for blending results from different systems. In this respect, the lack of release of the source code is a little bit disappointing.
  5. I forgot to ask explicitly, but no one mentioned using any joins of the data to external databases. That’s somewhat surprising if you think about it given how much other information is available about movies.
  6. I hadn’t previously convexity functioning as a tool for social engineering so explicitly. Because squared loss is convex, any two different solutions of similar performance can be linearly blended to yield a mixed solution of superior performance. The implications of this observation were on display.

Netflix also announced a plan for a new contest, which will focus on using features of users, and predicting well for the (presumably large number of) users who rate very few movies. I hope they get the anonymization on this data right, as it’s obviously important.

This brings up a basic issue: How should a contest be designed? In the main, the finished Netflix contest seems to have been well designed. For example, the double holdout set approach nicely prevents overfitting, which has been a bugaboo of some previous contests. One improvement they are already implementing is asymptopia removal—the contest will award $0.5M in 6 months, and $0.5M more in 18 months. Nevertheless, we might imagine better contests, and perhaps it’s worthwhile to do so given the amount of attention devoted.

  1. Metric One criticism is that squared loss does not very directly reflect the actual value to Netflix of a particular set of recommendations. This seems like a fair criticism, although if you believe ranking according to the optimal expected conditional ratings is the best possible, it is at least consistent. The degree to which suboptimal squared loss prediction controls suboptimality of a recommendation loss is weak, but it should kick in when squared loss is deeply optimized as happened in this contest.

    What you really want is something like “Did the user pick the recommended movie?” This would provide a qualitative leap in the fidelity of the metric to the true underlying problem. Unfortunately, doing this properly is difficult, as you need to cope with exploration issues, which must be done at the time of data collection. So my basic take is that the squared loss metric seems “ok”, with the proviso that it could be done better if you start the data collection with some amount of random exploration.

  2. Prize distribution In a race as tight as this one, it must feel pretty rough for the members of The Ensemble to put so much effort in and then win nothing. A good case can be made that this isn’t optimal design for a contest where we are trying to learn new things. For example, it seems quite plausible that there was some interesting technique used in The Ensemble yet not used by the winner. A case can also be made based on online learning with experts theory, which generally says that the right way to reward a stable of experts is via an exponential weighting scheme. This essentially corresponds to having a “softmax” prize distribution where the distribution to a participant p is according to e-C(winner – p) where C is a problem dependent constant. This introduces the possibility of a sybil attack, but that appears acceptably controllable, especially if the prize distribution is limited to the top few participants.
  3. Source Code After the Netflix prize was going for some time, the programming-time complexity of entering the contest became very formidable. The use of convex loss function and requiring participants to publish helped some with this, but it remained quite difficult. If the contest required the release of source code as well, I could imagine both lowering the barrier to late entry, and helping advance the field a bit more. Of course, it’s hard to go halfway with this—if you really want to guarantee that the source code works, you need to make the information exchange interface be the source code itself (which is then compiled and run in a sandbox), rather than labels.

One last question to consider is: Is it good for the research community to have contests? My general belief on this is a definite “yes”, as it gives people who know how to do things a chance to distinguish themselves. For the Netflix contest in particular, the contest has educated me a bit about ensemble and SVD-style techniques, and I’m sure it’s generally helped crystallize out a set of applicable ML technologies for many people, which I expect to see widely used elsewhere in the future.

Netflix nearly done

A $1M qualifying result was achieved on the public Netflix test set by a 3-way ensemble team. This is just in time for Yehuda‘s presentation at KDD, which I’m sure will be one of the best attended ever.

This isn’t quite over—there are a few days for another super-conglomerate team to come together and there is some small chance that the performance is nonrepresentative of the final test set, but I expect not.

Regardless of the final outcome, the biggest lesson for ML from the Netflix contest has been the formidable performance edge of ensemble methods.

Netflix prize within epsilon

The competitors for the Netflix Prize are tantalizingly close winning the million dollar prize. This year, BellKor and Commendo Research sent a combined solution that won the progress prize. Reading the writeups 2 is instructive. Several aspects of solutions are taken for granted including stochastic gradient descent, ensemble prediction, and targeting residuals (a form of boosting). Relatively to last year, it appears that many approaches have added parameterizations, especially for the purpose of modeling through time.

The big question is: will they make the big prize? At this point, the level of complexity in entering the competition is prohibitive, so perhaps only the existing competitors will continue to try. (This equation might change drastically if the teams open source their existing solutions, including parameter settings.) One fear is that the progress is asymptoting on the wrong side of the 10% threshold. In the first year, the teams progressed through 84.3% of the 10% gap, and in the second year, they progressed through just 64.4% of the remaining gap. While these numbers suggest an asymptote on the wrong side, in the month since the progress prize another 34.0% improvement of the remainder has been achieved. It’s remarkable that it’s too close to call, with just a 0.0035 RMSE gap to win the big prize. Clever people finding just the right parameterization might very well succeed.

Three levels of addressing the Netflix Prize

In October 2006, the online movie renter, Netflix, announced the Netflix Prize contest. They published a comprehensive dataset including more than 100 million movie ratings, which were performed by about 480,000 real customers on 17,770 movies.  Competitors in the challenge are required to estimate a few million ratings.  To win the “grand prize,” they need to deliver a 10% improvement in the prediction error compared with the results of Cinematch, Netflix’s proprietary recommender system. Best current results deliver 9.12% improvement, which is quite close to the 10% goal, yet painfully distant.

 The Netflix Prize breathed new life and excitement into recommender systems research. The competition allowed the wide research community to access a large scale, real life dataset. Beyond this, the competition changed the rules of the game. Claiming that your nice idea could outperform some mediocre algorithms on some toy dataset is no longer acceptable. Now researchers should face a new golden standard, and check how their seemingly elegant ideas are measured against best known results on an objective yardstick. I believe that this is a blessed change, which can help in shifting the focus to the few really useful ideas, rather than flooding us with a myriad of papers with questionable practical contributions. Well, days will tell…

 So where to start a truly meaningful research? What can really make a difference in perfecting a recommender system? I do not pretend have a real answer, but I will try to give some personal impressions. While working on the Netflix Prize, sifting through many ideas, implementing maybe a hundred different algorithms, we have come to recognize the few things that really matter. I will concentrate here on high level lessons that will hopefully help other practitioners in coming up with developments of a true practical value.

 I would like to characterize algorithms at three different levels. The first level answers the “what?” question – What do we want to model? Here we decide which features of the data to address. Do we want to model the numerical value of ratings, or maybe which movies people rate (regardless of rating value)? Do we want to address the date-dependent dynamics of users’ behavior? Some will want to model certain pieces of metadata associated with the movies, such as interactions with actors, directors, etc. Or, maybe, we would like to analyze the demographics of the users?

 The next level, the second one, answers the “which?” question – Which model are we going to pick? Will we model ratings through a neighborhood model or through a latent factor model? Within a neighborhood model, should we look at relationships between users, between movies, or maybe both? Within latent factor models we also have plenty of further choices – should we stick with the good old SVD, or move to fancier probabilistic models (e.g., pLSA, LDA)? Or maybe, we should jump to neural networks such as RBMs?

 Finally the last level answers the “how?” question – How are we going to implement the chosen model? Even after choosing a model, we have much flexibility in deciding how to optimize it. For example, nearest neighbor models can vary from quite simplistic correlation based models, to more sophisticated models that try to derive parameters directly from the data. Likewise, there are many ways to fit an SVD model, ranging from gradient descent and alternating least squares to deeper formulations such as EM, MAP, MCMC, Gibbs sampling and more.

 When designing an algorithm, one should go through the three levels, likely, but not necessarily, in the order I listed them. A major question is where most efforts should be invested? Which level has most influence on the quality of the outcome?

 My impression is that quite often most effort is allocated in the wrong direction. Most papers appear to concentrate on the third level, designing the best techniques for optimizing a single model or a particular cost function on which they are fixated. This is not very surprising, because the third level is the most technical one and offers the most flexibility. In particular, it allows researchers to express their prowess. Here, we can find papers with mathematical breakthroughs allowing squeezing some extra points from a model, getting us closer to the optimum, in a shorter time and with less overfitting. Well, no doubt, that’s wonderful… However, the practical value of these developments is quite limited, especially when using an ensemble of various models, where squeezing the best out of a single model is not really delivered to the bottom line.

 Concentrating efforts on the second level is more fruitful. Not all models are built equal for the task at hand. For example, user-based neighborhood models were found to be vastly inferior to item (movie) -based ones. Moreover, latent factor models were proven to be more accurate than the neighborhood ones (considering that you use the right latent factor model, which happens to be SVD). Most importantly, the design of a good ensemble blending complementing predictors should be mostly done at this level. It is very beneficial to blend SVD with a neighborhood technique and with an RBM. A simple mixture like this, involving quick and straightforward implementations, would probably vastly outperform some very well tuned and elaborated individual models. So this level is certainly important, receives quite a bit of attention at the literature, but not nearly as important as the first level.

 The first level, which decides the aspects of the data to be modeled, is where most pivotal choices are taken. Selecting the right features will make a huge impact on the quality of the results. For example, going beyond the numerical values of the ratings to analyzing which movies are chosen to be rated has a tremendous effect on prediction accuracy. On the other hand, modeling metadata associated with movies, such as identity of actors, or associated keywords, is not a prudent choice regarding the Netflix data. Similarly, modeling the date-dependent dynamics of users’ behavior is very useful. This first level receives less attention in the literature. Perhaps, because it is somewhat application dependant and harder to generalize. However, I can’t emphasize enough its importance.

 In practice, the borders between the three levels that I describe may be quite fuzzy. Moreover, these three levels can be sometimes strongly interlaced with each other, as at the end, a single implementation should fulfill all three levels. However, these days, whatever I think or hear about the Netflix data, I immediately try to relate to those three levels. The more it relates to the first level, the more interested I become, whereas I tend to almost completely ignore improvements related to the third level (well, that’s after exploring that level enough in the past). Just my 2 cents…

Concerns about the Large Scale Learning Challenge

The large scale learning challenge for ICML interests me a great deal, although I have concerns about the way it is structured.

From the instructions page, several issues come up:

  1. Large Definition My personal definition of dataset size is:
    1. small A dataset is small if a human could look at the dataset and plausibly find a good solution.
    2. medium A dataset is mediumsize if it fits in the RAM of a reasonably priced computer.
    3. large A large dataset does not fit in the RAM of a reasonably priced computer.

    By this definition, all of the datasets are medium sized. This might sound like a pissing match over dataset size, but I believe it is more than that.

    The fundamental reason for these definitions is that they correspond to transitions in the sorts of approaches which are feasible. From small to medium, the ability to use a human as the learning algorithm degrades. From medium to large, it becomes essential to have learning algorithms that don’t require random access to examples.

  2. No Loading Time The medium scale nature of the datasets is tacitly acknowledged in the rules which exclude data loading time. My experience is that parsing and loading large datasets is often the computational bottleneck. For example when comparing Vowpal Wabbit to SGD I used wall-clock time which makes SGD look a factor of 40 or so worse than Leon’s numbers only using training time after loading. This timing difference is entirely due to the overhead of parsing, even though the format parsed is a carefully optimized binary language. (No ‘excluding loading time’ number can be found for VW, of course, because loading and learning are intertwined.)
  3. Optimal Parameter Time The rules specify that the algorithm should be timed with optimal parameters. It’s very common for learning algorithms to have a few parameters controlling learning rate or regularization. However, no constraints are placed on the number or meaning of these parameters. As an extreme form of abuse, for example, your initial classifier could be declared a parameter. With an appropriate choice of this initial parameter (which you can freely optimize on the data), training time is zero.
  4. Parallelism One approach to dealing with large amounts of data is to add computers that operate in parallel. This is very natural (the brain is vastly parallel at the neuron level), and there are substantial research questions in parallel machine learning. Nevertheless it doesn’t appear to be supported by the contest. There are good reasons for this: parallel architectures aren’t very standard yet, and buying multiple computers is still substantially more expensive than buying the RAM to fit the dataset sizes. Nevertheless, it’s disappointing to exclude such a natural avenue. The rules even appear unclear on whether or not the final test run is on an SMP machine.

As a consequence of this design, the contest prefers algorithms that load all data into memory then operate on it. It also essentially excludes parallel algorithms. These design decisions discourage large scale algorithms (where large is as defined above) in favor of medium scale learning algorithms. The design also favors highly parameterized learning algorithms over less parameterized algorithms, which is the opposite of my personal preference for research direction.

Many of these issues are eliminatable or at least partially addressable. Limiting the parameter size to ’20 characters on the commandline’ or in some other reasonable way seems essential. It’s probably too late to get large datasets, but using wall-clock time would at least avoid bias against large scale algorithms. If the final evaluation is going to take place on an SMP machine, at least detailing that would be helpful.

Despite these concerns, it’s important to be clear that this is an interesting contest. Even without any rule changes, it’s outcome tells us something about which sorts of algorithms work at a medium scale. That’s good information to know if you are interested in tackling larger scale algorithms. The datasets are also large enough to break every Theta(m2) algorithm. We should also respect the organizers: setting up any contest of this sort is quite a bit of work that’s difficult to nail down perfectly in advance.

update: Soeren has helped setup an SMP parallel track which address some of the concerns above. See the site for details, and see you there.