Three levels of addressing the Netflix Prize

In October 2006, the online movie renter, Netflix, announced the Netflix Prize contest. They published a comprehensive dataset including more than 100 million movie ratings, which were performed by about 480,000 real customers on 17,770 movies.  Competitors in the challenge are required to estimate a few million ratings.  To win the “grand prize,” they need to deliver a 10% improvement in the prediction error compared with the results of Cinematch, Netflix’s proprietary recommender system. Best current results deliver 9.12% improvement, which is quite close to the 10% goal, yet painfully distant.

 The Netflix Prize breathed new life and excitement into recommender systems research. The competition allowed the wide research community to access a large scale, real life dataset. Beyond this, the competition changed the rules of the game. Claiming that your nice idea could outperform some mediocre algorithms on some toy dataset is no longer acceptable. Now researchers should face a new golden standard, and check how their seemingly elegant ideas are measured against best known results on an objective yardstick. I believe that this is a blessed change, which can help in shifting the focus to the few really useful ideas, rather than flooding us with a myriad of papers with questionable practical contributions. Well, days will tell…

 So where to start a truly meaningful research? What can really make a difference in perfecting a recommender system? I do not pretend have a real answer, but I will try to give some personal impressions. While working on the Netflix Prize, sifting through many ideas, implementing maybe a hundred different algorithms, we have come to recognize the few things that really matter. I will concentrate here on high level lessons that will hopefully help other practitioners in coming up with developments of a true practical value.

 I would like to characterize algorithms at three different levels. The first level answers the “what?” question – What do we want to model? Here we decide which features of the data to address. Do we want to model the numerical value of ratings, or maybe which movies people rate (regardless of rating value)? Do we want to address the date-dependent dynamics of users’ behavior? Some will want to model certain pieces of metadata associated with the movies, such as interactions with actors, directors, etc. Or, maybe, we would like to analyze the demographics of the users?

 The next level, the second one, answers the “which?” question – Which model are we going to pick? Will we model ratings through a neighborhood model or through a latent factor model? Within a neighborhood model, should we look at relationships between users, between movies, or maybe both? Within latent factor models we also have plenty of further choices – should we stick with the good old SVD, or move to fancier probabilistic models (e.g., pLSA, LDA)? Or maybe, we should jump to neural networks such as RBMs?

 Finally the last level answers the “how?” question – How are we going to implement the chosen model? Even after choosing a model, we have much flexibility in deciding how to optimize it. For example, nearest neighbor models can vary from quite simplistic correlation based models, to more sophisticated models that try to derive parameters directly from the data. Likewise, there are many ways to fit an SVD model, ranging from gradient descent and alternating least squares to deeper formulations such as EM, MAP, MCMC, Gibbs sampling and more.

 When designing an algorithm, one should go through the three levels, likely, but not necessarily, in the order I listed them. A major question is where most efforts should be invested? Which level has most influence on the quality of the outcome?

 My impression is that quite often most effort is allocated in the wrong direction. Most papers appear to concentrate on the third level, designing the best techniques for optimizing a single model or a particular cost function on which they are fixated. This is not very surprising, because the third level is the most technical one and offers the most flexibility. In particular, it allows researchers to express their prowess. Here, we can find papers with mathematical breakthroughs allowing squeezing some extra points from a model, getting us closer to the optimum, in a shorter time and with less overfitting. Well, no doubt, that’s wonderful… However, the practical value of these developments is quite limited, especially when using an ensemble of various models, where squeezing the best out of a single model is not really delivered to the bottom line.

 Concentrating efforts on the second level is more fruitful. Not all models are built equal for the task at hand. For example, user-based neighborhood models were found to be vastly inferior to item (movie) -based ones. Moreover, latent factor models were proven to be more accurate than the neighborhood ones (considering that you use the right latent factor model, which happens to be SVD). Most importantly, the design of a good ensemble blending complementing predictors should be mostly done at this level. It is very beneficial to blend SVD with a neighborhood technique and with an RBM. A simple mixture like this, involving quick and straightforward implementations, would probably vastly outperform some very well tuned and elaborated individual models. So this level is certainly important, receives quite a bit of attention at the literature, but not nearly as important as the first level.

 The first level, which decides the aspects of the data to be modeled, is where most pivotal choices are taken. Selecting the right features will make a huge impact on the quality of the results. For example, going beyond the numerical values of the ratings to analyzing which movies are chosen to be rated has a tremendous effect on prediction accuracy. On the other hand, modeling metadata associated with movies, such as identity of actors, or associated keywords, is not a prudent choice regarding the Netflix data. Similarly, modeling the date-dependent dynamics of users’ behavior is very useful. This first level receives less attention in the literature. Perhaps, because it is somewhat application dependant and harder to generalize. However, I can’t emphasize enough its importance.

 In practice, the borders between the three levels that I describe may be quite fuzzy. Moreover, these three levels can be sometimes strongly interlaced with each other, as at the end, a single implementation should fulfill all three levels. However, these days, whatever I think or hear about the Netflix data, I immediately try to relate to those three levels. The more it relates to the first level, the more interested I become, whereas I tend to almost completely ignore improvements related to the third level (well, that’s after exploring that level enough in the past). Just my 2 cents…