In October 2006, the online movie renter, Netflix, announced the Netflix Prize contest. They published a comprehensive dataset including more than 100 million movie ratings, which were performed by about 480,000 real customers on 17,770 movies.Ã‚Â Competitors in the challenge are required to estimate a few million ratings.Ã‚Â To win the “grand prize,” they need to deliver a 10% improvement in the prediction error compared with the results of Cinematch, Netflix’s proprietary recommender system. Best current results deliver 9.12% improvement, which is quite close to the 10% goal, yet painfully distant.

Ã‚Â The Netflix Prize breathed new life and excitement into recommender systems research. The competition allowed the wide research community to access a large scale, real life dataset. Beyond this, the competition changed the rules of the game. Claiming that your nice idea could outperform some mediocre algorithms on some toy dataset is no longer acceptable. Now researchers should face a new golden standard, and check how their seemingly elegant ideas are measured against best known results on an objective yardstick. I believe that this is a blessed change, which can help in shifting the focus to the few really useful ideas, rather than flooding us with a myriad of papers with questionable practical contributions. Well, days will tell…

Ã‚Â So where to start a truly meaningful research? What can really make a difference in perfecting a recommender system? I do not pretend have a real answer, but I will try to give some personal impressions. While working on the Netflix Prize, sifting through many ideas, implementing maybe a hundred different algorithms, we have come to recognize the few things that really matter. I will concentrate here on high level lessons that will hopefully help other practitioners in coming up with developments of a true practical value.

Ã‚Â I would like to characterize algorithms at three different levels. The first level answers the “what?” question – What do we want to model? Here we decide which features of the data to address. Do we want to model the numerical value of ratings, or maybe which movies people rate (regardless of rating value)? Do we want to address the date-dependent dynamics of users’ behavior? Some will want to model certain pieces of metadata associated with the movies, such as interactions with actors, directors, etc. Or, maybe, we would like to analyze the demographics of the users?

Ã‚Â The next level, the second one, answers the “which?” question – Which model are we going to pick? Will we model ratings through a neighborhood model or through a latent factor model? Within a neighborhood model, should we look at relationships between users, between movies, or maybe both? Within latent factor models we also have plenty of further choices – should we stick with the good old SVD, or move to fancier probabilistic models (e.g., pLSA, LDA)? Or maybe, we should jump to neural networks such as RBMs?

Ã‚Â Finally the last level answers the “how?” question – How are we going to implement the chosen model? Even after choosing a model, we have much flexibility in deciding how to optimize it. For example, nearest neighbor models can vary from quite simplistic correlation based models, to more sophisticated models that try to derive parameters directly from the data. Likewise, there are many ways to fit an SVD model, ranging from gradient descent and alternating least squares to deeper formulations such as EM, MAP, MCMC, Gibbs sampling and more.

Ã‚Â When designing an algorithm, one should go through the three levels, likely, but not necessarily, in the order I listed them. A major question is where most efforts should be invested? Which level has most influence on the quality of the outcome?

Ã‚Â My impression is that quite often most effort is allocated in the wrong direction. Most papers appear to concentrate on the third level, designing the best techniques for optimizing a single model or a particular cost function on which they are fixated. This is not very surprising, because the third level is the most technical one and offers the most flexibility. In particular, it allows researchers to express their prowess. Here, we can find papers with mathematical breakthroughs allowing squeezing some extra points from a model, getting us closer to the optimum, in a shorter time and with less overfitting. Well, no doubt, that’s wonderful… However, the practical value of these developments is quite limited, especially when using an ensemble of various models, where squeezing the best out of a single model is not really delivered to the bottom line.

Ã‚Â Concentrating efforts on the second level is more fruitful. Not all models are built equal for the task at hand. For example, user-based neighborhood models were found to be vastly inferior to item (movie) -based ones. Moreover, latent factor models were proven to be more accurate than the neighborhood ones (considering that you use the right latent factor model, which happens to be SVD). Most importantly, the design of a good ensemble blending complementing predictors should be mostly done at this level. It is very beneficial to blend SVD with a neighborhood technique and with an RBM. A simple mixture like this, involving quick and straightforward implementations, would probably vastly outperform some very well tuned and elaborated individual models. So this level is certainly important, receives quite a bit of attention at the literature, but not nearly as important as the first level.

Ã‚Â The first level, which decides the aspects of the data to be modeled, is where most pivotal choices are taken. Selecting the right features will make a huge impact on the quality of the results. For example, going beyond the numerical values of the ratings to analyzing which movies are chosen to be rated has a tremendous effect on prediction accuracy. On the other hand, modeling metadata associated with movies, such as identity of actors, or associated keywords, is not a prudent choice regarding the Netflix data. Similarly, modeling the date-dependent dynamics of users’ behavior is very useful. This first level receives less attention in the literature. Perhaps, because it is somewhat application dependant and harder to generalize. However, I can’t emphasize enough its importance.

Ã‚Â In practice, the borders between the three levels that I describe may be quite fuzzy. Moreover, these three levels can be sometimes strongly interlaced with each other, as at the end, a single implementation should fulfill all three levels. However, these days, whatever I think or hear about the Netflix data, I immediately try to relate to those three levels. The more it relates to the first level, the more interested I become, whereas I tend to almost completely ignore improvements related to the third level (well, that’s after exploring that level enough in the past). Just my 2 cents…

Yehuda, while I’m impressed with what you and others have done to improve on Cinematch, I think there’s a level even before the three that you describe–namely, what are the real problems we’re trying to help users address through recommendations. I know that, as a Netflix customer, I don’t get much out of the numerical ratings (or out of sorting by them). When a friend recommends a movie to me, he explains why he thinks I’ll like it. Not only is that far more informative, but it also allows me to interact with the process, since I can correct errors in his explanation (e.g., I like British comedy but not slapstick, and hence hate Montu Python) and thus arrive at different recommendations.

I understand that a simple model of assigning a number to each movie is very easy to measure and lends itself to a competition like the Netflix Prize. But, as an end-user, I question the practical utility of the approach, both in the specific context of movie recommendations and more generally to the problem of helping users find what they want / need.

I’ve been posting a fair amount about IR evaluation (a topic closely related to recommendation) at The Noisy Channel.

Is the Netflix dataset really that large? In the last blog post before this one, Concerns about the Large Scale Learning Challenge, John Langford classifies datasets like Netflix as medium sized, because they fit into RAM on commodity computers. This is just a semantics issue; clearly Netflix is big enough to knock anything quadratic or even anything linear with a really high constant out of the water. Maybe we need an “x-large” or “jumbo” category?

Regarding Dan Tunkelang’s comment, I wasn’t clear about whether your first level meant “what do we model to lower RMSE on the Netflix prize data?” or “what do we model to help Netflix rent more movies?”. The former’s obviously key for the challenge, the latter for building something useful.

Excellent post — thanks for sharing your insight. As you say, the issue of choosing an appropriate representation and what to model shouldn’t be overlooked.

Daniel — you are to some degree right. As many have stated, information system research (whether its recommendation, retrieval, filtering, or something else) needs to be cognizant of the impact on the end user. But, finding relevant or interesting items is

necessaryfunction in all these systems. RMSE is one way to measure how well a system predicts user interest.Maybe I’m just being obtuse, but it seems to me that, as with many recommendation scenarios, there is very little value in boiling down the recommendation to a single scalar (let alone the problem with modeling a Likert scale as a continuous variable). What sense does it make for me to put Schindler’s List and There’s Something About Mary on the same one-dimensional scale?

I realize that modeling recommendations as a Likert scale and evaluating performance using RMSE makes for a tractable contest. I’m just not convinced it’s the main problem that needs to be solved in order to improve a recommendation system. But, as I said, I’m glad that the contest is inspiring interesting work.

At the first level, even with a one-dimensional ratings scale, adding a variance estimate would be helpful.

It’s one thing to rate a movie mean of 3.5 with a deviation of 0.2 and quite another to rate it mean=4.0 and deviation=1.0. There’s actually a higher chance I won’t like the second movie even though it has a higher mean rating estimate (two deviations bring me to 2.0 on the second estimate, and only 3.1 on the first).