The competitors for the Netflix Prize are tantalizingly close winning the million dollar prize. This year, BellKor and Commendo Research sent a combined solution that won the progress prize. Reading the writeups 2 is instructive. Several aspects of solutions are taken for granted including stochastic gradient descent, ensemble prediction, and targeting residuals (a form of boosting). Relatively to last year, it appears that many approaches have added parameterizations, especially for the purpose of modeling through time.
The big question is: will they make the big prize? At this point, the level of complexity in entering the competition is prohibitive, so perhaps only the existing competitors will continue to try. (This equation might change drastically if the teams open source their existing solutions, including parameter settings.) One fear is that the progress is asymptoting on the wrong side of the 10% threshold. In the first year, the teams progressed through 84.3% of the 10% gap, and in the second year, they progressed through just 64.4% of the remaining gap. While these numbers suggest an asymptote on the wrong side, in the month since the progress prize another 34.0% improvement of the remainder has been achieved. It’s remarkable that it’s too close to call, with just a 0.0035 RMSE gap to win the big prize. Clever people finding just the right parameterization might very well succeed.
Your post is interesting, but I’ve not seen mention of a concern: are people just overfitting at this point?
If they introduce all sorts of knobs and switches, and tune them just right, does that represent true progress, given that Netflix is testing on the same hidden set over and over again?
If your read the rules carefully, you see that the amount of feedback from the most heldout set (i.e. the “test” set rather than the “quiz” set) is extremely low, on the order of a few bits per year. Given this and the size of the set, I don’t think overfitting is an issue.
It sucks that models have gotten so complex so as to bar newcomers from entering into the contest – its such a cool contest for those interested in machine learning. Hopefully some other tech-savy corporation will follow Netflix’s lead and introduce another machine learning contest.
A question we should ask is about the team rankings’ stability: if we had a second holdout sample, what are the odds the team rankings would be preserved? Given that, what might be alternative ways to judge the winner?
That’s a good question—stability is unclear.
Were I designing the contest, I’d use PAC-Bayes theory and exponential weight according to the gap in performance between competitors. See the “Second Main Result” here.
Doesn’t recent machine learning research attach confidence intervals to performance indices? I guess one algorithm would then only beat another if its performance index were higher and both were significantly difference in the statistical sense.