The Netflix Crack

A couple security researchers claim to have cracked the netflix dataset. The claims of success appear somewhat overstated to me, but the method of attack is valid and could plausibly be substantially improved so as to reveal the movie preferences of a small fraction of Netflix users.

The basic idea is to use a heuristic similarity function between ratings in a public database (from IMDB) and an anonymized database (Netflix) to link ratings in the private database to public identities (in IMDB). They claim to have linked two of a few dozen IMDB users to anonymized netflix users.

The claims seem a bit inflated to me, because (a) knowing the IMDB identity isn’t equivalent to knowing the person and (b) the claims of statistical significance are with respect to a model of the world they created (rather than one they created).

Overall, this is another example showing that complete privacy is hard. It may be worth remembering that there are some substantial benefits from the Netflix challenge as well—we (as a society) have learned something about how to do collaborative filtering which is useful beyond just recommending movies.

Slashdot has some discussion.

Computational Consequences of Classification

In the regression vs classification debate, I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations.

  1. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points.
  2. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints while a squared loss regressor could not. For example, if there is one feature which determines whether a binary label has probability less than or greater than 0.5, a great classifier exists using just one feature. Because squared loss is sensitive to the exact probability, many more features may be required to learn well with respect to squared loss.

BellKor wins Netflix

… but only the little prize. The BellKor team focused on integrating predictions from many different methods. The base methods consist of:

  1. Nearest Neighbor Methods
  2. Matrix Factorization Methods (asymmetric and symmetric)
  3. Linear Regression on various feature spaces
  4. Restricted Boltzman Machines

The final predictor was an ensemble (as was reasonable to expect), although it’s a little bit more complicated than just a weighted average—it’s essentially a customized learning algorithm. Base approaches (1)-(3) seem like relatively well-known approaches (although I haven’t seen the asymmetric factorization variant before). RBMs are the new approach.

The writeup is pretty clear for more details.

The contestants are close to reaching the big prize, but the last 1.5% is probably at least as hard as what’s been done. A few new structurally different methods for making predictions may need to be discovered and added into the mixture. In other words, research may be required.