A couple security researchers claim to have cracked the netflix dataset. The claims of success appear somewhat overstated to me, but the method of attack is valid and could plausibly be substantially improved so as to reveal the movie preferences of a small fraction of Netflix users.
The basic idea is to use a heuristic similarity function between ratings in a public database (from IMDB) and an anonymized database (Netflix) to link ratings in the private database to public identities (in IMDB). They claim to have linked two of a few dozen IMDB users to anonymized netflix users.
The claims seem a bit inflated to me, because (a) knowing the IMDB identity isn’t equivalent to knowing the person and (b) the claims of statistical significance are with respect to a model of the world they created (rather than one they created).
Overall, this is another example showing that complete privacy is hard. It may be worth remembering that there are some substantial benefits from the Netflix challenge as well—we (as a society) have learned something about how to do collaborative filtering which is useful beyond just recommending movies.
Very interesting — not sure about your point “(a) about the IMDB identity not being equivalent to knowing the person” — since the userid’s can be searched.
I went to IMDB, plucked a user name for the first movie link I could find, and did a search on google for the first userid (AlmostFamous7) — got 114 results:
google: AlmostFamous7
A quick scan of the abstracts from the search crawl + index produced information about location (and lifestyle choices?) of a user with the same id (though possibly not the same person of course):
TS/TV/TG Seeking Women in New Zealand
almostfamous7 50 T. Wellington, New Zealand. Wink, Email. Hotlist, Invite as friend. Ask me for a photo. kiwiwood 41 T. Palm Nth, Wairarapa, New Zealand ..
… please feel free to edit or delete this comment 🙂
If they were facebook they would start responding to comments for user A on imdb like: user A also likes .
Privacy violation lies in the gap between what was intentionally revealed by a person (presumably as above) and what was unintentionally revealed via a partial join with anonymized data. Given that so much is known about this person online already, not much more information may be revealed.