BellKor wins Netflix

… but only the little prize. The BellKor team focused on integrating predictions from many different methods. The base methods consist of:

  1. Nearest Neighbor Methods
  2. Matrix Factorization Methods (asymmetric and symmetric)
  3. Linear Regression on various feature spaces
  4. Restricted Boltzman Machines

The final predictor was an ensemble (as was reasonable to expect), although it’s a little bit more complicated than just a weighted average—it’s essentially a customized learning algorithm. Base approaches (1)-(3) seem like relatively well-known approaches (although I haven’t seen the asymmetric factorization variant before). RBMs are the new approach.

The writeup is pretty clear for more details.

The contestants are close to reaching the big prize, but the last 1.5% is probably at least as hard as what’s been done. A few new structurally different methods for making predictions may need to be discovered and added into the mixture. In other words, research may be required.

The Machine Learning Award goes to …

Perhaps the biggest CS prize for research is the Turing Award, which has a $0.25M cash prize associated with it. It appears none of the prizes so far have been for anything like machine learning (the closest are perhaps database awards).

In CS theory, there is the Gödel Prize which is smaller and newer, offering a $5K prize along and perhaps (more importantly) recognition. One such award has been given for Machine Learning, to Robert Schapire and Yoav Freund for Adaboost.

In Machine Learning, there seems to be no equivalent of these sorts of prizes. There are several plausible reasons for this:

There is no coherent community. People drift in and out of the central conferences all the time. Most of the author names from 10 years ago do not occur in the conferences of today. In addition, the entire subject area is fairly new. There are at least a core group of people who have stayed around.
Machine Learning work doesn’t last Almost every paper is forgotten, because {the goals change, there isn’t any real progress, there are no teachable foundations}. It’s true of all fields of research.
There is no consensus (or clarity) on what’s important. The field is fractured between many very different viewpoints—statistical, empirical, AI, and theoretical. The prioritization of results across these very different viewpoints is hard. We can hope that people who have been around for awhile can learn to appreciate these other viewpoints. The Turing Award is over a much more diverse set of interests.

I think there are some sound reasons why we-as-a-field-of-research should want a field-wide award.

  1. Aspiration Perhaps the most valuable aspect of an award is that it gives people an incentive to aim for something in the long term. The closest approximation that we have right now is “best papers” awards at individual conferences. Best paper awards have a role, but it’s not the same. 10 years from now, when we look back 10 years, which papers will seem most significant? Across many different conferences?
  2. Representation One function of an award is that tells other people what we consider good work. In an academic reference frame, it gives information of the form “this person deserves tenure”. From a less cynical point of view, having a short list of “what’s important” that outsiders can peruse might be quite helpful as a starting point in understanding what’s going on.
  3. Crystallization Research is a process of discovering information and placing it carefully in context. An award has some role in furthering that process.

The worst part of any award is administering it. How do you avoid wasting time and playing favorites while keeping the higher level vision of what might be useful in the long term? I don’t have any great insight, except that it would need to be done.

Contextual Bandits

One of the fundamental underpinnings of the internet is advertising based content. This has become much more effective due to targeted advertising where ads are specifically matched to interests. Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way.

The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. This implies we have a fundamental exploration problem.

A standard mathematical setting for this situation is “k-Armed Bandits”, often with various relevant embellishments. The k-Armed Bandit setting works on a round-by-round basis. On each round:

  1. A policy chooses arm a from 1 of k arms (i.e. 1 of k ads).
  2. The world reveals the reward ra of the chosen arm (i.e. whether the ad is clicked on).

As information is accumulated over multiple rounds, a good policy might converge on a good choice of arm (i.e. ad).

This setting (and its variants) fails to capture a critical phenomenon: each of these displayed ads are done in the context of a search or other webpage. To model this, we might think of a different setting where on each round:

  1. The world announces some context information x (think of this as a high dimensional bit vector if that helps).
  2. A policy chooses arm a from 1 of k arms (i.e. 1 of k ads).
  3. The world reveals the reward ra of the chosen arm (i.e. whether the ad is clicked on).

We can check that this is a critical distinction in 2 ways. First, note that policies using x can encode much more rich decisions than a policy not using x. Just think about: “if a search has the word flowers display a flower advertisement”. Second, we can try to reduce this setting to the k-Armed Bandit setting, and note that it can not be done well. There are two methods that I know of:

  1. Run a different k-Armed Bandit for every value of x. The amount of information required to do well scales linearly in the number of contexts. In contrast, good supervised learning algorithms often require information which is (essentially) independent of the number of contexts.
  2. Take some set of policies and treat every policy h(x) as a different arm. This removes an explicit dependence on the number of contexts, but it creates a linear dependence on the number of policies. Via Occam’s razor/VC dimension/Margin bounds, we already know that supervised learning requires experience much smaller than the number of policies.

We know these are bad reductions by contrast to direct methods for solving the problem. The first algorithm for solving this problem is EXP4 (page 19 = 66) which has a regret with respect to the best policy in a set of O( T0.5 (ln |H|)0.5) where T is the number of rounds and |H| is the number of policies. (Dividing by T gives error-rate like quantities.) This result is independent of the number of contexts x and only weakly dependent (similar to supervised learning) on the number of policies.

EXP4 has a number of drawbacks—it has severe computational requirements and doesn’t work for continuously parameterized policies (*). Tong and I worked out a reasonably simple meta-algorithm Epoch-Greedy which addresses these drawbacks (**), at the cost of sometimes worsening the regret bound to O(T2/3S1/3) where S is related to the representational complexity of supervised learning on the set of policies.

This T dependence is of great concern to people who have worked on bandit problems in the past (where, basically, only the dependence on T could be optimized). In many applications, the S dependence is more important. However, this does leave an important open question: Is it possible to get the best properties of EXP4 and Epoch-Greedy?

Reasonable people could argue about which setting is more important: k-Armed Bandits or Contextual Bandits. I favor Contextual Bandits, even though there has been far more work in the k-Armed Bandit setting. There are several reasons:

  1. I’m having difficulty finding interesting real-world k-Armed Bandit settings which aren’t better thought of as Contextual Bandits in practice. For myself, bandit algorithms are (at best) motivational because they can not be applied to real-world problems without altering them to take context into account.
  2. Doing things in context is one of the underlying (and very successful) tenets of machine learning. Applying this tenet here seems wise.
  3. If we want to eventually solve big problems, we must have composable subelements. Composition doesn’t work without context, because there is no “input” for an I/O diagram.

Any insights into the open question above or Contextual Bandits in general are of great interest to me.

(*) There are some simple modifications to deal with the second issue but not the first.
(**) You have to read between the lines a little bit to see this in the paper. The ERM-style algorithm in the paper could be replaced with an efficient approximate ERM algorithm which is often possible in practice.

Second Annual Reinforcement Learning Competition

The Second Annual Reinforcement Learning Competition is about to get started. The aim of the competition is to facilitate direct comparisons between various learning methods on important and realistic domains. This year’s event will feature well-known benchmark domains as well as more challenging problems of real-world complexity, such as helicopter control and robot soccer keepaway.

The competition begins on November 1st, 2007 when training software is released. Results must be submitted by July 1st, 2008. The competition will culminate in an event at ICML-08 in Helsinki, Finland, at which the winners will be announced.

For more information, visit the competition website.