Contextual Bandits

One of the fundamental underpinnings of the internet is advertising based content. This has become much more effective due to targeted advertising where ads are specifically matched to interests. Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way.

The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. This implies we have a fundamental exploration problem.

A standard mathematical setting for this situation is “k-Armed Bandits”, often with various relevant embellishments. The k-Armed Bandit setting works on a round-by-round basis. On each round:

  1. A policy chooses arm a from 1 of k arms (i.e. 1 of k ads).
  2. The world reveals the reward ra of the chosen arm (i.e. whether the ad is clicked on).

As information is accumulated over multiple rounds, a good policy might converge on a good choice of arm (i.e. ad).

This setting (and its variants) fails to capture a critical phenomenon: each of these displayed ads are done in the context of a search or other webpage. To model this, we might think of a different setting where on each round:

  1. The world announces some context information x (think of this as a high dimensional bit vector if that helps).
  2. A policy chooses arm a from 1 of k arms (i.e. 1 of k ads).
  3. The world reveals the reward ra of the chosen arm (i.e. whether the ad is clicked on).

We can check that this is a critical distinction in 2 ways. First, note that policies using x can encode much more rich decisions than a policy not using x. Just think about: “if a search has the word flowers display a flower advertisement”. Second, we can try to reduce this setting to the k-Armed Bandit setting, and note that it can not be done well. There are two methods that I know of:

  1. Run a different k-Armed Bandit for every value of x. The amount of information required to do well scales linearly in the number of contexts. In contrast, good supervised learning algorithms often require information which is (essentially) independent of the number of contexts.
  2. Take some set of policies and treat every policy h(x) as a different arm. This removes an explicit dependence on the number of contexts, but it creates a linear dependence on the number of policies. Via Occam’s razor/VC dimension/Margin bounds, we already know that supervised learning requires experience much smaller than the number of policies.

We know these are bad reductions by contrast to direct methods for solving the problem. The first algorithm for solving this problem is EXP4 (page 19 = 66) which has a regret with respect to the best policy in a set of O( T0.5 (ln |H|)0.5) where T is the number of rounds and |H| is the number of policies. (Dividing by T gives error-rate like quantities.) This result is independent of the number of contexts x and only weakly dependent (similar to supervised learning) on the number of policies.

EXP4 has a number of drawbacks—it has severe computational requirements and doesn’t work for continuously parameterized policies (*). Tong and I worked out a reasonably simple meta-algorithm Epoch-Greedy which addresses these drawbacks (**), at the cost of sometimes worsening the regret bound to O(T2/3S1/3) where S is related to the representational complexity of supervised learning on the set of policies.

This T dependence is of great concern to people who have worked on bandit problems in the past (where, basically, only the dependence on T could be optimized). In many applications, the S dependence is more important. However, this does leave an important open question: Is it possible to get the best properties of EXP4 and Epoch-Greedy?

Reasonable people could argue about which setting is more important: k-Armed Bandits or Contextual Bandits. I favor Contextual Bandits, even though there has been far more work in the k-Armed Bandit setting. There are several reasons:

  1. I’m having difficulty finding interesting real-world k-Armed Bandit settings which aren’t better thought of as Contextual Bandits in practice. For myself, bandit algorithms are (at best) motivational because they can not be applied to real-world problems without altering them to take context into account.
  2. Doing things in context is one of the underlying (and very successful) tenets of machine learning. Applying this tenet here seems wise.
  3. If we want to eventually solve big problems, we must have composable subelements. Composition doesn’t work without context, because there is no “input” for an I/O diagram.

Any insights into the open question above or Contextual Bandits in general are of great interest to me.

(*) There are some simple modifications to deal with the second issue but not the first.
(**) You have to read between the lines a little bit to see this in the paper. The ERM-style algorithm in the paper could be replaced with an efficient approximate ERM algorithm which is often possible in practice.

Online as the new adjective

Online learning is in vogue, which means we should expect to see in the near future:

  1. Online boosting.
  2. Online decision trees.
  3. Online SVMs. (actually, we’ve already seen)
  4. Online deep learning.
  5. Online parallel learning.
  6. etc…

There are three fundamental drivers of this trend.

  1. Increasing size of datasets makes online algorithms attractive.
  2. Online learning can simply be more efficient than batch learning. Here is a picture from a class on online learning:
    online optimization picture
    The point of this picture is that even in 3 dimensions and even with linear constraints, finding the minima of a set in an online fashion can be typically faster than finding the minima in a batch fashion. To see this, note that there is a minimal number of gradient updates (i.e. 2) required in order to reach the minima in the typical case. Given this, it’s best to do these updates as quickly as possible, which implies doing the first update online (i.e. before seeing all the examples) is preferred. Note that this is the simplest possible setting—making the constraints nonlinear, increasing the dimensionality, introducing noise, etc… all make online learning preferred.
  3. Online learning is conceptually harder, and hence more researchable. The essential difficulty is that there is no well defined static global optima, as exists for many other optimization-based learning algorithms. Grappling with this lack is of critical importance.

NIPS workshops extended to 3 days

(Unofficially, at least.) The Deep Learning Workshop is being held the afternoon before the rest of the workshops in Vancouver, BC. Separate registration is needed, and open.

What’s happening fundamentally here is that there are too many interesting workshops to fit into 2 days. Perhaps we can get it officially expanded to 3 days next year.