One of the fundamental underpinnings of the internet is advertising based content. This has become much more effective due to targeted advertising where ads are specifically matched to interests. Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way.

The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. This implies we have a fundamental exploration problem.

A standard mathematical setting for this situation is “*k*-Armed Bandits”, often with various relevant embellishments. The *k*-Armed Bandit setting works on a round-by-round basis. On each round:

- A policy chooses arm
*a*from*1*of*k*arms (i.e. 1 of k ads). - The world reveals the reward
*r*of the chosen arm (i.e. whether the ad is clicked on)._{a}

As information is accumulated over multiple rounds, a good policy might converge on a good choice of arm (i.e. ad).

This setting (and its variants) fails to capture a critical phenomenon: each of these displayed ads are done in the context of a search or other webpage. To model this, we might think of a different setting where on each round:

- The world announces some context information
*x*(think of this as a high dimensional bit vector if that helps). - A policy chooses arm
*a*from*1*of*k*arms (i.e. 1 of k ads). - The world reveals the reward
*r*of the chosen arm (i.e. whether the ad is clicked on)._{a}

We can check that this is a critical distinction in 2 ways. First, note that policies using *x* can encode much more rich decisions than a policy not using *x*. Just think about: “if a search has the word flowers display a flower advertisement”. Second, we can try to reduce this setting to the *k*-Armed Bandit setting, and note that it can not be done well. There are two methods that I know of:

- Run a different
*k*-Armed Bandit for every value of*x*. The amount of information required to do well scales linearly in the number of contexts. In contrast, good supervised learning algorithms often require information which is (essentially) independent of the number of contexts. - Take some set of policies and treat every policy
*h(x)*as a different arm. This removes an explicit dependence on the number of contexts, but it creates a linear dependence on the number of policies. Via Occam’s razor/VC dimension/Margin bounds, we already know that supervised learning requires experience much smaller than the number of policies.

We know these are bad reductions by contrast to direct methods for solving the problem. The first algorithm for solving this problem is EXP4 (page 19 = 66) which has a regret with respect to the best policy in a set of *O( T ^{0.5} (ln |H|)^{0.5})* where

*T*is the number of rounds and

*|H|*is the number of policies. (Dividing by

*T*gives error-rate like quantities.) This result is independent of the number of contexts

*x*and only weakly dependent (similar to supervised learning) on the number of policies.

EXP4 has a number of drawbacks—it has severe computational requirements and doesn’t work for continuously parameterized policies (*). Tong and I worked out a reasonably simple meta-algorithm Epoch-Greedy which addresses these drawbacks (**), at the cost of sometimes worsening the regret bound to *O(T ^{2/3}S^{1/3})* where

*S*is related to the representational complexity of supervised learning on the set of policies.

This *T* dependence is of great concern to people who have worked on bandit problems in the past (where, basically, only the dependence on *T* could be optimized). In many applications, the *S* dependence is more important. However, this does leave an important open question: Is it possible to get the best properties of EXP4 and Epoch-Greedy?

Reasonable people could argue about which setting is more important: *k*-Armed Bandits or Contextual Bandits. I favor Contextual Bandits, even though there has been far more work in the *k*-Armed Bandit setting. There are several reasons:

- I’m having difficulty finding interesting real-world
*k*-Armed Bandit settings which aren’t better thought of as Contextual Bandits in practice. For myself, bandit algorithms are (at best) motivational because they can not be applied to real-world problems without altering them to take context into account. - Doing things in context is one of the underlying (and very successful) tenets of machine learning. Applying this tenet here seems wise.
- If we want to eventually solve big problems, we must have composable subelements. Composition doesn’t work without context, because there is no “input” for an I/O diagram.

Any insights into the open question above or Contextual Bandits in general are of great interest to me.

(*) There are some simple modifications to deal with the second issue but not the first.

(**) You have to read between the lines a little bit to see this in the paper. The ERM-style algorithm in the paper could be replaced with an efficient approximate ERM algorithm which is often possible in practice.