One of the fundamental underpinnings of the internet is advertising based content. This has become much more effective due to targeted advertising where ads are specifically matched to interests. Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way.

The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. This implies we have a fundamental exploration problem.

A standard mathematical setting for this situation is “*k*-Armed Bandits”, often with various relevant embellishments. The *k*-Armed Bandit setting works on a round-by-round basis. On each round:

- A policy chooses arm
*a*from*1*of*k*arms (i.e. 1 of k ads). - The world reveals the reward
*r*of the chosen arm (i.e. whether the ad is clicked on)._{a}

As information is accumulated over multiple rounds, a good policy might converge on a good choice of arm (i.e. ad).

This setting (and its variants) fails to capture a critical phenomenon: each of these displayed ads are done in the context of a search or other webpage. To model this, we might think of a different setting where on each round:

- The world announces some context information
*x*(think of this as a high dimensional bit vector if that helps). - A policy chooses arm
*a*from*1*of*k*arms (i.e. 1 of k ads). - The world reveals the reward
*r*of the chosen arm (i.e. whether the ad is clicked on)._{a}

We can check that this is a critical distinction in 2 ways. First, note that policies using *x* can encode much more rich decisions than a policy not using *x*. Just think about: “if a search has the word flowers display a flower advertisement”. Second, we can try to reduce this setting to the *k*-Armed Bandit setting, and note that it can not be done well. There are two methods that I know of:

- Run a different
*k*-Armed Bandit for every value of*x*. The amount of information required to do well scales linearly in the number of contexts. In contrast, good supervised learning algorithms often require information which is (essentially) independent of the number of contexts. - Take some set of policies and treat every policy
*h(x)*as a different arm. This removes an explicit dependence on the number of contexts, but it creates a linear dependence on the number of policies. Via Occam’s razor/VC dimension/Margin bounds, we already know that supervised learning requires experience much smaller than the number of policies.

We know these are bad reductions by contrast to direct methods for solving the problem. The first algorithm for solving this problem is EXP4 (page 19 = 66) which has a regret with respect to the best policy in a set of *O( T ^{0.5} (ln |H|)^{0.5})* where

*T*is the number of rounds and

*|H|*is the number of policies. (Dividing by

*T*gives error-rate like quantities.) This result is independent of the number of contexts

*x*and only weakly dependent (similar to supervised learning) on the number of policies.

EXP4 has a number of drawbacks—it has severe computational requirements and doesn’t work for continuously parameterized policies (*). Tong and I worked out a reasonably simple meta-algorithm Epoch-Greedy which addresses these drawbacks (**), at the cost of sometimes worsening the regret bound to *O(T ^{2/3}S^{1/3})* where

*S*is related to the representational complexity of supervised learning on the set of policies.

This *T* dependence is of great concern to people who have worked on bandit problems in the past (where, basically, only the dependence on *T* could be optimized). In many applications, the *S* dependence is more important. However, this does leave an important open question: Is it possible to get the best properties of EXP4 and Epoch-Greedy?

Reasonable people could argue about which setting is more important: *k*-Armed Bandits or Contextual Bandits. I favor Contextual Bandits, even though there has been far more work in the *k*-Armed Bandit setting. There are several reasons:

- I’m having difficulty finding interesting real-world
*k*-Armed Bandit settings which aren’t better thought of as Contextual Bandits in practice. For myself, bandit algorithms are (at best) motivational because they can not be applied to real-world problems without altering them to take context into account. - Doing things in context is one of the underlying (and very successful) tenets of machine learning. Applying this tenet here seems wise.
- If we want to eventually solve big problems, we must have composable subelements. Composition doesn’t work without context, because there is no “input” for an I/O diagram.

Any insights into the open question above or Contextual Bandits in general are of great interest to me.

(*) There are some simple modifications to deal with the second issue but not the first.

(**) You have to read between the lines a little bit to see this in the paper. The ERM-style algorithm in the paper could be replaced with an efficient approximate ERM algorithm which is often possible in practice.

Are you assuming that all ads that are being considered for a single context have the same bid value and have been chosen by the ad-auction?

Not really, but you may think so if that’s convenient. These complexities don’t appear essential.

An analysis of the parametric setting (the uncertainty is parametric, i.e., the distributions of the rewards has a known parametric form and just the parameters are unknown) has appeared in the following paper that you might find interesting:

Chih-Chun Wang, Sanjeev R. Kulkarni and H. Vincent Poor: Bandit Problems with Side Observations, IEEE TAC, Vol. 50, 2005.

There are some interesting cases when the interaction between the arms is non-trivial. I am not sure if the ideas in the paper would generalize to the non-parametric case!

With rich contextual information my first instinct is to restructure the presentation strategy to get back into a supervised learning framework. For example, in the ad problem, I can present randomly shuffled top K (K>1) results, collect 0/1 feedback, and control for the presentation rank by introducing a nuisance parameter for it. For example, this seems to be the approach taken for search engine ranking by Radlinski and Joachims in http://www.cs.cornell.edu/People/tj/publications/radlinski_joachims_06a.pdf

Does this approach have any other drawbacks than the cost of altering the presentation strategy (which, to be fair, might be considerable and/or hard to measure with certainty)?

I think the closest that you will get to the k-armed bandit with Internet advertising will be when trying to “intelligently” place display ads on a page which you have very little contextual information about. Here at DoubleClick we collect a few pieces of information about the page, and general area within the website, some geo-location and time of day/day of week information…and little else (legally we can’t).

The above does give contextual info, but it’s very little. Furthermore, unlike a search engine, we don’t display the top 10 matching ads which give some room for error, we display 1 ad per slot on the page constrained by size and serving limitations.

Perhaps this is as close to a real k-armed bandit as you’ll get with a practical application.

Let me first admit I have little knowledge of state-of-the-art literature in reinforcement learning, so be warned that I might say fairly obvious things.

To give an idea of my ignorance, I don’t even know what is the big deal about contextual information (theoretically speaking).

I’ve always seen RL as a particular task of causal inference: there is a effect you want to achieve (RL word: reward), there are different causes (those that you can control are the bandit arms), and other covariates (I guess that’s what you call context). Standard reinforcement learning (afaik) is active learning of fairly simple causal models with very little information: exploration-exploitation is just another name for how to spread variance of your estimates of causal effects by choosing which interventions to perform (i.e., which arms to pull) under a limited budget for interventions.

One way to deal with a problem with a high dimensionality is to decompose it into direct and indirect effects. If today being Valentine’s day is the cause of me wanting to buy flowers and book a dinner, and my desire to buy flowers pushes me to type the query “flowers”. Having a death in the family somewhere in the world is also a cause of me wanting (and then querying for) flowers on-line, but knowing that today is Valentine’s day explains away that. Hence, I might want the policy of displaying an add for a fine restaurant, instead one for cheap flights. The problem is how to learn these causal relations without an insanely large combination of interventions.

One trick is tie all variables in a global causal model with some assumptions on how to link their data under a few interventional regimes in a way it predicts outcomes of *unseen* (or barely seen) interventions. There is some painfully slow (but hopefully steady) progress in this area. This might work in some scientific problems such as achieving some desirable expression level for some gene by hunting around for combinations of interventions on other genes (bandit arms to pull). The contextual information being whatever genes/molecules are measured in the cell. One example is the following paper

1. Sachs K, Perez O, Pe’er D, Lauffenburger DA, Nolan GP. Causal protein-signaling networks derived from multiparameter single-cell data. Science. 2005 Apr 22;308(5721):523-9.

which doesn’t really deal with the causal active learning (a.k.a. reinforcement learning) problem, but at least shows which kind of models one can use to combine data from different policies to predict the effect (a.k.a reward) of never-seen-before policies.

For a problem as massive as understanding the goals of a user, this strategy is an overkill. In practice, some massive dimensionality reduction procedure might create some “template causes” that would account for the observed queries. I don’t think it is easy at all to adapt such techniques for scientific discovery into search engine problems. By no means I’m asserting that what I said is useful in this problem. But at least this is some food for thought.

When you compare contextual bandit to RL in general, contextual bandit is a special case for the most general RL formulations (as essentially everything is). When you compare it to RL as commonly practiced, you tradeoff caring about a (discrete or effective) time horizon with caring about generalization.

The recent preprint “Performance limitations in bandit problems with side observations.” by A. Zeevi and myself derives some lower bounds for the setup discussed in Wang, Kulkarni and Poor. The relationship between distribution of the side covariates and performance is indeed non-trivial. Remarks and comments are welcome.

njozfck