This post is about contextual bandit problems where, repeatedly:

- The world chooses features
*x*and rewards for each action*r*then announces the features_{1},…,r_{k}*x*(but not the rewards). - A policy chooses an action
*a*. - The world announces the reward
*r*_{a}

The goal in these situations is to learn a policy which maximizes *r _{a}* in expectation efficiently. I’m thinking about all situations which fit the above setting, whether they are drawn IID or adversarially from round to round and whether they involve past logged data or rapidly learning via interaction.

One common drawback of all algorithms for solving this setting, is that they have a poor dependence on the number of actions. For example if *k* is the number of actions, EXP4 (page 66) has a dependence on *k ^{0.5}*, epoch-greedy (and the simpler epsilon greedy) have a dependence on

*k*, and the offset tree has a dependence on

^{1/3}*k-1*. These results aren’t directly comparable because different things are being analyzed. The fact that

*all*analyses have poor dependence on

*k*is troublesome. The lower bounds in the EXP4 paper and the Offset Tree paper demonstrate that this isn’t a matter of lazy proof writing or a poor choice of algorithms: it’s essential to the nature of the problem.

In supervised learning, it’s typical to get no dependence or very weak dependence on the number of actions/choices/labels. For example, if we do empirical risk minimization over a finite hypothesis space *H*, the dependence is at most *ln |H|* using an Occam’s Razor bound. Similarly, the PECOC algorithm (page 12) has dependence bounded by a constant. This kind of dependence is great for the feasibility of machine learning: it means that we can hope to tackle seemingly difficult problems.

Why is there such a large contrast between these settings? At the level of this discussion, they differ only in step 3, where for supervised learning, all of the rewards are revealed instead of just one.

One of the intuitions you develop after working with supervised learning is that holistic information is often better. As an example, given a choice between labeling the same point multiple times (perhaps revealing and correcting noise) or labeling other points once, an algorithm with labels other points typically exists and typically yields as good or better performance in theory and in practice. This appears untrue when we have only partial observations.

For example, consider the following problem(*): “Find an action with average reward greater than 0.5 with probability at least 0.99” and consider two algorithms:

- Sample actions at random until we can prove (via Hoeffding bounds) that one of them has large reward.
- Pick an action at random, sample it 100 times, and if we can prove (via a Hoeffding bound) that it has large average reward return it, otherwise pick another action randomly and repeat.

When there are *10 ^{10}* actions and

*10*of them have average reward 0.6, it’s easy to prove that algorithm 2 is much better than algorithm 1.

^{9}Lower bounds for the partial observation settings imply that more tractable algorithms only exist under additional assumptions. Two papers which do this without context features are:

- Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandit problems in metric spaces, STOC 2008. Here the idea is that you have access to a covering oracle on the actions where actions with similar average rewards cover each other.
- Deepak Agarwal, , and Deepayan Chakrabati, Multi-armed Bandit Problems with Dependent Arms, ICML 2007. Here the idea is that the values of actions are generated recursively, preserving structure through the recursion.

**Basic questions**: Are there other kinds of natural structure which allows a good dependence on the total number of actions? Can these kinds of structures be extended to the setting with features? (Which seems essential for real applications.)

(*) Developed in discussion with Yisong Yue and Bobby Kleinberg.

Very interesting post, John. On the subject of problem (*), and the general idea of obtaining stronger regret bounds when good actions are plentiful, it may be worth mentioning a paper I wrote for SODA 2006, “Anytime algorithms for multi-armed bandit problems.” That paper considered the objective of designing an algorithm with regret at most δ relative to at least one of the actions in the top ε percentile. If T is the smallest time at which this bound is guaranteed to hold, we call T the (ε,δ)-convergence time of the algorithm. An anytime bandit algorithm is one whose (ε,δ)-convergence time is finite, for all positive ε and δ.

The paper assumes no structure on the action space, apart from a probability measure, which is required in order to talk the “top ε percentile of the actions.” The algorithms designed in the paper are similar in spirit to your algorithm (2) above: randomly sample a small number of actions and try to do nearly as well as the best action in the sample; enlarge the sample over time so that your average payoff can eventually converge to the supremum of all the actions’ payoffs. Using this simple idea, and tinkering with the question of how rapidly to enlarge the sample over time, you can get (1) algorithms whose convergence time is polynomial in 1/ε and 1/δ, and (2) algorithms whose convergence time is quasilinear in 1/ε. The paper also contains a theorem proving that no algorithm can achieve (1) and (2) simultaneously.

I really like this post and I think it’s a valuable collection of ideas and research for looking at the impact of large action sets (one of biggest practical constraints). I’ve found that one key source of regret in these systems is when new actions are added to the set. I recognize that the novelty of the new action can be introduced into it’s features (side information) but often it’s hard to compare a novel action to older ones because they can live in very different parts of the feature space. For example, suppose side information includes the number of times this action has been taken; or, more subtly, when side information contains statistics with confidence intervals that depend on sample size, etc.

Are there conditions on the reward process assumed in this framework that bound the regret from any new action not being picked up fast enough? Is there research that grapples with this issue?