This tutorial is about learning through exploration. The
goal is to learn how to make decisions when the payoff of only a chosen
action is observed rather than all choices. The setting we address
is simpler than general reinforcement learning, because we consider
situations where future rewards are not affected by past decisions,
although the algorithms we discuss do have applications in this more
A canonical example of this problem is the ad display problem, where
based on a query and user history, an ad is chosen, and then the ad
is either clicked on or not. In general, feedback for one ad provides
no information about the quality of a different ad. This problem is
much more general, of course, because it is really about the process
of learning from interactive feedback where there is no "reset"
button on history. We will cover algorithms that can learn from logged
data where exploration decisions have already been made.
This tutorial draws on material from several different communities,
so we expect it to be of general interest as this material has never
been coherently brought together. The particular communities are:
Reinforcement learning. The critical ideas of a policy and policy
evaluation come from reinforcement learning. Relative to MDP-style
reinforcement learning which has been extensively studied,
we address an incomparable
problem, as it's simplified by removing time dependence, but more
complex due to dealing with generalization.
Bandit online learning. The critical idea of optimally balancing exploration
and exploitation comes from bandit online learning. Traditional bandit
algorithms, however, can not efficiently compete with a large set
of policies and do not learn from offline data.
Supervised learning. Supervised learning is all about generalization,
typically in an offline setting. For people coming from supervised
learning, this tutorial provides an introduction to exploration in
a manner which extends their understanding of supervised learning
to the solution of new classes of problems.
Active learning. Active learning deals with generalization and exploration
in a related setting.
To be posted before the tutorial.
We plan to cover the following topics:
- A mathematical definition of the setting.
- Offline policy evaluation of both static  and dynamic 
- Using exploration information in offline learning .
- Optimally efficient online learning with exploration. We mean this
in both a computational  and sample complexity 
- Learning from nonrandom exploration  .
- Special cases where improvements are possible. We plan to cover various
special cases as time permits, including the special case of linear
reward structure , and zooming dimensions with nearest neighbor
Alina Beygelzimer and John Langford. The Offset Tree
for learning with partial labels, KDD 2009.
Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and
Robert E. Schapire. The non-stochastic multi-armed bandit problem.
SIAM Journal on Computing, 32(1):48-77, 2002.
- Nicolo Cesa-Bianchi and Gabor Lugosi. Combinatorial
bandits, COLT 2009.
- Michael Kearns, Yishay Mansour and Andrew Y. Ng. Approximate
planning in large POMDPs via reusable trajectories, NIPS 2000.
- John Langford and Tong Zhang.
The epoch-greedy algorithm
for contextual multi-armed bandits, NIPS 2007.
- John Langford, Alexander Strehl, and Jennifer Wortman. Exploration scavenging, ICML 2008.
- Lihong Li, Wei Chu, John Langford and Robert Schapire.
A contextual bandit approach to personalized news article recommendation,
- Alex Slivkins.
Contextual bandits with similarity information, 2009.
Alina Beygelzimer is a research
staff member at IBM Research, working on machine learning, learning theory
is a senior research scientist at Yahoo! Research.
His work includes research in machine learning, learning theory,
algorithms, game theory, steganography, and Captchas.
- ICML 2010
- Where: Alon, Haifa Conference Center, Haifa, Israel
- When: 1pm, Monday, June 21, 2010
- KDD 2010
- Where: Potomac 6, KDD venue
- When: 9am, Sunday, July 25, 2010