ICML and KDD 2010 Tutorial on Learning Through Exploration

ICML and KDD 2010 Tutorial on Learning through Exploration

slides and video

Alina Beygelzimer and John Langford

Tutorial Description

This tutorial is about learning through exploration. The goal is to learn how to make decisions when the payoff of only a chosen action is observed rather than all choices. The setting we address is simpler than general reinforcement learning, because we consider situations where future rewards are not affected by past decisions, although the algorithms we discuss do have applications in this more general setting.

A canonical example of this problem is the ad display problem, where based on a query and user history, an ad is chosen, and then the ad is either clicked on or not. In general, feedback for one ad provides no information about the quality of a different ad. This problem is much more general, of course, because it is really about the process of learning from interactive feedback where there is no "reset" button on history. We will cover algorithms that can learn from logged data where exploration decisions have already been made.

This tutorial draws on material from several different communities, so we expect it to be of general interest as this material has never been coherently brought together. The particular communities are:

  1. Reinforcement learning. The critical ideas of a policy and policy evaluation come from reinforcement learning. Relative to MDP-style reinforcement learning which has been extensively studied, we address an incomparable problem, as it's simplified by removing time dependence, but more complex due to dealing with generalization.

  2. Bandit online learning. The critical idea of optimally balancing exploration and exploitation comes from bandit online learning. Traditional bandit algorithms, however, can not efficiently compete with a large set of policies and do not learn from offline data.

  3. Supervised learning. Supervised learning is all about generalization, typically in an offline setting. For people coming from supervised learning, this tutorial provides an introduction to exploration in a manner which extends their understanding of supervised learning to the solution of new classes of problems.

  4. Active learning. Active learning deals with generalization and exploration in a related setting.

Slides

To be posted before the tutorial. We plan to cover the following topics:
  1. A mathematical definition of the setting.
  2. Offline policy evaluation of both static [4] and dynamic [7] policies.
  3. Using exploration information in offline learning [1].
  4. Optimally efficient online learning with exploration. We mean this in both a computational [5] and sample complexity [2] sense.
  5. Learning from nonrandom exploration [6] .
  6. Special cases where improvements are possible. We plan to cover various special cases as time permits, including the special case of linear reward structure [3], and zooming dimensions with nearest neighbor structure [8].

References (TBC)

[1]
Alina Beygelzimer and John Langford. The Offset Tree for learning with partial labels, KDD 2009.
[2]
Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The non-stochastic multi-armed bandit problem. SIAM Journal on Computing, 32(1):48-77, 2002.
[3]
Nicolo Cesa-Bianchi and Gabor Lugosi. Combinatorial bandits, COLT 2009.
[4]
Michael Kearns, Yishay Mansour and Andrew Y. Ng. Approximate planning in large POMDPs via reusable trajectories, NIPS 2000.
[5]
John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits, NIPS 2007.
[6]
John Langford, Alexander Strehl, and Jennifer Wortman. Exploration scavenging, ICML 2008.
[7]
Lihong Li, Wei Chu, John Langford and Robert Schapire. A contextual bandit approach to personalized news article recommendation, WWW 2010.
[8]
Alex Slivkins. Contextual bandits with similarity information, 2009.

Presenters

Alina Beygelzimer is a research staff member at IBM Research, working on machine learning, learning theory and algorithms.

John Langford is a senior research scientist at Yahoo! Research. His work includes research in machine learning, learning theory, algorithms, game theory, steganography, and Captchas.

Logistics