Claude Sammut is attempting to put together an Encyclopedia of Machine Learning. I volunteered to write one article on Efficient RL in MDPs, which I would like to invite comment on. Is something critical missing?

## 11/26/2008

## 3/23/2008

### Interactive Machine Learning

A new direction of research seems to be arising in machine learning: Interactive Machine Learning. This isn’t a familiar term, although it does include some familiar subjects.

**What is Interactive Machine Learning?** The fundamental requirement is (a) learning algorithms which interact with the world and (b) learn.

For our purposes, let’s define learning as efficiently competing with a large set of possible predictors. Examples include:

- Online learning against an adversary (Avrim’s Notes). The interaction is almost trivial: the learning algorithm makes a prediction and then receives feedback. The learning is choosing based upon the advice of many experts.
**Active Learning**. In active learning, the interaction is choosing which examples to label, and the learning is choosing from amongst a large set of hypotheses.**Contextual Bandits**. The interaction is choosing one of several actions and learning only the value of the chosen action (weaker than active learning feedback).

More forms of interaction will doubtless be noted and tackled as time progresses. I created a webpage for my own research on interactive learning which helps define the above subjects a bit more.

**What isn’t Interactive Machine Learning?**

There are several learning settings which fail either the interaction or the learning test.

- Supervised Learning doesn’t fit. The basic paradigm in supervised learning is that you ask experts to label examples, and then you learn a predictor based upon the predictions of these experts. This approach has essentially no interaction.
- Semisupervised Learning doesn’t fit. Semisupervised learning is almost the same as supervised learning, except that you also throw in many unlabeled examples.
- Bandit algorithms don’t fit. They have the interaction, but not much learning happens because the sample complexity results only allow you to choose from amongst a small set of strategies. (One exception is EXP4 (page 66), which can operate in the contextual bandit setting.)
- MDP learning doesn’t fit. The interaction is there, but the set of policies learned over is still too limited—essentially the policies just memorize what to do in each state.
- Reinforcement learning may or may not fit, depending on whether you think of it as MDP learning or in a much broader sense.

All of these not-quite-interactive-learning topics are of course very useful background information for interactive machine learning.

**Why now?** Because it’s time, of course.

- We know from other fields and various examples that interaction is very powerful.
- From online learning against an adversary, we know that independence of samples is unnecessary in an interactive setting—in fact you can even function against an adversary.
- From active learning, we know that interaction sometimes allows us to use exponentially fewer labeled samples than in supervised learning.
- From context bandits, we gain the ability to learn in settings where traditional supervised learning just doesn’t apply.
- From complexity theory we have “IP=PSPACE” roughly: interactive proofs are as powerful as polynomial space algorithms, which is a strong statement about the power of interaction.

- We know that this analysis is often tractable. For example, since Sanjoy‘s post on Active Learning, much progress has been made. Several other variations of interactive settings have been proposed and analyzed. The older online learning against an adversary work is essentially completely worked out for the simpler cases (except for computational issues).
- Real world problems are driving it. One of my favorite problems at the moment is the ad display problem—How do you learn which ad is most likely to be of interest? The contextual bandit problem is a big piece of this problem.
- It’s more fun. Interactive learning is essentially a wide-open area of research. There are plenty of kinds of natural interaction which haven’t been formalized or analyzed. This is great for beginnners, because it means the problems are simple, and their solution does not require huge prerequisites.
- It’s a step closer to AI. Many people doing machine learning want to reach AI, and it seems clear that any AI must engage in interactive learning. Mastering this problem is a next step.

**Basic Questions**

- For natural interaction form [insert yours here], how do you learn? Some of the techniques for other methods of interactive learning may be helpful.
- How do we blend interactive and noninteractive learning? In many applications, there is already a pool of supervised examples around.
- Are there general methods for reducing interactive learning problems to supervised learning problems (which we know better)?

## 12/10/2007

### Learning Track of International Planning Competition

The International Planning Competition (IPC) is a biennial event organized in the context of the International Conference on Automated Planning and Scheduling (ICAPS). This year, for the first time, there will a learning track of the competition. For more information you can go to the competition web-site.

The competitions are typically organized around a number of planning domains that can vary from year to year, where a planning domain is simply a class of problems that share a common action schema—e.g. Blocksworld is a well-known planning domain that contains a problem instance each possible initial tower configuration and goal configuration. Some other domains have included Logistics, Airport, Freecell, PipesWorld, and many others. For each domain the competition includes a number of problems (say 40-50) and the planners are run on each problem with a time limit for each problem (around 30 minutes). The problems are hard enough that many problems are not solved within the time limit.

Given that the planners are asked to solve many problems from individual domains, and that problems within a domain generally have common solution structures, it makes sense to consider learning from previous problem-solving experience in a domain to better solve future problems in the same domain. Here “better solve” could mean either solve the problems significantly more quickly or finding better quality plans in a similar time frame. However, no planner in any of the competitions has included a learning component. Rather, to quote Foreigner, for these planners each problem “feels like the first time”.

Perhaps one reason that planners have not incorporated learning into the competition setting is that there has not been much overlap between the ICML and ICAPS communities, although that is changing. Another reason is perhaps that the structure of the competition would deduct any “learning time” from a planner’s 30mins per problem, which could reduce the potential benefits.

The learning track for the 2008 competition is being designed so that learning time is not counted against planners. Rather, there will be a distinct learning phase and a distinct evaluation phase. During the learning phase the planners will be provided with the set of domains to be used in the competition and example problems from each. The evaluation phase will be conducted like the current competition, with the exception that the learning-based planners will be allowed to read in a learned domain-specific “knowledge file” when solving the problems. This structure is designed to help answer the following question:

*Do we have techniques that can leverage a learning period to outperform state-of-the-art non-learning techniques across a wide range of domains? *

My current belief is that the answer is “no”. I certainly have never seen any such demonstration. This is not because of lack of work in the area of “learning to plan” as there has been a long history dating back to some of the early planners (see my horribly outdated resource page for a taste). While many of the learning approaches have shown some degree of success, the evaluations have typically been very narrow, focusing on only 2 to 3 domains and often only a few problems. My intuition, grounded in personal experience, is that most (all) of these systems would be quite brittle when taken to new domains. The hope is that the learning track of the competition will force us to take the issue of robustness seriously and soon lead to learning systems that convincingly outperform non-learning planners across a wide range of domains given proper training experience.

I hope to see a wide range of approaches entered into the competition. I’ll mention two styles of approaches that might be particular interesting to readers of this blog.

First, one might consider applying reinforcement learning to learn “generalized policies” that can be applied to any problem from a domain. Recall that here the domain model is provided to us, so applying RL would mean that the domain model is used as a sort of simulator in which an RL algorithm is run. RL is particularly difficult in these domains due to the challenges in developing an appropriate representation for learning value functions and/or policies—the states can be viewed as sets of ground relational atoms, rather than the more typical n-dimensional vectors common in RL. Another challenge is the extremely sparse reward, which is obtained only at goal states. There has been some work on applying RL to IPC-style domains (e.g. relational reinforcement learning, approximate policy iteration, policy gradient) but much improvement is needed to compete with non-learning planners.

Second, one might consider structured-classification techniques for this problem. Here one might view the planning problem as an input X and the plan as the structured output Y. Training data can be generated by solving example planning problems using state-of-the-art planners perhaps using significant resources. This approach has been studied under the name max-margin planning, but applied to a very different class of planning problems. Other work has considered applying the Learning as Search Optimization (LaSO) framework to IPC-style domains with some success. Some of the challenges here are to automatically produce an appropriate feature set given a planning domain and ambiguity in the training data. Ambiguity here refers to the fact that there are often a huge number of equally good plans for a given problem and the training data has only one or a small number of them, making the training data incomplete.

## 10/24/2007

### Contextual Bandits

One of the fundamental underpinnings of the internet is advertising based content. This has become much more effective due to targeted advertising where ads are specifically matched to interests. Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way.

The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. This implies we have a fundamental exploration problem.

A standard mathematical setting for this situation is “*k*-Armed Bandits”, often with various relevant embellishments. The *k*-Armed Bandit setting works on a round-by-round basis. On each round:

- A policy chooses arm
*a*from*1*of*k*arms (i.e. 1 of k ads). - The world reveals the reward
*r*of the chosen arm (i.e. whether the ad is clicked on)._{a}

As information is accumulated over multiple rounds, a good policy might converge on a good choice of arm (i.e. ad).

This setting (and its variants) fails to capture a critical phenomenon: each of these displayed ads are done in the context of a search or other webpage. To model this, we might think of a different setting where on each round:

- The world announces some context information
*x*(think of this as a high dimensional bit vector if that helps). - A policy chooses arm
*a*from*1*of*k*arms (i.e. 1 of k ads). - The world reveals the reward
*r*of the chosen arm (i.e. whether the ad is clicked on)._{a}

We can check that this is a critical distinction in 2 ways. First, note that policies using *x* can encode much more rich decisions than a policy not using *x*. Just think about: “if a search has the word flowers display a flower advertisement”. Second, we can try to reduce this setting to the *k*-Armed Bandit setting, and note that it can not be done well. There are two methods that I know of:

- Run a different
*k*-Armed Bandit for every value of*x*. The amount of information required to do well scales linearly in the number of contexts. In contrast, good supervised learning algorithms often require information which is (essentially) independent of the number of contexts. - Take some set of policies and treat every policy
*h(x)*as a different arm. This removes an explicit dependence on the number of contexts, but it creates a linear dependence on the number of policies. Via Occam’s razor/VC dimension/Margin bounds, we already know that supervised learning requires experience much smaller than the number of policies.

We know these are bad reductions by contrast to direct methods for solving the problem. The first algorithm for solving this problem is EXP4 (page 19 = 66) which has a regret with respect to the best policy in a set of *O( T ^{0.5} (ln |H|)^{0.5})* where

*T*is the number of rounds and

*|H|*is the number of policies. (Dividing by

*T*gives error-rate like quantities.) This result is independent of the number of contexts

*x*and only weakly dependent (similar to supervised learning) on the number of policies.

EXP4 has a number of drawbacks—it has severe computational requirements and doesn’t work for continuously parameterized policies (*). Tong and I worked out a reasonably simple meta-algorithm Epoch-Greedy which addresses these drawbacks (**), at the cost of sometimes worsening the regret bound to *O(T ^{2/3}S^{1/3})* where

*S*is related to the representational complexity of supervised learning on the set of policies.

This *T* dependence is of great concern to people who have worked on bandit problems in the past (where, basically, only the dependence on *T* could be optimized). In many applications, the *S* dependence is more important. However, this does leave an important open question: Is it possible to get the best properties of EXP4 and Epoch-Greedy?

Reasonable people could argue about which setting is more important: *k*-Armed Bandits or Contextual Bandits. I favor Contextual Bandits, even though there has been far more work in the *k*-Armed Bandit setting. There are several reasons:

- I’m having difficulty finding interesting real-world
*k*-Armed Bandit settings which aren’t better thought of as Contextual Bandits in practice. For myself, bandit algorithms are (at best) motivational because they can not be applied to real-world problems without altering them to take context into account. - Doing things in context is one of the underlying (and very successful) tenets of machine learning. Applying this tenet here seems wise.
- If we want to eventually solve big problems, we must have composable subelements. Composition doesn’t work without context, because there is no “input” for an I/O diagram.

Any insights into the open question above or Contextual Bandits in general are of great interest to me.

(*) There are some simple modifications to deal with the second issue but not the first.

(**) You have to read between the lines a little bit to see this in the paper. The ERM-style algorithm in the paper could be replaced with an efficient approximate ERM algorithm which is often possible in practice.

## 10/19/2007

### Second Annual Reinforcement Learning Competition

The Second Annual Reinforcement Learning Competition is about to get started. The aim of the competition is to facilitate direct comparisons between various learning methods on important and realistic domains. This year’s event will feature well-known benchmark domains as well as more challenging problems of real-world complexity, such as helicopter control and robot soccer keepaway.

The competition begins on November 1st, 2007 when training software is released. Results must be submitted by July 1st, 2008. The competition will culminate in an event at ICML-08 in Helsinki, Finland, at which the winners will be announced.

For more information, visit the competition website.