Benchmarks for RL

A couple years ago, Drew Bagnell and I started the RLBench project to setup a suite of reinforcement learning benchmark problems. We haven’t been able to touch it (due to lack of time) for a year so the project is on hold. Luckily, there are several other projects such as CLSquare and RL-Glue with a similar goal, and we strongly endorse their continued development.

I would like to explain why, especially in the context of criticism of other learning benchmarks. For example, sometimes the UCI Machine Learning Repository is criticized. There are two criticisms I know of:

  1. Learning algorithms have overfit to the problems in the repository. It is easy to imagine a mechanism for this happening unintentionally. Strong evidence of this would be provided by learning algorithms which perform great on the UCI machine learning repository but very badly (relative to other learning algorithms) on non-UCI learning problems. I have seen little evidence of this but it remains a point of concern. There is a natural order to how much we trust performance results:
    1. Competitions. A well run prediction competition provides some of the best evidence available about which learning algorithms work well. There are still some mechanisms for being misled (too many entries, not enough testing data, unlucky choice of learning problem that your algorithm just has the wrong bias for), but they are more controlled than anywhere else.
    2. Benchmarks. Benchmarks provide a “reasonable sanity check” that the algorithm will do something well in practice. Compared to competitions, there are additional ways to be misled and (unfortunately) to mislead. Overfitting becomes a serious issue.
    3. Simulated data Simulated data can be very misleading. Typically, there is little reason to believe that simulated data reflects the structure of real-world problems.

    This ordering should always be kept in mind when considering empirical results.

  2. The repository enforced a certain interface which many natural learning problems did not fit into. Some interface must be enforced in order for a benchmark/repository to be useful. This implies that problems fitting the interface are preferred both in submission of new problems and in choice of algorithms to solve. This is a reasonable complaint, but it’s basically a complaint that the UCI machine learning repository was too succesful. The good news here is that the internet has made it so anyone can setup a repository. Just put up a webserver and announce the dataset.

It is important to consider the benefits of a benchmark suite as well. The standard in reinforcement learning for experimental studies in reinforcement learning algorithms has been showing that some new algorithm does something reasonable on one or two reinforcement learning problems. Naturally, what this implies in practice is that the algorithms were hand tuned to work on these one-or-two problems, and there was relatively little emphasis on building a general purpose reinforcement learning algorithm.

This is strongly at odds with the end goal of the reinforcement learning community! The way to avoid this is by setting up a benchmark suite (the more problems, the better). With a large set of problems that people can easily test on, the natural emphasis in empirical algorithm development shifts towards general purpose algorithms.

Another drawback of the “one or two problems” approach is incomparability of results. When people reimplement simulators for reinforcement learning problems, it turns out these reimplementations typically differ in the details, and these details can radically alter the difficulty of the problem. For example, there are many ways to implement the “car on the hill” problem, and some of them are much more difficult to solve than others. Again, a large set of problems people can easily test on solves this problem because the precise definition of the problem is shared.

One sometimes reasonable view of the world is that if you define a problem publicly, it is as good as solved since there are plenty of smart people out there eager to prove themselves. According to this view of the world, setting up a benchmark is a vital task.

Why Reinforcement Learning is Important

One prescription for solving a problem well is:

  1. State the problem, in the simplest way possible. In particular, this statement should involve no contamination with or anticipation of the solution.
  2. Think about solutions to the stated problem.

Stating a problem in a succinct and crisp manner tends to invite a simple elegant solution. When a problem can not be stated succinctly, we wonder if the problem is even understood. (And when a problem is not understood, we wonder if a solution can be meaningful.)

Reinforcement learning does step (1) well. It provides a clean simple language to state general AI problems. In reinforcement learning there is a set of actions A, a set of observations O, and a reward r. The reinforcement learning problem, in general, is defined by a conditional measure D( o, r | (o,r,a)*) which produces an observation o and a reward r given a history (o,r,a)*. The goal in reinforcement learning is to find a policy pi:(o,r,a)* -> a mapping histories to actions so as to maximize (or approximately maximize) the expected sum of observed rewards.

This formulation is capable of capturing almost any (all?) AI problems. (Are there any other formulations capable of capturing a similar generality?) I don’t believe we yet have good RL solutions from step (2), but that is unsurprising given the generality of the problem.

Note that solving RL in this generality is impossible (for example, it can encode classification). The two approaches that can be taken are:

  1. Simplify the problem. It is very common to consider the restricted problem where the history is summarized by the previous observation. (aka a “Markov Decision Process”). In many cases, other restrictions are added.
  2. Think about relativized solutions (such as reductions).

Both methods are options are under active investigation.

Reopening RL->Classification

In research, it’s often the case that solving a problem helps you realize that it wasn’t the right problem to solve. This is the case for the “reduce RL to classification” problem with the solution hinted at here and turned into a paper here.

The essential difficulty is that the method of stating and analyzing reductions ends up being nonalgorithmic (unlike previous reductions) unless you work with learning from teleoperated robots as Greg Grudic does. The difficulty here is due to the reduction being dependent on the optimal policy (which a human teleoperator might simulate, but which is otherwise unavailable).

So, this problem is “open” again with the caveat that this time we want a more algorithmic solution.

Whether or not this is feasible at all is still unclear and evidence in either direction would greatly interest me. A positive answer might have many practical implications in the long run.

Solution: Reinforcement Learning with Classification

I realized that the tools needed to solve the problem just posted were just created. I tried to sketch out the solution here (also in .lyx and .tex). It is still quite sketchy (and probably only the few people who understand reductions well can follow).

One of the reasons why I started this weblog was to experiment with “research in the open”, and this is an opportunity to do so. Over the next few days, I’ll be filling in details and trying to get things to make sense. If you have additions or ideas, please propose them.

Problem: Reinforcement Learning with Classification

At an intuitive level, the question here is “Can reinforcement learning be solved with classification?”

Problem Construct a reinforcement learning algorithm with near-optimal expected sum of rewards in the direct experience model given access to a classifier learning algorithm which has a small error rate or regret on all posed classification problems. The definition of “posed” here is slightly murky. I consider a problem “posed” if there is an algorithm for constructing labeled classification examples.

Past Work

  1. There exists a reduction of reinforcement learning to classification given a generative model. A generative model is an inherently stronger assumption than the direct experience model.
  2. Other work on learning reductions may be important.
  3. Several algorithms for solving reinforcement learning in the direct experience model exist. Most, such as E3, Factored-E3, and metric-E3 and Rmax require that the observation be the state. Recent work extends this approach to POMDPs.
  4. This problem is related to predictive state representations, because we are trying to solve reinforcement learning with prediction ability.

Difficulty It is not clear whether this problem is solvable or not. A proof that it is not solvable would be extremely interesting, and even partial success one way or another could be important.

Impact At the theoretical level, it would be very nice to know if the ability to generalize implies the ability to solve reinforcement learning because (in a general sense) all problems can be cast as reinforcement learning. Even if the solution is exponential in the horizon time it can only motivate relaxations of the core algorithm like RLgen.