Apprenticeship Reinforcement Learning for Control

Pieter Abbeel presented a paper with Andrew Ng at ICML on Exploration and Apprenticeship Learning in Reinforcement Learning. The basic idea of this algorithm is:

  1. Collect data from a human controlling a machine.
  2. Build a transition model based upon the experience.
  3. Build a policy which optimizes the transition model.
  4. Evaluate the policy. If it works well, halt, otherwise add the experience into the pool and go to (2).

The paper proves that this technique will converge to some policy with expected performance near human expected performance assuming the world fits certain assumptions (MDP or linear dynamics).

This general idea of apprenticeship learning (i.e. incorporating data from an expert) seems very compelling because (a) humans often learn this way and (b) much harder problems can be solved. For (a), the notion of teaching is about transferring knowledge from an expert to novices, often via demonstration. To see (b), note that we can create intricate reinforcement learning problems where a particular sequence of actions must be taken to achieve a goal. A novice might be able to memorize this sequence given just one demonstration even though it would require experience exponential in the length of the sequence to discover the key sequence accidentally.

Andrew Ng’s group has exploited this to make this very fun picture.
(Yeah, that’s a helicopter flying upside down, under computer control.)

As far as this particular paper, one question occurs to me. There is a general principle of learning which says we should avoid “double approximation”, such as occurs in step (3) where we build an approximate policy on an approximate model. Is there a way to fuse steps (2) and (3) to achieve faster or better learning?

Why Reinforcement Learning is Important

One prescription for solving a problem well is:

  1. State the problem, in the simplest way possible. In particular, this statement should involve no contamination with or anticipation of the solution.
  2. Think about solutions to the stated problem.

Stating a problem in a succinct and crisp manner tends to invite a simple elegant solution. When a problem can not be stated succinctly, we wonder if the problem is even understood. (And when a problem is not understood, we wonder if a solution can be meaningful.)

Reinforcement learning does step (1) well. It provides a clean simple language to state general AI problems. In reinforcement learning there is a set of actions A, a set of observations O, and a reward r. The reinforcement learning problem, in general, is defined by a conditional measure D( o, r | (o,r,a)*) which produces an observation o and a reward r given a history (o,r,a)*. The goal in reinforcement learning is to find a policy pi:(o,r,a)* -> a mapping histories to actions so as to maximize (or approximately maximize) the expected sum of observed rewards.

This formulation is capable of capturing almost any (all?) AI problems. (Are there any other formulations capable of capturing a similar generality?) I don’t believe we yet have good RL solutions from step (2), but that is unsurprising given the generality of the problem.

Note that solving RL in this generality is impossible (for example, it can encode classification). The two approaches that can be taken are:

  1. Simplify the problem. It is very common to consider the restricted problem where the history is summarized by the previous observation. (aka a “Markov Decision Process”). In many cases, other restrictions are added.
  2. Think about relativized solutions (such as reductions).

Both methods are options are under active investigation.

Peekaboom

Luis has released Peekaboom a successor to ESPgame (game site). The purpose of the game is similar—using the actions of people playing a game to gather data helpful in solving AI.

Peekaboom gathers more detailed, and perhaps more useful, data about vision. For ESPgame, the byproduct of the game was mutually agreed upon labels for common images. For Peekaboom, the location of the subimage generating the label is revealed by the game as well. Given knowledge about what portion of the image is related to a label it may be more feasible learn to recognize the appropriate parts.

There isn’t a dataset yet available for this game as there is for ESPgame, but hopefully a significant number of people will play and we’ll have one to work wtih soon.

Not goal metrics

One of the confusing things about research is that progress is very hard to measure. One of the consequences of being in a hard-to-measure environment is that the wrong things are often measured.

  1. Lines of Code The classical example of this phenomenon is the old lines-of-code-produced metric for programming. It is easy to imagine systems for producing many lines of code with very little work that accomplish very little.
  2. Paper count In academia, a “paper count” is an analog of “lines of code”, and it suffers from the same failure modes. The obvious failure mode here is that we end up with a large number of uninteresting papers since people end up spending a lot of time optimizing this metric.
  3. Complexity Another metric, is “complexity” (in the eye of a reviewer) of a paper. There is a common temptation to make a method appear more complex than it is in order for reviewers to judge it worthy of publication. The failure mode here is unclean thinking. Simple effective methods are often overlooked in favor of complex relatively ineffective methods. This is simply wrong for any field. (Discussion at Lance‘s blog.)
  4. Acceptance Rate “Acceptance rate” is the number of papers accepted/number of papers submitted. A low acceptance rate is often considered desirable for a conference. But:
    1. It’s easy to skew an acceptance rate by adding (or inviting) many weak or bogus papers.
    2. It’s very difficult to judge what, exactly, is good work in the long term. Consequently, a low acceptance rate can retard progress by simply raising the bar too high for what turns out to be a good idea when it is more fully developed. (Consider the limit where only one paper is accepted per year…)
    3. Accept/reject decisions can become more “political” and less about judging the merits of a paper/idea. With a low acceptance ratio, a strong objection by any one of several reviewers might torpedo a paper. The consequence of this is that papers become noncontroversial with a tendency towards incremental improvements.
    4. A low acceptance rate tends to spawn a multiplicity of conferences in one area. There is a strong multiplicity of learning-related conferences.

    (see also How to increase the acceptance ratios at top conferences?)

  5. Citation count Counting citations is somewhat better than counting papers because it is some evidence that an idea is actually useful. This has been particularly aided by automated citation counting systems like scholar.google.com and http://citeseer.ist.psu.edu/. However, there are difficulties—citation counts can be optimized using self-citation and “societies of mutual admiration” (groups of people who agree implicitly or explicitly to cite each other). Citations are also sometimes negative of the form “here we fix bad idea X”.
  6. See also the Academic Mechanism Design post for other ideas.

These metrics do have some meaning. A programmer who writes no lines of code isn’t very good. An academic who produces no papers isn’t very good. A conference that doesn’t aid information filtration isn’t helpful. Hard problems often require complex solutions. Important papers are often cited.

Nevertheless, optimizing these metrics is not beneficial for a field of research. In thinking about this, we must clearly differentiate 1) what is good for a field of research (solving important problems) and 2) what is good for individual researchers (getting jobs). The essential point here is that there is a disparity.

Any individual in academia cannot avoid being judged by these metrics. Attempts by an individual or a small group of individuals to ignore these metrics is unlikely to change the system (and likely to result in the individual or small group being judged badly).

I don’t believe there is an easy fix to this problem. The best we can hope for is incremental progress which takes the form of the leadership in the academic community introducing new, saner metrics. This is a difficult thing, particularly because any academic leader must have succeeded in the old system. Nevertheless, it must happen if academic-style research is to flourish.

In the spirit of being constructive, I’ll make one proposal which may address the “complexity” problem: judge the importance of a piece of work independent of the method. For a conference paper this might be done by changing the review process to have one “technical reviewer” and several “importance reviewers” rather than 3 or 4 reviewers. The “importance reviewer” is easier than the current standard: they must simply understand the problem being solved and rate how important this problem is. The technical reviewers job is harder than the current standard: they must verify that all claims of solution to the problem are met. Overall, the amount of work by reviewers would stay constant, and perhaps we would avoid the preference for complex solutions.

Interesting papers at ACL

A recent discussion indicated that one goal of this blog might be to allow people to post comments about recent papers that they liked. I think this could potentially be very useful, especially for those with diverse interests but only finite time to read through conference proceedings. ACL 2005 recently completed, and here are four papers from that conference that I thought were either good or perhaps of interest to a machine learning audience.

David Chiang, A Hierarchical Phrase-Based Model for Statistical Machine Translation. (Best paper award.) This paper takes the standard phrase-based MT model that is popular in our field (basically, translate a sentence by individually translating phrases and reordering them according to a complicated statistical model) and extends it to take into account hierarchy in phrases, so that you can learn things like “X ‘s Y” -> “Y de X” in chinese, where X and Y are arbitrary phrases. This takes a step toward linguistic syntax for MT, which our group is working strongly on, but doesn’t require any linguists to sit down and write out grammars or parse sentences.

Rie Kubota Ando and Tong Zhang, A High-Performance Semi-Supervised Learning Method for Text Chunking. This is more of a machine learning style paper, where they improve a sequence labeling task by augmenting it with models from related tasks for which data is free. I.e., I might train a model that, given a context with a missing word, will predict the word (eg., “The ____ gave a speech” might want you to insert “president”.) By doing so, you can use these other models to give additional useful information to your main task.

Noah A. Smith and Jason Eisner, Contrastive Estimation: Training Log-Linear Models on Unlabeled Data. This paper talks about training sequence labeling models in an unsupervised fashion, basically by contrasting what the model does on the correct string with what the model does on a corrupted version of the string. They get significantly better results than just by using EM in an HMM, and the idea is pretty nice.

Patrick Pantel, Inducing Ontological Co-occurrence Vectors. This is a pretty neat idea (though I’m biased — Patrick is a friend) where one attempts to come up with feature vectors that describe nodes in a semantic hierarchy (ontology) that could enable you to figure out where to insert new words that are not in your ontology. The results are pretty good, and the method is fairly simple; I’d imagine that a more complex model/learning framework could improve the model even further.