Apprenticeship Reinforcement Learning for Control

Pieter Abbeel presented a paper with Andrew Ng at ICML on Exploration and Apprenticeship Learning in Reinforcement Learning. The basic idea of this algorithm is:

Collect data from a human controlling a machine.
Build a transition model based upon the experience.
Build a policy which optimizes the transition model.
Evaluate the policy. If it works well, halt, otherwise add the experience into the pool and go to (2).

The paper proves that this technique will converge to some policy with expected performance near human expected performance assuming the world fits certain assumptions (MDP or linear dynamics).

This general idea of apprenticeship learning (i.e. incorporating data from an expert) seems very compelling because (a) humans often learn this way and (b) much harder problems can be solved. For (a), the notion of teaching is about transferring knowledge from an expert to novices, often via demonstration. To see (b), note that we can create intricate reinforcement learning problems where a particular sequence of actions must be taken to achieve a goal. A novice might be able to memorize this sequence given just one demonstration even though it would require experience exponential in the length of the sequence to discover the key sequence accidentally.

Andrew Ng’s group has exploited this to make this very fun picture.
(Yeah, that’s a helicopter flying upside down, under computer control.)

As far as this particular paper, one question occurs to me. There is a general principle of learning which says we should avoid “double approximation”, such as occurs in step (3) where we build an approximate policy on an approximate model. Is there a way to fuse steps (2) and (3) to achieve faster or better learning?