The surge of interest in reinforcement learning is great fun, but I often see confused choices in applying RL algorithms to solve problems. There are two purposes for which you might use a world simulator in reinforcement learning:

**Reinforcement Learning Research**: You might be interested in creating reinforcement learning algorithms for the real world and use the simulator as a cheap alternative to actual real-world application.**Problem Solving**: You want to find a good policy solving a problem for which you have a good simulator.

In the first instance I have no problem, but in the second instance, I’m seeing many head-scratcher choices.

A reinforcement learning algorithm engaging in policy improvement from a continuous stream of experience needs to solve an opportunity-cost problem. (The RL lingo for opportunity-cost is “advantage”.) Thinking about this in the context of a 2-person game, at a given state, with your existing rollout policy, is taking the first action leading to a win 1/2 the time good or bad? It could be good since the player is well behind and every other action is worse. Or it could be bad since the player is well ahead and every other action is better. Understanding one action’s long term value relative to another’s is the essence of the opportunity cost trade-off at the core of many reinforcement learning algorithms.

If you have a choice between an algorithm that *estimates* the opportunity cost and one which *observes* the opportunity cost, which works better? Using observed opportunity-cost is an almost pure winner because it cuts out the effect of estimation error. In the real world you can’t observe the opportunity cost directly Groundhog day style. How many times have you left a conversation and thought to yourself: I wish I had said something else? A simulator is different though—you *can* reset a simulator. And when you do reset a simulator, you can directly observe the opportunity-cost of an action which can then directly drive learning updates.

If you are coming from viewpoint 1, using a “reset cheat” is unappealing since it doesn’t work in the real world and the goal is making algorithms which work in the real world. On the other hand, if you are operating from viewpoint 2, the “reset cheat” is a gigantic opportunity to dramatically improve learning algorithms. So, why are many people with goal 2 using goal 1 designed algorithms? I don’t know, but here are some hypotheses.

- Maybe people just aren’t aware that goal 2 style algorithms exist? They are out there. The most prominent examples of goal 2 style algorithms are from Learning to search and AlphaGo Zero.
- Maybe people are worried about the additional sample complexity of doing multiple rollouts from reset points? But these algorithm typically require little additional sample complexity in the worst case and can provide gigantic wins. People commonly use a discount factor
*d*values future rewards*t*timesteps ahead with a discount of*d*. Alternatively, you can terminate rollouts with probability^{t}*1 – d*and value future rewards with no discount while preserving the expected value. Using this approach a rollout terminates after an expected*1/(1-d)*timesteps bounding the cost of a reset and rollout. Since it is common to use very heavy discounting (e.g.*d=0.9*), the worst case additional sample complexity is only a small factor larger. On the upside, eliminating estimation error is can radically reduce sample complexity in theory and practice. - Maybe the implementation overhead for a second family of algorithms is to difficult? But the choice of whether or not you use resets is far more important than “oh, we’ll just run things for 10x longer”. It can easily make or break the outcome.

Maybe there is some other reason? As I said above, this is head-scratcher that I find myself trying to address regularly.

I think most researchers are interested in the first paradigm: develop RL algorithms that work on the real world, and use simulators as a proxy for this. This is because for the second paradigm that has explicit access to a simulator — or even better understanding the “principles” of the simulator — there are vastly superior algorithms than what is typically though about in RL. For instance, when using a physics simulator for robotics style problems, the most impressive results have been obtained with trajectory optimization methods such as CIO (https://youtu.be/mhr_jtQrhVA).

Using more explicit use of simulator with typically studied RL algorithms seems like a half measure. Firstly, this makes the pretty strong assumption that solving the problem in the simulator solves it in the real world, or that the end goal is just to solve in the simulator. There is some inclination to avoid this strong assumption. Secondly, if the assumption is indeed satisfied, like say in computer animation, then there exists vastly more efficient methods that make even more explicit use of the simulator (e.g. making physics soft constraints rather than hard constraints to smooth things).

I should also note that this is an extremely important question and answering this convincingly before proposing a solution paradigm is critical! This is something Sham Kakade and I have discussed at length, and is something we are very interested in understanding better. So far, at least in the robotics context, we haven’t been able to find a convincing answer to why it is OK to use resets but not make more integral use of the simulator leading to an entirely different class of solution methods. Maybe it makes sense for certain types of simulators like Go?

I agree that researchers are primarily interested in the first setting.

W.r.t. to alternatives to reset, I think reset is relatively unique in that it is universal technique applying to all simulators. Making a stronger use of any given simulator is of course possible, but you are necessarily specializing to that simulator or class of simulators.

My favorite example algorithms for Goal 2 are:

(1) PEGASUS, which asks for control over pseudo-random numbers.

(2) The “2-sample” trick in direct Bellman error minimization, where you can avoid the conditional variance in (f – r – f’)^2 by generating two i.i.d. x’ from exactly the same (x, a).

I totally agree that we should make it clear whether a paper is about Goal 1 or 2. Honestly I have always had a hard time guessing whether some one is a “Goal 1” or “Goal 2” person, or do they even have a clear idea themselves?

Regarding why people use Goal 1 algorithms for Goal 2, I would guess that most of classical RL algorithms are *actually* designed for Goal 2, although many of them don’t make full use of the simulator and fit Goal 1 well. By actually I mean the subjective intention of the designer, how they were initially advertised, etc.

What made me think so is a post by Rich long time ago, which you can still find on http://incompleteideas.net/RL-FAQ.html.

“In fact, neither [NDP nor RL] is very descriptive of the subject… The problem with [RL] is that it is dated. Much of the field does not concern learning at all, but just planning… Almost the same methods are used for planning and learning, and it seems off-point to emphasize learning in the name of the field.”

I agree with you question and your analysis. My students and I have worked on problems where the simulator is very expensive. Being able to reset to the start state, or even better, to query arbitrary states, is critical to minimizing the number of calls to the simulator. Monte Carlo methods (including rollouts) are not practical when the simulator is expensive.