We are hiring for reinforcement learning related research at all levels and all MSR labs. If you are interested, apply, talk to me at COLT or ICML, or email me.

More generally though, I wanted to lay out a philosophy of research which differs from (and plausibly improves on) the current prevailing mode.

Deepmind and OpenAI have popularized an empirical approach where researchers modify algorithms and test them against simulated environments, including in self-play. They’ve achieved significant success in these simulated environments, greatly expanding the reportoire of ‘games solved by reinforcement learning’ which consisted of the singleton backgammon when I was a graduate student. Given the ambitious goals of these organizations, the more general plan seems to be “first solve games, then solve real problems”. There are some weaknesses to this approach, which I want to lay out next.

**Broken API**One issue with this is that multi-step reinforcement learning is a broken API in the sense that it creates an interface for problem definitions that is unsolvable via currently popular algorithm families. In particular, you can create problems which are either ‘antishaped’ so local rewards mislead w.r.t. long term rewards or keylock problems, as are common in Markov Decision Process lower bounds. I coded up simple versions of these problems a couple years ago and stuck them on github now to be extra crisp. If you try to apply policy gradient or Q-learning style algorithms on these problems they commonly run into exponential (in the number of states) sample complexity. As a general principle, APIs which create exponential sample complexity are bad—they imply that individual applications require taking advantage of special structure in order to succeed.**Transference**Another significant issue is the degree of transference between solutions in simulation and the real world. “Transference” here potentially happens at several levels.- Do the algorithms carry over? One of the persistent issues with simulation-based approaches is that you don’t care about sample complexity that much—optimal performance at acceptable computational complexities is the typical goal. In real world applications, this is somewhat absurd—you really care about immediately doing something reasonable and optimizing from there.
- Do the simulators carry over? For every simulator, there is a fidelity question which comes into play when you want to transfer a policy learned in the simulator into action in the real world. Real-time ray tracing and simulator quality more generally are advancing, but I’m not ready yet to trust a self-driving care trained in a simulated reality. An accurate simulation of the physics is unclear—friction for example is known-difficult, and more generally the representative variety of exogenous events in an open world seems quite difficult to implement.

**Solution generality**When you test and discover that an algorithm works in a simulated world, you know that it works in the simulated world. If you try it in 30 simulated worlds and it works in all of them, it can still easily be the case that an algorithm fails on the 31st simulated world. How can you achieve confidence beyond the number of simulated worlds that you try and succeed on? There is some sense by which you can imagine generalization over an underlying process generating problems, but this seems like a shaky justification in practice, since the nature of the problems encountered seems to be a nonstationary development of an unknown future.**Value creation**Solutions of a ‘first A, then B’ flavor naturally take time to get to the end state where most of the real value is set to be realized. In the years before reaching applications in the real world, does the funding run out? We certainly hope not for the field of research but a danger does exist. Some discussion here including the comments is relevant.

What’s an alternative?

Each of the issues above is addressable.

- Build fundamental theories of what are statistically and computationally tractable sub-problems of Reinforcement Learning. These tractable sub-problems form the ‘APIs’ of systems for solving these problems. Examples of this include simpler (Contextual Bandits), intermediate (learning to search, and move advanced (Contextual Decision Process).
- Work on real-world problems. The obvious antidote to simulation is reality, driving both the need to create systems that work in reality as well as a research agenda around reality-centered issues like performance at low sample complexity. There are some significant difficulties with this—reinforcement style algorithms require interactive access to learn which often drives research towards companies with an infrastructure. Nevertheless, offline evaluation on real-world data does exist and the choice of emphasis in research directions is universal.
- The combination of fundamental theories and a platform which distills learnings so they are not forgotten and always improved upon provides a stronger basis for expectation of generalization into the next problem.
- The shortest path to creating valuable applications in the real world is to simply work on creating valuable applications in the real world. Doing this in a manner guided by other elements of the research program is just good sense.

The above must be applied in moderation—some emphasis on theory, some emphasis on real world applications, some emphasis on platforms, and some emphasis on empirics. This has been my research approach for a little over 10 years, ever since I started working on contextual bandits.

Let’s call the first research program ’empirical simulation’ and the second research program ‘real fundamentals’. The empirical simulation approach has a clear strong advantage in that it creates impressive demos, which creates funding, which creates more research. The threshold for contribution to the empirical simulation approach may also be lower simply because it requires mastery of fewer elements, implying people can more easily participate in it. At the same time, the real fundamentals approach has clear advantages in addressing the weaknesses of the empirical simulation approach. At a concrete level, this means we have managed to define and create fundamentals through research while creating real-world applications and value radically more efficiently than the empirical simulation approach has achieved.

The ‘real fundamentals’ concept is behind the open positions above. These positions have been designed to come with both the colleagues and mandate to address the most difficult research problems along with the organizational leverage to change the world. For people interested in fundamentals and making things happen in the real world these are prime positions—please consider joining us.

+1

+1

Hi John, very interesting program.

Our recent result seems very relevant to your proposal with real application and generalization (semantic parsing), sample efficiency based importance sampling, and systematic exploration

http://arxiv.org/abs/1807.02322

BTW the first author is on the job market.

Please encourage him to apply.

I am sympathetic to your goals, and you have certainly demonstrated great success pursuing your methodology. But I’m not convinced that it is the best approach to sequential decision making applications.

Consider SAT solvers, for example. The SAT API is intractable and yet tremendous progress has been made on methods that work in practice for that API. Progress (in terms of the size of problem that can be solved in practical time) has been exponential for many years. Yes, one can create difficult instances, but these are rare in practice, and people even demonstrated substantial speedups on them.

The robotics folks (I’m thinking of Pieter Abbeel and Sergey Levine, for example) have been demonstrating great success on sim-to-real transfer. I believe this is a very practical methodology in domains where good simulators can be constructed. This goes far beyond flashy demos. A strong argument can be made that we are also learning useful things (e.g., distributional RL) from pursuing the DeepMind gaming agenda despite the artificial and “useless” nature of the games themselves.

Thanks Tom, a few comments:

1) To be clear, I’m not advocating this research program to the exclusion of all others—I’m happy to see success by others.

2) Hard instances are actually reasonably common in practice, in my experience. For example, it’s fairly common for natural short term rewards to be opposed (i.e. antishaped) w.r.t. long term rewards, which breaks standard RL algorithms. The essential difference between these sorts of problems and SAT is that information-theoretic hardness is much stronger (and hence far less ignorable) than computational hardness.