Pervasive Simulator Misuse with Reinforcement Learning

The surge of interest in reinforcement learning is great fun, but I often see confused choices in applying RL algorithms to solve problems. There are two purposes for which you might use a world simulator in reinforcement learning:

  1. Reinforcement Learning Research: You might be interested in creating reinforcement learning algorithms for the real world and use the simulator as a cheap alternative to actual real-world application.
  2. Problem Solving: You want to find a good policy solving a problem for which you have a good simulator.

In the first instance I have no problem, but in the second instance, I’m seeing many head-scratcher choices.

A reinforcement learning algorithm engaging in policy improvement from a continuous stream of experience needs to solve an opportunity-cost problem. (The RL lingo for opportunity-cost is “advantage”.) Thinking about this in the context of a 2-person game, at a given state, with your existing rollout policy, is taking the first action leading to a win 1/2 the time good or bad? It could be good since the player is well behind and every other action is worse. Or it could be bad since the player is well ahead and every other action is better. Understanding one action’s long term value relative to another’s is the essence of the opportunity cost trade-off at the core of many reinforcement learning algorithms.

If you have a choice between an algorithm that estimates the opportunity cost and one which observes the opportunity cost, which works better? Using observed opportunity-cost is an almost pure winner because it cuts out the effect of estimation error. In the real world you can’t observe the opportunity cost directly Groundhog day style. How many times have you left a conversation and thought to yourself: I wish I had said something else? A simulator is different though—you can reset a simulator. And when you do reset a simulator, you can directly observe the opportunity-cost of an action which can then directly drive learning updates.

If you are coming from viewpoint 1, using a “reset cheat” is unappealing since it doesn’t work in the real world and the goal is making algorithms which work in the real world. On the other hand, if you are operating from viewpoint 2, the “reset cheat” is a gigantic opportunity to dramatically improve learning algorithms. So, why are many people with goal 2 using goal 1 designed algorithms? I don’t know, but here are some hypotheses.

  1. Maybe people just aren’t aware that goal 2 style algorithms exist? They are out there. The most prominent examples of goal 2 style algorithms are from Learning to search and AlphaGo Zero.
  2. Maybe people are worried about the additional sample complexity of doing multiple rollouts from reset points? But these algorithm typically require little additional sample complexity in the worst case and can provide gigantic wins. People commonly use a discount factor d values future rewards t timesteps ahead with a discount of dt. Alternatively, you can terminate rollouts with probability 1 – d and value future rewards with no discount while preserving the expected value. Using this approach a rollout terminates after an expected 1/(1-d) timesteps bounding the cost of a reset and rollout. Since it is common to use very heavy discounting (e.g. d=0.9), the worst case additional sample complexity is only a small factor larger. On the upside, eliminating estimation error is can radically reduce sample complexity in theory and practice.
  3. Maybe the implementation overhead for a second family of algorithms is to difficult? But the choice of whether or not you use resets is far more important than “oh, we’ll just run things for 10x longer”. It can easily make or break the outcome.

Maybe there is some other reason? As I said above, this is head-scratcher that I find myself trying to address regularly.

Vowpal Wabbit 8.5.0 & NIPS tutorial

Yesterday, I tagged VW version 8.5.0 which has many interactive learning improvements (both contextual bandit and active learning), better support for sparse models, and a new baseline reduction which I’m considering making a part of the default update rule.

If you want to know the details, we’ll be doing a mini-tutorial during the Friday lunch break at the Extreme Classification workshop at NIPS. Please join us if interested.

Edit: also announced at the Learning Systems workshop

The Real World Interactive Learning Tutorial

Alekh and I have been polishin the Real World Interactive Learning tutorial for ICML 2017 on Sunday.

This tutorial should be of pretty wide interest. For data scientists, we are crossing a threshold into easy use of interactive learning while for researchers interactive learning is plausibly the most important frontier of understanding. Great progress on both the theory and especially on practical systems has been made since an earlier NIPS 2013 tutorial.

Please join us if you are interested 🙂

ICML is changing its constitution

Andrew McCallum has been leading an initiative to update the bylaws of IMLS, the organization which runs ICML. I expect most people aren’t interested in such details. However, the bylaws change rarely and can have an impact over a long period of time so they do have some real importance. I’d like to hear comment from anyone with a particular interest before this year’s ICML.

In my opinion, the most important aspect of the bylaws is the at-large election of members of the board which is preserved. Most of the changes between the old and new versions are aimed at better defining roles, committees, etc… to leave IMLS/ICML better organized.

Anyways, please comment if you have a concern or thoughts.

Machine Learning the Future Class

This spring, I taught a class on Machine Learning the Future at Cornell Tech covering a number of advanced topics in machine learning including online learning, joint (structured) prediction, active learning, contextual bandit learning, logarithmic time prediction, and parallel learning. Each of these classes was recorded from the laptop via Zoom and I just uploaded the recordings to Youtube.

In some ways, this class is a followup to the large scale learning class I taught with Yann LeCun 4 years ago. The videos for that class were taken down(*) so these lectures both update and replace shared subjects as well as having some new subjects.

Much of this material is fairly close to research so to assist other machine learning lecturers around the world in digesting the material, I’ve made all the source available as well. Feel free to use and improve.

(*) The NYU policy changed so that students could not be shown in classroom videos.