EWRL and NIPS 2016

I went to the European Workshop on Reinforcement Learning and NIPS last month and saw several interesting things.

At EWRL, I particularly liked the talks from:

  1. Remi Munos on off-policy evaluation
  2. Mohammad Ghavamzadeh on learning safe policies
  3. Emma Brunskill on optimizing biased-but safe estimators (sense a theme?)
  4. Sergey Levine on low sample complexity applications of RL in robotics.

My talk is here. Overall, this was a well organized workshop with diverse and interesting subjects, with the only caveat being that they had to limit registration 🙂

At NIPS itself, I found the poster sessions fairly interesting.

  1. Allen-Zhu and Hazan had a new notion of a reduction (video).
  2. Zhao, Poupart, and Gordon had a new way to learn Sum-Product Networks
  3. Ho, Littman, MacGlashan, Cushman, and Austerwell, had a paper on how “Showing” is different from “Doing”.
  4. Toulis and Parkes had a paper on estimation of long term causal effects.
  5. Rae, Hunt, Danihelka, Harley, Senior, Wayne, Graves, and Lillicrap had a paper on large memories with neural networks.
  6. Hardt, Price, and Srebro, had a paper on Equal Opportunity in ML.

Format-wise, I thought the 2 sessions was better than 1, but I really would have preferred more. The recorded spotlights are also pretty cool.

The NIPS workshops were great, although I was somewhat reminded of kindergarten soccer in terms of lopsided attendance. This may be inevitable given how hot the field is, but I think it’s important for individual researchers to remember that:

  1. There are many important directions of research.
  2. You personally have a much higher chance of doing something interesting if everyone else is not doing it also.

During the workshops, I learned about ADAM (a momentum form of Adagrad), testing ML systems, and that even TenserFlow is finally looking into synchronous updates for parallel learning (allreduce is the way).

(edit: added one)

ICML 2016 videos and statistics

The ICML 2016 videos are out.

I also wanted to share some statistics from registration that might be of general interest.

The total number of people attending: 3103.

Industry: 47% University: 46%

Male: 83% Female: 14%

Local (NY, NJ, or CT): 27%

North America: 70% Europe: 18% Asia: 9% Middle East: 2% Remainder: <1% including 2 from Antarctica 🙂

The Multiworld Testing Decision Service

We made a tool that you can use. It is the first general purpose reinforcement-based learning system 🙂

Reinforcement learning is much discussed these days with successes like AlphaGo. Wouldn’t it be great if Reinforcement Learning algorithms could easily be used to solve all reinforcement learning problems? But there is a well-known problem: It’s very easy to create natural RL problems for which all standard RL algorithms (epsilon-greedy Q-learning, SARSA, etc…) fail catastrophically. That’s a serious limitation which both inspires research and which I suspect many people need to learn the hard way.

Removing the credit assignment problem from reinforcement learning yields the Contextual Bandit setting which we know is generically solvable in the same manner as common supervised learning problems. I know of about a half-dozen real-world successful contextual bandit applications typically requiring the cooperation of engineers and deeply knowledgeable data scientists.

Can we make this dramatically easier? We need a system that explores over appropriate choices with logging of features, actions, probabilities of actions, and outcomes. These must then be fed into an appropriate learning algorithm which trains a policy and then deploys the policy at the point of decision. Naturally, this is what we’ve done and now it can be used by anyone. This drops the barrier to use down to: “Do you have permissions? And do you have a reasonable idea of what a good feature is?”

A key foundational idea is Multiworld Testing: the capability to evaluate large numbers of policies mapping features to action in a manner exponentially more efficient than standard A/B testing. This is used pervasively in the Contextual Bandit literature and you can see it in action for the system we’ve made at Microsoft Research. The key design principles are:

  1. Contextual Bandits. Many people have tried to create online learning system that do not take into account the biasing effects of decisions. These fail near-universally. For example they might be very good at predicting what was shown (and hence clicked on) rather that what should be shown to generate the most interest.
  2. Data Lifecycle support. This system supports the entire process of data collection, joining, learning, and deployment. Doing this eliminates many stupid-but-killer bugs that I’ve seen in practice.
  3. Modularity. The system decomposes into pieces: exploration library, client library, online learner, join server, etc… because I’ve seen to many cases where the pieces are useful but the system is not.
  4. Reproducibility. Everything is logged in a fashion which makes online behavior offline reproducible. Consequently, the system is debuggable and hence improvable.

The system we’ve created is open source with system components in mwt-ds and the core learning algorithms in Vowpal Wabbit. If you use everything it enables a fully automatic causally sound learning loop for contextual control of a small number of actions. This is strongly scalable, for example a version of this is in use for personalized news on MSN. It can be either low-latency (with a client side library) or cross platform (with a JSON REST web interface). Advanced exploration algorithms are available to enable better exploration strategies than simple epsilon-greedy baselines. The system autodeploys into a chosen Azure account with a baseline cost of about $0.20/hour. The autodeployment takes a few minutes after which you can test or use the system as desired.

This system is open source and there are many ways for people to help if they are interested. For example, support for the client-side library in more languages, support of other learning algorithms & systems, better documentation, etc… are all obviously useful.

Have fun.

An ICML unworkshop

Following up on an interesting suggestion, we are creating a “Birds of a Feather Unworkshop” with a leftover room (Duffy/Columbia) on Thursday and Friday during the workshops. People interested in ad-hoc topics can post a time and place to meet and discuss. Details are here a little ways down.