I went to the European Workshop on Reinforcement Learning and NIPS last month and saw several interesting things.
At EWRL, I particularly liked the talks from:
- Remi Munos on off-policy evaluation
- Mohammad Ghavamzadeh on learning safe policies
- Emma Brunskill on optimizing biased-but safe estimators (sense a theme?)
- Sergey Levine on low sample complexity applications of RL in robotics.
My talk is here. Overall, this was a well organized workshop with diverse and interesting subjects, with the only caveat being that they had to limit registration
At NIPS itself, I found the poster sessions fairly interesting.
- Allen-Zhu and Hazan had a new notion of a reduction (video).
- Zhao, Poupart, and Gordon had a new way to learn Sum-Product Networks
- Ho, Littman, MacGlashan, Cushman, and Austerwell, had a paper on how “Showing” is different from “Doing”.
- Toulis and Parkes had a paper on estimation of long term causal effects.
- Rae, Hunt, Danihelka, Harley, Senior, Wayne, Graves, and Lillicrap had a paper on large memories with neural networks.
- Hardt, Price, and Srebro, had a paper on Equal Opportunity in ML.
Format-wise, I thought the 2 sessions was better than 1, but I really would have preferred more. The recorded spotlights are also pretty cool.
The NIPS workshops were great, although I was somewhat reminded of kindergarten soccer in terms of lopsided attendance. This may be inevitable given how hot the field is, but I think it’s important for individual researchers to remember that:
- There are many important directions of research.
- You personally have a much higher chance of doing something interesting if everyone else is not doing it also.
During the workshops, I learned about ADAM (a momentum form of Adagrad), testing ML systems, and that even TenserFlow is finally looking into synchronous updates for parallel learning (allreduce is the way).
(edit: added one)
I just released Vowpal Wabbit 8.3 and we are planning a tutorial at NIPS Saturday over the lunch break in the ML systems workshop. Please join us if interested.
8.3 should be backwards compatible with all 8.x series. There have been big changes since the last version related to
- Contextual bandits, particularly w.r.t. the decision service.
- Learning to search for which we have a paper at NIPS.
- Logarithmic time multiclass classification.
The ICML 2016 videos are out.
I also wanted to share some statistics from registration that might be of general interest.
The total number of people attending: 3103.
Industry: 47% University: 46%
Male: 83% Female: 14%
Local (NY, NJ, or CT): 27%
North America: 70% Europe: 18% Asia: 9% Middle East: 2% Remainder: <1% including 2 from Antarctica
I had a fantastic time at ICML 2016— I learned a great deal. There was far more good stuff than I could see, and it was exciting to catch up on recent advances.
David Silver gave one of the best tutorials I’ve seen on his group’s recent work in “deep” reinforcement learning. I learned about a few new techniques, including the benefits of asychrononous updates in distributed Q-learning https://arxiv.org/abs/1602.01783
, which was presented in more detail at the main conference. The new domains being explored were exciting, as were the improvements made on the computational side. I would love to seen more pointers to some of the related work from the tutorial, particularly given there was such an exciting mix of new techniques and old staples (e.g. experience replay http://www.dtic.mil/dtic/tr/fulltext/u2/a261434.pdf
), but the talk was so information packed it would have been difficult.
It was rumored that Aviv Tamar gave an exciting talk (I believe on this http://arxiv.org/abs/1602.02867
) , but I was forced to miss it to see Rong Ge’s https://users.cs.duke.edu/~rongge/
outstanding talk on a new-ish geometric tool for understanding non-convex optimization, the strict saddle.
I first read about the approach here http://arxiv.org/abs/1503.02101
, but at ICML he and other authors have demonstrated a remarkable number of problems that have this property that enables efficient optimization via an stochastic gradient descent (and other) procedures.
This was a theme of ICML— an incredible amount of good material, so much that I barely saw the posters at all because there was nearly always a talk I wanted to see!
Rocky Duan surveyed some benchmark RL continuous control problems http://jmlr.org/proceedings/papers/v48/duan16.pdf
An interesting theme of the conference— and came up in conversation with John Schulman and Yann LeCun– was really old methods working well. In fact, this group demonstrated that variants of the natural/covariant policy gradient proposed originally by Sham Kakade (with a derivation here: http://repository.cmu.edu/cgi/viewcontent.cgi?article=1080&context=robotics
) are largely at the state-of-the-art on many benchmark problems. There are some clever tricks necessary for large policy classes like neural networks (like using a partial-least squares-style truncated conjugate gradient to solve for the change in policy in the usual F \delta = \nabla one solves in the natural gradient procedure) that dramatically improve performance (https://arxiv.org/abs/1502.05477
). I had begun to view these methods as doing little better (or worse) then black-box search, so it’s exciting to see them make a comeback.
Chelsea Finn http://people.eecs.berkeley.edu/~cbfinn/
gave an outstanding talk on this work https://arxiv.org/abs/1603.00448
. She and co-authors (Sergey Levine and Pieter) effectively came up with a technique that lets one apply Maximum Entropy Inverse Optimal Control without the double-loop procedure and using policy gradient techniques. Jonathan Ho described a related algorithm http://jmlr.org/proceedings/papers/v48/ho16.pdf
that also appeared to mix policy gradient and an optimization over cost functions. Both are definitely on my reading list, and I want to understand the trade-offs of the techniques.
Both presentations were informative, and both made the interesting connection to Generative Adversarial Nets (GANS) http://arxiv.org/abs/1406.2661
. These were also a theme of the conference in both talks and during discussions. A very cool idea getting more traction, and being embraced by the neural net pioneers.
David Belanger https://people.cs.umass.edu/~belanger/belanger_spen_icml.pdf
gave a interesting talk on using backprop to optimize a structured output relative to a a learned cost function. I left thinking the technique was closely related to inverse optimal control methods and the GANs, and wanting understand how implicit differentiation wasn’t being used to optimize the energy function parameters.
Speaking of neural net pioneers— there was lots of good talks during both the main conference and workshops on what’s new — and what’s old https://sites.google.com/site/nnb2tf/
— in neural network architectures and algorithms.
Ian Osband gave an amazing talk on another topic that previously made me despair: exploration in RL http://jmlr.org/proceedings/papers/v48/osband16.pdf
. This is one of few approaches that combines the ability to function approximation with rigorous exploration guarantees/sample complexity in the tabular case (and amazingly *better* sample complexity then previous papers that work only in the tabular case). Super cool and also very high on my reading list.
Boaz Barak http://www.boazbarak.org/
gave a truly inspired talk that mixed a kind of coherent computationally-bounded Bayesian-ism (Slogan: ”Compute like a frequentist, think like a Bayesian.”) with demonstrating a lower bound for SoS procedures. Well outside of my expertise, but delivered in a way that made you feel like you understood all of it.
Honglak Lee gave an exciting talk on the benefits of semi-supervision in CNNs http://web.eecs.umich.edu/~honglak/icml2016-CNNdec.pdf
. The authors demonstrated that a remarkable amount of information needed to reproduce an input image was preserved quite deep in CNNs, and further that encouraging the ability to reconstruct could significantly enhance discriminative performance on real benchmarks.
The problem with this ICML is that I think it would take literally weeks of reading/watching talks to really absorb the high quality work that was presented. I’m *very* grateful to the organizing committee http://icml.cc/2016/?page_id=39
for making it so valuable.
We made a tool that you can use. It is the first general purpose reinforcement-based learning system
Reinforcement learning is much discussed these days with successes like AlphaGo. Wouldn’t it be great if Reinforcement Learning algorithms could easily be used to solve all reinforcement learning problems? But there is a well-known problem: It’s very easy to create natural RL problems for which all standard RL algorithms (epsilon-greedy Q-learning, SARSA, etc…) fail catastrophically. That’s a serious limitation which both inspires research and which I suspect many people need to learn the hard way.
Removing the credit assignment problem from reinforcement learning yields the Contextual Bandit setting which we know is generically solvable in the same manner as common supervised learning problems. I know of about a half-dozen real-world successful contextual bandit applications typically requiring the cooperation of engineers and deeply knowledgeable data scientists.
Can we make this dramatically easier? We need a system that explores over appropriate choices with logging of features, actions, probabilities of actions, and outcomes. These must then be fed into an appropriate learning algorithm which trains a policy and then deploys the policy at the point of decision. Naturally, this is what we’ve done and now it can be used by anyone. This drops the barrier to use down to: “Do you have permissions? And do you have a reasonable idea of what a good feature is?”
A key foundational idea is Multiworld Testing: the capability to evaluate large numbers of policies mapping features to action in a manner exponentially more efficient than standard A/B testing. This is used pervasively in the Contextual Bandit literature and you can see it in action for the system we’ve made at Microsoft Research. The key design principles are:
- Contextual Bandits. Many people have tried to create online learning system that do not take into account the biasing effects of decisions. These fail near-universally. For example they might be very good at predicting what was shown (and hence clicked on) rather that what should be shown to generate the most interest.
- Data Lifecycle support. This system supports the entire process of data collection, joining, learning, and deployment. Doing this eliminates many stupid-but-killer bugs that I’ve seen in practice.
- Modularity. The system decomposes into pieces: exploration library, client library, online learner, join server, etc… because I’ve seen to many cases where the pieces are useful but the system is not.
- Reproducibility. Everything is logged in a fashion which makes online behavior offline reproducible. Consequently, the system is debuggable and hence improvable.
The system we’ve created is open source with system components in mwt-ds and the core learning algorithms in Vowpal Wabbit. If you use everything it enables a fully automatic causally sound learning loop for contextual control of a small number of actions. This is strongly scalable, for example a version of this is in use for personalized news on MSN. It can be either low-latency (with a client side library) or cross platform (with a JSON REST web interface). Advanced exploration algorithms are available to enable better exploration strategies than simple epsilon-greedy baselines. The system autodeploys into a chosen Azure account with a baseline cost of about $0.20/hour. The autodeployment takes a few minutes after which you can test or use the system as desired.
This system is open source and there are many ways for people to help if they are interested. For example, support for the client-side library in more languages, support of other learning algorithms & systems, better documentation, etc… are all obviously useful.