We made a tool that you can use. It is the first general purpose reinforcement-based learning system
Reinforcement learning is much discussed these days with successes like AlphaGo. Wouldn’t it be great if Reinforcement Learning algorithms could easily be used to solve all reinforcement learning problems? But there is a well-known problem: It’s very easy to create natural RL problems for which all standard RL algorithms (epsilon-greedy Q-learning, SARSA, etc…) fail catastrophically. That’s a serious limitation which both inspires research and which I suspect many people need to learn the hard way.
Removing the credit assignment problem from reinforcement learning yields the Contextual Bandit setting which we know is generically solvable in the same manner as common supervised learning problems. I know of about a half-dozen real-world successful contextual bandit applications typically requiring the cooperation of engineers and deeply knowledgeable data scientists.
Can we make this dramatically easier? We need a system that explores over appropriate choices with logging of features, actions, probabilities of actions, and outcomes. These must then be fed into an appropriate learning algorithm which trains a policy and then deploys the policy at the point of decision. Naturally, this is what we’ve done and now it can be used by anyone. This drops the barrier to use down to: “Do you have permissions? And do you have a reasonable idea of what a good feature is?”
A key foundational idea is Multiworld Testing: the capability to evaluate large numbers of policies mapping features to action in a manner exponentially more efficient than standard A/B testing. This is used pervasively in the Contextual Bandit literature and you can see it in action for the system we’ve made at Microsoft Research. The key design principles are:
- Contextual Bandits. Many people have tried to create online learning system that do not take into account the biasing effects of decisions. These fail near-universally. For example they might be very good at predicting what was shown (and hence clicked on) rather that what should be shown to generate the most interest.
- Data Lifecycle support. This system supports the entire process of data collection, joining, learning, and deployment. Doing this eliminates many stupid-but-killer bugs that I’ve seen in practice.
- Modularity. The system decomposes into pieces: exploration library, client library, online learner, join server, etc… because I’ve seen to many cases where the pieces are useful but the system is not.
- Reproducibility. Everything is logged in a fashion which makes online behavior offline reproducible. Consequently, the system is debuggable and hence improvable.
The system we’ve created is open source with system components in mwt-ds and the core learning algorithms in Vowpal Wabbit. If you use everything it enables a fully automatic causally sound learning loop for contextual control of a small number of actions. This is strongly scalable, for example a version of this is in use for personalized news on MSN. It can be either low-latency (with a client side library) or cross platform (with a JSON REST web interface). Advanced exploration algorithms are available to enable better exploration strategies than simple epsilon-greedy baselines. The system autodeploys into a chosen Azure account with a baseline cost of about $0.20/hour. The autodeployment takes a few minutes after which you can test or use the system as desired.
This system is open source and there are many ways for people to help if they are interested. For example, support for the client-side library in more languages, support of other learning algorithms & systems, better documentation, etc… are all obviously useful.
Both CNTK and Vowpal Wabbit have pirate tutorials at NIPS. The CNTK tutorial is 1 hour during the lunch break of the Optimization workshop while the VW tutorial is 1 hour during the lunch break of the Extreme Multiclass workshop. Consider dropping by either if interested.
CNTK is a deep learning system started by the speech people who started the deep learning craze and grown into a more general platform-independent deep learning system. It has various useful features, the most interesting of which is perhaps efficient scalable training. Using GPUs with allreduce and one-bit sgd it achieves both high efficiency and scalability over many more GPUs than could ever fit into a single machine. This capability is unique amongst all open deep learning codebases so everything else looks nerfed in comparison. CNTK was released in April so this is the first chance for many people to learn about it. See here for more details.
The Vowpal Wabbit tutorial just focuses on what is new this year.
- The learning to search framework has greatly matured and is now easily used to solve ad-hoc joint(structured) prediction problems. The ICML tutorial covers algorithms/analysis so this is about using the system.
- VW has also become the modeling element of a larger system (called the decision service) which gathers data and uses it as per Contextual Bandit learning. This is now generally usable, and is the first general purpose system of this sort.
I just created Vowpal Wabbit 7.8, and we are planning to have an increasingly less heretical followup tutorial during the non-“ski break” at the NIPS Optimization workshop. Please join us if interested.
I always feel like things are going slow, but in the last year, but there have been many changes overall. Notes for 7.7 are here. Since then, there are several areas of improvement as well as generalized bug fixes and refactoring.
- Learning to Search: Hal completely rewrote the learning to search system, enough that the numbers here are looking obsolete. Kai-Wei has also created several advanced applications for entity-relation and dependency parsing which are promising.
- Languages Hal also created a good python library, which includes call-backs for learning to search. You can now develop advanced structured prediction solutions in a nice language. Jonathan Morra also contributed an initial Java interface.
- Exploration The contextual bandit subsystem now allows evaluation of an arbitrary policy, and an exploration library is now factored out into an independent library (principally by Luong with help from Sid and Sarah). This is critical for real applications because randomization must happen at the point of decision.
- Reductions The learning reductions subsystem has continued to mature, although the perfectionist in me is still dissatisfied. As a consequence, it’s now pretty easy to program new reductions, and the efficiency of these reductions has generally improved. Several news ones are cooking.
- Online Learning Alekh added an online SVM implementation based on LaSVM. This is known to parallelize well via the para-active approach.
This project has grown quite a bit—there are about 30 different people contributing to VW since the last release, and there is now a VW meetup (December 8th!) in the bay area that I wish I could attend.
Comments Off on Vowpal Wabbit 7.8 at NIPS
At NIPS I’m giving a tutorial on Learning to Interact. In essence this is about dealing with causality in a contextual bandit framework. Relative to previous tutorials, I’ll be covering several new results that changed my understanding of the nature of the problem. Note that Judea Pearl and Elias Bareinboim have a tutorial on causality. This might appear similar, but is quite different in practice. Pearl and Bareinboim’s tutorial will be about the general concepts while mine will be about total mastery of the simplest nontrivial case, including code. Luckily, they have the right order. I recommend going to both
I also just released version 7.4 of Vowpal Wabbit. When I was a frustrated learning theorist, I did not understand why people were not using learning reductions to solve problems. I’ve been slowly discovering why with VW, and addressing the issues. One of the issues is that machine learning itself was not automatic enough, while another is that creating a very low overhead process for doing learning reductions is vitally important. These have been addressed well enough that we are starting to see compelling results. Various changes:
- The internal learning reduction interface has been substantially improved. It’s now pretty easy to write new learning reduction. binary.cc provides a good example. This is a very simple reduction which just binarizes the prediction. More improvements are coming, but this is good enough that other people have started contributing reductions.
- Zhen Qin had a very productive internship with Vaclav Petricek at eharmony resulting in several systemic modifications and some new reductions, including:
- A direct hash inversion implementation for use in debugging.
- A holdout system which takes over for progressive validation when multiple passes over data are used. This keeps the printouts ‘honest’.
- An online bootstrap mechanism system which efficiently provides some understanding of prediction variations and which can sometimes effectively trade computational time for increased accuracy via ensembling. This will be discussed at the biglearn workshop at NIPS.
- A top-k reduction which chooses the top-k of any set of base instances.
- Hal Daume has a new implementation of Searn (and Dagger, the codes are unified) which makes structured prediction solutions far more natural. He has optimized this quite thoroughly (exercising the reduction stack in the process), resulting in this pretty graph.
Here, CRF++ is commonly used conditional random field code, SVMstruct is an SVM-style approach to classification, and CRF SGD is an online learning CRF approach. All of these methods use the same features. Fully optimized code is typically rough, but this one is less than 100 lines.
I’m trying to put together a tutorial on these things at NIPS during the workshop break on the 9th and will add details as that resolves for those interested enough to skip out on skiing
Edit: The VW tutorial will take place during the break at the big learning workshop from 1:30pm – 3pm at Harveys Emerald Bay B.
A new version of VW is out. The primary changes are:
- Learning Reductions: I’ve wanted to get learning reductions working and we’ve finally done it. Not everything is implemented yet, but VW now supports direct:
- Multiclass Classification –oaa or –ect.
- Cost Sensitive Multiclass Classification –csoaa or –wap.
- Contextual Bandit Classification –cb.
- Sequential Structured Prediction –searn or –dagger
In addition, it is now easy to build your own custom learning reductions for various plausible uses: feature diddling, custom structured prediction problems, or alternate learning reductions. This effort is far from done, but it is now in a generally useful state. Note that all learning reductions inherit the ability to do cluster parallel learning.
- Library interface: VW now has a basic library interface. The library provides most of the functionality of VW, with the limitation that it is monolithic and nonreentrant. These will be improved over time.
- Windows port: The priority of a windows port jumped way up once we moved to Microsoft. The only feature which we know doesn’t work at present is automatic backgrounding when in daemon mode.
- New update rule: Stephane visited us this summer, and we fixed the default online update rule so that it is unit invariant.
There are also many other small updates including some contributed utilities that aid the process of applying and using VW.
Plans for the near future involve improving the quality of various items above, and of course better documentation: several of the reductions are not yet well documented.