Both CNTK and Vowpal Wabbit have pirate tutorials at NIPS. The CNTK tutorial is 1 hour during the lunch break of the Optimization workshop while the VW tutorial is 1 hour during the lunch break of the Extreme Multiclass workshop. Consider dropping by either if interested.
CNTK is a deep learning system started by the speech people who started the deep learning craze and grown into a more general platform-independent deep learning system. It has various useful features, the most interesting of which is perhaps efficient scalable training. Using GPUs with allreduce and one-bit sgd it achieves both high efficiency and scalability over many more GPUs than could ever fit into a single machine. This capability is unique amongst all open deep learning codebases so everything else looks nerfed in comparison. CNTK was released in April so this is the first chance for many people to learn about it. See here for more details.
The Vowpal Wabbit tutorial just focuses on what is new this year.
- The learning to search framework has greatly matured and is now easily used to solve ad-hoc joint(structured) prediction problems. The ICML tutorial covers algorithms/analysis so this is about using the system.
- VW has also become the modeling element of a larger system (called the decision service) which gathers data and uses it as per Contextual Bandit learning. This is now generally usable, and is the first general purpose system of this sort.
I just created Vowpal Wabbit 7.8, and we are planning to have an increasingly less heretical followup tutorial during the non-“ski break” at the NIPS Optimization workshop. Please join us if interested.
I always feel like things are going slow, but in the last year, but there have been many changes overall. Notes for 7.7 are here. Since then, there are several areas of improvement as well as generalized bug fixes and refactoring.
- Learning to Search: Hal completely rewrote the learning to search system, enough that the numbers here are looking obsolete. Kai-Wei has also created several advanced applications for entity-relation and dependency parsing which are promising.
- Languages Hal also created a good python library, which includes call-backs for learning to search. You can now develop advanced structured prediction solutions in a nice language. Jonathan Morra also contributed an initial Java interface.
- Exploration The contextual bandit subsystem now allows evaluation of an arbitrary policy, and an exploration library is now factored out into an independent library (principally by Luong with help from Sid and Sarah). This is critical for real applications because randomization must happen at the point of decision.
- Reductions The learning reductions subsystem has continued to mature, although the perfectionist in me is still dissatisfied. As a consequence, it’s now pretty easy to program new reductions, and the efficiency of these reductions has generally improved. Several news ones are cooking.
- Online Learning Alekh added an online SVM implementation based on LaSVM. This is known to parallelize well via the para-active approach.
This project has grown quite a bit—there are about 30 different people contributing to VW since the last release, and there is now a VW meetup (December 8th!) in the bay area that I wish I could attend.
Comments Off on Vowpal Wabbit 7.8 at NIPS
At NIPS I’m giving a tutorial on Learning to Interact. In essence this is about dealing with causality in a contextual bandit framework. Relative to previous tutorials, I’ll be covering several new results that changed my understanding of the nature of the problem. Note that Judea Pearl and Elias Bareinboim have a tutorial on causality. This might appear similar, but is quite different in practice. Pearl and Bareinboim’s tutorial will be about the general concepts while mine will be about total mastery of the simplest nontrivial case, including code. Luckily, they have the right order. I recommend going to both
I also just released version 7.4 of Vowpal Wabbit. When I was a frustrated learning theorist, I did not understand why people were not using learning reductions to solve problems. I’ve been slowly discovering why with VW, and addressing the issues. One of the issues is that machine learning itself was not automatic enough, while another is that creating a very low overhead process for doing learning reductions is vitally important. These have been addressed well enough that we are starting to see compelling results. Various changes:
- The internal learning reduction interface has been substantially improved. It’s now pretty easy to write new learning reduction. binary.cc provides a good example. This is a very simple reduction which just binarizes the prediction. More improvements are coming, but this is good enough that other people have started contributing reductions.
- Zhen Qin had a very productive internship with Vaclav Petricek at eharmony resulting in several systemic modifications and some new reductions, including:
- A direct hash inversion implementation for use in debugging.
- A holdout system which takes over for progressive validation when multiple passes over data are used. This keeps the printouts ‘honest’.
- An online bootstrap mechanism system which efficiently provides some understanding of prediction variations and which can sometimes effectively trade computational time for increased accuracy via ensembling. This will be discussed at the biglearn workshop at NIPS.
- A top-k reduction which chooses the top-k of any set of base instances.
- Hal Daume has a new implementation of Searn (and Dagger, the codes are unified) which makes structured prediction solutions far more natural. He has optimized this quite thoroughly (exercising the reduction stack in the process), resulting in this pretty graph.
Here, CRF++ is commonly used conditional random field code, SVMstruct is an SVM-style approach to classification, and CRF SGD is an online learning CRF approach. All of these methods use the same features. Fully optimized code is typically rough, but this one is less than 100 lines.
I’m trying to put together a tutorial on these things at NIPS during the workshop break on the 9th and will add details as that resolves for those interested enough to skip out on skiing
Edit: The VW tutorial will take place during the break at the big learning workshop from 1:30pm – 3pm at Harveys Emerald Bay B.
A new version of VW is out. The primary changes are:
- Learning Reductions: I’ve wanted to get learning reductions working and we’ve finally done it. Not everything is implemented yet, but VW now supports direct:
- Multiclass Classification –oaa or –ect.
- Cost Sensitive Multiclass Classification –csoaa or –wap.
- Contextual Bandit Classification –cb.
- Sequential Structured Prediction –searn or –dagger
In addition, it is now easy to build your own custom learning reductions for various plausible uses: feature diddling, custom structured prediction problems, or alternate learning reductions. This effort is far from done, but it is now in a generally useful state. Note that all learning reductions inherit the ability to do cluster parallel learning.
- Library interface: VW now has a basic library interface. The library provides most of the functionality of VW, with the limitation that it is monolithic and nonreentrant. These will be improved over time.
- Windows port: The priority of a windows port jumped way up once we moved to Microsoft. The only feature which we know doesn’t work at present is automatic backgrounding when in daemon mode.
- New update rule: Stephane visited us this summer, and we fixed the default online update rule so that it is unit invariant.
There are also many other small updates including some contributed utilities that aid the process of applying and using VW.
Plans for the near future involve improving the quality of various items above, and of course better documentation: several of the reductions are not yet well documented.
There are a handful of basic code patterns that I wish I was more aware of when I started research in machine learning. Each on its own may seem pointless, but collectively they go a long way towards making the typical research workflow more efficient. Here they are:
- Separate code from data.
- Separate input data, working data and output data.
- Save everything to disk frequently.
- Separate options from parameters.
- Do not use global variables.
- Record the options used to generate each run of the algorithm.
- Make it easy to sweep options.
- Make it easy to execute only portions of the code.
- Use checkpointing.
- Write demos and tests.
Click here for discussion and examples for each item. Also see Charles Sutton’s and HackerNews’ thoughts on the same topic.
My guess is that these patterns will not only be useful for machine learning, but also any other computational work that involves either a) processing large amounts of data, or b) algorithms that take a significant amount of time to execute. Share this list with your students and colleagues. Trust me, they’ll appreciate it.