Today brings a new release of the Vowpal Wabbit fast online learning software. This time, unlike the previous release, the project itself is going open source, developing via github. For example, the lastest and greatest can be downloaded via:
git clone git://github.com/JohnLangford/vowpal_wabbit.git
If you aren’t familiar with git, it’s a distributed version control system which supports quick and easy branching, as well as reconciliation.
This version of the code is confirmed to compile without complaint on at least some flavors of OSX as well as Linux boxes.
As much of the point of this project is pushing the limits of fast and effective machine learning, let me mention a few datapoints from my experience.
- The program can effectively scale up to batch-style training on sparse terafeature (i.e. 1012 sparse feature) size datasets. The limiting factor is typically i/o.
- I started using the the real datasets from the large-scale learning workshop as a convenient benchmark. The largest dataset takes about 10 minutes. (This is using the native features that the organizers intended as a starting point, yet all contestants used. In some cases, that admittedly gives you performance nowhere near to optimal.)
- After using this program for awhile, it’s substantially altered my perception of what is a large-scale learning problem. This causes confusion when people brag about computational performance on tiny datasets with only 105 examples 🙂
I would also like to emphasize that this is intended as an open source project rather than merely a code drop, as occurred last time. What I think this project has to offer researchers is an infrastructure for implementing fast online algorithms. It’s reasonably straightforward to implant your own tweaked algorithm, automatically gaining the substantial benefits of the surrounding code that deals with file formats, file caching, buffering, etc… In a very real sense, most of the code is this surrounding stuff, which you don’t have to rewrite to benefit from. For people applying machine learning, there is some obvious value in getting very fast feedback in a batch setting, as well as having an algorithm that actually works in a real online setting.
As one example of the ability to reuse the code for other purposes, an effective general purpose online implementation of the Offset Tree is included. I haven’t seen any other implementation of an algorithm for learning in the agnostic partial label setting, so this code may be of substantial interest for people encountering these sorts of problems.
The difference between this version and the previous is a nearly total rewrite. Some bigger changes are:
- We dropped SEG for now, because of code complexity reasons.
- Multicore parallelization proceeds in a different fashion—parallelization over features instead of examples. This works better with caches. Note that all parallelization of the core algorithm is meaningless unless you use the -q flag, because otherwise you are i/o bound.
- The code is more deeply threaded, with a separate thread for parsing.
- There is support for several different loss functions, and it’s easy to add your own.
I’m interested in any bug reports or suggestions for the code. I have substantial confidence that this code can do interesting and useful things, but improving it is a constant and ongoing process.