Vowpal Wabbit Open Source Project

Today brings a new release of the Vowpal Wabbit fast online learning software. This time, unlike the previous release, the project itself is going open source, developing via github. For example, the lastest and greatest can be downloaded via:

git clone git://github.com/JohnLangford/vowpal_wabbit.git

If you aren’t familiar with git, it’s a distributed version control system which supports quick and easy branching, as well as reconciliation.

This version of the code is confirmed to compile without complaint on at least some flavors of OSX as well as Linux boxes.

As much of the point of this project is pushing the limits of fast and effective machine learning, let me mention a few datapoints from my experience.

The program can effectively scale up to batch-style training on sparse terafeature (i.e. 10¹² sparse feature) size datasets. The limiting factor is typically i/o.
I started using the the real datasets from the large-scale learning workshop as a convenient benchmark. The largest dataset takes about 10 minutes. (This is using the native features that the organizers intended as a starting point, yet all contestants used. In some cases, that admittedly gives you performance nowhere near to optimal.)
After using this program for awhile, it’s substantially altered my perception of what is a large-scale learning problem. This causes confusion when people brag about computational performance on tiny datasets with only 10⁵ examples 🙂

I would also like to emphasize that this is intended as an open source project rather than merely a code drop, as occurred last time. What I think this project has to offer researchers is an infrastructure for implementing fast online algorithms. It’s reasonably straightforward to implant your own tweaked algorithm, automatically gaining the substantial benefits of the surrounding code that deals with file formats, file caching, buffering, etc… In a very real sense, most of the code is this surrounding stuff, which you don’t have to rewrite to benefit from. For people applying machine learning, there is some obvious value in getting very fast feedback in a batch setting, as well as having an algorithm that actually works in a real online setting.

As one example of the ability to reuse the code for other purposes, an effective general purpose online implementation of the Offset Tree is included. I haven’t seen any other implementation of an algorithm for learning in the agnostic partial label setting, so this code may be of substantial interest for people encountering these sorts of problems.

The difference between this version and the previous is a nearly total rewrite. Some bigger changes are:

We dropped SEG for now, because of code complexity reasons.
Multicore parallelization proceeds in a different fashion—parallelization over features instead of examples. This works better with caches. Note that all parallelization of the core algorithm is meaningless unless you use the -q flag, because otherwise you are i/o bound.
The code is more deeply threaded, with a separate thread for parsing.
There is support for several different loss functions, and it’s easy to add your own.

I’m interested in any bug reports or suggestions for the code. I have substantial confidence that this code can do interesting and useful things, but improving it is a constant and ongoing process.

9 Replies to “Vowpal Wabbit Open Source Project”

Drew Bagnell says:

8/1/2009 at 2:56 pm

John,

What’s the difficulty faced with SEG?
1. jl says:
  
  8/1/2009 at 3:47 pm
  
  The computational difficulty is due to the need to normalize. This arises in two ways—in the actual act of computing the normalization and in the requirement that each feature be distinct, which requires substantial computation.
  
  I’m not opposed to SEG. An alternate story is that it’s almost as fast as GD when you are i/o bound, which is common. Perhaps we’ll reimplement it at some point.
Ryan says:

2/9/2010 at 12:16 pm

I’m having problems compiling on Snow Leopard. Any pointers on how to fix this?

Error:

g++ -Wall -march=nocona -ffast-math -fno-strict-aliasing -D_FILE_OFFSET_BITS=64 -I /usr/local/boost_1_34_1 -O3 -c multisource.cc -o multisource.o
In file included from multisource.h:4,
from multisource.cc:1:
example.h:45: error: ‘pthread_mutex_t’ does not name a type
make: *** [multisource.o] Error 1

———————–
Environment

os: Darwin RNz.local 10.2.0 Darwin Kernel Version 10.2.0: Tue Nov 3 10:37:10 PST 2009; root:xnu-1486.2.11~1/RELEASE_I386 i386 i386
g++: i686-apple-darwin10-g++-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5646) (dot 1)
boost: 1.34.1
vw: latest from git
1. jl says:
  
  2/10/2010 at 4:00 pm
  
  It looks like an include bug, in particular perhaps a failure to include pthreads. Maybe try:
  
  #include < pthread .h >
Ryan says:

2/11/2010 at 7:37 am

Thanks for the response. Adding the pthread include fixed the problem. Also, I was able to compile with Boost 1.42.0.

If you need someone to test future releases on OS X, ping me.
Thomas says:

7/16/2010 at 8:58 am

Had the same problem as Ryan on Snow Leopard. I did add #include to example.h and now it works.
probably a stupid question but: is there any advantage of running it as a daemon
Albert says:

3/23/2011 at 1:08 am

Hi, I got a different error when I compiled the latest code under Mac Snow Leopard

g++ -Wall -march=native -O3 -fomit-frame-pointer -ffast-math -fno-strict-aliasing -D_FILE_OFFSET_BITS=64 -I /usr/local/include -c hash.cc -o hash.o
hash.cc:1: error: bad value (native) for -march= switch
hash.cc:1: error: bad value (native) for -mtune= switch
make: *** [hash.o] Error 1
1. jl says:
  
  3/23/2011 at 6:43 am
  
  This is a limitation of the compiler on that version of macos. A later version fixes this, but you can deal with this by removing -march=native from the Makefile.
Sheng says:

6/29/2011 at 9:07 pm

I got a compile error on snow leopard. How can I fix this?

g++ -march=nocona -Wall -Werror -O3 -fomit-frame-pointer -ffast-math -fno-strict-aliasing -D_FILE_OFFSET_BITS=64 -I /usr/local/include -c global_data.cc -o global_data.o
In file included from global_data.h:11,
from global_data.cc:4:
parse_regressor.h:10:37: error: boost/program_options.hpp: No such file or directory
In file included from multisource.h:5,
from global_data.cc:5:
parser.h:55:37: error: boost/program_options.hpp: No such file or directory
In file included from global_data.h:11,
from global_data.cc:4:
parse_regressor.h:12: error: ‘boost’ has not been declared
parse_regressor.h:12: error: ‘program_options’ is not a namespace-name
parse_regressor.h:12: error: expected namespace-name before ‘;’ token
parse_regressor.h:22: error: variable or field ‘parse_regressor_args’ declared void
parse_regressor.h:22: error: ‘po’ has not been declared
parse_regressor.h:22: error: ‘vm’ was not declared in this scope
parse_regressor.h:22: error: expected primary-expression before ‘&’ token
parse_regressor.h:22: error: ‘r’ was not declared in this scope
parse_regressor.h:22: error: expected primary-expression before ‘&’ token
parse_regressor.h:22: error: ‘final_regressor_name’ was not declared in this scope
parse_regressor.h:22: error: expected primary-expression before ‘bool’
In file included from multisource.h:5,
from global_data.cc:5:
parser.h:56: error: ‘boost’ has not been declared
parser.h:56: error: ‘program_options’ is not a namespace-name
parser.h:56: error: expected namespace-name before ‘;’ token
parser.h:57: error: variable or field ‘parse_source_args’ declared void
parser.h:57: error: ‘po’ has not been declared
parser.h:57: error: ‘vm’ was not declared in this scope
parser.h:57: error: expected primary-expression before ‘*’ token
parser.h:57: error: ‘par’ was not declared in this scope
parser.h:57: error: expected primary-expression before ‘bool’
parser.h:57: error: expected primary-expression before ‘passes’
make: *** [global_data.o] Error 1

Comments are closed.