Vowpal Wabbit Open Source Project

Today brings a new release of the Vowpal Wabbit fast online learning software. This time, unlike the previous release, the project itself is going open source, developing via github. For example, the lastest and greatest can be downloaded via:

git clone git://github.com/JohnLangford/vowpal_wabbit.git

If you aren’t familiar with git, it’s a distributed version control system which supports quick and easy branching, as well as reconciliation.

This version of the code is confirmed to compile without complaint on at least some flavors of OSX as well as Linux boxes.

As much of the point of this project is pushing the limits of fast and effective machine learning, let me mention a few datapoints from my experience.

  1. The program can effectively scale up to batch-style training on sparse terafeature (i.e. 1012 sparse feature) size datasets. The limiting factor is typically i/o.
  2. I started using the the real datasets from the large-scale learning workshop as a convenient benchmark. The largest dataset takes about 10 minutes. (This is using the native features that the organizers intended as a starting point, yet all contestants used. In some cases, that admittedly gives you performance nowhere near to optimal.)
  3. After using this program for awhile, it’s substantially altered my perception of what is a large-scale learning problem. This causes confusion when people brag about computational performance on tiny datasets with only 105 examples 🙂

I would also like to emphasize that this is intended as an open source project rather than merely a code drop, as occurred last time. What I think this project has to offer researchers is an infrastructure for implementing fast online algorithms. It’s reasonably straightforward to implant your own tweaked algorithm, automatically gaining the substantial benefits of the surrounding code that deals with file formats, file caching, buffering, etc… In a very real sense, most of the code is this surrounding stuff, which you don’t have to rewrite to benefit from. For people applying machine learning, there is some obvious value in getting very fast feedback in a batch setting, as well as having an algorithm that actually works in a real online setting.

As one example of the ability to reuse the code for other purposes, an effective general purpose online implementation of the Offset Tree is included. I haven’t seen any other implementation of an algorithm for learning in the agnostic partial label setting, so this code may be of substantial interest for people encountering these sorts of problems.

The difference between this version and the previous is a nearly total rewrite. Some bigger changes are:

  1. We dropped SEG for now, because of code complexity reasons.
  2. Multicore parallelization proceeds in a different fashion—parallelization over features instead of examples. This works better with caches. Note that all parallelization of the core algorithm is meaningless unless you use the -q flag, because otherwise you are i/o bound.
  3. The code is more deeply threaded, with a separate thread for parsing.
  4. There is support for several different loss functions, and it’s easy to add your own.

I’m interested in any bug reports or suggestions for the code. I have substantial confidence that this code can do interesting and useful things, but improving it is a constant and ongoing process.

9 Replies to “Vowpal Wabbit Open Source Project”

    1. The computational difficulty is due to the need to normalize. This arises in two ways—in the actual act of computing the normalization and in the requirement that each feature be distinct, which requires substantial computation.

      I’m not opposed to SEG. An alternate story is that it’s almost as fast as GD when you are i/o bound, which is common. Perhaps we’ll reimplement it at some point.

  1. I’m having problems compiling on Snow Leopard. Any pointers on how to fix this?

    Error:

    g++ -Wall -march=nocona -ffast-math -fno-strict-aliasing -D_FILE_OFFSET_BITS=64 -I /usr/local/boost_1_34_1 -O3 -c multisource.cc -o multisource.o
    In file included from multisource.h:4,
    from multisource.cc:1:
    example.h:45: error: ‘pthread_mutex_t’ does not name a type
    make: *** [multisource.o] Error 1

    ———————–
    Environment

    os: Darwin RNz.local 10.2.0 Darwin Kernel Version 10.2.0: Tue Nov 3 10:37:10 PST 2009; root:xnu-1486.2.11~1/RELEASE_I386 i386 i386
    g++: i686-apple-darwin10-g++-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5646) (dot 1)
    boost: 1.34.1
    vw: latest from git

    1. It looks like an include bug, in particular perhaps a failure to include pthreads. Maybe try:

      #include < pthread .h >

  2. Thanks for the response. Adding the pthread include fixed the problem. Also, I was able to compile with Boost 1.42.0.

    If you need someone to test future releases on OS X, ping me.

  3. Had the same problem as Ryan on Snow Leopard. I did add #include to example.h and now it works.
    probably a stupid question but: is there any advantage of running it as a daemon

  4. Hi, I got a different error when I compiled the latest code under Mac Snow Leopard

    g++ -Wall -march=native -O3 -fomit-frame-pointer -ffast-math -fno-strict-aliasing -D_FILE_OFFSET_BITS=64 -I /usr/local/include -c hash.cc -o hash.o
    hash.cc:1: error: bad value (native) for -march= switch
    hash.cc:1: error: bad value (native) for -mtune= switch
    make: *** [hash.o] Error 1

    1. This is a limitation of the compiler on that version of macos. A later version fixes this, but you can deal with this by removing -march=native from the Makefile.

  5. I got a compile error on snow leopard. How can I fix this?

    g++ -march=nocona -Wall -Werror -O3 -fomit-frame-pointer -ffast-math -fno-strict-aliasing -D_FILE_OFFSET_BITS=64 -I /usr/local/include -c global_data.cc -o global_data.o
    In file included from global_data.h:11,
    from global_data.cc:4:
    parse_regressor.h:10:37: error: boost/program_options.hpp: No such file or directory
    In file included from multisource.h:5,
    from global_data.cc:5:
    parser.h:55:37: error: boost/program_options.hpp: No such file or directory
    In file included from global_data.h:11,
    from global_data.cc:4:
    parse_regressor.h:12: error: ‘boost’ has not been declared
    parse_regressor.h:12: error: ‘program_options’ is not a namespace-name
    parse_regressor.h:12: error: expected namespace-name before ‘;’ token
    parse_regressor.h:22: error: variable or field ‘parse_regressor_args’ declared void
    parse_regressor.h:22: error: ‘po’ has not been declared
    parse_regressor.h:22: error: ‘vm’ was not declared in this scope
    parse_regressor.h:22: error: expected primary-expression before ‘&’ token
    parse_regressor.h:22: error: ‘r’ was not declared in this scope
    parse_regressor.h:22: error: expected primary-expression before ‘&’ token
    parse_regressor.h:22: error: ‘final_regressor_name’ was not declared in this scope
    parse_regressor.h:22: error: expected primary-expression before ‘bool’
    In file included from multisource.h:5,
    from global_data.cc:5:
    parser.h:56: error: ‘boost’ has not been declared
    parser.h:56: error: ‘program_options’ is not a namespace-name
    parser.h:56: error: expected namespace-name before ‘;’ token
    parser.h:57: error: variable or field ‘parse_source_args’ declared void
    parser.h:57: error: ‘po’ has not been declared
    parser.h:57: error: ‘vm’ was not declared in this scope
    parser.h:57: error: expected primary-expression before ‘*’ token
    parser.h:57: error: ‘par’ was not declared in this scope
    parser.h:57: error: expected primary-expression before ‘bool’
    parser.h:57: error: expected primary-expression before ‘passes’
    make: *** [global_data.o] Error 1

Comments are closed.