Vowpal Wabbit (Fast Online Learning)

Code
Usage
Examples
Discussion
This is a project at Yahoo! Research to design a fast, scalable, useful learning algorithm.

There are two ways to have a fast learning algorithm: (a) start with a slow algorithm and speed it up, or (b) build an intrinsically fast learning algorithm. This project is about approach (b), and it's reached a state where it may be useful to others as a platform for research and experimentation.

There are two algorithms, one implementing specialist gradient descent (GD) on squared loss and the other implementing specialist exponentiated gradient descent (SEG) on squared loss. The code should be easily usable. Its only external dependence is on the boost program_options library, which is often installed by default.

Features

There are several features that (in combination) are fairly interesting.
  1. Raw data. The input format for the learning algorithm is substantially more flexible than might be expected. Examples can have features consisting of free form text, which is interpreted in a bag-of-words way. There can even be multiple sets of free form text in different namespaces.
  2. Speed. The learning algorithm is pretty fast---similar to the few other online algorithm implementations out there. As an example, it's about a factor of 3 faster than Leon Bottou's svmsgd on the RCV1-V2 example in wall clock execution time.
  3. Scalability. This is not the same as fast. Instead, the important characterictic here is that the memory footprint of the program is bounded independent of data. This means the training set is not loaded into main memory before learning starts. In addition, the size of the set of features is bounded independent of the amount of training data.
  4. Feature Pairing. Subsets of features can be internally paired so that the algorithm is linear in the cross-product of the subsets. This is useful for ranking problems. David Grangier seems to have a similar trick in the PAMIR code. The alternative of explicitly expanding the features before feeding them into the learning algorithm can be both computation and space intensive, depending on how it's handled.

Learning Rate

The code implements several methods for adjusting the learning rates. The default is a fixed learning rate which decays by a factor of 20.5 if multiple epochs are used. This seems to be a fairly stable default. For some datasets, having a learning rate which decays as 1/(number of examples) or 1/(C + number of examples) in stochastic gradient descent style can work better. Choosing C and the learning rate well appear to be substantially more problem dependent so this is not the default.

The Future

This project is "live" and ongoing. We are interested in incorporating any significant improvements from other people, and I believe any such are of substantial research interest.

Authors

John Langford, Lihong Li, and Alex Strehl have all worked on VW, while at Yahoo! Research.