Vowpal Wabbit (Fast Online Learning)

Code
Usage
Tutorial
Examples
Todo
VW4.0 post
VW3.10 post
VW2.3 post
This is a project at Yahoo! Research to design a fast, scalable, useful learning algorithm.

There are two ways to have a fast learning algorithm: (a) start with a slow algorithm and speed it up, or (b) build an intrinsically fast learning algorithm. This project is about approach (b), and it's reached a state where it may be useful to others as a platform for research and experimentation.

The core algorithm is specialist gradient descent (GD) on a loss function (several are available), The code should be easily usable. Its only external dependence is on the boost program_options library, which is often installed by default.

Features

There are several features that (in combination) are fairly interesting.
  1. Raw data. The input format for the learning algorithm is substantially more flexible than might be expected. Examples can have features consisting of free form text, which is interpreted in a bag-of-words way. There can even be multiple sets of free form text in different namespaces.
  2. Speed. The learning algorithm is pretty fast---similar to the few other online algorithm implementations out there. As one datapoint, it can be effectively applied on learning problems with a sparse terafeature (i.e. 1012 sparse features). As another example, it's about a factor of 3 faster than Leon Bottou's svmsgd on the RCV1-V2 example in wall clock execution time.
  3. Scalability. This is not the same as fast. Instead, the important characterictic here is that the memory footprint of the program is bounded independent of data. This means the training set is not loaded into main memory before learning starts. In addition, the size of the set of features is bounded independent of the amount of training data using the hashing trick.
  4. Feature Pairing. Subsets of features can be internally paired so that the algorithm is linear in the cross-product of the subsets. This is useful for ranking problems. David Grangier seems to have a similar trick in the PAMIR code. The alternative of explicitly expanding the features before feeding them into the learning algorithm can be both computation and space intensive, depending on how it's handled.

Learning Rate

The code implements several methods for adjusting the learning rates. The default is a fixed learning rate which decays by a factor of 20.5 if multiple epochs are used. This seems to be a fairly stable default. For some datasets, having a learning rate which decays as 1/(number of examples)p or 1/(C + number of examples)p in stochastic gradient descent style can work better. Choosing C and the learning rate well appear to be substantially more problem dependent so this is not the default. Known good values of p are in the range [0,1], with 1 representing very aggressive decay, 0.5 representing a minimax optimal choice in an adversarial setting and smaller values implying forgetting, which can be important in time-varying settings.

The Future

This project is "live" and ongoing at github. Several people have contributed to the project, and we welcome further contributions.

Authors

Shubham Chopra, Ariel Faigon, Daniel Hsu, John Langford, Lihong Li, Gordon Rios, and Alex Strehl have all worked on VW. Many others have contributed via feature requests or bug reports.