Vowpal Wabbit, version 5.0, and the second heresy

I’ve released version 5.0 of the Vowpal Wabbit online learning software. The major number has changed since the last release because I regard all earlier versions as obsolete—there are several new algorithms & features including substantial changes and upgrades to the default learning algorithm.

The biggest changes are new algorithms:

  1. Nikos and I improved the default algorithm. The basic update rule still uses gradient descent, but the size of the update is carefully controlled so that it’s impossible to overrun the label. In addition, the normalization has changed. Computationally, these changes are virtually free and yield better results, sometimes much better. Less careful updates can be reenabled with –loss_function classic, although results are still not identical to previous due to normalization changes.
  2. Nikos also implemented the per-feature learning rates as per these two papers. Often, this works better than the default algorithm. It isn’t the default because it isn’t (yet) as adaptable in terms of learning rate decay. This is enabled with –adaptive and learned regressors are compatible with the default. Computationally, you might see a factor of 4 slowdown if using ‘-q’. Nikos noticed that the phenomenal quake inverse square root hack applies making this substantially faster than a naive implementation.
  3. Nikos and Daniel also implemented active learning derived from this paper, usable via –active_simulation (to test parameters on an existing supervised dataset) or –active_learning (to do the real thing). This runs at full speed which is much faster than is reasonable in any active learning scenario. We see this approach dominating supervised learning on all classification datasets so far, often with far fewer labeled examples required, as the theory predicts. The learned predictor is compatible with the default.
  4. Olivier helped me implement preconditioned conjugate gradient based on Jonathan Shewchuk‘s tutorial. This is a batch algorithm and hence requires multiple passes over any dataset to do something useful. Each step of conjugate gradient requires 2 passes. The advantage of cg is that it converges relatively quickly via the use of second derivative information. This can be particularly helpful if your features are of widely differing scales. The use of –regularization 0.001 (or smaller) is almost required with –conjugate_gradient as it will otherwise overfit hard. This implementation has two advantages over the basic approach: it implicitly computes a Hessian in O(n) time where n is the number of features and it operates out of core, hence making it applicable to datasets that don’t conveniently fit in RAM. The learned predictor is compatible with the default, although you’ll notice that a factor of 8 more RAM is required when learning.
  5. Matt Hoffman and I implemented Online Latent Dirichlet Allocation. This code is still experimental and likely to change over the next week. It really does a minibatch update under the hood. The code appears to be substantially faster than Matt’s earlier python implementation making this probably the most efficient LDA anywhere. LDA is still much slower than online linear learning as it is quite computationally heavy in comparison—perhaps a good candidate for GPU optimization.
  6. Nikos, Daniel, and I have been experimenting with more online cluster parallel learning algorithms (–corrective, –backprop, –delayed_global). We aren’t yet satisfied with these although they are improving. Details are at the LCCC workshop.

In addition, Ariel added a test suite, Shravan helped with ngrams, and there are several other minor new features and bug fixes including a very subtle one caught by Vaclav.

The documentation on the website hasn’t kept up with the code. I’m planning to rectify that over the next week, and have a new tutorial starting at 2pm in the LCCC room for those interested. Yes, I’ll not be skiing 🙂

9 Replies to “Vowpal Wabbit, version 5.0, and the second heresy”

  1. I would like to add that the sole blame of this heresy rests with John and John alone. The other organizers of the workshop are not complicit in any way 🙂

  2. Hi,

    Does VW implement/plan to implement the L1 truncated gradient method?

    Is there a way to extract the feature weights from the final regressor, in ascii format?

    1. Lihong is actually looking into reimplementing truncated gradient right now.

      There isn’t a way to extract feature weights in ascii format, although implementing that is easy—it’s just an alternate output function compared to the one in parse_regressor.cc.

  3. Dear John, It’s a great tool! But it would be very nice if the document is more complete 🙂 Say, I’m trying to use the online LDA, but now I have to guess out which command line switch is which.

    1. I’ve got to know most of the parameters related to LDA, but I still don’t know how to dump topics to disk… What should I do to save the topics? Thank you!

Comments are closed.