Vowpal Wabbit, v5.1

I just created version 5.1 of vowpal wabbit. This almost entirely a bugfix release, so it’s an easy upgrade from v5.0.

In addition:

  1. There is now a mailing list, which I and several other developers are subscribed to.
  2. The main website has shifted to the wiki on github. This means that anyone with a github account can now edit it.
  3. I’m planning to give a tutorial tomorrow on it at eHarmony/the LA machine learning meetup at 10am. Drop by if you’re interested.

The status of VW amongst other open source projects has changed. When VW first came out, it was relatively unique amongst existing projects in terms of features. At this point, many other projects have started to appreciate the value of the design choices here. This includes:

  1. Mahout, which now has an SGD implementation.
  2. Shogun, where Soeren is keen on incorporating features.
  3. LibLinear, where they won the KDD best paper award for out-of-core learning.

This is expected—any open source approach which works well should be widely adopted. None of these other projects yet have the full combination of features, so VW still offers something unique. There are also more tricks that I haven’t yet had time to implement, and I look forward to discovering even more.

I’m indebted to many people at this point who have helped with this project. I particularly point out Daniel and Nikos, who have spent quite a bit of time over the last few months working on things.

The Ideal Large Scale Learning Class

At NIPS, Andrew Ng asked me what should be in a large scale learning class. After some discussion with him and Nando and mulling it over a bit, these are the topics that I think should be covered.

There are many different kinds of scaling.

  1. Scaling in examples This is the most basic kind of scaling.
    1. Online Gradient Descent This is an old algorithm—I’m not sure if anyone can be credited with it in particular. Perhaps the Perceptron is a good precursor, but substantial improvements come from the notion of a loss function of which squared loss, logistic loss, Hinge Loss, and Quantile Loss are all worth covering. It’s important to cover the semantics of these loss functions as well. Vowpal Wabbit is a reasonably fast codebase implementing these.
    2. Second Order Gradient Descent methods For some problems, methods taking into account second derivative information can be more effective. I’ve seen preconditioned conjugate gradient work well, for which Jonathan Shewchuck‘s writeup is reasonable. Nando likes L-BFGS which I don’t have much experience with.
    3. Map-Reduce I have a love-hate relationship with the Map-Reduce framework. In my experience, it’s an excellent filesystem, but it’s quite frustrating to do machine learning with, since it encourages the parallelization of slow learning algorithms. I liked what Markus said at the LCCC workshop: nobody wants to give up on the idea of distributed fault tolerant storage and moving small amounts of code to large amounts of data rather than vice-versa. The best way to use this for Machine Learning isn’t yet clear to me. Hadoop is probably the most commonly used open source implementation of Map-Reduce.
  2. Feature Scaling—what do you do when you have very many features?
    1. Hashing approaches are surprisingly effective in my experience. It’s a good idea to also present Bloom Filters, as they help with the intuition of where this works substantially.
    2. Online l1 regularization is via truncated gradient. See Bob Carpenter’s discussion. John Duchi’s composite mirror descent generalization is also a useful general treatment.
    3. Boosting based approaches can also be effective, although training time can become problematic. This is partially mitigated by parallelization algorithms as discussed at the LCCC workshop See Jerry Ye’s talk and Krysta’s talk.. A really good public implementation of this is so far missing, as far as I know.
  3. Test-time Evaluation Ultrafast and efficient test-time evaluation seems to be a goal independent of training.
    1. Indicies One way to speed things up is with inverted indicies. Aside from the basic datastructure, I’d cover WAND and predictive indexing.
    2. GPU The use of GPU’s to make evaluation both more efficient and fast seems to make sense in many applications.
  4. Labels
    1. Sourcing Just acquiring sufficient label information can be problematic.
      1. Mechanical Turk can be an efficient approach. The basic approach can be improved with some work.
      2. When you are paying directly for labels, active learning approaches can substantially cut your costs. Burr Settles active learning survey is pretty comprehensive, although if I was to cover one algorithm, it would be this one which enjoys a compelling combination of strong theoretical guarantees, computational tractability, empirical performance, and generality.
      3. The other common approach is user-feedback information where bias and exploration effects becomes a critical concern. The tutorial Alina and I did on learning and exploration is critical here.
    2. Many Labels It’s common to need to make a complex prediction.
      1. Label Tree based approaches are a good starting point. I’d discuss the inconsistency of the naive approach and the Filter Tree, discussed here. Online tree building and conditional probability trees are also potentially extremely useful. Building smarter trees can help, such as with a confusion matrix or in an iterative fashion.
      2. Label Tree approaches breakdown when the number of labels becomes so large that filtering eliminates too many examples. Here Structured Prediction techniques become particularly important. I’d cover Searn as well as some of Drew Bagnell‘s work such as this one. Many other people are still interested in CRFs or Max-Margin Markov Networks which I find somewhat less compelling for computational reasons.
      3. Cascade Learning is also a compelling approach. The canonical paper on this is the Viola-Jones Face Detector. I’m sure there’s much other vision-related work on cascades that I’m unfamiliar. A more recent instance is the structured prediction cascade.

What else is essential and missing?