Vowpal Wabbit, v5.1

I just created version 5.1 of vowpal wabbit. This almost entirely a bugfix release, so it’s an easy upgrade from v5.0.

In addition:

  1. There is now a mailing list, which I and several other developers are subscribed to.
  2. The main website has shifted to the wiki on github. This means that anyone with a github account can now edit it.
  3. I’m planning to give a tutorial tomorrow on it at eHarmony/the LA machine learning meetup at 10am. Drop by if you’re interested.

The status of VW amongst other open source projects has changed. When VW first came out, it was relatively unique amongst existing projects in terms of features. At this point, many other projects have started to appreciate the value of the design choices here. This includes:

  1. Mahout, which now has an SGD implementation.
  2. Shogun, where Soeren is keen on incorporating features.
  3. LibLinear, where they won the KDD best paper award for out-of-core learning.

This is expected—any open source approach which works well should be widely adopted. None of these other projects yet have the full combination of features, so VW still offers something unique. There are also more tricks that I haven’t yet had time to implement, and I look forward to discovering even more.

I’m indebted to many people at this point who have helped with this project. I particularly point out Daniel and Nikos, who have spent quite a bit of time over the last few months working on things.

The Ideal Large Scale Learning Class

At NIPS, Andrew Ng asked me what should be in a large scale learning class. After some discussion with him and Nando and mulling it over a bit, these are the topics that I think should be covered.

There are many different kinds of scaling.

  1. Scaling in examples This is the most basic kind of scaling.
    1. Online Gradient Descent This is an old algorithm—I’m not sure if anyone can be credited with it in particular. Perhaps the Perceptron is a good precursor, but substantial improvements come from the notion of a loss function of which squared loss, logistic loss, Hinge Loss, and Quantile Loss are all worth covering. It’s important to cover the semantics of these loss functions as well. Vowpal Wabbit is a reasonably fast codebase implementing these.
    2. Second Order Gradient Descent methods For some problems, methods taking into account second derivative information can be more effective. I’ve seen preconditioned conjugate gradient work well, for which Jonathan Shewchuck‘s writeup is reasonable. Nando likes L-BFGS which I don’t have much experience with.
    3. Map-Reduce I have a love-hate relationship with the Map-Reduce framework. In my experience, it’s an excellent filesystem, but it’s quite frustrating to do machine learning with, since it encourages the parallelization of slow learning algorithms. I liked what Markus said at the LCCC workshop: nobody wants to give up on the idea of distributed fault tolerant storage and moving small amounts of code to large amounts of data rather than vice-versa. The best way to use this for Machine Learning isn’t yet clear to me. Hadoop is probably the most commonly used open source implementation of Map-Reduce.
  2. Feature Scaling—what do you do when you have very many features?
    1. Hashing approaches are surprisingly effective in my experience. It’s a good idea to also present Bloom Filters, as they help with the intuition of where this works substantially.
    2. Online l1 regularization is via truncated gradient. See Bob Carpenter’s discussion. John Duchi’s composite mirror descent generalization is also a useful general treatment.
    3. Boosting based approaches can also be effective, although training time can become problematic. This is partially mitigated by parallelization algorithms as discussed at the LCCC workshop See Jerry Ye’s talk and Krysta’s talk.. A really good public implementation of this is so far missing, as far as I know.
  3. Test-time Evaluation Ultrafast and efficient test-time evaluation seems to be a goal independent of training.
    1. Indicies One way to speed things up is with inverted indicies. Aside from the basic datastructure, I’d cover WAND and predictive indexing.
    2. GPU The use of GPU’s to make evaluation both more efficient and fast seems to make sense in many applications.
  4. Labels
    1. Sourcing Just acquiring sufficient label information can be problematic.
      1. Mechanical Turk can be an efficient approach. The basic approach can be improved with some work.
      2. When you are paying directly for labels, active learning approaches can substantially cut your costs. Burr Settles active learning survey is pretty comprehensive, although if I was to cover one algorithm, it would be this one which enjoys a compelling combination of strong theoretical guarantees, computational tractability, empirical performance, and generality.
      3. The other common approach is user-feedback information where bias and exploration effects becomes a critical concern. The tutorial Alina and I did on learning and exploration is critical here.
    2. Many Labels It’s common to need to make a complex prediction.
      1. Label Tree based approaches are a good starting point. I’d discuss the inconsistency of the naive approach and the Filter Tree, discussed here. Online tree building and conditional probability trees are also potentially extremely useful. Building smarter trees can help, such as with a confusion matrix or in an iterative fashion.
      2. Label Tree approaches breakdown when the number of labels becomes so large that filtering eliminates too many examples. Here Structured Prediction techniques become particularly important. I’d cover Searn as well as some of Drew Bagnell‘s work such as this one. Many other people are still interested in CRFs or Max-Margin Markov Networks which I find somewhat less compelling for computational reasons.
      3. Cascade Learning is also a compelling approach. The canonical paper on this is the Viola-Jones Face Detector. I’m sure there’s much other vision-related work on cascades that I’m unfamiliar. A more recent instance is the structured prediction cascade.

What else is essential and missing?

What does Watson mean?

Watson convincingly beat the best champion Jeopardy! players. The apparent significance of this varies hugely, depending on your background knowledge about the related machine learning, NLP, and search technology. For a random person, this might seem evidence of serious machine intelligence, while for people working on the system itself, it probably seems like a reasonably good assemblage of existing technologies with several twists to make the entire system work.

Above all, I think we should congratulate the people who managed to put together and execute this project—many years of effort by a diverse set of highly skilled people were needed to make this happen. In academia, it’s pretty difficult for one professor to assemble that quantity of talent, and in industry it’s rarely the case that such a capable group has both a worthwhile project and the support needed to pursue something like this for several years before success.

Alina invited me to the Jeopardy watching party at IBM, which was pretty fun, and it gave me a chance to talk to several people, principally Gerry Tesauro (2nd from the right). It’s cool to see people asking for autographs 🙂

I wasn’t surprised to see Watson win. Partly, this is simply because when a big company does a publicity stunt like this, it’s with a pretty solid expectation of victory. Partly, this is because I already knew that computers could answer trivia questions moderately well(*), so the question was just how far this could be improved. Gerry tells me that although Watson’s error rate is still significant, one key element is the ability to estimate with high accuracy when they can answer with high accuracy. Gerry also tells me the Watson papers will be coming out later this summer, with many more details.

What happens next? I don’t expect the project to be shelved like deep blue was, for two reasons. The first is that there is clearly very substantial room for improvement, and the second is that having a natural language question/answering device that can quickly search and respond from large sets of text is obviously valuable. The first means that researchers are interested, and the second that the money to support them can probably be found. The history of textual entailment challenges is another less centralized effort in about the same direction.

In the immediate future (next few years), applications in semi-open domains may become viable, particularly when a question/answer device knows when to answer “I don’t know”. Fully conversational speech recognition working in an open domain should take somewhat longer, because speech recognition software has additional error points, conversational systems aren’t so easy to come by, and in a fully open domain the error rates will be higher. Getting the error rate on questions down to the level that a human with access to the internet has difficulty beating is the tricky challenge which has not yet been addressed. It’s a worthy goal to work towards.

Many people believe in human exceptionalism, so when seeing a computer beat Jeopardy, they are surprised that humans aren’t exceptional there. We should understand that this has happened many times before, with chess and mathematical calculation being two areas where computers now dominate, but which were once thought to be the essence of intelligence by some. Similarly, it is not difficult to imagine automated driving (after all, animals can do it), gross object recognition, etc…

To avert surprise in the future, human exceptionalists should understand what the really hard things for an AI to do are. It’s important to understand that there are various levels of I in AI. A few I think about are:

  1. Animal Intelligence. The ability to understand your place in the world, navigate the world, and accomplish something. Some of these tasks are solved, but many others are not yet. This level implies that routine tasks can be automated. Automated driving, farming, factories, etc…
  2. Turing Test Intelligence. The ability to mimic a typical human well-enough to fool a typical human in open conversation. Watson doesn’t achieve this, but the thrust of the research is in this direction as open domain question answering is probably necessary for this. Nonroutine noncreative tasks might be accomplished by the computer. Think of an automated secretary.
  3. Pandora’s box Intelligence. The ability to efficiently self-program in an open domain so as to continuously improve. At this level human exceptionalism fails, and it is difficult to predict what happens next.

So, serious evidence of (2) or (3) is what I watch for.

(*) About 10 years ago, I had a friend2 on WWTBAM who called the friend for help on a question, who typed the question and multiple choice answers into CMU‘s Zephyr system, where a bot I made queried (question,answer) pairs on Google to discover which had the most web pages. It worked.