Vowpal Wabbit Code Release

We are releasing the Vowpal Wabbit (Fast Online Learning) code as open source under a BSD (revised) license. This is a project at Yahoo! Research to build a useful large scale learning algorithm which Lihong Li, Alex Strehl, and I have been working on.

To appreciate the meaning of “large”, it’s useful to define “small” and “medium”. A “small” supervised learning problem is one where a human could use a labeled dataset and come up with a reasonable predictor. A “medium” supervised learning problem dataset fits into the RAM of a modern desktop computer. A “large” supervised learning problem is one which does not fit into the RAM of a normal machine. VW tackles large scale learning problems by this definition of large. I’m not aware of any other open source Machine Learning tools which can handle this scale (although they may exist). A few close ones are:

  1. IBM’s Parallel Machine Learning Toolbox isn’t quite open source. The approach used by this toolbox is essentially map-reduce style computation, which doesn’t seem amenable to online learning approaches. This is significant, because the fastest learning algorithms without parallelization tend to be online learning algorithms.
  2. Leon Bottou‘s sgd implementation first loads data into RAM, then learns. Leon’s code is a great demonstrator of how fast and effective online learning approaches (specifically stochastic gradient descent) can be. VW is about a factor of 3 faster on my desktop, and yields a lower error rate solution.

There are several other features such as feature pairing, sparse features, and namespacing that are often handy in practice.

At present, VW optimizes squared loss via gradient descent or exponentiated gradient descent over a linear representation.

This code is free to use, incorporate, and modify as per the BSD (revised) license. The project is ongoing inside of Yahoo. We will gladly incorporate significant improvements from other people, and I believe any significant improvements are of substantial research interest.

Cool and Interesting things at NIPS, take three

Following up on Hal Daume’s post and John’s post on cool and interesting things seen at NIPS I’ll post my own little list of neat papers here as well. Of course it’s going to be biased towards what I think is interesting. Also, I have to say that I hadn’t been able to see many papers this year at nips due to myself being too busy, so please feel free to contribute the papers that you liked 🙂

1. P. Mudigonda, V. Kolmogorov, P. Torr. An Analysis of Convex Relaxations for MAP Estimation. A surprising paper which shows that many of the more sophisticated convex relaxations that had been proposed recently turns out to be subsumed by the simplest LP relaxation. Be careful next time you try a cool new convex relaxation!

2. D. Sontag, T. Jaakkola. New Outer Bounds on the Marginal Polytope. The title says it all. The marginal polytope is the set of local marginal distributions over subsets of variables that are globally consistent in the sense that there is at least one distribution over all the variables consistent with all the local marginal distributions. It is an interesting mathematical object to study, and this work builds on the work by Martin Wainwright’s upper bounding the log partition function paper, proposing improved outer bounds on the marginal polytope.

I think there is a little theme going on this year relating approximate inference to convex optimization. Besides the above two papers there were some other papers as well.

3. A. Sanborn, T. Griffiths. Markov Chain Monte Carlo with People. A cute idea of how you can construct an experimental set-up such that people act as accept/reject modules in a Metropolis-Hastings framework, so that we can probe what is the prior distribution encoded in people’s brains.

4. E. Sudderth, M. Wainwright, A. Willsky. Loop Series and Bethe Variational Bounds in Attractive Graphical Models. Another surprising result, that in attractive networks, if loopy belief propagation converges, the Bethe free energy is actually a LOWER bound on the log partition function.

5. M. Welling, I. Porteous, E. Bart. Infinite State Bayes-Nets for Structured Domains. An interesting idea to construct Bayesian networks with infinite number of states, using a pretty complex set-up involving hierarchical Dirichlet processes. I am not sure if the software is out, but I think building such general frameworks for nonparametric models is quite useful for many people who want to use such models but don’t want to spend too much time coding up the sometimes involved MCMC samplers.

I also liked Luis von Ahn’s invited talk on Human Computation. It’s always good see that machine learning has quite a ways to go 🙂

ps: apologies, I stopped maintaining my own blog and ended up losing the domain name. So I’m guest posting here instead.

Cool and interesting things seen at NIPS

I learned a number of things at NIPS.

  1. The financial people were there in greater force than previously. Two Sigma sponsored NIPS while DRW Trading had a booth.
  2. The adversarial machine learning workshop had a number of talks about interesting applications where an adversary really is out to try and mess up your learning algorithm. This is very different from the situation we often think of where the world is oblivious to our learning. This may present new and convincing applications for the learning-against-an-adversary work common at COLT.
  3. There were several interesing papers.
    1. Sanjoy Dasgupta, Daniel Hsu, and Claire Monteleoni had a paper on General Agnostic Active Learning. The basic idea is that active learning can be done via reduction to a form of supervised learning problem. This is great, because we have many supervised learning algorithms from which the benefits of active learning may be derived.
    2. Joseph Bradley and Robert Schapire had a Paper on Filterboost. Filterboost is an online boosting algorithm which I think of as the boost-by-filtration approaches in the first boosting paper updated for an adaboost-like structure. These kinds of approaches are doubtless helpful for large scale learning problems which are becoming more common.
    3. Peter Bartlett, Elad Hazan, and Sasha Rakhlin had a paper on Adaptive Online Learning. This paper refines earlier results for online learning against an adversary via gradient descent, which is plausibly of great use in practice.
  4. MLOSS was giving out free T-shirts which were cool. I missed the workshop starting this effort at last year’s NIPS due to workshop overload, but open source machine learning is definitely of great and sound interest to the community.

Workshop Summary—Principles of Learning Problem Design

This is a summary of the workshop on Learning Problem Design which Alina and I ran at NIPS this year.

The first question many people have is “What is learning problem design?” This workshop is about admitting that solving learning problems does not start with labeled data, but rather somewhere before. When humans are hired to produce labels, this is usually not a serious problem because you can tell them precisely what semantics you want the labels to have, and we can fix some set of features in advance. However, when other methods are used this becomes more problematic. This focus is important for Machine Learning because there are very large quantities of data which are not labeled by a hired human.

The title of the workshop was a bit ambitious, because a workshop is not long enough to synthesize a diversity of approaches into a coherent set of principles. For me, the posters at the end of the workshop were quite helpful in getting approaches to gel.

Here are some answers to “where do the labels come from?”:

  1. Simulation Use a simulator (which need not be that good) to predict the cost of various choices and turn that into label information. Ashutosh had some cool demos showing the power of this approach. Gregory also presented a poster which might be viewed this way.
  2. Agreement A label is a point of agreement. Luis often used an agreement mechanism to induce labels with games. Sham discussed the power of agreement to constrain learning algorithms. Huzefa‘s work on bioprediction can be thought of as partly using agreement with previous structures to simulate the label of a new structure.
  3. Compilation Labels can be found by compiling one learning problem into another. Mark and I both talked about reductions a bit, which come with some nice formal guarantees.
  4. Backprop Labels are the signals in generalized backpropagation (David Bradley‘s talk).

Some answers to “where do the data come from” are:

  1. Everywhere The essential idea is to integrate as many data sources as possible. Rakesh had several algorithms which (in combination) allowed him to use a large number of diverse data sources in a text domain.
  2. Sparsity A representation is formed by finding a sparse set of basis functions on otherwise totally unlabeled data. Rajat discussed self-taught learning algorithms which achieve this.
  3. Self-prediction A representation is formed by learning to self-predict a set of raw features. Hal‘s talk covered this idea.

A workshop like this is successful if it informs the questions we ask (and answer) in the future. Some natural questions (some of which were discussed) are:

  1. What is a natural, sufficient langauge for adding prior information into a learning system? Which languages are insufficient? Shai described a sense in which kernels are insufficient as a language for prior information. Bayesian analysis emphasizes reasoning about the parameters of the model, but the language of examples or maybe label expectations may be more natural.
  2. What is missing from the above lists? And are the elements of the lists actually distinct?
  3. How do we modularize? Many of the approaches use problem-specific tricks. That’s to be expected for a direction of research which is just starting, but it’s important to modularize these techniques so they can be repeatedly and easily applied. Achieving modularity in a manner which supports prior information properly seems tricky.
  4. How do we formalize and analyze? Of the items listed above, I feel like we only have some reasonable understanding of the compilation approach. The other approaches and questions are essentially unexplored territory where some serious thinking may be helpful.