Vowpal Wabbit Open Source Project

Today brings a new release of the Vowpal Wabbit fast online learning software. This time, unlike the previous release, the project itself is going open source, developing via github. For example, the lastest and greatest can be downloaded via:

git clone git://github.com/JohnLangford/vowpal_wabbit.git

If you aren’t familiar with git, it’s a distributed version control system which supports quick and easy branching, as well as reconciliation.

This version of the code is confirmed to compile without complaint on at least some flavors of OSX as well as Linux boxes.

As much of the point of this project is pushing the limits of fast and effective machine learning, let me mention a few datapoints from my experience.

  1. The program can effectively scale up to batch-style training on sparse terafeature (i.e. 1012 sparse feature) size datasets. The limiting factor is typically i/o.
  2. I started using the the real datasets from the large-scale learning workshop as a convenient benchmark. The largest dataset takes about 10 minutes. (This is using the native features that the organizers intended as a starting point, yet all contestants used. In some cases, that admittedly gives you performance nowhere near to optimal.)
  3. After using this program for awhile, it’s substantially altered my perception of what is a large-scale learning problem. This causes confusion when people brag about computational performance on tiny datasets with only 105 examples 🙂

I would also like to emphasize that this is intended as an open source project rather than merely a code drop, as occurred last time. What I think this project has to offer researchers is an infrastructure for implementing fast online algorithms. It’s reasonably straightforward to implant your own tweaked algorithm, automatically gaining the substantial benefits of the surrounding code that deals with file formats, file caching, buffering, etc… In a very real sense, most of the code is this surrounding stuff, which you don’t have to rewrite to benefit from. For people applying machine learning, there is some obvious value in getting very fast feedback in a batch setting, as well as having an algorithm that actually works in a real online setting.

As one example of the ability to reuse the code for other purposes, an effective general purpose online implementation of the Offset Tree is included. I haven’t seen any other implementation of an algorithm for learning in the agnostic partial label setting, so this code may be of substantial interest for people encountering these sorts of problems.

The difference between this version and the previous is a nearly total rewrite. Some bigger changes are:

  1. We dropped SEG for now, because of code complexity reasons.
  2. Multicore parallelization proceeds in a different fashion—parallelization over features instead of examples. This works better with caches. Note that all parallelization of the core algorithm is meaningless unless you use the -q flag, because otherwise you are i/o bound.
  3. The code is more deeply threaded, with a separate thread for parsing.
  4. There is support for several different loss functions, and it’s easy to add your own.

I’m interested in any bug reports or suggestions for the code. I have substantial confidence that this code can do interesting and useful things, but improving it is a constant and ongoing process.

Interesting papers at KDD

I attended KDD this year. The conference has always had a strong grounding in what works based on the KDDcup, but it has developed a halo of workshops on various subjects. It seems that KDD has become a place where the economy meets machine learning in a stronger sense than many other conferences.

There were several papers that other people might like to take a look at.

  1. Yehuda Koren Collaborative Filtering with Temporal Dynamics. This paper describes how to incorporate temporal dynamics into a couple of collaborative filtering approaches. This was also a best paper award.
  2. D. Sculley, Robert Malkin, Sugato Basu, Roberto J. Bayardo, Predicting Bounce Rates in Sponsored Search Advertisements. The basic claim of this paper is that the probability people immediately leave (“bounce”) after clicking on an advertisement is predictable.
  3. Frank McSherry and Ilya Mironov Differentially Private Recommender Systems: Building Privacy into the Netflix Prize Contenders. The basic claim here is that it is possible to beat the baseline system in Netflix and preserve a nontrivial amount of user privacy. It’s the first demonstration I’ve seen of this sort, and it’s particularly impressive they used a strong algorithm-independent definition of privacy which Cynthia Dwork first stated.

KDD also experimented this year with crowdvine which was interesting. Compared to Mark Reid‘s efforts with ICML, they managed to get substantially more participation. There seemed to be two reasons: the conference organizers more deeply integrated and encouraged the use of crowdvine, and crowdvine has certain handy additional uses—you can create your own personal schedule for instance, which incidentally provides some vague global notion of the popularity of various papers. The biggest drawback I found was that the papers themselves were not integrated into the website.

The Machine Learning Forum

Dear Fellow Machine Learners,

For the past year or so I have become increasingly frustrated with the peer review system in our field. I constantly get asked to review papers in which I have no interest. At the same time, as an action editor in JMLR, I constantly have to harass people to review papers. When I send papers to conferences and to journals I often get rejected with reviews that, at least in my mind, make no sense. Finally, I have a very hard time keeping up with the best new work, because I don’t know where to look for it…

I decided to try an do something to improve the situation. I started a new web site, which I decided to call “The machine learning forum” the URL is http://themachinelearningforum.org

The main idea behind this web site is to remove anonymity from the review process. In this site, all opinions are attributed to the actual person that expressed them. I expect that this will improve the quality of the reviews. An obvious other effect is that there will be fewer negative reviews, weak papers will tend not to get reviewed at all, but then again, is that such a bad thing?

If you have any interest in this endeavor, please register to the web site and please submit a photo of yourself. Based on the information on your web site I will decide whether to grant you “author” privileges that would allow you to write reviews and overviews. Anybody can submit pointers to publications that they would like somebody to review. Anybody can participate in the discussion forum that is a fancy message board with threads etc.

Right now the main contribution I am looking for are “overviews”.

Overviews are pages written by somebody who is an authority in some area (for example, Kamalika Chaudhuri is an authority on mixture models) in which they list the main papers in the area and five a high level description for how the papers relate. These overviews are intended to serve as an entry point for somebody that wants to learn about that subfield. Overviews *can* reference the work of the author of the overview. This is unlike reviews, in which the reviewer cannot be the author of the reviewed paper.

I hope you are interested enough to give this a try!

Comments are very welcome.

Cheers

Yoav Freund (yfreund@ucsd.edu)

Netflix nearly done

A $1M qualifying result was achieved on the public Netflix test set by a 3-way ensemble team. This is just in time for Yehuda‘s presentation at KDD, which I’m sure will be one of the best attended ever.

This isn’t quite over—there are a few days for another super-conglomerate team to come together and there is some small chance that the performance is nonrepresentative of the final test set, but I expect not.

Regardless of the final outcome, the biggest lesson for ML from the Netflix contest has been the formidable performance edge of ensemble methods.

Interesting papers at UAICMOLT 2009

Here’s a list of papers that I found interesting at ICML/COLT/UAI in 2009.

  1. Elad Hazan and Comandur Seshadhri Efficient learning algorithms for changing environments at ICML. This paper shows how to adapt learning algorithms that compete with fixed predictors to compete with changing policies. The definition of regret they deal with seems particularly useful in many situation.
  2. Hal Daume, Unsupervised Search-based Structured Prediction at ICML. This paper shows a technique for reducing unsupervised learning to supervised learning which (a) make a fast unsupervised learning algorithm and (b) makes semisupervised learning both easy and highly effective.
  3. There were two papers with similar results on active learning in the KWIK framework for linear regression, both reducing the sample complexity to . One was Nicolo Cesa-Bianchi, Claudio Gentile, and Francesco Orabona Robust Bounds for Classification via Selective Sampling at ICML and the other was Thomas Walsh, Istvan Szita, Carlos Diuk, Michael Littman Exploring compact reinforcement-learning representations with linear regression at UAI. The UAI paper covers application to RL as well.
  4. Ping Li, Improving Compressed Counting at UAI. This paper talks about how to keep track of the moments in a datastream with very little space and computation. I’m not sure I have a use for it yet, but it seems like a cool piece of basic technology.
  5. Mark Reid and Bob Williamson Surrogate Regret Bounds for Proper Losses at ICML. This paper points out that via the integral characterization of proper losses, proper scoring rules can be reduced to binary classification. The results unify and generalize the Probing and Quanting reductions we worked on previously. This paper is also related to Nicolas Lambert‘s work, which is quite thought provoking in terms of specifying what is learnable.
  6. Daniel Hsu, Sham M. Kakade and Tong Zhang. A Spectral Algorithm for Learning Hidden Markov Models COLT. This paper shows that a subset of HMMs can be learned using an SVD-based algorithm.
  7. Samory Kpotufe, Escaping the curse of dimensionality with a tree-based regressor at COLT. This paper shows how to directly applying regression in high dimensional vector spaces and have it succeed anyways because the data is naturally low-dimensional.
  8. Shai Ben-David, David Pal and Shai Shalev-Shwartz. Agnostic Online Learning at COLT. This paper characterizes the ability to learn when an adversary is choosing features in the online setting as the “Littlestone dimension”.