2011 Summer Conference Deadline Season

Machine learning always welcomes the new year with paper deadlines for summer conferences. This year, we have:

Conference Paper Deadline When/Where Double blind? Author Feedback? Notes
ICML February 1 June 28-July 2, Bellevue, Washington, USA Y Y Weak colocation with ACL
COLT February 11 July 9-July 11, Budapest, Hungary N N colocated with FOCM
KDD February 11/18 August 21-24, San Diego, California, USA N N
UAI March 18 July 14-17, Barcelona, Spain Y N

The larger conferences are on the west coast in the United States, while the smaller ones are in Europe.

Herman Goldstine 2011

Vikas points out the Herman Goldstine Fellowship at IBM. I was a Herman Goldstine Fellow, and benefited from the experience a great deal—that’s where work on learning reductions started. If you can do research independently, it’s recommended. Applications are due January 6.

Vowpal Wabbit, version 5.0, and the second heresy

I’ve released version 5.0 of the Vowpal Wabbit online learning software. The major number has changed since the last release because I regard all earlier versions as obsolete—there are several new algorithms & features including substantial changes and upgrades to the default learning algorithm.

The biggest changes are new algorithms:

  1. Nikos and I improved the default algorithm. The basic update rule still uses gradient descent, but the size of the update is carefully controlled so that it’s impossible to overrun the label. In addition, the normalization has changed. Computationally, these changes are virtually free and yield better results, sometimes much better. Less careful updates can be reenabled with –loss_function classic, although results are still not identical to previous due to normalization changes.
  2. Nikos also implemented the per-feature learning rates as per these two papers. Often, this works better than the default algorithm. It isn’t the default because it isn’t (yet) as adaptable in terms of learning rate decay. This is enabled with –adaptive and learned regressors are compatible with the default. Computationally, you might see a factor of 4 slowdown if using ‘-q’. Nikos noticed that the phenomenal quake inverse square root hack applies making this substantially faster than a naive implementation.
  3. Nikos and Daniel also implemented active learning derived from this paper, usable via –active_simulation (to test parameters on an existing supervised dataset) or –active_learning (to do the real thing). This runs at full speed which is much faster than is reasonable in any active learning scenario. We see this approach dominating supervised learning on all classification datasets so far, often with far fewer labeled examples required, as the theory predicts. The learned predictor is compatible with the default.
  4. Olivier helped me implement preconditioned conjugate gradient based on Jonathan Shewchuk‘s tutorial. This is a batch algorithm and hence requires multiple passes over any dataset to do something useful. Each step of conjugate gradient requires 2 passes. The advantage of cg is that it converges relatively quickly via the use of second derivative information. This can be particularly helpful if your features are of widely differing scales. The use of –regularization 0.001 (or smaller) is almost required with –conjugate_gradient as it will otherwise overfit hard. This implementation has two advantages over the basic approach: it implicitly computes a Hessian in O(n) time where n is the number of features and it operates out of core, hence making it applicable to datasets that don’t conveniently fit in RAM. The learned predictor is compatible with the default, although you’ll notice that a factor of 8 more RAM is required when learning.
  5. Matt Hoffman and I implemented Online Latent Dirichlet Allocation. This code is still experimental and likely to change over the next week. It really does a minibatch update under the hood. The code appears to be substantially faster than Matt’s earlier python implementation making this probably the most efficient LDA anywhere. LDA is still much slower than online linear learning as it is quite computationally heavy in comparison—perhaps a good candidate for GPU optimization.
  6. Nikos, Daniel, and I have been experimenting with more online cluster parallel learning algorithms (–corrective, –backprop, –delayed_global). We aren’t yet satisfied with these although they are improving. Details are at the LCCC workshop.

In addition, Ariel added a test suite, Shravan helped with ngrams, and there are several other minor new features and bug fixes including a very subtle one caught by Vaclav.

The documentation on the website hasn’t kept up with the code. I’m planning to rectify that over the next week, and have a new tutorial starting at 2pm in the LCCC room for those interested. Yes, I’ll not be skiing 🙂