The Large Scale Learning Survey Tutorial

Ron Bekkerman initiated an effort to create an edited book on parallel machine learning that Misha and I have been helping with. The breadth of efforts to parallelize machine learning surprised me: I was only aware of a small fraction initially.

This put us in a unique position, with knowledge of a wide array of different efforts, so it is natural to put together a survey tutorial on the subject of parallel learning for KDD, tomorrow. This tutorial is not limited to the book itself however, as several interesting new algorithms have come out since we started inviting chapters.

This tutorial should interest anyone trying to use machine learning on significant quantities of data, anyone interested in developing algorithms for such, and of course who has bragging rights to the fastest learning algorithm on planet earth 🙂

(Also note the Modeling with Hadoop tutorial just before ours which deals with one way of trying to speed up learning algorithms. We have almost no overlap.)

Vowpal Wabbit 6.0

I just released Vowpal Wabbit 6.0. Since the last version:

  1. VW is now 2-3 orders of magnitude faster at linear learning, primarily thanks to Alekh. Given the baseline, this is loads of fun, allowing us to easily deal with terafeature datasets, and dwarfing the scale of any other open source projects. The core improvement here comes from effective parallelization over kilonode clusters (either Hadoop or not). This code is highly scalable, so it even helps with clusters of size 2 (and doesn’t hurt for clusters of size 1). The core allreduce technique appears widely and easily reused—we’ve already used it to parallelize Conjugate Gradient, LBFGS, and two variants of online learning. We’ll be documenting how to do this more thoroughly, but for now “README_cluster” and associated scripts should provide a good starting point.
  2. The new LBFGS code from Miro seems to commonly dominate the existing conjugate gradient code in time/quality tradeoffs.
  3. The new matrix factorization code from Jake adds a core algorithm.
  4. We finally have basic persistent daemon support, again with Jake’s help.
  5. Adaptive gradient calculations can now be made dimensionally correct, following up on Paul’s post, yielding a better algorithm. And Nikos sped it up further with SSE native inverse square root.
  6. The LDA core is perhaps twice as fast after Paul educated us about SSE and representational gymnastics.

All of the above was done without adding significant new dependencies, so the code should compile easily.

The VW mailing list has been slowly growing, and is a good place to ask questions.

Enjoy.

Interesting thing at UAI 2011

I had a chance to attend UAI this year, where several papers interested me, including:

  1. Hoifung Poon and Pedro Domingos Sum-Product Networks: A New Deep Architecture. We’ve already discussed this one, but in a nutshell, they identify a large class of efficiently normalizable distributions and do learning with it.
  2. Yao-Liang Yu and Dale Schuurmans, Rank/norm regularization with closed-form solutions: Application to subspace clustering. This paper is about matrices, and in particular they prove that certain matrices are the solution of matrix optimizations. I’m not matrix inclined enough to fully appreciate this one, but I believe many others may be, and anytime closed form solutions come into play, you get 2 order of magnitude speedups, as they show experimentally.
  3. Laurent Charlin, Richard Zemel and Craig Boutilier, A Framework for Optimizing Paper Matching. This is about what works in matching papers to reviewers, as has been tested at several previous NIPS. We are looking into using this system for ICML 2012.

In addition I wanted to comment on Karl Friston‘s invited talk. At the outset, he made a claim that seems outlandish to me: The way the brain works is to minimize surprise as measured by a probabilistic model. The majority of the talk was not actually about this—instead it was about how probabilistic models can plausibly do things that you might not have thought possible, such as birdsong. Nevertheless, I think several of us in the room ended up stuck on the claim in questions afterward.

My personal belief is that world modeling (probabilistic or not) is a useful subroutine for intelligence, but it could not possibly be the entirety of intelligence. A key reason for this is the bandwidth of our senses—we simply take in too much information to model everything with equal attention. It seems critical for the efficient functioning of intelligence that only things which might plausibly matter are modeled, and only to the degree that matters. In other words, I do not model the precise placement of items on my desk, or even the precise content of my desk, because these details simply do not matter.

This argument can be made in another way. Suppose for the moment that all the brain does is probabilistic modeling. Then, the primary notion of failure to model is “surprise”, which is low probability events occurring. Surprises (stumbles, car wrecks, and other accidents) certainly can be unpleasant, but this could be correct if modeling is a subroutine as well. The clincher is that there are many unpleasant things which are not surprises, including keeping your head under water, fasting, and self-inflicted wounds.

Accounting for the unpleasantness of these events requires more than probabilistic modeling. In other words, it requires rewards, which is why reinforcement learning is important. As a byproduct, rewards also naturally create a focus of attention, addressing the computational efficiency issue. Believing that intelligence is just probabilistic modeling is another example of simple wrong answer.

Interesting papers at COLT 2011

Since John did not attend COLT this year, I have been volunteered to report back on the hot stuff at this year’s meeting. The conference seemed to have pretty high quality stuff this year, and I found plenty of interesting papers on all the three days. I’m gonna pick some of my favorites going through the program in a chronological order.

The first session on matrices seemed interesting for two reasons. First, the papers were quite nice. But more interestingly, this is a topic that has had a lot of presence in Statistics and Compressed sensing literature recently. So it was good to see high-dimensional matrices finally make their entry at COLT. The paper of Ohad and Shai on Collaborative Filtering with the Trace Norm: Learning, Bounding, and Transducing provides non-trivial guarantees on trace norm regularization in an agnostic setup, while Rina and Nati show how Rademacher averages can be used to get sharper results for matrix completion problems in their paper Concentration-Based Guarantees for Low-Rank Matrix Reconstruction. Both the papers seemed to share the flavor of a learning theorists’ take at compressed sensing that I enjoyed seeing.

The best student paper by Amit, Sivan and Shai2 on Multiclass Learnability and the ERM principle showed a crucial distinction between binary and multiclass classification. Every ERM procedure is not equally good for multiclass classification. In particular, there are multiclass problems where some ERM learners succeed while others are inconsistent, in sharp contrast to the binary case. They also present some intuition on what characterizes a good ERM procedure for the multiclass setting.

I enjoyed all the three papers in the online learning session quite a bit. Jake, Elad and Peter show the equivalence of Blackwell approachability and low regret in their paper Blackwell Approachability and No-Regret Learning are Equivalent, with applications to efficient algorithms for calibration. Sasha, Karthik and Ambuj won the best paper award for their paper Online Learning: Beyond Regret which shows how the tools like sequential Rademacher averages and sequential covering numbers can be used to capture the minimax value of a large class of games, beyond just external regret settings. Their paper with Dean on Complexity-Based Approach to Calibration with Checking Rules showed a nice application of these techniques to the calibration problem.

The impromptu session was quite a hit this year with ~15 talks. I was quite disappointed to see none of them turn up on my NIPS review stack 🙂

Rob managed to save some money by solving his open problem from last COLT, together with Indraneel and Cynthia. Their paper The Rate of Convergence of AdaBoost was interesting as it made me realize how much difference the boundedness of rates can make to the theoretical properties of an algorithm. Adaboost is greedy coordinate descent, for which convergence over a compact domain is well-studied, but what makes this challenging here is that Adaboost doesn’t impose any bound on the weights. The way this paper gets around and pays a penalty for these issues seemed quite interesting.

I also liked the paper by Gabor, David and Csaba on Minimax Regret of Finite Partial-Monitoring Games in Stochastic Environments. This paper shows a tight characterization of partial regret games that have an optimal regret of 0, T1/2, T2/3 or T. While some sufficient conditions one way or the other were known before, theirs is a first complete characterization in my knowledge.

Overall, this was a thoroughly enjoyable COLT, both in the technical content and in the choice of venue.

Interesting Neural Network Papers at ICML 2011

Maybe it’s too early to call, but with four separate Neural Network sessions at this year’s ICML, it looks like Neural Networks are making a comeback. Here are my highlights of these sessions. In general, my feeling is that these papers both demystify deep learning and show its broader applicability.

The first observation I made is that the once disreputable “Neural” nomenclature is being used again in lieu of “deep learning”. Maybe it’s because Adam Coates et al. showed that single layer networks can work surprisingly well.

Another surprising result out of Andrew Ng’s group comes from Andrew Saxe et al. who show that certain convolutional pooling architectures can obtain close to state-of-the-art performance with random weights (that is, without actually learning).

Of course, in most cases we do want to train these models eventually. There were two interesting papers on the topic of training neural networks. In the first, Quoc Le et al. show that a simple, off-the-shelf L-BFGS optimizer is often preferable to stochastic gradient descent.

Secondly, Martens and Sutskever from Geoff Hinton’s group show how to train recurrent neural networks for sequence tasks that exhibit very long range dependencies:

It will be interesting to see whether this type of training will allow recurrent neural networks to outperform CRFs on some standard sequence tasks and data sets. It certainly seems possible since even with standard L-BFGS our recursive neural network (see previous post) can outperform CRF-type models on several challenging computer vision tasks such as semantic segmentation of scene images. This common vision task of labeling each pixel with an object class has not received much attention from the deep learning community.
Apart from the vision experiments, this paper further solidifies the trend that neural networks are being used more and more in natural language processing. In our case, the RNN-based model was used for structure prediction. Another neat example of this trend comes from Yann Dauphin et al. in Yoshua Bengio’s group. They present an interesting solution for learning with sparse bag-of-word representations.

Such sparse representations had previously been problematic for neural architectures.

In summary, these papers have helped us understand a bit better which “deep” or “neural” architectures work, why they work and how we should train them. Furthermore, the scope of problems that these architectures can handle has been widened to harder and more real-life problems.

Of the non-neural papers, these two papers stood out for me: