Fall Machine Learning Events

Many Machine Learning related events are coming up this fall.

  1. September 9, abstracts for the New York Machine Learning Symposium are due. Send a 2 page pdf, if interested, and note that we:
    1. widened submissions to be from anybody rather than students.
    2. set aside a larger fraction of time for contributed submissions.
  2. September 15, there is a machine learning meetup, where I’ll be discussing terascale learning at AOL.
  3. September 16, there is a CS&Econ day at New York Academy of Sciences. This is not ML focused, but it’s easy to imagine interest.
  4. September 23 and later NIPS workshop submissions start coming due. As usual, there are too many good ones, so I won’t be able to attend all those that interest me. I do hope some workshop makers consider ICML this coming summer, as we are increasing to a 2 day format for you. Here are a few that interest me:
    1. Big Learning is about dealing with lots of data. Abstracts are due September 30.
    2. The Bayes Bandits workshop. Abstracts are due September 23.
    3. The Personalized Medicine workshop
    4. The Learning Semantics workshop. Abstracts are due September 26.
    5. The ML Relations workshop. Abstracts are due September 30.
    6. The Hierarchical Learning workshop. Challenge submissions are due October 17, and abstracts are due October 21.
    7. The Computational Tradeoffs workshop. Abstracts are due October 17.
    8. The Model Selection workshop. Abstracts are due September 24.
  5. October 16-17 is the Singularity Summit in New York. This is for the AIists and only peripherally about ML.
  6. October 16-21 is a Predictive Analytics World in New York. As machine learning goes industrial, we see industrial-style conferences rapidly developing.
  7. October 21, there is the New York ML Symposium. In addition to what’s there, Chris Wiggins is looking into setting up a session for startups and those interested in them to get to know each other, as last year.
  8. Decembr 16-17 NIPS workshops in Granada, Spain.

The Large Scale Learning Survey Tutorial

Ron Bekkerman initiated an effort to create an edited book on parallel machine learning that Misha and I have been helping with. The breadth of efforts to parallelize machine learning surprised me: I was only aware of a small fraction initially.

This put us in a unique position, with knowledge of a wide array of different efforts, so it is natural to put together a survey tutorial on the subject of parallel learning for KDD, tomorrow. This tutorial is not limited to the book itself however, as several interesting new algorithms have come out since we started inviting chapters.

This tutorial should interest anyone trying to use machine learning on significant quantities of data, anyone interested in developing algorithms for such, and of course who has bragging rights to the fastest learning algorithm on planet earth 🙂

(Also note the Modeling with Hadoop tutorial just before ours which deals with one way of trying to speed up learning algorithms. We have almost no overlap.)

Vowpal Wabbit 6.0

I just released Vowpal Wabbit 6.0. Since the last version:

  1. VW is now 2-3 orders of magnitude faster at linear learning, primarily thanks to Alekh. Given the baseline, this is loads of fun, allowing us to easily deal with terafeature datasets, and dwarfing the scale of any other open source projects. The core improvement here comes from effective parallelization over kilonode clusters (either Hadoop or not). This code is highly scalable, so it even helps with clusters of size 2 (and doesn’t hurt for clusters of size 1). The core allreduce technique appears widely and easily reused—we’ve already used it to parallelize Conjugate Gradient, LBFGS, and two variants of online learning. We’ll be documenting how to do this more thoroughly, but for now “README_cluster” and associated scripts should provide a good starting point.
  2. The new LBFGS code from Miro seems to commonly dominate the existing conjugate gradient code in time/quality tradeoffs.
  3. The new matrix factorization code from Jake adds a core algorithm.
  4. We finally have basic persistent daemon support, again with Jake’s help.
  5. Adaptive gradient calculations can now be made dimensionally correct, following up on Paul’s post, yielding a better algorithm. And Nikos sped it up further with SSE native inverse square root.
  6. The LDA core is perhaps twice as fast after Paul educated us about SSE and representational gymnastics.

All of the above was done without adding significant new dependencies, so the code should compile easily.

The VW mailing list has been slowly growing, and is a good place to ask questions.

Enjoy.

Interesting papers at COLT 2011

Since John did not attend COLT this year, I have been volunteered to report back on the hot stuff at this year’s meeting. The conference seemed to have pretty high quality stuff this year, and I found plenty of interesting papers on all the three days. I’m gonna pick some of my favorites going through the program in a chronological order.

The first session on matrices seemed interesting for two reasons. First, the papers were quite nice. But more interestingly, this is a topic that has had a lot of presence in Statistics and Compressed sensing literature recently. So it was good to see high-dimensional matrices finally make their entry at COLT. The paper of Ohad and Shai on Collaborative Filtering with the Trace Norm: Learning, Bounding, and Transducing provides non-trivial guarantees on trace norm regularization in an agnostic setup, while Rina and Nati show how Rademacher averages can be used to get sharper results for matrix completion problems in their paper Concentration-Based Guarantees for Low-Rank Matrix Reconstruction. Both the papers seemed to share the flavor of a learning theorists’ take at compressed sensing that I enjoyed seeing.

The best student paper by Amit, Sivan and Shai2 on Multiclass Learnability and the ERM principle showed a crucial distinction between binary and multiclass classification. Every ERM procedure is not equally good for multiclass classification. In particular, there are multiclass problems where some ERM learners succeed while others are inconsistent, in sharp contrast to the binary case. They also present some intuition on what characterizes a good ERM procedure for the multiclass setting.

I enjoyed all the three papers in the online learning session quite a bit. Jake, Elad and Peter show the equivalence of Blackwell approachability and low regret in their paper Blackwell Approachability and No-Regret Learning are Equivalent, with applications to efficient algorithms for calibration. Sasha, Karthik and Ambuj won the best paper award for their paper Online Learning: Beyond Regret which shows how the tools like sequential Rademacher averages and sequential covering numbers can be used to capture the minimax value of a large class of games, beyond just external regret settings. Their paper with Dean on Complexity-Based Approach to Calibration with Checking Rules showed a nice application of these techniques to the calibration problem.

The impromptu session was quite a hit this year with ~15 talks. I was quite disappointed to see none of them turn up on my NIPS review stack 🙂

Rob managed to save some money by solving his open problem from last COLT, together with Indraneel and Cynthia. Their paper The Rate of Convergence of AdaBoost was interesting as it made me realize how much difference the boundedness of rates can make to the theoretical properties of an algorithm. Adaboost is greedy coordinate descent, for which convergence over a compact domain is well-studied, but what makes this challenging here is that Adaboost doesn’t impose any bound on the weights. The way this paper gets around and pays a penalty for these issues seemed quite interesting.

I also liked the paper by Gabor, David and Csaba on Minimax Regret of Finite Partial-Monitoring Games in Stochastic Environments. This paper shows a tight characterization of partial regret games that have an optimal regret of 0, T1/2, T2/3 or T. While some sufficient conditions one way or the other were known before, theirs is a first complete characterization in my knowledge.

Overall, this was a thoroughly enjoyable COLT, both in the technical content and in the choice of venue.

Interesting Neural Network Papers at ICML 2011

Maybe it’s too early to call, but with four separate Neural Network sessions at this year’s ICML, it looks like Neural Networks are making a comeback. Here are my highlights of these sessions. In general, my feeling is that these papers both demystify deep learning and show its broader applicability.

The first observation I made is that the once disreputable “Neural” nomenclature is being used again in lieu of “deep learning”. Maybe it’s because Adam Coates et al. showed that single layer networks can work surprisingly well.

Another surprising result out of Andrew Ng’s group comes from Andrew Saxe et al. who show that certain convolutional pooling architectures can obtain close to state-of-the-art performance with random weights (that is, without actually learning).

Of course, in most cases we do want to train these models eventually. There were two interesting papers on the topic of training neural networks. In the first, Quoc Le et al. show that a simple, off-the-shelf L-BFGS optimizer is often preferable to stochastic gradient descent.

Secondly, Martens and Sutskever from Geoff Hinton’s group show how to train recurrent neural networks for sequence tasks that exhibit very long range dependencies:

It will be interesting to see whether this type of training will allow recurrent neural networks to outperform CRFs on some standard sequence tasks and data sets. It certainly seems possible since even with standard L-BFGS our recursive neural network (see previous post) can outperform CRF-type models on several challenging computer vision tasks such as semantic segmentation of scene images. This common vision task of labeling each pixel with an object class has not received much attention from the deep learning community.
Apart from the vision experiments, this paper further solidifies the trend that neural networks are being used more and more in natural language processing. In our case, the RNN-based model was used for structure prediction. Another neat example of this trend comes from Yann Dauphin et al. in Yoshua Bengio’s group. They present an interesting solution for learning with sparse bag-of-word representations.

Such sparse representations had previously been problematic for neural architectures.

In summary, these papers have helped us understand a bit better which “deep” or “neural” architectures work, why they work and how we should train them. Furthermore, the scope of problems that these architectures can handle has been widened to harder and more real-life problems.

Of the non-neural papers, these two papers stood out for me: