Interesting Neural Network Papers at ICML 2011

Maybe it’s too early to call, but with four separate Neural Network sessions at this year’s ICML, it looks like Neural Networks are making a comeback. Here are my highlights of these sessions. In general, my feeling is that these papers both demystify deep learning and show its broader applicability.

The first observation I made is that the once disreputable “Neural” nomenclature is being used again in lieu of “deep learning”. Maybe it’s because Adam Coates et al. showed that single layer networks can work surprisingly well.

Another surprising result out of Andrew Ng’s group comes from Andrew Saxe et al. who show that certain convolutional pooling architectures can obtain close to state-of-the-art performance with random weights (that is, without actually learning).

Of course, in most cases we do want to train these models eventually. There were two interesting papers on the topic of training neural networks. In the first, Quoc Le et al. show that a simple, off-the-shelf L-BFGS optimizer is often preferable to stochastic gradient descent.

Secondly, Martens and Sutskever from Geoff Hinton’s group show how to train recurrent neural networks for sequence tasks that exhibit very long range dependencies:

It will be interesting to see whether this type of training will allow recurrent neural networks to outperform CRFs on some standard sequence tasks and data sets. It certainly seems possible since even with standard L-BFGS our recursive neural network (see previous post) can outperform CRF-type models on several challenging computer vision tasks such as semantic segmentation of scene images. This common vision task of labeling each pixel with an object class has not received much attention from the deep learning community.
Apart from the vision experiments, this paper further solidifies the trend that neural networks are being used more and more in natural language processing. In our case, the RNN-based model was used for structure prediction. Another neat example of this trend comes from Yann Dauphin et al. in Yoshua Bengio’s group. They present an interesting solution for learning with sparse bag-of-word representations.

Such sparse representations had previously been problematic for neural architectures.

In summary, these papers have helped us understand a bit better which “deep” or “neural” architectures work, why they work and how we should train them. Furthermore, the scope of problems that these architectures can handle has been widened to harder and more real-life problems.

Of the non-neural papers, these two papers stood out for me: