Interesting Neural Network Papers at ICML 2011

Maybe it’s too early to call, but with four separate Neural Network sessions at this year’s ICML, it looks like Neural Networks are making a comeback. Here are my highlights of these sessions. In general, my feeling is that these papers both demystify deep learning and show its broader applicability.

The first observation I made is that the once disreputable “Neural” nomenclature is being used again in lieu of “deep learning”. Maybe it’s because Adam Coates et al. showed that single layer networks can work surprisingly well.

Another surprising result out of Andrew Ng’s group comes from Andrew Saxe et al. who show that certain convolutional pooling architectures can obtain close to state-of-the-art performance with random weights (that is, without actually learning).

Of course, in most cases we do want to train these models eventually. There were two interesting papers on the topic of training neural networks. In the first, Quoc Le et al. show that a simple, off-the-shelf L-BFGS optimizer is often preferable to stochastic gradient descent.

Secondly, Martens and Sutskever from Geoff Hinton’s group show how to train recurrent neural networks for sequence tasks that exhibit very long range dependencies:

It will be interesting to see whether this type of training will allow recurrent neural networks to outperform CRFs on some standard sequence tasks and data sets. It certainly seems possible since even with standard L-BFGS our recursive neural network (see previous post) can outperform CRF-type models on several challenging computer vision tasks such as semantic segmentation of scene images. This common vision task of labeling each pixel with an object class has not received much attention from the deep learning community.
Apart from the vision experiments, this paper further solidifies the trend that neural networks are being used more and more in natural language processing. In our case, the RNN-based model was used for structure prediction. Another neat example of this trend comes from Yann Dauphin et al. in Yoshua Bengio’s group. They present an interesting solution for learning with sparse bag-of-word representations.

Such sparse representations had previously been problematic for neural architectures.

In summary, these papers have helped us understand a bit better which “deep” or “neural” architectures work, why they work and how we should train them. Furthermore, the scope of problems that these architectures can handle has been widened to harder and more real-life problems.

Of the non-neural papers, these two papers stood out for me:

Asymmophobia

One striking feature of many machine learning algorithms is the gymnastics that designers go through to avoid symmetry breaking. In the most basic form of machine learning, there are labeled examples composed of features. Each of these can be treated symmetrically or asymmetrically by algorithms.

  1. feature symmetry Every feature is treated the same. In gradient update rules, the same update is applied whether the feature is first or last. In metric-based predictions, every feature is just as important in computing the distance.
  2. example symmetry Every example is treated the same. Batch learning algorithms are great exemplars of this approach.
  3. label symmetry Every label is treated the same. This is particularly noticeable in multiclass classification systems which predict according to arg maxl wl x but it occurs in many other places as well.

Empirically, breaking symmetry well seems to yield great algorithms.

  1. feature asymmetry For those who like the “boosting is stepwise additive regression on exponential loss” viewpoint (I don’t entirely), boosting is an example of symmetry breaking on features.
  2. example asymmetry Online learning introduces an example asymmetry. Aside from providing a mechanism for large scale learning, it also enables learning in entirely new (online) settings.
  3. label asymmetry Tree structured algorithms are good instances of example asymmetry. This includes both the older decision tree approaches like C4.5 and some newer ones we’ve worked on. These approaches are exponentially faster in the number of labels than more symmetric approaches.

The examples above are notably important, with good symmetry breaking approaches yielding substantially improved prediction or computational performance. Given such strong evidence that symmetry breaking is a desirable property, a basic question is: Why isn’t it more prevalent, and more thoroughly studied? One reasonable answer is that doing symmetry breaking well requires more serious thought about learning algorithm design, so researchers simply haven’t gotten to it. This answer appears incomplete.

A more complete answer is that many researchers seem to reflexively avoid symmetry breaking. A simple reason for this is the now pervasive use of Matlab in machine learning. Matlab is a handy tool for fast prototyping of learning algorithms, but it has an intrinsic language-level bias towards symmetric approaches since there are builtin primitives for matrix operations. A more complex reason is a pervasive reflex belief in fairness. While this is admirable when reviewing papers, it seems less so when designing learning algorithms. A third related reason seems to be a fear of doing unmotivated things. Anytime symmetry breaking is undertaken, the method for symmetry breaking is in question, and many people feel uncomfortable without a theorem suggesting the method is the right one. Since there are few theorems motivating symmetry breaking methods, it is often avoided.

What methods for symmetry breaking exist?

  1. Randomization. Neural Network learning algorithms which initialize the weights randomly exemplify this. I consider the randomization approach particularly weak. It makes experiments non-repeatable, and it seems like the sort of solution that someone with asymmophobia would come up with if they were forced to do something asymmetric.
  2. Arbitrary. Arbitrary symmetry breaking is something like random, except there is no randomness—you simply declare this feature/label/example comes first and that one second. This seems mildly better than the randomized approach, but still not inspiring.
  3. Data-driven. Boosting is a good example where a data-driven approach drives symmetry breaking (over features). Data-driven approaches for symmetry breaking seem the most sound, as they can result in improved performance.

While there are examples of learning algorithms doing symmetry breaking for features, labels, and examples individually, there aren’t any I know which do all three, well. What would such an algorithm look like?