Interesting Neural Network Papers at ICML 2011

Maybe it’s too early to call, but with four separate Neural Network sessions at this year’s ICML, it looks like Neural Networks are making a comeback. Here are my highlights of these sessions. In general, my feeling is that these papers both demystify deep learning and show its broader applicability.

The first observation I made is that the once disreputable “Neural” nomenclature is being used again in lieu of “deep learning”. Maybe it’s because Adam Coates et al. showed that single layer networks can work surprisingly well.

Another surprising result out of Andrew Ng’s group comes from Andrew Saxe et al. who show that certain convolutional pooling architectures can obtain close to state-of-the-art performance with random weights (that is, without actually learning).

Of course, in most cases we do want to train these models eventually. There were two interesting papers on the topic of training neural networks. In the first, Quoc Le et al. show that a simple, off-the-shelf L-BFGS optimizer is often preferable to stochastic gradient descent.

Secondly, Martens and Sutskever from Geoff Hinton’s group show how to train recurrent neural networks for sequence tasks that exhibit very long range dependencies:

It will be interesting to see whether this type of training will allow recurrent neural networks to outperform CRFs on some standard sequence tasks and data sets. It certainly seems possible since even with standard L-BFGS our recursive neural network (see previous post) can outperform CRF-type models on several challenging computer vision tasks such as semantic segmentation of scene images. This common vision task of labeling each pixel with an object class has not received much attention from the deep learning community.
Apart from the vision experiments, this paper further solidifies the trend that neural networks are being used more and more in natural language processing. In our case, the RNN-based model was used for structure prediction. Another neat example of this trend comes from Yann Dauphin et al. in Yoshua Bengio’s group. They present an interesting solution for learning with sparse bag-of-word representations.

Such sparse representations had previously been problematic for neural architectures.

In summary, these papers have helped us understand a bit better which “deep” or “neural” architectures work, why they work and how we should train them. Furthermore, the scope of problems that these architectures can handle has been widened to harder and more real-life problems.

Of the non-neural papers, these two papers stood out for me:

Learning Track of International Planning Competition

The International Planning Competition (IPC) is a biennial event organized in the context of the International Conference on Automated Planning and Scheduling (ICAPS). This year, for the first time, there will a learning track of the competition. For more information you can go to the competition web-site.

The competitions are typically organized around a number of planning domains that can vary from year to year, where a planning domain is simply a class of problems that share a common action schema—e.g. Blocksworld is a well-known planning domain that contains a problem instance each possible initial tower configuration and goal configuration. Some other domains have included Logistics, Airport, Freecell, PipesWorld, and many others. For each domain the competition includes a number of problems (say 40-50) and the planners are run on each problem with a time limit for each problem (around 30 minutes). The problems are hard enough that many problems are not solved within the time limit.

Given that the planners are asked to solve many problems from individual domains, and that problems within a domain generally have common solution structures, it makes sense to consider learning from previous problem-solving experience in a domain to better solve future problems in the same domain. Here “better solve” could mean either solve the problems significantly more quickly or finding better quality plans in a similar time frame. However, no planner in any of the competitions has included a learning component. Rather, to quote Foreigner, for these planners each problem “feels like the first time”.

Perhaps one reason that planners have not incorporated learning into the competition setting is that there has not been much overlap between the ICML and ICAPS communities, although that is changing. Another reason is perhaps that the structure of the competition would deduct any “learning time” from a planner’s 30mins per problem, which could reduce the potential benefits.

The learning track for the 2008 competition is being designed so that learning time is not counted against planners. Rather, there will be a distinct learning phase and a distinct evaluation phase. During the learning phase the planners will be provided with the set of domains to be used in the competition and example problems from each. The evaluation phase will be conducted like the current competition, with the exception that the learning-based planners will be allowed to read in a learned domain-specific “knowledge file” when solving the problems. This structure is designed to help answer the following question:

Do we have techniques that can leverage a learning period to outperform state-of-the-art non-learning techniques across a wide range of domains?

My current belief is that the answer is “no”. I certainly have never seen any such demonstration. This is not because of lack of work in the area of “learning to plan” as there has been a long history dating back to some of the early planners (see my horribly outdated resource page for a taste). While many of the learning approaches have shown some degree of success, the evaluations have typically been very narrow, focusing on only 2 to 3 domains and often only a few problems. My intuition, grounded in personal experience, is that most (all) of these systems would be quite brittle when taken to new domains. The hope is that the learning track of the competition will force us to take the issue of robustness seriously and soon lead to learning systems that convincingly outperform non-learning planners across a wide range of domains given proper training experience.

I hope to see a wide range of approaches entered into the competition. I’ll mention two styles of approaches that might be particular interesting to readers of this blog.

First, one might consider applying reinforcement learning to learn “generalized policies” that can be applied to any problem from a domain. Recall that here the domain model is provided to us, so applying RL would mean that the domain model is used as a sort of simulator in which an RL algorithm is run. RL is particularly difficult in these domains due to the challenges in developing an appropriate representation for learning value functions and/or policies—the states can be viewed as sets of ground relational atoms, rather than the more typical n-dimensional vectors common in RL. Another challenge is the extremely sparse reward, which is obtained only at goal states. There has been some work on applying RL to IPC-style domains (e.g. relational reinforcement learning, approximate policy iteration, policy gradient) but much improvement is needed to compete with non-learning planners.

Second, one might consider structured-classification techniques for this problem. Here one might view the planning problem as an input X and the plan as the structured output Y. Training data can be generated by solving example planning problems using state-of-the-art planners perhaps using significant resources. This approach has been studied under the name max-margin planning, but applied to a very different class of planning problems. Other work has considered applying the Learning as Search Optimization (LaSO) framework to IPC-style domains with some success. Some of the challenges here are to automatically produce an appropriate feature set given a planning domain and ambiguity in the training data. Ambiguity here refers to the fact that there are often a huge number of equally good plans for a given problem and the training data has only one or a small number of them, making the training data incomplete.

An ICML reject

Hal, Daniel, and I have been working on the algorithm Searn for structured prediction. This was just conditionally accepted and then rejected from ICML, and we were quite surprised. By any reasonable criteria, it seems this is an interesting algorithm.

  1. Prediction Performance: Searn performed better than any other algorithm on all the problems we tested against using the same feature set. This is true even using the numbers reported by authors in their papers.
  2. Theoretical underpinning. Searn is a reduction which comes with a reduction guarantee: the good performance on a base classifiers implies good performance for the overall system. No other theorem of this type has been made for other structured prediction algorithms, as far as we know.
  3. Speed. Searn has no problem handling much larger datasets than other algorithms we tested against.
  4. Simplicity. Given code for a binary classifier and a problem-specific search algorithm, only a few tens of lines are necessary to implement Searn.
  5. Generality. Searn applies in a superset of the situations where other algorithms apply. It can use (and cope with) arbitrary loss functions over the data. It can also solve new problems not previously thought of as learning problems (and we do so!)

Much of the time, papers are about tradeoffs. A very typical (although often unstated) tradeoff is expending extra computation to gain better predictive performance in practice. In Searn, it seems there is no tradeoff compared to other approaches: you can have your lunch and eat it too.

In addition, this also solves a problem: yes, any classifier can be effectively and efficiently applied on complex structured prediction problems via Searn.

Why reject?
We, as respectfully as possible, simply disagree with the SPC about the stated grounds for rejection.

  1. The paper is difficult to read. This is true to some extent, but doesn’t seem reasonable. Fundamentally, there is a lot going on in terms of new ways of thinking about the problem and that simply requires some patience to read. One endorsement of this comes from a reviewer who said

    In the end, I do think there is something there, but I think its introduction should have been like that for CRF. Present the idea, with an understanding of why it should work, with one or two easily replicable examples. Then over a few papers reapplying it to new settings, people would get it.

    It isn’t often that a reviewer complains that there is too much content so it should be split across multiple papers.

  2. The SPC stated:

    The results, though, which essentially show that good local classifiers imply good global performance, are not that significant, and hold for other approaches that use local classifiers as building blocks. After all, perfect local classifiers provide perfect local accuracy, and therefore provide perfect global accuracy, and again provide perfect global loss of any kind.

    Both sentences are simply false in the setting we consider. In particular, no other algorithms appear to have a good local performance to global performance guarantee for general global loss functions. Furthermore, it is not the case that perfect local performance implies perfect global performance except (perhaps) in a noise free world. Most of us believe that the problems we address typically contain fundamental ambiguities and noise (that was certainly our mathematical model). It’s easy to setup a (noisy) distribution over inputs+loss such that best-possible-up-to-the-noise-limit local predictors are globally suboptimal.

  3. The SPC wanted us to contrast with Michael Collins, Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms, EMNLP02. We instead contrasted with Michael Collins and Brian Roark, Incremental Parsing with the Perceptron Algorithm, ACL04. I believe any reasonable reading of these and the Searn paper will find the ACL04 paper superceeds the EMNLP02 paper in relevance.
  4. The SPC wanted us to contrast with IBT in V. Punyakanok, D. Roth, W. Yih, and D. Zimak, Learning and Inference over Constrained Output. IJCAI 2005. Here we messed up and confused it with V. Punyakanok and D. Roth, The Use of Classifiers in Sequential Inference. NIPS 2001 (which we did compare with). IBT is a modification of the earlier algorithm which integrates global information (in the form of problem specific constraints) into the local training process yielding performance gains. This modification addresses an orthogonal issue: Searn has not been used in any experiments yet where this side information is available. Searn is made to cope with global loss functions rather than global constraints.

These reasons for rejection seem (1) weak, (2) false, (3) very weak, and (4) weak.

Why post this? There are several reasons.

  1. The biggest personal reason is that we want to build on Searn. For whatever reason, it is psychologically difficult to build on rejected work. Posting it here gives us some sense that it is “out there” which may make it easier to build on it. We are trying to do research and simply don’t want to be delayed for 6 months or a year.
  2. One of the purposes of publishing a paper is to acquire feedback about the research, and there is some opportunity here.
  3. This is a nice algorithm for which we can easily imagine immediate use and general interest. Hal will be releasing the source code shortly which should make it very easy to use.
  4. Communal mental hygiene. It should not be the case, under any circumstances, that an (S)PC/reviewer makes false statements. If you think something is true but aren’t sure, it is appropriate to say “I think …” rather than simply asserting it as a fact. Asserting nonfact as fact is injurious to the process of research because it distorts the understanding of other people.

We aren’t looking for a flamefest. If there are no comments or simply comments about Searn, that’s fine. We aren’t looking for a reversal of the decision. A final decision has to be made for any conference and for better or worse this is it. The PC chairs should not be pestered—they are very busy people. Running ICML is a very difficult task, and we understand that some mistakes are inevitable when there are hundreds of papers involved.

This post violates a standard: avoiding talking about specific papers the poster has been working on. To remedy this, I will consider posts on other rejected papers. Any such post must have a very strong case.

Gradients everywhere

One of the basic observations from the atomic learning workshop is that gradient-based optimization is pervasive. For example, at least 7 (of 12) speakers used the word ‘gradient’ in their talk and several others may be approximating a gradient. The essential useful quality of a gradient is that it decouples local updates from global optimization. Restated: Given a gradient, we can determine how to change individual parameters of the system so as to improve overall performance.

It’s easy to feel depressed about this and think “nothing has happened”, but that appears untrue. Many of the talks were about clever techniques for computing gradients where your calculus textbook breaks down.

  1. Sometimes there are clever approximations of the gradient. (Simon Osindero)
  2. Sometimes we can compute constrained gradients via iterated gradient/project steps. (Ben Taskar)
  3. Sometimes we can compute gradients anyways over mildly nondifferentiable functions. (Drew Bagnell)
  4. Even given a gradient, the choice of update is unclear, and might be cleverly chosen (Nic Schraudolph)

Perhaps a more extreme example of this is Adaboost which repeatedly reuses a classifier learner to implicitly optimize a gradient. Viewed as a gradient optimization algorithm, Adaboost is a sublinear algorithm (in the number of implicit parameters) when applied to decision trees.

SVM Adaptability

Several recent papers have shown that SVM-like optimizations can be used to handle several large family loss functions.

This is a good thing because it is implausible that the loss function imposed by the world can not be taken into account in the process of solving a prediction problem. Even people used to the hard-core Bayesian approach to learning often note that some approximations are almost inevitable in specifying a prior and/or integrating to achieve a posterior. Taking into account how the system will be evaluated can allow both computational effort and design effort to be focused so as to improve performance.

A current laundry list of capabilities includes:

  1. 2002 multiclass SVM including arbitrary cost matrices
  2. ICML 2003 Hidden Markov Models
  3. NIPS 2003 Markov Networks (see some discussion)
  4. EMNLP 2004 Context free grammars
  5. ICML 2004 Any loss (with much computation)
  6. ICML 2005 Any constrained linear prediction model (that’s my own name).
  7. ICML 2005 Any loss dependent on a contingency table

I am personally interested in how this relates to the learning reductions work which has similar goals, but works at a different abstraction level (the learning problem rather than algorithmic mechanism). The difference in abstraction implies that anything solvable by reduction should be solvable by a direct algorithmic mechanism. However, comparing and constrasting the results I know of it seems that what is solvable via reduction to classification versus what is solvable via direct SVM-like methods is currently incomparable.

  1. Can SVMs be tuned to directly solve (example dependent) cost sensitive classification? Obviously, they can be tuned indirectly via reduction, but it is easy to imagine more tractable direct optimizations.
  2. How efficiently can learning reductions be used to solve structured prediction problems? Structured prediction problems are instances of cost sensitive classification, but the regret transform efficiency which occurs when this embedding is done is too weak to be of interest.
  3. Are there any problems efficiently solvable by SVM-like algorithms which are not efficiently solvable via learning reductions?