Manik and I are organizing the extreme classification workshop at NIPS this year. We have a number of good speakers lined up, but I would further encourage anyone working in the area to submit an abstract by October 9. I believe this is an idea whose time has now come.
The NIPS website doesn’t have other workshops listed yet, but I expect several others to be of significant interest.
Maybe it’s too early to call, but with four separate Neural Network sessions at this year’s ICML, it looks like Neural Networks are making a comeback. Here are my highlights of these sessions. In general, my feeling is that these papers both demystify deep learning and show its broader applicability.
The first observation I made is that the once disreputable “Neural” nomenclature is being used again in lieu of “deep learning”. Maybe it’s because Adam Coates et al. showed that single layer networks can work surprisingly well.
Another surprising result out of Andrew Ng’s group comes from Andrew Saxe et al. who show that certain convolutional pooling architectures can obtain close to state-of-the-art performance with random weights (that is, without actually learning).
Of course, in most cases we do want to train these models eventually. There were two interesting papers on the topic of training neural networks. In the first, Quoc Le et al. show that a simple, off-the-shelf L-BFGS optimizer is often preferable to stochastic gradient descent.
Secondly, Martens and Sutskever from Geoff Hinton’s group show how to train recurrent neural networks for sequence tasks that exhibit very long range dependencies:
It will be interesting to see whether this type of training will allow recurrent neural networks to outperform CRFs on some standard sequence tasks and data sets. It certainly seems possible since even with standard L-BFGS our recursive neural network (see previous post) can outperform CRF-type models on several challenging computer vision tasks such as semantic segmentation of scene images. This common vision task of labeling each pixel with an object class has not received much attention from the deep learning community.
Apart from the vision experiments, this paper further solidifies the trend that neural networks are being used more and more in natural language processing. In our case, the RNN-based model was used for structure prediction. Another neat example of this trend comes from Yann Dauphin et al. in Yoshua Bengio’s group. They present an interesting solution for learning with sparse bag-of-word representations.
Such sparse representations had previously been problematic for neural architectures.
In summary, these papers have helped us understand a bit better which “deep” or “neural” architectures work, why they work and how we should train them. Furthermore, the scope of problems that these architectures can handle has been widened to harder and more real-life problems.
Of the non-neural papers, these two papers stood out for me:
About 4 years ago, I speculated that decision trees qualify as a deep learning algorithm because they can make decisions which are substantially nonlinear in the input representation. Ping Li has proved this correct, empirically at UAI by showing that boosted decision trees can beat deep belief networks on versions of Mnist which are artificially hardened so as to make them solvable only by deep learning algorithms.
This is an important point, because the ability to solve these sorts of problems is probably the best objective definition of a deep learning algorithm we have. I’m not that surprised. In my experience, if you can accept the computational drawbacks of a boosted decision tree, they can achieve pretty good performance.
Geoff Hinton once told me that the great thing about deep belief networks is that they work. I understand that Ping had very substantial difficulty in getting this published, so I hope some reviewers step up to the standard of valuing what works.
A $1M qualifying result was achieved on the public Netflix test set by a 3-way ensemble team. This is just in time for Yehuda‘s presentation at KDD, which I’m sure will be one of the best attended ever.
This isn’t quite over—there are a few days for another super-conglomerate team to come together and there is some small chance that the performance is nonrepresentative of the final test set, but I expect not.
Regardless of the final outcome, the biggest lesson for ML from the Netflix contest has been the formidable performance edge of ensemble methods.
Suppose we have a set of observations over time x1,x2,…,xt and want to predict some future event yt+1. An inevitable problem arises, because learning a predictor h(x1,…,xt) of yt+1 is generically intractable due to the size of the input. To make this problem tractable, what’s necessary is a method for summarizing the relevant information in past observations for the purpose of prediction in the future. In other words, state is required.
Existing approaches for deriving state have some limitations.
- Hidden Markov models learned with EM suffer from local minima, use tabular learning approaches which provide dubious generalization ability, and often require substantial a.priori specification of the observations.
- Kalman Filters and Particle Filters are very parametric in the sense that substantial information must be specified up front.
- Dynamic Bayesian Networks (graphical models through time) require substantial a.priori specification and often require the solution of difficult computational problems to use. Some of these difficulties are representational rather than computational.
- The Subspace-ID approach from control theory uses a linear representation, with the basic claim that it works well when all transformations are linear, but not so well when things are nonlinear. (Thanks to Drew for pointing it out.) In making this post, I ran across this two day tutorial which discusses extensions of this idea to nonlinear systems. Unfortunately, I’ll miss the tutorial, and I haven’t found the related paper.
The point of this paper at ICML is that some dynamic systems (those which are “invertible”), can be decomposed into separate bounded resource prediction problems which, when solved, create an implicit definition of state. This allows us to use any general purpose supervised learning algorithm to solve the state formation problem without requiring linearity or any specific representation. When writing papers you don’t generally gush too hard, but it’s fair to say that I’m excited by this approach.
- It’s not a known dead end.
- It doesn’t require lots of prior specification & information when you have lots of data.
- It leverages the huge amount of work that has gone into supervised learning algorithm design.
- It works in controlled systems also, where the control is simply another observation.
- It works with generalization from the start, rather than requiring the (often awkward) addition of generalization later.
- It doesn’t require predicting everything in order to predict what you want.
- It can work with very large observation spaces, and can even work better the larger the observation space, because larger observations imply more invertibility.
I expect some people reading this paper will be disappointed that it doesn’t solve all problems. That’s good news for anyone interested in research. For those who aren’t note that this is (in some sense) a generalization of subspace ID, and hence that there are other applications of the approach known to work in practice. Furthermore, we have some sample complexity analysis in the linear case.
It’s relatively rare to have a paper about a new approach to solving a problem as intractable as nonlinear dynamics has proved to be, so if you see a flaw please speak up.