The Call of the Deep

Many learning algorithms used in practice are fairly simple. Viewed representationally, many prediction algorithms either compute a linear separator of basic features (perceptron, winnow, weighted majority, SVM) or perhaps a linear separator of slightly more complex features (2-layer neural networks or kernelized SVMs). Should we go beyond this, and start using “deep” representations?

What is deep learning?
Intuitively, deep learning is about learning to predict in ways which can involve complex dependencies between the input (observed) features.

Specifying this more rigorously turns out to be rather difficult. Consider the following cases:

  1. SVM with Gaussian Kernel. This is not considered deep learning, because an SVM with a gaussian kernel can’t succinctly represent certain decision surfaces. One of Yann LeCun‘s examples is recognizing objects based on pixel values. An SVM will need a new support vector for each significantly different background. Since the number of distinct backgrounds is large, this isn’t easy.
  2. K-Nearest neighbor. This is not considered deep learning for essentially the same reason as the gaussian SVM. The number of representative points required to recognize an image in any background is very large.
  3. Decision Tree. A decision tree might be considered a deep learning system. However, there exist simple learning problems that defeat decision trees using axis aligned splits. It’s easy to find problems that defeat such decision trees by rotating a linear separator through many dimensions.
  4. 2-layer neural networks. A two layer neural network isn’t considered deep learning because it isnt a deep architecture. More importantly, perhaps, the object recognition with occluding background problem implies that the hidden layer must be very large to do general purpose detection.
  5. Deep neural networks. (for example, convolutional neural networks) A neural network with several layers might be considered deep.
  6. Deep Belief networks are “deep”.
  7. Automated feature generation and selection systems might be considered deep since they can certainly develop deep dependencies between the input and the output.

One test for a deep learning system is: are there well-defined learning problems which the system can not solve but a human easily could? If the answer is ‘yes’, then it’s perhaps not a deep learning system.

Where might deep learning be useful?
There are several theorems of the form: “nearest neighbor can learn any measurable function”, “2 layer neural networks can represent any function”, “a support vector machine with a gaussian kernel can learn any function”. These theorems imply that deep learning is only interesting in the bounded data or computation case.

And yet, for the small data situation (think “30 examples”), problems with overfitting become so severe it’s difficult to imagine using more complex learning algorithms than the shallow systems comonly in use.

So the domain where a deep learning system might be most useful involves large quantities of data with computational constraints.

What are the principles of design for deep learning systems?
The real answer here is “we don’t know”, and this is an interesting but difficult direction of research.

  1. Is (approximate) gradient descent the only efficient training algorithm?
  2. Can we learn an architecture on the fly or must it be prespecified?
  3. What are the limits of what can be learned?

AOL’s data drop

AOL has released several large search engine related datasets. This looks like a pretty impressive data release, and it is a big opportunity for people everywhere to worry about search engine related learning problems, if they want.

Two more UAI papers of interest

In addition to Ed Snelson’s paper, there were (at least) two other papers that caught my eye at UAI.

One was this paper by Sanjoy Dasgupta, Daniel Hsu and Nakul Verma at UCSD which shows in a surprisingly general and strong way that almost all linear projections of any jointly distributed vector random variable with finite first and second moments look sphereical and unimodal (in fact look like a scale mixture of Gaussians). Great result, as you’d expect from Sanjoy.

The other paper which I found intriguing but which I just haven’t groked yet is this beast by Manfred and Dima Kuzmin.
You can check out the (beautiful) slides
if that helps. I feel like there is something deep here, but my brain is too small to understand it. The COLT and last NIPS papers/slides are also on Manfred’s page. Hopefully someone here can illuminate.

Upcoming conference

The Workshop for Women in Machine Learning will be held in San Diego on October 4, 2006.

For details see the workshop website:
http://www.seas.upenn.edu/~wiml/

A Winner

Ed Snelson won the Predictive Uncertainty in Environmental Modelling Competition in the temp(erature) category using this algorithm. Some characteristics of the algorithm are:

  1. Gradient descent
  2. … on about 600 parameters
  3. … with local minima
  4. … to solve regression.

This bears a strong resemblance to a neural network. The two main differences seem to be:

  1. The system has a probabilistic interpretation (which may aid design).
  2. There are (perhaps) fewer parameters than a typical neural network might have for the same problem (aiding speed).