Many learning algorithms used in practice are fairly simple. Viewed representationally, many prediction algorithms either compute a linear separator of basic features (perceptron, winnow, weighted majority, SVM) or perhaps a linear separator of slightly more complex features (2-layer neural networks or kernelized SVMs). Should we go beyond this, and start using “deep” representations?
What is deep learning?
Intuitively, deep learning is about learning to predict in ways which can involve complex dependencies between the input (observed) features.
Specifying this more rigorously turns out to be rather difficult. Consider the following cases:
- SVM with Gaussian Kernel. This is not considered deep learning, because an SVM with a gaussian kernel can’t succinctly represent certain decision surfaces. One of Yann LeCun‘s examples is recognizing objects based on pixel values. An SVM will need a new support vector for each significantly different background. Since the number of distinct backgrounds is large, this isn’t easy.
- K-Nearest neighbor. This is not considered deep learning for essentially the same reason as the gaussian SVM. The number of representative points required to recognize an image in any background is very large.
- Decision Tree. A decision tree might be considered a deep learning system. However, there exist simple learning problems that defeat decision trees using axis aligned splits. It’s easy to find problems that defeat such decision trees by rotating a linear separator through many dimensions.
- 2-layer neural networks. A two layer neural network isn’t considered deep learning because it isnt a deep architecture. More importantly, perhaps, the object recognition with occluding background problem implies that the hidden layer must be very large to do general purpose detection.
- Deep neural networks. (for example, convolutional neural networks) A neural network with several layers might be considered deep.
- Deep Belief networks are “deep”.
- Automated feature generation and selection systems might be considered deep since they can certainly develop deep dependencies between the input and the output.
One test for a deep learning system is: are there well-defined learning problems which the system can not solve but a human easily could? If the answer is ‘yes’, then it’s perhaps not a deep learning system.
Where might deep learning be useful?
There are several theorems of the form: “nearest neighbor can learn any measurable function”, “2 layer neural networks can represent any function”, “a support vector machine with a gaussian kernel can learn any function”. These theorems imply that deep learning is only interesting in the bounded data or computation case.
And yet, for the small data situation (think “30 examples”), problems with overfitting become so severe it’s difficult to imagine using more complex learning algorithms than the shallow systems comonly in use.
So the domain where a deep learning system might be most useful involves large quantities of data with computational constraints.
What are the principles of design for deep learning systems?
The real answer here is “we don’t know”, and this is an interesting but difficult direction of research.
- Is (approximate) gradient descent the only efficient training algorithm?
- Can we learn an architecture on the fly or must it be prespecified?
- What are the limits of what can be learned?
Do you have a particular dataset or application in mind, that would benefit from the concept of deep learning?
Are comments being moderated or is my browser broken?
I think there is more to deep learning than large data sets and
computational constraints. Every time I explain machine learning
methods to outsiders, they are always disappointed that there really
is not happening too much behind the scenes in terms of
“understanding” (whatever that means).
Thus, if one interprets deep learning as forming an “understanding”
about the objects involved, deep learning maybe becomes somewhat more
closely related to modelling human intelligence than the usual machine
learning methods like support vector machines (which have nevertheless
been very successful).
I think part of the reason why we developed “shallow” methods has
something to do with the way in which we have formalized the
supervised learning problem, which is completely black-box and makes
no requirements on the internal structure of the predictor (which is
certainly a good thing).
One could argue that the current formal definition of the supervised
learning task does not capture everything which signifies
“learning”. In humans, learning some task not only means that one can
solve the prediction task well, but one usually also expects some kind
of internal change. For example, if you have learned to classify
hand-written digits (let’s say of some exotic script), then you should
also be able to say what is characteristic of some digit. And, most
importantly, you have built an internal representation of the classes
which can be used as input for further processing.
I think one way to approach understanding deep learning would be to
think about how an alternative supervised learning scenario would have
to look like such that deep learning is the answer.
Yep, there is definite moderation (and moderation lag), in order to prevent spam.
You ask if we can learn the architecture on the fly – I think we can. A neural net can be viewed as a tree instead of a net. Each initial feature is a parent, so you start with only with the initial features. For each iteration, you could experiment with several children at several different locations in the tree. The neural net could be trained for each of the different configurations, and the performance would be compared. The best configuration, either the existing configuration, or one of the configurations with an added child could be used for the next iteration and the process would repeat forever.What do you think?
Might it be the case that the important feature of deep learning algorithms is that they are what Watanabe calls ‘singular learning machines’, i.e., where the set of parameters corresponding to a particular distribution is an analytic set with singularities?
Recent and relevant:
Hinton, G. E. and Salakhutdinov, R. R
Reducing the dimensionality of data with neural networks.
Science, Vol. 313. no. 5786, pp. 504-507, 28 July 2006.
[Abstract]
“One test for a deep learning system is: are there well-defined learning problems which the system can not solve but a human easily could?”
There definitely must be such problems; otherwise, human brains would just evolve into SVMs (linear most of the time 🙂
Of course, another hypothesis would be that evolution that resulted into quite a “deep” neural net was just plain stupid 🙂
But I think this is less likely than assuming that “natural deep learners” were developed for good reasons. As mentioned in one of the previous comments, current formulations of learning problems are far from covering completly what we mean by human learning – and “understanding” is just one example. We need a mathematical theory of “understanding”.
At least some of this might be answered by syntactic pattern recognition, which often involves hierarchical or grammatical models of patterns. As an example, consider the problem of recognizing tables in images. The conventional (and much more common) approach would be: 1. Collect images of “tables” and “non-tables”, 2. Extract relevant, meaningful features (possibly after center, isolation, etc.), 3. Train statistical model (neural network, LDA, SVM, etc.), 4. Collect fee. The obvious problem with this is that, at the pixel level, real tables can look very different from one another. Anyone who’s done machine learning work with images knows that some recognition tasks are very difficult because items in images which we consider “similar” are in fact very very far apart in bitmap space.
A syntactic solution might involve: 1. Define or discover what components tables are made of (table = tabletop + 3 or 4 legs), 2. define or discover what the components of tables are made of, and so on until “primitives” are reached, 3. Apply defined or discovered grammar to real images for recall, 4. Collect fee. Obviously, syntactic pattern recognition is still a challenge, but it tries to take into account the hierarchies which exist in the world.