About 4 years ago, I speculated that decision trees qualify as a deep learning algorithm because they can make decisions which are substantially nonlinear in the input representation. Ping Li has proved this correct, empirically at UAI by showing that boosted decision trees can beat deep belief networks on versions of Mnist which are artificially hardened so as to make them solvable only by deep learning algorithms.

This is an important point, because the ability to solve these sorts of problems is probably the best objective definition of a deep learning algorithm we have. I’m not that surprised. In my experience, if you can accept the computational drawbacks of a boosted decision tree, they can achieve pretty good performance.

Geoff Hinton once told me that the great thing about deep belief networks is that they work. I understand that Ping had very substantial difficulty in getting this published, so I hope some reviewers step up to the standard of valuing what works.

After seeing Ping’s talk about this work in February, I spent some time reproducing his results. Getting the same results as Ping turned out to be fairly easy. It certainly required a lot less tweaking than reproducing the results of DBNs and the like on the same data sets.

Having said that, I am not convinced that deep learning does a good job on all hardened MNIST data sets. For instance, on the data set that has natural images as background for the digits, today’s deep learners almost have to do a bad job: the unsupervised pre-training algorithms are wasting a lot of modeling power on trying to model the background images, the structure of which is completely irrelevant for the classification task. The subsequent discriminative fine-tuning then has to fix this, but can only do so much. So on some of the data sets considered in the paper, I don’t think it is too surprising that powerful discriminative learners such as boosted regression trees can do a better job than deep learning.

Quite interesting. I know that the Russian search engine Yandex uses boosted decision trees (similar to TreeRank) to implement a learning to rank approach.

Decision trees actually are shallow learners (2-levels, e.g. like SVMs and ordinary MLPs), as discussed in a paper soon to appear (but already available online there: http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/436). Basically a decision tree can be written as a sum-of-products where each product (over the arcs from the root to a leaf) selects a leaf. The paper discusses how they can only generalize locally (the training examples within each region defined by a leaf only provide examples about test examples in the same region). On the other hand, boosted decision trees add one level of depth to the weak learners, which makes them reasonably deep (considering that many deep learning results are often with just 3 or 4 levels).

@Yoshua: I’m afraid I don’t quite follow something. You’re saying that trees are sums of products (eq 1 in the paper), which I totally buy. So those are two layers, and when you add boosting, you get three. This is stated in the paper as:

“What is the architectural depth of decision trees and decision forests? It depends on what elementary units of

computation are allowed on each level. By analogy with the disjunctive normal form (which is usually assigned an

architectural depth of two) one would assign an architectural depth of two to a decision tree, and of three to decision

forests or boosted trees.” (page 10)

I guess I don’t completely understand the reasoning behind saying that DNF has depth 2. Is there a citation I could look at that argues this point?

To me, it seems like you could make this claim about almost any learning algorithm. The argument for trees reads to me like “trees are just sums of [things]”, where [things] = a bunch of products over features. But couldn’t I say this about other algorithms? For instance, take a depth 1000 neural network. Why couldn’t I say that this is just two layers because the final layer is a sum of [things], where in this case [things] is another non-linear function over the features? Why do products not get to count as layers, but thresholded sums do get to count as layers?

(I’m not trying to be cantankerous — I’m really trying to understand!)

Hal,

Yann LeCun and I wrote a paper on depth (Scaling learning algorithms toward AI: http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/4) in which we discuss and define depth and give many examples. To define the depth of an architecture requires to also define what are the elementary computational units of the graph (whose depth is the longest path from input to output node). If we change the definition of what these units can do, we get a different value for depth, but such redefinitions typically only change depth by a multiplicative constant (e.g. we can replace a formal neuron by the composition of three kinds of things, sums, products and thresholds, i.e., multiply depth of a neural net by 3 by changing what we consider to be the elementary units of computation). In computational complexity papers, the typical elementary units that have been considered are summing units, product units, linear threshold units (formal neuron), and logic gates. Hence in the above paper, a linear classifier is typically considered to have depth 1, whereas a finite polynomial (sum of products) is considered to have depth 2, a traditional MLP (with one hidden layer) is considered to have depth 2, and an SVM or RBF network or kernel-based predictor is considered to have depth 2 (either the feature function phi(x) or the kernel function K(u,v)=phi(u).phi(v) can be considered in the set of computational elements). Nicely, we find that with all these kinds of computational element sets, depth 1 is insufficient to represent most functions, whereas depth 2 provides a universal approximator. Note how that works also for decision trees (which can basically be decomposed as a sum of products), that according this cultural convention also have depth 2 and are universal approximators. Boosting on top of decision trees adds a level of depth (it is a weighted sum of the hard decisions produced by the decision trees). That is why I wrote that that boosted decision trees have depth 3.

In the paper on Scaling learning algorithms toward AI, we also introduce the notion that for learned architectures, we typically would like each level in the architecture to be adapted. Hence a depth 1000 neural net is a weighted sum over the outputs of the 999-th layer, but all 1000 layers are adaptive, and composed of the same kind of computational units, so it makes more sense to talk about a depth 1000 architecture in that case. Note how when we learn a polynomial we typically can adapt the coefficients (2nd layer parameters) and choose which products enter the polynomial (1st layer parameters), although of course one can consider all the possible products and then the 1st layer is not adaptive anymore, only the second. But practically, we generally can’t afford the coefficients of all possible products to be non-zero, so we need to do some kind of ‘feature selection’, which amounts to learning the first layer.

Let me know if my explanations answer your questions and satisfy you. In conclusion, there is a formal definition of depth which can give different answers in practice because it depends on the choice of computational units, and there is a softer more fuzzy definition of depth that represents a kind of cultural convention (for computational elements) that typically makes depth-2 architectures to be universal approximators (with enough units in the first layer). It is under this convention that decision trees would reasonably be seen as depth-2 and boosted decision trees as depth-3 architectures.

Maybe it would be a nice to include the datasets that require deep learning on the MLcomp site? I didn’t find them there.

The data sets are available at http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DeepVsShallowComparisonICML2007

Are the boosted decision trees learned in an unsupervised way?

One of the appeals of the “deep learning” is that the training is often done in an unsupervised way. The argument being that to achieve AI or something like that, machines will have to learn without supervision. Typically the unsupervised training is based on stochastic maximum likelihood (SML) or contrastive divergence (CD). Yoshua and Yann have proposed good algorithms using autoencoders trained to reconstruct the inputs. Some students of mine tried to do an L2 boosting version of this style of training deep architectures in http://www.cs.ubc.ca/~nando/papers/bip.pdf but I have to admit that we weren’t able to show that this does much better than the naive SML or CD. L2 boosting allows one to obtain very sparse bases and estimate the number of coefficients, but can be computationally expensive in comparison to say CD (at least in our naive implementation).

Nando is right that one of the learning principles which is a the core at most current algorithms for deep architecture is unsupervised learning (or semi-supervised learning). In addition to the reason he gave (that to achieve AI we will probably need to exploit the huge quantity of data that are not human-labeled), there is another one. It looks like unsupervised learning (and especially the kind that allows to adapt individual layers in a deep architecture) is also important in order to find good solutions to the optimization problem involved in learning. This is because the overall training criterion (whether supervised or unsupervised) for a deep architecture is typically non-convex and full of local minima. So a deep Boltzmann machine trained from random weights by maximum likelihood just falls in poor local minima, and the same is true (to a lesser extent) to many supervised deep neural networks. In the book / review paper Deep Learning for AI (http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/239), I discuss this question (in particular in section 4.2). Having such a local training criterion allows to somewhat disentangle the optimization problem and turn it into a series of easier ones. Whereas one can imagine many unsupervised learning algorithms that can train individual layers to do something useful (basically try to explain what the layers below are doing), doing something similar with supervised learners is less obvious (and attempts such as the one discussed in our NIPS’2006 paper on Greedy Layer-wise Training of Deep Networks [http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/190] fared much worse than their unsupervised counterparts).

Finally, there is a third reason why unsupervised learning can be very useful for deep learners: it provides a very powerful prior that can act as a regularizer. This is discussed at length in this JMLR paper: Why Does Unsupervised Pre-training Help Deep Learning? (http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/439). Basically the prior is the following and very similar to the one exploited in semi-supervised learning. Let X be the input variable and Y the output variable to predict given X. The prior says that the structure of P(X=x), as a function of x, is strongly related to the structure of P(Y|X=x), as a function of x. More precisely (and this is specific to deep learners, not always the case for semi-supervised learners), the assumption is that representations of x that are useful at capturing P(X) are often also useful at capturing P(Y|X). Hence deep learning is really about learning representations, and multiple levels of representation.

Thank you Yoshua for all of the insightful comments.

I’ve often felt that the deep learning literature glosses over “the assumption is that representations of x that are useful at capturing P(X) are often also useful at capturing P(Y|X).” Has anyone characterized what kinds of learning problems have this property? This strikes me as a very important question. (I apologize if this is well known in the deep learning literature; I’ve only read a few papers in this area.)