“Deep learning” is used to describe learning architectures which have significant depth (as a circuit).

One claim is that shallow architectures (one or two layers) can not concisely represent some functions while a circuit with more depth can concisely represent these same functions. Proving lower bounds on the size of a circuit is substantially harder than upper bounds (which are constructive), but some results are known. Luca Trevisan‘s class notes detail how XOR is not concisely representable by “AC0” (= constant depth unbounded fan-in AND, OR, NOT gates). This doesn’t quite prove that depth is necessary for the representations commonly used in learning (such as a thresholded weighted sum), but it is strongly suggestive that this is so.

Examples like this are a bit disheartening because existing algorithms for deep learning (deep belief nets, gradient descent on deep neural networks, and a perhaps decision trees depending on who you ask) can’t learn XOR very easily. Evidence so far suggests learning a noisy version of XOR is hard. In fact, crypto systems have been proposed based upon this hardness. The evidence so far suggests that XOR based deep learning problems have no algorithm much better than “guess and check”.

It turns out that we *can* define deep learning problems which are solvable by deep belief net style algorithms. Some definitions:

**Learning Problem**A learning problem is defined by probability distribution*D(x,y)*over features*x*which are a vector of bits and a label*y*which is either*0*or*1*.**Shallow Learning Problem**A shallow learning problem is a learning problem where the label*y*can be predicted with error rate at most*e < 0.5*by a weighted linear combination of features,*sign(sum*._{i}w_{i}x_{i})**Deep Learning Problem**A deep learning problem is a learning problem with a solution representable by a circuit of weighted linear sums with O(number of input features) gates.

These definitions are not necessarily the correct ones (and I’d like to hear from anyone that disagrees with the definition, and why), but they seem to capture the intuitions I know. Note that the definition of “deep learning problem” contains the definition of “shallow learning problem” and the XOR example. With high probability, it does not contain a random function. This definition is not captured by any existing complexity theory classes I know, although some are close (TC0, for example).

**Theorem** There exists a deep learning problem for which:

- A deep belief net (like) learning algorithm can achieve error rate
*0*with probability*1- d*for any*d > 0*in the limit as the number of IID samples goes to infinity. - The learning problem is not shallow. In particular for all
*e > 0*, all weighted predictors have error rate at least*1/2 – e*

The proof is actually a little bit stronger than the theorem statement. The definition of a ‘shallow learning problem’ can be broadened in several ways to include solution by representation of many common learning algorithms. Also, instead of an asymptotic analysis, a finite sample analysis could be made.

This theorem (roughly) says that “deep learning could be useful in practice”. This is a fairly weak statement. However, a stronger PAC-learning statement appears implausible because deep belief net (like) algorithms actively use the structure in *x* while PAC analysis holds for all distributions over *x*. Given the weakness of the theorem statement, empirical evidence for the effectiveness (or not) of deep learning is important.

**Proof** (This is sketch only.) The first part of the proof is constructive. We simply specify a learning problem, and then show that a deep belief net-like algorithm can solve it. The second part involves some probabilistic analysis.

The learning problem is essentially a ‘hidden bits problem’ which is best specified by defining an algorithm for drawing an example. The problem is parameterized by an integer *k*, where larger *k* problems hold for smaller choices of *e*. An example is drawn by first picking a uniform random bit *y* from *{0,1}*. After that *k* hidden bits *h _{1},…,h_{k}* are set so that a random subset of

*(k + y)/2*of them are

*1*and the rest

*0*. For each hidden bit

*h*, we have

_{i}*4*output bits

*x*(implying a total of

_{i1},x_{i2},x_{i3},x_{i4}*4k*output bits). If

*h*, with

_{i}= 0*0.5*probability we set one of the output bits to

*1*and the rest to

*0*, and with

*0.5*probability we set all output bits to

*0*. If

*h*, with

_{i}= 1*0.5*probability we set one of the output bits to

*0*and the rest to

*1*, and with

*0.5*probability we set all output bits to

*1*.

This learning problem is solved by a two-level prediction process. Variations using recursive composition (redefine each “output bit” to be a hidden bit in a new layer, each of which has it’s own output bit) can make the “right” number of levels be larger than 2.

The deep belief net like algorithm we consider is the algorithm which:

- Builds a threshold weighted sum predictor for every feature
*x*using weights = the probability of agreement between the features minus 0.5._{ij} - Builds a threshold weighted sum predictor for the label given the predicted values from the first step with weights as before.

(The real algorithm uses something similar to gradient descent which is more powerful, but this is all we need.)

For each output feature *x _{ij}*, the values of output features corresponding to other hidden bits are uncorrelated since by construction

*Pr(h*for

_{i}= h_{i’}) = 0.5*i != i’*. For output features which share a hidden bit, the probability of agreement in value between two bits

*j,j’*is

*0.75*. If we have

*n*IID samples from the learning problem, then Chernoff bounds imply that empirical expectations deviate from expectations at most

*(log ((4k)*with probability

^{2}/d)/2n)^{0.5}*d*or less for all pairs of features simultaneously. For the prediction of each feature, when

*n = 512 k*, the sum of the weights on the

^{4}log ((4k)^{2}/d)*4 (k-1)*features corresponding to other hidden weights is bounded by

*4(k-1) * 1/(32 k*. On the other hand, the weight on the 3 other features sharing the same bit are each at least

^{2}) <= 1/(8k)*0.25 +/- 1/(32k*which are individually larger than the sum of all other weights. Consequently, the predicted value is the majority of the 3 other features which is always the value of the hidden bit.

^{2})The above analysis (sketchily) shows that the predicted value for each output bit is the hiden bit used to generate it. The same style of analysis shows that given the hidden bits, the output bit can be predicted perfectly. In this case, the value of each hidden bit provides a slight consistent edge in predicting the value of the output bit implying that the learning algorithm converges to uniform weighting over the predicted hidden bit values.

To prove the second part of the theorem, we can first show that a uniform weight over all features is the optimal predictor, and then show that the error rate of this predictor converges to *1/2* as *k -> infinity*. The optimality of uniform weighting is a little bit tricky to prove, but it is obvious at a high level because (1) of symmetry in the definition of the problem and (2) a nonuniform weighting increases the noise. The error rate convergence to 0.5 is a statement about Binomial probability distributions. Essentially, the noise in the observed bits given the hidden bits kills prediction performance.

Very interesting… can you point to any natural problems that exhibit XOR-like properties?

Well, there is cracking crypto systems

More generally, systems involving competitive behavior can exhibit xor-like behavior.

Count me as an ignorant outsider, but it seems to me like your’re hung up with the wrong field (namely real numbers). XOR

issum, but in a different field (the “field of bits” {0,1}). If you’re going to work with bits all the time, why are you not using the field of bits? You pose a problem involving distributions over vectors of bits, then you try to solve it with real numbers… Why not use the right tools, which even the formulation of the problem suggests? Or is there a deeper reason why algebra must always be done over the field of reals in machine learning?I think the fundamental problem is that the field of bits doesn’t form natural biases. In particular, in the real world 1 + 1 does not equal 0 very often.

A more practical reason is that the algorithms often run with respect to “real” (== floating point) features which are provided by the world. This abstraction doesn’t show that, but it’s something that is kept in mind when making the abstraction.

Many of the algorithms used (like gradient descent) work better with continuous valued parameters. It’s another example of continuizing.

Point taken. There are many ways of “continuizing” bits. It looks like in examples that remind one of XOR there needs to be a “wrap around” effect. For this purpose you could try to contunuize bits by representing them as -1 and 1 on the unit circle in the complex plane. XOR becomes complex multiplication. To do this without complex numbers and circle, you could ask how to “learn” XOR if you can form linear combinations, but then instead of usign sgn(x) or a similar threshold function, you use a “periodic” threshold, such as sgn(sin(x)). And it doesn’t really have to be sign, it could be mod, for example. I suspect the math quickly gets out of hand.

I guess I have a more general question. I understand that for the purposes of proving lower bounds in computational complexity it’s reasonable to show theorems like “can’t learn XOR with NOT, AND, OR”. But I always thought that machine learning was more “positive”, i.e., you really want to learn whenever possible. So, instead of dispairing one should try to use a different bag of tricks.

Artificial intelligence has made its way much into the depths of algorithms and one potential mathematical tool, in this connection, is Bayes’ theoram.

[…] The Deep Learning problem remains interesting. How do you effectively learn complex nonlinearities capable of better performance than a basic linear predictor? An effective solution avoids feature engineering. Right now, this is almost entirely dealt with empirically, but theory could easily have a role to play in phrasing appropriate optimization algorithms, for example. […]