Shravan and Alex‘s LDA code is released. On a single machine, I’m not sure how it currently compares to the online LDA in VW, but the ability to effectively scale across very many machines is surely interesting.
A Deep Belief Net Learning Problem
“Deep learning” is used to describe learning architectures which have significant depth (as a circuit).
One claim is that shallow architectures (one or two layers) can not concisely represent some functions while a circuit with more depth can concisely represent these same functions. Proving lower bounds on the size of a circuit is substantially harder than upper bounds (which are constructive), but some results are known. Luca Trevisan‘s class notes detail how XOR is not concisely representable by “AC0” (= constant depth unbounded fan-in AND, OR, NOT gates). This doesn’t quite prove that depth is necessary for the representations commonly used in learning (such as a thresholded weighted sum), but it is strongly suggestive that this is so.
Examples like this are a bit disheartening because existing algorithms for deep learning (deep belief nets, gradient descent on deep neural networks, and a perhaps decision trees depending on who you ask) can’t learn XOR very easily. Evidence so far suggests learning a noisy version of XOR is hard. In fact, crypto systems have been proposed based upon this hardness. The evidence so far suggests that XOR based deep learning problems have no algorithm much better than “guess and check”.
It turns out that we can define deep learning problems which are solvable by deep belief net style algorithms. Some definitions:
- Learning Problem A learning problem is defined by probability distribution D(x,y) over features x which are a vector of bits and a label y which is either 0 or 1.
- Shallow Learning Problem A shallow learning problem is a learning problem where the label y can be predicted with error rate at most e < 0.5 by a weighted linear combination of features, sign(sumi wi xi).
- Deep Learning Problem A deep learning problem is a learning problem with a solution representable by a circuit of weighted linear sums with O(number of input features) gates.
These definitions are not necessarily the correct ones (and I’d like to hear from anyone that disagrees with the definition, and why), but they seem to capture the intuitions I know. Note that the definition of “deep learning problem” contains the definition of “shallow learning problem” and the XOR example. With high probability, it does not contain a random function. This definition is not captured by any existing complexity theory classes I know, although some are close (TC0, for example).
Theorem There exists a deep learning problem for which:
- A deep belief net (like) learning algorithm can achieve error rate 0 with probability 1- d for any d > 0 in the limit as the number of IID samples goes to infinity.
- The learning problem is not shallow. In particular for all e > 0, all weighted predictors have error rate at least 1/2 – e
The proof is actually a little bit stronger than the theorem statement. The definition of a ‘shallow learning problem’ can be broadened in several ways to include solution by representation of many common learning algorithms. Also, instead of an asymptotic analysis, a finite sample analysis could be made.
This theorem (roughly) says that “deep learning could be useful in practice”. This is a fairly weak statement. However, a stronger PAC-learning statement appears implausible because deep belief net (like) algorithms actively use the structure in x while PAC analysis holds for all distributions over x. Given the weakness of the theorem statement, empirical evidence for the effectiveness (or not) of deep learning is important.
Proof (This is sketch only.) The first part of the proof is constructive. We simply specify a learning problem, and then show that a deep belief net-like algorithm can solve it. The second part involves some probabilistic analysis.
The learning problem is essentially a ‘hidden bits problem’ which is best specified by defining an algorithm for drawing an example. The problem is parameterized by an integer k, where larger k problems hold for smaller choices of e. An example is drawn by first picking a uniform random bit y from {0,1}. After that k hidden bits h1,…,hk are set so that a random subset of (k + y)/2 of them are 1 and the rest 0. For each hidden bit hi, we have 4 output bits xi1,xi2,xi3,xi4 (implying a total of 4k output bits). If hi = 0, with 0.5 probability we set one of the output bits to 1 and the rest to 0, and with 0.5 probability we set all output bits to 0. If hi = 1, with 0.5 probability we set one of the output bits to 0 and the rest to 1, and with 0.5 probability we set all output bits to 1.
This learning problem is solved by a two-level prediction process. Variations using recursive composition (redefine each “output bit” to be a hidden bit in a new layer, each of which has it’s own output bit) can make the “right” number of levels be larger than 2.
The deep belief net like algorithm we consider is the algorithm which:
- Builds a threshold weighted sum predictor for every feature xij using weights = the probability of agreement between the features minus 0.5.
- Builds a threshold weighted sum predictor for the label given the predicted values from the first step with weights as before.
(The real algorithm uses something similar to gradient descent which is more powerful, but this is all we need.)
For each output feature xij, the values of output features corresponding to other hidden bits are uncorrelated since by construction Pr(hi = hi’) = 0.5 for i != i’. For output features which share a hidden bit, the probability of agreement in value between two bits j,j’ is 0.75. If we have n IID samples from the learning problem, then Chernoff bounds imply that empirical expectations deviate from expectations at most (log ((4k)2/d)/2n)0.5 with probability d or less for all pairs of features simultaneously. For the prediction of each feature, when n = 512 k4 log ((4k)2/d), the sum of the weights on the 4 (k-1) features corresponding to other hidden weights is bounded by 4(k-1) * 1/(32 k2) <= 1/(8k). On the other hand, the weight on the 3 other features sharing the same bit are each at least 0.25 +/- 1/(32k2) which are individually larger than the sum of all other weights. Consequently, the predicted value is the majority of the 3 other features which is always the value of the hidden bit.
The above analysis (sketchily) shows that the predicted value for each output bit is the hiden bit used to generate it. The same style of analysis shows that given the hidden bits, the output bit can be predicted perfectly. In this case, the value of each hidden bit provides a slight consistent edge in predicting the value of the output bit implying that the learning algorithm converges to uniform weighting over the predicted hidden bit values.
To prove the second part of the theorem, we can first show that a uniform weight over all features is the optimal predictor, and then show that the error rate of this predictor converges to 1/2 as k -> infinity. The optimality of uniform weighting is a little bit tricky to prove, but it is obvious at a high level because (1) of symmetry in the definition of the problem and (2) a nonuniform weighting increases the noise. The error rate convergence to 0.5 is a statement about Binomial probability distributions. Essentially, the noise in the observed bits given the hidden bits kills prediction performance.
Interesting Papers at NIPS 2006
Here are some papers that I found surprisingly interesting.
- Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, Greedy Layer-wise Training of Deep Networks. Empirically investigates some of the design choices behind deep belief networks.
- Long Zhu, Yuanhao Chen, Alan Yuille Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing. An unsupervised method for detecting objects using simple feature filters that works remarkably well on the (supervised) caltech-101 dataset.
- Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira, Analysis of Representations for Domain Adaptation. This is the first analysis I’ve seen of learning with respect to samples drawn differently from the evaluation distribution which depends on reasonable measurable quantities.
All of these papers turn out to have a common theme—the power of unlabeled data to do generically useful things.
Automated Labeling
One of the common trends in machine learning has been an emphasis on the use of unlabeled data. The argument goes something like “there aren’t many labeled web pages out there, but there are a huge number of web pages, so we must find a way to take advantage of them.” There are several standard approaches for doing this:
- Unsupervised Learning. You use only unlabeled data. In a typical application, you cluster the data and hope that the clusters somehow correspond to what you care about.
- Semisupervised Learning. You use both unlabeled and labeled data to build a predictor. The unlabeled data influences the learned predictor in some way.
- Active Learning. You have unlabeled data and access to a labeling oracle. You interactively choose which examples to label so as to optimize prediction accuracy.
It seems there is a fourth approach worth serious investigation—automated labeling. The approach goes as follows:
- Identify some subset of observed values to predict from the others.
- Build a predictor.
- Use the output of the predictor to define a new prediction problem.
- Repeat…
Examples of this sort seem to come up in robotics very naturally. An extreme version of this is:
- Predict nearby things given touch sensor output.
- Predict medium distance things given the nearby predictor.
- Predict far distance things given the medium distance predictor.
Some of the participants in the LAGR project are using this approach.
A less extreme version was the DARPA grand challenge winner where the output of a laser range finder was used to form a road-or-not predictor for a camera image.
These automated labeling techniques transform an unsupervised learning problem into a supervised learning problem, which has huge implications: we understand supervised learning much better and can bring to bear a host of techniques.
The set of work on automated labeling is sketchy—right now it is mostly just an observed-as-useful technique for which we have no general understanding. Some relevant bits of algorithm and theory are:
- Reinforcement learning to classification reductions which convert rewards into labels.
- Cotraining which considers a setting containing multiple data sources. When predictors using different data sources agree on unlabeled data, an inferred label is automatically created.
It’s easy to imagine that undiscovered algorithms and theory exist to guide and use this empirically useful technique.
Why Manifold-Based Dimension Reduction Techniques?
Manifold based dimension-reduction algorithms share the following general outline.
Given: a metric d() and a set of points S
- Construct a graph with a point in every node and every edge connecting to the node of one of the k-nearest neighbors. Associate with the edge a weight which is the distance between the points in the connected nodes.
- Digest the graph. This might include computing the shortest path between all points or figuring out how to linearly interpolate the point from it’s neighbors.
- Find a set of points in a low dimensional space which preserve the digested properties.
Examples include LLE, Isomap (which I worked on), Hessian-LLE, SDE, and many others. The hope with these algorithms is that they can recover the low dimensional structure of point sets in high dimensional spaces. Many of them can be shown to work in interesting ways producing various compelling pictures.
Despite doing some early work in this direction, I suffer from a motivational problem: Why do we want to recover the low dimensional structure? One answer is “for better data visualization”. This is compelling if you have data visualization problems. However, I don’t — I want to make machines that can better predict the future, which generally appears to be a sound goal of learning. Reducing the dimensionality of a dataset is not obviously helpful in accomplishing this. In fact, doing so violates one of the basic intuitions of applied learning algorithms “avoid double approximation”. (One approximation = the projection into the low dimensional space, another approximation = the classifier learned on that space.)
Another answer is “for robots”. Several people have experimented with using a vision sensor and a dimension reduction technique in an attempt to extract the manifold of pose space. These attempts have not generally worked well, basically because the euclidean distance on pixels is not particularly good at predicting which things are “nearby”. However, we might be able to do considerably better if we learn the distance. At the 1-bit level, we might learn a predictor from image pairs to “nearby” or “far”. Any stream S of images i1, i2, i3, …, in can be transformed into a binary problem according to:
{((ij,ik),1 – I(j = k+1 or k = j+1): ij,ik in S}. In unmath “the binary problem formed by predicting whether images are adjacent in the chain of experience”. (*) A good solution to this binary problem would give us an
interesting 1-bit metric. Using regression and counting numbers of transitions might provide a more conventional multibit metric.
This metric, if well solved, has a concrete meaning: the minimum distance in terms of actuator transitions between positions. A shortest path in this space is a sequence of actuator movements leading from a position A to a position B. A projection of this space into low dimensions provides some common format which both the human and the robot can understand. Commanding the robot to go to some location is just a matter of pointing out that location in the low dimensional projection.
This is a possible use for manifold based dimension reduction techniques which I find compelling, if it works out. (Anyone interested in playing with this should talk to Dana Wilkinson who is considering experimenting with this approach.)
(*) We probably would want to tweak the positive/negative ratio to reflect the pattern encountered in usage.
(**) Post tweaked to fix an oversight.