John Langford – Page 58 – Machine Learning (Theory)

1/15/2007

The Machine Learning Department

Carnegie Mellon School of Computer Science has the first academic Machine Learning department. This department already existed as the Center for Automated Learning and Discovery, but recently changed it’s name.

The reason for changing the name is obvious: very few people think of themselves as “Automated Learner and Discoverers”, but there are number of people who think of themselves as “Machine Learners”. Machine learning is both more succinct and recognizable—good properties for a name.

A more interesting question is “Should there be a Machine Learning Department?”. Tom Mitchell has a relevant whitepaper claiming that machine learning is answering a different question than other fields or departments. The fundamental debate here is “Is machine learning different from statistics?”

At a cultural level, there is no real debate: they are different. Machine learning is characterized by several very active large peer reviewed conferences, operating in a computer science mode. Statistics tends to function with a greater emphasis on journals and a lesser emphasis on conferences which often implies a much longer publishing cycle.

In terms of the basic questions driving the field, the answer seems less clear. It is true that the core problems of statistics in the past have typically differed from the core problems of machine learning today. Yet, there has been some substantial overlap, and there are a number of statisticians nowadays that are actively doing machine learning. It’s reasonably plausible that in the long term statistics departments will adopt the core problems of machine learning, removing the reasons for a separate machine learning department.

The parallel question for computer science comes up less often perhaps because computer science is a notoriously broad field.

The practical implication of a new department is the ability to create a more specific curricula, admit more specific students, and hire faculty based upon more specific interests. Compared to a computer science program, classes on programming languages, computer architecture, or graphics might be dropped in favor of classes on learning theory, statistics, etc… Compared to a statistics program, classes on advanced parameter estimation and measure theory might be dropped in favor of algorithms and programming experience.

An alternative solution like “learn everything from computer science and statistics” is personally appealing to me, and I have benefitted from and recommend a broad education. However this is not practical for everyone. In my experience, a machine learning skill set is an effective specialization with which people can do important things in the world. Given this, having a department with a machine learning centered curricula seems like a good idea. At Carnegie Mellon, this is the Machine Learning department. In the future and elsewhere it may have a different name, but the value of the machine learning skill set should grow with research, improving computers, and improving data sources.

1/10/20071/13/2007

A Deep Belief Net Learning Problem

“Deep learning” is used to describe learning architectures which have significant depth (as a circuit).

One claim is that shallow architectures (one or two layers) can not concisely represent some functions while a circuit with more depth can concisely represent these same functions. Proving lower bounds on the size of a circuit is substantially harder than upper bounds (which are constructive), but some results are known. Luca Trevisan‘s class notes detail how XOR is not concisely representable by “AC0” (= constant depth unbounded fan-in AND, OR, NOT gates). This doesn’t quite prove that depth is necessary for the representations commonly used in learning (such as a thresholded weighted sum), but it is strongly suggestive that this is so.

Examples like this are a bit disheartening because existing algorithms for deep learning (deep belief nets, gradient descent on deep neural networks, and a perhaps decision trees depending on who you ask) can’t learn XOR very easily. Evidence so far suggests learning a noisy version of XOR is hard. In fact, crypto systems have been proposed based upon this hardness. The evidence so far suggests that XOR based deep learning problems have no algorithm much better than “guess and check”.

It turns out that we can define deep learning problems which are solvable by deep belief net style algorithms. Some definitions:

Learning Problem A learning problem is defined by probability distribution D(x,y) over features x which are a vector of bits and a label y which is either 0 or 1.
Shallow Learning Problem A shallow learning problem is a learning problem where the label y can be predicted with error rate at most e < 0.5 by a weighted linear combination of features, sign(sum_i w_i x_i).
Deep Learning Problem A deep learning problem is a learning problem with a solution representable by a circuit of weighted linear sums with O(number of input features) gates.

These definitions are not necessarily the correct ones (and I’d like to hear from anyone that disagrees with the definition, and why), but they seem to capture the intuitions I know. Note that the definition of “deep learning problem” contains the definition of “shallow learning problem” and the XOR example. With high probability, it does not contain a random function. This definition is not captured by any existing complexity theory classes I know, although some are close (TC0, for example).

Theorem There exists a deep learning problem for which:

A deep belief net (like) learning algorithm can achieve error rate 0 with probability 1- d for any d > 0 in the limit as the number of IID samples goes to infinity.
The learning problem is not shallow. In particular for all e > 0, all weighted predictors have error rate at least 1/2 – e

The proof is actually a little bit stronger than the theorem statement. The definition of a ‘shallow learning problem’ can be broadened in several ways to include solution by representation of many common learning algorithms. Also, instead of an asymptotic analysis, a finite sample analysis could be made.

This theorem (roughly) says that “deep learning could be useful in practice”. This is a fairly weak statement. However, a stronger PAC-learning statement appears implausible because deep belief net (like) algorithms actively use the structure in x while PAC analysis holds for all distributions over x. Given the weakness of the theorem statement, empirical evidence for the effectiveness (or not) of deep learning is important.

Proof (This is sketch only.) The first part of the proof is constructive. We simply specify a learning problem, and then show that a deep belief net-like algorithm can solve it. The second part involves some probabilistic analysis.

The learning problem is essentially a ‘hidden bits problem’ which is best specified by defining an algorithm for drawing an example. The problem is parameterized by an integer k, where larger k problems hold for smaller choices of e. An example is drawn by first picking a uniform random bit y from {0,1}. After that k hidden bits h₁,…,h_k are set so that a random subset of (k + y)/2 of them are 1 and the rest 0. For each hidden bit h_i, we have 4 output bits x_i1,x_i2,x_i3,x_i4 (implying a total of 4k output bits). If h_i = 0, with 0.5 probability we set one of the output bits to 1 and the rest to 0, and with 0.5 probability we set all output bits to 0. If h_i = 1, with 0.5 probability we set one of the output bits to 0 and the rest to 1, and with 0.5 probability we set all output bits to 1.

This learning problem is solved by a two-level prediction process. Variations using recursive composition (redefine each “output bit” to be a hidden bit in a new layer, each of which has it’s own output bit) can make the “right” number of levels be larger than 2.

The deep belief net like algorithm we consider is the algorithm which:

Builds a threshold weighted sum predictor for every feature x_ij using weights = the probability of agreement between the features minus 0.5.
Builds a threshold weighted sum predictor for the label given the predicted values from the first step with weights as before.

(The real algorithm uses something similar to gradient descent which is more powerful, but this is all we need.)

For each output feature x_ij, the values of output features corresponding to other hidden bits are uncorrelated since by construction Pr(h_i = h_i’) = 0.5 for i != i’. For output features which share a hidden bit, the probability of agreement in value between two bits j,j’ is 0.75. If we have n IID samples from the learning problem, then Chernoff bounds imply that empirical expectations deviate from expectations at most (log ((4k)²/d)/2n)^0.5 with probability d or less for all pairs of features simultaneously. For the prediction of each feature, when n = 512 k⁴ log ((4k)²/d), the sum of the weights on the 4 (k-1) features corresponding to other hidden weights is bounded by 4(k-1) * 1/(32 k²) <= 1/(8k). On the other hand, the weight on the 3 other features sharing the same bit are each at least 0.25 +/- 1/(32k²) which are individually larger than the sum of all other weights. Consequently, the predicted value is the majority of the 3 other features which is always the value of the hidden bit.

The above analysis (sketchily) shows that the predicted value for each output bit is the hiden bit used to generate it. The same style of analysis shows that given the hidden bits, the output bit can be predicted perfectly. In this case, the value of each hidden bit provides a slight consistent edge in predicting the value of the output bit implying that the learning algorithm converges to uniform weighting over the predicted hidden bit values.

To prove the second part of the theorem, we can first show that a uniform weight over all features is the optimal predictor, and then show that the error rate of this predictor converges to 1/2 as k -> infinity. The optimality of uniform weighting is a little bit tricky to prove, but it is obvious at a high level because (1) of symmetry in the definition of the problem and (2) a nonuniform weighting increases the noise. The error rate convergence to 0.5 is a statement about Binomial probability distributions. Essentially, the noise in the observed bits given the hidden bits kills prediction performance.

1/4/2007

2007 Summer Machine Learning Conferences

It’s conference season once again.

Conference	Due?	When?	Where?	double blind?	author feedback?	Workshops?
AAAI	February 1/6 (and 27)	July 22-26	Vancouver, British Columbia	Yes	Yes	Done
UAI	February 28/March 2	July 19-22	Vancouver, British Columbia	No	No	No
COLT	January 16	June 13-15	San Diego, California (with FCRC)	No	No	No
ICML	February 7/9	June 20-24	Corvallis, Oregon	Yes	Yes	February 16
KDD	February 23/28	August 12-15	San Jose, California	Yes	No?	February 28

The geowinner this year is the west coast of North America. Last year‘s geowinner was the Northeastern US, and the year before it was mostly Europe. It’s notable how tightly the conferences cluster, even when they don’t colocate.

1/2/2007

Retrospective

It’s been almost two years since this blog began. In that time, I’ve learned enough to shift my expectations in several ways.

Initially, the idea was for a general purpose ML blog where different people could contribute posts. What has actually happened is most posts come from me, with a few guest posts that I greatly value. There are a few reasons I see for this.
1. Overload. A couple years ago, I had not fully appreciated just how busy life gets for a researcher. Making a post is not simply a matter of getting to it, but rather of prioritizing between {writing a grant, finishing an overdue review, writing a paper, teaching a class, writing a program, etc…}. This is a substantial transition away from what life as a graduate student is like. At some point the question is not “when will I get to it?” but rather “will I get to it?” and the answer starts to become “no” most of the time.
2. Feedback failure. This blog currently receives about 3K unique visitors per day from about 13K unique sites per month. This number of visitors is large enough that it scares me somewhat—having several thousand people read a post is more attention than almost all papers published in academia get. But the nature of things is that only a small fraction of people leave comments, and the rest are essentially invisible. Adding a few counters to the site may help with this.
3. Content Control. The internet has a huge untapped capacity to support content, so one of the traditional reasons for editorial control (limited space) simply no longer exists. Nevertheless, the time of readers is important and there is a focus-of-attention issue since one blog with all posts on all topics would be virtually useless. In an ideal world, the need for explicit content control would disappear and be replaced by a massive cooperative collaborative filtering process. This shift is already well underway since anyone can start their own blog and read anything they choose. Tighter integration of collaborative filtering into the overall process will surely be useful. I’ve reorganized my links to other blogs to make this a little bit easier. In the last couple years, many new machine learning related blogs have started (just recently: Yee Whye’s), which is great in several ways.
4. Difficulty. Talking clearly about things you barely understand (and no one else does) is simply very difficult. Expending the effort to write clearly about them in a post is not too difficult from expending the effort to write clearly about them in a paper, which is the traditional mechanism of publishing. There is no simply way around this problem, although changing people’s expectations may be helpful. Right now, the expectation in academia is (partially) set by the academic paper. A different expectation, more akin to the way we discuss problems with each other in person, may be helpful.
For the record, I’m always happy to consider posts by others. If you are considering your own blog, trying a guest post or two is a great way to experiment. Many people don’t have the time or inclination to run their own blog, so guest posts are essential.
What is a good post? A good post is fundamentally an interesting post, but “interesting” can be broken down further.
1. Speak plainly. The review process in academia can sometimes favor the convoluted over the plain. A blog strongly encourages otherwise since the backgrounds of readers are very diverse. If you aren’t self-editing for simplicity, you aren’t being simple enough.
2. Believe in it.
3. Lack of comments is not always lack of interest. As an example the posts of the form “interesting papers at <conference>” tend to get very few comments, but they are some of the most viewed.
4. Avoid duplication. The most obvious way to use a blog is as a mechanism for posting finished research. It’s ok for this, but the most interesting way of using the blog are for topics which could not be stated as a research paper.

12/12/2006

Interesting Papers at NIPS 2006

Here are some papers that I found surprisingly interesting.

Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, Greedy Layer-wise Training of Deep Networks. Empirically investigates some of the design choices behind deep belief networks.
Long Zhu, Yuanhao Chen, Alan Yuille Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing. An unsupervised method for detecting objects using simple feature filters that works remarkably well on the (supervised) caltech-101 dataset.
Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira, Analysis of Representations for Domain Adaptation. This is the first analysis I’ve seen of learning with respect to samples drawn differently from the evaluation distribution which depends on reasonable measurable quantities.

All of these papers turn out to have a common theme—the power of unlabeled data to do generically useful things.