August 2006 – Machine Learning (Theory)

8/28/20068/28/2006

Learning Theory standards for NIPS 2006

Bob Williamson and I are the learning theory PC members at NIPS this year. This is some attempt to state the standards and tests I applied to the papers. I think it is a good idea to talk about this for two reasons:

Making community standards a matter of public record seems healthy. It give us a chance to debate what is and is not the right standard. It might even give us a bit more consistency across the years.
It may save us all time. There are a number of papers submitted which just aren’t there yet. Avoiding submitting is the right decision in this case.

There are several criteria for judging a paper. All of these were active this year. Some criteria are uncontroversial while others may be so.

The paper must have a theorem establishing something new for which it is possible to derive high confidence in the correctness of the results. A surprising number of papers fail this test. This criteria seems essential to the definition of “theory”.
1. Missing theorem statement
2. Missing proof This isn’t an automatic fail, because sometimes reviewers can be expected to fill in the proof from discussion. (Not all theorems are hard.) Similarly, sometimes a proof sketch is adequate. Providing the right amount of detail to give confidence in the results is tricky, but general advice is: err on the side of being explicit.
3. Imprecise theorem statement A number of theorems are simply too imprecise to verify or imagine verifying. Typically they are written in english or mixed math/english and have words like “small”, “very small”, or “itsy bitsy”.
4. Typos and thinkos Often a theorem statement or proof is “right” when expressed correctly, but it isn’t expressed correctly: typos and thinkos (little correctable bugs in how you were thinking) confuse the reader.
5. Not new This may be controversial, because the test of ‘new’ is stronger than some people might expect. A theorem of the form “algorithm A can do B” is not new when we already know “algorithm C can do B”.
Some of these problems are sometimes fixed by smart reviewers. Where that happens, it’s fine. Sometimes a paper has a reasonable chance of passing evaluation as an algorithms paper (which has experimental requirements). Where that happens, it’s fine.
The paper should plausibly lead to algorithmic implications. This test is applied in a varying strength. For an older mathematical model of learning, we tried to apply at the level of “I see how an algorithm might be developed from this insight”. For a new model of learning, this test was applied only weakly.
We did not require that the paper be about machine learning. For non-learning papers, we decided to defer to the judgement of referees on whether or not the results were relevant to NIPS. It seemed more natural that authors/reviewers be setting the agenda here.
I had a preference for papers presenting new mathematical models. I liked Neil Lawrence‘s comment: “If we started rejecting learning theory papers for having the wrong model, where would we stop?” There is a natural tendency to forget the drawbacks of the accepted models in machine learning when evaluating new models, so it seems appropriate to provide some encouragement towards exploration.
Papers were not penalized for having experiments. Sometimes experiments helped (especially when the theory was weak), and sometimes they had no effect.

Reviewing is a difficult process—it’s very difficult to get 82 (the number this time) important decisions right. It’s my hope that more such decisions can be made right in the future, so I’d like to invite comments on what the right criteria are and why. This year’s decisions are made now (and will be released soon), so any suggestions will just influence the future.

8/18/2006

Report of MLSS 2006 Taipei

The 2006 Machine Learning Summer School in Taipei, Taiwan ended on August 4, 2006. It has been a very exciting two weeks for a record crowd of 245 participants (including speakers and organizers) from 18 countries. We had a lineup of speakers that is hard to match up for other similar events (see our WIKI for more information). With this lineup, it is difficult for us as organizers to screw it up too bad. Also, since we have pretty good infrastructure for international meetings and experienced staff at NTUST and Academia Sinica, plus the reputation established by previous MLSS series, it was relatively easy for us to attract registrations and simply enjoyed this two-week long party of machine learning.

In the end of MLSS we distributed a survey form for participants to fill in. I will report what we found from this survey, together with the registration data and word-of-mouth from participants.

The first question is designed to find out how our participants learned about MLSS 2006 Taipei. Unfortunately, most of the participants learned about MLSS from their advisors and it is difficult for us to track how their advisors learned about MLSS. But it appears that posters at ICASSP (related but not directly related to ML) work better than at NIPS (directly related to ML).

Question 2 to 6 ask participants their demographical background. They are mostly graduate students working on one of the application fields of machine learning (e.g., bioinformatics, multimedia, NLP processing). Asked about why they attended MLSS, as expected, about 2/3 replied that they wanted to use ML and 1/3 replied that they wanted to do ML research. Most of participants attended all talks, which is consistent with our record. The attendance rate was kept at about 80 percent every day, a remarkable record for both speakers and participants. Asked about what makes it difficult for them to understand the talks, about half replied mathematics, about a quarter replied “no examples” and less than a quarter replied English. Finally, all talk topics were mentioned as being helpful by our participants, especially those talks that are of more introductory nature, such as graphical models by Sam Roweis, SVM by Chih-Jen Lin, and Boosting by Gunnar Ratsch, while talks with many theorems and proofs are less popular.

We let participants provide their suggestions to us. An issue that was brought up very often is that many wanted lecture notes to be distributed in advance, maybe a day before the talks, maybe it would be better if before MLSS. One of them suggested we put prerequisite math background on the Web so that they would have been better prepared (but that may scare away many people, not good for organizers…). A quick fix for this problem is to provide Web pointers to previous MLSS slides and video and urge registered participants to take a look at them in advance to prepare themselves before attending MLSS.

Another frequently brought-up issue is that they would like to hear more concrete examples, have a chance to do some exercises, and learn more about the applications of the algorithms and analysis results given in the talks. This is reasonable given that 2/3 of our participants’ goal was learning how to use ML. So Manfred decided to adapt to their needs “online” by modifying his talk slides over night (thanks!).

Then our participants would like organizers to design more activities to encourage interaction with speakers and among participants. I think we could have done a better job here. We could have let our speakers “expose” to participants more often than staying in a cozy VIP lounge. We could have also provided online and physical chat board for participants to expose their contact IDs.

We scheduled our talks mostly based on the time constraints given by the speakers. Roughly, it was like we made graphical models/learning reduction first, SVM and kernels next, then online/boosting/bounds, and finally clustering. It turned out that our speakers were so good that they covered and adapted to others related talks and made the entire program appear like a carefully designed coherent one. So most participants liked the program and only one complaint was about this part.

Putting all of the comments together, I think we have two clusters of participants that are hard to please at the same time. One cluster of participants is looking for new research topics on ML or trying to enhance their understanding of some advanced topics on ML. If MLSS is designed for them, speakers can present their latest or even ongoing research results. Speakers can take the chance to spread their idea in the hope that the idea can be further explored by more researchers. The other cluster of participants is new to ML. They might be trying to learn how to use ML to solve their problem, or just trying to have a better idea of what ML is all about. To them, speakers need to present more examples, show them applications, and present mature results. It is still possible to please both of them. We tried to balance their needs by plugging in research talks every afternoon. Again our speakers did a great job here and it seemed work pretty well. We also designed a graduate credit program to give registered students a preview and prerequisite math background. Unfortunately, due to long beauraucratic process, we could not announce it early enough to accommodate more students and the program did not accept international students. I think we could have done a better job helping our participants understand the nature of the summer school and be prepared.

Finally, on behalf of the steering committee, I would like to take this chance to thank Alex Smola, Bernhard Scholkoph and John Langford for their help to put together this excellent lineup of speakers and the positive examples they established in previous MLSS series for us to learn from.

8/10/2006

Precision is not accuracy

In my experience, there are two different groups of people who believe the same thing: the mathematics encountered in typical machine learning conference papers is often of questionable value.
The two groups who agree on this are applied machine learning people who have given up on math, and mature theoreticians who understand the limits of theory.

Partly, this is just a statement about where we are with respect to machine learning. In particular, we have no mechanism capable of generating a prescription for how to solve all learning problems. In the absence of such certainty, people try to come up with formalisms that partially describe and motivate how and why they do things. This is natural and healthy—we might hope that it will eventually lead to just such a mechanism.

But, part of this is simply an emphasis on complexity over clarity. A very natural and simple theoretical statement is often obscured by complexifications. Common sources of complexification include:

Generalization By trying to make a statement that applies in the most general possible setting, your theorem becomes excessively hard to read.
Specialization Your theorem relies upon so many assumptions that it is hard for a simple reader to hold them all in their head.
Obscuration Your theorem relies upon cumbersome notation full of subsubsuperscripts, badly named variables, etc…

There are several reasons why complexification occurs.

Excessive generalization often happens when authors have an idea and want to completely exploit it. So, various bells and whistles are added until the core idea is obscured.
Excessive specialization often happens when authors have some algorithm they really want to prove works. So, they start making one assumption after another until the proof goes through.
Obscuration is far more subtle than it sounds. Some of the worst obscurations come from using an old standard notation which has simply been pushed to far.

After doing research for awhile, you realize that these complexifications are counterproductive. Type (1) complexifications make it double hard for others to do follow-on work: your paper is hard to read and you have eliminated the possibility. Type (2) complexifications look like “the tail wags the dog”—the math isn’t really working until it guides the algorithm design. Figuring out how to remove the assumptions often results in the better algoritihm. Type (3) complexifications are an error. Fooling around to find a better notation is one of the ways that we sometimes make new discoveries.

The worst reason, I’ve saved for last: it’s that the reviewing process emphasizes precision over accuracy. Imagine shooting a math gun at a machine learning target. A high precision math gun will very carefully guide the bullets to strike a fixed location—even though the location may have little to do with the target. An accurate math gun will point at the correct target. A precision/accuracy tradeoff is often encountered: we don’t know how to think about the actual machine learning problem, so instead we very precisely think about another not-quite-right problem. A reviewer almost invariably prefers the more precise (but less accurate) paper because precision is the easy thing to check and think about.

There seems to be no easy fix for this—people naturally prefer to value the things they can measure. The hard fix for this is more time spent by everyone thinking about what the real machine learning problems are.

8/7/20068/7/2006

The Call of the Deep

Many learning algorithms used in practice are fairly simple. Viewed representationally, many prediction algorithms either compute a linear separator of basic features (perceptron, winnow, weighted majority, SVM) or perhaps a linear separator of slightly more complex features (2-layer neural networks or kernelized SVMs). Should we go beyond this, and start using “deep” representations?

What is deep learning?
Intuitively, deep learning is about learning to predict in ways which can involve complex dependencies between the input (observed) features.

Specifying this more rigorously turns out to be rather difficult. Consider the following cases:

SVM with Gaussian Kernel. This is not considered deep learning, because an SVM with a gaussian kernel can’t succinctly represent certain decision surfaces. One of Yann LeCun‘s examples is recognizing objects based on pixel values. An SVM will need a new support vector for each significantly different background. Since the number of distinct backgrounds is large, this isn’t easy.
K-Nearest neighbor. This is not considered deep learning for essentially the same reason as the gaussian SVM. The number of representative points required to recognize an image in any background is very large.
Decision Tree. A decision tree might be considered a deep learning system. However, there exist simple learning problems that defeat decision trees using axis aligned splits. It’s easy to find problems that defeat such decision trees by rotating a linear separator through many dimensions.
2-layer neural networks. A two layer neural network isn’t considered deep learning because it isnt a deep architecture. More importantly, perhaps, the object recognition with occluding background problem implies that the hidden layer must be very large to do general purpose detection.
Deep neural networks. (for example, convolutional neural networks) A neural network with several layers might be considered deep.
Deep Belief networks are “deep”.
Automated feature generation and selection systems might be considered deep since they can certainly develop deep dependencies between the input and the output.

One test for a deep learning system is: are there well-defined learning problems which the system can not solve but a human easily could? If the answer is ‘yes’, then it’s perhaps not a deep learning system.

Where might deep learning be useful?
There are several theorems of the form: “nearest neighbor can learn any measurable function”, “2 layer neural networks can represent any function”, “a support vector machine with a gaussian kernel can learn any function”. These theorems imply that deep learning is only interesting in the bounded data or computation case.

And yet, for the small data situation (think “30 examples”), problems with overfitting become so severe it’s difficult to imagine using more complex learning algorithms than the shallow systems comonly in use.

So the domain where a deep learning system might be most useful involves large quantities of data with computational constraints.

What are the principles of design for deep learning systems?
The real answer here is “we don’t know”, and this is an interesting but difficult direction of research.

Is (approximate) gradient descent the only efficient training algorithm?
Can we learn an architecture on the fly or must it be prespecified?
What are the limits of what can be learned?

8/3/2006

AOL’s data drop

AOL has released several large search engine related datasets. This looks like a pretty impressive data release, and it is a big opportunity for people everywhere to worry about search engine related learning problems, if they want.