John Langford – Page 62 – Machine Learning (Theory)

9/12/2006

Incentive Compatible Reviewing

Reviewing is a fairly formal process which is integral to the way academia is run. Given this integral nature, the quality of reviewing is often frustrating. I’ve seen plenty of examples of false statements, misbeliefs, reading what isn’t written, etc…, and I’m sure many other people have as well.

Recently, mechanisms like double blind review and author feedback have been introduced to try to make the process more fair and accurate in many machine learning (and related) conferences. My personal experience is that these mechanisms help, especially the author feedback. Nevertheless, some problems remain.

The game theory take on reviewing is that the incentive for truthful reviewing isn’t there. Since reviewers are also authors, there are sometimes perverse incentives created and acted upon. (Incidentially, these incentives can be both positive and negative.)

Setting up a truthful reviewing system is tricky because their is no final reference truth available in any acceptable (say: subyear) timespan. There are several ways we could try to get around this.

We could try to engineer new mechanisms for finding a reference truth into a conference and then use a ‘proper scoring rule’ which is incentive compatible. For example, we could have a survey where conference participants short list the papers which interested them. There are significant problems here:
1. Conference presentations mostly function as announcements of results. Consequently, the understanding of the paper at the conference is not nearly as deep as, say, after reading through it carefully in a reading group.
2. This is inherently useless for judging reviews of rejected papers and it is highly biased for judging reviews of papers presented in two different formats (say, a poster versus an oral presentation).
We could ignore the time issue and try to measure reviewer performance based upon (say) long term citation count. Aside from the bias problems above, there is also a huge problem associated with turnover. Who the reviewers are and how an individual reviewer reviews may change drastically in just a 5 year timespan. A system which can provide track records for only a small subset of current reviewers isn’t very capable.
We could try to manufacture an incentive compatible system even when the truth is never known. This paper by Nolan Miller, Paul Resnick, and Richard Zeckhauser discusses the feasibility of this approach. Essentially, the scheme works by rewarding reviewer i according to a proper scoring rule applied to P(reviewer j’s score | reviewer i’s score). (A simple example of a proper scoring rule is log[P()].) This is approach is pretty fresh, so there are lots of problems, some of which may or may not be fundamental difficulties for application in practice. The significant problem I see is that this mechanism may reward joint agreement instead of a good contribution towards good joint decision making.

None of these mechanisms are perfect, but they may each yield a little bit of information about what was or was not a good decision over time. Combining these sources of information to create some reviewer judgement system may yield another small improvement in the reviewing process.

The important thing to remember is that we are the reviewers as well as the authors. Are we interested in tracking our reviewing performance over time in order to make better judgements? Such tracking often happens on an anecdotal or personal basis, but shifting to an automated incentive compatible system would be a big change in scope.

9/9/20069/12/2006

How to solve an NP hard problem in quadratic time

This title is a lie, but it is a special lie which has a bit of truth.

If n players each play each other, you have a tournament. How do you order the players from weakest to strongest?

The standard first attempt is “find the ordering which agrees with the tournament on as many player pairs as possible”. This is called the “minimum feedback arcset” problem in the CS theory literature and it is a well known NP-hard problem. A basic guarantee holds for the solution to this problem: if there is some “true” intrinsic ordering, and the outcome of the tournament disagrees k times (due to noise for instance), then the output ordering will disagree with the original ordering on at most 2k edges (and no solution can be better).

One standard approach to tractably solving an NP-hard problem is to find another algorithm with an approximation guarantee. For example, Don Coppersmith, Lisa Fleischer and Atri Rudra proved that ordering players according to the number of wins is a 5-approximation to the NP-hard problem.

An even better approach is to realize that the NP hard problem may not be the real problem. The real problem may be finding a good approximation to the “true” intrinsic ordering given noisy tournament information.

In a learning setting, the simplest form of ranking problem is “bipartite ranking” where every element has a value of either 0 or 1 and we want to order 0s before 1s. A common way to measure the performance of bipartite ranking is according to “area under the ROC curve” (AUC) = 1 – the fraction of out-of-order pairs. Nina, Alina, Greg and I proved that if we learn a comparison function which errs on k dissimilar pairs, then ordering according to the number of wins yields an order within 4k edge reversals of the original ordering. As a reduction statement(*), this shows that an error rate of e for a learned pairwise binary classifier produces an ordering with an expected AUC of 1 – 4e. The same inequality even holds for a (stronger) regret transform. If r = e – e_min is the regret of the binary pairwise classifier, then the AUC regret is bounded by 4r. (Here e_min is the error rate of the best possible classifier which predicts knowing the most likely outcome.) The regret result extends to more general measures of ordering than simply AUC.

We were unable to find any examples where ordering according to the degree produced more than a 2r AUC regret. Nikhil Bansal, Don, and Greg have worked out a tightened proof which gives exactly this upper bound. At the end of the day, we have an algorithm with satisfies precisely the same guarantee as the NP hard solution.

There are two lessons here. The learning lesson is that a good pairwise comparator implies the ability to rank well according to AUC. The general research lesson is that an NP hard problem for an approximate solution is not an intrinsic obstacle. Sometimes there exist simple tractable algorithms which satisfy the same guarantees as the NP hard solution.

(*) To prove the reduction, you must make sure that your form pairwise examples in the right way. Your source of pairwise ordering examples must be uniform over the dissimilar pairs containing one example with label 1 and one example with label 0.

8/28/20068/28/2006

Learning Theory standards for NIPS 2006

Bob Williamson and I are the learning theory PC members at NIPS this year. This is some attempt to state the standards and tests I applied to the papers. I think it is a good idea to talk about this for two reasons:

Making community standards a matter of public record seems healthy. It give us a chance to debate what is and is not the right standard. It might even give us a bit more consistency across the years.
It may save us all time. There are a number of papers submitted which just aren’t there yet. Avoiding submitting is the right decision in this case.

There are several criteria for judging a paper. All of these were active this year. Some criteria are uncontroversial while others may be so.

The paper must have a theorem establishing something new for which it is possible to derive high confidence in the correctness of the results. A surprising number of papers fail this test. This criteria seems essential to the definition of “theory”.
1. Missing theorem statement
2. Missing proof This isn’t an automatic fail, because sometimes reviewers can be expected to fill in the proof from discussion. (Not all theorems are hard.) Similarly, sometimes a proof sketch is adequate. Providing the right amount of detail to give confidence in the results is tricky, but general advice is: err on the side of being explicit.
3. Imprecise theorem statement A number of theorems are simply too imprecise to verify or imagine verifying. Typically they are written in english or mixed math/english and have words like “small”, “very small”, or “itsy bitsy”.
4. Typos and thinkos Often a theorem statement or proof is “right” when expressed correctly, but it isn’t expressed correctly: typos and thinkos (little correctable bugs in how you were thinking) confuse the reader.
5. Not new This may be controversial, because the test of ‘new’ is stronger than some people might expect. A theorem of the form “algorithm A can do B” is not new when we already know “algorithm C can do B”.
Some of these problems are sometimes fixed by smart reviewers. Where that happens, it’s fine. Sometimes a paper has a reasonable chance of passing evaluation as an algorithms paper (which has experimental requirements). Where that happens, it’s fine.
The paper should plausibly lead to algorithmic implications. This test is applied in a varying strength. For an older mathematical model of learning, we tried to apply at the level of “I see how an algorithm might be developed from this insight”. For a new model of learning, this test was applied only weakly.
We did not require that the paper be about machine learning. For non-learning papers, we decided to defer to the judgement of referees on whether or not the results were relevant to NIPS. It seemed more natural that authors/reviewers be setting the agenda here.
I had a preference for papers presenting new mathematical models. I liked Neil Lawrence‘s comment: “If we started rejecting learning theory papers for having the wrong model, where would we stop?” There is a natural tendency to forget the drawbacks of the accepted models in machine learning when evaluating new models, so it seems appropriate to provide some encouragement towards exploration.
Papers were not penalized for having experiments. Sometimes experiments helped (especially when the theory was weak), and sometimes they had no effect.

Reviewing is a difficult process—it’s very difficult to get 82 (the number this time) important decisions right. It’s my hope that more such decisions can be made right in the future, so I’d like to invite comments on what the right criteria are and why. This year’s decisions are made now (and will be released soon), so any suggestions will just influence the future.

8/10/2006

Precision is not accuracy

In my experience, there are two different groups of people who believe the same thing: the mathematics encountered in typical machine learning conference papers is often of questionable value.
The two groups who agree on this are applied machine learning people who have given up on math, and mature theoreticians who understand the limits of theory.

Partly, this is just a statement about where we are with respect to machine learning. In particular, we have no mechanism capable of generating a prescription for how to solve all learning problems. In the absence of such certainty, people try to come up with formalisms that partially describe and motivate how and why they do things. This is natural and healthy—we might hope that it will eventually lead to just such a mechanism.

But, part of this is simply an emphasis on complexity over clarity. A very natural and simple theoretical statement is often obscured by complexifications. Common sources of complexification include:

Generalization By trying to make a statement that applies in the most general possible setting, your theorem becomes excessively hard to read.
Specialization Your theorem relies upon so many assumptions that it is hard for a simple reader to hold them all in their head.
Obscuration Your theorem relies upon cumbersome notation full of subsubsuperscripts, badly named variables, etc…

There are several reasons why complexification occurs.

Excessive generalization often happens when authors have an idea and want to completely exploit it. So, various bells and whistles are added until the core idea is obscured.
Excessive specialization often happens when authors have some algorithm they really want to prove works. So, they start making one assumption after another until the proof goes through.
Obscuration is far more subtle than it sounds. Some of the worst obscurations come from using an old standard notation which has simply been pushed to far.

After doing research for awhile, you realize that these complexifications are counterproductive. Type (1) complexifications make it double hard for others to do follow-on work: your paper is hard to read and you have eliminated the possibility. Type (2) complexifications look like “the tail wags the dog”—the math isn’t really working until it guides the algorithm design. Figuring out how to remove the assumptions often results in the better algoritihm. Type (3) complexifications are an error. Fooling around to find a better notation is one of the ways that we sometimes make new discoveries.

The worst reason, I’ve saved for last: it’s that the reviewing process emphasizes precision over accuracy. Imagine shooting a math gun at a machine learning target. A high precision math gun will very carefully guide the bullets to strike a fixed location—even though the location may have little to do with the target. An accurate math gun will point at the correct target. A precision/accuracy tradeoff is often encountered: we don’t know how to think about the actual machine learning problem, so instead we very precisely think about another not-quite-right problem. A reviewer almost invariably prefers the more precise (but less accurate) paper because precision is the easy thing to check and think about.

There seems to be no easy fix for this—people naturally prefer to value the things they can measure. The hard fix for this is more time spent by everyone thinking about what the real machine learning problems are.

8/7/20068/7/2006

The Call of the Deep

Many learning algorithms used in practice are fairly simple. Viewed representationally, many prediction algorithms either compute a linear separator of basic features (perceptron, winnow, weighted majority, SVM) or perhaps a linear separator of slightly more complex features (2-layer neural networks or kernelized SVMs). Should we go beyond this, and start using “deep” representations?

What is deep learning?
Intuitively, deep learning is about learning to predict in ways which can involve complex dependencies between the input (observed) features.

Specifying this more rigorously turns out to be rather difficult. Consider the following cases:

SVM with Gaussian Kernel. This is not considered deep learning, because an SVM with a gaussian kernel can’t succinctly represent certain decision surfaces. One of Yann LeCun‘s examples is recognizing objects based on pixel values. An SVM will need a new support vector for each significantly different background. Since the number of distinct backgrounds is large, this isn’t easy.
K-Nearest neighbor. This is not considered deep learning for essentially the same reason as the gaussian SVM. The number of representative points required to recognize an image in any background is very large.
Decision Tree. A decision tree might be considered a deep learning system. However, there exist simple learning problems that defeat decision trees using axis aligned splits. It’s easy to find problems that defeat such decision trees by rotating a linear separator through many dimensions.
2-layer neural networks. A two layer neural network isn’t considered deep learning because it isnt a deep architecture. More importantly, perhaps, the object recognition with occluding background problem implies that the hidden layer must be very large to do general purpose detection.
Deep neural networks. (for example, convolutional neural networks) A neural network with several layers might be considered deep.
Deep Belief networks are “deep”.
Automated feature generation and selection systems might be considered deep since they can certainly develop deep dependencies between the input and the output.

One test for a deep learning system is: are there well-defined learning problems which the system can not solve but a human easily could? If the answer is ‘yes’, then it’s perhaps not a deep learning system.

Where might deep learning be useful?
There are several theorems of the form: “nearest neighbor can learn any measurable function”, “2 layer neural networks can represent any function”, “a support vector machine with a gaussian kernel can learn any function”. These theorems imply that deep learning is only interesting in the bounded data or computation case.

And yet, for the small data situation (think “30 examples”), problems with overfitting become so severe it’s difficult to imagine using more complex learning algorithms than the shallow systems comonly in use.

So the domain where a deep learning system might be most useful involves large quantities of data with computational constraints.

What are the principles of design for deep learning systems?
The real answer here is “we don’t know”, and this is an interesting but difficult direction of research.

Is (approximate) gradient descent the only efficient training algorithm?
Can we learn an architecture on the fly or must it be prespecified?
What are the limits of what can be learned?