How to solve an NP hard problem in quadratic time

This title is a lie, but it is a special lie which has a bit of truth.

If n players each play each other, you have a tournament. How do you order the players from weakest to strongest?

The standard first attempt is “find the ordering which agrees with the tournament on as many player pairs as possible”. This is called the “minimum feedback arcset” problem in the CS theory literature and it is a well known NP-hard problem. A basic guarantee holds for the solution to this problem: if there is some “true” intrinsic ordering, and the outcome of the tournament disagrees k times (due to noise for instance), then the output ordering will disagree with the original ordering on at most 2k edges (and no solution can be better).

One standard approach to tractably solving an NP-hard problem is to find another algorithm with an approximation guarantee. For example, Don Coppersmith, Lisa Fleischer and Atri Rudra proved that ordering players according to the number of wins is a 5-approximation to the NP-hard problem.

An even better approach is to realize that the NP hard problem may not be the real problem. The real problem may be finding a good approximation to the “true” intrinsic ordering given noisy tournament information.

In a learning setting, the simplest form of ranking problem is “bipartite ranking” where every element has a value of either 0 or 1 and we want to order 0s before 1s. A common way to measure the performance of bipartite ranking is according to “area under the ROC curve” (AUC) = 1 – the fraction of out-of-order pairs. Nina, Alina, Greg and I proved that if we learn a comparison function which errs on k dissimilar pairs, then ordering according to the number of wins yields an order within 4k edge reversals of the original ordering. As a reduction statement(*), this shows that an error rate of e for a learned pairwise binary classifier produces an ordering with an expected AUC of 1 – 4e. The same inequality even holds for a (stronger) regret transform. If r = e – emin is the regret of the binary pairwise classifier, then the AUC regret is bounded by 4r. (Here emin is the error rate of the best possible classifier which predicts knowing the most likely outcome.) The regret result extends to more general measures of ordering than simply AUC.

We were unable to find any examples where ordering according to the degree produced more than a 2r AUC regret. Nikhil Bansal, Don, and Greg have worked out a tightened proof which gives exactly this upper bound. At the end of the day, we have an algorithm with satisfies precisely the same guarantee as the NP hard solution.

There are two lessons here. The learning lesson is that a good pairwise comparator implies the ability to rank well according to AUC. The general research lesson is that an NP hard problem for an approximate solution is not an intrinsic obstacle. Sometimes there exist simple tractable algorithms which satisfy the same guarantees as the NP hard solution.

(*) To prove the reduction, you must make sure that your form pairwise examples in the right way. Your source of pairwise ordering examples must be uniform over the dissimilar pairs containing one example with label 1 and one example with label 0.

Objective and subjective interpretations of probability

An amusing tidbit (reproduced without permission) from Herman Chernoff’s delightful monograph, “Sequential analysis and optimal design”:

The use of randomization raises a philosophical question which is articulated by the following probably apocryphal anecdote.

The metallurgist told his friend the statistician how he planned to test the effect of heat on the strength of a metal bar by sawing the bar into six pieces. The first two would go into the hot oven, the next two into the medium oven, and the last two into the cool oven. The statistician, horrified, explained how he should randomize to avoid the effect of a possible gradient of strength in the metal bar. The method of randomization was applied, and it turned out that the randomized experiment called for putting the first two pieces into the hot oven, the next two into the medium oven, and the last two into the cool oven. “Obviously, we can’t do that,” said the metallurgist. “On the contrary, you have to do that,” said the statistician.

What are arguments for and against this design? In a “larger” design or sample, the effect of a reasonable randomization scheme could be such that this obvious difficulty would almost certainly not happen. Assuming that the original strength of the bar and the heat treatment did not “interact” in a complicated nonlinear way, the randomization would virtually cancel out any effect due to a strength gradient or other erratic phenomena, and computing estimates as though these did not exist would lead to no real error. In this small problem, the effect may not be cancelled out, but the statistician still has a right to close his eyes to the design actually selected if he is satisfied with “playing fair”. That is, if he instructs an agent to select the design and he analyzes the results, assuming there are no gradients, his conclusions will be unbiased in the sense that a tendency to overestimate is balanced on the average by a tendency to underestimate the desired quantities. However, this tendency may be substantial as measured by the variability of the estimates which will be affected by substantial gradients. On the other hand, following the natural inclination to reject an obviously unsatisfactory design resulting from randomization puts the statistician in the position of not “playing fair”. What is worse for an objective statistician, he has no way of evaluating in advance how good his procedure is if he can change the rules in the middle of the experiment.

The Bayesian statistician, who uses subjective probability and must consider all information, is unsatisfied to simply play fair. When randomization leads to the original unsatisfactory design, he is aware of this information and unwilling to accept the design. In general, the religious Bayesian states that no good and only harm can come from randomized experiments. In principle, he is opposed even to random sampling in opinion polling. However, this principle puts him in untenable computational positions, and a pragmatic Bayesian will often ignore what seems useless design information if there are no obvious quirks in a randomly selected sample.

Learning Theory standards for NIPS 2006

Bob Williamson and I are the learning theory PC members at NIPS this year. This is some attempt to state the standards and tests I applied to the papers. I think it is a good idea to talk about this for two reasons:

  1. Making community standards a matter of public record seems healthy. It give us a chance to debate what is and is not the right standard. It might even give us a bit more consistency across the years.
  2. It may save us all time. There are a number of papers submitted which just aren’t there yet. Avoiding submitting is the right decision in this case.

There are several criteria for judging a paper. All of these were active this year. Some criteria are uncontroversial while others may be so.

  1. The paper must have a theorem establishing something new for which it is possible to derive high confidence in the correctness of the results. A surprising number of papers fail this test. This criteria seems essential to the definition of “theory”.
    1. Missing theorem statement
    2. Missing proof This isn’t an automatic fail, because sometimes reviewers can be expected to fill in the proof from discussion. (Not all theorems are hard.) Similarly, sometimes a proof sketch is adequate. Providing the right amount of detail to give confidence in the results is tricky, but general advice is: err on the side of being explicit.
    3. Imprecise theorem statement A number of theorems are simply too imprecise to verify or imagine verifying. Typically they are written in english or mixed math/english and have words like “small”, “very small”, or “itsy bitsy”.
    4. Typos and thinkos Often a theorem statement or proof is “right” when expressed correctly, but it isn’t expressed correctly: typos and thinkos (little correctable bugs in how you were thinking) confuse the reader.
    5. Not new This may be controversial, because the test of ‘new’ is stronger than some people might expect. A theorem of the form “algorithm A can do B” is not new when we already know “algorithm C can do B”.

    Some of these problems are sometimes fixed by smart reviewers. Where that happens, it’s fine. Sometimes a paper has a reasonable chance of passing evaluation as an algorithms paper (which has experimental requirements). Where that happens, it’s fine.

  2. The paper should plausibly lead to algorithmic implications. This test is applied in a varying strength. For an older mathematical model of learning, we tried to apply at the level of “I see how an algorithm might be developed from this insight”. For a new model of learning, this test was applied only weakly.
  3. We did not require that the paper be about machine learning. For non-learning papers, we decided to defer to the judgement of referees on whether or not the results were relevant to NIPS. It seemed more natural that authors/reviewers be setting the agenda here.
  4. I had a preference for papers presenting new mathematical models. I liked Neil Lawrence‘s comment: “If we started rejecting learning theory papers for having the wrong model, where would we stop?” There is a natural tendency to forget the drawbacks of the accepted models in machine learning when evaluating new models, so it seems appropriate to provide some encouragement towards exploration.
  5. Papers were not penalized for having experiments. Sometimes experiments helped (especially when the theory was weak), and sometimes they had no effect.

Reviewing is a difficult process—it’s very difficult to get 82 (the number this time) important decisions right. It’s my hope that more such decisions can be made right in the future, so I’d like to invite comments on what the right criteria are and why. This year’s decisions are made now (and will be released soon), so any suggestions will just influence the future.

Report of MLSS 2006 Taipei

The 2006 Machine Learning Summer School in Taipei, Taiwan ended on August 4, 2006. It has been a very exciting two weeks for a record crowd of 245 participants (including speakers and organizers) from 18 countries. We had a lineup of speakers that is hard to match up for other similar events (see our WIKI for more information). With this lineup, it is difficult for us as organizers to screw it up too bad. Also, since we have pretty good infrastructure for international meetings and experienced staff at NTUST and Academia Sinica, plus the reputation established by previous MLSS series, it was relatively easy for us to attract registrations and simply enjoyed this two-week long party of machine learning.

In the end of MLSS we distributed a survey form for participants to fill in. I will report what we found from this survey, together with the registration data and word-of-mouth from participants.

The first question is designed to find out how our participants learned about MLSS 2006 Taipei. Unfortunately, most of the participants learned about MLSS from their advisors and it is difficult for us to track how their advisors learned about MLSS. But it appears that posters at ICASSP (related but not directly related to ML) work better than at NIPS (directly related to ML).

Question 2 to 6 ask participants their demographical background. They are mostly graduate students working on one of the application fields of machine learning (e.g., bioinformatics, multimedia, NLP processing). Asked about why they attended MLSS, as expected, about 2/3 replied that they wanted to use ML and 1/3 replied that they wanted to do ML research. Most of participants attended all talks, which is consistent with our record. The attendance rate was kept at about 80 percent every day, a remarkable record for both speakers and participants. Asked about what makes it difficult for them to understand the talks, about half replied mathematics, about a quarter replied “no examples” and less than a quarter replied English. Finally, all talk topics were mentioned as being helpful by our participants, especially those talks that are of more introductory nature, such as graphical models by Sam Roweis, SVM by Chih-Jen Lin, and Boosting by Gunnar Ratsch, while talks with many theorems and proofs are less popular.

We let participants provide their suggestions to us. An issue that was brought up very often is that many wanted lecture notes to be distributed in advance, maybe a day before the talks, maybe it would be better if before MLSS. One of them suggested we put prerequisite math background on the Web so that they would have been better prepared (but that may scare away many people, not good for organizers…). A quick fix for this problem is to provide Web pointers to previous MLSS slides and video and urge registered participants to take a look at them in advance to prepare themselves before attending MLSS.

Another frequently brought-up issue is that they would like to hear more concrete examples, have a chance to do some exercises, and learn more about the applications of the algorithms and analysis results given in the talks. This is reasonable given that 2/3 of our participants’ goal was learning how to use ML. So Manfred decided to adapt to their needs “online” by modifying his talk slides over night (thanks!).

Then our participants would like organizers to design more activities to encourage interaction with speakers and among participants. I think we could have done a better job here. We could have let our speakers “expose” to participants more often than staying in a cozy VIP lounge. We could have also provided online and physical chat board for participants to expose their contact IDs.

We scheduled our talks mostly based on the time constraints given by the speakers. Roughly, it was like we made graphical models/learning reduction first, SVM and kernels next, then online/boosting/bounds, and finally clustering. It turned out that our speakers were so good that they covered and adapted to others related talks and made the entire program appear like a carefully designed coherent one. So most participants liked the program and only one complaint was about this part.

Putting all of the comments together, I think we have two clusters of participants that are hard to please at the same time. One cluster of participants is looking for new research topics on ML or trying to enhance their understanding of some advanced topics on ML. If MLSS is designed for them, speakers can present their latest or even ongoing research results. Speakers can take the chance to spread their idea in the hope that the idea can be further explored by more researchers. The other cluster of participants is new to ML. They might be trying to learn how to use ML to solve their problem, or just trying to have a better idea of what ML is all about. To them, speakers need to present more examples, show them applications, and present mature results. It is still possible to please both of them. We tried to balance their needs by plugging in research talks every afternoon. Again our speakers did a great job here and it seemed work pretty well. We also designed a graduate credit program to give registered students a preview and prerequisite math background. Unfortunately, due to long beauraucratic process, we could not announce it early enough to accommodate more students and the program did not accept international students. I think we could have done a better job helping our participants understand the nature of the summer school and be prepared.

Finally, on behalf of the steering committee, I would like to take this chance to thank Alex Smola, Bernhard Scholkoph and John Langford for their help to put together this excellent lineup of speakers and the positive examples they established in previous MLSS series for us to learn from.

Precision is not accuracy

In my experience, there are two different groups of people who believe the same thing: the mathematics encountered in typical machine learning conference papers is often of questionable value.
The two groups who agree on this are applied machine learning people who have given up on math, and mature theoreticians who understand the limits of theory.

Partly, this is just a statement about where we are with respect to machine learning. In particular, we have no mechanism capable of generating a prescription for how to solve all learning problems. In the absence of such certainty, people try to come up with formalisms that partially describe and motivate how and why they do things. This is natural and healthy—we might hope that it will eventually lead to just such a mechanism.

But, part of this is simply an emphasis on complexity over clarity. A very natural and simple theoretical statement is often obscured by complexifications. Common sources of complexification include:

  1. Generalization By trying to make a statement that applies in the most general possible setting, your theorem becomes excessively hard to read.
  2. Specialization Your theorem relies upon so many assumptions that it is hard for a simple reader to hold them all in their head.
  3. Obscuration Your theorem relies upon cumbersome notation full of subsubsuperscripts, badly named variables, etc…

There are several reasons why complexification occurs.

  1. Excessive generalization often happens when authors have an idea and want to completely exploit it. So, various bells and whistles are added until the core idea is obscured.
  2. Excessive specialization often happens when authors have some algorithm they really want to prove works. So, they start making one assumption after another until the proof goes through.
  3. Obscuration is far more subtle than it sounds. Some of the worst obscurations come from using an old standard notation which has simply been pushed to far.

After doing research for awhile, you realize that these complexifications are counterproductive. Type (1) complexifications make it double hard for others to do follow-on work: your paper is hard to read and you have eliminated the possibility. Type (2) complexifications look like “the tail wags the dog”—the math isn’t really working until it guides the algorithm design. Figuring out how to remove the assumptions often results in the better algoritihm. Type (3) complexifications are an error. Fooling around to find a better notation is one of the ways that we sometimes make new discoveries.

The worst reason, I’ve saved for last: it’s that the reviewing process emphasizes precision over accuracy. Imagine shooting a math gun at a machine learning target. A high precision math gun will very carefully guide the bullets to strike a fixed location—even though the location may have little to do with the target. An accurate math gun will point at the correct target. A precision/accuracy tradeoff is often encountered: we don’t know how to think about the actual machine learning problem, so instead we very precisely think about another not-quite-right problem. A reviewer almost invariably prefers the more precise (but less accurate) paper because precision is the easy thing to check and think about.

There seems to be no easy fix for this—people naturally prefer to value the things they can measure. The hard fix for this is more time spent by everyone thinking about what the real machine learning problems are.