Machine Learning (Theory) – Page 69 – Machine learning and learning theory research

9/12/2006

Incentive Compatible Reviewing

Reviewing is a fairly formal process which is integral to the way academia is run. Given this integral nature, the quality of reviewing is often frustrating. I’ve seen plenty of examples of false statements, misbeliefs, reading what isn’t written, etc…, and I’m sure many other people have as well.

Recently, mechanisms like double blind review and author feedback have been introduced to try to make the process more fair and accurate in many machine learning (and related) conferences. My personal experience is that these mechanisms help, especially the author feedback. Nevertheless, some problems remain.

The game theory take on reviewing is that the incentive for truthful reviewing isn’t there. Since reviewers are also authors, there are sometimes perverse incentives created and acted upon. (Incidentially, these incentives can be both positive and negative.)

Setting up a truthful reviewing system is tricky because their is no final reference truth available in any acceptable (say: subyear) timespan. There are several ways we could try to get around this.

We could try to engineer new mechanisms for finding a reference truth into a conference and then use a ‘proper scoring rule’ which is incentive compatible. For example, we could have a survey where conference participants short list the papers which interested them. There are significant problems here:
1. Conference presentations mostly function as announcements of results. Consequently, the understanding of the paper at the conference is not nearly as deep as, say, after reading through it carefully in a reading group.
2. This is inherently useless for judging reviews of rejected papers and it is highly biased for judging reviews of papers presented in two different formats (say, a poster versus an oral presentation).
We could ignore the time issue and try to measure reviewer performance based upon (say) long term citation count. Aside from the bias problems above, there is also a huge problem associated with turnover. Who the reviewers are and how an individual reviewer reviews may change drastically in just a 5 year timespan. A system which can provide track records for only a small subset of current reviewers isn’t very capable.
We could try to manufacture an incentive compatible system even when the truth is never known. This paper by Nolan Miller, Paul Resnick, and Richard Zeckhauser discusses the feasibility of this approach. Essentially, the scheme works by rewarding reviewer i according to a proper scoring rule applied to P(reviewer j’s score | reviewer i’s score). (A simple example of a proper scoring rule is log[P()].) This is approach is pretty fresh, so there are lots of problems, some of which may or may not be fundamental difficulties for application in practice. The significant problem I see is that this mechanism may reward joint agreement instead of a good contribution towards good joint decision making.

None of these mechanisms are perfect, but they may each yield a little bit of information about what was or was not a good decision over time. Combining these sources of information to create some reviewer judgement system may yield another small improvement in the reviewing process.

The important thing to remember is that we are the reviewers as well as the authors. Are we interested in tracking our reviewing performance over time in order to make better judgements? Such tracking often happens on an anecdotal or personal basis, but shifting to an automated incentive compatible system would be a big change in scope.

9/9/20069/12/2006

How to solve an NP hard problem in quadratic time

This title is a lie, but it is a special lie which has a bit of truth.

If n players each play each other, you have a tournament. How do you order the players from weakest to strongest?

The standard first attempt is “find the ordering which agrees with the tournament on as many player pairs as possible”. This is called the “minimum feedback arcset” problem in the CS theory literature and it is a well known NP-hard problem. A basic guarantee holds for the solution to this problem: if there is some “true” intrinsic ordering, and the outcome of the tournament disagrees k times (due to noise for instance), then the output ordering will disagree with the original ordering on at most 2k edges (and no solution can be better).

One standard approach to tractably solving an NP-hard problem is to find another algorithm with an approximation guarantee. For example, Don Coppersmith, Lisa Fleischer and Atri Rudra proved that ordering players according to the number of wins is a 5-approximation to the NP-hard problem.

An even better approach is to realize that the NP hard problem may not be the real problem. The real problem may be finding a good approximation to the “true” intrinsic ordering given noisy tournament information.

In a learning setting, the simplest form of ranking problem is “bipartite ranking” where every element has a value of either 0 or 1 and we want to order 0s before 1s. A common way to measure the performance of bipartite ranking is according to “area under the ROC curve” (AUC) = 1 – the fraction of out-of-order pairs. Nina, Alina, Greg and I proved that if we learn a comparison function which errs on k dissimilar pairs, then ordering according to the number of wins yields an order within 4k edge reversals of the original ordering. As a reduction statement(*), this shows that an error rate of e for a learned pairwise binary classifier produces an ordering with an expected AUC of 1 – 4e. The same inequality even holds for a (stronger) regret transform. If r = e – e_min is the regret of the binary pairwise classifier, then the AUC regret is bounded by 4r. (Here e_min is the error rate of the best possible classifier which predicts knowing the most likely outcome.) The regret result extends to more general measures of ordering than simply AUC.

We were unable to find any examples where ordering according to the degree produced more than a 2r AUC regret. Nikhil Bansal, Don, and Greg have worked out a tightened proof which gives exactly this upper bound. At the end of the day, we have an algorithm with satisfies precisely the same guarantee as the NP hard solution.

There are two lessons here. The learning lesson is that a good pairwise comparator implies the ability to rank well according to AUC. The general research lesson is that an NP hard problem for an approximate solution is not an intrinsic obstacle. Sometimes there exist simple tractable algorithms which satisfy the same guarantees as the NP hard solution.

(*) To prove the reduction, you must make sure that your form pairwise examples in the right way. Your source of pairwise ordering examples must be uniform over the dissimilar pairs containing one example with label 1 and one example with label 0.

9/7/20069/9/2006

Objective and subjective interpretations of probability

An amusing tidbit (reproduced without permission) from Herman Chernoff’s delightful monograph, “Sequential analysis and optimal design”:

The use of randomization raises a philosophical question which is articulated by the following probably apocryphal anecdote.

The metallurgist told his friend the statistician how he planned to test the effect of heat on the strength of a metal bar by sawing the bar into six pieces. The first two would go into the hot oven, the next two into the medium oven, and the last two into the cool oven. The statistician, horrified, explained how he should randomize to avoid the effect of a possible gradient of strength in the metal bar. The method of randomization was applied, and it turned out that the randomized experiment called for putting the first two pieces into the hot oven, the next two into the medium oven, and the last two into the cool oven. “Obviously, we can’t do that,” said the metallurgist. “On the contrary, you have to do that,” said the statistician.

What are arguments for and against this design? In a “larger” design or sample, the effect of a reasonable randomization scheme could be such that this obvious difficulty would almost certainly not happen. Assuming that the original strength of the bar and the heat treatment did not “interact” in a complicated nonlinear way, the randomization would virtually cancel out any effect due to a strength gradient or other erratic phenomena, and computing estimates as though these did not exist would lead to no real error. In this small problem, the effect may not be cancelled out, but the statistician still has a right to close his eyes to the design actually selected if he is satisfied with “playing fair”. That is, if he instructs an agent to select the design and he analyzes the results, assuming there are no gradients, his conclusions will be unbiased in the sense that a tendency to overestimate is balanced on the average by a tendency to underestimate the desired quantities. However, this tendency may be substantial as measured by the variability of the estimates which will be affected by substantial gradients. On the other hand, following the natural inclination to reject an obviously unsatisfactory design resulting from randomization puts the statistician in the position of not “playing fair”. What is worse for an objective statistician, he has no way of evaluating in advance how good his procedure is if he can change the rules in the middle of the experiment.

The Bayesian statistician, who uses subjective probability and must consider all information, is unsatisfied to simply play fair. When randomization leads to the original unsatisfactory design, he is aware of this information and unwilling to accept the design. In general, the religious Bayesian states that no good and only harm can come from randomized experiments. In principle, he is opposed even to random sampling in opinion polling. However, this principle puts him in untenable computational positions, and a pragmatic Bayesian will often ignore what seems useless design information if there are no obvious quirks in a randomly selected sample.

8/28/20068/28/2006

Learning Theory standards for NIPS 2006

Bob Williamson and I are the learning theory PC members at NIPS this year. This is some attempt to state the standards and tests I applied to the papers. I think it is a good idea to talk about this for two reasons:

Making community standards a matter of public record seems healthy. It give us a chance to debate what is and is not the right standard. It might even give us a bit more consistency across the years.
It may save us all time. There are a number of papers submitted which just aren’t there yet. Avoiding submitting is the right decision in this case.

There are several criteria for judging a paper. All of these were active this year. Some criteria are uncontroversial while others may be so.

The paper must have a theorem establishing something new for which it is possible to derive high confidence in the correctness of the results. A surprising number of papers fail this test. This criteria seems essential to the definition of “theory”.
1. Missing theorem statement
2. Missing proof This isn’t an automatic fail, because sometimes reviewers can be expected to fill in the proof from discussion. (Not all theorems are hard.) Similarly, sometimes a proof sketch is adequate. Providing the right amount of detail to give confidence in the results is tricky, but general advice is: err on the side of being explicit.
3. Imprecise theorem statement A number of theorems are simply too imprecise to verify or imagine verifying. Typically they are written in english or mixed math/english and have words like “small”, “very small”, or “itsy bitsy”.
4. Typos and thinkos Often a theorem statement or proof is “right” when expressed correctly, but it isn’t expressed correctly: typos and thinkos (little correctable bugs in how you were thinking) confuse the reader.
5. Not new This may be controversial, because the test of ‘new’ is stronger than some people might expect. A theorem of the form “algorithm A can do B” is not new when we already know “algorithm C can do B”.
Some of these problems are sometimes fixed by smart reviewers. Where that happens, it’s fine. Sometimes a paper has a reasonable chance of passing evaluation as an algorithms paper (which has experimental requirements). Where that happens, it’s fine.
The paper should plausibly lead to algorithmic implications. This test is applied in a varying strength. For an older mathematical model of learning, we tried to apply at the level of “I see how an algorithm might be developed from this insight”. For a new model of learning, this test was applied only weakly.
We did not require that the paper be about machine learning. For non-learning papers, we decided to defer to the judgement of referees on whether or not the results were relevant to NIPS. It seemed more natural that authors/reviewers be setting the agenda here.
I had a preference for papers presenting new mathematical models. I liked Neil Lawrence‘s comment: “If we started rejecting learning theory papers for having the wrong model, where would we stop?” There is a natural tendency to forget the drawbacks of the accepted models in machine learning when evaluating new models, so it seems appropriate to provide some encouragement towards exploration.
Papers were not penalized for having experiments. Sometimes experiments helped (especially when the theory was weak), and sometimes they had no effect.

Reviewing is a difficult process—it’s very difficult to get 82 (the number this time) important decisions right. It’s my hope that more such decisions can be made right in the future, so I’d like to invite comments on what the right criteria are and why. This year’s decisions are made now (and will be released soon), so any suggestions will just influence the future.

8/18/2006

Report of MLSS 2006 Taipei

The 2006 Machine Learning Summer School in Taipei, Taiwan ended on August 4, 2006. It has been a very exciting two weeks for a record crowd of 245 participants (including speakers and organizers) from 18 countries. We had a lineup of speakers that is hard to match up for other similar events (see our WIKI for more information). With this lineup, it is difficult for us as organizers to screw it up too bad. Also, since we have pretty good infrastructure for international meetings and experienced staff at NTUST and Academia Sinica, plus the reputation established by previous MLSS series, it was relatively easy for us to attract registrations and simply enjoyed this two-week long party of machine learning.

In the end of MLSS we distributed a survey form for participants to fill in. I will report what we found from this survey, together with the registration data and word-of-mouth from participants.

The first question is designed to find out how our participants learned about MLSS 2006 Taipei. Unfortunately, most of the participants learned about MLSS from their advisors and it is difficult for us to track how their advisors learned about MLSS. But it appears that posters at ICASSP (related but not directly related to ML) work better than at NIPS (directly related to ML).

Question 2 to 6 ask participants their demographical background. They are mostly graduate students working on one of the application fields of machine learning (e.g., bioinformatics, multimedia, NLP processing). Asked about why they attended MLSS, as expected, about 2/3 replied that they wanted to use ML and 1/3 replied that they wanted to do ML research. Most of participants attended all talks, which is consistent with our record. The attendance rate was kept at about 80 percent every day, a remarkable record for both speakers and participants. Asked about what makes it difficult for them to understand the talks, about half replied mathematics, about a quarter replied “no examples” and less than a quarter replied English. Finally, all talk topics were mentioned as being helpful by our participants, especially those talks that are of more introductory nature, such as graphical models by Sam Roweis, SVM by Chih-Jen Lin, and Boosting by Gunnar Ratsch, while talks with many theorems and proofs are less popular.

We let participants provide their suggestions to us. An issue that was brought up very often is that many wanted lecture notes to be distributed in advance, maybe a day before the talks, maybe it would be better if before MLSS. One of them suggested we put prerequisite math background on the Web so that they would have been better prepared (but that may scare away many people, not good for organizers…). A quick fix for this problem is to provide Web pointers to previous MLSS slides and video and urge registered participants to take a look at them in advance to prepare themselves before attending MLSS.

Another frequently brought-up issue is that they would like to hear more concrete examples, have a chance to do some exercises, and learn more about the applications of the algorithms and analysis results given in the talks. This is reasonable given that 2/3 of our participants’ goal was learning how to use ML. So Manfred decided to adapt to their needs “online” by modifying his talk slides over night (thanks!).

Then our participants would like organizers to design more activities to encourage interaction with speakers and among participants. I think we could have done a better job here. We could have let our speakers “expose” to participants more often than staying in a cozy VIP lounge. We could have also provided online and physical chat board for participants to expose their contact IDs.

We scheduled our talks mostly based on the time constraints given by the speakers. Roughly, it was like we made graphical models/learning reduction first, SVM and kernels next, then online/boosting/bounds, and finally clustering. It turned out that our speakers were so good that they covered and adapted to others related talks and made the entire program appear like a carefully designed coherent one. So most participants liked the program and only one complaint was about this part.

Putting all of the comments together, I think we have two clusters of participants that are hard to please at the same time. One cluster of participants is looking for new research topics on ML or trying to enhance their understanding of some advanced topics on ML. If MLSS is designed for them, speakers can present their latest or even ongoing research results. Speakers can take the chance to spread their idea in the hope that the idea can be further explored by more researchers. The other cluster of participants is new to ML. They might be trying to learn how to use ML to solve their problem, or just trying to have a better idea of what ML is all about. To them, speakers need to present more examples, show them applications, and present mature results. It is still possible to please both of them. We tried to balance their needs by plugging in research talks every afternoon. Again our speakers did a great job here and it seemed work pretty well. We also designed a graduate credit program to give registered students a preview and prerequisite math background. Unfortunately, due to long beauraucratic process, we could not announce it early enough to accommodate more students and the program did not accept international students. I think we could have done a better job helping our participants understand the nature of the summer school and be prepared.

Finally, on behalf of the steering committee, I would like to take this chance to thank Alex Smola, Bernhard Scholkoph and John Langford for their help to put together this excellent lineup of speakers and the positive examples they established in previous MLSS series for us to learn from.