Incentive Compatible Reviewing

Reviewing is a fairly formal process which is integral to the way academia is run. Given this integral nature, the quality of reviewing is often frustrating. I’ve seen plenty of examples of false statements, misbeliefs, reading what isn’t written, etc…, and I’m sure many other people have as well.

Recently, mechanisms like double blind review and author feedback have been introduced to try to make the process more fair and accurate in many machine learning (and related) conferences. My personal experience is that these mechanisms help, especially the author feedback. Nevertheless, some problems remain.

The game theory take on reviewing is that the incentive for truthful reviewing isn’t there. Since reviewers are also authors, there are sometimes perverse incentives created and acted upon. (Incidentially, these incentives can be both positive and negative.)

Setting up a truthful reviewing system is tricky because their is no final reference truth available in any acceptable (say: subyear) timespan. There are several ways we could try to get around this.

  1. We could try to engineer new mechanisms for finding a reference truth into a conference and then use a ‘proper scoring rule’ which is incentive compatible. For example, we could have a survey where conference participants short list the papers which interested them. There are significant problems here:
    1. Conference presentations mostly function as announcements of results. Consequently, the understanding of the paper at the conference is not nearly as deep as, say, after reading through it carefully in a reading group.
    2. This is inherently useless for judging reviews of rejected papers and it is highly biased for judging reviews of papers presented in two different formats (say, a poster versus an oral presentation).
  2. We could ignore the time issue and try to measure reviewer performance based upon (say) long term citation count. Aside from the bias problems above, there is also a huge problem associated with turnover. Who the reviewers are and how an individual reviewer reviews may change drastically in just a 5 year timespan. A system which can provide track records for only a small subset of current reviewers isn’t very capable.
  3. We could try to manufacture an incentive compatible system even when the truth is never known. This paper by Nolan Miller, Paul Resnick, and Richard Zeckhauser discusses the feasibility of this approach. Essentially, the scheme works by rewarding reviewer i according to a proper scoring rule applied to P(reviewer j’s score | reviewer i’s score). (A simple example of a proper scoring rule is log[P()].) This is approach is pretty fresh, so there are lots of problems, some of which may or may not be fundamental difficulties for application in practice. The significant problem I see is that this mechanism may reward joint agreement instead of a good contribution towards good joint decision making.

None of these mechanisms are perfect, but they may each yield a little bit of information about what was or was not a good decision over time. Combining these sources of information to create some reviewer judgement system may yield another small improvement in the reviewing process.

The important thing to remember is that we are the reviewers as well as the authors. Are we interested in tracking our reviewing performance over time in order to make better judgements? Such tracking often happens on an anecdotal or personal basis, but shifting to an automated incentive compatible system would be a big change in scope.

How to solve an NP hard problem in quadratic time

This title is a lie, but it is a special lie which has a bit of truth.

If n players each play each other, you have a tournament. How do you order the players from weakest to strongest?

The standard first attempt is “find the ordering which agrees with the tournament on as many player pairs as possible”. This is called the “minimum feedback arcset” problem in the CS theory literature and it is a well known NP-hard problem. A basic guarantee holds for the solution to this problem: if there is some “true” intrinsic ordering, and the outcome of the tournament disagrees k times (due to noise for instance), then the output ordering will disagree with the original ordering on at most 2k edges (and no solution can be better).

One standard approach to tractably solving an NP-hard problem is to find another algorithm with an approximation guarantee. For example, Don Coppersmith, Lisa Fleischer and Atri Rudra proved that ordering players according to the number of wins is a 5-approximation to the NP-hard problem.

An even better approach is to realize that the NP hard problem may not be the real problem. The real problem may be finding a good approximation to the “true” intrinsic ordering given noisy tournament information.

In a learning setting, the simplest form of ranking problem is “bipartite ranking” where every element has a value of either 0 or 1 and we want to order 0s before 1s. A common way to measure the performance of bipartite ranking is according to “area under the ROC curve” (AUC) = 1 – the fraction of out-of-order pairs. Nina, Alina, Greg and I proved that if we learn a comparison function which errs on k dissimilar pairs, then ordering according to the number of wins yields an order within 4k edge reversals of the original ordering. As a reduction statement(*), this shows that an error rate of e for a learned pairwise binary classifier produces an ordering with an expected AUC of 1 – 4e. The same inequality even holds for a (stronger) regret transform. If r = e – emin is the regret of the binary pairwise classifier, then the AUC regret is bounded by 4r. (Here emin is the error rate of the best possible classifier which predicts knowing the most likely outcome.) The regret result extends to more general measures of ordering than simply AUC.

We were unable to find any examples where ordering according to the degree produced more than a 2r AUC regret. Nikhil Bansal, Don, and Greg have worked out a tightened proof which gives exactly this upper bound. At the end of the day, we have an algorithm with satisfies precisely the same guarantee as the NP hard solution.

There are two lessons here. The learning lesson is that a good pairwise comparator implies the ability to rank well according to AUC. The general research lesson is that an NP hard problem for an approximate solution is not an intrinsic obstacle. Sometimes there exist simple tractable algorithms which satisfy the same guarantees as the NP hard solution.

(*) To prove the reduction, you must make sure that your form pairwise examples in the right way. Your source of pairwise ordering examples must be uniform over the dissimilar pairs containing one example with label 1 and one example with label 0.

Precision is not accuracy

In my experience, there are two different groups of people who believe the same thing: the mathematics encountered in typical machine learning conference papers is often of questionable value.
The two groups who agree on this are applied machine learning people who have given up on math, and mature theoreticians who understand the limits of theory.

Partly, this is just a statement about where we are with respect to machine learning. In particular, we have no mechanism capable of generating a prescription for how to solve all learning problems. In the absence of such certainty, people try to come up with formalisms that partially describe and motivate how and why they do things. This is natural and healthy—we might hope that it will eventually lead to just such a mechanism.

But, part of this is simply an emphasis on complexity over clarity. A very natural and simple theoretical statement is often obscured by complexifications. Common sources of complexification include:

  1. Generalization By trying to make a statement that applies in the most general possible setting, your theorem becomes excessively hard to read.
  2. Specialization Your theorem relies upon so many assumptions that it is hard for a simple reader to hold them all in their head.
  3. Obscuration Your theorem relies upon cumbersome notation full of subsubsuperscripts, badly named variables, etc…

There are several reasons why complexification occurs.

  1. Excessive generalization often happens when authors have an idea and want to completely exploit it. So, various bells and whistles are added until the core idea is obscured.
  2. Excessive specialization often happens when authors have some algorithm they really want to prove works. So, they start making one assumption after another until the proof goes through.
  3. Obscuration is far more subtle than it sounds. Some of the worst obscurations come from using an old standard notation which has simply been pushed to far.

After doing research for awhile, you realize that these complexifications are counterproductive. Type (1) complexifications make it double hard for others to do follow-on work: your paper is hard to read and you have eliminated the possibility. Type (2) complexifications look like “the tail wags the dog”—the math isn’t really working until it guides the algorithm design. Figuring out how to remove the assumptions often results in the better algoritihm. Type (3) complexifications are an error. Fooling around to find a better notation is one of the ways that we sometimes make new discoveries.

The worst reason, I’ve saved for last: it’s that the reviewing process emphasizes precision over accuracy. Imagine shooting a math gun at a machine learning target. A high precision math gun will very carefully guide the bullets to strike a fixed location—even though the location may have little to do with the target. An accurate math gun will point at the correct target. A precision/accuracy tradeoff is often encountered: we don’t know how to think about the actual machine learning problem, so instead we very precisely think about another not-quite-right problem. A reviewer almost invariably prefers the more precise (but less accurate) paper because precision is the easy thing to check and think about.

There seems to be no easy fix for this—people naturally prefer to value the things they can measure. The hard fix for this is more time spent by everyone thinking about what the real machine learning problems are.

Who is having visa problems reaching US conferences?

Many of the large machine learning conferences were in the US this summer. A common problem which students from abroad encounter is visa issues.
Just getting a visa to visit can be pretty rough: you stand around in lines, sometimes for days. Even worse is the timing with respect to ticket buying. Airplane tickets typically need to be bought well in advance on nonrefundable terms to secure a reasonable rate for air travel. When a visa is denied, as happens reasonably often, a very expensive ticket is burnt.

A serious effort is under way to raise this as in issue in need of fixing. Over the long term, effectively driving research conferences to locate outside of the US seems an unwise policy. Robert Schapire is planning to talk to a congressman. Sally Goldman suggested putting together a list of problem cases, and Phil Long setup an email address immigration.and.confs@gmail.com to collect them.

If you (or someone you know) has had insurmountable difficulties reaching a conference in the US, please send an email with:

Name:
Email address:
Conference:
Difficulty:
Details: (be brief please)

We expect most of the problem cases are students, so don’t be shy.

New Models

How should we, as researchers in machine learning, organize ourselves?

The most immediate measurable objective of computer science research is publishing a paper. The most difficult aspect of publishing a paper is having reviewers accept and recommend it for publication. The simplest mechanism for doing this is to show theoretical progress on some standard, well-known easily understood problem.

In doing this, we often fall into a local minima of the research process. The basic problem in machine learning is that it is very unclear that the mathematical model is the right one for the (or some) real problem. A good mathematical model in machine learning should have one fundamental trait: it should aid the design of effective learning algorithms. To date, our ability to solve interesting learning problems (speech recognition, machine translation, object recognition, etc…) remains limited (although improving), so the “rightness” of our models is in doubt.

If our mathematical models are bad, the simple mechanism of research above can not yield the end goal. (This should be agreed on even by people who disagree about what the end goal of machine learning is!) Instead, research which proposes and investigates new mathematical models for machine learning might yield the end goal. Doing this is hard.

  1. Coming up with a new mathematical model is just plain not easy. Some sources of inspiration include:
    1. Watching carefully: what happens succesfully in practice, can often be abstracted into a mathematical model.
    2. Swapping fields: In other fields (for example crypto), other methods of analysis have been developed. Sometimes, these methods can be transferred.
    3. Model repair: Existing mathematical models often have easily comprehendable failure modes. By thinking about how to avoid such failure modes, we can sometimes produce a new mathematical model.
  2. Speaking about a new model is hard. The difficulty starts with you in explaining it. Often, when trying to converge on a new model, we think of it in terms of the difference with respect to an older model, leading to a tangled explanation. The difficulty continues with other people (in particular: reviewers) reading it. For a reviewer with limited time, it is very tempting to assume that any particular paper is operating in some familiar model and fail out. The best approach here seems to be super explicitness. You can’t be too blunt about saying “this isn’t the model you are thinking about”.
  3. Succeeding with new models is also hard. When people don’t have a reference frame to understand the new model, they are unlikely to follow up, as is necessary for success in academia.

The good news here is that a succesful new model can be a big win. I wish it was an easier win: the barriers to success are formidably high, and it seems we should do everything possible to lower the barriers to success for the sake of improving research.