Reviewing is a fairly formal process which is integral to the way academia is run. Given this integral nature, the quality of reviewing is often frustrating. I’ve seen plenty of examples of false statements, misbeliefs, reading what isn’t written, etc…, and I’m sure many other people have as well.
Recently, mechanisms like double blind review and author feedback have been introduced to try to make the process more fair and accurate in many machine learning (and related) conferences. My personal experience is that these mechanisms help, especially the author feedback. Nevertheless, some problems remain.
The game theory take on reviewing is that the incentive for truthful reviewing isn’t there. Since reviewers are also authors, there are sometimes perverse incentives created and acted upon. (Incidentially, these incentives can be both positive and negative.)
Setting up a truthful reviewing system is tricky because their is no final reference truth available in any acceptable (say: subyear) timespan. There are several ways we could try to get around this.
- We could try to engineer new mechanisms for finding a reference truth into a conference and then use a ‘proper scoring rule’ which is incentive compatible. For example, we could have a survey where conference participants short list the papers which interested them. There are significant problems here:
- Conference presentations mostly function as announcements of results. Consequently, the understanding of the paper at the conference is not nearly as deep as, say, after reading through it carefully in a reading group.
- This is inherently useless for judging reviews of rejected papers and it is highly biased for judging reviews of papers presented in two different formats (say, a poster versus an oral presentation).
- We could ignore the time issue and try to measure reviewer performance based upon (say) long term citation count. Aside from the bias problems above, there is also a huge problem associated with turnover. Who the reviewers are and how an individual reviewer reviews may change drastically in just a 5 year timespan. A system which can provide track records for only a small subset of current reviewers isn’t very capable.
- We could try to manufacture an incentive compatible system even when the truth is never known. This paper by Nolan Miller, Paul Resnick, and Richard Zeckhauser discusses the feasibility of this approach. Essentially, the scheme works by rewarding reviewer i according to a proper scoring rule applied to P(reviewer j’s score | reviewer i’s score). (A simple example of a proper scoring rule is log[P()].) This is approach is pretty fresh, so there are lots of problems, some of which may or may not be fundamental difficulties for application in practice. The significant problem I see is that this mechanism may reward joint agreement instead of a good contribution towards good joint decision making.
None of these mechanisms are perfect, but they may each yield a little bit of information about what was or was not a good decision over time. Combining these sources of information to create some reviewer judgement system may yield another small improvement in the reviewing process.
The important thing to remember is that we are the reviewers as well as the authors. Are we interested in tracking our reviewing performance over time in order to make better judgements? Such tracking often happens on an anecdotal or personal basis, but shifting to an automated incentive compatible system would be a big change in scope.
I like to think about this in terms of costs and benefits. As a reviewer the cost on my time is considerable, perhaps several working days to review a few papers if they are complex. What is the benefit to me? No much as far as I’ve seen.
How about a system something like this: The paper writer has to pick which of the reviews was the most valuable. These responses are collected and statistically corrected for how positive the review was relative to the average review score for that paper. This is to remove the effect that paper writers will prefer good reviews. Clearly, if a reviewer gets voted the best reviewer for many papers, including relatively negative reviews, they must be a good reviewer. Maybe then, the top 10 reviewers could be given an award for “high quality reviewing” and a free conference registration.
One very simple idea is to have the reviewer’s name published on the paper, not as a co-author, but also very prominently displayed. A person could then gain a reputation as a recommender of highly-cited papers or upcoming talent. At review time, the reviewers will be anonymous, and remain (somewhat) anonymous if the reviewer doesn’t want to be associated with the paper. It is certain that this will have some obvious negative effects, but I think that it captures the key essense of the problem of incentives for reviewing: it makes the reviewer a stakeholder on the decision that this paper is worthwhile to present to the community.
The organizers implemented a pretty cool “solution” at ACL this year. At the end-of-the-conference meeting (best paper announced), they announced a list of 10-15 “best reviewers.” These were nominated by the SPC and my understanding is that it did not require very much additional effort. There wasn’t any prize or anything, but I know that having your name up on that slide was very cool. A lot of people I ran in to afterward seemed to think it would make them work harder on reviews if they knew such recognition was possible. (I’m not sure if anyone will do this, but it is something that could presumably make it on to a CV as well.)
I’ve also thought about your option 1…but I think the converse would work better: have people vote on worst papers and announce who the worst reviewers are on that basis. Sad as it may be, picking out terrible papers is probably more independent of presentation then picking out the “best.” I know there are lots of problems with this, but it’s a thought. Probably not a thought that should actually be implemented, though.
Forgot to mention a funny anecdote wrt “option 3.” At ETS (who do scoring for standardized tests like the SAT), the typical (old) procedure for scoring essays is to have two people read and score each essay 1-5. If there is disagreement of greater than 1, a third reader comes in to settle. Because this is incredibly costly, an anti-incentive system was set up where, if you disagree with the other reviewer on too many essays, you get some sort of punishment. The result? People starting just giving 2,3 or 4 because then the chances of disagreement drop significantly and they are less likely to be punished.
(Incidentally, now, one of these people is replaced by a machine!)
I have often thought about making reviews public (though still anonymous). This would be very helpful to people trying to find good papers outside their area of expertise. Reviewers might also feel pressure to improve the quality of their writing. Are there good reasons for reviews to remain private?
If reviews were public, other meta-review options would become available (e.g., Amazon-like “Was this review helpful?”).
This was a nice topic. Very useful too…
This is an interesting issue. If it is not too unpolite, may I mention a paper by myself:
S. Mizzaro. Quality Control in Scholarly Publishing: A New Proposal, Journal of the American Society for Information Science and Technology, 54(11):989-1005, 2003.
which discusses scholarly publishing and peer review, and proposes an alternative approach?
A PDF of the final draft can be downloaded at
This is not a bad idea, but one problem that I see is that the distribution of papers to reviewers is not IID. That is, the tough, big-name papers often are assigned to top people in the field, while borderline papers will typically get relegated to new PC members (or graduate students). The problem is that an excited graduate student who reviews many borderline papers, many of which barely make it into the conference, will have his name plastered all over marginal work. That would be a disincentive to the graduate student (or anyone, for that matter).
He would want to review good papers, bad papers (since they’ll be rejected), or not at all.
… downloaded at http://www.dimi.uniud.it/mizzaro/research/papers/EJ-JASIST.pdf
Indeed, it may unnecessarily restrain the overly cautious, and unnecessarily discredit the overly enthusiastic. On the other hand, that sounds like the side to error on. Furthermore, what’s wrong with wanting to review good papers?
I hear second-hand that Joe Marks is conducting a survey across all of ACM to find out how different conferences address the problems of reviewing and committees. This should be very interesting once the results are in.
Robin Hanson proposes a laboratory economics experiment setup to compare traditional peer review against “information prizes” to see which is better at advancing science. (He proposes using a murder mystery as a proxy for scientific discovery in the experiment.)
One could think that there already is some kind of incentive, as reviewing for good conferences/journals gives academic brownie points. Young researchers usually get the “less interesting” papers, but have a higher motivation to turn in a good review as the need the brownie points to get positions/promotions. More established researchers probably have a lower motivation — I wouldn’t know 🙂 — but get more interesting papers anyway. Ideally this should somehow regulate the reviewing process (and quality).
I would think that a good part of the responsibility lies with the area/conference chairs: If you are in a position to choose reviewers, do you take into account past reviewing history, or just the perceived prestige/clout in the field?
The other thing that is true is that this model would require a greater number of reviewers seeing a greater number of papers. This could alleviate the pressure on getting a good distribution, but of course this also has its problems.
Measuring the performance of a reviewer directly and only through its ability to predict the “true” interest the community will have for a paper in the future seems like a dead-end to me. First, it is ill-defined, subject to many fallacies, and totally impractical to implement for a number of reasons as is clear regarding points 1 and 2 of the post. Secondly, it only focuses on the score, and not the comments of the reviewer (as is clear regarding point 3). I think a reviewer does a good job ih (s)he:
* helps significantly the decision process for the PC
* provides helpful feedback to the authors.
For both of these points it seems that a reviewer’s carefully written report with sound and insightful arguments is actually more valuable than the score per se. In the end of the day the most sensible road is that there is a reviewer feedback/scoring from the PC. As mentioned above it does not need to require much additional effort. As for the incentive, various forms of social rewards, involving making part of the review/reviewer identity information public as suggested above could be good. Last year as a reviewer I found an important bug in a NIPS paper (which was accepted after correction) and the author put me in the acknowledegments of the journal version. This made me feel cool (although this might not be enough for everyone 🙂 )
I don’t like picking out the worst papers for two reasons. The first is that no competent conference goer desires to spend time look for the worst paper, so the meaningfulness of the vote is in doubt. The second is that I much prefer to “max the max” rather than “max the min” with respect to papers at a conference. If there are bad papers, that’s ok. What’s really bad is losing the best papers. In my experience, losing the best papers is entirely possible due to controversy or simply difficulty in explaining a big jump.
Readers may be interested in the report Jon Udell prepared for Los Alamos National Laboratory titled “Internet Groupware for Scientific Collaboration”. The URL is http://207.22.26.166/GroupwareReport.html (or you can Google for the title terms).