When thinking about how best to review papers, it seems helpful to have some conception of what good reviewing is. As far as I can tell, this is almost always only discussed in the specific context of a paper (i.e. your rejected paper), or at most an area (i.e. what a “good paper” looks like for that area) rather than general principles. Neither individual papers or areas are sufficiently general for a large conference—every paper differs in the details, and what if you want to build a new area and/or cross areas?
An unavoidable reason for reviewing is that the community of research is too large. In particular, it is not possible for a researcher to read every paper which someone thinks might be of interest. This reason for reviewing exists independent of constraints on rooms or scheduling formats of individual conferences. Indeed, history suggests that physical constraints are relatively meaningless over the long term — growing conferences simply use more rooms and/or change formats to accommodate the growth.
This suggests that a generic test for paper acceptance should be “Are there a significant number of people who will be interested?” This question could theoretically be answered by sending the paper to every person who might be interested and simply asking them. In practice, this would be an intractable use of people’s time: We must query far fewer people and achieve an approximate answer to this question. Our goal then should be minimizing the approximation error for some fixed amount of reviewing work.
Viewed from this perspective, the first way that things can go wrong is by misassignment of reviewers to papers, for which there are two
easy failure modes available.
- When reviewer/paper assignment is automated based on an affinity graph, the affinity graph may be low quality or the constraint on the maximum number of papers per reviewer can easily leave some papers with low affinity to all reviewers orphaned.
- When reviewer/paper assignments are done by one person, that person may choose reviewers who are all like-minded, simply because
this is the crowd that they know. I’ve seen this happen at the beginning of the reviewing process, but the more insidious case is when it happens at the end, where people are pressed for time and low quality judgements can become common.
An interesting approach for addressing the constraint objective would be optimizing a different objective, such as the product of affinities
rather than the sum. I’ve seen no experimentation of this sort.
For ICML, there are about 3 levels of “reviewer”: the program chair who is responsible for all papers, the area chair who is responsible for organizing reviewing on a subset of papers, and the program committee member/reviewer who has primary responsibility for reviewing. In 2012 tried to avoid these failure modes in a least-system effort way using a blended approach. We used bidding to get a higher quality affinity matrix. We used a constraint system to assign the first reviewer to each paper and two area chairs to each paper. Then, we asked each area chair to find one reviewer for each paper. This obviously dealt with the one-area-chair failure mode. It also helps substantially with low quality assignments from the constrained system since (a) the first reviewer chosen is typically higher quality than the last due to it being the least constrained (b) misassignments to area chairs are diagnosed at the beginning of the process by ACs trying to find reviewers (c) ACs can reach outside of the initial program committee to find reviewers, which existing automated systems can not do.
The next way that reviewing can go wrong is via biased reviewing.
- Author name bias is a famous one. In my experience it is real: well known authors automatically have their paper taken seriously, which particularly matters when time is short. Furthermore, I’ve seen instances where well-known authors can slide by with proof sketches that no one fully understands.
- Review anchoring is a very significant problem if it occurs. This does not happen in the standard review process, because the reviews of others are not visible to other reviewers until they are complete.
- A more subtle form of bias is when one reviewer is simply much louder or charismatic than others. Reviewing without an in-person meeting is actually helpful here, as it reduces this problem substantially.
Reviewing can also be low quality. A primary issue here is time: most reviewers will submit a review within a time constraint, but it may not be high quality due to limits on time. Minimizing average reviewer load is quite important here. Staggered deadlines for reviews are almost certainly also helpful. A more subtle thing is discouraging low quality submissions. My favored approach here is to publish all submissions nonanonymously after some initial period of time.
Another significant issue in reviewer quality is motivation. Making reviewers not anonymous to each other helps with motivation as poor reviews will at least be known to some. Author feedback also helps with motivation, as reviewers know that authors will be able to point out poor reviewing. It is easy to imagine that further improvements in reviewer motivation would be helpful.
A third form of low quality review is based on miscommunication. Maybe there is silly typo in a paper? Maybe something was confusing? Being able to communicate with the author can greatly reduce ambiguities.
The last problem is dictatorship at decision time for which I’ve seen several variants. Sometimes this comes in the form of giving each area chair a budget of papers to “champion”. Sometimes this comes in the form of an area chair deciding to override all reviews and either accept or more likely reject a paper. Sometimes this comes in the form of a program chair doing this as well. The power of dictatorship is often available, but it should not be used: the wiser course is keeping things representative.
At ICML 2012, we tried to deal with this via a defined power approach. When reviewers agreed on the accept/reject decision, that was the decision. If the reviewers disgreed, we asked the two area chairs to make decisions and if they agreed, that was the decision. It was only when the ACs disagreed that the program chairs would become involved in the decision.
The above provides an understanding of how to create a good reviewing process for a large conference. With this in mind, we can consider various proposals at the peer review workshop and elsewhere.
- Double Blind Review. This reduces bias, at the cost of decreasing reviewer motivation. Overall, I think it’s a significant long term positive for a conference as “insiders” naturally become more concerned with review quality and “outsiders” are more prone to submit.
- Better paper/reviewer matching. A pure win, with the only caveat that you should be familiar with failure modes and watch out for them.
- Author feedback. This improves review quality by placing a check on unfair reviews and reducing miscommunication at some cost in time.
- Allowing an appendix or ancillary materials. This allows authors to better communicate complex ideas, at the potential cost of reviewer time. A standard compromise is to make reading an appendix optional for reviewers.
- Open reviews. Open reviews means that people can learn from other reviews, and that authors can respond more naturally than in single round author feedback.
It’s important to note that none of the above are inherently contradictory. This is not necessarily obvious as proponents of open review and double blind review have found themselves in opposition at times. These approaches can be accommodated by simply hiding authors names for a fixed period of 2 months while the initial review process is ongoing.
Representative reviewing seems like the real difficult goal. If a paper is rejected in a representative reviewing process, then perhaps it is just not of sufficient interest. Similarly, if a paper is accepted, then perhaps it is of real and meaningful interest. And if the reviewing process is not representative, then perhaps we should fix the failure modes.
Edit: Crossposted on CACM.