When thinking about how best to review papers, it seems helpful to have some conception of what good reviewing is. As far as I can tell, this is almost always only discussed in the specific context of a paper (i.e. your rejected paper), or at most an area (i.e. what a “good paper” looks like for that area) rather than general principles. Neither individual papers or areas are sufficiently general for a large conference—every paper differs in the details, and what if you want to build a new area and/or cross areas?
An unavoidable reason for reviewing is that the community of research is too large. In particular, it is not possible for a researcher to read every paper which someone thinks might be of interest. This reason for reviewing exists independent of constraints on rooms or scheduling formats of individual conferences. Indeed, history suggests that physical constraints are relatively meaningless over the long term — growing conferences simply use more rooms and/or change formats to accommodate the growth.
This suggests that a generic test for paper acceptance should be “Are there a significant number of people who will be interested?” This question could theoretically be answered by sending the paper to every person who might be interested and simply asking them. In practice, this would be an intractable use of people’s time: We must query far fewer people and achieve an approximate answer to this question. Our goal then should be minimizing the approximation error for some fixed amount of reviewing work.
Viewed from this perspective, the first way that things can go wrong is by misassignment of reviewers to papers, for which there are two
easy failure modes available.
- When reviewer/paper assignment is automated based on an affinity graph, the affinity graph may be low quality or the constraint on the maximum number of papers per reviewer can easily leave some papers with low affinity to all reviewers orphaned.
- When reviewer/paper assignments are done by one person, that person may choose reviewers who are all like-minded, simply because
this is the crowd that they know. I’ve seen this happen at the beginning of the reviewing process, but the more insidious case is when it happens at the end, where people are pressed for time and low quality judgements can become common.
An interesting approach for addressing the constraint objective would be optimizing a different objective, such as the product of affinities
rather than the sum. I’ve seen no experimentation of this sort.
For ICML, there are about 3 levels of “reviewer”: the program chair who is responsible for all papers, the area chair who is responsible for organizing reviewing on a subset of papers, and the program committee member/reviewer who has primary responsibility for reviewing. In 2012 tried to avoid these failure modes in a least-system effort way using a blended approach. We used bidding to get a higher quality affinity matrix. We used a constraint system to assign the first reviewer to each paper and two area chairs to each paper. Then, we asked each area chair to find one reviewer for each paper. This obviously dealt with the one-area-chair failure mode. It also helps substantially with low quality assignments from the constrained system since (a) the first reviewer chosen is typically higher quality than the last due to it being the least constrained (b) misassignments to area chairs are diagnosed at the beginning of the process by ACs trying to find reviewers (c) ACs can reach outside of the initial program committee to find reviewers, which existing automated systems can not do.
The next way that reviewing can go wrong is via biased reviewing.
- Author name bias is a famous one. In my experience it is real: well known authors automatically have their paper taken seriously, which particularly matters when time is short. Furthermore, I’ve seen instances where well-known authors can slide by with proof sketches that no one fully understands.
- Review anchoring is a very significant problem if it occurs. This does not happen in the standard review process, because the reviews of others are not visible to other reviewers until they are complete.
- A more subtle form of bias is when one reviewer is simply much louder or charismatic than others. Reviewing without an in-person meeting is actually helpful here, as it reduces this problem substantially.
Reviewing can also be low quality. A primary issue here is time: most reviewers will submit a review within a time constraint, but it may not be high quality due to limits on time. Minimizing average reviewer load is quite important here. Staggered deadlines for reviews are almost certainly also helpful. A more subtle thing is discouraging low quality submissions. My favored approach here is to publish all submissions nonanonymously after some initial period of time.
Another significant issue in reviewer quality is motivation. Making reviewers not anonymous to each other helps with motivation as poor reviews will at least be known to some. Author feedback also helps with motivation, as reviewers know that authors will be able to point out poor reviewing. It is easy to imagine that further improvements in reviewer motivation would be helpful.
A third form of low quality review is based on miscommunication. Maybe there is silly typo in a paper? Maybe something was confusing? Being able to communicate with the author can greatly reduce ambiguities.
The last problem is dictatorship at decision time for which I’ve seen several variants. Sometimes this comes in the form of giving each area chair a budget of papers to “champion”. Sometimes this comes in the form of an area chair deciding to override all reviews and either accept or more likely reject a paper. Sometimes this comes in the form of a program chair doing this as well. The power of dictatorship is often available, but it should not be used: the wiser course is keeping things representative.
At ICML 2012, we tried to deal with this via a defined power approach. When reviewers agreed on the accept/reject decision, that was the decision. If the reviewers disgreed, we asked the two area chairs to make decisions and if they agreed, that was the decision. It was only when the ACs disagreed that the program chairs would become involved in the decision.
The above provides an understanding of how to create a good reviewing process for a large conference. With this in mind, we can consider various proposals at the peer review workshop and elsewhere.
- Double Blind Review. This reduces bias, at the cost of decreasing reviewer motivation. Overall, I think it’s a significant long term positive for a conference as “insiders” naturally become more concerned with review quality and “outsiders” are more prone to submit.
- Better paper/reviewer matching. A pure win, with the only caveat that you should be familiar with failure modes and watch out for them.
- Author feedback. This improves review quality by placing a check on unfair reviews and reducing miscommunication at some cost in time.
- Allowing an appendix or ancillary materials. This allows authors to better communicate complex ideas, at the potential cost of reviewer time. A standard compromise is to make reading an appendix optional for reviewers.
- Open reviews. Open reviews means that people can learn from other reviews, and that authors can respond more naturally than in single round author feedback.
It’s important to note that none of the above are inherently contradictory. This is not necessarily obvious as proponents of open review and double blind review have found themselves in opposition at times. These approaches can be accommodated by simply hiding authors names for a fixed period of 2 months while the initial review process is ongoing.
Representative reviewing seems like the real difficult goal. If a paper is rejected in a representative reviewing process, then perhaps it is just not of sufficient interest. Similarly, if a paper is accepted, then perhaps it is of real and meaningful interest. And if the reviewing process is not representative, then perhaps we should fix the failure modes.
Edit: Crossposted on CACM.
GitHub repositories don’t have reviewers assigned. Anyone can open issues, star a repository, contribute, fork it or use it as a library (this is especially easy with SBT).
If papers were stored in GitHub repositories, as LaTeX, Markdown, or whatever, then other papers could cite them (while actually linking at them, the equivalent to using a library), they could be forked and pull requests could be made, contributing to the area, comments could be included in the issues, etc.
So, what is the best way to review papers? I would suggest something similar to GitHub, and if that is not considered a review process, then the best review process is no-review process.
In that case the question would not be what is the best way to review papers, but what is the best way to credit authors, assuming society or a part of it needs crediting authors in a similar way to the h-index, a simple to read number.
I’d say society needs *not* to do that, though. Simplification in this case means information loss, and even misinformation, as well as defining an objective function for researchers that drives the whole research system to maxima according to wrong parameters.
But I guess you know more about machine learning than me, so I hope this could be analysed from that perspective and read it soon here.
Thank you for the nice blog and keep it up! 😉
What if there is an outsider to the field who works out something good? In a system with review, that good thing may be presented to a large number of interested people. In a system without review, it could easily be lost in the noise of excess information.
Coming from another angle, if you look around the internet we now have reviews for just about any consumer purchasable item on multiple sites so my expectation is that the future involves quite a bit of reviewing.
The core value that reviewing provides independent of publishing and credit, is finding plausibly interesting things. In academia, traditionally publishing, credit, and reviewing were completely intertwined. Publishing is now disentangled—anyone can publish on arxiv. Credit and reviewing are still entangled, but only partially so—there are many ways to gain credit after the fact with reviewing being a gateway to attention.
That’s a more general concept of “reviews” that what we use to see in the reviews for papers.
But in any case, if the purpose of reviews is managing what could otherwise be “excessive information”, why not ranking instead of filtering papers?
That seems to work well for systems with lots of information, like the Internet. The PageRank algorithm in Google is based not on manual reviews, but on the “review” (or recognition) that incoming links to a URL mean. This fits perfectly the systems used to credit authors that are based on citations (like the h-index) and not on publications (like the number of papers in journals with an JCR greater than _x_). In that case, the reviewers for a paper would be anyone considering whether the paper should be cited in their paper or not.
This would work better for outsiders as well. Getting a paper accepted for an outsider is harder, maybe not because the paper is “bad” in any sense, it may simply be using some terminology borrowed from _outside_ some particular area. There is some conformity with the state of the art and the zeitgeist that is expected in a paper, certainly there should be some innovation and pioneering, but sometimes this has to be handled very carefully.
The advantages are many but all of them stem from what I consider the main difference between ranking and filtering, easing the access to some information instead of concealing some other information.
The reason to limit publishing (filter) has traditionally been the cost of publishing (specially when a printing press was involved), now everything can be digital and I (personally, IMHO) find no reason to keep reviewing as a means for that obsolete purpose and therefore I find no reason to keep reviewing as it is now.
For example, it could be done after publishing, for ranking purposes (as the reviews you mention on the Internet), and in this context this means for citation purposes. People would not be assigned papers on an affinity graph to review, they would simply try to search whether someone has already done something with a quality that suits what they expect, and based on that they would continue from that state of the art citing that paper, repeat that work with higher (or simply different) quality standards or do it because it’s new.
This would imply, however, not only a change in the review system, but in the whole publication system, as well as the credit systems to rate people (for instance for official positions), which at the same time relates with systems in which money is involved, like access to grants, projects, funding, and which would have many effects on economy like the money that publishers make. Putting it shortly, it’s quite unlikely, no matter how good or bad may it be.
I think we are in agreement about a view of promoting good things vs. filtering bad things.
Although changing the review system significantly in the short term is hard, it seems unavoidable in the longer term.
When I review papers for math journals, the hardest part is catching all the errors. It usually takes me 1 to 2 hours per page to carefully check every proof for typos or misconceptions. When I read machine learning papers there is usually much less to check and in reality, I am much less concerned about the correctness of proofs in ML papers. I focus much more on figuring out how the ideas in the paper can be applied. Are the results relevant? Powerful?
It might be interesting if reviewers could post to the internet a summary of the paper they are reviewing—trying to limit the time used to write the summary to two hours. The summaries could be read by other interested parties as a quick way to get an overview of the ideas in the paper. Then acceptance could be based a minimum number of votes by probable conference attendees. If you get enough votes, then there would be enough people interested enough to visit the poster or attend the associated lecture. You could limit the number of votes by each attendee.
I wonder if you could created a comment ranking system this way.
If you de-anonymize reviewers, could that actually reduce incentive to review? I agree that it encourages those who do review to do it better. But it could also decrease (or perhaps not?) the number of people who review.
At least within a machine learning context, I believe the suggestion is mostly about opening up review contents rather than reviewer IDs.