If you are interested, please email msrnycrsvp at microsoft.com and say “I want to come” so we can get a count of attendees for refreshments.
Added: Videos are now online.
If you are interested, please email msrnycrsvp at microsoft.com and say “I want to come” so we can get a count of attendees for refreshments.
Added: Videos are now online.
We’ve been somewhat disorganized in advertising this. As a consequence, anyone who has not submitted an abstract but would like to do so may send one directly to me (firstname.lastname@example.org title NYASMLS) by Friday March 14. I will forward them to the rest of the committee for consideration.
Manik and I are organizing the extreme classification workshop at NIPS this year. We have a number of good speakers lined up, but I would further encourage anyone working in the area to submit an abstract by October 9. I believe this is an idea whose time has now come.
The NIPS website doesn’t have other workshops listed yet, but I expect several others to be of significant interest.
This post is a (near) transcript of a talk that I gave at the ICML 2013 Workshop on Peer Review and Publishing Models. Although there’s a PDF available on my website, I’ve chosen to post a slightly modified version here as well in order to better facilitate discussion.
Disclaimers and Context
I want to start with a couple of disclaimers and some context.
First, I want to point out that although I’ve read a lot about double-blind review, this isn’t my research area and the research discussed in this post is not my own. As a result, I probably can’t answer super detailed questions about these studies.
I also want to note that I’m not opposed to open peer review — I was a free and open source software developer for over ten years and I care a great deal about openness and transparency. Rather, my motivation in writing this post is simply to create awareness of and to initiate discussion about the benefits of double-blind review.
Lastly, and most importantly, I think it’s essential to acknowledge that there’s a lot of research on double-blind review out there. Not all of this research is in agreement, in part because it’s hard to control for all the variables involved and in part because most studies involve a single journal or discipline. And, because these studies arise from different disciplines, they can be difficult to
track down — to my knowledge at least, there’s no “Journal of Double-Blind Review Research.” These factors make for a hard landscape to navigate. My goal here is therefore to draw your attention to some of the key benefits of double-blind review so that we don’t lose sight of them when considering alternative reviewing models.
How Blind Is It?
The primary motivation behind double-blind peer review — in which the identities of a paper’s authors and reviewers are concealed from each other — is to eliminate bias in the reviewing process by preventing factors other than scientific quality from influencing the perceived merit of the work under review. At this point in time, double-blind review is the de facto standard for machine learning conferences.
Before I discuss the benefits of double-blind review, however, I’d like to address one of its most commonly heard criticisms: “But it’s possible to infer author identity from content!” — i.e., that double-blind review isn’t really blind, so therefore there’s no point in implementing it. It turns out that there’s some truth to this statement, but there’s also a lot of untruth too. There are several studies that directly test this assertion by asking reviewers whether authors or institutions are identifiable and, if so, to record their identities and describe the clues that led to their identification.
The results are pretty interesting: when asked to guess the identities of authors or institutions, reviewers are correct only 25–42% of the time . The most common identification clues are self-referencing and authors’ initials or institution identities in the manuscript, followed by reviewers’ personal knowledge [2, 3]. Furthermore, higher identification percentages correspond to journals in which papers are required to explicitly state the source of the data being studied . This indicates that journals, not just authors, bear some responsibility for the degree of identification clues present and can therefore influence the extent to which review is truly double-blind.
Is It Necessary?
Another commonly heard criticism of double-blind review is “But I’m not biased!” — i.e., that double-blind review isn’t needed because factors other than scientific quality do not affect reviewers’ opinions anyway. It’s this statement that I’ll mostly be focusing on here. There are many studies that address this assertion by testing the extent to which peer review can be biased against new ideas, women, junior researchers, and researchers from less prestigious universities or countries other than the US. In the remainder of this post, I’m therefore going give a brief overview of these studies’ findings. But before I do that, I want to talk a bit more about bias.
I think it’s important to talk about bias because I want to make it very clear that the kind of bias I’m talking about is NOT necessarily ill-intentioned, explicit, or even conscious. To quote the AAUW’s report  on the under-representation of women in science, “Even individuals who consciously refute gender and science stereotypes can still hold that belief at an unconscious level. These unconscious beliefs or implicit biases may be more powerful than explicitly held beliefs and values simply because we are not aware of them.” Chapters 8 and 9 of this report provide a really great overview of recent research on implicit bias and negative stereotypes in the workplace. I highly recommend reading them — and the rest of the report for that matter — but for the purpose of this post, it’s sufficient to remember that “Less-conscious beliefs underlying negative stereotypes continue to influence assumptions about people and behavior. [Even] good people end up unintentionally making decisions that violate [...] their own sense of what’s correct [and] what’s good.”
Prestige and Familiarity
Perhaps the most well studied form of bias is the “Matthew effect,” originally introduced by Robert Merton in 1968 . This term refers to the “rich-get-richer” phenomenon whereby well known, eminent researchers get more credit for their contributions than unknown researchers. Since 1968, there’s been a considerable amount of follow-on research investigating the extent to which the Matthew effect exists in science. In the context of peer review, reviewers may be more likely to recommend acceptance of incomplete or inferior papers if they are authored by more prestigious researchers.
Country of Origin
It’s also important to consider country of origin and international bias. There’s research  showing that reviewers from within the United States and reviewers from outside the United States evaluate US papers more favorably, with US reviewers showing a stronger preference for US papers than non-US reviewers. In contrast, US and non-US reviewers behaved near identically for non-US papers.
One of the most widely discussed pieces of recent work on double-blind review and gender is that of Budden et al. , whose research demonstrated that following the introduction of double-blind review by the journal Behavioral Ecology, there was a significant increase in papers authored by women. This pattern was not observed in a similar journal that instead reveals author information to reviewers. Although there’s been some controversy surrounding this work , mostly questioning whether the observed increase was indeed to do with the policy change or a more widely observed phenomenon, the original authors reanalyzed their data and again found that double-blind review favors increased representation of female authors .
Race has also been demonstrated to influence reviewers’ recommendations, albeit in the context of grant funding rather than publications. Even after controlling for factors such as educational background, country of origin, training, previous research awards, publication record, and employer characteristics, African-American applicants for National Institutes of Health R01 grants are 10% less likely than white applicants to be awarded research funding .
I also want to talk briefly about stereotype threat. Stereotype threat is a phenomenon in which performance in academic contexts can be harmed by the awareness that one’s behavior might be viewed through the lens of a negative stereotype about one’s social group . For example, studies have demonstrated that African-American students enrolled in college and female students enrolled in math and science courses score much lower on tests when they are reminded beforehand of their race or gender [10, 11]. In the case of female science students, simply having a larger ratio of men to women present in the testing situation can lower women’s test scores . Several factors may contribute to this decreased performance, including the anxiety, reduced attention, and self-consciousness associated with worrying about whether or not one is confirming the stereotype. One idea that that hasn’t yet been explored in the context of peer review, but might be worth investigating, is whether requiring authors to reveal their identities during peer review induces a stereotype threat scenario.
Lastly, I want to mention the identification of reviewers. Although there’s much less research on this side of the equation, it’s definitely worth considering the effects of revealing reviewer identities as well — especially for more junior reviewers. To quote Mainguy et al.’s article  in PLoS Biology, “Reviewers, and especially newcomers, may feel pressured into accepting a mediocre paper from a more established lab in fear of future reprisals.”
I want to conclude by reminding you that my goal in writing this post was to create awareness about the benefits of double-blind review. There’s a great deal of research on double-blind review and although it can be a hard landscape to navigate — in part because there are many factors involved, not all of which can be trivially controlled in experimental conditions — there are studies out there that demonstrate concrete benefits of double-blind review. Perhaps more importantly though, double-blind review promotes the PERCEPTION of fairness. To again quote Mainguy et al., “[Double-blind review] bears symbolic power that will go a long way to quell fears and frustrations, thereby generating a better perception of fairness and equality in global scientific funding and publishing.”
 Budden, Tregenza, Aarssen, Koricheva, Leimu, Lortie. “Double-blind review favours increased representation of female authors.” 2008.
 Yankauer. “How blind is blind review?” 1991.
 Katz, Proto, Olmsted. “Incidence and nature of unblinding by authors: our experience at two radiology journals with double-blinded peer review policies.” 2002.
 Hill, Corbett, St, Rose. “Why so few? Women in science, technology, engineering, and mathematics.” 2010.
 Merton. “The Matthew effect in science.” 1968.
 Link. “US and non-US submissions: an analysis of reviewer bias.” 1998.
 Webb, O’Hara, Freckleton. “Does double-blind review benefit female authors?” 2008.
 Budden, Lortie, Tregenza, Aarssen, Koricheva, Leimu. “Response to Webb et al.: Double-blind review: accept with minor revisions.” 2008.
 Ginther, Schaffer, Schnell, Masimore, Liu, Haak, Kington. “Race, ethnicity, and NIH research awards.” 2011.
 Steele, Aronson. “Stereotype threat and the intellectual test performance of African Americans.” 1995.
 Dar-Nimrod, Heine. “Exposure to scientific theories affects women’s math performance.” 2006,
 Mainguy, Motamedi, Mietchen. “Peer review—the newcomers’ perspective.” 2005.
When thinking about how best to review papers, it seems helpful to have some conception of what good reviewing is. As far as I can tell, this is almost always only discussed in the specific context of a paper (i.e. your rejected paper), or at most an area (i.e. what a “good paper” looks like for that area) rather than general principles. Neither individual papers or areas are sufficiently general for a large conference—every paper differs in the details, and what if you want to build a new area and/or cross areas?
An unavoidable reason for reviewing is that the community of research is too large. In particular, it is not possible for a researcher to read every paper which someone thinks might be of interest. This reason for reviewing exists independent of constraints on rooms or scheduling formats of individual conferences. Indeed, history suggests that physical constraints are relatively meaningless over the long term — growing conferences simply use more rooms and/or change formats to accommodate the growth.
This suggests that a generic test for paper acceptance should be “Are there a significant number of people who will be interested?” This question could theoretically be answered by sending the paper to every person who might be interested and simply asking them. In practice, this would be an intractable use of people’s time: We must query far fewer people and achieve an approximate answer to this question. Our goal then should be minimizing the approximation error for some fixed amount of reviewing work.
Viewed from this perspective, the first way that things can go wrong is by misassignment of reviewers to papers, for which there are two
easy failure modes available.
An interesting approach for addressing the constraint objective would be optimizing a different objective, such as the product of affinities
rather than the sum. I’ve seen no experimentation of this sort.
For ICML, there are about 3 levels of “reviewer”: the program chair who is responsible for all papers, the area chair who is responsible for organizing reviewing on a subset of papers, and the program committee member/reviewer who has primary responsibility for reviewing. In 2012 tried to avoid these failure modes in a least-system effort way using a blended approach. We used bidding to get a higher quality affinity matrix. We used a constraint system to assign the first reviewer to each paper and two area chairs to each paper. Then, we asked each area chair to find one reviewer for each paper. This obviously dealt with the one-area-chair failure mode. It also helps substantially with low quality assignments from the constrained system since (a) the first reviewer chosen is typically higher quality than the last due to it being the least constrained (b) misassignments to area chairs are diagnosed at the beginning of the process by ACs trying to find reviewers (c) ACs can reach outside of the initial program committee to find reviewers, which existing automated systems can not do.
The next way that reviewing can go wrong is via biased reviewing.
Reviewing can also be low quality. A primary issue here is time: most reviewers will submit a review within a time constraint, but it may not be high quality due to limits on time. Minimizing average reviewer load is quite important here. Staggered deadlines for reviews are almost certainly also helpful. A more subtle thing is discouraging low quality submissions. My favored approach here is to publish all submissions nonanonymously after some initial period of time.
Another significant issue in reviewer quality is motivation. Making reviewers not anonymous to each other helps with motivation as poor reviews will at least be known to some. Author feedback also helps with motivation, as reviewers know that authors will be able to point out poor reviewing. It is easy to imagine that further improvements in reviewer motivation would be helpful.
A third form of low quality review is based on miscommunication. Maybe there is silly typo in a paper? Maybe something was confusing? Being able to communicate with the author can greatly reduce ambiguities.
The last problem is dictatorship at decision time for which I’ve seen several variants. Sometimes this comes in the form of giving each area chair a budget of papers to “champion”. Sometimes this comes in the form of an area chair deciding to override all reviews and either accept or more likely reject a paper. Sometimes this comes in the form of a program chair doing this as well. The power of dictatorship is often available, but it should not be used: the wiser course is keeping things representative.
At ICML 2012, we tried to deal with this via a defined power approach. When reviewers agreed on the accept/reject decision, that was the decision. If the reviewers disgreed, we asked the two area chairs to make decisions and if they agreed, that was the decision. It was only when the ACs disagreed that the program chairs would become involved in the decision.
The above provides an understanding of how to create a good reviewing process for a large conference. With this in mind, we can consider various proposals at the peer review workshop and elsewhere.
It’s important to note that none of the above are inherently contradictory. This is not necessarily obvious as proponents of open review and double blind review have found themselves in opposition at times. These approaches can be accommodated by simply hiding authors names for a fixed period of 2 months while the initial review process is ongoing.
Representative reviewing seems like the real difficult goal. If a paper is rejected in a representative reviewing process, then perhaps it is just not of sufficient interest. Similarly, if a paper is accepted, then perhaps it is of real and meaningful interest. And if the reviewing process is not representative, then perhaps we should fix the failure modes.
Edit: Crossposted on CACM.
Powered by WordPress