Machine Learning (Theory)

6/29/2013

The Benefits of Double-Blind Review

This post is a (near) transcript of a talk that I gave at the ICML 2013 Workshop on Peer Review and Publishing Models. Although there’s a PDF available on my website, I’ve chosen to post a slightly modified version here as well in order to better facilitate discussion.

Disclaimers and Context

I want to start with a couple of disclaimers and some context.

First, I want to point out that although I’ve read a lot about double-blind review, this isn’t my research area and the research discussed in this post is not my own. As a result, I probably can’t answer super detailed questions about these studies.

I also want to note that I’m not opposed to open peer review — I was a free and open source software developer for over ten years and I care a great deal about openness and transparency. Rather, my motivation in writing this post is simply to create awareness of and to initiate discussion about the benefits of double-blind review.

Lastly, and most importantly, I think it’s essential to acknowledge that there’s a lot of research on double-blind review out there. Not all of this research is in agreement, in part because it’s hard to control for all the variables involved and in part because most studies involve a single journal or discipline. And, because these studies arise from different disciplines, they can be difficult to
track down — to my knowledge at least, there’s no “Journal of Double-Blind Review Research.” These factors make for a hard landscape to navigate. My goal here is therefore to draw your attention to some of the key benefits of double-blind review so that we don’t lose sight of them when considering alternative reviewing models.

How Blind Is It?

The primary motivation behind double-blind peer review — in which the identities of a paper’s authors and reviewers are concealed from each other — is to eliminate bias in the reviewing process by preventing factors other than scientific quality from influencing the perceived merit of the work under review. At this point in time, double-blind review is the de facto standard for machine learning conferences.

Before I discuss the benefits of double-blind review, however, I’d like to address one of its most commonly heard criticisms: “But it’s possible to infer author identity from content!” — i.e., that double-blind review isn’t really blind, so therefore there’s no point in implementing it. It turns out that there’s some truth to this statement, but there’s also a lot of untruth too. There are several studies that directly test this assertion by asking reviewers whether authors or institutions are identifiable and, if so, to record their identities and describe the clues that led to their identification.

The results are pretty interesting: when asked to guess the identities of authors or institutions, reviewers are correct only 25–42% of the time [1]. The most common identification clues are self-referencing and authors’ initials or institution identities in the manuscript, followed by reviewers’ personal knowledge [2, 3]. Furthermore, higher identification percentages correspond to journals in which papers are required to explicitly state the source of the data being studied [2]. This indicates that journals, not just authors, bear some responsibility for the degree of identification clues present and can therefore influence the extent to which review is truly double-blind.

Is It Necessary?

Another commonly heard criticism of double-blind review is “But I’m not biased!” — i.e., that double-blind review isn’t needed because factors other than scientific quality do not affect reviewers’ opinions anyway. It’s this statement that I’ll mostly be focusing on here. There are many studies that address this assertion by testing the extent to which peer review can be biased against new ideas, women, junior researchers, and researchers from less prestigious universities or countries other than the US. In the remainder of this post, I’m therefore going give a brief overview of these studies’ findings. But before I do that, I want to talk a bit more about bias.

Implicit Bias

I think it’s important to talk about bias because I want to make it very clear that the kind of bias I’m talking about is NOT necessarily ill-intentioned, explicit, or even conscious. To quote the AAUW’s report [4] on the under-representation of women in science, “Even individuals who consciously refute gender and science stereotypes can still hold that belief at an unconscious level. These unconscious beliefs or implicit biases may be more powerful than explicitly held beliefs and values simply because we are not aware of them.” Chapters 8 and 9 of this report provide a really great overview of recent research on implicit bias and negative stereotypes in the workplace. I highly recommend reading them — and the rest of the report for that matter — but for the purpose of this post, it’s sufficient to remember that “Less-conscious beliefs underlying negative stereotypes continue to influence assumptions about people and behavior. [Even] good people end up unintentionally making decisions that violate [...] their own sense of what’s correct [and] what’s good.”

Prestige and Familiarity

Perhaps the most well studied form of bias is the “Matthew effect,” originally introduced by Robert Merton in 1968 [5]. This term refers to the “rich-get-richer” phenomenon whereby well known, eminent researchers get more credit for their contributions than unknown researchers. Since 1968, there’s been a considerable amount of follow-on research investigating the extent to which the Matthew effect exists in science. In the context of peer review, reviewers may be more likely to recommend acceptance of incomplete or inferior papers if they are authored by more prestigious researchers.

Country of Origin

It’s also important to consider country of origin and international bias. There’s research [6] showing that reviewers from within the United States and reviewers from outside the United States evaluate US papers more favorably, with US reviewers showing a stronger preference for US papers than non-US reviewers. In contrast, US and non-US reviewers behaved near identically for non-US papers.

Gender

One of the most widely discussed pieces of recent work on double-blind review and gender is that of Budden et al. [1], whose research demonstrated that following the introduction of double-blind review by the journal Behavioral Ecology, there was a significant increase in papers authored by women. This pattern was not observed in a similar journal that instead reveals author information to reviewers. Although there’s been some controversy surrounding this work [7], mostly questioning whether the observed increase was indeed to do with the policy change or a more widely observed phenomenon, the original authors reanalyzed their data and again found that double-blind review favors increased representation of female authors [8].

Race

Race has also been demonstrated to influence reviewers’ recommendations, albeit in the context of grant funding rather than publications. Even after controlling for factors such as educational background, country of origin, training, previous research awards, publication record, and employer characteristics, African-American applicants for National Institutes of Health R01 grants are 10% less likely than white applicants to be awarded research funding [9].

Stereotype Threat

I also want to talk briefly about stereotype threat. Stereotype threat is a phenomenon in which performance in academic contexts can be harmed by the awareness that one’s behavior might be viewed through the lens of a negative stereotype about one’s social group [10]. For example, studies have demonstrated that African-American students enrolled in college and female students enrolled in math and science courses score much lower on tests when they are reminded beforehand of their race or gender [10, 11]. In the case of female science students, simply having a larger ratio of men to women present in the testing situation can lower women’s test scores [4]. Several factors may contribute to this decreased performance, including the anxiety, reduced attention, and self-consciousness associated with worrying about whether or not one is confirming the stereotype. One idea that that hasn’t yet been explored in the context of peer review, but might be worth investigating, is whether requiring authors to reveal their identities during peer review induces a stereotype threat scenario.

Reviewers’ Identities

Lastly, I want to mention the identification of reviewers. Although there’s much less research on this side of the equation, it’s definitely worth considering the effects of revealing reviewer identities as well — especially for more junior reviewers. To quote Mainguy et al.’s article [12] in PLoS Biology, “Reviewers, and especially newcomers, may feel pressured into accepting a mediocre paper from a more established lab in fear of future reprisals.”

Summary

I want to conclude by reminding you that my goal in writing this post was to create awareness about the benefits of double-blind review. There’s a great deal of research on double-blind review and although it can be a hard landscape to navigate — in part because there are many factors involved, not all of which can be trivially controlled in experimental conditions — there are studies out there that demonstrate concrete benefits of double-blind review. Perhaps more importantly though, double-blind review promotes the PERCEPTION of fairness. To again quote Mainguy et al., “[Double-blind review] bears symbolic power that will go a long way to quell fears and frustrations, thereby generating a better perception of fairness and equality in global scientific funding and publishing.”

References

[1] Budden, Tregenza, Aarssen, Koricheva, Leimu, Lortie. “Double-blind review favours increased representation of female authors.” 2008.

[2] Yankauer. “How blind is blind review?” 1991.

[3] Katz, Proto, Olmsted. “Incidence and nature of unblinding by authors: our experience at two radiology journals with double-blinded peer review policies.” 2002.

[4] Hill, Corbett, St, Rose. “Why so few? Women in science, technology, engineering, and mathematics.” 2010.

[5] Merton. “The Matthew effect in science.” 1968.

[6] Link. “US and non-US submissions: an analysis of reviewer bias.” 1998.

[7] Webb, O’Hara, Freckleton. “Does double-blind review benefit female authors?” 2008.

[8] Budden, Lortie, Tregenza, Aarssen, Koricheva, Leimu. “Response to Webb et al.: Double-blind review: accept with minor revisions.” 2008.

[9] Ginther, Schaffer, Schnell, Masimore, Liu, Haak, Kington. “Race, ethnicity, and NIH research awards.” 2011.

[10] Steele, Aronson. “Stereotype threat and the intellectual test performance of African Americans.” 1995.

[11] Dar-Nimrod, Heine. “Exposure to scientific theories affects women’s math performance.” 2006,

[12] Mainguy, Motamedi, Mietchen. “Peer review—the newcomers’ perspective.” 2005.

6/16/2013

Representative Reviewing

Tags: Conferences,Reviewing ,Workshop jl@ 10:09 am

When thinking about how best to review papers, it seems helpful to have some conception of what good reviewing is. As far as I can tell, this is almost always only discussed in the specific context of a paper (i.e. your rejected paper), or at most an area (i.e. what a “good paper” looks like for that area) rather than general principles. Neither individual papers or areas are sufficiently general for a large conference—every paper differs in the details, and what if you want to build a new area and/or cross areas?

An unavoidable reason for reviewing is that the community of research is too large. In particular, it is not possible for a researcher to read every paper which someone thinks might be of interest. This reason for reviewing exists independent of constraints on rooms or scheduling formats of individual conferences. Indeed, history suggests that physical constraints are relatively meaningless over the long term — growing conferences simply use more rooms and/or change formats to accommodate the growth.

This suggests that a generic test for paper acceptance should be “Are there a significant number of people who will be interested?” This question could theoretically be answered by sending the paper to every person who might be interested and simply asking them. In practice, this would be an intractable use of people’s time: We must query far fewer people and achieve an approximate answer to this question. Our goal then should be minimizing the approximation error for some fixed amount of reviewing work.

Viewed from this perspective, the first way that things can go wrong is by misassignment of reviewers to papers, for which there are two
easy failure modes available.

  1. When reviewer/paper assignment is automated based on an affinity graph, the affinity graph may be low quality or the constraint on the maximum number of papers per reviewer can easily leave some papers with low affinity to all reviewers orphaned.
  2. When reviewer/paper assignments are done by one person, that person may choose reviewers who are all like-minded, simply because
    this is the crowd that they know. I’ve seen this happen at the beginning of the reviewing process, but the more insidious case is when it happens at the end, where people are pressed for time and low quality judgements can become common.

An interesting approach for addressing the constraint objective would be optimizing a different objective, such as the product of affinities
rather than the sum. I’ve seen no experimentation of this sort.

For ICML, there are about 3 levels of “reviewer”: the program chair who is responsible for all papers, the area chair who is responsible for organizing reviewing on a subset of papers, and the program committee member/reviewer who has primary responsibility for reviewing. In 2012 tried to avoid these failure modes in a least-system effort way using a blended approach. We used bidding to get a higher quality affinity matrix. We used a constraint system to assign the first reviewer to each paper and two area chairs to each paper. Then, we asked each area chair to find one reviewer for each paper. This obviously dealt with the one-area-chair failure mode. It also helps substantially with low quality assignments from the constrained system since (a) the first reviewer chosen is typically higher quality than the last due to it being the least constrained (b) misassignments to area chairs are diagnosed at the beginning of the process by ACs trying to find reviewers (c) ACs can reach outside of the initial program committee to find reviewers, which existing automated systems can not do.

The next way that reviewing can go wrong is via biased reviewing.

  1. Author name bias is a famous one. In my experience it is real: well known authors automatically have their paper taken seriously, which particularly matters when time is short. Furthermore, I’ve seen instances where well-known authors can slide by with proof sketches that no one fully understands.
  2. Review anchoring is a very significant problem if it occurs. This does not happen in the standard review process, because the reviews of others are not visible to other reviewers until they are complete.
  3. A more subtle form of bias is when one reviewer is simply much louder or charismatic than others. Reviewing without an in-person meeting is actually helpful here, as it reduces this problem substantially.

Reviewing can also be low quality. A primary issue here is time: most reviewers will submit a review within a time constraint, but it may not be high quality due to limits on time. Minimizing average reviewer load is quite important here. Staggered deadlines for reviews are almost certainly also helpful. A more subtle thing is discouraging low quality submissions. My favored approach here is to publish all submissions nonanonymously after some initial period of time.

Another significant issue in reviewer quality is motivation. Making reviewers not anonymous to each other helps with motivation as poor reviews will at least be known to some. Author feedback also helps with motivation, as reviewers know that authors will be able to point out poor reviewing. It is easy to imagine that further improvements in reviewer motivation would be helpful.

A third form of low quality review is based on miscommunication. Maybe there is silly typo in a paper? Maybe something was confusing? Being able to communicate with the author can greatly reduce ambiguities.

The last problem is dictatorship at decision time for which I’ve seen several variants. Sometimes this comes in the form of giving each area chair a budget of papers to “champion”. Sometimes this comes in the form of an area chair deciding to override all reviews and either accept or more likely reject a paper. Sometimes this comes in the form of a program chair doing this as well. The power of dictatorship is often available, but it should not be used: the wiser course is keeping things representative.

At ICML 2012, we tried to deal with this via a defined power approach. When reviewers agreed on the accept/reject decision, that was the decision. If the reviewers disgreed, we asked the two area chairs to make decisions and if they agreed, that was the decision. It was only when the ACs disagreed that the program chairs would become involved in the decision.

The above provides an understanding of how to create a good reviewing process for a large conference. With this in mind, we can consider various proposals at the peer review workshop and elsewhere.

  1. Double Blind Review. This reduces bias, at the cost of decreasing reviewer motivation. Overall, I think it’s a significant long term positive for a conference as “insiders” naturally become more concerned with review quality and “outsiders” are more prone to submit.
  2. Better paper/reviewer matching. A pure win, with the only caveat that you should be familiar with failure modes and watch out for them.
  3. Author feedback. This improves review quality by placing a check on unfair reviews and reducing miscommunication at some cost in time.
  4. Allowing an appendix or ancillary materials. This allows authors to better communicate complex ideas, at the potential cost of reviewer time. A standard compromise is to make reading an appendix optional for reviewers.
  5. Open reviews. Open reviews means that people can learn from other reviews, and that authors can respond more naturally than in single round author feedback.

It’s important to note that none of the above are inherently contradictory. This is not necessarily obvious as proponents of open review and double blind review have found themselves in opposition at times. These approaches can be accommodated by simply hiding authors names for a fixed period of 2 months while the initial review process is ongoing.

Representative reviewing seems like the real difficult goal. If a paper is rejected in a representative reviewing process, then perhaps it is just not of sufficient interest. Similarly, if a paper is accepted, then perhaps it is of real and meaningful interest. And if the reviewing process is not representative, then perhaps we should fix the failure modes.

Edit: Crossposted on CACM.

6/29/2012

ICML survey and comments

Just about nothing could keep me from attending ICML, except for Dora who arrived on Monday. Consequently, I have only secondhand reports that the conference is going well.

For those who are remote (like me) or after the conference (like everyone), Mark Reid has setup the ICML discussion site where you can comment on any paper or subscribe to papers. Authors are automatically subscribed to their own papers, so it should be possible to have a discussion significantly after the fact, as people desire.

We also conducted a survey before the conference and have the survey results now. This can be compared with the ICML 2010 survey results. Looking at the comparable questions, we can sometimes order the answers to have scores ranging from 0 to 3 or 0 to 4 with 3 or 4 being best and 0 worst, then compute the average difference between 2012 and 2010.

Glancing through them, I see:

  1. Most people found the papers they reviewed a good fit for their expertise (-.037 w.r.t 2010). Achieving this was one of our subgoals in the pursuit of high quality decisions.
  2. Most people had sufficient time for doing reviews. This was something that we worried about significantly in shifting the paper deadline and otherwise massaging the schedule. Most people also thought the review period was sufficiently long and most reviews were high quality (+.023 w.r.t. 2010)
  3. About 1/4 of reviewers say that author response changed their mind on a paper and 2/3 of reviewers say discussion changed their mind on a paper. The expectation of decision impact from author response is reduced from 2010 (-.135). The existence of author response is overwhelmingly preferred.
  4. People generally found ICML reviewing the same or better than previous ICMLs (+.35 w.r.t. 2010) and other similar conferences (+.198 w.r.t. 2010) at the cost of being somewhat more work. A substantial bump in reviewing quality was a primary goal.
  5. The ACs spent substantially more time (43 hours on average) than PC members (28 hours on average). This agrees with our expectation—the set of ACs didn’t change even after we had a 50% increase in submissions. The AC load we had this year was probably too high and will need to be reduced somewhat for next year.
  6. 2/3 of authors prefer the option to revise a paper during author response.
  7. The choice of how to deal with increased submissions is deeply undecided, with a slight preference for short talk+poster as we did.
  8. Most people like having two workshop days or don’t care.
  9. There is a strong preference for COLT and UAI colocation with the next tier of preference for IJCAI, KDD, AAAI, and CVPR.

6/5/2012

ICML acceptance statistics

Tags: Conferences,Reviewing jl@ 8:24 pm

People are naturally interested in slicing the ICML acceptance statistics in various ways. Here’s a rundown for the top categories.

18/66 = 0.27 in (0.18,0.36) Reinforcement Learning
10/52 = 0.19 in (0.17,0.37) Supervised Learning
9/51 = 0.18 not in (0.18, 0.37) Clustering
12/46 = 0.26 in (0.17, 0.37) Kernel Methods
11/40 = 0.28 in (0.15, 0.4) Optimization Algorithms
8/33 = 0.24 in (0.15, 0.39) Learning Theory
14/33 = 0.42 not in (0.15, 0.39) Graphical Models
10/32 = 0.31 in (0.15, 0.41) Applications (+5 invited)
8/29 = 0.28 in (0.14, 0.41]) Probabilistic Models
13/29 = 0.45 not in (0.14, 0.41) NN & Deep Learning
8/26 = 0.31 in (0.12, 0.42) Transfer and Multi-Task Learning
13/25 = 0.52 not in (0.12, 0.44) Online Learning
5/25 = 0.20 in (0.12, 0.44) Active Learning
6/22 = 0.27 in (0.14, 0.41) Semi-Supervised Learning
7/20 = 0.35 in (0.1, 0.45) Statistical Methods
4/20 = 0.20 in (0.1, 0.45) Sparsity and Compressed Sensing
1/19 = 0.05 not in (0.11, 0.42) Ensemble Methods
5/18 = 0.28 in (0.11, 0.44) Structured Output Prediction
4/18 = 0.22 in (0.11, 0.44) Recommendation and Matrix Factorization
7/18 = 0.39 in (0.11, 0.44) Latent-Variable Models and Topic Models
1/17 = 0.06 not in (0.12, 0.47) Graph-Based Learning Methods
5/16 = 0.31 in (0.13, 0.44) Nonparametric Bayesian Inference
3/15 = 0.20 in (0.7, 0.47) Unsupervised Learning and Outlier Detection
7/12 = 0.58 not in (0.08, 0.50) Gaussian Processes
5/11 = 0.45 not in (0.09, 0.45) Ranking and Preference Learning
2/11 = 0.18 in (0.09, 0.45) Large-Scale Learning
0/9 = 0.00 in [0, 0.56) Vision
3/9 = 0.33 in [0, 0.56) Social Network Analysis
0/9 = 0.00 in [0, 0.56) Multi-agent & Cooperative Learning
2/9 = 0.22 in [0, 0.56) Manifold Learning
4/8 = 0.50 not in [0, 0.5) Time-Series Analysis
2/8 = 0.25 in [0, 0.5] Large-Margin Methods
2/8 = 0.25 in [0, 0.5] Cost Sensitive Learning
2/7 = 0.29 in [0, 0.57) Recommender Systems
3/7 = 0.43 in [0, 0.57) Privacy, Anonymity, and Security
0/7 = 0.00 in [0, 0.57) Neural Networks
0/7 = 0.00 in [0, 0.57) Empirical Insights
0/7 = 0.00 in [0, 0.57) Bioinformatics
1/6 = 0.17 in [0, 0.5) Information Retrieval
2/6 = 0.33 in [0, 0.5) Evaluation Methodology

Update: See Brendan’s graph for a visualization.

I usually find these numbers hard to interpret. At the grossest level, all areas have significant selection. At a finer level, one way to add further interpretation is to pretend that the acceptance rate of all papers is 0.27, then compute a 5% lower tail and a 5% upper tail. With 40 categories, we expect to have about 4 violations of tail inequalities. Instead, we have 9, so there is some evidence that individual areas are particularly hot or cold. In particular, the hot topics are Graphical models, Neural Networks and Deep Learning, Online Learning, Gaussian Processes, Ranking and Preference Learning, and Time Series Analysis. The cold topics are Clustering, Ensemble Methods, and Graph-Based Learning Methods.

We also experimented with AIStats resubmits (3/4 accepted) and NFP papers (4/7 accepted) but the numbers were to small to read anything significant.

One thing that surprised me was how uniform decisions were as a function of average score in reviews. All reviews included a decision from {Strong Reject, Weak Reject, Weak Accept, Strong Accept}. These were mapped to numbers in the range {1,2,3,4}. In essence, average review score < 2.2 meant 0% chance of acceptance, and average review score > 3.1 meant acceptance. Due to discretization in the number of reviewers and review scores there were only 3 typical uncertain outcomes:

  1. 2.33. This was either 2 Weak Rejects+Weak Accept or Strong Reject+2 Weak Accepts or (rarely) Strong Reject+Weak Reject+Strong Accept. About 8% of these paper were accepted.
  2. 2.67. This was either Weak Reject+Weak Accept*2 or Strong Accept+2 Weak Rejects or (rarely) Strong Reject+Weak Accept+Strong Accept. About 48% of these paper were accepted.
  3. 3.0. This was commonly 3 Weak Accepts or Strong Accept+Weak Accept+Weak Reject or (rarely) 2 Strong Accepts + Strong Reject. About 90% of these papers were accepted.

One question I’ve always wondered is: How much variance is there in the accept/reject decision? In general, correlated assignment of reviewers can greatly increase the amount of variance, so one of our goals this year was doing as independent an assignment as possible. If you accept that as independence, we essentially get 3 samples for each paper where the average standard deviation of reviewer scores before author feedback and discussion is 0.64. After author feedback and discussion the standard deviation drops to 0.51. If we pretend that papers have an intrinsic value between 1 and 4 then think of reviews as discretized gaussian measurements fed through the above decision criteria, we get the following:

There are great caveats to this picture. For example, treating the AC’s decision as random conditioned on the reviewer average is a worst-case analysis. The reality is that ACs are removing noise from the few events that I monitored carefully, although it is difficult to quantify this. Similarly, treating the reviews observed after discussion as independent is clearly flawed. A reasonable way to look at it is: author feedback and discussion get us about 1/3 or 1/4 of the way to the final decision from the initial reviews.

Conditioned on the papers, discussion, author feedback and reviews, AC’s are pretty uniform in their decisions with ~30 papers where ACs disagreed on the accept/reject decision. For half of those, the ACs discussed further and agreed, leaving Joelle and I a feasible quantity of cases to look at (plus several other exceptions).

At the outset, we promised a zero-spof reviewing process. We actually aimed higher: at least 3 people needed to make a wrong decision for the ICML 2012 reviewing process to kick out a wrong decision. I expect this happened a few times given the overall level of quality disagreement and quantities involved, but hopefully we managed to reduce the noise appreciably.

5/2/2012

ICML: Behind the Scenes

This is a rather long post, detailing the ICML 2012 review process. The goal is to make the process more transparent, help authors understand how we came to a decision, and discuss the strengths and weaknesses of this process for future conference organizers.

Microsoft’s Conference Management Toolkit (CMT)
We chose to use CMT over other conference management software mainly because of its rich toolkit. The interface is sub-optimal (to say the least!) but it has extensive capabilities (to handle bids, author response, resubmissions, etc.), good import/export mechanisms (to process the data elsewhere), excellent technical support (to answer late night emails, add new functionalities). Overall, it was the right choice, although we hope a designer will look at that interface sometime soon!

Toronto Matching System (TMS)
TMS is now being used by many major conferences in our field (including NIPS and UAI). It is an automated system (developed by Laurent Charlin and Rich Zemel at U. Toronto) to match reviewers to papers, based on an analysis of each reviewer’s publications. TMS collects publications from reviewers, parses them into features and applies unsupervised or supervised learning techniques to predict the relevance of any target paper for any reviewer. We convinced TMS to integrate with CMT and funded Laurent’s work for that. Reviewers were asked to put in a publication list for TMS to parse. For those who failed to do so (after many reminders!), we manually added that information from public sources.

The Program Committee
Recruiting a program committee that is both large and highly qualified is difficult these days. We sent out 69 area chair invitations; 50 (highly qualified!) people accepted. Each of these area chairs was asked to nominate a list of potential reviewers. We sent out approximately 700 invitations for program committee members; 389 accepted. A number of additional PC members were recruited during the review process (most of them for 1-2 papers), for a total of 470 active PC members. In terms of seniority, the final PC contains about ~15% students, 80% researchers, 5% other.

The Surge (ICML + 50%)
The first big challenge came on the submission deadline. In the past few years, ICML had consistently received ~550-600 submissions. This year, we had a 50% increase, to 890 submissions. We had recruited a PC that could comfortably handle 700 papers. Dealing with an extra 200 papers was not an easy task.

About 10 submissions were rejected without review for various reasons (severe formatting issues, extra pages, non-anonymization).

Bidding
An unsupervised version of TMS was used to generate a list of candidate papers for each reviewer and area chair. This was done working closely with the Laurent Charlin of TMS using validation on previous NIPS data. CMT did not have the functionality to show a good list of candidate papers to reviewers, so we crafted an interface to show this list and let reviewers use that in conjunction with CMT. Ideally, this will be better incorporated in CMT in the future.

When you ask a group of scientists to run a conference, you must expect a few experiments will take place…. And so we decided to assess the usefulness of TMS scoring for generating lists of papers to bid on. To do this, we (randomly) assigned PC members to 1 of 3 groups. One group saw a list purely based on TMS scores. Another group received a list based on the matching between their subject area and that of the paper (referred to as the “relevance” score in CMT). The third group received a list based on a mix of both TMS and relevance. Reviewers were allowed to bid on any paper (excluding those with which they had a conflict); the lists were provided to help them efficiently sort through the large number of papers. We then compared the set of bids for a reviewer, with the list of suggestions, and measured the correspondence.

The following is the Discounted Cumulative Gain (DCG) of each list with respect to the bidding scores, averaged separately for each group. Note that each group was only presented with their corresponding list and not the others.

Group: CMT Group: TMS Group: CMT+TMS
Sorting by CMT scores 6.11 out of 12.64 (48%) 4.98 out of 13.63 (36%) 4.87 out of 13.55 (35%)
Sorting by TMS score 4.06 out of 12.64 (32%) 6.43 out of 13.63 (47%) 5.72 out of 13.55 (42%)
Sorting by TMS+CMT 4.77 out of 12.64 (37%) 6.11 out of 13.63 (44%) 6.71 out of 13.55 (49%)

A micro-survey was also run to collect further information on how users liked their short list. 85% of the participants indicated that they have used the list interface provided to them. The following is the preference indicated by each group (~75 reviewers in each group, ~2% error):

CMT TMS CMT+TMS
Preferred CMT over list 15% 12% 8%
Preferred list+CMT 81% 83% 83%
Preferred list over CMT 4% 5% 9%

It is obvious from the above that most participants found the list useful in conjunction with CMT (suggesting that the list should be integrated inside CMT). We can also see that those who were presented with a list based on TMS scores were more likely to find the list useful.

Note that all of the above was done in a long hectic but fun weekend.

Imputing Missing Bids
CMT assumes that the reviewers are not willing to review a paper unless stated otherwise. It does not differentiate between an unseen (but potentially relevant) paper and a paper that has been seen and ignored. This is a real shortcoming when it comes to matching papers to reviewers, especially for those reviewers that did not bid often. To mitigate this problem, we used the click information on the shortlist presented to the reviewers to find out which papers have been observed and ignored. We then impute these cases as real non-willing bids.

Around 30 reviewers did not provide any bids (and many had only a few). This is problematic because the tools used to do the actual reviewer-paper matching tend to assign the papers without any bids to the reviewers who did not bid, regardless of the match in expertise.

Once the bidding information was in and imputation was done, we now had to fill in the rest of the paper-reviewer bidding matrix to mitigate the problem with sparse bidders. This was done, once again, through TMS, but this time using a supervised learning approach.

Using supervised learning was more delicate than expected. To deal with the wildly varying number of bids per person, we imputed zero bids, first from papers that were plausibly skipped over, and if necessary at random from papers not bid on such that each person had the same expected bid in the dataset. From this dataset, we held out a random bid per person, and then trained to predict well the heldout bid. Most optimization approaches performed poorly due to the number of features greatly exceeding the number of labels. The best approach we found used the online algorithms in Vowpal Wabbit with a mass personalized training method similar to the one discussed here. This trained predictor was used to predict bid values for the full paper-reviewer bid matrix.

Automated Area Chair and First Reviewer Assignment
Once we had the imputed paper-reviewer bidding matrix, CMT was used to generate the actual match between papers and area chairs, and (separately) between papers and reviewers. Each paper had two area chairs (sometimes called “meta-reviewers” in CMT) assigned to it, one primary, one secondary, by running two rounds of assignments (so that the primary was usually the “better” match). One reviewer per paper was also assigned automatically by CMT in a similar fashion. CMT provides proper load balancing, so that all area chairs and reviewers had similar loads.

Manual Checks of the Automated Assignments
Before finalizing the automated assignment, we manually looked through the list of papers to fix any potential problems that were not handled by the automated process. The two major cases were papers that did not go through the TMS system (authors did not agree to do so), and cases of poor primary-secondary meta-reviewer pairs (when the two area chairs are judged to be too close to offer independent assessment, e.g. working at the same institution, previous supervisor-student relationship).

Second and Third Reviewer Assignment
Once the initial assignments were announced, we asked the two area chairs for a given paper to each manually assign another reviewer from the PC. To help area chairs with this, we generated a shortlist of 10 recommended reviewers for each paper (using the estimated bid matrix and TMS score, with the CMT matching algorithm for load balancing of reviewer suggestions.) Area chairs were free to either use this list, or select from the complete program committee, or alternately, they could seek an outside reviewer which was then added to the PC, an option used 80 times. The load for each reviewer was restricted to at most 7 papers with exceptions when they agreed explicitly to more.

The second and third uses of TMS, including the new supervised learning system, lead to another long hectic weekend with Laurent, Mahdi, Joelle, and John all deeply involved.

Reviews
Most papers received at least 3 full reviews in the first round. Reviewers could not see each others’ reviews until they submitted their own. ML-Journaled submissions (see double submission guide) were reviewed only by two area chairs. In a small number of regular submissions (less than 10), we received 2 very negative reviews and notified the third reviewer (who was usually late by this point!) that we would not need their review.

Authors’ Response
Authors were given a chance to respond to the reviews during a short feedback period. This is becoming a standard practice in machine learning conferences. Authors were also allowed to upload a new version of the paper. The motivation here is that in some cases, it is easier to show the changes directly in the paper, rather than discuss them separately.

Our analysis shows that authors’ responses and subsequent discussions by reviewers made significant changes to the scoring of papers. A total of ~35% of the papers had some change in their scores after the author feedback. The average score for ~50% of the papers went down, stayed the same for ~10%, and went up for the other ~40%. The variance on the scores decreased by ~20%, indicating some convergence in the decisions.

Final Decisions
To help us better decide on the quality of the papers, we asked the primary area chairs to provide a meta-review for each of their papers. For papers without unanimous review decisions (i.e. some reviews wanted to accept and some wanted to reject), we asked the secondary area chair to (independently) fill-in a meta-review, recommending whether to accept or reject the paper. A total of 1214 meta-reviews were provided. There were also 20 papers for which a 4th review was added in this period.

In all cases where the primary and secondary area chairs disagreed on the decision, the program chairs were directly involved, reviewing all the evidence (reviews, rebuttal, discussion, often the paper itself), and entering in a discussion (usually via email) with the area chairs, until a unanimous decision was achieved.
A total of 243 papers (27% of submissions) were accepted. Author notifications were sent out on April 30.

Older Posts »

Powered by WordPress