Experiments with the ICML 2020 Peer-Review Process

This post is cross-listed on the CMU ML blog.

The International Conference on Machine Learning (ICML) is a flagship machine learning conference that in 2020 received 4,990 submissions and managed a pool of 3,931 reviewers and area chairs. Given that the stakes in the review process are high — the careers of researchers are often significantly affected by the publications in top venues — we decided to scrutinize several components of the peer-review process in a series of experiments. Specifically, in conjunction with the ICML 2020 conference, we performed three experiments that target: resubmission policies, management of reviewer discussions, and reviewer recruiting. In this post, we summarize the results of these studies.

Resubmission Bias

Motivation. Several leading ML and AI conferences have recently started requiring authors to declare previous submission history of their papers. In part, such measures are taken to reduce the load on reviewers by discouraging resubmissions without substantial changes. However, this requirement poses a risk of bias in reviewers’ evaluations.

Research question. Do reviewers get biased when they know that the paper they are reviewing was previously rejected from a similar venue?

Procedure. We organized an auxiliary conference review process with 134 junior reviewers from 5 top US schools and 19 papers from various areas of ML. We assigned participants 1 paper each and asked them to review the paper as if it was submitted to ICML. Unbeknown to participants, we allocated them to a test or control condition uniformly at random:

Control. Participants review the papers as usual.

Test. Before reading the paper, participants are told that the paper they review is a resubmission.

Hypothesis. We expect that if the bias is present, reviewers in the test condition should be harsher than in the control. 

Key findings. Reviewers give almost one point lower score (95% Confidence Interval: [0.24, 1.30]) on a 10-point Likert item for the overall evaluation of a paper when they are told that a paper is a resubmission. In terms of narrower review criteria, reviewers tend to underrate “Paper Quality” the most.

Implications. Conference organizers need to evaluate a trade-off between envisaged benefits such as the hypothetical reduction in the number of submissions and the potential unfairness introduced to the process by the resubmission bias. One option to reduce the bias is to postpone the moment in which the resubmission signal is revealed until after the initial reviews are submitted. This finding must also be accounted for when deciding whether the reviews of rejected papers should be publicly available on systems like openreview.net and others. 

Details. http://arxiv.org/abs/2011.14646

Herding Effects in Discussions

Motivation. Past research on human decision making shows that group discussion is susceptible to various biases related to social influence. For instance, it is documented that the decision of a group may be biased towards the opinion of the group member who proposes the solution first. We call this effect herding and note that, in peer review, herding (if present) may result in undesirable artifacts in decisions as different area chairs use different strategies to select the discussion initiator.

Research question. Conditioned on a set of reviewers who actively participate in a discussion of a paper, does the final decision of the paper depend on the order in which reviewers join the discussion?

Procedure. We performed a randomized controlled trial on herding in ICML 2020 discussions that involved about 1,500 papers and 2,000 reviewers. In peer review, the discussion takes place after the reviewers submit their initial reviews, so we know prior opinions of reviewers about the papers. With this information, we split a subset of ICML papers into two groups uniformly at random and applied different discussion-management strategies to them: 

Positive Group. First ask the most positive reviewer to start the discussion, then later ask the most negative reviewer to contribute to the discussion.

Negative Group. First ask the most negative reviewer to start the discussion, then later ask the most positive reviewer to contribute to the discussion.

Hypothesis. The only difference between the strategies is the order in which reviewers are supposed to join the discussion. Hence, if the herding is absent, the strategies will not impact submissions from the two groups disproportionately. However, if the herding is present, we expect that the difference in the order will introduce a difference in the acceptance rates across the two groups of papers.

Key findings. The analysis of outcomes of approximately 1,500 papers does not reveal a statistically significant difference in acceptance rates between the two groups of papers. Hence, we find no evidence of herding in the discussion phase of peer review.

Implications. Regarding the concern of herding which is found to occur in other applications involving people, discussion in peer review does not seem to be susceptible to this effect and hence no specific measures to counteract herding in peer-review discussions are needed.

Details. https://arxiv.org/abs/2011.15083

Novice Reviewer Recruiting

Motivation.  A surge in the number of submissions received by leading ML and  AI conferences has challenged the sustainability of the review process by increasing the burden on the pool of qualified reviewers. Leading conferences have been addressing the issue by relaxing the seniority bar for reviewers and inviting very junior researchers with limited or no publication history, but there is mixed evidence regarding the impact of such interventions on the quality of reviews. 

Research question. Can very junior reviewers be recruited and guided such that they enlarge the reviewer pool of leading ML and AI conferences without compromising the quality of the process?

Procedure. We implemented a twofold approach towards managing novice reviewers:

Selection. We evaluated reviews written in the aforementioned auxiliary conference review process involving 134 junior reviewers, and invited 52 of these reviewers who produced the strongest reviews to join the reviewer pool of ICML 2020. Most of these 52 “experimental” reviewers come from the population not considered by the conventional way of reviewer recruiting used in ICML 2020.

Mentoring. In the actual conference, we provided these experimental reviewers with a senior researcher as a point of contact who offered additional mentoring.

Hypothesis. If our approach allows to bring strong reviewers to the pool, we expect experimental reviewers to perform at least as good as reviewers from the main pool on various metrics, including the quality of reviews as rated by area chairs.

Key findings. A combination of the selection and mentoring mechanisms results in reviews of at least comparable and on some metrics even higher-rated quality as compared to the conventional pool of reviews: 30% of reviews written by the experimental reviewers exceeded the expectations of area chairs (compared to only 14% for the main pool).

Implications. The experiment received positive feedback from participants who appreciated the opportunity to become a reviewer in ICML 2020 and from authors of papers used in the auxiliary review process who received a set of useful reviews without submitting to a real conference. Hence, we believe that a promising direction is to replicate the experiment at a larger scale and evaluate the benefits of each component of our approach.

Details. http://arxiv.org/abs/2011.15050

Conclusion

All in all, the experiments we conducted in ICML 2020 reveal some useful and actionable insights about the peer-review process. We hope that some of these ideas will help to design a better peer-review pipeline in future conferences.

We thank ICML area chairs, reviewers, and authors for their tremendous efforts. We would also like to thank the Microsoft Conference Management Toolkit (CMT) team for their continuous support and implementation of features necessary to run these experiments, the authors of papers contributed to the auxiliary review process for their responsiveness, and participants of the resubmission bias experiment for their enthusiasm. Finally, we thank Ed Kennedy and Devendra Chaplot for their help with designing and executing the experiments.

The post is based on the works by Ivan Stelmakh, Nihar B. Shah, Aarti Singh, Hal Daumé III, and Charvi Rastogi.

Please vote

This is not at all related to Machine Learning.

I lived in Squirrel Hill as a graduate student at Carnegie Mellon so the massacre there is feeling particularly immediate. While the person who did it is obviously culpable, the pattern of events makes it clear that others bear responsibility as well. This pattern includes an attempted bomber of Democrats and Trump critics by a Trump fanboy. It also includes a more general cross section of Republicans and their leaders pushing anti-semitism and more general xenophobia about migrants.

I don’t believe that stochastic terrorism is the goal here. Instead, I have a rather pessimal view of politics in which politicians do pretty much anything to get re-elected, at least in aggregate. Donald Trump’s presidential campaign showed how to do this with a platform of populism, nostalgia, xenophobia, and anti-abortion voters.

The populist angle is looking fairly broken now between anti-populist tax cuts and widely publicized efforts to allow preexisting condition discrimination by insurance companies via Obamacare repeal. About the only populist angle which works is the economy, which is doing fine. On the other hand, there is no obvious change in employment trends since 2011 and no change in wage trends since 2014 so the case for responsibility is clearly tenuous.

Alliances in a two-party system tend to be fragile since winning with a smaller constituency enables better serving that constituency. Losing the populist angle leaves a double-down on the remaining agenda as the most plausible choice. Xenophobia is much older than democracy and psychologically potent so it has obvious value. It’s historically used by leaders who pick some characteristic to divide people and position themselves to thrive on the conflict or distraction that creates. Almost anything will do—if you take away religion, birthplace, skin color, and ethnicity, it would just change to hair color, nose size, or left-handedness. In a democracy, the goal with this approach is simply convincing people to vote according to their activated xenophobia.

For people embracing xenophobia to retain power, stochastic terrorism is just an unfortunate side effect. In this sense, inciting xenophobia about a caravan of refugee Guatemalans at the other end of Mexico is rather clever since most of them won’t even make it to the US border months after the election plausibly leaving only electoral consequences. Yet xenophobia is known to be hard to control. Given this, it’s difficult to imagine stochastic terrorism as anything other than deliberately accepted by the Republican party leadership as an observed consequence of this behavior. The Squirrel Hill massacre and the attempted bombing campaigns are precisely the sort of thing that can happen when you dial up the rhetoric just before an election.

This is part of a pattern of moral collapse across the Republican party. By any reasonable measure Donald Trump is a serial liar with Republican politicians now mimicking this behavior. A remarkable set of people around the Trump campaign are confessed or convicted criminals with members of the Republican party variously tolerating, condoning, and perhaps mimicking.

In this context, the upcoming midterm election seems particularly important. If politicians in aggregate behave as if they will do anything to get reelected, then voters must vote for the behavior they want at the ballot box rather than relying on or appealing to it at a later date. In most situations, this is about picking and choosing the better candidate. I’ve been registered as an independent for this reason—I want to decide for myself.

This is not most situations. Do voters rebuke the Republican party or not? If the answer is not (a 37% chance according to bettors at present) then the slide into corruption likely accelerates as confirmed control of the government erodes the remaining institutional checks on corruption. We are several steps away from a state of deep corruption and it takes time for the consequences of corruption to really seep into society. But every step on the path makes the situation worse and we are on the wrong path now as evidenced by bombing attempts, a xenophobic massacre, and the wider context creating them.

I want to particularly encourage those who are eligible to vote in the United States midterms November 6th.

ICML 2016 was awesome

I had a fantastic time at ICML 2016— I learned a great deal. There was far more good stuff than I could see, and it was exciting to catch up on recent advances.

David Silver gave one of the best tutorials I’ve seen on his group’s recent work in “deep” reinforcement learning. I learned about a few new techniques, including the benefits of asychrononous  updates in distributed Q-learning https://arxiv.org/abs/1602.01783, which was presented in more detail at the main conference. The new domains being explored were exciting, as were the improvements made on the computational side. I would love to seen more pointers to some of the related work from the tutorial, particularly given there was such an exciting mix of new techniques and old staples (e.g. experience replay http://www.dtic.mil/dtic/tr/fulltext/u2/a261434.pdf ), but the talk was so information packed it would have been difficult.

Pieter Abbeel gave an outstanding talk in the Abstraction in RL workshop http://rlabstraction2016.wix.com/icml#!schedule/bx34m, and (I heard) another excellent one during the deep learning workshop.
It was rumored that Aviv Tamar gave an exciting talk (I believe on this http://arxiv.org/abs/1602.02867) , but I was forced to miss it to see Rong Ge’s https://users.cs.duke.edu/~rongge/ outstanding talk on a new-ish geometric tool for understanding non-convex optimization, the strict saddle. I first read about the approach here http://arxiv.org/abs/1503.02101, but at ICML he and other authors have demonstrated a remarkable number of problems that have this property that enables efficient optimization via an stochastic gradient descent (and other) procedures.

This was a theme of ICML— an incredible amount of good material, so much that I barely saw the posters at all because there was nearly always a talk I wanted to see!

Rocky Duan surveyed some benchmark RL continuous control problems http://jmlr.org/proceedings/papers/v48/duan16.pdf  An interesting theme of the conference— and came up in conversation with John Schulman and Yann LeCun– was really old methods working well. In fact, this group demonstrated that variants of the natural/covariant policy gradient proposed originally by Sham Kakade (with a derivation here: http://repository.cmu.edu/cgi/viewcontent.cgi?article=1080&context=robotics) are largely at the state-of-the-art on many benchmark problems. There are some clever tricks necessary for large policy classes like neural networks (like using a partial-least squares-style truncated  conjugate gradient to solve for the change in policy in the usual F \delta = \nabla one solves in the natural gradient procedure) that dramatically improve performance (https://arxiv.org/abs/1502.05477).  I had begun to view these methods as doing little better (or worse) then black-box search, so it’s exciting to see them make a comeback.

Chelsea Finn http://people.eecs.berkeley.edu/~cbfinn/ gave an outstanding talk on this work https://arxiv.org/abs/1603.00448. She and co-authors (Sergey Levine and Pieter) effectively came up with a technique that lets one apply Maximum Entropy Inverse Optimal Control without the double-loop procedure and using policy gradient techniques.  Jonathan Ho described a related algorithm http://jmlr.org/proceedings/papers/v48/ho16.pdf that also appeared to mix policy gradient and an optimization over cost functions. Both are definitely on my reading list, and I want to understand the trade-offs of the techniques.

Both presentations were informative, and both made the interesting connection to Generative Adversarial Nets (GANS) http://arxiv.org/abs/1406.2661 . These were also a theme of the conference in both talks and during discussions. A very cool idea getting more traction, and being embraced by the neural net pioneers.

David Belanger https://people.cs.umass.edu/~belanger/belanger_spen_icml.pdf gave a interesting talk on using backprop to optimize a structured output relative to a a learned cost function. I left thinking the technique was closely related to inverse optimal control methods and the GANs, and wanting understand how implicit differentiation wasn’t being used to optimize the energy function parameters.

Speaking of neural net pioneers— there was lots of good talks during both the main conference and workshops on what’s new — and what’s old https://sites.google.com/site/nnb2tf/— in neural network architectures and algorithms.

I was intrigued by http://jmlr.org/proceedings/papers/v48/balduzzi16.pdf and particularly by the well written blog post it mentions http://colah.github.io/posts/2015-09-NN-Types-FP/ by Christopher Olah. The notion that we need language tools to structure the design of learning programs (e.g. http://www.umiacs.umd.edu/~hal/docs/daume14lts.pdf)  and have tools to reason about them seems to be gaining currency. After reading these, I began to view some of the recent work of Wen, Arun, Byron, and myself (including at http://jmlr.org/proceedings/papers/v48/sun16.pdf  ICML) in this light— generative RNNs “should” have a well defined hidden state whose “type” is effectively (moments of) future observations. I wonder now if there is a larger lesson here in the design of learning programs.

Nando de Freitas and colleagues approach of separating value and advantage function predictions in one network http://jmlr.org/proceedings/papers/v48/wangf16.pdf was quite interesting and had a lot of buzz.
Ian Osband gave an amazing talk on another topic that previously made me despair: exploration in RL http://jmlr.org/proceedings/papers/v48/osband16.pdf. This is one of few approaches that combines the ability to function approximation with rigorous exploration guarantees/sample complexity in the tabular case (and amazingly *better* sample complexity then previous papers that work only in the tabular case).  Super cool and also very high on my reading list.

Boaz Barak http://www.boazbarak.org/ gave a truly inspired talk that mixed a kind of coherent computationally-bounded Bayesian-ism (Slogan: ”Compute like a frequentist, think like a Bayesian.”) with demonstrating a lower bound for SoS procedures. Well outside of my expertise, but delivered in a way that made you feel like you understood all of it.

Honglak Lee gave an exciting talk on the benefits of semi-supervision in CNNs http://web.eecs.umich.edu/~honglak/icml2016-CNNdec.pdf. The authors demonstrated that a remarkable amount of information needed to reproduce an input image was preserved quite deep in CNNs, and further that encouraging the ability to reconstruct could significantly enhance discriminative performance on real benchmarks.

The problem with this ICML is that I think it would take literally weeks of reading/watching talks to really absorb the high quality work that was presented. I’m *very* grateful to the organizing committee http://icml.cc/2016/?page_id=39 for making it so valuable.

Research Political Issues

I’ve avoided discussing politics here, although not for lack of interest. The problem with discussing politics is that it’s customary for people to say much based upon little information. Nevertheless, politics can have a substantial impact on science (and we might hope for the vice-versa). It’s primary election time in the United States, so the topic is timely, although the issues are not.

There are several policy decisions which substantially effect development of science and technology in the US.

  1. Education The US has great contrasts in education. The top universities are very good places, yet the grade school education system produces mediocre results. For me, the contrast between a public education and Caltech was bracing. For many others attending Caltech, it clearly was not. Upgrading the k-12 education system in the US is a long-standing chronic problem which I know relatively little about. My own experience is that a basic attitude of “no child unrealized” is better than “no child left behind”. A fair claim can also be made that the US just doesn’t invest enough.
  2. Respect Lack of respect for science and technology is routinely expressed in many ways in the US.
    1. The most bald form of lack of respect is scientific censorship. This may be easily understood as a generality: you choose to spend a large fraction of your life learning to interpret some part of the world. After years, you come to some conclusion about the nature of the world. Then, someone with no particular experience or expertise tells you to alter it.
    2. A more refined form of lack of respect is simply lack of presence in decision making. This isn’t necessarily intentional: many people simply make decisions from the gut, and then come up with reasons to justify their decision. This style explicitly cuts out the deep thinking of science. Many policies could have been better informed by a serious consideration of even basic science:
      1. The oil of Iraq is fundamentally less valuable if we are going to tackle global warming.
      2. Swapping gasoline for hydrogen-based transportable energy source is dubious because it introduces another energy storage conversion to lose energy on. The same goes for swapping bioethanol for gasoline. In contrast, hybrid and electric vehicles actually recover substantial energy from regenerative braking, and a plug-in hybrid could run off electricity in typical commuter usage.
      3. The Space Shuttle is a boondoggle design. The rocket equation implies that the ratio of initial to final mass for vehicles reaching earth orbit must be at least a factor of e2.5 (it’s actually e2.93 for the Space Shuttle). Making the system reusable implies that most of this mass returns to earth so the payload deliverable into space is only 1.2% of the liftoff mass. A better designed system might deliver payloads a factor of 4 larger or be much smaller.
      4. Passenger Inspections at airports is another poor policy from the perspective of science. It isn’t effective, and there is no cost-efficient way to make it effective against a motivated opponent. Solid evidence for this is the continued use of mules to smuggle drugs. The basic problem from a chemistry point of view is that too much can be done with a small amount of mass. Deterrence and limitation (armored cockpits and active resistance for example) are fine policies.
    3. Lack of support. The simplest form of lack of respect is simply lack of support. The case for federal vs corporate funding of basic science and technology development is very simple: the benefit to society of conducting such work dramatically exceeds the benefit any one agent within society (such as a company) could gain from it. Of late, investment in core science has been an anemic 0.0005 GDP and visa issues hamstring broader technology development.
  3. Confidence This is primarily related to the technology side of science and technology. Many policy decisions are made without confidence in the ability of technologists to adapt. This comes in at least two flavors.
    1. The foreordained solution. Policy often comes in the form “we use approach X to solve problem Y” (some examples are above). This demonstrates an overconfidence by policy makers in there ability to pick the winner, and a lack of confidence in the ability of technologists to solve problems. It also represents an opportunity for large established industries to get huge payoffs at taxpayer expense. The X-prize represents the opposite of this approach, and it has been radically more effective by any reasonable standard.
    2. Confusion about the meaning of wealth. Some people believe that wealth is about what you have. However, for a society it seems much better to measure wealth in terms of what the society can do. Policy makers often forget that science and technology is a capability when it comes time to think of a solution. For example, someone with no confidence in the ability to create and make affordable plugin electric hybrids might think it necessary to conquest for oil.
  4. Stability People can’t program, do science, or invent new things when they are worried about more immediate events. There are several destabilizing trends going on in the US right now which either now or in the future may make it hard to focus away from immediate concerns.
    1. Debt and money supply. The federal debt for the US government is about 3.5 times the federal budget. This is bad for the simple reason that investors buying US treasury bonds aren’t investing in new technology. However, the destabilizing concern is more subtle. Since world war II, the US dollar has become the standard currency for exchange around the world. Since debt by the government creates a temptation by the government to (effectively) print money, the number of dollars in circulation has been rapidly growing. But, a growing number of dollars means that the currency is devaluing, which makes owning dollars undesirable. I don’t know an example of a previous world currency that has ceased to be such, but basic economics says that bad things happen to dollar-based savings if all the dollars flow back into the US. So far, the decline of the dollar has been relatively gradual, but a very disruptive cliff might exist out there somewhere. Policies which increase debt (like cutting taxes and increasing spending) exacerbate this problem. There is no fix once the dollar loses world currency status because confidence can be lost quickly, but not regained.
    2. Health Care. The US is running an experiment to determine how large a fraction of GDP can be devoted to health care. Currently it’s over 15%, in first place, and growing. This is even worse than it sounds, because many comparable countries in Europe (or Japan) have older populations which should generally be more expensive to take care of. In the present situation, because health care is incredibly expensive, losing health insurance (which is typically tied to a job) is potentially catastrophic for any individual.
    3. Wealth Asymmetry. The US has shifted towards a substantially more asymmetric division of wealth since the 1970s. An asymmetric division of wealth is not fundamentally bad—there needs to be room for great success to imply great rewards. However, a casual correlation of science and technology development with the gini coefficient map reveals that a large gini coefficient and substantial science and technology development do not coincide. The problem is that wealth becomes inheritable, and it’s very unlikely that the wealth is inherited by a someone interested in science and technology. Wealth is now scheduled to become perfectly inheritable in 2010 in the US.

I’m sure some of these issues are endemic to many other parts of the world as well, because there are fundamental conceptual difficulties with investing in the unknown instead of the known.

Regularization = Robustness

The Gibbs-Jaynes theorem is a classical result that tells us that the highest entropy distribution (most uncertain, least committed, etc.) subject to expectation constraints on a set of features is an exponential family distribution with the features as sufficient statistics. In math,

argmax_p H(p)
s.t. E_p[f_i] = c_i

is given by e^{\sum \lambda_i f_i}/Z. (Z here is the necessary normalization constraint, and the lambdas are free parameters we set to meet the expectation constraints).

A great deal of statistical mechanics flows from this result, and it has proven very fruitful in learning as well. (Motivating work in models in text learning and Conditional Random Fields, for instance. ) The result has been demonstrated a number of ways. One of the most elegant is the “geometric” version here.

In the case when the expectation constraints come from data, this tells us that the maximum entropy distribution is exactly the maximum likelihood distribution in the exponential family. It’s a surprising connection and the duality it flows from appears in a wide variety of work. (For instance, Martin Wainwright’s approximate inference techniques rely (in essence) on this result.)

In practice, we know that Maximum Likelihood with a lot of features is bound to overfit. The traditional trick is to pull a sleight of hand in the derivation. We start with the primal entropy problem, move to the dual, and in the dual add a “prior” that penalizes the lambdas. (Typically an l_1 or l_2 penalty or constraint.) This game is played in a variety of papers, and it’s a sleight of hand because the penalties don’t come from the motivating problem (the primal) but rather get tacked on at the end. In short: it’s a hack.

So I realized a few months back, that the primal (entropy) problem that regularization relates to is remarkably natural. Basically, it tells us that regularization in the dual corresponds directly to uncertainty (mini-max) about the constraints in the primal. What we end up with is a distribution p that is robust in the sense that it maximizes the entropy subject to a large set of potential constraints. More recently, I realized that I’m not even close to having been the first to figure that out. Miroslav Dudík, Steven J. Phillips and Robert E. Schapire, have a paper that derives this relation and then goes a step further to show what performance guarantees the method provides. It’s a great paper and I hope you get a chance to check it out:

Performance guarantees for regularized maximum entropy density estimation.

(Even better: if you’re attending ICML this year, I believe you will see Rob Schapire talk about some of this and related material as an invited speaker.)

It turns out the idea generalizes quite a bit. In Robust design of biological experiments. P. Flaherty, M. I. Jordan and A. P. Arkin show a related result where regularization directly follows from a robustness or uncertainty guarantee. And if you want the whole, beautiful framework you’re in luck. Yasemin Altun and Alex Smola have a paper (that I haven’t yet finished, but at least begins very well) that generalizes the regularized maximum entropy duality to a whole class of statistical inference procedures. If you’re at COLT, you can check this out as well.

Unifying Divergence Minimization and Statistical Inference via Convex Duality

The deep, unifying result seems to be what the title of the post says: robustness = regularization. This viewpoint makes regularization seem like much less of a hack, and goes further in suggesting just what range of constants might be reasonable. The work is very relevant to learning, but the general idea goes beyond to various problems where we only approximately know constraints.