What is the Right Response to Employer Misbehavior in Research?

I enjoyed my conversations with Timnit when she was in the MSR-NYC lab, so her situation has been on my mind throughout NeurIPS.

Piecing together what happened second-hand is always tricky, but Jeff Dean’s account and Timnit’s agree on a basic outline. Timnit and others wrote a paper for FAccT which was approved for submission by the normal internal review process, then later unapproved. Timnit threatened to leave unless various details about this unapproval were clarified. Google then declared her resigned.

The definition of resign makes it clear an employee does it, not an employer. Since that apparently never happened, this is a mischaracterized firing. It also seems quite credible that the unapproval process was highly unusual based on various reactions I’ve seen and my personal expectations of what researchers would typically tolerate.

This frankly looks bad to me and quite a number of other people. Aside from the plain facts, this is also consistent with racism and/or sexism given the roles of those involved. Google itself now faces a substantial rebellion amongst employees.

However, I worry about consequences to some of these reactions.

  1. Some people suggest not reviewing papers from Google-based researchers. As a personal decision, this is making a program chair’s difficult job harder. As a communal decision, this would devastate the community since a substantial fraction are employed at Google. These people did not make this decision and many actively support Timnit there (at some risk to their job) so a mass-punishment approach seems deeply counterproductive.
  2. Others have suggested that Google should not be a sponsor at major machine learning conferences. Since all of these are run as nonprofits, the lost grants will either be made up by increasing costs for everyone or reducing grants to students and diversity sponsorship. Reduced grants in particular seem deeply counterproductive.
  3. Some have suggested that all industry research in general is bad. Industrial research varies substantially from place to place, perhaps much more so than in academia. As an example, Microsoft Research has no similar internal review process for publications. Overall, the stereotyping inherent in this view makes me uncomfortable and there are some real advantages to working in industry in terms of ability to concentrate on research or effecting real change.

It’s critical to understand that the strength of the research community is incredibly valuable to the community. It’s not hard to imagine a different arrangement where all industrial research is proprietary, with only a few major companies operating competitive internal research teams. This sort of structure exists in some other fields, often to the detriment of anyone other than a major company. Researchers at those companies can’t as easily switch jobs and researchers outside of those companies may lack the context to even contribute to the state of the art. The field itself progresses slower and in a more secretive way due to lack of sharing. Anticommunal acts based on mass ostracization or abandonment could shift our structure from the current relatively happy equilibrium where people from all over can participate, learn, and contribute towards a much worse situation.

This is not to say that there are no consequences. The substantial natural consequences of a significant moral-impacting event will play out regardless of anything else. The marketplace for top researchers is quite competitive so for many of them uncertainty about the feasibility of publication, the disposition and competence of senior leadership, or constraints on topics tips the balance towards other offers. That may be severe this year, since this all blew up as the recruiting season was launching and I expect it to last over many years unless some significant action is taken. In this sense, I expect all the competitors may be looking forward to recruiting more than they were previously and the cost of not resolving the conflict here in a better way may be much, much higher than just about any other course of action. This is not particularly hypothetical—I saw it play out over the years after the silicon valley lab was cut as the brain drain of other great researchers in competitive areas was severe for several years afterwards.

I don’t think a general answer to the starting question is possible, since it will always depend on circumstances. Even this instance is complex with actions that could cause unintuitive adverse impacts on unanticipated parts of our community or damage the community as a whole. I personally hope that the considerable natural consequences here form a substantial deterrent to misbehavior in the long term. Please think this through when considering your actions here.

Edits: tweaked conclusion wording a bit with advice from reshamas.

Experiments with the ICML 2020 Peer-Review Process

This post is cross-listed on the CMU ML blog.

The International Conference on Machine Learning (ICML) is a flagship machine learning conference that in 2020 received 4,990 submissions and managed a pool of 3,931 reviewers and area chairs. Given that the stakes in the review process are high — the careers of researchers are often significantly affected by the publications in top venues — we decided to scrutinize several components of the peer-review process in a series of experiments. Specifically, in conjunction with the ICML 2020 conference, we performed three experiments that target: resubmission policies, management of reviewer discussions, and reviewer recruiting. In this post, we summarize the results of these studies.

Resubmission Bias

Motivation. Several leading ML and AI conferences have recently started requiring authors to declare previous submission history of their papers. In part, such measures are taken to reduce the load on reviewers by discouraging resubmissions without substantial changes. However, this requirement poses a risk of bias in reviewers’ evaluations.

Research question. Do reviewers get biased when they know that the paper they are reviewing was previously rejected from a similar venue?

Procedure. We organized an auxiliary conference review process with 134 junior reviewers from 5 top US schools and 19 papers from various areas of ML. We assigned participants 1 paper each and asked them to review the paper as if it was submitted to ICML. Unbeknown to participants, we allocated them to a test or control condition uniformly at random:

Control. Participants review the papers as usual.

Test. Before reading the paper, participants are told that the paper they review is a resubmission.

Hypothesis. We expect that if the bias is present, reviewers in the test condition should be harsher than in the control. 

Key findings. Reviewers give almost one point lower score (95% Confidence Interval: [0.24, 1.30]) on a 10-point Likert item for the overall evaluation of a paper when they are told that a paper is a resubmission. In terms of narrower review criteria, reviewers tend to underrate “Paper Quality” the most.

Implications. Conference organizers need to evaluate a trade-off between envisaged benefits such as the hypothetical reduction in the number of submissions and the potential unfairness introduced to the process by the resubmission bias. One option to reduce the bias is to postpone the moment in which the resubmission signal is revealed until after the initial reviews are submitted. This finding must also be accounted for when deciding whether the reviews of rejected papers should be publicly available on systems like openreview.net and others. 

Details. http://arxiv.org/abs/2011.14646

Herding Effects in Discussions

Motivation. Past research on human decision making shows that group discussion is susceptible to various biases related to social influence. For instance, it is documented that the decision of a group may be biased towards the opinion of the group member who proposes the solution first. We call this effect herding and note that, in peer review, herding (if present) may result in undesirable artifacts in decisions as different area chairs use different strategies to select the discussion initiator.

Research question. Conditioned on a set of reviewers who actively participate in a discussion of a paper, does the final decision of the paper depend on the order in which reviewers join the discussion?

Procedure. We performed a randomized controlled trial on herding in ICML 2020 discussions that involved about 1,500 papers and 2,000 reviewers. In peer review, the discussion takes place after the reviewers submit their initial reviews, so we know prior opinions of reviewers about the papers. With this information, we split a subset of ICML papers into two groups uniformly at random and applied different discussion-management strategies to them: 

Positive Group. First ask the most positive reviewer to start the discussion, then later ask the most negative reviewer to contribute to the discussion.

Negative Group. First ask the most negative reviewer to start the discussion, then later ask the most positive reviewer to contribute to the discussion.

Hypothesis. The only difference between the strategies is the order in which reviewers are supposed to join the discussion. Hence, if the herding is absent, the strategies will not impact submissions from the two groups disproportionately. However, if the herding is present, we expect that the difference in the order will introduce a difference in the acceptance rates across the two groups of papers.

Key findings. The analysis of outcomes of approximately 1,500 papers does not reveal a statistically significant difference in acceptance rates between the two groups of papers. Hence, we find no evidence of herding in the discussion phase of peer review.

Implications. Regarding the concern of herding which is found to occur in other applications involving people, discussion in peer review does not seem to be susceptible to this effect and hence no specific measures to counteract herding in peer-review discussions are needed.

Details. https://arxiv.org/abs/2011.15083

Novice Reviewer Recruiting

Motivation.  A surge in the number of submissions received by leading ML and  AI conferences has challenged the sustainability of the review process by increasing the burden on the pool of qualified reviewers. Leading conferences have been addressing the issue by relaxing the seniority bar for reviewers and inviting very junior researchers with limited or no publication history, but there is mixed evidence regarding the impact of such interventions on the quality of reviews. 

Research question. Can very junior reviewers be recruited and guided such that they enlarge the reviewer pool of leading ML and AI conferences without compromising the quality of the process?

Procedure. We implemented a twofold approach towards managing novice reviewers:

Selection. We evaluated reviews written in the aforementioned auxiliary conference review process involving 134 junior reviewers, and invited 52 of these reviewers who produced the strongest reviews to join the reviewer pool of ICML 2020. Most of these 52 “experimental” reviewers come from the population not considered by the conventional way of reviewer recruiting used in ICML 2020.

Mentoring. In the actual conference, we provided these experimental reviewers with a senior researcher as a point of contact who offered additional mentoring.

Hypothesis. If our approach allows to bring strong reviewers to the pool, we expect experimental reviewers to perform at least as good as reviewers from the main pool on various metrics, including the quality of reviews as rated by area chairs.

Key findings. A combination of the selection and mentoring mechanisms results in reviews of at least comparable and on some metrics even higher-rated quality as compared to the conventional pool of reviews: 30% of reviews written by the experimental reviewers exceeded the expectations of area chairs (compared to only 14% for the main pool).

Implications. The experiment received positive feedback from participants who appreciated the opportunity to become a reviewer in ICML 2020 and from authors of papers used in the auxiliary review process who received a set of useful reviews without submitting to a real conference. Hence, we believe that a promising direction is to replicate the experiment at a larger scale and evaluate the benefits of each component of our approach.

Details. http://arxiv.org/abs/2011.15050

Conclusion

All in all, the experiments we conducted in ICML 2020 reveal some useful and actionable insights about the peer-review process. We hope that some of these ideas will help to design a better peer-review pipeline in future conferences.

We thank ICML area chairs, reviewers, and authors for their tremendous efforts. We would also like to thank the Microsoft Conference Management Toolkit (CMT) team for their continuous support and implementation of features necessary to run these experiments, the authors of papers contributed to the auxiliary review process for their responsiveness, and participants of the resubmission bias experiment for their enthusiasm. Finally, we thank Ed Kennedy and Devendra Chaplot for their help with designing and executing the experiments.

The post is based on the works by Ivan Stelmakh, Nihar B. Shah, Aarti Singh, Hal Daumé III, and Charvi Rastogi.

A Real World Reinforcement Learning Research Program

We are hiring for reinforcement learning related research at all levels and all MSR labs. If you are interested, apply, talk to me at COLT or ICML, or email me.

More generally though, I wanted to lay out a philosophy of research which differs from (and plausibly improves on) the current prevailing mode.

Deepmind and OpenAI have popularized an empirical approach where researchers modify algorithms and test them against simulated environments, including in self-play. They’ve achieved significant success in these simulated environments, greatly expanding the reportoire of ‘games solved by reinforcement learning’ which consisted of the singleton backgammon when I was a graduate student. Given the ambitious goals of these organizations, the more general plan seems to be “first solve games, then solve real problems”. There are some weaknesses to this approach, which I want to lay out next.

  • Broken API One issue with this is that multi-step reinforcement learning is a broken API in the sense that it creates an interface for problem definitions that is unsolvable via currently popular algorithm families. In particular, you can create problems which are either ‘antishaped’ so local rewards mislead w.r.t. long term rewards or keylock problems, as are common in Markov Decision Process lower bounds. I coded up simple versions of these problems a couple years ago and stuck them on github now to be extra crisp. If you try to apply policy gradient or Q-learning style algorithms on these problems they commonly run into exponential (in the number of states) sample complexity. As a general principle, APIs which create exponential sample complexity are bad—they imply that individual applications require taking advantage of special structure in order to succeed.
  • Transference Another significant issue is the degree of transference between solutions in simulation and the real world. “Transference” here potentially happens at several levels.
    • Do the algorithms carry over? One of the persistent issues with simulation-based approaches is that you don’t care about sample complexity that much—optimal performance at acceptable computational complexities is the typical goal. In real world applications, this is somewhat absurd—you really care about immediately doing something reasonable and optimizing from there.
    • Do the simulators carry over? For every simulator, there is a fidelity question which comes into play when you want to transfer a policy learned in the simulator into action in the real world. Real-time ray tracing and simulator quality more generally are advancing, but I’m not ready yet to trust a self-driving care trained in a simulated reality. An accurate simulation of the physics is unclear—friction for example is known-difficult, and more generally the representative variety of exogenous events in an open world seems quite difficult to implement.
  • Solution generality When you test and discover that an algorithm works in a simulated world, you know that it works in the simulated world. If you try it in 30 simulated worlds and it works in all of them, it can still easily be the case that an algorithm fails on the 31st simulated world. How can you achieve confidence beyond the number of simulated worlds that you try and succeed on? There is some sense by which you can imagine generalization over an underlying process generating problems, but this seems like a shaky justification in practice, since the nature of the problems encountered seems to be a nonstationary development of an unknown future.
  • Value creation Solutions of a ‘first A, then B’ flavor naturally take time to get to the end state where most of the real value is set to be realized. In the years before reaching applications in the real world, does the funding run out? We certainly hope not for the field of research but a danger does exist. Some discussion here including the comments is relevant.

What’s an alternative?

Each of the issues above is addressable.

  • Build fundamental theories of what are statistically and computationally tractable sub-problems of Reinforcement Learning. These tractable sub-problems form the ‘APIs’ of systems for solving these problems. Examples of this include simpler (Contextual Bandits), intermediate (learning to search, and move advanced (Contextual Decision Process).
  • Work on real-world problems. The obvious antidote to simulation is reality, driving both the need to create systems that work in reality as well as a research agenda around reality-centered issues like performance at low sample complexity. There are some significant difficulties with this—reinforcement style algorithms require interactive access to learn which often drives research towards companies with an infrastructure. Nevertheless, offline evaluation on real-world data does exist and the choice of emphasis in research directions is universal.
  • The combination of fundamental theories and a platform which distills learnings so they are not forgotten and always improved upon provides a stronger basis for expectation of generalization into the next problem.
  • The shortest path to creating valuable applications in the real world is to simply work on creating valuable applications in the real world. Doing this in a manner guided by other elements of the research program is just good sense.

The above must be applied in moderation—some emphasis on theory, some emphasis on real world applications, some emphasis on platforms, and some emphasis on empirics. This has been my research approach for a little over 10 years, ever since I started working on contextual bandits.

Let’s call the first research program ’empirical simulation’ and the second research program ‘real fundamentals’. The empirical simulation approach has a clear strong advantage in that it creates impressive demos, which creates funding, which creates more research. The threshold for contribution to the empirical simulation approach may also be lower simply because it requires mastery of fewer elements, implying people can more easily participate in it. At the same time, the real fundamentals approach has clear advantages in addressing the weaknesses of the empirical simulation approach. At a concrete level, this means we have managed to define and create fundamentals through research while creating real-world applications and value radically more efficiently than the empirical simulation approach has achieved.

The ‘real fundamentals’ concept is behind the open positions above. These positions have been designed to come with both the colleagues and mandate to address the most difficult research problems along with the organizational leverage to change the world. For people interested in fundamentals and making things happen in the real world these are prime positions—please consider joining us.

Fact over Fiction

Politics is a distracting affair which I generally believe it’s best to stay out of if you want to be able to concentrate on research. Nevertheless, the US presidential election looks like something that directly politicizes the idea and process of research by damaging the association of scientists & students, funding for basic research, and creating political censorship.

A core question here is: What to do? Today’s March for Science is a good step, but I’m not sure it will change many minds. Unlike most scientists, I grew up in a a county (Linn) which voted overwhelmingly for Trump. As a consequence, I feel like I must translate the mindset a bit. For the median household left behind over my lifetime a march by relatively affluent people protesting the government cutting expenses will not elicit much sympathy. Discussion about the overwhelming value of science may also fall on deaf ears simply because they have not seen the economic value personally. On the contrary, they have seen their economic situation flat or worsening for 4 decades with little prospect for things getting better. Similarly, I don’t expect history lessons on anti-intellectualism to make much of a dent. Fundamentally, scientists and science fans are a small fraction of the population.

What’s needed is a campaign that achieves broad agreement across the population and which will help. One of the roots of the March for Science is a belief in facts over fiction which may have the requisite scope. In particular, there seems to be a good case that the right to engage in mass disinformation has been enormously costly to the United States and is now a significant threat to civil stability. Internally, disinformation is a preferred tool for starting wars or for wealthy companies to push a deadly business model. Externally, disinformation is now being actively used to sway elections and is self-funding.

The election outcome is actually less important than the endemic disagreement that disinformation creates. When people simply believe in different facts about the world how can you expect them to agree? There probably are some good uses of mass disinformation somewhere, but I’m extremely skeptical the value exceeds the cost.

Is opposition to mass disinformation broad enough that it makes a good organizing principle? If mass disinformation was eliminated or greatly reduced it would be an enormous value to society, particularly to the disinformed. It would not address the fundamental economic stagnation of the median household in the United States, but it would remove a significant threat to civil society which may be necessary for such progress. Given a choice between the right to mass disinform and democracy, I choose democracy.

A real question is “how”? We are discussing an abridgment of freedom of speech so from a legal perspective the basis must rest on the balance between freedom of speech and other constitutional rights. Many abridgements exist like censuring a yell of “fire” in a crowded theater unnecessarily.

Voluntary efforts (as Facebook and Twitter have undertaken) are a start, but it seems unlikely to go far enough as many other “news” organizations have made no such commitments. A system where companies commit to informing over disinforming and in return become both more trusted and simultaneously liable for disinformation damages (due to the disinformed) as assessed by civil law may make sense. Right now organizations are mostly free to engage in disinformation as long as it is not directed at an individual where libel laws apply. Penalizing an organization for individual mistakes seems absurd, but a pattern of errors backed by scientific surveys verifying an anomalously misinformed status of viewers/readers/listeners is cause for action. Getting this right is obviously a tricky thing—we want a solution that a real news organization with an existing mimetic immune system prefers to the status quo because it curbs competitors that disinform. At the same time, there must be enough teeth to make disinformation uneconomical or the problem only grows.

Should disinformation have criminal penalties? One existing approach here uses RICO laws to counter disinformation from Tobacco companies. Reading the history, this took an amazing amount of time—enough that it was ineffective for a generation. It seems plausible that an act directly addressing disinformation may be helpful.

What about technical solutions? Technical solutions seem necessary for success, perhaps with changes to law incentivizing this. It’s important to understand that something going before the courts is inherently slow, particularly because courts tend to be deeply overloaded. A significant increase in the number of cases going before courts makes an approach nonviable in practice.

Would we regret this? There is a long history of governments abusing laws to censor inconvenient news sources so caution is warranted. Structuring new laws in a manner such that they cannot be abused is an important consideration. It is obviously important to leave satire fully intact which seems entirely possibly by making the fact that it is satire unmistakable. This entire discussion is also not relevant to individuals speaking to other individuals—that is not what creates a problem.

Is this possible? It might seem obvious that mass disinformation should be curbed but there should be no doubt that powerful forces will work to preserve mass disinformation by subtle and unethical means.

Overall, I fundamentally believe that people in a position to inform or disinform have a responsibility to inform. If they don’t want that responsibility, then they should abdicate the position to someone who does, similar in effect to the proposed fiduciary rule for investments. I’m open to any approach towards achieving this.

Edit: also at CACM.

No more MSR Silicon Valley

This news report is correct, the Microsoft Research Silicon Valley center has been cut. The New York lab has not been directly affected although obviously cross-lab collaborations are impacted, and we sympathize deeply with those involved. Most of the rest of MSR is not directly affected.

I’m not privy to the rationale behind the decision, but in my opinion there are some very strong people in the various groups (Algorithms, Architecture, Distributed Systems, Privacy, Software tools, Web Search), and I expect offers have started raining on them. In my experience, this is harrowing in the short term, yet I know that most of my previous colleagues ended up happier after the troubles hit Yahoo! Research 2 1/2 years ago.