Structural Problems in NIPS Decision Making

This is a very difficult post to write, because it is about a perenially touchy subject. Nevertheless, it is an important one which needs to be thought about carefully.

There are a few things which should be understood:

  1. The system is changing and responsive. We-the-authors are we-the-reviewers, we-the-PC, and even we-the-NIPS-board. NIPS has implemented ‘secondary program chairs’, ‘author response’, and ‘double blind reviewing’ in the last few years to help with the decision process, and more changes may happen in the future.
  2. Agreement creates a perception of correctness. When any PC meets and makes a group decision about a paper, there is a strong tendency for the reinforcement inherent in a group decision to create the perception of correctness. For the many people who have been on the NIPS PC it’s reasonable to entertain a healthy skepticism in the face of this reinforcing certainty.
  3. This post is about structural problems. What problems arise because of the structure of the process? The post is not about individual people, because this is unlikely to be fruitful.

Although the subject is nominally about NIPS (which I have experience with as an author, reviewer, and PC member), the points may apply elsewhere.

For those that don’t know, it’s worth reviewing how the NIPS process currently works. Temporally, it looks like the following:

  1. PC chair is appointed.
  2. PC chair picks PC committee to cover many different areas. NIPS is notably diverse.
  3. PC committee members pick reviewers for their areas.
  4. Authors submit blinded papers.
  5. Papers are assigned to two PC committee members, the “primary” and the “secondary”.
  6. Reviewers bid for papers within their areas which they want and don’t want to review.
  7. Reviewers are assigned papers based on bid plus coverage.
  8. Reviewers review papers.
  9. Authors respond to blinded reviews.
  10. Reviewers discuss and rate papers.
  11. PC members digest author/reviewer interaction (and sometimes the paper) into an impression.
  12. PC members meet physically at the PC meeting.
  13. PC members present all papers that they believe are worth considering to other PC members and a decision is made.

Naturally, there are many details left out of this long list.

Here is my attempt to describe the problems I’ve seen:

  1. Attention deficit disorder. The attention paid to individual accept/reject decisions is (and structurally must be) small. There are several effects which drive this:
    1. The people on the NIPS PC are typically busy and time constrained.
    2. The number of papers assigned to individual PC members is large—perhaps 40 to 80, plus a similar number assigned as a secondary.
    3. Many of the people have traveled a very long ways to reach the PC meeting. Jetlag is common, and often significantly effects your ability to think carefully.
    4. The meeting itself is 2 days long. The average time spent on any decision must be less than 5 minutes, and everyone knows this. The implicit encouragement to digest a paper down to its most simple description is significant. No one on the PC has seen the paper except for the primary and the secondary (if you are lucky) PC members, so decisions are made quickly based upon relatively little information. (This is better than it sounds in most cases because effectively the decision was made by the primary PC member beforehand.)
  2. Artificial scarcity. NIPS is a single track conference with 3 levels of acceptance “Accept for an oral presentation”, “Accept for a poster with a spotlight”, and “Accept as a poster only”. It’s fairly difficult to justify a paper as “of broad interest”, which is ideal for an oral presentation. Will a neuroscientist really pay attention to this learning theory paper? Is this dimensionality reduction algorithm going to interest someone in learning theory? It’s substantially easier to justify a paper as “possibly of interest to a number of people”, which is about right for poster spotlight. Since the number of spotlights and the number of orals is similar, two effects occur: papers which are about right for spotlights become orals, and many reasonable spotlights aren’t spotlights because they don’t fit.
  3. The Veto Effect. If someone on the PC has a strong dislike for your paper, there is a very good chance for reject. This is true even when attention is explicitly payed by the PC chair to avoiding the veto problem. It’s even true when your paper has the strongest reviews in the area (no joke!). There are several fundamental problems here:
    1. People, especially in person, do not generally want to be confrontational. Consequently, if someone who is rarely confrontational speaks strongly against a paper, it’s rare2 for an alternate voice to be heard.
    2. It is easy to instill “fear, uncertainty, and doubt” in people. Was this paper covering the same material as some other paper no one knows? Are the assumptions criticizable? This problem is greatly exaggerated by attention deficit disorder.

It is easy to complain about these problems and substantially harder to fix them. (There is previous discussion on this.) Here is my best attempt to imagine fixes.

  1. Attention Deficit Disorder. The fundamental problem here is that papers aren’t getting the attention that they deserve by the final decision maker. Several changes might help, but nothing is going to be a silver bullet here.
    1. Author responsibility. Unfortunately, some authors abuse the system by submitting papers which should not be submitted. Much of this has to do with inexperience—many authors are first time paper writers. For these authors, some better effort educating people about what is an appropriate paper is good. This year, an effort was made to do this, and followups may be helpful. For a small fraction of papers, authors intentionally skate the edge of what is reasonable. Should an ICML paper with 30% different content be submitted to NIPS? This small fraction takes more time than their fraction indicates and (frankly) isn’t always caught. Some form of “shame list” may be an appropriate way to deal with this, although much caution would have to be exercised.
    2. Many of the problems here are unremovable artifacts of a physically present PC meeting. Going to a virtualized process would eliminate these problems (and introduce others). Any such decision would have to be carefully considered, but it is not impossible—there are plenty of succesful conference committees which never meet physically.
    3. The PC meeting can be run a bit differently.
      1. Bob Williamson and I managed to go through our secondary assignments and make independent decisions, then reconcile. In contrast, for most papers, the secondary PC member was inoperative at the PC meeting. This made some difference, and it’s easy to imagine that systematically having this reconciliation be a part of the PC meeting is helpful. The reconciliation step does not take very long and is parallelizable.
      2. Not making a decision at the PC meeting could be a real option for a small number of troublesome papers. There is perhaps a week-long timegap between the PC meeting and the release of the decisions during which decisions could be double checked. This option must only be used rarely, and never as a means for excluding interested PC members from the decision.
      3. Information can be more widely shared. I don’t see any real advantage to limiting the knowledge of papers not in your area to “title+authors”. At the PC meeting itself, it would be helpful to have all of the papers available to all of the members.
  2. Artificial Scarcity. My understanding is that the makers of NIPS purposefully preferred a single track conference, and it’s hard to argue with the success NIPS has enjoyed. Nevertheless, it seems notable that the NIPS workshops (which are excessively multitracked) are more succesful than the NIPS conference by some measures. Going to a two-track or partially two-track format would ease some of the decision making.

    Even working within the single track format, it’s not clear that the ratio between orals and spotlights is right. Spotlights take about 1/10th the time that an oral presentation takes, and yet only 1/10th or so of the overall time is allocated to spotlight presentations. Losing one oral presentation (out of about 20) would yield a
    significant increase in the number of spotlights, and it’s easy to imagine this would be beneficial to attendees while easing decision making.

  3. The Veto Effect. The veto effect is hard to deal with, and it’s only relevant to a small number of decisions. Nevertheless it’s important because some of the best papers are controversial at the time they are published. The are two ways I can imagine for dealing with the veto effect: (1) allowing author feedback (2) devolving power from the PC to the reviewers. Allowing author feedback would have to be coupled with delayed decision making. Eliminating the power of the PC to reject very highly rated papers is also controversial, but may be worth considering.

Incentive Compatible Reviewing

Reviewing is a fairly formal process which is integral to the way academia is run. Given this integral nature, the quality of reviewing is often frustrating. I’ve seen plenty of examples of false statements, misbeliefs, reading what isn’t written, etc…, and I’m sure many other people have as well.

Recently, mechanisms like double blind review and author feedback have been introduced to try to make the process more fair and accurate in many machine learning (and related) conferences. My personal experience is that these mechanisms help, especially the author feedback. Nevertheless, some problems remain.

The game theory take on reviewing is that the incentive for truthful reviewing isn’t there. Since reviewers are also authors, there are sometimes perverse incentives created and acted upon. (Incidentially, these incentives can be both positive and negative.)

Setting up a truthful reviewing system is tricky because their is no final reference truth available in any acceptable (say: subyear) timespan. There are several ways we could try to get around this.

  1. We could try to engineer new mechanisms for finding a reference truth into a conference and then use a ‘proper scoring rule’ which is incentive compatible. For example, we could have a survey where conference participants short list the papers which interested them. There are significant problems here:
    1. Conference presentations mostly function as announcements of results. Consequently, the understanding of the paper at the conference is not nearly as deep as, say, after reading through it carefully in a reading group.
    2. This is inherently useless for judging reviews of rejected papers and it is highly biased for judging reviews of papers presented in two different formats (say, a poster versus an oral presentation).
  2. We could ignore the time issue and try to measure reviewer performance based upon (say) long term citation count. Aside from the bias problems above, there is also a huge problem associated with turnover. Who the reviewers are and how an individual reviewer reviews may change drastically in just a 5 year timespan. A system which can provide track records for only a small subset of current reviewers isn’t very capable.
  3. We could try to manufacture an incentive compatible system even when the truth is never known. This paper by Nolan Miller, Paul Resnick, and Richard Zeckhauser discusses the feasibility of this approach. Essentially, the scheme works by rewarding reviewer i according to a proper scoring rule applied to P(reviewer j’s score | reviewer i’s score). (A simple example of a proper scoring rule is log[P()].) This is approach is pretty fresh, so there are lots of problems, some of which may or may not be fundamental difficulties for application in practice. The significant problem I see is that this mechanism may reward joint agreement instead of a good contribution towards good joint decision making.

None of these mechanisms are perfect, but they may each yield a little bit of information about what was or was not a good decision over time. Combining these sources of information to create some reviewer judgement system may yield another small improvement in the reviewing process.

The important thing to remember is that we are the reviewers as well as the authors. Are we interested in tracking our reviewing performance over time in order to make better judgements? Such tracking often happens on an anecdotal or personal basis, but shifting to an automated incentive compatible system would be a big change in scope.

NIPS paper evaluation criteria

John Platt, who is PC-chair for NIPS 2006 has organized a NIPS paper evaluation criteria document with input from the program committee and others.

The document contains specific advice about what is appropriate for the various subareas within NIPS. It may be very helpful, because the standards of evaluation for papers varies significantly.

This is a bit of an experiment: the hope is that by carefully thinking about and stating what is important, authors can better understand whether and where their work fits.

Update: The general submission page and Author instruction including how to submit an appendix.

Reviewing techniques for conferences

The many reviews following the many paper deadlines are just about over. AAAI and ICML in particular were experimenting with several reviewing techniques.

  1. Double Blind: AAAI and ICML were both double blind this year. It seemed (overall) beneficial, but two problems arose.
    1. For theoretical papers, with a lot to say, authors often leave out the proofs. This is very hard to cope with under a double blind review because (1) you can not trust the authors got the proof right but (2) a blanket “reject” hits many probably-good papers. Perhaps authors should more strongly favor proof-complete papers sent to double blind conferences.
    2. On the author side, double blind reviewing is actually somewhat disruptive to research. In particular, it discourages the author from talking about the subject, which is one of the mechanisms of research. This is not a great drawback, but it is one not previously appreciated.
  2. Author feedback: AAAI and ICML did author feedback this year. It seemed helpful for several papers. The ICML-style author feedback (more space, no requirement of attacking the review to respond), appeared somewhat more helpful and natural. It seems ok to pass a compliment from author to reviewer.
  3. Discussion Periods: AAAI seemed more natural than ICML with respect to discussion periods. For ICML, there were “dead times” when reviews were submitted but discussions amongst reviewers were not encouraged. This has the drawback of letting people forget their review before discussing it.