ICML: Behind the Scenes

This is a rather long post, detailing the ICML 2012 review process. The goal is to make the process more transparent, help authors understand how we came to a decision, and discuss the strengths and weaknesses of this process for future conference organizers.

Microsoft’s Conference Management Toolkit (CMT)
We chose to use CMT over other conference management software mainly because of its rich toolkit. The interface is sub-optimal (to say the least!) but it has extensive capabilities (to handle bids, author response, resubmissions, etc.), good import/export mechanisms (to process the data elsewhere), excellent technical support (to answer late night emails, add new functionalities). Overall, it was the right choice, although we hope a designer will look at that interface sometime soon!

Toronto Matching System (TMS)
TMS is now being used by many major conferences in our field (including NIPS and UAI). It is an automated system (developed by Laurent Charlin and Rich Zemel at U. Toronto) to match reviewers to papers, based on an analysis of each reviewer’s publications. TMS collects publications from reviewers, parses them into features and applies unsupervised or supervised learning techniques to predict the relevance of any target paper for any reviewer. We convinced TMS to integrate with CMT and funded Laurent’s work for that. Reviewers were asked to put in a publication list for TMS to parse. For those who failed to do so (after many reminders!), we manually added that information from public sources.

The Program Committee
Recruiting a program committee that is both large and highly qualified is difficult these days. We sent out 69 area chair invitations; 50 (highly qualified!) people accepted. Each of these area chairs was asked to nominate a list of potential reviewers. We sent out approximately 700 invitations for program committee members; 389 accepted. A number of additional PC members were recruited during the review process (most of them for 1-2 papers), for a total of 470 active PC members. In terms of seniority, the final PC contains about ~15% students, 80% researchers, 5% other.

The Surge (ICML + 50%)
The first big challenge came on the submission deadline. In the past few years, ICML had consistently received ~550-600 submissions. This year, we had a 50% increase, to 890 submissions. We had recruited a PC that could comfortably handle 700 papers. Dealing with an extra 200 papers was not an easy task.

About 10 submissions were rejected without review for various reasons (severe formatting issues, extra pages, non-anonymization).

Bidding
An unsupervised version of TMS was used to generate a list of candidate papers for each reviewer and area chair. This was done working closely with the Laurent Charlin of TMS using validation on previous NIPS data. CMT did not have the functionality to show a good list of candidate papers to reviewers, so we crafted an interface to show this list and let reviewers use that in conjunction with CMT. Ideally, this will be better incorporated in CMT in the future.

When you ask a group of scientists to run a conference, you must expect a few experiments will take place…. And so we decided to assess the usefulness of TMS scoring for generating lists of papers to bid on. To do this, we (randomly) assigned PC members to 1 of 3 groups. One group saw a list purely based on TMS scores. Another group received a list based on the matching between their subject area and that of the paper (referred to as the “relevance” score in CMT). The third group received a list based on a mix of both TMS and relevance. Reviewers were allowed to bid on any paper (excluding those with which they had a conflict); the lists were provided to help them efficiently sort through the large number of papers. We then compared the set of bids for a reviewer, with the list of suggestions, and measured the correspondence.

The following is the Discounted Cumulative Gain (DCG) of each list with respect to the bidding scores, averaged separately for each group. Note that each group was only presented with their corresponding list and not the others.

Group: CMT Group: TMS Group: CMT+TMS
Sorting by CMT scores 6.11 out of 12.64 (48%) 4.98 out of 13.63 (36%) 4.87 out of 13.55 (35%)
Sorting by TMS score 4.06 out of 12.64 (32%) 6.43 out of 13.63 (47%) 5.72 out of 13.55 (42%)
Sorting by TMS+CMT 4.77 out of 12.64 (37%) 6.11 out of 13.63 (44%) 6.71 out of 13.55 (49%)

A micro-survey was also run to collect further information on how users liked their short list. 85% of the participants indicated that they have used the list interface provided to them. The following is the preference indicated by each group (~75 reviewers in each group, ~2% error):

CMT TMS CMT+TMS
Preferred CMT over list 15% 12% 8%
Preferred list+CMT 81% 83% 83%
Preferred list over CMT 4% 5% 9%

It is obvious from the above that most participants found the list useful in conjunction with CMT (suggesting that the list should be integrated inside CMT). We can also see that those who were presented with a list based on TMS scores were more likely to find the list useful.

Note that all of the above was done in a long hectic but fun weekend.

Imputing Missing Bids
CMT assumes that the reviewers are not willing to review a paper unless stated otherwise. It does not differentiate between an unseen (but potentially relevant) paper and a paper that has been seen and ignored. This is a real shortcoming when it comes to matching papers to reviewers, especially for those reviewers that did not bid often. To mitigate this problem, we used the click information on the shortlist presented to the reviewers to find out which papers have been observed and ignored. We then impute these cases as real non-willing bids.

Around 30 reviewers did not provide any bids (and many had only a few). This is problematic because the tools used to do the actual reviewer-paper matching tend to assign the papers without any bids to the reviewers who did not bid, regardless of the match in expertise.

Once the bidding information was in and imputation was done, we now had to fill in the rest of the paper-reviewer bidding matrix to mitigate the problem with sparse bidders. This was done, once again, through TMS, but this time using a supervised learning approach.

Using supervised learning was more delicate than expected. To deal with the wildly varying number of bids per person, we imputed zero bids, first from papers that were plausibly skipped over, and if necessary at random from papers not bid on such that each person had the same expected bid in the dataset. From this dataset, we held out a random bid per person, and then trained to predict well the heldout bid. Most optimization approaches performed poorly due to the number of features greatly exceeding the number of labels. The best approach we found used the online algorithms in Vowpal Wabbit with a mass personalized training method similar to the one discussed here. This trained predictor was used to predict bid values for the full paper-reviewer bid matrix.

Automated Area Chair and First Reviewer Assignment
Once we had the imputed paper-reviewer bidding matrix, CMT was used to generate the actual match between papers and area chairs, and (separately) between papers and reviewers. Each paper had two area chairs (sometimes called “meta-reviewers” in CMT) assigned to it, one primary, one secondary, by running two rounds of assignments (so that the primary was usually the “better” match). One reviewer per paper was also assigned automatically by CMT in a similar fashion. CMT provides proper load balancing, so that all area chairs and reviewers had similar loads.

Manual Checks of the Automated Assignments
Before finalizing the automated assignment, we manually looked through the list of papers to fix any potential problems that were not handled by the automated process. The two major cases were papers that did not go through the TMS system (authors did not agree to do so), and cases of poor primary-secondary meta-reviewer pairs (when the two area chairs are judged to be too close to offer independent assessment, e.g. working at the same institution, previous supervisor-student relationship).

Second and Third Reviewer Assignment
Once the initial assignments were announced, we asked the two area chairs for a given paper to each manually assign another reviewer from the PC. To help area chairs with this, we generated a shortlist of 10 recommended reviewers for each paper (using the estimated bid matrix and TMS score, with the CMT matching algorithm for load balancing of reviewer suggestions.) Area chairs were free to either use this list, or select from the complete program committee, or alternately, they could seek an outside reviewer which was then added to the PC, an option used 80 times. The load for each reviewer was restricted to at most 7 papers with exceptions when they agreed explicitly to more.

The second and third uses of TMS, including the new supervised learning system, lead to another long hectic weekend with Laurent, Mahdi, Joelle, and John all deeply involved.

Reviews
Most papers received at least 3 full reviews in the first round. Reviewers could not see each others’ reviews until they submitted their own. ML-Journaled submissions (see double submission guide) were reviewed only by two area chairs. In a small number of regular submissions (less than 10), we received 2 very negative reviews and notified the third reviewer (who was usually late by this point!) that we would not need their review.

Authors’ Response
Authors were given a chance to respond to the reviews during a short feedback period. This is becoming a standard practice in machine learning conferences. Authors were also allowed to upload a new version of the paper. The motivation here is that in some cases, it is easier to show the changes directly in the paper, rather than discuss them separately.

Our analysis shows that authors’ responses and subsequent discussions by reviewers made significant changes to the scoring of papers. A total of ~35% of the papers had some change in their scores after the author feedback. The average score for ~50% of the papers went down, stayed the same for ~10%, and went up for the other ~40%. The variance on the scores decreased by ~20%, indicating some convergence in the decisions.

Final Decisions
To help us better decide on the quality of the papers, we asked the primary area chairs to provide a meta-review for each of their papers. For papers without unanimous review decisions (i.e. some reviews wanted to accept and some wanted to reject), we asked the secondary area chair to (independently) fill-in a meta-review, recommending whether to accept or reject the paper. A total of 1214 meta-reviews were provided. There were also 20 papers for which a 4th review was added in this period.

In all cases where the primary and secondary area chairs disagreed on the decision, the program chairs were directly involved, reviewing all the evidence (reviews, rebuttal, discussion, often the paper itself), and entering in a discussion (usually via email) with the area chairs, until a unanimous decision was achieved.
A total of 243 papers (27% of submissions) were accepted. Author notifications were sent out on April 30.

Compassionate Reviewing

Most long conversations between academics seem to converge on the topic of reviewing where almost no one is happy. A basic question is: Should most people be happy?

The case against is straightforward. Anyone who watches the flow of papers realizes that most papers amount to little in the longer term. By it’s nature research is brutal, where the second-best method is worthless, and the second person to discover things typically gets no credit. If you think about this for a moment, it’s very different from most other human endeavors. The second best migrant laborer, construction worker, manager, conductor, quarterback, etc… all can manage quite well. If a reviewer has even a vaguely predictive sense of what’s important in the longer term, then most people submitting papers will be unhappy.

But this argument unravels, in my experience. Perhaps half of reviews are thoughtless or simply wrong with a small part being simply malicious. And yet, I’m sure that most reviewers genuinely believe they can predict what will and will not be useful in the longer term. This disparity is a lack of communication. When academics have conversations about reviewing, the presumption of participants in each conversation is that they all share about the same beliefs about what will be useful, and what will take off. Such conversations rarely go into specifics, because the specifics are boring in particular, technical, and because their is a real chance of disagreement on the specifics themselves.

When double blind reviewing was first being considered for ICML, I remember speaking about the experience in the Crypto community, where in my estimate the reviewing was both fairer and less happy. Many conferences in machine learning have shifted to doubleblind reviewing, and I think we have seen this come to pass here as well. Without double blind reviewing, it is common to have an “in” crowd who everyone respects and whose papers are virtually always accepted. These people are happy, and the rest have little voice. With double blind reviewing, everyone suffers substantial rejections.

We might say “fine, at least it’s fair”, but in my experience there is a real problem. From a viewpoint external to the community, when the reviewing is poor and the viewpoint of people in the community highly contradictory, nothing good happens. Outsiders (i.e. most people) viewing the acrimony choose some other way to solve problems, proposals don’t get funded, and the community itself tends to fracture. For example, in cryptography, TCC (not double blind) has started, presumably because the top theory people got tired of having their papers rejected at Crypto (double blind). From a process-of-research standpoint, this seems suboptimal, as different groups using different methods to solve similar problems are particularly the people who you would prefer talking to each other.

What seems to be lost with double blind reviewing is some amount of compassion, unfairly allocated. In a double blind system, any given paper is plausibly from someone you don’t know, and since most papers go nowhere, plausibly not going anywhere. Consequently, the bias starts “against” for all work, a disadvantage which can be quite difficult to overcome. Some time ago, I discussed how I thought motivation should be the responsibility of the reviewer. Aaron Hertzman strongly disagreed on the grounds that this belief could dead end your career as an author. I’ve come to appreciate his viewpoint to an extent. But, it misses the point slightly—the question of “What is good for the community?” differs from “What is good for the author?” In a healthy community, reviewers will actively understand why a piece of work is or is not important, filling in and extending the motivation as they consider the problem.

So, a question is: How can we get compassionate reviewing? (And in a fair way?) It might help somewhat for reviewers to actively consider, as part of their review, the level and mechanism of impact that a paper may have. Reducing reviewing load is certainly helpful, but it is not sufficient alone, because many people naturally interpret a reduced reviewing load as time to work on other things. And, some mechanisms seem to even harm. For example, the two-phase reviewing process that ICML currently uses might save 0.5 reviews/paper, while guaranteeing that for half of the papers, the deciding review is done hastily with no author feedback, a recipe for mistakes.

What creates a great deal of compassion? Public responsibility helps (witness workshops more interesting than conferences). A natural conversation helps (the current method of single round response tends to be very stilted). And time, of course, helps. What else?

Future Publication Models @ NIPS

Yesterday, there was a discussion about future publication models at NIPS. Yann and Zoubin have specific detailed proposals which I’ll add links to when I get them (Yann’s proposal and Zoubin’s proposal).

What struck me about the discussion is that there are many simultaneous concerns as well as many simultaneous proposals, which makes it difficult to keep all the distinctions straight in a verbal conversation. It also seemed like people were serious enough about this that we may see some real movement. Certainly, my personal experience motivates that as I’ve posted many times about the substantial flaws in our review process, including some very poor personal experiences.

Concerns include the following:

  1. (Several) Reviewers are overloaded, boosting the noise in decision making.
  2. (Yann) A new system should run with as little built-in delay and friction to the process of research as possible.
  3. (Hanna Wallach(updated)) Double-blind review is particularly important for people who are unknown or from an unknown institution.
  4. (Several) But, it’s bad to take double blind so seriously as to disallow publishing on arxiv or personal webpages.
  5. (Yann) And double-blind is bad when it prevents publishing for substantial periods of time. Apparently, this comes up in CVPR.
  6. (Zoubin) Any new system should appear to outsiders as if it’s the old system, or a journal, because it’s already hard enough to justify CS tenure cases to other disciplines.
  7. (Fernando) There shouldn’t be a big change with a complex bureaucracy, but rather a smaller changes which are obviously useful or at least worth experimenting with.

There were other concerns as well, but these are the ones that I remember.

Elements of proposals include:

  1. (Yann) Everything should go to Arxiv or an arxiv-like system first, as per physics or mathematics. This addresses (1), because it delinks dissemination from review, relieving some of the burden of reviewing. It also addresses (2) since with good authors they can immediately begin building on each other’s work. It conflicts with (3), because Arxiv does not support double-blind submission. It does not conflict if we build our own system.
  2. (Fernando) Create a conference coincident journal in which people can publish at any time. VLDB has apparently done this. It can be done smoothly by allowing submission in either conference deadline mode or journal mode. This proposal addresses (1) by reducing peak demand on reviewing. It also addresses (6) above.
  3. (Daphne) Perhaps we should have a system which only reviews papers for correctness, which is not nearly as subjective as for novelty or interestingness. This addresses (1), by eliminating some concerns for the reviewer. It is orthogonal to the double blind debate. In biology, such a journal exists (pointer updated), because delays were becoming absurd and intolerable.
  4. (Yann) There should be multiple publishing entities (people or groups of people) that can bless a paper as interesting. This addresses (1).

There are many other proposal elements (too many for my memory), which hopefully we’ll see in particular proposals. If other people have concrete proposals, now is probably the right time to formalize them.

Decision by Vetocracy

Few would mistake the process of academic paper review for a fair process, but sometimes the unfairness seems particularly striking. This is most easily seen by comparison:

Paper Banditron Offset Tree Notes
Problem Scope Multiclass problems where only the loss of one choice can be probed. Strictly greater: Cost sensitive multiclass problems where only the loss of one choice can be probed. Often generalizations don’t matter. That’s not the case here, since every plausible application I’ve thought of involves loss functions substantially different from 0/1.
What’s new Analysis and Experiments Algorithm, Analysis, and Experiments As far as I know, the essence of the more general problem was first stated and analyzed with the EXP4 algorithm (page 16) (1998). It’s also the time horizon 1 simplification of the Reinforcement Learning setting for the random trajectory method (page 15) (2002). The Banditron algorithm itself is functionally identical to One-Step RL with Traces (page 122) (2003) in Bianca‘s thesis with the epsilon greedy strategy and a multiclass perceptron with update scaled by the importance weight.
Computational Time O(k) per example where k is the number of choices O(log k) per example Lower bounds on the sample complexity of learning in this setting are a factor of k worse than for supervised learning, implying that many more examples may be needed in practice. Consequently, learning algorithm speed is more important than in standard supervised learning.
Analysis Incomparable. An online regret analysis showing that if a small hinge loss predictor exists, a bounded number of mistakes occur. Also, an algorithm independent analysis of the fully realizable case. Incomparable. A learning reduction analysis showing how the regret of any base classifier bounds policy regret. Also contains a lower bound and comparable analysis of all plausible alternative reductions.
Experiments 1 dataset, comparing with no other approaches to solving the problem. 13 datasets, comparing with 2 other approaches to solve the problem.
Outcome Accepted at ICML Rejected at ICML, NIPS, UAI, and NIPS.

The reviewers of the Banditron paper made the right call. The subject is interesting, and analysis of a new learning domain is of substantial interest. Real advances in machine learning often come as new domains of application. The talk was well attended and generated substantial interest. It’s also important to remember the reviewers of the two papers probably did not overlap, so there was no explicit preference for A over B.

Why was the Offset Tree rejected? One of these rejections is easily explained as a fluke—we ran into a reviewer at UAI who believes that learning by memorization is the way to go. I, and virtually all machine learning people, disagree but some reviewers at UAI aren’t interested or expert in machine learning.

The striking thing about the other 3 rejects is that they all contain a reviewer who doesn’t read the paper. Instead, the reviewer asserts that learning reductions are bogus because for an alternative notion of learning reduction, made up by the reviewer, an obviously useless approach yields a factor of 2 regret bound. I believe this is the same reviewer each time, because the alternative theorem statement drifted over the reviews fixing bugs we pointed out in the author response.

The first time we encountered this review, we assumed the reviewer was just cranky that day—maybe we weren’t quite clear enough in explaining everything as it’s always difficult to get every detail clear in new subject matter. I have sometimes had a very strong negative impression of a paper which later turned out to be unjustified upon further consideration. Sometimes when a reviewer is cranky, they change their mind after the authors respond, or perhaps later, or perhaps never but you get a new set of reviewers the next time.

The second time the review came up, we knew there was a problem. If we are generous to the reviewer, and taking into account the fact that learning reduction analysis is a relatively new form of analysis, the fear that because an alternative notion of reduction is vacuous our notion of reduction might also be vacuous isn’t too outlandish. Fortunately, there is a way to completely address that—we added an algorithm independent lower bound to the draft (which was the only significant change in content over the submissions). This lower bound conclusively proves that our notion of learning reduction is not vacuous as is the reviewer’s notion of learning reduction.

The review came up a third time. Despite pointing out the lower bound quite explicitly, the reviewer simply ignored it. This more-or-less confirms our worst fears. Some reviewer is bidding for the paper with the intent to torpedo review it. They are uninterested in and unwiling to read the content itself.

Shouldn’t author feedback address this? Not if the reviewer ignores it.

Shouldn’t Double Blind reviewing help? Not if the paper only has one plausible source. The general problem area and method of analysis were freely discussed on hunch.net. We withheld public discussion of the algorithm itself for much of the time (except for a talk at CMU) out of respect for the review process.

Why doesn’t the area chair/program chair catch it? It took us 3 interactions to get it, so it seems unrealistic to expect someone else to get it in one interaction. In general, these people are strongly overloaded and the reviewer wasn’t kind enough to boil down the essence of the stated objection as I’ve done above. Instead, they phrase it as an example and do not clearly state the theorem they have in mind or distinguish the fact that the quantification of that theorem differs from the quantification of our theorems. More generally, my observation is that area chairs rarely override negative reviews because:

  1. It risks their reputation since defending a criticized work requires the kind of confidence that can only be inspired by a thorough personal review they don’t have time for.
  2. They may offend the reviewer they invited to review and personally know.
  3. They figure that the average review is similar to the average perception/popularity by the community anyways.
  4. Even if they don’t agree with the reviewer, it’s hard to fully discount the review in their consideration.

I’ve seen these effects create substantial mental gymnastics elsewhere.

Maybe you just ran into a cranky reviewer 3 times randomly Maybe so. However, the odds seem low enough and the 1/2 year cost of getting another sample high enough, that going with the working hypothesis seems indicated.

Maybe the writing needs improving. Often that’s a reasonable answer for a rejection, but in this case I believe not. We’ve run the paper by several people, who did not have substantial difficulties understanding it. They even understand the draft well enough to make a suggestion or two. More generally, no paper is harder to read than the one you picked because you want to reject it.

What happens next? With respect to the Offset Tree, I’m hopeful that we eventually find reviewers who appreciate an exponentially faster algorithm, good empirical results, or the very tight and elegant analysis, or even all three. For the record, I consider the Offset Tree a great paper. It remains a substantial advance on the state of the art, even 2 years later, and as far as I know the Offset Tree (or the Realizable Offset Tree) consistently beat all reasonable contenders both in prediction and computational performance. This is rare and precious, as many papers tradeoff one for the other. It yields a practical algorithm applicable to real problems. It substantially addresses the RL to classification reduction problem. It also has the first nonconstant algorithm independent lower bound for learning reductions.

With respect to the reviewer, I expect remarkably little. The system is designed to protect reviewers, so they have virtually no responsibility for their decisions. This reviewer has a demonstrated capability to sabotage the review process at ICML and NIPS and a demonstrated willingness to continue doing so indefinitely. The process of bidding for papers and making up reasons to reject them seems tedious, but there is no fundamental reason why they can’t continue doing so for several decades if they remain active in academia.

This experience has substantially altered my understanding and appreciation of the review process at conferences. The bidding mechanism commonly used, coupled with responsibility-free reviewing is an invitation to abuse. A clever abusive reviewer can sabotage perhaps 5 papers per conference (out of 8 reviewed), while maintaining a typical average score. While I don’t believe most people choose papers with intent to sabotage, the capability is there and used by at least one person and possibly others. If, for example, 5% of reviewers are willing to abuse the process this way and there are 100 reviewers, every paper must survive 5 vetoes. If there are 200 reviewers, every paper must survive 10 vetoes. And if there are 400 reviewers, every paper must survive 20 vetoes. This makes publishing any paper that offends someone difficult. The surviving papers are typically inoffensive or part of a fad strong enough that vetoes are held back. Neither category is representative of high quality decision making. These observations suggest that the conference with the most reviewers tend strongly toward faddy and inoffensive papers, both of which often lack impact in the long term. Perhaps this partly explains why NIPS is so weak when people start citation counting. Conversely, this would suggest that smaller conferences and workshops have a natural advantage. Similarly, the reviewing style in theory conferences seems better—the set of bidders for any paper is substantially smaller, implying papers must survive fewer vetos.

This decision making process can be modeled as a group of n decision makers, each of which has the opportunity to veto any action. When n is relatively small, this decision making process might work ok, depending on the decision makers, but as n grows larger, it’s difficult to imagine a worse decision making process. The closest representatives outside of academia I know are deeply bureacratic governments and other large organizations where many people must sign off on something before it takes place. These vetocracies are universally frustrating to interact with. A reasonable conjecture is that any decision making process with a large veto number has poor characteristics.

A basic question is: Is a vetocracy inevitable for large organizations? I believe the answer is no. The basic observation is that the value of n can be logarithmic in the number of participants in an organization rather than linear, as per reviewing under a bidding process. An essential force driving vetocracy creation is a desire to offload responsibility for decisions, so there is no clear decision maker. A large organization not deciding by vetocracy must have a very different structure, with clearly dilineated responsibility.

NIPS provides an almost perfect natural experiment in it’s workshop organization, which involves the very same community of people and subject matter, yet works in a very different manner. There are one or two workshop chairs who are responsible for selecting amongst workshop proposals, after which the content of the workshop is entirely up to the workshop organizers. If a workshop is rejected, it’s clear who is at fault, and if a workshop presentation is rejected, it is often clear by who. Some workshop chairs use a small set of reviewers, but even then the effective veto number remains small. Similarly, if a workshop ends up a flop, it’s relatively easy to see who to blame—either the workshop chair for not predicting it, or the organizers for failing to organize. I can’t think of a single time when I attended both the workshops and the conference that the workshops were less interesting than the conference. My understanding is that this observation is common. Given this discussion, it will be particularly interesting to see how the review process Michael and Leon setup for ICML this year pans out, as it is a system with notably more responsibility assignment than in previous years.

Journals end up looking relatively good with respect to vetocracy avoidance. The ones I’m familiar with have a chief editor who bears responsibility for routing papers to an action editor, who bears responsibility for choosing good reviewers. Every agent except the reviewers is often known by the authors, and the reviewers don’t act as additional vetoers in nearly as strong a manner as reviewers with the opportunity to bid.

This experience has also altered my view of blogging and research. On one hand, I’m very enthusiastic about research in general, and my research in particular, where we are regularly cracking conventionally impossible problems. On the other hand, it seems that some small number of people viewing a discussion silently decide they don’t like it, and veto it given the opportunity. It only takes one to turn strong paper into a years-long odyssey, so public discussion of research directions and topics in a vetocracy is akin to voluntarily wearing a “kick me” sign. While this a problem for me, I expect it to be even worse for the members of a vetocracy in the long term.

It’s hard to imagine any research community surviving without a serious online presence. When a prospective new researcher looks around at existing research, if they don’t find serious online discussion, they’ll assume it doesn’t exist under the “not on the internet so it doesn’t exist” principle. This will starve a field of new people. More generally, there is an opportunity to get feedback about research directions and problems much more rapidly than is otherwise possible, allowing us to avoid research on dead end topics which are pervasive. At some point, it may even seem that people not willing to discuss their research simply avoid doing so because it is critically lacking in one way or another. Since a vetocracy creates a substantial disincentive to discuss research directions online, we can expect that communities sticking with decision by vetocracy to be at a substantial disadvantage.

Adversarial Academia

One viewpoint on academia is that it is inherently adversarial: there are finite research dollars, positions, and students to work with, implying a zero-sum game between different participants. This is not a viewpoint that I want to promote, as I consider it flawed. However, I know several people believe strongly in this viewpoint, and I have found it to have substantial explanatory power.

For example:

  1. It explains why your paper was rejected based on poor logic. The reviewer wasn’t concerned with research quality, but rather with rejecting a competitor.
  2. It explains why professors rarely work together. The goal of a non-tenured professor (at least) is to get tenure, and a case for tenure comes from a portfolio of work that is undisputably yours.
  3. It explains why new research programs are not quickly adopted. Adopting a competitor’s program is impossible, if your career is based on the competitor being wrong.

Different academic groups subscribe to the adversarial viewpoint in different degrees. In my experience, NIPS is the worst. It is bad enough that the probability of a paper being accepted at NIPS is monotonically decreasing in it’s quality. This is more than just my personal experience over a number of years, as it’s corroborated by others who have told me the same. ICML (run by IMLS) used to have less of a problem, but since it has become more like NIPS over time, it has inherited this problem. COLT has not suffered from this problem as much in my experience, although it had other problems related to the focus being defined too narrowly. I do not have enough experience with UAI or KDD to comment there.

There are substantial flaws in the adversarial viewpoint.

  1. The adversarial viewpoint makes you stupid. When viewed adversarially, any idea has crippling disadvantages and no advantages. Contorting your viewpoint enough to make this true damages your ability to conduct research. In short, it promotes poor mental hygiene.
  2. Many activities become impossible. Doing research is in general extremely hard, so there are many instances where working with other people can allow you to do things which are otherwise impossible.
  3. The previous two disadvantages apply even more strongly for a community—good ideas are more likely to be missed, change comes slowly, and often with steps backward.
  4. At it’s most basic level, the assumption that research is zero-sum is flawed, because the process of research is not done in a closed system. If the rest of society at large discovers that research is valuable, then the budget increases.

Despite these disadvantages, there is a substantial advantage as well: you can materially protect and aid your career by rejecting papers, preventing grants, and generally discriminating against key people doing interesting but competitive work.

The adversarial viewpoint has a validity in proportion to the number of people subscribing to it. For those of us who would like to deemphasize the adversarial viewpoint, what’s unclear is: how?

One concrete thing is: use Arxiv. For a long time, physicists have adopted an Arxiv-first philosophy, which I’ve come to respect. Arxiv functions as a universal timestamp which decreases the power of an adversarial reviewer. Essentially, you avoid giving away the power to muddy the track of invention. I’m expecting to use Arxiv for essentially all my past-but-unpublished and future papers.

It is plausible that limiting the scope of bidding, as Andrew McCallum suggested at the last ICML, and as is effectively implemented at this ICML, will help. The system of review at journals might also help for the same reason. In my experience as an author, if an anonymous reviewer wants to kill a paper they usually succeed. Most area chairs or program chairs are more interested in avoiding conflict with the reviewer (who they picked and may consider a friend) than reading the paper to determine the illogic of the review (which is a difficult task that simply cannot be done for all papers). NIPS experimented with a reputation system for reviewers last year, but I’m unclear on how well it worked, as an author’s score for a review and a reviewer’s score for the paper may be deeply correlated, revealing little additional information.

Public discussion of research can help with this, because very poor logic simply doesn’t stand up under public scrutiny. While I hope to nudge people in this direction, it’s clear that most people aren’t yet comfortable with public discussion.