ICML: Behind the Scenes

This is a rather long post, detailing the ICML 2012 review process. The goal is to make the process more transparent, help authors understand how we came to a decision, and discuss the strengths and weaknesses of this process for future conference organizers.

Microsoft’s Conference Management Toolkit (CMT)
We chose to use CMT over other conference management software mainly because of its rich toolkit. The interface is sub-optimal (to say the least!) but it has extensive capabilities (to handle bids, author response, resubmissions, etc.), good import/export mechanisms (to process the data elsewhere), excellent technical support (to answer late night emails, add new functionalities). Overall, it was the right choice, although we hope a designer will look at that interface sometime soon!

Toronto Matching System (TMS)
TMS is now being used by many major conferences in our field (including NIPS and UAI). It is an automated system (developed by Laurent Charlin and Rich Zemel at U. Toronto) to match reviewers to papers, based on an analysis of each reviewer’s publications. TMS collects publications from reviewers, parses them into features and applies unsupervised or supervised learning techniques to predict the relevance of any target paper for any reviewer. We convinced TMS to integrate with CMT and funded Laurent’s work for that. Reviewers were asked to put in a publication list for TMS to parse. For those who failed to do so (after many reminders!), we manually added that information from public sources.

The Program Committee
Recruiting a program committee that is both large and highly qualified is difficult these days. We sent out 69 area chair invitations; 50 (highly qualified!) people accepted. Each of these area chairs was asked to nominate a list of potential reviewers. We sent out approximately 700 invitations for program committee members; 389 accepted. A number of additional PC members were recruited during the review process (most of them for 1-2 papers), for a total of 470 active PC members. In terms of seniority, the final PC contains about ~15% students, 80% researchers, 5% other.

The Surge (ICML + 50%)
The first big challenge came on the submission deadline. In the past few years, ICML had consistently received ~550-600 submissions. This year, we had a 50% increase, to 890 submissions. We had recruited a PC that could comfortably handle 700 papers. Dealing with an extra 200 papers was not an easy task.

About 10 submissions were rejected without review for various reasons (severe formatting issues, extra pages, non-anonymization).

Bidding
An unsupervised version of TMS was used to generate a list of candidate papers for each reviewer and area chair. This was done working closely with the Laurent Charlin of TMS using validation on previous NIPS data. CMT did not have the functionality to show a good list of candidate papers to reviewers, so we crafted an interface to show this list and let reviewers use that in conjunction with CMT. Ideally, this will be better incorporated in CMT in the future.

When you ask a group of scientists to run a conference, you must expect a few experiments will take place…. And so we decided to assess the usefulness of TMS scoring for generating lists of papers to bid on. To do this, we (randomly) assigned PC members to 1 of 3 groups. One group saw a list purely based on TMS scores. Another group received a list based on the matching between their subject area and that of the paper (referred to as the “relevance” score in CMT). The third group received a list based on a mix of both TMS and relevance. Reviewers were allowed to bid on any paper (excluding those with which they had a conflict); the lists were provided to help them efficiently sort through the large number of papers. We then compared the set of bids for a reviewer, with the list of suggestions, and measured the correspondence.

The following is the Discounted Cumulative Gain (DCG) of each list with respect to the bidding scores, averaged separately for each group. Note that each group was only presented with their corresponding list and not the others.

Group: CMT Group: TMS Group: CMT+TMS
Sorting by CMT scores 6.11 out of 12.64 (48%) 4.98 out of 13.63 (36%) 4.87 out of 13.55 (35%)
Sorting by TMS score 4.06 out of 12.64 (32%) 6.43 out of 13.63 (47%) 5.72 out of 13.55 (42%)
Sorting by TMS+CMT 4.77 out of 12.64 (37%) 6.11 out of 13.63 (44%) 6.71 out of 13.55 (49%)

A micro-survey was also run to collect further information on how users liked their short list. 85% of the participants indicated that they have used the list interface provided to them. The following is the preference indicated by each group (~75 reviewers in each group, ~2% error):

CMT TMS CMT+TMS
Preferred CMT over list 15% 12% 8%
Preferred list+CMT 81% 83% 83%
Preferred list over CMT 4% 5% 9%

It is obvious from the above that most participants found the list useful in conjunction with CMT (suggesting that the list should be integrated inside CMT). We can also see that those who were presented with a list based on TMS scores were more likely to find the list useful.

Note that all of the above was done in a long hectic but fun weekend.

Imputing Missing Bids
CMT assumes that the reviewers are not willing to review a paper unless stated otherwise. It does not differentiate between an unseen (but potentially relevant) paper and a paper that has been seen and ignored. This is a real shortcoming when it comes to matching papers to reviewers, especially for those reviewers that did not bid often. To mitigate this problem, we used the click information on the shortlist presented to the reviewers to find out which papers have been observed and ignored. We then impute these cases as real non-willing bids.

Around 30 reviewers did not provide any bids (and many had only a few). This is problematic because the tools used to do the actual reviewer-paper matching tend to assign the papers without any bids to the reviewers who did not bid, regardless of the match in expertise.

Once the bidding information was in and imputation was done, we now had to fill in the rest of the paper-reviewer bidding matrix to mitigate the problem with sparse bidders. This was done, once again, through TMS, but this time using a supervised learning approach.

Using supervised learning was more delicate than expected. To deal with the wildly varying number of bids per person, we imputed zero bids, first from papers that were plausibly skipped over, and if necessary at random from papers not bid on such that each person had the same expected bid in the dataset. From this dataset, we held out a random bid per person, and then trained to predict well the heldout bid. Most optimization approaches performed poorly due to the number of features greatly exceeding the number of labels. The best approach we found used the online algorithms in Vowpal Wabbit with a mass personalized training method similar to the one discussed here. This trained predictor was used to predict bid values for the full paper-reviewer bid matrix.

Automated Area Chair and First Reviewer Assignment
Once we had the imputed paper-reviewer bidding matrix, CMT was used to generate the actual match between papers and area chairs, and (separately) between papers and reviewers. Each paper had two area chairs (sometimes called “meta-reviewers” in CMT) assigned to it, one primary, one secondary, by running two rounds of assignments (so that the primary was usually the “better” match). One reviewer per paper was also assigned automatically by CMT in a similar fashion. CMT provides proper load balancing, so that all area chairs and reviewers had similar loads.

Manual Checks of the Automated Assignments
Before finalizing the automated assignment, we manually looked through the list of papers to fix any potential problems that were not handled by the automated process. The two major cases were papers that did not go through the TMS system (authors did not agree to do so), and cases of poor primary-secondary meta-reviewer pairs (when the two area chairs are judged to be too close to offer independent assessment, e.g. working at the same institution, previous supervisor-student relationship).

Second and Third Reviewer Assignment
Once the initial assignments were announced, we asked the two area chairs for a given paper to each manually assign another reviewer from the PC. To help area chairs with this, we generated a shortlist of 10 recommended reviewers for each paper (using the estimated bid matrix and TMS score, with the CMT matching algorithm for load balancing of reviewer suggestions.) Area chairs were free to either use this list, or select from the complete program committee, or alternately, they could seek an outside reviewer which was then added to the PC, an option used 80 times. The load for each reviewer was restricted to at most 7 papers with exceptions when they agreed explicitly to more.

The second and third uses of TMS, including the new supervised learning system, lead to another long hectic weekend with Laurent, Mahdi, Joelle, and John all deeply involved.

Reviews
Most papers received at least 3 full reviews in the first round. Reviewers could not see each others’ reviews until they submitted their own. ML-Journaled submissions (see double submission guide) were reviewed only by two area chairs. In a small number of regular submissions (less than 10), we received 2 very negative reviews and notified the third reviewer (who was usually late by this point!) that we would not need their review.

Authors’ Response
Authors were given a chance to respond to the reviews during a short feedback period. This is becoming a standard practice in machine learning conferences. Authors were also allowed to upload a new version of the paper. The motivation here is that in some cases, it is easier to show the changes directly in the paper, rather than discuss them separately.

Our analysis shows that authors’ responses and subsequent discussions by reviewers made significant changes to the scoring of papers. A total of ~35% of the papers had some change in their scores after the author feedback. The average score for ~50% of the papers went down, stayed the same for ~10%, and went up for the other ~40%. The variance on the scores decreased by ~20%, indicating some convergence in the decisions.

Final Decisions
To help us better decide on the quality of the papers, we asked the primary area chairs to provide a meta-review for each of their papers. For papers without unanimous review decisions (i.e. some reviews wanted to accept and some wanted to reject), we asked the secondary area chair to (independently) fill-in a meta-review, recommending whether to accept or reject the paper. A total of 1214 meta-reviews were provided. There were also 20 papers for which a 4th review was added in this period.

In all cases where the primary and secondary area chairs disagreed on the decision, the program chairs were directly involved, reviewing all the evidence (reviews, rebuttal, discussion, often the paper itself), and entering in a discussion (usually via email) with the area chairs, until a unanimous decision was achieved.
A total of 243 papers (27% of submissions) were accepted. Author notifications were sent out on April 30.

ICML author feedback is open

as of last night, late.

When the reviewing deadline passed Wednesday night 15% of reviews were still missing, much higher than I expected. Between late reviews coming in, ACs working overtime through the weekend, and people willing to help in the pinch another ~390 reviews came in, reducing the missing mass to 0.2%. Nailing that last bit and a similar quantity of papers with uniformly low confidence reviews is what remains to be done in terms of basic reviews. We are trying to make all of those happen this week so authors have some chance to respond.

I was surprised by the quantity of late reviews, and I think that’s an area where ICML needs to improve in future years. Good reviews are not done in a rush—they are done by setting aside time (like an afternoon), and carefully reading the paper while thinking about implications. Many reviewers do this well but a significant minority aren’t good at scheduling their personal time. In this situation there are several ways to fail:

  1. Give early warning and bail.
  2. Give no warning and finish not-too-late.
  3. Give no warning and don’t finish.

The worst failure mode by far is the last one for Program Chairs and Area Chairs, because they must catch and fix all the failures at the last minute. I expect the second failure mode also impacts the quality of reviews because high speed reviewing of a deep paper often doesn’t work. This issue is one of community norms which can only be adjusted slowly. To do this, we’re going to pass a flake list for failure mode 3 to future program chairs who will hopefully further encourage people to schedule time well and review carefully.

If my experience is any guide, plenty of authors will feel disappointed by the reviews. Part of this is simply because it’s the first time the authors have had contact with people not biased towards agreeing with them, as almost all friends are. Part of this is the significant hurdle of communicating technical new things well. Part may be too-hasty reviews, as discussed above. And part of it may be that the authors simply are far more expert in their subject than reviewers.

In author responses, my personal tendency is to be blunter than most people when reviewers make errors. Perhaps “kind but clear” is a good viewpoint. You should be sympathetic to reviewers who have voluntarily put significant time into reviewing your paper, but you should also use the channel to communicate real information. Remotivating your paper almost never works, so concentrate on getting across errors in understanding by reviewers or answer their direct questions.

We did not include reviewer scores in author feedback, although we do plan to include them when the decision is made. Scores should not be regarded as final by any party, since author feedback and discussion can significantly alter a reviewer’s understanding of the paper. Encouraging reviewers to incorporate this additional information well before settling on a final score is one of my goals.

We did allow resubmission of the paper with the author response, similar to what Geoff Gordon did as program chair for AIStat. This solves two problems: It helps authors create a more polished draft, and it avoids forcing an overly constrained channel in the communication. If an equation has a bug, you can write it out bug free in mathematical notation rather than trying to describe by reference how to alter the equation in author response.

Please comment if you have further thoughts.

COLT/ICML Open Questions and ICML Instructions

Sasha is the open problems chair for both COLT and ICML. Open problems will be presented in a joint session in the evening of the COLT/ICML overlap day. COLT has a history of open sessions, but this is new for ICML. If you have a difficult theoretically definable problem in machine learning, consider submitting it for review, due March 16. You’ll benefit three ways:

  1. The effort of writing down a precise formulation of what you want often helps you understand the nature of the problem.
  2. Your problem will be officially published and citable.
  3. You might have it solved by some very intelligent bored people.

The general idea could easily be applied to any problem which can be crisply stated with an easily verifiable solution, and we may consider expanding this in later years, but for this year all problems need to be of a theoretical variety.

Joelle and I (and Mahdi, and Laurent) finished an initial assignment of Program Committee and Area Chairs to papers. We’ll be updating instructions for the PC and ACs as we field questions. Feel free to comment here on things of plausible general interest, but email us directly with specific concerns.

ICML+50%

The ICML paper deadline has passed. Joelle and I were surprised to see the number of submissions jump from last year by about 50% to around 900 submissions. A tiny portion of these are immediate rejects(*), so this is a much larger set of papers than expected. The number of workshop submissions also doubled compared to last year, so ICML may grow significantly this year, if we can manage to handle the load well. The prospect of making 900 good decisions is fundamentally daunting, and success will rely heavily on the program committee and area chairs at this point.

For those who want to rubberneck a bit more, here’s a breakdown of submissions by primary topic of submitted papers:

66 Reinforcement Learning
52 Supervised Learning
51 Clustering
46 Kernel Methods
40 Optimization Algorithms
39 Feature Selection and Dimensionality Reduction
33 Learning Theory
33 Graphical Models
33 Applications
29 Probabilistic Models
29 NN & Deep Learning
26 Transfer and Multi-Task Learning
25 Online Learning
25 Active Learning
22 Semi-Supervised Learning
20 Statistical Methods
20 Sparsity and Compressed Sensing
19 Ensemble Methods
18 Structured Output Prediction
18 Recommendation and Matrix Factorization
18 Latent-Variable Models and Topic Models
17 Graph-Based Learning Methods
16 Nonparametric Bayesian Inference
15 Unsupervised Learning and Outlier Detection
12 Gaussian Processes
11 Ranking and Preference Learning
11 Large-Scale Learning
9 Vision
9 Social Network Analysis
9 Multi-agent & Cooperative Learning
9 Manifold Learning
8 Time-Series Analysis
8 Large-Margin Methods
8 Cost Sensitive Learning
7 Recommender Systems
7 Privacy, Anonymity, and Security
7 Neural Networks
7 Empirical Insights into ML
7 Bioinformatics
6 Information Retrieval
6 Evaluation Methodology
<5 each Text Mining, Rule and Decision Tree Learning, Graph Mining, 
    Planning & Control, Monte Carlo Methods, Inductive Logic Programming & Relational Learning, 
    Causal Inference, Statistical and Relational Learning, NLP, Hidden Markov Models, 
    Game Theory, Robotics, POMDPs, Geometric Approaches, Game Playing, Data Streams, 
    Pattern Mining & Inductive Querying, Meta-Learning, Evolutionary Computation

(*) Deadlines are magical, because they galvanize groups of people to concentrated action. But, they have to be real deadlines to achieve this, which leads us to reject late submissions & format failures to keep the deadline real for future ICMLs. This is uncomfortably rough at times.

ICML Posters and Scope

Normally, I don’t indulge in posters for ICML, but this year is naturally an exception for me. If you want one, there are a small number left here, if you sign up before February.

It also seems worthwhile to give some sense of the scope and reviewing criteria for ICML for authors considering submitting papers. At ICML, the (very large) program committee does the reviewing which informs final decisions by area chairs on most papers. Program chairs setup the process, deal with exceptions or disagreements, and provide advice for the reviewing process. Providing advice is tricky (and easily misleading) because a conference is a community, and in the end the aggregate interests of the community determine the conference. Nevertheless, as a program chair this year it seems worthwhile to state the overall philosophy I have and what I plan to encourage (and occasionally discourage).

At the highest level, I believe ICML exists to further research into machine learning, which I generally think of as turning observations into useful predictions. Research is greatly varied in general, but in all cases it involves answering an interesting question for which the answer was not previously known. Interesting questions are generally natural: they can be stated easily and other people plausibly encounter them. Interesting questions are generally also ones for which there are multiple plausible wrong answers. The definition of “interesting” is otherwise hard to pin down, because it is does and must change over time.

ICML is a broad conference which incorporates the interests of many different groups of people with different tastes in the research they prefer. It’s broad enough that most people don’t appreciate all the papers. That’s ok as long as there is some higher level appreciation for which directions of research benefit the community. Some common flavors are:

  1. ML for X In general, Machine Learning is a core field of study with many applications. Often, it’s a good idea to publish within a conference focused on that area, but particularly when no such conference exists, ICML is a solid choice for a place to publish. One example of this kind of thing is Machine Learning for Sustainability, where the CCC will be giving a few travel grants. Here the core question is typically “How?” Exhibiting new things that you can do with ML provides good reference points for what is possible, provides a sense of what works, and compelling new ideas about what to work on can be valuable to the community.

    There are several ways that papers of this sort can bounce. Perhaps X is insufficiently interesting, the results are unconvincing, or the method of solution is considered too straight-forward. I consider the first and second criteria sound, but am inclined toward leniency on the third, since there is often quite a bit of work in figuring out how to frame the problem so that the solution happens to be easy.

  2. New Algorithms Often, authors find that existing learning algorithms for solving some problem are lacking in some way, so they propose new better algorithms. This is plausibly the most common category of paper at ICML, so there is quite a bit of variety. The most straight-forward version proposes a new algorithm for a well-studied problem. For these papers it’s important to have an empirical comparison to existing baselines.

    It’s easy for an empirical comparison to go wrong. Some authors use synthetic datasets which do not seem significant to me, because good results on such datasets may not transfer to real-world problems well as the real world tends to be quite a bit more complex than the synthetic processes which are natural to program. Instead, it’s important to show good results on real datasets. One problem with relying on real datasets is dataset selection—choosing the dataset for which your algorithm seems to perform best. You can avoid this by choosing datasets in some clearly unbiased manner and by evaluating on many standard datasets. Another way to fail is with a poor choice of baseline. This is tricky, because three reviewers might consider three different baselines the most natural one. Asking around a bit when developing the paper might help here, but in the end this can be a tough judgement call: Is the paper convincing enough that people interested in solving the problem should use this algorithm?

    Another class of new algorithms papers is new algorithms for new areas of machine learning, blending into the previous category. Here, there typically are relatively few (perhaps just one) dataset available and there may be no (or only implausibly bad) baselines. For papers like this, one way I’ve seen difficulties is when authors are very invested in a particular approach to solving the problem. If you have defined the problem too narrowly, broadening the definition of the problem can help you see appropriate baselines. Another difficulty I’ve observed is reviewers used to the well-studied problems reject an interesting paper because (essentially) they assume that the authors left out a good baseline which does not exist. To prevent the first, authors who ask around might get some valuable early feedback. For the second, it’s a difficulty we are aware of and will consider asking reviewers to judge on the merits of ML for X.

  3. Algorithmic studies A relatively rare but potentially valuable form of paper is an algorithmic study. Here, the authors do not propose a new algorithm, but instead do a comprehensive empirical comparison of different algorithms. The standards here are quite high—the empirical comparison needs to be first-class to convince people, so the empirical comparison comments under new algorithms apply strongly.
  4. New Theory Good theory can enlighten us about what is (or might be) possible. It can also help us build robust learning algorithms, where we design learning algorithms so that they provably solve some large class of problems. I am personally most interested in theory that helps us design new learning algorithms, but broadly interested in what is possible. I’m most interested in the question answered, while the means (and language) should only be as complex as necessary so the theory can be understood as widely as possible.

    In many areas of CS theory, double blind reviewing is rare, so theory-oriented people may be unfamiliar with it. An important consequence is that complete proofs must be included either in the paper or supplemental material so that proof checking is fully feasible.

    Another way that I’ve seen theory papers run into trouble is when it is a post-hoc justification for an algorithm. In essence, authors who choose to analyze an existing algorithm are sometimes forced to make many unnatural assumptions for the theory to be correct. There generally isn’t an easy fix if you arrive at this point.

  5. n of the above It is common for ICML papers to be multicategory. At the extreme, you might have a new algorithm which solves a new X well, empirically and theoretically. Reviewers can fall into a trap where they are most interested in 1 of the 4 questions answered above, and find 1/4 of the paper devoted to their question relatively weak compared to the paper that devotes all the pages to the same question.

    We are aware of this, and will encourage it to be taken into account.

  6. The exception The set of papers I expect to see at ICML is more diverse than the above—there are often exceptions of one sort or another. For these exceptions, it often becomes a judgment call: Does this paper significantly further research into machine learning? Papers with little potential audience probably don’t while fun/interesting/useful things that we didn’t think of do.

Further comments or questions are welcome.