ICML Reviewing Criteria

Michael Littman and Leon Bottou have decided to use a franchise program chair approach to reviewing at ICML this year. I’ll be one of the area chairs, so I wanted to mention a few things if you are thinking about naming me.

  1. I take reviewing seriously. That means papers to be reviewed are read, the implications are considered, and decisions are only made after that. I do my best to be fair, and there are zero subjects that I consider categorical rejects. I don’t consider several arguments for rejection-not-on-the-merits reasonable.
  2. I am generally interested in papers that (a) analyze new models of machine learning, (b) provide new algorithms, and (c) show that they work empirically on plausibly real problems. If a paper has the trifecta, I’m particularly interested. With 2 out of 3, I might be interested. I often find papers with only one element harder to accept, including papers with just (a).
  3. I’m a bit tough. I rarely jump-up-and-down about a paper, because I believe that great progress is rarely made. I’m not very interested in new algorithms with the same theorems as older algorithms. I’m also cautious about new analysis for older algorithms, since I like to see analysis driving algorithm rather than vice-versa. I prioritize a proof-of-possibility over a quantitative improvement. I consider quantitative improvements of small constant factors in sample complexity significant. For computationaly complexity, I generally want to see at least an order of magnitude improvement. I generally disregard any experiments on toy data, because I’ve found that toy data and real data can too-easily differ in their behavior.
  4. My personal interests are pretty well covered by existing papers, but this is perhaps not too important a criteria, compared to the above, as I easily believe other subjects are interesting.

A Healthy COLT

A while ago, we discussed the health of COLT. COLT 2008 substantially addressed my concerns. The papers were diverse and several were interesting. Attendance was up, which is particularly notable in Europe. In my opinion, the colocation with UAI and ICML was the best colocation since 1998.

And, perhaps best of all, registration ended up being free for all students due to various grants from the Academy of Finland, Google, IBM, and Yahoo.

A basic question is: what went right? There seem to be several answers.

  1. Cost-wise, COLT had sufficient grants to alleviate the high cost of the Euro and location at a university substantially reduces the cost compared to a hotel.
  2. Organization-wise, the Finns were great with hordes of volunteers helping set everything up. Having too many volunteers is a good failure mode.
  3. Organization-wise, it was clear that all 3 program chairs were cooperating in designing the program.
  4. Facilities-wise, proximity in time and space made the colocation much more real than many others have been in the past.
  5. Program-wise, COLT notably had two younger program chairs, Tong and Rocco, which seemed to work well.

New York’s ML Day

I’m not as naturally exuberant as Muthu 2 or David about CS/Econ day, but I believe it and ML day were certainly successful.

At the CS/Econ day, I particularly enjoyed Toumas Sandholm’s talk which showed a commanding depth of understanding and application in automated auctions.

For the machine learning day, I enjoyed several talks and posters (I better, I helped pick them.). What stood out to me was number of people attending: 158 registered, a level qualifying as “scramble to find seats”. My rule of thumb for workshops/conferences is that the number of attendees is often something like the number of submissions. That isn’t the case here, where there were just 4 invited speakers and 30-or-so posters. Presumably, the difference is due to a critical mass of Machine Learning interested people in the area and the ease of their attendance.

Are there other areas where a local Machine Learning day would fly? It’s easy to imagine something working out in the San Francisco bay area and possibly Germany or England.

The basic formula for the ML day is a committee picks a few people to give talks, and posters are invited, with some of them providing short presentations. The CS/Econ day was similar, except they managed to let every submitter do a presentation. Are there tweaks to the format which would improve things?

Who is Responsible for a Bad Review?

Although I’m greatly interested in machine learning, I think it must be admitted that there is a large amount of low quality logic being used in reviews. The problem is bad enough that sometimes I wonder if the Byzantine generals limit has been exceeded. For example, I’ve seen recent reviews where the given reasons for rejecting are:

  1. [NIPS] Theorem A is uninteresting because Theorem B is uninteresting.
  2. [UAI] When you learn by memorization, the problem addressed is trivial.
  3. [NIPS] The proof is in the appendix.
  4. [NIPS] This has been done before. (… but not giving any relevant citations)

Just for the record I want to point out what’s wrong with these reviews. A future world in which such reasons never come up again would be great, but I’m sure these errors will be committed many times more in the future.

  1. This is nonsense. A theorem should be evaluated based on it’s merits, rather than the merits of another theorem.
  2. Learning by memorization requires an exponentially larger sample complexity than many other common approaches that often work well. Consequently, what is possible under memorization does not have any substantial bearing on common practice or what might be useful in the future.
  3. Huh? Other people, thank you for putting the proof in the appendix, so the paper reads better. It seems absurd to base a decision on the placement of the content rather than the content.
  4. This is a red flag for a bogus review. Every time I’ve seen a review (as an author or a fellow reviewer) where such claims are made without a concrete citation, they are false. Often they are false even when concrete citations are given.

A softer version of (4) is when someone is cranky because their own paper wasn’t cited. This is understandable, but a more appropriate response seems to be pointing things out, and reviewing anyways. This avoids creating the extra work (for authors and reviewers) of yet another paper resubmission, and reasonable authors do take such suggestions into account.

NIPS figures fairly prominently here. While these are all instances in the last year, my experience after interacting with NIPS for almost a decade is that the average quality of reviews is particularly low there—in many instances reviewers clearly don’t read the papers before writing the review. Furthermore, such low quality reviews are often the deciding factor for the paper decision. Blaming the reviewer seems to be the easy solution for a bad review, but a bit more thought suggests other possibilities:

  1. Area Chair In some conferences an “area chair” or “senior PC” makes or effectively makes the decision on a paper. In general, I’m not a fan of activist area chairs, but when a reviewer isn’t thinking well, I think it is appropriate to step in. This rarely happens, because the easy choice is to simply accept the negative review. In my experience, many Area Chairs are eager to avoid any substantial controversy, and there is a general tendency to believe that something must be wrong with a paper that has a negative review, even if it isn’t what was actually pointed out.
  2. Program Chair In smaller conferences, Program Chairs play the same role as the area chair, so all of the above applies, except now you know the persons name explicitly making them easier to blame. This is a little bit too tempting, I think. For example, I know David McAllester understands that learning by memorization is a bogus reference point, and probably he was just too busy to really digest the reviews. However, a Program Chair is responsible for finding appropriate reviewers for papers, and doing so (or not) has a huge impact on whether a paper is accepted. Not surprisingly, if a paper about the sample complexity of learning is routed to people who have never seen a proof involving sample complexity before, the reviews tend to be spuriously negative (and the paper unread).
  3. Author A reviewer might blame an author, if it turns out later that the reasons given in the review for rejection were bogus. This isn’t absurd—writing a paper well is hard and it’s easy for small mistakes to be drastically misleading in technical content.
  4. Culture A conference has a culture associated with it that is driven by the people who keep coming back. If in this culture it is considered ok to do all the reviews on the last day, it’s unsurprising to see reviews lacking critical thought that could be written without reading the paper. Similarly, it’s unsurprising to see little critical thought at the area chair level, or in the routing of papers to reviewers. This answer is pretty convincing: it explains why low quality reviews keep happening year after year at a conference.

If you believe the Culture reason, then what’s needed is a change in the culture. The good news is that this is both possible and effective. There are other conferences where reviewers expect to spend several hours reviewing a paper. In my experience this year, it was true of COLT and for my corner of SODA. Effecting the change is simply a matter of community standards, and that is just a matter of leaders in the community leading.

The SODA Program Committee

Claire asked me to be on the SODA program committee this year, which was quite a bit of work.

I had a relatively light load—merely 49 theory papers. Many of these papers were not on subjects that I was expert about, so (as is common for theory conferences) I found various reviewers that I trusted to help review the papers. I ended up reviewing about 1/3 personally. There were a couple instances where I ended up overruling a subreviewer whose logic seemed off, but otherwise I generally let their reviews stand.

There are some differences in standards for paper reviews between the machine learning and theory communities. In machine learning it is expected that a review be detailed, while in the theory community this is often not the case. Every paper given to me ended up with a review varying between somewhat and very detailed.

I’m sure not every author was happy with the outcome. While we did our best to make good decisions, they were difficult decisions to make. For example, if there is a well written paper on an interesting topic which analyzes a flawed abstraction of the topic, should it get in? I would rate this a ‘weak accept’.

Here are some observations/thoughts about the process (Several also appear in Claire’s report).

  1. Better feedback isn’t too hard. The real time sink in reviewing a theory paper is reading it. Leaving a few comments, even if just “I don’t like the model analyzed because it misses important feature X.” is relatively easy. My impression of the last COLT was that COLT had entirely switched from minimal author feedback to substantial author feedback. This year’s SODA was somewhere inbetween, depending on the PC member involved, which is a definite trend towards stronger comments for SODA.
  2. Normalization There were very substantial differences amongst the PC members in what fraction of papers they wanted to accept, and this leaked into the final decisions. Normalizing reviewer ratings is standard operating procedure at some machine learning conferences, so I helped with that. Even with that help, further efforts at normalization in the future seem like they could help, for example in getting the decision on the paper above right.
  3. Ordering There were various areas where we tried to order all the reasonable papers and make a decision based on the ordering. Where the papers are sufficiently related, I think this is very helpful, and the act even changed my opinion on some papers a bit by putting them in better context. Not everyone imposed the same ordering, because there are somewhat different tastes: Do you care about the techniques used? (A traditional theory concern) or about the quality of the result? (I’m more focused here.) Nevertheless, it helped reduce the noise. Incidentally, there is substantial theoretical evidence that decisions by ordering are more robust than decisions by absolute score producing an ordering.
  4. Writing quality I was surprised by the poor writing quality of some SODA papers—several were basically not readable without a thorough understanding of referenced papers, and a substantial ability to infer what was meant rather than what was said. Some of these papers were accepted, which would have been impossible in a conference with double-blind reviewing.
  5. PC size The tradition in theory conferences is to have a relatively small program committee. I don’t see much advantage to this for SODA. The program committe is small enough and SODA is broad enough that it seems dubious to claim that every PC member is an expert on the subject of all of their papers. Also, (frankly) the highest quality reviews from my batch of papers weren’t written by me, but rather by reviewers that I picked who had the time to really grind through all the nitty-gritty of the paper. It’s easy to imagine that a larger PC would improve reviewing quality by avoiding overload.