Claire asked me to be on the SODA program committee this year, which was quite a bit of work.
I had a relatively light load—merely 49 theory papers. Many of these papers were not on subjects that I was expert about, so (as is common for theory conferences) I found various reviewers that I trusted to help review the papers. I ended up reviewing about 1/3 personally. There were a couple instances where I ended up overruling a subreviewer whose logic seemed off, but otherwise I generally let their reviews stand.
There are some differences in standards for paper reviews between the machine learning and theory communities. In machine learning it is expected that a review be detailed, while in the theory community this is often not the case. Every paper given to me ended up with a review varying between somewhat and very detailed.
I’m sure not every author was happy with the outcome. While we did our best to make good decisions, they were difficult decisions to make. For example, if there is a well written paper on an interesting topic which analyzes a flawed abstraction of the topic, should it get in? I would rate this a ‘weak accept’.
Here are some observations/thoughts about the process (Several also appear in Claire’s report).
- Better feedback isn’t too hard. The real time sink in reviewing a theory paper is reading it. Leaving a few comments, even if just “I don’t like the model analyzed because it misses important feature X.” is relatively easy. My impression of the last COLT was that COLT had entirely switched from minimal author feedback to substantial author feedback. This year’s SODA was somewhere inbetween, depending on the PC member involved, which is a definite trend towards stronger comments for SODA.
- Normalization There were very substantial differences amongst the PC members in what fraction of papers they wanted to accept, and this leaked into the final decisions. Normalizing reviewer ratings is standard operating procedure at some machine learning conferences, so I helped with that. Even with that help, further efforts at normalization in the future seem like they could help, for example in getting the decision on the paper above right.
- Ordering There were various areas where we tried to order all the reasonable papers and make a decision based on the ordering. Where the papers are sufficiently related, I think this is very helpful, and the act even changed my opinion on some papers a bit by putting them in better context. Not everyone imposed the same ordering, because there are somewhat different tastes: Do you care about the techniques used? (A traditional theory concern) or about the quality of the result? (I’m more focused here.) Nevertheless, it helped reduce the noise. Incidentally, there is substantial theoretical evidence that decisions by ordering are more robust than decisions by absolute score producing an ordering.
- Writing quality I was surprised by the poor writing quality of some SODA papers—several were basically not readable without a thorough understanding of referenced papers, and a substantial ability to infer what was meant rather than what was said. Some of these papers were accepted, which would have been impossible in a conference with double-blind reviewing.
- PC size The tradition in theory conferences is to have a relatively small program committee. I don’t see much advantage to this for SODA. The program committe is small enough and SODA is broad enough that it seems dubious to claim that every PC member is an expert on the subject of all of their papers. Also, (frankly) the highest quality reviews from my batch of papers weren’t written by me, but rather by reviewers that I picked who had the time to really grind through all the nitty-gritty of the paper. It’s easy to imagine that a larger PC would improve reviewing quality by avoiding overload.
Interesting post. I think that almost all changes have to be incremental: we do a small change, check that the effect is desirable, then the community is ready for a bigger change.
Regarding PC size, I thought 27 was quite large… would you advocate having a PC consisting of 50 members, each with a batch of less than 30 papers?
I understand about big changes.
I would certainly be comfortable with a larger PC. As one datapoint, at COLT this year, the PC was size 28. Since COLT is around 1/4 the size of SODA, this meant perhaps 12 papers each.
The tradeoff seems to be: do you want a coherent global viewpoint (small PC) or do you want experts reviewing the papers (large PC)? My impression is that SODA is still on the too-small end of the tradeoff, where the right number of papers/PC should be something like the size of the set that a PC member is expert in.
you have to be careful though: PC members don’t submit papers, so we can’t go to the really large PCs like the database community has. Even doubling the size of the committee might reduce the number of possible submissions. In fact, it’s more likely to reduce at the “junior end” where people have to weigh the prestige of being on a PC vs the prestige of getting papers in.
I agree that large PCs don’t work with a noPC submission rule. I think many conferences try to deal with this by allowing PC submissions and having strong conflict of interest determination and application. My impression is that this works, although I haven’t seen a serious debate about it. When everything takes place electronically, conflict of interest determination and application seems to work better than in face-to-face meetings.
The conflict-of-interest rules help increase the PC size (for example, Crypto and Eurocrypt have done this for years). My experience on larger PCs indicates that the real problem isn’t PC subs, but consistency: the larger the PC, the less likely it is that each PC member will have an accurate view of the submission pool as a whole, hence it is harder to score accurately (as John says, relative scores are more meaningful than absolute ones).
The DB community seems to address these difficulties partially with a two-level PC: the large bottom layer reads papers thoroughly, on topics they know well, while the smaller top layer is responsible for trying to achieve consistency and making final decisions. For example, the top committee might have a physical PC meeting, or divide papers into groups to be considered by sub-committees, etc. The top-layer members are not allowed submissions, but lower-layer members are.
It seems that SODA’s size would make this type of 2-layer PC appropriate. There is another significant advantage to this structure: because the lower layer acts sort of like sub-reviewers, the “sub-reviewers” actually get credit for their work.
John, you seem to have experience on both types of PC’s. How do you feel about their (relative) effectiveness?
I’ve been on DB-style committees, and functionally I feel the same way as I do as a subreviewer for SODA. Which is to say, I read some papers in isolation, and give my feedback, without having any sense (or even having access to any sense) of the bigger picture. In this model, the “senior reviewers” essentially act as the PC: the difference is that each paper is reviewed by exactly one senior reviewer (and 3 second-tier reviewers).
In the conferences I’ve been on PCs for, the senior PC members could ALSO submit papers. In one instance, the PC chair voluntarily decided not to submit, but I’m not sure this is the standard, or is even expected.
I can’t say I like the 2-level process: there’s the above mentioned problem of not having the bigger picture. Secondly, (and this is possibly an unrelated quirk), even at the lower level, we’re asked to make accept/reject decisions, which in my mind is a rather dangerous thing to do, rather than merely giving some kind of score.
John, you mentioned theoretical results suggesting that orderings are “more robust” than giving absolute scores based on orderings. What are you referring to ?
I’m referring to this paper and this paper.
I believe a 2-layer system can be more effective if the failure modes are addressed. Whether or not (and the degree to which) they are addressed, varies substantially with the program chair. My estimate is that where it is standard, there are both good and bad years. Logistics-wise, it is _much_ easier on the first layer if the second layer is built into the system rather than an adhoc sequence of secondary reviews. The failure mode to watch out for is: only one person ends up making the decision on a paper, and that decision can easily be wrong. When I was on the NIPS PC with Bob Williamson, we addressed that failure mode explicitly amongst the learning theory papers by arriving at decisions independently on each paper, then reconciling. However, most other papers were not handled in that fashion.