Reviewing Horror Stories – Machine Learning (Theory)

Essentially everyone who writes research papers suffers rejections. They always sting immediately, but upon further reflection many of these rejections come to seem reasonable. Maybe the equations had too many typos or maybe the topic just isn’t as important as was originally thought. A few rejections do not come to seem acceptable, and these form the basis of reviewing horror stories, a great material for conversations. I’ve decided to share three of mine, now all safely a bit distant in the past.

Prediction Theory for Classification Tutorial. This is a tutorial about tight sample complexity bounds for classification that I submitted to JMLR. The first decision I heard was a reject which appeared quite unjust to me—for example one of the reviewers appeared to claim that all the content was in standard statistics books. Upon further inquiry, several citations were given, none of which actually covered the content. Later, I was shocked to hear the paper was accepted. Apparently, the paper accidentally went to two different action editors, who each chose distinct reviewers.
Cover Tree. This paper was the first one to give a datastructure for nearest neighbor search for an arbitrary metric which both (a) took logarithmic time under dimensionality constraint and (b) always required space competitive with brute force nearest neighbor search. Previous papers had done (a) or (b), but not both, and achieving both appears key to a practical algorithm, which we backed up with experiments and code.
The cover tree paper suffered a triple rejection, the last one of which seems particularly poor to me. We submitted the draft to SODA, and got back 3 reviews. The first was blank. The second was a paragraph of positive but otherwise uninformative text. The third was blank. The decision was reject. We were rather confused, so we emailed the program chair asking if the decision was right and if so whether there was any more information we could get. We got back only a form letter providing no further information. Since then, the paper was accepted at ICML.
Ranking Reduction. This paper shows that learning how to predict which of a pair of items is better strongly transfers to optimizing a ranking loss, in contrast to (for example) simply predicting a score and ordering according to predicted score.
We submitted this paper to NIPS and it had the highest average review of any learning theory paper. The decision was to reject. Based upon what we could make out from a statement by the program committee, the logic of this decision is mostly kindly describable as badly flawed—somehow they confused the algorithm, the problem, and the analysis into a mess. Later it was accepted at COLT. (A bit of disclosure: I was on the program committee at NIPS that year, although obviously not involved in the decision on this paper.)

In all cases where a rejection occurs, the default presumption is that the correct decision was made because most of the time a good (or at least reasonable) decision was made. Consequently, it seems important to point out that there are some objective signs each of the above cases involved poor decisions.

The tutorial paper is fairly widely cited (Google scholar places it 8th amongst my papers), and I continue to find it useful material for a lecture when teaching a class.
The cover tree is also fairly widely cited, and I know from various emails and download counts that it is used by several people. It also won an award from IBM. To this day, it seems odd that an algorithms paper was only publishable at a machine learning conference.
It’s really too soon to tell with the ranking paper, but it was one of the few COLT papers invited to a journal special issue, and there has since been substantial additional work by Mehryar Mohri and Nir Ailon which broadens the claim to other ranking metrics and makes it more computationally tractable.

One of the reasons you hear for why a paper was rejected and then accepted is that the paper improved in the meantime. That’s often true, but in each of the above cases I don’t believe there were any substantial changes between submissions (and for the tutorial it was a perfect accidental experiment).

Normally reviewing horror stories are the academic equivalent of warstories, but these ones have slightly more point. They have each informed my thinking about how reviewing should be done. Relating these stories might make this thinking a bit more understandable.

Reviewer Choice. The tutorial case brings home the impact of how reviewers are chosen. If a paper is to have 3 reviews, it seems like a good idea to choose the reviewers in diverse ways, rather than one way. For example, at a conference, one reviewer by bidding preference, one reviewer by area chair, and one reviewer by another area chair or the program chair’s choice might reduce variance.
Uniform Author feedback. The standard at NIPS was to have author feedback when the ranking paper was submitted. In effect, the standard was not followed for the ranking paper, and it’s easy to imagine this making a substantial difference given how badly flawed the basis of rejection was. It is also easy to imagine that author feedback might have made a difference in the tutorial rejection, as the reviewer was wrong (author feedback was not the standard then).
Decision Basis. It’s helpful to relate the basis of decision by the program committee, especially when it is not summarized in the reviews. The cover tree case was one of the things which led me to add summaries to some of the NIPS papers when I was on the program committee, and I am committed to doing the same for SODA papers I’m reviewing this year. Not having a summary saves the program committee the embarassment of accidentally admitting mistakes, but it is badly disrepectful of the authors and generally promotes misunderstanding.
Fast decisions are bad. It’s not possible to reliably make good decisions about technical matters quickly. I suspect that the time crunch of the NIPS program committee meeting was a contributing factor in the ranking paper case.

As anyone educated in machine learning or statistics understands, drawing 4 conclusions from 3 datapoints is problematic, so the above should be understood as suggestions subject to further evidence.

The takeaway message is that the inter-annotator agreement on paper reviewing has too much variance to be reliable anywhere near the margin, which is where many submissions lie. Not only that, reviewer biases are systematic as discussed in
“reviewer choice” above and war stories below, so it doesn’t even make sense to talk about the “true score” of a paper.

Yes, we can mitigate some of the problems by being more careful in reviewer selection and balance, but we’ll never get tight confidence intervals with a sample of 3 reviewers, no matter how carefully we select them, how much we make them say about their decision-making process, or how much time we give them.

I’d like to see a simple bootstrap analysis of the variance in average ratings and whether they’re above or below a cutoff point. It’d be easier if we had more than three reviews/paper from some venue. (Yes, I know more subjective decisions are actually made in the program committee’s smoke-filled rooms; see war stories below.)

Now some war stories…

One of my favorites was at ACL when I submitted a joint-authored paper while on the program committee. The reviewers concluded I’d proved the result in my feature structure book, and voted to reject. In fact, I’d only conjectured it in the book and it took the help of Gerald Penn to prove it. I still don’t know if it was good anonymization or that they thought I was recycling results.

I had an NSF grant rejected with one review saying the work was “too European” (I was fresh out of grad school in Edinburgh and clearly hadn’t cottoned-on to American grantspersonship). How’s that for a decision explanation? The reviews were more concrete, saying things like “only tree-adjoining grammar is computationally tractable so you should do that”, or “you should build soft connectionist models instead of hard logical ones” or “you should do psychology experiments instead of studying formal grammar”. The clear message was to change fields to anything but logical grammar (I finally got the message).

I was on an NSF panel where all the reviewers thought a proposal was mediocre, but it was funded because the program director knew “the lab did good work”. I had a different NSF proposal rejected in a similar vein — the reviewers liked it, but the program director vetoed it because of its basic theoretical approach (i.e. it was about language, but wasn’t Chomskyan lingusitics).

3 Replies to “Reviewing Horror Stories”

Carlos says:

6/27/2008 at 2:31 pm

Here’s an idea: publish reviews and reviewer names along with the paper, upon paper acceptance. (I argue the point here) This change makes the role of the reviewer more prominent, introduces a real incentive to write better reviews, and naturally creates a public database of good reviews we can learn from. It’s a simple symmetry argument: if the paper makes it, all information becomes public. If the paper doesn’t, no information exchange happens. With electronic proceedings, this is essentially free.
anonymous says:

6/27/2008 at 9:13 pm

About your incident with the cover-tree paper, this kind of reviewing is actually quite common at theory conferences. In the past, I have gotten blank reviews, completely uninformative reviews, and even downright stupid reviews, and even all together for the same paper, from SODA itself. Compared to reviewer feedback from theory conferences, reviewer feedback from machine learning conferences is top-notch.
Bob Carpenter says:

6/30/2008 at 12:05 pm

The takeaway message is that the inter-annotator agreement on paper reviewing has too much variance to be reliable anywhere near the margin, which is where many submissions lie. Not only that, reviewer biases are systematic as discussed in
“reviewer choice” above and war stories below, so it doesn’t even make sense to talk about the “true score” of a paper.

Yes, we can mitigate some of the problems by being more careful in reviewer selection and balance, but we’ll never get tight confidence intervals with a sample of 3 reviewers, no matter how carefully we select them, how much we make them say about their decision-making process, or how much time we give them.

I’d like to see a simple bootstrap analysis of the variance in average ratings and whether they’re above or below a cutoff point. It’d be easier if we had more than three reviews/paper from some venue. (Yes, I know more subjective decisions are actually made in the program committee’s smoke-filled rooms; see war stories below.)

Now some war stories…

One of my favorites was at ACL when I submitted a joint-authored paper while on the program committee. The reviewers concluded I’d proved the result in my feature structure book, and voted to reject. In fact, I’d only conjectured it in the book and it took the help of Gerald Penn to prove it. I still don’t know if it was good anonymization or that they thought I was recycling results.

I had an NSF grant rejected with one review saying the work was “too European” (I was fresh out of grad school in Edinburgh and clearly hadn’t cottoned-on to American grantspersonship). How’s that for a decision explanation? The reviews were more concrete, saying things like “only tree-adjoining grammar is computationally tractable so you should do that”, or “you should build soft connectionist models instead of hard logical ones” or “you should do psychology experiments instead of studying formal grammar”. The clear message was to change fields to anything but logical grammar (I finally got the message).

I was on an NSF panel where all the reviewers thought a proposal was mediocre, but it was funded because the program director knew “the lab did good work”. I had a different NSF proposal rejected in a similar vein — the reviewers liked it, but the program director vetoed it because of its basic theoretical approach (i.e. it was about language, but wasn’t Chomskyan lingusitics).

Comments are closed.