Although I’m greatly interested in machine learning, I think it must be admitted that there is a large amount of low quality logic being used in reviews. The problem is bad enough that sometimes I wonder if the Byzantine generals limit has been exceeded. For example, I’ve seen recent reviews where the given reasons for rejecting are:
- [NIPS] Theorem A is uninteresting because Theorem B is uninteresting.
- [UAI] When you learn by memorization, the problem addressed is trivial.
- [NIPS] The proof is in the appendix.
- [NIPS] This has been done before. (… but not giving any relevant citations)
Just for the record I want to point out what’s wrong with these reviews. A future world in which such reasons never come up again would be great, but I’m sure these errors will be committed many times more in the future.
- This is nonsense. A theorem should be evaluated based on it’s merits, rather than the merits of another theorem.
- Learning by memorization requires an exponentially larger sample complexity than many other common approaches that often work well. Consequently, what is possible under memorization does not have any substantial bearing on common practice or what might be useful in the future.
- Huh? Other people, thank you for putting the proof in the appendix, so the paper reads better. It seems absurd to base a decision on the placement of the content rather than the content.
- This is a red flag for a bogus review. Every time I’ve seen a review (as an author or a fellow reviewer) where such claims are made without a concrete citation, they are false. Often they are false even when concrete citations are given.
A softer version of (4) is when someone is cranky because their own paper wasn’t cited. This is understandable, but a more appropriate response seems to be pointing things out, and reviewing anyways. This avoids creating the extra work (for authors and reviewers) of yet another paper resubmission, and reasonable authors do take such suggestions into account.
NIPS figures fairly prominently here. While these are all instances in the last year, my experience after interacting with NIPS for almost a decade is that the average quality of reviews is particularly low there—in many instances reviewers clearly don’t read the papers before writing the review. Furthermore, such low quality reviews are often the deciding factor for the paper decision. Blaming the reviewer seems to be the easy solution for a bad review, but a bit more thought suggests other possibilities:
- Area Chair In some conferences an “area chair” or “senior PC” makes or effectively makes the decision on a paper. In general, I’m not a fan of activist area chairs, but when a reviewer isn’t thinking well, I think it is appropriate to step in. This rarely happens, because the easy choice is to simply accept the negative review. In my experience, many Area Chairs are eager to avoid any substantial controversy, and there is a general tendency to believe that something must be wrong with a paper that has a negative review, even if it isn’t what was actually pointed out.
- Program Chair In smaller conferences, Program Chairs play the same role as the area chair, so all of the above applies, except now you know the persons name explicitly making them easier to blame. This is a little bit too tempting, I think. For example, I know David McAllester understands that learning by memorization is a bogus reference point, and probably he was just too busy to really digest the reviews. However, a Program Chair is responsible for finding appropriate reviewers for papers, and doing so (or not) has a huge impact on whether a paper is accepted. Not surprisingly, if a paper about the sample complexity of learning is routed to people who have never seen a proof involving sample complexity before, the reviews tend to be spuriously negative (and the paper unread).
- Author A reviewer might blame an author, if it turns out later that the reasons given in the review for rejection were bogus. This isn’t absurd—writing a paper well is hard and it’s easy for small mistakes to be drastically misleading in technical content.
- Culture A conference has a culture associated with it that is driven by the people who keep coming back. If in this culture it is considered ok to do all the reviews on the last day, it’s unsurprising to see reviews lacking critical thought that could be written without reading the paper. Similarly, it’s unsurprising to see little critical thought at the area chair level, or in the routing of papers to reviewers. This answer is pretty convincing: it explains why low quality reviews keep happening year after year at a conference.
If you believe the Culture reason, then what’s needed is a change in the culture. The good news is that this is both possible and effective. There are other conferences where reviewers expect to spend several hours reviewing a paper. In my experience this year, it was true of COLT and for my corner of SODA. Effecting the change is simply a matter of community standards, and that is just a matter of leaders in the community leading.
There are groups of reviewers, usually a professor or a group leader who delegates the reviewing load to her graduate students. It’s ultimately the responsibility of the group leader to maintain the standards.
Some ideas:
* Publishing the average scores given to papers by a particular group of reviewers (and potentially controlling for that).
* Some sort of a reviewing standard: is it 30 minutes or 3 hours? Each reviewer taking 2 in-depth reviews and 4 quick reviews, with each paper getting one in-depth and 2 quick reviews.
* Appeal system and complaints statistics would help identify low quality reviews.
* Not submitting papers to and not attending conferences that harbor insular or political reviewing practices.
One problem possibly amenable to a technical fix is what happens when a reviewer gets a paper they aren’t qualified to review? In conferences with large scope, it just isn’t true that every reviewer is qualified to review every paper. In my opinion, there should be some way for reviewers to punt on a paper if, when reading it, they realize it is outside their area of expertise. I know reviewers are hesitant to send back a blank review, because that seems like more work for the chair. However, I can’t see how sending back a totally uninformed review is more helpful!
The strange thing is that it is common wisdom that the reviews at NIPS are bad– everyone seems to have nightmare stories like described in the post. Yet NIPS continues to be such a desirable place to publish…
For the similar reason, many conferences, such ICML, NIPS, have lost my respect long time ago. I see many average papers appear there which will surely fail under a even less strict review. It seems certain clique enjoying some previlige of producing trashes. don’t mention some plagiarism after reviewing.
Hi, John. I can certainly understand your criticisms of the reviewers (I myself found that this year’s NIPS reviews I got were, beyond any doubt, the most clueless reviews I’ve seen in 10 years). But I was curious about your third case. Was it the case that the reviewer refused to acknowledge the proofs because they were not in the main paper but given as supplementary material? I think reviewers might have the right to refuse to accept the paper if they think there is no fair way to assess it within the 8 page format.
I’m making this comment because I participated in the organization of ICML a couple of times, and on one or other occasion an author asked me about the relevance of supplementary material. My reply was that supplementary material is welcome, but the reviewers have no obligation to take them into account. I’m actually curious about your opinion on this.
Now, if it wasn’t the case that the appendix was a supplement, then I have to say you got the most unbelievably clueless reviewer ever!
The appendix was indeed part of the supplementary material.
I understand supplementary material is to be treated as optional by reviewers at NIPS. However, in the case of a proof it’s semantics and impact on the paper are fully summarized by the theorem statement, implying that reading the proof is only of import to reviewers if they have doubts about the correctness of the theorem statement. I don’t believe that was the case here.
My general treatment of proofs is that if I have a doubt about the theorem statement, I read and verify them if available. When they aren’t available, I substantially downweight the paper score. When I don’t have a doubt (which is fairly common—many theorems are easy variations on existing theorems), I generally assume the proof is correct and read on.
My experience with COLT this year was pretty bad, so I don’t think this is a conference A vs conference B problem. The fundamental problem is that conference reviewing is collapsing under the imbalance between demand (fast-increasing submission numbers) and supply (experienced, thoughtful reviewers with enough free time to do a good job). Selective conferences make sense for relatively small, homogeneous, slow growing fields (like theory), but they are ultimately a failure in fast-growing fields (besides machine learning, my colleagues in databases, systems, computer architecture confirm this). The reason is that the wavefront of submissions from newcomers into the field is increasingly lagged by the availability of experienced reviewers (who were part of a much smaller cohort of newcomers N years ago; cf. the baby boom). The solution is to move away from a few reviewing frenzies (NIPS, ICML, and a few others) that overwhelm peak reviewing capacity to year-round reviewing for electronic journals with an associated conference for selected papers, as VLDB appears to be moving to (there’s a document circulating around on their plans but I can’t find it at the moment). Year-round reviewing with journal-like standards avoids overwhelming demand peaks, allows better reviewer calibration and quality control, and has better memory of submissions across their lifecycle. All of these debates about how to improve conference reviewing are good for letting off steam, but they are in practice useless since this is not to be solved by conference-internal mechanism redesign, it’s a macroscopic matter of demand and supply.
Is there evidence that (change in number of submission)/(last years number of submissions) is growing? I haven’t seen a case for made for this.
Perhaps a simpler explanation is the “large organization implies lack of responsibility” problem. At a large conference, the PC chair might not feel responsible for individual decisions because there are too many. An area chair might not feel responsible because they just went with the reviewer, and the reviewers might not feel responsible, because their decision or not affects only a very small piece of the conference content.
Hi Fernando,
VLDB is going ahead with the revolving model (http://www.jdmr.org/). This year you can submit to the “journal” and can expect a verdict within a month or so. You can also submit to the regular conference submission process with a deadline in Feb, but if you were rejected by the journal, you cannot resubmit to the conference in the same cycle. I think that in future years, the conference submission will go away (it’s a transitional effect), and only rolling submissions to the journal will be reviewed. The cutoff date is May 20 for consideration for appearing in the actual conference.
The case 2 (learning by memorization) seems to be a fair enough reason for rejection. For example, if the main result of the paper is a theorem that states “Algorithm A solves problem B”, whereas problem B can be solved trivially (e.g. by memorization) I would clearly reject the paper. An interesting theoretical result in this case should rather look like “Algorithm A solves problem B, and has [sample/ resource] complexity C”, where C is much smaller than that of trivial (and other known) methods.
Of course, if the main result of the paper is not theoretical, then 2 is not a valid reason for rejection.
The paper (for reference) is the offset tree paper available here (third entry). There are a couple reasons why the logic above does not apply.
(1) The main theorem has the form “A solves B subject to resource complexity C”. The notion of resource is the average performance of a base classifier. This notion of resource is most similar to boosting (where the resource is maximum error error rate for a base classifier) and bears a passing resemblance to online learning with experts (where the resource is the error rate of the best expert).
(2) The paper doesn’t pigeon hole as “theoretical” or “not theoretical” easily. This paper has both a new algorithm and a theorem analyzing it. To a theory person it might be theoretical, but to a person trying to solve a problem it’s an algorithm paper.
yes if it’s clear from the bounds that they are not satisfied by trivial algortihms than surely such a comparison is not a valid reason for rejection.