Bad Reviewing

This is a difficult subject to talk about for many reasons, but a discussion may be helpful.

Bad reviewing is a problem in academia. The first step in understanding this is admitting to the problem, so here is a short list of examples of bad reviewing.

  1. Reviewer disbelieves theorem proof (ICML), or disbelieve theorem with a trivially false counterexample. (COLT)
  2. Reviewer internally swaps quantifiers in a theorem, concludes it has been done before and is trivial. (NIPS)
  3. Reviewer believes a technique will not work despite experimental validation. (COLT)
  4. Reviewers fail to notice flaw in theorem statement (CRYPTO).
  5. Reviewer erroneously claims that it has been done before (NIPS, SODA, JMLR)—(complete with references!)
  6. Reviewer inverts the message of a paper and concludes it says nothing important. (NIPS*2)
  7. Reviewer fails to distinguish between a DAG and a tree (SODA).
  8. Reviewer is enthusiastic about paper but clearly does not understand (ICML).
  9. Reviewer erroneously believe that the “birthday paradox” is relevant (CCS).

The above is only for cases where there was sufficient reviewer comments to actually understand reviewer failure modes. Many reviewers fail to leave sufficient comments and it’s easy to imagine they commit similar mistakes.

Bad reviewing should be clearly distinguished from rejections—note that some of the above examples are actually accepts.

The standard psychological reaction to any rejected paper is trying to find fault with the reviewers. You, as a paper writer, have invested significant work (weeks? months? years?) in the process of creating a paper, so it is extremely difficult to step back and read the reviews objectively. One distinguishing characteristic of a bad review from a rejection is that it bothers you years later.

If we accept that bad reviewing happens and want to address the issue, we are left with a very difficult problem. Many smart people have thought about improving this process, yielding the system we observe now. There are many subtle issues here and several solutions that (naively) appear obvious don’t work.

13 Replies to “Bad Reviewing”

  1. Maybe the grass is greener on the other side of the fence.

    Of the 50-100 MSS I have been given in the last ten years to review for summer conferences, about 1/3 have failed to acknowledge any work done since the “canonical text” was written 10+ years ago (I do GAs principally, and many folks seem to think Goldberg 1989 is all they need as an overview of current work); about 1/3 present research that is indeed trivial, or which completely duplicates work presented by a different author at the same meeting one or two years ago;and about 1/3 are infinitesimal modifications of papers the authors have already published elsewhere.

    Luckily for science and engineering, these are not mutually exclusive categories.

    Lately I think the role of reviewer is becoming less one of gatekeeping and more one of pedagogy: the submissions I’ve been given have been singularly unscholarly, so much so that I find I end up re-planning the authors’ future (or supposed past) work and telling them how to get on with their lives and show what they thought they already had.

    That said, nobody ever said that reviewers are generally good at it. We’re peers, not betters. Mistakes seem to be made consistently on both sides of the fence.

    That’s how it gets done.

  2. Notice that all the examples cited are from conferences. Conference review processes are bad for a variety of reasons, and reviewer competence is only one. others include reviewer load, time for review, brevity of submissions, ‘do-or-die’ nature of conference acceptance vs journal acceptance.

    This would not be a problem except for the preeminence given to conferences in the CS community. the problem is tricky….

  3. From my own experience reviewing for conferences, I have to say that I agree that the quality of reviews is not
    very good in general. In many cases, I would submit reasonably detailed (positive or negative) reviews and the
    other two reviewers would submit very short reviews that went contrary to
    my opinion. That was frustating because it was two against one, but they did not have any good arguments for their
    ratings. In some cases, the area chair corrected this by initiating a discussion so that eventually some people would
    change their ratings given my arguments.

    Of course, there were other cases when I saw very good reviews and I actually changed my mind about the paper after
    seeing the other reviewers. For this reason, I believe that the discussion step is very important.

    It is very easy to tell whether a reviewer has actually read the paper carefully and
    devoted a reasonable amount of time to write the review. But I’ve tried to think of a system for rating reviewers
    and could not come up with anything. Any ideas?

  4. Suresh, note that JMLR is a journal.

    My impression is that discussion can be helpful, but it needs to be conducted well. There is an unfortunate tendency for the loudest person to win, which means that the standard three reviewers can be transformed into an (effective) one reviewer.

    A good system for rating reviewers seems like it should start with a good specification of the reviewing problem. What is the definition of a good paper? And how can “good paper” be measured?

  5. I think this is impossible to answer before first clarifying who your target audience is.

    Once you have identified a target audience and the distribution of topics that interest them (in the opinion of the conference and chairs), then I would define a “good paper” worth publishing as one which:
    (a)explains something the target audience *does not know already*;
    (b) is technically/scientifically correct (so it does not mislead the audience with erroneous conclusions/wrong theorems etc);
    and (c) will be *useful* to the target audience in the sense that it helps them do their job better.

    Unfortunately there is a difference between reviewers about these, and this subjectivity is sometimes not properly explained in the review itself (and also ignored by the area chair). For example while I may disagree with frequentist philosophy, I do not believe that I should reject papers on those grounds. Similarly, what I perceive as “new/useful to the target audience” is also a matter of some debate. The discussions phase of the reviewing is a good way to ensure that these kinds of subjectivity are addressed. Another good mechanism is to ensure that the questions asked to the reviewers are objective, and as precise as possible. It is important to clarify to both the reviewers and the authors what exactly is of interest to the expected audience.

    We need to ensure that the reviewers dont just provide their numerical scores rating the paper, but also sufficient text clarifying why exactly they think some aspect is good/bad about it. This can sometimes be systematically addressed by requiring that reviewers have to write a paragraph explaining what they think are the main contributions, another about their reasoning about what they think the paper does wrongly or how can it be improved, (and why the current version is inadequate). This is desirable, but sometimes not required in order to avoid overloading the reviewer who has signed up for reviewing 10 papers (as an example) in a short time for a conference. On the other hand, my opinion is that if they dont want to invest the time to do it properly, they should not sign up for it in the first place.

    A last aspect is that reviewers are sometimes not familiar with the topic under discussion. Often, this problem can be reuced by a process of reviewers bidding for papers. Assigning reviews to your grad students is, in my opinion, unethical; After all you signed up to review the paper so you have to do the task. If the grad student signed up for it (upon request from the area chair) that is a different deal.

    PS: It is very unfortunate that some conferences (even among the best ones) do not follow the double-blind review process even now. Every year NIPS discusses this and promptly forgets about it, but in my opinion the authors’ names should neither jeopardize their chance nor increase them.

    Two arguments raised in favor of knowing the names of the authors are
    (a) this allows the reviewer to make sure it has not already been published before, in a different conference.
    (b) If a long and complicated proof (or the math in some other flavor) is very difficult to understand, the reviewer finds it more easy to trust the author if he is familar with some of his excellent work in the past than if he thought it was a random person who never published a paper in the area.

    In my opinion both these are rather crude excuses either for the reviewer not investing enough time and effort to understand and verify the submission in its entirety or to not acknowledge his lack of qualification for reviewing the paper (because he does not understand or know the current literature well enough). If the reviewer is not confident in his analysis/understanding he should openly acknowledge that; after all it is part of the very structure of academics that we should all be open and honest about our ignorance so that we can then proceed to learn. However, If all the three reviewers feel that way, it indicates that the paper is probably badly written or that it is a break through. Analyzing this is then the task of the area chair, and if necessary more reviews may be solicited.

    If double-blind reviews are not a viable solution, then another solution is to make the entire discussion open: both the authors and the reviewers should have their names published on the paper. In the stats community it is common to have long discussions in journals about other papers, and the authors openly provide their names in this discussion. If as a reviewer I am not confident in making my name known, then it is often indicative of a bad review or other systemic problems that need to be clearly understood. It is perfectly ok to disagre in the scientific community, so long as we state our opinions respectfully, and with clear supporting arguments; it is not ok to use the cloak of anonymity be dismissive/abusive (even about papers which appear to you, as a reviewer, to be trivial or wrong) or to present an opinion as dogma without critical reasoning supported by facts.

  6. I agree with much of what Balaji is saying, however is it *always* unethical to ask one’s grad student to help with refereeing? What if the advisor reviews the student’s report before sending it up to the area chair?

    In such cases, provided the advisor is attentive, the authors still get a thorough/fair review and the student reviewer gets taught the *right* way to referee.

    As a current grad student I have experienced both “good” and “bad” situations where my reviewing has/has not been scrutinized properly. I certainly feel that the “good” sub-refereeing assignments have helped me, and hopefully will help the community once I am “on my own”.

  7. Double blind review is not the solution that one might think it is. In algorithms, we have discussed this issue often (algorithms conferences are not double blind in general). The real problem is trying to optimize too much when reviewing papers for conferences (as opposed to journals). The conference review process as structured cannot be expected to have a careful evaluation of each paper, especially with respect to proofs of correctness, especially when page limits are observed. Conferences really do have to be “services to the community”, accepting all reasonable papers within capacity, rather than trying to judge what is “good”. After all, the real value of a paper sometimes emerges many years later.

    Oded Goldreich’s a note on this is worth reading.

  8. I don’t think grad students need to wait until they are “out on their own” to review papers. In particular, if a paper under review cites one of my papers in an essential way, I appreciate the opportunity to look at it. (This has happened; I was surprised.) Yes, there is potential for abuse if an advisor weighs down students with reviews or requires the student to review. I think it’s perfectly appropriate, however, for an advisor to offer a student the opportunity to review, say, one or two papers per conference.

    What I would like to see more of, however, is feedback to reviewers on what constitutes a good review and what the scale numbers “mean” on a review sheet, both for the score and the confidence. I did a review of a paper I did not think was very strong and assigned it a 6 out of 10. The person reviewing came back to me and explained that a 6 rating really meant, given the context of that conference, a
    pretty good chance of acceptance — which was not my intent at all! Oded Goldreich’s note on reviewing is helpful here, but program committees can help as well by clear instructions to the reviewer as to what they would like to see.

    As for blind reviewing, well, in crypto and security it is what we use. Works so far for me. Systems (e.g. OSDI) does not, and as noted neither does algorithms. I’ve read Oded’s article arguing against blind reviewing, but I can’t say I am so sanguine about our ability to detect and punish people rating papers for non-scientific reasons. Then again, I’ve never been on a PC, so what do I know? 🙂

  9. Well, there is an easily corrected problem with reviews–they have no tangible incentive to do a good job. Before you say, “professional pride”, lets ask how many tenured academics actually devote their time towards ends which do not serve to further a very focused sense of perceived utility in some calculated way. As mentioned previously, this is a teaching aspect of the job, but where there is no feedback.

    So, why not introduce feedback into the reviewer’s loop? Certainly the editors and conference organizers could function as a critic to the reviewer’s primary control function. If they omit this step, the quality of the publication will suffer and after all, the would-be author can always go elsewhere.

  10. I suspect any system for improving reviewing which adds to the load of conference organizers is an inherent nonstarter. Running a conference is already an immense task.

  11. Sorry, I should be more clear – when I wrote “feedback” I should have written something like “clear instructions.” It’s great to get actual feedback if you can, but given the scale of conference reviewing it doesn’t seem like it will happen often.

  12. John,

    Of course, that depends on the goals of the conference organizers. If they want to increase the prestige of the conference, they will invest the additional time here rather than in picking color schemes and designing brochures. This is just the most logical place to start.

    Why complain if there is no will to fix the problem?

    Dave,

    Feedback of some sort is necessary. Otherwise we have the same problem as with grade inflation. Why should the reviewer suffer increased risk? Without feedback there will be a greater bias against making tough calls.

Comments are closed.