Machine Learning (Theory)

4/26/2010

Compassionate Reviewing

Most long conversations between academics seem to converge on the topic of reviewing where almost no one is happy. A basic question is: Should most people be happy?

The case against is straightforward. Anyone who watches the flow of papers realizes that most papers amount to little in the longer term. By it’s nature research is brutal, where the second-best method is worthless, and the second person to discover things typically gets no credit. If you think about this for a moment, it’s very different from most other human endeavors. The second best migrant laborer, construction worker, manager, conductor, quarterback, etc… all can manage quite well. If a reviewer has even a vaguely predictive sense of what’s important in the longer term, then most people submitting papers will be unhappy.

But this argument unravels, in my experience. Perhaps half of reviews are thoughtless or simply wrong with a small part being simply malicious. And yet, I’m sure that most reviewers genuinely believe they can predict what will and will not be useful in the longer term. This disparity is a lack of communication. When academics have conversations about reviewing, the presumption of participants in each conversation is that they all share about the same beliefs about what will be useful, and what will take off. Such conversations rarely go into specifics, because the specifics are boring in particular, technical, and because their is a real chance of disagreement on the specifics themselves.

When double blind reviewing was first being considered for ICML, I remember speaking about the experience in the Crypto community, where in my estimate the reviewing was both fairer and less happy. Many conferences in machine learning have shifted to doubleblind reviewing, and I think we have seen this come to pass here as well. Without double blind reviewing, it is common to have an “in” crowd who everyone respects and whose papers are virtually always accepted. These people are happy, and the rest have little voice. With double blind reviewing, everyone suffers substantial rejections.

We might say “fine, at least it’s fair”, but in my experience there is a real problem. From a viewpoint external to the community, when the reviewing is poor and the viewpoint of people in the community highly contradictory, nothing good happens. Outsiders (i.e. most people) viewing the acrimony choose some other way to solve problems, proposals don’t get funded, and the community itself tends to fracture. For example, in cryptography, TCC (not double blind) has started, presumably because the top theory people got tired of having their papers rejected at Crypto (double blind). From a process-of-research standpoint, this seems suboptimal, as different groups using different methods to solve similar problems are particularly the people who you would prefer talking to each other.

What seems to be lost with double blind reviewing is some amount of compassion, unfairly allocated. In a double blind system, any given paper is plausibly from someone you don’t know, and since most papers go nowhere, plausibly not going anywhere. Consequently, the bias starts “against” for all work, a disadvantage which can be quite difficult to overcome. Some time ago, I discussed how I thought motivation should be the responsibility of the reviewer. Aaron Hertzman strongly disagreed on the grounds that this belief could dead end your career as an author. I’ve come to appreciate his viewpoint to an extent. But, it misses the point slightly—the question of “What is good for the community?” differs from “What is good for the author?” In a healthy community, reviewers will actively understand why a piece of work is or is not important, filling in and extending the motivation as they consider the problem.

So, a question is: How can we get compassionate reviewing? (And in a fair way?) It might help somewhat for reviewers to actively consider, as part of their review, the level and mechanism of impact that a paper may have. Reducing reviewing load is certainly helpful, but it is not sufficient alone, because many people naturally interpret a reduced reviewing load as time to work on other things. And, some mechanisms seem to even harm. For example, the two-phase reviewing process that ICML currently uses might save 0.5 reviews/paper, while guaranteeing that for half of the papers, the deciding review is done hastily with no author feedback, a recipe for mistakes.

What creates a great deal of compassion? Public responsibility helps (witness workshops more interesting than conferences). A natural conversation helps (the current method of single round response tends to be very stilted). And time, of course, helps. What else?

12/9/2009

Future Publication Models @ NIPS

Yesterday, there was a discussion about future publication models at NIPS. Yann and Zoubin have specific detailed proposals which I’ll add links to when I get them (Yann’s proposal and Zoubin’s proposal).

What struck me about the discussion is that there are many simultaneous concerns as well as many simultaneous proposals, which makes it difficult to keep all the distinctions straight in a verbal conversation. It also seemed like people were serious enough about this that we may see some real movement. Certainly, my personal experience motivates that as I’ve posted many times about the substantial flaws in our review process, including some very poor personal experiences.

Concerns include the following:

  1. (Several) Reviewers are overloaded, boosting the noise in decision making.
  2. (Yann) A new system should run with as little built-in delay and friction to the process of research as possible.
  3. (Hanna Wallach(updated)) Double-blind review is particularly important for people who are unknown or from an unknown institution.
  4. (Several) But, it’s bad to take double blind so seriously as to disallow publishing on arxiv or personal webpages.
  5. (Yann) And double-blind is bad when it prevents publishing for substantial periods of time. Apparently, this comes up in CVPR.
  6. (Zoubin) Any new system should appear to outsiders as if it’s the old system, or a journal, because it’s already hard enough to justify CS tenure cases to other disciplines.
  7. (Fernando) There shouldn’t be a big change with a complex bureaucracy, but rather a smaller changes which are obviously useful or at least worth experimenting with.

There were other concerns as well, but these are the ones that I remember.

Elements of proposals include:

  1. (Yann) Everything should go to Arxiv or an arxiv-like system first, as per physics or mathematics. This addresses (1), because it delinks dissemination from review, relieving some of the burden of reviewing. It also addresses (2) since with good authors they can immediately begin building on each other’s work. It conflicts with (3), because Arxiv does not support double-blind submission. It does not conflict if we build our own system.
  2. (Fernando) Create a conference coincident journal in which people can publish at any time. VLDB has apparently done this. It can be done smoothly by allowing submission in either conference deadline mode or journal mode. This proposal addresses (1) by reducing peak demand on reviewing. It also addresses (6) above.
  3. (Daphne) Perhaps we should have a system which only reviews papers for correctness, which is not nearly as subjective as for novelty or interestingness. This addresses (1), by eliminating some concerns for the reviewer. It is orthogonal to the double blind debate. In biology, such a journal exists (pointer updated), because delays were becoming absurd and intolerable.
  4. (Yann) There should be multiple publishing entities (people or groups of people) that can bless a paper as interesting. This addresses (1).

There are many other proposal elements (too many for my memory), which hopefully we’ll see in particular proposals. If other people have concrete proposals, now is probably the right time to formalize them.

2/18/2009

Decision by Vetocracy

Few would mistake the process of academic paper review for a fair process, but sometimes the unfairness seems particularly striking. This is most easily seen by comparison:

Paper Banditron Offset Tree Notes
Problem Scope Multiclass problems where only the loss of one choice can be probed. Strictly greater: Cost sensitive multiclass problems where only the loss of one choice can be probed. Often generalizations don’t matter. That’s not the case here, since every plausible application I’ve thought of involves loss functions substantially different from 0/1.
What’s new Analysis and Experiments Algorithm, Analysis, and Experiments As far as I know, the essence of the more general problem was first stated and analyzed with the EXP4 algorithm (page 16) (1998). It’s also the time horizon 1 simplification of the Reinforcement Learning setting for the random trajectory method (page 15) (2002). The Banditron algorithm itself is functionally identical to One-Step RL with Traces (page 122) (2003) in Bianca‘s thesis with the epsilon greedy strategy and a multiclass perceptron with update scaled by the importance weight.
Computational Time O(k) per example where k is the number of choices O(log k) per example Lower bounds on the sample complexity of learning in this setting are a factor of k worse than for supervised learning, implying that many more examples may be needed in practice. Consequently, learning algorithm speed is more important than in standard supervised learning.
Analysis Incomparable. An online regret analysis showing that if a small hinge loss predictor exists, a bounded number of mistakes occur. Also, an algorithm independent analysis of the fully realizable case. Incomparable. A learning reduction analysis showing how the regret of any base classifier bounds policy regret. Also contains a lower bound and comparable analysis of all plausible alternative reductions.
Experiments 1 dataset, comparing with no other approaches to solving the problem. 13 datasets, comparing with 2 other approaches to solve the problem.
Outcome Accepted at ICML Rejected at ICML, NIPS, UAI, and NIPS.

The reviewers of the Banditron paper made the right call. The subject is interesting, and analysis of a new learning domain is of substantial interest. Real advances in machine learning often come as new domains of application. The talk was well attended and generated substantial interest. It’s also important to remember the reviewers of the two papers probably did not overlap, so there was no explicit preference for A over B.

Why was the Offset Tree rejected? One of these rejections is easily explained as a fluke—we ran into a reviewer at UAI who believes that learning by memorization is the way to go. I, and virtually all machine learning people, disagree but some reviewers at UAI aren’t interested or expert in machine learning.

The striking thing about the other 3 rejects is that they all contain a reviewer who doesn’t read the paper. Instead, the reviewer asserts that learning reductions are bogus because for an alternative notion of learning reduction, made up by the reviewer, an obviously useless approach yields a factor of 2 regret bound. I believe this is the same reviewer each time, because the alternative theorem statement drifted over the reviews fixing bugs we pointed out in the author response.

The first time we encountered this review, we assumed the reviewer was just cranky that day—maybe we weren’t quite clear enough in explaining everything as it’s always difficult to get every detail clear in new subject matter. I have sometimes had a very strong negative impression of a paper which later turned out to be unjustified upon further consideration. Sometimes when a reviewer is cranky, they change their mind after the authors respond, or perhaps later, or perhaps never but you get a new set of reviewers the next time.

The second time the review came up, we knew there was a problem. If we are generous to the reviewer, and taking into account the fact that learning reduction analysis is a relatively new form of analysis, the fear that because an alternative notion of reduction is vacuous our notion of reduction might also be vacuous isn’t too outlandish. Fortunately, there is a way to completely address that—we added an algorithm independent lower bound to the draft (which was the only significant change in content over the submissions). This lower bound conclusively proves that our notion of learning reduction is not vacuous as is the reviewer’s notion of learning reduction.

The review came up a third time. Despite pointing out the lower bound quite explicitly, the reviewer simply ignored it. This more-or-less confirms our worst fears. Some reviewer is bidding for the paper with the intent to torpedo review it. They are uninterested in and unwiling to read the content itself.

Shouldn’t author feedback address this? Not if the reviewer ignores it.

Shouldn’t Double Blind reviewing help? Not if the paper only has one plausible source. The general problem area and method of analysis were freely discussed on hunch.net. We withheld public discussion of the algorithm itself for much of the time (except for a talk at CMU) out of respect for the review process.

Why doesn’t the area chair/program chair catch it? It took us 3 interactions to get it, so it seems unrealistic to expect someone else to get it in one interaction. In general, these people are strongly overloaded and the reviewer wasn’t kind enough to boil down the essence of the stated objection as I’ve done above. Instead, they phrase it as an example and do not clearly state the theorem they have in mind or distinguish the fact that the quantification of that theorem differs from the quantification of our theorems. More generally, my observation is that area chairs rarely override negative reviews because:

  1. It risks their reputation since defending a criticized work requires the kind of confidence that can only be inspired by a thorough personal review they don’t have time for.
  2. They may offend the reviewer they invited to review and personally know.
  3. They figure that the average review is similar to the average perception/popularity by the community anyways.
  4. Even if they don’t agree with the reviewer, it’s hard to fully discount the review in their consideration.

I’ve seen these effects create substantial mental gymnastics elsewhere.

Maybe you just ran into a cranky reviewer 3 times randomly Maybe so. However, the odds seem low enough and the 1/2 year cost of getting another sample high enough, that going with the working hypothesis seems indicated.

Maybe the writing needs improving. Often that’s a reasonable answer for a rejection, but in this case I believe not. We’ve run the paper by several people, who did not have substantial difficulties understanding it. They even understand the draft well enough to make a suggestion or two. More generally, no paper is harder to read than the one you picked because you want to reject it.

What happens next? With respect to the Offset Tree, I’m hopeful that we eventually find reviewers who appreciate an exponentially faster algorithm, good empirical results, or the very tight and elegant analysis, or even all three. For the record, I consider the Offset Tree a great paper. It remains a substantial advance on the state of the art, even 2 years later, and as far as I know the Offset Tree (or the Realizable Offset Tree) consistently beat all reasonable contenders both in prediction and computational performance. This is rare and precious, as many papers tradeoff one for the other. It yields a practical algorithm applicable to real problems. It substantially addresses the RL to classification reduction problem. It also has the first nonconstant algorithm independent lower bound for learning reductions.

With respect to the reviewer, I expect remarkably little. The system is designed to protect reviewers, so they have virtually no responsibility for their decisions. This reviewer has a demonstrated capability to sabotage the review process at ICML and NIPS and a demonstrated willingness to continue doing so indefinitely. The process of bidding for papers and making up reasons to reject them seems tedious, but there is no fundamental reason why they can’t continue doing so for several decades if they remain active in academia.

This experience has substantially altered my understanding and appreciation of the review process at conferences. The bidding mechanism commonly used, coupled with responsibility-free reviewing is an invitation to abuse. A clever abusive reviewer can sabotage perhaps 5 papers per conference (out of 8 reviewed), while maintaining a typical average score. While I don’t believe most people choose papers with intent to sabotage, the capability is there and used by at least one person and possibly others. If, for example, 5% of reviewers are willing to abuse the process this way and there are 100 reviewers, every paper must survive 5 vetoes. If there are 200 reviewers, every paper must survive 10 vetoes. And if there are 400 reviewers, every paper must survive 20 vetoes. This makes publishing any paper that offends someone difficult. The surviving papers are typically inoffensive or part of a fad strong enough that vetoes are held back. Neither category is representative of high quality decision making. These observations suggest that the conference with the most reviewers tend strongly toward faddy and inoffensive papers, both of which often lack impact in the long term. Perhaps this partly explains why NIPS is so weak when people start citation counting. Conversely, this would suggest that smaller conferences and workshops have a natural advantage. Similarly, the reviewing style in theory conferences seems better—the set of bidders for any paper is substantially smaller, implying papers must survive fewer vetos.

This decision making process can be modeled as a group of n decision makers, each of which has the opportunity to veto any action. When n is relatively small, this decision making process might work ok, depending on the decision makers, but as n grows larger, it’s difficult to imagine a worse decision making process. The closest representatives outside of academia I know are deeply bureacratic governments and other large organizations where many people must sign off on something before it takes place. These vetocracies are universally frustrating to interact with. A reasonable conjecture is that any decision making process with a large veto number has poor characteristics.

A basic question is: Is a vetocracy inevitable for large organizations? I believe the answer is no. The basic observation is that the value of n can be logarithmic in the number of participants in an organization rather than linear, as per reviewing under a bidding process. An essential force driving vetocracy creation is a desire to offload responsibility for decisions, so there is no clear decision maker. A large organization not deciding by vetocracy must have a very different structure, with clearly dilineated responsibility.

NIPS provides an almost perfect natural experiment in it’s workshop organization, which involves the very same community of people and subject matter, yet works in a very different manner. There are one or two workshop chairs who are responsible for selecting amongst workshop proposals, after which the content of the workshop is entirely up to the workshop organizers. If a workshop is rejected, it’s clear who is at fault, and if a workshop presentation is rejected, it is often clear by who. Some workshop chairs use a small set of reviewers, but even then the effective veto number remains small. Similarly, if a workshop ends up a flop, it’s relatively easy to see who to blame—either the workshop chair for not predicting it, or the organizers for failing to organize. I can’t think of a single time when I attended both the workshops and the conference that the workshops were less interesting than the conference. My understanding is that this observation is common. Given this discussion, it will be particularly interesting to see how the review process Michael and Leon setup for ICML this year pans out, as it is a system with notably more responsibility assignment than in previous years.

Journals end up looking relatively good with respect to vetocracy avoidance. The ones I’m familiar with have a chief editor who bears responsibility for routing papers to an action editor, who bears responsibility for choosing good reviewers. Every agent except the reviewers is often known by the authors, and the reviewers don’t act as additional vetoers in nearly as strong a manner as reviewers with the opportunity to bid.

This experience has also altered my view of blogging and research. On one hand, I’m very enthusiastic about research in general, and my research in particular, where we are regularly cracking conventionally impossible problems. On the other hand, it seems that some small number of people viewing a discussion silently decide they don’t like it, and veto it given the opportunity. It only takes one to turn strong paper into a years-long odyssey, so public discussion of research directions and topics in a vetocracy is akin to voluntarily wearing a “kick me” sign. While this a problem for me, I expect it to be even worse for the members of a vetocracy in the long term.

It’s hard to imagine any research community surviving without a serious online presence. When a prospective new researcher looks around at existing research, if they don’t find serious online discussion, they’ll assume it doesn’t exist under the “not on the internet so it doesn’t exist” principle. This will starve a field of new people. More generally, there is an opportunity to get feedback about research directions and problems much more rapidly than is otherwise possible, allowing us to avoid research on dead end topics which are pervasive. At some point, it may even seem that people not willing to discuss their research simply avoid doing so because it is critically lacking in one way or another. Since a vetocracy creates a substantial disincentive to discuss research directions online, we can expect that communities sticking with decision by vetocracy to be at a substantial disadvantage.

12/27/2008

Adversarial Academia

One viewpoint on academia is that it is inherently adversarial: there are finite research dollars, positions, and students to work with, implying a zero-sum game between different participants. This is not a viewpoint that I want to promote, as I consider it flawed. However, I know several people believe strongly in this viewpoint, and I have found it to have substantial explanatory power.

For example:

  1. It explains why your paper was rejected based on poor logic. The reviewer wasn’t concerned with research quality, but rather with rejecting a competitor.
  2. It explains why professors rarely work together. The goal of a non-tenured professor (at least) is to get tenure, and a case for tenure comes from a portfolio of work that is undisputably yours.
  3. It explains why new research programs are not quickly adopted. Adopting a competitor’s program is impossible, if your career is based on the competitor being wrong.

Different academic groups subscribe to the adversarial viewpoint in different degrees. In my experience, NIPS is the worst. It is bad enough that the probability of a paper being accepted at NIPS is monotonically decreasing in it’s quality. This is more than just my personal experience over a number of years, as it’s corroborated by others who have told me the same. ICML (run by IMLS) used to have less of a problem, but since it has become more like NIPS over time, it has inherited this problem. COLT has not suffered from this problem as much in my experience, although it had other problems related to the focus being defined too narrowly. I do not have enough experience with UAI or KDD to comment there.

There are substantial flaws in the adversarial viewpoint.

  1. The adversarial viewpoint makes you stupid. When viewed adversarially, any idea has crippling disadvantages and no advantages. Contorting your viewpoint enough to make this true damages your ability to conduct research. In short, it promotes poor mental hygiene.
  2. Many activities become impossible. Doing research is in general extremely hard, so there are many instances where working with other people can allow you to do things which are otherwise impossible.
  3. The previous two disadvantages apply even more strongly for a community—good ideas are more likely to be missed, change comes slowly, and often with steps backward.
  4. At it’s most basic level, the assumption that research is zero-sum is flawed, because the process of research is not done in a closed system. If the rest of society at large discovers that research is valuable, then the budget increases.

Despite these disadvantages, there is a substantial advantage as well: you can materially protect and aid your career by rejecting papers, preventing grants, and generally discriminating against key people doing interesting but competitive work.

The adversarial viewpoint has a validity in proportion to the number of people subscribing to it. For those of us who would like to deemphasize the adversarial viewpoint, what’s unclear is: how?

One concrete thing is: use Arxiv. For a long time, physicists have adopted an Arxiv-first philosophy, which I’ve come to respect. Arxiv functions as a universal timestamp which decreases the power of an adversarial reviewer. Essentially, you avoid giving away the power to muddy the track of invention. I’m expecting to use Arxiv for essentially all my past-but-unpublished and future papers.

It is plausible that limiting the scope of bidding, as Andrew McCallum suggested at the last ICML, and as is effectively implemented at this ICML, will help. The system of review at journals might also help for the same reason. In my experience as an author, if an anonymous reviewer wants to kill a paper they usually succeed. Most area chairs or program chairs are more interested in avoiding conflict with the reviewer (who they picked and may consider a friend) than reading the paper to determine the illogic of the review (which is a difficult task that simply cannot be done for all papers). NIPS experimented with a reputation system for reviewers last year, but I’m unclear on how well it worked, as an author’s score for a review and a reviewer’s score for the paper may be deeply correlated, revealing little additional information.

Public discussion of research can help with this, because very poor logic simply doesn’t stand up under public scrutiny. While I hope to nudge people in this direction, it’s clear that most people aren’t yet comfortable with public discussion.

10/14/2008

Who is Responsible for a Bad Review?

Although I’m greatly interested in machine learning, I think it must be admitted that there is a large amount of low quality logic being used in reviews. The problem is bad enough that sometimes I wonder if the Byzantine generals limit has been exceeded. For example, I’ve seen recent reviews where the given reasons for rejecting are:

  1. [NIPS] Theorem A is uninteresting because Theorem B is uninteresting.
  2. [UAI] When you learn by memorization, the problem addressed is trivial.
  3. [NIPS] The proof is in the appendix.
  4. [NIPS] This has been done before. (… but not giving any relevant citations)

Just for the record I want to point out what’s wrong with these reviews. A future world in which such reasons never come up again would be great, but I’m sure these errors will be committed many times more in the future.

  1. This is nonsense. A theorem should be evaluated based on it’s merits, rather than the merits of another theorem.
  2. Learning by memorization requires an exponentially larger sample complexity than many other common approaches that often work well. Consequently, what is possible under memorization does not have any substantial bearing on common practice or what might be useful in the future.
  3. Huh? Other people, thank you for putting the proof in the appendix, so the paper reads better. It seems absurd to base a decision on the placement of the content rather than the content.
  4. This is a red flag for a bogus review. Every time I’ve seen a review (as an author or a fellow reviewer) where such claims are made without a concrete citation, they are false. Often they are false even when concrete citations are given.

A softer version of (4) is when someone is cranky because their own paper wasn’t cited. This is understandable, but a more appropriate response seems to be pointing things out, and reviewing anyways. This avoids creating the extra work (for authors and reviewers) of yet another paper resubmission, and reasonable authors do take such suggestions into account.

NIPS figures fairly prominently here. While these are all instances in the last year, my experience after interacting with NIPS for almost a decade is that the average quality of reviews is particularly low there—in many instances reviewers clearly don’t read the papers before writing the review. Furthermore, such low quality reviews are often the deciding factor for the paper decision. Blaming the reviewer seems to be the easy solution for a bad review, but a bit more thought suggests other possibilities:

  1. Area Chair In some conferences an “area chair” or “senior PC” makes or effectively makes the decision on a paper. In general, I’m not a fan of activist area chairs, but when a reviewer isn’t thinking well, I think it is appropriate to step in. This rarely happens, because the easy choice is to simply accept the negative review. In my experience, many Area Chairs are eager to avoid any substantial controversy, and there is a general tendency to believe that something must be wrong with a paper that has a negative review, even if it isn’t what was actually pointed out.
  2. Program Chair In smaller conferences, Program Chairs play the same role as the area chair, so all of the above applies, except now you know the persons name explicitly making them easier to blame. This is a little bit too tempting, I think. For example, I know David McAllester understands that learning by memorization is a bogus reference point, and probably he was just too busy to really digest the reviews. However, a Program Chair is responsible for finding appropriate reviewers for papers, and doing so (or not) has a huge impact on whether a paper is accepted. Not surprisingly, if a paper about the sample complexity of learning is routed to people who have never seen a proof involving sample complexity before, the reviews tend to be spuriously negative (and the paper unread).
  3. Author A reviewer might blame an author, if it turns out later that the reasons given in the review for rejection were bogus. This isn’t absurd—writing a paper well is hard and it’s easy for small mistakes to be drastically misleading in technical content.
  4. Culture A conference has a culture associated with it that is driven by the people who keep coming back. If in this culture it is considered ok to do all the reviews on the last day, it’s unsurprising to see reviews lacking critical thought that could be written without reading the paper. Similarly, it’s unsurprising to see little critical thought at the area chair level, or in the routing of papers to reviewers. This answer is pretty convincing: it explains why low quality reviews keep happening year after year at a conference.

If you believe the Culture reason, then what’s needed is a change in the culture. The good news is that this is both possible and effective. There are other conferences where reviewers expect to spend several hours reviewing a paper. In my experience this year, it was true of COLT and for my corner of SODA. Effecting the change is simply a matter of community standards, and that is just a matter of leaders in the community leading.

« Newer PostsOlder Posts »

Powered by WordPress