Machine Learning (Theory)

6/16/2013

Representative Reviewing

Tags: Conferences,Reviewing ,Workshop jl@ 10:09 am

When thinking about how best to review papers, it seems helpful to have some conception of what good reviewing is. As far as I can tell, this is almost always only discussed in the specific context of a paper (i.e. your rejected paper), or at most an area (i.e. what a “good paper” looks like for that area) rather than general principles. Neither individual papers or areas are sufficiently general for a large conference—every paper differs in the details, and what if you want to build a new area and/or cross areas?

An unavoidable reason for reviewing is that the community of research is too large. In particular, it is not possible for a researcher to read every paper which someone thinks might be of interest. This reason for reviewing exists independent of constraints on rooms or scheduling formats of individual conferences. Indeed, history suggests that physical constraints are relatively meaningless over the long term — growing conferences simply use more rooms and/or change formats to accommodate the growth.

This suggests that a generic test for paper acceptance should be “Are there a significant number of people who will be interested?” This question could theoretically be answered by sending the paper to every person who might be interested and simply asking them. In practice, this would be an intractable use of people’s time: We must query far fewer people and achieve an approximate answer to this question. Our goal then should be minimizing the approximation error for some fixed amount of reviewing work.

Viewed from this perspective, the first way that things can go wrong is by misassignment of reviewers to papers, for which there are two
easy failure modes available.

  1. When reviewer/paper assignment is automated based on an affinity graph, the affinity graph may be low quality or the constraint on the maximum number of papers per reviewer can easily leave some papers with low affinity to all reviewers orphaned.
  2. When reviewer/paper assignments are done by one person, that person may choose reviewers who are all like-minded, simply because
    this is the crowd that they know. I’ve seen this happen at the beginning of the reviewing process, but the more insidious case is when it happens at the end, where people are pressed for time and low quality judgements can become common.

An interesting approach for addressing the constraint objective would be optimizing a different objective, such as the product of affinities
rather than the sum. I’ve seen no experimentation of this sort.

For ICML, there are about 3 levels of “reviewer”: the program chair who is responsible for all papers, the area chair who is responsible for organizing reviewing on a subset of papers, and the program committee member/reviewer who has primary responsibility for reviewing. In 2012 tried to avoid these failure modes in a least-system effort way using a blended approach. We used bidding to get a higher quality affinity matrix. We used a constraint system to assign the first reviewer to each paper and two area chairs to each paper. Then, we asked each area chair to find one reviewer for each paper. This obviously dealt with the one-area-chair failure mode. It also helps substantially with low quality assignments from the constrained system since (a) the first reviewer chosen is typically higher quality than the last due to it being the least constrained (b) misassignments to area chairs are diagnosed at the beginning of the process by ACs trying to find reviewers (c) ACs can reach outside of the initial program committee to find reviewers, which existing automated systems can not do.

The next way that reviewing can go wrong is via biased reviewing.

  1. Author name bias is a famous one. In my experience it is real: well known authors automatically have their paper taken seriously, which particularly matters when time is short. Furthermore, I’ve seen instances where well-known authors can slide by with proof sketches that no one fully understands.
  2. Review anchoring is a very significant problem if it occurs. This does not happen in the standard review process, because the reviews of others are not visible to other reviewers until they are complete.
  3. A more subtle form of bias is when one reviewer is simply much louder or charismatic than others. Reviewing without an in-person meeting is actually helpful here, as it reduces this problem substantially.

Reviewing can also be low quality. A primary issue here is time: most reviewers will submit a review within a time constraint, but it may not be high quality due to limits on time. Minimizing average reviewer load is quite important here. Staggered deadlines for reviews are almost certainly also helpful. A more subtle thing is discouraging low quality submissions. My favored approach here is to publish all submissions nonanonymously after some initial period of time.

Another significant issue in reviewer quality is motivation. Making reviewers not anonymous to each other helps with motivation as poor reviews will at least be known to some. Author feedback also helps with motivation, as reviewers know that authors will be able to point out poor reviewing. It is easy to imagine that further improvements in reviewer motivation would be helpful.

A third form of low quality review is based on miscommunication. Maybe there is silly typo in a paper? Maybe something was confusing? Being able to communicate with the author can greatly reduce ambiguities.

The last problem is dictatorship at decision time for which I’ve seen several variants. Sometimes this comes in the form of giving each area chair a budget of papers to “champion”. Sometimes this comes in the form of an area chair deciding to override all reviews and either accept or more likely reject a paper. Sometimes this comes in the form of a program chair doing this as well. The power of dictatorship is often available, but it should not be used: the wiser course is keeping things representative.

At ICML 2012, we tried to deal with this via a defined power approach. When reviewers agreed on the accept/reject decision, that was the decision. If the reviewers disgreed, we asked the two area chairs to make decisions and if they agreed, that was the decision. It was only when the ACs disagreed that the program chairs would become involved in the decision.

The above provides an understanding of how to create a good reviewing process for a large conference. With this in mind, we can consider various proposals at the peer review workshop and elsewhere.

  1. Double Blind Review. This reduces bias, at the cost of decreasing reviewer motivation. Overall, I think it’s a significant long term positive for a conference as “insiders” naturally become more concerned with review quality and “outsiders” are more prone to submit.
  2. Better paper/reviewer matching. A pure win, with the only caveat that you should be familiar with failure modes and watch out for them.
  3. Author feedback. This improves review quality by placing a check on unfair reviews and reducing miscommunication at some cost in time.
  4. Allowing an appendix or ancillary materials. This allows authors to better communicate complex ideas, at the potential cost of reviewer time. A standard compromise is to make reading an appendix optional for reviewers.
  5. Open reviews. Open reviews means that people can learn from other reviews, and that authors can respond more naturally than in single round author feedback.

It’s important to note that none of the above are inherently contradictory. This is not necessarily obvious as proponents of open review and double blind review have found themselves in opposition at times. These approaches can be accommodated by simply hiding authors names for a fixed period of 2 months while the initial review process is ongoing.

Representative reviewing seems like the real difficult goal. If a paper is rejected in a representative reviewing process, then perhaps it is just not of sufficient interest. Similarly, if a paper is accepted, then perhaps it is of real and meaningful interest. And if the reviewing process is not representative, then perhaps we should fix the failure modes.

Edit: Crossposted on CACM.

5/4/2013

COLT and ICML registration

Sebastien Bubeck points out COLT registration with a May 13 early registration deadline. The local organizers have done an admirable job of containing costs with a $300 registration fee.

ICML registration is also available, at about an x3 higher cost. My understanding is that this is partly due to the costs of a larger conference being harder to contain, partly due to ICML lasting twice as long with tutorials and workshops, and partly because the conference organizers were a bit over-conservative in various ways.

10/26/2012

ML Symposium and Strata/Hadoop World

Tags: Conferences,Workshop jl@ 11:40 am

The New York ML symposium was last Friday. There were 303 registrations, up a bit from last year. I particularly enjoyed talks by Bill Freeman on vision and ML, Jon Lenchner on strategy in Jeopardy, and Tara N. Sainath and Brian Kingsbury on deep learning for speech recognition. If anyone has suggestions or thoughts for next year, please speak up.

I also attended Strata + Hadoop World for the first time. This is primarily a trade conference rather than an academic conference, but I found it pretty interesting as a first time attendee. This is ground zero for the Big data buzzword, and I see now why. It’s about data, and the word “big” is so ambiguous that everyone can lay claim to it. There were essentially zero academic talks. Instead, the focus was on war stories, product announcements, and education. The general level of education is much lower—explaining Machine Learning to the SQL educated is the primary operating point. Nevertheless that’s happening, and the fact that machine learning is considered a necessary technology for industry is a giant step for the field. Over time, I expect the industrial side of Machine Learning to grow, and perhaps surpass the academic side, in the same sense as has already occurred for chip design. Amongst the talks I could catch, I particularly liked the Github, Zillow, and Pandas talks. Ted Dunning also gave a particularly masterful talk, although I have doubts about the core Bayesian Bandit approach(*). The streaming k-means algorithm they implemented does look quite handy.

(*) The doubt is the following: prior elicitation is generally hard, and Bayesian techniques are not robust to misspecification. This matters in standard supervised settings, but it may matter more in exploration settings where misspecification can imply data starvation.

8/27/2012

NYAS ML 2012 and ICML 2013

The New York Machine Learning Symposium is October 19 with a 2 page abstract deadline due September 13 via email with subject “Machine Learning Poster Submission” sent to physicalscience@nyas.org. Everyone is welcome to submit. Last year’s attendance was 246 and I expect more this year.

The primary experiment for ICML 2013 is multiple paper submission deadlines with rolling review cycles. The key dates are October 1, December 15, and February 15. This is an attempt to shift ICML further towards a journal style review process and reduce peak load. The “not for proceedings” experiment from this year’s ICML is not continuing.

Edit: Fixed second ICML deadline.

6/29/2012

ICML survey and comments

Just about nothing could keep me from attending ICML, except for Dora who arrived on Monday. Consequently, I have only secondhand reports that the conference is going well.

For those who are remote (like me) or after the conference (like everyone), Mark Reid has setup the ICML discussion site where you can comment on any paper or subscribe to papers. Authors are automatically subscribed to their own papers, so it should be possible to have a discussion significantly after the fact, as people desire.

We also conducted a survey before the conference and have the survey results now. This can be compared with the ICML 2010 survey results. Looking at the comparable questions, we can sometimes order the answers to have scores ranging from 0 to 3 or 0 to 4 with 3 or 4 being best and 0 worst, then compute the average difference between 2012 and 2010.

Glancing through them, I see:

  1. Most people found the papers they reviewed a good fit for their expertise (-.037 w.r.t 2010). Achieving this was one of our subgoals in the pursuit of high quality decisions.
  2. Most people had sufficient time for doing reviews. This was something that we worried about significantly in shifting the paper deadline and otherwise massaging the schedule. Most people also thought the review period was sufficiently long and most reviews were high quality (+.023 w.r.t. 2010)
  3. About 1/4 of reviewers say that author response changed their mind on a paper and 2/3 of reviewers say discussion changed their mind on a paper. The expectation of decision impact from author response is reduced from 2010 (-.135). The existence of author response is overwhelmingly preferred.
  4. People generally found ICML reviewing the same or better than previous ICMLs (+.35 w.r.t. 2010) and other similar conferences (+.198 w.r.t. 2010) at the cost of being somewhat more work. A substantial bump in reviewing quality was a primary goal.
  5. The ACs spent substantially more time (43 hours on average) than PC members (28 hours on average). This agrees with our expectation—the set of ACs didn’t change even after we had a 50% increase in submissions. The AC load we had this year was probably too high and will need to be reduced somewhat for next year.
  6. 2/3 of authors prefer the option to revise a paper during author response.
  7. The choice of how to deal with increased submissions is deeply undecided, with a slight preference for short talk+poster as we did.
  8. Most people like having two workshop days or don’t care.
  9. There is a strong preference for COLT and UAI colocation with the next tier of preference for IJCAI, KDD, AAAI, and CVPR.
« Newer PostsOlder Posts »

Powered by WordPress