If you are interested, please email msrnycrsvp at microsoft.com and say “I want to come” so we can get a count of attendees for refreshments.
Added: Videos are now online.
If you are interested, please email msrnycrsvp at microsoft.com and say “I want to come” so we can get a count of attendees for refreshments.
Added: Videos are now online.
We’ve been somewhat disorganized in advertising this. As a consequence, anyone who has not submitted an abstract but would like to do so may send one directly to me (email@example.com title NYASMLS) by Friday March 14. I will forward them to the rest of the committee for consideration.
Manik and I are organizing the extreme classification workshop at NIPS this year. We have a number of good speakers lined up, but I would further encourage anyone working in the area to submit an abstract by October 9. I believe this is an idea whose time has now come.
The NIPS website doesn’t have other workshops listed yet, but I expect several others to be of significant interest.
This post is a (near) transcript of a talk that I gave at the ICML 2013 Workshop on Peer Review and Publishing Models. Although there’s a PDF available on my website, I’ve chosen to post a slightly modified version here as well in order to better facilitate discussion.
Disclaimers and Context
I want to start with a couple of disclaimers and some context.
First, I want to point out that although I’ve read a lot about double-blind review, this isn’t my research area and the research discussed in this post is not my own. As a result, I probably can’t answer super detailed questions about these studies.
I also want to note that I’m not opposed to open peer review — I was a free and open source software developer for over ten years and I care a great deal about openness and transparency. Rather, my motivation in writing this post is simply to create awareness of and to initiate discussion about the benefits of double-blind review.
Lastly, and most importantly, I think it’s essential to acknowledge that there’s a lot of research on double-blind review out there. Not all of this research is in agreement, in part because it’s hard to control for all the variables involved and in part because most studies involve a single journal or discipline. And, because these studies arise from different disciplines, they can be difficult to
track down — to my knowledge at least, there’s no “Journal of Double-Blind Review Research.” These factors make for a hard landscape to navigate. My goal here is therefore to draw your attention to some of the key benefits of double-blind review so that we don’t lose sight of them when considering alternative reviewing models.
How Blind Is It?
The primary motivation behind double-blind peer review — in which the identities of a paper’s authors and reviewers are concealed from each other — is to eliminate bias in the reviewing process by preventing factors other than scientific quality from influencing the perceived merit of the work under review. At this point in time, double-blind review is the de facto standard for machine learning conferences.
Before I discuss the benefits of double-blind review, however, I’d like to address one of its most commonly heard criticisms: “But it’s possible to infer author identity from content!” — i.e., that double-blind review isn’t really blind, so therefore there’s no point in implementing it. It turns out that there’s some truth to this statement, but there’s also a lot of untruth too. There are several studies that directly test this assertion by asking reviewers whether authors or institutions are identifiable and, if so, to record their identities and describe the clues that led to their identification.
The results are pretty interesting: when asked to guess the identities of authors or institutions, reviewers are correct only 25–42% of the time . The most common identification clues are self-referencing and authors’ initials or institution identities in the manuscript, followed by reviewers’ personal knowledge [2, 3]. Furthermore, higher identification percentages correspond to journals in which papers are required to explicitly state the source of the data being studied . This indicates that journals, not just authors, bear some responsibility for the degree of identification clues present and can therefore influence the extent to which review is truly double-blind.
Is It Necessary?
Another commonly heard criticism of double-blind review is “But I’m not biased!” — i.e., that double-blind review isn’t needed because factors other than scientific quality do not affect reviewers’ opinions anyway. It’s this statement that I’ll mostly be focusing on here. There are many studies that address this assertion by testing the extent to which peer review can be biased against new ideas, women, junior researchers, and researchers from less prestigious universities or countries other than the US. In the remainder of this post, I’m therefore going give a brief overview of these studies’ findings. But before I do that, I want to talk a bit more about bias.
I think it’s important to talk about bias because I want to make it very clear that the kind of bias I’m talking about is NOT necessarily ill-intentioned, explicit, or even conscious. To quote the AAUW’s report  on the under-representation of women in science, “Even individuals who consciously refute gender and science stereotypes can still hold that belief at an unconscious level. These unconscious beliefs or implicit biases may be more powerful than explicitly held beliefs and values simply because we are not aware of them.” Chapters 8 and 9 of this report provide a really great overview of recent research on implicit bias and negative stereotypes in the workplace. I highly recommend reading them — and the rest of the report for that matter — but for the purpose of this post, it’s sufficient to remember that “Less-conscious beliefs underlying negative stereotypes continue to influence assumptions about people and behavior. [Even] good people end up unintentionally making decisions that violate [...] their own sense of what’s correct [and] what’s good.”
Prestige and Familiarity
Perhaps the most well studied form of bias is the “Matthew effect,” originally introduced by Robert Merton in 1968 . This term refers to the “rich-get-richer” phenomenon whereby well known, eminent researchers get more credit for their contributions than unknown researchers. Since 1968, there’s been a considerable amount of follow-on research investigating the extent to which the Matthew effect exists in science. In the context of peer review, reviewers may be more likely to recommend acceptance of incomplete or inferior papers if they are authored by more prestigious researchers.
Country of Origin
It’s also important to consider country of origin and international bias. There’s research  showing that reviewers from within the United States and reviewers from outside the United States evaluate US papers more favorably, with US reviewers showing a stronger preference for US papers than non-US reviewers. In contrast, US and non-US reviewers behaved near identically for non-US papers.
One of the most widely discussed pieces of recent work on double-blind review and gender is that of Budden et al. , whose research demonstrated that following the introduction of double-blind review by the journal Behavioral Ecology, there was a significant increase in papers authored by women. This pattern was not observed in a similar journal that instead reveals author information to reviewers. Although there’s been some controversy surrounding this work , mostly questioning whether the observed increase was indeed to do with the policy change or a more widely observed phenomenon, the original authors reanalyzed their data and again found that double-blind review favors increased representation of female authors .
Race has also been demonstrated to influence reviewers’ recommendations, albeit in the context of grant funding rather than publications. Even after controlling for factors such as educational background, country of origin, training, previous research awards, publication record, and employer characteristics, African-American applicants for National Institutes of Health R01 grants are 10% less likely than white applicants to be awarded research funding .
I also want to talk briefly about stereotype threat. Stereotype threat is a phenomenon in which performance in academic contexts can be harmed by the awareness that one’s behavior might be viewed through the lens of a negative stereotype about one’s social group . For example, studies have demonstrated that African-American students enrolled in college and female students enrolled in math and science courses score much lower on tests when they are reminded beforehand of their race or gender [10, 11]. In the case of female science students, simply having a larger ratio of men to women present in the testing situation can lower women’s test scores . Several factors may contribute to this decreased performance, including the anxiety, reduced attention, and self-consciousness associated with worrying about whether or not one is confirming the stereotype. One idea that that hasn’t yet been explored in the context of peer review, but might be worth investigating, is whether requiring authors to reveal their identities during peer review induces a stereotype threat scenario.
Lastly, I want to mention the identification of reviewers. Although there’s much less research on this side of the equation, it’s definitely worth considering the effects of revealing reviewer identities as well — especially for more junior reviewers. To quote Mainguy et al.’s article  in PLoS Biology, “Reviewers, and especially newcomers, may feel pressured into accepting a mediocre paper from a more established lab in fear of future reprisals.”
I want to conclude by reminding you that my goal in writing this post was to create awareness about the benefits of double-blind review. There’s a great deal of research on double-blind review and although it can be a hard landscape to navigate — in part because there are many factors involved, not all of which can be trivially controlled in experimental conditions — there are studies out there that demonstrate concrete benefits of double-blind review. Perhaps more importantly though, double-blind review promotes the PERCEPTION of fairness. To again quote Mainguy et al., “[Double-blind review] bears symbolic power that will go a long way to quell fears and frustrations, thereby generating a better perception of fairness and equality in global scientific funding and publishing.”
 Budden, Tregenza, Aarssen, Koricheva, Leimu, Lortie. “Double-blind review favours increased representation of female authors.” 2008.
 Yankauer. “How blind is blind review?” 1991.
 Katz, Proto, Olmsted. “Incidence and nature of unblinding by authors: our experience at two radiology journals with double-blinded peer review policies.” 2002.
 Hill, Corbett, St, Rose. “Why so few? Women in science, technology, engineering, and mathematics.” 2010.
 Merton. “The Matthew effect in science.” 1968.
 Link. “US and non-US submissions: an analysis of reviewer bias.” 1998.
 Webb, O’Hara, Freckleton. “Does double-blind review benefit female authors?” 2008.
 Budden, Lortie, Tregenza, Aarssen, Koricheva, Leimu. “Response to Webb et al.: Double-blind review: accept with minor revisions.” 2008.
 Ginther, Schaffer, Schnell, Masimore, Liu, Haak, Kington. “Race, ethnicity, and NIH research awards.” 2011.
 Steele, Aronson. “Stereotype threat and the intellectual test performance of African Americans.” 1995.
 Dar-Nimrod, Heine. “Exposure to scientific theories affects women’s math performance.” 2006,
 Mainguy, Motamedi, Mietchen. “Peer review—the newcomers’ perspective.” 2005.
When thinking about how best to review papers, it seems helpful to have some conception of what good reviewing is. As far as I can tell, this is almost always only discussed in the specific context of a paper (i.e. your rejected paper), or at most an area (i.e. what a “good paper” looks like for that area) rather than general principles. Neither individual papers or areas are sufficiently general for a large conference—every paper differs in the details, and what if you want to build a new area and/or cross areas?
An unavoidable reason for reviewing is that the community of research is too large. In particular, it is not possible for a researcher to read every paper which someone thinks might be of interest. This reason for reviewing exists independent of constraints on rooms or scheduling formats of individual conferences. Indeed, history suggests that physical constraints are relatively meaningless over the long term — growing conferences simply use more rooms and/or change formats to accommodate the growth.
This suggests that a generic test for paper acceptance should be “Are there a significant number of people who will be interested?” This question could theoretically be answered by sending the paper to every person who might be interested and simply asking them. In practice, this would be an intractable use of people’s time: We must query far fewer people and achieve an approximate answer to this question. Our goal then should be minimizing the approximation error for some fixed amount of reviewing work.
Viewed from this perspective, the first way that things can go wrong is by misassignment of reviewers to papers, for which there are two
easy failure modes available.
An interesting approach for addressing the constraint objective would be optimizing a different objective, such as the product of affinities
rather than the sum. I’ve seen no experimentation of this sort.
For ICML, there are about 3 levels of “reviewer”: the program chair who is responsible for all papers, the area chair who is responsible for organizing reviewing on a subset of papers, and the program committee member/reviewer who has primary responsibility for reviewing. In 2012 tried to avoid these failure modes in a least-system effort way using a blended approach. We used bidding to get a higher quality affinity matrix. We used a constraint system to assign the first reviewer to each paper and two area chairs to each paper. Then, we asked each area chair to find one reviewer for each paper. This obviously dealt with the one-area-chair failure mode. It also helps substantially with low quality assignments from the constrained system since (a) the first reviewer chosen is typically higher quality than the last due to it being the least constrained (b) misassignments to area chairs are diagnosed at the beginning of the process by ACs trying to find reviewers (c) ACs can reach outside of the initial program committee to find reviewers, which existing automated systems can not do.
The next way that reviewing can go wrong is via biased reviewing.
Reviewing can also be low quality. A primary issue here is time: most reviewers will submit a review within a time constraint, but it may not be high quality due to limits on time. Minimizing average reviewer load is quite important here. Staggered deadlines for reviews are almost certainly also helpful. A more subtle thing is discouraging low quality submissions. My favored approach here is to publish all submissions nonanonymously after some initial period of time.
Another significant issue in reviewer quality is motivation. Making reviewers not anonymous to each other helps with motivation as poor reviews will at least be known to some. Author feedback also helps with motivation, as reviewers know that authors will be able to point out poor reviewing. It is easy to imagine that further improvements in reviewer motivation would be helpful.
A third form of low quality review is based on miscommunication. Maybe there is silly typo in a paper? Maybe something was confusing? Being able to communicate with the author can greatly reduce ambiguities.
The last problem is dictatorship at decision time for which I’ve seen several variants. Sometimes this comes in the form of giving each area chair a budget of papers to “champion”. Sometimes this comes in the form of an area chair deciding to override all reviews and either accept or more likely reject a paper. Sometimes this comes in the form of a program chair doing this as well. The power of dictatorship is often available, but it should not be used: the wiser course is keeping things representative.
At ICML 2012, we tried to deal with this via a defined power approach. When reviewers agreed on the accept/reject decision, that was the decision. If the reviewers disgreed, we asked the two area chairs to make decisions and if they agreed, that was the decision. It was only when the ACs disagreed that the program chairs would become involved in the decision.
The above provides an understanding of how to create a good reviewing process for a large conference. With this in mind, we can consider various proposals at the peer review workshop and elsewhere.
It’s important to note that none of the above are inherently contradictory. This is not necessarily obvious as proponents of open review and double blind review have found themselves in opposition at times. These approaches can be accommodated by simply hiding authors names for a fixed period of 2 months while the initial review process is ongoing.
Representative reviewing seems like the real difficult goal. If a paper is rejected in a representative reviewing process, then perhaps it is just not of sufficient interest. Similarly, if a paper is accepted, then perhaps it is of real and meaningful interest. And if the reviewing process is not representative, then perhaps we should fix the failure modes.
Edit: Crossposted on CACM.
Adam Kalai points out the New England Machine Learning Day May 1 at MSR New England. There is a poster session with abstracts due April 19. I understand last year’s NEML went well and it’s great to meet your neighbors at regional workshops like this.
Michael Jordan sends the below:
The new Simons Institute for the Theory of Computing
will begin organizing semester-long programs starting in 2013.
One of our first programs, set for Fall 2013, will be on the “Theoretical Foundations
of Big Data Analysis”. The organizers of this program are Michael Jordan (chair),
Stephen Boyd, Peter Buehlmann, Ravi Kannan, Michael Mahoney, and Muthu Muthukrishnan.
See http://simons.berkeley.edu/program_bigdata2013.html for more information on
The Simons Institute has created a number of “Research Fellowships” for young
researchers (within at most six years of the award of their PhD) who wish to
participate in Institute programs, including the Big Data program. Individuals
who already hold postdoctoral positions or who are junior faculty are welcome
to apply, as are finishing PhDs.
Please note that the application deadline is January 15, 2013. Further details
are available at http://simons.berkeley.edu/fellows.html .
The New York ML symposium was last Friday. There were 303 registrations, up a bit from last year. I particularly enjoyed talks by Bill Freeman on vision and ML, Jon Lenchner on strategy in Jeopardy, and Tara N. Sainath and Brian Kingsbury on deep learning for speech recognition. If anyone has suggestions or thoughts for next year, please speak up.
I also attended Strata + Hadoop World for the first time. This is primarily a trade conference rather than an academic conference, but I found it pretty interesting as a first time attendee. This is ground zero for the Big data buzzword, and I see now why. It’s about data, and the word “big” is so ambiguous that everyone can lay claim to it. There were essentially zero academic talks. Instead, the focus was on war stories, product announcements, and education. The general level of education is much lower—explaining Machine Learning to the SQL educated is the primary operating point. Nevertheless that’s happening, and the fact that machine learning is considered a necessary technology for industry is a giant step for the field. Over time, I expect the industrial side of Machine Learning to grow, and perhaps surpass the academic side, in the same sense as has already occurred for chip design. Amongst the talks I could catch, I particularly liked the Github, Zillow, and Pandas talks. Ted Dunning also gave a particularly masterful talk, although I have doubts about the core Bayesian Bandit approach(*). The streaming k-means algorithm they implemented does look quite handy.
(*) The doubt is the following: prior elicitation is generally hard, and Bayesian techniques are not robust to misspecification. This matters in standard supervised settings, but it may matter more in exploration settings where misspecification can imply data starvation.
The main program will feature invited talks from Peter Bartlett, William Freeman, and Vladimir Vapnik, along with numerous spotlight talks and a poster session. Following the main program, hackNY and Microsoft Research are sponsoring a networking hour with talks from machine learning practitioners at NYC startups (specifically bit.ly, Buzzfeed, Chartbeat, and Sense Networks, Visual Revenue). This should be of great interest to everyone considering working in machine learning.
The New York Machine Learning Symposium is October 19 with a 2 page abstract deadline due September 13 via email with subject “Machine Learning Poster Submission” sent to firstname.lastname@example.org. Everyone is welcome to submit. Last year’s attendance was 246 and I expect more this year.
The primary experiment for ICML 2013 is multiple paper submission deadlines with rolling review cycles. The key dates are October 1, December 15, and February 15. This is an attempt to shift ICML further towards a journal style review process and reduce peak load. The “not for proceedings” experiment from this year’s ICML is not continuing.
Edit: Fixed second ICML deadline.
The workshop on the Meaningful Use of Complex Medical Data is happening again, August 9-12 in LA, near UAI on Catalina Island August 15-17. I enjoyed my visit last year, and expect this year to be interesting also.
May 16 in Cambridge, is the New England Machine Learning Day, a first regional workshop/symposium on machine learning. To present a poster, submit an abstract by May 5.
For graduate students, the Yahoo! Key Scientific Challenges program including in machine learning is on again, due March 9. The application is easy and the $5K award is high quality “no strings attached” funding. Consider submitting.
The From Data to Knowledge workshop May 7-11 at Berkeley should be of interest to the many people encountering streaming data in different disciplines. It’s run by a group of astronomers who encounter streaming data all the time. I met Josh Bloom recently and he is broadly interested in a workshop covering all aspects of Machine Learning on streaming data. The hope here is that techniques developed in one area turn out useful in another which seems quite plausible. Particularly if you are in the bay area, consider checking it out.
The New York ML symposium was last Friday. Attendance was 268, significantly larger than last year. My impression was that the event mostly still fit the space, although it was crowded. If anyone has suggestions for next year, speak up.
The best student paper award went to Sergiu Goschin for a cool video of how his system learned to play video games (I can’t find the paper online yet). Choosing amongst the submitted talks was pretty difficult this year, as there were many similarly good ones.
By coincidence all the invited talks were (at least potentially) about faster learning algorithms. Stephen Boyd talked about ADMM. Leon Bottou spoke on single pass online learning via averaged SGD. Yoav Freund talked about parameter-free hedging. In Yoav’s case the talk was mostly about a better theoretical learning algorithm, but it has the potential to unlock an exponential computational complexity improvement via oraclization of experts algorithms… but some serious thought needs to go in this direction.
Unrelated, I found quite a bit of truth in Paul’s talking bears and Xtranormal always adds a dash of funny. My impression is that the ML job market has only become hotter since 4 years ago. Anyone who is well trained can find work, with the key limiting factor being “well trained”. In this environment, efforts to make ML more automatic and more easily applied are greatly appreciated. And yes, Yahoo! is still hiring too
Everyone should have received notice for NY ML Symposium abstracts. Check carefully, as one was lost by our system.
The event itself is October 21, next week. Leon Bottou, Stephen Boyd, and Yoav Freund are giving the invited talks this year, and there are many spotlights on local work spread throughout the day. Chris Wiggins has setup 6(!) ML-interested startups to follow the symposium, which should be of substantial interest to the employment interested.
I also wanted to give an update on ICML 2012. Unlike last year, our deadline is coordinated with AIStat (which is due this Friday). The paper deadline for ICML has been pushed back to February 24 which should allow significant time for finishing up papers after the winter break. Other details may interest people as well:
At KDD I enjoyed Stephen Boyd‘s invited talk about optimization quite a bit. However, the most interesting talk for me was David Haussler‘s. His talk started out with a formidable load of biological complexity. About half-way through you start wondering, “can this be used to help with cancer?” And at the end he connects it directly to use with a call to arms for the audience: cure cancer. The core thesis here is that cancer is a complex set of diseases which can be distentangled via genetic assays, allowing attacking the specific signature of individual cancers. However, the data quantity and complex dependencies within the data require systematic and relatively automatic prediction and analysis algorithms of the kind that we are best familiar with.
Some of the papers which interested me are:
There were also three papers that were about creating (or perhaps composing) learning systems to do something cool.
I also attended MUCMD, a workshop on the Meaningful Use of Complex Medical Data shortly afterwards. This workshop is about the emergent area of using data to improve medicine. The combination of electronic health records, the economic importance of getting medicine right, and the relatively weak use of existing data implies there is much good work to do.
This finally gave us a chance to discuss radically superior medical trial designs based on work in exploration and learning
Jeff Hammerbacher‘s talk was a hilarilously blunt and well stated monologue about the need and how to gather data in a usable way.
Amongst the talks on using medical data, Suchi Saria‘s seemed the most mature. They’ve constructed a noninvasive test for problem infants which is radically superior to the existing Apgar score according to leave-one-out cross validation.
From the doctor’s side, there was discussion of the deep balkanization of data systems within hospitals, efforts to overcome that, and the (un)trustworthiness of data. Many issues clearly remain here, but it also looks like serious progress is being made.
Overall, the workshop went well, with the broad cross-section of talks providing quite a bit of extra context you don’t normally see. It left me believing that a community centered on MUCMD is rising now, with attendant workshops, conferences, etc… to be expected.
Many Machine Learning related events are coming up this fall.
I enjoyed attending NIPS this year, with several things interesting me. For the conference itself:
I also attended two workshops—Coarse-To-Fine and LCCC which were a fine combination. The first was about more efficient (and sometimes more effective) methods for learning which start with coarse information and refine, while the second was about parallelization and distribution of learning algorithms. Together, they were about how to learn fast and effective solutions.
The CtF workshop could have been named “Integrating breadth first search and learning”. I was somewhat (I hope not too) pesky, discussing Searn repeatedly during questions, since it seems quite plausible that a good application of Searn would compete with and plausibly improve on results from several of the talks. Eventually, I hope the conventional wisdom shifts to a belief that search and learning must be integrated for efficiency and robustness reasons. The talks in this workshop were uniformly strong in making that case. I was particularly interested in Drew‘s talk on a plausible improvement on Searn.
The level of agreement in approaches at the LCCC workshop was much lower, with people discussing many radically different approaches.
I hope we’ll discover convincing answers to these questions in the near future.
Powered by WordPress