Yaser points out some nicely videotaped machine learning lectures at Caltech. Yaser taught me machine learning, and I always found the lectures clear and interesting, so I expect many people can benefit from watching. Relative to Andrew Ng‘s ML class there are somewhat different areas of emphasis but the topic is the same, so picking and choosing the union may be helpful.
For those who are remote (like me) or after the conference (like everyone), Mark Reid has setup the ICML discussion site where you can comment on any paper or subscribe to papers. Authors are automatically subscribed to their own papers, so it should be possible to have a discussion significantly after the fact, as people desire.
We also conducted a survey before the conference and have the survey results now. This can be compared with the ICML 2010 survey results. Looking at the comparable questions, we can sometimes order the answers to have scores ranging from 0 to 3 or 0 to 4 with 3 or 4 being best and 0 worst, then compute the average difference between 2012 and 2010.
Glancing through them, I see:
- Most people found the papers they reviewed a good fit for their expertise (-.037 w.r.t 2010). Achieving this was one of our subgoals in the pursuit of high quality decisions.
- Most people had sufficient time for doing reviews. This was something that we worried about significantly in shifting the paper deadline and otherwise massaging the schedule. Most people also thought the review period was sufficiently long and most reviews were high quality (+.023 w.r.t. 2010)
- About 1/4 of reviewers say that author response changed their mind on a paper and 2/3 of reviewers say discussion changed their mind on a paper. The expectation of decision impact from author response is reduced from 2010 (-.135). The existence of author response is overwhelmingly preferred.
- People generally found ICML reviewing the same or better than previous ICMLs (+.35 w.r.t. 2010) and other similar conferences (+.198 w.r.t. 2010) at the cost of being somewhat more work. A substantial bump in reviewing quality was a primary goal.
- The ACs spent substantially more time (43 hours on average) than PC members (28 hours on average). This agrees with our expectation—the set of ACs didn’t change even after we had a 50% increase in submissions. The AC load we had this year was probably too high and will need to be reduced somewhat for next year.
- 2/3 of authors prefer the option to revise a paper during author response.
- The choice of how to deal with increased submissions is deeply undecided, with a slight preference for short talk+poster as we did.
- Most people like having two workshop days or don’t care.
- There is a strong preference for COLT and UAI colocation with the next tier of preference for IJCAI, KDD, AAAI, and CVPR.
We had advanced warning from Prabhakar through the simple act of leaving. Yahoo! Research was a world class organization that Prabhakar recruited much of personally, so it is deeply implausible that he would spontaneously decide to leave. My first thought when I saw the news was “Uhoh, Rob said that he knew it was serious when the head of ATnT Research left.” In this case it was even more significant, because Prabhakar recruited me on the premise that Y!R was an experiment in how research should be done: via a combination of high quality people and high engagement with the company. Prabhakar’s departure is a clear end to that experiment.
The result is ambiguous from a business perspective. Y!R clearly was not capable of saving the company from its illnesses. I’m not privy to the internal accounting of impact and this is the kind of subject where there can easily be great disagreement. Even so, there were several strong direct impacts coming from the machine learning, economics, and algorithms groups.
Y!R clearly was excellent from an academic research perspective. On a per person basis in relevant subjects, it was outstanding. One way to measure this is by noticing that both ICML and KDD had (co)program chairs from Y!R. It turns out that talking to the rest of the organization doing consulting, architecting, and prototyping on a minority basis helps research by sharpening the questions you ask more than it hinders by taking up time. The decision to participate in this experiment was a good one for me personally.
It has been clear in silicon valley, academia, and pretty much everywhere else that people at Yahoo! including Yahoo! Research have been looking around for new positions. Maintaining the excellence of Y!R in a company that has been under prolonged stress was challenging leadership-wise. Consequently, the abrupt departure of Prabhakar and an apparent lack of appreciation by the new CEO created a crisis of confidence. Many people who were sitting on strong offers quickly left, and everyone else started looking around.
In this situation, my first concern was for colleagues, both in Machine Learning across the company and the Yahoo! Research New York office.
Machine Learning turns out to be a very hot technology. Every company and government in the world is drowning in data, and Machine Learning is the prime tool for actually using it to do interesting things. More generally, the demand for high quality seasoned machine learning researchers across startups, mature companies, government labs, and academia has been astonishing, and I expect the outcome to reflect that. This is remarkably different from the cuts that hit ATnT research in late 2001 and early 2002 where the famous machine learning group there took many months to disperse to new positions.
In the New York office, we investigated many possibilities hard enough that it became a news story. While that article is wrong in specifics (we ended up not fired for example, although it is difficult to discern cause and effect), we certainly shook the job tree very hard to see what would fall out. To my surprise, amongst all the companies we investigated, Microsoft had a uniquely sufficient agility, breadth of interest, and technical culture, enabling them to make offers that I and a significant fraction of the Y!R-NY lab could not resist. My belief is that the new Microsoft Research New York City lab will become an even greater techhouse than Y!R-NY. At a personal level, it is deeply flattering that they have chosen to create a lab for us on short notice. I will certainly do my part chasing the greatest learning algorithms not yet invented.
In light of this, I would encourage people in academia to consider Yahoo! in as fair a light as possible in the current circumstances. There are and will be some serious hard feelings about the outcome as various top researchers elsewhere in the organization feel compelled to look for jobs and leave. However, Yahoo! took a real gamble supporting a research organization about 7 years ago, and many positive things have come of this gamble from all perspectives. I expect almost all of the people leaving to eventually do quite well, and often even better.
What about ICML? My second thought on hearing about Prabhakar’s departure was “I really need to finish up initial paper/reviewer assignments today before dealing with this”. During the reviewing period where the program chair load is relatively light, Joelle handled nearly everything. My great distraction ended neatly in time to help with decisions at ICML. I considered all possibilities in accepting the job and was prepared to simply put aside a job search for some time if necessary, but the timing was surreally perfect. All signs so far point towards this ICML being an exceptional ICML, and I plan to do everything that I can to make that happen. The early registration deadline is May 13.
What about Vowpal Wabbit? Amongst other things, VW is the ultrascale learning algorithm, not the kind of thing that you would want to put aside lightly. I negotiated to continue the project and succeeded. This surprised me greatly—Microsoft has made serious commitments to supporting open source in various ways and that commitment is what sealed the deal for me. In return, I would like to see Microsoft always at or beyond the cutting edge in machine learning technology.
This is a rather long post, detailing the ICML 2012 review process. The goal is to make the process more transparent, help authors understand how we came to a decision, and discuss the strengths and weaknesses of this process for future conference organizers.
Microsoft’s Conference Management Toolkit (CMT)
We chose to use CMT over other conference management software mainly because of its rich toolkit. The interface is sub-optimal (to say the least!) but it has extensive capabilities (to handle bids, author response, resubmissions, etc.), good import/export mechanisms (to process the data elsewhere), excellent technical support (to answer late night emails, add new functionalities). Overall, it was the right choice, although we hope a designer will look at that interface sometime soon!
Toronto Matching System (TMS)
TMS is now being used by many major conferences in our field (including NIPS and UAI). It is an automated system (developed by Laurent Charlin and Rich Zemel at U. Toronto) to match reviewers to papers, based on an analysis of each reviewer’s publications. TMS collects publications from reviewers, parses them into features and applies unsupervised or supervised learning techniques to predict the relevance of any target paper for any reviewer. We convinced TMS to integrate with CMT and funded Laurent’s work for that. Reviewers were asked to put in a publication list for TMS to parse. For those who failed to do so (after many reminders!), we manually added that information from public sources.
The Program Committee
Recruiting a program committee that is both large and highly qualified is difficult these days. We sent out 69 area chair invitations; 50 (highly qualified!) people accepted. Each of these area chairs was asked to nominate a list of potential reviewers. We sent out approximately 700 invitations for program committee members; 389 accepted. A number of additional PC members were recruited during the review process (most of them for 1-2 papers), for a total of 470 active PC members. In terms of seniority, the final PC contains about ~15% students, 80% researchers, 5% other.
The Surge (ICML + 50%)
The first big challenge came on the submission deadline. In the past few years, ICML had consistently received ~550-600 submissions. This year, we had a 50% increase, to 890 submissions. We had recruited a PC that could comfortably handle 700 papers. Dealing with an extra 200 papers was not an easy task.
About 10 submissions were rejected without review for various reasons (severe formatting issues, extra pages, non-anonymization).
An unsupervised version of TMS was used to generate a list of candidate papers for each reviewer and area chair. This was done working closely with the Laurent Charlin of TMS using validation on previous NIPS data. CMT did not have the functionality to show a good list of candidate papers to reviewers, so we crafted an interface to show this list and let reviewers use that in conjunction with CMT. Ideally, this will be better incorporated in CMT in the future.
When you ask a group of scientists to run a conference, you must expect a few experiments will take place…. And so we decided to assess the usefulness of TMS scoring for generating lists of papers to bid on. To do this, we (randomly) assigned PC members to 1 of 3 groups. One group saw a list purely based on TMS scores. Another group received a list based on the matching between their subject area and that of the paper (referred to as the “relevance” score in CMT). The third group received a list based on a mix of both TMS and relevance. Reviewers were allowed to bid on any paper (excluding those with which they had a conflict); the lists were provided to help them efficiently sort through the large number of papers. We then compared the set of bids for a reviewer, with the list of suggestions, and measured the correspondence.
The following is the Discounted Cumulative Gain (DCG) of each list with respect to the bidding scores, averaged separately for each group. Note that each group was only presented with their corresponding list and not the others.
|Group: CMT||Group: TMS||Group: CMT+TMS|
|Sorting by CMT scores||6.11 out of 12.64 (48%)||4.98 out of 13.63 (36%)||4.87 out of 13.55 (35%)|
|Sorting by TMS score||4.06 out of 12.64 (32%)||6.43 out of 13.63 (47%)||5.72 out of 13.55 (42%)|
|Sorting by TMS+CMT||4.77 out of 12.64 (37%)||6.11 out of 13.63 (44%)||6.71 out of 13.55 (49%)|
A micro-survey was also run to collect further information on how users liked their short list. 85% of the participants indicated that they have used the list interface provided to them. The following is the preference indicated by each group (~75 reviewers in each group, ~2% error):
|Preferred CMT over list||15%||12%||8%|
|Preferred list over CMT||4%||5%||9%|
It is obvious from the above that most participants found the list useful in conjunction with CMT (suggesting that the list should be integrated inside CMT). We can also see that those who were presented with a list based on TMS scores were more likely to find the list useful.
Note that all of the above was done in a long hectic but fun weekend.
Imputing Missing Bids
CMT assumes that the reviewers are not willing to review a paper unless stated otherwise. It does not differentiate between an unseen (but potentially relevant) paper and a paper that has been seen and ignored. This is a real shortcoming when it comes to matching papers to reviewers, especially for those reviewers that did not bid often. To mitigate this problem, we used the click information on the shortlist presented to the reviewers to find out which papers have been observed and ignored. We then impute these cases as real non-willing bids.
Around 30 reviewers did not provide any bids (and many had only a few). This is problematic because the tools used to do the actual reviewer-paper matching tend to assign the papers without any bids to the reviewers who did not bid, regardless of the match in expertise.
Once the bidding information was in and imputation was done, we now had to fill in the rest of the paper-reviewer bidding matrix to mitigate the problem with sparse bidders. This was done, once again, through TMS, but this time using a supervised learning approach.
Using supervised learning was more delicate than expected. To deal with the wildly varying number of bids per person, we imputed zero bids, first from papers that were plausibly skipped over, and if necessary at random from papers not bid on such that each person had the same expected bid in the dataset. From this dataset, we held out a random bid per person, and then trained to predict well the heldout bid. Most optimization approaches performed poorly due to the number of features greatly exceeding the number of labels. The best approach we found used the online algorithms in Vowpal Wabbit with a mass personalized training method similar to the one discussed here. This trained predictor was used to predict bid values for the full paper-reviewer bid matrix.
Automated Area Chair and First Reviewer Assignment
Once we had the imputed paper-reviewer bidding matrix, CMT was used to generate the actual match between papers and area chairs, and (separately) between papers and reviewers. Each paper had two area chairs (sometimes called “meta-reviewers” in CMT) assigned to it, one primary, one secondary, by running two rounds of assignments (so that the primary was usually the “better” match). One reviewer per paper was also assigned automatically by CMT in a similar fashion. CMT provides proper load balancing, so that all area chairs and reviewers had similar loads.
Manual Checks of the Automated Assignments
Before finalizing the automated assignment, we manually looked through the list of papers to fix any potential problems that were not handled by the automated process. The two major cases were papers that did not go through the TMS system (authors did not agree to do so), and cases of poor primary-secondary meta-reviewer pairs (when the two area chairs are judged to be too close to offer independent assessment, e.g. working at the same institution, previous supervisor-student relationship).
Second and Third Reviewer Assignment
Once the initial assignments were announced, we asked the two area chairs for a given paper to each manually assign another reviewer from the PC. To help area chairs with this, we generated a shortlist of 10 recommended reviewers for each paper (using the estimated bid matrix and TMS score, with the CMT matching algorithm for load balancing of reviewer suggestions.) Area chairs were free to either use this list, or select from the complete program committee, or alternately, they could seek an outside reviewer which was then added to the PC, an option used 80 times. The load for each reviewer was restricted to at most 7 papers with exceptions when they agreed explicitly to more.
Most papers received at least 3 full reviews in the first round. Reviewers could not see each others’ reviews until they submitted their own. ML-Journaled submissions (see double submission guide) were reviewed only by two area chairs. In a small number of regular submissions (less than 10), we received 2 very negative reviews and notified the third reviewer (who was usually late by this point!) that we would not need their review.
Authors were given a chance to respond to the reviews during a short feedback period. This is becoming a standard practice in machine learning conferences. Authors were also allowed to upload a new version of the paper. The motivation here is that in some cases, it is easier to show the changes directly in the paper, rather than discuss them separately.
Our analysis shows that authors’ responses and subsequent discussions by reviewers made significant changes to the scoring of papers. A total of ~35% of the papers had some change in their scores after the author feedback. The average score for ~50% of the papers went down, stayed the same for ~10%, and went up for the other ~40%. The variance on the scores decreased by ~20%, indicating some convergence in the decisions.
To help us better decide on the quality of the papers, we asked the primary area chairs to provide a meta-review for each of their papers. For papers without unanimous review decisions (i.e. some reviews wanted to accept and some wanted to reject), we asked the secondary area chair to (independently) fill-in a meta-review, recommending whether to accept or reject the paper. A total of 1214 meta-reviews were provided. There were also 20 papers for which a 4th review was added in this period.
In all cases where the primary and secondary area chairs disagreed on the decision, the program chairs were directly involved, reviewing all the evidence (reviews, rebuttal, discussion, often the paper itself), and entering in a discussion (usually via email) with the area chairs, until a unanimous decision was achieved.
A total of 243 papers (27% of submissions) were accepted. Author notifications were sent out on April 30.
May 16 in Cambridge, is the New England Machine Learning Day, a first regional workshop/symposium on machine learning. To present a poster, submit an abstract by May 5.
as of last night, late.
When the reviewing deadline passed Wednesday night 15% of reviews were still missing, much higher than I expected. Between late reviews coming in, ACs working overtime through the weekend, and people willing to help in the pinch another ~390 reviews came in, reducing the missing mass to 0.2%. Nailing that last bit and a similar quantity of papers with uniformly low confidence reviews is what remains to be done in terms of basic reviews. We are trying to make all of those happen this week so authors have some chance to respond.
I was surprised by the quantity of late reviews, and I think that’s an area where ICML needs to improve in future years. Good reviews are not done in a rush—they are done by setting aside time (like an afternoon), and carefully reading the paper while thinking about implications. Many reviewers do this well but a significant minority aren’t good at scheduling their personal time. In this situation there are several ways to fail:
- Give early warning and bail.
- Give no warning and finish not-too-late.
- Give no warning and don’t finish.
The worst failure mode by far is the last one for Program Chairs and Area Chairs, because they must catch and fix all the failures at the last minute. I expect the second failure mode also impacts the quality of reviews because high speed reviewing of a deep paper often doesn’t work. This issue is one of community norms which can only be adjusted slowly. To do this, we’re going to pass a flake list for failure mode 3 to future program chairs who will hopefully further encourage people to schedule time well and review carefully.
If my experience is any guide, plenty of authors will feel disappointed by the reviews. Part of this is simply because it’s the first time the authors have had contact with people not biased towards agreeing with them, as almost all friends are. Part of this is the significant hurdle of communicating technical new things well. Part may be too-hasty reviews, as discussed above. And part of it may be that the authors simply are far more expert in their subject than reviewers.
In author responses, my personal tendency is to be blunter than most people when reviewers make errors. Perhaps “kind but clear” is a good viewpoint. You should be sympathetic to reviewers who have voluntarily put significant time into reviewing your paper, but you should also use the channel to communicate real information. Remotivating your paper almost never works, so concentrate on getting across errors in understanding by reviewers or answer their direct questions.
We did not include reviewer scores in author feedback, although we do plan to include them when the decision is made. Scores should not be regarded as final by any party, since author feedback and discussion can significantly alter a reviewer’s understanding of the paper. Encouraging reviewers to incorporate this additional information well before settling on a final score is one of my goals.
We did allow resubmission of the paper with the author response, similar to what Geoff Gordon did as program chair for AIStat. This solves two problems: It helps authors create a more polished draft, and it avoids forcing an overly constrained channel in the communication. If an equation has a bug, you can write it out bug free in mathematical notation rather than trying to describe by reference how to alter the equation in author response.
Please comment if you have further thoughts.
has died. He lived a full life. I know him personally as a founder of the Center for Computational Learning Systems and the New York Machine Learning Symposium, both of which have sheltered and promoted the advancement of machine learning. I expect much of the New York area machine learning community will miss him, as well as many others around the world.
Sasha is the open problems chair for both COLT and ICML. Open problems will be presented in a joint session in the evening of the COLT/ICML overlap day. COLT has a history of open sessions, but this is new for ICML. If you have a difficult theoretically definable problem in machine learning, consider submitting it for review, due March 16. You’ll benefit three ways:
- The effort of writing down a precise formulation of what you want often helps you understand the nature of the problem.
- Your problem will be officially published and citable.
- You might have it solved by some very intelligent bored people.
The general idea could easily be applied to any problem which can be crisply stated with an easily verifiable solution, and we may consider expanding this in later years, but for this year all problems need to be of a theoretical variety.
Joelle and I (and Mahdi, and Laurent) finished an initial assignment of Program Committee and Area Chairs to papers. We’ll be updating instructions for the PC and ACs as we field questions. Feel free to comment here on things of plausible general interest, but email us directly with specific concerns.
For graduate students, the Yahoo! Key Scientific Challenges program including in machine learning is on again, due March 9. The application is easy and the $5K award is high quality “no strings attached” funding. Consider submitting.
The ICML paper deadline has passed. Joelle and I were surprised to see the number of submissions jump from last year by about 50% to around 900 submissions. A tiny portion of these are immediate rejects(*), so this is a much larger set of papers than expected. The number of workshop submissions also doubled compared to last year, so ICML may grow significantly this year, if we can manage to handle the load well. The prospect of making 900 good decisions is fundamentally daunting, and success will rely heavily on the program committee and area chairs at this point.
For those who want to rubberneck a bit more, here’s a breakdown of submissions by primary topic of submitted papers:
66 Reinforcement Learning 52 Supervised Learning 51 Clustering 46 Kernel Methods 40 Optimization Algorithms 39 Feature Selection and Dimensionality Reduction 33 Learning Theory 33 Graphical Models 33 Applications 29 Probabilistic Models 29 NN & Deep Learning 26 Transfer and Multi-Task Learning 25 Online Learning 25 Active Learning 22 Semi-Supervised Learning 20 Statistical Methods 20 Sparsity and Compressed Sensing 19 Ensemble Methods 18 Structured Output Prediction 18 Recommendation and Matrix Factorization 18 Latent-Variable Models and Topic Models 17 Graph-Based Learning Methods 16 Nonparametric Bayesian Inference 15 Unsupervised Learning and Outlier Detection 12 Gaussian Processes 11 Ranking and Preference Learning 11 Large-Scale Learning 9 Vision 9 Social Network Analysis 9 Multi-agent & Cooperative Learning 9 Manifold Learning 8 Time-Series Analysis 8 Large-Margin Methods 8 Cost Sensitive Learning 7 Recommender Systems 7 Privacy, Anonymity, and Security 7 Neural Networks 7 Empirical Insights into ML 7 Bioinformatics 6 Information Retrieval 6 Evaluation Methodology <5 each Text Mining, Rule and Decision Tree Learning, Graph Mining, Planning & Control, Monte Carlo Methods, Inductive Logic Programming & Relational Learning, Causal Inference, Statistical and Relational Learning, NLP, Hidden Markov Models, Game Theory, Robotics, POMDPs, Geometric Approaches, Game Playing, Data Streams, Pattern Mining & Inductive Querying, Meta-Learning, Evolutionary Computation
(*) Deadlines are magical, because they galvanize groups of people to concentrated action. But, they have to be real deadlines to achieve this, which leads us to reject late submissions & format failures to keep the deadline real for future ICMLs. This is uncomfortably rough at times.
The From Data to Knowledge workshop May 7-11 at Berkeley should be of interest to the many people encountering streaming data in different disciplines. It’s run by a group of astronomers who encounter streaming data all the time. I met Josh Bloom recently and he is broadly interested in a workshop covering all aspects of Machine Learning on streaming data. The hope here is that techniques developed in one area turn out useful in another which seems quite plausible. Particularly if you are in the bay area, consider checking it out.
It also seems worthwhile to give some sense of the scope and reviewing criteria for ICML for authors considering submitting papers. At ICML, the (very large) program committee does the reviewing which informs final decisions by area chairs on most papers. Program chairs setup the process, deal with exceptions or disagreements, and provide advice for the reviewing process. Providing advice is tricky (and easily misleading) because a conference is a community, and in the end the aggregate interests of the community determine the conference. Nevertheless, as a program chair this year it seems worthwhile to state the overall philosophy I have and what I plan to encourage (and occasionally discourage).
At the highest level, I believe ICML exists to further research into machine learning, which I generally think of as turning observations into useful predictions. Research is greatly varied in general, but in all cases it involves answering an interesting question for which the answer was not previously known. Interesting questions are generally natural: they can be stated easily and other people plausibly encounter them. Interesting questions are generally also ones for which there are multiple plausible wrong answers. The definition of “interesting” is otherwise hard to pin down, because it is does and must change over time.
ICML is a broad conference which incorporates the interests of many different groups of people with different tastes in the research they prefer. It’s broad enough that most people don’t appreciate all the papers. That’s ok as long as there is some higher level appreciation for which directions of research benefit the community. Some common flavors are:
- ML for X In general, Machine Learning is a core field of study with many applications. Often, it’s a good idea to publish within a conference focused on that area, but particularly when no such conference exists, ICML is a solid choice for a place to publish. One example of this kind of thing is Machine Learning for Sustainability, where the CCC will be giving a few travel grants. Here the core question is typically “How?” Exhibiting new things that you can do with ML provides good reference points for what is possible, provides a sense of what works, and compelling new ideas about what to work on can be valuable to the community.
There are several ways that papers of this sort can bounce. Perhaps X is insufficiently interesting, the results are unconvincing, or the method of solution is considered too straight-forward. I consider the first and second criteria sound, but am inclined toward leniency on the third, since there is often quite a bit of work in figuring out how to frame the problem so that the solution happens to be easy.
- New Algorithms Often, authors find that existing learning algorithms for solving some problem are lacking in some way, so they propose new better algorithms. This is plausibly the most common category of paper at ICML, so there is quite a bit of variety. The most straight-forward version proposes a new algorithm for a well-studied problem. For these papers it’s important to have an empirical comparison to existing baselines.
It’s easy for an empirical comparison to go wrong. Some authors use synthetic datasets which do not seem significant to me, because good results on such datasets may not transfer to real-world problems well as the real world tends to be quite a bit more complex than the synthetic processes which are natural to program. Instead, it’s important to show good results on real datasets. One problem with relying on real datasets is dataset selection—choosing the dataset for which your algorithm seems to perform best. You can avoid this by choosing datasets in some clearly unbiased manner and by evaluating on many standard datasets. Another way to fail is with a poor choice of baseline. This is tricky, because three reviewers might consider three different baselines the most natural one. Asking around a bit when developing the paper might help here, but in the end this can be a tough judgement call: Is the paper convincing enough that people interested in solving the problem should use this algorithm?
Another class of new algorithms papers is new algorithms for new areas of machine learning, blending into the previous category. Here, there typically are relatively few (perhaps just one) dataset available and there may be no (or only implausibly bad) baselines. For papers like this, one way I’ve seen difficulties is when authors are very invested in a particular approach to solving the problem. If you have defined the problem too narrowly, broadening the definition of the problem can help you see appropriate baselines. Another difficulty I’ve observed is reviewers used to the well-studied problems reject an interesting paper because (essentially) they assume that the authors left out a good baseline which does not exist. To prevent the first, authors who ask around might get some valuable early feedback. For the second, it’s a difficulty we are aware of and will consider asking reviewers to judge on the merits of ML for X.
- Algorithmic studies A relatively rare but potentially valuable form of paper is an algorithmic study. Here, the authors do not propose a new algorithm, but instead do a comprehensive empirical comparison of different algorithms. The standards here are quite high—the empirical comparison needs to be first-class to convince people, so the empirical comparison comments under new algorithms apply strongly.
- New Theory Good theory can enlighten us about what is (or might be) possible. It can also help us build robust learning algorithms, where we design learning algorithms so that they provably solve some large class of problems. I am personally most interested in theory that helps us design new learning algorithms, but broadly interested in what is possible. I’m most interested in the question answered, while the means (and language) should only be as complex as necessary so the theory can be understood as widely as possible.
In many areas of CS theory, double blind reviewing is rare, so theory-oriented people may be unfamiliar with it. An important consequence is that complete proofs must be included either in the paper or supplemental material so that proof checking is fully feasible.
Another way that I’ve seen theory papers run into trouble is when it is a post-hoc justification for an algorithm. In essence, authors who choose to analyze an existing algorithm are sometimes forced to make many unnatural assumptions for the theory to be correct. There generally isn’t an easy fix if you arrive at this point.
- n of the above It is common for ICML papers to be multicategory. At the extreme, you might have a new algorithm which solves a new X well, empirically and theoretically. Reviewers can fall into a trap where they are most interested in 1 of the 4 questions answered above, and find 1/4 of the paper devoted to their question relatively weak compared to the paper that devotes all the pages to the same question.
We are aware of this, and will encourage it to be taken into account.
- The exception The set of papers I expect to see at ICML is more diverse than the above—there are often exceptions of one sort or another. For these exceptions, it often becomes a judgment call: Does this paper significantly further research into machine learning? Papers with little potential audience probably don’t while fun/interesting/useful things that we didn’t think of do.
Further comments or questions are welcome.
Following John’s advertisement for submitting to ICML, we thought it appropriate to highlight the advantages of COLT, and the reasons it is often the best place for theory papers. We would like to emphasize that we both respect ICML, and are active in ICML, both as authors and as area chairs, and certainly are not arguing that ICML is a bad place for your papers. For many papers, ICML is the best venue. But for many theory papers, COLT is a better and more appropriate place.
Why should you submit to COLT?
By-and-large, theory papers go to COLT. This is the tradition of the field and most theory papers are sent to COLT. This is the place to present your ground-breaking theorems and new models that will shape the theory of machine learning. COLT is more focused then ICML with a single track session. Unlike ICML, the norm in COLT is for people to sit through most sessions, and hear most of the talks presented. There is also often a lively discussion following paper presentations. If you want theory people to know of your work, you should submit to COLT.
Additionally, this year COLT and ICML are tightly co-located, with joint plenary sessions (i.e. some COLT papers will be presented in a plenary session to the entire combined COLT/ICML audience, as will some ICML papers), and many other opportunities for exposure to the wider ICML audience. And so, by submitting to COLT, you have the potential of reaching both the captive theory audience at COLT and the wider ML audience at ICML.
The advantages of sending to COLT:
- Rigorous review process.
The COLT program committee is comprised entirely of established, mostly fairly senior, researchers. Program committee members read and review papers themselves, or potentially use a sub-reviewer that they know personally and carefully select for the paper, but still check and maintain responsibility for the review. Your paper will get reviewed by at least three program committee members, who will likely be experts on the topics covered by the paper. This is in contrast to ICML (and most other ML conferences) were area chairs (of similar seniority to the COLT program committee) only manage the review process, but reviewers are assigned based on load-balancing considerations and the primary reviewing is done by a very wide set of reviewers, frequently students, who are often not the most relevant experts.
COLT reviews are typically detailed and technical details are checked. The reviewing process is less rushed and program committee members (and sub-reviewers were appropriate) are expected to do a careful job on each and every paper.
All papers are then discussed by the program committee, and there is generally significant and meaningful discussions on papers. This also means the COLT reviewing process is far from having a “single point of failure”, as the paper will be carefully considered and argued for by multiple (senior) program committee members. We believe this yields a more consistently high quality program, with much less randomness in the paper selection process, which in turn translates to high respect for accepted COLT papers.
- COLT is not double blind, but also not exactly single blind. Program committee members have access to the author identities (as do area chairs in ICML), as this is essential in order to select sub-reviewers. However, the author names do not appear on the papers, both in order to reduce the effect of first impressions, and to allow program committee members to utilize reviewers who are truly blind to the author’s identities.
It should be noted that the COLT anonimization guidelines are a bit more relaxed, which we hope makes it easier to create an anonimized version for conference submission (authors are still allowed to, and even encouraged, to post their papers online, with their names on them of course).
- COLT does not have a dedicated rebuttal phase. Frankly, with the higher quality, less random, reviews, we feel it is not needed, and the hassle to authors and program committee members is not worth it. However, the tradition in COLT, which we plan to follow, is to contact authors as needed during the review and discussion process to ask for clarification on issues that came up during review. In particular, if a concern is raised on the soundness or other technical aspect of a paper, the authors will be contacted to give them a chance to set things straight. But no, there is no generic author response where authors can argue and plead for acceptance.
Here’s a quick reference for summer ML-related conferences sorted by due date:
|KDD||Feb 10||August 12-16, Beijing, China||Single Blind|
|COLT||Feb 14||June 25-June 27, Edinburgh, Scotland||Single Blind? (historically)|
|ICML||Feb 24||June 26-July 1, Edinburgh, Scotland||Double Blind, author response, zero SPOF|
|UAI||March 30||August 15-17, Catalina Islands, California||Double Blind, author response|
Geographically, this is greatly dispersed and the UAI/KDD conflict is unfortunate.
Machine Learning conferences are triannual now, between NIPS, AIStat, and ICML. This has not always been the case: the academic default is annual summer conferences, then NIPS started with a December conference, and now AIStat has grown into an April conference.
However, the first claim is not quite correct. NIPS and AIStat have few competing venues while ICML implicitly competes with many other conferences accepting machine learning related papers. Since Joelle and I are taking a turn as program chairs this year, I want to make explicit the case for ICML.
- COLT was historically a conference for learning-interested Computer Science theory people. Every COLT paper has a theorem, and few have experimental results. A significant subset of COLT papers could easily be published at ICML instead. ICML now has a significant theory community, including many pure theory papers and significant overlap with COLT attendees. Good candidates for an ICML submission are learning theory papers motivated by real machine learning problems (example: the agnostic active learning paper) or which propose and analyze new plausibly useful algorithms (example: the adaptive gradient papers). If you find yourself tempted to add empirical experiments to prove the point that your theory really works, ICML sounds like an excellent fit. Not everything is a good fit though—papers motivated by definitional aesthetics or tradition (Valiant style PAC learning comes to mind) may not be appreciated.
There are two significant advantages to ICML over COLT. One is that ICML provides a potentially much larger audience which appreciates and uses your work. That’s substantially less relevant this year, because ICML and COLT are colocating and we are carefully designing joint sessions for the overlap day.
The other is that ICML is committed to fair reviewing—papers are double blind so reviewers are not forced to take into account the author identity. Plenty of people will argue that author names don’t matter to them, but I’ve personally seen several cases as a reviewer where author identity affected the decision, typically towards favoring insiders or bigwigs at theory conferences as common sense would suggest. The double blind aspect of ICML reviewing is an open invitation to outsiders to submit to ICML.
- Many UAI papers could easily go to ICML because they are explicitly about machine learning or connections with machine learning. For example, pure prediction markets are a stretch for ICML, but connections between machine learning and prediction markets, which seem to come up in multiple ways, are a good fit. Bernhard‘s lab has done quite a bit of work on extracting causality from prediction complexity which could easily interest people at ICML. I’ve personally found some work on representations for learning algorithms, such as sum-product networks of first class interest. UAI has a definite subcommunity of hardcore Bayesians which is less evident at ICML. ICML as a community seems more pragmatist w.r.t. Bayesian methods: if they work well, that’s good. Of the comparators here, UAI seems the most similar in orientation to ICML to me.
ICML provides a significantly larger potential audience and, due to it’s size, tends to be more diverse.
- KDD is a large conference (a bit larger than ICML by attendance) which, as I understand it, initially started from the viewpoint of database people trying to do interesting things with the data they had. The conference is generally one step more commercial/industrial than ICML. Significant parts of the academic track are about machine learning technology and could have been submitted to ICML instead. I was impressed by the double robust sampling work and the out of core learning paper is cool. And, I often enjoy the differential privacy in learning work. KDD attendees tends to be very pragmatic about what works, which is reinforced by yearly prediction challenges. I appreciate this viewpoint quite a bit.
KDD doesn’t do double blind review, which was discussed above. To me, a more significant drawback of KDD is the ACM paywall. I was burned by this last summer. We decided to do a large scale learning survey based on the SUML compendium at KDD, but discovered too late that the video would be stuck behind the paywall, unlike our learning with exploration tutorial the year before. As I understand it, the year before ACM made them pay twice: once to videolectures and once to ACM, which was understandably judged unsustainable. The paywall is particularly rough for students who are not well-established, because it substantially limits their potential audience.
This is not a problem at ICML 2012. Every prepared presentation will be videotaped and we will have every paper easily and publicly accessible along with it. The effort you put into the presentation will payoff over hundreds or thousands of additional online views.
- Area conferences. There are many other conferences which I think of as adjacent area conferences, including AAAI, ACL, SIGIR, CVPR and WWW which I have not attended enough or recently enough to make a real comparison with. Nevertheless, in each of these conferences, machine learning is a common technology. And sometimes new forms of machine learning technology are developed. Depending on many circumstances, ICML might be a good candidate for a place to send a paper on a new empirically useful piece of machine learning technology. Or not—the circumstances matter hugely.
Machine Learning has grown radically and gone industrial over the last decade, providing plenty of motivation for a conference on developing new core machine learning technology. Indeed, it is because of the power of ML that so much overlap exists. In most cases, the best place to send a paper is to the conference where it will be most appreciated. But, there is a real sense in which you create the community by participating in it. So, when the choice is unclear, sending the paper to a conference designed simultaneously for fair high quality reviewing and broad distribution of your work is a good call as it provides the most meaningful acceptance. For machine learning, that conference is ICML. Details of the ICML plan this year are here. We are on track.
As always, comments are welcome.
- The cluster parallel learning code better supports multiple simultaneous runs, and other forms of parallelism have been mostly removed. This incidentally significantly simplifies the learning core.
- The online learning algorithms are more general, with support for l1 (via a truncated gradient variant) and l2 regularization, and a generalized form of variable metric learning.
- There is a solid persistent server mode which can train online, as well as serve answers to many simultaneous queries, either in text or binary.
This should be a very good release if you are just getting started, as we’ve made it compile more automatically out of the box, have several new examples and updated documentation.
- Miro will cover the L-BFGS implementation, which he created from scratch. We have found this works quite well amongst batch learning algorithms.
- Alekh will cover how to do cluster parallel learning. If you have access to a large cluster, VW is orders of magnitude faster than any other public learning system accomplishing linear prediction. And if you are as impatient as I am, it is a real pleasure when the computers can keep up with you.
This will be recorded, so it will hopefully be available for viewing online before too long.
I hope to see you soon
Suppose you have a dataset with 2 terafeatures (we only count nonzero entries in a datamatrix), and want to learn a good linear predictor in a reasonable amount of time. How do you do it? As a learning theorist, the first thing you do is pray that this is too much data for the number of parameters—but that’s not the case, there are around 16 billion examples, 16 million parameters, and people really care about a high quality predictor, so subsampling is not a good strategy.
Alekh visited us last summer, and we had a breakthrough (see here for details), coming up with the first learning algorithm I’ve seen that is provably faster than any future single machine learning algorithm. The proof of this is simple: We can output a optimal-up-to-precision linear predictor faster than the data can be streamed through the network interface of any single machine involved in the computation.
It is necessary but not sufficient to have an effective communication infrastructure. It is necessary but not sufficient to have a decent programming language, because parallel programming is hard. It is necessary but not sufficient to have a good optimization approach. The combination says “yikes”, because you need to know many things to design an effective new system.
- MPI suffers because it has no fault tolerance by default and because it has a poor understanding of where data is, implying that data must be either manually placed on local nodes, or the first step in every computation is “partition the data across the cluster” which is very undesirable from a communication complexity and programming complexity standpoint. These significantly limit the scale that you can work at to ~100 nodes in practice, because the economics of clusters make sharing unavoidable at larger scales. When the cluster is shared, preshuffling the data is awkward to impossible and you must expect that some nodes will run slower than others because they will be executing other jobs. This limitation on reliability kicks in much sooner than disk read failures or node failures.
- MapReduce suffers because the setup and teardown costs are significant. Measured directly, this is often on the order of a minute, associated with interacting with the scheduler and communicating the program to a large number of nodes. But indirectly, this can be radically worse, as any map-reduce job can be held in limbo while waiting for free nodes to work on. And commonly we need to execute many MapReduce iterations to achieve high quality prediction.
MapReduce has another more subtle flaw: using it requires refactoring your code into a sequence of map and reduce operations. This is significantly annoying, because right good learning algorithms is pretty difficult in the first place. MapReduce has a third flaw: it encourages inefficient optimization paradigm. In particular, while you can phrase many learning algorithms as statistical query learning algorithms, doing so is energy inefficient, up to O(examples) in extreme cases.
Since the drawbacks of MPI and MapReduce differ, we can try to create a solution which eliminates all of drawbacks, which a Hadoop-compatible AllReduce does. Cherry picking from each we get:
- MPI: The Allreduce function. The starting state for AllReduce is n nodes each with a number, and the end state is all nodes having the sum of all numbers.
- MapReduce: Conceptual simplicity. One easy to understand function is enough.
- MPI: No need to refactor code. You just sprinkle allreduce in a few locations in your single machine code.
- MapReduce: Data locality. We just hijack the MapReduce infrastructure to execute a map-only job where each process executes on the node with the data.
- MPI: Ability to use local storage (or RAM). Hadoop itself gobbles large amounts of RAM by default because it uses Java. And, in any case, you don’t have an effective large scale learning algorithm if it dies every time the data on a single node exceeds available RAM. Instead, you want to create a temporary file on the local disk and allow it to be cached in RAM by the OS, if that’s possible.
- MapReduce: Automatic cleanup of local resources. Temporary files are automatically nuked.
- MPI: Fast optimization approaches remain within the conceptual scope. Allreduce, because it’s a function call, does not conceptually limit online learning approaches as discussed below. MapReduce conceptually forces statistical query style algorithms. In practice, this can be walked around, but that’s annoying.
- MapReduce: Robustness. We don’t captures all the robustness of MapReduce which can succeed even during a gunfight in the datacenter. But we don’t generally need that: it’s easy to use Hadoop’s speculative execution approach to deal with the slow node problem and use delayed initialization to get around all startup failures giving you something with >99% success rate on a running time reliable to within a factor of 2.
One function (all_reduce) is not a programming language. But since it’s written in C, it is easily encapsulated and added to any existing programming language giving you a complete language. To test this hypothesis, I visited Clement for a day, where we connected things to make Allreduce work in Lua twice—once with an online approach and once with an LBFGS optimization approach for convolutional neural networks. As a parallel programming paradigm, it’s amazingly easier than many other approaches, because you take your existing code and figure out which pieces of state to synchronize. It’s superior enough that I’ve now eliminated the multithreaded and parallel online learning approaches within Vowpal Wabbit. This approach is also great in terms of the amount of incremental learning required—you just need to learn one function to be able to create useful parallel machine learning algorithms. The only thing easier than learning one function is learning none, which you can do for linear prediction by just using VW. Incidentally, we designed the AllReduce code so that Hadoop is not a requirement—you just need to do a bit of extra scripting and lose some of the benefits discussed above when running this on a workstation cluster or a single machine.
You also need to get optimization approaches right. Two canonical but very different optimization algorithms are stochastic gradient descent and LBFGS. Understanding the weaknesses of these algorithm is critical even though often not discussed by their proponents. SGD approaches tend to have two drawbacks: the right choice of various hyperparameters can be annoying. We’ve mostly eliminated this drawback in VW using a learning rate that is tuned to automatically work in various ways. The other drawback is that they generally aren’t great at dealing with noise. This is tricky to deal with in general, because the algorithms only see one example at a time. Leon Bottou is working to eliminate this last drawback, but my impression is that we’re not quite there yet. LBFGS on the other hand is great at dealing with noise but suffers significantly in it’s early convergence rate where SGD is extremely effective. Again, we can combine these approaches in an obvious way: use online learning at the beginning to warmstart LBFGS to integrate out the noise. In practice, the online learning gets you 95%-99% of the way there and then LBFGS nails the last bit of performance.
For the problem I mentioned at the beginning, we can learn in about an hour using a kilonode, implying an overall throughput of 500 megafeatures/s, which is about a factor of 5 faster than any single network interface (1 gigabit/s). This is substantially greater scaling than any of the other algorithms in the Scaling up Machine Learning book (see here for a comparison).
The general area of parallel learning has grown significantly, as indicated by the Big Learning workshop at NIPS, and there are a number of very different approaches people are taking. From what I understand of all other approaches, this approach is a significant step up within it’s scope of applicability. Let’s define that scope as learning (= tuning large numbers of parameters to be simultaneously optimal on test data) from a large dataset on a cluster or datacenter. At the borders:
- For counting based learning algorithms such as the NLP folks sometimes use, a MapReduce approach appears superior as MapReduce is straightforwardly excellent for counting.
- For smaller datasets with computationally intense models, GPU approaches seem very compelling.
- For broadly distributed datasets (not all in one cluster), asynchronous approaches become unavoidably necessary. That’s scary in practice, because you lose the ability to debug.
- The model needs to fit into memory. If that’s not the case, then other approaches are required.
I also expect Hadoop Allreduce is useful across many more tasks than just machine learning. Optimization problems are an easy example, but I suspect there are a number of iterative computation problems where allreduce can be very effective. While it might appear a limited operation, you can easily do average, weighted average, max, etc… And, the scope of allreduce is also easily broadened with an arbitrary reduce function, as per MPI’s version. The Allreduce code itself is not yet native in Hadoop, so you’ll need to grab it from the VW source code which has a BSD license. I’ve been encouraged by discussions with Milind suggesting it may become native soon.
Update: CACM Crosspost.
The New York ML symposium was last Friday. Attendance was 268, significantly larger than last year. My impression was that the event mostly still fit the space, although it was crowded. If anyone has suggestions for next year, speak up.
The best student paper award went to Sergiu Goschin for a cool video of how his system learned to play video games (I can’t find the paper online yet). Choosing amongst the submitted talks was pretty difficult this year, as there were many similarly good ones.
By coincidence all the invited talks were (at least potentially) about faster learning algorithms. Stephen Boyd talked about ADMM. Leon Bottou spoke on single pass online learning via averaged SGD. Yoav Freund talked about parameter-free hedging. In Yoav’s case the talk was mostly about a better theoretical learning algorithm, but it has the potential to unlock an exponential computational complexity improvement via oraclization of experts algorithms… but some serious thought needs to go in this direction.
Unrelated, I found quite a bit of truth in Paul’s talking bears and Xtranormal always adds a dash of funny. My impression is that the ML job market has only become hotter since 4 years ago. Anyone who is well trained can find work, with the key limiting factor being “well trained”. In this environment, efforts to make ML more automatic and more easily applied are greatly appreciated. And yes, Yahoo! is still hiring too