For graduate students, the Yahoo! Key Scientific Challenges program including in machine learning is on again, due March 9. The application is easy and the $5K award is high quality “no strings attached” funding. Consider submitting.
The ICML paper deadline has passed. Joelle and I were surprised to see the number of submissions jump from last year by about 50% to around 900 submissions. A tiny portion of these are immediate rejects(*), so this is a much larger set of papers than expected. The number of workshop submissions also doubled compared to last year, so ICML may grow significantly this year, if we can manage to handle the load well. The prospect of making 900 good decisions is fundamentally daunting, and success will rely heavily on the program committee and area chairs at this point.
For those who want to rubberneck a bit more, here’s a breakdown of submissions by primary topic of submitted papers:
66 Reinforcement Learning 52 Supervised Learning 51 Clustering 46 Kernel Methods 40 Optimization Algorithms 39 Feature Selection and Dimensionality Reduction 33 Learning Theory 33 Graphical Models 33 Applications 29 Probabilistic Models 29 NN & Deep Learning 26 Transfer and Multi-Task Learning 25 Online Learning 25 Active Learning 22 Semi-Supervised Learning 20 Statistical Methods 20 Sparsity and Compressed Sensing 19 Ensemble Methods 18 Structured Output Prediction 18 Recommendation and Matrix Factorization 18 Latent-Variable Models and Topic Models 17 Graph-Based Learning Methods 16 Nonparametric Bayesian Inference 15 Unsupervised Learning and Outlier Detection 12 Gaussian Processes 11 Ranking and Preference Learning 11 Large-Scale Learning 9 Vision 9 Social Network Analysis 9 Multi-agent & Cooperative Learning 9 Manifold Learning 8 Time-Series Analysis 8 Large-Margin Methods 8 Cost Sensitive Learning 7 Recommender Systems 7 Privacy, Anonymity, and Security 7 Neural Networks 7 Empirical Insights into ML 7 Bioinformatics 6 Information Retrieval 6 Evaluation Methodology <5 each Text Mining, Rule and Decision Tree Learning, Graph Mining, Planning & Control, Monte Carlo Methods, Inductive Logic Programming & Relational Learning, Causal Inference, Statistical and Relational Learning, NLP, Hidden Markov Models, Game Theory, Robotics, POMDPs, Geometric Approaches, Game Playing, Data Streams, Pattern Mining & Inductive Querying, Meta-Learning, Evolutionary Computation
(*) Deadlines are magical, because they galvanize groups of people to concentrated action. But, they have to be real deadlines to achieve this, which leads us to reject late submissions & format failures to keep the deadline real for future ICMLs. This is uncomfortably rough at times.
The From Data to Knowledge workshop May 7-11 at Berkeley should be of interest to the many people encountering streaming data in different disciplines. It’s run by a group of astronomers who encounter streaming data all the time. I met Josh Bloom recently and he is broadly interested in a workshop covering all aspects of Machine Learning on streaming data. The hope here is that techniques developed in one area turn out useful in another which seems quite plausible. Particularly if you are in the bay area, consider checking it out.
It also seems worthwhile to give some sense of the scope and reviewing criteria for ICML for authors considering submitting papers. At ICML, the (very large) program committee does the reviewing which informs final decisions by area chairs on most papers. Program chairs setup the process, deal with exceptions or disagreements, and provide advice for the reviewing process. Providing advice is tricky (and easily misleading) because a conference is a community, and in the end the aggregate interests of the community determine the conference. Nevertheless, as a program chair this year it seems worthwhile to state the overall philosophy I have and what I plan to encourage (and occasionally discourage).
At the highest level, I believe ICML exists to further research into machine learning, which I generally think of as turning observations into useful predictions. Research is greatly varied in general, but in all cases it involves answering an interesting question for which the answer was not previously known. Interesting questions are generally natural: they can be stated easily and other people plausibly encounter them. Interesting questions are generally also ones for which there are multiple plausible wrong answers. The definition of “interesting” is otherwise hard to pin down, because it is does and must change over time.
ICML is a broad conference which incorporates the interests of many different groups of people with different tastes in the research they prefer. It’s broad enough that most people don’t appreciate all the papers. That’s ok as long as there is some higher level appreciation for which directions of research benefit the community. Some common flavors are:
- ML for X In general, Machine Learning is a core field of study with many applications. Often, it’s a good idea to publish within a conference focused on that area, but particularly when no such conference exists, ICML is a solid choice for a place to publish. One example of this kind of thing is Machine Learning for Sustainability, where the CCC will be giving a few travel grants. Here the core question is typically “How?” Exhibiting new things that you can do with ML provides good reference points for what is possible, provides a sense of what works, and compelling new ideas about what to work on can be valuable to the community.
There are several ways that papers of this sort can bounce. Perhaps X is insufficiently interesting, the results are unconvincing, or the method of solution is considered too straight-forward. I consider the first and second criteria sound, but am inclined toward leniency on the third, since there is often quite a bit of work in figuring out how to frame the problem so that the solution happens to be easy.
- New Algorithms Often, authors find that existing learning algorithms for solving some problem are lacking in some way, so they propose new better algorithms. This is plausibly the most common category of paper at ICML, so there is quite a bit of variety. The most straight-forward version proposes a new algorithm for a well-studied problem. For these papers it’s important to have an empirical comparison to existing baselines.
It’s easy for an empirical comparison to go wrong. Some authors use synthetic datasets which do not seem significant to me, because good results on such datasets may not transfer to real-world problems well as the real world tends to be quite a bit more complex than the synthetic processes which are natural to program. Instead, it’s important to show good results on real datasets. One problem with relying on real datasets is dataset selection—choosing the dataset for which your algorithm seems to perform best. You can avoid this by choosing datasets in some clearly unbiased manner and by evaluating on many standard datasets. Another way to fail is with a poor choice of baseline. This is tricky, because three reviewers might consider three different baselines the most natural one. Asking around a bit when developing the paper might help here, but in the end this can be a tough judgement call: Is the paper convincing enough that people interested in solving the problem should use this algorithm?
Another class of new algorithms papers is new algorithms for new areas of machine learning, blending into the previous category. Here, there typically are relatively few (perhaps just one) dataset available and there may be no (or only implausibly bad) baselines. For papers like this, one way I’ve seen difficulties is when authors are very invested in a particular approach to solving the problem. If you have defined the problem too narrowly, broadening the definition of the problem can help you see appropriate baselines. Another difficulty I’ve observed is reviewers used to the well-studied problems reject an interesting paper because (essentially) they assume that the authors left out a good baseline which does not exist. To prevent the first, authors who ask around might get some valuable early feedback. For the second, it’s a difficulty we are aware of and will consider asking reviewers to judge on the merits of ML for X.
- Algorithmic studies A relatively rare but potentially valuable form of paper is an algorithmic study. Here, the authors do not propose a new algorithm, but instead do a comprehensive empirical comparison of different algorithms. The standards here are quite high—the empirical comparison needs to be first-class to convince people, so the empirical comparison comments under new algorithms apply strongly.
- New Theory Good theory can enlighten us about what is (or might be) possible. It can also help us build robust learning algorithms, where we design learning algorithms so that they provably solve some large class of problems. I am personally most interested in theory that helps us design new learning algorithms, but broadly interested in what is possible. I’m most interested in the question answered, while the means (and language) should only be as complex as necessary so the theory can be understood as widely as possible.
In many areas of CS theory, double blind reviewing is rare, so theory-oriented people may be unfamiliar with it. An important consequence is that complete proofs must be included either in the paper or supplemental material so that proof checking is fully feasible.
Another way that I’ve seen theory papers run into trouble is when it is a post-hoc justification for an algorithm. In essence, authors who choose to analyze an existing algorithm are sometimes forced to make many unnatural assumptions for the theory to be correct. There generally isn’t an easy fix if you arrive at this point.
- n of the above It is common for ICML papers to be multicategory. At the extreme, you might have a new algorithm which solves a new X well, empirically and theoretically. Reviewers can fall into a trap where they are most interested in 1 of the 4 questions answered above, and find 1/4 of the paper devoted to their question relatively weak compared to the paper that devotes all the pages to the same question.
We are aware of this, and will encourage it to be taken into account.
- The exception The set of papers I expect to see at ICML is more diverse than the above—there are often exceptions of one sort or another. For these exceptions, it often becomes a judgment call: Does this paper significantly further research into machine learning? Papers with little potential audience probably don’t while fun/interesting/useful things that we didn’t think of do.
Further comments or questions are welcome.
Following John’s advertisement for submitting to ICML, we thought it appropriate to highlight the advantages of COLT, and the reasons it is often the best place for theory papers. We would like to emphasize that we both respect ICML, and are active in ICML, both as authors and as area chairs, and certainly are not arguing that ICML is a bad place for your papers. For many papers, ICML is the best venue. But for many theory papers, COLT is a better and more appropriate place.
Why should you submit to COLT?
By-and-large, theory papers go to COLT. This is the tradition of the field and most theory papers are sent to COLT. This is the place to present your ground-breaking theorems and new models that will shape the theory of machine learning. COLT is more focused then ICML with a single track session. Unlike ICML, the norm in COLT is for people to sit through most sessions, and hear most of the talks presented. There is also often a lively discussion following paper presentations. If you want theory people to know of your work, you should submit to COLT.
Additionally, this year COLT and ICML are tightly co-located, with joint plenary sessions (i.e. some COLT papers will be presented in a plenary session to the entire combined COLT/ICML audience, as will some ICML papers), and many other opportunities for exposure to the wider ICML audience. And so, by submitting to COLT, you have the potential of reaching both the captive theory audience at COLT and the wider ML audience at ICML.
The advantages of sending to COLT:
- Rigorous review process.
The COLT program committee is comprised entirely of established, mostly fairly senior, researchers. Program committee members read and review papers themselves, or potentially use a sub-reviewer that they know personally and carefully select for the paper, but still check and maintain responsibility for the review. Your paper will get reviewed by at least three program committee members, who will likely be experts on the topics covered by the paper. This is in contrast to ICML (and most other ML conferences) were area chairs (of similar seniority to the COLT program committee) only manage the review process, but reviewers are assigned based on load-balancing considerations and the primary reviewing is done by a very wide set of reviewers, frequently students, who are often not the most relevant experts.
COLT reviews are typically detailed and technical details are checked. The reviewing process is less rushed and program committee members (and sub-reviewers were appropriate) are expected to do a careful job on each and every paper.
All papers are then discussed by the program committee, and there is generally significant and meaningful discussions on papers. This also means the COLT reviewing process is far from having a “single point of failure”, as the paper will be carefully considered and argued for by multiple (senior) program committee members. We believe this yields a more consistently high quality program, with much less randomness in the paper selection process, which in turn translates to high respect for accepted COLT papers.
- COLT is not double blind, but also not exactly single blind. Program committee members have access to the author identities (as do area chairs in ICML), as this is essential in order to select sub-reviewers. However, the author names do not appear on the papers, both in order to reduce the effect of first impressions, and to allow program committee members to utilize reviewers who are truly blind to the author’s identities.
It should be noted that the COLT anonimization guidelines are a bit more relaxed, which we hope makes it easier to create an anonimized version for conference submission (authors are still allowed to, and even encouraged, to post their papers online, with their names on them of course).
- COLT does not have a dedicated rebuttal phase. Frankly, with the higher quality, less random, reviews, we feel it is not needed, and the hassle to authors and program committee members is not worth it. However, the tradition in COLT, which we plan to follow, is to contact authors as needed during the review and discussion process to ask for clarification on issues that came up during review. In particular, if a concern is raised on the soundness or other technical aspect of a paper, the authors will be contacted to give them a chance to set things straight. But no, there is no generic author response where authors can argue and plead for acceptance.
Here’s a quick reference for summer ML-related conferences sorted by due date:
|KDD||Feb 10||August 12-16, Beijing, China||Single Blind|
|COLT||Feb 14||June 25-June 27, Edinburgh, Scotland||Single Blind? (historically)|
|ICML||Feb 24||June 26-July 1, Edinburgh, Scotland||Double Blind, author response, zero SPOF|
|UAI||March 30||August 15-17, Catalina Islands, California||Double Blind, author response|
Geographically, this is greatly dispersed and the UAI/KDD conflict is unfortunate.
Machine Learning conferences are triannual now, between NIPS, AIStat, and ICML. This has not always been the case: the academic default is annual summer conferences, then NIPS started with a December conference, and now AIStat has grown into an April conference.
However, the first claim is not quite correct. NIPS and AIStat have few competing venues while ICML implicitly competes with many other conferences accepting machine learning related papers. Since Joelle and I are taking a turn as program chairs this year, I want to make explicit the case for ICML.
- COLT was historically a conference for learning-interested Computer Science theory people. Every COLT paper has a theorem, and few have experimental results. A significant subset of COLT papers could easily be published at ICML instead. ICML now has a significant theory community, including many pure theory papers and significant overlap with COLT attendees. Good candidates for an ICML submission are learning theory papers motivated by real machine learning problems (example: the agnostic active learning paper) or which propose and analyze new plausibly useful algorithms (example: the adaptive gradient papers). If you find yourself tempted to add empirical experiments to prove the point that your theory really works, ICML sounds like an excellent fit. Not everything is a good fit though—papers motivated by definitional aesthetics or tradition (Valiant style PAC learning comes to mind) may not be appreciated.
There are two significant advantages to ICML over COLT. One is that ICML provides a potentially much larger audience which appreciates and uses your work. That’s substantially less relevant this year, because ICML and COLT are colocating and we are carefully designing joint sessions for the overlap day.
The other is that ICML is committed to fair reviewing—papers are double blind so reviewers are not forced to take into account the author identity. Plenty of people will argue that author names don’t matter to them, but I’ve personally seen several cases as a reviewer where author identity affected the decision, typically towards favoring insiders or bigwigs at theory conferences as common sense would suggest. The double blind aspect of ICML reviewing is an open invitation to outsiders to submit to ICML.
- Many UAI papers could easily go to ICML because they are explicitly about machine learning or connections with machine learning. For example, pure prediction markets are a stretch for ICML, but connections between machine learning and prediction markets, which seem to come up in multiple ways, are a good fit. Bernhard‘s lab has done quite a bit of work on extracting causality from prediction complexity which could easily interest people at ICML. I’ve personally found some work on representations for learning algorithms, such as sum-product networks of first class interest. UAI has a definite subcommunity of hardcore Bayesians which is less evident at ICML. ICML as a community seems more pragmatist w.r.t. Bayesian methods: if they work well, that’s good. Of the comparators here, UAI seems the most similar in orientation to ICML to me.
ICML provides a significantly larger potential audience and, due to it’s size, tends to be more diverse.
- KDD is a large conference (a bit larger than ICML by attendance) which, as I understand it, initially started from the viewpoint of database people trying to do interesting things with the data they had. The conference is generally one step more commercial/industrial than ICML. Significant parts of the academic track are about machine learning technology and could have been submitted to ICML instead. I was impressed by the double robust sampling work and the out of core learning paper is cool. And, I often enjoy the differential privacy in learning work. KDD attendees tends to be very pragmatic about what works, which is reinforced by yearly prediction challenges. I appreciate this viewpoint quite a bit.
KDD doesn’t do double blind review, which was discussed above. To me, a more significant drawback of KDD is the ACM paywall. I was burned by this last summer. We decided to do a large scale learning survey based on the SUML compendium at KDD, but discovered too late that the video would be stuck behind the paywall, unlike our learning with exploration tutorial the year before. As I understand it, the year before ACM made them pay twice: once to videolectures and once to ACM, which was understandably judged unsustainable. The paywall is particularly rough for students who are not well-established, because it substantially limits their potential audience.
This is not a problem at ICML 2012. Every prepared presentation will be videotaped and we will have every paper easily and publicly accessible along with it. The effort you put into the presentation will payoff over hundreds or thousands of additional online views.
- Area conferences. There are many other conferences which I think of as adjacent area conferences, including AAAI, ACL, SIGIR, CVPR and WWW which I have not attended enough or recently enough to make a real comparison with. Nevertheless, in each of these conferences, machine learning is a common technology. And sometimes new forms of machine learning technology are developed. Depending on many circumstances, ICML might be a good candidate for a place to send a paper on a new empirically useful piece of machine learning technology. Or not—the circumstances matter hugely.
Machine Learning has grown radically and gone industrial over the last decade, providing plenty of motivation for a conference on developing new core machine learning technology. Indeed, it is because of the power of ML that so much overlap exists. In most cases, the best place to send a paper is to the conference where it will be most appreciated. But, there is a real sense in which you create the community by participating in it. So, when the choice is unclear, sending the paper to a conference designed simultaneously for fair high quality reviewing and broad distribution of your work is a good call as it provides the most meaningful acceptance. For machine learning, that conference is ICML. Details of the ICML plan this year are here. We are on track.
As always, comments are welcome.
- The cluster parallel learning code better supports multiple simultaneous runs, and other forms of parallelism have been mostly removed. This incidentally significantly simplifies the learning core.
- The online learning algorithms are more general, with support for l1 (via a truncated gradient variant) and l2 regularization, and a generalized form of variable metric learning.
- There is a solid persistent server mode which can train online, as well as serve answers to many simultaneous queries, either in text or binary.
This should be a very good release if you are just getting started, as we’ve made it compile more automatically out of the box, have several new examples and updated documentation.
- Miro will cover the L-BFGS implementation, which he created from scratch. We have found this works quite well amongst batch learning algorithms.
- Alekh will cover how to do cluster parallel learning. If you have access to a large cluster, VW is orders of magnitude faster than any other public learning system accomplishing linear prediction. And if you are as impatient as I am, it is a real pleasure when the computers can keep up with you.
This will be recorded, so it will hopefully be available for viewing online before too long.
I hope to see you soon
Suppose you have a dataset with 2 terafeatures (we only count nonzero entries in a datamatrix), and want to learn a good linear predictor in a reasonable amount of time. How do you do it? As a learning theorist, the first thing you do is pray that this is too much data for the number of parameters—but that’s not the case, there are around 16 billion examples, 16 million parameters, and people really care about a high quality predictor, so subsampling is not a good strategy.
Alekh visited us last summer, and we had a breakthrough (see here for details), coming up with the first learning algorithm I’ve seen that is provably faster than any future single machine learning algorithm. The proof of this is simple: We can output a optimal-up-to-precision linear predictor faster than the data can be streamed through the network interface of any single machine involved in the computation.
It is necessary but not sufficient to have an effective communication infrastructure. It is necessary but not sufficient to have a decent programming language, because parallel programming is hard. It is necessary but not sufficient to have a good optimization approach. The combination says “yikes”, because you need to know many things to design an effective new system.
- MPI suffers because it has no fault tolerance by default and because it has a poor understanding of where data is, implying that data must be either manually placed on local nodes, or the first step in every computation is “partition the data across the cluster” which is very undesirable from a communication complexity and programming complexity standpoint. These significantly limit the scale that you can work at to ~100 nodes in practice, because the economics of clusters make sharing unavoidable at larger scales. When the cluster is shared, preshuffling the data is awkward to impossible and you must expect that some nodes will run slower than others because they will be executing other jobs. This limitation on reliability kicks in much sooner than disk read failures or node failures.
- MapReduce suffers because the setup and teardown costs are significant. Measured directly, this is often on the order of a minute, associated with interacting with the scheduler and communicating the program to a large number of nodes. But indirectly, this can be radically worse, as any map-reduce job can be held in limbo while waiting for free nodes to work on. And commonly we need to execute many MapReduce iterations to achieve high quality prediction.
MapReduce has another more subtle flaw: using it requires refactoring your code into a sequence of map and reduce operations. This is significantly annoying, because right good learning algorithms is pretty difficult in the first place. MapReduce has a third flaw: it encourages inefficient optimization paradigm. In particular, while you can phrase many learning algorithms as statistical query learning algorithms, doing so is energy inefficient, up to O(examples) in extreme cases.
Since the drawbacks of MPI and MapReduce differ, we can try to create a solution which eliminates all of drawbacks, which a Hadoop-compatible AllReduce does. Cherry picking from each we get:
- MPI: The Allreduce function. The starting state for AllReduce is n nodes each with a number, and the end state is all nodes having the sum of all numbers.
- MapReduce: Conceptual simplicity. One easy to understand function is enough.
- MPI: No need to refactor code. You just sprinkle allreduce in a few locations in your single machine code.
- MapReduce: Data locality. We just hijack the MapReduce infrastructure to execute a map-only job where each process executes on the node with the data.
- MPI: Ability to use local storage (or RAM). Hadoop itself gobbles large amounts of RAM by default because it uses Java. And, in any case, you don’t have an effective large scale learning algorithm if it dies every time the data on a single node exceeds available RAM. Instead, you want to create a temporary file on the local disk and allow it to be cached in RAM by the OS, if that’s possible.
- MapReduce: Automatic cleanup of local resources. Temporary files are automatically nuked.
- MPI: Fast optimization approaches remain within the conceptual scope. Allreduce, because it’s a function call, does not conceptually limit online learning approaches as discussed below. MapReduce conceptually forces statistical query style algorithms. In practice, this can be walked around, but that’s annoying.
- MapReduce: Robustness. We don’t captures all the robustness of MapReduce which can succeed even during a gunfight in the datacenter. But we don’t generally need that: it’s easy to use Hadoop’s speculative execution approach to deal with the slow node problem and use delayed initialization to get around all startup failures giving you something with >99% success rate on a running time reliable to within a factor of 2.
One function (all_reduce) is not a programming language. But since it’s written in C, it is easily encapsulated and added to any existing programming language giving you a complete language. To test this hypothesis, I visited Clement for a day, where we connected things to make Allreduce work in Lua twice—once with an online approach and once with an LBFGS optimization approach for convolutional neural networks. As a parallel programming paradigm, it’s amazingly easier than many other approaches, because you take your existing code and figure out which pieces of state to synchronize. It’s superior enough that I’ve now eliminated the multithreaded and parallel online learning approaches within Vowpal Wabbit. This approach is also great in terms of the amount of incremental learning required—you just need to learn one function to be able to create useful parallel machine learning algorithms. The only thing easier than learning one function is learning none, which you can do for linear prediction by just using VW. Incidentally, we designed the AllReduce code so that Hadoop is not a requirement—you just need to do a bit of extra scripting and lose some of the benefits discussed above when running this on a workstation cluster or a single machine.
You also need to get optimization approaches right. Two canonical but very different optimization algorithms are stochastic gradient descent and LBFGS. Understanding the weaknesses of these algorithm is critical even though often not discussed by their proponents. SGD approaches tend to have two drawbacks: the right choice of various hyperparameters can be annoying. We’ve mostly eliminated this drawback in VW using a learning rate that is tuned to automatically work in various ways. The other drawback is that they generally aren’t great at dealing with noise. This is tricky to deal with in general, because the algorithms only see one example at a time. Leon Bottou is working to eliminate this last drawback, but my impression is that we’re not quite there yet. LBFGS on the other hand is great at dealing with noise but suffers significantly in it’s early convergence rate where SGD is extremely effective. Again, we can combine these approaches in an obvious way: use online learning at the beginning to warmstart LBFGS to integrate out the noise. In practice, the online learning gets you 95%-99% of the way there and then LBFGS nails the last bit of performance.
For the problem I mentioned at the beginning, we can learn in about an hour using a kilonode, implying an overall throughput of 500 megafeatures/s, which is about a factor of 5 faster than any single network interface (1 gigabit/s). This is substantially greater scaling than any of the other algorithms in the Scaling up Machine Learning book (see here for a comparison).
The general area of parallel learning has grown significantly, as indicated by the Big Learning workshop at NIPS, and there are a number of very different approaches people are taking. From what I understand of all other approaches, this approach is a significant step up within it’s scope of applicability. Let’s define that scope as learning (= tuning large numbers of parameters to be simultaneously optimal on test data) from a large dataset on a cluster or datacenter. At the borders:
- For counting based learning algorithms such as the NLP folks sometimes use, a MapReduce approach appears superior as MapReduce is straightforwardly excellent for counting.
- For smaller datasets with computationally intense models, GPU approaches seem very compelling.
- For broadly distributed datasets (not all in one cluster), asynchronous approaches become unavoidably necessary. That’s scary in practice, because you lose the ability to debug.
- The model needs to fit into memory. If that’s not the case, then other approaches are required.
I also expect Hadoop Allreduce is useful across many more tasks than just machine learning. Optimization problems are an easy example, but I suspect there are a number of iterative computation problems where allreduce can be very effective. While it might appear a limited operation, you can easily do average, weighted average, max, etc… And, the scope of allreduce is also easily broadened with an arbitrary reduce function, as per MPI’s version. The Allreduce code itself is not yet native in Hadoop, so you’ll need to grab it from the VW source code which has a BSD license. I’ve been encouraged by discussions with Milind suggesting it may become native soon.
Update: CACM Crosspost.
Thanksgiving is perhaps my favorite holiday, because pausing your life and giving thanks provides a needed moment of perspective.
As a researcher, I am most thankful for my education, without which I could not function. I want to share this, because it provides some sense of how a researcher starts.
- My long term memory seems to function particularly well, which makes any education I get is particularly useful.
- I am naturally obsessive, which makes me chase down details until I fully understand things. Natural obsessiveness can go wrong, of course, but it’s a great ally when you absolutely must get things right.
- My childhood was all in one hometown, which was a conscious sacrifice on the part of my father, implying disruptions from moving around were eliminated. I’m not sure how important this was since travel has it’s own benefits, but it bears thought.
- I had several great teachers in grade school, and naturally gravitated towards teachers over classmates, as they seemed more interesting. I particularly remember Mr. Cox, who read Watership Down 10 minutes a day. The frustration of not getting to the ending drove me into reading books on my own, including just about every science fiction book in Lebanon Oregon.
- I spent a few summers picking strawberries and blueberries. It’s great motivation to not do that sort of thing for a living.
- Lebanon school district was willing to bend the rules for me, so I could skip unnecessary math classes. I ended up a year advanced, taking math from our local community college during senior year in high school.
- College applications was a very nervous time, because high quality colleges cost much more than we could reasonably expect to pay. I was very lucky to get into Caltech here. Caltech should not be thought of as a university—instead, it’s a research lab which happens to have a few undergraduate students running around. I understand from Preston that the operating budget is about 4% tuition these days. This showed in the financial aid package, where they basically let me attend for the cost of room&board. Between a few scholarships and plentiful summer research opportunities, I managed to graduate debt free. Caltech was also an exceptional place to study, because rules like “no taking two classes at the same time” were never enforced them. The only limits on what you could learn were your own.
- Graduate school was another big step. Here, I think Avrim must have picked out my application to Carnegie Mellon, which was a good fit for me. At the time, I knew I wanted to do research in some sort of ML/AI subject area, but not really what, so the breadth of possibilities at CMU was excellent. In graduate school, your advisor is much more important, and between Avrim and Sebastian, I learned quite a bit. The funding which made this all work out was mostly hidden from me at CMU, but there was surely a strong dependence on NSF and DARPA. Tom Siebel also directly covered my final year as as a Siebel Scholar.
- Figuring out what to do next was again a nervous time, but it did work out, first in a summer postdoc with Michael Kearns, then at IBM research as a Herman Goldstine Fellow, then at TTI-Chicago, and now at Yahoo! Research for the last 5 years.
My life is just one anecdote, from which it’s easy to be misled. But trying to abstract the details, it seems like the critical elements for success are a good memory, an interest in getting the details right, motivation, and huge amounts of time to learn and then to do research. Given that many of the steps in this process winnow out large fractions of people, some amount of determination and sheer luck is involved. Does the right person manage to see you as a good possibility?
But mostly I’d like to give thanks for the “huge amounts of time” which in practical terms translates into access to other smart people and funding. In education and research funding is something like oxygen—you really miss it when it’s not there, so Thanksgiving is a good time to remember it.
The New York ML symposium was last Friday. Attendance was 268, significantly larger than last year. My impression was that the event mostly still fit the space, although it was crowded. If anyone has suggestions for next year, speak up.
The best student paper award went to Sergiu Goschin for a cool video of how his system learned to play video games (I can’t find the paper online yet). Choosing amongst the submitted talks was pretty difficult this year, as there were many similarly good ones.
By coincidence all the invited talks were (at least potentially) about faster learning algorithms. Stephen Boyd talked about ADMM. Leon Bottou spoke on single pass online learning via averaged SGD. Yoav Freund talked about parameter-free hedging. In Yoav’s case the talk was mostly about a better theoretical learning algorithm, but it has the potential to unlock an exponential computational complexity improvement via oraclization of experts algorithms… but some serious thought needs to go in this direction.
Unrelated, I found quite a bit of truth in Paul’s talking bears and Xtranormal always adds a dash of funny. My impression is that the ML job market has only become hotter since 4 years ago. Anyone who is well trained can find work, with the key limiting factor being “well trained”. In this environment, efforts to make ML more automatic and more easily applied are greatly appreciated. And yes, Yahoo! is still hiring too
Everyone should have received notice for NY ML Symposium abstracts. Check carefully, as one was lost by our system.
The event itself is October 21, next week. Leon Bottou, Stephen Boyd, and Yoav Freund are giving the invited talks this year, and there are many spotlights on local work spread throughout the day. Chris Wiggins has setup 6(!) ML-interested startups to follow the symposium, which should be of substantial interest to the employment interested.
I also wanted to give an update on ICML 2012. Unlike last year, our deadline is coordinated with AIStat (which is due this Friday). The paper deadline for ICML has been pushed back to February 24 which should allow significant time for finishing up papers after the winter break. Other details may interest people as well:
- We settled on using CMT after checking out the possibilities. I wasn’t looking for this, because I’ve often found CMT clunky in terms of easy access to the right information. Nevertheless, the breadth of features and willingness to support new/better approaches to reviewing was unrivaled. We are also coordinating with Laurent, Rich, and CMT to enable their paper/reviewer recommendation system. The outcome should be a standardized interface in CMT for any recommendation system, which others can then code to if interested.
- Area chairs have been picked. The list isn’t sacred, so if we discover significant holes in expertise we’ll deal with it. We expect to start inviting PC members in a little while. Right now, we’re looking into invited talks. If you have any really good suggestions, they could be considered.
- CCC is interested in sponsoring travel costs for any climate/environment related ML papers, which seems great to us. In general, this seems like an area of growing interest.
- We now have a permanent server and the beginnings of the permanent website setup. Much more work needs to be done here.
- We haven’t settled yet on how videos will work. Last year, ICML experimented with Weyond with results here. Previously, ICML had used videolectures, which is significantly more expensive. If you have an opinion about cost/quality tradeoffs or other options, speak up.
- Plans for COLT have shifted slightly—COLT will start a day early, overlap with tutorials, then overlap with a coordinated first day of ICML conference papers.
Various people want to use hunch.net to announce things. I’ve generally resisted this because I feared hunch becoming a pure announcement zone while I am much more interested contentful posts and discussion personally. Nevertheless there is clearly some value and announcements are easy, so I’m planning to summarize announcements on Mondays.
- D. Sculley points out an interesting Semisupervised feature learning competition, with a deadline of October 17.
- Lihong Li points out the webscope user interaction dataset which is the first high quality exploration dataset I’m aware of that is publicly available.
- Seth Rogers points out CrossValidated which looks similar in conception to metaoptimize, but directly using the stackoverflow interface and with a bit more of a statistics twist.
Since we last discussed the other online learning, Stanford has very visibly started pushing mass teaching in AI, Machine Learning, and Databases. In retrospect, it’s not too surprising that the next step up in serious online teaching experiments are occurring at the computer science department of a university embedded in the land of startups. Numbers on the order of 100000 are quite significant—similar in scale to the number of computer science undergraduate students/year in the US. Although these populations surely differ, the fact that they could overlap is worth considering for the future.
It’s too soon to say how successful these classes will be and there are many easy criticisms to make:
- Registration != Learning … but if only 1/10th complete these classes, the scale of teaching still surpasses the scale of any traditional process.
- 1st year excitement != nth year routine … but if only 1/10th take future classes, the scale of teaching still surpasses the scale of any traditional process.
- Hello, cheating … but teaching is much harder than testing in general, and we already have recognized systems for mass testing.
- Online misses out … sure, but for students not enrolled in a high quality university program, this is simply not a relevant comparison. There are also benefits to being online as well, as your time might be better focused. Anecdotally, at Caltech, they let us take two classes at the same time, which I did a few times. Typically, I had a better grade in the class that I skipped as the instructor had to go through things rather slowly.
- Where’s the beef? The hard nosed will want to know how to make money, which is always a concern. But, a decent expectation is that if you first figure out how to create value, you’ll find some way to make money. And, if you first wait until it’s clear how to make money, you won’t make any.
My belief is that this project will pan out, with allowances for the expected inevitable adjustments that you learn to make from experience. I think the critics miss an understanding of what’s possible and what motivates people.
The prospect of teaching 1 student means you might review some notes. The prospect of teaching ~10 students means you prepare some slides. The prospect of teaching ~100 students means you polish your slides well, trying to anticipate questions, and hopefully drawing on experience from previous presentations. I’ve never directly taught ~1000 students, but at that scale you must try very hard to make the presentation perfect, including serious testing with dry runs. 105 students must make getting out of bed in the morning quite easy.
Stanford has a significant first-mover advantage amongst top research universities, but it’s easy to imagine a few other (but not many) universities operating at a similar scale. Those that have the foresight to start a serious online teaching program soon will have a chance of being among the few. For other research universities, we can expect boutique traditional classes to continue for some time. These boutique classes may have some significant social value, because it’s easy to imagine that the few megaclasses miss important things in developing research areas. And for everyone working at teaching universities, someone is eating your lunch.
At KDD I enjoyed Stephen Boyd‘s invited talk about optimization quite a bit. However, the most interesting talk for me was David Haussler‘s. His talk started out with a formidable load of biological complexity. About half-way through you start wondering, “can this be used to help with cancer?” And at the end he connects it directly to use with a call to arms for the audience: cure cancer. The core thesis here is that cancer is a complex set of diseases which can be distentangled via genetic assays, allowing attacking the specific signature of individual cancers. However, the data quantity and complex dependencies within the data require systematic and relatively automatic prediction and analysis algorithms of the kind that we are best familiar with.
Some of the papers which interested me are:
- Kai-Wei Chang and Dan Roth, Selective Block Minimization for Faster Convergence of Limited Memory Large-Scale Linear Models, which is about effectively using a hard-example cache to speedup learning.
- Leland Wilkinson, Anushka Anand, and Dang Nhon Tuan, CHIRP: A New Classifier Based on Composite Hypercubes on Iterated Random Projections. The bar on creating new classifiers is pretty high. The approach here uses a combination of random projection and partition which appears to be compelling for some nonlinear and relatively high computation settings. They do a more thorough empirical evaluation than most papers.
- Zhuang Wang, Nemanja Djuric, Koby Crammer, and Slobodan Vucetic Trading Representability for Scalability: Adaptive Multi-Hyperplane Machine for Nonlinear Classification. The paper explores an interesting idea: having lots of weight vectors (effectively infinity) associated with a particular label, showing that algorithms on this representation can deal with lots of data as per linear predictors, but with superior-to-linear performance. The authors don’t use the hashing trick, but their representation is begging for it.
- Michael Bruckner and Tobias Scheffer, Stackelberg Games for Adversarial Prediction Problem. This is about email spam filtering, where the authors use a theory of adversarial equilibria to construct a more robust filter, at least in some cases. Demonstrating this on noninteractive data is inherently difficult.
There were also three papers that were about creating (or perhaps composing) learning systems to do something cool.
- Gideon Dror, Yehuda Koren, Yoelle Maarek, and Idan Szpektor, I Want to Answer, Who Has a Question? Yahoo! Answers Recommender System. This is about how to learn to route a question to the appropriate answerer automatically.
- Yehuda Koren, Edo Liberty, Yoelle Maarek, and Roman Sandler, Automatically Tagging Email by Leveraging Other Users’ Folders. This is about helping people organize their email with machine learning.
- D. Sculley, Matthew Eric Otey, Michael Pohl, Bridget Spitznagel, John Hainsworth, Yunkai Zhou, Detecting Adversarial Advertisements in the Wild. The title is an excellent abstract here, and there are quite a few details about the implementation.
I also attended MUCMD, a workshop on the Meaningful Use of Complex Medical Data shortly afterwards. This workshop is about the emergent area of using data to improve medicine. The combination of electronic health records, the economic importance of getting medicine right, and the relatively weak use of existing data implies there is much good work to do.
This finally gave us a chance to discuss radically superior medical trial designs based on work in exploration and learning
Jeff Hammerbacher‘s talk was a hilarilously blunt and well stated monologue about the need and how to gather data in a usable way.
Amongst the talks on using medical data, Suchi Saria‘s seemed the most mature. They’ve constructed a noninvasive test for problem infants which is radically superior to the existing Apgar score according to leave-one-out cross validation.
From the doctor’s side, there was discussion of the deep balkanization of data systems within hospitals, efforts to overcome that, and the (un)trustworthiness of data. Many issues clearly remain here, but it also looks like serious progress is being made.
Overall, the workshop went well, with the broad cross-section of talks providing quite a bit of extra context you don’t normally see. It left me believing that a community centered on MUCMD is rising now, with attendant workshops, conferences, etc… to be expected.
Many Machine Learning related events are coming up this fall.
- September 9, abstracts for the New York Machine Learning Symposium are due. Send a 2 page pdf, if interested, and note that we:
- widened submissions to be from anybody rather than students.
- set aside a larger fraction of time for contributed submissions.
- September 15, there is a machine learning meetup, where I’ll be discussing terascale learning at AOL.
- September 16, there is a CS&Econ day at New York Academy of Sciences. This is not ML focused, but it’s easy to imagine interest.
- September 23 and later NIPS workshop submissions start coming due. As usual, there are too many good ones, so I won’t be able to attend all those that interest me. I do hope some workshop makers consider ICML this coming summer, as we are increasing to a 2 day format for you. Here are a few that interest me:
- Big Learning is about dealing with lots of data. Abstracts are due September 30.
- The Bayes Bandits workshop. Abstracts are due September 23.
- The Personalized Medicine workshop
- The Learning Semantics workshop. Abstracts are due September 26.
- The ML Relations workshop. Abstracts are due September 30.
- The Hierarchical Learning workshop. Challenge submissions are due October 17, and abstracts are due October 21.
- The Computational Tradeoffs workshop. Abstracts are due October 17.
- The Model Selection workshop. Abstracts are due September 24.
- October 16-17 is the Singularity Summit in New York. This is for the AIists and only peripherally about ML.
- October 16-21 is a Predictive Analytics World in New York. As machine learning goes industrial, we see industrial-style conferences rapidly developing.
- October 21, there is the New York ML Symposium. In addition to what’s there, Chris Wiggins is looking into setting up a session for startups and those interested in them to get to know each other, as last year.
- Decembr 16-17 NIPS workshops in Granada, Spain.
Ron Bekkerman initiated an effort to create an edited book on parallel machine learning that Misha and I have been helping with. The breadth of efforts to parallelize machine learning surprised me: I was only aware of a small fraction initially.
This put us in a unique position, with knowledge of a wide array of different efforts, so it is natural to put together a survey tutorial on the subject of parallel learning for KDD, tomorrow. This tutorial is not limited to the book itself however, as several interesting new algorithms have come out since we started inviting chapters.
This tutorial should interest anyone trying to use machine learning on significant quantities of data, anyone interested in developing algorithms for such, and of course who has bragging rights to the fastest learning algorithm on planet earth
(Also note the Modeling with Hadoop tutorial just before ours which deals with one way of trying to speed up learning algorithms. We have almost no overlap.)
I just released Vowpal Wabbit 6.0. Since the last version:
- VW is now 2-3 orders of magnitude faster at linear learning, primarily thanks to Alekh. Given the baseline, this is loads of fun, allowing us to easily deal with terafeature datasets, and dwarfing the scale of any other open source projects. The core improvement here comes from effective parallelization over kilonode clusters (either Hadoop or not). This code is highly scalable, so it even helps with clusters of size 2 (and doesn’t hurt for clusters of size 1). The core allreduce technique appears widely and easily reused—we’ve already used it to parallelize Conjugate Gradient, LBFGS, and two variants of online learning. We’ll be documenting how to do this more thoroughly, but for now “README_cluster” and associated scripts should provide a good starting point.
- The new LBFGS code from Miro seems to commonly dominate the existing conjugate gradient code in time/quality tradeoffs.
- The new matrix factorization code from Jake adds a core algorithm.
- We finally have basic persistent daemon support, again with Jake’s help.
- Adaptive gradient calculations can now be made dimensionally correct, following up on Paul’s post, yielding a better algorithm. And Nikos sped it up further with SSE native inverse square root.
- The LDA core is perhaps twice as fast after Paul educated us about SSE and representational gymnastics.
All of the above was done without adding significant new dependencies, so the code should compile easily.
The VW mailing list has been slowly growing, and is a good place to ask questions.
I had a chance to attend UAI this year, where several papers interested me, including:
- Hoifung Poon and Pedro Domingos Sum-Product Networks: A New Deep Architecture. We’ve already discussed this one, but in a nutshell, they identify a large class of efficiently normalizable distributions and do learning with it.
- Yao-Liang Yu and Dale Schuurmans, Rank/norm regularization with closed-form solutions: Application to subspace clustering. This paper is about matrices, and in particular they prove that certain matrices are the solution of matrix optimizations. I’m not matrix inclined enough to fully appreciate this one, but I believe many others may be, and anytime closed form solutions come into play, you get 2 order of magnitude speedups, as they show experimentally.
- Laurent Charlin, Richard Zemel and Craig Boutilier, A Framework for Optimizing Paper Matching. This is about what works in matching papers to reviewers, as has been tested at several previous NIPS. We are looking into using this system for ICML 2012.
In addition I wanted to comment on Karl Friston‘s invited talk. At the outset, he made a claim that seems outlandish to me: The way the brain works is to minimize surprise as measured by a probabilistic model. The majority of the talk was not actually about this—instead it was about how probabilistic models can plausibly do things that you might not have thought possible, such as birdsong. Nevertheless, I think several of us in the room ended up stuck on the claim in questions afterward.
My personal belief is that world modeling (probabilistic or not) is a useful subroutine for intelligence, but it could not possibly be the entirety of intelligence. A key reason for this is the bandwidth of our senses—we simply take in too much information to model everything with equal attention. It seems critical for the efficient functioning of intelligence that only things which might plausibly matter are modeled, and only to the degree that matters. In other words, I do not model the precise placement of items on my desk, or even the precise content of my desk, because these details simply do not matter.
This argument can be made in another way. Suppose for the moment that all the brain does is probabilistic modeling. Then, the primary notion of failure to model is “surprise”, which is low probability events occurring. Surprises (stumbles, car wrecks, and other accidents) certainly can be unpleasant, but this could be correct if modeling is a subroutine as well. The clincher is that there are many unpleasant things which are not surprises, including keeping your head under water, fasting, and self-inflicted wounds.
Accounting for the unpleasantness of these events requires more than probabilistic modeling. In other words, it requires rewards, which is why reinforcement learning is important. As a byproduct, rewards also naturally create a focus of attention, addressing the computational efficiency issue. Believing that intelligence is just probabilistic modeling is another example of simple wrong answer.
Since John did not attend COLT this year, I have been volunteered to report back on the hot stuff at this year’s meeting. The conference seemed to have pretty high quality stuff this year, and I found plenty of interesting papers on all the three days. I’m gonna pick some of my favorites going through the program in a chronological order.
The first session on matrices seemed interesting for two reasons. First, the papers were quite nice. But more interestingly, this is a topic that has had a lot of presence in Statistics and Compressed sensing literature recently. So it was good to see high-dimensional matrices finally make their entry at COLT. The paper of Ohad and Shai on Collaborative Filtering with the Trace Norm: Learning, Bounding, and Transducing provides non-trivial guarantees on trace norm regularization in an agnostic setup, while Rina and Nati show how Rademacher averages can be used to get sharper results for matrix completion problems in their paper Concentration-Based Guarantees for Low-Rank Matrix Reconstruction. Both the papers seemed to share the flavor of a learning theorists’ take at compressed sensing that I enjoyed seeing.
The best student paper by Amit, Sivan and Shai2 on Multiclass Learnability and the ERM principle showed a crucial distinction between binary and multiclass classification. Every ERM procedure is not equally good for multiclass classification. In particular, there are multiclass problems where some ERM learners succeed while others are inconsistent, in sharp contrast to the binary case. They also present some intuition on what characterizes a good ERM procedure for the multiclass setting.
I enjoyed all the three papers in the online learning session quite a bit. Jake, Elad and Peter show the equivalence of Blackwell approachability and low regret in their paper Blackwell Approachability and No-Regret Learning are Equivalent, with applications to efficient algorithms for calibration. Sasha, Karthik and Ambuj won the best paper award for their paper Online Learning: Beyond Regret which shows how the tools like sequential Rademacher averages and sequential covering numbers can be used to capture the minimax value of a large class of games, beyond just external regret settings. Their paper with Dean on Complexity-Based Approach to Calibration with Checking Rules showed a nice application of these techniques to the calibration problem.
The impromptu session was quite a hit this year with ~15 talks. I was quite disappointed to see none of them turn up on my NIPS review stack
Rob managed to save some money by solving his open problem from last COLT, together with Indraneel and Cynthia. Their paper The Rate of Convergence of AdaBoost was interesting as it made me realize how much difference the boundedness of rates can make to the theoretical properties of an algorithm. Adaboost is greedy coordinate descent, for which convergence over a compact domain is well-studied, but what makes this challenging here is that Adaboost doesn’t impose any bound on the weights. The way this paper gets around and pays a penalty for these issues seemed quite interesting.
I also liked the paper by Gabor, David and Csaba on Minimax Regret of Finite Partial-Monitoring Games in Stochastic Environments. This paper shows a tight characterization of partial regret games that have an optimal regret of 0, T1/2, T2/3 or T. While some sufficient conditions one way or the other were known before, theirs is a first complete characterization in my knowledge.
Overall, this was a thoroughly enjoyable COLT, both in the technical content and in the choice of venue.
Maybe it’s too early to call, but with four separate Neural Network sessions at this year’s ICML, it looks like Neural Networks are making a comeback. Here are my highlights of these sessions. In general, my feeling is that these papers both demystify deep learning and show its broader applicability.
The first observation I made is that the once disreputable “Neural” nomenclature is being used again in lieu of “deep learning”. Maybe it’s because Adam Coates et al. showed that single layer networks can work surprisingly well.
- An Analysis of Single-Layer Networks in Unsupervised Feature Learning, Adam Coates, Honglak Lee, Andrew Y. Ng (AISTATS 2011)
- The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization, Adam Coates, Andrew Y. Ng (ICML 2011)
Another surprising result out of Andrew Ng’s group comes from Andrew Saxe et al. who show that certain convolutional pooling architectures can obtain close to state-of-the-art performance with random weights (that is, without actually learning).
- On Random Weights and Unsupervised Feature Learning, Andrew M. Saxe, Pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, Andrew Y. Ng
Of course, in most cases we do want to train these models eventually. There were two interesting papers on the topic of training neural networks. In the first, Quoc Le et al. show that a simple, off-the-shelf L-BFGS optimizer is often preferable to stochastic gradient descent.
- On optimization methods for deep learning, Quoc V. Le, Jiquan Ngiam, Adam Coates, Abhik Lahiri, Bobby Prochnow, Andrew Y. Ng
Secondly, Martens and Sutskever from Geoff Hinton’s group show how to train recurrent neural networks for sequence tasks that exhibit very long range dependencies:
It will be interesting to see whether this type of training will allow recurrent neural networks to outperform CRFs on some standard sequence tasks and data sets. It certainly seems possible since even with standard L-BFGS our recursive neural network (see previous post) can outperform CRF-type models on several challenging computer vision tasks such as semantic segmentation of scene images. This common vision task of labeling each pixel with an object class has not received much attention from the deep learning community.
Apart from the vision experiments, this paper further solidifies the trend that neural networks are being used more and more in natural language processing. In our case, the RNN-based model was used for structure prediction. Another neat example of this trend comes from Yann Dauphin et al. in Yoshua Bengio’s group. They present an interesting solution for learning with sparse bag-of-word representations.
- Large-Scale Learning of Embeddings with Reconstruction Sampling, Yann Dauphin, Xavier Glorot, Yoshua Bengio
Such sparse representations had previously been problematic for neural architectures.
In summary, these papers have helped us understand a bit better which “deep” or “neural” architectures work, why they work and how we should train them. Furthermore, the scope of problems that these architectures can handle has been widened to harder and more real-life problems.
Of the non-neural papers, these two papers stood out for me:
- Sparse Additive Generative Models of Text, Jacob Eisenstein, Amr Ahmed, Eric Xing – the idea is to model each topic only in terms of how it differs from a background distribution.
- Message Passing Algorithms for Dirichlet Diffusion Trees, David A. Knowles, Jurgen Van Gael, Zoubin Ghahramani – I’ve been intrigued by Dirichlet Diffusion Trees but inference always seemed too slow. This paper might solve that problem.