ICML 2011 and the future

Unfortunately, I ended up sick for much of this ICML. I did manage to catch one interesting paper:

Richard Socher, Cliff Lin, Andrew Y. Ng, and Christopher D. Manning Parsing Natural Scenes and Natural Language with Recursive Neural Networks.

I invited Richard to share his list of interesting papers, so hopefully we’ll hear from him soon. In the meantime, Paul and Hal have posted some lists.

the future

Joelle and I are program chairs for ICML 2012 in Edinburgh, which I previously enjoyed visiting in 2005. This is a huge responsibility, that we hope to accomplish well. A part of this (perhaps the most fun part), is imagining how we can make ICML better. A key and critical constraint is choosing things that can be accomplished. So far we have:

  1. Colocation. The first thing we looked into was potential colocations. We quickly discovered that many other conferences precomitted their location. For the future, getting a colocation with ACL or SIGIR, seems to require more advanced planning. If that can be done, I believe there is substantial interest—I understand there was substantial interest in the joint symposium this year. What we did manage was achieving a colocation with COLT and there is an outside chance that a machine learning summer school will precede the main conference. The colocation with COLT is in both time and space, with COLT organized as (essentially) a separate track in a nearby building. We look forward to organizing a joint invited session or two with the COLT program chairs.
  2. Tutorials. We don’t have anything imaginative here, except for pushing for quality tutorials, probably through a mixture of invitations and a call. There is a small chance we’ll be able to organize a machine learning summer school as a prequel, which would be quite cool, but several things have to break right for this to occur.
  3. Conference. We are considering a few tinkerings with the conference format.
    1. Shifting a conference banquet to be during the workshops, more tightly integrating the workshops.
    2. Having 3 nights of posters (1 per day) rather than 2 nights. This provides more time/poster, and avoids halving talks and posters appear on different days.
    3. Having impromptu sessions in the evening. Two possibilities here are impromptu talks and perhaps a joint open problems session with COLT. I’ve made sure we have rooms available so others can organize other things.
    4. We may go for short presentations (+ a poster) for some papers, depending on how things work out schedulewise. My opinions on this are complex. ICML is traditionally multitrack with all papers having a 25 minute-ish presentation. As a mechanism for research, I believe this is superior to a single track conference of a similar size because:
      1. Typically some talk of potential interest can always be found by participants avoiding the boredom problem which comes up at a single track conference
      2. My experience is that program organizers have a limited ability to foresee which talks are of most interest, commonly creating a misallocation of attention.

      On the other hand, there are clearly limits to the number of tracks that are reasonable, and I feel like ICML (especially with COLT cotimed) is near the upper limit. There are also some papers which have a limited scope of interest, for which a shorter presentation is reasonable.

  4. Workshops. A big change here—we want to experiment with 2 days of workshops rather than 1. There seems to be demand for it, as the number of workshops historically is about 10, enough that it’s easy to imagine people commonly interested in 2 workshops. It’s also the case that NIPS has had to start rejecting a substantial fraction of workshop submissions for space reasons. I am personally a big believer in workshops as a mechanism for further research, so I hope this works out well.
  5. Journal integration. I tend to believe that we should be shifting to a journal format for ICML papers, as per many past discussions. After thinking about this the easiest way seems to be simply piggybacking on existing journals such as JMLR and MLJ by essentially declaring that people could submit there first, and if accepted, and not otherwise presented at a conference, present at ICML. This was considered too large a change, so it is not happening. Nevertheless, it is a possible tweak that I believe should be considered for the future. My best guess is that this would never displace the baseline conference review process, but it would help some papers that don’t naturally fit into a conference format while keeping quality high.
  6. Reviewing. Drawing on plentiful experience with what goes wrong, I think we can create the best reviewing system for conferences. We are still debating exact details here while working through what is possible in different conference systems. Nevertheless, some basic goals are:
    1. Double Blind [routine now] Two identical papers with different authors should have the same chance of success. In terms of reviewing quality, I think double blind makes little difference in the short term, but the public commitment to fair reviewing makes a real difference in the long term.
    2. Author Feedback [routine now] Author feedback makes a difference in only a small minority of decisions, but I believe its effect is larger as (a) reviewer quality improves and (b) reviewer understanding improves. Both of these are silent improvers of quality. Somewhat less routine, we are seeking a mechanism for authors to be able to provide feedback if additional reviews are requested, as I’ve become cautious of the late-breaking highly negative review.
    3. Paper Editing. Geoff Gordon tweaked AIStats this year to allow authors to revise papers during feedback. I think this is helpful, because it encourages authors to fix clarity issues immediately, rather than waiting longer. This helps with some things, but it is not a panacea—authors still have to convince reviewers their paper is worthwhile, and given the way people are first impressions are lasting impressions.
    4. Multisource reviewing. We want all of the initial reviews to be assigned by good yet different mechanisms. In the past, I’ve observed that the source of reviewer assignments can greatly bias the decision outcome, all the way from “accept with minor revisions” to “reject” in the case of a JMLR submission that I had. Our plan at the moment is that one review will be assigned by bidding, one by a primary area chair, and one by a secondary area chair.
    5. No single points of failure. When Bob Williamson and I were PC members for learning theory at NIPS, we each came to a decisions given reviews and then reconciled differences. This made a difference on about 5-10% of decisions, and (I believe) improved overall quality a bit. More generally, I’ve seen instances where an area chair has an unjustifiable dislike for a paper and kills it off, which this mechanism avoids.
    6. Speed. In general, I believe speed and good decision making are antagonistic. Nevertheless, we believe it is important to try to do the reviewing both quickly and well. Doing things quickly implies that we can push the submission deadline back later, providing authors more time to make quality papers. Key elements of doing things well fast are: good organization (that’s all on us), light loads for everyone involved (i.e. not too many papers), crowd sourcing (i.e. most decisions made by area chairs), and some amount of asynchrony. Altogether, we believe at the moment that two weeks can be shaved from our reviewing process.
  7. Website. Traditionally at ICML, every new local organizer was responsible for creating a website. This doesn’t make sense anymore, because substantial work is required there, which can and should be amortized across the years so that the website can evolve to do more for the community. We plant to create a permanent website, based around some combination of icml.cc and machinelearning.org. I think this just makes sense.
  8. Publishing. We are thinking about strongly encouraging authors to use arxiv for final submissions. This provides a lasting backing store for ICML papers, as well as a mechanism for revisions. The reality here is that some mistakes get into even final drafts, so a way to revise for the long term is helpful. We are also planning to videotape and make available all talks, although a decision between videolectures and Weyond has not yet been made.

Implementing all the changes above is ambitious, but I believe feasible and that each is individually beneficial and to some extent individually evaluatable. I’d like to hear any thoughts you have on this. It’s also not too late if you have further suggestions of your own.

Ultra LDA

Shravan and Alex‘s LDA code is released. On a single machine, I’m not sure how it currently compares to the online LDA in VW, but the ability to effectively scale across very many machines is surely interesting.

Research Directions for Machine Learning and Algorithms

Muthu invited me to the workshop on algorithms in the field, with the goal of providing a sense of where near-term research should go. When the time came though, I bargained for a post instead, which provides a chance for many other people to comment.

There are several things I didn’t fully understand when I went to Yahoo! about 5 years ago. I’d like to repeat them as people in academia may not yet understand them intuitively.

  1. Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n2 dependence on the number of states in a Hidden Markov Model and Kernelized Support Vector Machines where optimization is typically quadratic or worse. There are two basic reasons for this. The most obvious is that linear time allows you to deal with large datasets. A less obvious but critical point is that a superlinear time algorithm is inherently buggy—it has an unreliable running time that can easily explode if you accidentally give it too much input.
  2. Almost anything worth doing requires many people working together. This happens for many reasons. One is the time-critical aspect of development—in many places it really is worthwhile to pay more to have something developed faster. Another is that real projects are simply much bigger than you might otherwise expect. A third is that real organizations have people coming and going, and any project that is just by one person whithers when they leave. This observation is what means that development of systems with clean abstractions can be extraordinarily helpful, as it allows people to work independently. This observation also means that simple widely applicable tricks (for example the hashing trick) can be broadly helpful.

A good way to phrase research directions is with questions. Here are a few of my natural questions.

  1. How do we efficiently learn in settings where exploration is required? These are settings where the choice of action you take influences the observed reward—ad display and medical testing are two good scenarios. This is deeply critical to many applications, because the learning with exploration setting is inherently more natural than the standard supervised learning setting. The tutorial we did details much of the state of the art here, but very significant questions remain. How can we do efficient offline evaluation of algorithms (see here for a first attempt)? How can we be both efficient in sample complexity and computational complexity? Several groups are interested in sampling from a Bayesian posterior to solve these sorts of problems. Where and when can this be proved to work? (There is essentially no analysis.) What is a maximally distributed and incentive compatible algorithm that remains efficient? The last question is very natural for market place design. How can we best construct reward functions operating on different time scales? What is the relationship between the realizable and agnostic versions of this setting, and how can we construct an algorithm which smoothly interpolates between the two?
  2. How can we learn from lots of data? We’ll be presenting a KDD survey tutorial about what’s been done. Some of the larger scale learning problems have been addressed effectively using MapReduce. The best example here I know is known as Ozgur Cetin’s algorithm at Y!—It’s preconditioned conjugate gradient with a Newton stepsize using two passes over examples per step. (A nonHadoop version is implemented in VW for reference.) But linear predictors are not enough—we would like learning algorithms that can for example learn from all the images in the world. Doing this well plausibly requires a new approach and new learning algorithms. A key observation here is that the bandwidth required by the learning algorithm can not be too great.
  3. How can we learn to index efficiently? The standard solution in information retrieval is to evaluate (or approximately evaluate) all objects in a database returning the elements with the largest score according to some learned or constructed scoring function. This is an inherently O(n) operation, which is frustrating when it’s plausible that an exponentially faster O(log(n)) solution exists. A good solution involves both theory and empirical work here, as we need to think about how to think about how to solve the problem, and of course we need to solve it.
  4. What is a flexible inherently efficient language for architecting representations for learning algorithms? Right now, graphical models often get (mis)used for this purpose. It’s easy and natural to pose a computationally intractable graphical model, implying many real applications involve approximations. A better solution would be to use a different representation language which was always computationally tractable yet flexible enough to solve real-world problems. One starting point for this is Searn. Another general approach was the topic of the Coarse to fine workshop. These are inherently related as Coarse to fine is a pruned breadth first search. Restated, it is not enough to have a language for specifying your prior structural beliefs—instead we must have a language for such which results in computationally tractable solutions.
  5. The Deep Learning problem remains interesting. How do you effectively learn complex nonlinearities capable of better performance than a basic linear predictor? An effective solution avoids feature engineering. Right now, this is almost entirely dealt with empirically, but theory could easily have a role to play in phrasing appropriate optimization algorithms, for example.

Good solutions to each of the research directions above would result in revolutions in their area, and everyone of them would plausibly see wide applicability.

What’s missing?

Cross-posted at CACM.

The End of the Beginning of Active Learning

This post is by Daniel Hsu and John Langford.

In selective sampling style active learning, a learning algorithm chooses which examples to label. We now have an active learning algorithm that is:

  1. Efficient in label complexity, unlabeled complexity, and computational complexity.
  2. Competitive with supervised learning anywhere that supervised learning works.
  3. Compatible with online learning, with any optimization-based learning algorithm, with any loss function, with offline testing, and even with changing learning algorithms.
  4. Empirically effective.

The basic idea is to combine disagreement region-based sampling with importance weighting: an example is selected to be labeled with probability proportional to how useful it is for distinguishing among near-optimal classifiers, and labeled examples are importance-weighted by the inverse of these probabilities. The combination of these simple ideas removes the sampling bias problem that has plagued many previous heuristics for active learning, and yet leads to a general and flexible method that enjoys the desirable traits described above. None of these desirable criteria are individually sufficient, but the simultaneous satisfaction of all criteria by one algorithm is compelling.

This combination of traits implies that active learning is a viable choice in most places where supervised learning is a viable choice. 6 years ago, we didn’t know how to deal with adversarial label noise, and didn’t know how to characterize where active learning helped over supervised learning. 5.5 years ago we had the first breakthroughs in characterization and learning with adversarial label noise. Several more substantial improvements occurred, leading to a tutorial 2 years ago and discussion about what’s next. Since then, we cracked question (2) here and applied it to get an effective absurdly efficient active learning algorithm. Directions for experimenting with it are here, page 47.

As research programs go, we’d like to declare victory, but can’t. Victory in research is when ideas get used, becoming part of the standard repertoire of tools and ways people think. Instead, it seems fair to declare “theory victory” which is more like a milestone in the grand scheme. We’ve hit a point where anyone versed in these results can comfortably and effectively apply active learning instead of supervised learning (caveats below). Whether or not this leads to a real victory depends a great deal on how this gets used. In achieving “theory victory”, the key people were:

  1. Nina Balcan*
  2. Alina Beygelzimer
  3. Sanjoy Dasgupta
  4. Steve Hanneke*
  5. Daniel Hsu*
  6. Matti Kääriäinen
  7. Nikos Karampatziakis
  8. [Added from Daniel] Vladimir Koltchinskii
  9. John Langford
  10. Claire Monteleoni*
  11. Tong Zhang

(*)=thesis work.

Naturally, there are many caveats to the above. We moved as fast as we could towards an effective sound useful algorithm according to the standard IID-only assumption criteria of supervised learning. This means that our understanding is not particularly broad, and a number of questions remain including 1,3,4 here.

  1. The existing solution is a simple algorithm with a complex analysis. Simplifying and tightening this analysis would be quite helpful—it’s not even clear we have the best-possible functional form yet. In addition, we rely on the abstraction of a learning algorithm as an effective ERM algorithm when applying the algorithm in practice. That’s plausibly reasonable, but it may also breakdown, implying that experiments beyond linear and decision tree architectures could be helpful.
  2. The extreme efficiency of active learning achieved opens up the possibility of using it for efficient parallel learning and other information-distillation purposes.
  3. Our understanding of the limits of active learning is not complete: various lower bounds have been identified, but a gap remains relative to the known upper bounds. Are the label complexity upper bounds achieved by the general scheme tight for a general class of active learning algorithms, or can the algorithms be improved using new techniques without sacrificing consistency?
  4. Can active learning succeed under different generative / statistical assumptions (e.g., fully-adversarial data, alternative labeling oracles)? Some recent progress on fully-adversarial active learning has been made by Nicolo Cesa-Bianchi, Claudio Gentile, and Francesco Orabona and Ofer Dekel, Claudio Gentile, and Karthik Sridharan, with the latter also providing a solution for learning with multiple heterogeneous labeling oracles (e.g. Mechanical Turk). At the other end of the spectrum, the work of Andrew Guillory and Jeff Bilmes and Daniel Golovin and Andreas Krause (building on some earlier findings of Sanjoy Dasgupta) looks at average-case / Bayesian analyses of greedy algorithms. The resulting approximation factor-type guarantees can be reassuring to the practitioner when justifying a particular choice of sampling strategy; the trade-off here is that the guarantee holds in a somewhat limited sense.
  5. Can active learning be effectively combined with semi-supervised and unsupervised learning? It is known from the rich area of semi-supervised learning that unlabeled data can suggest learning biases (e.g., large margin separators, low dimensional structure) that may improve performance over supervised learning, especially when labeled data are few. When these biases are not aligned with reality, however, performance can be significantly degraded; this is a common but serious criticism of semi-supervised learning. A basic observation is that active learning provides the opportunity to validate or refute these biases using label queries, and also to subsequently revise them. Thus, it seems that active learners ought to be able to pursue learning biases much more aggressively than passive learners. A few works on cluster-based sampling and multi-view active learning have appeared, but much remains to be discovered.