April 2011 – Machine Learning (Theory)

4/23/2011

ICML workshops due

Lihong points out that ICML workshop submissions are due April 29.

4/20/20114/20/2011

The End of the Beginning of Active Learning

This post is by Daniel Hsu and John Langford.

In selective sampling style active learning, a learning algorithm chooses which examples to label. We now have an active learning algorithm that is:

Efficient in label complexity, unlabeled complexity, and computational complexity.
Competitive with supervised learning anywhere that supervised learning works.
Compatible with online learning, with any optimization-based learning algorithm, with any loss function, with offline testing, and even with changing learning algorithms.
Empirically effective.

The basic idea is to combine disagreement region-based sampling with importance weighting: an example is selected to be labeled with probability proportional to how useful it is for distinguishing among near-optimal classifiers, and labeled examples are importance-weighted by the inverse of these probabilities. The combination of these simple ideas removes the sampling bias problem that has plagued many previous heuristics for active learning, and yet leads to a general and flexible method that enjoys the desirable traits described above. None of these desirable criteria are individually sufficient, but the simultaneous satisfaction of all criteria by one algorithm is compelling.

This combination of traits implies that active learning is a viable choice in most places where supervised learning is a viable choice. 6 years ago, we didn’t know how to deal with adversarial label noise, and didn’t know how to characterize where active learning helped over supervised learning. 5.5 years ago we had the first breakthroughs in characterization and learning with adversarial label noise. Several more substantial improvements occurred, leading to a tutorial 2 years ago and discussion about what’s next. Since then, we cracked question (2) here and applied it to get an effective absurdly efficient active learning algorithm. Directions for experimenting with it are here, page 47.

As research programs go, we’d like to declare victory, but can’t. Victory in research is when ideas get used, becoming part of the standard repertoire of tools and ways people think. Instead, it seems fair to declare “theory victory” which is more like a milestone in the grand scheme. We’ve hit a point where anyone versed in these results can comfortably and effectively apply active learning instead of supervised learning (caveats below). Whether or not this leads to a real victory depends a great deal on how this gets used. In achieving “theory victory”, the key people were:

Nina Balcan^*
Alina Beygelzimer
Sanjoy Dasgupta
Steve Hanneke^*
Daniel Hsu^*
Matti Kääriäinen
Nikos Karampatziakis
[Added from Daniel] Vladimir Koltchinskii
John Langford
Claire Monteleoni^*
Tong Zhang

(*)=thesis work.

Naturally, there are many caveats to the above. We moved as fast as we could towards an effective sound useful algorithm according to the standard IID-only assumption criteria of supervised learning. This means that our understanding is not particularly broad, and a number of questions remain including 1,3,4 here.

The existing solution is a simple algorithm with a complex analysis. Simplifying and tightening this analysis would be quite helpful—it’s not even clear we have the best-possible functional form yet. In addition, we rely on the abstraction of a learning algorithm as an effective ERM algorithm when applying the algorithm in practice. That’s plausibly reasonable, but it may also breakdown, implying that experiments beyond linear and decision tree architectures could be helpful.
The extreme efficiency of active learning achieved opens up the possibility of using it for efficient parallel learning and other information-distillation purposes.
Our understanding of the limits of active learning is not complete: various lower bounds have been identified, but a gap remains relative to the known upper bounds. Are the label complexity upper bounds achieved by the general scheme tight for a general class of active learning algorithms, or can the algorithms be improved using new techniques without sacrificing consistency?
Can active learning succeed under different generative / statistical assumptions (e.g., fully-adversarial data, alternative labeling oracles)? Some recent progress on fully-adversarial active learning has been made by Nicolo Cesa-Bianchi, Claudio Gentile, and Francesco Orabona and Ofer Dekel, Claudio Gentile, and Karthik Sridharan, with the latter also providing a solution for learning with multiple heterogeneous labeling oracles (e.g. Mechanical Turk). At the other end of the spectrum, the work of Andrew Guillory and Jeff Bilmes and Daniel Golovin and Andreas Krause (building on some earlier findings of Sanjoy Dasgupta) looks at average-case / Bayesian analyses of greedy algorithms. The resulting approximation factor-type guarantees can be reassuring to the practitioner when justifying a particular choice of sampling strategy; the trade-off here is that the guarantee holds in a somewhat limited sense.
Can active learning be effectively combined with semi-supervised and unsupervised learning? It is known from the rich area of semi-supervised learning that unlabeled data can suggest learning biases (e.g., large margin separators, low dimensional structure) that may improve performance over supervised learning, especially when labeled data are few. When these biases are not aligned with reality, however, performance can be significantly degraded; this is a common but serious criticism of semi-supervised learning. A basic observation is that active learning provides the opportunity to validate or refute these biases using label queries, and also to subsequently revise them. Thus, it seems that active learners ought to be able to pursue learning biases much more aggressively than passive learners. A few works on cluster-based sampling and multi-view active learning have appeared, but much remains to be discovered.

4/18/20114/18/2011

A paper not at Snowbird

Unfortunately, a scheduling failure meant I missed all of AIStat and most of the learning workshop, otherwise known as Snowbird, when it’s at Snowbird.

At snowbird, the talk on Sum-Product networks by Hoifung Poon stood out to me (Pedro Domingos is a coauthor.). The basic point was that by appropriately constructing networks based on sums and products, the normalization problem in probabilistic models is eliminated, yielding a highly tractable yet flexible representation+learning algorithm. As an algorithm, this is noticeably cleaner than deep belief networks with a claim to being an order of magnitude faster and working better on an image completion task.

Snowbird doesn’t have real papers—just the abstract above. I look forward to seeing the paper. (added: Rodrigo points out the deep learning workshop draft.)

4/11/2011

The Heritage Health Prize

The Heritage Health Prize is potentially the largest prediction prize yet at $3M, which is sure to get many people interested. Several elements of the competition may be worth discussing.

The most straightforward way for HPN to deploy this predictor is in determining who to cover with insurance. This might easily cover the costs of running the contest itself, but the value to the health system of a whole is minimal, as people not covered still exist. While HPN itself is a provider network, they have active relationships with a number of insurance companies, and the right to resell any entrant. It’s worth keeping in mind that the research and development may nevertheless end up being useful in the longer term, especially as entrants also keep the right to their code.
The judging metric is something I haven’t seen previously. If a patient has probability 0.5 of being in the hospital 0 days and probability 0.5 of being in the hospital ~53.6 days, the optimal prediction in expectation is ~6.4 days. This is evidence against point (1) above, since cost is probably closer to linear in the number of hospital days. As a starting point, I suspect many people will simply optimize conditional squared loss and then back out an inferred prediction according to p=e^x-1, with clipping. The standard approach of ensembling should be effective.
The team structure seems a bit strange to me. I’m not sure there is a good reason for it from a prediction point of view and 8 may be too hard a limit on team size, imposing bin packing problems on the entrants.
Privacy is clearly a huge concern. They anonymized the data, require entrants to protect the data, and admonish people to not try to break privacy. Despite that, the data will be released to large numbers of people, so I wouldn’t be surprised if someone attempts a join attack of some sort. Whether or not a join attack succeeds could make a huge difference in how this contest is viewed in the long term.
The Accuracy Threshold is a big deal. If they set it at an out-of-reach point (which they could easily do), the size of the prize becomes 0.5M. This part of the contest is supposed to be determined next month.

This contest is not a slam-dunk, but is has the potential to become one, and I’ll be interested to see how it turns out.

4/6/20114/6/2011

COLT open questions

Alina and Jake point out the COLT Call for Open Questions due May 11. In general, this is cool, and worth doing if you can come up with a crisp question. In my case, I particularly enjoyed crafting an open question with precisely a form such that a critic targeting my papers would be forced to confront their fallacy or make a case for the reward. But less esoterically, this is a way to get the attention of some very smart people focused on a problem that really matters, which is the real value.