The End of the Beginning of Active Learning

This post is by Daniel Hsu and John Langford.

In selective sampling style active learning, a learning algorithm chooses which examples to label. We now have an active learning algorithm that is:

  1. Efficient in label complexity, unlabeled complexity, and computational complexity.
  2. Competitive with supervised learning anywhere that supervised learning works.
  3. Compatible with online learning, with any optimization-based learning algorithm, with any loss function, with offline testing, and even with changing learning algorithms.
  4. Empirically effective.

The basic idea is to combine disagreement region-based sampling with importance weighting: an example is selected to be labeled with probability proportional to how useful it is for distinguishing among near-optimal classifiers, and labeled examples are importance-weighted by the inverse of these probabilities. The combination of these simple ideas removes the sampling bias problem that has plagued many previous heuristics for active learning, and yet leads to a general and flexible method that enjoys the desirable traits described above. None of these desirable criteria are individually sufficient, but the simultaneous satisfaction of all criteria by one algorithm is compelling.

This combination of traits implies that active learning is a viable choice in most places where supervised learning is a viable choice. 6 years ago, we didn’t know how to deal with adversarial label noise, and didn’t know how to characterize where active learning helped over supervised learning. 5.5 years ago we had the first breakthroughs in characterization and learning with adversarial label noise. Several more substantial improvements occurred, leading to a tutorial 2 years ago and discussion about what’s next. Since then, we cracked question (2) here and applied it to get an effective absurdly efficient active learning algorithm. Directions for experimenting with it are here, page 47.

As research programs go, we’d like to declare victory, but can’t. Victory in research is when ideas get used, becoming part of the standard repertoire of tools and ways people think. Instead, it seems fair to declare “theory victory” which is more like a milestone in the grand scheme. We’ve hit a point where anyone versed in these results can comfortably and effectively apply active learning instead of supervised learning (caveats below). Whether or not this leads to a real victory depends a great deal on how this gets used. In achieving “theory victory”, the key people were:

  1. Nina Balcan*
  2. Alina Beygelzimer
  3. Sanjoy Dasgupta
  4. Steve Hanneke*
  5. Daniel Hsu*
  6. Matti Kääriäinen
  7. Nikos Karampatziakis
  8. [Added from Daniel] Vladimir Koltchinskii
  9. John Langford
  10. Claire Monteleoni*
  11. Tong Zhang

(*)=thesis work.

Naturally, there are many caveats to the above. We moved as fast as we could towards an effective sound useful algorithm according to the standard IID-only assumption criteria of supervised learning. This means that our understanding is not particularly broad, and a number of questions remain including 1,3,4 here.

  1. The existing solution is a simple algorithm with a complex analysis. Simplifying and tightening this analysis would be quite helpful—it’s not even clear we have the best-possible functional form yet. In addition, we rely on the abstraction of a learning algorithm as an effective ERM algorithm when applying the algorithm in practice. That’s plausibly reasonable, but it may also breakdown, implying that experiments beyond linear and decision tree architectures could be helpful.
  2. The extreme efficiency of active learning achieved opens up the possibility of using it for efficient parallel learning and other information-distillation purposes.
  3. Our understanding of the limits of active learning is not complete: various lower bounds have been identified, but a gap remains relative to the known upper bounds. Are the label complexity upper bounds achieved by the general scheme tight for a general class of active learning algorithms, or can the algorithms be improved using new techniques without sacrificing consistency?
  4. Can active learning succeed under different generative / statistical assumptions (e.g., fully-adversarial data, alternative labeling oracles)? Some recent progress on fully-adversarial active learning has been made by Nicolo Cesa-Bianchi, Claudio Gentile, and Francesco Orabona and Ofer Dekel, Claudio Gentile, and Karthik Sridharan, with the latter also providing a solution for learning with multiple heterogeneous labeling oracles (e.g. Mechanical Turk). At the other end of the spectrum, the work of Andrew Guillory and Jeff Bilmes and Daniel Golovin and Andreas Krause (building on some earlier findings of Sanjoy Dasgupta) looks at average-case / Bayesian analyses of greedy algorithms. The resulting approximation factor-type guarantees can be reassuring to the practitioner when justifying a particular choice of sampling strategy; the trade-off here is that the guarantee holds in a somewhat limited sense.
  5. Can active learning be effectively combined with semi-supervised and unsupervised learning? It is known from the rich area of semi-supervised learning that unlabeled data can suggest learning biases (e.g., large margin separators, low dimensional structure) that may improve performance over supervised learning, especially when labeled data are few. When these biases are not aligned with reality, however, performance can be significantly degraded; this is a common but serious criticism of semi-supervised learning. A basic observation is that active learning provides the opportunity to validate or refute these biases using label queries, and also to subsequently revise them. Thus, it seems that active learners ought to be able to pursue learning biases much more aggressively than passive learners. A few works on cluster-based sampling and multi-view active learning have appeared, but much remains to be discovered.

22 Replies to “The End of the Beginning of Active Learning”

  1. I wonder if/how can this be generalized to regression–it seems as if it depends essentially on things like bounded losses and knowing how would things change if the prediction turns out to be wrong, and this is not easy to see in the regression setting. I ask because then it would be easier to generalize, for example, to structured settings.

    1. Regression was addressed in the ICML 2009 paper, as long as the loss function is effectively bounded. Even in the best case we don’t know how to show more than a geometric factor improvement for regression.

      For structured prediction, my baseline approach would be to reduce it to binary classification and then use active learning for the binary classifiers.

  2. As someone on the applications end of active learning research, the advances in theory are quite encouraging. Two of the remaining major issues that I’ve run into from the applications point of view and would certainly like to see more theoretical work on are the IID assumption, and the fact that the cost of human labeling often significantly differs from example to example.

    For the IID assumption, in NLP for example, we often can’t assume that words within a sentence are IID, but we would like to avoid labeling the words in the sentence that the classifier is sure about and thus wouldn’t provide much useful information. There has been good empirical work on combining active learning with partial self-training by Katrin Tomanek (http://aclweb.org/anthology-new/P/P09/P09-1117.pdf) and I’ve also done some work on essentially choosing non-structured classifiers for active learning to maintain the IID assumption (http://www.phontron.com/paper/neubig11aclshort.pdf), but a more theoretical treatment is necessary.

    Also, another interesting thread of research is that examples that are hard for the classifier are often hard for humans to annotate, which means that examples chosen by active learning take a lot more time to annotate. Burr Settles has some interesting work on this (http://www.cs.cmu.edu/~bsettles/pub/settles.nips08ws.pdf), and I think a lot of applications-oriented people would love to see discussions of annotations costs take a more central role active learning research.

  3. For the structured prediction problem, I’d expect that independence between sentences is vaguely reasonable, and hence you can use the same sorts of strategies, except dividing the number of examples by the dependence range to get an effective number of examples.

    For the other problem, I don’t quite understand how to capture the problem. It makes sense that labeling uncertain things is more difficult, but most unlabeled examples don’t come with a cost of labelling annotation. Instead, we would have to predict this. Do you think we can predict in a manner refined enough to guide the labelling? Have you tried that empirically?

    1. Assuming IID between sentences is reasonable, and as a result most work on active learning in NLP does this (for structured prediction problems anyway). However, in most sentences, out of the 30-50 words in the sentence, only 1 or 2 actually need to be annotated and the rest are quite easy to classify, so if you annotate entire sentences you’re spending 15-50 times more effort than actually needed. If you want a more fine-grained focus you need to deal with annotating only parts of structure, and here the IID assumption clearly doesn’t apply.

      About predicting costs, the paper that I linked actually talks about that a little bit in section 3. It’d be nice to have publicly available data set that includes annotation costs to facilitate this kind of research. Unfortunately making these costs money, so it may end at “it’d be nice…”, although it might be reasonable to do it with MTurk.

      1. Cn’t you fix this “which word to annotate” problem by giving the annotator the current best estimate of the labels for the words, and let them fix it? They should probably see the rest of the sentence anyway, and the only problem with this is a possible bias.

        1. Actually this is what we always do for our baseline, but annotating only parts of the sentence is still significantly faster. Especially for tagging parts of speech, or word boundaries in languages like Japanese or Chinese, there’s really no need to read the whole sentence.

      2. A quantity of independent data implicitly provides a measure of how certain an algorithm should be about it’s prediction. So, as long as the number of ‘truly independent’ draws is known or reasonably estimatable, I think the basic method is applicable.

        I discussed structured active learning with Sanjoy a bit—he thought that labeling everything was about as hard as labeling anything. I expect the truth lies somewhere in between this and linear.

        For a starter theory incorporating costs perhaps it’s easiest to imagine that labeling costs truly are known, figure out what to do, then relax that assumption to predicted costs.

        1. Our research results match with your intuition actually. When we had a worker annotate both single words and whole sentences (correcting automatic tags in both cases), we found that it took approximately 6 times more time per word to annotate single words, but still resulted in better results because there are so many useless words that you end up annotating when you go through the full sentence. Others have reported similar results.

  4. Just like unlabeled data can provide some learning bias to an active learner, I believe active learning can even benefit from related learning problems (eg, multitask and transfer learning) because multiple related tasks can share their inductive biases which can be used to guide the active learner.

  5. The first place I saw something like what you’re calling importance weighting is in:

    James L. Carroll , Robbie Haertel , Peter Mcclanahan , Eric Ringger. 2007. Modeling the annotation process for ancient corpus creation.

    http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.158.1648

    They cite a statistical notion called “expected value of sample information” EVSI, which is defined just like its name implies.

    I have a question about whether you can apply your approach in a situation with different objective functions. Many of our customers want very high (90-95%) precision and as high a recall as possible at that precision level. This is typically in an IR-type setting where there are two categoires relevant and irrelevant, and the relevant ones are only 1% or less of the overall data.

    The documents being classified can be anywhere in length from Tweets in a sentiment context to clinical reports in an epidemiology context to PDFs in an e-discovery context. For a baseline, we usually start with character n-grams (in the 3-8 range for English), so the data vectors can be pretty sparse at the low end.

    What we’ve been doing is just a naive active learning approach where we bootstrap a classifier by finding some positive examples by hand using search or clustering or whatever. Then, build a classifier using a sample of other docs as negatives. Then, run the classifier on the held-out docs and re-annotate everything showing up with very high posterior probability of being positive. This is very different than annotating the data about which the classifier is most uncertain — we’re annotating the data about which the classifier is most certain that it’s a positive instance. This tends to work pretty well in practice, but we have no idea how it behaves theoretically or if we could do a lot better with more clever sampling. In most of these cases, we’re dealing with 0.1% prevalence or less and actually have no idea what the true prevalence of docs is (we’d need to label 1000s of docs to get a handle on that when you only get one hit per 1000 docs).

    1. I don’t quite understand the analogy to EVSI. There, it seems they have a probabilistic model and use it to estimate directly a value of information. This will work fine, of course, if their model is accurate, and possibly be deeply inconsistent if not.

      If I understand right, your problem can be rephrased as importance weighted classification—you care much more about one failure mode than the other. In particular, the cost of mislabeling a ‘1’ as a ‘0’ is small and a ‘0’ as a ‘1’ is large, so long as there are a reasonable number of 1s. If you use IWAL strategies, it will ask you to mostly label examples where it’s uncertain about the label to give after taking into account importance weights. If it was the case that every 1-prediction could easily be converted into a 0-prediction (plausible given the importance weighting), then it would imply asking for labels on all label-1s. But, it would differ by also asking for labels on the 0-predictions that might easily convert into 1-predictions.

      The “finding by hand” step is sometimes called “active hitting”, a problem that is closer to experimental design than selective sampling, particularly if there are gradations of relevance.

      1. Thanks! That’s a much better way to look at the problem decision theoretically. If I put a weight of -9 on false positives and +1 on true positives, I’ll break even on positive responses with a 90% precision. I’m not too worried about false negatives or true negatives, as they’re both essentially weight 0 in the kinds of apps we’re looking at. The goal as stated by the customers is to get maximum recall for a fixed minimum precision.

        This also accounts for another problem we often ran into, namely that we prefer higher precision to whatever the customer’s minimum specification was. We often find our classifiers ranking the true positives highest, then all the negatives, and there’s no reason to set the threshold to get the minimum precision when there’s a dominating solution.

        I got the scope wrong in the first comment — I meant that EVSI seems like your combined strategy of “disagreement region sampling” and “importance weighting”, because it’ll weight the next sample by its expected contribution to classifier training.

  6. Several more practical problems that I’ve run into when performing active learning with actual editors involved for text categorization problems:

    1. Coming out with well calibrated and well refined classifiers (DeGroot “Comparison and Evaluation of Forecasters”) in the end of the AL process is essential.
    – Reason: As Bob points above, business requirements vary from day to day and from category to category – I’ve been asked to produce “adult” classifiers with very high recall and reasonable precision and “technology” classifiers with high precision and reasonable recall. This means that I need to have a good range of scores (refinement) and be confident that for each score my classifier is calibrated so I can flexibly operate at it without running AL again.
    – My solution: I’ve found out that following elaborate strategies helps but throwing in the active set for labeling a number of random samples across different classifier scores always leads to good insights and tends to give better calibrated and refined models.

    2. Settings from iteration to iteration usually change (this is true especially in the early iterations)
    – Reason: When there are actual editors involved 10 AL learning iterations may translate to 10 work days of 10 paid employees. I would never set up a purely automatic procedure, usually based on multiple assumptions, to find out on day 10 that the assumptions were not quite right.
    – My solution: After each iteration when I get the results from the current batch labeled by the editors, I look at them carefully and make the necessary adjustments. Two common adjustments are: 1) Feature space changes – may be I had never seen the feature “ipad” to consider it when I first trained my “technology” category, but now the editors indicate that it is quite descriptive for the positive class. Therefore, I augment (I’ve also pruned noisy features) my feature space. 2) Sampling strategies change – may be during the first iteration a “least confident” strategy (choosing examples around the decision boundary) was a reasonable choice, but after seeing some feedback (and thanks again to the random examples that I always add during each iteration) I may find out that “most confident” or some unknown until now strategy makes more sense for the next iteration. Most of the literature that I’ve seen implicitly assumes fixed feature space and strategies throughout all AL iterations.

    3. Multi-class classification AL is much harder than binary classification, yet just as frequently demanded by customers in practice.
    – Reason: Some of the industrial taxonomies have thousands of nodes. If we need on average 100 examples labeled per category we may end up requiring 1M examples to be labeled (using binary labeling). Additional problem is that different categories converge differently during AL and often have different business importance. These factors need to be reflected somehow in the AL procedure.
    – My solution: Still don’t have a good such, so I will appreciate any feedback.

    1. These are important concerns, and I don’t know what the best solutions are. Nevertheless, here are a few ideas:

      1. For purposes of calibration (and evaluation, and other auxiliary tasks, …), maintain a separate set of random (labeled) examples, independent of the examples used for training. This can be a much smaller set than what is used for training (basic statistical learning theory supports this), but it would allow you to do post-training calibration (e.g., with isotonic regression), compute test errors, etc. Building-in the calibration with the learning seems trickier, although using logistic regression with IWAL may be one way to do it.

      2. A changing feature space is related to the issue of changing predictor class. With some active learning algorithms, the resulting labeled data set is heavily biased — it no longer accurately reflects the original distribution the way a random sample would. The IWAL schemes of Beygelzimer et al (2009, 2010) correct the bias using importance weights so that the resulting labeled data set can be re-used when starting learning anew with a different predictor class.

      3. This is a special case of structured prediction, so the previous comments on this post are relevant. But for the special case of multi-class, some of the algorithms mentioned in the post (e.g., IWAL with decision trees, cluster-based sampling) can be readily applied.

  7. Dear John Langford !
    I read your paper “Online Importance Weight Aware Updates”, and find some points in it useful for my graduation thesis. However, I doubt your works were restricted to linear model p=w^T. If p=1/(1+exp(-wTx-b)), in logistic regression, or neural networks, I can not find a general formula for s(h) as you did in Section 3″FRAMEWORK FOR DERIVING STEP SIZES”. My question is how to deal with this problem

Comments are closed.