# Machine Learning (Theory)

## 3/22/2005

### Active learning

Tags: Active,Prediction Theory sanjoy@ 9:29 pm

Often, unlabeled data is easy to come by but labels are expensive. For instance, if you’re building a speech recognizer, it’s easy enough to get raw speech samples — just walk around with a microphone — but labeling even one of these samples is a tedious process in which a human must examine the speech signal and carefully segment it into phonemes. In the field of active learning, the goal is as usual to construct an accurate classifier, but the labels of the data points are initially hidden and there is a charge for each label you want revealed. The hope is that by intelligent adaptive querying, you can get away with significantly fewer labels than you would need in a regular supervised learning framework.

Here’s an example. Suppose the data lie on the real line, and the classifiers are simple thresholding functions, H = {hw}:

hw(x) = 1 if x > w, and 0 otherwise.

VC theory tells us that if the underlying distribution P can be classified perfectly by some hypothesis in H (called the realizable case), then in order to get a classifier with error rate at most e, it is enough to draw m = O(1/e) random labeled examples from P, and to return any classifier consistent with them. But suppose we instead draw m unlabeled samples from P. If we lay these points down on the line, their hidden labels are a sequence of 0’s followed by a sequence of 1’s, and the goal is to discover the point w at which the transition occurs. This can be accomplished with a simple binary search which asks for just log m labels. Thus active learning gives us an exponential improvement in the number of labels needed: by adaptively querying log m labels, we can automatically infer the rest of them.

Unfortunately, not all that much is known beyond this toy example. To date, the single main theoretical result in the field is [FSST97]‘s analysis of the query-by-committee (QBC) learning algorithm. In their model, the learner observes a stream of unlabeled data and makes spot decisions about whether or not to ask for a point’s label. They show that if the data is drawn uniformly from the surface of the d-dimensional unit sphere, and the hidden labels correspond perfectly to a homogeneous (i.e., through the origin) linear separator from this same distribution, then it is possible to achieve generalization error e after seeing O(d/e) points and requesting just O(d log 1/e) labels: an exponential improvement over the usual O(d/e) sample complexity of learning linear separators in a supervised setting. This remarkable result is tempered somewhat by the complexity of the QBC algorithm, which involves computing volumes of intermediate version spaces.

Some recent progress on active learning: [DKM05] show how a simple variant of the perceptron update can be used to achieve these same sample complexity bounds, in the same model. [D04] shows a variety of upper and lower bounds for active learning — for instance, if you allow linear separators which are non-homogeneous then in the above model the sample complexity necessarily shoots up to 1/e.

The theoretical terrain of active learning remains something of an unexplored wilderness. There has, however, been a lot of beautiful theory work (see [A02] for a roundup) on a related model in which the learner is allowed to synthesize query points, rather than simply choosing them from the pool of unlabeled data. This ran into some practical problems: [BL92] found that the resulting synthetic instances were often very difficult for a human to classify!

[A02] D. Angluin. Queries revisited.
[BL92] E. Baum and K. Lang. Query learning can work poorly when a human oracle is used.
[D04] S. Dasgupta. Analysis of a greedy active learning strategy.
[DKM05] S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning.
[FSST97] Y. Freund, S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query-by-committee algorithm.

###### 9 Comments to “Active learning”
1. I find active learning very compelling because of the toy example.

One thing that I do not understand is “what are the minimal assumptions required for active learning?”

For example, independence of samples is sufficient to allow us to predict our future performance. This is not the minimal assumption, but we know that something relating past to future is required.

In standard supervised learning, we know that independent samples plus an assumption that the problem has certain forms (like “decision list”) implies learnability. Again, these aren’t minimal, but we understand that some assumption relating past to future and some constraint on the form of the learning problem are necessary.

Does active learning always require stronger constraints on the learning problem then passive learning?

2. Onur Cobanoglu says:

I think this result cannot be generalized. Baum & Lang worked on OCR data. Since active learning tried to find most informative examples, it synthesized examples located at class borders in sample space; hence they appeared as superposition of characters, looking like neither of the characters. Problem originates from both creating samples near borders and working on continuous signal-like data. There are ways to avoid from sampling near borders (e.g. see [MacKay91]) and not all domains contain this type of data. For example, assume you are working on grammar induction task in which you are working with sentences. When learner synthesizes a sequence of words in a language and human oracle tries to classify the sequence as a valid sentence or a meaningless utterance, ambiguity and needed effort arise in too few occasions.
Traditionally, active learning is selecting or synthesizing samples for cost reduction; but there are other usages too. For example, in [Liu04], sample selection is used for eliminating irrelevant features. In [V2006], features are actively selected and measured for existing samples, again in order to minimize the feature set.

[MacKay91] D.J.C. MacKay. Bayesien Methods for Adaptive Models.
[Liu04] H. Liu and H. Motoda and L. Yu. A Selective Sampling Approach to Active Feature Selection
[V2006] S. Veeramachaneni and E. Olivetti and P. Avesani. Active Sampling for Detecting Irrelevant Features

3. When building a speech recognizer, it is usually sufficient to label mostly or entirely at the word level, rather than the phoneme level. And the start and end times of individual words usually do not need to be labeled, as long as the words are grouped into reasonably short blocks and those blocks have a labeled start and end time. (Typically I see a block per sentence, but I\’m not sure what the limits on block size are.) The learning algorithms are powerful enough to learn subword states given such word-level labels.

However, labeling can still get quite expensive! And indeed a Google search for

\”active learning\” \”automatic speech recognition\”

turns up a number of papers.

4. […] We know that this analysis is often tractable. For example, since Sanjoy’s post on Active Learning, much progress has been made. Several other variations of interactive settings have been proposed and analyzed. The older online learning against an adversary work is essentially completely worked out for the simpler cases (except for computational issues). […]

5. […] Active Learning (part of Machine Learning): http://hunch.net/?p=49 […]

6. kk says:

hi everybody i am doing my project in data mining… so i select one paper which consist some concepts such as active learning,svm… so i need some example for these concepts… so anybody having much more knowledge in that pls give me the example…