Machine Learning (Theory)

4/20/2011

The End of the Beginning of Active Learning

Tags: Active,Machine Learning Daniel@ 12:18 pm

This post is by Daniel Hsu and John Langford.

In selective sampling style active learning, a learning algorithm chooses which examples to label. We now have an active learning algorithm that is:

  1. Efficient in label complexity, unlabeled complexity, and computational complexity.
  2. Competitive with supervised learning anywhere that supervised learning works.
  3. Compatible with online learning, with any optimization-based learning algorithm, with any loss function, with offline testing, and even with changing learning algorithms.
  4. Empirically effective.

The basic idea is to combine disagreement region-based sampling with importance weighting: an example is selected to be labeled with probability proportional to how useful it is for distinguishing among near-optimal classifiers, and labeled examples are importance-weighted by the inverse of these probabilities. The combination of these simple ideas removes the sampling bias problem that has plagued many previous heuristics for active learning, and yet leads to a general and flexible method that enjoys the desirable traits described above. None of these desirable criteria are individually sufficient, but the simultaneous satisfaction of all criteria by one algorithm is compelling.

This combination of traits implies that active learning is a viable choice in most places where supervised learning is a viable choice. 6 years ago, we didn’t know how to deal with adversarial label noise, and didn’t know how to characterize where active learning helped over supervised learning. 5.5 years ago we had the first breakthroughs in characterization and learning with adversarial label noise. Several more substantial improvements occurred, leading to a tutorial 2 years ago and discussion about what’s next. Since then, we cracked question (2) here and applied it to get an effective absurdly efficient active learning algorithm. Directions for experimenting with it are here, page 47.

As research programs go, we’d like to declare victory, but can’t. Victory in research is when ideas get used, becoming part of the standard repertoire of tools and ways people think. Instead, it seems fair to declare “theory victory” which is more like a milestone in the grand scheme. We’ve hit a point where anyone versed in these results can comfortably and effectively apply active learning instead of supervised learning (caveats below). Whether or not this leads to a real victory depends a great deal on how this gets used. In achieving “theory victory”, the key people were:

  1. Nina Balcan*
  2. Alina Beygelzimer
  3. Sanjoy Dasgupta
  4. Steve Hanneke*
  5. Daniel Hsu*
  6. Matti Kääriäinen
  7. Nikos Karampatziakis
  8. [Added from Daniel] Vladimir Koltchinskii
  9. John Langford
  10. Claire Monteleoni*
  11. Tong Zhang

(*)=thesis work.

Naturally, there are many caveats to the above. We moved as fast as we could towards an effective sound useful algorithm according to the standard IID-only assumption criteria of supervised learning. This means that our understanding is not particularly broad, and a number of questions remain including 1,3,4 here.

  1. The existing solution is a simple algorithm with a complex analysis. Simplifying and tightening this analysis would be quite helpful—it’s not even clear we have the best-possible functional form yet. In addition, we rely on the abstraction of a learning algorithm as an effective ERM algorithm when applying the algorithm in practice. That’s plausibly reasonable, but it may also breakdown, implying that experiments beyond linear and decision tree architectures could be helpful.
  2. The extreme efficiency of active learning achieved opens up the possibility of using it for efficient parallel learning and other information-distillation purposes.
  3. Our understanding of the limits of active learning is not complete: various lower bounds have been identified, but a gap remains relative to the known upper bounds. Are the label complexity upper bounds achieved by the general scheme tight for a general class of active learning algorithms, or can the algorithms be improved using new techniques without sacrificing consistency?
  4. Can active learning succeed under different generative / statistical assumptions (e.g., fully-adversarial data, alternative labeling oracles)? Some recent progress on fully-adversarial active learning has been made by Nicolo Cesa-Bianchi, Claudio Gentile, and Francesco Orabona and Ofer Dekel, Claudio Gentile, and Karthik Sridharan, with the latter also providing a solution for learning with multiple heterogeneous labeling oracles (e.g. Mechanical Turk). At the other end of the spectrum, the work of Andrew Guillory and Jeff Bilmes and Daniel Golovin and Andreas Krause (building on some earlier findings of Sanjoy Dasgupta) looks at average-case / Bayesian analyses of greedy algorithms. The resulting approximation factor-type guarantees can be reassuring to the practitioner when justifying a particular choice of sampling strategy; the trade-off here is that the guarantee holds in a somewhat limited sense.
  5. Can active learning be effectively combined with semi-supervised and unsupervised learning? It is known from the rich area of semi-supervised learning that unlabeled data can suggest learning biases (e.g., large margin separators, low dimensional structure) that may improve performance over supervised learning, especially when labeled data are few. When these biases are not aligned with reality, however, performance can be significantly degraded; this is a common but serious criticism of semi-supervised learning. A basic observation is that active learning provides the opportunity to validate or refute these biases using label queries, and also to subsequently revise them. Thus, it seems that active learners ought to be able to pursue learning biases much more aggressively than passive learners. A few works on cluster-based sampling and multi-view active learning have appeared, but much remains to be discovered.

6/15/2009

In Active Learning, the question changes

Tags: Active,Questions jl@ 10:54 am

A little over 4 years ago, Sanjoy made a post saying roughly “we should study active learning theoretically, because not much is understood”.

At the time, we did not understand basic things such as whether or not it was possible to PAC-learn with an active algorithm without making strong assumptions about the noise rate. In other words, the fundamental question was “can we do it?”

The nature of the question has fundamentally changed in my mind. The answer is to the previous question is “yes”, both information theoretically and computationally, most places where supervised learning could be applied.

In many situation, the question has now changed to: “is it worth it?” Is the programming and computational overhead low enough to make the label cost savings of active learning worthwhile? Currently, there are situations where this question could go either way. Much of the challenge for the future is in figuring out how to make active learning easier or more worthwhile.

At the active learning tutorial, I stated a set of somewhat more precise research questions that I don’t yet have answer to, and which I believe are worth answering. Here is a bit of an expansion on those questions for those interested.

  1. Is active learning possible in a fully adversarial setting? By fully adversarial, I mean when an adversary controls all the algorithms observations. Some work by Claudio and Nicolo has moved in this direction, but there is not yet a solid answer.
  2. Is there an efficient and effective reduction of active learning to supervised learning? The bootstrap IWAL approach is efficient but not effective in some situations where other approaches can succeed. The algorithm here is a reduction to a special kind of supervised learning where you can specify both examples and constraints. For many supervised learning algorithms, adding constraints seems problematic.
  3. Can active learning succeed with alternate labeling oracles? The ones I see people trying to use in practice often differ because they can provide answers of varying specificity and cost, or because some oracles are good for some questions, but not good for others.
  4. At this point, there have been several successful applications of active learning, but that’s not the same thing as succeeding with more robust algorithms. Can we succeed empirically with more robust algorithms? And is the empirical cost of additional robustness worth the empirical peace-of-mind that your learning algorithm won’t go astray where other more aggressive approaches may do so?

3/8/2009

Prediction Science

One view of machine learning is that it’s about how to program computers to predict well. This suggests a broader research program centered around the more pervasive goal of simply predicting well.
There are many distinct strands of this broader research program which are only partially unified. Here are the ones that I know of:

  1. Learning Theory. Learning theory focuses on several topics related to the dynamics and process of prediction. Convergence bounds like the VC bound give an intellectual foundation to many learning algorithms. Online learning algorithms like Weighted Majority provide an alternate purely game theoretic foundation for learning. Boosting algorithms yield algorithms for purifying prediction abiliity. Reduction algorithms provide means for changing esoteric problems into well known ones.
  2. Machine Learning. A great deal of experience has accumulated in practical algorithm design from a mixture of paradigms, including bayesian, biological, optimization, and theoretical.
  3. Mechanism Design. The core focus in game theory is on equilibria, mostly typically Nash equilibria, but also many other kinds of equilibria. The point of equilibria, to a large extent, is predicting how agents will behave. When this is employed well, principally in mechanism design for auctions, it can be a very powerful concept.
  4. Prediction Markets. The basic idea in a prediction market is that commodities can be designed so that their buy/sell price reflects a form of wealth-weighted consensus estimate of the probability of some event. This is not simply mechanism design, because (a) the thin market problem must be dealt with and (b) the structure of plausible guarantees is limited.
  5. Predictive Statistics. Part of statistics focuses on prediction, essentially becoming indistinguishable from machine learning. The canonical example of this is tree building algorithms such as CART, random forests, and some varieties of boosting. Similarly the notion of probability, counting, and estimation are all handy.
  6. Robust Search. I have yet to find an example of robust search which isn’t useful—and there are several varieties. This includes active learning, robust min finding, and (more generally) compressed sensing and error correcting codes.

The lack of unification is fertile territory for new research, so perhaps it’s worthwhile to think about how these different research programs might benefit from each other.

  1. Learning Theory. The concept of mechanism design is mostly missing from learning theory, but it is sure to be essential when interactive agents are learning. We’ve found several applications for robust search as well as new settings for robust search such as active learning, and error correcting tournaments, but there are surely others.
  2. Machine Learning and Predictive Statistics. Machine learning has been applied to auction design. There is a strong relationship between incentive compatibility and choice of loss functions, both for choosing proxy losses and approximating the real loss function imposed by the world. It’s easy to imagine designer loss functions from the study of incentive compatibility mechanisms giving learning algorithm an edge. I found this paper thought provoking that way. Since machine learning and information markets share a design goal, are there hybrid approaches which can outperform either?
  3. Mechanism Design. There are some notable similarities between papers in ML and mechanism design. For example there are papers about learning on permutations and pricing in combinatorial markets. I haven’t yet taken the time to study these carefully, but I could imagine that one suggests advances for the other, and perhaps vice versa. In general, the idea of using mechanism design with context information (as is done in machine learning), could also be extremely powerful.
  4. Prediction Markets. Prediction markets are partly an empirical field and partly a mechanism design field. There seems to be relatively little understanding about how well and how exactly information from multiple agents is supposed to interact to derive a good probability estimate. For example, the current global recession reminds us that excess leverage is a very bad idea. The same problem comes up in machine learning and is solved by the weighted majority algorithm (and even more thoroughly by the hedge algorithm). Can an information market be designed with the guarantee that an imperfect but best player decides the vote after not-too-many rounds? How would this scale as a function of the ratio of a participants initial wealth to the total wealth?
  5. Robust Search. Investigations into robust search are extremely diverse, essentially only unified in a mathematically based analysis. For people interested in robust search, machine learning and information markets provide a fertile ground for empirical application and new settings. Can all mechanisms for robust search be done with context information, as is common in learning? Do these approaches work empirically in machine learning or information markets?

There are almost surely many other interesting research topics and borrowable techniques here, and probably even other communities oriented around prediction. While the synthesis of these fields is almost sure to eventually happen, I’d like to encourage it sooner rather than later. For someone working on one of these branches, attending a conference on one of the other branches might be a good start. At a lesser time investment, Oddhead is a good start.

1/23/2009

An Active Learning Survey

Tags: Active,Machine Learning jl@ 5:11 am

Burr Settles wrote a fairly comprehensive survey of active learning. He intends to maintain and update the survey, so send him any suggestions you have.

7/26/2008

Compositional Machine Learning Algorithm Design

There were two papers at ICML presenting learning algorithms for a contextual bandit-style setting, where the loss for all labels is not known, but the loss for one label is known. (The first might require a exploration scavenging viewpoint to understand if the experimental assignment was nonrandom.) I strongly approve of these papers and further work in this setting and its variants, because I expect it to become more important than supervised learning. As a quick review, we are thinking about situations where repeatedly:

  1. The world reveals feature values (aka context information).
  2. A policy chooses an action.
  3. The world provides a reward.

Sometimes this is done in an online fashion where the policy can change based on immediate feedback and sometimes it’s done in a batch setting where many samples are collected before the policy can change. If you haven’t spent time thinking about the setting, you might want to because there are many natural applications.

I’m going to pick on the Banditron paper (second one), which attacks the special case of the contextual bandit setting where exactly one of the rewards is 1 and all other actions result in reward 0, and show that (a) similar performance is achievable via a simple combination of existing modular technologies and (b) superior performance is achievable by optimizing some existing modular technologies for the realizable case. This algorithm is the hardest of the two to compete with, because it explicitly deals with the explore/exploit tradeoff. Note that I’m definitely not trying to minimize the paper—there is analysis in that paper which remains interesting to me and isn’t covered by what follows. I am happy that it was published. The point of this post is showing that a modular approach to building learning algorithms is a strong contender when we encounter new learning problems.

Given the problem statement, my approach to solving the problem would be to compose some modular technologies I know.

  1. Perceptron learning algorithm We chose a Binary Perceptron as a classification algorithm. This choice is intrinsically motivated by the great computational performance of a Perceptron. It’s also the closest binary supervised learning algorithm to the Banditron, which eliminates a source of variation in comparison. We could have easily chosen a different base learning algorithm, and in many applications this is highly desirable.
  2. Offset Tree Reduction The offset tree is a newer machine learning reduction from the contextual bandit setting to the standard supervised learning setting. It more robustly transforms a supervised learner’s performance into good policy performance than any other reduction. The offset tree also has good computational properties, since it produces at most log2 k binary examples per train or test event, where k is the number of actions. In some sense it’s unfair to include the offset tree because it hasn’t yet been formally published. In another sense, that’s what this post is about.
  3. Epoch-Greedy exploration. The Epoch-greedy approach shows how to handle the explore/exploit tradeoff for learning in a contextual bandit setting as a function of a sample complexity bound. For common sample complexity bounds, we get an O(T2/3) online regret where T is the total number of timesteps.
  4. Occam’s Razor Bound The Occam’s Razor bound limits the regret of an empirical error minimizing perceptron as function of the number of examples. The bound (and it’s many cousins) are often loose, so the only thing we’ll really use is the denominator which says that regret scales as 1/sqrt(number of training examples) in the worst case. Applying Epoch-Greedy to the Occam’s Razor bound gives you an exploration probability of about C/t1/3 where t is the round number.

Each component above has been analyzed in isolation and is at least a reasonable approach (some are the best possible). Each of these components is also composable. Fitting these pieces together, we get an online learning algorithm (agnostic offset-tree perceptron) that chooses to explore uniformly at random amongst the actions with probability about 1/t1/3. How well does it perform? On the 4 class reuters based dataset used in the Banditron paper, we get the following accumulated average error rates with some code.:

The right plot is from the Banditron paper. The Perceptron line in both plots is for an algorithm which learns knowing the full loss function of each example, so it represents an ideal we don’t expect to achieve here. There are three results in the left plot:

  1. The blue line is a version of the component set where the Occam’s Razor bound and the Offset Tree reduction have been optimized for the realizable case. This was the first thing we tested (and it’s the result I mentioned at Shai‘s ICML talk). It turns out this approach works substantially better than the Banditron, achieving an error rate about halfway between the Banditron and the Multiclass Perceptron. The two components that we tweaked are:
    1. Realizable case Bound It’s well known that in the realizable case the regret of a chosen classifier should scale as 1/t rather than 1/t0.5. Plugging this into epoch-greedy, we get that the probability of exploration should be about 1/t0.5.
    2. Realizable case Offset Tree A basic observation is that in the realizable setting, every observation should create an example to tune the learning algorithm. In the context of the perceptron, this implies every error creates an example. The offset tree reduction can be altered to take this into account by eliminating the importance weight from all updates, and updating even for exploitation examples which are not drawn uniform randomly.
  2. The red line is what you get with exactly the component set stated above. We were curious about the degree to which a general purpose algorithm can perform well on this application as the realizable case algorithm is definitely broken when the problem is inherently noisy. The performance is somewhat worse than Banditron. I believe this is because it explores only about 1% of the time while the Banditron plot comes from exploring about 5% of the time.
  3. The green line is from a component set where epoch greedy and the offset tree have been tweaked to keep track of and use the distribution over actions at every timestep. This tweaks allows the amount of exploration as measured by the sum of importance weights of training examples to almost double. As we see, this approach improves performance, almost as if we doubled the number of examples, giving it similar performance to the Banditron. The tweaks used for the component set are:
    1. Stochastic Epoch-Greedy Instead of deterministically exploring every 1/(bound_gap) times, choose to explore with probability 1/(bound_gap), and pass this probability to the offset tree reduction.
    2. Nonuniform Offset Tree Tweak the Offset Tree in the obvious way to take into account nonuniform exploration. In particular, 1/2 is replaced with K/p(a) where p(a) is the probability of the action taken conditioned on one of two actions being taken. We set K so that this value is 1 when a nonexploit action is taken, which implies the importance weight is p(a)/(2-p(a)) when the exploit action is taken.
    3. Importance Weighted Perceptron We dealt with importance weights generated by the offset tree reduction by scaling any update by the importance weight.

People may be dissatisfied with the component assembly approach to learning algorithm design, because all of the pieces are not analyzed together. For example, the Banditron paper essentially proves that certain conditions are sufficient for the Banditron algorithm to perform well. This is a more complete guarantee than “all the pieces we stuck together are known to work well when analyzed in isolation”, but this style of guarantee has limitations which are both obvious and often overlooked. These guarantees do not show necessity of these conditions, and hence characterize only a subset of settings where the Banditron works. Unless you are lucky enough to know that your application satisfies the sufficient conditions, you’ll basically have to try the Banditron and see if it works for any particular application. Another drawback is that the complexity of analyzing all the different pieces simultaneously means that it’s difficult to use the best elements together. This last point is what leaves me dissatisfied with the complete analysis approach—it produces algorithms which are simple to analyze rather than optimized for performance.

If you are thinking “I have a better algorithm for solving one problem”, then you’ve missed the point of this post. The point of this post is that compositional design is good for solving many learning problems. This post is about one example of that approach in action. We start by assembling a set of components which we know work well from individual component analysis. Then, we try to optimize the performance of the assembly by swapping components or improving individual components to address known properties of the problem or observed deficiencies. In this particular case, we end up with a better performing algorithm and better components (such as stochastic epoch greedy) which are directly reusable in other settings. The essence of this approach is understanding that there is a real vocabulary of interchangeable components and actively using it in designing learning algorithms.

I would like to thank Alina who helped substantially with this post. I would also like to thank Shai for providing data and helping setup a clean comparison and Sham for helpful discussion.

3/23/2008

Interactive Machine Learning

A new direction of research seems to be arising in machine learning: Interactive Machine Learning. This isn’t a familiar term, although it does include some familiar subjects.

What is Interactive Machine Learning? The fundamental requirement is (a) learning algorithms which interact with the world and (b) learn.

For our purposes, let’s define learning as efficiently competing with a large set of possible predictors. Examples include:

  1. Online learning against an adversary (Avrim’s Notes). The interaction is almost trivial: the learning algorithm makes a prediction and then receives feedback. The learning is choosing based upon the advice of many experts.
  2. Active Learning. In active learning, the interaction is choosing which examples to label, and the learning is choosing from amongst a large set of hypotheses.
  3. Contextual Bandits. The interaction is choosing one of several actions and learning only the value of the chosen action (weaker than active learning feedback).

More forms of interaction will doubtless be noted and tackled as time progresses. I created a webpage for my own research on interactive learning which helps define the above subjects a bit more.

What isn’t Interactive Machine Learning?
There are several learning settings which fail either the interaction or the learning test.

  1. Supervised Learning doesn’t fit. The basic paradigm in supervised learning is that you ask experts to label examples, and then you learn a predictor based upon the predictions of these experts. This approach has essentially no interaction.
  2. Semisupervised Learning doesn’t fit. Semisupervised learning is almost the same as supervised learning, except that you also throw in many unlabeled examples.
  3. Bandit algorithms don’t fit. They have the interaction, but not much learning happens because the sample complexity results only allow you to choose from amongst a small set of strategies. (One exception is EXP4 (page 66), which can operate in the contextual bandit setting.)
  4. MDP learning doesn’t fit. The interaction is there, but the set of policies learned over is still too limited—essentially the policies just memorize what to do in each state.
  5. Reinforcement learning may or may not fit, depending on whether you think of it as MDP learning or in a much broader sense.

All of these not-quite-interactive-learning topics are of course very useful background information for interactive machine learning.

Why now? Because it’s time, of course.

  1. We know from other fields and various examples that interaction is very powerful.
    1. From online learning against an adversary, we know that independence of samples is unnecessary in an interactive setting—in fact you can even function against an adversary.
    2. From active learning, we know that interaction sometimes allows us to use exponentially fewer labeled samples than in supervised learning.
    3. From context bandits, we gain the ability to learn in settings where traditional supervised learning just doesn’t apply.
    4. From complexity theory we have “IP=PSPACE” roughly: interactive proofs are as powerful as polynomial space algorithms, which is a strong statement about the power of interaction.
  2. We know that this analysis is often tractable. For example, since Sanjoy‘s post on Active Learning, much progress has been made. Several other variations of interactive settings have been proposed and analyzed. The older online learning against an adversary work is essentially completely worked out for the simpler cases (except for computational issues).
  3. Real world problems are driving it. One of my favorite problems at the moment is the ad display problem—How do you learn which ad is most likely to be of interest? The contextual bandit problem is a big piece of this problem.
  4. It’s more fun. Interactive learning is essentially a wide-open area of research. There are plenty of kinds of natural interaction which haven’t been formalized or analyzed. This is great for beginnners, because it means the problems are simple, and their solution does not require huge prerequisites.
  5. It’s a step closer to AI. Many people doing machine learning want to reach AI, and it seems clear that any AI must engage in interactive learning. Mastering this problem is a next step.

Basic Questions

  1. For natural interaction form [insert yours here], how do you learn? Some of the techniques for other methods of interactive learning may be helpful.
  2. How do we blend interactive and noninteractive learning? In many applications, there is already a pool of supervised examples around.
  3. Are there general methods for reducing interactive learning problems to supervised learning problems (which we know better)?

11/28/2007

Computational Consequences of Classification

In the regression vs classification debate, I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations.

  1. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points.
  2. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints while a squared loss regressor could not. For example, if there is one feature which determines whether a binary label has probability less than or greater than 0.5, a great classifier exists using just one feature. Because squared loss is sensitive to the exact probability, many more features may be required to learn well with respect to squared loss.

6/14/2006

Explorations of Exploration

Exploration is one of the big unsolved problems in machine learning. This isn’t for lack of trying—there are many models of exploration which have been analyzed in many different ways by many different groups of people. At some point, it is worthwhile to sit back and see what has been done across these many models.

  • Reinforcement Learning (1). Reinforcement learning has traditionally focused on Markov Decision Processes where the next state s’ is given by a conditional distribution P(s’|s,a) given the current state s and action a. The typical result here is that certain specific algorithms controlling an agent can behave within e of optimal for horizon T except for poly(1/e,T,S,A) “wasted” experiences (with high probability). This started with E3 by Satinder Singh and Michael Kearns. Sham Kakade’s thesis has significant discussion. Extensions have typically been of the form “under extra assumptions, we can prove more”, for example Factored-E3 and Metric-E3. (It turns out that the number of wasted samples can be less than the number of bits required to describe an MDP.) A weakness of all these results is that they rely upon (a) assumptions which are often false for real applications, (b) state spaces are too large, and (c) make a gurantee that is rather weak. Good performance is only guaranteed after suffering the possibly catastrophic consequences of exploration.
  • Reinforcement Learning (2). Several recent papers have been attempting to analyze reinforcement learning via reduction. To date, all results are either nonconstructive or involve the use of various hints (oracle access to an optimal policy, the distribution over states of an optimal policy etc…) which short-circuit the need to explore. Obviously, these hints are not always available for real-world problems.
  • Reinforcement Learning (3). Much of the rest of reinforcement learning has something to do with exploration, but it’s difficult to summarize succinctly.
  • Online Learning. The n-armed bandit setting can be thought of as an MDP with one state and many actions. In some variants, there is even an adversary who chooses the payoffs of the arms in a non-stochastic manner. The typical result here says that you can compete well with the best constant action after some wasted actions. The exact number of wasted actions varies with the precise setting, but it is typically linear in the number of actions. This work can be traced back to (at least) Gittins indices which (unfortunately) don’t seem to have a good description available on the internet.
  • Active Learning(1) The common current use of this term has to do with “selective sampling”=choosing unlabeled samples to label so as to minimize the number of labels required to learn a good predictor (typically a classifier). A typical result has the form: Given that your classifier comes from restricted class C and the labeled data distribution obeys some constraint, the number of adaptively labeled samples required O(log (1/e)) where e is the error rate. (It turns out that the even noisy distributions are allowed.) The constraints on distributions and hypothesis spaces required to achieve these speedups are often severe.
  • Active Learning(2) Membership query learning is another example. The distinguishing difference with respect to selective sampling is that the a labeled can be requested for any unlabeled point (not just those drawn according to some natural distribution). Several relatively strong results hold for membership query learning, but there is a significant drawback: it seems that the ability to query for a label on an arbitrary point is not very natural. For example, imagine query whether a text document is about sports or politics when the text is generated at random.
  • Active Learning(3) Experimental design (which is mostly based in statistics), is often about finding the extrema of some function rather than generalization. Often, the data generating distribution is assumed to come from some specific parametric family. Unfortunately, my knowledge is sketchy here.

The striking thing about all of these models is that they fail to apply to typical real world problems. This failure is either by design (making assumptions which are simply rarely met), by failure to prove interesting results, or both.

And yet, many of the pieces are here. Active learning deals with generalization, online learning can deal with adversarial situations, and reinforcement learning deals with the situation where your choices influence what you can later learn. At a high level, there is much room for research here by design of a new model of exploration, new theoretical statements, or both.

I’ve been told “exploration is too hard”, and that’s a good warning to bear in mind, but I’m still hopeful for progress.

12/27/2005

Automated Labeling

One of the common trends in machine learning has been an emphasis on the use of unlabeled data. The argument goes something like “there aren’t many labeled web pages out there, but there are a huge number of web pages, so we must find a way to take advantage of them.” There are several standard approaches for doing this:

  1. Unsupervised Learning. You use only unlabeled data. In a typical application, you cluster the data and hope that the clusters somehow correspond to what you care about.
  2. Semisupervised Learning. You use both unlabeled and labeled data to build a predictor. The unlabeled data influences the learned predictor in some way.
  3. Active Learning. You have unlabeled data and access to a labeling oracle. You interactively choose which examples to label so as to optimize prediction accuracy.

It seems there is a fourth approach worth serious investigation—automated labeling. The approach goes as follows:

  1. Identify some subset of observed values to predict from the others.
  2. Build a predictor.
  3. Use the output of the predictor to define a new prediction problem.
  4. Repeat…

Examples of this sort seem to come up in robotics very naturally. An extreme version of this is:

  1. Predict nearby things given touch sensor output.
  2. Predict medium distance things given the nearby predictor.
  3. Predict far distance things given the medium distance predictor.

Some of the participants in the LAGR project are using this approach.

A less extreme version was the DARPA grand challenge winner where the output of a laser range finder was used to form a road-or-not predictor for a camera image.

These automated labeling techniques transform an unsupervised learning problem into a supervised learning problem, which has huge implications: we understand supervised learning much better and can bring to bear a host of techniques.

The set of work on automated labeling is sketchy—right now it is mostly just an observed-as-useful technique for which we have no general understanding. Some relevant bits of algorithm and theory are:

  1. Reinforcement learning to classification reductions which convert rewards into labels.
  2. Cotraining which considers a setting containing multiple data sources. When predictors using different data sources agree on unlabeled data, an inferred label is automatically created.

It’s easy to imagine that undiscovered algorithms and theory exist to guide and use this empirically useful technique.

11/2/2005

Progress in Active Learning

Tags: Active,Solutions jl@ 11:08 pm

Several bits of progress have been made since Sanjoy pointed out the significant lack of theoretical understanding of active learning. This is an update on the progress I know of. As a refresher, active learning as meant here is:

  1. There is a source of unlabeled data.
  2. There is an oracle from which labels can be requested for unlabeled data produced by the source.
  3. The goal is to perform well with minimal use of the oracle.

Here is what I’ve learned:

  1. Sanjoy has developed sufficient and semi-necessary conditions for active learning given the assumptions of IID data and “realizability” (that one of the classifiers is a correct classifier).
  2. Nina, Alina, and I developed an algorithm for active learning relying on only the assumption of IID data. A draft is here.
  3. Nicolo, Claudio, and Luca showed that it is possible to do active learning in an entirely adversarial setting for linear threshold classifiers here. This was published a year or two ago and I recently learned about it.

All of these results are relatively ‘rough’: they don’t necessarily make good algorithms as stated (although the last one has a few experiments). None of these results are directly comparable because the assumptions vary. Comparing the assumptions and the results leads to a number of remaining questions:

  1. Do the sufficient and seminecessary conditions apply to the IID only case? The adversarial case?
  2. Is there a generic algorithm for any hypothesis space that works in the fully adversarial setting?
  3. What are special cases of these algorithms which are computationally tractable and useful?

The Foundations of Active Learning workshop at NIPS should be a good place to discuss these questions.

3/22/2005

Active learning

Tags: Active,Prediction Theory sanjoy@ 9:29 pm

Often, unlabeled data is easy to come by but labels are expensive. For instance, if you’re building a speech recognizer, it’s easy enough to get raw speech samples — just walk around with a microphone — but labeling even one of these samples is a tedious process in which a human must examine the speech signal and carefully segment it into phonemes. In the field of active learning, the goal is as usual to construct an accurate classifier, but the labels of the data points are initially hidden and there is a charge for each label you want revealed. The hope is that by intelligent adaptive querying, you can get away with significantly fewer labels than you would need in a regular supervised learning framework.

Here’s an example. Suppose the data lie on the real line, and the classifiers are simple thresholding functions, H = {hw}:

hw(x) = 1 if x > w, and 0 otherwise.

VC theory tells us that if the underlying distribution P can be classified perfectly by some hypothesis in H (called the realizable case), then in order to get a classifier with error rate at most e, it is enough to draw m = O(1/e) random labeled examples from P, and to return any classifier consistent with them. But suppose we instead draw m unlabeled samples from P. If we lay these points down on the line, their hidden labels are a sequence of 0′s followed by a sequence of 1′s, and the goal is to discover the point w at which the transition occurs. This can be accomplished with a simple binary search which asks for just log m labels. Thus active learning gives us an exponential improvement in the number of labels needed: by adaptively querying log m labels, we can automatically infer the rest of them.

Unfortunately, not all that much is known beyond this toy example. To date, the single main theoretical result in the field is [FSST97]‘s analysis of the query-by-committee (QBC) learning algorithm. In their model, the learner observes a stream of unlabeled data and makes spot decisions about whether or not to ask for a point’s label. They show that if the data is drawn uniformly from the surface of the d-dimensional unit sphere, and the hidden labels correspond perfectly to a homogeneous (i.e., through the origin) linear separator from this same distribution, then it is possible to achieve generalization error e after seeing O(d/e) points and requesting just O(d log 1/e) labels: an exponential improvement over the usual O(d/e) sample complexity of learning linear separators in a supervised setting. This remarkable result is tempered somewhat by the complexity of the QBC algorithm, which involves computing volumes of intermediate version spaces.

Some recent progress on active learning: [DKM05] show how a simple variant of the perceptron update can be used to achieve these same sample complexity bounds, in the same model. [D04] shows a variety of upper and lower bounds for active learning — for instance, if you allow linear separators which are non-homogeneous then in the above model the sample complexity necessarily shoots up to 1/e.

The theoretical terrain of active learning remains something of an unexplored wilderness. There has, however, been a lot of beautiful theory work (see [A02] for a roundup) on a related model in which the learner is allowed to synthesize query points, rather than simply choosing them from the pool of unlabeled data. This ran into some practical problems: [BL92] found that the resulting synthetic instances were often very difficult for a human to classify!

[A02] D. Angluin. Queries revisited.
[BL92] E. Baum and K. Lang. Query learning can work poorly when a human oracle is used.
[D04] S. Dasgupta. Analysis of a greedy active learning strategy.
[DKM05] S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning.
[FSST97] Y. Freund, S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query-by-committee algorithm.

Powered by WordPress