A new direction of research seems to be arising in machine learning: Interactive Machine Learning. This isn’t a familiar term, although it does include some familiar subjects.
What is Interactive Machine Learning? The fundamental requirement is (a) learning algorithms which interact with the world and (b) learn.
For our purposes, let’s define learning as efficiently competing with a large set of possible predictors. Examples include:
- Online learning against an adversary (Avrim’s Notes). The interaction is almost trivial: the learning algorithm makes a prediction and then receives feedback. The learning is choosing based upon the advice of many experts.
- Active Learning. In active learning, the interaction is choosing which examples to label, and the learning is choosing from amongst a large set of hypotheses.
- Contextual Bandits. The interaction is choosing one of several actions and learning only the value of the chosen action (weaker than active learning feedback).
More forms of interaction will doubtless be noted and tackled as time progresses. I created a webpage for my own research on interactive learning which helps define the above subjects a bit more.
What isn’t Interactive Machine Learning?
There are several learning settings which fail either the interaction or the learning test.
- Supervised Learning doesn’t fit. The basic paradigm in supervised learning is that you ask experts to label examples, and then you learn a predictor based upon the predictions of these experts. This approach has essentially no interaction.
- Semisupervised Learning doesn’t fit. Semisupervised learning is almost the same as supervised learning, except that you also throw in many unlabeled examples.
- Bandit algorithms don’t fit. They have the interaction, but not much learning happens because the sample complexity results only allow you to choose from amongst a small set of strategies. (One exception is EXP4 (page 66), which can operate in the contextual bandit setting.)
- MDP learning doesn’t fit. The interaction is there, but the set of policies learned over is still too limited—essentially the policies just memorize what to do in each state.
- Reinforcement learning may or may not fit, depending on whether you think of it as MDP learning or in a much broader sense.
All of these not-quite-interactive-learning topics are of course very useful background information for interactive machine learning.
Why now? Because it’s time, of course.
- We know from other fields and various examples that interaction is very powerful.
- From online learning against an adversary, we know that independence of samples is unnecessary in an interactive setting—in fact you can even function against an adversary.
- From active learning, we know that interaction sometimes allows us to use exponentially fewer labeled samples than in supervised learning.
- From context bandits, we gain the ability to learn in settings where traditional supervised learning just doesn’t apply.
- From complexity theory we have “IP=PSPACE” roughly: interactive proofs are as powerful as polynomial space algorithms, which is a strong statement about the power of interaction.
- We know that this analysis is often tractable. For example, since Sanjoy‘s post on Active Learning, much progress has been made. Several other variations of interactive settings have been proposed and analyzed. The older online learning against an adversary work is essentially completely worked out for the simpler cases (except for computational issues).
- Real world problems are driving it. One of my favorite problems at the moment is the ad display problem—How do you learn which ad is most likely to be of interest? The contextual bandit problem is a big piece of this problem.
- It’s more fun. Interactive learning is essentially a wide-open area of research. There are plenty of kinds of natural interaction which haven’t been formalized or analyzed. This is great for beginnners, because it means the problems are simple, and their solution does not require huge prerequisites.
- It’s a step closer to AI. Many people doing machine learning want to reach AI, and it seems clear that any AI must engage in interactive learning. Mastering this problem is a next step.
- For natural interaction form [insert yours here], how do you learn? Some of the techniques for other methods of interactive learning may be helpful.
- How do we blend interactive and noninteractive learning? In many applications, there is already a pool of supervised examples around.
- Are there general methods for reducing interactive learning problems to supervised learning problems (which we know better)?
In the regression vs classification debate, I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations.
- In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points.
- In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints while a squared loss regressor could not. For example, if there is one feature which determines whether a binary label has probability less than or greater than 0.5, a great classifier exists using just one feature. Because squared loss is sensitive to the exact probability, many more features may be required to learn well with respect to squared loss.
Exploration is one of the big unsolved problems in machine learning. This isn’t for lack of trying—there are many models of exploration which have been analyzed in many different ways by many different groups of people. At some point, it is worthwhile to sit back and see what has been done across these many models.
- Reinforcement Learning (1). Reinforcement learning has traditionally focused on Markov Decision Processes where the next state s’ is given by a conditional distribution P(s’|s,a) given the current state s and action a. The typical result here is that certain specific algorithms controlling an agent can behave within e of optimal for horizon T except for poly(1/e,T,S,A) “wasted” experiences (with high probability). This started with E3 by Satinder Singh and Michael Kearns. Sham Kakade’s thesis has significant discussion. Extensions have typically been of the form “under extra assumptions, we can prove more”, for example Factored-E3 and Metric-E3. (It turns out that the number of wasted samples can be less than the number of bits required to describe an MDP.) A weakness of all these results is that they rely upon (a) assumptions which are often false for real applications, (b) state spaces are too large, and (c) make a gurantee that is rather weak. Good performance is only guaranteed after suffering the possibly catastrophic consequences of exploration.
- Reinforcement Learning (2). Several recent papers have been attempting to analyze reinforcement learning via reduction. To date, all results are either nonconstructive or involve the use of various hints (oracle access to an optimal policy, the distribution over states of an optimal policy etc…) which short-circuit the need to explore. Obviously, these hints are not always available for real-world problems.
- Reinforcement Learning (3). Much of the rest of reinforcement learning has something to do with exploration, but it’s difficult to summarize succinctly.
- Online Learning. The n-armed bandit setting can be thought of as an MDP with one state and many actions. In some variants, there is even an adversary who chooses the payoffs of the arms in a non-stochastic manner. The typical result here says that you can compete well with the best constant action after some wasted actions. The exact number of wasted actions varies with the precise setting, but it is typically linear in the number of actions. This work can be traced back to (at least) Gittins indices which (unfortunately) don’t seem to have a good description available on the internet.
- Active Learning(1) The common current use of this term has to do with “selective sampling”=choosing unlabeled samples to label so as to minimize the number of labels required to learn a good predictor (typically a classifier). A typical result has the form: Given that your classifier comes from restricted class C and the labeled data distribution obeys some constraint, the number of adaptively labeled samples required O(log (1/e)) where e is the error rate. (It turns out that the even noisy distributions are allowed.) The constraints on distributions and hypothesis spaces required to achieve these speedups are often severe.
- Active Learning(2) Membership query learning is another example. The distinguishing difference with respect to selective sampling is that the a labeled can be requested for any unlabeled point (not just those drawn according to some natural distribution). Several relatively strong results hold for membership query learning, but there is a significant drawback: it seems that the ability to query for a label on an arbitrary point is not very natural. For example, imagine query whether a text document is about sports or politics when the text is generated at random.
- Active Learning(3) Experimental design (which is mostly based in statistics), is often about finding the extrema of some function rather than generalization. Often, the data generating distribution is assumed to come from some specific parametric family. Unfortunately, my knowledge is sketchy here.
The striking thing about all of these models is that they fail to apply to typical real world problems. This failure is either by design (making assumptions which are simply rarely met), by failure to prove interesting results, or both.
And yet, many of the pieces are here. Active learning deals with generalization, online learning can deal with adversarial situations, and reinforcement learning deals with the situation where your choices influence what you can later learn. At a high level, there is much room for research here by design of a new model of exploration, new theoretical statements, or both.
I’ve been told “exploration is too hard”, and that’s a good warning to bear in mind, but I’m still hopeful for progress.
One of the common trends in machine learning has been an emphasis on the use of unlabeled data. The argument goes something like “there aren’t many labeled web pages out there, but there are a huge number of web pages, so we must find a way to take advantage of them.” There are several standard approaches for doing this:
- Unsupervised Learning. You use only unlabeled data. In a typical application, you cluster the data and hope that the clusters somehow correspond to what you care about.
- Semisupervised Learning. You use both unlabeled and labeled data to build a predictor. The unlabeled data influences the learned predictor in some way.
- Active Learning. You have unlabeled data and access to a labeling oracle. You interactively choose which examples to label so as to optimize prediction accuracy.
It seems there is a fourth approach worth serious investigation—automated labeling. The approach goes as follows:
- Identify some subset of observed values to predict from the others.
- Build a predictor.
- Use the output of the predictor to define a new prediction problem.
Examples of this sort seem to come up in robotics very naturally. An extreme version of this is:
- Predict nearby things given touch sensor output.
- Predict medium distance things given the nearby predictor.
- Predict far distance things given the medium distance predictor.
Some of the participants in the LAGR project are using this approach.
A less extreme version was the DARPA grand challenge winner where the output of a laser range finder was used to form a road-or-not predictor for a camera image.
These automated labeling techniques transform an unsupervised learning problem into a supervised learning problem, which has huge implications: we understand supervised learning much better and can bring to bear a host of techniques.
The set of work on automated labeling is sketchy—right now it is mostly just an observed-as-useful technique for which we have no general understanding. Some relevant bits of algorithm and theory are:
- Reinforcement learning to classification reductions which convert rewards into labels.
- Cotraining which considers a setting containing multiple data sources. When predictors using different data sources agree on unlabeled data, an inferred label is automatically created.
It’s easy to imagine that undiscovered algorithms and theory exist to guide and use this empirically useful technique.
Several bits of progress have been made since Sanjoy pointed out the significant lack of theoretical understanding of active learning. This is an update on the progress I know of. As a refresher, active learning as meant here is:
- There is a source of unlabeled data.
- There is an oracle from which labels can be requested for unlabeled data produced by the source.
- The goal is to perform well with minimal use of the oracle.
Here is what I’ve learned:
- Sanjoy has developed sufficient and semi-necessary conditions for active learning given the assumptions of IID data and “realizability” (that one of the classifiers is a correct classifier).
- Nina, Alina, and I developed an algorithm for active learning relying on only the assumption of IID data. A draft is here.
- Nicolo, Claudio, and Luca showed that it is possible to do active learning in an entirely adversarial setting for linear threshold classifiers here. This was published a year or two ago and I recently learned about it.
All of these results are relatively ‘rough’: they don’t necessarily make good algorithms as stated (although the last one has a few experiments). None of these results are directly comparable because the assumptions vary. Comparing the assumptions and the results leads to a number of remaining questions:
- Do the sufficient and seminecessary conditions apply to the IID only case? The adversarial case?
- Is there a generic algorithm for any hypothesis space that works in the fully adversarial setting?
- What are special cases of these algorithms which are computationally tractable and useful?
The Foundations of Active Learning workshop at NIPS should be a good place to discuss these questions.