A new direction of research seems to be arising in machine learning: Interactive Machine Learning. This isn’t a familiar term, although it does include some familiar subjects.

**What is Interactive Machine Learning?** The fundamental requirement is (a) learning algorithms which interact with the world and (b) learn.

For our purposes, let’s define learning as efficiently competing with a large set of possible predictors. Examples include:

- Online learning against an adversary (Avrim’s Notes). The interaction is almost trivial: the learning algorithm makes a prediction and then receives feedback. The learning is choosing based upon the advice of many experts.
**Active Learning**. In active learning, the interaction is choosing which examples to label, and the learning is choosing from amongst a large set of hypotheses.**Contextual Bandits**. The interaction is choosing one of several actions and learning only the value of the chosen action (weaker than active learning feedback).

More forms of interaction will doubtless be noted and tackled as time progresses. I created a webpage for my own research on interactive learning which helps define the above subjects a bit more.

**What isn’t Interactive Machine Learning?**

There are several learning settings which fail either the interaction or the learning test.

- Supervised Learning doesn’t fit. The basic paradigm in supervised learning is that you ask experts to label examples, and then you learn a predictor based upon the predictions of these experts. This approach has essentially no interaction.
- Semisupervised Learning doesn’t fit. Semisupervised learning is almost the same as supervised learning, except that you also throw in many unlabeled examples.
- Bandit algorithms don’t fit. They have the interaction, but not much learning happens because the sample complexity results only allow you to choose from amongst a small set of strategies. (One exception is EXP4 (page 66), which can operate in the contextual bandit setting.)
- MDP learning doesn’t fit. The interaction is there, but the set of policies learned over is still too limited—essentially the policies just memorize what to do in each state.
- Reinforcement learning may or may not fit, depending on whether you think of it as MDP learning or in a much broader sense.

All of these not-quite-interactive-learning topics are of course very useful background information for interactive machine learning.

**Why now?** Because it’s time, of course.

- We know from other fields and various examples that interaction is very powerful.
- From online learning against an adversary, we know that independence of samples is unnecessary in an interactive setting—in fact you can even function against an adversary.
- From active learning, we know that interaction sometimes allows us to use exponentially fewer labeled samples than in supervised learning.
- From context bandits, we gain the ability to learn in settings where traditional supervised learning just doesn’t apply.
- From complexity theory we have “IP=PSPACE” roughly: interactive proofs are as powerful as polynomial space algorithms, which is a strong statement about the power of interaction.

- We know that this analysis is often tractable. For example, since Sanjoy‘s post on Active Learning, much progress has been made. Several other variations of interactive settings have been proposed and analyzed. The older online learning against an adversary work is essentially completely worked out for the simpler cases (except for computational issues).
- Real world problems are driving it. One of my favorite problems at the moment is the ad display problem—How do you learn which ad is most likely to be of interest? The contextual bandit problem is a big piece of this problem.
- It’s more fun. Interactive learning is essentially a wide-open area of research. There are plenty of kinds of natural interaction which haven’t been formalized or analyzed. This is great for beginnners, because it means the problems are simple, and their solution does not require huge prerequisites.
- It’s a step closer to AI. Many people doing machine learning want to reach AI, and it seems clear that any AI must engage in interactive learning. Mastering this problem is a next step.

**Basic Questions**

- For natural interaction form [insert yours here], how do you learn? Some of the techniques for other methods of interactive learning may be helpful.
- How do we blend interactive and noninteractive learning? In many applications, there is already a pool of supervised examples around.
- Are there general methods for reducing interactive learning problems to supervised learning problems (which we know better)?

Regarding the MDP, I have a question: Do you just consider finite state – finite action MDPs as MDP?!

If not, I cannot understand why you say the set of policies are too limited.

Yes, I was thinking about finite state MDPs.

One difficulty with MDP learning for infinite states is that Sham and I worked out a lower bound: you must have O(SA) experience where S is the number of states and A is the number of actions per state. This lower bound can be avoided by introducing extra assumptions, but it’s not yet clear to me how best to do that.

Would http://www.toppersearch.com be an example of interactive learning? In this demo we give users the ability to learn a model during the interaction with a search engine. In our case the learning algorithm is also simple but works quite well with small amounts of text. We use it in the context of search where users can label a few results of a search and then reorder to see the effect of their model on the results. Users can then continue to search, label and reorder as they refine both the model and what they are searching for. Try it out and let me know what you think. We are also applying this idea of interacting with learning tools in other contexts as well – news, blogs, social network data. I welcome your comments or suggestions.

Looks like a fun example to me. It seems to fit the contextual bandit setting, except that you might get feedback about multiple possibilities rather than just 1.

There are many details which could be of interest to the technically minded—how it scales, what the underlying predictor is, etc…

I am not sure why MDP’s dont dall in this category. By doing Q-learning we can lern the “goodness” of every policy. So there is interaction and learning, and hence it is a form of interactive learning. Pardon me in case I am wrong

The difficulty with Q learning is that it isn’t robust to the quantity of information. For example, if the state space is a 1000 dimensional bit vector (which is only moderate-sized by supervised learning standards), then the very best Q-learning algorithms (check out delayed Q-learning) might require a mere O(2

^{1000}) experience. This is too large to be practicable.Can your main difficulties with classifying reinforcement learning as interactive learning be resolved by changing the representation of the state space? For example, Least Squares Policy Iteration learns the Q-value function as a linear combination of some set of basis functions on state-action pairs, and in practice it learns reasonably well in continuous state spaces. The lower bounds on experience you’ve stated don’t apply then, yes?

Also, I was kicking around a problem last year where data was arriving bit by bit into a system, and it was the system’s job to classify the entire sequence. At each time step, the system could either choose to classify, or wait for the next piece of data, with the catch being that there was some cost associated with waiting. The obvious tradeoff is between the certainty of classification and the cost of acquiring new information. It seems like humans come up against this situation a lot in the real world. Does anyone know of anyone working on a problem of this sort?

One more basic question I would add to the mix is: How do you evaluate an interactive machine learner?

With standard classification-oriented machine learning, this evaluation is straightforward. You just compare two systems based on how many right/wrong, with whatever penalties or rewards are associated with the right/wrong classifications.

However, with interactive learning, you have a multi-stage process in which the machine’s decision at one point affects.. and (key issue here)

changes.. the state of the outside, non-machine entity with which the algorithm is interacting. That changed external state then feeds back a different set of interactions to the machine learning algorithm.It therefore becomes very difficult to separate out the effects of a changing and unpredictable external entity from the effects of the machine learning. And it therefore becomes very difficult to compare one interactive machine learning algorithm against another interactive machine learning algorithm, because so much depends on that feedback loop.

I suppose one thing you could do is just compare the final outcome between two different interactions, like with A/B testing on the web.

But that has always remained unsatisfying to me, because no two tests are exactly the same, because no two interaction sequences are the same. You cannot control for enough important variables.

I don’t know theorem statements for LSPI which are strong enough to make me think the problem is solved. Your empirical experience and mine may differ—what is the most convincing application of LSPI you have seen?

All of the tractable analysis for interactive learning I know doesn’t deal with the “you change the world by your interaction” problem. That seems to be the essence of the hardness of the full scale reinforcement learning problem.

The collection of settings we’ve developed (active learning, contextual bandits, etc…) which avoid this problem seem helpful, but ultimately this problem must be dealt with.