Exploration is one of the big unsolved problems in machine learning. This isn’t for lack of trying—there are many models of exploration which have been analyzed in many different ways by many different groups of people. At some point, it is worthwhile to sit back and see what has been done across these many models.

**Reinforcement Learning (1)**. Reinforcement learning has traditionally focused on Markov Decision Processes where the next state*s’**P(s’|s,a)*given the current state*s*and action*a*. The typical result here is that certain specific algorithms controlling an agent can behave within*e*of optimal for horizon*T*except for*poly(1/e,T,S,A)*“wasted” experiences (with high probability). This started with E^{3}by Satinder Singh and Michael Kearns. Sham Kakade’s thesis has significant discussion. Extensions have typically been of the form “under extra assumptions, we can prove more”, for example Factored-E^{3}and Metric-E^{3}. (It turns out that the number of wasted samples can be less than the number of bits required to describe an MDP.) A weakness of all these results is that they rely upon (a) assumptions which are often false for real applications, (b) state spaces are too large, and (c) make a gurantee that is rather weak. Good performance is only guaranteed after suffering the possibly catastrophic consequences of exploration.**Reinforcement Learning (2)**. Several recent papers have been attempting to analyze reinforcement learning via reduction. To date, all results are either nonconstructive or involve the use of various hints (oracle access to an optimal policy, the distribution over states of an optimal policy etc…) which short-circuit the need to explore. Obviously, these hints are not always available for real-world problems.**Reinforcement Learning (3)**. Much of the rest of reinforcement learning has something to do with exploration, but it’s difficult to summarize succinctly.

**Online Learning**. The n-armed bandit setting can be thought of as an MDP with one state and many actions. In some variants, there is even an adversary who chooses the payoffs of the arms in a non-stochastic manner. The typical result here says that you can compete well with the best constant action after some wasted actions. The exact number of wasted actions varies with the precise setting, but it is typically linear in the number of actions. This work can be traced back to (at least) Gittins indices which (unfortunately) don’t seem to have a good description available on the internet.**Active Learning(1)**The common current use of this term has to do with “selective sampling”=choosing unlabeled samples to label so as to minimize the number of labels required to learn a good predictor (typically a classifier). A typical result has the form: Given that your classifier comes from restricted class*C*and the labeled data distribution obeys some constraint, the number of adaptively labeled samples required*O(log (1/e))*where*e*is the error rate. (It turns out that the even noisy distributions are allowed.) The constraints on distributions and hypothesis spaces required to achieve these speedups are often severe.**Active Learning(2)**Membership query learning is another example. The distinguishing difference with respect to selective sampling is that the a labeled can be requested for any unlabeled point (not just those drawn according to some natural distribution). Several relatively strong results hold for membership query learning, but there is a significant drawback: it seems that the ability to query for a label on an arbitrary point is not very natural. For example, imagine query whether a text document is about sports or politics when the text is generated at random.**Active Learning(3)**Experimental design (which is mostly based in statistics), is often about finding the extrema of some function rather than generalization. Often, the data generating distribution is assumed to come from some specific parametric family. Unfortunately, my knowledge is sketchy here.

The striking thing about all of these models is that they fail to apply to typical real world problems. This failure is either by design (making assumptions which are simply rarely met), by failure to prove interesting results, or both.

And yet, many of the pieces are here. Active learning deals with generalization, online learning can deal with adversarial situations, and reinforcement learning deals with the situation where your choices influence what you can later learn. At a high level, there is much room for research here by design of a new model of exploration, new theoretical statements, or both.

I’ve been told “exploration is too hard”, and that’s a good warning to bear in mind, but I’m still hopeful for progress.