One thing which is clear on a little reflection is that there exists a single master learning problem capable of encoding essentially all learning problems. This problem is of course a very general sort of reinforcement learning where the world interacts with an agent as:

- The world announces an observation
*x*. - The agent makes a choice
*a*. - The world announces a reward
*r*.

The goal here is to maximize the sum of the rewards over the time of the agent. No particular structure relating *x* to *a* or *a* to *r* is implied by this setting so we do not know effective general algorithms for the agent. It’s very easy to prove lower bounds showing that an agent cannot hope to succeed here—just consider the case where actions are unrelated to rewards. Nevertheless, there is a real sense in which essentially all forms of life are agents operating in this setting, somehow succeeding. The gap between these observations drives research—How can we find tractable specializations of the master problem general enough to provide an effective solution in real problems?

The process of specializing is a tricky business, as you want to simultaneously achieve tractable analysis, sufficient generality to be useful, and yet capture a new aspect of the master problem not otherwise addressed. Consider: How is it even possible to choose a setting where analysis is tractable before you even try to analyze it? What follows is my mental map of different specializations.

### Online Learning

The online learning setting is perhaps the most satisfying specialization more general than standard batch learning at present, because it turns out to additionally provide tractable algorithms for many batch learning settings.

Standard online learning models specialize in two ways: You assume that the choice of action in step 2 does not influence future observations and rewards, and you assume additional information is available in step 3, a retrospectively available reward for each action. The algorithm for an agent in this setting typically has a given name—gradient descent, weighted majority, Winnow, etc…

The general algorithm here is a more refined version of follow-the-leader than in batch learning, with online update rules. An awesome discovery about this setting is that it’s possible to compete with a set of predictors even when the world is totally adversarial, substantially strengthening our understanding of what learning is and where it might be useful. For this adversarial setting, the algorithm alters into a form of follow-the-perturbed leader, where the learning algorithm randomizes it’s action amongst the set of plausible alternatives in order to defeat an adversary.

The standard form of argument in this setting is a potential argument, where at each step you show that if the learning algorithm performs badly, there is some finite budget from which an adversary deducts it’s ability. The form of the final theorem is that you compete with the accumulated reward of a set any one-step policies *h:X – > A*, with a dependence log(#policies) or weaker in regret, a measure of failure to compete.

A good basic paper to read here is:

Nick Littlestone and Manfred Warmuth, The Weighted Majority Algorithm, which shows the basic information-theoretic claim clearly. Vovk‘s page on aggregating algorithms is also relevant, although somewhat harder to read.

Provably computationally tractable special cases all have linear structure, either on rewards or policies. Good results are often observed empirically by applying backpropagation for nonlinear architectures, with the danger of local minima understood.

### Bandit Analysis

In the bandit setting, step 1 is omitted, and the difficulty of the problem is weakened by assuming that action in step (2) don’t alter future rewards. The goal is generally to compete with all constant arm strategies.

Analysis in this basic setting started very specialized with Gittin’s Indicies and gradually generalized over time to include IID and fully adversarial settings, with EXP3 a canonical algorithm. If there are *k* strategies available, the standard theorem states that you can compete with the set of all constant strategies up to regret *k*. The most impressive theoretical discovery in this setting is that the dependence on *T*, the number of timesteps, is not substantially worse than supervised learning despite the need to explore.

Given the dependence on *k* all of these algorithms are computationally tractable.

However, the setting is flawed, because the set of constant strategies is inevitably too weak in practice—it’s an example of optimal decision making given that you ignore almost all information. Adding back the observation in step 1 allows competing with a large set of policies, while the regret grows only as log(#policies) or weaker. Canonical algorithms here are EXP4 (computationally intractable, but information theoretically near-optimal), Epoch-Greedy (computationally tractable given an oracle optimizer), and the Offset Tree providing a reduction to supervised binary classification.

### MDP analysis

A substantial fraction of reinforcement learning has specialized on the Markov Decision Process setting, where the observation *x* is a state *s*, which is a sufficient statistic for predicting all future observations. Compared to the previous settings, dealing with time dependence is explicitly required, but learning typically exists in only primitive forms.

The first work here was in the 1950’s where the actual MDP was assumed known and the problem was simply computing a good policy, typically via dynamic programming style solutions. More recently, principally in the 1990’s, the setting where the MDP was not assumed known was analyzed. A very substantial theoretical advancement was the *E ^{3}* algorithm which requires only

*O(S*experience to learn a near-optimal policy where the world is an MDP with

^{2}A)*S*state and

*A*actions per state. A further improvement on this is Delayed Q-Learning, where only

*O(SA)*experience is required. There are many variants on the model-based approach and not much for the model-free approach. Lihong Li‘s thesis probably has the best detailed discussion at present.

There are some unsatisfactory elements of the analysis here. First, I’ve suppressed the dependence on the definition of “approximate” and the typical time horizon, for which the dependence is often bad and the optimality is unclear. The second is the dependence on *S*, which is intuitively unremovable, with this observation formalized in the lower bound Sham and I worked on (section 8.6 of Sham’s thesis). Empirically, these and related algorithms are often finicky, because in practice the observation isn’t a sufficient statistic and the number of states isn’t small, so approximating things as such is often troublesome.

A very different variant of this setting is given by Control theory, which I know less about than I should. The canonical setting for control theory is with a known MDP having linear transition dynamics. More exciting are the system identification problems where the system must be first identified. I don’t know any good relatively assumption free results for this setting.

### Oracle Advice Shortcuts

Techniques here specialize the setting to situations in which some form of oracle advice is available when a policy is being learned. A good example of this is an oracle which provides samples from the distribution of observations visited by a good policy. Using this oracle, conservative policy iteration is guaranteed to perform well, so long as a base learning algorithm can predict well. This algorithm was refined and improved a bit by PSDP, which works via dynamic programming, improving guarantees to work with regret rather than errors.

An alternative form of oracle is provide by access to a good policy at training time. In this setting, Searn has similar provable guarantees with a similar analysis.

The oracle based algorithms appear to work well anywhere these oracles are available.

### Uncontrolled Delay

In the uncontrolled delay setting, step (2) is removed, and typically steps (1) and (3) are collapsed into one observation, where the goal becomes state tracking. Most of the algorithms for state tracking are heavily model dependent, implying good success within particular domains. Examples include Kalman filters, hidden markov models, and particle filters which typical operate according to an explicit probabilistic model of world dynamics.

Relatively little is known for a nonparametric version of this problem. One observation is that the process of predicting adjacent observations well forms states as a byproduct when the observations are sufficiently rich as detailed here.

A basic question is: What’s missing from the above? A good answer is worth a career.

Suppose Step 3 was rephrased as as “the world announces its response” from which we must derive or infer a notion of reward. Another way to think about this is to simply remove Step 3 all together. The next observation announced by the world would then be dependent upon the previous actions taken. I think in many cases this can be reduced to some kind oracle/expert advice setting, but in general it seems unrealistic to assume that we have direct, unbiased access to the “true” notion of reward.

I think you’ve touched earlier on the case where revealing the label of an example carries a specific cost and the learning algorithm must make decisions about which labels to reveal. (I’ve forgotten the name of this problem) Does this task fit into a specific specialization?

Or more generally, is there a mapping between this mental map and the Yahoo machine learning challenges that you presented a while ago?

You are thinking of active learning. The setting for active learning isn’t a perfect fit as a specialization, for two reasons:

(1) There are two criteria being optimized (minimizing error rate and the number of labels). This is cosmetic, as we could certainly argue for and introduce an optimization on both criteria.

(2) Individual rewards aren’t revealed after each example, but rather after those queried. This is an issue similar to the one that Yisong points out.

Many of the Y! problems such as contextual bandit learning do seem very related to this paradigm.

I think Yisong’s comment is right on the money. Games often have only a single bit of explicit reward data available (you won, you lost) at the end. Removing the explicit reward step also allows generalizations to problems where too much data is available: Computing over the full observation x may not be possible in time-limited problems, without the explicit reward step, choice a could be a decision about what data to request. It doesn’t make sense for that decision to be followed by a reward, rather it results in a particular observation x’ within the full world of observations x. On the other hand, without at least a single bit of reward data there’s really nothing to maximize is there? So perhaps it is best understood with step 3 returning 0 at most steps.

Step 2 also has an interesting specialization is that the domain of choices in not necessarily finite or explicitly defined. Some problems are trivial as multiple choice, but difficult when you need to “think outside the box” and use a non-standard action. On one hand these infinite choice sets can (usually?) be represented as a series of finite, primitive actions, but in some problems you may have to contend with never being able to explicitly list the possible actions at any step. So although we usually think about learning over the 1-2-3 series, each may require its own learning process separate from the others

I certainly concur about the reward often being ambiguous. One way to think of this is as a reward shaping problem where the reward is typically clear in aggregate over long time scales, and you want to decompose it over smaller time scales.

I agree with the basic message of your post, especially your last sentence (in my own humble way, I am motivated by essentially these questions in different application domains and am trying to make a career of it).

However, by focusing attention exclusively on interesting algorithmic paradigms, you have failed to mention one of the potential ways in which such problems may be solved in the real world by real (biological) agents. As you point out, ‘essentially all forms of life are agents operating in this setting’. However, it is not clear that any such natural agent does a particularly good job of solving *all (or even a suitably large fraction of) instances* of generic algorithmic problems (e.g., online learning being one such general problem).

Instead, natural agents seem to carefully exploit ‘ecological’ regularities to restrict the space of problems they solve, and to fine tune the representation within which problems are posed to them. So, despite not having generic powerful algorithms, they demonstrate practically powerful and robust behaviors.

Even if one were not particularly interested in the cognitive science aspects of the above, it is still of interest to ask whether learning algorithms can demonstrate a similarly selective property of focusing on the problems that ‘actually’ matter. One key feature in this setting would be to understand learning over very long periods, in a lifelong sense, where the problems will necessarily be varied over time.

In my opinion, some of the remaining work to do lies into the definition and the restrictions of the relationships between observations, available actions and rewards. Because if the Master Problem can represent *any* possible problem, it can also represent unsolvable problems as well as unrealistic ones.

An interesting area of research would then focus on defining new restricting assumptions of existing models (like MDPs, HMMs, POMDPs …) to balance their generality and their tractability.

As an example, we recently study quasi deterministic POMDPs where we assume that the observation perceived would be the underlying state but with an additive noise while transitions are deterministic. Some theoretical results show that this “restricted” model is more tractable and still captures interesting and challenging realistic problems.

This work may be inserted into the MDP analysis section but we would rather talk about “Constrained Design Analysis” where we try to design “realistic” but tractable decision processes.

The steps of master problem could be analysed from a slightly different point of view.

In general the observation depends on agent, and process models of ‘world’ and ‘agent’ are asynchronous. During the observation process the world can change. The same applies to decision making process.

Reward also is perceived by an agent, yet usually this step is simplified. Furthermore we can think of external and internal rewards as separate action triggers, and also of reward prediction.

The problem could be specialized a bit by saying something about the structure of observations and actions. They don’t have to be just numbers.

Some possible representations of an observation:

a) A vector of features

b) A 2D image

c) A graph of relations

d) ???