Specializations of the Master Problem

One thing which is clear on a little reflection is that there exists a single master learning problem capable of encoding essentially all learning problems. This problem is of course a very general sort of reinforcement learning where the world interacts with an agent as:

  1. The world announces an observation x.
  2. The agent makes a choice a.
  3. The world announces a reward r.

The goal here is to maximize the sum of the rewards over the time of the agent. No particular structure relating x to a or a to r is implied by this setting so we do not know effective general algorithms for the agent. It’s very easy to prove lower bounds showing that an agent cannot hope to succeed here—just consider the case where actions are unrelated to rewards. Nevertheless, there is a real sense in which essentially all forms of life are agents operating in this setting, somehow succeeding. The gap between these observations drives research—How can we find tractable specializations of the master problem general enough to provide an effective solution in real problems?

The process of specializing is a tricky business, as you want to simultaneously achieve tractable analysis, sufficient generality to be useful, and yet capture a new aspect of the master problem not otherwise addressed. Consider: How is it even possible to choose a setting where analysis is tractable before you even try to analyze it? What follows is my mental map of different specializations.

Online Learning

The online learning setting is perhaps the most satisfying specialization more general than standard batch learning at present, because it turns out to additionally provide tractable algorithms for many batch learning settings.

Standard online learning models specialize in two ways: You assume that the choice of action in step 2 does not influence future observations and rewards, and you assume additional information is available in step 3, a retrospectively available reward for each action. The algorithm for an agent in this setting typically has a given name—gradient descent, weighted majority, Winnow, etc…

The general algorithm here is a more refined version of follow-the-leader than in batch learning, with online update rules. An awesome discovery about this setting is that it’s possible to compete with a set of predictors even when the world is totally adversarial, substantially strengthening our understanding of what learning is and where it might be useful. For this adversarial setting, the algorithm alters into a form of follow-the-perturbed leader, where the learning algorithm randomizes it’s action amongst the set of plausible alternatives in order to defeat an adversary.

The standard form of argument in this setting is a potential argument, where at each step you show that if the learning algorithm performs badly, there is some finite budget from which an adversary deducts it’s ability. The form of the final theorem is that you compete with the accumulated reward of a set any one-step policies h:X – > A, with a dependence log(#policies) or weaker in regret, a measure of failure to compete.

A good basic paper to read here is:
Nick Littlestone and Manfred Warmuth, The Weighted Majority Algorithm, which shows the basic information-theoretic claim clearly. Vovk‘s page on aggregating algorithms is also relevant, although somewhat harder to read.

Provably computationally tractable special cases all have linear structure, either on rewards or policies. Good results are often observed empirically by applying backpropagation for nonlinear architectures, with the danger of local minima understood.

Bandit Analysis

In the bandit setting, step 1 is omitted, and the difficulty of the problem is weakened by assuming that action in step (2) don’t alter future rewards. The goal is generally to compete with all constant arm strategies.

Analysis in this basic setting started very specialized with Gittin’s Indicies and gradually generalized over time to include IID and fully adversarial settings, with EXP3 a canonical algorithm. If there are k strategies available, the standard theorem states that you can compete with the set of all constant strategies up to regret k. The most impressive theoretical discovery in this setting is that the dependence on T, the number of timesteps, is not substantially worse than supervised learning despite the need to explore.

Given the dependence on k all of these algorithms are computationally tractable.

However, the setting is flawed, because the set of constant strategies is inevitably too weak in practice—it’s an example of optimal decision making given that you ignore almost all information. Adding back the observation in step 1 allows competing with a large set of policies, while the regret grows only as log(#policies) or weaker. Canonical algorithms here are EXP4 (computationally intractable, but information theoretically near-optimal), Epoch-Greedy (computationally tractable given an oracle optimizer), and the Offset Tree providing a reduction to supervised binary classification.

MDP analysis

A substantial fraction of reinforcement learning has specialized on the Markov Decision Process setting, where the observation x is a state s, which is a sufficient statistic for predicting all future observations. Compared to the previous settings, dealing with time dependence is explicitly required, but learning typically exists in only primitive forms.

The first work here was in the 1950’s where the actual MDP was assumed known and the problem was simply computing a good policy, typically via dynamic programming style solutions. More recently, principally in the 1990’s, the setting where the MDP was not assumed known was analyzed. A very substantial theoretical advancement was the E3 algorithm which requires only O(S2A) experience to learn a near-optimal policy where the world is an MDP with S state and A actions per state. A further improvement on this is Delayed Q-Learning, where only O(SA) experience is required. There are many variants on the model-based approach and not much for the model-free approach. Lihong Li‘s thesis probably has the best detailed discussion at present.

There are some unsatisfactory elements of the analysis here. First, I’ve suppressed the dependence on the definition of “approximate” and the typical time horizon, for which the dependence is often bad and the optimality is unclear. The second is the dependence on S, which is intuitively unremovable, with this observation formalized in the lower bound Sham and I worked on (section 8.6 of Sham’s thesis). Empirically, these and related algorithms are often finicky, because in practice the observation isn’t a sufficient statistic and the number of states isn’t small, so approximating things as such is often troublesome.

A very different variant of this setting is given by Control theory, which I know less about than I should. The canonical setting for control theory is with a known MDP having linear transition dynamics. More exciting are the system identification problems where the system must be first identified. I don’t know any good relatively assumption free results for this setting.

Oracle Advice Shortcuts

Techniques here specialize the setting to situations in which some form of oracle advice is available when a policy is being learned. A good example of this is an oracle which provides samples from the distribution of observations visited by a good policy. Using this oracle, conservative policy iteration is guaranteed to perform well, so long as a base learning algorithm can predict well. This algorithm was refined and improved a bit by PSDP, which works via dynamic programming, improving guarantees to work with regret rather than errors.

An alternative form of oracle is provide by access to a good policy at training time. In this setting, Searn has similar provable guarantees with a similar analysis.

The oracle based algorithms appear to work well anywhere these oracles are available.

Uncontrolled Delay

In the uncontrolled delay setting, step (2) is removed, and typically steps (1) and (3) are collapsed into one observation, where the goal becomes state tracking. Most of the algorithms for state tracking are heavily model dependent, implying good success within particular domains. Examples include Kalman filters, hidden markov models, and particle filters which typical operate according to an explicit probabilistic model of world dynamics.

Relatively little is known for a nonparametric version of this problem. One observation is that the process of predicting adjacent observations well forms states as a byproduct when the observations are sufficiently rich as detailed here.

A basic question is: What’s missing from the above? A good answer is worth a career.

Functionally defined Nonlinear Dynamic Models

Suppose we have a set of observations over time x1,x2,…,xt and want to predict some future event yt+1. An inevitable problem arises, because learning a predictor h(x1,…,xt) of yt+1 is generically intractable due to the size of the input. To make this problem tractable, what’s necessary is a method for summarizing the relevant information in past observations for the purpose of prediction in the future. In other words, state is required.

Existing approaches for deriving state have some limitations.

  1. Hidden Markov models learned with EM suffer from local minima, use tabular learning approaches which provide dubious generalization ability, and often require substantial a.priori specification of the observations.
  2. Kalman Filters and Particle Filters are very parametric in the sense that substantial information must be specified up front.
  3. Dynamic Bayesian Networks (graphical models through time) require substantial a.priori specification and often require the solution of difficult computational problems to use. Some of these difficulties are representational rather than computational.
  4. The Subspace-ID approach from control theory uses a linear representation, with the basic claim that it works well when all transformations are linear, but not so well when things are nonlinear. (Thanks to Drew for pointing it out.) In making this post, I ran across this two day tutorial which discusses extensions of this idea to nonlinear systems. Unfortunately, I’ll miss the tutorial, and I haven’t found the related paper.

The point of this paper at ICML is that some dynamic systems (those which are “invertible”), can be decomposed into separate bounded resource prediction problems which, when solved, create an implicit definition of state. This allows us to use any general purpose supervised learning algorithm to solve the state formation problem without requiring linearity or any specific representation. When writing papers you don’t generally gush too hard, but it’s fair to say that I’m excited by this approach.

  1. It’s not a known dead end.
  2. It doesn’t require lots of prior specification & information when you have lots of data.
  3. It leverages the huge amount of work that has gone into supervised learning algorithm design.
  4. It works in controlled systems also, where the control is simply another observation.
  5. It works with generalization from the start, rather than requiring the (often awkward) addition of generalization later.
  6. It doesn’t require predicting everything in order to predict what you want.
  7. It can work with very large observation spaces, and can even work better the larger the observation space, because larger observations imply more invertibility.

I expect some people reading this paper will be disappointed that it doesn’t solve all problems. That’s good news for anyone interested in research. For those who aren’t note that this is (in some sense) a generalization of subspace ID, and hence that there are other applications of the approach known to work in practice. Furthermore, we have some sample complexity analysis in the linear case.

It’s relatively rare to have a paper about a new approach to solving a problem as intractable as nonlinear dynamics has proved to be, so if you see a flaw please speak up.

Computability in Artificial Intelligence

Normally I do not blog, but John kindly invited me to do so. Since computability issues play a major role in Artificial Intelligence and Machine Learning, I would like to take the opportunity to comment on that and raise some questions.

The general attitude is that AI is about finding efficient smart algorithms. For large parts of machine learning, the same attitude is not too dangerous. If you want to concentrate on conceptual problems, simply become a statistician. There is no analogous escape for modern research on AI (as opposed to GOFAI rooted in logic).

Let me show by analogy why limiting research to computational questions is bad for any field.

Except in computer science, computational aspects play little role in the development of fundamental theories: Consider e.g. set theory with axiom of choice, foundations of logic, exact/full minimax for zero-sum games, quantum (field) theory, string theory, … Indeed, at least in physics, every new fundamental theory seems to be less computable than previous ones. Of course, once a subject has been formalized, further research (a) analyzes the structure of the theory and (b) tries to compute efficient approximations. Only in (b) do computational aspects play a role.

So my question is: Why are computational questions so prevalent in AI research? Here are some unconvincing arguments I’ve heard:

A) Because AI is a subfield of computer science, and the task of computer scientists is to find (efficient) algorithms for well-defined problems?

I think it does not do any (real-world) problem any good to confine it to computer science. Of course, philosophers and cognitive scientists also care about AI, but where are the mathematicians?

B) Because formalizing AI and finding efficient smart programs goes hand-in-hand? Separating these two issues would lead to no, or at best to results which are misleading or useless in the construction of intelligent machines?

I am not aware of any convincing argument that separating the issues of “axiomatizing a field” and “finding efficient solutions” will (likely) fail for AI. The examples above of other fields actually indicate the opposite. Of course, interaction is important to avoid both sides running wild. For instance, von Neumann’s minimax solution for games, albeit infeasible for most games, is the cornerstone of most practical approximations.

C) Because there is some deep connection between intelligence and computation which can not be disentangled?

Sure, you could say that intelligence is by definition about computationally efficient decision making. This is as unconvincing as argument (A). Pointing out that the human brain is a computational device is quite useful in many ways, but doesn’t proves (C) either. Of course, ultimately we want a “fast” smart algorithm. How is AI different from wanting a fast algorithm computing primes, which you derive from a non-algorithmic definition of primes; or drawing fractals?

D) Because AI is trivial if computational issues are ignored? All conceptual problems have already been solved?

Many have expressed ideas that some form of exhaustive search over all possible solutions and picking the “best” one does the job. This works essentially for exactly those problems that are well-defined. For instance, optimal minimax play of a zero-sum game or solving NP complete problems are conceptually trivial, i.e. if computation time is ignored. But in general AI and machine learning, there is not a universally agreed-upon objective function. The Turing test is informal (involves a human judge in the loop), maximizing expected reward (the true distribution is not known, so expectation w.r.t. to what?), etc. The AIXI model, briefly discussed at this blog, is the first complete and formal such criterion, for which, let me phrase it that way, no flaw has yet been identified. Shane Legg’s award-winning thesis gives an informal introduction and contains lots of discussion.

Conceptual and computational problems in AI should be studied jointly as well as separately, but the latter is not (yet) fashionable. When AI was more logic oriented, some good logicians helped develop the foundations of “deductive” AI. Where are the researchers giving modern “inductive” AI its foundation? I am talking about generic learning agents, not classifying i.i.d. data. Reinforcement learners? Well, most of the hard results are from adaptive control theorists, but it’s reassuring to see parts of these communities merging. It’s a pity that so few mathematicians are interested in AI. A field “mathematical AI” with the prestige of “mathematical physics” would be exciting. As a start: 40% of the COLT & ALT papers on generic learning agents, 30% induction, 20% time-series forecasting, 10% i.i.d. Currently it’s reversed.

Machine Learning to AI

I recently had fun discussions with both Vikash Mansinghka and Thomas Breuel about approaching AI with machine learning. The general interest in taking a crack at AI with machine learning seems to be rising on many fronts including DARPA.

As a matter of history, there was a great deal of interest in AI which died down before I began research. There remain many projects and conferences spawned in this earlier AI wave, as well as a good bit of experience about what did not work, or at least did not work yet. Here are a few examples of failure modes that people seem to run into:

  1. Supply/Product confusion. Sometimes we think “Intelligences use X, so I’ll create X and have an Intelligence.” An example of this is the Cyc Project which inspires some people as “intelligences use ontologies, so I’ll create an ontology and a system using it to have an Intelligence.” The flaw here is that Intelligences create ontologies, which they use, and without the ability to create ontologies you don’t have an Intelligence. If we are lucky, the substantial effort invested in Cyc won’t be wasted, as it has a large quantity of information stored in a plausibly useful format. If we are unlucky, it fails to even be partially useful, because the format is unnatural for the internal representations of an Intelligence.
  2. Uncertainty second. Many of the older AI programs had no role for uncertainty. If you asked the people working on them, they might agree that uncertainty was an important but secondary concern to be solved after the main problem. Unfortunately, it seems that uncertainty is a primary concern in practice. One example of this is blocks world where a system for planning how to rearrange blocks on a table might easily fail in practice because the robot fails to grab a block properly. Many people think of uncertainty as a second order concern, because they don’t experience uncertainty in their daily lives. I believe this is incorrect—a mental illusion due to the effect that focusing attention on a specific subject implies reducing uncertainty on that subject. More generally, because any Intelligence is a small part of the world, the ability of any intelligence to perceive, understand, and manipulate the world is inherently limited, requiring the ability to deal with uncertainty. For statistics & ML people, it’s important to not breath a sigh of relief too easily, as the problem is pernicious. For example many ML techniques based around conditional independence routinely suffer from excess certainty.
  3. Computation second. Some people try to create an intelligence without reference to efficient computation. AIXI is an extreme example of this sort. The algorithm is very difficult to deploy in practice because there were no computational constraints other than computability designed into it’s creation. It’s important to understand that computational constraints and uncertainty go together: because there are computational constraints, an intelligence is forced to deal with uncertainty since not everything which might follow at a mathematical level can be inferred in the available computational budget.
  4. AI-Hard problems. There was a time when some people thought, “If we could just get a program that mastered chess so well it could beat the best humans, we will learn enough about AI to create an AI.” Deep Blue put that theory to rest. Current efforts on Poker and Go seem more promising, but no one believes they are “AI-Hard” for good reason. It’s not even clear that the Turing Test is a reliable indicator, because (for example) we might imagine that there is Intelligence which can not imitate a human, or that there are programs that can imitate humans well enough to fool humans without being able to achieve everything that an Intelligence could. Perhaps the best evidence is something singularity-style: AI exists when it can substantially improve it’s own abilities.
  5. Asymptopia. In machine learning there are many theorems of the form “learning algorithm A can solve any learning problem in the limit of infinite data”. Here A might be nearest neighbors, decision trees, two-layer neural networks, support vector machines, nonparametric statistics, nonparametric Bayes, or something else. These theorem are ok, but insufficient. Often the algorithms are not computationally acceptable, and even if so, they are not sufficiently efficient with respect to the amount of experience required to learn.

Solving AI is undeniably hard, as evidenced by the amount of time spent on it, and the set of approaches which haven’t succeeded. There are a couple reasons for hope this time. The first is that there is, or soon will be sufficient computation available, unlike the last time. The second is that the machine learning approach fails well, because there are industrial uses for machine learning. Consequently, we can expect a lack of success to still see substantial use in practice. This might sound like “a good downside”, but it’s actually an upside, because it implies that incremental progress has the potential for ultimate success.

Restated at an abstract level: a hard problem can generally be decomposed in many ways into subproblems. Amongst all such decompositions, a good decomposition is one with the property that solutions to the subproblems are immediately useful. The machine learning approach to AI has this goodness property, unlike many other approaches, which partially explains why the ML approach is successful despite “failing” so far to achieve AI.

One reason why AI is hard, is that it turns out tackling general problems in the world undeniably requires a substantial number of different strategies, including learning, searching, and chunking (= constructing macros), all while respecting constraints of computation and robustness to uncertainty. Given this, a fair strategy seems to be first mastering one strategy, and then incorporating others, always checking that that incorporation properly addresses real world problems. In doing this, considering the constraint ignoring approaches as limiting cases of the real system may be helpful.