Reinforcement – Page 4 – Machine Learning (Theory)

3/23/20084/12/2008

Interactive Machine Learning

A new direction of research seems to be arising in machine learning: Interactive Machine Learning. This isn’t a familiar term, although it does include some familiar subjects.

What is Interactive Machine Learning? The fundamental requirement is (a) learning algorithms which interact with the world and (b) learn.

For our purposes, let’s define learning as efficiently competing with a large set of possible predictors. Examples include:

Online learning against an adversary (Avrim’s Notes). The interaction is almost trivial: the learning algorithm makes a prediction and then receives feedback. The learning is choosing based upon the advice of many experts.
Active Learning. In active learning, the interaction is choosing which examples to label, and the learning is choosing from amongst a large set of hypotheses.
Contextual Bandits. The interaction is choosing one of several actions and learning only the value of the chosen action (weaker than active learning feedback).

More forms of interaction will doubtless be noted and tackled as time progresses. I created a webpage for my own research on interactive learning which helps define the above subjects a bit more.

What isn’t Interactive Machine Learning?
There are several learning settings which fail either the interaction or the learning test.

Supervised Learning doesn’t fit. The basic paradigm in supervised learning is that you ask experts to label examples, and then you learn a predictor based upon the predictions of these experts. This approach has essentially no interaction.
Semisupervised Learning doesn’t fit. Semisupervised learning is almost the same as supervised learning, except that you also throw in many unlabeled examples.
Bandit algorithms don’t fit. They have the interaction, but not much learning happens because the sample complexity results only allow you to choose from amongst a small set of strategies. (One exception is EXP4 (page 66), which can operate in the contextual bandit setting.)
MDP learning doesn’t fit. The interaction is there, but the set of policies learned over is still too limited—essentially the policies just memorize what to do in each state.
Reinforcement learning may or may not fit, depending on whether you think of it as MDP learning or in a much broader sense.

All of these not-quite-interactive-learning topics are of course very useful background information for interactive machine learning.

Why now? Because it’s time, of course.

We know from other fields and various examples that interaction is very powerful.
1. From online learning against an adversary, we know that independence of samples is unnecessary in an interactive setting—in fact you can even function against an adversary.
2. From active learning, we know that interaction sometimes allows us to use exponentially fewer labeled samples than in supervised learning.
3. From context bandits, we gain the ability to learn in settings where traditional supervised learning just doesn’t apply.
4. From complexity theory we have “IP=PSPACE” roughly: interactive proofs are as powerful as polynomial space algorithms, which is a strong statement about the power of interaction.
We know that this analysis is often tractable. For example, since Sanjoy‘s post on Active Learning, much progress has been made. Several other variations of interactive settings have been proposed and analyzed. The older online learning against an adversary work is essentially completely worked out for the simpler cases (except for computational issues).
Real world problems are driving it. One of my favorite problems at the moment is the ad display problem—How do you learn which ad is most likely to be of interest? The contextual bandit problem is a big piece of this problem.
It’s more fun. Interactive learning is essentially a wide-open area of research. There are plenty of kinds of natural interaction which haven’t been formalized or analyzed. This is great for beginnners, because it means the problems are simple, and their solution does not require huge prerequisites.
It’s a step closer to AI. Many people doing machine learning want to reach AI, and it seems clear that any AI must engage in interactive learning. Mastering this problem is a next step.

Basic Questions

For natural interaction form [insert yours here], how do you learn? Some of the techniques for other methods of interactive learning may be helpful.
How do we blend interactive and noninteractive learning? In many applications, there is already a pool of supervised examples around.
Are there general methods for reducing interactive learning problems to supervised learning problems (which we know better)?

12/10/200712/10/2007

Learning Track of International Planning Competition

The International Planning Competition (IPC) is a biennial event organized in the context of the International Conference on Automated Planning and Scheduling (ICAPS). This year, for the first time, there will a learning track of the competition. For more information you can go to the competition web-site.

The competitions are typically organized around a number of planning domains that can vary from year to year, where a planning domain is simply a class of problems that share a common action schema—e.g. Blocksworld is a well-known planning domain that contains a problem instance each possible initial tower configuration and goal configuration. Some other domains have included Logistics, Airport, Freecell, PipesWorld, and many others. For each domain the competition includes a number of problems (say 40-50) and the planners are run on each problem with a time limit for each problem (around 30 minutes). The problems are hard enough that many problems are not solved within the time limit.

Given that the planners are asked to solve many problems from individual domains, and that problems within a domain generally have common solution structures, it makes sense to consider learning from previous problem-solving experience in a domain to better solve future problems in the same domain. Here “better solve” could mean either solve the problems significantly more quickly or finding better quality plans in a similar time frame. However, no planner in any of the competitions has included a learning component. Rather, to quote Foreigner, for these planners each problem “feels like the first time”.

Perhaps one reason that planners have not incorporated learning into the competition setting is that there has not been much overlap between the ICML and ICAPS communities, although that is changing. Another reason is perhaps that the structure of the competition would deduct any “learning time” from a planner’s 30mins per problem, which could reduce the potential benefits.

The learning track for the 2008 competition is being designed so that learning time is not counted against planners. Rather, there will be a distinct learning phase and a distinct evaluation phase. During the learning phase the planners will be provided with the set of domains to be used in the competition and example problems from each. The evaluation phase will be conducted like the current competition, with the exception that the learning-based planners will be allowed to read in a learned domain-specific “knowledge file” when solving the problems. This structure is designed to help answer the following question:

Do we have techniques that can leverage a learning period to outperform state-of-the-art non-learning techniques across a wide range of domains?

My current belief is that the answer is “no”. I certainly have never seen any such demonstration. This is not because of lack of work in the area of “learning to plan” as there has been a long history dating back to some of the early planners (see my horribly outdated resource page for a taste). While many of the learning approaches have shown some degree of success, the evaluations have typically been very narrow, focusing on only 2 to 3 domains and often only a few problems. My intuition, grounded in personal experience, is that most (all) of these systems would be quite brittle when taken to new domains. The hope is that the learning track of the competition will force us to take the issue of robustness seriously and soon lead to learning systems that convincingly outperform non-learning planners across a wide range of domains given proper training experience.

I hope to see a wide range of approaches entered into the competition. I’ll mention two styles of approaches that might be particular interesting to readers of this blog.

First, one might consider applying reinforcement learning to learn “generalized policies” that can be applied to any problem from a domain. Recall that here the domain model is provided to us, so applying RL would mean that the domain model is used as a sort of simulator in which an RL algorithm is run. RL is particularly difficult in these domains due to the challenges in developing an appropriate representation for learning value functions and/or policies—the states can be viewed as sets of ground relational atoms, rather than the more typical n-dimensional vectors common in RL. Another challenge is the extremely sparse reward, which is obtained only at goal states. There has been some work on applying RL to IPC-style domains (e.g. relational reinforcement learning, approximate policy iteration, policy gradient) but much improvement is needed to compete with non-learning planners.

Second, one might consider structured-classification techniques for this problem. Here one might view the planning problem as an input X and the plan as the structured output Y. Training data can be generated by solving example planning problems using state-of-the-art planners perhaps using significant resources. This approach has been studied under the name max-margin planning, but applied to a very different class of planning problems. Other work has considered applying the Learning as Search Optimization (LaSO) framework to IPC-style domains with some success. Some of the challenges here are to automatically produce an appropriate feature set given a planning domain and ambiguity in the training data. Ambiguity here refers to the fact that there are often a huge number of equally good plans for a given problem and the training data has only one or a small number of them, making the training data incomplete.

10/24/2007

Contextual Bandits

One of the fundamental underpinnings of the internet is advertising based content. This has become much more effective due to targeted advertising where ads are specifically matched to interests. Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way.

The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. This implies we have a fundamental exploration problem.

A standard mathematical setting for this situation is “k-Armed Bandits”, often with various relevant embellishments. The k-Armed Bandit setting works on a round-by-round basis. On each round:

A policy chooses arm a from 1 of k arms (i.e. 1 of k ads).
The world reveals the reward r_a of the chosen arm (i.e. whether the ad is clicked on).

As information is accumulated over multiple rounds, a good policy might converge on a good choice of arm (i.e. ad).

This setting (and its variants) fails to capture a critical phenomenon: each of these displayed ads are done in the context of a search or other webpage. To model this, we might think of a different setting where on each round:

The world announces some context information x (think of this as a high dimensional bit vector if that helps).
A policy chooses arm a from 1 of k arms (i.e. 1 of k ads).
The world reveals the reward r_a of the chosen arm (i.e. whether the ad is clicked on).

We can check that this is a critical distinction in 2 ways. First, note that policies using x can encode much more rich decisions than a policy not using x. Just think about: “if a search has the word flowers display a flower advertisement”. Second, we can try to reduce this setting to the k-Armed Bandit setting, and note that it can not be done well. There are two methods that I know of:

Run a different k-Armed Bandit for every value of x. The amount of information required to do well scales linearly in the number of contexts. In contrast, good supervised learning algorithms often require information which is (essentially) independent of the number of contexts.
Take some set of policies and treat every policy h(x) as a different arm. This removes an explicit dependence on the number of contexts, but it creates a linear dependence on the number of policies. Via Occam’s razor/VC dimension/Margin bounds, we already know that supervised learning requires experience much smaller than the number of policies.

We know these are bad reductions by contrast to direct methods for solving the problem. The first algorithm for solving this problem is EXP4 (page 19 = 66) which has a regret with respect to the best policy in a set of O( T^0.5 (ln |H|)^0.5) where T is the number of rounds and |H| is the number of policies. (Dividing by T gives error-rate like quantities.) This result is independent of the number of contexts x and only weakly dependent (similar to supervised learning) on the number of policies.

EXP4 has a number of drawbacks—it has severe computational requirements and doesn’t work for continuously parameterized policies (*). Tong and I worked out a reasonably simple meta-algorithm Epoch-Greedy which addresses these drawbacks (**), at the cost of sometimes worsening the regret bound to O(T^2/3S^1/3) where S is related to the representational complexity of supervised learning on the set of policies.

This T dependence is of great concern to people who have worked on bandit problems in the past (where, basically, only the dependence on T could be optimized). In many applications, the S dependence is more important. However, this does leave an important open question: Is it possible to get the best properties of EXP4 and Epoch-Greedy?

Reasonable people could argue about which setting is more important: k-Armed Bandits or Contextual Bandits. I favor Contextual Bandits, even though there has been far more work in the k-Armed Bandit setting. There are several reasons:

I’m having difficulty finding interesting real-world k-Armed Bandit settings which aren’t better thought of as Contextual Bandits in practice. For myself, bandit algorithms are (at best) motivational because they can not be applied to real-world problems without altering them to take context into account.
Doing things in context is one of the underlying (and very successful) tenets of machine learning. Applying this tenet here seems wise.
If we want to eventually solve big problems, we must have composable subelements. Composition doesn’t work without context, because there is no “input” for an I/O diagram.

Any insights into the open question above or Contextual Bandits in general are of great interest to me.

(*) There are some simple modifications to deal with the second issue but not the first.
(**) You have to read between the lines a little bit to see this in the paper. The ERM-style algorithm in the paper could be replaced with an efficient approximate ERM algorithm which is often possible in practice.

10/19/2007

Second Annual Reinforcement Learning Competition

The Second Annual Reinforcement Learning Competition is about to get started. The aim of the competition is to facilitate direct comparisons between various learning methods on important and realistic domains. This year’s event will feature well-known benchmark domains as well as more challenging problems of real-world complexity, such as helicopter control and robot soccer keepaway.

The competition begins on November 1st, 2007 when training software is released. Results must be submitted by July 1st, 2008. The competition will culminate in an event at ICML-08 in Helsinki, Finland, at which the winners will be announced.

For more information, visit the competition website.

3/3/2007

All Models of Learning have Flaws

Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning.

The point here is not simply “woe unto us”. There are several implications which seem important.

The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students.
Algorithms which conform to multiple approaches can have substantial value. “I don’t really understand it yet, because I only understand it one way”. Reinterpretation alone is not the goal—we want algorithmic guidance.
We need to remain constantly open to new mathematical models of machine learning. It’s common to forget the flaws of the model that you are most familiar with in evaluating other models while the flaws of new models get exaggerated. The best way to avoid this is simply education.
The value of theory alone is more limited than many theoreticians may be aware. Theories need to be tested to see if they correctly predict the underlying phenomena.

Here is a summary what is wrong with various frameworks for learning. To avoid being entirely negative, I added a column about what’s right as well.

Name	Methodology	What’s right	What’s wrong
Bayesian Learning	You specify a prior probability distribution over data-makers, P(datamaker) then use Bayes law to find a posterior P(datamaker\|x). True Bayesians integrate over the posterior to make predictions while many simply use the world with largest posterior directly.	Handles the small data limit. Very flexible. Interpolates to engineering.	Information theoretically problematic. Explicitly specifying a reasonable prior is often hard. Computationally difficult problems are commonly encountered. Human intensive. Partly due to the difficulties above and partly because “first specify a prior” is built into framework this approach is not very automatable.
Graphical/generative Models	Sometimes Bayesian and sometimes not. Data-makers are typically assumed to be IID samples of fixed or varying length data. Data-makers are represented graphically with conditional independencies encoded in the graph. For some graphs, fast algorithms for making (or approximately making) predictions exist.	Relative to pure Bayesian systems, this approach is sometimes computationally tractable. More importantly, the graph language is natural, which aids prior elicitation.	Often (still) fails to fix problems with the Bayesian approach. In real world applications, true conditional independence is rare, and results degrade rapidly with systematic misspecification of conditional independence.
Convex Loss Optimization	Specify a loss function related to the world-imposed loss fucntion which is convex on some parametric predictive system. Optimize the parametric predictive system to find the global optima.	Mathematically clean solutions where computational tractability is partly taken into account. Relatively automatable.	The temptation to forget that the world imposes nonconvex loss functions is sometimes overwhelming, and the mismatch is always dangerous. Limited models. Although switching to a convex loss means that some optimizations become convex, optimization on representations which aren’t single layer linear combinations is often difficult.
Gradient Descent	Specify an architecture with free parameters and use gradient descent with respect to data to tune the parameters.	Relatively computationally tractable due to (a) modularity of gradient descent (b) directly optimizing the quantity you want to predict.	Finicky. There are issues with paremeter initialization, step size, and representation. It helps a great deal to have accumulated experience using this sort of system and there is little theoretical guidance. Overfitting is a significant issue.
Kernel-based learning	You chose a kernel K(x,x’) between datapoints that satisfies certain conditions, and then use it as a measure of similarity when learning.	People often find the specification of a similarity function between objects a natural way to incorporate prior information for machine learning problems. Algorithms (like SVMs) for training are reasonably practical—O(n²) for instance.	Specification of the kernel is not easy for some applications (this is another example of prior elicitation). O(n²) is not efficient enough when there is much data.
Boosting	You create a learning algorithm that may be imperfect but which has some predictive edge, then apply it repeatedly in various ways to make a final predictor.	A focus on getting something that works quickly is natural. This approach is relatively automated and (hence) easy to apply for beginners.	The boosting framework tells you nothing about how to build that initial algorithm. The weak learning assumption becomes violated at some point in the iterative process.
Online Learning with Experts	You make many base predictors and then a master algorithm automatically switches between the use of these predictors so as to minimize regret.	This is an effective automated method to extract performance from a pool of predictors.	Computational intractability can be a problem. This approach lives and dies on the effectiveness of the experts, but it provides little or no guidance in their construction.
Learning Reductions	You solve complex machine learning problems by reducing them to well-studied base problems in a robust manner.	The reductions approach can yield highly automated learning algorithms.	The existence of an algorithm satisfying reduction guarantees is not sufficient to guarantee success. Reductions tell you little or nothing about the design of the base learning algorithm.
PAC Learning	You assume that samples are drawn IID from an unknown distribution D. You think of learning as finding a near-best hypothesis amongst a given set of hypotheses in a computationally tractable manner.	The focus on computation is pretty right-headed, because we are ultimately limited by what we can compute.	There are not many substantial positive results, particularly when D is noisy. Data isn’t IID in practice anyways.
Statistical Learning Theory	You assume that samples are drawn IID from an unknown distribution D. You think of learning as figuring out the number of samples required to distinguish a near-best hypothesis from a set of hypotheses.	There are substantially more positive results than for PAC Learning, and there are a few examples of practical algorithms directly motivated by this analysis.	The data is not IID. Ignorance of computational difficulties often results in difficulty of application. More importantly, the bounds are often loose (sometimes to the point of vacuousness).
Decision tree learning	Learning is a process of cutting up the input space and assigning predictions to pieces of the space.	Decision tree algorithms are well automated and can be quite fast.	There are learning problems which can not be solved by decision trees, but which are solvable. It’s common to find that other approaches give you a bit more performance. A theoretical grounding for many choices in these algorithms is lacking.
Algorithmic complexity	Learning is about finding a program which correctly predicts the outputs given the inputs.	Any reasonable problem is learnable with a number of samples related to the description length of the program.	The theory literally suggests solving halting problems to solve machine learning.
RL, MDP learning	Learning is about finding and acting according to a near optimal policy in an unknown Markov Decision Process.	We can learn and act with an amount of summed regret related to O(SA) where S is the number of states and A is the number of actions per state.	Has anyone counted the number of states in real world problems? We can’t afford to wait that long. Discretizing the states creates a POMDP (see below). In the real world, we often have to deal with a POMDP anyways.
RL, POMDP learning	Learning is about finding and acting according to a near optimaly policy in a Partially Observed Markov Decision Process	In a sense, we’ve made no assumptions, so algorithms have wide applicability.	All known algorithms scale badly with the number of hidden states.

This set is incomplete of course, but it forms a starting point for understanding what’s out there. (Please fill in the what/pro/con of anything I missed.)