#LyX 1.3 created this file. For more info see http://www.lyx.org/
\lyxformat 221
\textclass amsart
\language english
\inputencoding auto
\fontscheme default
\graphics default
\paperfontsize default
\spacing single 
\papersize Default
\paperpackage a4
\use_geometry 0
\use_amsmath 0
\use_natbib 0
\use_numerical_citations 0
\paperorientation portrait
\secnumdepth 3
\tocdepth 3
\paragraph_separation indent
\defskip medskip
\quotes_language english
\quotes_times 2
\papercolumns 1
\papersides 1
\paperpagestyle default

\layout Title


\begin_inset Quotes eld
\end_inset 

Pure
\begin_inset Quotes erd
\end_inset 

 Reinforcement Learning with Classification
\layout Author

John Langford
\layout Section

Introduction
\layout Standard


\noun on 
Problem
\noun default 
 This paper addresses the purest reinforcement learning problem, which intuitive
ly is 
\begin_inset Quotes eld
\end_inset 

choose actions in the world so as to optimize rewards given observations
\begin_inset Quotes erd
\end_inset 

.
 We make 
\emph on 
no 
\emph default 
assumptions about the world other than the ability to approximately solve
 a certain collection of classification problems.
 In particular, we rely on no ability to 
\begin_inset Quotes eld
\end_inset 

reset
\begin_inset Quotes erd
\end_inset 

 to a start state, no generative model to generate outcomes given states
 and actions 
\begin_inset LatexCommand \cite{Generative Model}

\end_inset 

, and no 
\begin_inset Quotes eld
\end_inset 

mixing time
\begin_inset Quotes erd
\end_inset 

 
\begin_inset LatexCommand \cite{e3}

\end_inset 

.
 A 
\begin_inset Quotes eld
\end_inset 

homing policy
\begin_inset Quotes erd
\end_inset 

 
\begin_inset LatexCommand \cite{homing}

\end_inset 

 is (in some sense) not an assumption, but we do not need it in order to
 state the basic result (it may be useful in the process of acquiring good
 classifications).
 
\layout Standard


\noun on 
Solution Motivation 
\noun default 
The method of reduction in learning allows us to ask questions of the sort
 
\begin_inset Quotes eld
\end_inset 

How does the ability to (approximately) solve simple prediction problems
 allow us to (approximately) solve complex prediction problems?
\begin_inset Quotes erd
\end_inset 

 Aside from the theoretical motivations of this question, this style of
 analysis often results in practical new algorithms (such as 
\begin_inset LatexCommand \cite{example 1}

\end_inset 

, 
\begin_inset LatexCommand \cite{example 2}

\end_inset 

) for solving learning problems in diverse domains.
 Here, we show that, for the right definition of 
\begin_inset Quotes eld
\end_inset 

learning
\begin_inset Quotes erd
\end_inset 

, 
\emph on 
all
\emph default 
 learning problems can be (approximately) solved by (approximately) solving
 certain classification problems.
 Here the 
\begin_inset Quotes eld
\end_inset 

all
\begin_inset Quotes erd
\end_inset 

 is due to the ability to embed a very large set of learning problems into
 reinforcement learning.
\layout Standard


\noun on 
Caveats 
\noun default 
The method we present is the first connection of this sort, but it may not
 be the last because the reduction is 
\begin_inset Quotes eld
\end_inset 

loose
\begin_inset Quotes erd
\end_inset 

---small error rates must be achieved in order to guarantee that the created
 policy is anywhere near optimal.
 Consequently, very significant tightenings of these results may be possible.
 It is also important to understand that the algorithm presented here is
 quite computationally expensive, so again room for improvement exists.
\layout Section

Solution Sketch
\layout Subsection

Definitions
\layout Standard

It is possible to think of the general reinforcement learning problem as
 learning in a POMDP (Partially Observable Markov Decision Process) which
 is defined by a set of states, actions, observations, and rewards together
 with conditional probability tables on these sets.
 However, this is unnecessarily complicated because we need never use (or
 explicitly consider) states.
 Instead, consider the following definition:
\layout Definition

(Reinforcement Learning Problem) A reinforcement learning problem 
\begin_inset Formula $\textrm{RL}=\{ A,O,D\}$
\end_inset 

 is defined by a set of actions 
\begin_inset Formula $A$
\end_inset 

, a set of observations 
\begin_inset Formula $O$
\end_inset 

, and conditional probability tables 
\begin_inset Formula $D(o',r|(a,o,r)^{*})$
\end_inset 

 where 
\begin_inset Formula $(a,o,r)^{*}$
\end_inset 

 is any history of past actions, observations, and rewards.
 Here 
\begin_inset Formula $r\in[0,1]$
\end_inset 

 is some reward.
 
\layout Standard

There are a variety of goals considered in reinforcement learning.
 We consider the following one:
\layout Definition

(Reinforcement Learning Goal) Given some horizon 
\begin_inset Formula $T$
\end_inset 

, find a policy 
\begin_inset Formula $\pi:(o,a)^{*}\rightarrow a$
\end_inset 

, optimizing the expected sum of rewards: 
\begin_inset Formula \[
\max_{\pi}E_{(a,o,r)^{T}\sim\pi,D}\sum_{t=1}^{T}r_{t}\]

\end_inset 

where 
\begin_inset Formula $r_{t}$
\end_inset 

 is the 
\begin_inset Formula $t$
\end_inset 

th observed reward.
 Here, the expectation is over the process which generates a history using
 
\begin_inset Formula $D$
\end_inset 

 and choosing actions from 
\begin_inset Formula $\pi$
\end_inset 

.
\layout Standard

This definition is essentially equivalent to several other definitions.
 For example, the commonly used 
\begin_inset Formula $\gamma$
\end_inset 

 discounting definition which optimizes the 
\begin_inset Formula $\gamma$
\end_inset 

-discounted sum of rewards can be simulated by modifying the reinforcement
 learning problem so that at each timestep with probability 
\begin_inset Formula $1-\gamma$
\end_inset 

, a 
\begin_inset Quotes eld
\end_inset 

termination
\begin_inset Quotes erd
\end_inset 

 observation is made and no future rewards are observed.
 The truncation of the infinite sum at 
\begin_inset Formula $T$
\end_inset 

 introduces an error which decreases exponentially in 
\begin_inset Formula $T(1-\gamma)$
\end_inset 

.
 
\layout Subsection

The Algorithm
\layout Standard

The basic idea of reductions in learning is to reduce more complicated learning
 problems to simpler ones.
 One way to think about classification problems in particular is in the
 form of a decision question with a yes/no answer (which is sometimes answered
 incorrectly).
 For example 
\begin_inset Quotes eld
\end_inset 

If I take action 
\begin_inset Formula $a_{1}$
\end_inset 

 is the reward greater than 
\begin_inset Formula $0.5$
\end_inset 

?
\begin_inset Quotes erd
\end_inset 

 Not all binary questions are allowed---only those for which we can derive
 a coherent classification problem (and hence talk about classificaton error
 rates) make sense.
 
\layout Standard

The set of questions used here have the form 
\begin_inset Quotes eld
\end_inset 

What is the expected reward of doing action 
\begin_inset Formula $a_{t}$
\end_inset 

 after some prefix 
\begin_inset Formula $a_{1},...,a_{t-1}$
\end_inset 

?
\begin_inset Quotes erd
\end_inset 

 This is a reduction to a form of regression rather than classification.
 However, we can further reduce the regression problem to binary classification
 using a simple reduction from cost estimation (which is another term for
 regression) to classification 
\begin_inset LatexCommand \cite{Regret Reductions}

\end_inset 

.
 The method chooses the action 
\begin_inset Formula $a_{t}$
\end_inset 

 at timestep 
\begin_inset Formula $t$
\end_inset 

 leading to the largest sum of expected future rewards.
 
\layout Section

Analysis
\layout Standard

Before stating the analysis, we need to make a basic reductions-related
 sanity check.
 Any reinforcement learning problem 
\begin_inset Formula $D$
\end_inset 

 and prefix of actions 
\begin_inset Formula $a_{1},....,a_{t}$
\end_inset 

 creates a measure 
\begin_inset Formula $D(o,r|a_{1},...,a_{t})=E_{o_{1},r_{1},...,o_{t},r_{t}\sim a_{1},...,a_{t},D}D(o,r|(a,o,r)^{t})$
\end_inset 

.
 Consequently, there exists a well defined distribution over immediate rewards
 given any prefix of actions.
 This implies that the regression problems defined each correspond to an
 induced measure 
\begin_inset Formula $D_{a_{1},...,a_{t}}$
\end_inset 

 over immediate rewards.
 
\layout Standard

A basic theorem can be proved about this algorithm.
 For this theorem, let the performance of any policy 
\begin_inset Formula $\pi$
\end_inset 

be defined by:
\begin_inset Formula \[
\eta(\pi)=E_{(a,o,r)^{T}\sim\pi,D}\sum_{t=1}^{T}r_{t}\]

\end_inset 


\layout Lemma

(Reduction Statement) Let 
\begin_inset Formula $\pi_{T}$
\end_inset 

 be the policy created by the regressors above.
 Let the predictions (given any past observations) be given by 
\begin_inset Formula $\hat{r}_{a_{1},...,a_{t}}$
\end_inset 

.
 If, for all action prefixes 
\begin_inset Formula $a_{1},...,a_{t}$
\end_inset 

 with 
\begin_inset Formula $t\in\{1,...,T\}$
\end_inset 

:
\begin_inset Formula \[
E_{(o,r)\sim D_{a_{1},...,a_{t}}}|r-\hat{r}_{a_{1},...,a_{t}}|\leq\epsilon\]

\end_inset 

then
\begin_inset Formula \[
\eta(\pi^{*})-\eta(\pi_{T})\leq f(\epsilon)\]

\end_inset 


\layout Proof

(very sketchy) First note that if the predictions are perfect, then 
\begin_inset Formula $\eta(\pi^{*})=\eta(\pi_{T})$
\end_inset 

.
 Then, note that in the worst case (with adversarial regressors constrained
 by the predictability assumption), the sum of rewards for any 
\begin_inset Formula $T$
\end_inset 

-length prefix is altered by at most 
\begin_inset Formula $\epsilon T$
\end_inset 

.
 (details still to be filled in.)
\layout Bibliography
\bibitem {e3}

Michael Kearns and Satinder Singh, Near-Optimal Reinforcement Learning in
 Polynomial Time.
 Proceedings of the 15th International Conference on Machine Learning, pp.
 260-268, 1998, Morgan Kaufmann.
\layout Bibliography
\bibitem {homing}

Sham Kakade (and others?) 
\begin_inset Quotes eld
\end_inset 

Reinforcement Learning in POMDPs Without Resets
\begin_inset Quotes erd
\end_inset 

 http://www.cis.upenn.edu/~skakade/papers/rl/learn_pomdp.ps
\layout Bibliography
\bibitem {Generative Model}

Michael Kearns, Yishay Mansour, and Andrew Ng, 
\begin_inset Quotes eld
\end_inset 

A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision
 Processes
\begin_inset Quotes erd
\end_inset 

 Proceedings of the Sixteenth International Joint Conference on Artificial
 Intelligence Morgan Kaufmann, 1999, pages 1324--1331.
\layout Bibliography
\bibitem {example 1}

example
\layout Bibliography
\bibitem {example 2}

algorithms
\layout Bibliography
\bibitem {Regret Reductions}

John Langford and Alina Beygelzimer, 
\begin_inset Quotes eld
\end_inset 

Sensitive Error Correcting Output Codes
\begin_inset Quotes erd
\end_inset 

, http://hunch.net/~jl/projects/reductions/pecoc/v7/pecoc.ps
\the_end