#LyX 1.3 created this file. For more info see http://www.lyx.org/
\lyxformat 221
\textclass amsart
\language english
\inputencoding auto
\fontscheme default
\graphics default
\paperfontsize default
\spacing single
\papersize Default
\paperpackage a4
\use_geometry 0
\use_amsmath 0
\use_natbib 0
\use_numerical_citations 0
\paperorientation portrait
\secnumdepth 3
\tocdepth 3
\paragraph_separation indent
\defskip medskip
\quotes_language english
\quotes_times 2
\papercolumns 1
\papersides 1
\paperpagestyle default
\layout Title
\begin_inset Quotes eld
\end_inset
Pure
\begin_inset Quotes erd
\end_inset
Reinforcement Learning with Classification
\layout Author
John Langford
\layout Section
Introduction
\layout Standard
\noun on
Problem
\noun default
This paper addresses the purest reinforcement learning problem, which intuitive
ly is
\begin_inset Quotes eld
\end_inset
choose actions in the world so as to optimize rewards given observations
\begin_inset Quotes erd
\end_inset
.
We make
\emph on
no
\emph default
assumptions about the world other than the ability to approximately solve
a certain collection of classification problems.
In particular, we rely on no ability to
\begin_inset Quotes eld
\end_inset
reset
\begin_inset Quotes erd
\end_inset
to a start state, no generative model to generate outcomes given states
and actions
\begin_inset LatexCommand \cite{Generative Model}
\end_inset
, and no
\begin_inset Quotes eld
\end_inset
mixing time
\begin_inset Quotes erd
\end_inset
\begin_inset LatexCommand \cite{e3}
\end_inset
.
A
\begin_inset Quotes eld
\end_inset
homing policy
\begin_inset Quotes erd
\end_inset
\begin_inset LatexCommand \cite{homing}
\end_inset
is (in some sense) not an assumption, but we do not need it in order to
state the basic result (it may be useful in the process of acquiring good
classifications).
\layout Standard
\noun on
Solution Motivation
\noun default
The method of reduction in learning allows us to ask questions of the sort
\begin_inset Quotes eld
\end_inset
How does the ability to (approximately) solve simple prediction problems
allow us to (approximately) solve complex prediction problems?
\begin_inset Quotes erd
\end_inset
Aside from the theoretical motivations of this question, this style of
analysis often results in practical new algorithms (such as
\begin_inset LatexCommand \cite{example 1}
\end_inset
,
\begin_inset LatexCommand \cite{example 2}
\end_inset
) for solving learning problems in diverse domains.
Here, we show that, for the right definition of
\begin_inset Quotes eld
\end_inset
learning
\begin_inset Quotes erd
\end_inset
,
\emph on
all
\emph default
learning problems can be (approximately) solved by (approximately) solving
certain classification problems.
Here the
\begin_inset Quotes eld
\end_inset
all
\begin_inset Quotes erd
\end_inset
is due to the ability to embed a very large set of learning problems into
reinforcement learning.
\layout Standard
\noun on
Caveats
\noun default
The method we present is the first connection of this sort, but it may not
be the last because the reduction is
\begin_inset Quotes eld
\end_inset
loose
\begin_inset Quotes erd
\end_inset
---small error rates must be achieved in order to guarantee that the created
policy is anywhere near optimal.
Consequently, very significant tightenings of these results may be possible.
It is also important to understand that the algorithm presented here is
quite computationally expensive, so again room for improvement exists.
\layout Section
Solution Sketch
\layout Subsection
Definitions
\layout Standard
It is possible to think of the general reinforcement learning problem as
learning in a POMDP (Partially Observable Markov Decision Process) which
is defined by a set of states, actions, observations, and rewards together
with conditional probability tables on these sets.
However, this is unnecessarily complicated because we need never use (or
explicitly consider) states.
Instead, consider the following definition:
\layout Definition
(Reinforcement Learning Problem) A reinforcement learning problem
\begin_inset Formula $\textrm{RL}=\{ A,O,D\}$
\end_inset
is defined by a set of actions
\begin_inset Formula $A$
\end_inset
, a set of observations
\begin_inset Formula $O$
\end_inset
, and conditional probability tables
\begin_inset Formula $D(o',r|(a,o,r)^{*})$
\end_inset
where
\begin_inset Formula $(a,o,r)^{*}$
\end_inset
is any history of past actions, observations, and rewards.
Here
\begin_inset Formula $r\in[0,1]$
\end_inset
is some reward.
\layout Standard
There are a variety of goals considered in reinforcement learning.
We consider the following one:
\layout Definition
(Reinforcement Learning Goal) Given some horizon
\begin_inset Formula $T$
\end_inset
, find a policy
\begin_inset Formula $\pi:(o,a)^{*}\rightarrow a$
\end_inset
, optimizing the expected sum of rewards:
\begin_inset Formula \[
\max_{\pi}E_{(a,o,r)^{T}\sim\pi,D}\sum_{t=1}^{T}r_{t}\]
\end_inset
where
\begin_inset Formula $r_{t}$
\end_inset
is the
\begin_inset Formula $t$
\end_inset
th observed reward.
Here, the expectation is over the process which generates a history using
\begin_inset Formula $D$
\end_inset
and choosing actions from
\begin_inset Formula $\pi$
\end_inset
.
\layout Standard
This definition is essentially equivalent to several other definitions.
For example, the commonly used
\begin_inset Formula $\gamma$
\end_inset
discounting definition which optimizes the
\begin_inset Formula $\gamma$
\end_inset
-discounted sum of rewards can be simulated by modifying the reinforcement
learning problem so that at each timestep with probability
\begin_inset Formula $1-\gamma$
\end_inset
, a
\begin_inset Quotes eld
\end_inset
termination
\begin_inset Quotes erd
\end_inset
observation is made and no future rewards are observed.
The truncation of the infinite sum at
\begin_inset Formula $T$
\end_inset
introduces an error which decreases exponentially in
\begin_inset Formula $T(1-\gamma)$
\end_inset
.
\layout Subsection
The Algorithm
\layout Standard
The basic idea of reductions in learning is to reduce more complicated learning
problems to simpler ones.
One way to think about classification problems in particular is in the
form of a decision question with a yes/no answer (which is sometimes answered
incorrectly).
For example
\begin_inset Quotes eld
\end_inset
If I take action
\begin_inset Formula $a_{1}$
\end_inset
is the reward greater than
\begin_inset Formula $0.5$
\end_inset
?
\begin_inset Quotes erd
\end_inset
Not all binary questions are allowed---only those for which we can derive
a coherent classification problem (and hence talk about classificaton error
rates) make sense.
\layout Standard
The set of questions used here have the form
\begin_inset Quotes eld
\end_inset
What is the expected reward of doing action
\begin_inset Formula $a_{t}$
\end_inset
after some prefix
\begin_inset Formula $a_{1},...,a_{t-1}$
\end_inset
?
\begin_inset Quotes erd
\end_inset
This is a reduction to a form of regression rather than classification.
However, we can further reduce the regression problem to binary classification
using a simple reduction from cost estimation (which is another term for
regression) to classification
\begin_inset LatexCommand \cite{Regret Reductions}
\end_inset
.
The method chooses the action
\begin_inset Formula $a_{t}$
\end_inset
at timestep
\begin_inset Formula $t$
\end_inset
leading to the largest sum of expected future rewards.
\layout Section
Analysis
\layout Standard
Before stating the analysis, we need to make a basic reductions-related
sanity check.
Any reinforcement learning problem
\begin_inset Formula $D$
\end_inset
and prefix of actions
\begin_inset Formula $a_{1},....,a_{t}$
\end_inset
creates a measure
\begin_inset Formula $D(o,r|a_{1},...,a_{t})=E_{o_{1},r_{1},...,o_{t},r_{t}\sim a_{1},...,a_{t},D}D(o,r|(a,o,r)^{t})$
\end_inset
.
Consequently, there exists a well defined distribution over immediate rewards
given any prefix of actions.
This implies that the regression problems defined each correspond to an
induced measure
\begin_inset Formula $D_{a_{1},...,a_{t}}$
\end_inset
over immediate rewards.
\layout Standard
A basic theorem can be proved about this algorithm.
For this theorem, let the performance of any policy
\begin_inset Formula $\pi$
\end_inset
be defined by:
\begin_inset Formula \[
\eta(\pi)=E_{(a,o,r)^{T}\sim\pi,D}\sum_{t=1}^{T}r_{t}\]
\end_inset
\layout Lemma
(Reduction Statement) Let
\begin_inset Formula $\pi_{T}$
\end_inset
be the policy created by the regressors above.
Let the predictions (given any past observations) be given by
\begin_inset Formula $\hat{r}_{a_{1},...,a_{t}}$
\end_inset
.
If, for all action prefixes
\begin_inset Formula $a_{1},...,a_{t}$
\end_inset
with
\begin_inset Formula $t\in\{1,...,T\}$
\end_inset
:
\begin_inset Formula \[
E_{(o,r)\sim D_{a_{1},...,a_{t}}}|r-\hat{r}_{a_{1},...,a_{t}}|\leq\epsilon\]
\end_inset
then
\begin_inset Formula \[
\eta(\pi^{*})-\eta(\pi_{T})\leq f(\epsilon)\]
\end_inset
\layout Proof
(very sketchy) First note that if the predictions are perfect, then
\begin_inset Formula $\eta(\pi^{*})=\eta(\pi_{T})$
\end_inset
.
Then, note that in the worst case (with adversarial regressors constrained
by the predictability assumption), the sum of rewards for any
\begin_inset Formula $T$
\end_inset
-length prefix is altered by at most
\begin_inset Formula $\epsilon T$
\end_inset
.
(details still to be filled in.)
\layout Bibliography
\bibitem {e3}
Michael Kearns and Satinder Singh, Near-Optimal Reinforcement Learning in
Polynomial Time.
Proceedings of the 15th International Conference on Machine Learning, pp.
260-268, 1998, Morgan Kaufmann.
\layout Bibliography
\bibitem {homing}
Sham Kakade (and others?)
\begin_inset Quotes eld
\end_inset
Reinforcement Learning in POMDPs Without Resets
\begin_inset Quotes erd
\end_inset
http://www.cis.upenn.edu/~skakade/papers/rl/learn_pomdp.ps
\layout Bibliography
\bibitem {Generative Model}
Michael Kearns, Yishay Mansour, and Andrew Ng,
\begin_inset Quotes eld
\end_inset
A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision
Processes
\begin_inset Quotes erd
\end_inset
Proceedings of the Sixteenth International Joint Conference on Artificial
Intelligence Morgan Kaufmann, 1999, pages 1324--1331.
\layout Bibliography
\bibitem {example 1}
example
\layout Bibliography
\bibitem {example 2}
algorithms
\layout Bibliography
\bibitem {Regret Reductions}
John Langford and Alina Beygelzimer,
\begin_inset Quotes eld
\end_inset
Sensitive Error Correcting Output Codes
\begin_inset Quotes erd
\end_inset
, http://hunch.net/~jl/projects/reductions/pecoc/v7/pecoc.ps
\the_end