#LyX 1.3 created this file. For more info see http://www.lyx.org/ \lyxformat 221 \textclass amsart \language english \inputencoding auto \fontscheme default \graphics default \paperfontsize default \spacing single \papersize Default \paperpackage a4 \use_geometry 0 \use_amsmath 0 \use_natbib 0 \use_numerical_citations 0 \paperorientation portrait \secnumdepth 3 \tocdepth 3 \paragraph_separation indent \defskip medskip \quotes_language english \quotes_times 2 \papercolumns 1 \papersides 1 \paperpagestyle default \layout Title \begin_inset Quotes eld \end_inset Pure \begin_inset Quotes erd \end_inset Reinforcement Learning with Classification \layout Author John Langford \layout Section Introduction \layout Standard \noun on Problem \noun default This paper addresses the purest reinforcement learning problem, which intuitive ly is \begin_inset Quotes eld \end_inset choose actions in the world so as to optimize rewards given observations \begin_inset Quotes erd \end_inset . We make \emph on no \emph default assumptions about the world other than the ability to approximately solve a certain collection of classification problems. In particular, we rely on no ability to \begin_inset Quotes eld \end_inset reset \begin_inset Quotes erd \end_inset to a start state, no generative model to generate outcomes given states and actions \begin_inset LatexCommand \cite{Generative Model} \end_inset , and no \begin_inset Quotes eld \end_inset mixing time \begin_inset Quotes erd \end_inset \begin_inset LatexCommand \cite{e3} \end_inset . A \begin_inset Quotes eld \end_inset homing policy \begin_inset Quotes erd \end_inset \begin_inset LatexCommand \cite{homing} \end_inset is (in some sense) not an assumption, but we do not need it in order to state the basic result (it may be useful in the process of acquiring good classifications). \layout Standard \noun on Solution Motivation \noun default The method of reduction in learning allows us to ask questions of the sort \begin_inset Quotes eld \end_inset How does the ability to (approximately) solve simple prediction problems allow us to (approximately) solve complex prediction problems? \begin_inset Quotes erd \end_inset Aside from the theoretical motivations of this question, this style of analysis often results in practical new algorithms (such as \begin_inset LatexCommand \cite{example 1} \end_inset , \begin_inset LatexCommand \cite{example 2} \end_inset ) for solving learning problems in diverse domains. Here, we show that, for the right definition of \begin_inset Quotes eld \end_inset learning \begin_inset Quotes erd \end_inset , \emph on all \emph default learning problems can be (approximately) solved by (approximately) solving certain classification problems. Here the \begin_inset Quotes eld \end_inset all \begin_inset Quotes erd \end_inset is due to the ability to embed a very large set of learning problems into reinforcement learning. \layout Standard \noun on Caveats \noun default The method we present is the first connection of this sort, but it may not be the last because the reduction is \begin_inset Quotes eld \end_inset loose \begin_inset Quotes erd \end_inset ---small error rates must be achieved in order to guarantee that the created policy is anywhere near optimal. Consequently, very significant tightenings of these results may be possible. It is also important to understand that the algorithm presented here is quite computationally expensive, so again room for improvement exists. \layout Section Solution Sketch \layout Subsection Definitions \layout Standard It is possible to think of the general reinforcement learning problem as learning in a POMDP (Partially Observable Markov Decision Process) which is defined by a set of states, actions, observations, and rewards together with conditional probability tables on these sets. However, this is unnecessarily complicated because we need never use (or explicitly consider) states. Instead, consider the following definition: \layout Definition (Reinforcement Learning Problem) A reinforcement learning problem \begin_inset Formula $\textrm{RL}=\{ A,O,D\}$ \end_inset is defined by a set of actions \begin_inset Formula $A$ \end_inset , a set of observations \begin_inset Formula $O$ \end_inset , and conditional probability tables \begin_inset Formula $D(o',r|(a,o,r)^{*})$ \end_inset where \begin_inset Formula $(a,o,r)^{*}$ \end_inset is any history of past actions, observations, and rewards. Here \begin_inset Formula $r\in[0,1]$ \end_inset is some reward. \layout Standard There are a variety of goals considered in reinforcement learning. We consider the following one: \layout Definition (Reinforcement Learning Goal) Given some horizon \begin_inset Formula $T$ \end_inset , find a policy \begin_inset Formula $\pi:(o,a)^{*}\rightarrow a$ \end_inset , optimizing the expected sum of rewards: \begin_inset Formula \[ \max_{\pi}E_{(a,o,r)^{T}\sim\pi,D}\sum_{t=1}^{T}r_{t}\] \end_inset where \begin_inset Formula $r_{t}$ \end_inset is the \begin_inset Formula $t$ \end_inset th observed reward. Here, the expectation is over the process which generates a history using \begin_inset Formula $D$ \end_inset and choosing actions from \begin_inset Formula $\pi$ \end_inset . \layout Standard This definition is essentially equivalent to several other definitions. For example, the commonly used \begin_inset Formula $\gamma$ \end_inset discounting definition which optimizes the \begin_inset Formula $\gamma$ \end_inset -discounted sum of rewards can be simulated by modifying the reinforcement learning problem so that at each timestep with probability \begin_inset Formula $1-\gamma$ \end_inset , a \begin_inset Quotes eld \end_inset termination \begin_inset Quotes erd \end_inset observation is made and no future rewards are observed. The truncation of the infinite sum at \begin_inset Formula $T$ \end_inset introduces an error which decreases exponentially in \begin_inset Formula $T(1-\gamma)$ \end_inset . \layout Subsection The Algorithm \layout Standard The basic idea of reductions in learning is to reduce more complicated learning problems to simpler ones. One way to think about classification problems in particular is in the form of a decision question with a yes/no answer (which is sometimes answered incorrectly). For example \begin_inset Quotes eld \end_inset If I take action \begin_inset Formula $a_{1}$ \end_inset is the reward greater than \begin_inset Formula $0.5$ \end_inset ? \begin_inset Quotes erd \end_inset Not all binary questions are allowed---only those for which we can derive a coherent classification problem (and hence talk about classificaton error rates) make sense. \layout Standard The set of questions used here have the form \begin_inset Quotes eld \end_inset What is the expected reward of doing action \begin_inset Formula $a_{t}$ \end_inset after some prefix \begin_inset Formula $a_{1},...,a_{t-1}$ \end_inset ? \begin_inset Quotes erd \end_inset This is a reduction to a form of regression rather than classification. However, we can further reduce the regression problem to binary classification using a simple reduction from cost estimation (which is another term for regression) to classification \begin_inset LatexCommand \cite{Regret Reductions} \end_inset . The method chooses the action \begin_inset Formula $a_{t}$ \end_inset at timestep \begin_inset Formula $t$ \end_inset leading to the largest sum of expected future rewards. \layout Section Analysis \layout Standard Before stating the analysis, we need to make a basic reductions-related sanity check. Any reinforcement learning problem \begin_inset Formula $D$ \end_inset and prefix of actions \begin_inset Formula $a_{1},....,a_{t}$ \end_inset creates a measure \begin_inset Formula $D(o,r|a_{1},...,a_{t})=E_{o_{1},r_{1},...,o_{t},r_{t}\sim a_{1},...,a_{t},D}D(o,r|(a,o,r)^{t})$ \end_inset . Consequently, there exists a well defined distribution over immediate rewards given any prefix of actions. This implies that the regression problems defined each correspond to an induced measure \begin_inset Formula $D_{a_{1},...,a_{t}}$ \end_inset over immediate rewards. \layout Standard A basic theorem can be proved about this algorithm. For this theorem, let the performance of any policy \begin_inset Formula $\pi$ \end_inset be defined by: \begin_inset Formula \[ \eta(\pi)=E_{(a,o,r)^{T}\sim\pi,D}\sum_{t=1}^{T}r_{t}\] \end_inset \layout Lemma (Reduction Statement) Let \begin_inset Formula $\pi_{T}$ \end_inset be the policy created by the regressors above. Let the predictions (given any past observations) be given by \begin_inset Formula $\hat{r}_{a_{1},...,a_{t}}$ \end_inset . If, for all action prefixes \begin_inset Formula $a_{1},...,a_{t}$ \end_inset with \begin_inset Formula $t\in\{1,...,T\}$ \end_inset : \begin_inset Formula \[ E_{(o,r)\sim D_{a_{1},...,a_{t}}}|r-\hat{r}_{a_{1},...,a_{t}}|\leq\epsilon\] \end_inset then \begin_inset Formula \[ \eta(\pi^{*})-\eta(\pi_{T})\leq f(\epsilon)\] \end_inset \layout Proof (very sketchy) First note that if the predictions are perfect, then \begin_inset Formula $\eta(\pi^{*})=\eta(\pi_{T})$ \end_inset . Then, note that in the worst case (with adversarial regressors constrained by the predictability assumption), the sum of rewards for any \begin_inset Formula $T$ \end_inset -length prefix is altered by at most \begin_inset Formula $\epsilon T$ \end_inset . (details still to be filled in.) \layout Bibliography \bibitem {e3} Michael Kearns and Satinder Singh, Near-Optimal Reinforcement Learning in Polynomial Time. Proceedings of the 15th International Conference on Machine Learning, pp. 260-268, 1998, Morgan Kaufmann. \layout Bibliography \bibitem {homing} Sham Kakade (and others?) \begin_inset Quotes eld \end_inset Reinforcement Learning in POMDPs Without Resets \begin_inset Quotes erd \end_inset http://www.cis.upenn.edu/~skakade/papers/rl/learn_pomdp.ps \layout Bibliography \bibitem {Generative Model} Michael Kearns, Yishay Mansour, and Andrew Ng, \begin_inset Quotes eld \end_inset A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes \begin_inset Quotes erd \end_inset Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence Morgan Kaufmann, 1999, pages 1324--1331. \layout Bibliography \bibitem {example 1} example \layout Bibliography \bibitem {example 2} algorithms \layout Bibliography \bibitem {Regret Reductions} John Langford and Alina Beygelzimer, \begin_inset Quotes eld \end_inset Sensitive Error Correcting Output Codes \begin_inset Quotes erd \end_inset , http://hunch.net/~jl/projects/reductions/pecoc/v7/pecoc.ps \the_end