Reinforcement learning sample complexity analysis

This is a best-effort trace of the development of sample complexity analysis for reinforcement learning divided by modeling assumption. (See also the reductions analysis page.)

symbols key

S	The number of states
A	The number of actions per state
T	The time horizon

Using a single thread of experience.

A tight analysis of exploration in deterministic MDPs	Sven Koenig and Reid Simmons, Complexity Analysis of Real-Time Reinforcement Learning, AAAI 1993
Exploration in MDPs with a mixing time has poly sample complexity.	Michael Kearns and Satinder Singh, Near-Optimal Reinforcement Learning in Polynomial Time, ICML 1998
Optimizing a reward function based on observed rewards plus uncertainty gives the same result.	Ronen Brafman and Moshe Tennenholtz R-max - A general polynomial time algorithm for near-optimal reinforcement learning IJCAI 2001
Exploration in Factored-MDPs is possible given the factoring and the ability to solve a computationally hard learning problem.	Michael Kearns and Daphne Koller, Efficient Reinforcement Learning in Factored MDPs, IJCAI 1999 (There is a fundamentally hard planning problem. It's really hard.)
Removal of mixing time assumption, O(S²A) analysis, a lower bound.	Sham Kakade, On the Sample Complexity of Reinforcement Learning Thesis Gatsby, UCL, 2003
Exploration is possible in metric-MDPs with infinite states.	Sham Kakade, Michael Kearns, and John Langford, Exploration in Metric State Spaces, ICML 2003
The sample complexity can be reduced by keeping precise confidence intervals.	Alexander L. Strehl and Michael L. Littman. A Theoretical Analysis of Model-Based Interval Estimation ICML 2005
Poly(A,O,K_H)^T is possible for a POMDP where K_H is the homing time and O is the number of observations.	Eyal Even-Dar, Sham Kakade, & Yishay Mansour Reinforcement Learning in POMDPs Without Resets, IJCAI 2005
Exploration requires only O(SA) experience --- less than the description complexity of an MDP.	Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman, PAC Model-Free Reinforcement Learning ICML 2006

Using a thread of experience with resets.

A precursor to the above analysis.

Claude-Nicolas Fiechter Efficient reinforcement learning, COLT 1994

Using a Generative Model

These results all hold for general decision processes regardless of certain titles. :) Any result which holds in the single thread or trace models also holds under the generative model.

An A^O(T) algorithm with no dependence on S.	Michael Kearns, Yishay Mansour, and Andrew Ng, A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes IJCAI 1999
An O(SA) analysis of Q-learning.	Michael Kearns and Satinder Singh, Finite-Sample Convergence Rates for Q-learning and Indirect Algorithms NIPS 1999.
An analysis of TD-lambda	Michael Kearns and Satinder Singh, Bias-Variance Error Bounds for Temporal Difference Updates, COLT 2000.
An O(A^T) algorithm for choosing a good policy from a set of policies.	Michael Kearns, Yishay Mansour, and Andrew Ng Approximate Planning in Large POMDPs via Reusable Trajectories NIPS 2000 This is tractable under additional assumptions.