FOCS 2020 tutorial on the Theoretical Foundations of Reinforcement Learning

Alekh Agarwal, Akshay Krishnamurthy, and John Langford

Overview

This is a tutorial on the theoretical foundations of reinforcement learning covering many new developments over the last half-decade which substantially deepen our understanding of what is possible and why. In addition, we cover various important open problems. The tutorial has 3 key parts: The information theory of reinforcement learning, optimization/gradient descent in reinforcement learning, and latent state discovery.

The tutorial video

backup video

slides

Primary references

Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I. Jordan. Provably efficient reinforcement learning with linear function approximation. COLT, 2020.
Daniel Russo, and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. NeurIPS, 2013.
Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E. Schapire. Contextual decision processes with low Bellman rank are PAC learnable. ICML, 2017.
Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. COLT, 2019.
Alekh Agarwal, Sham M. Kakade, Jason D. Lee, and Gaurav Mahajan. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift. COLT, 2020.
Alekh Agarwal, Mikael Henaff, Sham Kakade, and Wen Sun. PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning. NeurIPS, 2020.
Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, and John Langford. Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning. ICML, 2020.
Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs. NeurIPS, 2020.
Zakaria Mhammedi, Dylan J. Foster, Max Simchowitz, Dipendra Misra, Wen Sun, Akshay Krishnamurthy, Alexander Rakhlin, and John Langford. Learning the Linear Quadratic Regulator from Nonlinear Observations. NeurIPS, 2020.

Other references

Lower bounds

Michael Kearns, Yishay Mansour, and Andrew Ng. A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learning, 2002.
Akshay Krishnamurthy, Alekh Agarwal, and John Langford. PAC reinforcement learning with rich observations. NeurIPS, 2016.
Simon S. Du, Sham M. Kakade, Ruosong Wang, Lin F. Yang. Is a good representation sufficient for sample efficient reinforcement learning? ICLR, 2020.
Gellert Weisz, Philip Amortila, and Csaba Szepesvári. Exponential lower bounds for planning in MDPs With linearly-realizable optimal action-value functions. arXiv:2010.01374, 2020.
Ruosong Wang, Dean P. Foster, and Sham M. Kakade. What are the Statistical Limits of Offline RL with Linear Function Approximation? arXiv: 2010.11895, 2020.

The three challenges

Richard Bellman. The theory of dynamic programming. Bulletin of the American Mathematical Society, 1954.
Edsgar W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematika, 1959.
Carlo E. Bonferroni. Teoria statistica delle classi e calcolo delle probabilita. 1936.
Vladimir N. Vapnik and Alexey Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & its Applications, 1971.
Leslie G. Valiant. A theory of the learnable. Communications of the ACM, 1984.

Contextual bandits

William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 1933.
Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The non-stochastic multi-armed bandit problem. SIAM Journal on Computing, 2002. Preliminary version in FOCS, 1995.
John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. NeurIPS, 2007.
Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. WWW, 2010.
Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Robert E. Schapire. Efficient optimal learning for contextual bandits. UAI, 2011.
Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. ICML, 2014.
Dylan J. Foster and Alexander Rakhlin. Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles. ICML, 2020.
David Simchi-Levi and Yunzong Xu. Bypassing the Monster: A Faster and Simpler Optimal Algorithm for Contextual Bandits under Realizability. arXiv:2003.12699, 2020.

Tabular Markov decision process

Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 2002.
Ronen I. Brafman, Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learningn. JMLR, 2002.
Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. PAC model-free reinforcement learning. ICML, 2006.
Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. JMLR, 2010.
Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. ICML, 2017.
Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. NeurIPS, 2017.
Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I. Jordan. Is Q-learning provably efficient? NeurIPS, 2018.
Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. ICML, 2019.
Max Simchowitz and Kevin G. Jamieson. Non-asymptotic gap-dependent regret bounds for tabular MDPs. NeurIPS, 2019.

Linear bandits

Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. JCSS, 2008. Preliminary version in STOC, 2004.
Elad Hazan, Zohar Karnin, and Raghu Meka. Volumetric spanners: An efficient exploration basis for learning. COLT, 2014.
Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. COLT, 2008.
Yasin Abbasi-Yadkori, Csaba Szepesvari, and David Pal. Improved algorithms for linear stochastic bandits. NeurIPS, 2011.

Extrapolation methods for RL

Lin F. Yang and Mengdi Wang. Reinforcement Learning in feature space: Matrix bandit, kernels, and regret bound. ICML, 2020.
Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent Bellman error. ICML, 2020.
Ruosong Wang, Ruslan Salakhutdinov, and Lin F. Yang. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. NeurIPS, 2020.

Bellman and witness rank

Kefan Dong, Jian Peng, Yining Wang, and Yuan Zhou. \sqrt{n}-Regret for learning in Markov decision processes with function approximation and low Bellman rank. COLT, 2020.

Factored MDPs

Michael Kearns and Daphne Koller. Efficient reinforcement learning in factored MDPs. IJCAI, 1999.
Martin Mundhenk, Judy Goldsmith, Christopher Lusena, and Eric Allender. Complexity of finite-horizon Markov decision process problems. Journal of the ACM, 2000.
Eric Allender, Sanjeev Arora, Michael Kearns, Cristopher Moore, and Alexander Russell. A note on the representational incompatibility of function approximation and factored dynamics. NeurIPS, 2002.
Paolo Liberatore. The size of MDP factored policies. AAAI, 2002.
Ian Osband and Benjamin Van Roy. Near-optimal reinforcement learning in factored MDPs. NeurIPS, 2014.
Aviv Rosenberg and Yishay Mansour. Oracle-efficient reinforcement learning in factored MDPs with unknown structure. arXiv:2009.05986.

Policy optimization algorithms

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992.
Vijay R. Konda and John N. Tsitsiklis. Actor-critic algorithms. NeurIPS, 2000.
Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. NeurIPS, 2000.
Sham M. Kakade. A natural policy gradient. NeurIPS, 2001.
Sham M. Kakade and John Langford. Approximately Optimal Approximate Reinforcement Learning. ICML, 2002.
J. Andrew Bagnell and Jeff Schneider. Covariant policy search. IJCAI, 2003.
Jan Peters, Sethu Vijayakumar, and Stefan Schaal. Natural actor-critic. Neurocomputing, 2008.
John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. ICML, 2015.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv:1707.06347, 2017.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ICML, 2018.

Policy optimization algorithms

Eyal Even-Dar, Sham. M. Kakade, and Yishay Mansour. Online Markov decision processes. Mathematics of Operations Research, 2009.
Bruno Scherrer. Approximate policy iteration schemes: A comparison. ICML, 2014.
Gergely Neu, Anders Jonsson, and Vicenç Gómez. A unified view of entropy-regularized Markov decision processes. arXiv:1705.17798, 2017.
Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized Markov decision processes. ICML, 2019.
Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, and Gellert Weisz. POLITEX: Regret bounds for policy iteration using expert prediction. ICML, 2019.
Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. arXiv:1906.01786, 2019.
Lior Shani, Yonathan Efroni, and Shie Mannor. Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs. AAAI, 2020.
Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the global convergence rates of softmax policy gradient methods. NeurIPS, 2020.

Exploration in policy optimization

Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. ICML, 2020.
Yonathan Efroni, Lior Shani, Aviv Rosenberg, and Shie Mannor. Optimistic policy optimization with bandit feedback. ICML, 2020.

Latent state discovery

Robert Givan, Thomas Dean, and Matthew Greig. Equivalence notions and model minimization in Markov decision processes. Artificial Intelligence, 2003.
Lihong Li, Thomas J. Walsh, and Michael L. Littman. Towards a unified theory of state abstraction for MDPs. International Symposium on Artificial Intelligence and Mathematics, 2006.
Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. ICML, 2017.
Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. #Exploration: A study of count-based exploration for deep reinforcement learning. NeurIPS, 2017.
Aditya Modi, Nan Jiang, Ambuj Tewari, and Satinder Singh. Sample complexity of reinforcement learning using linearly combined model ensembles. AISTATS, 2020.