FOCS 2020 tutorial on the Theoretical Foundations of Reinforcement Learning
Alekh Agarwal, Akshay Krishnamurthy, and John Langford
This is a tutorial on the theoretical foundations of reinforcement learning covering many new developments over the last half-decade which substantially deepen our understanding of what is possible and why. In addition, we cover various important open problems.
The tutorial has 3 key parts: The information theory of reinforcement learning, optimization/gradient descent in reinforcement learning, and latent state discovery.
- Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I. Jordan. Provably efficient reinforcement learning with linear function approximation. COLT, 2020.
- Daniel Russo, and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. NeurIPS, 2013.
- Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E. Schapire. Contextual decision processes with low Bellman rank are PAC learnable. ICML, 2017.
- Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. COLT, 2019.
- Alekh Agarwal, Sham M. Kakade, Jason D. Lee, and Gaurav Mahajan. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift. COLT, 2020.
- Alekh Agarwal, Mikael Henaff, Sham Kakade, and Wen Sun. PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning. NeurIPS, 2020.
- Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, and John Langford. Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning. ICML, 2020.
- Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs. NeurIPS, 2020.
- Zakaria Mhammedi, Dylan J. Foster, Max Simchowitz, Dipendra Misra, Wen Sun, Akshay Krishnamurthy, Alexander Rakhlin, and John Langford. Learning the Linear Quadratic Regulator from Nonlinear Observations. NeurIPS, 2020.
- Michael Kearns, Yishay Mansour, and Andrew Ng. A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learning, 2002.
- Akshay Krishnamurthy, Alekh Agarwal, and John Langford. PAC reinforcement learning with rich observations. NeurIPS, 2016.
- Simon S. Du, Sham M. Kakade, Ruosong Wang, Lin F. Yang. Is a good representation sufficient for sample efficient reinforcement learning? ICLR, 2020.
- Gellert Weisz, Philip Amortila, and Csaba Szepesvári. Exponential lower bounds for planning in MDPs With linearly-realizable optimal action-value functions. arXiv:2010.01374, 2020.
- Ruosong Wang, Dean P. Foster, and Sham M. Kakade. What are the Statistical Limits of Offline RL with Linear Function Approximation? arXiv: 2010.11895, 2020.
The three challenges
- William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 1933.
- Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The non-stochastic multi-armed bandit problem. SIAM Journal on Computing, 2002. Preliminary version in FOCS, 1995.
- John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. NeurIPS, 2007.
- Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. WWW, 2010.
- Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Robert E. Schapire. Efficient optimal learning for contextual bandits. UAI, 2011.
- Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. ICML, 2014.
- Dylan J. Foster and Alexander Rakhlin. Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles. ICML, 2020.
- David Simchi-Levi and Yunzong Xu. Bypassing the Monster: A Faster and Simpler Optimal Algorithm for Contextual Bandits under Realizability. arXiv:2003.12699, 2020.
Tabular Markov decision process
- Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 2002.
- Ronen I. Brafman, Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learningn. JMLR, 2002.
- Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. PAC model-free reinforcement learning. ICML, 2006.
- Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. JMLR, 2010.
- Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. ICML, 2017.
- Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. NeurIPS, 2017.
- Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I. Jordan. Is Q-learning provably efficient? NeurIPS, 2018.
- Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. ICML, 2019.
- Max Simchowitz and Kevin G. Jamieson. Non-asymptotic gap-dependent regret bounds for tabular MDPs. NeurIPS, 2019.
- Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. JCSS, 2008. Preliminary version in STOC, 2004.
- Elad Hazan, Zohar Karnin, and Raghu Meka. Volumetric spanners: An efficient exploration basis for learning. COLT, 2014.
- Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. COLT, 2008.
- Yasin Abbasi-Yadkori, Csaba Szepesvari, and David Pal. Improved algorithms for linear stochastic bandits. NeurIPS, 2011.
Extrapolation methods for RL
- Lin F. Yang and Mengdi Wang. Reinforcement Learning in feature space: Matrix bandit, kernels, and regret bound. ICML, 2020.
- Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent Bellman error. ICML, 2020.
- Ruosong Wang, Ruslan Salakhutdinov, and Lin F. Yang. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. NeurIPS, 2020.
Bellman and witness rank
- Michael Kearns and Daphne Koller. Efficient reinforcement learning in factored MDPs. IJCAI, 1999.
- Martin Mundhenk, Judy Goldsmith, Christopher Lusena, and Eric Allender. Complexity of finite-horizon Markov decision process problems. Journal of the ACM, 2000.
- Eric Allender, Sanjeev Arora, Michael Kearns, Cristopher Moore, and Alexander Russell. A note on the representational incompatibility of function approximation and factored dynamics. NeurIPS, 2002.
- Paolo Liberatore. The size of MDP factored policies. AAAI, 2002.
- Ian Osband and Benjamin Van Roy. Near-optimal reinforcement learning in factored MDPs. NeurIPS, 2014.
- Aviv Rosenberg and Yishay Mansour. Oracle-efficient reinforcement learning in factored MDPs with unknown structure. arXiv:2009.05986.
Policy optimization algorithms
- Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992.
- Vijay R. Konda and John N. Tsitsiklis. Actor-critic algorithms. NeurIPS, 2000.
- Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. NeurIPS, 2000.
- Sham M. Kakade. A natural policy gradient. NeurIPS, 2001.
- Sham M. Kakade and John Langford. Approximately Optimal Approximate Reinforcement Learning. ICML, 2002.
- J. Andrew Bagnell and Jeff Schneider. Covariant policy search. IJCAI, 2003.
- Jan Peters, Sethu Vijayakumar, and Stefan Schaal. Natural actor-critic. Neurocomputing, 2008.
- John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. ICML, 2015.
- John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv:1707.06347, 2017.
- Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ICML, 2018.
Policy optimization algorithms
- Eyal Even-Dar, Sham. M. Kakade, and Yishay Mansour. Online Markov decision processes. Mathematics of Operations Research, 2009.
- Bruno Scherrer. Approximate policy iteration schemes: A comparison. ICML, 2014.
- Gergely Neu, Anders Jonsson, and Vicenç Gómez. A unified view of entropy-regularized Markov decision processes. arXiv:1705.17798, 2017.
- Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized Markov decision processes. ICML, 2019.
- Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, and Gellert Weisz. POLITEX: Regret bounds for policy iteration using expert prediction. ICML, 2019.
- Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. arXiv:1906.01786, 2019.
- Lior Shani, Yonathan Efroni, and Shie Mannor. Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs. AAAI, 2020.
- Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the global convergence rates of softmax policy gradient methods. NeurIPS, 2020.
Exploration in policy optimization
Latent state discovery
- Robert Givan, Thomas Dean, and Matthew Greig. Equivalence notions and model minimization in Markov decision processes. Artificial Intelligence, 2003.
- Lihong Li, Thomas J. Walsh, and Michael L. Littman. Towards a unified theory of state abstraction for MDPs. International Symposium on Artificial Intelligence and Mathematics, 2006.
- Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. ICML, 2017.
- Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. #Exploration: A study of count-based exploration for deep reinforcement learning. NeurIPS, 2017.
- Aditya Modi, Nan Jiang, Ambuj Tewari, and Satinder Singh. Sample complexity of reinforcement learning using linearly combined model ensembles. AISTATS, 2020.