Near-optimal Regret Bounds for Reinforcement Learning

作者: Ronald Ortner , Peter Auer , Thomas Jaksch

DOI:

关键词:

摘要: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a algorithm with respect to an optimal policy. In order describe transition structure MDP propose new parameter: An has diameter D if for any pair states s,s' there is policy which moves from s s' at most steps (on average). We present O(DS√AT) after T unknown S states, A actions per state, and D. corresponding lower bound Ω(√DSAT) on given as well. These results are complemented by sample complexity number suboptimal taken our algorithm. This can be used achieve (gap-dependent) that logarithmic T. Finally, also setting where allowed change fixed l times. modification able deal this show O(l1/3T2/3DS√A).

参考文章(26)
Gadiel Seroussi, Erik Ordentlich, Sergio Verdu, Tsachy Weissman, Marcelo J. Weinberger, Inequalities for the L1 Deviation of the Empirical Distribution ,(2003)
Sham Machandranath Kakade, On the Sample Complexity of Reinforcement Learning Doctoral thesis, UCL (University College London).. ,(2003)
Damien Ernst, Arthur Louette, Introduction to Reinforcement Learning MIT Press. ,(1998)
Peter L. Bartlett, Ambuj Tewari, REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs uncertainty in artificial intelligence. pp. 35- 42 ,(2009)
Aurélien Garivier, Eric Moulines, On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems arXiv: Statistics Theory. ,(2008)
Alexander L. Strehl, Michael L. Littman, An analysis of model-based Interval Estimation for Markov Decision Processes Journal of Computer and System Sciences. ,vol. 74, pp. 1309- 1331 ,(2008) , 10.1016/J.JCSS.2007.08.009
Apostolos N. Burnetas, Michael N. Katehakis, Optimal adaptive policies for Markov decision processes Mathematics of Operations Research. ,vol. 22, pp. 222- 255 ,(1997) , 10.1287/MOOR.22.1.222
Eyal Even-Dar, Sham. M. Kakade, Yishay Mansour, Online Markov Decision Processes Mathematics of Operations Research. ,vol. 34, pp. 726- 736 ,(2009) , 10.1287/MOOR.1090.0396
Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, Robert E. Schapire, The Nonstochastic Multiarmed Bandit Problem SIAM Journal on Computing. ,vol. 32, pp. 48- 77 ,(2003) , 10.1137/S0097539701398375