作者: Ronald Ortner , Peter Auer , Thomas Jaksch
DOI:
关键词:
摘要: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a algorithm with respect to an optimal policy. In order describe transition structure MDP propose new parameter: An has diameter D if for any pair states s,s' there is policy which moves from s s' at most steps (on average). We present O(DS√AT) after T unknown S states, A actions per state, and D. corresponding lower bound Ω(√DSAT) on given as well. These results are complemented by sample complexity number suboptimal taken our algorithm. This can be used achieve (gap-dependent) that logarithmic T. Finally, also setting where allowed change fixed l times. modification able deal this show O(l1/3T2/3DS√A).