Model-based reinforcement learning with nearly tight exploration complexity bounds

作者: Istvan Szita , Csaba Szepesv ri

DOI:

关键词: Markov decision processReinforcement learningUpper and lower boundsMathematical optimizationContrast (statistics)Current (mathematics)Time complexityBinary logarithmMathematicsDiscrete mathematicsLogarithm

摘要: One might believe that model-based algorithms of reinforcement learning can propagate the obtained experience more quickly, and are able to direct exploration better. As a consequence, fewer exploratory actions should be enough learn good policy. Strangely enough, current theoretical results for do not support this claim: In finite Markov decision process with N states, best bounds on number steps necessary order O(N2 log N), in contrast O(N N) bound available model-free, delayed Q-learning algorithm. paper we show Mormax, modified version Rmax algorithm needs make at most steps. This matches lower up logarithmic factors, as well upper state-of-the-art model-free algorithm, while our new improves dependence other problem parameters.

参考文章(2)
John N. Tsitsiklis, Dimitri P. Bertsekas, Neuro-dynamic programming ,(1996)
Ronald Ortner, Peter Auer, Thomas Jaksch, Near-optimal Regret Bounds for Reinforcement Learning Journal of Machine Learning Research. ,vol. 11, pp. 1563- 1600 ,(2010)