Eligibility Traces for Off-Policy Policy Evaluation

作者: Doina Precup , Satinder P. Singh , Richard S. Sutton

DOI:

关键词:

摘要: Eligibility traces have been shown to speed reinforcement learning, make it more robust hidden states, and provide a link between Monte Carlo temporal-difference methods. Here we generalize eligibility off-policy in which one learns about policy different from the that generates data. Off-policy methods can greatly multiply as many policies be learned same data stream, identified particularly useful for learning subgoals temporally extended macro-actions. In this paper consider version of evaluation problem, only trace algorithm is known, method. We analyze compare four new algorithms, emphasizing their relationships classical statistical technique known importance sampling. Our main results are 1) establish consistency bias properties 2) empirically rank methods, showing improvement over one-step restricted model-free, table-lookup offline updating (at end each episode) although several algorithms could applied generally.

参考文章(10)
Thomas G. Dietterich, The MAXQ Method for Hierarchical Reinforcement Learning international conference on machine learning. pp. 118- 126 ,(1998)
Doina Precup, Satinder P. Singh, Richard S. Sutton, Intra-Option Learning about Temporally Abstract Actions international conference on machine learning. pp. 556- 564 ,(1998)
Leslie Pack Kaelbling, Hierarchical learning in stochastic domains: preliminary results international conference on machine learning. pp. 167- 173 ,(1993) , 10.1016/B978-1-55860-307-3.50028-9
Stuart Russell, Ronald Edward Parr, Hierarchical control and learning for markov decision processes University of California, Berkeley. ,(1998)
Doina Precup, Richard S. Sutton, Temporal abstraction in reinforcement learning University of Massachusetts Amherst. ,(2000)
Richard S. Sutton, Doina Precup, Satinder Singh, Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning Artificial Intelligence. ,vol. 112, pp. 181- 211 ,(1999) , 10.1016/S0004-3702(99)00052-1
A.G. Barto, R.S. Sutton, Reinforcement Learning: An Introduction ,(1988)
Tommi Jaakkola, Michael Jordan, Satinder Singh, None, Convergence of Stochastic Iterative Dynamic Programming Algorithms neural information processing systems. ,vol. 6, pp. 703- 710 ,(1993) , 10.1162/NECO.1994.6.6.1185
Jing Peng, Ronald J. Williams, Incremental multi-step Q-learning Machine Learning. ,vol. 22, pp. 283- 290 ,(1996) , 10.1007/BF00114731