作者: Doina Precup , Satinder P. Singh , Richard S. Sutton
DOI:
关键词:
摘要: Eligibility traces have been shown to speed reinforcement learning, make it more robust hidden states, and provide a link between Monte Carlo temporal-difference methods. Here we generalize eligibility off-policy in which one learns about policy different from the that generates data. Off-policy methods can greatly multiply as many policies be learned same data stream, identified particularly useful for learning subgoals temporally extended macro-actions. In this paper consider version of evaluation problem, only trace algorithm is known, method. We analyze compare four new algorithms, emphasizing their relationships classical statistical technique known importance sampling. Our main results are 1) establish consistency bias properties 2) empirically rank methods, showing improvement over one-step restricted model-free, table-lookup offline updating (at end each episode) although several algorithms could applied generally.