Emphatic Temporal-Difference Learning.

作者: Ashique Rupam Mahmood , Richard S. Sutton , Martha White , Huizhen Yu

DOI:

关键词: Function approximationTemporal difference learningArtificial intelligenceFlexibility (engineering)Computer scienceDiscountingLinear functionBootstrapping (linguistics)

摘要: Emphatic algorithms are temporal-difference learning that change their effective state distribution by selectively emphasizing and de-emphasizing updates on different time steps. Recent works Sutton, Mahmood White (2015), Yu (2015) show varying the emphasis in a particular way, these become stable convergent under off-policy training with linear function approximation. This paper serves as unified summary of available results from both works. In addition, we demonstrate empirical benefits flexibility emphatic algorithms, including state-dependent discounting, bootstrapping, user-specified allocation approximation resources.

参考文章(3)
Doina Precup, Satinder P. Singh, Richard S. Sutton, Eligibility Traces for Off-Policy Policy Evaluation international conference on machine learning. pp. 759- 766 ,(2000)
John N. Tsitsiklis, Dimitri P. Bertsekas, Neuro-dynamic programming ,(1996)
A.G. Barto, R.S. Sutton, Reinforcement Learning: An Introduction ,(1988)