An emphatic approach to the problem of off-policy temporal-difference learning

作者: Richard S. Sutton , A. Rupam Mahmood , Martha White

DOI: 10.5555/2946645.3007026

关键词:

摘要: In this paper we introduce the idea of improving performance parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps. particular, show that varying emphasis linear TD(γ)'s in a particular way causes its expected update to become stable under off-policy training. The only prior model-free TD methods achieve with per-step computation number function approximation parameters are gradient-TD family including TDC, GTD(γ), and GQ(λ). Compared these methods, our emphatic TD(λ) is simpler easier use; it has one learned parameter vector step-size parameter. Our treatment includes general state-dependent discounting bootstrapping functions, specifying degrees interest accurately valuing states.

参考文章(59)
Doina Precup, Patrick M. Pilarski, Richard S. Sutton, Adam White, Thomas Degris, Joseph Modayil, Michael Delp, Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction adaptive agents and multi-agents systems. pp. 761- 768 ,(2011) , 10.5555/2031678.2031726
Peter Dayan, Yael Niv, Reinforcement learning: The Good, The Bad and The Ugly Current Opinion in Neurobiology. ,vol. 18, pp. 185- 196 ,(2008) , 10.1016/J.CONB.2008.08.003
J.N. Tsitsiklis, B. Van Roy, An analysis of temporal-difference learning with function approximation IEEE Transactions on Automatic Control. ,vol. 42, pp. 674- 690 ,(1997) , 10.1109/9.580874
Huizhen Yu, Least Squares Temporal Difference Methods: An Analysis under General Conditions Siam Journal on Control and Optimization. ,vol. 50, pp. 3310- 3343 ,(2012) , 10.1137/100807879
Richard S Sutton, Eddie Rafols, Anna Koop, Temporal Abstraction in Temporal-difference Networks neural information processing systems. ,vol. 18, pp. 1313- 1320 ,(2005)
J. Z. Kolter, The Fixed Points of Off-Policy TD neural information processing systems. ,vol. 24, pp. 2169- 2177 ,(2011)
Andrew G. Barto, Richard S. Sutton, Time-Derivative Models of Pavlovian Reinforcement The MIT Press. ,(1990)
Philip Thomas, Bias in Natural Actor-Critic Algorithms international conference on machine learning. pp. 441- 448 ,(2014)
Huizhen Yu, Convergence of Least Squares Temporal Difference Methods Under General Conditions international conference on machine learning. pp. 1207- 1214 ,(2010)
Peter Dayan, The Convergence of TD(λ) for General λ Machine Learning. ,vol. 8, pp. 341- 362 ,(1992) , 10.1007/BF00992701