Gradient Temporal-Difference Learning with Regularized Corrections

作者: Adam White , Martha White , Sina Ghiassian , Andrew Patterson , Shivam Garg

DOI:

关键词:

摘要: It is still common to use Q-learning and temporal difference (TD) learning-even though they have divergence issues sound Gradient TD alternatives exist-because seems rare typically perform well. However, recent work with large neural network learning systems reveals that instability more than previously thought. Practitioners face a difficult dilemma: choose an easy performant method, or complex algorithm but harder tune all unexplored non-linear function approximation control. In this paper, we introduce new method called Regularized Corrections (TDRC), attempts balance ease of use, soundness, performance. behaves as well TD, when performs well, in cases where diverges. We empirically investigate TDRC across range problems, for both prediction control, linear approximation, show, potentially the first time, gradient methods could be better alternative Q-learning.

参考文章(36)
Diederik P. Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization arXiv: Learning. ,(2014)
Yoshua Bengio, Xavier Glorot, Understanding the difficulty of training deep feedforward neural networks international conference on artificial intelligence and statistics. pp. 249- 256 ,(2010)
Richard S. Sutton, A. Rupam Mahmood, Martha White, An emphatic approach to the problem of off-policy temporal-difference learning Journal of Machine Learning Research. ,vol. 17, pp. 2603- 2631 ,(2016) , 10.5555/2946645.3007026
Sebastian Nowozin, Stephen J. Wright, Suvrit Sra, Optimization for Machine Learning neural information processing systems. pp. 72- 73 ,(2011)
Doina Precup, Richard S. Sutton, Sanjoy Dasgupta, Off-Policy Temporal Difference Learning with Function Approximation international conference on machine learning. pp. 417- 424 ,(2001)
Andrew William Moore, None, Efficient memory-based learning for robot control ,(1990)
Leemon Baird, Residual Algorithms: Reinforcement Learning with Function Approximation Machine Learning Proceedings 1995. pp. 30- 37 ,(1995) , 10.1016/B978-1-55860-377-6.50013-X
Ian Gemp, Sridhar Mahadevan, Nicholas Jacek, Ji Liu, Stephen Giguere, Philip S. Thomas, Bo Liu, William Dabney, Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces arXiv: Learning. ,(2014)
V. S. Borkar, S. P. Meyn, The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning SIAM Journal on Control and Optimization. ,vol. 38, pp. 447- 469 ,(2000) , 10.1137/S0363012997331639
Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári, Eric Wiewiora, Fast gradient-descent methods for temporal-difference learning with linear function approximation Proceedings of the 26th Annual International Conference on Machine Learning - ICML '09. pp. 993- 1000 ,(2009) , 10.1145/1553374.1553501