Addressing Function Approximation Error in Actor-Critic Methods

作者: David Meger , Herke van Hoof , Scott Fujimoto

DOI:

关键词:

摘要: In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead overestimated value estimates and suboptimal policies. We show that this problem persists in an actor-critic setting propose novel mechanisms minimize its effects on both the actor critic. Our algorithm builds Double by taking minimum between a pair of critics limit overestimation. draw connection target networks overestimation bias, suggest delaying policy updates reduce per-update error further improve performance. evaluate our method suite OpenAI gym tasks, outperforming state art every environment tested.

参考文章(49)
Diederik P. Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization arXiv: Learning. ,(2014)
Doina Precup, Richard S. Sutton, Sanjoy Dasgupta, Off-Policy Temporal Difference Learning with Function Approximation international conference on machine learning. pp. 417- 424 ,(2001)
John Schulman, None, Trust Region Policy Optimization international conference on machine learning. pp. 1889- 1897 ,(2015)
Vijay R. Konda, John N. Tsitsiklis, On Actor-Critic Algorithms Siam Journal on Control and Optimization. ,vol. 42, pp. 1143- 1166 ,(2003) , 10.1137/S0363012901385691
G. E. Uhlenbeck, L. S. Ornstein, On the Theory of the Brownian Motion Physical Review. ,vol. 36, pp. 823- 841 ,(1930) , 10.1103/PHYSREV.36.823
David K. Smith, Dynamic Programming and Optimal Control. Volume 1 Journal of the Operational Research Society. ,vol. 47, pp. 833- 834 ,(1996) , 10.1057/JORS.1996.103
Donghun Lee, Boris Defourny, Warren B. Powell, Bias-corrected Q-learning to control max-operator bias in Q-learning ieee symposium on adaptive dynamic programming and reinforcement learning. pp. 93- 99 ,(2013) , 10.1109/ADPRL.2013.6614994
Dimitri P. Bertsekas, Dynamic Programming and Optimal Control Athena Scientific. ,(1995)
Richard S. Sutton, Learning to Predict by the Methods of Temporal Differences Machine Learning. ,vol. 3, pp. 9- 44 ,(1988) , 10.1023/A:1022633531479