Natural actor-critic algorithms

作者： Shalabh Bhatnagar , Richard S. Sutton , Mohammad Ghavamzadeh , Mark Lee

DOI: 10.1016/J.AUTOMATICA.2009.07.008

关键词: Mathematics 、 Algorithm 、 Temporal difference learning 、 Mathematical optimization 、 Reinforcement learning 、 Bellman equation 、 Gradient method 、 Gradient descent 、 Function approximation 、 Stochastic gradient descent 、 Stochastic approximation

摘要: We present four new reinforcement learning algorithms based on actor-critic, natural-gradient and function-approximation ideas, we provide their convergence proofs. Actor-critic methods are online approximations to policy iteration in which the value-function parameters estimated using temporal difference updated by stochastic gradient descent. Methods gradients this way of special interest because compatibility with methods, needed handle large or infinite state spaces. The use is many applications it dramatically reduces variance estimates. natural can produce better conditioned parameterizations has been shown further reduce some cases. Our results extend prior two-timescale for actor-critic Konda Tsitsiklis actor incorporating gradients. empirical studies Peters, Vijayakumar Schaal providing first proofs fully incremental algorithms.

参考文章(81)

John Rust, Chapter 14 Numerical dynamic programming in economics Handbook of Computational Economics. ,vol. 1, pp. 619- 729 ,(1996) , 10.1016/S1574-0021(96)01016-7

Sean Meyn, Control Techniques for Complex Networks ,(2007)

Improved Temporal Difference Methods with Linear Function Approximation Wiley-IEEE Press. pp. 233- 259 ,(2004) , 10.1109/9780470544785.CH9

Geoffrey J. Gordon, Stable Function Approximation in Dynamic Programming Machine Learning Proceedings 1995. pp. 261- 268 ,(1995) , 10.1016/B978-1-55860-377-6.50040-2

J. Andrew Bagnell, Jeff Schneider, Covariant policy search international joint conference on artificial intelligence. pp. 1019- 1024 ,(2003) , 10.1184/R1/6552458.V1

Pierre Priouret, Michel Métivier, Albert Benveniste, Adaptive Algorithms and Stochastic Approximations ,(1990)

Richard Stuart Sutton, Temporal credit assignment in reinforcement learning University of Massachusetts Amherst. ,(1984)

Robert H. Crites, Andrew G. Barto, Elevator Group Control Using Multiple Reinforcement Learning Agents Machine Learning. ,vol. 33, pp. 235- 262 ,(1998) , 10.1023/A:1007518724497

J.N. Tsitsiklis, D.P. Bertsekas, Parallel and distributed computation Old Tappan, NJ (USA); Prentice Hall Inc.. ,(1989)

10.

Vladislav Tadić, On the Convergence of Temporal-Difference Learning with Linear Function Approximation Machine Learning. ,vol. 42, pp. 241- 267 ,(2001) , 10.1023/A:1007609817671

Natural actor-critic algorithms

来源期刊

我的账户

Natural actor-critic algorithms

来源期刊

相似文章 10

我的账户