作者: Shalabh Bhatnagar , Richard S. Sutton , Mohammad Ghavamzadeh , Mark Lee
DOI: 10.1016/J.AUTOMATICA.2009.07.008
关键词: Mathematics 、 Algorithm 、 Temporal difference learning 、 Mathematical optimization 、 Reinforcement learning 、 Bellman equation 、 Gradient method 、 Gradient descent 、 Function approximation 、 Stochastic gradient descent 、 Stochastic approximation
摘要: We present four new reinforcement learning algorithms based on actor-critic, natural-gradient and function-approximation ideas, we provide their convergence proofs. Actor-critic methods are online approximations to policy iteration in which the value-function parameters estimated using temporal difference updated by stochastic gradient descent. Methods gradients this way of special interest because compatibility with methods, needed handle large or infinite state spaces. The use is many applications it dramatically reduces variance estimates. natural can produce better conditioned parameterizations has been shown further reduce some cases. Our results extend prior two-timescale for actor-critic Konda Tsitsiklis actor incorporating gradients. empirical studies Peters, Vijayakumar Schaal providing first proofs fully incremental algorithms.