TD(λ) Converges with Probability 1

作者: Peter Dayan , Terrence J. Sejnowski

DOI: 10.1023/A:1022657612745

关键词:

摘要: The methods of temporal differences (Samuel, 1959s Sutton, 1984, 1988) allow an agent to learn accurate predictions stationary stochastic future outcomes. learning is effectively approximation based on samples extracted from the process generating agent's future. Sutton (1988) proved that for a special case differences, expected values converge their correct values, as larger are taken, and Dayan (1992) extended his proof general case. This article proves stronger result than slightly modified form difference with probability one, shows how quantify rate convergence.

参考文章(15)
Christopher J.C.H. Watkins, Peter Dayan, Technical Note Q-Learning Machine Learning. ,vol. 8, pp. 279- 292 ,(1992) , 10.1023/A:1022676722315
Pierre Priouret, Michel Métivier, Albert Benveniste, Adaptive Algorithms and Stochastic Approximations ,(1990)
Richard Stuart Sutton, Temporal credit assignment in reinforcement learning University of Massachusetts Amherst. ,(1984)
Herbert Robbins, Sutton Monro, A Stochastic Approximation Method Annals of Mathematical Statistics. ,vol. 22, pp. 400- 407 ,(1951) , 10.1214/AOMS/1177729586
Stuart Geman, Elie Bienenstock, René Doursat, Neural networks and the bias/variance dilemma Neural Computation. ,vol. 4, pp. 1- 58 ,(1992) , 10.1162/NECO.1992.4.1.1
Richard S. Sutton, Learning to Predict by the Methods of Temporal Differences Machine Learning. ,vol. 3, pp. 9- 44 ,(1988) , 10.1023/A:1022633531479