TD_gamma: Re-evaluating Complex Backups in Temporal Difference Learning

作者: George Konidaris , Scott Niekum , Philip S Thomas

DOI:

关键词: EstimatorMaximum likelihoodAlgorithmBenchmark (computing)Mathematical optimizationMathematicsVariance (accounting)SpacetimeSpecific modelTemporal difference learning

摘要: We show that the λ-return target used in TD(λ) family of algorithms is maximum likelihood estimator for a specific model how variance an n-step return estimate increases with n. introduce γ-return estimator, alternative based on more accurate variance, which defines TDγ complex-backup temporal difference learning algorithms. derive TDγ, equivalent original algorithm, eliminates λ parameter but can only perform updates at end episode and requires time space proportional to length. then second TDγ(C), capacity C. TDγ(C) C times memory than incremental online. outperforms any setting 4 out 5 benchmark domains, performs as well or better intermediate settings

参考文章(2)
Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári, Eric Wiewiora, Fast gradient-descent methods for temporal-difference learning with linear function approximation Proceedings of the 26th Annual International Conference on Machine Learning - ICML '09. pp. 993- 1000 ,(2009) , 10.1145/1553374.1553501
A.G. Barto, R.S. Sutton, Reinforcement Learning: An Introduction ,(1988)