作者: George Konidaris , Scott Niekum , Philip S Thomas
DOI:
关键词: Estimator 、 Maximum likelihood 、 Algorithm 、 Benchmark (computing) 、 Mathematical optimization 、 Mathematics 、 Variance (accounting) 、 Spacetime 、 Specific model 、 Temporal difference learning
摘要: We show that the λ-return target used in TD(λ) family of algorithms is maximum likelihood estimator for a specific model how variance an n-step return estimate increases with n. introduce γ-return estimator, alternative based on more accurate variance, which defines TDγ complex-backup temporal difference learning algorithms. derive TDγ, equivalent original algorithm, eliminates λ parameter but can only perform updates at end episode and requires time space proportional to length. then second TDγ(C), capacity C. TDγ(C) C times memory than incremental online. outperforms any setting 4 out 5 benchmark domains, performs as well or better intermediate settings