Off-Policy Temporal Difference Learning with Function Approximation

作者： Doina Precup , Richard S. Sutton , Sanjoy Dasgupta

DOI:

关键词:

摘要: We introduce the first algorithm for off-policy temporal-difference learning that is stable with linear function approximation. Off-policy of interest because it forms basis popular reinforcement methods such as Q-learning, which has been known to diverge approximation, and critical practical utility multi-scale, multi-goal, frameworks options, HAMs, MAXQ. Our new combines TD(λ) over state–action pairs importance sampling ideas from our previous work. prove that, given training under any -soft policy, converges w.p.1 a close approximation (as in Tsitsiklis Van Roy, 1997; Tadic, 2001) action-value an arbitrary target policy. Variations designed reduce variance additional bias but are also guaranteed convergent. illustrate method empirically on small policy evaluation problem. current results limited episodic tasks episodes bounded length. 1Although Q-learning remains most all algorithms, since about 1996 unsound (see Gordon, 1995; Bertsekas Tsitsiklis, 1996). The telling counterexample, due Baird (1995) seven-state Markov decision process linearly independent feature vectors, exact solution exists, yet This re-typeset version article published Proceedings 18th International Conference Machine Learning (2001). It differs original line page breaks, crisper electronic viewing, this funny footnote, otherwise identical article. approximate values found by infinity. problem prompted development residual gradient (Baird, 1995), much slower than fitted value iteration (Gordon, 1995, 1999), restricted, weaker-than-linear approximators. Of course, used its invention (Watkins, 1989), often good results, soundness approach no longer open question. There exist non-pathological processes diverges; absolutely sense. A sensible response turn some other methods, Sarsa, efficient possibility. An important distinction here between must follow they about, called on-policy those can learn behavior generated different methods. learns optimal even when actions selected according more exploratory or random requires only be tried states, whereas like Sarsa require specific probabilities. Although capability appealing, source at least part instability problems. For example, one Baird’s algorithm, underlies both Qlearning applied Q π. Operating mode, updating state– action same distribution would experienced π, convergent near best possible (Tsitsiklis 2001). However, if state-action updated distribution, say following greedy then estimated again related counterexamples suggest reason method; make clear studied purely policy-evaluation context. Despite these problems, there substantial Several researchers have argued ambitious extension into modular, hierarchical architectures (Sutton, Precup & Singh, 1999; Parr, 1998; Parr Russell, Dietterich, 2000). These rely multiple subgoals ways behaving singular stream experience. approaches feasible, way combining found. Because problems become apparent setting, we focus paper. In work considered multi-step tabular case. paper consistent mathematical focuses case, fact single episode. Given starting state action, show expected offpolicy update conventional TD(λ). This, together conditions, allows us convergence bounds error asymptotic obtained Roy (1997; 1. Notation Main Result consider standard framework (see, e.g., Sutton Barto, 1998) agent interacts (MDP). notation episode T time steps, s0, a0, r1, s1, a1, r2, . , rT sT states st ∈ S, A, rewards rt <. take initial s0 arbitrarily. at, next reward, rt+1, variable mean state, st+1, chosen probabilities pt stst+1 final special terminal may not occur preceding step. st, 0 < t probability π(st, at) b(st, depending whether π b force. always use denote about. generate instead b, call either seek : S ×A 7→ π: Q(s, a) = Eπ { rt+1 + · ·+ γrT | s, } where ≤ γ 1 discount-rate parameter. approximations set vectors {φsa}, s A: ≈ θφsa n ∑

uni-trier.de 本地加速

princeton.edu 本地加速

mcgill.ca PDF 下载加速

researchgate.net LINK 下载加速

mcgill.ca LINK 下载加速

参考文章(14)

Geoffrey J. Gordon, Tom Mitchell, Approximate solutions to markov decision processes Carnegie Mellon University. ,(1999)

Doina Precup, Satinder P. Singh, Richard S. Sutton, Eligibility Traces for Off-Policy Policy Evaluation international conference on machine learning. pp. 759- 766 ,(2000)

Geoffrey J. Gordon, Stable Function Approximation in Dynamic Programming Machine Learning Proceedings 1995. pp. 261- 268 ,(1995) , 10.1016/B978-1-55860-377-6.50040-2

Stuart Russell, Ronald Edward Parr, Hierarchical control and learning for markov decision processes University of California, Berkeley. ,(1998)

Vladislav Tadić, On the Convergence of Temporal-Difference Learning with Linear Function Approximation Machine Learning. ,vol. 42, pp. 241- 267 ,(2001) , 10.1023/A:1007609817671

John N. Tsitsiklis, Dimitri P. Bertsekas, Neuro-dynamic programming ,(1996)

Leemon Baird, Residual Algorithms: Reinforcement Learning with Function Approximation Machine Learning Proceedings 1995. pp. 30- 37 ,(1995) , 10.1016/B978-1-55860-377-6.50013-X

Richard S. Sutton, Doina Precup, Satinder Singh, Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning Artificial Intelligence. ,vol. 112, pp. 181- 211 ,(1999) , 10.1016/S0004-3702(99)00052-1

T. G. Dietterich, Hierarchical reinforcement learning with the MAXQ value function decomposition Journal of Artificial Intelligence Research. ,vol. 13, pp. 227- 303 ,(2000) , 10.1613/JAIR.639

10.

A.G. Barto, R.S. Sutton, Reinforcement Learning: An Introduction ,(1988)

Off-Policy Temporal Difference Learning with Function Approximation

来源期刊

我的账户

Off-Policy Temporal Difference Learning with Function Approximation

来源期刊

相似文章 10

我的账户