作者: Doina Precup , Richard S. Sutton , Sanjoy Dasgupta
DOI:
关键词:
摘要: We introduce the first algorithm for off-policy temporal-difference learning that is stable with linear function approximation. Off-policy of interest because it forms basis popular reinforcement methods such as Q-learning, which has been known to diverge approximation, and critical practical utility multi-scale, multi-goal, frameworks options, HAMs, MAXQ. Our new combines TD(λ) over state–action pairs importance sampling ideas from our previous work. prove that, given training under any -soft policy, converges w.p.1 a close approximation (as in Tsitsiklis Van Roy, 1997; Tadic, 2001) action-value an arbitrary target policy. Variations designed reduce variance additional bias but are also guaranteed convergent. illustrate method empirically on small policy evaluation problem. current results limited episodic tasks episodes bounded length. 1Although Q-learning remains most all algorithms, since about 1996 unsound (see Gordon, 1995; Bertsekas Tsitsiklis, 1996). The telling counterexample, due Baird (1995) seven-state Markov decision process linearly independent feature vectors, exact solution exists, yet This re-typeset version article published Proceedings 18th International Conference Machine Learning (2001). It differs original line page breaks, crisper electronic viewing, this funny footnote, otherwise identical article. approximate values found by infinity. problem prompted development residual gradient (Baird, 1995), much slower than fitted value iteration (Gordon, 1995, 1999), restricted, weaker-than-linear approximators. Of course, used its invention (Watkins, 1989), often good results, soundness approach no longer open question. There exist non-pathological processes diverges; absolutely sense. A sensible response turn some other methods, Sarsa, efficient possibility. An important distinction here between must follow they about, called on-policy those can learn behavior generated different methods. learns optimal even when actions selected according more exploratory or random requires only be tried states, whereas like Sarsa require specific probabilities. Although capability appealing, source at least part instability problems. For example, one Baird’s algorithm, underlies both Qlearning applied Q π. Operating mode, updating state– action same distribution would experienced π, convergent near best possible (Tsitsiklis 2001). However, if state-action updated distribution, say following greedy then estimated again related counterexamples suggest reason method; make clear studied purely policy-evaluation context. Despite these problems, there substantial Several researchers have argued ambitious extension into modular, hierarchical architectures (Sutton, Precup & Singh, 1999; Parr, 1998; Parr Russell, Dietterich, 2000). These rely multiple subgoals ways behaving singular stream experience. approaches feasible, way combining found. Because problems become apparent setting, we focus paper. In work considered multi-step tabular case. paper consistent mathematical focuses case, fact single episode. Given starting state action, show expected offpolicy update conventional TD(λ). This, together conditions, allows us convergence bounds error asymptotic obtained Roy (1997; 1. Notation Main Result consider standard framework (see, e.g., Sutton Barto, 1998) agent interacts (MDP). notation episode T time steps, s0, a0, r1, s1, a1, r2, . , rT sT states st ∈ S, A, rewards rt <. take initial s0 arbitrarily. at, next reward, rt+1, variable mean state, st+1, chosen probabilities pt stst+1 final special terminal may not occur preceding step. st, 0 < t probability π(st, at) b(st, depending whether π b force. always use denote about. generate instead b, call either seek : S ×A 7→ π: Q(s, a) = Eπ { rt+1 + · ·+ γrT | s, } where ≤ γ 1 discount-rate parameter. approximations set vectors {φsa}, s A: ≈ θφsa n ∑