作者: Richard S. Sutton , A. Rupam Mahmood , Martha White
关键词:
摘要: In this paper we introduce the idea of improving performance parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps. particular, show that varying emphasis linear TD(γ)'s in a particular way causes its expected update to become stable under off-policy training. The only prior model-free TD methods achieve with per-step computation number function approximation parameters are gradient-TD family including TDC, GTD(γ), and GQ(λ). Compared these methods, our emphatic TD(λ) is simpler easier use; it has one learned parameter vector step-size parameter. Our treatment includes general state-dependent discounting bootstrapping functions, specifying degrees interest accurately valuing states.