作者: Tommi Jaakkola , Michael Jordan , Satinder Singh , None
DOI: 10.1162/NECO.1994.6.6.1185
关键词: Computer programming 、 Stochastic process 、 Convergence (routing) 、 Class (set theory) 、 Dynamic programming 、 Reinforcement learning 、 Mathematics 、 Markov process 、 Algorithm 、 Stochastic approximation
摘要: Recent developments in the area of reinforcement learning have yielded a number new algorithms for prediction and control Markovian environments. These algorithms, including TD(λ) algorithm Sutton (1988) Q-learning Watkins (1989), can be motivated heuristically as approximations to dynamic programming (DP). In this paper we provide rigorous proof convergence these DP-based by relating them powerful techniques stochastic approximation theory via theorem. The theorem establishes general class convergent which both belong.