Technical Note : \cal Q -Learning

作者: Christopher J. C. H. Watkins , Peter Dayan

DOI: 10.1007/BF00992698

关键词:

摘要: \cal Q-learning (Watkins, 1989) is a simple way for agents to learn how act optimally in controlled Markovian domains. It amounts an incremental method dynamic programming which imposes limited computational demands. works by successively improving its evaluations of the quality particular actions at states. This paper presents and proves detail convergence theorem based on that outlined Watkins (1989). We show converges optimum action-values with probability 1 so long as all are repeatedly sampled states represented discretely. also sketch extensions cases non-discounted, but absorbing, Markov environments, where many Q values can be changed each iteration, rather than just one.

参考文章(4)
M. Sato, K. Abe, H. Takeda, Learning control of finite Markov chains with an explicit trade-off between estimation and control IEEE Transactions on Systems, Man, and Cybernetics. ,vol. 18, pp. 677- 684 ,(1988) , 10.1109/21.21595
Richard E. Bellman, Stuart E Dreyfus, Applied Dynamic Programming Princeton University Press. ,(1962) , 10.1515/9781400874651
Harold J. Kushner, Dean S. Clark, Stochastic Approximation Methods for Constrained and Unconstrained Systems Applied Mathematical Sciences. ,(1978) , 10.1007/978-1-4684-9352-8