作者: Christopher J. C. H. Watkins , Peter Dayan
DOI: 10.1007/BF00992698
关键词:
摘要: \cal Q-learning (Watkins, 1989) is a simple way for agents to learn how act optimally in controlled Markovian domains. It amounts an incremental method dynamic programming which imposes limited computational demands. works by successively improving its evaluations of the quality particular actions at states. This paper presents and proves detail convergence theorem based on that outlined Watkins (1989). We show converges optimum action-values with probability 1 so long as all are repeatedly sampled states represented discretely. also sketch extensions cases non-discounted, but absorbing, Markov environments, where many Q values can be changed each iteration, rather than just one.