作者: Hamid R. Maei , Richard S. Sutton , Shalabh Bhatnagar , Csaba Szepesv ri
DOI:
关键词:
摘要: We present the first temporal-difference learning algorithm for off-policy control with unrestricted linear function approximation whose per-time-step complexity is in number of features. Our algorithm, Greedy-GQ, an extension recent work on gradient learning, which has hitherto been restricted to a prediction (policy evaluation) setting, setting target policy greedy respect optimal action-value function. A limitation our that we require behavior be stationary. call this latent because policy, though learned, not manifest behavior. Popular algorithms such as Q-learning are known unstable when used approximation.