作者: Richard S. Sutton , Adam White , Banafsheh Rafiee , Sina Ghiassian
DOI:
关键词:
摘要: The ability to continually make predictions about the world may be central intelligence. Off-policy learning and general value functions (GVFs) are well-established algorithmic techniques for many signals while interacting with world. In past couple of years, ambitious works have used off-policy GVF improve control performance in both simulation robotic tasks. Many these use semi-gradient temporal-difference (TD) algorithms, like Q-learning, which potentially divergent. last decade, several TD algorithms been proposed that convergent computationally efficient, but not much is known how they perform practice, especially on robots. this work, we an empirical comparison modern three different robot platforms, providing insights into their strengths weaknesses. We also discuss challenges conducting fair comparative studies robots develop a new evaluation methodology successful applicable relatively complicated domain.