Combining manual feedback with subsequent MDP reward signals for reinforcement learning

作者: Peter Stone , W. Bradley Knox

DOI: 10.5555/1838206.1838208

关键词:

摘要: As learning agents move from research labs to the real world, it is increasingly important that human users, including those without programming skills, be able teach desired behaviors. Recently, tamer framework was introduced for designing can interactively shaped by trainers who give only positive and negative feedback signals. Past work on showed shaping greatly reduce sample complexity required learn a good policy, enable lay users behaviors they desire, allow within Markov Decision Process (MDP) in absence of coded reward function. However, does not this training combined with autonomous based such This paper leverages fast exhibited hasten reinforcement (RL) algorithm's climb up curve, effectively demonstrating MDP used conjunction one another an agent. We tested eight plausible tamer+rl methods combining previously learned function, H, algorithm. identifies which these are most effective analyzes their strengths weaknesses. Results algorithms indicate better final performance cumulative than either agent or RL alone.

参考文章(3)
Andrew Y. Ng, Stuart J. Russell, Daishi Harada, Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping international conference on machine learning. pp. 278- 287 ,(1999)
Brenna D. Argall, Sonia Chernova, Manuela Veloso, Brett Browning, A survey of robot learning from demonstration Robotics and Autonomous Systems. ,vol. 57, pp. 469- 483 ,(2009) , 10.1016/J.ROBOT.2008.10.024
A.G. Barto, R.S. Sutton, Reinforcement Learning: An Introduction ,(1988)