作者: Peter Stone , W. Bradley Knox
关键词:
摘要: As learning agents move from research labs to the real world, it is increasingly important that human users, including those without programming skills, be able teach desired behaviors. Recently, tamer framework was introduced for designing can interactively shaped by trainers who give only positive and negative feedback signals. Past work on showed shaping greatly reduce sample complexity required learn a good policy, enable lay users behaviors they desire, allow within Markov Decision Process (MDP) in absence of coded reward function. However, does not this training combined with autonomous based such This paper leverages fast exhibited hasten reinforcement (RL) algorithm's climb up curve, effectively demonstrating MDP used conjunction one another an agent. We tested eight plausible tamer+rl methods combining previously learned function, H, algorithm. identifies which these are most effective analyzes their strengths weaknesses. Results algorithms indicate better final performance cumulative than either agent or RL alone.