Preference-based policy learning

作者: Riad Akrour , Marc Schoenauer , Michele Sebag

DOI: 10.1007/978-3-642-23780-5_11

关键词:

摘要: Many machine learning approaches in robotics, based on reinforcement learning, inverse optimal control or direct policy critically rely robot simulators. This paper investigates a simulatorfree called Preference-based Policy Learning (PPL). PPL iterates four-step process: the demonstrates candidate policy; expert ranks this comparatively to other ones according her preferences; these preferences are used learn return estimate; uses estimate build new policies, and process is iterated until desired behavior obtained. requires good representation of search space be available, enabling one accurate estimates limiting human ranking effort needed yield policy. Furthermore, cannot use informed features (e.g., how far from any target) due simulator-free setting. As second contribution, proposes agnostic exploitation robotic log. The convergence analytically studied its experimental validation two problems, involving single maze interacting robots, presented.

参考文章(2)
Pieter Abbeel, Andrew Y. Ng, Apprenticeship learning via inverse reinforcement learning Twenty-first international conference on Machine learning - ICML '04. pp. 1- 8 ,(2004) , 10.1145/1015330.1015430
A.G. Barto, R.S. Sutton, Reinforcement Learning: An Introduction ,(1988)