作者: Riad Akrour , Marc Schoenauer , Michele Sebag
DOI: 10.1007/978-3-642-23780-5_11
关键词:
摘要: Many machine learning approaches in robotics, based on reinforcement learning, inverse optimal control or direct policy critically rely robot simulators. This paper investigates a simulatorfree called Preference-based Policy Learning (PPL). PPL iterates four-step process: the demonstrates candidate policy; expert ranks this comparatively to other ones according her preferences; these preferences are used learn return estimate; uses estimate build new policies, and process is iterated until desired behavior obtained. requires good representation of search space be available, enabling one accurate estimates limiting human ranking effort needed yield policy. Furthermore, cannot use informed features (e.g., how far from any target) due simulator-free setting. As second contribution, proposes agnostic exploitation robotic log. The convergence analytically studied its experimental validation two problems, involving single maze interacting robots, presented.