Safe Exploration in Markov Decision Processes

作者: Pieter Abbeel , Teodor M. Moldovan

DOI:

关键词:

摘要: In environments with uncertain dynamics exploration is necessary to learn how perform well. Existing reinforcement learning algorithms provide strong guarantees, but they tend rely on an ergodicity assumption. The essence of that any state eventually reachable from other by following a suitable policy. This assumption allows for operate simply favoring states have rarely been visited before. For most physical systems this impractical as the would break before reasonable has taken place, i.e., don't satisfy paper we address need safe methods in Markov decision processes. We first propose general formulation safety through ergodicity. show imposing restricting attention resulting set guaranteed policies NP-hard. then present efficient algorithm safe, potentially suboptimal, exploration. At core optimization which constraints restrict subset and objective favors policies. Our framework compatible majority previously proposed methods, bonus. experiments, include Martian terrain problem, our method able explore better than classical methods.

参考文章(28)
Steffen Udluft, Daniel Schneegaß, Anton Maximilian Schäfer, Alexander Hans, Safe exploration for reinforcement learning. the european symposium on artificial neural networks. pp. 143- 148 ,(2008)
Damien Ernst, Arthur Louette, Introduction to Reinforcement Learning MIT Press. ,(1998)
John N. Tsitsiklis, Dimitri P. Bertsekas, Neuro-dynamic programming ,(1996)
S. I. Marcus, E. Fernández-Gaucherand, D. Hernández-Hernández, S. Coraluppi, P. Fard, Risk Sensitive Markov Decision Processes Birkhäuser, Boston, MA. pp. 263- 279 ,(1997) , 10.1007/978-1-4612-4120-1_14
Sham Kakade, Michael Kearns, John Langford, Exploration in metric state spaces international conference on machine learning. pp. 306- 312 ,(2003)
Arnab Nilim, Laurent El Ghaoui, Robust Control of Markov Decision Processes with Uncertain Transition Matrices Operations Research. ,vol. 53, pp. 780- 798 ,(2005) , 10.1287/OPRE.1050.0216
Anil Aswani, Patrick Bouffard, Claire Tomlin, Extensions of learning-based model predictive control for real-time application to a quadrotor helicopter advances in computing and communications. pp. 4661- 4666 ,(2012) , 10.1109/ACC.2012.6315483
Randolph L. Kirk, Michael T. Mellon, Steven W. Squyres, Nicolas Thomas, Catherine M. Weitz, Alfred S. McEwen, Eric M. Eliason, James W. Bergstrom, Nathan T. Bridges, Candice J. Hansen, W. Alan Delamere, John A. Grant, Virginia C. Gulick, Kenneth E. Herkenhoff, Laszlo Keszthelyi, Mars Reconnaissance Orbiter's High Resolution Imaging Science Experiment (HiRISE) Journal of Geophysical Research. ,vol. 112, ,(2007) , 10.1029/2005JE002605