PEGASUS: A policy search method for large MDPs and POMDPs

作者: Andrew Y. Ng , Michael I. Jordan

DOI:

关键词: Mathematical optimizationExponential functionPartially observable Markov decision processAction (physics)Value (ethics)Space (mathematics)State (functional analysis)MathematicsMarkov decision processPolynomial

摘要: We propose a new approach to the problem of searching space policies for Markov decision process (MDP) or partially observable (POMDP), given model. Our is based on following observation: Any (PO)MDP can be transformed into an "equivalent" POMDP in which all state transitions (given current and action) are deterministic. This reduces general policy search one we need only consider POMDPs with deterministic transitions. give natural way estimating value these POMDPs. Policy then simply performed by high estimated value. also establish conditions under our estimates will good, recovering theoretical results similar those Kearns, Mansour Ng [7], but "sample complexity" bounds that have polynomial rather than exponential dependence horizon time. method applies arbitrary POMDPs, including ones infinite action spaces. present empirical small discrete problem, complex continuous state/continuous involving learning ride bicycle.

参考文章(17)
Hajime Kimura, Masayuki Yamamura, Shigenobu Kobayashi, Reinforcement Learning by Stochastic Hill Climbing on Discounted Reward Machine Learning Proceedings 1995. pp. 295- 303 ,(1995) , 10.1016/B978-1-55860-377-6.50044-X
Jette Randløv, Preben Alstrøm, Learning to Drive a Bicycle Using Reinforcement Learning and Shaping international conference on machine learning. pp. 463- 471 ,(1998)
John N. Tsitsiklis, Benjamin Van Roy, Learning and value function approximation in complex decision processes Massachusetts Institute of Technology. ,(1998)
Damien Ernst, Arthur Louette, Introduction to Reinforcement Learning MIT Press. ,(1998)
Vladimir Naumovich Vapnik, Estimation of Dependences Based on Empirical Data ,(2010)
John R. Birge, Franois Louveaux, Introduction to Stochastic Programming ,(2011)
Leonid Peshkin, Leslie Pack Kaelbling, Kee-Eung Kim, Nicolas Meuleau, Learning finite-state controllers for partially observable environments uncertainty in artificial intelligence. pp. 427- 436 ,(1999)
Paul Goldberg, Mark Jerrum, Bounding the Vapnik-Chervonenkis dimension of concept classes parameterized by real numbers conference on learning theory. ,vol. 18, pp. 361- 369 ,(1993) , 10.1145/168304.168377