作者: Andrew Y. Ng , Michael I. Jordan
DOI:
关键词: Mathematical optimization 、 Exponential function 、 Partially observable Markov decision process 、 Action (physics) 、 Value (ethics) 、 Space (mathematics) 、 State (functional analysis) 、 Mathematics 、 Markov decision process 、 Polynomial
摘要: We propose a new approach to the problem of searching space policies for Markov decision process (MDP) or partially observable (POMDP), given model. Our is based on following observation: Any (PO)MDP can be transformed into an "equivalent" POMDP in which all state transitions (given current and action) are deterministic. This reduces general policy search one we need only consider POMDPs with deterministic transitions. give natural way estimating value these POMDPs. Policy then simply performed by high estimated value. also establish conditions under our estimates will good, recovering theoretical results similar those Kearns, Mansour Ng [7], but "sample complexity" bounds that have polynomial rather than exponential dependence horizon time. method applies arbitrary POMDPs, including ones infinite action spaces. present empirical small discrete problem, complex continuous state/continuous involving learning ride bicycle.