Efficient approximate policy iteration methods for sequential decision making in reinforcement learning

作者: Michail G. Lagoudakis , Ronald Parr

DOI:

关键词:

摘要: Reinforcement learning is a promising paradigm in which an agent learns how to make good decisions by interacting with (unknown) environment. This framework can be extended along two dimensions: the number of decision makers (single- or multi-agent) and nature interaction (collaborative competitive). characterization leads four making situations that are considered this thesis modeled as Markov processes, team zero-sum games, games. Existing reinforcement algorithms have not been applied widely on real-world problems, mainly because required resources grow fast function size problem. Exact, but impractical, solutions commonly abandoned favor approximate, practical, solutions. Unfortunately, research efficient stable approximate methods has focused prediction problem, where tries learn outcome fixed policy. contributes based general policy iteration for control whereby Least-Squares Policy Iteration (LSPI) algorithm policies least-squares fixed-point approximation value function. LSPI makes use sample experience and, therefore, most appropriate domains training data expensive simulator process available. Rollout Classification (RCPI) other hand rollouts (Monte-Carlo simulation estimates) train classifier represents For reason, RCPI comes at no cost Both exhibit nice theoretical properties, they bear strong connections areas, such feature selection classification learning, respectively. The proposed demonstrated variety tasks: chain walk, inverted pendulum balancing, bicycle balancing riding, game Tetris, multiagent system administration, distributed power grid control, server-router flow two-player soccer game, control. These results demonstrate clearly efficiency applicability new large problems.

参考文章(59)
Leslie Pack Kaelbling, Michael L. Littman, Thomas L. Dean, On the complexity of solving Markov decision problems uncertainty in artificial intelligence. pp. 394- 402 ,(1995)
John Nash, NON-COOPERATIVE GAMES Annals of Mathematics. ,vol. 54, pp. 286- ,(1951) , 10.1017/CBO9780511528231.007
Arega W. Yohannes, Dynamic Programming ,(1957)
L. S. Shapley, Stochastic Games Proceedings of the National Academy of Sciences of the United States of America. ,vol. 39, pp. 1095- 1100 ,(1953) , 10.1073/PNAS.39.10.1095
C. J. C. H. Watkins, Learning from delayed rewards Ph. D thesis, Cambridge University Psychology Department. ,(1989)
Dimitri P. Bertsekas, David A. Castanon, Rollout Algorithms for Stochastic Scheduling Problems Journal of Heuristics. ,vol. 5, pp. 89- 108 ,(1999) , 10.1023/A:1009634810396
Martin A. Riedmiller, Weng-Keen Wong, Jeff G. Schneider, Andrew W. Moore, Distributed Value Functions international conference on machine learning. pp. 371- 378 ,(1999)
Daphne Koller, Ronald Parr, Policy Iteration for Factored MDPs uncertainty in artificial intelligence. pp. 326- 334 ,(2000)
Geoffrey J. Gordon, Tom Mitchell, Approximate solutions to markov decision processes Carnegie Mellon University. ,(1999)
A. NEDIĆ, D. P. BERTSEKAS, Least Squares Policy Evaluation Algorithms with Linear Function Approximation Discrete Event Dynamic Systems. ,vol. 13, pp. 79- 110 ,(2003) , 10.1023/A:1022192903948