Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces

作者: Ian Gemp , Sridhar Mahadevan , Nicholas Jacek , Ji Liu , Stephen Giguere

DOI:

关键词: Mathematical optimizationReinforcement learningTemporal difference learningOperator (computer programming)Computer scienceStochastic optimizationDual (category theory)Monotone polygonOperator theoryLegendre transformation

摘要: In this paper, we set forth a new vision of reinforcement learning developed by us over the past few years, one that yields mathematically rigorous solutions to longstanding important questions have remained unresolved: (i) how design reliable, convergent, and robust algorithms (ii) guarantee satisfies pre-specified "safety" guarantees, remains in stable region parameter space (iii) "off-policy" temporal difference reliable manner, finally (iv) integrate study into rich theory stochastic optimization. provide detailed answers all these using powerful framework proximal operators. The key idea emerges is use primal dual spaces connected through Legendre transform. This allows updates occur spaces, allowing variety technical advantages. The transform elegantly generalizes for solving problems, such as natural gradient methods, which show relate closely previously unconnected mirror descent methods. Equally importantly, operator enables systematic development splitting methods safely reliably decompose complex products gradients recent variants gradient-based learning. innovation makes it possible "true" Finally, transforms enable other benefits, including modeling sparsity domain geometry. Our work builds extensively on convergence saddle-point algorithms, monotone

参考文章(113)
Shun-ichi Amari, Natural gradient works efficiently in learning Neural Computation. ,vol. 10, pp. 177- 202 ,(1998) , 10.1162/089976698300017746
Anna Nagurney, Migration equilibrium and variational inequalities. Economics Letters. ,vol. 31, pp. 109- 112 ,(1989) , 10.1016/0165-1765(89)90122-5
Robert E. Schapire, Manfred K. Warmuth, On the worst-case analysis of temporal-difference learning algorithms Machine Learning. ,vol. 22, pp. 95- 121 ,(1996) , 10.1007/BF00114725
Ernie Esser, Xiaoqun Zhang, Tony F. Chan, A General Framework for a Class of First Order Primal-Dual Algorithms for Convex Optimization in Imaging Science SIAM Journal on Imaging Sciences. ,vol. 3, pp. 1015- 1046 ,(2010) , 10.1137/09076934X
A. Nemirovski, A. Juditsky, G. Lan, A. Shapiro, Robust Stochastic Approximation Approach to Stochastic Programming SIAM Journal on Optimization. ,vol. 19, pp. 1574- 1609 ,(2009) , 10.1137/070704277
Hamid Benbrahim, Judy A. Franklin, Biped dynamic walking using reinforcement learning Robotics and Autonomous Systems. ,vol. 22, pp. 283- 302 ,(1996) , 10.1016/S0921-8890(97)00043-2
Jim Douglas, H. H. Rachford, On the numerical solution of heat conduction problems in two and three space variables Transactions of the American Mathematical Society. ,vol. 82, pp. 421- 439 ,(1956) , 10.1090/S0002-9947-1956-0084194-4
Aharon Ben-Tal, Arkadi Nemirovski, Non-euclidean restricted memory level method for large-scale convex optimization Mathematical Programming. ,vol. 102, pp. 407- 456 ,(2005) , 10.1007/S10107-004-0553-4
Amir Beck, Marc Teboulle, Mirror descent and nonlinear projected subgradient methods for convex optimization Operations Research Letters. ,vol. 31, pp. 167- 175 ,(2003) , 10.1016/S0167-6377(02)00231-6
Aharon Ben-Tal, Tamar Margalit, Arkadi Nemirovski, The Ordered Subsets Mirror Descent Optimization Method with Applications to Tomography Siam Journal on Optimization. ,vol. 12, pp. 79- 108 ,(2001) , 10.1137/S1052623499354564