作者: Ian Gemp , Sridhar Mahadevan , Nicholas Jacek , Ji Liu , Stephen Giguere
DOI:
关键词: Mathematical optimization 、 Reinforcement learning 、 Temporal difference learning 、 Operator (computer programming) 、 Computer science 、 Stochastic optimization 、 Dual (category theory) 、 Monotone polygon 、 Operator theory 、 Legendre transformation
摘要: In this paper, we set forth a new vision of reinforcement learning developed by us over the past few years, one that yields mathematically rigorous solutions to longstanding important questions have remained unresolved: (i) how design reliable, convergent, and robust algorithms (ii) guarantee satisfies pre-specified "safety" guarantees, remains in stable region parameter space (iii) "off-policy" temporal difference reliable manner, finally (iv) integrate study into rich theory stochastic optimization. provide detailed answers all these using powerful framework proximal operators. The key idea emerges is use primal dual spaces connected through Legendre transform. This allows updates occur spaces, allowing variety technical advantages. The transform elegantly generalizes for solving problems, such as natural gradient methods, which show relate closely previously unconnected mirror descent methods. Equally importantly, operator enables systematic development splitting methods safely reliably decompose complex products gradients recent variants gradient-based learning. innovation makes it possible "true" Finally, transforms enable other benefits, including modeling sparsity domain geometry. Our work builds extensively on convergence saddle-point algorithms, monotone