CoinDICE: Off-Policy Confidence Interval Estimation

作者: Csaba Szepesvári , Dale Schuurmans , Bo Dai , Yinlam Chow , Lihong Li

DOI:

关键词:

摘要: We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, where the goal is to estimate a confidence interval on a target policy's value, given only …

参考文章(76)
Csaba Szepesvari, Remi Munos, Lihong Li, {Toward Minimax Off-policy Value Estimation} international conference on artificial intelligence and statistics. pp. 608- 616 ,(2015)
Sascha Lange, Thomas Gabel, Martin Riedmiller, Batch Reinforcement Learning Reinforcement Learning. pp. 45- 73 ,(2012) , 10.1007/978-3-642-27645-3_2
Patrice Bertail, Emmanuelle Gautherat, Hugo Harari-Kermadec, Empirical φ∗-Divergence Minimizers for Hadamard Differentiable Functionals Springer Proceedings in Mathematics & Statistics. ,vol. 74, pp. 21- 32 ,(2014) , 10.1007/978-1-4939-0569-0_3
Thomas G. Dietterich, The MAXQ Method for Hierarchical Reinforcement Learning international conference on machine learning. pp. 118- 126 ,(1998)
Doina Precup, Satinder P. Singh, Richard S. Sutton, Eligibility Traces for Off-Policy Policy Evaluation international conference on machine learning. pp. 759- 766 ,(2000)
Istvan Szita, Csaba Szepesv ri, Model-based reinforcement learning with nearly tight exploration complexity bounds international conference on machine learning. pp. 1031- 1038 ,(2010)
Werner R��misch, Delta Method, Infinite Dimensional Wiley StatsRef: Statistics Reference Online. ,(2006) , 10.1002/0471667196.ESS3139
Akimichi Takemura, Junya Honda, An Asymptotically Optimal Bandit Algorithm for Bounded Support Models. conference on learning theory. pp. 67- 79 ,(2010)
D. P. de Farias, B. Van Roy, The Linear Programming Approach to Approximate Dynamic Programming Operations Research. ,vol. 51, pp. 850- 865 ,(2003) , 10.1287/OPRE.51.6.850.24925
Csaba Szepesvári, Alborz Geramifard, Richard S. Sutton, Michael Bowling, Dyna-style planning with linear function approximation and prioritized sweeping uncertainty in artificial intelligence. pp. 528- 536 ,(2008)