Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning

作者： Ali Mousavi , Lihong Li , Qiang Liu , Denny Zhou

DOI:

关键词: Computer science 、 Kernel (statistics) 、 Stationary distribution 、 Generalization 、 Kernel (linear algebra) 、 Black box 、 Mathematical optimization 、 Estimator 、 Reinforcement learning 、 Fixed point 、 Operator (computer programming)

摘要: Off-policy estimation for long-horizon problems is important in many real-life applications such as healthcare and robotics, where high-fidelity simulators may not be available on-policy evaluation expensive or impossible. Recently, \cite{liu18breaking} proposed an approach that avoids the \emph{curse of horizon} suffered by typical importance-sampling-based methods. While showing promising results, this limited practice it requires data drawn from \emph{stationary distribution} a \emph{known} behavior policy. In work, we propose novel eliminates limitations. particular, formulate problem solving fixed point certain operator. Using tools Reproducing Kernel Hilbert Spaces (RKHSs), develop new estimator computes importance ratios stationary distributions, without knowledge how off-policy are collected. We analyze its asymptotic consistency finite-sample generalization. Experiments on benchmarks verify effectiveness our approach.

参考文章(32)

Hamid R. Maei, Richard S. Sutton, Shalabh Bhatnagar, Csaba Szepesv ri, Toward Off-Policy Learning Control with Function Approximation international conference on machine learning. pp. 719- 726 ,(2010)

Martin Riedmiller, Neural Fitted Q Iteration – First Experiences with a Data Efficient Neural Reinforcement Learning Method Machine Learning: ECML 2005. pp. 317- 328 ,(2005) , 10.1007/11564096_32

Csaba Szepesvari, Remi Munos, Lihong Li, {Toward Minimax Off-policy Value Estimation} international conference on artificial intelligence and statistics. pp. 608- 616 ,(2015)

Richard S. Sutton, A. Rupam Mahmood, Martha White, An emphatic approach to the problem of off-policy temporal-difference learning Journal of Machine Learning Research. ,vol. 17, pp. 2603- 2631 ,(2016) , 10.5555/2946645.3007026

Christine Thomas-Agnan, Alain Berlinet, Reproducing Kernel Hilbert Spaces in Probability and Statistics ,(2011)

Miroslav Dudik, Lihong Li, John Langford, Doubly Robust Policy Evaluation and Learning arXiv: Learning. ,(2011)

S A Murphy, M J van der Laan, J M Robins, , Marginal Mean Models for Dynamic Regimes. Journal of the American Statistical Association. ,vol. 96, pp. 1410- 1423 ,(2001) , 10.1198/016214501753382327

M. Henmi, R. Yoshida, S. Eguchi, Importance Sampling Via the Estimated Sampler Biometrika. ,vol. 94, pp. 985- 991 ,(2007) , 10.1093/BIOMET/ASM076

Jun S. Liu, Monte Carlo strategies in scientific computing ,(2001)

10.

Sébastien Jean, Kyunghyun Cho, Roland Memisevic, Yoshua Bengio, On Using Very Large Target Vocabulary for Neural Machine Translation Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). ,vol. 1, pp. 1- 10 ,(2015) , 10.3115/V1/P15-1001

Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning

来源期刊

我的账户

Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning

来源期刊

相似文章 4

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Learning and Planning in Average-Reward Markov Decision Processes.

Importance sampling in reinforcement learning with an estimated behavior policy

On Instrumental Variable Regression for Deep Offline Policy Evaluation.

我的账户