Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning

作者: Ali Mousavi , Lihong Li , Qiang Liu , Denny Zhou

DOI:

关键词: Computer scienceKernel (statistics)Stationary distributionGeneralizationKernel (linear algebra)Black boxMathematical optimizationEstimatorReinforcement learningFixed pointOperator (computer programming)

摘要: Off-policy estimation for long-horizon problems is important in many real-life applications such as healthcare and robotics, where high-fidelity simulators may not be available on-policy evaluation expensive or impossible. Recently, \cite{liu18breaking} proposed an approach that avoids the \emph{curse of horizon} suffered by typical importance-sampling-based methods. While showing promising results, this limited practice it requires data drawn from \emph{stationary distribution} a \emph{known} behavior policy. In work, we propose novel eliminates limitations. particular, formulate problem solving fixed point certain operator. Using tools Reproducing Kernel Hilbert Spaces (RKHSs), develop new estimator computes importance ratios stationary distributions, without knowledge how off-policy are collected. We analyze its asymptotic consistency finite-sample generalization. Experiments on benchmarks verify effectiveness our approach.

参考文章(32)
Hamid R. Maei, Richard S. Sutton, Shalabh Bhatnagar, Csaba Szepesv ri, Toward Off-Policy Learning Control with Function Approximation international conference on machine learning. pp. 719- 726 ,(2010)
Csaba Szepesvari, Remi Munos, Lihong Li, {Toward Minimax Off-policy Value Estimation} international conference on artificial intelligence and statistics. pp. 608- 616 ,(2015)
Richard S. Sutton, A. Rupam Mahmood, Martha White, An emphatic approach to the problem of off-policy temporal-difference learning Journal of Machine Learning Research. ,vol. 17, pp. 2603- 2631 ,(2016) , 10.5555/2946645.3007026
Christine Thomas-Agnan, Alain Berlinet, Reproducing Kernel Hilbert Spaces in Probability and Statistics ,(2011)
Miroslav Dudik, Lihong Li, John Langford, Doubly Robust Policy Evaluation and Learning arXiv: Learning. ,(2011)
S A Murphy, M J van der Laan, J M Robins, , Marginal Mean Models for Dynamic Regimes. Journal of the American Statistical Association. ,vol. 96, pp. 1410- 1423 ,(2001) , 10.1198/016214501753382327
M. Henmi, R. Yoshida, S. Eguchi, Importance Sampling Via the Estimated Sampler Biometrika. ,vol. 94, pp. 985- 991 ,(2007) , 10.1093/BIOMET/ASM076
Sébastien Jean, Kyunghyun Cho, Roland Memisevic, Yoshua Bengio, On Using Very Large Target Vocabulary for Neural Machine Translation Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). ,vol. 1, pp. 1- 10 ,(2015) , 10.3115/V1/P15-1001