On Minimax Optimal Offline Policy Evaluation.

作者: Csaba Szepesvári , Rémi Munos , Lihong Li

DOI:

关键词:

摘要: This paper studies the off-policy evaluation problem, where one aims to estimate value of a target policy based on sample observations collected by another policy. We first consider multi-armed bandit case, establish minimax risk lower bound, and analyze two standard estimators. It is shown, verified in simulation, that optimal up constant, while can be arbitrarily worse, despite its empirical success popularity. The results are applied related problems contextual bandits fixed-horizon Markov decision processes, also semi-supervised learning.

参考文章(19)
Doina Precup, Satinder P. Singh, Richard S. Sutton, Eligibility Traces for Off-Policy Policy Evaluation international conference on machine learning. pp. 759- 766 ,(2000)
Andrew Y. Ng, Michael I. Jordan, PEGASUS: A policy search method for large MDPs and POMDPs uncertainty in artificial intelligence. pp. 406- 415 ,(2000)
Sanjoy Dasgupta, Two faces of active learning Theoretical Computer Science. ,vol. 412, pp. 1767- 1781 ,(2011) , 10.1016/J.TCS.2010.12.054
Paul R. Rosenbaum, Reducing Bias in Observational Studies Using Subclassification on the Propensity Score Journal of the American Statistical Association. ,vol. 79, pp. 516- 524 ,(1984) , 10.2307/2288398
Thomas Dietterich, Xiaojin Zhu, Andrew B. Goldberg, Ronald Brachman, Introduction to Semi-Supervised Learning ,(2009)
Verena Heidrich-Meisner, Christian Igel, Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search Proceedings of the 26th Annual International Conference on Machine Learning - ICML '09. pp. 401- 408 ,(2009) , 10.1145/1553374.1553426
Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, Robert E. Schapire, The Nonstochastic Multiarmed Bandit Problem SIAM Journal on Computing. ,vol. 32, pp. 48- 77 ,(2003) , 10.1137/S0097539701398375
Yao-liang Yu, Csaba Szepesv ri, Analysis of Kernel Mean Matching under Covariate Shift international conference on machine learning. pp. 1147- 1154 ,(2012)
Elon Portugaly, Léon Bottou, D. Max Chickering, Denis X. Charles, Dipankar Ray, Jonas Peters, Patrice Simard, Ed Snelson, Joaquin Quiñonero-Candela, Counterfactual reasoning and learning systems: the example of computational advertising Journal of Machine Learning Research. ,vol. 14, pp. 3207- 3260 ,(2013)