作者: Csaba Szepesvári , Rémi Munos , Lihong Li
DOI:
关键词:
摘要: This paper studies the off-policy evaluation problem, where one aims to estimate value of a target policy based on sample observations collected by another policy. We first consider multi-armed bandit case, establish minimax risk lower bound, and analyze two standard estimators. It is shown, verified in simulation, that optimal up constant, while can be arbitrarily worse, despite its empirical success popularity. The results are applied related problems contextual bandits fixed-horizon Markov decision processes, also semi-supervised learning.