作者: Tengyang Xie , Yu-Xiang Wang , Yifei Ma
DOI:
关键词:
摘要: Motivated by the many real-world applications of reinforcement learning (RL) that 1 require safe-policy iterations, we consider the problem of off-policy evaluation 2 (OPE)—the problem of evaluating a new policy using the historical data obtained 3 by different behavior policies—under the model of nonstationary episodic Markov 4 Decision Processes with a long horizon and large action space. Existing importance 5 sampling (IS) methods often suffer from large variance that depends exponentially 6 on the RL horizon H. To solve this problem, we consider a marginalized im-7 portance sampling (MIS) estimator that recursively estimates the state marginal 8 distribution for the target policy at every step. MIS achieves a mean-squared 9 error of O (H2R2 max∑ H t= 1 Eµ [(wπ, µ (st, at)) 2]/n) for large n, where wπ, µ (st, at) 10 is the ratio of the marginal distribution of tth step under π and µ, H is the hori-11 zon, Rmax is the maximal rewards, and n is the sample size. The result nearly 12 matches the Cramer-Rao lower bounds for DAG MDP in Jiang and Li [2016] 13 for most non-trivial regimes. To the best of our knowledge, this is the first OPE 14 estimator with provably optimal dependence in H and the second moments of the 15 importance weight. Besides theoretical optimality, we empirically demonstrate the 16 superiority of our method in time-varying, partially observable, and long-horizon 17 RL environments. 18