Marginalized Off-Policy Evaluation for Reinforcement Learning

作者： Tengyang Xie , Yu-Xiang Wang , Yifei Ma

DOI:

关键词:

摘要: Motivated by the many real-world applications of reinforcement learning (RL) that 1 require safe-policy iterations, we consider the problem of off-policy evaluation 2 (OPE)—the problem of evaluating a new policy using the historical data obtained 3 by different behavior policies—under the model of nonstationary episodic Markov 4 Decision Processes with a long horizon and large action space. Existing importance 5 sampling (IS) methods often suffer from large variance that depends exponentially 6 on the RL horizon H. To solve this problem, we consider a marginalized im-7 portance sampling (MIS) estimator that recursively estimates the state marginal 8 distribution for the target policy at every step. MIS achieves a mean-squared 9 error of O (H2R2 max∑ H t= 1 Eµ [(wπ, µ (st, at)) 2]/n) for large n, where wπ, µ (st, at) 10 is the ratio of the marginal distribution of tth step under π and µ, H is the hori-11 zon, Rmax is the maximal rewards, and n is the sample size. The result nearly 12 matches the Cramer-Rao lower bounds for DAG MDP in Jiang and Li [2016] 13 for most non-trivial regimes. To the best of our knowledge, this is the first OPE 14 estimator with provably optimal dependence in H and the second moments of the 15 importance weight. Besides theoretical optimality, we empirically demonstrate the 16 superiority of our method in time-varying, partially observable, and long-horizon 17 RL environments. 18

amazon.science 本地加速

暂无可下载资源，当前可以选择系统获取到有开放资源时通知我或者直接发起求助文献求助

参考文章(0)

Marginalized Off-Policy Evaluation for Reinforcement Learning

来源期刊

我的账户

Marginalized Off-Policy Evaluation for Reinforcement Learning

来源期刊

相似文章 0

我的账户