作者: Ali Mousavi , Lihong Li , Qiang Liu , Denny Zhou
DOI:
关键词: Computer science 、 Kernel (statistics) 、 Stationary distribution 、 Generalization 、 Kernel (linear algebra) 、 Black box 、 Mathematical optimization 、 Estimator 、 Reinforcement learning 、 Fixed point 、 Operator (computer programming)
摘要: Off-policy estimation for long-horizon problems is important in many real-life applications such as healthcare and robotics, where high-fidelity simulators may not be available on-policy evaluation expensive or impossible. Recently, \cite{liu18breaking} proposed an approach that avoids the \emph{curse of horizon} suffered by typical importance-sampling-based methods. While showing promising results, this limited practice it requires data drawn from \emph{stationary distribution} a \emph{known} behavior policy. In work, we propose novel eliminates limitations. particular, formulate problem solving fixed point certain operator. Using tools Reproducing Kernel Hilbert Spaces (RKHSs), develop new estimator computes importance ratios stationary distributions, without knowledge how off-policy are collected. We analyze its asymptotic consistency finite-sample generalization. Experiments on benchmarks verify effectiveness our approach.