作者: Assaf Hallak , Francois Schnitzler , Timothy Mann , Shie Mannor
DOI:
关键词:
摘要: Off-policy learning in dynamic decision problems is essential for providing strong evidence that a new policy better than the one use. But how can we prove superiority without testing policy? To answer this question, introduce G-SCOPE algorithm evaluates based on data generated by existing policy. Our both computationally and sample efficient because it greedily learns to exploit factored structure dynamics of environment. We present finite analysis our approach show through experiments scales well high-dimensional with few samples.