作者: Josiah P Hanna , Philip S Thomas , Peter Stone , Scott Niekum
DOI:
关键词:
摘要: We first derive an analytic expression for the gradient of the variance of an arbitrary, unbiased off-policy policy evaluation estimator, OPE (H, θ). Importance-sampling is one …