Behavior Policy Gradient Supplemental Material

作者: Josiah P Hanna , Philip S Thomas , Peter Stone , Scott Niekum

DOI:

关键词:

摘要: We first derive an analytic expression for the gradient of the variance of an arbitrary, unbiased off-policy policy evaluation estimator, OPE (H, θ). Importance-sampling is one …

参考文章(0)