Efficient and safe off-policy evaluation: from point estimation to interval estimation

作者: Ziyang Tang , None

DOI:

关键词:

摘要: Offline reinforcement learning involves training a decision-making agent based solely on historical data, without any online interaction with the real-world environment. This data-driven approach is particularly relevant in high-stakes applications, such as medical treatment and robotics, where online interaction with the environment can be prohibitively expensive, dangerous, or ethically problematic. By leveraging large datasets of previously observed behaviors, offline reinforcement learning allows practitioners to improve the performance of their decision-making algorithms without incurring the risks and costs associated with online experimentation. This has the potential to accelerate progress in fields where human lives and safety are at stake, and to enable the development of more sophisticated and reliable autonomous systems. The present dissertation is concerned with the problem of off-policy evaluation (OPE), which constitutes a central challenge in offline reinforcement learning. Specifically, OPE aims to estimate the expected return of a target policy based solely on historical data. Prior works on OPE have predominantly relied on either importance sampling (IS) based methods or value-based estimators. However, IS-based methods, especially those based on trajectory, suffer from high and even unbounded variance, particularly in the case of infinite-horizon problems. Conversely, value-based estimators are susceptible to bias arising from model assumptions and the optimization process. These limitations have motivated a need for novel and improved methods of OPE in order to facilitate more accurate and reliable estimates of policy …

参考文章(0)