作者: David Meger , Herke van Hoof , Scott Fujimoto
DOI:
关键词:
摘要: In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead overestimated value estimates and suboptimal policies. We show that this problem persists in an actor-critic setting propose novel mechanisms minimize its effects on both the actor critic. Our algorithm builds Double by taking minimum between a pair of critics limit overestimation. draw connection target networks overestimation bias, suggest delaying policy updates reduce per-update error further improve performance. evaluate our method suite OpenAI gym tasks, outperforming state art every environment tested.