作者: Akimichi Takemura , Junya Honda
DOI:
关键词:
摘要: Multiarmed bandit problem is a typical example of dilemma between exploration and exploitation in reinforcement learning. This expressed as model gambler playing slot machine with multiple arms. We study stochastic where each arm has reward distribution supported known bounded interval, e.g. [0, 1]. In this model, Auer et al. (2002) proposed practical policies called UCB derived finite-time regret policies. However, achieving the asymptotic bound given by Burnetas Katehakis (1996) have been unknown for model. propose Deterministic Minimum Empirical Divergence (DMED) policy prove that DMED achieves bound. Furthermore, index used choosing an can be computed easily convex optimization technique. Although we do not derive regret, confirm simulations close to finite time.