An Asymptotically Optimal Bandit Algorithm for Bounded Support Models.

作者: Akimichi Takemura , Junya Honda

DOI:

关键词:

摘要: Multiarmed bandit problem is a typical example of dilemma between exploration and exploitation in reinforcement learning. This expressed as model gambler playing slot machine with multiple arms. We study stochastic where each arm has reward distribution supported known bounded interval, e.g. [0, 1]. In this model, Auer et al. (2002) proposed practical policies called UCB derived finite-time regret policies. However, achieving the asymptotic bound given by Burnetas Katehakis (1996) have been unknown for model. propose Deterministic Minimum Empirical Divergence (DMED) policy prove that DMED achieves bound. Furthermore, index used choosing an can be computed easily convex optimization technique. Although we do not derive regret, confirm simulations close to finite time.

参考文章(1)
T.L Lai, Herbert Robbins, Asymptotically efficient adaptive allocation rules Advances in Applied Mathematics. ,vol. 6, pp. 4- 22 ,(1985) , 10.1016/0196-8858(85)90002-8