An Asymptotically Optimal Bandit Algorithm for Bounded Support Models.

作者： Akimichi Takemura , Junya Honda

DOI:

关键词:

摘要: Multiarmed bandit problem is a typical example of dilemma between exploration and exploitation in reinforcement learning. This expressed as model gambler playing slot machine with multiple arms. We study stochastic where each arm has reward distribution supported known bounded interval, e.g. [0, 1]. In this model, Auer et al. (2002) proposed practical policies called UCB derived finite-time regret policies. However, achieving the asymptotic bound given by Burnetas Katehakis (1996) have been unknown for model. propose Deterministic Minimum Empirical Divergence (DMED) policy prove that DMED achieves bound. Furthermore, index used choosing an can be computed easily convex optimization technique. Although we do not derive regret, confirm simulations close to finite time.

uni-trier.de 本地加速

uni-trier.de PDF 下载加速

参考文章(1)

T.L Lai, Herbert Robbins, Asymptotically efficient adaptive allocation rules Advances in Applied Mathematics. ,vol. 6, pp. 4- 22 ,(1985) , 10.1016/0196-8858(85)90002-8

An Asymptotically Optimal Bandit Algorithm for Bounded Support Models.

来源期刊

我的账户

An Asymptotically Optimal Bandit Algorithm for Bounded Support Models.

来源期刊

相似文章 2

Regret Analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems

CoinDICE: Off-Policy Confidence Interval Estimation

我的账户