An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support

作者: Michael N. Katehakis , Wesley Cowan

DOI:

关键词:

摘要: Consider the problem of sampling sequentially from a finite n umber N > 2 populations, specified by random variables X i k , = 1,..., N, and 1, 2,...; where denotes outcome population th time it is sampled. It assumed that for each fixed i, {X }k>1 sequence i.i.d. uniform over some interval [ai, bi], with support (i.e., ai, bi) unknown to controller. The objective have policy π deciding which populations sample form at any n= 2,... so as maximize expected sum outcomes samples or equivalently minimize regret due lack on information parameters {ai} {bi}. In this paper, we present simple UCB-type asymptotically optimal. Additionally, horizon bounds are given.

参考文章(4)
Csaba Szepesvári, Rémi Munos, Lihong Li, On Minimax Optimal Offline Policy Evaluation. arXiv: Artificial Intelligence. ,(2014)
Peter L. Bartlett, Ambuj Tewari, REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs uncertainty in artificial intelligence. pp. 35- 42 ,(2009)
Herbert Robbins, Some aspects of the sequential design of experiments Bulletin of the American Mathematical Society. ,vol. 58, pp. 527- 535 ,(1952) , 10.1090/S0002-9904-1952-09620-8
T.L Lai, Herbert Robbins, Asymptotically efficient adaptive allocation rules Advances in Applied Mathematics. ,vol. 6, pp. 4- 22 ,(1985) , 10.1016/0196-8858(85)90002-8