Normal bandits of unknown means and variances

作者: Michael N. Katehakis , Wesley Cowan , Junya Honda

DOI: 10.5555/3122009.3242011

关键词: Random variableRegretPopulationCombinatoricsOutcome (probability)Open problemMathematicsSequenceAsymptotically optimal algorithmSample (statistics)

摘要: Consider the problem of sampling sequentially from a finite number N ≥ 2 populations, specified by random variables Xki, i = 1,...,N; and k 1,2,..., where Xki denotes outcome population kth time it is sampled. It assumed that for each fixed i, {Xki}k≥1 sequence i.i.d. normal variables, with unknown mean µi variance σi2. The objective to have policy π deciding which populations sample at any t 1,2, ... so as maximize expected sum outcomes n total samples or equivalently minimize regret due lack on information parameters In this paper, we present simple inflated (ISM) index asymptotically optimal in sense Theorem 4 below. This resolves standing open Burnetas Katehakis (1996b). Additionally, horizon bounds are given.

参考文章(4)
Csaba Szepesvári, Rémi Munos, Lihong Li, On Minimax Optimal Offline Policy Evaluation. arXiv: Artificial Intelligence. ,(2014)
Peter L. Bartlett, Ambuj Tewari, REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs uncertainty in artificial intelligence. pp. 35- 42 ,(2009)
Herbert Robbins, Some aspects of the sequential design of experiments Bulletin of the American Mathematical Society. ,vol. 58, pp. 527- 535 ,(1952) , 10.1090/S0002-9904-1952-09620-8
T.L Lai, Herbert Robbins, Asymptotically efficient adaptive allocation rules Advances in Applied Mathematics. ,vol. 6, pp. 4- 22 ,(1985) , 10.1016/0196-8858(85)90002-8