Asymptotically Optimal Sequential Experimentation Under Generalized Ranking

作者: Michael N. Katehakis , Wesley Cowan

DOI:

关键词:

摘要: We consider the \mnk{classical} problem of a controller activating (or sampling) sequentially from finite number $N \geq 2$ populations, specified by unknown distributions. Over some time horizon, at each $n = 1, 2, \ldots$, wishes to select population sample, with goal sampling that optimizes "score" function its distribution, e.g., maximizing expected sum outcomes or minimizing variability. define class \textit{Uniformly Fast (UF)} policies and show, under mild regularity conditions, there is an asymptotic lower bound for total sub-optimal activations. Then, we provide sufficient conditions which UCB policy UF asymptotically optimal, since it attains this bound. Explicit solutions are provided examples interest, including general score functionals on unconstrained Pareto distributions (of potentially infinite mean), uniform support. Additional results bandits Normal also provided.

参考文章(3)
Csaba Szepesvári, Rémi Munos, Lihong Li, On Minimax Optimal Offline Policy Evaluation. arXiv: Artificial Intelligence. ,(2014)
Peter L. Bartlett, Ambuj Tewari, REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs uncertainty in artificial intelligence. pp. 35- 42 ,(2009)
Herbert Robbins, Some aspects of the sequential design of experiments Bulletin of the American Mathematical Society. ,vol. 58, pp. 527- 535 ,(1952) , 10.1090/S0002-9904-1952-09620-8