作者: Michael N. Katehakis , Wesley Cowan
DOI:
关键词:
摘要: We consider the \mnk{classical} problem of a controller activating (or sampling) sequentially from finite number $N \geq 2$ populations, specified by unknown distributions. Over some time horizon, at each $n = 1, 2, \ldots$, wishes to select population sample, with goal sampling that optimizes "score" function its distribution, e.g., maximizing expected sum outcomes or minimizing variability. define class \textit{Uniformly Fast (UF)} policies and show, under mild regularity conditions, there is an asymptotic lower bound for total sub-optimal activations. Then, we provide sufficient conditions which UCB policy UF asymptotically optimal, since it attains this bound. Explicit solutions are provided examples interest, including general score functionals on unconstrained Pareto distributions (of potentially infinite mean), uniform support. Additional results bandits Normal also provided.