作者: Michael N. Katehakis , Wesley Cowan
DOI:
关键词:
摘要: Consider the problem of sampling sequentially from a finite n umber N > 2 populations, specified by random variables X i k , = 1,..., N, and 1, 2,...; where denotes outcome population th time it is sampled. It assumed that for each fixed i, {X }k>1 sequence i.i.d. uniform over some interval [ai, bi], with support (i.e., ai, bi) unknown to controller. The objective have policy π deciding which populations sample form at any n= 2,... so as maximize expected sum outcomes samples or equivalently minimize regret due lack on information parameters {ai} {bi}. In this paper, we present simple UCB-type asymptotically optimal. Additionally, horizon bounds are given.