作者: Jean-Yves Audibert , Sébastien Bubeck , Rémi Munos
DOI:
关键词: Unobservable 、 Mathematical optimization 、 Task (project management) 、 Mathematics 、 Order (exchange) 、 Logarithm 、 Identification (information) 、 Regret 、 Scaling
摘要: We consider the problem of finding best arm in a stochastic multi-armed bandit game. The regret forecaster is here defined by gap between mean reward optimal and ultimately chosen arm. propose highly exploring UCB policy new algorithm based on successive rejects. show that these algorithms are essentially since their decreases exponentially at rate which is, up to logarithmic factor, possible. However, while needs tuning parameter depending unobservable hardness task, rejects benefits from being parameter-free, also independent scaling rewards. As by-product our analysis, we identifying (when it unique) requires number samples order (up log(K) factor) Σ i 1/Δ2i, where sum suboptimal arms andΔi represents difference one i. This generalizes well-known fact 1/Δ2 differentiate means two distributions with Δ.