Best Arm Identification in Multi-Armed Bandits

作者: Jean-Yves Audibert , Sébastien Bubeck , Rémi Munos

DOI:

关键词: UnobservableMathematical optimizationTask (project management)MathematicsOrder (exchange)LogarithmIdentification (information)RegretScaling

摘要: We consider the problem of finding best arm in a stochastic multi-armed bandit game. The regret forecaster is here defined by gap between mean reward optimal and ultimately chosen arm. propose highly exploring UCB policy new algorithm based on successive rejects. show that these algorithms are essentially since their decreases exponentially at rate which is, up to logarithmic factor, possible. However, while needs tuning parameter depending unobservable hardness task, rejects benefits from being parameter-free, also independent scaling rewards. As by-product our analysis, we identifying (when it unique) requires number samples order (up log(K) factor) Σ i 1/Δ2i, where sum suboptimal arms andΔi represents difference one i. This generalizes well-known fact 1/Δ2 differentiate means two distributions with Δ.

参考文章(9)
Carlos Domingo, Ricard Gavaldà, Osamu Watanabe, Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms Data Mining and Knowledge Discovery. ,vol. 6, pp. 131- 152 ,(2002) , 10.1023/A:1014091514039
Nicolo Cesa-Bianchi, Gabor Lugosi, Prediction, learning, and games ,(2006)
Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Pure exploration in multi-armed bandits problems algorithmic learning theory. pp. 23- 37 ,(2009) , 10.1007/978-3-642-04414-4_7
Herbert Robbins, Some aspects of the sequential design of experiments Bulletin of the American Mathematical Society. ,vol. 58, pp. 527- 535 ,(1952) , 10.1090/S0002-9904-1952-09620-8
Volodymyr Mnih, Csaba Szepesvári, Jean-Yves Audibert, Empirical Bernstein stopping Proceedings of the 25th international conference on Machine learning - ICML '08. pp. 672- 679 ,(2008) , 10.1145/1390156.1390241
U.M. Feyyad, Data mining and knowledge discovery: making sense out of data IEEE Intelligent Systems. ,vol. 11, pp. 20- 25 ,(1996) , 10.1109/64.539013
Eyal Even-Dar, Shie Mannor, Yishay Mansour, Sridhar Mahadevan, Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems Journal of Machine Learning Research. ,vol. 7, pp. 1079- 1105 ,(2006)
Andrew W. Moore, Oded Maron, Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation neural information processing systems. ,vol. 6, pp. 59- 66 ,(1993)
Peter Auer, Nicolò Cesa-Bianchi, Paul Fischer, Finite-time Analysis of the Multiarmed Bandit Problem Machine Learning. ,vol. 47, pp. 235- 256 ,(2002) , 10.1023/A:1013689704352