Regret Analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems

作者: Nicolò Cesa-Bianchi , Sébastien Bubeck

DOI:

关键词:

摘要: A multi-armed bandit problem - or, simply, a is sequential allocation defined by set of actions. At each time step, unit resource allocated to an action and some observable payoff obtained. The goal maximize the total obtained in sequence allocations. name refers colloquial term for slot machine (a "one-armed bandit" American slang). In casino, when player facing many machines at once "multi-armed bandit"), must repeatedly choose where insert next coin. Multi-armed problems are most basic examples decision with exploration-exploitation trade-off. This balance between staying option that gave highest payoffs past exploring new options might give higher future. Although study dates back 1930s, trade-offs arise several modern applications, such as ad placement, website optimization, packet routing. Mathematically, process associated option. this book, focus on two extreme cases which analysis regret particularly simple elegant: independent identically distributed adversarial payoffs. Besides setting finitely actions, it also analyzes important variants extensions, contextual model. monograph ideal reference students researchers interest problems.

参考文章(166)
Manfred K. Warmuth, Elad Hazan, Satyen Kale, Learning rotations with little regret conference on learning theory. pp. 144- 154 ,(2010)
Sham M Kakade, Thomas P Hayes, Varsha Dani, Stochastic Linear Optimization Under Bandit Feedback conference on learning theory. pp. 355- 366 ,(2008)
Róbert Busa-Fekete, Balázs Kégl, Fast boosting using adversarial bandits international conference on machine learning. pp. 143- 150 ,(2010)
Matthew J. Streeter, H. Brendan McMahan, Tighter Bounds for Multi-Armed Bandits with Expert Advice conference on learning theory. ,(2009)
Sham Machandranath Kakade, On the Sample Complexity of Reinforcement Learning Doctoral thesis, UCL (University College London).. ,(2003)
Aditya Mahajan, Demosthenis Teneketzis, Multi-Armed Bandit Problems Foundations and Applications of Sensor Management. pp. 121- 151 ,(2008) , 10.1007/978-0-387-49819-5_6
Ofer Dekel, Alekh Agarwal, Lin Xiao, Optimal Algorithms for Online Convex Optimization with Multi-Point Bandit Feedback. conference on learning theory. pp. 28- 40 ,(2010)
Aurélien Garivier, Eric Moulines, On Upper-Confidence Bound Policies for Switching Bandit Problems Lecture Notes in Computer Science. pp. 174- 188 ,(2011) , 10.1007/978-3-642-24412-4_16
Eli Upfal, Aleksandrs Slivkins, Adapting to a Changing Environment: the Brownian Restless Bandits. conference on learning theory. pp. 343- 354 ,(2008)
Katya Scheinberg, Luis N. Vicente, Andrew R. Conn, Introduction to Derivative-Free Optimization ,(2009)