Thompson Sampling for Contextual Bandits with Linear Payoffs

作者: Shipra Agrawal , Navin Goyal

DOI:

关键词: Randomized algorithmThompson samplingMathematicsBayesian probabilityDiscrete mathematicsAlgorithmHeuristicsStochastic gameUpper and lower boundsOpen problemGeneralization

摘要: Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared state-of-the-art methods. However, many questions regarding its theoretical remained open. In this paper, we design analyze generalization stochastic contextual problem with linear payoff functions, when contexts are provided by an adaptive adversary. This among most important widely studied version bandits problem. We prove high probability regret bound O(d2/e√T1+e) in time T any 0 < e 1, where d dimension each context vector parameter used algorithm. Our results provide first guarantees Sampling, close lower Ω(d√T) essentially solves COLT open Chapelle Li [COLT 2012].

参考文章(30)
Sham M Kakade, Thomas P Hayes, Varsha Dani, Stochastic Linear Optimization Under Bandit Feedback conference on learning theory. pp. 355- 366 ,(2008)
Robert E. Schapire, Wei Chu, Lihong Li, Lev Reyzin, Contextual bandits with linear Payoff functions international conference on artificial intelligence and statistics. ,vol. 15, pp. 208- 214 ,(2011)
Malcolm J. A. Strens, A Bayesian Framework for Reinforcement Learning international conference on machine learning. pp. 943- 950 ,(2000)
Jeremy Wyatt, Exploration and Inference in Learning from Reinforcement University of Edinburgh. College of Science and Engineering. School of Informatics.. ,(1998)
Nathan Korda, David S. Leslie, Anthony Lee, Benedict C. May, Optimistic Bayesian sampling in contextual-bandit problems Journal of Machine Learning Research. ,vol. 13, pp. 2069- 2106 ,(2012) , 10.5555/2188385.2343711
Shipra Agrawal, Navin Goyal, Analysis of Thompson Sampling for the Multi-armed Bandit Problem conference on learning theory. ,(2012)
Leslie Pack Kaelbling, Associative Reinforcement Learning: Functions in k -DNF Machine Learning. ,vol. 15, pp. 279- 298 ,(1994) , 10.1023/A:1022689909846
Jyotirmoy Sarkar, One-Armed Bandit Problems with Covariates Annals of Statistics. ,vol. 19, pp. 1978- 2002 ,(1991) , 10.1214/AOS/1176348382
T.L Lai, Herbert Robbins, Asymptotically efficient adaptive allocation rules Advances in Applied Mathematics. ,vol. 6, pp. 4- 22 ,(1985) , 10.1016/0196-8858(85)90002-8
Ole‐Christoffer Granmo, Solving two‐armed Bernoulli bandit problems using a Bayesian learning automaton International Journal of Intelligent Computing and Cybernetics. ,vol. 3, pp. 207- 234 ,(2010) , 10.1108/17563781011049179