Thompson Sampling for Contextual Bandits with Linear Payoffs

DOI:

关键词: Randomized algorithm 、 Thompson sampling 、 Mathematics 、 Bayesian probability 、 Discrete mathematics 、 Algorithm 、 Heuristics 、 Stochastic game 、 Upper and lower bounds 、 Open problem 、 Generalization

摘要: Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared state-of-the-art methods. However, many questions regarding its theoretical remained open. In this paper, we design analyze generalization stochastic contextual problem with linear payoff functions, when contexts are provided by an adaptive adversary. This among most important widely studied version bandits problem. We prove high probability regret bound O(d2/e√T1+e) in time T any 0 < e 1, where d dimension each context vector parameter used algorithm. Our results provide first guarantees Sampling, close lower Ω(d√T) essentially solves COLT open Chapelle Li [COLT 2012].

参考文章(30)

Sham M Kakade, Thomas P Hayes, Varsha Dani, Stochastic Linear Optimization Under Bandit Feedback conference on learning theory. pp. 355- 366 ,(2008)

Robert E. Schapire, Wei Chu, Lihong Li, Lev Reyzin, Contextual bandits with linear Payoff functions international conference on artificial intelligence and statistics. ,vol. 15, pp. 208- 214 ,(2011)

Malcolm J. A. Strens, A Bayesian Framework for Reinforcement Learning international conference on machine learning. pp. 943- 950 ,(2000)

Jeremy Wyatt, Exploration and Inference in Learning from Reinforcement University of Edinburgh. College of Science and Engineering. School of Informatics.. ,(1998)

Nathan Korda, David S. Leslie, Anthony Lee, Benedict C. May, Optimistic Bayesian sampling in contextual-bandit problems Journal of Machine Learning Research. ,vol. 13, pp. 2069- 2106 ,(2012) , 10.5555/2188385.2343711

Shipra Agrawal, Navin Goyal, Analysis of Thompson Sampling for the Multi-armed Bandit Problem conference on learning theory. ,(2012)

Leslie Pack Kaelbling, Associative Reinforcement Learning: Functions in k -DNF Machine Learning. ,vol. 15, pp. 279- 298 ,(1994) , 10.1023/A:1022689909846

Jyotirmoy Sarkar, One-Armed Bandit Problems with Covariates Annals of Statistics. ,vol. 19, pp. 1978- 2002 ,(1991) , 10.1214/AOS/1176348382

T.L Lai, Herbert Robbins, Asymptotically efficient adaptive allocation rules Advances in Applied Mathematics. ,vol. 6, pp. 4- 22 ,(1985) , 10.1016/0196-8858(85)90002-8

10.

Ole‐Christoffer Granmo, Solving two‐armed Bernoulli bandit problems using a Bayesian learning automaton International Journal of Intelligent Computing and Cybernetics. ,vol. 3, pp. 207- 234 ,(2010) , 10.1108/17563781011049179

Thompson Sampling for Contextual Bandits with Linear Payoffs

来源期刊

我的账户

Thompson Sampling for Contextual Bandits with Linear Payoffs

来源期刊

相似文章 10

我的账户