作者: Shipra Agrawal , Navin Goyal
DOI:
关键词: Randomized algorithm 、 Thompson sampling 、 Mathematics 、 Bayesian probability 、 Discrete mathematics 、 Algorithm 、 Heuristics 、 Stochastic game 、 Upper and lower bounds 、 Open problem 、 Generalization
摘要: Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared state-of-the-art methods. However, many questions regarding its theoretical remained open. In this paper, we design analyze generalization stochastic contextual problem with linear payoff functions, when contexts are provided by an adaptive adversary. This among most important widely studied version bandits problem. We prove high probability regret bound O(d2/e√T1+e) in time T any 0 < e 1, where d dimension each context vector parameter used algorithm. Our results provide first guarantees Sampling, close lower Ω(d√T) essentially solves COLT open Chapelle Li [COLT 2012].