Online convex optimization in the bandit setting: gradient descent without a gradient

作者： H. Brendan McMahan , Adam Tauman Kalai , Abraham D. Flaxman

关键词:

摘要: We study a general online convex optimization problem. have set S and an unknown sequence of cost functions c1, c2,..., in each period, we choose feasible point xt S, learn the ct(xt). If function ct is also revealed after period then, as Zinkevich shows [25], gradient descent can be used on these to get regret bounds O(√n). That is, n rounds, total incurred will O(√n) more than best single decision chosen with benefit hindsight, minx Σ ct(x).We extend this "bandit" setting, where, only ct(xt) revealed, bound expected O(n3/4).Our approach uses simple approximation that computed from evaluating at (random) point. show biased estimate sufficient approximate functions. In other words, it possible use without seeing anything value The guarantees hold even most case: against adaptive adversary.For linear problem [15], algorithms low regrets bandit setting recently been given oblivious [1] adversaries [19]. contrast algorithms, which distinguish between explicit explore exploit periods, our algorithm interpreted doing small amount exploration period.

arxiv-vanity.com 本地加速

microsoft.com PDF 下载加速

参考文章(22)

James C. Spall, Introduction to Stochastic Search and Optimization ,(2003)

V. D. Milman, A. Pajor, Isotropic position and inertia ellipsoids and zonoids of the unit ball of a normed n-dimensional space Lecture Notes in Mathematics. pp. 64- 104 ,(1989) , 10.1007/BFB0090049

O. N. Granichin, Randomized Algorithms for Stochastic Approximation under Arbitrary Disturbances Automation and Remote Control. ,vol. 63, pp. 209- 219 ,(2002) , 10.1023/A:1014291407082

Eiji Takimoto, Manfred K. Warmuth, Path Kernels and Multiplicative Updates conference on learning theory. pp. 74- 89 ,(2002) , 10.1007/3-540-45435-7_6

John R. Birge, Franois Louveaux, Introduction to Stochastic Programming ,(2011)

H. Brendan McMahan, Avrim Blum, Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary Learning Theory. pp. 109- 123 ,(2004) , 10.1007/978-3-540-27819-1_8

John N. Tsitsiklis, Dimitri P. Bertsekas, Neuro-dynamic programming ,(1996)

David P. Helmbold, Robert E. Schapire, Yoram Singer, Manfred K. Warmuth, ON-LINE PORTFOLIO SELECTION USING MULTIPLICATIVE UPDATES Mathematical Finance. ,vol. 8, pp. 325- 347 ,(1998) , 10.1111/1467-9965.00058

Dimitris Bertsimas, Santosh Vempala, Solving convex programs by random walks symposium on the theory of computing. pp. 109- 115 ,(2002) , 10.1145/509907.509926

10.

James C. Spall, A one-measurement form of simultaneous perturbation stochastic approximation Automatica. ,vol. 33, pp. 109- 112 ,(1997) , 10.1016/S0005-1098(96)00149-5

Online convex optimization in the bandit setting: gradient descent without a gradient

来源期刊

我的账户

Online convex optimization in the bandit setting: gradient descent without a gradient

来源期刊

相似文章 10

我的账户