作者: H. Brendan McMahan , Adam Tauman Kalai , Abraham D. Flaxman
关键词:
摘要: We study a general online convex optimization problem. have set S and an unknown sequence of cost functions c1, c2,..., in each period, we choose feasible point xt S, learn the ct(xt). If function ct is also revealed after period then, as Zinkevich shows [25], gradient descent can be used on these to get regret bounds O(√n). That is, n rounds, total incurred will O(√n) more than best single decision chosen with benefit hindsight, minx Σ ct(x).We extend this "bandit" setting, where, only ct(xt) revealed, bound expected O(n3/4).Our approach uses simple approximation that computed from evaluating at (random) point. show biased estimate sufficient approximate functions. In other words, it possible use without seeing anything value The guarantees hold even most case: against adaptive adversary.For linear problem [15], algorithms low regrets bandit setting recently been given oblivious [1] adversaries [19]. contrast algorithms, which distinguish between explicit explore exploit periods, our algorithm interpreted doing small amount exploration period.