作者: Robert E. Schapire , Alina Beygelzimer , Lihong Li , Lev Reyzin , John Langford
DOI:
关键词:
摘要: We address the problem of competing with any large set N policies in nonstochastic bandit setting, where learner must repeatedly select among K actions but observes only reward chosen action.