作者: Olivier Teytaud , Nicolas Baskiotis , Sylvain Gelly , Cédric Hartland , Michèle Sebag
DOI:
关键词:
摘要: Motivated by realtime website optimization, this paper is about online learning in abruptly changing environments. Two extensions of the UCBT algorithm are combined order to handle dynamic multi-armed bandits, and specifically cope with fast variations rewards. Firstly, a change point detection test based on Page-Hinkley statistics used overcome limitations due inertia. Secondly, controlled forgetting strategy dubbed Meta-Bandit proposed take care Exploration vs Exploitation trade-off when PH triggered. Extensive empirical validation shows significant improvements compared baseline algorithms. The also investigates sensitivity respect number available options.