Safe Exploration for Optimizing Contextual Bandits

作者: Maarten de Rijke , Ilya Markov , Rolf Jagerman

DOI:

关键词:

摘要: Contextual bandit problems are a natural fit for many information retrieval tasks, such as learning to rank, text classification, recommendation, etc. However, existing methods contextual have one of two drawbacks: they either do not explore the space all possible document rankings (i.e., actions) and, thus, may miss optimal ranking, or present suboptimal user harm experience. We introduce new method problems, Safe Exploration Algorithm (SEA), which overcomes above drawbacks. SEA starts by using baseline (or production) ranking system policy), does experience is safe execute, but has performance needs be improved. Then uses counterfactual learn policy based on behavior policy. also high-confidence off-policy evaluation estimate newly learned Once at least good policy, execute actions, allowing it actively favorable regions action space. This way, never performs worse than experience, while still exploring being able find an Our experiments classification and confirm comparing (and boundless variant called BSEA) online offline problems.

参考文章(46)
Ken Lang, NewsWeeder: Learning to Filter Netnews Machine Learning Proceedings 1995. pp. 331- 339 ,(1995) , 10.1016/B978-1-55860-377-6.50048-7
Tie-Yan Liu, Tao Qin, Introducing LETOR 4.0 Datasets arXiv: Information Retrieval. ,(2013)
Thorsten Joachims, Adith Swaminathan, Batch learning from logged bandit feedback through counterfactual risk minimization Journal of Machine Learning Research. ,vol. 16, pp. 1731- 1755 ,(2015)
Fernando Fernández, Javier García, Safe exploration of state and action spaces in reinforcement learning Journal of Artificial Intelligence Research. ,vol. 45, pp. 515- 564 ,(2012) , 10.1613/JAIR.3761
Leslie Pack Kaelbling, Associative Reinforcement Learning: Functions in k -DNF Machine Learning. ,vol. 15, pp. 279- 298 ,(1994) , 10.1023/A:1022689909846
Yisong Yue, Thorsten Joachims, Interactively optimizing information retrieval systems as a dueling bandits problem Proceedings of the 26th Annual International Conference on Machine Learning - ICML '09. pp. 1201- 1208 ,(2009) , 10.1145/1553374.1553527
Alina Beygelzimer, John Langford, The offset tree for learning with partial labels Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '09. pp. 129- 138 ,(2009) , 10.1145/1557019.1557040
Thorsten Joachims, Optimizing search engines using clickthrough data Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02. pp. 133- 142 ,(2002) , 10.1145/775047.775067
Kalervo Järvelin, Jaana Kekäläinen, Cumulated gain-based evaluation of IR techniques ACM Transactions on Information Systems. ,vol. 20, pp. 422- 446 ,(2002) , 10.1145/582415.582418
Katja Hofmann, Shimon Whiteson, Maarten De Rijke, Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods ACM Transactions on Information Systems. ,vol. 31, pp. 1- 43 ,(2013) , 10.1145/2536736.2536737