作者: Maarten de Rijke , Ilya Markov , Rolf Jagerman
DOI:
关键词:
摘要: Contextual bandit problems are a natural fit for many information retrieval tasks, such as learning to rank, text classification, recommendation, etc. However, existing methods contextual have one of two drawbacks: they either do not explore the space all possible document rankings (i.e., actions) and, thus, may miss optimal ranking, or present suboptimal user harm experience. We introduce new method problems, Safe Exploration Algorithm (SEA), which overcomes above drawbacks. SEA starts by using baseline (or production) ranking system policy), does experience is safe execute, but has performance needs be improved. Then uses counterfactual learn policy based on behavior policy. also high-confidence off-policy evaluation estimate newly learned Once at least good policy, execute actions, allowing it actively favorable regions action space. This way, never performs worse than experience, while still exploring being able find an Our experiments classification and confirm comparing (and boundless variant called BSEA) online offline problems.