Counterfactual Risk Minimization

作者: Adith Swaminathan , Thorsten Joachims

DOI: 10.1145/2740908.2742564

关键词:

摘要: We develop a learning principle and an efficient algorithm for batch from logged bandit feedback. This setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where makes prediction ranking) given input query) observes feedback user clicks on presented ads). first address the counterfactual nature of problem through propensity scoring. Next, we derive generalization error bounds that account variance propensity-weighted empirical risk estimator. These constructive give rise to Counterfactual Risk Minimization (CRM) principle. Using CRM principle, new -- Policy Optimizer Exponential Models (POEM) structured output prediction. evaluate POEM several multi-label classification problems verify its performance supports theory.

参考文章(20)
Csaba Szepesvári, Rémi Munos, Lihong Li, On Minimax Optimal Offline Policy Evaluation. arXiv: Artificial Intelligence. ,(2014)
Massimiliano Pontil, Andreas Maurer, Empirical Bernstein Bounds and Sample Variance Penalization conference on learning theory. ,(2009)
Miroslav Dudik, Lihong Li, John Langford, Doubly Robust Policy Evaluation and Learning arXiv: Learning. ,(2011)
Lihong Li, Shunbao Chen, Ankur Gupta, Jim Kleban, Counterfactual Estimation and Optimization of Click Metrics for Search Engines. arXiv: Learning. ,(2014)
Alina Beygelzimer, John Langford, The offset tree for learning with partial labels Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '09. pp. 129- 138 ,(2009) , 10.1145/1557019.1557040
Edward L Ionides, Truncated Importance Sampling Journal of Computational and Graphical Statistics. ,vol. 17, pp. 295- 311 ,(2008) , 10.1198/106186008X320456
Robert Schapire, Lihong Li, Satyen Kale, Alekh Agarwal, John Langford, Daniel Hsu, Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits international conference on machine learning. pp. 1638- 1646 ,(2014)
Thorsten Joachims, Adith Swaminathan, Counterfactual Risk Minimization: Learning from Logged Bandit Feedback international conference on machine learning. pp. 814- 823 ,(2015)
Sham M Kakade, Lihong Li, John Langford, Alex Strehl, Learning from Logged Implicit Exploration Data neural information processing systems. ,vol. 23, pp. 2217- 2225 ,(2010)
Olivier Nicol, Philippe Preux, J r mie Mary, Improving offline evaluation of contextual bandit algorithms via bootstrapping techniques international conference on machine learning. ,vol. 32, pp. 172- 180 ,(2014)