摘要: We develop a learning principle and an efficient algorithm for batch from logged bandit feedback. This setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where makes prediction ranking) given input query) observes feedback user clicks on presented ads). first address the counterfactual nature of problem through propensity scoring. Next, we derive generalization error bounds that account variance propensity-weighted empirical risk estimator. These constructive give rise to Counterfactual Risk Minimization (CRM) principle. Using CRM principle, new -- Policy Optimizer Exponential Models (POEM) structured output prediction. evaluate POEM several multi-label classification problems verify its performance supports theory.