作者: Saayan Mitra , Somdeb Sarkhel , Aritra Ghosh , Viswanathan Swaminathan
DOI:
关键词:
摘要: Maximizing utility with a budget constraint is the primary goal for advertisers in real-time bidding (RTB) systems. The policy maximizing referred to as optimal strategy. Earlier works on strategy apply model-based batch reinforcement learning methods which can not generalize unknown and time constraint. Further, advertiser observes censored market price makes direct evaluation infeasible test datasets. Previous ignore losing auctions alleviate difficulty states; thus significantly modifying distribution. We address challenge of lacking clear procedure well error propagated through RTB exploit two conditional independence structures sequential process that allow us propose novel practical framework using maximum entropy principle imitate behavior true distribution observed traffic. Moreover, allows train model unseen conditions than limit only those history. compare our real-world datasets several baselines demonstrate improved performance under various settings.