作者: Assaf Hallak , Dotan Di Castro , Shie Mannor
DOI:
关键词:
摘要: We consider a planning problem where the dynamics and rewards of environment depend on hidden static parameter referred to as context. The objective is learn strategy that maximizes accumulated reward across all contexts. new model, called Contextual Markov Decision Process (CMDP), can model customer's behavior when interacting with website (the learner). depends gender, age, location, device, etc. Based behavior, determine customer characteristics, optimize interaction between them. Our work focuses one basic scenario--finite horizon small known number possible suggest family algorithms provable guarantees underlying models latent contexts, CMDPs. Bounds are obtained for specific naive implementations, extensions framework discussed, laying ground future research.