作者: Andrea Goldsmith , Anusha Lalitha
DOI: 10.1109/JSAIT.2021.3080661
关键词:
摘要: We study a decentralized cooperative multi-agent multi-armed bandit problem with $K$ arms and $N$ agents connected over network. In our model, each arm's reward distribution is same for all agents, rewards are drawn independently across time steps. round, choose an arm to play subsequently send message their neighbors. The goal minimize cumulative regret averaged the entire propose Bayesian framework that extends single-agent algorithms setting. Specifically, we information assimilation algorithm can be combined existing algorithms, using this, Thompson Sampling Bayes-UCB algorithm. analyze under Bernoulli establish problem-dependent upper bound on regret. show incurred scales logarithmically horizon constants match those of optimal centralized agent access observations Our analysis also characterizes in terms network structure. Through extensive numerical studies, extensions incur lesser than state-of-art inspired by Upper Confidence Bound implement proposed gossip protocol, time-varying networks, where communication link has fixed probability failure.