loader
Generating audio...

arxiv

Paper 2410.23867

QuACK: A Multipurpose Queuing Algorithm for Cooperative $k$-Armed Bandits

Authors: Benjamin Howson, Sarah Filippi, Ciara Pike-Burke

Published: 2024-10-31

Abstract:

We study the cooperative stochastic $k$-armed bandit problem, where a network of $m$ agents collaborate to find the optimal action. In contrast to most prior work on this problem, which focuses on extending a specific algorithm to the multi-agent setting, we provide a black-box reduction that allows us to extend any single-agent bandit algorithm to the multi-agent setting. Under mild assumptions on the bandit environment, we prove that our reduction transfers the regret guarantees of the single-agent algorithm to the multi-agent setting. These guarantees are tight in subgaussian environments, in that using a near minimax optimal single-player algorithm is near minimax optimal in the multi-player setting up to an additive graph-dependent quantity. Our reduction and theoretical results are also general, and apply to many different bandit settings. By plugging in appropriate single-player algorithms, we can easily develop provably efficient algorithms for many multi-player settings such as heavy-tailed bandits, duelling bandits and bandits with local differential privacy, among others. Experimentally, our approach is competitive with or outperforms specialised multi-agent algorithms.

Paper Content:
Page 1: QuACK: A Multipurpose Queuing Algorithm for Cooperative k-Armed Bandits Benjamin Howson Sarah Filippi Ciara Pike-Burke Department of Mathematics, Imperial College London November 1, 2024 Abstract We study the cooperative stochastic k-armed bandit problem, where a network of magents collaborate to find the optimal action. In contrast to most prior work on this problem, which focuses on extending a specific algorithm to the multi-agent setting, we provide a black-box reduction that allows us to extend any single-agent bandit algorithm to the multi-agent setting. Under mild assumptions on the bandit envi- ronment, we prove that our reduction transfers the regret guarantees of the single-agent algorithm to the multi-agent setting. These guarantees are tight in subgaussian environments, in that using a near mini- max optimal single-player algorithm is near minimax optimal in the multi-player setting up to an additive graph-dependent quantity. Our reduction and theoretical results are also general, and apply to many differ- ent bandit settings. By plugging in appropriate single-player algorithms, we can easily develop provably efficient algorithms for many multi-player settings such as heavy-tailed bandits, duelling bandits and ban- dits with local differential privacy, among others. Experimentally, our approach is competitive with or outperforms specialised multi-agent algorithms. 1 Introduction Stochastic multi-armed bandit problems are a fundamental model for sequential decision-making. Here, a single agent sequentially interacts with the environment by selecting an action and receiving a reward from an unknown distribution associated with that action (Lattimore and Szepesv ´ari, 2020). There are numerous provably efficient algorithms for this setting that employ different mechanisms for balancing the exploration- exploitation trade-off, such as optimism, posterior sampling, bootstrapping, and soft-max exploration (Auer et al., 2002; Thompson, 1933; Kveton et al., 2019; Bian and Jun, 2022). However, decision-making tasks naturally arise in distributed settings, such as recommender systems and sensor networks (Tekin et al., 2014; Tran-Thanh et al., 2011), where there exist multiple decision-makers interacting with the environment. For example, large-scale recommender systems are often a distributed sys- tem of servers, where each server hosts a decision-maker. Every time a user arrives to one of the servers, the corresponding decision-maker chooses an item to recommend, and the user will provide a reward signal in response. Upon receiving feedback, the decision-makers can update their own knowledge on the quality of this item. Furthermore, they can share this information with neighbouring servers to improve future decision- making at all servers. Motivated by these applications, we study an extension of the traditional multi-armed bandit model where there is a network of mdecision-makers who each interact with the same k-armed bandit environment. Each agent can communicate over the network, allowing for the possibility of collaboration to speed up the learning process. 1arXiv:2410.23867v1 [cs.LG] 31 Oct 2024 Page 2: 1.1 Related Work Designing provably efficient algorithms for the multi-player setting is more challenging than the single-agent setting. Previous work has extended specific single-player algorithms to the multiplayer setting through one of two methods: gossiping or electing a leader. Gossiping is a popular technique in distributed computing that uses an iterative averaging procedure to com- bine information from neighbouring agents to approximate the full network information (Xiao and Boyd, 2004; Duchi et al., 2012). Landgren et al. (2016) and Mart ´ınez-Rubio et al. (2019) combine this procedure with the upper confidence bound algorithm (Auer et al., 2002). Lalitha and Goldsmith (2021) show that we can use gossiping to design a provably efficient multi-agent version of Thompson Sampling (Thompson, 1933). Notably, all of these algorithms are asymptotically optimal. However, the analyses of these algorithms are complex, and specific to the algorithm and variant of the gossiping procedure considered. Furthermore, the performance depends on the choice of communication matrix that the gossip procedure uses for iterative averaging. Even when the network is known, choosing this matrix is a non-trivial task, and it is common to use heuristics (Xiao and Boyd, 2004). Theleader-based methods require electing a leading agent who directs the exploration of the entire network. Bar-On and Mansour (2019) consider the non-stochastic setting and analyse the exponential weights algo- rithm with numerous leaders send their action distributions to the non-leading agents in their neighbourhood. Wang et al. (2020) elect one leading agent who performs all the exploration using a upper confidence bound algorithm. Here, the leader does not observe any information from the other agents in the network, who all play greedily with respect to the empirical means of the leader. Again, these papers analyse extensions of specific bandit algorithms. Our approach also fits into the category of leader-based approaches. However, we develop a leader-based black-box reduction where the leading agent can use any bandit algorithm for decision-making. Specifically, we build upon queuing methods commonly used to handle delayed feedback (Joulani et al., 2013; Mandel et al., 2015). However, extending these queuing methods to the multi-agents setting introduces unique theoretical challenges that do not arise in single-agent scenarios. Additional works investigate variations of the multi-agent bandit problem. Szorenyi et al. (2013) study the multi-armed bandit problem in peer-to-peer networks where each agent communicates with two other agents in the network, which are chosen randomly in every round. Yang et al. (2021) and Chen et al. (2023) look into the setting where each agent on the network plays at different and possibly unknown times. Madhushani et al. (2021) study the setting where there is imperfect communication between agents. Shahrampour et al. (2017); Hossain et al. (2021) study the case where every agent has their own distribution over the rewards for each action. Perhaps the best-studied multi-agent problem is where agents receive zero or degraded rewards if they play the same action at the same time. See Boursier and Perchet (2024) for a recent survey on this topic. 1.2 Contributions We design and analyse an algorithm that reduces cooperative stochastic multi-agent k-armed bandit problems to the single-agent version of the problem. Our main contributions are as follows: • We propose a black-box reduction, QuACK, that accepts any single-agent bandit algorithm as input and immediately extends it to the multi-agent setting. • Theorem 1 shows that we can upper bound the performance over the entire network in terms of the guarantees of the chosen single-agent bandit algorithm. Pairing our reduction with a near optimal algorithm yields a near optimal algorithm for the multi-agent k-armed subgaussian bandit problem, up to an additive graph-dependent quantity. These results are competitive or better than the case-by-case analyses of previous works. • Our theoretical guarantees hold under mild assumptions on the bandit environment. We require that the distribution of the reward depends only on the chosen action, and each distribution has a finite first moment. Hence, if there exists a provably efficient algorithm for a single-agent bandit problem, and 2 Page 3: this bandit problem satisfies the assumptions, our reduction guarantees that it will be provably efficient in the multi-agent setting. This makes developing provably efficient multi-agent algorithms for various bandit problems simple. In particular, in Section 4: •We demonstrate the simplicity by using our reduction to design multi-agent algorithms for heavy- tailed bandits and duelling bandits, which have all be considered separately in the literature (Land- gren et al., 2016; Dubey and Pentland, 2020; Raveh et al., 2024). The resulting algorithms have comparable guarantees to those developed specifically for each setting. •We develop provably efficient algorithms for new multi-agent settings, such as local differential privacy, by using an appropriate single-player algorithm (Ren et al., 2020). • Finally, we perform an experimental comparison to existing works which shows that our reduction is competitive or outperforms existing methods when paired with a comparable bandit algorithm. 2 Problem Setting Our paper considers multi-agent variants of stochastic multi-armed bandit problems where we have a finite set of actions: Asuch that |A|=k. We formalise the network of magents and their connections through an undirected graph G= (V, E)with: V={1,2,···, m} E⊆ {(v, w)∈V×V:v̸=w}. Here v∈Vrepresents an agent and e∈Erepresents a communication channel between a pair of agents. Each agent in the network is allowed to communicate with any other agent in their neighbourhood , which we define for all v∈Vas follows: Nv={w∈V: (v, w)∈E}. Each round consists of every agent simultaneously playing their own action, receiving their own reward and communicating with their neighbours. Formally, for each t∈ {1,2,···, n}: • Each agent v∈Vplays action Av t∈ A. • Each agent v∈Vreceives reward Xv t∈R. • Each agent v∈Vcommunicates with w∈Nv. Throughout, we will make use of a standard and mild assumption on the bandit environment (Lattimore and Szepesv ´ari, 2020). Assumption 1. The bandit environment generates feedback that depends only on the chosen action. Letting Padenote the reward distribution for action a, we assume that: Xv t|Av t=ai.i.d.∼Paandµa=EX∼Pa[X]<∞. for all t∈N,a∈ A andv∈V. Additionally, we make a mild assumption on the graph and the communication protocol. Assumption 2. Messages can be passed from one agent v∈Vto another agent w∈Valong a shortest path in dvwtime steps, and the graph diameter is finite: d= max v,w∈Vdvw<∞. 3 Page 4: 2.1 Measuring Performance In the multi-agent bandit problem, agents collaborate to minimise the regret of the entire network, also known as the group regret : RG(n) =X a∈A∆aE"mX v=1Tav(n)# . where the expectation is over the network of agents interacting with the bandit environment. Here, ∆a= µ⋆−µais the sub-optimality gap of action awhere µ⋆= max µadenotes the expected reward of the optimal action, and Tav(n) =nX t=11{Av t=a} (1) is the number of times agent v∈Vchooses action a∈ A over the course of nrounds. From standard arguments, we can deduce a minimax lower bound on the group regret when all reward distributions are Gaussian (Lattimore and Szepesv ´ari, 2020): RG(n)≥1 27p mn(k−1). This lower bound holds for cases where each agent can communicate immediately with any other agent on the graph. Hence, it may not be tight for graphs with particular structure. However, it does demonstrate that information sharing can be used to improve the group performance in the worst-case. 3 A Black-Box Reduction We present QuACK, a queuing algorithm for cooperative k-armed bandits . QuACK is a multipurpose black- box reduction that can be paired with any single-agent bandit algorithm to extend it to the multi-agent case. QuACK requires the index of the leader on the network and a bandit algorithm πas input. The leader will useπto select actions for the whole network. The leader begins by initialising an empty queue for each action. These queues will store the rewards that the other agents observe from the bandit environment and have passed to the leader. Intuitively, the leader can use the reward samples in these queues instead of playing actions in the real environment. Specifically, in each round, the leader observes the action suggested by π. If there is a reward in the queue for that action, the leader will take the reward from the queue and pass this toπ. Importantly, the leader does not play this action, but instead uses the reward from the queue to update π. This continues until πsuggests playing an action whose queue is empty. In this case, the leader plays the action and receives a reward from the environment. The leader then communicates this decision to the other, follower , agents. Once a follower receives instruction from the leader (note that this can take some time since the message needs to travel over the network), they will play the instructed action, and pass the reward back to the leader (which can also take some time). Once the leader receives rewards from their followers, they will place the rewards in the corresponding queues, and begin the process again. Algorithm 1 presents the pseudo-code for QuACK. For clarity, it is presented assuming that the leader is given as input and one shortest path between each follower w∈Vand the chosen leader v∈Vhas been previously identified. Appendix A explains how we can elect a leader in a distributed manner and Section 3.1 suggests a well-known distributed algorithm for computing the paths once we have the leader. 3.1 Message-Passing Protocol The message a follower w̸=vsends to the leader vin round tcontains only the action and the reward: (Aw t, Xw t). Sending such simple messages is possible since there is a known fixed route between the leader and any follower. Computing these routes can be done in a distributed manner before the start of the interaction. 4 Page 5: Algorithm 1 QuACK Input: index of the leader v∈V Input for Leader: bandit algorithm π Initialisation for Leader: Qa=∅for all a∈ A fort∈ {1,2,···, n}do Leader ( v) Receive messages via shortest path: Mt={(Aw t−dvw, Xw t−dvw)}w̸=v Append Qa=Qa∪ {(a, x)for(a, x)∈ M t} Select awithπ while Qa̸=∅do Remove xfrom Qa Update πwith(a, x) Select awithπ end while PlayAv t=a Observe Xv t∼PAv t Send Av tto all w̸=vvia shortest path Follower ( w̸=v) ift > d vwthen Receive Av t−dvwfrom vvia shortest path PlayAw t=Av t−dvw else PlayAw tuniformly at random end if Observe Xw t∼PAw t Send (Aw t, Xw t)tovvia shortest path. end for Specifically, this amounts to constructing the shortest path tree of the graph which can be done by solving the single-source shortest-path problem before interacting with the bandit environment (Ahuja et al., 1993). Figure 1 illustrates the shortest path tree for the grid graph, where the black vertex represents the leader. Figure 1: Grid Graph and its Shortest Path Tree. The Distributed Bellman-Ford algorithm can solve the single-source shortest-path problem exactly and it does so using approximately miterations, whilst only requiring that each agent knows their neighbours and has a unique identifier (Ford, 1956; Bellman, 1958; Moore, 1959; Elkin, 2020). Appendix A presents full details of this algorithm for completeness. Sending instructions from the leader to the followers using the shortest path tree is straightforward. The leader communicates the action they played to their children, and the children send this action to their children at 5 Page 6: the end of the next round, and so forth. Using the shortest path tree also makes passing messages to the leader straightforward. Each follower will forward messages from their children to their parent, and send their own message to their parents. Thus, in the t-th round, each follower wwill perform the following communication: • Forward Aw t=Av t−dvwto each z∈ Cw • Send Mv t= (Aw t, Xw t)∪ {Mz t−1}z∈CwtoPz Here,CwandPwdenote the children and the parent of agent won the shortest path tree, respectively. The leader vonly needs to send the action they choose in each round to their children, which corresponds to just the first item the above list. Thus, the communication complexity, in terms of the number of messages sent in each round, for each agent wis upper bounded by the diameter of the graph. Note that we could use any other message-passing protocol that satisfies Assumption 2. For example, each agent could forward everything they have been sent in the past mrounds. However, in this case, all messages Mw twould have to contain additional information about who originally sent the message and the time the message was sent, e.g. wandt. This additional information is required so that the followers can identify the most recent action sent to them by the leader. Furthermore, the leader would have to check that Mw tis not added to the queue more than once. 3.2 Theoretical Analysis We now analyse QuACK (Algorithm 1) with an arbitrary single player algorithm π. We will see that the group regret of QuACK depends on the single-agent regret of πand some graph-dependent quantities. We define the regret of the single-agent bandit algorithm πover an n-round interaction with the single-player environment as follows: Sπ(n) =X a̸=⋆∆aEP"nX t=11{at=a}# where atdenotes the action taken at time t when following πandEP[·]denotes expectation with respect to the interaction with the single-player environment. We will show that, by maintaining the queues, the leader creates a simulated single-player version of the environment for the bandit algorithm. For this, we need to define the number of times the single-agent bandit algorithm πchooses action a∈ A in totality. This counter includes the number of times the leader, who operates this single-player algorithm, chooses this action in the bandit environment, as well as the number of samples taken from the queue. Letting asdenote the s-th action chosen by the bandit algorithm allows us to define this counter: T′ a(sn) =nX t=1stX s=st−1+11{as=a}. (2) Here, strepresents the number of simulated rounds the bandit algorithm has completed by the end of the t-th round. This includes the action chosen by the leader in this round, e.g. Av t=ast. From Section 3.1, we know that the leader appends each message to the queue exactly once, which implies that: sn≤mn. Lemma 1 shows that the bandit algorithm is operating in a simulated version of a single-player environment. Lemma 1. Under Assumption 1, QuACK guarantees that, for all n: Sπ(n) =X a̸=⋆∆aE[T′ a(n)]. Proof. Under Assumption 1, the rewards for each action-agent pair are independent and identically dis- tributed random variables. Properties of exchangeable random variables guarantee that the sequence of rewards the leading agent feeds to the bandit algorithm has the same distribution as in the single-player environment. Therefore, the expectation in the definition of the single-player regret and in the multi-agent setting are the same. See Appendix B.1 for a formal proof. 6 Page 7: Recall that the group regret is defined in terms of times the network of agents plays each action. Therefore, Lemma 2 relates this quantity to the number of times the single-agent bandit algorithm plays each action. Lemma 2. Running QuACK with an arbitrary v∈Vas the leader guarantees that: mX w=1Taw(t−dvw)≤T′ a(st) + 2mX w=1dvw. for all t∈N. Proof. See Appendix B.2 Lemmas 1 and 2 allow us to upper bound the group regret by the regret of the single-player bandit algorithm plus an additive graph-dependent quantity. Theorem 1. Under Assumptions 1 and 2, the group regret of QuACK run with single-player bandit algorithm πand leading agent v∈Vis bounded by RG(n)≤Sπ(mn) + 3·mX w=1dvw!kX a=1∆a. Proof. Firstly, we upper bound the group plays for action a∈ A at the end of the final round as: mX w=1Taw(n)≤mX w=1Taw(n−dvw) +X w̸=vdvw. (3) Combining Equation (3) with Lemma 2 and sn< mn allows us to upper bound the number of times the entire network plays action aby the number of times πhas played this action: mX w=1Taw(n)≤mX w=1Taw(n−dvw) +X w̸=vdvw ≤T′ a(mn) + 3X w̸=vdvw (4) Plugging Equation (4) into the definition of the group regret gives us: RG(n) =X a̸=⋆∆aE"mX w=1Taw(n)# ≤X a̸=⋆∆aE[T′ a(mn)] + 3X w̸=vdvw X a̸=⋆∆a =Sπ(mn) + 3X w̸=vdvw X a̸=⋆∆a where the final line follows from Lemma 1 and the definition of regret in the single-player setting. See Appendix B.3 for further details. Theorem 1 suggests that we should pick: v⋆= arg min v∈VmX w=1dvw, as the leading agent. This is intuitive, as minimising the sum of the shortest paths will minimise the total delay in sending and receiving instructions and feedback from the followers. Appendix A presents an approach for 7 Page 8: finding this agent in a distributed manner. Nevertheless, Assumption 2 guarantees that the worst-case graph- dependence, regardless of the chosen leading agent, is given by: mX w=1dvw≤d(m−1). It is important to highlight that Theorem 1 holds for any choice of bandit algorithm and makes weak assump- tions on the rewards. This is in contrast to existing works, where it is common to analyse a specific bandit algorithm and make stronger assumptions on the reward distributions, such as subgaussian tails. In Section 4 we show that the guarantees for QuACK are competitive with the case-specific analyses that exist in the literature, despite being strictly more general. 4 Instances of QuACK Theorem 1 holds under the assumption that the rewards are conditionally independent given the actions and the expected reward for each of the actions is finite. Therefore, we can apply our reduction in numerous bandit environments. Here, we present a non-exhaustive selection of bandit environments that satisfy Assumption 1. For each, we describe how to translate single-agent algorithms to the multi-agent setting, provide bounds on their group regret, and compare to any existing multi-agent works in that setting. In each setting, we choose one specific single-player algorithm to use with QuACK. However, any other provably efficient single-player algorithm could be combined with our reduction to get similar results. 4.1 Subgaussian Bandits Traditionally, the multi-armed bandit problem consists of playing an action and receiving a reward drawn independently from a σ-subgaussian distribution that depends only on the chosen action (Lattimore and Szepesv ´ari, 2020). Thus, Assumption 1 holds. Corollary 1 presents the group regret bound for our algo- rithm when the leader uses the classic upper confidence bound algorithm (Auer et al., 2002). Corollary 1. Running QuACK with an arbitrary leader following the upper confidence bound algorithm with confidence parameter δ= 1/(mn)2guarantees: RG(n)≤X a̸=⋆16σ2ln (mn) ∆a+ 3 + 3mX w=1dvw!X a̸=⋆∆a. Proof. Theorem 7.1 of Lattimore and Szepesv ´ari (2020) provides an upper bound on the regret of the upper confidence bound algorithm over a T-round interaction with the bandit environment: Sπ(T)≤X a̸=⋆16 ln ( T) ∆a+ 3X a̸=⋆∆a. (5) Combining this upper bound with Theorem 1 and taking T=mngives the result. Mart ´ınez-Rubio et al. (2019) analyse a multi-agent extension of the upper confidence bound algorithm and offer the best theoretical guarantees in the subgaussian setting. They obtain the following bound on the group regret: X a̸=⋆16η 1 +ϵ 2 σ2ln (mn) ∆a+ (CG+m+ 4)X a̸=⋆∆a Here, η >1andϵ∈(0,1)are hyperparameters and CGis a graph-dependent quantity they define as: CG=& 6mln2m ϵ p 2 ln|λ|−1' 8 Page 9: where λis the second largest singular value of the communication matrix they use for the gossiping procedure. Corollary 1 shows a strict improvement in the leading-order term. Comparing the graph-dependence is a little more challenging and requires specifying the communication matrix. Following their recommendations, we can show that for regular graphs, our graph-dependence is smaller than a function of theirs: 3mX w=1dvw<CGp 2 ln|λ|−1 which suggests the dependence in our bounds is better in regular graphs with a small spectral gap. Addition- ally, we can prove a strict improvement on the graph-dependence for specific graphs, such as the star network. The amount of per-round communication of the two methods is similar. Mart ´ınez-Rubio et al. (2019) have each agent send two vectors of length kto each neighbour. Conversely, QuACK sends an action-reward pair to a subset of the neighbours. See Appendix C for further details. 4.2 Heavy-Tailed Bandits Bubeck et al. (2013) introduced the single-agent heavy tailed bandit problem. Here, the the subgaussian assumption on the reward distribution for each action is relaxed, and instead it is assumed that each reward distribution has finite moments of order 1 +ϵwhere ϵ >0controls the heaviness of the tails. Bubeck et al. (2013) propose a generic solution for the single-agent setting that they call the robust upper confidence bound algorithm that replaces the empirical mean with a robust estimator, such as the truncated-mean, median-of- means, or Catoni’s M (Bubeck et al., 2013). The only difference between this setting and the subguassian setting is the weaker assumption on the tails of the rewards. Therefore, Assumption 1 holds and the QuACK algorithm can be used for the multi-agent heavy-tailed bandit problem with the following regret guarantees. Corollary 2. Running QuACK with an arbitrary leader and the truncated-mean robust upper confidence bound algorithm guarantees: RG(n)≤ O X a̸=⋆σ1 ϵln (mn) ∆1 ϵa+ mX w=1dvw!X a̸=⋆∆a  where EX∼Pa|X|1+ϵ≤σfor all a∈ A. Dubey and Pentland (2020) study the multi-agent heavy-tailed bandit problem and devise two algorithms based on the truncated-mean estimator. Their best algorithm achieves a group regret bound with a similar form to Corollary 2, except the leading order term involving nis scaled by a graph-dependent quantity. This graph-dependent term is the independence number of the γ-th power of the graph, where γis a hyperparam- eter of their algorithm. Thus, our bound is strictly better when this quantity is greater than one, which occurs when γ < d . Since they also employ a leader-based algorithm that uses a message passing procedure, the additive terms in their bound and their per-round communication are comparable. See Appendix C for further details. 4.3 Duelling Bandits Yue et al. (2009) introduced the single-agent duelling bandit problem. Here, feedback from the environment is the outcome of a duel between two actions. Modifying Algorithm 1 for the multi-agent duelling bandit problem amounts to maintaining a queue for each pair of actions: A+=A × A . Now, each agent will communicate the pair of actions and the outcome of the duel to their neighbours. Assumption 1 holds since the outcomes of the duels are assumed to depend only on the pair of actions chosen in each round. Recall that we define the group regret relative to a fixed optimal action. Therefore, Theorem 1 applies to solution concepts, such as the Condorcet, Copeland, and Borda winners (Zoghi et al., 2014, 2015; Jamieson et al., 2015). For illustrative purposes, we present an upper bound on the group regret when the 9 Page 10: leader uses the relative upper confidence bound algorithm, which is designed under the assumption that the Condorcet winner exists (Zoghi et al., 2014). Corollary 3. Running QuACK with an arbitrary leader following the relative upper confidence bound algo- rithm of Zoghi et al. (2014) guarantees: RG(n)≤ O X a̸=⋆ln (mn) ˜∆a+ mX w=1dvw!X a̸=⋆˜∆a  where ˜∆a=P(a⋆≻a)−1/2denotes the sub-optimality gap of action ain the the duelling bandit setting. Raveh et al. (2024) develop several algorithms for the multi-agent duelling bandit problem. Their best algo- rithm builds on the relative upper confidence bound algorithm and achieves a similar guarantee. However, the leading order term involving nis scaled by the clique covering number of the γ-th power of the graph, where γis a hyperparameter of their algorithm. Thus, we obtain a strict improvement whenever this quantity is greater than one, which occurs when γ < d . Since their algorithm employs a message passing procedure, the additive terms in their bound and the per-round communication are comparable to ours. See Appendix C for further details. 4.4 Local Differential Privacy Privacy is a topic of growing interest in the machine learning community and seeks to protect private infor- mation within datasets. Ren et al. (2020) study local differential privacy (LDP) in the bandit setting, where the environment uses a mechanism to make the rewards private before sending them to the agent. Formally, in the multi-agent setting, the only difference is that the rewards each agent receives are privatised: Xv t=fϵ(X)forXi.i.d.∼PAv t where fϵdenotes the privatising mechanism that ensures ϵ-LDP of the rewards before handing them to the agent. Procedures for privatising rewards include adding Bernoulli or Laplace noise to the rewards. Proving that the mechanism provides the specified level of privacy is independent of the bandit algorithm, e.g. Lem- mas 2 and 5 of Ren et al. (2020). Therefore, we can directly use these mechanisms in the multi-agent setting and guarantee that the rewards are ϵ-LDP. Currently, no multi-agent algorithms for this problem setting exist. However, Assumption 1 is satisfied in this setting because the curator adds i.i.d. noise to the rewards. There- fore, we develop the first provably efficient algorithm for multi-agent bandits with local differential privacy by plugging an existing single agent algorithm into QuACK. Corollary 4. Running QuACK with an arbitrary leader following LDP-UCB-L of Ren et al. (2020) is ϵ-LDP and guarantees: RG(n)≤ O X a̸=⋆1 ϵ2ln (mn) ∆a+ mX w=1dvw!X a̸=⋆∆a . 5 Experimental Results We compare the performance of QuACK (Algorithm 1) initialised with UCB against coop-UCB (Landgren et al., 2016) and DDUCB (Mart ´ınez-Rubio et al., 2019). Additionally, we initialise QuACK with Thompson Sampling to compare against Dec-TS (Lalitha and Goldsmith, 2021). Our experiments consider a simple bandit environment with ten actions where each reward distribution is Bernoulli with µ1= 0.5andµa= 0.45otherwise. Following existing works, we conduct our experiments on several standard graph structures, namely cycle and grid graphs. Additionally, we investigate the star graph to verify our theoretical insights. All results are averaged over 100independent runs and we represent uncertainty by shading between the 2.5-th and 97.5-th quantiles. 10 Page 11: Figure 2: Group Regret for a Network of 196Agents. Figure 2 compares the group regret of each algorithm across various graph structures. For UCB, QuACK significantly outperforms existing methods. These results support the claims we made in Section 4 based on our theoretical results. For Thompson Sampling, we observe that our algorithm is competitive with existing work on the cycle and grid structures, and performs significantly better on the star structure, also supporting our theoretical claims. Appendix D displays similar results for additional experimental settings and provides further details on the settings. 6 Conclusion This paper proposes a generic black-box algorithm that can extend any single-agent bandit algorithm to the multi-agent setting. Under mild assumptions on the bandit environment, we show that our black-box ap- proach immediately transfers the theoretical guarantees of the chosen bandit algorithm from the single agent setting to the multi-agent setting. In the subgaussian setting, we can pair our algorithm with any nearly minimax optimal single agent algorithm and match the minimax lower bound in the multi-agent setting, up to an additive graph-dependent quantity. We suspect that the lower bound is loose because it fails to capture the structure of the graph and the commu- nication delays. Proving graph-dependent lower bounds is an interesting open problem. Additionally, we assumed that each agent can communicate real-valued messages to each neighbour at the end of every round with arbitrary precision. However, some practical applications impose constraints on the amount of communication in terms of frequency or the number of bits. Relaxing these assumptions are an important next step for this line of research. References R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network Flows: Theory, Algorithms, and Applications . Prentice- Hall, 1993. Peter Auer, Nicol `o Cesa-Bianchi, and Paul Fischer. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning , 2002. Yogev Bar-On and Yishay Mansour. Individual Regret in Cooperative Nonstochastic Multi-Armed Bandits. InAdvances in Neural Information Processing Systems , 2019. Richard Bellman. On a Routing Problem. Quarterly of Applied Mathematics , 1958. Dimitri P. Bertsekas and Robert G. Gallager. Data Networks . Prentice-Hall, 1992. 11 Page 12: Jie Bian and Kwang-Sung Jun. Maillard Sampling: Boltzmann Exploration Done Optimally. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics , 2022. Etienne Boursier and Vianney Perchet. A Survey on Multi-player Bandits. Journal of Machine Learning Research , 2024. Sebastien Bubeck, Nicolo Cesa-Bianchi, and Gabor Lugosi. Bandits With Heavy Tail. IEEE Transactions on Information Theory , 2013. Yu-Zhen Janice Chen, Lin Yang, Xuchuang Wang, Xutong Liu, Mohammad Hajiesmaili, John C. S. Lui, and Don Towsley. On-Demand Communication for Asynchronous Multi-Agent Bandits. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , 2023. F. R. K. Chung. Diameters and Eigenvalues. Journal of the American Mathematical Society , 1989. F. R. K. Chung. Spectral Graph Theory . American Mathematical Society, 1997. Donald L. Cohn. Measure Theory . Birkh auser/Springer, New York, 2013. Abhimanyu Dubey and Alex ‘Sandy’ Pentland. Cooperative Multi-Agent Bandits with Heavy Tails. In Proceedings of the 37th International Conference on Machine Learning , 2020. J. C. Duchi, A. Agarwal, and M. J. Wainwright. Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling. IEEE Transactions on Automatic Control , 2012. Michael Elkin. Distributed Exact Shortest Paths in Sublinear Time. J. ACM , 2020. Wan Fokkink. Distributed Algorithms: An Intuitive Approach . The MIT Press, 2013. Lester Ford. Network Flow Theory . RAND Corporation, 1956. Safwan Hossain, Evi Micha, and Nisarg Shah. Fair Algorithms for Multi-Agent Multi-Armed Bandits. In Advances in Neural Information Processing Systems , 2021. Kevin Jamieson, Sumeet Katariya, Atul Deshpande, and Robert Nowak. Sparse Dueling Bandits. In Pro- ceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics , 2015. Pooria Joulani, Andras Gyorgy, and Csaba Szepesvari. Online Learning under Delayed Feedback. In Pro- ceedings of the 30th International Conference on Machine Learning , 2013. Branislav Kveton, Csaba Szepesvari, Sharan Vaswani, Zheng Wen, Tor Lattimore, and Mohammad Ghavamzadeh. Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits. In Pro- ceedings of the 36th International Conference on Machine Learning , 2019. Anusha Lalitha and Andrea Goldsmith. Bayesian Algorithms for Decentralized Stochastic Bandits. IEEE Journal on Selected Areas in Information Theory , 2021. Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. Distributed Cooperative Decision-Making in Multi-Armed Bandits: Frequentist and Bayesian Algorithms. In 2016 IEEE 55th Conference on Deci- sion and Control , 2016. Tor Lattimore and Csaba Szepesv ´ari.Bandit Algorithms . Cambridge University Press, 2020. Udari Madhushani, Abhimanyu Dubey, Naomi Leonard, and Alex Pentland. One more step towards reality: Cooperative bandits with imperfect communication. Advances in Neural Information Processing Systems , 2021. Travis Mandel, Yun-En Liu, Emma Brunskill, and Zoran Popovi ´c. The Queue Method: Handling Delay, Heuristics, Prior Data, and Evaluation in Bandits. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence , 2015. 12 Page 13: David Mart ´ınez-Rubio, Varun Kanade, and Patrick Rebeschini. Decentralized Cooperative Stochastic Ban- dits. In Proceedings of the 33rd International Conference on Neural Information Processing Systems , 2019. Edward Moore. The Shortest Path Through a Maze . Bell Telephone System., 1959. Or Raveh, Junya Honda, and Masashi Sugiyama. Multi-Player Approaches for Dueling Bandits. Preprint, 2024. Wenbo Ren, Xingyu Zhou, Jia Liu, and Ness B. Shroff. Multi-Armed Bandits with Local Differential Privacy. Preprint, 2020. Nicola Santoro. Design and Analysis of Distributed Algorithms . John-Wiley & Sons, USA, 2006. Shahin Shahrampour, Alexander Rakhlin, and Ali Jadbabaie. Multi-armed bandits in multi-agent networks. In2017 IEEE International Conference on Acoustics, Speech and Signal Processing , 2017. Balazs Szorenyi, Robert Busa-Fekete, Istvan Hegedus, Robert Ormandi, Mark Jelasity, and Balazs Kegl. Gossip-based distributed stochastic bandit algorithms. In Proceedings of the 30th International Conference on Machine Learning , 2013. Cem Tekin, Simpson Zhang, and Mihaela van der Schaar. Distributed Online Learning in Social Recom- mender Systems. IEEE Journal of Selected Topics in Signal Processing , 2014. William R. Thompson. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika , 1933. Long Tran-Thanh, Alex Rogers, and Nicholas Jennings. Long-term information collection with energy har- vesting wireless sensors: A multi-armed bandit based approach. Autonomous Agents and Multi-Agent Systems , 2011. Po-An Wang, Alexandre Proutiere, Kaito Ariu, Yassir Jedra, and Alessio Russo. Optimal Algorithms for Mul- tiplayer Multi-Armed Bandits. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics , 2020. Lin Xiao and Stephen Boyd. Fast Linear Iterations for Distributed Averaging. Systems & Control Letters , 2004. Lin Yang, Yu-Zhen Janice Chen, Stephen Pasteris, Mohammad Hajiesmaili, John C. S. Lui, and Don Towsley. Cooperative Stochastic Bandits with Asynchronous Agents and Constrained Feedback. In Advances in Neural Information Processing Systems , 2021. Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed duelling bandits problem. InProceedings of the 22nd Conference on Learning Theory , 2009. Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten Rijke. Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem. In Proceedings of the 31st International Conference on Machine Learning , 2014. Masrour Zoghi, Zohar Karnin, Shimon Whiteson, and Maarten de Rijke. Copeland Dueling Bandits. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 , 2015. 13 Page 14: A Implementation Details Section 3 presents QuACK assuming that we have elected a leader and have computed a shortest path tree for our message passing protocol. Here, we describe how we can perform these steps in a distributed manner using well-known algorithms. A.1 Shortest Path Tree Constructing the shortest path tree rooted at a fixed vertex amounts to solving the single-source shortest path problem. Traditionally, we can solve this problem in non-distributed systems using the Bellman-Ford algo- rithm (Ford, 1956; Bellman, 1958). However, using this algorithm would require the existence of an agent who possesses knowledge of the entire network topology. Therefore, we will present the distributed variant of this algorithm (Bertsekas and Gallager, 1992). Here, each agent knows only their neighbours. Algorithm 2 presents the pseudo-code for the Distributed Bellman-Ford algorithm. Briefly, this algorithm takes a source agent as input, and each agent keeps track of their distances from the source and their parent on the shortest path tree. Algorithm 2 Distributed Bellman-Ford ( DBF) Input. Index of a source agent v Setd1 v= 0andd1 w=∞forw̸=v SetPw=∅for each w∈Vto initialise their parent. fort∈ {1,2,···m−1}do Each w∈Vsends dt wtoz∈Nw Each w∈Vreceives dt zfrom z∈Nw Each w∈Vupdates their distance from the source dt+1 w= min z∈Nw∪{w} dt z+ 1{z̸=w} Each w∈Vupdates their parent: Pw=( arg minz∈Nwdt z+ 1 ifdt+1 w< dt w Pw ifdt+1 w=dt w end for Each w∈Vsends their identifier wto their parent Pw. Each w∈Vreceives the identifiers from their children and creates their set of children: Cw={z∈V:Pz=w} After Algorithm 2 terminates, each agent will know their children and parent on the shortest path tree rooted at the index of the input agent. Indeed, these are the quantities we require to implement the message-passing scheme described in Section 3.1. A.2 Leader Election Leader Election (LE) is a well-known problem in distributed computing (Santoro, 2006; Fokkink, 2013). Here, we want the network to unanimously agree upon a single agent as a leader. Generally, there are impossibility results for anonymous networks (Fokkink, 2013). Therefore, we will assume that each agent has a unique identifier. Wang et al. (2020) devise a simple leader election scheme that requires at most d+ 1 iterations, where dis the diameter of the graph. Algorithm 3 presents a slight modification of their procedure that will ease the exposition of finding the best leader. 14 Page 15: Algorithm 3 Leader Election ( LE) Each v∈Vsetsfvarbitrarily. Section A.3 has each agent set fvto their sum of shortest paths. Each v∈Vsaves their initial value Iv= (v, fv) Each v∈Vcreates their state value S1 v= (v, fv) fort∈ {1,2,···, m}do Each v∈Vreceives St wfrom w∈Nv if∃w∈Nvsuch that fw< fvthen Update St+1 v= (w, fw) else if∃w∈Nvsuch that fw=fvthen Update St+1 v= (min {w, v}, fv) else St+1 v=St v end if end for ifIv=Sm+1 v then Agent vis the leader. end if Running Algorithm 3 requires at most miterations because every agent will have seen the identifier of every other agent in the graph at the end of this iteration. Furthermore, Algorithm 3 will always terminate with exactly one leader. To see this, we can consider two possible cases. •Unique Minimum . Here, there exists exactly one agent vwho possesses the smallest fvvalue. There- fore, they never get to update their state and become the leader. •Non-Unique Minimum . Let f= min v∈Vfvand suppose multiple agents attain this value. Algorithm 3 implements an index-based tie-breaking rule to handle this case. Notably, agents with fv=fwill eventually receive an fw=fand they will proceed to check if their index is minimal. Since each agent has a unique index, there will be exactly one agent who possesses fv=fwith the minimum index. This agent will never update their state and will become the leader. A.3 Optimising the Leader Algorithm 4 Best Leader forv∈V={1,2,···, m}do Compute Sum of Shortest Paths Execute DBF(v) Each agent now knows dvw(distance from source v) Each agent now knows Pw(their parent on the shortest path tree with source v) Each w∈Vcreates message Mw 1= (w, dvw) fort∈ {1,2,···, m}do Each w∈Vsends Mw ttoPw Each w∈Vupdates Mw t+1=Mw t∪ {Mz t}z∈Cw end for Agent vcalculates fv=P wdvwfrom the messages. end for Use Sum of Shortest Paths as fvin Leader Election Each agent now knows their value for fv=Pm w=1dvw Execute LEto elect the leader with minimal sum of shortest paths. Combining Sections A.2 and A.1 allows us to find the best-positioned leading agent in the network according 15 Page 16: to our theoretical guarantees. Algorithm 4 presents the pseudo-code for finding and electing the leader. Briefly, this procedure starts with each agent calling Algorithm 2. Afterwards, we perform several rounds of message passing so that agent vcan calculate the sum of shortest paths: fv. Finally, Algorithm 3 performs leader election with fvequal to the sum of shortest paths. B Missing Proofs Throughout the main paper we deferred several proofs to the appendix and gave sketches of the proofs for the main results. Here, we present full proofs of the claims made in the main paper and we do so in chronological order. Specifically, the remainder of this section will have the following structure: • Section B.1 presents a proof of Lemma 1 which establishes the equivalence of playing in the bandit environment and the queued version of the environment created by the feedback from non-leading agents. • Section B.2 presents a proof of Lemma 2 which tells us that the difference between the group plays and the pseudo plays of the leader is almost surely bounded by a graph-dependent constant. • Section B.3 presents a full proof of Theorem 1 and fills in steps missing from the proof sketches given in the main paper. B.1 Proof of Lemma 1 In QuACK, the leader uses rewards from the queues as well as their own observations to simulate an environ- ment for updating the bandit algorithm π. Recall that the rewards in the queues have been gathered by other agents in the network when interacting with the environment. Lemma 1 relates the expected number of times the leader updates the algorithm πfor each action ato the expected number of times the single-agent bandit algorithm πwould play action aover an n-round interaction with the environment. More precisely it states that: X a̸=⋆∆aEP"nX t=11{At=a}# =X a̸=⋆∆aE[T′ a(n)]. (6) In Equation (6), the first expectation is with respect to the bandit algorithm πinteracting directly with the bandit environment. The second expectation is with respect to bandit algorithm πinteracting with the simu- lated environment created by the leader. For each action a∈ A, letPabe the distribution over possible rewards when playing action ain the bandit environment, as defined in Assumption 1. For each a, we also define ˜Pawhich is a probability measure over the possible rewards for playing action ain the simulated environment created by the leader. In addition let pπbe the density over action-reward pairs up to time nwhen the single player bandit algorithm πinteracts with the environment, and let ˜pπbe the density over action-reward pairs in the simulated environment up to timen. Following the notation of Lattimore and Szepesv ´ari (2020), we can write down these densities as follows: pπ(a1, x1,···, an, xn) =nY t=1π(at|a1, x1,···, at−1, xt−1)pat(xt) ˜pπ(a1, x1,···, an, xn) =nY t=1π(at|a1, x1,···, at−1, xt−1) ˜pat(xt) where paand˜padenote the Radon-Nikodym derivatives of Paand˜Pa, respectively. Using these notations we can define the two expectations in equation (6) more precisely, i.e. for any action a∈ A, EP"nX t=11{At=a}# =Z HnnX t=11{at=a}pπ(h)dh 16 Page 17: whereas E[T′ a(n)] =Z HnnX t=11{at=a}˜pπ(h)dh where h= (x1, a1, . . . x n, an)∈ H n= (A ×R)nis a sequence of naction-reward pairs. Under Assumption 1 we know that the rewards for each action-agent pair are independent and identically distributed random variables. Therefore, using exchangeability of independent and identically distributed random variables, we have that: Pad=˜Pa The Radon-Nikodym Theorem tells us that the pais unique, e.g. see Theorem 4.2.2 of Cohn (2013). Combin- ingPa=˜Pawith this well-known result tells us that pa(x) = ˜pa(x)for all x∈R. Therefore pπ(h) = ˜pπ(h) for all h∈ H n= (A ×R)nand hence, Z Hnf(h)pπ(h)dh=Z Hnf(h) ˜pπ(h)dh for any arbitrary function f:Hn→R. Choosing f(h) =Pn t=11{at=a}completes the proof. B.2 Proof of Lemma 2 Define the last round that the leader vertex plays action ain the bandit environment as follows: τ= max j≤t:Av j=a . Recall stdenotes the pseudo round counter when the leading agent makes their t-th decision in the bandit environment. Using Equation (2) and leveraging the fact that the round in the previous display is associated with a play of action agives us the following: T′ av(st) =stX s=11{av s=a} =sτX s=11{av s=a}+stX s=sτ+11{av s=a} =T′ av(sτ) +stX s=sτ+11{av s=a} (⋆)=mX w=1Taw(τ−dvw) +stX s=sτ+11{av s=a} ≥mX w=1Taw(τ−dvw) (7) where (⋆)follows from the fact that the leader plays action ain round τso the number of times they play this action in the queue must be equivalent to the group plays of this action in the bandit environment. Now, Equation (7) relates the number of times the bandit algorithm has played action ato the number of times the network has chosen this action. Recall that each follower will play action afor the last time in round: τw=τ+dvw where τis the last round that the leader plays action ain the bandit environment. Using Equation (1) with the definition and properties of τwallows us to get an upper bound on the number of plays for any non-leading agent was follows: Taw(t−dvw) =t−dvwX j=11{Aw j=a} 17 Page 18: =τ−dvwX j=11{Aw j=a}+t−dvwX j=τ−dvw+11{Aw j=a} (⋆) ≤Taw(τ−dvw) +τ+dvwX j=τ−dvw+11{Aw j=a} ≤Taw(τ−dvw) + 2dvw (8) Notably, (⋆)follows from analysing the two possible cases: • When t−dvw≥τw=τ+dvw, the inequality is actually an equality because τwis the last round follower wplays this action up to and including the t-th round. • Otherwise, t−dvw< τw=τ+dvwand we are extending the summation to include more rounds which gives us the inequality. Summing Equation (8) over all agents and substituting into Equation (7) gives us: mX w=1Taw(t−dvw)≤mX w=1Taw(τ−dvw) + 2mX w=1dvw (Equation (8)) ≤T′ a(st) + 2mX w=1dvw (Equation (7)) as required. B.3 Proof of Theorem 1 Within Section 3, we provided a nearly complete proof. Here, we present each step for completeness. Firstly, we can rewrite the group plays for action aat the end of the final round as follows: mX w=1Taw(n) =Tav(n) +X w̸=vTaw(n) =Tav(n) +X w̸=vnX t=11{Aw t=a} =Tav(n) +X w̸=vn−dvwX t=11{Aw t=a}+X w̸=vnX t=n−dvw+11{Aw t=a} =Tav(n) +X w̸=vTaw(n−dvw) +X w̸=vnX t=n−dvw+11{Aw t=a} ≤Tav(n) +X w̸=vTaw(n−dvw) +X w̸=vdvw (9) Combining Equation (9) with Lemma 2 allows us to upper bound the group plays by the number of times the leader plays the action in the queued version of the environment: mX w=1Taw(n)≤Tav(n) +X w̸=vTav(n−dvw) +X w̸=vdvw (Equation (9)) =mX w=1Taw(n−dvw) +mX w=1dvw (Since dvv= 0) ≤T′ av(sn) + 3X w̸=vdvw (10) 18 Page 19: where the final line follows from Lemma 2. Plugging Equation (10) allows us to get an upper bound on the group regret: RG(n) =X a̸=⋆∆aE"mX w=1Taw(n)# ≤X a̸=⋆∆aE T′ av(sn) + 3X w̸=vdvw  (Equation (10)) =X a̸=⋆∆aE[T′ av(sn)] + 3X w̸=vdvw X a̸=⋆∆a ≤X a̸=⋆∆aE[T′ av(mn)] + 3X w̸=vdvw X a̸=⋆∆a (Since sn≤mn) =Sπ(mn) + 3X w̸=vdvw X a̸=⋆∆a (11) where the final line follows from Lemma 1. C Regret Bound Comparison Our theoretical results span several multi-agent bandit problems, which include: subgaussian rewards, heavy- tailed rewards, duelling bandits and bandits with local differential privacy, amongst any other bandit problems that satisfied Assumption 1. Here, we compare our bounds to existing works in the literature. C.1 Subgaussian Rewards Several works design algorithms under a subgaussian assumption on the reward distributions (Landgren et al., 2016; Mart ´ınez-Rubio et al., 2019; Lalitha and Goldsmith, 2021). Mart ´ınez-Rubio et al. (2019) show that their guarantees improve upon those shown by Landgren et al. (2016). Lalitha and Goldsmith (2021) remark that their analysis does not provide guarantees as tight as those found in previous works (Landgren et al., 2016; Mart ´ınez-Rubio et al., 2019). Thus, we focus on a comparison with DDUCB of Mart ´ınez-Rubio et al. (2019). Mart ´ınez-Rubio et al. (2019) use a gossiping procedure. This requires a communication matrix, P∈Rm×m. Notably, Pis doubly stochastic and must respect the structure of the graph, e.g. Pvw= 0 if(v, w)̸∈E. Throughout, we will require the eigenvalues of the Pand follow the authors in assuming that the eigenvalues of this matrix satisfy the following: 1 =λ1(P)>|λ2(P)| ≥ |λ3(P)| ≥ ··· ≥ | λm(P)| ≥0 Theorem 3.2 of Mart ´ınez-Rubio et al. (2019) prove the following upper bound on the group regret of their algorithm, DDUCB: X a̸=⋆16η 1 +ϵ 2 σ2ln (mn) ∆a+ (CG+m+ 4)X a̸=⋆∆a Here, η > 1andϵ∈(0,η−1/7(η+1)are tune-able hyperparameters and CGis a graph-dependent quantity defined as: CG=& 6mln2m ϵ p 2 ln|λ|−1' . 19 Page 20: Corollary 1 shows that the group regret for running QuACK with the upper confidence bound algorithm yields the following bound on the group regret: RG(n)≤X a̸=⋆16σ2ln (mn) ∆a+ 3 + 3mX w=1dvw!X a̸=⋆∆a Inspecting the leading order terms reveals: 16η 1 +ϵ 2 σ2ln (mn)>16σ2ln (mn) where the inequality follows from the fact that η >1andϵ >0. Thus, we obtain a strictly better constant multiplying the leading order term. The additive graph dependent terms are different in each bound. Although, we remark that these terms become negligible for nsufficiently large, it is still interesting to compare them. To compare these additive graph dependent terms, we consider specific graph structures, such as regular and star graphs. Throughout, we will assume that the communication matrix is chosen as-per the recommendations of Mart ´ınez-Rubio et al. (2019): P=I−1 1 + max v∈V|Nv|(D−M) where DandMdenote the degree matrix and the adjacency matrix of the graph. Formally, DandMhave the following element-wise definitions: Dvw=( |Nv|ifw=v 0 otherwiseand Mvw=( 1if(v, w)∈E 0otherwise Regular Graphs. Graphs are called δ-regular if every agent has exactly δneighbours, e.g. |Nv|=δfor allv∈V. For these graphs, the expression for Psimplifies and allows us to obtain an expression for the spectral term found in the graph-dependent term of Mart ´ınez-Rubio et al. (2019): P=I+M 1 +δ=⇒λ:=λ2(P) =1 +λ2(M) 1 +δ Chung (1989) provide an upper bound on dfor regular graphs, which we can substitute into the graph- dependent term in the group regret of QuACK-UCB: 3 + 3mX w=1dvw≤3 + 3 d(m−1) ≤3 + 3 (m−1) ln ( m−1) ln 1+δ |1+λ2(M)|  = 3 +3 (m−1) ln ( m−1) ln|λ|−1 = 3 +& 6 (m−1) ln ( m−1)p 2 ln|λ|−1p 2 ln|λ|−1' (⋆) <3 +& 6 (m−1) ln2m ϵ p 2 ln|λ|−1p 2 ln|λ|−1' = 3 +CG−6 ln(m−1)p 2 ln|λ|−1 <CGp 2 ln|λ|−1+m+ 4 20 Page 21: where (⋆)follows from the fact that ln(2m/ϵ)<ln(m−1)forϵ∈(0,1). This establishes a relationship between an upper bound on the graph-dependence of our algorithm with the graph-dependence of Mart ´ınez- Rubio et al. (2019). Generally, this suggests that the graph-dependence of our leader-based approach is smaller for δ-regular graphs where the communication matrix has large values for |λ|. Hence, in these cases, the graph-dependent terms in our bound on the group regret are smaller. Star Graphs. The star graph consists of m−1vertices connected to a single central vertex. Here, we can exactly evaluate the spectrum of the communication matrix. Notably, the spectrum of L:=D−Mis known for this graph (Chung, 1997): λ(L) =  m with multiplicity 1 1 with multiplicity m−2 0 with multiplicity 1 Therefore, the spectrum of the communication matrix Pcan be found by plugging in these quantities: λ(P) = 1−λ(L) m=  0 with multiplicity 1 m−1 mwith multiplicity m−2 1 with multiplicity 1 which gives us λ=λ2(P) =m−1/m. Plugging this into their graph-dependent term yields: CG+m+ 4 = 6mlnm ϵ r 2 ln m m−1 +m+ 4 ≥6mln (m)r 2 ln m m−1+m+ 4 >7m+ 4 where the final inequality follows from the fact that m≥2is required for the multi-agent setting. Running Algorithm 4 will choose the central vertex as the leading agent and this will give make the graph-dependence in our algorithm: 3 +mX w=1dvw=m+ 2 which is strictly smaller than that of Mart ´ınez-Rubio et al. (2019). C.2 Heavy-Tailed Rewards Dubey and Pentland (2020) design a multi-agent extension of the robust upper confidence bound algorithm using the truncated mean estimator. To the best of our knowledge, this is the only work addressing heavy- tailed environments in the multi-agent setting. Before stating their theoretical results, we first define the independence number of a graph: α(G) = max S⊆V{|S|: (v, w)̸∈Efor all (v, w)∈S} which is the size of the largest subset of vertices where no two vertices are adjacent. Dubey and Pentland (2020) develop several algorithms, and their best guarantee on the group regret up to absolute constants is given by: α(Gγ)X a̸=⋆σ1 ϵln (n) ∆1 ϵa+ (mγ·α(Gγ) +α(Gγ) +m)X a̸=⋆∆a 21 Page 22: Here, ϵcontrols the heaviness of the tails for the reward distributions, γis a parameter their algorithm takes as input that governs how far each agent can communicate, and the γ-th power of the graph is defined as: Gγ= (V, E γ)where Eγ= 1{(v, w)∈V×V:dvw≤γ} which adds an edge to the graph whenever dvw≤γ. From Corollary 2, the group regret of running QuACK with the robust upper confidence bound using the truncated-mean estimator has the following upper bound: RG(n)≤ O X a̸=⋆σ1 ϵln (mn) ∆1 ϵa+ mX w=1dvw!X a̸=⋆∆a  Thus, we obtain a smaller leading order term whenever: nα(Gγ)−1≥m Notably, α(Gγ)∈[1, m]for all graphs. Therefore, our graph-dependence is smaller whenever α(Gγ)>1 andm < n , i.e. the number of agents is smaller than the number of rounds.1Specifying γ≥das the input parameter will yields an independence number of 1in the bound Dubey and Pentland (2020). However, their analysis is specific to the robust upper confidence bound algorithm and the truncated mean estimator, whereas Theorem 1 is strictly more general. Dubey and Pentland (2020) have an additive graph-dependent term that displays a trade-off between the input parameter γand the independent number. Generally, increasing γwill decrease the independence number and decreasing γwill increase the independent number. Generally, we expect their additive term to be comparable to ours in many cases. For example, choosing γ=dwill minimise the leading order term in the regret bound and yields an additive graph-dependence of: mγ·α(Gγ) +α(Gγ) +m=md+m+ 1 Conversely Corollary 2 shows that our additive graph-dependence of the order: mX w=1dvw≤m(d−1) C.3 Duelling Bandits Raveh et al. (2024) design a multi-agent extension of various duelling bandit algorithms. To the best of our knowledge, this is this only work addressing the multi-agent duelling bandit problem. Their best guarantee is for an extension of the relative upper confidence bound algorithm (Zoghi et al., 2014). Up to absolute constants, they show that this algorithm achieves an upper bound on the group regret given by: χ(Gγ)X a̸=⋆ln (n) ˜∆a+m2k2(1 + ln (1 + |Nγ ⋆|))˜∆max+ (1 + γ)km˜∆max where 1≤χ(Gγ)≤mdenotes the clique covering number of the γ-th power of the graph and γis an input parameter. From Corollary 3, we can upper bound the group regret when running QuACK with the relative upper confidence bound algorithm: RG(n)≤ O X a̸=⋆ln (mn) ˜∆a+ mX w=1dvw!X a̸=⋆˜∆a  Comparing the leading-order term reveals that we obtain a strictly better leading order term whenever: nχ(Gγ)−1≥m 1Note that m > n implies that all multi-agent bounds on the group regret become linear in the horizon. 22 Page 23: Notably, χ(Gγ)∈[1, m]for all graphs. Therefore, our graph-dependence is smaller whenever χ(Gγ)>1 andm < n , i.e. the number of agents is smaller than the number of rounds. Specifying γ≥das the input parameter will yield a clique covering number of 1in the bound of Raveh et al. (2024). However, their anal- ysis is specific to the relative upper confidence bound algorithm, whereas Theorem 1 is strictly more general. Raveh et al. (2024) have an additive graph-dependent term that is strictly larger. Notably, their graph- dependence is always quadratic in the network size, whereas ours scales as md. Notably, d < m in connected graphs. D Additional Experiments Experimentally, we seek to compare our black-box reduction to existing works, who design and analyse extensions of well-known single agent bandit algorithms. Our simulations were conducted on a personal machine (Intel i5-10310U CPU @ 1.70Ghz x 8). Furthermore, all results are averages over 100independent runs. D.1 Subgaussian Experiments Here, we describe the missing experimental details for the experiments found in the main paper, such as the hyperparameters for existing algorithms. Firstly, Landgren et al. (2016), Mart ´ınez-Rubio et al. (2019) and Lalitha and Goldsmith (2021) are all gossip-based algorithms. Thus, they all require the specification of a communication matrix. All authors recommend choosing this matrix based on the graph Laplacian. Formally, they use the following doubly stochastic matrix: P=I−1 1 + max v∈V|Nv|(D−M) where M,Dandδdenote the adjacency matrix, the degree matrix and the maximum degree of the underlying graph. Below, we provide specific details on additional hyperparameters of each algorithm. • Landgren et al. (2016) (coop-UCB) has been implemented by other researchers (Mart ´ınez-Rubio et al., 2019). Their algorithm requires an exploration hyperparameter γ >1; we choose this parameter based on the empirical results of existing work (Mart ´ınez-Rubio et al., 2019; Lalitha and Goldsmith, 2021). •Exploration Parameter: γ= 1.01. • Mart ´ınez-Rubio et al. (2019) (DDUCB) provide an implementation of their algorithm on Github. Their algorithm requires specifying two hyperparameters; we use the values stated in their theorems and chosen in their experiments. •Exploration Parameter (Theorem 3.2 of Mart ´ınez-Rubio et al. (2019)): η= 2. •Approximation Parameter (Theorem 3.2 of Mart ´ınez-Rubio et al. (2019)): ϵ=1/22. • Lalitha and Goldsmith (2021) (DecTS) provide an implementation of their algorithm on Github. Their algorithm requires the specification of a learning rate. We use the value found in the theoretical results that they use in their experiments. •Learning Rate (Theorem 1 of Lalitha and Goldsmith (2021)): η=m. • Algorithm 1 requires the specification of a leader and a single-player bandit algorithm. •Leader: As Theorem 1 suggests, we select the leader as the solution to the graph median problem for each experimental setting. We do so using Algorithm 4. •Bandit Algorithm: Our goal is to compare how well our black-box reduction performs relative to existing works, who design and analyse specific algorithms designed for the setting. Therefore, we consider UCB (Auer et al., 2002) and Thompson Sampling (Thompson, 1933) to compare with existing literature. 23 Page 24: Our experiments consider three network structures, namely cycle, grid and star graphs with varying network sizes. Figure 2 in the main paper shows the experimental results for the largest network size of m= 196 agents. Here, we present the experimental results for smaller network sizes. Cycle Graph. The cycle graph consists of agents organised in a circle who are connected to their clockwise and anti-clockwise neighbours. Thus, the graph is 2-regular. Section 5 presents the empirical results for a cycle with m= 196 agents. Figure 3 displays the group regret for cycle graphs with varying numbers of agents. Figure 3: Group Regret for Cycle Graphs. Figure 3, shows that QuACK-UCB outperforms the other UCB-based algorithms on all network sizes. As Section C.1 suggests, the performance gap increases with the size of the network. Furthermore, QuACK-TS is competitive with DecTS. Grid Graph. The grid graph consists of agents organised in a square lattice and each agent can have two, three or four neighbours. Section 5 presents the results of our experiments for a grid graph with m= 196 agents. Figure 4 displays the group regret for grid graphs with varying numbers of agents. Figure 4: Group Regret for Grid Graphs. Here, QuACK-UCB out performs the other UCB-based algorithms on all network sizes. Furthermore, QuACK-TS is competitive with DecTS. 24 Page 25: Star Graph. The star graph is a common sub-structure that appears in social networks. This particular graph consists of a single central vertex, who is connected to all other vertices. The non-central vertices have exactly one neighbour, which is the central vertex. Section 5 presents the empirical results for a star with m= 196 agents. However, we also conduct experiments on smaller stars to verify our theoretical results. Figure 4 displays the group regret for cycle graphs with varying numbers of agents. Figure 5: Group Regret for Star Graphs. Here, QuACK-UCB out performs the other UCB-based algorithms on all network sizes. Furthermore, QuACK-TS is competitive with DecTS on the smallest star network. However, QuACK-TS achieves smaller group regret for larger star networks. Network Scaling. Finally, Figure 6 visualises the group regret at the end of the final round as a function of the network size for each graph structure. Figure 6: Network Size Scaling of Group Regret. Notably, we can see the group regret of Coop-UCB and DDUCB scales quickly with the network size for each graph. Conversely, the group regret of QuACK-UCB, QuACK-TS and Dec-TS have a shallower gradient, indicating a better dependence on the network size. 25 Page 26: D.2 Heavy-Tailed Experiments Following Dubey and Pentland (2020), we conduct our experiments on standard α-stable densities. Specifi- cally, α= 1.9in our experiments and we select the location of the densities such that: µa=( 0.7ifa= 1 0.4ifa̸= 1 where A={1,2,···,5}. Our experiments in this setting consider the cycle and grid network structures withm∈ {16,36,64}agents. These experiments use a smaller number of agents and fewer rounds because the computational complexity of computing the truncated mean estimator grows linearly with the number of reward samples for each action.2 Below, we provide specific details of the hyperparameters used for each algorithms. • Dubey and Pentland (2020) obtain the best theoretical guarantees for CMP-UCB and this algorithm partitions the graph into disjoint subsets. Each of these subset has one leader that uses the robust upper confidence bound algorithm. Following their guidance, we select the leaders such that they form the maximal weighted independent set of each graph. •We tune the hyperparameter γof their algorithm such that the networks under consideration have 1,2and4leaders. With 1leader we expect their algorithm to outperform ours. With ≥2leaders, we expect our algorithm to perform best. • Algorithm 1 requires a single-agent heavy-tailed bandit algorithm as input. For a direct comparison, we choose the robust upper confidence bound algorithm with the truncated mean. We select the leader as the arg min of the graph-median problem. Figure 7 displays the group regret for the various cycle and grid graphs. From Section C.2, we expect QuACK will perform better than CMP-UCB when it receives a γsuch that the resulting graph has independence number is greater than 1. Notably, our experimental results align exactly with our theoretical predictions. 2The number of rewards for each action depends implicitly on the network size in the multi-agent setting. 26 Page 27: Figure 7: Group Regret for Cycle and Grid Graphs. 27

---