Paper Content:
Page 1:
QuACK: A Multipurpose Queuing Algorithm for Cooperative
k-Armed Bandits
Benjamin Howson Sarah Filippi Ciara Pike-Burke
Department of Mathematics, Imperial College London
November 1, 2024
Abstract
We study the cooperative stochastic k-armed bandit problem, where a network of magents collaborate
to find the optimal action. In contrast to most prior work on this problem, which focuses on extending
a specific algorithm to the multi-agent setting, we provide a black-box reduction that allows us to extend
any single-agent bandit algorithm to the multi-agent setting. Under mild assumptions on the bandit envi-
ronment, we prove that our reduction transfers the regret guarantees of the single-agent algorithm to the
multi-agent setting. These guarantees are tight in subgaussian environments, in that using a near mini-
max optimal single-player algorithm is near minimax optimal in the multi-player setting up to an additive
graph-dependent quantity. Our reduction and theoretical results are also general, and apply to many differ-
ent bandit settings. By plugging in appropriate single-player algorithms, we can easily develop provably
efficient algorithms for many multi-player settings such as heavy-tailed bandits, duelling bandits and ban-
dits with local differential privacy, among others. Experimentally, our approach is competitive with or
outperforms specialised multi-agent algorithms.
1 Introduction
Stochastic multi-armed bandit problems are a fundamental model for sequential decision-making. Here, a
single agent sequentially interacts with the environment by selecting an action and receiving a reward from
an unknown distribution associated with that action (Lattimore and Szepesv ´ari, 2020). There are numerous
provably efficient algorithms for this setting that employ different mechanisms for balancing the exploration-
exploitation trade-off, such as optimism, posterior sampling, bootstrapping, and soft-max exploration (Auer
et al., 2002; Thompson, 1933; Kveton et al., 2019; Bian and Jun, 2022).
However, decision-making tasks naturally arise in distributed settings, such as recommender systems and
sensor networks (Tekin et al., 2014; Tran-Thanh et al., 2011), where there exist multiple decision-makers
interacting with the environment. For example, large-scale recommender systems are often a distributed sys-
tem of servers, where each server hosts a decision-maker. Every time a user arrives to one of the servers, the
corresponding decision-maker chooses an item to recommend, and the user will provide a reward signal in
response. Upon receiving feedback, the decision-makers can update their own knowledge on the quality of
this item. Furthermore, they can share this information with neighbouring servers to improve future decision-
making at all servers.
Motivated by these applications, we study an extension of the traditional multi-armed bandit model where
there is a network of mdecision-makers who each interact with the same k-armed bandit environment. Each
agent can communicate over the network, allowing for the possibility of collaboration to speed up the learning
process.
1arXiv:2410.23867v1 [cs.LG] 31 Oct 2024
Page 2:
1.1 Related Work
Designing provably efficient algorithms for the multi-player setting is more challenging than the single-agent
setting. Previous work has extended specific single-player algorithms to the multiplayer setting through one
of two methods: gossiping or electing a leader.
Gossiping is a popular technique in distributed computing that uses an iterative averaging procedure to com-
bine information from neighbouring agents to approximate the full network information (Xiao and Boyd,
2004; Duchi et al., 2012). Landgren et al. (2016) and Mart ´ınez-Rubio et al. (2019) combine this procedure
with the upper confidence bound algorithm (Auer et al., 2002). Lalitha and Goldsmith (2021) show that
we can use gossiping to design a provably efficient multi-agent version of Thompson Sampling (Thompson,
1933). Notably, all of these algorithms are asymptotically optimal. However, the analyses of these algorithms
are complex, and specific to the algorithm and variant of the gossiping procedure considered. Furthermore,
the performance depends on the choice of communication matrix that the gossip procedure uses for iterative
averaging. Even when the network is known, choosing this matrix is a non-trivial task, and it is common to
use heuristics (Xiao and Boyd, 2004).
Theleader-based methods require electing a leading agent who directs the exploration of the entire network.
Bar-On and Mansour (2019) consider the non-stochastic setting and analyse the exponential weights algo-
rithm with numerous leaders send their action distributions to the non-leading agents in their neighbourhood.
Wang et al. (2020) elect one leading agent who performs all the exploration using a upper confidence bound
algorithm. Here, the leader does not observe any information from the other agents in the network, who all
play greedily with respect to the empirical means of the leader. Again, these papers analyse extensions of
specific bandit algorithms. Our approach also fits into the category of leader-based approaches. However,
we develop a leader-based black-box reduction where the leading agent can use any bandit algorithm for
decision-making. Specifically, we build upon queuing methods commonly used to handle delayed feedback
(Joulani et al., 2013; Mandel et al., 2015). However, extending these queuing methods to the multi-agents
setting introduces unique theoretical challenges that do not arise in single-agent scenarios.
Additional works investigate variations of the multi-agent bandit problem. Szorenyi et al. (2013) study the
multi-armed bandit problem in peer-to-peer networks where each agent communicates with two other agents
in the network, which are chosen randomly in every round. Yang et al. (2021) and Chen et al. (2023) look
into the setting where each agent on the network plays at different and possibly unknown times. Madhushani
et al. (2021) study the setting where there is imperfect communication between agents. Shahrampour et al.
(2017); Hossain et al. (2021) study the case where every agent has their own distribution over the rewards for
each action. Perhaps the best-studied multi-agent problem is where agents receive zero or degraded rewards
if they play the same action at the same time. See Boursier and Perchet (2024) for a recent survey on this
topic.
1.2 Contributions
We design and analyse an algorithm that reduces cooperative stochastic multi-agent k-armed bandit problems
to the single-agent version of the problem. Our main contributions are as follows:
• We propose a black-box reduction, QuACK, that accepts any single-agent bandit algorithm as input
and immediately extends it to the multi-agent setting.
• Theorem 1 shows that we can upper bound the performance over the entire network in terms of the
guarantees of the chosen single-agent bandit algorithm. Pairing our reduction with a near optimal
algorithm yields a near optimal algorithm for the multi-agent k-armed subgaussian bandit problem, up
to an additive graph-dependent quantity. These results are competitive or better than the case-by-case
analyses of previous works.
• Our theoretical guarantees hold under mild assumptions on the bandit environment. We require that
the distribution of the reward depends only on the chosen action, and each distribution has a finite first
moment. Hence, if there exists a provably efficient algorithm for a single-agent bandit problem, and
2
Page 3:
this bandit problem satisfies the assumptions, our reduction guarantees that it will be provably efficient
in the multi-agent setting. This makes developing provably efficient multi-agent algorithms for various
bandit problems simple. In particular, in Section 4:
•We demonstrate the simplicity by using our reduction to design multi-agent algorithms for heavy-
tailed bandits and duelling bandits, which have all be considered separately in the literature (Land-
gren et al., 2016; Dubey and Pentland, 2020; Raveh et al., 2024). The resulting algorithms have
comparable guarantees to those developed specifically for each setting.
•We develop provably efficient algorithms for new multi-agent settings, such as local differential
privacy, by using an appropriate single-player algorithm (Ren et al., 2020).
• Finally, we perform an experimental comparison to existing works which shows that our reduction is
competitive or outperforms existing methods when paired with a comparable bandit algorithm.
2 Problem Setting
Our paper considers multi-agent variants of stochastic multi-armed bandit problems where we have a finite
set of actions: Asuch that |A|=k. We formalise the network of magents and their connections through an
undirected graph G= (V, E)with:
V={1,2,···, m}
E⊆ {(v, w)∈V×V:v̸=w}.
Here v∈Vrepresents an agent and e∈Erepresents a communication channel between a pair of agents.
Each agent in the network is allowed to communicate with any other agent in their neighbourhood , which we
define for all v∈Vas follows:
Nv={w∈V: (v, w)∈E}.
Each round consists of every agent simultaneously playing their own action, receiving their own reward and
communicating with their neighbours. Formally, for each t∈ {1,2,···, n}:
• Each agent v∈Vplays action Av
t∈ A.
• Each agent v∈Vreceives reward Xv
t∈R.
• Each agent v∈Vcommunicates with w∈Nv.
Throughout, we will make use of a standard and mild assumption on the bandit environment (Lattimore and
Szepesv ´ari, 2020).
Assumption 1. The bandit environment generates feedback that depends only on the chosen action. Letting
Padenote the reward distribution for action a, we assume that:
Xv
t|Av
t=ai.i.d.∼Paandµa=EX∼Pa[X]<∞.
for all t∈N,a∈ A andv∈V.
Additionally, we make a mild assumption on the graph and the communication protocol.
Assumption 2. Messages can be passed from one agent v∈Vto another agent w∈Valong a shortest
path in dvwtime steps, and the graph diameter is finite: d= max v,w∈Vdvw<∞.
3
Page 4:
2.1 Measuring Performance
In the multi-agent bandit problem, agents collaborate to minimise the regret of the entire network, also known
as the group regret :
RG(n) =X
a∈A∆aE"mX
v=1Tav(n)#
.
where the expectation is over the network of agents interacting with the bandit environment. Here, ∆a=
µ⋆−µais the sub-optimality gap of action awhere µ⋆= max µadenotes the expected reward of the optimal
action, and
Tav(n) =nX
t=11{Av
t=a} (1)
is the number of times agent v∈Vchooses action a∈ A over the course of nrounds. From standard
arguments, we can deduce a minimax lower bound on the group regret when all reward distributions are
Gaussian (Lattimore and Szepesv ´ari, 2020):
RG(n)≥1
27p
mn(k−1).
This lower bound holds for cases where each agent can communicate immediately with any other agent on
the graph. Hence, it may not be tight for graphs with particular structure. However, it does demonstrate that
information sharing can be used to improve the group performance in the worst-case.
3 A Black-Box Reduction
We present QuACK, a queuing algorithm for cooperative k-armed bandits . QuACK is a multipurpose black-
box reduction that can be paired with any single-agent bandit algorithm to extend it to the multi-agent case.
QuACK requires the index of the leader on the network and a bandit algorithm πas input. The leader will
useπto select actions for the whole network. The leader begins by initialising an empty queue for each
action. These queues will store the rewards that the other agents observe from the bandit environment and
have passed to the leader. Intuitively, the leader can use the reward samples in these queues instead of playing
actions in the real environment. Specifically, in each round, the leader observes the action suggested by π.
If there is a reward in the queue for that action, the leader will take the reward from the queue and pass this
toπ. Importantly, the leader does not play this action, but instead uses the reward from the queue to update
π. This continues until πsuggests playing an action whose queue is empty. In this case, the leader plays the
action and receives a reward from the environment. The leader then communicates this decision to the other,
follower , agents. Once a follower receives instruction from the leader (note that this can take some time since
the message needs to travel over the network), they will play the instructed action, and pass the reward back
to the leader (which can also take some time). Once the leader receives rewards from their followers, they
will place the rewards in the corresponding queues, and begin the process again.
Algorithm 1 presents the pseudo-code for QuACK. For clarity, it is presented assuming that the leader is
given as input and one shortest path between each follower w∈Vand the chosen leader v∈Vhas been
previously identified. Appendix A explains how we can elect a leader in a distributed manner and Section 3.1
suggests a well-known distributed algorithm for computing the paths once we have the leader.
3.1 Message-Passing Protocol
The message a follower w̸=vsends to the leader vin round tcontains only the action and the reward:
(Aw
t, Xw
t).
Sending such simple messages is possible since there is a known fixed route between the leader and any
follower. Computing these routes can be done in a distributed manner before the start of the interaction.
4
Page 5:
Algorithm 1 QuACK
Input: index of the leader v∈V
Input for Leader: bandit algorithm π
Initialisation for Leader: Qa=∅for all a∈ A
fort∈ {1,2,···, n}do
Leader ( v)
Receive messages via shortest path:
Mt={(Aw
t−dvw, Xw
t−dvw)}w̸=v
Append Qa=Qa∪ {(a, x)for(a, x)∈ M t}
Select awithπ
while Qa̸=∅do
Remove xfrom Qa
Update πwith(a, x)
Select awithπ
end while
PlayAv
t=a
Observe Xv
t∼PAv
t
Send Av
tto all w̸=vvia shortest path
Follower ( w̸=v)
ift > d vwthen
Receive Av
t−dvwfrom vvia shortest path
PlayAw
t=Av
t−dvw
else
PlayAw
tuniformly at random
end if
Observe Xw
t∼PAw
t
Send (Aw
t, Xw
t)tovvia shortest path.
end for
Specifically, this amounts to constructing the shortest path tree of the graph which can be done by solving
the single-source shortest-path problem before interacting with the bandit environment (Ahuja et al., 1993).
Figure 1 illustrates the shortest path tree for the grid graph, where the black vertex represents the leader.
Figure 1: Grid Graph and its Shortest Path Tree.
The Distributed Bellman-Ford algorithm can solve the single-source shortest-path problem exactly and it
does so using approximately miterations, whilst only requiring that each agent knows their neighbours and
has a unique identifier (Ford, 1956; Bellman, 1958; Moore, 1959; Elkin, 2020). Appendix A presents full
details of this algorithm for completeness.
Sending instructions from the leader to the followers using the shortest path tree is straightforward. The leader
communicates the action they played to their children, and the children send this action to their children at
5
Page 6:
the end of the next round, and so forth. Using the shortest path tree also makes passing messages to the leader
straightforward. Each follower will forward messages from their children to their parent, and send their own
message to their parents. Thus, in the t-th round, each follower wwill perform the following communication:
• Forward Aw
t=Av
t−dvwto each z∈ Cw
• Send Mv
t= (Aw
t, Xw
t)∪ {Mz
t−1}z∈CwtoPz
Here,CwandPwdenote the children and the parent of agent won the shortest path tree, respectively. The
leader vonly needs to send the action they choose in each round to their children, which corresponds to just
the first item the above list. Thus, the communication complexity, in terms of the number of messages sent in
each round, for each agent wis upper bounded by the diameter of the graph.
Note that we could use any other message-passing protocol that satisfies Assumption 2. For example, each
agent could forward everything they have been sent in the past mrounds. However, in this case, all messages
Mw
twould have to contain additional information about who originally sent the message and the time the
message was sent, e.g. wandt. This additional information is required so that the followers can identify the
most recent action sent to them by the leader. Furthermore, the leader would have to check that Mw
tis not
added to the queue more than once.
3.2 Theoretical Analysis
We now analyse QuACK (Algorithm 1) with an arbitrary single player algorithm π. We will see that the
group regret of QuACK depends on the single-agent regret of πand some graph-dependent quantities.
We define the regret of the single-agent bandit algorithm πover an n-round interaction with the single-player
environment as follows:
Sπ(n) =X
a̸=⋆∆aEP"nX
t=11{at=a}#
where atdenotes the action taken at time t when following πandEP[·]denotes expectation with respect to
the interaction with the single-player environment.
We will show that, by maintaining the queues, the leader creates a simulated single-player version of the
environment for the bandit algorithm. For this, we need to define the number of times the single-agent bandit
algorithm πchooses action a∈ A in totality. This counter includes the number of times the leader, who
operates this single-player algorithm, chooses this action in the bandit environment, as well as the number of
samples taken from the queue. Letting asdenote the s-th action chosen by the bandit algorithm allows us to
define this counter:
T′
a(sn) =nX
t=1stX
s=st−1+11{as=a}. (2)
Here, strepresents the number of simulated rounds the bandit algorithm has completed by the end of the t-th
round. This includes the action chosen by the leader in this round, e.g. Av
t=ast. From Section 3.1, we
know that the leader appends each message to the queue exactly once, which implies that: sn≤mn.
Lemma 1 shows that the bandit algorithm is operating in a simulated version of a single-player environment.
Lemma 1. Under Assumption 1, QuACK guarantees that, for all n:
Sπ(n) =X
a̸=⋆∆aE[T′
a(n)].
Proof. Under Assumption 1, the rewards for each action-agent pair are independent and identically dis-
tributed random variables. Properties of exchangeable random variables guarantee that the sequence of
rewards the leading agent feeds to the bandit algorithm has the same distribution as in the single-player
environment. Therefore, the expectation in the definition of the single-player regret and in the multi-agent
setting are the same. See Appendix B.1 for a formal proof.
6
Page 7:
Recall that the group regret is defined in terms of times the network of agents plays each action. Therefore,
Lemma 2 relates this quantity to the number of times the single-agent bandit algorithm plays each action.
Lemma 2. Running QuACK with an arbitrary v∈Vas the leader guarantees that:
mX
w=1Taw(t−dvw)≤T′
a(st) + 2mX
w=1dvw.
for all t∈N.
Proof. See Appendix B.2
Lemmas 1 and 2 allow us to upper bound the group regret by the regret of the single-player bandit algorithm
plus an additive graph-dependent quantity.
Theorem 1. Under Assumptions 1 and 2, the group regret of QuACK run with single-player bandit algorithm
πand leading agent v∈Vis bounded by
RG(n)≤Sπ(mn) +
3·mX
w=1dvw!kX
a=1∆a.
Proof. Firstly, we upper bound the group plays for action a∈ A at the end of the final round as:
mX
w=1Taw(n)≤mX
w=1Taw(n−dvw) +X
w̸=vdvw. (3)
Combining Equation (3) with Lemma 2 and sn< mn allows us to upper bound the number of times the
entire network plays action aby the number of times πhas played this action:
mX
w=1Taw(n)≤mX
w=1Taw(n−dvw) +X
w̸=vdvw
≤T′
a(mn) + 3X
w̸=vdvw (4)
Plugging Equation (4) into the definition of the group regret gives us:
RG(n) =X
a̸=⋆∆aE"mX
w=1Taw(n)#
≤X
a̸=⋆∆aE[T′
a(mn)] +
3X
w̸=vdvw
X
a̸=⋆∆a
=Sπ(mn) +
3X
w̸=vdvw
X
a̸=⋆∆a
where the final line follows from Lemma 1 and the definition of regret in the single-player setting. See
Appendix B.3 for further details.
Theorem 1 suggests that we should pick:
v⋆= arg min
v∈VmX
w=1dvw,
as the leading agent. This is intuitive, as minimising the sum of the shortest paths will minimise the total delay
in sending and receiving instructions and feedback from the followers. Appendix A presents an approach for
7
Page 8:
finding this agent in a distributed manner. Nevertheless, Assumption 2 guarantees that the worst-case graph-
dependence, regardless of the chosen leading agent, is given by:
mX
w=1dvw≤d(m−1).
It is important to highlight that Theorem 1 holds for any choice of bandit algorithm and makes weak assump-
tions on the rewards. This is in contrast to existing works, where it is common to analyse a specific bandit
algorithm and make stronger assumptions on the reward distributions, such as subgaussian tails. In Section
4 we show that the guarantees for QuACK are competitive with the case-specific analyses that exist in the
literature, despite being strictly more general.
4 Instances of QuACK
Theorem 1 holds under the assumption that the rewards are conditionally independent given the actions and
the expected reward for each of the actions is finite. Therefore, we can apply our reduction in numerous bandit
environments. Here, we present a non-exhaustive selection of bandit environments that satisfy Assumption 1.
For each, we describe how to translate single-agent algorithms to the multi-agent setting, provide bounds on
their group regret, and compare to any existing multi-agent works in that setting. In each setting, we choose
one specific single-player algorithm to use with QuACK. However, any other provably efficient single-player
algorithm could be combined with our reduction to get similar results.
4.1 Subgaussian Bandits
Traditionally, the multi-armed bandit problem consists of playing an action and receiving a reward drawn
independently from a σ-subgaussian distribution that depends only on the chosen action (Lattimore and
Szepesv ´ari, 2020). Thus, Assumption 1 holds. Corollary 1 presents the group regret bound for our algo-
rithm when the leader uses the classic upper confidence bound algorithm (Auer et al., 2002).
Corollary 1. Running QuACK with an arbitrary leader following the upper confidence bound algorithm with
confidence parameter δ= 1/(mn)2guarantees:
RG(n)≤X
a̸=⋆16σ2ln (mn)
∆a+
3 + 3mX
w=1dvw!X
a̸=⋆∆a.
Proof. Theorem 7.1 of Lattimore and Szepesv ´ari (2020) provides an upper bound on the regret of the upper
confidence bound algorithm over a T-round interaction with the bandit environment:
Sπ(T)≤X
a̸=⋆16 ln ( T)
∆a+ 3X
a̸=⋆∆a. (5)
Combining this upper bound with Theorem 1 and taking T=mngives the result.
Mart ´ınez-Rubio et al. (2019) analyse a multi-agent extension of the upper confidence bound algorithm and
offer the best theoretical guarantees in the subgaussian setting. They obtain the following bound on the group
regret:
X
a̸=⋆16η
1 +ϵ
2
σ2ln (mn)
∆a+ (CG+m+ 4)X
a̸=⋆∆a
Here, η >1andϵ∈(0,1)are hyperparameters and CGis a graph-dependent quantity they define as:
CG=&
6mln 2m
ϵ
p
2 ln|λ|−1'
8
Page 9:
where λis the second largest singular value of the communication matrix they use for the gossiping procedure.
Corollary 1 shows a strict improvement in the leading-order term. Comparing the graph-dependence is a little
more challenging and requires specifying the communication matrix. Following their recommendations, we
can show that for regular graphs, our graph-dependence is smaller than a function of theirs:
3mX
w=1dvw<CGp
2 ln|λ|−1
which suggests the dependence in our bounds is better in regular graphs with a small spectral gap. Addition-
ally, we can prove a strict improvement on the graph-dependence for specific graphs, such as the star network.
The amount of per-round communication of the two methods is similar. Mart ´ınez-Rubio et al. (2019) have
each agent send two vectors of length kto each neighbour. Conversely, QuACK sends an action-reward pair
to a subset of the neighbours. See Appendix C for further details.
4.2 Heavy-Tailed Bandits
Bubeck et al. (2013) introduced the single-agent heavy tailed bandit problem. Here, the the subgaussian
assumption on the reward distribution for each action is relaxed, and instead it is assumed that each reward
distribution has finite moments of order 1 +ϵwhere ϵ >0controls the heaviness of the tails. Bubeck et al.
(2013) propose a generic solution for the single-agent setting that they call the robust upper confidence bound
algorithm that replaces the empirical mean with a robust estimator, such as the truncated-mean, median-of-
means, or Catoni’s M (Bubeck et al., 2013).
The only difference between this setting and the subguassian setting is the weaker assumption on the tails
of the rewards. Therefore, Assumption 1 holds and the QuACK algorithm can be used for the multi-agent
heavy-tailed bandit problem with the following regret guarantees.
Corollary 2. Running QuACK with an arbitrary leader and the truncated-mean robust upper confidence
bound algorithm guarantees:
RG(n)≤ O
X
a̸=⋆σ1
ϵln (mn)
∆1
ϵa+ mX
w=1dvw!X
a̸=⋆∆a
where EX∼Pa|X|1+ϵ≤σfor all a∈ A.
Dubey and Pentland (2020) study the multi-agent heavy-tailed bandit problem and devise two algorithms
based on the truncated-mean estimator. Their best algorithm achieves a group regret bound with a similar
form to Corollary 2, except the leading order term involving nis scaled by a graph-dependent quantity. This
graph-dependent term is the independence number of the γ-th power of the graph, where γis a hyperparam-
eter of their algorithm. Thus, our bound is strictly better when this quantity is greater than one, which occurs
when γ < d . Since they also employ a leader-based algorithm that uses a message passing procedure, the
additive terms in their bound and their per-round communication are comparable. See Appendix C for further
details.
4.3 Duelling Bandits
Yue et al. (2009) introduced the single-agent duelling bandit problem. Here, feedback from the environment
is the outcome of a duel between two actions. Modifying Algorithm 1 for the multi-agent duelling bandit
problem amounts to maintaining a queue for each pair of actions: A+=A × A . Now, each agent will
communicate the pair of actions and the outcome of the duel to their neighbours.
Assumption 1 holds since the outcomes of the duels are assumed to depend only on the pair of actions chosen
in each round. Recall that we define the group regret relative to a fixed optimal action. Therefore, Theorem 1
applies to solution concepts, such as the Condorcet, Copeland, and Borda winners (Zoghi et al., 2014, 2015;
Jamieson et al., 2015). For illustrative purposes, we present an upper bound on the group regret when the
9
Page 10:
leader uses the relative upper confidence bound algorithm, which is designed under the assumption that the
Condorcet winner exists (Zoghi et al., 2014).
Corollary 3. Running QuACK with an arbitrary leader following the relative upper confidence bound algo-
rithm of Zoghi et al. (2014) guarantees:
RG(n)≤ O
X
a̸=⋆ln (mn)
˜∆a+ mX
w=1dvw!X
a̸=⋆˜∆a
where ˜∆a=P(a⋆≻a)−1/2denotes the sub-optimality gap of action ain the the duelling bandit setting.
Raveh et al. (2024) develop several algorithms for the multi-agent duelling bandit problem. Their best algo-
rithm builds on the relative upper confidence bound algorithm and achieves a similar guarantee. However,
the leading order term involving nis scaled by the clique covering number of the γ-th power of the graph,
where γis a hyperparameter of their algorithm. Thus, we obtain a strict improvement whenever this quantity
is greater than one, which occurs when γ < d . Since their algorithm employs a message passing procedure,
the additive terms in their bound and the per-round communication are comparable to ours. See Appendix C
for further details.
4.4 Local Differential Privacy
Privacy is a topic of growing interest in the machine learning community and seeks to protect private infor-
mation within datasets. Ren et al. (2020) study local differential privacy (LDP) in the bandit setting, where
the environment uses a mechanism to make the rewards private before sending them to the agent. Formally,
in the multi-agent setting, the only difference is that the rewards each agent receives are privatised:
Xv
t=fϵ(X)forXi.i.d.∼PAv
t
where fϵdenotes the privatising mechanism that ensures ϵ-LDP of the rewards before handing them to the
agent. Procedures for privatising rewards include adding Bernoulli or Laplace noise to the rewards. Proving
that the mechanism provides the specified level of privacy is independent of the bandit algorithm, e.g. Lem-
mas 2 and 5 of Ren et al. (2020). Therefore, we can directly use these mechanisms in the multi-agent setting
and guarantee that the rewards are ϵ-LDP. Currently, no multi-agent algorithms for this problem setting exist.
However, Assumption 1 is satisfied in this setting because the curator adds i.i.d. noise to the rewards. There-
fore, we develop the first provably efficient algorithm for multi-agent bandits with local differential privacy
by plugging an existing single agent algorithm into QuACK.
Corollary 4. Running QuACK with an arbitrary leader following LDP-UCB-L of Ren et al. (2020) is ϵ-LDP
and guarantees:
RG(n)≤ O
X
a̸=⋆1
ϵ2ln (mn)
∆a+ mX
w=1dvw!X
a̸=⋆∆a
.
5 Experimental Results
We compare the performance of QuACK (Algorithm 1) initialised with UCB against coop-UCB (Landgren
et al., 2016) and DDUCB (Mart ´ınez-Rubio et al., 2019). Additionally, we initialise QuACK with Thompson
Sampling to compare against Dec-TS (Lalitha and Goldsmith, 2021).
Our experiments consider a simple bandit environment with ten actions where each reward distribution is
Bernoulli with µ1= 0.5andµa= 0.45otherwise. Following existing works, we conduct our experiments
on several standard graph structures, namely cycle and grid graphs. Additionally, we investigate the star
graph to verify our theoretical insights. All results are averaged over 100independent runs and we represent
uncertainty by shading between the 2.5-th and 97.5-th quantiles.
10
Page 11:
Figure 2: Group Regret for a Network of 196Agents.
Figure 2 compares the group regret of each algorithm across various graph structures. For UCB, QuACK
significantly outperforms existing methods. These results support the claims we made in Section 4 based on
our theoretical results. For Thompson Sampling, we observe that our algorithm is competitive with existing
work on the cycle and grid structures, and performs significantly better on the star structure, also supporting
our theoretical claims. Appendix D displays similar results for additional experimental settings and provides
further details on the settings.
6 Conclusion
This paper proposes a generic black-box algorithm that can extend any single-agent bandit algorithm to the
multi-agent setting. Under mild assumptions on the bandit environment, we show that our black-box ap-
proach immediately transfers the theoretical guarantees of the chosen bandit algorithm from the single agent
setting to the multi-agent setting.
In the subgaussian setting, we can pair our algorithm with any nearly minimax optimal single agent algorithm
and match the minimax lower bound in the multi-agent setting, up to an additive graph-dependent quantity.
We suspect that the lower bound is loose because it fails to capture the structure of the graph and the commu-
nication delays. Proving graph-dependent lower bounds is an interesting open problem.
Additionally, we assumed that each agent can communicate real-valued messages to each neighbour at the
end of every round with arbitrary precision. However, some practical applications impose constraints on the
amount of communication in terms of frequency or the number of bits. Relaxing these assumptions are an
important next step for this line of research.
References
R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network Flows: Theory, Algorithms, and Applications . Prentice-
Hall, 1993.
Peter Auer, Nicol `o Cesa-Bianchi, and Paul Fischer. Finite-time Analysis of the Multiarmed Bandit Problem.
Machine Learning , 2002.
Yogev Bar-On and Yishay Mansour. Individual Regret in Cooperative Nonstochastic Multi-Armed Bandits.
InAdvances in Neural Information Processing Systems , 2019.
Richard Bellman. On a Routing Problem. Quarterly of Applied Mathematics , 1958.
Dimitri P. Bertsekas and Robert G. Gallager. Data Networks . Prentice-Hall, 1992.
11
Page 12:
Jie Bian and Kwang-Sung Jun. Maillard Sampling: Boltzmann Exploration Done Optimally. In Proceedings
of The 25th International Conference on Artificial Intelligence and Statistics , 2022.
Etienne Boursier and Vianney Perchet. A Survey on Multi-player Bandits. Journal of Machine Learning
Research , 2024.
Sebastien Bubeck, Nicolo Cesa-Bianchi, and Gabor Lugosi. Bandits With Heavy Tail. IEEE Transactions on
Information Theory , 2013.
Yu-Zhen Janice Chen, Lin Yang, Xuchuang Wang, Xutong Liu, Mohammad Hajiesmaili, John C. S. Lui, and
Don Towsley. On-Demand Communication for Asynchronous Multi-Agent Bandits. In Proceedings of
The 26th International Conference on Artificial Intelligence and Statistics , 2023.
F. R. K. Chung. Diameters and Eigenvalues. Journal of the American Mathematical Society , 1989.
F. R. K. Chung. Spectral Graph Theory . American Mathematical Society, 1997.
Donald L. Cohn. Measure Theory . Birkh auser/Springer, New York, 2013.
Abhimanyu Dubey and Alex ‘Sandy’ Pentland. Cooperative Multi-Agent Bandits with Heavy Tails. In
Proceedings of the 37th International Conference on Machine Learning , 2020.
J. C. Duchi, A. Agarwal, and M. J. Wainwright. Dual Averaging for Distributed Optimization: Convergence
Analysis and Network Scaling. IEEE Transactions on Automatic Control , 2012.
Michael Elkin. Distributed Exact Shortest Paths in Sublinear Time. J. ACM , 2020.
Wan Fokkink. Distributed Algorithms: An Intuitive Approach . The MIT Press, 2013.
Lester Ford. Network Flow Theory . RAND Corporation, 1956.
Safwan Hossain, Evi Micha, and Nisarg Shah. Fair Algorithms for Multi-Agent Multi-Armed Bandits. In
Advances in Neural Information Processing Systems , 2021.
Kevin Jamieson, Sumeet Katariya, Atul Deshpande, and Robert Nowak. Sparse Dueling Bandits. In Pro-
ceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics , 2015.
Pooria Joulani, Andras Gyorgy, and Csaba Szepesvari. Online Learning under Delayed Feedback. In Pro-
ceedings of the 30th International Conference on Machine Learning , 2013.
Branislav Kveton, Csaba Szepesvari, Sharan Vaswani, Zheng Wen, Tor Lattimore, and Mohammad
Ghavamzadeh. Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits. In Pro-
ceedings of the 36th International Conference on Machine Learning , 2019.
Anusha Lalitha and Andrea Goldsmith. Bayesian Algorithms for Decentralized Stochastic Bandits. IEEE
Journal on Selected Areas in Information Theory , 2021.
Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. Distributed Cooperative Decision-Making
in Multi-Armed Bandits: Frequentist and Bayesian Algorithms. In 2016 IEEE 55th Conference on Deci-
sion and Control , 2016.
Tor Lattimore and Csaba Szepesv ´ari.Bandit Algorithms . Cambridge University Press, 2020.
Udari Madhushani, Abhimanyu Dubey, Naomi Leonard, and Alex Pentland. One more step towards reality:
Cooperative bandits with imperfect communication. Advances in Neural Information Processing Systems ,
2021.
Travis Mandel, Yun-En Liu, Emma Brunskill, and Zoran Popovi ´c. The Queue Method: Handling Delay,
Heuristics, Prior Data, and Evaluation in Bandits. In Proceedings of the Twenty-Ninth AAAI Conference
on Artificial Intelligence , 2015.
12
Page 13:
David Mart ´ınez-Rubio, Varun Kanade, and Patrick Rebeschini. Decentralized Cooperative Stochastic Ban-
dits. In Proceedings of the 33rd International Conference on Neural Information Processing Systems ,
2019.
Edward Moore. The Shortest Path Through a Maze . Bell Telephone System., 1959.
Or Raveh, Junya Honda, and Masashi Sugiyama. Multi-Player Approaches for Dueling Bandits. Preprint,
2024.
Wenbo Ren, Xingyu Zhou, Jia Liu, and Ness B. Shroff. Multi-Armed Bandits with Local Differential Privacy.
Preprint, 2020.
Nicola Santoro. Design and Analysis of Distributed Algorithms . John-Wiley & Sons, USA, 2006.
Shahin Shahrampour, Alexander Rakhlin, and Ali Jadbabaie. Multi-armed bandits in multi-agent networks.
In2017 IEEE International Conference on Acoustics, Speech and Signal Processing , 2017.
Balazs Szorenyi, Robert Busa-Fekete, Istvan Hegedus, Robert Ormandi, Mark Jelasity, and Balazs Kegl.
Gossip-based distributed stochastic bandit algorithms. In Proceedings of the 30th International Conference
on Machine Learning , 2013.
Cem Tekin, Simpson Zhang, and Mihaela van der Schaar. Distributed Online Learning in Social Recom-
mender Systems. IEEE Journal of Selected Topics in Signal Processing , 2014.
William R. Thompson. On the Likelihood that One Unknown Probability Exceeds Another in View of the
Evidence of Two Samples. Biometrika , 1933.
Long Tran-Thanh, Alex Rogers, and Nicholas Jennings. Long-term information collection with energy har-
vesting wireless sensors: A multi-armed bandit based approach. Autonomous Agents and Multi-Agent
Systems , 2011.
Po-An Wang, Alexandre Proutiere, Kaito Ariu, Yassir Jedra, and Alessio Russo. Optimal Algorithms for Mul-
tiplayer Multi-Armed Bandits. In Proceedings of the Twenty Third International Conference on Artificial
Intelligence and Statistics , 2020.
Lin Xiao and Stephen Boyd. Fast Linear Iterations for Distributed Averaging. Systems & Control Letters ,
2004.
Lin Yang, Yu-Zhen Janice Chen, Stephen Pasteris, Mohammad Hajiesmaili, John C. S. Lui, and Don Towsley.
Cooperative Stochastic Bandits with Asynchronous Agents and Constrained Feedback. In Advances in
Neural Information Processing Systems , 2021.
Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed duelling bandits problem.
InProceedings of the 22nd Conference on Learning Theory , 2009.
Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten Rijke. Relative Upper Confidence Bound for
the K-Armed Dueling Bandit Problem. In Proceedings of the 31st International Conference on Machine
Learning , 2014.
Masrour Zoghi, Zohar Karnin, Shimon Whiteson, and Maarten de Rijke. Copeland Dueling Bandits. In
Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 ,
2015.
13
Page 14:
A Implementation Details
Section 3 presents QuACK assuming that we have elected a leader and have computed a shortest path tree
for our message passing protocol. Here, we describe how we can perform these steps in a distributed manner
using well-known algorithms.
A.1 Shortest Path Tree
Constructing the shortest path tree rooted at a fixed vertex amounts to solving the single-source shortest path
problem. Traditionally, we can solve this problem in non-distributed systems using the Bellman-Ford algo-
rithm (Ford, 1956; Bellman, 1958). However, using this algorithm would require the existence of an agent
who possesses knowledge of the entire network topology. Therefore, we will present the distributed variant
of this algorithm (Bertsekas and Gallager, 1992). Here, each agent knows only their neighbours.
Algorithm 2 presents the pseudo-code for the Distributed Bellman-Ford algorithm. Briefly, this algorithm
takes a source agent as input, and each agent keeps track of their distances from the source and their parent
on the shortest path tree.
Algorithm 2 Distributed Bellman-Ford ( DBF)
Input. Index of a source agent v
Setd1
v= 0andd1
w=∞forw̸=v
SetPw=∅for each w∈Vto initialise their parent.
fort∈ {1,2,···m−1}do
Each w∈Vsends dt
wtoz∈Nw
Each w∈Vreceives dt
zfrom z∈Nw
Each w∈Vupdates their distance from the source
dt+1
w= min
z∈Nw∪{w}
dt
z+ 1{z̸=w}
Each w∈Vupdates their parent:
Pw=(
arg minz∈Nwdt
z+ 1 ifdt+1
w< dt
w
Pw ifdt+1
w=dt
w
end for
Each w∈Vsends their identifier wto their parent Pw.
Each w∈Vreceives the identifiers from their children and creates their set of children:
Cw={z∈V:Pz=w}
After Algorithm 2 terminates, each agent will know their children and parent on the shortest path tree rooted
at the index of the input agent. Indeed, these are the quantities we require to implement the message-passing
scheme described in Section 3.1.
A.2 Leader Election
Leader Election (LE) is a well-known problem in distributed computing (Santoro, 2006; Fokkink, 2013).
Here, we want the network to unanimously agree upon a single agent as a leader. Generally, there are
impossibility results for anonymous networks (Fokkink, 2013). Therefore, we will assume that each agent
has a unique identifier. Wang et al. (2020) devise a simple leader election scheme that requires at most d+ 1
iterations, where dis the diameter of the graph. Algorithm 3 presents a slight modification of their procedure
that will ease the exposition of finding the best leader.
14
Page 15:
Algorithm 3 Leader Election ( LE)
Each v∈Vsetsfvarbitrarily.
Section A.3 has each agent set fvto their sum of shortest paths.
Each v∈Vsaves their initial value Iv= (v, fv)
Each v∈Vcreates their state value S1
v= (v, fv)
fort∈ {1,2,···, m}do
Each v∈Vreceives St
wfrom w∈Nv
if∃w∈Nvsuch that fw< fvthen
Update St+1
v= (w, fw)
else if∃w∈Nvsuch that fw=fvthen
Update St+1
v= (min {w, v}, fv)
else
St+1
v=St
v
end if
end for
ifIv=Sm+1
v then
Agent vis the leader.
end if
Running Algorithm 3 requires at most miterations because every agent will have seen the identifier of every
other agent in the graph at the end of this iteration. Furthermore, Algorithm 3 will always terminate with
exactly one leader. To see this, we can consider two possible cases.
•Unique Minimum . Here, there exists exactly one agent vwho possesses the smallest fvvalue. There-
fore, they never get to update their state and become the leader.
•Non-Unique Minimum . Let f= min v∈Vfvand suppose multiple agents attain this value. Algorithm
3 implements an index-based tie-breaking rule to handle this case. Notably, agents with fv=fwill
eventually receive an fw=fand they will proceed to check if their index is minimal. Since each agent
has a unique index, there will be exactly one agent who possesses fv=fwith the minimum index.
This agent will never update their state and will become the leader.
A.3 Optimising the Leader
Algorithm 4 Best Leader
forv∈V={1,2,···, m}do
Compute Sum of Shortest Paths
Execute DBF(v)
Each agent now knows dvw(distance from source v)
Each agent now knows Pw(their parent on the shortest path tree with source v)
Each w∈Vcreates message Mw
1= (w, dvw)
fort∈ {1,2,···, m}do
Each w∈Vsends Mw
ttoPw
Each w∈Vupdates Mw
t+1=Mw
t∪ {Mz
t}z∈Cw
end for
Agent vcalculates fv=P
wdvwfrom the messages.
end for
Use Sum of Shortest Paths as fvin Leader Election
Each agent now knows their value for fv=Pm
w=1dvw
Execute LEto elect the leader with minimal sum of shortest paths.
Combining Sections A.2 and A.1 allows us to find the best-positioned leading agent in the network according
15
Page 16:
to our theoretical guarantees. Algorithm 4 presents the pseudo-code for finding and electing the leader.
Briefly, this procedure starts with each agent calling Algorithm 2. Afterwards, we perform several rounds of
message passing so that agent vcan calculate the sum of shortest paths: fv. Finally, Algorithm 3 performs
leader election with fvequal to the sum of shortest paths.
B Missing Proofs
Throughout the main paper we deferred several proofs to the appendix and gave sketches of the proofs for the
main results. Here, we present full proofs of the claims made in the main paper and we do so in chronological
order. Specifically, the remainder of this section will have the following structure:
• Section B.1 presents a proof of Lemma 1 which establishes the equivalence of playing in the bandit
environment and the queued version of the environment created by the feedback from non-leading
agents.
• Section B.2 presents a proof of Lemma 2 which tells us that the difference between the group plays
and the pseudo plays of the leader is almost surely bounded by a graph-dependent constant.
• Section B.3 presents a full proof of Theorem 1 and fills in steps missing from the proof sketches given
in the main paper.
B.1 Proof of Lemma 1
In QuACK, the leader uses rewards from the queues as well as their own observations to simulate an environ-
ment for updating the bandit algorithm π. Recall that the rewards in the queues have been gathered by other
agents in the network when interacting with the environment. Lemma 1 relates the expected number of times
the leader updates the algorithm πfor each action ato the expected number of times the single-agent bandit
algorithm πwould play action aover an n-round interaction with the environment. More precisely it states
that:
X
a̸=⋆∆aEP"nX
t=11{At=a}#
=X
a̸=⋆∆aE[T′
a(n)]. (6)
In Equation (6), the first expectation is with respect to the bandit algorithm πinteracting directly with the
bandit environment. The second expectation is with respect to bandit algorithm πinteracting with the simu-
lated environment created by the leader.
For each action a∈ A, letPabe the distribution over possible rewards when playing action ain the bandit
environment, as defined in Assumption 1. For each a, we also define ˜Pawhich is a probability measure over
the possible rewards for playing action ain the simulated environment created by the leader. In addition let
pπbe the density over action-reward pairs up to time nwhen the single player bandit algorithm πinteracts
with the environment, and let ˜pπbe the density over action-reward pairs in the simulated environment up to
timen. Following the notation of Lattimore and Szepesv ´ari (2020), we can write down these densities as
follows:
pπ(a1, x1,···, an, xn) =nY
t=1π(at|a1, x1,···, at−1, xt−1)pat(xt)
˜pπ(a1, x1,···, an, xn) =nY
t=1π(at|a1, x1,···, at−1, xt−1) ˜pat(xt)
where paand˜padenote the Radon-Nikodym derivatives of Paand˜Pa, respectively. Using these notations
we can define the two expectations in equation (6) more precisely, i.e. for any action a∈ A,
EP"nX
t=11{At=a}#
=Z
HnnX
t=11{at=a}pπ(h)dh
16
Page 17:
whereas
E[T′
a(n)] =Z
HnnX
t=11{at=a}˜pπ(h)dh
where h= (x1, a1, . . . x n, an)∈ H n= (A ×R)nis a sequence of naction-reward pairs. Under Assumption
1 we know that the rewards for each action-agent pair are independent and identically distributed random
variables. Therefore, using exchangeability of independent and identically distributed random variables, we
have that:
Pad=˜Pa
The Radon-Nikodym Theorem tells us that the pais unique, e.g. see Theorem 4.2.2 of Cohn (2013). Combin-
ingPa=˜Pawith this well-known result tells us that pa(x) = ˜pa(x)for all x∈R. Therefore pπ(h) = ˜pπ(h)
for all h∈ H n= (A ×R)nand hence,
Z
Hnf(h)pπ(h)dh=Z
Hnf(h) ˜pπ(h)dh
for any arbitrary function f:Hn→R. Choosing f(h) =Pn
t=11{at=a}completes the proof.
B.2 Proof of Lemma 2
Define the last round that the leader vertex plays action ain the bandit environment as follows:
τ= max
j≤t:Av
j=a
.
Recall stdenotes the pseudo round counter when the leading agent makes their t-th decision in the bandit
environment. Using Equation (2) and leveraging the fact that the round in the previous display is associated
with a play of action agives us the following:
T′
av(st) =stX
s=11{av
s=a}
=sτX
s=11{av
s=a}+stX
s=sτ+11{av
s=a}
=T′
av(sτ) +stX
s=sτ+11{av
s=a}
(⋆)=mX
w=1Taw(τ−dvw) +stX
s=sτ+11{av
s=a}
≥mX
w=1Taw(τ−dvw) (7)
where (⋆)follows from the fact that the leader plays action ain round τso the number of times they play
this action in the queue must be equivalent to the group plays of this action in the bandit environment. Now,
Equation (7) relates the number of times the bandit algorithm has played action ato the number of times the
network has chosen this action. Recall that each follower will play action afor the last time in round:
τw=τ+dvw
where τis the last round that the leader plays action ain the bandit environment. Using Equation (1) with the
definition and properties of τwallows us to get an upper bound on the number of plays for any non-leading
agent was follows:
Taw(t−dvw) =t−dvwX
j=11{Aw
j=a}
17
Page 18:
=τ−dvwX
j=11{Aw
j=a}+t−dvwX
j=τ−dvw+11{Aw
j=a}
(⋆)
≤Taw(τ−dvw) +τ+dvwX
j=τ−dvw+11{Aw
j=a}
≤Taw(τ−dvw) + 2dvw (8)
Notably, (⋆)follows from analysing the two possible cases:
• When t−dvw≥τw=τ+dvw, the inequality is actually an equality because τwis the last round
follower wplays this action up to and including the t-th round.
• Otherwise, t−dvw< τw=τ+dvwand we are extending the summation to include more rounds
which gives us the inequality.
Summing Equation (8) over all agents and substituting into Equation (7) gives us:
mX
w=1Taw(t−dvw)≤mX
w=1Taw(τ−dvw) + 2mX
w=1dvw (Equation (8))
≤T′
a(st) + 2mX
w=1dvw (Equation (7))
as required.
B.3 Proof of Theorem 1
Within Section 3, we provided a nearly complete proof. Here, we present each step for completeness. Firstly,
we can rewrite the group plays for action aat the end of the final round as follows:
mX
w=1Taw(n) =Tav(n) +X
w̸=vTaw(n)
=Tav(n) +X
w̸=vnX
t=11{Aw
t=a}
=Tav(n) +X
w̸=vn−dvwX
t=11{Aw
t=a}+X
w̸=vnX
t=n−dvw+11{Aw
t=a}
=Tav(n) +X
w̸=vTaw(n−dvw) +X
w̸=vnX
t=n−dvw+11{Aw
t=a}
≤Tav(n) +X
w̸=vTaw(n−dvw) +X
w̸=vdvw (9)
Combining Equation (9) with Lemma 2 allows us to upper bound the group plays by the number of times the
leader plays the action in the queued version of the environment:
mX
w=1Taw(n)≤Tav(n) +X
w̸=vTav(n−dvw) +X
w̸=vdvw (Equation (9))
=mX
w=1Taw(n−dvw) +mX
w=1dvw (Since dvv= 0)
≤T′
av(sn) + 3X
w̸=vdvw (10)
18
Page 19:
where the final line follows from Lemma 2. Plugging Equation (10) allows us to get an upper bound on the
group regret:
RG(n) =X
a̸=⋆∆aE"mX
w=1Taw(n)#
≤X
a̸=⋆∆aE
T′
av(sn) + 3X
w̸=vdvw
(Equation (10))
=X
a̸=⋆∆aE[T′
av(sn)] +
3X
w̸=vdvw
X
a̸=⋆∆a
≤X
a̸=⋆∆aE[T′
av(mn)] +
3X
w̸=vdvw
X
a̸=⋆∆a (Since sn≤mn)
=Sπ(mn) +
3X
w̸=vdvw
X
a̸=⋆∆a (11)
where the final line follows from Lemma 1.
C Regret Bound Comparison
Our theoretical results span several multi-agent bandit problems, which include: subgaussian rewards, heavy-
tailed rewards, duelling bandits and bandits with local differential privacy, amongst any other bandit problems
that satisfied Assumption 1. Here, we compare our bounds to existing works in the literature.
C.1 Subgaussian Rewards
Several works design algorithms under a subgaussian assumption on the reward distributions (Landgren et al.,
2016; Mart ´ınez-Rubio et al., 2019; Lalitha and Goldsmith, 2021). Mart ´ınez-Rubio et al. (2019) show that their
guarantees improve upon those shown by Landgren et al. (2016). Lalitha and Goldsmith (2021) remark that
their analysis does not provide guarantees as tight as those found in previous works (Landgren et al., 2016;
Mart ´ınez-Rubio et al., 2019). Thus, we focus on a comparison with DDUCB of Mart ´ınez-Rubio et al. (2019).
Mart ´ınez-Rubio et al. (2019) use a gossiping procedure. This requires a communication matrix, P∈Rm×m.
Notably, Pis doubly stochastic and must respect the structure of the graph, e.g. Pvw= 0 if(v, w)̸∈E.
Throughout, we will require the eigenvalues of the Pand follow the authors in assuming that the eigenvalues
of this matrix satisfy the following:
1 =λ1(P)>|λ2(P)| ≥ |λ3(P)| ≥ ··· ≥ | λm(P)| ≥0
Theorem 3.2 of Mart ´ınez-Rubio et al. (2019) prove the following upper bound on the group regret of their
algorithm, DDUCB:
X
a̸=⋆16η
1 +ϵ
2
σ2ln (mn)
∆a+ (CG+m+ 4)X
a̸=⋆∆a
Here, η > 1andϵ∈(0,η−1/7(η+1)are tune-able hyperparameters and CGis a graph-dependent quantity
defined as:
CG=&
6mln 2m
ϵ
p
2 ln|λ|−1'
.
19
Page 20:
Corollary 1 shows that the group regret for running QuACK with the upper confidence bound algorithm yields
the following bound on the group regret:
RG(n)≤X
a̸=⋆16σ2ln (mn)
∆a+
3 + 3mX
w=1dvw!X
a̸=⋆∆a
Inspecting the leading order terms reveals:
16η
1 +ϵ
2
σ2ln (mn)>16σ2ln (mn)
where the inequality follows from the fact that η >1andϵ >0. Thus, we obtain a strictly better constant
multiplying the leading order term.
The additive graph dependent terms are different in each bound. Although, we remark that these terms
become negligible for nsufficiently large, it is still interesting to compare them. To compare these additive
graph dependent terms, we consider specific graph structures, such as regular and star graphs. Throughout,
we will assume that the communication matrix is chosen as-per the recommendations of Mart ´ınez-Rubio
et al. (2019):
P=I−1
1 + max v∈V|Nv|(D−M)
where DandMdenote the degree matrix and the adjacency matrix of the graph. Formally, DandMhave
the following element-wise definitions:
Dvw=(
|Nv|ifw=v
0 otherwiseand Mvw=(
1if(v, w)∈E
0otherwise
Regular Graphs. Graphs are called δ-regular if every agent has exactly δneighbours, e.g. |Nv|=δfor
allv∈V. For these graphs, the expression for Psimplifies and allows us to obtain an expression for the
spectral term found in the graph-dependent term of Mart ´ınez-Rubio et al. (2019):
P=I+M
1 +δ=⇒λ:=λ2(P) =1 +λ2(M)
1 +δ
Chung (1989) provide an upper bound on dfor regular graphs, which we can substitute into the graph-
dependent term in the group regret of QuACK-UCB:
3 + 3mX
w=1dvw≤3 + 3 d(m−1)
≤3 +
3 (m−1) ln ( m−1)
ln
1+δ
|1+λ2(M)|
= 3 +3 (m−1) ln ( m−1)
ln|λ|−1
= 3 +&
6 (m−1) ln ( m−1)p
2 ln|λ|−1p
2 ln|λ|−1'
(⋆)
<3 +&
6 (m−1) ln 2m
ϵ
p
2 ln|λ|−1p
2 ln|λ|−1'
= 3 +CG−6 ln(m−1)p
2 ln|λ|−1
<CGp
2 ln|λ|−1+m+ 4
20
Page 21:
where (⋆)follows from the fact that ln(2m/ϵ)<ln(m−1)forϵ∈(0,1). This establishes a relationship
between an upper bound on the graph-dependence of our algorithm with the graph-dependence of Mart ´ınez-
Rubio et al. (2019). Generally, this suggests that the graph-dependence of our leader-based approach is
smaller for δ-regular graphs where the communication matrix has large values for |λ|. Hence, in these cases,
the graph-dependent terms in our bound on the group regret are smaller.
Star Graphs. The star graph consists of m−1vertices connected to a single central vertex. Here, we can
exactly evaluate the spectrum of the communication matrix. Notably, the spectrum of L:=D−Mis known
for this graph (Chung, 1997):
λ(L) =
m with multiplicity 1
1 with multiplicity m−2
0 with multiplicity 1
Therefore, the spectrum of the communication matrix Pcan be found by plugging in these quantities:
λ(P) = 1−λ(L)
m=
0 with multiplicity 1
m−1
mwith multiplicity m−2
1 with multiplicity 1
which gives us λ=λ2(P) =m−1/m. Plugging this into their graph-dependent term yields:
CG+m+ 4 =
6mln m
ϵ
r
2 ln
m
m−1
+m+ 4
≥6mln (m)r
2 ln
m
m−1+m+ 4
>7m+ 4
where the final inequality follows from the fact that m≥2is required for the multi-agent setting. Running
Algorithm 4 will choose the central vertex as the leading agent and this will give make the graph-dependence
in our algorithm:
3 +mX
w=1dvw=m+ 2
which is strictly smaller than that of Mart ´ınez-Rubio et al. (2019).
C.2 Heavy-Tailed Rewards
Dubey and Pentland (2020) design a multi-agent extension of the robust upper confidence bound algorithm
using the truncated mean estimator. To the best of our knowledge, this is the only work addressing heavy-
tailed environments in the multi-agent setting. Before stating their theoretical results, we first define the
independence number of a graph:
α(G) = max
S⊆V{|S|: (v, w)̸∈Efor all (v, w)∈S}
which is the size of the largest subset of vertices where no two vertices are adjacent. Dubey and Pentland
(2020) develop several algorithms, and their best guarantee on the group regret up to absolute constants is
given by:
α(Gγ)X
a̸=⋆σ1
ϵln (n)
∆1
ϵa+ (mγ·α(Gγ) +α(Gγ) +m)X
a̸=⋆∆a
21
Page 22:
Here, ϵcontrols the heaviness of the tails for the reward distributions, γis a parameter their algorithm takes
as input that governs how far each agent can communicate, and the γ-th power of the graph is defined as:
Gγ= (V, E γ)where Eγ= 1{(v, w)∈V×V:dvw≤γ}
which adds an edge to the graph whenever dvw≤γ. From Corollary 2, the group regret of running QuACK
with the robust upper confidence bound using the truncated-mean estimator has the following upper bound:
RG(n)≤ O
X
a̸=⋆σ1
ϵln (mn)
∆1
ϵa+ mX
w=1dvw!X
a̸=⋆∆a
Thus, we obtain a smaller leading order term whenever:
nα(Gγ)−1≥m
Notably, α(Gγ)∈[1, m]for all graphs. Therefore, our graph-dependence is smaller whenever α(Gγ)>1
andm < n , i.e. the number of agents is smaller than the number of rounds.1Specifying γ≥das the input
parameter will yields an independence number of 1in the bound Dubey and Pentland (2020). However, their
analysis is specific to the robust upper confidence bound algorithm and the truncated mean estimator, whereas
Theorem 1 is strictly more general.
Dubey and Pentland (2020) have an additive graph-dependent term that displays a trade-off between the input
parameter γand the independent number. Generally, increasing γwill decrease the independence number and
decreasing γwill increase the independent number. Generally, we expect their additive term to be comparable
to ours in many cases. For example, choosing γ=dwill minimise the leading order term in the regret bound
and yields an additive graph-dependence of:
mγ·α(Gγ) +α(Gγ) +m=md+m+ 1
Conversely Corollary 2 shows that our additive graph-dependence of the order:
mX
w=1dvw≤m(d−1)
C.3 Duelling Bandits
Raveh et al. (2024) design a multi-agent extension of various duelling bandit algorithms. To the best of our
knowledge, this is this only work addressing the multi-agent duelling bandit problem. Their best guarantee
is for an extension of the relative upper confidence bound algorithm (Zoghi et al., 2014). Up to absolute
constants, they show that this algorithm achieves an upper bound on the group regret given by:
χ(Gγ)X
a̸=⋆ln (n)
˜∆a+m2k2(1 + ln (1 + |Nγ
⋆|))˜∆max+ (1 + γ)km˜∆max
where 1≤χ(Gγ)≤mdenotes the clique covering number of the γ-th power of the graph and γis an input
parameter. From Corollary 3, we can upper bound the group regret when running QuACK with the relative
upper confidence bound algorithm:
RG(n)≤ O
X
a̸=⋆ln (mn)
˜∆a+ mX
w=1dvw!X
a̸=⋆˜∆a
Comparing the leading-order term reveals that we obtain a strictly better leading order term whenever:
nχ(Gγ)−1≥m
1Note that m > n implies that all multi-agent bounds on the group regret become linear in the horizon.
22
Page 23:
Notably, χ(Gγ)∈[1, m]for all graphs. Therefore, our graph-dependence is smaller whenever χ(Gγ)>1
andm < n , i.e. the number of agents is smaller than the number of rounds. Specifying γ≥das the input
parameter will yield a clique covering number of 1in the bound of Raveh et al. (2024). However, their anal-
ysis is specific to the relative upper confidence bound algorithm, whereas Theorem 1 is strictly more general.
Raveh et al. (2024) have an additive graph-dependent term that is strictly larger. Notably, their graph-
dependence is always quadratic in the network size, whereas ours scales as md. Notably, d < m in connected
graphs.
D Additional Experiments
Experimentally, we seek to compare our black-box reduction to existing works, who design and analyse
extensions of well-known single agent bandit algorithms. Our simulations were conducted on a personal
machine (Intel i5-10310U CPU @ 1.70Ghz x 8). Furthermore, all results are averages over 100independent
runs.
D.1 Subgaussian Experiments
Here, we describe the missing experimental details for the experiments found in the main paper, such as the
hyperparameters for existing algorithms. Firstly, Landgren et al. (2016), Mart ´ınez-Rubio et al. (2019) and
Lalitha and Goldsmith (2021) are all gossip-based algorithms. Thus, they all require the specification of a
communication matrix. All authors recommend choosing this matrix based on the graph Laplacian. Formally,
they use the following doubly stochastic matrix:
P=I−1
1 + max v∈V|Nv|(D−M)
where M,Dandδdenote the adjacency matrix, the degree matrix and the maximum degree of the underlying
graph. Below, we provide specific details on additional hyperparameters of each algorithm.
• Landgren et al. (2016) (coop-UCB) has been implemented by other researchers (Mart ´ınez-Rubio et al.,
2019). Their algorithm requires an exploration hyperparameter γ >1; we choose this parameter based
on the empirical results of existing work (Mart ´ınez-Rubio et al., 2019; Lalitha and Goldsmith, 2021).
•Exploration Parameter: γ= 1.01.
• Mart ´ınez-Rubio et al. (2019) (DDUCB) provide an implementation of their algorithm on Github. Their
algorithm requires specifying two hyperparameters; we use the values stated in their theorems and
chosen in their experiments.
•Exploration Parameter (Theorem 3.2 of Mart ´ınez-Rubio et al. (2019)): η= 2.
•Approximation Parameter (Theorem 3.2 of Mart ´ınez-Rubio et al. (2019)): ϵ=1/22.
• Lalitha and Goldsmith (2021) (DecTS) provide an implementation of their algorithm on Github. Their
algorithm requires the specification of a learning rate. We use the value found in the theoretical results
that they use in their experiments.
•Learning Rate (Theorem 1 of Lalitha and Goldsmith (2021)): η=m.
• Algorithm 1 requires the specification of a leader and a single-player bandit algorithm.
•Leader: As Theorem 1 suggests, we select the leader as the solution to the graph median problem
for each experimental setting. We do so using Algorithm 4.
•Bandit Algorithm: Our goal is to compare how well our black-box reduction performs relative to
existing works, who design and analyse specific algorithms designed for the setting. Therefore,
we consider UCB (Auer et al., 2002) and Thompson Sampling (Thompson, 1933) to compare
with existing literature.
23
Page 24:
Our experiments consider three network structures, namely cycle, grid and star graphs with varying network
sizes. Figure 2 in the main paper shows the experimental results for the largest network size of m= 196
agents. Here, we present the experimental results for smaller network sizes.
Cycle Graph. The cycle graph consists of agents organised in a circle who are connected to their clockwise
and anti-clockwise neighbours. Thus, the graph is 2-regular. Section 5 presents the empirical results for a
cycle with m= 196 agents. Figure 3 displays the group regret for cycle graphs with varying numbers of
agents.
Figure 3: Group Regret for Cycle Graphs.
Figure 3, shows that QuACK-UCB outperforms the other UCB-based algorithms on all network sizes. As
Section C.1 suggests, the performance gap increases with the size of the network. Furthermore, QuACK-TS
is competitive with DecTS.
Grid Graph. The grid graph consists of agents organised in a square lattice and each agent can have two,
three or four neighbours. Section 5 presents the results of our experiments for a grid graph with m= 196
agents. Figure 4 displays the group regret for grid graphs with varying numbers of agents.
Figure 4: Group Regret for Grid Graphs.
Here, QuACK-UCB out performs the other UCB-based algorithms on all network sizes. Furthermore,
QuACK-TS is competitive with DecTS.
24
Page 25:
Star Graph. The star graph is a common sub-structure that appears in social networks. This particular
graph consists of a single central vertex, who is connected to all other vertices. The non-central vertices have
exactly one neighbour, which is the central vertex. Section 5 presents the empirical results for a star with
m= 196 agents. However, we also conduct experiments on smaller stars to verify our theoretical results.
Figure 4 displays the group regret for cycle graphs with varying numbers of agents.
Figure 5: Group Regret for Star Graphs.
Here, QuACK-UCB out performs the other UCB-based algorithms on all network sizes. Furthermore,
QuACK-TS is competitive with DecTS on the smallest star network. However, QuACK-TS achieves smaller
group regret for larger star networks.
Network Scaling. Finally, Figure 6 visualises the group regret at the end of the final round as a function of
the network size for each graph structure.
Figure 6: Network Size Scaling of Group Regret.
Notably, we can see the group regret of Coop-UCB and DDUCB scales quickly with the network size for each
graph. Conversely, the group regret of QuACK-UCB, QuACK-TS and Dec-TS have a shallower gradient,
indicating a better dependence on the network size.
25
Page 26:
D.2 Heavy-Tailed Experiments
Following Dubey and Pentland (2020), we conduct our experiments on standard α-stable densities. Specifi-
cally, α= 1.9in our experiments and we select the location of the densities such that:
µa=(
0.7ifa= 1
0.4ifa̸= 1
where A={1,2,···,5}. Our experiments in this setting consider the cycle and grid network structures
withm∈ {16,36,64}agents. These experiments use a smaller number of agents and fewer rounds because
the computational complexity of computing the truncated mean estimator grows linearly with the number of
reward samples for each action.2
Below, we provide specific details of the hyperparameters used for each algorithms.
• Dubey and Pentland (2020) obtain the best theoretical guarantees for CMP-UCB and this algorithm
partitions the graph into disjoint subsets. Each of these subset has one leader that uses the robust upper
confidence bound algorithm. Following their guidance, we select the leaders such that they form the
maximal weighted independent set of each graph.
•We tune the hyperparameter γof their algorithm such that the networks under consideration have
1,2and4leaders. With 1leader we expect their algorithm to outperform ours. With ≥2leaders,
we expect our algorithm to perform best.
• Algorithm 1 requires a single-agent heavy-tailed bandit algorithm as input. For a direct comparison,
we choose the robust upper confidence bound algorithm with the truncated mean. We select the leader
as the arg min of the graph-median problem.
Figure 7 displays the group regret for the various cycle and grid graphs. From Section C.2, we expect QuACK
will perform better than CMP-UCB when it receives a γsuch that the resulting graph has independence
number is greater than 1. Notably, our experimental results align exactly with our theoretical predictions.
2The number of rewards for each action depends implicitly on the network size in the multi-agent setting.
26
Page 27:
Figure 7: Group Regret for Cycle and Grid Graphs.
27