Paper Content:
Page 1:
arXiv:2501.03187v1 [cs.AI] 6 Jan 2025Turn-based Multi-Agent Reinforcement Learning Model Chec king
Dennis Gross
Institute for Computing and Information Sciences, Radboud University, Toernooiveld 212, 6525 EC Nijmegen,
The Netherlands
dgross@science.ru.nl
Keywords: Turn-based Multi-Agent Reinforcement Learning, Model Che cking
Abstract: In this paper, we propose a novel approach for verifying the c ompliance of turn-based multi-agent reinforce-
ment learning (TMARL) agents with complex requirements in s tochastic multiplayer games. Our method
overcomes the limitations of existing verification approac hes, which are inadequate for dealing with TMARL
agents and not scalable to large games with multiple agents. Our approach relies on tight integration of
TMARL and a verification technique referred to as model check ing. We demonstrate the effectiveness and
scalability of our technique through experiments in differ ent types of environments. Our experiments show
that our method is suited to verify TMARL agents and scales be tter than naive monolithic model checking.
1 INTRODUCTION
AI technology has revolutionized the game in-
dustry ( Berner et al. ,2019 ), enabling the creation
of agents that can outperform human players us-
ingturn-based multi-agent reinforcement learning
(TMARL) (Silver et al. ,2016 ). TMARL consists of
multiple agents, where each one learns a near-optimal
policy based on its own objective by making observa-
tions and gaining rewards through turn-based interac-
tions with the environment ( Wong et al. ,2022 ).
The strength of these agents can also be a problem,
limiting the gameplay experience and hindering the
design of high-quality games with non-player charac-
ters (NPCs) ( Svelch ,2020 ;Nam et al. ,2022 ). Game
developers want to ensure that their TMARL agents
behave as intended, and tracking their rewards can al-
low them to fine-tune their performance. However, re-
wards are not expressive enough to encode more com-
plex requirements for TMARL agents, such as ensur-
ing that a specific sequence of events occurs in a par-
ticular order ( Littman et al. ,2017 ;Hahn et al. ,2019 ;
Hasanbeig et al. ,2020 ;Vamplew et al. ,2022 ).
This paper addresses the challenge of verifying
the compliance of TMARL agents with complex
requirements by combining TMARL with rigorous
model checking (Baier and Katoen ,2008 ). Rigorous
model checking is a formal verification technique
that uses mathematical models to verify the correct-
ness of a system with respect to a given property.
It is called ”rigorous” because it provides guaran-
tees of correctness based on rigorous mathematicalreasoning and logical deductions. In the context of
this paper, rigorous model checking is used to ver-
ify TMARL agents. The system being verified is the
TMARL system, which is modeled as a Markov deci-
sion process (MDP) treating the collection of agents
as a joint agent , and the property is the set of re-
quirements that the agents must satisfy. Our proposed
method1supports a broad range of properties that
can be expressed by probabilistic computation tree
logic (PCTL; Hansson and Jonsson ,1994 ). We eval-
uate our method on different TMARL benchmarks
and show that it outperforms naive monolithic model
checking2.
To summarize, the main contributions of this pa-
per are:
1. rigorous model checking of TMARL agents,
2. a method that outperforms naive monolithic
model checking on different benchmarks.
The paper is structured in the following way. First,
we summarize the related work and position our
paper in it. Second, we explain the fundamentals
of our technique. Then, we present the TMARL
model checking method and describe its function-
alities and limitations. After that, we evaluate
1GitHub-Repository: https://github.com/DennisGross/COOL-MC/tree/marko
2Naive monolithic model checking is called ”naive” be-
cause it does not take into account the complexity of the
system or the number of possible states it can be in, and it
is called ”monolithic” because it treats the entire system a s
a single entity, without considering the individual compo-
nents of the system or the interactions between them.
Page 2:
our method in multiple environments from the AI
and model checking community ( Lee and Togelius ,
2017 ;Even-Dar et al. ,2006 ;Abu Dalffa et al. ,2019 ;
Hartmanns et al. ,2019 ). The empirical analysis
shows that the TMARL model checking method can
effectively check PCTL properties of TMARL agents.
2 RELATED WORK
PRISM (Kwiatkowska et al. , 2011 ) and
Storm (Hensel et al. ,2022 ) are tools for for-
mal modeling and analysis of systems that ex-
hibit uncertain behavior. PRISM is also a lan-
guage for modeling discrete-time Markov chains
(DTMCs) and MDPs. We use PRISM to model
the TMARL environments as MDPs. Until now,
PRISM and Storm do not allow verifying TMARL
agents. PRISM-games ( Kwiatkowska et al. ,2018 )
is an extension of PRISM to verify stochas-
tic multi-player games (including turn-based
stochastic multi-player games). Various works
about turn-based stochastic game model check-
ing have been published ( Kwiatkowska et al. ,
2022 ,2019 ;Li et al. ,2020 ;Hansen et al. ,2013 ;
Kucera ,2011 ). None of them focus on TMARL
systems. TMARL has been applied to mul-
tiple turn-based games ( Wender and Watson ,
2008 ;Pagalyte et al. ,2020 ;Silver et al. ,2016 ;
Videga´ ın and Garc´ ıa-S´ anchez ,2021 ;Pagalyte et al. ,
2020 ). The major work about model checking for
RL agents focuses on single RL agents ( Wang et al. ,
2020 ;Hasanbeig et al. ,2020 ;Hahn et al. ,2019 ;
Hasanbeig et al. , 2019 ; Fulton and Platzer ,
2019 ;Sadigh et al. ,2014 ;Bouton et al. ,2019 ;
Chatterjee et al. ,2017 ). However, model checking
work exists for cooperative MARL ( Riley et al. ,
2021a ;Khan et al. ,2019 ;Riley et al. ,2021b ), but
no work for TMARL. Therefore, with our research,
we try to close the gap between TMARL and model
checking.
3 BACKGROUND
In this section, we introduce the fundamentals of our
work. We begin by summarizing the modeling and
analysis of probabilistic systems, which forms the ba-
sis of our approach to check TMARL agents. We then
describe TMARL in more detail.3.1 Probabilistic Systems
Aprobability distribution over a set Xis a function
µ:X→[0,1]with ∑x∈Xµ(x) =1. The set of all dis-
tributions on Xis denoted Distr(X).
Definition 3.1 (Markov Decision Process) .A
Markov decision process (MDP) is a tuple M=
(S,s0,A,T,rew)where Sis a finite, nonempty set of
states; s0∈Sis an initial state; Ais a finite set of ac-
tions; T:S×A→Distr(S)is a probability transition
function; rew:S×A→Ris a reward function.
We employ a factored state representation where
each state sis a vector of features (f1,f2,...,fn)where
each feature fj∈Zfor 1≤i≤n(nis the dimen-
sion of the state). The available actions in s∈S
areA(s) ={a∈A|T(s,a)/ne}ationslash=⊥}. An MDP with
only one action per state ( ∀s∈S:|A(s)|=1) is a
DTMC. A path of an MDP Mis an (in)finite sequence
τ=s0a0,r0− −− → s1a1,r1− −− →..., where si∈S,ai∈A(si),
ri:=rew(si,ai), and T(si,ai)(si+1)/ne}ationslash=0. A state s′
is reachable from state sif there exists a path τfrom
state sto state s′. We say a state sis reachable if sis
reachable from s0.
Definition 3.2 (Policy) .Amemoryless deterministic
policy for an MDP Mis a function π:S→Athat maps
a state s∈Sto an action a∈A(s).
Applying a policy πto an MDP Myields an
induced DTMC , denoted as D, where all non-
determinism is resolved. A state sis reachable by a
policy πifsis reachable in the DTMC induced by π.
We specify the properties of a DTMC via the specifi-
cation language PCTL ( Wang et al. ,2020 ).
Definition 3.3 (PCTL Syntax) .Let APbe a set
of atomic propositions. The following gram-
mar defines a state formula: Φ::=true|a|Φ1∧
Φ2| ¬Φ|P⊲⊳p|Pmax
⊲⊳p(φ)|Pmin
⊲⊳p(φ)where a∈AP,⊲⊳∈
{<,>,≤,≥},p∈[0,1]is a threshold, and φis a
path formula which is formed according to the fol-
lowing grammar φ::=XΦ|φ1Uφ2|φ1Fθtφ2|GΦ
with θ={<,≤}.
For MDPs, PCTL formulae are interpreted over
the states of the induced DTMC of an MDP and
a policy. In a slight abuse of notation, we use
PCTL state formulas to denote probability values.
That is, we sometimes write P⊲⊳p(φ)where we
omit the threshold p. For instance, in this paper,
P(F collision )denotes the reachability probability
of eventually running into a collision. There exist
a variety of model checking algorithms for verify-
ing PCTL properties ( Courcoubetis and Yannakakis ,
1988 ,1995 ). PRISM ( Kwiatkowska et al. ,2011 )
and Storm ( Hensel et al. ,2022 ) offer efficient and
Page 3:
Agent 1 π(s)
Environmenta
s,r
Figure 1: This diagram represents a single RL system in
which an agent (Agent 1) interacts with an environment.
The agent observes a state (denoted as s) and a reward (de-
noted as r) from the environment based on its previous ac-
tion (denoted as a). The agent then uses this information to
select the next action, which it sends back to the environ-
ment.
mature tool support for verifying probabilistic sys-
tems ( Kwiatkowska et al. ,2011 ;Hensel et al. ,2022 ).
Definition 3.4 (Turn-based stochastic multi-player
game) .Aturn-based stochastic multi-player game
(TSG) is a tuple(S,s0,I,A,(Si)i∈I,T,{rew i}i∈I)where
Sis a finite, nonempty set of states; s0∈Sis an initial
state; Iis a finite, nonempty set of agents; Ais a finite,
nonempty set of actions available to all agents; (Si)i∈I
is a partition of the state space S;T:S×A→[0,1]is a
transition function; and rew i:Si×A→Ris a reward
function for each agent i.
Each agent i∈Ihas a policy πi:Si→Athat maps
a state si∈Sito an action ai∈A. The joint policy π
induced by the set of agent policies {πi}i∈Iis the map-
ping from states into actions and transforms the TSG
into an induced DTMC.
3.2 Turn-based Multi-Agent
Reinforcement Learning (TMARL)
We now introduce TMARL. The standard learning
goal for RL is to find a policy πin an MDP such
that πmaximizes the expected discounted reward,
that is,E[∑L
t=0γtRt], where γwith 0≤γ≤1 is the
discount factor, Rtis the reward at time t, and L
is the total number of steps ( Kaelbling et al. ,1996 ).
TMARL extends the RL idea to find near-optimal
agent policies πiin a TSG setting (compare Figure 1
with Figure 2 ). Each policy πiis represented by a
neural network. A neural network is a function pa-
rameterized by weights θi. The neural network policy
πican be trained by minimizing a sequence of loss
functions J(θi,s,ai)(Mnih et al. ,2013 ).
4 Model Checking of TMARL
agents
We now describe how to verify trained TMARL
agents. Recall, the joint policy πinduced by the set of
all agent policies {πi}i∈Iis a single policy π. The tool
COOL-MC (Gross et al. ,2022 ) allows model check-EnvironmentAgent 1 π1(s)
Agent 2 π2(s)a1
s1,r1 a2
s2,r2
Figure 2: This diagram represents a TMARL system in
which two agents (Agent 1 and Agent 2) interact in a turn-
based manner with a shared environment. The agents re-
ceive states (denoted as s1ands2) and rewards (denoted as
r1andr2) from the environment based on their previous
actions (denoted as a1anda2). The agents then use this in-
formation to select their next actions, which they send back
to the environment.
State s
Extract turn from state s
Which turn? π1(s) π2(s)
Action aturn=1 turn=2
a1 a2Joint Policy Wrapper
Figure 3: An example of a joint policy wrapper with two
policies. The wrapper takes in a state (denoted as s) and
extracts the current turn from that state. It then uses this
information to determine which of two policies ( π1andπ2)
should choose the next action. The selected policy then pro-
duces an action, which is output by the joint policy wrapper.
ing of a single RL policy πagainst a user-provided
PCTL property P(φ)and MDP M. Thereby, it builds
the induced DTMC Dincrementally ( Cassez et al. ,
2005 ).
To verify a TMARL system, we model it as a nor-
mal MDP. We have to extend the MDP with an addi-
tional feature called turn that controls which agent’s
turn it is. To support joint policies π(s), and there-
fore multiple TMARL agents, we created a joint pol-
icy wrapper that queries the corresponding TMARL
agent policy at every turn (see Figure 3 ). With the
joint policy wrapper, we build the induced DTMC the
following way. For every state sthat is reachable via
the joint policy π, we query for an action a=π(s).
In the underlying MDP M, only states s′that may be
reached via that action a∈A(s)are expanded. The
resulting DTMC induced by Mandπis fully deter-
ministic, as no action choices are left open and ready
for efficient model checking.
Limitations. Our method allows the model check-
ing of probabilistic policies by always choosing the
action with the highest probability at each state. We
Page 4:
support any environment that can be modeled us-
ing the PRISM language ( Kwiatkowska et al. ,2011 ).
However, our method does not consider PCTL prop-
erties with the reward operator ( Hansson and Jonsson ,
1994 ). When creating the joint policy, there is no sep-
aration of which agent receives which reward.
TSGs with more than two agents must han-
dle inactive agents who no longer participate in
the game. Our method handles this by allow-
ing non-participating agents to only apply actions
that only change the turn feature, allowing the next
agent to make a move. This must be considered
when using the expected time step PCTL opera-
torHansson and Jonsson (1994 ).
Our method is independent of the learning algo-
rithm and allows for the model checking of TMARL
policies that select their actions based on the cur-
rent and fully-observed state. For simplicity, we fo-
cus on TMARL agents with the same action space
in this paper. However, extending our method to
support TMARL agents with different action spaces
and different partial observations for different agents
is straightforward.
5 EXPERIMENTS
We now evaluate our proposed model checking
method in multiple environments.
5.1 Setup
In this section, we provide an overview of the ex-
perimental setup. We first introduce the environ-
ments, followed by the trained TMARL agents. Next,
we describe the model checking properties that we
used, and finally, we provide details about the tech-
nical setup.
Environments. Pokemon is an environment from
the game franchise Pokemon that was developed
by Game Freak and published by Nintendo in
1996 ( Freak ,1996 ). It is used in the Showdown AI
competition ( Lee and Togelius ,2017 ). In a Pokemon
battle, two agents fight one another until the Poke-
mon of one agent is knocked out (see Figure 4 ). The
impact of randomness in Pokemon is significant, and
much of the game’s competitive strategy comes from
accurately estimating stochastic events. The dam-
age calculation formula for attacks includes a random
multiplier between the values of 0.85 and 1.0. Each
Pokemon has four different attack actions (tackle,
punch, poison, sleep) and can use items (for example,
heal pots to recover its hit points (HP)). The attacks
Figure 4: This screenshot shows a scene from the Show-
down AI competition, in which two Pokemon characters
are engaged in a battle. We model this scene in PRISM.
The AI-controlled Pokemons use different policies πito try
and defeat their opponent. The outcome of the battle will
depend on the abilities and actions of the two Pokemons, as
well as on random elements.
tackle andpunch decrease the opponent’s Pokemon
HP,poison decreases the HP over multiple turns, and
sleep does not allow the opponent’s Pokemon to at-
tack for multiple turns. All actions in Pokemon have
a success rate, and there is a chance that they fail.
S={(turn,done,HP0,sleeping 0,poisoned 0,
healpots 0,sleeps 0,poisons 0,punches 0,
HP1,sleeping 1,poisoned 1,healpots 1,sleeps 1,
poisons 1,punches 1),...}
Act={sleep,tackle,heal,punch,poison}
rew i=
if agent iwins:
5000
+max(100−HP1−0.2∗(100−HP0),0)
otherwise:
max(100−HP1−0.2∗(100−HP0),0)
The main purpose of this environment is to show that
it is possible to verify TMARL agents in complex en-
vironments. The main difference to the Showdown AI
competition is that each agent observes the full game
state, each agent has only the previously mentioned
action choices, and our environment allows one Poke-
mon per agent.
The multi-armed bandit problem (MABP) is a
problem in which a fixed limited set of resources
must be allocated between competing choices in a
way that maximizes their expected gain when each
choice’s properties are only partially known at the
time of allocation and may become better under-
stood as time passes or by allocating resources to the
choice ( Even-Dar et al. ,2006 ;Shahrampour et al. ,
Page 5:
Env. Label PCTL Property Query ( P(φ)) =|S| | T|Time (s)
Pokemon won1 P(F won 1) TO TO TO TO
Pokemon (5HP) HP5NoHealP1 P(F won 1) 0.34 213 640 14
Pokemon (5HP) HP5NoHealP2 P(F won 2) 0.66 222 667 14
Pokemon (5HP) usePoisons0HealP1 P(poisons 1=2 Upoisons 1<2) 0.0 238 715 17
Pokemon (20HP) useHeal20P1 P(healpot 1=1 Uhealpot 1=0) 0.65 6720 40315 401
MABP 25 lost1 P(Flost 1) 0.0 50 51 5
MABP 100 lost1 P(Flost 1) 0.0 100 100 296
Tic-Tac-Toe marking order P(((cell 10=0 Ucell 10=2)Ucell 12=2)Ucell 11=2) 1.0 18 30 0 .5
CC CC1KO P(F player 1ko) 0.0 205 281 54
CC CC2KO P(F player 2ko) 0.001 197 273 53
CC CC3KO P(F player 3ko) 0.0 205 281 54
CC collision P(F collision ) 0.36 199 275 47
Table 1: The table presents the results of running probabili stic model checking via our method on various environments. For
each environment, the table lists the label for the probabil istic computation tree logic (PCTL) property query that was used,
the result of the query, the number of states in the environme nt, the number of transitions, and the time it took to run the q uery.
TOindicates that the query did not complete within 24 hours, an d therefore the time taken is unknown.
2017 ). It is a classic RL problem. We transformed
this problem into a turn-based MABP. At each turn,
an agent has to learn which action maximizes its ex-
pected reward.
S={(HP1,HP2,...,HPN,turn,done),...}
Act={bandit 1,bandit 2}
rew i=/braceleftBigg
1, if agent iis alive
0, otherwise
The Tic-Tac-Toe environment is a paper-and-
pencil game for two agents who take turns mark-
ing the empty cells in a 3x3 grid with X or O. The
agent who succeeds in placing three of their marks
in a horizontal, vertical, or diagonal row is the win-
ner ( Abu Dalffa et al. ,2019 ). With a probability of
10%, an agent does not draw in the grid during
its turn.
S={(cell 00,cell 01,cell 02,cell 10,cell 11,cell 12,
cell 20,cell 21,cell 22,turn,done),...}
Act=A mark action per cell.
rew i=/braceleftBigg
500, if agent iwins
0, otherwise
The Coin Collection (CC) environment is a game in
which three agents must collect coins in a 4x4 grid
world without colliding with each other. If an agent
collides with another, the environment terminates. An
agent can attack another agent by standing next to
them with a success rate of 0 .4. Each agent receives a
reward for every round it is not knocked out (its HP is
not 0) and a larger reward for collecting coins. TheCC environment is inspired by the QComp bench-
mark resource gathering ( Hartmanns et al. ,2019 ).
S={(x1,y1,hp1,x2,y2,hp2,x3,y3,hp3,
coin x,coin y,done,turn),...}
Act={up,right,down,le ft,
hitup,hitright,hitdown,hitle ft}
rew i=/braceleftBigg
100, if agent icollects coin
1, otherwise
Trained TMARL agents. In the training results,
agent 1 in Pokemon has an average reward of 818 .23
over 100 episodes, while agent 2 has an average re-
ward of 690 .32 over the same number of episodes
(50,000 episodes in total). In Tic-Tac-Toe (10 ,000
episodes in total), agent 1 has an average reward of
370.0, while agent 2 has an average reward of 100.
In CC (10 ,000 episodes in total), agent 1 has an aver-
age reward of 29 .72, agent 2 has an average reward of
111.58, and agent 3 has an average reward of 117 .69.
The reward of the TMARL agents can be neglected
because we only use them for performance measure-
ments. All of our training runs used a seed of 128, an
ε=0.5.εmin=0.1,εdec=0.9999, γ=0.99, a learn-
ing rate of 0 .0001, batch size of 32, replay buffer size
of 300,000, and a target network replacement interval
of 304. The Pokemon agents have four layers, each
with 256 rectifier neurons. The Tic-Tac-Toe, MABP,
and CC agents have two layers, each with 256 recti-
fier neurons.
Properties. Table 1 presents the property queries of
the trained policies. For example, HP5NoHealP 1 de-
Page 6:
scribes the probability of agent 2 winning the Poke-
mon battle when both Pokemon have HP=5 and no
more heal pots. Note, at this point, that our main goal
is to verify the trained TMARL policies and that we
do not focus on training near-optimal policies.
Technical setup. We executed our benchmarks on
an NVIDIA GeForce GTX 1060 Mobile GPU, 16
GB RAM, and an Intel(R) Core(TM) i7-8750H
CPU @ 2.20GHz x 12. For model checking, we use
Storm 1.7.1 (dev).
5.2 Analysis
In this section, we address the following research
questions:
1. Does our proposed method scale better than naive
monolithic model checking?
2. How many TMARL agents can our method han-
dle?
3. Do the TMARL agents perform specific game
moves?
We will provide detailed answers to these questions
and discuss the implications of our findings.
Does our proposed method scale better than naive
monolithic model checking? In this experiment,
we compare our method with a naive monolithic
model checking in the Pokemon environment. For a
whole Pokemon battle (starting with HP=100, unlim-
ited tackle attacks, 5 punch attacks, 2 sleep attacks,
2 poison attacks, and 3 heal pots), naive monolithic
model checking runs out of memory. On the other
hand, our method runs out of time (time out after
24 hours). However, we can train TMARL agents
in the Pokemon environment and can, for example,
analyze the end game. For instance, our method al-
lows the model checking of environments with HP of
20 and 1 heal pot left, and we can quantify the prob-
ability that Pokemon 1 uses the heal pot with 0 .65
(see useHeal20P1 in Table 1 ). On the other hand, for
naive monolithic model checking, it is impossible to
extract this probability because it runs out of mem-
ory with a model that contains 31 ,502,736 states and
51,6547,296 transitions. However, at some point,
our model checking method is also limited by the
size of the induced DTMC and runs out of mem-
ory ( Gross et al. ,2022 ).
How many TMARL agents can our method han-
dle? We perform this experiment in the MABP en-
vironment with multiple agents because, in this en-
vironment, it is straightforward to show how our25 50 75 100 125
Number of Agents0123456Time in Seconds
Figure 5: The diagram shows the time it takes to build a
state for a TMARL system as the number of agents in the
system increases. The curve in the diagram indicates that
the time it takes to build a state increases exponentially as
the number of agents increases.
method performs with different numbers of TMARL
agents. To evaluate our method, we train multiple
agents using TMARL to play the MABP game with
different numbers of agents. We then compare the
performance of our method to naive monolithic model
checking, and evaluate the scalability of both methods
as we increase the number of agents in the game.
Naive monolithic model checking is unable to ver-
ify (Pmax(F lost 1)) 24 agents in the MABP environ-
ment due to memory constraints. The largest possi-
ble MDP that can be checked using monolithic model
checking contains 23 agents and has 100 ,663,296
states and 247 ,463,936 transitions. In contrast, our
method allows the model checking up to over 100
TMARL agents, and we can verify in each of the
TMARL systems that agent 1 never uses the riskier
bandit (see, for 25 agents, the property query lost1 in
Table 1 ). This experiment shows, that the limitation
of our approach is the action querying time, which in-
creases with the number of agents (see Figure 5 ).
Do the TMARL agents perform specific game
moves? In the Pokemon environment, agent 1
uses a heal pot with only 20 HP remaining (see
useHeal 20P1 inTable 1 ). This is a reasonable strat-
egy in a late game when the agent is low on HP
and needs to restore its HP to avoid being defeated.
Furthermore, agent 1 wins in the end game with
HP=5 and no heal pot left with a probability of
HP5NoHealP 1=0.33, and agent 2 wins with a prob-
ability of HP5NoHealP 2=0.66. In Tic-Tac-Toe,
agent 2 first marks cell 10, then cell 12, and finally
cell 11in a specific order. In the CC environment, we
observe that only the second agent may get knocked
Page 7:
out (CC2KO=0.001) and that a collision occurs in
36% of the cases (collision). Overall, these details
show that our method gives insight into policy behav-
iors in different environments.
6 CONCLUSION
In this work, we presented an analytical method for
model checking TMARL agents. Our method is
based on constructing an induced DTMC from the
TMARL system and using probabilistic model check-
ing techniques to verify the behavior of the agents.
We applied our method to multiple environments and
found that it is able to accurately verify the behavior
of the TMARL agents. Our method can handle sce-
narios that can not be verified using naive monolithic
model checking methods. However, at some point,
our technique is limited by the size of the induced
DTMC and the number of TMARL agents in the sys-
tem.
In future work, we plan to extend our method
to incorporate safe TMARL approaches. This has
been previously done in the single agent RL do-
main ( Jin et al. ,2022 ;Jothimurugan et al. ,2022 ), and
we believe it can also be applied to TMARL systems.
We also plan to combine our proposed method with
interpretable RL techniques ( Davoodi and Komeili ,
2021 ) to better understand the trained TMARL
agents. This could provide valuable insights into the
behavior of the agents.
REFERENCES
Abu Dalffa, M., Abu-Nasser, B. S., and Abu-Naser, S. S.
(2019). Tic-tac-toe learning using artificial neural net-
works.
Baier, C. and Katoen, J. (2008). Principles of model check-
ing. MIT Press.
Berner, C., Brockman, G., Chan, B., Cheung, V ., Debiak, P.,
Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse,
C., J´ ozefowicz, R., Gray, S., Olsson, C., Pachocki, J.,
Petrov, M., de Oliveira Pinto, H. P., Raiman, J., Sali-
mans, T., Schlatter, J., Schneider, J., Sidor, S., Sutskeve r,
I., Tang, J., Wolski, F., and Zhang, S. (2019). Dota
2 with large scale deep reinforcement learning. CoRR ,
abs/1912.06680.
Bouton, M., Karlsson, J., Nakhaei, A., Fujimura, K.,
Kochenderfer, M. J., and Tumova, J. (2019). Rein-
forcement learning with probabilistic guarantees for au-
tonomous driving. CoRR , abs/1904.07189.
Cassez, F., David, A., Fleury, E., Larsen, K. G., and Lime,
D. (2005). Efficient on-the-fly algorithms for the analysisof timed games. In CONCUR , volume 3653 of Lecture
Notes in Computer Science , pages 66–80. Springer.
Chatterjee, K., Novotn´ y, P., P´ erez, G. A., Raskin, J., and
Zikelic, D. (2017). Optimizing expectation with guaran-
tees in pomdps. In AAAI , pages 3725–3732. AAAI Press.
Courcoubetis, C. and Yannakakis, M. (1988). Verifying
temporal properties of finite-state probabilistic program s.
InFOCS , pages 338–345. IEEE Computer Society.
Courcoubetis, C. and Yannakakis, M. (1995). The complex-
ity of probabilistic verification. J. ACM , 42(4):857–907.
Davoodi, O. and Komeili, M. (2021). Feature-based inter-
pretable reinforcement learning based on state-transitio n
models. In SMC , pages 301–308. IEEE.
Even-Dar, E., Mannor, S., Mansour, Y ., and Mahadevan, S.
(2006). Action elimination and stopping conditions for
the multi-armed bandit and reinforcement learning prob-
lems. Journal of machine learning research , 7(6).
Freak, G. (1996). Pokemon series.
Fulton, N. and Platzer, A. (2019). Verifiably safe off-model
reinforcement learning. In TACAS (1) , volume 11427
ofLecture Notes in Computer Science , pages 413–430.
Springer.
Gross, D., Jansen, N., Junges, S., and P´ erez, G. A. (2022).
Cool-mc: A comprehensive tool for reinforcement learn-
ing and model checking. In SETTA . Springer.
Hahn, E. M., Perez, M., Schewe, S., Somenzi, F., Trivedi,
A., and Wojtczak, D. (2019). Omega-regular objectives
in model-free reinforcement learning. In TACAS (1) , vol-
ume 11427 of Lecture Notes in Computer Science , pages
395–412. Springer.
Hansen, T. D., Miltersen, P. B., and Zwick, U. (2013). Strat-
egy iteration is strongly polynomial for 2-player turn-
based stochastic games with a constant discount factor.
J. ACM , 60(1):1:1–1:16.
Hansson, H. and Jonsson, B. (1994). A logic for reason-
ing about time and reliability. Formal Aspects Comput. ,
6(5):512–535.
Hartmanns, A., Klauck, M., Parker, D., Quatmann, T., and
Ruijters, E. (2019). The quantitative verification bench-
mark set. In TACAS (1) , volume 11427 of Lecture Notes
in Computer Science , pages 344–350. Springer.
Hasanbeig, M., Kroening, D., and Abate, A. (2019).
Towards verifiable and safe model-free reinforcement
learning. In OVERLAY@AI*IA , volume 2509 of CEUR
Workshop Proceedings , page 1. CEUR-WS.org.
Hasanbeig, M., Kroening, D., and Abate, A. (2020). Deep
reinforcement learning with temporal logics. In FOR-
MATS , volume 12288 of Lecture Notes in Computer Sci-
ence, pages 1–22. Springer.
Hensel, C., Junges, S., Katoen, J., Quatmann, T., and V olk,
M. (2022). The probabilistic model checker Storm. Int.
J. Softw. Tools Technol. Transf. , 24(4):589–610.
Page 8:
Jin, P., Tian, J., Zhi, D., Wen, X., and Zhang, M. (2022).
Trainify: A cegar-driven training and verification frame-
work for safe deep reinforcement learning. In CAV (1) ,
volume 13371 of Lecture Notes in Computer Science ,
pages 193–218. Springer.
Jothimurugan, K., Bansal, S., Bastani, O., and Alur, R.
(2022). Specification-guided learning of nash equilib-
ria with high social welfare. In CAV (2) , volume 13372
ofLecture Notes in Computer Science , pages 343–363.
Springer.
Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996).
Reinforcement learning: A survey. Journal of artificial
intelligence research , 4:237–285.
Khan, A., Zhang, C., Li, S., Wu, J., Schlotfeldt, B., Tang,
S. Y ., Ribeiro, A., Bastani, O., and Kumar, V . (2019).
Learning safe unlabeled multi-robot planning with mo-
tion constraints. In IROS , pages 7558–7565. IEEE.
Kucera, A. (2011). Turn-based stochastic games. In Lec-
tures in Game Theory for Computer Scientists , pages
146–184. Cambridge University Press.
Kwiatkowska, M., Norman, G., and Parker, D. (2019). Ver-
ification and control of turn-based probabilistic real-tim e
games. In The Art of Modelling Computational Systems ,
volume 11760 of Lecture Notes in Computer Science ,
pages 379–396. Springer.
Kwiatkowska, M., Norman, G., Parker, D., and Santos, G.
(2022). Symbolic verification and strategy synthesis for
turn-based stochastic games. CoRR , abs/2211.06141.
Kwiatkowska, M., Parker, D., and Wiltsche, C. (2018).
PRISM-games: verification and strategy synthesis for
stochastic multi-player games with multiple objectives.
Int. J. Softw. Tools Technol. Transf. , 20(2):195–210.
Kwiatkowska, M. Z., Norman, G., and Parker, D. (2011).
PRISM 4.0: Verification of probabilistic real-time sys-
tems. In CAV, volume 6806 of Lecture Notes in Com-
puter Science , pages 585–591. Springer.
Lee, S. and Togelius, J. (2017). Showdown AI competition.
InCIG, pages 191–198. IEEE.
Li, J., Zhou, Y ., Ren, T., and Zhu, J. (2020). Exploration
analysis in finite-horizon turn-based stochastic games. In
UAI, volume 124 of Proceedings of Machine Learning
Research , pages 201–210. AUAI Press.
Littman, M. L., Topcu, U., Fu, J., Isbell, C., Wen, M., and
MacGlashan, J. (2017). Environment-independent task
specifications via GLTL. CoRR, abs/1704.04341 .
Mnih, V ., Kavukcuoglu, K., Silver, D., Graves, A.,
Antonoglou, I., Wierstra, D., and Riedmiller, M. A.
(2013). Playing atari with deep reinforcement learning.
CoRR , abs/1312.5602.
Nam, S., Hsueh, C., and Ikeda, K. (2022). Generation
of game stages with quality and diversity by reinforce-
ment learning in turn-based RPG. IEEE Trans. Games ,
14(3):488–501.Pagalyte, E., Mancini, M., and Climent, L. (2020). Go
with the flow: Reinforcement learning in turn-based bat-
tle video games. In IVA, pages 44:1–44:8. ACM.
Riley, J., Calinescu, R., Paterson, C., Kudenko, D., and
Banks, A. (2021a). Reinforcement learning with quan-
titative verification for assured multi-agent policies. In
13th International Conference on Agents and Artificial
Intelligence . York.
Riley, J., Calinescu, R., Paterson, C., Kudenko, D., and
Banks, A. (2021b). Utilising assured multi-agent rein-
forcement learning within safety-critical scenarios. In
KES, volume 192 of Procedia Computer Science , pages
1061–1070. Elsevier.
Sadigh, D., Kim, E. S., Coogan, S., Sastry, S. S., and Seshia,
S. A. (2014). A learning based approach to control syn-
thesis of Markov decision processes for linear temporal
logic specifications. In CDC , pages 1091–1096. IEEE.
Shahrampour, S., Rakhlin, A., and Jadbabaie, A. (2017).
Multi-armed bandits in multi-agent networks. In
ICASSP , pages 2786–2790. IEEE.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,
van den Driessche, G., Schrittwieser, J., Antonoglou, I.,
Panneershelvam, V ., Lanctot, M., Dieleman, S., Grewe,
D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap,
T. P., Leach, M., Kavukcuoglu, K., Graepel, T., and Has-
sabis, D. (2016). Mastering the game of go with deep
neural networks and tree search. Nat., 529(7587):484–
489.
Svelch, J. (2020). Should the monster play fair?: Reception
of artificial intelligence in Alien: Isolation .Game Stud. ,
20(2).
Vamplew, P., Smith, B. J., K¨ allstr¨ om, J., de Oliveira Ramo s,
G., Radulescu, R., Roijers, D. M., Hayes, C. F., Heintz,
F., Mannion, P., Libin, P. J. K., Dazeley, R., and Foale,
C. (2022). Scalar reward is not enough: a response to
silver, singh, precup and sutton (2021). Auton. Agents
Multi Agent Syst. , 36(2):41.
Videga´ ın, S. and Garc´ ıa-S´ anchez, P. (2021). Performanc e
study of minimax and reinforcement learning agents
playing the turn-based game iwoki. Appl. Artif. Intell. ,
35(10):717–744.
Wang, Y ., Roohi, N., West, M., Viswanathan, M., and
Dullerud, G. E. (2020). Statistically model checking
PCTL specifications on Markov decision processes via
reinforcement learning. In CDC , pages 1392–1397.
IEEE.
Wender, S. and Watson, I. D. (2008). Using reinforcement
learning for city site selection in the turn-based strategy
game civilization IV. In CIG, pages 372–377. IEEE.
Wong, A., B¨ ack, T., Kononova, A. V ., and Plaat, A. (2022).
Deep multiagent reinforcement learning: Challenges and
directions. Artificial Intelligence Review , pages 1–34.
Page 9:
Page 10:
Page 11:
This figure "orcid.png" is available in "png"
format from:
http://arxiv.org/ps/2501.03187v1