Abstract —Chatbots are software typically embedded in Web
and Mobile applications designed to assist the user in a plethora
of activities, from chit-chatting to task completion. They enable
diverse forms of interactions, like text and voice commands. As
any software, even chatbots are susceptible to bugs, and their
pervasiveness in our lives, as well as the underlying technological
advancements, call for tailored quality assurance techniques.
However, test case generation techniques for conversational
chatbots are still limited. In this paper, we present Chatbot Test
Generator ( CTG ), an automated testing technique designed for
task-based chatbots. We conducted an experiment comparing
CTG with state-of-the-art BOTIUM and CHARM tools with
seven chatbots, observing that the test cases generated by
CTG outperformed the competitors, in terms of robustness and
Index Terms —Chatbot Testing, Test Automation, Coverage,
Flakiness, Mutation Testing
Task-based chatbots are chatbots designed to deliver func-
tionalities through conversations. Notable examples are chat-
bots for assisting users, booking services, and completing
transactions [1], [2]. In recent years, task-based chatbots
(hereafter simply chatbots) have gained popularity thanks
to their integration into Web and Mobile applications, and
advances in underlying technologies have made them ubiquitous
in a multitude of domains, such as e-commerce, booking, tech
support, healthcare, and more [1]–[5]. Further, chatbots have
become significant in the context of business companies, as
they allow for a reduction in personnel expenses by providing
a potentially active 24/7 service.
As chatbots continue to permeate our daily lives supported
by a large number of design platforms, such as Dialogflow [6],
Amazon Lex [7], IBM Watson Assistant [8], and Rasa [9],
ensuring their reliability has become a critical concern. Unlike
regular test case generation approaches that validate software
through interactions taking the form of, for instance, API calls,
HTTP requests, or clicks on visual elements, chatbots requirethe generation of actual conversations, that is, natural language
sentences that correspond to actual user requests. Further,
responses also take the form of natural language sentences
that must be interpreted to determine their correctness. It is
thus imperative to design ad-hoc techniques that can generate
test cases that thoroughly exercise software whose interface is
conversational [10], [11].
The initial effort in test case generation for chatbots was
mainly focused on exercising speech recognition capabili-
ties [12]–[16], and other non-functional characteristics [17].
The generation of test cases (i.e., conversations) to thoroughly
validate the functionalities implemented by the chatbot under
test required extensive manual intervention [9], [18]–[24]. So
far, two approaches have targeted automatic test case generation
for chatbots: BOTIUM [25], which is a state-of-the-practice
tool originally developed by Botium GmbH (now Cyara [26]),
andCHARM [27], which extends BOTIUM with the capability
to generate more diverse test cases. However, both approaches
arenot able to thoroughly cover the conversational input space
of a chatbot and often generate incorrect test cases , due to the
challenge of predicting the responses that must be produced
by tested chatbots.
To address the need for an automated testing solution
capable of generating robust and effective test cases that can
be used, this paper presents Chatbot Test Generator ( CTG ),
an automated testing tool designed for Dialogflow task-based
chatbots. CTG exploits the tests generated by BOTIUM as seed
tests that are systematically augmented to produce test cases
that cover alternative conversational paths . Moreover, CTG
is the first dynamic test generation approach for chatbots,
that is, it executes the test cases and records the chatbot
responses to obtain test cases that can be faithfully re-executed
as regression tests, compared to statically generated tests that
may include incorrect oracles. Finally, CTG generates test cases
with setup and teardown operations that prepare and clean
arXiv:2503.05561v1 [cs.SE] 7 Mar 2025
with the generated test cases. We conducted an experiment
comparing CTG with BOTIUM and CHARM , state-of-the-
art tools, observing that the test cases generated by CTG
outperformed those generated by competing approaches, in
terms of both robustness and effectiveness. In particular, CTG
generated 95% correct test cases, compared to BOTIUM and
CHARM that generated 61% and 82% correct tests, respectively.
Moreover, the tests generated by CTG killed the highest number
of mutants in five out of the seven chatbots considered in our
mutation testing experiment. This result suggests that the tests
obtained with CTG can be used to create better regression test
suites than using state-of-the-art approaches.
This paper provides the following contributions: (i) it
proposes CTG , the first dynamic test case generation tool
for Dialogflow task-based chatbots, (ii) it presents a systematic
strategy to generate reliable conversations that cover the
possible conversational paths, including the usage of entity
values, alternative utterances, and set up and tear down of
the testing environment, and, (iii) it presents empirical results
that show how CTG outperforms the BOTIUM andCHARM
state-of-the-art tools when generating tests for seven task-based
The paper is organized as follows. Section II introduces
the key elements of task-based chatbots. Section III describes
CTG . Section IV presents our experiment. Section V discusses
related work. Then, Section VI provides final remarks.
Chatbots, also known as virtual assistants or conversational
agents, are software programs designed to interact with human
users through textual, vocal, images, or other communication
channels [2]. Chatbots can be categorized based on their oper-
ational focus (i.e., informative, conversational, and task-based),
their context (i.e., domain-specific or domain-independent), and
thetechnology and modules they are composed of (e.g., natural
language processors, speech recognition systems, machine
learning models) [1], [2]. Task-based chatbots are chatbots
designed to efficiently complete specific tasks through user
conversations, prioritizing functionality over casual chat. They
aim to fulfill a user’s intent, such as booking a service or
guiding users through procedures.
Dialogflow [6] is a Natural Language Understanding (NLU)
platform, part of the Google Cloud Platform, and one of the
most popular platforms for multi-language chatbot design [28].
The platform offers a user-friendly Web interface, advanced
machine learning and voice recognition modules for NLU, and
native support with Google Cloud services. Further, it integrates
with existing services (e.g., Slack, Telegram, Messenger) and
custom external services, via Cloud Functions for Firebase.
Finally, it exposes APIs that enable connections with existing
testing tools, such as Botium [25]. In terms of structure,
a Dialogflow chatbot is composed of several components
encapsulating the main concepts of a conversation [29], [30]:
•Intents : they represent the possible goals of a user that
the chatbot considers as tasks to achieve; users express
their intent using utterances (e.g., the intent to order apizza can be expressed by the sentence I would like to
order a pizza );
•Actions : they are the actions that a chatbot may perform
to complete a task; actions can be either plain text
responses/requests (e.g., asking the question What pizza
would you like to order? ), graphical objects to interact
with, or requests invoking external services;
•Entities : the data types that can be used in a conversation,
either representing simple literals (e.g., Pepperoni pizza ),
regular expressions, or compositions of other entities;
•Flows : the possible user-bot interactions composing con-
versations, which can include context data representing
the time-bounded memory of the chatbot.
Testing a task-based chatbot requires generating conversa-
tions that thoroughly cover the intents, entities, and flows that
are defined in a chatbot.
Chatbot Test Generator (CTG) is an automated test gener-
ator tool for Dialogflow chatbots, developed in Node.js. Its
architecture is shown in Figure 1.
Starting from a set of BOTIUM seed tests that exercise the
intents defined in the chatbot under test (e.g., one seed test
per intent), CTG generates augmented tests that cover inputs
often ignored by test case generation. The seed tests represent
the backbone of the conversations explored by CTG . Further,
CTG is able to set up and tear down the environment of the
chatbot, preventing any side-effect between test executions.
Unlike other approaches that generate tests for chatbots
using static analysis (i.e., without executing the chatbot), CTG
incrementally generates test cases by sending user messages
and recording bot responses at runtime, thus producing test
cases that resemble actual conversations, being more reliable
for regression testing than statically generated tests. CTG
consists of four key components. The GENERATOR orchestrates
the entire test generation process. The EXPANDER retrieves
alternative utterances and entity values that can be used
as replacements for those present in the seed test cases,
then it unrolls the user-bot interactions to explore additional
conversational paths based on the extracted alternative values,
potentially exercising unexplored intents. The EXECUTOR
runs the generated tests, using BOTIUM connector module to
interact with the chatbot deployed on Dialogflow, and records
the conversation (i.e., user requests and bot responses). The
CLEANER is configured according to a cleaning routine defined
by the tester to set up and tear down the environment, including
the persistence layer of the chatbot under test, thus favoring
testing in a neutral environment. In the following, we present
the main algorithms of CTG in detail.
A. Generator
Algorithm 1 shows the generateTests function implemented
by the GENERATOR . The function takes in input the seed tests
ST, used as a basis for the generation process, the cleaning
routine CR, used to configure the CLEANER with the logic to
connect to the chatbot and to set up and tear down its back-end,
User requests Bot responsesBot Domain
Data cleaningSeed Tests
UtterancesData extractionFig. 1. CTG architecture.
Algorithm 1 CTG test case generation process.
1ST: the seed tests2CR : the cleaning routine to set up the connection with the chatbot
3bot : the bot under test
4AT: the augmented tests
5function EXPAND (st, botMsg, bot ) ▷Alg.2
6function SETUP(CR) ▷Alg.3
7function TEAR DOWN(CR) ▷Alg.3
89function GENERATE TESTS (ST, CR, bot )
10 AT← ∅11 forstinSTdo12 startMsgs ←EXP.expand (st, null, bot )
13 fori= 0;i <|startMsgs |;i+ + do
14 at←()
15 at.seed ←true16 AT[st]←AT[st]∪ {at}
17 end for18 fori= 0;i <|AT[st]|;i+ + do
19 conn←CLEANER.setUp (CR)
20 ifAT[st][i].seed then
21 userMsg ←startMsgs [i]
22 do23 if|userMsg |= 1 then
24 AT[st][i].addUserMsg (userMsg )
25 botMsg ←EXEC.sendMsg (userMsg, conn )
26 AT[st][i].addBotMsg (botMsg )
27 userMsg ←EXP.expand (st, botMsg, bot )
28 else29 forj= 1;j <|userMsg |;j+ + do
30 at′←()
31 at′.addUserMsg (AT[st][i].getUserMsgs ())
32 at′.addUserMsg (userMsg [j])
33 at′.seed←false
34 AT[st]←AT[st]∪ {at′}
35 end for36 AT[st][i].addUserMsg (userMsg [0])
37 botMsg ←EXEC.sendMsg (userMsg [0], conn )
38 AT[st][i].addBotMsg (botMsg )
39 userMsg ←EXP.expand (st, botMsg, bot )
40 end if41 whileuserMsg ̸=null
42 else43 userMsgs ←AT[st][i].getUserMsgs ()
44 forj= 0;j <|userMsgs |;j+ + do
45 botMsg ←EXEC.sendMsg (userMsgs [j], conn )
46 AT[st][i].addBotMsg (botMsg )
47 end for48 end if49 conn←CLEANER.tearDown (CR)
50 end for51 end for52 return AT53end function
and the reference to the chatbot under test bot, used to extract
the alternative utterances and entity values from the chatbot,
and returns the collection ATof augmented tests for each seed
test (e.g., AT[sti][j]is the jthaugmented test for seed test i).
The seed tests ST must cover the conversational flows
that should be used to generate alternative conversations.
Although seed tests could be obtained in multiple ways,
including manually implementing them, in our assessment
we used BOTIUM to generate seed tests, since BOTIUM can
automatically generate a set of test cases that exercise the main
conversational flow defined in a chatbot.
The function generateTests iterates over each seed test caseto generate corresponding alternative test cases (lines 11-51).
The GENERATOR starts by using the EXPANDER to retrieve all
alternative messages that a user can use to start a conversation
according to the set of salutation messages expected by the
chatbot (line 12). To perform such a task, the EXPANDER
accesses the chatbot structural files, extracting the alternative
utterances and entity values. For each message that can be
used to start a conversation, a new empty augmented test case
is created and added to the output set (lines 13-17). Each new
test case is flagged as “seed” (line 15), since the algorithm
discriminates tests that cover the seed conversational flow from
the alternatives that may originate by the expansion of the
input data in future interactions (e.g., after the greetings, the
conversation may take a new path if the user orders a pizza
instead of a hamburger).
For each new test case (lines 18-50), the connection to the
chatbot is established through the CLEANER (line 19). If the
test is flagged as “seed” (line 20), a proper salutation message
is used to initiate the test from the available ones (line 21).
Then, the conversation is iteratively built and executed until
no more messages can be sent to the chatbot (lines 22-41).
Within each iteration, if the EXPANDER has not identified any
alternative message to be sent, the only possible message is sent
to the chatbot using the EXECUTOR , which exploits Botium
APIs to communicate with the bot deployed on Dialogflow.
The bot response is collected and encoded as expectation in the
test, and the next user message to be sent is retrieved from the
alternative messages identified by the EXPANDER (lines 23-27).
In case, instead, there are multiple alternatives for the user
message to be sent, for each alternative, a new augmented test
case is created (lines 29-35): every alternative test inherits the
initial part of the conversation already established within the
seed test from which the new test is created (line 31) and such
test is added to the output set of augmented tests associated
with the seed test (line 34). Then, the user-bot interaction
proceeds for the seed test (lines 36-39), using the first user
message among the alternatives that, by construction, indicates
a seed test (i.e., userMsg [0], line 36).
Once an augmented test has reached the end of the execution,
the alternative augmented test cases that are pending are
retrieved and completed using the same strategy (lines 42-48).
Eventually, at the end of each test case interaction with the bot,
theCLEANER is called to restore the chatbot to its original
state (line 49). The aforementioned procedure is repeated until
no more seed tests are present, and finally the whole set of
augmented tests is returned (line 52).
B. Expander
Algorithm 2 shows the expand function implemented by the
EXPANDER . The function takes in input a test case t, a bot
message botMsg occurring in the test, the reference to the
chatbot bot, and returns all the alternative user responses alt
that can be sent to the chatbot for the message botMsg .
To perform its tasks, the EXPANDER accesses the Dialogflow
data defined for the bot and exploits the BOTIUM test case
structure, outlined in the example in Figure 2. In BOTIUM , a test
case is a conversation that alternates user sentences, occurring
after the tag #me, and bot sentences, occurring after the tag
#bot . For each chatbot intent, defined by user utterances and
corresponding chatbot responses ( User Utterances and Bot
Responses sections in Figure 2), used internally by the chatbot
to train a model to recognize sentences flexibly, BOTIUM
exploits the dataset and builds the seed test cases as finite
combined sequences of user-bot interactions, where the bot
messages can be used as oracles to compare with actual
responses during test execution.
When a sentence refers to an entity, the bot identifies it
with the @symbol (e.g., in the example, the bot is asking
for the service the user wants). The chatbot has a number of
entity values that can be accepted for each entity. During the
conversation, the user must choose one of them to produce a
response that is understandable by the bot. The EXPANDER
accesses this information to find any alternative utterance and
entity value that can be used in the conversation.
If the bot message to answer is assigned with the value null
(lines 7-9), this means that the conversation has just started.
The user message for which the alternatives must be found
is the one starting the test case (e.g., the first #me message
of Figure 2); then, the EXPANDER can extract the alternative
sentences by accessing the user utterances file associated with
that message ( User Utterances section in Figure 2). If, instead,
the bot message is a prompt referring to one or more required
entity values (lines 10-11), where an entity is determined by
the presence of the “@” symbol, the alternative messages
are obtained by combining the entity values extracted for
each entity from the chatbot ( @service values section in
Figure 2); the logic of entity value extraction and combination
is encapsulated by the method getEntityValues , not shown for
space reasons. Else, if the bot message is a simple plain text
response (lines 12-15), the next user message is retrieved from
the test case (line 13), and the alternative user responses are
extracted from the utterances file related to the user message
(line 14). Finally, the set of alternatives messages discovered
is returned to the GENERATOR (line 16), at its disposal for the
test generation process.
C. Cleaner
Algorithm 3 shows the setUp andtearDown functions of
theCLEANER component. The setUp function is called at thebeginning of the generation process (line 19 from Algorithm
1) to establish a connection to the chatbot under test and to its
persistence layer; it takes in input the cleaning routine CR to
inject to the CLEANER the logic to manage the exact cleaning
process for the target chatbot (e.g., how to remove events from a
Google Calendar for a calendar managing chatbot), and returns
a connection conn to communicate with the chatbot. More in
detail, the service account configuration pertaining the external
service that the chatbot connects to, if any, is built (line 7); a
service account could refer, for instance, to the Google API
managing Google Calendar, thus requiring data such as private
keys and project ids to later connect to such service. Next, the
connection to the chatbot is opened and returned (lines 8-9).
Eventually, the connection pointer and the cleaning routine are
used in the tearDown function, called at the end of each test
generation process (line 49 from Algorithm 1), to restore the
chatbot to its original state, by querying the service for any
persistent item to clean (lines 13-15), such as events from a
given calendar; finally, the connection is closed (lines 16-17).
To study the effectiveness of Chatbot Test Generator ( CTG )
we considered the following research questions:
•RQ1-Correctness :How often does CTG generate seman-
tically correct tests? This research question studies the
reliability of the generated test cases, considering the rate
of wrong tests (i.e., tests that may fail even for passing
executions) that are generated.
•RQ2-Coverage :How thoroughly can CTG exercise
conversations? This research question studies how thor-
oughly the generated tests exercise the intents and entities
implemented in chatbots.
•RQ3-Mutations :What is the defect detection capability
of the tests generated by CTG ?This research question
studies how well test cases can reveal faults injected in
conversations implemented in task-based chatbots.
The three research questions are investigated by comparing
the test cases generated by CTG with the test cases generated
by the baseline approach BOTIUM [25] and the state-of-the-
art approach CHARM [27], with seven task-based chatbots
selected from open-source third-party repositories. In the
rest of this section, we describe the subject chatbots used
for the evaluation (Section IV-A ), the competing techniques
Algorithm 2 CTG alternatives expander process.
1t: the test case to expand data from
2botMsg : the bot message used to retrieve the next user message
3bot : the bot under test
4alt: the alternative responses (e.g., entities, utterances)
56function EXPAND (t, botMsg, bot )
7 ifbotMsg =null then
8 userMsg ←getFirstUserMsg (t)
9 alt←getUtterances (userMsg, bot )
10 else ifbotMsg.contains (“@”) then
11 alt←getEntityV alues (botMsg, bot )
12 else13 userMsg ←getNextUserMsg (t, botMsg )
14 alt←getUtterances (userMsg, bot )
15 end if16 return alt17end function
Chatbot Data
Hi, I need an appointment
Hello, for which @service ?
For servicei
@service valuesintent IHi, I need an appointment
Hello, I need your service
…User Utterances
Hello, for which @service?
How can I assist you ?
… Bot Responses
service2…Test Case for intent I
service1servicenFig. 2. Overview of a B OTIUM test case and Dialogflow chatbot data.
Algorithm 3 CTG cleaning process.
1CR : cleaning routine.
2conn : the connection session to the chatbot
3function CLEAN (CR) ▷Extern
4function ITEMS() ▷Extern
56function SETUP(CR )
7serviceAccount ←buildService (CR)
8conn←connect (serviceAccount, CR )
9 return conn10end function1112function TEARDOWN (CR )
13 ifconn̸=null∧ |conn.items ()| ≥1then
14 conn.clean (CR)
15 end if16 conn←null17 return null18end function
Name Domain # Intents# Entities
(# Values)
E-Commerce Shopping 11 0 (0)
Room Reservation Hotels 6 1 (3)
Weather Forecast Weather 4 0 (0)
Currency Converter Currencies 3 0 (0)
Temperature Converter Temperatures 5 2 (12)
Appointments Scheduler Cars 3 1 (10)
News Info 3 1 (9)
(Section IV-B ), the experimental procedure defined to answer
the research questions (Section IV-C ), the empirical results
obtained for the three research questions (Sections IV-D ,IV-E ,
and IV-F), and the threats to validity (Section IV-G).
A. Subject Chatbots
To investigate the research questions, we selected seven
Dialogflow chatbots from the ASYM0B repository of open-
source third-party chatbots [31], which has already been used
in previous studies [27], [29], [30]. The selected chatbots
are representative of different domains (e.g., hotels booking,
weather forecasting), and designed to cover non-trivial conversa-
tional aspects (e.g., nested intents, input contexts, and back-end
components to interact with external services). Table I reports
the name, the domain, and the conversational size of each
selected chatbot (i.e., the number of intents, custom entities,
and entity values).
Since the Weather Forecast and News chatbots did not
include a configuration file for the back-end component, wehad to define one ourselves to integrate with free API services
alternative to the original, unavailable, ones. We specifically
used the OpenWeather and News APIs. We carefully inspected
and tested the modified chatbots to ensure they preserved their
integrity, conversational structure, and coherence.
B. Competing Techniques
We selected the BOTIUM [25] baseline and the CHARM [27]
state-of-the-art approaches to compare with CTG .BOTIUM [25]
is an automated quality assurance framework for chatbot testing
designed by Botium GmbH (now Cyara [26]), widely used
in both industrial and academic contexts [11], [27]. It offers
support for generating and executing test cases on various
chatbot design platforms, including Dialogflow [6], Amazon
Lex [7] and Rasa [9]. It represents the state-of-the-practice ap-
proach for test case generation for chatbots. CHARM [27] is an
open-source testing tool for Dialogflow chatbots implemented
in Python. Similarly to CTG , it uses BOTIUM as a back-end
for test generation and execution. CHARM extends the test
case generation capabilities of BOTIUM with mutations over
utterances of seed test cases, such as language back-translations
and synonym substitutions.
C. Experimental Procedure
To address RQ1-Correctness , we studied the rate of errors
present in the tests generated by CTG ,BOTIUM andCHARM .
We separately deployed each subject chatbot on Dialogflow.
Then, we used the testing techniques to automatically generate
test cases for each chatbot, configuring each tool to use the
multistepconvos BOTIUM feature, which causes the generation
of test cases that tentatively cover full user-bot conversations,
according to the conversations defined in the tested chatbots. As
seed tests required by the CTG generation phase (Algorithm
1), we set one BOTIUM seed test per intent. To assess the
correctness of the tests, we executed each test 15 times. We
then manually inspected the failures to determine the presence
ofsemantically wrong test cases, due to malformed utterances,
wrong oracles, incoherent interactions, or flaky interactions1.
The remaining tests were deemed as correct .
1A flaky test is a test that passes and fails periodically without any code
changes [32].
To address RQ2-Coverage , we executed the correct test
cases resulting from RQ1 and collected the set of intents and
entities exercised by each test. To obtain this information, we
run the tests in verbose mode and implemented a script to
parse the collected logs, extracting the information about the
executed elements of the conversation. We derive two main
indicators of the capability of the tests to exercise conversations:
intent coverage andentity coverage [33]. Intent coverage is the
percentage of intents in the chatbots that are exercised by the
test cases, and measures the capability of the tests to cover at
least once each user’s intent implemented in the chatbot. Entity
coverage measures the percentage of actual entity values that
are exercised in the tests in comparison to the set of entity
values that the chatbot is designed to recognize. For instance,
theAppointments Scheduler chatbot could be used to schedule
appointments about driver license orvehicle registration (these
are possible values for the entity type AppointmentType ); thus,
it indicates how extensive the conversation has been exercised
in terms of the kind of values used.
To address RQ3-Mutations , we injected mutants into subject
chatbots using Mutabot [30], a mutation testing tool for Di-
alogflow conversational chatbots. Mutabot generates mutations
at multiple levels, including structural changing affecting a
conversation (e.g., removing an intent from a chatbot), intents
(e.g., flagging an intent as fallback2or changing its priority3),
and entities (e.g., removing or changing entity definitions). In
our evaluation, we applied the following operators affecting the
main conversational characteristics of the selected chatbots: 1)
intent removal, 2) entity removal, 3) intent parameter removal,
4) intent priority change, 5) intent flagging as fallback, 6) entity
renaming, and 7) entity value change. After generating the
mutants, we manually inspected the resulting chatbots to detect
and discard the equivalent4mutants, which we could determine
by comparing the original and the mutated code elements, and
detecting those that had no impact on the behavior of the
chatbot (e.g., changing the priority of an intent when there are
no other intents to compete with). We independently deployed
the mutated chatbots on Dialogflow, and tested them with the
test suites produced for RQ1, measuring the rate of mutants
killed by each technique. To avoid any side effect and guarantee
replicability and fairness in the results, we reset the state of
the chatbots after each run.
The resources needed to replicate our experiments, including
tools, chatbots, and generated tests, are publicly available at
D. Results for RQ1-Correctness
Table II lists the number of test cases generated for each
chatbot and technique. Overall, CHARM generated a smaller
set of test cases (106 test cases) compared to BOTIUM and
CTG , since it often failed at generating tests for intents whose
2A fallback intent is designed to respond to the user requests that the chatbot
could not understand.
3The intent priority determines the order of activation when multiple intents
compete for a user request.
4An equivalent mutant is a mutant that cannot be killed by any test case [34].TABLE II
E-Commerce 196 20 196
Room Reservation 233 21 233
Weather Forecast 11 11 16
Currency Converter 39 17 39
Temperature Converter 15 15 19
Appointments Scheduler 21 12 24
News 10 10 19
Total 525 106 546
E-Commerce 0% 100% 0% 100% 0% 100%
Room Reservation 80% 20% 24% 76% 4% 96%
Weather Forecast 36% 64% 18% 82% 40% 60%
Currency Converter 18% 82% 24% 76% 0% 100%
Temperature Converter 27% 73% 53% 47% 0% 100%
Appointments Scheduler 24% 76% 0% 100% 43% 57%
News 0% 100% 0% 100% 0% 100%
Total 39% 61% 18% 82% 5% 95%
execution is constrained, such as nested intents that can be
exercised only after other intents have been exercised, or intents
requiring some input contexts defined by former interactions.
The consequence is that CHARM misses to exercise several
alternative conversations, in particular in the E-Commerce and
Room Reservation chatbots.
On the other hand, the performance of BOTIUM andCTG is
comparable (525 test cases generated by BOTIUM and 546 test
cases generated by CTG ), since both techniques are able to
discover alternative execution paths. However, the EXPANDER
component of CTG can also retrieve alternative entity values
(see lines 10-11 of Algorithm 2) when the available user
utterances are incomplete, and the chatbot must interact with
the user to fill the slots with missing entities (e.g., the case of a
user requesting for an appointment without specifying the kind
of service requested forces that chatbot to produce additional
interactions to ask what service is needed to the user). If all
the cases are covered by the utterances available in the training
data, CTG does not activate the expansion mechanism and
the conversations tentatively generated by CTG andBOTIUM
are the same. In fact, CTG andBOTIUM , even addressing the
same conversational flow, generate different tests since they
rely on different information: CTG generates tests dynamically,
based on actual bot responses, while B OTIUM generates tests
statically, guessing responses from source files.
Although the number of test cases generated by BOTIUM
and CTG is comparable, the test cases generated by CTG
were more robust than those generated by the other techniques.
Notice that the test generation timings were comparable and
required few seconds across the various tools. Nevertheless,
since CTG dynamically generates tests by interacting with the
chatbot to record actual responses, the generation time depends
on external services that could potentially introduce delays.
However, no significant impact was observed for the subject
Table III reports the results of RQ1. The column SWindicates
Test Case#me
Hi! I’m your room booking bot.
I can help you to find a perfect room for your
meeting and manage your reservations.#me
Good day! I’m your room booking bot.
I can help you to find a perfect room for
your meeting and manage your reservations.
Expected Actual
Fig. 3. A flaky test case generated by all the techniques.
the percentage of semantically wrong test cases that fail when
executed, while Cindicates the percentage of correct test
cases that can be reliably executed. All techniques generated
no semantically wrong test cases for E-Commerce andNews
chatbots, where dialogues could be reliably extracted from
the implementation. Generating reliable test cases was more
challenging for the other chatbots. CTG generated the highest
number of correct test cases for five out of seven chatbots,
while CHARM outperformed the other techniques in the other
two chatbots. Overall, BOTIUM generated 61% correct test
cases, reporting the lowest performance compared to the other
techniques. CHARM achieved 82% correct test cases on average.
CTG outperformed the competing techniques, achieving 95%
correct test cases on average. Note that CHARM , in addition to
generating a slightly smaller percentage of correct test cases
than CTG, generates fewer tests in total.
The reasons for semantically wrong test cases that fail even
in case of correct responses are various. We identified four
main cases, which we discuss below.
(1) Oracle Weaknesses: Chatbots are often designed to
provide multiple, semantically equivalent, responses to the
same user requests. This can be the source of flakiness, since
the expected response specified in a generated test, although
correct, can differ with respect to the actual response provided
by a chatbot when the test is executed, resulting in spurious
failure of an assert statement. This limitation is common to
all the compared techniques, since BOTIUM serves as the
backbone for test execution, where the oracle is defined by
default through text matching between the chatbot expected
and actual responses. Figure 3 shows an example of a flaky test
case generated by all the compared techniques for the Room
Reservation chatbot. The test case simply submits a salutation
message and encodes an expectation about the response of the
chatbot (column Expected ). However, the Room Reservation
chatbot can respond in multiple ways. For example, the actual
response could start with Good day rather than Hi, causing
a test failure. Implementing a flexible oracle strategy is an
open challenge for all compared techniques.
(2) Dirty Environment: Flakiness may also occur when the
same scenario affecting the chatbot state is exercised multiple
times without proper cleanup of the state. While CTG uses
theCLEANER component to properly set up and tear down the
environment, the test cases generated by both BOTIUM and
CHARM may interfere with each other.
(3) Dynamic Responses: Both BOTIUM andCHARM generate
test cases statically by combining the user utterances and the
bot responses as they are taken from the chatbot design files,
thus they could miss any dynamic response that may depend
on the actions executed by the chatbot. On the other hand,
CTG can collect dynamic responses according to the status of
the conversation, as shown in Figure 4 for the AppointmentsTest Cases#me
I would like to set an appointment for 3pm
on Tuesday
What services are you looking to get?
DMV offers Driver license and
vehicle registration services.
Driver License
Let me see if we can fit you in on
2024-05-07 at 15:00! Yes It is fine!#me
I would like to set an appointment for 3pm
on Tuesday
What services are you looking to get?
DMV offers Driver license and
vehicle registration services.
Driver License
I’m sorry, there are no slots available for
2024-05-07 at 15:00.
Scenario #1 Scenario #2
Fig. 4. Test cases generated by CTG capturing dynamic responses.
Test Case-
now into Euros
Invalid currency conversion parameters#me
Convert 30 Dollars
What is the currency-to?
now into Euros
At the moment 30USD are 27.642EUR
Fig. 5. A same test case comparison generated by B OTIUM and CTG.
Intents Entities Intents Entities Intents Entities
E-Commerce 91% - 36% - 91% -
Room Reservation 83% 100% 50% 0% 83% 100%
Weather Forecast 50% - 50% - 50% -
Currency Converter 66% - 66% - 66% -
Temperature Converter 60% 33% 60% 50% 60% 50%
Appointments Scheduler 66% 10% 66% 20% 66% 20%
News 66% 11% 66% 11% 66% 56%
Total 74% 26% 51% 26% 74% 47%
Scheduler chatbot, where the final chatbot responses depend
on the actual finalization of the user request (e.g., if a slot is
available, the appointment will be confirmed).
(4) Complex Conversational Flow: The generation strategies
implemented by BOTIUM andCHARM occasionally neglect
the preliminary steps of conversations or skip some necessary
interactions. Differently, CTG can address nested intents and
interactions, by iteratively building the conversations as long
as user requests and bot responses can be exchanged (see
lines 22-41 in Algorithm 1). Incorrect user-bot interactions
sequences can lead the chatbots to fail understanding the
requests, resulting in semantically wrong test cases. For
example, Figure 5 shows two test cases generated for the
Currency Converter chatbot. The test case generated by
BOTIUM (left side), due to its limited ability to explore
nested intents, starts the conversation with a sentence, now
into Euros , that is not meaningful alone. In fact, the
bot responds with an Invalid currency conversion
parameters , failing the test that expects a meaningful
response. In contrast, the test case generated by CTG (right
side) implements the entire conversation, including the initial
request for a given amount to be converted into a new currency.
The test can be repeatedly executed without problems and can
be reliably used for regression testing.
To summarize, while oracle weaknesses affect all compared
techniques, the other limitations do not affect CTG.
E. Results for RQ2-Coverage
Table IV reports intent and entity coverage achieved for each
chatbot and testing technique.
Concerning intent coverage, all the techniques performed
the same with five chatbots: Currency Converter (66%),
Appointments Scheduler (66%), News (66%), Temperature
Converter (60%), and Weather Forecast (50%). Additionally,
BOTIUM andCTG both covered all intents but one in Room
Reservation (83%) and in E-Commerce (91%), while CHARM
performed significantly worse in Room Reservation (50%) and
E-Commerce (36%), due to the lower number of generated test
cases (see again Table II). Overall, CTG andBOTIUM scored
the same (74% of intents covered on average), followed by
CHARM with 51%.
Concerning entity coverage, BOTIUM andCTG cover all
entities in every chatbot for at least one value. Instead, CHARM
does not cover the entity in Room Reservation , as it failed
in generating tests about intents that use such entity. On the
other hand, BOTIUM misses more entity values than CHARM
and CTG inTemperature Converter (33% vs 50%) and in
Appointments Scheduler (10% vs 20%), whereas it scores the
same as CHARM inNews (11%), making them to tie overall
(26% both). CTG covers 47% of entity values, scoring the
same or better than the other techniques in all cases.
The main reasons of uncovered intents and entities lie
underneath the core functionality of test generation, that is
shared among all the techniques. Since the testing tools generate
test cases from the training data that reflect the positive usages
of the chatbots, they often fail at exercising the fallback intents
that are designed to manage the negative scenarios which may
originate from realistic but wrong interactions (e.g., ask to
rephrase/complete a message received with a badly formatted
data) [30]. Further, some intents may require specific pre-
requisite to activate them, such as the contexts data that must
be provided in input from past interactions. Among the testing
tools, CHARM is the one that suffered the most, as observed
especially for the E-Commerce and the Room Reservation
chatbots, affecting both test generation and coverage.
Nevertheless, the number of test cases does not necessarily
reflect the capability of a technique to cover intents and entities.
For example, we observed cases where radically different
number of test cases resulted in the same intent and entity
coverage, such as in Currency Converter chatbot, where all
techniques scored 66% of intents covered, although BOTIUM
and CTG generated 39 test cases while CHARM generated
17 only. This suggests that the generated tests are sometimes
equivalent variations of other existing tests [33], for instance,
having a user utterance replaced with another one producing
the same effect (e.g., Convert 10 Pounds to Dollars versus
How much is 10 Pounds in Dollars? ).
In a nutshell, all the techniques performed better in covering
intents rather than entities, with CTG providing the best
combined coverage. Indeed, the achieved intent and entity
coverage can still be significantly improved.
F . Results for RQ3-Mutants
To assess the fault revealing capability of the compared
approaches, we measured the mutation score. To compute
this score, we manually inspected the generated mutants and
discarded the equivalent ones: we discarded six equivalents
from Room Reservation , one from Weather Forecast , and oneTABLE V
Chatbot# Killed / # Generated (% Killed)
E-Commerce 10/22 (46%) 11/22 (50%) 10/22 (46%)
Room Reservation 9/16 (56%) 7/16 (44%) 11/16 (69%)
Weather Forecast 5/13 (38%) 7/13 (54%) 6/13 (46%)
Currency Converter 3/6 (50%) 3/6 (50%) 3/6 (50%)
Temperature Converter 6/16 (38%) 6/16 (38%) 8/16 (50%)
Appointments Scheduler 5/12 (42%) 4/12 (33%) 7/12 (58%)
News 8/12 (67%) 6/12 (50%) 10/12 (83%)
Total 46/97 (47%) 44/97 (45%) 55/97 (57%)
from Currency Converter . Table V reports the mutation score
obtained by each technique. In five chatbots out of seven
(i.e., Room Reservation ,Temperature Converter ,Appointments
Scheduler ,Currency Converter andNews )CTG performs better
or the same than the other techniques. CHARM performs the
same or better than the other approaches in three cases, and
BOTIUM performs the same or better than other approaches
in one case. Overall, CTG outperformed the competing
approaches, killing 57% of the mutants, with BOTIUM and
CHARM killing 47% and 45%, respectively.
In general, mutants that affected intents and entities (e.g.,
remove an entity, or remove a required parameter referring to
an entity from a user utterance) were easier to kill by CTG ,
because of its capability of generating test cases that can record
the actual chatbot responses and use them as oracles, rather
than building static conversations unable to track such details,
as well as being able to exercise entities more systematically.
For instance, News chatbot presents a conversational scenario
in which the user has to specify the topic of a news; if such
parameter is missing, once a test exercises such scenario, the
mutant will likely activate the fallback intent as being unable
to understand the request prompted by the test, thus getting
detected. On the other hand, some mutations, as reported in
previous studies [30], are complex to detect by any technique,
such as those affecting negative scenarios (e.g., turning an
ordinary intent into a fallback intent), as they are generally
neglected by the test generation process. This result confirms
that more research is needed in test case generation for task-
based chatbots.
G. Threats to Validity
An internal threat to the validity of this study depends on
the manual intervention to deploy the mutated chatbots on
Dialogflow, to detect the semantically wrong test cases, and
the inspection of the mutants (e.g., to detect equivalent mutants).
We mitigated this issue by having two authors carefully inspect
the produced artifacts and discuss the result of the inspection
until they reached agreement. An external threat to validity
concerns the generalizability of the results. Although our
findings are not final, they provide an interesting picture of the
state of the art in testing task-based chatbots. The assessment
considers chatbots from different domains, using third-party
software also involved in other experiments [27], [29], [30],
and compares CTG toBOTIUM andCHARM , which are the
state-of-the-art testing tools [11], [30]. We selected Google
Dialogflow as the target framework, which is one of the most
popular platforms for chatbot design [28]. Although we cannot
make claims about the generalizability of the results to chatbots
implemented with other platforms, such as Rasa and Amazon
Lex, their similarity suggests that results are unlikely to change
drastically among task-based chatbot platforms.
Chatbots have existed for a long time (e.g., Weizenbaum’s
creation named Eliza [35]), but only recently they have
been integrated with AI and natural language processing
capabilities [2], [36]. Traditional testing methods have proven
challenging to apply to this new technology. So far, a plethora
of methods have focused on performance, usability, design
metrics, and guidelines [10], [37], [38]. On the other hand,
few approaches have addressed the functional testing of
chatbots [11].
For what concern industrial and open-source testing tools,
Amazon released the Alexa Simulator [39] to test Alexa
skills without the need of a device, supporting both text and
voice-based interactions. An open-source framework for offline
black-box testing of Alexa skills, named Alexa Skill Test
Framework [18], has also been provided. The capabilities of
the framework include mocking external services and support
for audio streams. Both solutions, although highly configurable,
are tailored to Alexa. Rasa bot builder [9] provides testing
features focused on conversational flows and Natural Language
Understanding (NLU) models, but test stories must be manually
written and maintained. Chatbottest [17] offers support for
improving the quality of chatbot design based on heuristic
evaluations, but no actual test generation is supported, and
the tool relies on a Chrome extension to evaluate chatbots
operating on only a few platforms (e.g., Telegram). Bespoken
offers a commercial solution [19] for conversational AI chatbot
testing, using a rich dashboard that integrates utility features,
such as voice support and testing reporting. Similarly to
Rasa, conversational scenarios in Bespoken must be manually
configured. Playwright [40] supports end-to-end testing of
chatbots that rely on a Document Object Model (DOM)
interface. It employs a recording mechanism to save interactions
over the DOM for replication, thus tests have to be recorded
manually. Unlike these frameworks, CTG provides automated
test generation capabilities.
BOTIUM [25] is an automated quality assurance framework
providing both automatic generation and execution of test cases
for chatbots, supporting multiple natural language processing
engines and chatbot design platforms, including Amazon
Lex [7], Rasa [9] and Dialogflow [6]. The test cases generated
inBOTIUM are textual conversations between the user and
the chatbot. Based on our findings, BOTIUM shows limitations
in covering conversations when custom entities and external
services are involved. Further research on mutation testing has
shown how test suites generated by BOTIUM struggle to detect
numerous mutants affecting conversational properties [30],
[41]. To alleviate the cost of manual chatbot testing, in 2017
researchers from IBM Research Lab proposed Bottester [20],
a tool that simulates user interactions with a chatbot. The tool
takes in input a specification of the conversational flows andadditional parameters to simulate actual interactions, such as
a sleep time to separate each user input. CHARM [27] tool
has been proposed by Bravo-Santos et. al as an extension to
BOTIUM for Dialogflow chatbots. CHARM enriches BOTIUM
conversations using a set of mutations to test chatbot robustness
and accuracy. We empirically compared CTG with BOTIUM
andCHARM , showing how the CTG ’s capabilities can improve
test effectiveness.
Similarly to CHARM , further work addresses input mutation
on quality assurance of chatbots. Guichard et al. [22] investigate
the process of automatically paraphrasing user sentences, and
they evaluate robustness of text-based conversational agents
on the BoTest testing framework [21]. Bozic et al. propose an
automated approach for functional chatbot testing involving
AI planning as foundation for test generation [42]: a plan
represents an abstract test case, composed of user’s requests as
actions to perform and an intent as a goal to achieve, allowing
a re-planning phase once stuck. Bozic and Wotawa [23] further
investigate the oracle problem in the chatbot domain. As chatbot
output may be difficult to predict, metamorphic testing [43] can
be integrated into the testing process, defining metamorphic
relations, such as synonym transformations and word removals.
In a subsequent work [24], they sophisticated the approach
introducing an ontology-based infrastructure. Since employing
the ontology requires some effort and knowledge, the approach
can benefit from automated black-box test generator tools, such
as B OTIUM , CHARM , and CTG.
The ubiquity of chatbots in human activities, as well as
the involvement of advanced technologies for their design,
demands suitable quality assurance techniques and tools. So
far, only a few approaches have addressed functional testing of
task-based chatbots [11], [24]. State-of-the-art approaches are
limited to the static generation of test cases that cannot always
capture the complexity and diversity of questions and responses
that are part of conversations, generating test cases that may be
fragile or require manual fixtures, providing limited coverage
of intents and entities present in a chatbot [25], [27].
In this paper, we have presented CTG , a dynamic test case
generation technique for Dialogflow task-based chatbots. CTG
leverages BOTIUM test cases to produce test variants capable
of exploring scenarios that are not covered by existing tests.
The tool records all bot responses, even those depending on
external services, and employs setup and teardown operations
to clean up the environment, favoring regression testing in a
neutral environment. Our empirical results show that CTG can
generate a higher rate of correct test cases that extensively cover
conversations compared to BOTIUM and CHARM . In future
work, our aim is to further refine and expand the capabilities of
CTG . This will involve conducting additional experiments,
such as investigating the adaptability of the tool to other
chatbot design platforms (e.g., Rasa), as well as designing
more sophisticated test strategies to implement flexible oracles
and to cover negative scenarios in the context of LLM-based
chatbot testing [36].
