Open AGI Codes | Your Codes Reflect!

Updates

Generating audio...

arxiv

Paper 2503.05561

Test Case Generation for Dialogflow Task-Based Chatbots

Authors: Rocco Gianni Rapisarda, Davide Ginelli, Diego Clerissi, Leonardo Mariani

Published: 2025-03-07

Abstract:

Chatbots are software typically embedded in Web and Mobile applications designed to assist the user in a plethora of activities, from chit-chatting to task completion. They enable diverse forms of interactions, like text and voice commands. As any software, even chatbots are susceptible to bugs, and their pervasiveness in our lives, as well as the underlying technological advancements, call for tailored quality assurance techniques. However, test case generation techniques for conversational chatbots are still limited. In this paper, we present Chatbot Test Generator (CTG), an automated testing technique designed for task-based chatbots. We conducted an experiment comparing CTG with state-of-the-art BOTIUM and CHARM tools with seven chatbots, observing that the test cases generated by CTG outperformed the competitors, in terms of robustness and effectiveness.

Paper Content:

Page 1: Test Case Generation for Dialogflow Task-Based Chatbots Rocco Gianni Rapisarda Dep. of Informatics, Systems and Communication University of Milano-Bicocca Milan, Italy r.rapisarda2@campus.unimib.itDavide Ginelli Dep. of Informatics, Systems and Communication University of Milano-Bicocca Milan, Italy davide.ginelli@unimib.it Diego Clerissi Dep. of Informatics, Bioeng., Robotics and Systems Engineering University of Genova Genova, Italy diego.clerissi@dibris.unige.itLeonardo Mariani Dep. of Informatics, Systems and Communication University of Milano-Bicocca Milan, Italy leonardo.mariani@unimib.it Abstract —Chatbots are software typically embedded in Web and Mobile applications designed to assist the user in a plethora of activities, from chit-chatting to task completion. They enable diverse forms of interactions, like text and voice commands. As any software, even chatbots are susceptible to bugs, and their pervasiveness in our lives, as well as the underlying technological advancements, call for tailored quality assurance techniques. However, test case generation techniques for conversational chatbots are still limited. In this paper, we present Chatbot Test Generator ( CTG ), an automated testing technique designed for task-based chatbots. We conducted an experiment comparing CTG with state-of-the-art BOTIUM and CHARM tools with seven chatbots, observing that the test cases generated by CTG outperformed the competitors, in terms of robustness and effectiveness. Index Terms —Chatbot Testing, Test Automation, Coverage, Flakiness, Mutation Testing I. I NTRODUCTION Task-based chatbots are chatbots designed to deliver func- tionalities through conversations. Notable examples are chat- bots for assisting users, booking services, and completing transactions [1], [2]. In recent years, task-based chatbots (hereafter simply chatbots) have gained popularity thanks to their integration into Web and Mobile applications, and advances in underlying technologies have made them ubiquitous in a multitude of domains, such as e-commerce, booking, tech support, healthcare, and more [1]–[5]. Further, chatbots have become significant in the context of business companies, as they allow for a reduction in personnel expenses by providing a potentially active 24/7 service. As chatbots continue to permeate our daily lives supported by a large number of design platforms, such as Dialogflow [6], Amazon Lex [7], IBM Watson Assistant [8], and Rasa [9], ensuring their reliability has become a critical concern. Unlike regular test case generation approaches that validate software through interactions taking the form of, for instance, API calls, HTTP requests, or clicks on visual elements, chatbots requirethe generation of actual conversations, that is, natural language sentences that correspond to actual user requests. Further, responses also take the form of natural language sentences that must be interpreted to determine their correctness. It is thus imperative to design ad-hoc techniques that can generate test cases that thoroughly exercise software whose interface is conversational [10], [11]. The initial effort in test case generation for chatbots was mainly focused on exercising speech recognition capabili- ties [12]–[16], and other non-functional characteristics [17]. The generation of test cases (i.e., conversations) to thoroughly validate the functionalities implemented by the chatbot under test required extensive manual intervention [9], [18]–[24]. So far, two approaches have targeted automatic test case generation for chatbots: BOTIUM [25], which is a state-of-the-practice tool originally developed by Botium GmbH (now Cyara [26]), andCHARM [27], which extends BOTIUM with the capability to generate more diverse test cases. However, both approaches arenot able to thoroughly cover the conversational input space of a chatbot and often generate incorrect test cases , due to the challenge of predicting the responses that must be produced by tested chatbots. To address the need for an automated testing solution capable of generating robust and effective test cases that can be used, this paper presents Chatbot Test Generator ( CTG ), an automated testing tool designed for Dialogflow task-based chatbots. CTG exploits the tests generated by BOTIUM as seed tests that are systematically augmented to produce test cases that cover alternative conversational paths . Moreover, CTG is the first dynamic test generation approach for chatbots, that is, it executes the test cases and records the chatbot responses to obtain test cases that can be faithfully re-executed as regression tests, compared to statically generated tests that may include incorrect oracles. Finally, CTG generates test cases with setup and teardown operations that prepare and clean up the environment, preventing any accidental interferencearXiv:2503.05561v1 [cs.SE] 7 Mar 2025 Page 2: with the generated test cases. We conducted an experiment comparing CTG with BOTIUM and CHARM , state-of-the- art tools, observing that the test cases generated by CTG outperformed those generated by competing approaches, in terms of both robustness and effectiveness. In particular, CTG generated 95% correct test cases, compared to BOTIUM and CHARM that generated 61% and 82% correct tests, respectively. Moreover, the tests generated by CTG killed the highest number of mutants in five out of the seven chatbots considered in our mutation testing experiment. This result suggests that the tests obtained with CTG can be used to create better regression test suites than using state-of-the-art approaches. This paper provides the following contributions: (i) it proposes CTG , the first dynamic test case generation tool for Dialogflow task-based chatbots, (ii) it presents a systematic strategy to generate reliable conversations that cover the possible conversational paths, including the usage of entity values, alternative utterances, and set up and tear down of the testing environment, and, (iii) it presents empirical results that show how CTG outperforms the BOTIUM andCHARM state-of-the-art tools when generating tests for seven task-based chatbots. The paper is organized as follows. Section II introduces the key elements of task-based chatbots. Section III describes CTG . Section IV presents our experiment. Section V discusses related work. Then, Section VI provides final remarks. II. D IALOGFLOW TASK-BASED CHATBOTS Chatbots, also known as virtual assistants or conversational agents, are software programs designed to interact with human users through textual, vocal, images, or other communication channels [2]. Chatbots can be categorized based on their oper- ational focus (i.e., informative, conversational, and task-based), their context (i.e., domain-specific or domain-independent), and thetechnology and modules they are composed of (e.g., natural language processors, speech recognition systems, machine learning models) [1], [2]. Task-based chatbots are chatbots designed to efficiently complete specific tasks through user conversations, prioritizing functionality over casual chat. They aim to fulfill a user’s intent, such as booking a service or guiding users through procedures. Dialogflow [6] is a Natural Language Understanding (NLU) platform, part of the Google Cloud Platform, and one of the most popular platforms for multi-language chatbot design [28]. The platform offers a user-friendly Web interface, advanced machine learning and voice recognition modules for NLU, and native support with Google Cloud services. Further, it integrates with existing services (e.g., Slack, Telegram, Messenger) and custom external services, via Cloud Functions for Firebase. Finally, it exposes APIs that enable connections with existing testing tools, such as Botium [25]. In terms of structure, a Dialogflow chatbot is composed of several components encapsulating the main concepts of a conversation [29], [30]: •Intents : they represent the possible goals of a user that the chatbot considers as tasks to achieve; users express their intent using utterances (e.g., the intent to order apizza can be expressed by the sentence I would like to order a pizza ); •Actions : they are the actions that a chatbot may perform to complete a task; actions can be either plain text responses/requests (e.g., asking the question What pizza would you like to order? ), graphical objects to interact with, or requests invoking external services; •Entities : the data types that can be used in a conversation, either representing simple literals (e.g., Pepperoni pizza ), regular expressions, or compositions of other entities; •Flows : the possible user-bot interactions composing con- versations, which can include context data representing the time-bounded memory of the chatbot. Testing a task-based chatbot requires generating conversa- tions that thoroughly cover the intents, entities, and flows that are defined in a chatbot. III. C HATBOTS TESTGENERATOR Chatbot Test Generator (CTG) is an automated test gener- ator tool for Dialogflow chatbots, developed in Node.js. Its architecture is shown in Figure 1. Starting from a set of BOTIUM seed tests that exercise the intents defined in the chatbot under test (e.g., one seed test per intent), CTG generates augmented tests that cover inputs often ignored by test case generation. The seed tests represent the backbone of the conversations explored by CTG . Further, CTG is able to set up and tear down the environment of the chatbot, preventing any side-effect between test executions. Unlike other approaches that generate tests for chatbots using static analysis (i.e., without executing the chatbot), CTG incrementally generates test cases by sending user messages and recording bot responses at runtime, thus producing test cases that resemble actual conversations, being more reliable for regression testing than statically generated tests. CTG consists of four key components. The GENERATOR orchestrates the entire test generation process. The EXPANDER retrieves alternative utterances and entity values that can be used as replacements for those present in the seed test cases, then it unrolls the user-bot interactions to explore additional conversational paths based on the extracted alternative values, potentially exercising unexplored intents. The EXECUTOR runs the generated tests, using BOTIUM connector module to interact with the chatbot deployed on Dialogflow, and records the conversation (i.e., user requests and bot responses). The CLEANER is configured according to a cleaning routine defined by the tester to set up and tear down the environment, including the persistence layer of the chatbot under test, thus favoring testing in a neutral environment. In the following, we present the main algorithms of CTG in detail. A. Generator Algorithm 1 shows the generateTests function implemented by the GENERATOR . The function takes in input the seed tests ST, used as a basis for the generation process, the cleaning routine CR, used to configure the CLEANER with the logic to connect to the chatbot and to set up and tear down its back-end, Page 3: Cleaning Routine User requests Bot responsesBot Domain Data cleaningSeed Tests Persistence layer Augmented Tests Entities UtterancesData extractionFig. 1. CTG architecture. Algorithm 1 CTG test case generation process. Input: 1ST: the seed tests2CR : the cleaning routine to set up the connection with the chatbot 3bot : the bot under test Output: 4AT: the augmented tests Dependencies: 5function EXPAND (st, botMsg, bot ) ▷Alg.2 6function SETUP(CR) ▷Alg.3 7function TEAR DOWN(CR) ▷Alg.3 89function GENERATE TESTS (ST, CR, bot ) 10 AT← ∅11 forstinSTdo12 startMsgs ←EXP.expand (st, null, bot ) 13 fori= 0;i <|startMsgs |;i+ + do 14 at←() 15 at.seed ←true16 AT[st]←AT[st]∪ {at} 17 end for18 fori= 0;i <|AT[st]|;i+ + do 19 conn←CLEANER.setUp (CR) 20 ifAT[st][i].seed then 21 userMsg ←startMsgs [i] 22 do23 if|userMsg |= 1 then 24 AT[st][i].addUserMsg (userMsg ) 25 botMsg ←EXEC.sendMsg (userMsg, conn ) 26 AT[st][i].addBotMsg (botMsg ) 27 userMsg ←EXP.expand (st, botMsg, bot ) 28 else29 forj= 1;j <|userMsg |;j+ + do 30 at′←() 31 at′.addUserMsg (AT[st][i].getUserMsgs ()) 32 at′.addUserMsg (userMsg [j]) 33 at′.seed←false 34 AT[st]←AT[st]∪ {at′} 35 end for36 AT[st][i].addUserMsg (userMsg [0]) 37 botMsg ←EXEC.sendMsg (userMsg [0], conn ) 38 AT[st][i].addBotMsg (botMsg ) 39 userMsg ←EXP.expand (st, botMsg, bot ) 40 end if41 whileuserMsg ̸=null 42 else43 userMsgs ←AT[st][i].getUserMsgs () 44 forj= 0;j <|userMsgs |;j+ + do 45 botMsg ←EXEC.sendMsg (userMsgs [j], conn ) 46 AT[st][i].addBotMsg (botMsg ) 47 end for48 end if49 conn←CLEANER.tearDown (CR) 50 end for51 end for52 return AT53end function and the reference to the chatbot under test bot, used to extract the alternative utterances and entity values from the chatbot, and returns the collection ATof augmented tests for each seed test (e.g., AT[sti][j]is the jthaugmented test for seed test i). The seed tests ST must cover the conversational flows that should be used to generate alternative conversations. Although seed tests could be obtained in multiple ways, including manually implementing them, in our assessment we used BOTIUM to generate seed tests, since BOTIUM can automatically generate a set of test cases that exercise the main conversational flow defined in a chatbot. The function generateTests iterates over each seed test caseto generate corresponding alternative test cases (lines 11-51). The GENERATOR starts by using the EXPANDER to retrieve all alternative messages that a user can use to start a conversation according to the set of salutation messages expected by the chatbot (line 12). To perform such a task, the EXPANDER accesses the chatbot structural files, extracting the alternative utterances and entity values. For each message that can be used to start a conversation, a new empty augmented test case is created and added to the output set (lines 13-17). Each new test case is flagged as “seed” (line 15), since the algorithm discriminates tests that cover the seed conversational flow from the alternatives that may originate by the expansion of the input data in future interactions (e.g., after the greetings, the conversation may take a new path if the user orders a pizza instead of a hamburger). For each new test case (lines 18-50), the connection to the chatbot is established through the CLEANER (line 19). If the test is flagged as “seed” (line 20), a proper salutation message is used to initiate the test from the available ones (line 21). Then, the conversation is iteratively built and executed until no more messages can be sent to the chatbot (lines 22-41). Within each iteration, if the EXPANDER has not identified any alternative message to be sent, the only possible message is sent to the chatbot using the EXECUTOR , which exploits Botium APIs to communicate with the bot deployed on Dialogflow. The bot response is collected and encoded as expectation in the test, and the next user message to be sent is retrieved from the alternative messages identified by the EXPANDER (lines 23-27). In case, instead, there are multiple alternatives for the user message to be sent, for each alternative, a new augmented test case is created (lines 29-35): every alternative test inherits the initial part of the conversation already established within the seed test from which the new test is created (line 31) and such test is added to the output set of augmented tests associated with the seed test (line 34). Then, the user-bot interaction proceeds for the seed test (lines 36-39), using the first user message among the alternatives that, by construction, indicates a seed test (i.e., userMsg [0], line 36). Once an augmented test has reached the end of the execution, the alternative augmented test cases that are pending are retrieved and completed using the same strategy (lines 42-48). Eventually, at the end of each test case interaction with the bot, theCLEANER is called to restore the chatbot to its original Page 4: state (line 49). The aforementioned procedure is repeated until no more seed tests are present, and finally the whole set of augmented tests is returned (line 52). B. Expander Algorithm 2 shows the expand function implemented by the EXPANDER . The function takes in input a test case t, a bot message botMsg occurring in the test, the reference to the chatbot bot, and returns all the alternative user responses alt that can be sent to the chatbot for the message botMsg . To perform its tasks, the EXPANDER accesses the Dialogflow data defined for the bot and exploits the BOTIUM test case structure, outlined in the example in Figure 2. In BOTIUM , a test case is a conversation that alternates user sentences, occurring after the tag #me, and bot sentences, occurring after the tag #bot . For each chatbot intent, defined by user utterances and corresponding chatbot responses ( User Utterances and Bot Responses sections in Figure 2), used internally by the chatbot to train a model to recognize sentences flexibly, BOTIUM exploits the dataset and builds the seed test cases as finite combined sequences of user-bot interactions, where the bot messages can be used as oracles to compare with actual responses during test execution. When a sentence refers to an entity, the bot identifies it with the @symbol (e.g., in the example, the bot is asking for the service the user wants). The chatbot has a number of entity values that can be accepted for each entity. During the conversation, the user must choose one of them to produce a response that is understandable by the bot. The EXPANDER accesses this information to find any alternative utterance and entity value that can be used in the conversation. If the bot message to answer is assigned with the value null (lines 7-9), this means that the conversation has just started. The user message for which the alternatives must be found is the one starting the test case (e.g., the first #me message of Figure 2); then, the EXPANDER can extract the alternative sentences by accessing the user utterances file associated with that message ( User Utterances section in Figure 2). If, instead, the bot message is a prompt referring to one or more required entity values (lines 10-11), where an entity is determined by the presence of the “@” symbol, the alternative messages are obtained by combining the entity values extracted for each entity from the chatbot ( @service values section in Figure 2); the logic of entity value extraction and combination is encapsulated by the method getEntityValues , not shown for space reasons. Else, if the bot message is a simple plain text response (lines 12-15), the next user message is retrieved from the test case (line 13), and the alternative user responses are extracted from the utterances file related to the user message (line 14). Finally, the set of alternatives messages discovered is returned to the GENERATOR (line 16), at its disposal for the test generation process. C. Cleaner Algorithm 3 shows the setUp andtearDown functions of theCLEANER component. The setUp function is called at thebeginning of the generation process (line 19 from Algorithm 1) to establish a connection to the chatbot under test and to its persistence layer; it takes in input the cleaning routine CR to inject to the CLEANER the logic to manage the exact cleaning process for the target chatbot (e.g., how to remove events from a Google Calendar for a calendar managing chatbot), and returns a connection conn to communicate with the chatbot. More in detail, the service account configuration pertaining the external service that the chatbot connects to, if any, is built (line 7); a service account could refer, for instance, to the Google API managing Google Calendar, thus requiring data such as private keys and project ids to later connect to such service. Next, the connection to the chatbot is opened and returned (lines 8-9). Eventually, the connection pointer and the cleaning routine are used in the tearDown function, called at the end of each test generation process (line 49 from Algorithm 1), to restore the chatbot to its original state, by querying the service for any persistent item to clean (lines 13-15), such as events from a given calendar; finally, the connection is closed (lines 16-17). IV. E MPIRICAL EVALUATION To study the effectiveness of Chatbot Test Generator ( CTG ) we considered the following research questions: •RQ1-Correctness :How often does CTG generate seman- tically correct tests? This research question studies the reliability of the generated test cases, considering the rate of wrong tests (i.e., tests that may fail even for passing executions) that are generated. •RQ2-Coverage :How thoroughly can CTG exercise conversations? This research question studies how thor- oughly the generated tests exercise the intents and entities implemented in chatbots. •RQ3-Mutations :What is the defect detection capability of the tests generated by CTG ?This research question studies how well test cases can reveal faults injected in conversations implemented in task-based chatbots. The three research questions are investigated by comparing the test cases generated by CTG with the test cases generated by the baseline approach BOTIUM [25] and the state-of-the- art approach CHARM [27], with seven task-based chatbots selected from open-source third-party repositories. In the rest of this section, we describe the subject chatbots used for the evaluation (Section IV-A ), the competing techniques Algorithm 2 CTG alternatives expander process. Input: 1t: the test case to expand data from 2botMsg : the bot message used to retrieve the next user message 3bot : the bot under test Output: 4alt: the alternative responses (e.g., entities, utterances) 56function EXPAND (t, botMsg, bot ) 7 ifbotMsg =null then 8 userMsg ←getFirstUserMsg (t) 9 alt←getUtterances (userMsg, bot ) 10 else ifbotMsg.contains (“@”) then 11 alt←getEntityV alues (botMsg, bot ) 12 else13 userMsg ←getNextUserMsg (t, botMsg ) 14 alt←getUtterances (userMsg, bot ) 15 end if16 return alt17end function Page 5: Chatbot Data #me Hi, I need an appointment #bot Hello, for which @service ? #me For servicei … @service valuesintent IHi, I need an appointment Hello, I need your service …User Utterances Hello, for which @service? How can I assist you ? … Bot Responses … service2…Test Case for intent I service1servicenFig. 2. Overview of a B OTIUM test case and Dialogflow chatbot data. Algorithm 3 CTG cleaning process. Input: 1CR : cleaning routine. Output: 2conn : the connection session to the chatbot Dependencies: 3function CLEAN (CR) ▷Extern 4function ITEMS() ▷Extern 56function SETUP(CR ) 7serviceAccount ←buildService (CR) 8conn←connect (serviceAccount, CR ) 9 return conn10end function1112function TEARDOWN (CR ) 13 ifconn̸=null∧ |conn.items ()| ≥1then 14 conn.clean (CR) 15 end if16 conn←null17 return null18end function TABLE I SUBJECT CHATBOTS . Name Domain # Intents# Entities (# Values) E-Commerce Shopping 11 0 (0) Room Reservation Hotels 6 1 (3) Weather Forecast Weather 4 0 (0) Currency Converter Currencies 3 0 (0) Temperature Converter Temperatures 5 2 (12) Appointments Scheduler Cars 3 1 (10) News Info 3 1 (9) (Section IV-B ), the experimental procedure defined to answer the research questions (Section IV-C ), the empirical results obtained for the three research questions (Sections IV-D ,IV-E , and IV-F), and the threats to validity (Section IV-G). A. Subject Chatbots To investigate the research questions, we selected seven Dialogflow chatbots from the ASYM0B repository of open- source third-party chatbots [31], which has already been used in previous studies [27], [29], [30]. The selected chatbots are representative of different domains (e.g., hotels booking, weather forecasting), and designed to cover non-trivial conversa- tional aspects (e.g., nested intents, input contexts, and back-end components to interact with external services). Table I reports the name, the domain, and the conversational size of each selected chatbot (i.e., the number of intents, custom entities, and entity values). Since the Weather Forecast and News chatbots did not include a configuration file for the back-end component, wehad to define one ourselves to integrate with free API services alternative to the original, unavailable, ones. We specifically used the OpenWeather and News APIs. We carefully inspected and tested the modified chatbots to ensure they preserved their integrity, conversational structure, and coherence. B. Competing Techniques We selected the BOTIUM [25] baseline and the CHARM [27] state-of-the-art approaches to compare with CTG .BOTIUM [25] is an automated quality assurance framework for chatbot testing designed by Botium GmbH (now Cyara [26]), widely used in both industrial and academic contexts [11], [27]. It offers support for generating and executing test cases on various chatbot design platforms, including Dialogflow [6], Amazon Lex [7] and Rasa [9]. It represents the state-of-the-practice ap- proach for test case generation for chatbots. CHARM [27] is an open-source testing tool for Dialogflow chatbots implemented in Python. Similarly to CTG , it uses BOTIUM as a back-end for test generation and execution. CHARM extends the test case generation capabilities of BOTIUM with mutations over utterances of seed test cases, such as language back-translations and synonym substitutions. C. Experimental Procedure To address RQ1-Correctness , we studied the rate of errors present in the tests generated by CTG ,BOTIUM andCHARM . We separately deployed each subject chatbot on Dialogflow. Then, we used the testing techniques to automatically generate test cases for each chatbot, configuring each tool to use the multistepconvos BOTIUM feature, which causes the generation of test cases that tentatively cover full user-bot conversations, according to the conversations defined in the tested chatbots. As seed tests required by the CTG generation phase (Algorithm 1), we set one BOTIUM seed test per intent. To assess the correctness of the tests, we executed each test 15 times. We then manually inspected the failures to determine the presence ofsemantically wrong test cases, due to malformed utterances, wrong oracles, incoherent interactions, or flaky interactions1. The remaining tests were deemed as correct . 1A flaky test is a test that passes and fails periodically without any code changes [32]. Page 6: To address RQ2-Coverage , we executed the correct test cases resulting from RQ1 and collected the set of intents and entities exercised by each test. To obtain this information, we run the tests in verbose mode and implemented a script to parse the collected logs, extracting the information about the executed elements of the conversation. We derive two main indicators of the capability of the tests to exercise conversations: intent coverage andentity coverage [33]. Intent coverage is the percentage of intents in the chatbots that are exercised by the test cases, and measures the capability of the tests to cover at least once each user’s intent implemented in the chatbot. Entity coverage measures the percentage of actual entity values that are exercised in the tests in comparison to the set of entity values that the chatbot is designed to recognize. For instance, theAppointments Scheduler chatbot could be used to schedule appointments about driver license orvehicle registration (these are possible values for the entity type AppointmentType ); thus, it indicates how extensive the conversation has been exercised in terms of the kind of values used. To address RQ3-Mutations , we injected mutants into subject chatbots using Mutabot [30], a mutation testing tool for Di- alogflow conversational chatbots. Mutabot generates mutations at multiple levels, including structural changing affecting a conversation (e.g., removing an intent from a chatbot), intents (e.g., flagging an intent as fallback2or changing its priority3), and entities (e.g., removing or changing entity definitions). In our evaluation, we applied the following operators affecting the main conversational characteristics of the selected chatbots: 1) intent removal, 2) entity removal, 3) intent parameter removal, 4) intent priority change, 5) intent flagging as fallback, 6) entity renaming, and 7) entity value change. After generating the mutants, we manually inspected the resulting chatbots to detect and discard the equivalent4mutants, which we could determine by comparing the original and the mutated code elements, and detecting those that had no impact on the behavior of the chatbot (e.g., changing the priority of an intent when there are no other intents to compete with). We independently deployed the mutated chatbots on Dialogflow, and tested them with the test suites produced for RQ1, measuring the rate of mutants killed by each technique. To avoid any side effect and guarantee replicability and fairness in the results, we reset the state of the chatbots after each run. The resources needed to replicate our experiments, including tools, chatbots, and generated tests, are publicly available at https://gitlab.com/ctg-experiment1/ctg-experiment. D. Results for RQ1-Correctness Table II lists the number of test cases generated for each chatbot and technique. Overall, CHARM generated a smaller set of test cases (106 test cases) compared to BOTIUM and CTG , since it often failed at generating tests for intents whose 2A fallback intent is designed to respond to the user requests that the chatbot could not understand. 3The intent priority determines the order of activation when multiple intents compete for a user request. 4An equivalent mutant is a mutant that cannot be killed by any test case [34].TABLE II GENERATED TEST CASES FOR EACH CHATBOT AND TECHNIQUE . Chatbot BOTIUM CHARM CTG E-Commerce 196 20 196 Room Reservation 233 21 233 Weather Forecast 11 11 16 Currency Converter 39 17 39 Temperature Converter 15 15 19 Appointments Scheduler 21 12 24 News 10 10 19 Total 525 106 546 TABLE III RESULTS OF RQ1: S EMANTICALLY WRONG (SW) AND CORRECT (C) TEST CASES FOR EACH CHATBOT AND TECHNIQUE (%). ChatbotBOTIUM CHARM CTG SW C SW C SW C E-Commerce 0% 100% 0% 100% 0% 100% Room Reservation 80% 20% 24% 76% 4% 96% Weather Forecast 36% 64% 18% 82% 40% 60% Currency Converter 18% 82% 24% 76% 0% 100% Temperature Converter 27% 73% 53% 47% 0% 100% Appointments Scheduler 24% 76% 0% 100% 43% 57% News 0% 100% 0% 100% 0% 100% Total 39% 61% 18% 82% 5% 95% execution is constrained, such as nested intents that can be exercised only after other intents have been exercised, or intents requiring some input contexts defined by former interactions. The consequence is that CHARM misses to exercise several alternative conversations, in particular in the E-Commerce and Room Reservation chatbots. On the other hand, the performance of BOTIUM andCTG is comparable (525 test cases generated by BOTIUM and 546 test cases generated by CTG ), since both techniques are able to discover alternative execution paths. However, the EXPANDER component of CTG can also retrieve alternative entity values (see lines 10-11 of Algorithm 2) when the available user utterances are incomplete, and the chatbot must interact with the user to fill the slots with missing entities (e.g., the case of a user requesting for an appointment without specifying the kind of service requested forces that chatbot to produce additional interactions to ask what service is needed to the user). If all the cases are covered by the utterances available in the training data, CTG does not activate the expansion mechanism and the conversations tentatively generated by CTG andBOTIUM are the same. In fact, CTG andBOTIUM , even addressing the same conversational flow, generate different tests since they rely on different information: CTG generates tests dynamically, based on actual bot responses, while B OTIUM generates tests statically, guessing responses from source files. Although the number of test cases generated by BOTIUM and CTG is comparable, the test cases generated by CTG were more robust than those generated by the other techniques. Notice that the test generation timings were comparable and required few seconds across the various tools. Nevertheless, since CTG dynamically generates tests by interacting with the chatbot to record actual responses, the generation time depends on external services that could potentially introduce delays. However, no significant impact was observed for the subject chatbots. Table III reports the results of RQ1. The column SWindicates Page 7: Test Case#me hello #bot Hi! I’m your room booking bot. I can help you to find a perfect room for your meeting and manage your reservations.#me hello #bot Good day! I’m your room booking bot. I can help you to find a perfect room for your meeting and manage your reservations. Expected Actual Fig. 3. A flaky test case generated by all the techniques. the percentage of semantically wrong test cases that fail when executed, while Cindicates the percentage of correct test cases that can be reliably executed. All techniques generated no semantically wrong test cases for E-Commerce andNews chatbots, where dialogues could be reliably extracted from the implementation. Generating reliable test cases was more challenging for the other chatbots. CTG generated the highest number of correct test cases for five out of seven chatbots, while CHARM outperformed the other techniques in the other two chatbots. Overall, BOTIUM generated 61% correct test cases, reporting the lowest performance compared to the other techniques. CHARM achieved 82% correct test cases on average. CTG outperformed the competing techniques, achieving 95% correct test cases on average. Note that CHARM , in addition to generating a slightly smaller percentage of correct test cases than CTG, generates fewer tests in total. The reasons for semantically wrong test cases that fail even in case of correct responses are various. We identified four main cases, which we discuss below. (1) Oracle Weaknesses: Chatbots are often designed to provide multiple, semantically equivalent, responses to the same user requests. This can be the source of flakiness, since the expected response specified in a generated test, although correct, can differ with respect to the actual response provided by a chatbot when the test is executed, resulting in spurious failure of an assert statement. This limitation is common to all the compared techniques, since BOTIUM serves as the backbone for test execution, where the oracle is defined by default through text matching between the chatbot expected and actual responses. Figure 3 shows an example of a flaky test case generated by all the compared techniques for the Room Reservation chatbot. The test case simply submits a salutation message and encodes an expectation about the response of the chatbot (column Expected ). However, the Room Reservation chatbot can respond in multiple ways. For example, the actual response could start with Good day rather than Hi, causing a test failure. Implementing a flexible oracle strategy is an open challenge for all compared techniques. (2) Dirty Environment: Flakiness may also occur when the same scenario affecting the chatbot state is exercised multiple times without proper cleanup of the state. While CTG uses theCLEANER component to properly set up and tear down the environment, the test cases generated by both BOTIUM and CHARM may interfere with each other. (3) Dynamic Responses: Both BOTIUM andCHARM generate test cases statically by combining the user utterances and the bot responses as they are taken from the chatbot design files, thus they could miss any dynamic response that may depend on the actions executed by the chatbot. On the other hand, CTG can collect dynamic responses according to the status of the conversation, as shown in Figure 4 for the AppointmentsTest Cases#me I would like to set an appointment for 3pm on Tuesday #bot What services are you looking to get? DMV offers Driver license and vehicle registration services. #me Driver License #bot Let me see if we can fit you in on 2024-05-07 at 15:00! Yes It is fine!#me I would like to set an appointment for 3pm on Tuesday #bot What services are you looking to get? DMV offers Driver license and vehicle registration services. #me Driver License #bot I’m sorry, there are no slots available for 2024-05-07 at 15:00. Scenario #1 Scenario #2 Fig. 4. Test cases generated by CTG capturing dynamic responses. Test Case- - - - #me now into Euros #bot Invalid currency conversion parameters#me Convert 30 Dollars #bot What is the currency-to? #me now into Euros #bot At the moment 30USD are 27.642EUR BOTIUM CTG Fig. 5. A same test case comparison generated by B OTIUM and CTG. TABLE IV RESULTS OF RQ2: I NTENT AND ENTITY COVERAGE . ChatbotBOTIUM CHARM CTG Intents Entities Intents Entities Intents Entities E-Commerce 91% - 36% - 91% - Room Reservation 83% 100% 50% 0% 83% 100% Weather Forecast 50% - 50% - 50% - Currency Converter 66% - 66% - 66% - Temperature Converter 60% 33% 60% 50% 60% 50% Appointments Scheduler 66% 10% 66% 20% 66% 20% News 66% 11% 66% 11% 66% 56% Total 74% 26% 51% 26% 74% 47% Scheduler chatbot, where the final chatbot responses depend on the actual finalization of the user request (e.g., if a slot is available, the appointment will be confirmed). (4) Complex Conversational Flow: The generation strategies implemented by BOTIUM andCHARM occasionally neglect the preliminary steps of conversations or skip some necessary interactions. Differently, CTG can address nested intents and interactions, by iteratively building the conversations as long as user requests and bot responses can be exchanged (see lines 22-41 in Algorithm 1). Incorrect user-bot interactions sequences can lead the chatbots to fail understanding the requests, resulting in semantically wrong test cases. For example, Figure 5 shows two test cases generated for the Currency Converter chatbot. The test case generated by BOTIUM (left side), due to its limited ability to explore nested intents, starts the conversation with a sentence, now into Euros , that is not meaningful alone. In fact, the bot responds with an Invalid currency conversion parameters , failing the test that expects a meaningful response. In contrast, the test case generated by CTG (right side) implements the entire conversation, including the initial request for a given amount to be converted into a new currency. The test can be repeatedly executed without problems and can be reliably used for regression testing. To summarize, while oracle weaknesses affect all compared techniques, the other limitations do not affect CTG. E. Results for RQ2-Coverage Table IV reports intent and entity coverage achieved for each chatbot and testing technique. Concerning intent coverage, all the techniques performed the same with five chatbots: Currency Converter (66%), Page 8: Appointments Scheduler (66%), News (66%), Temperature Converter (60%), and Weather Forecast (50%). Additionally, BOTIUM andCTG both covered all intents but one in Room Reservation (83%) and in E-Commerce (91%), while CHARM performed significantly worse in Room Reservation (50%) and E-Commerce (36%), due to the lower number of generated test cases (see again Table II). Overall, CTG andBOTIUM scored the same (74% of intents covered on average), followed by CHARM with 51%. Concerning entity coverage, BOTIUM andCTG cover all entities in every chatbot for at least one value. Instead, CHARM does not cover the entity in Room Reservation , as it failed in generating tests about intents that use such entity. On the other hand, BOTIUM misses more entity values than CHARM and CTG inTemperature Converter (33% vs 50%) and in Appointments Scheduler (10% vs 20%), whereas it scores the same as CHARM inNews (11%), making them to tie overall (26% both). CTG covers 47% of entity values, scoring the same or better than the other techniques in all cases. The main reasons of uncovered intents and entities lie underneath the core functionality of test generation, that is shared among all the techniques. Since the testing tools generate test cases from the training data that reflect the positive usages of the chatbots, they often fail at exercising the fallback intents that are designed to manage the negative scenarios which may originate from realistic but wrong interactions (e.g., ask to rephrase/complete a message received with a badly formatted data) [30]. Further, some intents may require specific pre- requisite to activate them, such as the contexts data that must be provided in input from past interactions. Among the testing tools, CHARM is the one that suffered the most, as observed especially for the E-Commerce and the Room Reservation chatbots, affecting both test generation and coverage. Nevertheless, the number of test cases does not necessarily reflect the capability of a technique to cover intents and entities. For example, we observed cases where radically different number of test cases resulted in the same intent and entity coverage, such as in Currency Converter chatbot, where all techniques scored 66% of intents covered, although BOTIUM and CTG generated 39 test cases while CHARM generated 17 only. This suggests that the generated tests are sometimes equivalent variations of other existing tests [33], for instance, having a user utterance replaced with another one producing the same effect (e.g., Convert 10 Pounds to Dollars versus How much is 10 Pounds in Dollars? ). In a nutshell, all the techniques performed better in covering intents rather than entities, with CTG providing the best combined coverage. Indeed, the achieved intent and entity coverage can still be significantly improved. F . Results for RQ3-Mutants To assess the fault revealing capability of the compared approaches, we measured the mutation score. To compute this score, we manually inspected the generated mutants and discarded the equivalent ones: we discarded six equivalents from Room Reservation , one from Weather Forecast , and oneTABLE V RESULTS OF RQ3: M UTANTS KILLED /TOTAL . Chatbot# Killed / # Generated (% Killed) BOTIUM CHARM CTG E-Commerce 10/22 (46%) 11/22 (50%) 10/22 (46%) Room Reservation 9/16 (56%) 7/16 (44%) 11/16 (69%) Weather Forecast 5/13 (38%) 7/13 (54%) 6/13 (46%) Currency Converter 3/6 (50%) 3/6 (50%) 3/6 (50%) Temperature Converter 6/16 (38%) 6/16 (38%) 8/16 (50%) Appointments Scheduler 5/12 (42%) 4/12 (33%) 7/12 (58%) News 8/12 (67%) 6/12 (50%) 10/12 (83%) Total 46/97 (47%) 44/97 (45%) 55/97 (57%) from Currency Converter . Table V reports the mutation score obtained by each technique. In five chatbots out of seven (i.e., Room Reservation ,Temperature Converter ,Appointments Scheduler ,Currency Converter andNews )CTG performs better or the same than the other techniques. CHARM performs the same or better than the other approaches in three cases, and BOTIUM performs the same or better than other approaches in one case. Overall, CTG outperformed the competing approaches, killing 57% of the mutants, with BOTIUM and CHARM killing 47% and 45%, respectively. In general, mutants that affected intents and entities (e.g., remove an entity, or remove a required parameter referring to an entity from a user utterance) were easier to kill by CTG , because of its capability of generating test cases that can record the actual chatbot responses and use them as oracles, rather than building static conversations unable to track such details, as well as being able to exercise entities more systematically. For instance, News chatbot presents a conversational scenario in which the user has to specify the topic of a news; if such parameter is missing, once a test exercises such scenario, the mutant will likely activate the fallback intent as being unable to understand the request prompted by the test, thus getting detected. On the other hand, some mutations, as reported in previous studies [30], are complex to detect by any technique, such as those affecting negative scenarios (e.g., turning an ordinary intent into a fallback intent), as they are generally neglected by the test generation process. This result confirms that more research is needed in test case generation for task- based chatbots. G. Threats to Validity An internal threat to the validity of this study depends on the manual intervention to deploy the mutated chatbots on Dialogflow, to detect the semantically wrong test cases, and the inspection of the mutants (e.g., to detect equivalent mutants). We mitigated this issue by having two authors carefully inspect the produced artifacts and discuss the result of the inspection until they reached agreement. An external threat to validity concerns the generalizability of the results. Although our findings are not final, they provide an interesting picture of the state of the art in testing task-based chatbots. The assessment considers chatbots from different domains, using third-party software also involved in other experiments [27], [29], [30], and compares CTG toBOTIUM andCHARM , which are the state-of-the-art testing tools [11], [30]. We selected Google Dialogflow as the target framework, which is one of the most popular platforms for chatbot design [28]. Although we cannot Page 9: make claims about the generalizability of the results to chatbots implemented with other platforms, such as Rasa and Amazon Lex, their similarity suggests that results are unlikely to change drastically among task-based chatbot platforms. V. R ELATED WORK Chatbots have existed for a long time (e.g., Weizenbaum’s creation named Eliza [35]), but only recently they have been integrated with AI and natural language processing capabilities [2], [36]. Traditional testing methods have proven challenging to apply to this new technology. So far, a plethora of methods have focused on performance, usability, design metrics, and guidelines [10], [37], [38]. On the other hand, few approaches have addressed the functional testing of chatbots [11]. For what concern industrial and open-source testing tools, Amazon released the Alexa Simulator [39] to test Alexa skills without the need of a device, supporting both text and voice-based interactions. An open-source framework for offline black-box testing of Alexa skills, named Alexa Skill Test Framework [18], has also been provided. The capabilities of the framework include mocking external services and support for audio streams. Both solutions, although highly configurable, are tailored to Alexa. Rasa bot builder [9] provides testing features focused on conversational flows and Natural Language Understanding (NLU) models, but test stories must be manually written and maintained. Chatbottest [17] offers support for improving the quality of chatbot design based on heuristic evaluations, but no actual test generation is supported, and the tool relies on a Chrome extension to evaluate chatbots operating on only a few platforms (e.g., Telegram). Bespoken offers a commercial solution [19] for conversational AI chatbot testing, using a rich dashboard that integrates utility features, such as voice support and testing reporting. Similarly to Rasa, conversational scenarios in Bespoken must be manually configured. Playwright [40] supports end-to-end testing of chatbots that rely on a Document Object Model (DOM) interface. It employs a recording mechanism to save interactions over the DOM for replication, thus tests have to be recorded manually. Unlike these frameworks, CTG provides automated test generation capabilities. BOTIUM [25] is an automated quality assurance framework providing both automatic generation and execution of test cases for chatbots, supporting multiple natural language processing engines and chatbot design platforms, including Amazon Lex [7], Rasa [9] and Dialogflow [6]. The test cases generated inBOTIUM are textual conversations between the user and the chatbot. Based on our findings, BOTIUM shows limitations in covering conversations when custom entities and external services are involved. Further research on mutation testing has shown how test suites generated by BOTIUM struggle to detect numerous mutants affecting conversational properties [30], [41]. To alleviate the cost of manual chatbot testing, in 2017 researchers from IBM Research Lab proposed Bottester [20], a tool that simulates user interactions with a chatbot. The tool takes in input a specification of the conversational flows andadditional parameters to simulate actual interactions, such as a sleep time to separate each user input. CHARM [27] tool has been proposed by Bravo-Santos et. al as an extension to BOTIUM for Dialogflow chatbots. CHARM enriches BOTIUM conversations using a set of mutations to test chatbot robustness and accuracy. We empirically compared CTG with BOTIUM andCHARM , showing how the CTG ’s capabilities can improve test effectiveness. Similarly to CHARM , further work addresses input mutation on quality assurance of chatbots. Guichard et al. [22] investigate the process of automatically paraphrasing user sentences, and they evaluate robustness of text-based conversational agents on the BoTest testing framework [21]. Bozic et al. propose an automated approach for functional chatbot testing involving AI planning as foundation for test generation [42]: a plan represents an abstract test case, composed of user’s requests as actions to perform and an intent as a goal to achieve, allowing a re-planning phase once stuck. Bozic and Wotawa [23] further investigate the oracle problem in the chatbot domain. As chatbot output may be difficult to predict, metamorphic testing [43] can be integrated into the testing process, defining metamorphic relations, such as synonym transformations and word removals. In a subsequent work [24], they sophisticated the approach introducing an ontology-based infrastructure. Since employing the ontology requires some effort and knowledge, the approach can benefit from automated black-box test generator tools, such as B OTIUM , CHARM , and CTG. VI. C ONCLUSION The ubiquity of chatbots in human activities, as well as the involvement of advanced technologies for their design, demands suitable quality assurance techniques and tools. So far, only a few approaches have addressed functional testing of task-based chatbots [11], [24]. State-of-the-art approaches are limited to the static generation of test cases that cannot always capture the complexity and diversity of questions and responses that are part of conversations, generating test cases that may be fragile or require manual fixtures, providing limited coverage of intents and entities present in a chatbot [25], [27]. In this paper, we have presented CTG , a dynamic test case generation technique for Dialogflow task-based chatbots. CTG leverages BOTIUM test cases to produce test variants capable of exploring scenarios that are not covered by existing tests. The tool records all bot responses, even those depending on external services, and employs setup and teardown operations to clean up the environment, favoring regression testing in a neutral environment. Our empirical results show that CTG can generate a higher rate of correct test cases that extensively cover conversations compared to BOTIUM and CHARM . In future work, our aim is to further refine and expand the capabilities of CTG . This will involve conducting additional experiments, such as investigating the adaptability of the tool to other chatbot design platforms (e.g., Rasa), as well as designing more sophisticated test strategies to implement flexible oracles and to cover negative scenarios in the context of LLM-based chatbot testing [36]. Page 10: REFERENCES [1]E. Adamopoulou and L. Moussiades, “An overview of chatbot technology,” inIFIP International Conference on Artificial Intelligence Applications and Innovations (AIAI) . Springer, 2020, pp. 373–383. [2]——, “Chatbots: History, technology, and applications,” Machine Learn- ing with Applications , vol. 2, p. 100006, 2020. [3]J. Grudin and R. Jacques, “Chatbots, humbots, and the quest for artificial general intelligence,” in Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems , 2019, pp. 1–11. [4]A. Følstad and P. B. Brandtzæg, “Chatbots and the new world of HCI,” Interactions , vol. 24, no. 4, pp. 38–42, 2017. [5]B. A. Shawar and E. Atwell, “Chatbots: Are they really useful?” Journal for Language Technology and Computational Linguistics , vol. 22, no. 1, pp. 29–49, 2007. [6] Dialogflow. [Online]. Available: https://dialogflow.cloud.google.com [7] Amazon Lex. [Online]. Available: https://aws.amazon.com/lex [8]IBM Watson Assistant. [Online]. Available: https://www.ibm.com/ products/watsonx-assistant [9] Rasa. [Online]. Available: https://rasa.com [10] J. Deriu, A. Rodrigo, A. Otegi, G. Echegoyen, S. Rosset, E. Agirre, and M. Cieliebak, “Survey on evaluation methods for dialogue systems,” Artificial Intelligence Review , vol. 54, pp. 755–810, 2021. [11] X. Li, C. Tao, J. Gao, and H. Guo, “A review of quality assurance research of dialogue systems,” in Proceedings of the 2022 IEEE International Conference On Artificial Intelligence Testing (AITest) . IEEE, 2022, pp. 87–94. [12] F. Iwama and T. Fukuda, “Automated testing of basic recognition capability for speech recognition systems,” in Proceedings of the 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST) . IEEE, 2019, pp. 13–24. [13] M. H. Asyrofi, F. Thung, D. Lo, and L. Jiang, “CrossASR: Efficient differential testing of automatic speech recognition via text-to-speech,” inProceedings of the 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME) . IEEE, 2020, pp. 640–650. [14] L. Sch ¨onherr, K. Kohls, S. Zeiler, T. Holz, and D. Kolossa, “Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding,” in Proceedings of the 26th Annual Network and Distributed System Security Symposium (NDSS) , 2019. [15] Y . Zhang, L. Xu, A. Mendoza, G. Yang, P. Chinprutthiwong, and G. Gu, “Life after speech recognition: Fuzzing semantic misinterpretation for voice assistant applications,” in Proceedings of the 26th Annual Network and Distributed System Security Symposium (NDSS) , 2019. [16] Y . Qin, N. Carlini, G. Cottrell, I. Goodfellow, and C. Raffel, “Imper- ceptible, robust, and targeted adversarial examples for automatic speech recognition,” in Proceedings of the 36th International Conference on Machine Learning (ICML) . PMLR, 2019, pp. 5231–5240. [17] Chatbottest. [Online]. Available: https://chatbottest.com [18] Alexa Test Skill Framework. [Online]. Available: https://github.com/ BrianMacIntosh/alexa-skill-test-framework [19] Bespoken. [Online]. Available: https://bespoken.io [20] M. Vasconcelos, H. Candello, C. Pinhanez, and T. dos Santos, “Bottester: Testing conversational systems with simulated users,” in Proceedings of the 16th Brazilian Symposium on Human Factors in Computing Systems , 2017, pp. 1–4. [21] E. Ruane, T. Faure, R. Smith, D. Bean, J. Carson-Berndsen, and A. Ventresque, “Botest: A framework to test the quality of conversational agents using divergent input examples,” in Proceedings of the 23rd International Conference on Intelligent User Interfaces Companion (IUI- C), 2018, pp. 1–2. [22] J. Guichard, E. Ruane, R. Smith, D. Bean, and A. Ventresque, “Assessing the robustness of conversational agents using paraphrases,” in Proceedings of the 2019 IEEE International Conference On Artificial Intelligence Testing (AITest) . IEEE, 2019, pp. 55–62.[23] J. Bozic and F. Wotawa, “Testing chatbots using metamorphic relations,” inProceedings of the 31st IFIP International Conference on Testing Software and Systems (ICTSS) . Springer, 2019, pp. 41–55. [24] J. Bo ˇzi´c, “Ontology-based metamorphic testing for chatbots,” Software Quality Journal (SQJ) , vol. 30, no. 1, pp. 227–251, 2022. [25] Botium. [Online]. Available: https://botium-docs.readthedocs.io/en/latest [26] Cyara. [Online]. Available: https://cyara.com/products/botium/ [27] S. Bravo-Santos, E. Guerra, and J. de Lara, “Testing chatbots with Charm,” inProceedings of the 13th International Conference on the Quality of Information and Communications Technology (QUATIC) . Springer, 2020, pp. 426–438. [28] A. Abdellatif, K. Badran, D. E. Costa, and E. Shihab, “A comparison of natural language understanding platforms for chatbots in software engineering,” IEEE Transactions on Software Engineering (TSE) , vol. 48, no. 8, pp. 3087–3102, 2021. [29] P. C. Ca ˜nizares, S. P ´erez-Soler, E. Guerra, and J. de Lara, “Automating the measurement of heterogeneous chatbot designs,” in Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing (SAC) , 2022, pp. 1491–1498. [30] M. Ferdinando Urrico, D. Clerissi, and L. Mariani, “Mutabot: A mutation testing approach for chatbots,” in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering: Companion Proceed- ings (ICSE-C) , 2024. [31] ASYM0B chatbots repository. [Online]. Available: https://github.com/ ASYM0B/Dataset/tree/main/Dialogflow [32] W. Zheng, G. Liu, M. Zhang, X. Chen, and W. Zhao, “Research progress of flaky tests,” in Proceedings of the 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 2021, pp. 639–646. [33] P. C. Ca ˜nizares, D. ´Avila, S. P ´erez-Soler, E. Guerra, and J. de Lara, “Coverage-based strategies for the automated synthesis of test scenarios for conversational agents,” in Proceedings of the 5th ACM/IEEE International Conference on Automation of Software Test (AST) , 2024. [34] Y . Jia and M. Harman, “An analysis and survey of the development of mutation testing,” IEEE Transactions on Software Engineering (TSE) , vol. 37, no. 5, pp. 649–678, 2010. [35] J. Weizenbaum, “Eliza — A computer program for the study of natural language communication between man and machine,” Communications of the ACM , vol. 9, no. 1, pp. 36–45, 1966. [36] J. S´anchez Cuadrado, S. P ´erez-Soler, E. Guerra, and J. De Lara, “Automating the development of task-oriented LLM-based chatbots,” inProceedings of the 6th ACM Conference on Conversational User Interfaces (CUI) , 2024, pp. 1–10. [37] L. Laranjo, A. G. Dunn, H. L. Tong, A. B. Kocaballi, J. Chen, R. Bashir, D. Surian, B. Gallego, F. Magrabi, A. Y . Lau et al. , “Conversational agents in healthcare: A systematic review,” Journal of the American Medical Informatics Association , vol. 25, no. 9, pp. 1248–1258, 2018. [38] P. C. Ca ˜nizares, J. M. L ´opez-Morales, S. P ´erez-Soler, E. Guerra, and J. de Lara, “Measuring and clustering heterogeneous chatbot designs,” ACM Transactions on Software Engineering and Methodology (TOSEM) , vol. 33, no. 4, pp. 1–43, 2024. [39] Alexa Simulator. [Online]. Available: https://developer.amazon.com/ en-US/docs/alexa/devconsole/alexa-simulator.html [40] Playwright. [Online]. Available: https://playwright.dev [41] P. G´omez-Abajo, S. P ´erez-Soler, P. C. Ca ˜nizares, E. Guerra, and J. de Lara, “Mutation testing for task-oriented chatbots,” in Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering (EASE) , 2024, pp. 232–241. [42] J. Bozic, O. A. Tazl, and F. Wotawa, “Chatbot testing using AI planning,” inProceedings of the 2019 IEEE International Conference On Artificial Intelligence Testing (AITest) . IEEE, 2019, pp. 37–44. [43] T. Y . Chen, F.-C. Kuo, H. Liu, P.-L. Poon, D. Towey, T. Tse, and Z. Q. Zhou, “Metamorphic testing: A review of challenges and opportunities,” ACM Computing Surveys (CSUR) , vol. 51, no. 1, pp. 1–27, 2018.