loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2502.08820

Can a Single Model Master Both Multi-turn Conversations and Tool Use? CoALM: A Unified Conversational Agentic Language Model

Authors: Emre Can Acikgoz, Jeremiah Greer, Akul Datta, Ze Yang, William Zeng, Oussama Elachqar, Emmanouil Koukoumidis, Dilek Hakkani-Tür, Gokhan Tur

Published: 2025-02-12

Abstract:

Large Language Models (LLMs) with API-calling capabilities enabled building effective Language Agents (LA), while also revolutionizing the conventional task-oriented dialogue (TOD) paradigm. However, current approaches face a critical dilemma: TOD systems are often trained on a limited set of target APIs, requiring new data to maintain their quality when interfacing with new services, while LAs are not trained to maintain user intent over multi-turn conversations. Because both robust multi-turn management and advanced function calling are crucial for effective conversational agents, we evaluate these skills on three popular benchmarks: MultiWOZ 2.4 (TOD), BFCL V3 (LA), and API-Bank (LA), and our analyses reveal that specialized approaches excel in one domain but underperform in the other. To bridge this chasm, we introduce CoALM (Conversational Agentic Language Model), a unified approach that integrates both conversational and agentic capabilities. We created CoALM-IT, a carefully constructed multi-task dataset that interleave multi-turn ReAct reasoning with complex API usage. Using CoALM-IT, we train three models CoALM 8B, CoALM 70B, and CoALM 405B, which outperform top domain-specific models, including GPT-4o, across all three benchmarks. This demonstrates the feasibility of a single model approach for both TOD and LA, setting a new standard for conversational agents.

Paper Content: on Alphaxiv
Page 1: Can a Single Model Master Both Multi-turn Conversations and Tool Use? CoALM: A Unified Conversational Agentic Language Model Emre Can Acikgoz1, Jeremiah Greer2, Akul Datta1, Ze Yang1, William Zeng2, Oussama Elachqar2, Emmanouil Koukoumidis2, Dilek Hakkani-Tür1, Gokhan Tur1 1University of Illinois Urbana-Champaign,2Oumi {acikgoz2, akuld2, zey2, dilek, gokhan}@illinois.edu {jeremy, william, oussama, manos}@oumi.ai Abstract Large Language Models (LLMs) with API- calling capabilities enabled building effective Language Agents (LA), while also revolution- izing the conventional task-oriented dialogue (TOD) paradigm. However, current approaches face a critical dilemma: TOD systems are of- ten trained on a limited set of target APIs, requiring new data to maintain their quality when interfacing with new services, while LAs are not trained to maintain user intent over multi-turn conversations. Because both robust multi-turn management and advanced func- tion calling are crucial for effective conversa- tional agents, we evaluate these skills on three popular benchmarks: MultiWOZ 2.4 (TOD), BFCL V3 (LA), and API-Bank (LA)—and our analyses reveal that specialized approaches ex- cel in one domain but underperform in the other. To bridge this chasm, we introduce CoALM (Conversational Agentic Language Model), a unified approach that integrates both conversational and agentic capabilities. We created CoALM-IT , a carefully constructed multi-task dataset that interleave multi-turn Re- Act reasoning with complex API usage. Using CoALM-IT, we train three models CoALM 8B,CoALM 70B , and CoALM 405B , which outperform top domain-specific models, includ- ing GPT-4o, across all three benchmarks. This demonstrates the feasibility of a single model approach for both TOD and LA, setting a new standard for conversational agents1. 1 Introduction The concept of intelligent agents has been the cor- nerstone of artificial intelligence research for a long time (Minsky, 1986), developing in parallel with the field of human-to-machine conversation (Young, 2002). The advent of LLMs (OpenAI et al., 2024; Dubey et al., 2024) has revolutionized both fields and enabled powerful Language Agents (LA) 1https://emrecanacikgoz.github.io/CoALM/ TODsLanguage AgentsConversational Agents Properties:Multi-turn ConversationSuccess, DSTLLM-basedSystem Response GenerationPrevious Work:•Hudeček GPT3.5 (Hudeček et al., 2023)•FNCTOD (Li et al., 2024)Properties:Single-turnFunction AccuracyLLM-basedFunction CallingPrevious Work:•Toolformer (Schick et al., 2023)•Hammer (Lin et al., 2024)•ToolACE (Liu et al., 2024)Properties:Multi-turn ConversationSuccess, DST, Function AccuracyLLM-basedSystem Response GenerationFunction Calling •NC-Latent-TOD (King et al., 2024)Figure 1: Unifying Capabilities of TOD Systems and LAs. TOD systems excel in multi-turn conversations and task completion but lack advanced API capabilities, while LA handle APIs well but struggle with coherent multi-turn dialogue. (Schick et al., 2024) while transforming modular di- alogue systems into end-to-end solutions (Hude ˇcek and Dusek, 2023). Despite sharing LLM founda- tions, they are typically focused and analyzed sep- arately from each other; dialogue models focused on tasks such as multi-turn interactions, delivering relevant information to users, and dialogue man- agement with state-tracking, on the other hand LAs concentrated exclusively on tool calling skills. What if a single model could master both conver- sational and agentic tasks at the same time? The narrative of our paper aims to address the vision of a unified conversational agent . Such an agent must excel not only in handling multi-turn conversations and TOD tasks but also in leveraging advanced LA capabilities, such as compound tool usage. Pre- vious research has focused on training dialogue agents in controlled scenarios (e.g., booking and reservation tasks) (Li et al., 2024) with limited set of functions coming from dialogue actions (e.g.,arXiv:2502.08820v3 [cs.AI] 19 Feb 2025 Page 2: User: Yes, please tell me how to get tickets for the Jazz Festival.SGDThoughtSystem TurnAPI CallObservationThought SGD ReAct SNIPSTOD State TrackingHammerToolAceFunction CallFunction CallSystem Turn Input: What are the current weather conditions in Sydney?Output: {[{“name”: "WoDdNSe7e7K5", "arguments": {"LzZsvxUC": "Sydney"}}]}Input: Book a table at a restaurant in Portugal with parking for me in 19 minutes.Output: {"domain": "BookRestaurant", "slot_values": {"restaurant_type": "restaurant", "country": "Portugal", "facility": "parking", "party_size_description": "me", "timeRange": "in 19 minutes"}}Action: BuyEventTickets(event_name='Jazz Festival', number_of_seats=2, date='2023-10-07', city='New York')Thought1: The user is interested in purchasing tickets. I need to call the BuyEventTickets API for the Jazz Festival.<Conversation History>User: Yes, please tell me how to get tickets for the Jazz Festival.Action: BuyEventTickets(event_name='Jazz Festival', number_of_seats=2, date='2023-10-07', city='New York')Thought1: The user is interested in purchasing tickets. I need to call the BuyEventTickets API for the Jazz Festival.<Conversation History> Observation: Observation: {'status': 'success', 'message': 'Successfully purchased 2 tickets'}System: You have successfully purchased 2 tickets for the Jazz Festival in Central Park!Thought2: The tickets were successfully purchased. I need to confirm this with the user. Llama-{8B,70B,405B}Instruction FinetuningCoALM User: I am planning a trip to Japan soon. Could you recommend games popular in Japan to help me understand more about the local culture?"System: Get Suggestion(term="popular", country="Japan", lang="en")SNIPS (TOD) Hammer ToolAce (LA) CoALM-ITCRA LATOD SGD ReAct (CRA) Figure 2: Overview of the CoALM Pipeline. This figure illustrates our dataset generation and fine-tuning framework. The top three rows depict the data transformation processes, along with a corresponding sample shown on the right. In each training sample, green text highlights the input components of the instruction sample, while purple text indicates the target outputs optimized during fine-tuning. For detailed examples, refer to Figures 4 - 7. find_attraction, book_hotel), or, relied on hand- crafted long prompts tied to a small set of prede- fined APIs (Xu et al., 2024) leveraged by propriety models like GPT-4. However, these approaches face limitations in real-world applications. Specifi- cally, existing systems cannot easily adapt to new services without expensive fine-tuning or prompt engineering, yet real-world users often need access to a diverse range of APIs and functionalities ac- cording to their needs. Moreover, previous work shown a notable performance gap reported in TOD tasks between closed-source and open-source mod- els (Hude ˇcek and Dusek, 2023; Xu et al., 2024; Li et al., 2024). This tension underscores the need for an integrated open-source framework that supports both long-term dialogue state tracking and complex function calling from wide variety of APIs2. We propose CoALM (Conversational Agentic Language Model), a unified approach that inte- grates TOD strengths (e.g., multi-turn state track- ing) with LA capabilities (e.g., dynamic tool use). To achieve this, we develop CoALM-IT , a dataset spanning three dimensions: dialogue state tracking, complex function calling, and multi-turn conver- sations in ReAct style where the agent integrates its reasoning process with actions before providing 2In this work, words such as "tool use", "function calling", and "API calling" are used interchangeably.the final response (Yao et al., 2023). The nov- elty of CoALM-IT comes from its Conversational ReAct API (CRA) instances, which makes it the first multi-turn TOD dataset explicitly incorporat- ing ReAct-style reasoning with multiple think steps inside, generated using GPT-4o. The first think steps are responsible for deciding to call an API or not and second think step is to decide whether to response to user or not. Leveraging CoALM- IT, we trained CoALM model series: CoALM 8B,CoALM 70B , and CoALM 405B , a family of conversational agents demonstrates state-of-the-art performance on both TOD and LA domains. To comprehensively evaluate this, we perform exper- iments on one TOD benchmark, MultiWOZ 2.4 (Ye et al., 2022), and two popular function calling benchmarks, the Berkeley Function Calling Leader- board V3 (BFCL) (Yan et al., 2024) and API-Bank (Li et al., 2023) in completely zero-shot settings3. Our experiments reveal a stark gap in existing models: while LAs excel at tool calling on BFCL V3, they falter on MultiWOZ 2.4 with poor task completion. Conversely, base LLMs and traditional TOD systems show limited function calling abili- ties, as evidenced by the low performance on BFCL 3Here, "zero-shot" refers to none of the evaluation bench- mark train-set was used while training the CoALM models with CoALM-IT. Page 3: V3 and API-Bank. In contrast, our CoALM mod- els, excel across both TOD and LA tasks. Our larger-scale open-source variants—CoALM 70B and CoALM 405B—outperform GPT-4o and other domain-specific models on both TOD (MultiWOZ) and function calling benchmarks (BFCL V3 and API-Bank). In this paper, we study: How can we craft a single conversational agentic LLM that elegantly interweaves multi-turn dialogue mastery with pow- erful function calling capabilities? Our key contri- butions are as follows: •We analyze the gap between two domains: TOD systems and LA through evaluations on Multi- WOZ 2.4, BFCL V3, and API-Bank, showing limitations of existing approaches. •We introduce CoALM-IT , a hybrid multi-task dataset for conversational agents that, for the first time, explicitly incorporates ReAct-style reason- ing steps in multi-turn TOD scenarios. Notably, to our knowledge, no prior effort has trained ReAct-based models using multi-turn TOD data in this manner. •We propose CoALM , a family of model series trained with CoALM-IT: CoALM 8B ,CoALM 70B, and the largest open-source conversational agent CoALM 405B —all unified by multi-turn dialogue skills and advanced function calling ca- pabilities. •Our larger models, CoALM 70B and CoALM 405B, outperform GPT-4o and GPT-4o-mini on both TOD and LA tasks, narrowing gap between agents using closed-source and open-source mod- els. To foster further research within the open-source community, we publicly release code, all model weights, datasets, intermediate checkpoints, and training configurations. 2 Related Work Dialogues and the Domain Shift. Earlier stud- ies work on applying LLMs to dialog applications through supervised fine-tuning (Su et al., 2022; Gupta et al., 2022) or different prompting meth- ods (Hu et al., 2022; Chung et al., 2023; Zhang et al., 2023). Following these, Hude ˇcek and Dusek (2023) have examined the dialogue management abilities of instruction-tuned LLMs in handling goal-oriented multi-turn conversations. More re- cently, existing work in dialogue agents primarily focuses on leveraging dialogue acts to derive APIcalls for backend services (Li et al., 2024; Xu et al., 2024; King and Flanigan, 2024). FNCTOD (Li et al., 2024) fine-tunes on a small dataset restricted to a limited set of domain-specific APIs for state tracking, whereas AutoTOD (Xu et al., 2024) uses GPT-4 with hand-crafted prompts that rely on a narrow set of predefined APIs with long instruc- tions for each dialogue domain. However, these ap- proaches are brittle and difficult to scale in real life scenarios, as they require costly re-trainings or ex- tensive prompt engineering to handle new services, unseen domains, and unexpected user requests. Our work aligns with these studies in building such agents, but CoALM can manage thousands of com- plex APIs at the same time and can generalize to unseen domains without expensive training cycles and time-intensive prompt engineering. Language Agents. Tool learning with LLMs has evolved from simple simple reasoning (Wei et al., 2022) to more sophisticated approaches (Yao et al., 2023; Ling et al., 2023). Early work relied on prompting to enable tool usage (Yao et al., 2023; Paranjape et al., 2023), but more recent research has focused on specialized fine-tuning approaches for effective function calling accuracy (Schick et al., 2024; Patil et al., 2023; Wang et al., 2024; Zhang et al., 2024). For example, Toolformer (Schick et al., 2024) have explored how LLMs autonomously learn when and how to call APIs, leading to improved performance in task-specific settings. In this direction, recent works (Abdelaziz et al., 2024; Liu et al., 2024; Lin et al., 2024) fo- cus on fine-tuning synthetically generated data to integrate more complex tool calling capabilities, such as nested function calls and irrelevance detec- tion. These approaches shown promising results on LA benchmarks, however they mostly operate on single-turn interactions with the user and fall short of enabling user-driven, multi-domain, and multi-turn task completion which is essential for real-world conversational systems. 3 Preliminaries A Conversational Agent, at its core, must under- stand user intents, maintain context across multi- turn interactions, and respond contextually. Be- yond traditional TOD tasks, modern conversational agents are also expected to exhibit agentic abilities, like tool calling, planning, and decision making, to fulfill complex user requests. An effective con- versational agent integrates these capabilities as Page 4: Data Domain Data Type Data Name Data Format # of Data Samples # of Total Tokens Avg. Tokens Per Sample TOD Single-Turn SNIPS State Tracking 13,028 12 ,278,780 942 .49 LASingle-Turn Hammer API Call 13,819 10 ,199,147 738 .05 Multi-Turn ToolAce API Call 202,500 129 ,001,612 637 .04 CRA Multi-Turn SGD ReAct API Call 82,236 59 ,704,782 726 .02 Total 311,583 211 ,184,321 760 .90 Table 1: CoALM-IT Dataset Details. Statistical details of our proposed CoALM-IT dataset showcasing the training mixtures. Generated CRA denotes the Conversational ReAct API dataset. skills, ensuring natural and relevant interactions while efficiently completing the user’s objectives. The detailed task formulations for TOD systems and LA are provided in Appendix A. 3.1 Why we need both TOD and LA Capabilities? Multi-turn interactions are critical for refining am- biguous user requests. For example, when a user says "Find me a hotel", the system can ask clari- fying questions to clarify the user’s intention (e.g., location, price range) instead of returning generic results. This ensures meaningful and task-specific conversations. That said, traditional TOD systems excel at handling these multi-turn interactions but over a small set of APIs (e.g., query_restaurant, book_hotel) (Ye et al., 2022). By training on struc- tured dialogue flows, they achieve high task success rates in controlled scenarios (e.g., standard book- ing or reservation tasks) without requiring complex function-calling capabilities. However, these sys- tems struggle to adapt to new services (e.g., airline, retail) without expensive re-training. In real-world settings, users may need to access a wide variety of APIs (e.g., search_direct_flight, get_product_details). This is where LA shines: they leverage LLMs and can rapidly learn how to use unseen new tools since they are already profi- cient with determining when to invoke an API and decide which API to use from a diverse set of avail- able functions. Without these skills, agents fail to fulfill complex user goals, limiting their utility. Together, these skills form the backbone of a uni- fied conversational agents, enabling them to tran- sition from being passive responders to proactive collaborators capable of managing intricate tasks and sustaining user engagement. 3.2 Can TOD Systems Solve Function Calling Tasks? The benchmark results demonstrate the limitations of TOD systems in function calling scenarios. De- spite achieving top performance on MultiWOZ met-rics as in Table 2, these systems show significantly lower accuracy on both API-Bank (Table 3) and BFCL (Table 4) benchmarks. This performance gap reveals that TOD systems’ traditional strengths in dialogue management do not translate well to handling diverse, unseen, and complex API calls. 3.3 Can LAs Handle Task-oriented Multi-turn Conversations? Conversely, agentic models like ToolAce (Liu et al., 2024), Hammer (Lin et al., 2024), and Granite (Ab- delaziz et al., 2024) while achieving accurate re- sults on API-Bank and BFCL V3, perform poorly on MultiWOZ’s task completion metrics. These results highlight a critical weakness: while they deliver strong performance on function execution tasks, they fall short in maintaining coherent multi- turn conversations and properly fulfilling user in- tents. Their specialized optimization for tool call- ing impairs their dialogue management abilities, indicating that current LAs need more balanced capabilities to handle task-oriented conversations more effectively. 4 Methodology Our approach, illustrated in Figure 2, develops a unified agent skilled in goal-oriented multi-turn conversations and function calling. First, we build the CoALM-IT, a broad instruction-tuning (IT) dataset that spans multiple domains, tasks, and unique reasoning structures. Next, we do fine- tuning on the proposed CoALM-IT dataset to pro- duce CoALM; a balanced conversational agent model series capable of complex reasoning, flu- ent dialogue, user intent fulfillment, and function calling. 4.1 Conversational Agent Dataset Generation To develop a conversational agent with diverse ca- pabilities, we created a comprehensive dataset that combines samples across multiple skills essential for both multi-turn task-oriented conversations and tool utilization. Figure 2 summarizes how the Page 5: dataset is created and Table 1 provides detailed statistics of CoALM-IT. TOD Datasets. An accurate dialogue system needs to master three fundamental capabilities: providing accurate information to users, fulfilling user goals, and tracking dialogue states to understand user in- tents and goals throughout conversations (Walker et al., 1997). To equip our model with these skills, we utilized the SNIPS dataset (Coucke et al., 2018), originally designed for language understanding but repurposed for single-turn dialogue state tracking (DST). We extracted its training split and converted it into the state tracking IT format by crafting a detailed instruction prompt, as illustrated in Figure 4. This transformation resulted in a training set of 24,542 samples for effective DST. Function Calling Datasets. Tool calling capability is the ability to select appropriate APIs and access external knowledge, which is crucial in modern LAs. An effective agent must not only choose the correct API but also provide properly typed parameters (e.g., integers or strings) and manage complex scenarios involving sequential or parallel function calls. To develop these skills, we incor- porated datasets from two state-of-the-art agent models: Hammer (Lin et al., 2024) and ToolACE (Liu et al., 2024). Hammer’s training dataset in- corporates random noise by replacing function and parameter names to prevent overfitting (see Figure 2), forcing the model to reason about API func- tionality through provided descriptions rather than memorizing specific identifiers. ToolACE provides multi-turn conversational scenarios in open-domain settings, where function calls may occur across multiple turns, but no database is provided. We post-process these datasets by incorporating the prompt instructions and adding conversation his- tory if available. As reported in Table 1, the com- bined API calling corpus contains 216,319 samples. A function calling training sample for the Hammer dataset can be seen in Figure 5. Conversational ReAct-based API Calling (CRA) Dataset. While state tracking enables the un- derstanding of user intent and function calling provides external knowledge access, integrating these capabilities within multi-turn task-oriented conversations requires additional reasoning about when to make API calls and how to interpret their results. Our primary contribution is a completely new User and Agent conversationstructure as User-Thought1-Action-Observation- Thought2-Response . Starting from multi-turn SGD dataset (Rastogi et al., 2020), we systemat- ically transform each turn to include two distinct reasoning steps (Thought1 and Thought2) and po- tential API calls (Action and Observation), extend- ing traditional ReAct format (Yao et al., 2023) by incorporating GPT-4o for content generation (Fig- ure 2 top row). Our structure includes two main parts: (i) User-Thought1-Action , which focuses on understanding the user’s intent with reasoning and invoking the right API, if necessary (Figure 6 bottom). (ii) Observation-Thought2-Response , where the agent analyzes the returned observations and formulates an appropriate response to the user (Figure 7 bottom). This transformation is achieved with a carefully designed prompt in Table 6, which enforces strict “Role Definition”, “Task Informa- tion”, and “Output Format”. Since CRA is gener- ated via GPT-4o (OpenAI et al., 2024), it is also validated by human evaluators (Appendix D). Best of our knowledge, this is the first ReAct-based Con- versational API dataset that incorporates multiple intermediate reasoning steps in multi-turn settings for TOD. This process yielded 82,236 samples, specifically tailored for task-oriented domains such as hotel bookings and restaurant reservations. We merge all three datasets into a single train- ing set called CoALM-IT, please refer to Table 1 for details. We fine-tune our CoALM models on this merged dataset in one pass. By interleaving samples from TOD, LA, and CRA, the model con- tinuously practices different conversational skills without overfitting to any single domain or task type. 4.2 Fine-tuning Towards Conversational Agents We followed a multitask fine-tuning approach to de- velop CoALM models’ diverse capabilities across TOD, function calling, and multi-turn reasoning by training on CoALM-IT. Our training process is structured to target specific skills through different optimization objectives completely in zero-shot settings , as our CoALM-IT dataset does not con- tain any of the evaluation benchmark training sets. Multitask Fine-tuning. As described in Section 4.1 and illustrated in Figure 2, our CoALM-IT dataset combines samples from three distinct do- mains, each designed to cultivate a specific skill: (i) Page 6: TOD (Task-Oriented Dialogue) for strengthening dialogue state tracking, (ii) LA (Language Agent) for teaching the model when and how to invoke function calls, and (iii) ReAct for multi-turn con- versation, multi-step reasoning and function call- ing. For TOD, we augment SNIPS data with prompt instructions (Figure 4), training the model to gen- erate structured dialogue states in response to user queries. For function calling (LA), we optimize CoALM to select the correct APIs and produce ac- curate function calls with proper parameter types (Figure 5), emphasizing reasoning over memorized patterns. We then address complex multi-turn con- versations with API integration using our CRA dataset, formatted in the ReAct style. This stage uses two objectives: (1) action prediction (Figure 6), where the model learns to issue the appropriate function call given the conversation history, and (2) response generation (Figure 7), where it synthe- sizes coherent replies based on both API results and intermediate reasoning steps. Rather than merely producing answers, the model learns to reason, de- cide, and act in multiple stages before arriving at a final response. Notably, we trained our models on CoALM-IT by interleaving TOD, LA, and CRA samples, enabling the model to continuously prac- tice diverse conversational skills while avoiding overfitting to any single domain or task type. Training Details. We developed the CoALM model series by fine-tuning Llama 3.1 8B, Llama 3.3 70B, and Llama 3.1 405B (Dubey et al., 2024) using a consistent Alpaca (Instruction-Input- Output) format. To balance efficiency and model quality, we applied LoRA (Hu et al., 2021) rank (r) = 16 and scaling factor ( α) = 32 to all linear lay- ers, and trained in mixed-precision bfloat16 (bf16) on 8 NVIDIA H100 GPUs. Under these settings, CoALM 8B required approximately 8 hours of training, while CoALM 70B took about 60 hours. We used a global batch size of 8, trained for 3 epochs with a learning rate of 1e−4, and em- ployed a linear warm-up schedule with a 0.1 ra- tio. For CoALM 405B, we fine-tuned Llama 3.1 405B and using QLoRA (Dettmers et al., 2023) with the same rank and scaling factor using bitsand- bytes (BitsAndBytes, 2025) with a quantization type of normalized float 4 (nf4). The precise train- ing configurations for CoALM 8B, CoALM 70B and CoALM 405 are included in the HuggingFace pages. Our training pipeline leveraged the OumiMethod Success JGA CoALM 8B (ours) 51.6 30.4 CoALM 70B (ours) 69.4 43.8 CoALM 405B (ours)∗66.7 38.8 Hammer 2.0 7B 23.5 21.7 ToolAce 18.0 34.4 Granite-20B-Code 10.7 21.8 CodeActAgent 9.5 20.2 Llama 3.1 8B Instruct 19.9 26.3 Llama 3.3 70B Instruct 67.6 40.8 Mistral-7B-Instruct-v0.3 31.2 27.0 FNCTOD (Li et al., 2024) 44.4 37.9 NC-Latent-TOD (King and Flanigan, 2024) 68.3 39.7 GPT 3.5 Turbo (Hude ˇcek and Dusek, 2023) - 13.5 GPT4o-mini 69.9 38.4 GPT4o 75.5 36.9 Table 2: MultiWOZ 2.4 Benchmark Results. Perfor- mance comparison across models on MultiWOZ 2.4 dialogue benchmark. Best scores are highlighted with bold . The asterisk (*) on CoALM 405B denotes the checkpoint from one completed epoch, as the model is still under training. framework4to ensure reproducibility and stream- lined management (Oumi, 2025). 5 Experiments This section presents results highlighting CoALM’s effectiveness in unifying conversational manage- ment and advanced API calling, outperforming spe- cialized models across both TOD and LA bench- marks. 5.1 Experimental Settings Evaluation Benchmarks. We evaluate our ap- proach on three complementary benchmarks that assess different aspects of model performance: MultiWOZ 2.4 (TOD), API-Bank (LA), and BFCL V3 (LA). Specifically, MultiWOZ 2.4 (Ye et al., 2022) is a multi-domain TOD dataset covering sce- narios such as hotel booking and transportation, where we measure Success Rate and Joint Goal Accuracy (JGA); in our zero-shot setting, we rely on the test set of 999 samples, using a slightly modified AutoTOD prompt (Xu et al., 2024). API- Bank (Li et al., 2023) focuses on evaluating tool- augmented LAs through 314 tool-use dialogues and 753 API calls, tested at two levels: L-1 (invok- ing a known API) and L-2 (retrieving and calling from multiple candidates). Lastly, BFCL V35(Patil et al., 2023) provides over 1,800 test cases spanning tasks like simple, multiple, and parallel function 4https://github.com/oumi-ai/oumi 5https://gorilla.cs.berkeley.edu/blogs/13_ bfcl_v3_multi_turn.html Page 7: ModelRouge-L* Rouge-1 Rouge-2 BLEU-4 L-1 L-2 L-1 L-2 L-1 L-2 L-1 L-2 CoALM 8B (ours) 92.8 81.9 94.1 81.2 91.9 76.4 89.4 69.7 CoALM 70B (ours) 92.7 83.2 94.5 82.7 92.5 78.9 89.5 72.4 CoALM 405B (ours)∗93.4 77.8 94.5 77.1 92.4 71.9 90.3 64.4 Llama 3.1 8B Instruct 72.7 75.2 84.0 81.4 79.8 76.3 62.3 65.1 Qwen2.5 7B Instruct 84.3 73.9 88.9 78.5 84.6 71.2 76.4 64.2 Hammer 2.0 7B 90.1 74.0 92.3 74.1 89.9 68.5 85.4 58.4 ToolAce 81.5 63.6 88.8 71.3 85.0 63.0 76.1 67.0 Granite-20B-Code 60.3 45.7 64.7 48.9 59.5 43.4 43.8 29.3 Fnc-TOD 13B 3.9 3.3 22.1 23.4 8.0 9.2 1.5 1.1 LDST 8.3 7.1 12.8 11.6 2.7 2.4 6.2 5.7 tod-zero-bqag3oyb 3.7 4.2 11.5 12.4 1.1 2.2 1.0 0.9 nc-latent-tod-step-2 3.2 3.2 14.3 13.3 3.2 1.5 0.8 0.8 Table 3: API-Bank Benchmark Results. Performance comparison across models on API-Bank function calling benchmark. Best scores are highlighted with bold and the second-best results are underlined . The asterisk (*) on CoALM 405B denotes one completed epoch, as the model is still in the training process. calls, evaluated by Abstract Syntax Tree (AST) ac- curacy and Executable Function Accuracy. See Appendix B for further details. Baselines. In the LA tasks, we included strong baselines like Hammer (Lin et al., 2024), ToolAce (Liu et al., 2024), Granite (Abdelaziz et al., 2024) which represent state-of-the-art models in agentic tasks, including OpenAI models. For MultiWOZ evaluations, we recognize that many existing TOD models are trained with classification-based super- vised fine-tuning, focusing primarily on DST. Such models do not support free-form dialogue genera- tion, nor do they exhibit broader “chat” capabilities. In contrast, our approach aims to unify both conver- sational (LA) and agentic (TOD) tasks into a single, generative framework. On the other hand, there are some models evaluated in zero-shot settings but as per domain JGA, rather than overall JGA. That said, we used top popular zero-shot models FNC- TOD (Li et al., 2024) and NC-Latent-TOD (King and Flanigan, 2024) as our TOD baselines in TOD. Please see Appendix C for more details of these baseline models. 5.2 Results on MultiWOZ LA models struggle with TOD. Table 2 summa- rizes results on MultiWOZ 2.4. Baseline models optimized for function calling (ToolAce, Hammer, Granite, CodeAct) achieve low Success Rate and JGA. Although these agents can call APIs effec- tively, they fail to track user intents across multiple sessions or deliver correct final answers to the user, except ToolAce JGA reaches 34.4% accuracy close with domain-specific TOD models like FNCTOD.Instruction-tuned base LLMs like Llama 3.1 8B perform moderately better on MultiWOZ, reaching a 19.9% Success rate and 26.3% JGA. CoALM surpasses and generalizes in TOD. In contrast, our smallest CoALM 8B achieves 51.6% Success Rate, more than doubling the Success per- formance compared to Llama 3.1 8B and surpass- ing other LAs. Moreover, our CoALM 70B model achieves top results on DST with achieving 43.8% JGA, even outperforming GPT-4o and GPT-4o- mini. This shows CoALM’s ability with coherent multi-turn state-tracking, outperforming existing baselines and domain-specific models like FNC- TOD. Notably, CoALM’s strong performance is achieved without any MultiWOZ samples in its CoALM-IT training dataset, demonstrating its ro- bustness in out-of-distribution (OOD) generaliza- tion. 5.3 Results on API-Bank and BFCL CoALM adeptly orchestrates function calls. Ta- ble 3 shows API-Bank scores to test model’s API calling capabilities where Rouge-L is the primary evaluation metric. TOD models in the bottom row yield suboptimal results in this task. On the other hand, CoALM 8B achieves a Rouge-L score of 92.8 at Level-1 and 81.9 at Level-2, surpassing both TOD-oriented models and tool-centric LAs by a sig- nificant margin. It also achieves top performance on nearly all metrics. Moreover, we scale CoALM 8B accuracy with CoALM 70B and CoALM 405B models achieving top best and second best scores. This suggests that CoALM’s balanced approach en- ables it not only to retrieve and call the correct API but also to generate precise responses grounded in the returned results, fulfilling complex user re- quests effectively. CoALM outperforms specialized LAs and GPT- 4o.We next assess function calling accuracy on BFCL V3 (Table 4). Models trained only for TOD or basic instruction-following underperform. While LAs like Hammer and ToolAce fare better, our smallest model CoALM 8B surpasses them (see Figure 3 for error analysis examples). Our larger scale models outperform GPT-4o, GPT-4o- mini and Llama-3.1-405B in overall accuracy. Re- markably, CoALM 405B achieves 100% accuracy on the relevance detection task, highlighting its agentic reasoning capabilities through hallucina- tion. CoALM 405B stands as the top-performing Page 8: Model Overall Acc Non-Live AST Acc Non-Live Exec Acc Live Acc Multi Turn Acc Relevance Detection Irrelevance Detection Mistral-7B-Instruct-v0.3 38.35% 56.33% 63.77% 57.31% 0.25% 77.78% 41.84% Llama-3.1-8B-Instruct 49.84% 84.25% 79.75% 60.33% 10.25% 75.61% 47.92% Llama-3.3-70B-Instruct 51.36% 84.85% 90.05% 62.51% 7.25% 95.12% 48.33% ToolAce 52.55% 82.19% 86.98% 71.08% 0.88% 70.73% 87.29% Hammer2.0-7b 52.13% 86.94% 83.66% 71.17% 0.38% 95.12% 73.20% Llama-3.1-405B-Instruct 56.38% 89.71% 84.70% 70.77% 11.75% 88.89% 70.86% GPT-4o-mini (2024-07-18) 59.40% 86.52% 85.05% 73.26% 19.00% 78.05% 76.97% GPT-4o (2024-08-06) 59.83% 70.08% 60.79% 76.41% 34.62% 51.22% 87.34% CoALM 8B (ours) 54.11% 85.17% 78.61% 72.59% 7.00% 77.78% 83.00% CoALM 70B (ours) 60.49% 82.94% 81.36% 72.19% 26.25% 72.22% 85.36% CoALM 405B (ours)∗63.34% 90.46% 84.75% 74.59% 28.25% 100.00% 72.26% Table 4: BFCL V3 Benchmark Results. Performance comparison on the BFCL V3 function-calling benchmark. The best results are highlighted in bold , while the second-best results are underlined . The asterisk (*) on CoALM 405B denotes one completed epoch, as the model continues training. TOD Task Function Calling Tasks MultiWOZ 2.4 API-Bank BFCL-V3 Model Success DST Rouge-L1 Rouge-L2 Overall Success Llama 3.1 8B Instruct 19.9 26.3 72.7 75.2 49.8 + CoALM-IT w/o LA 46.0 (26.1↑,5.6↓)28.5 (2.2↑,1.9↓) 45.5 (27.2↓,47.3↓)48.8 (26.4↓,33.1↓) 35.4 (14.4↓,18.3↓) + CoALM-IT w/o TOD 42.0 (22.1↑,9.6↓)19.4 (6.9↓,11.0↓) 92.7 (20.0↑,0.1↓)78.9 (13.7↑,3.0↓) 54.1 (4.3↑,0.4↑) + CoALM-IT w/o CRA 50.0 (30.1↑,1.6↓)34.5 (8.2↑,4.1↑) 91.3 (18.6↑,1.5↓) 78.8 (3.6↑,3.1↓) 56.6 (10.6↑,2.9↑) CoALM 8B 51.6 30.4 92.8 81.9 53.7 Table 5: Dataset Domain Effects. Experimental results highlighting the impact of excluding specific domain datasets during CoALM fine-tuning. w/oindicates excluding the corresponding dataset during fine-tuning. Each row displays performance changes in parentheses with respect to base model (Llama) and final model (CoALM), i.e. (∆Llama, ∆CoALM). Performance gains are highlighted in green , while drops are marked in red. fully open-source model on BFCL V3 leaderboard. 5.4 Domain Impact on Performance Table 5 highlights the performance impact of CoALM-IT’s fine-tuning components. Removing LA datasets significantly reduces function calling performance, with API-Bank Rouge-L1 dropping 47.3% and BFCL success falling 18.3%. Exclud- ing the DST dataset leads to a notable decline in CoALM’s JGA, dropping by 11.0% relative to CoALM and even underperforming base Llama by 6.9%. This underscores the essential role of fine-tuning on state tracking to capture user intents effectively. Finally, removing the GPT-4-generated CRA dataset has negative impact on MultiWOZ 2.4’s Success metric, which plummets by 11.7%. Also, multi-turn function calling accuracy dropped in API-Bank, both in L1 and L2 metrics. This indicates that the CRA dataset is instrumental in developing coherent and contextually aware re- sponses in multi-turn settings. However, JGA and BFCL’s overall success see slight improvements, suggesting that certain specialized skills may bene- fit marginally in the absence of broader conversa- tional reasoning. These results confirm that each dataset is crucial for balanced task performance,enabling CoALM to generalize effectively across different tasks without overfitting to one domain. 6 Conclusion and Future Work In this work, we highlighted a critical gap between LA and TOD systems, where each excels in com- plementary capabilities - function calling and multi- turn conversation management, respectively. To solve this, we introduced CoALM, unified con- versational agents that seamlessly integrates so- phisticated API usage with natural multi-turn dia- logue. Through fine-tuning on CoALM-IT with a hybrid fine-tuning strategy, CoALM achieves leading performance on both TOD and LA bench- marks, demonstrating that a single model can in- deed master multi-turn conversations and tool use effectively. Future work can investigate using reinforcement learning (RL) to generate large-scale interaction trajectories supported with API calls could further enhance the self-evolution of conversational agents through purely RL-based optimization. Another direction is, improving multi-turn function calling and user interaction abilities of these models, which remains a difficult problem with generally low accu- racy. We believe that our findings, methodologies, Page 9: and published resources will foster future research to create more capable and versatile conversational systems. 7 Limitations While CoALM demonstrates improved perfor- mance across both conversational TOD and agentic tasks, we conducted all experiments solely using the Llama model family, limiting our insights into other architectures like Mistral and Qwen. Further- more, many TOD systems rely on classification- based supervised fine-tuning (DST-only), lacking free-form chat capabilities, so we are not able to integrate them in our chat-based evaluation setup for head-to-head comparisons. We also did not systematically assess CoALM’s general reason- ing abilities after post-training, leaving open the question of potential catastrophic forgetting if any. Even though we introduced the open source model CoALM 405B, the computational cost of doing inference with CoALM 405B requires 16 H100 GPUs, which may limit accessibility for some re- searchers. Lastly, our current approach still relies on curated fine-tuning data; future work might in- vestigate self-evolving methods that learns complex function calling skills continuously leveraging RL. 8 Acknowledgements We would like to acknowledge the Oumi AI team (Oumi, 2025) for their assistance in training and scaling with the larger CoALM models. We would also like to thank Together AI (TogetherAI, 2025) for providing the cluster resources necessary to enable CoALM 405B training. This project also has benefited from the Microsoft Accelerate Foun- dation Models Research (AFMR) grant program, through which leading foundation models hosted by Microsoft Azure and access to Azure credits were provided to conduct the research. References Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, Sadhana Kumaravel, Matthew Stallone, Rameswar Panda, Yara Rizk, G P Shrivatsa Bhargav, Maxwell Crouse, Chulaka Gunasekara, Shajith Ikbal, Sachin- dra Joshi, Hima Karanam, Vineet Kumar, Asim Mu- nawar, Sumit Neelam, Dinesh Raghu, Udit Sharma, Adriana Meza Soria, Dheeraj Sreedhar, Praveen Venkateswaran, Merve Unuvar, David Daniel Cox, Salim Roukos, Luis A. Lastras, and Pavan Kapa- nipathi. 2024. Granite-function calling model: In- troducing function calling abilities via multi-tasklearning of granular tasks. In Proceedings ofthe 2024 Conference onEmpirical Methods inNatural Language Processing: Industry Track , pages 1131– 1139, Miami, Florida, US. Association for Computa- tional Linguistics. BitsAndBytes. 2025. bitsandbytes. Willy Chung, Samuel Cahyawijaya, Bryan Wilie, Holy Lovenia, and Pascale Fung. 2023. InstructTODS: Large language models for end-to-end task-oriented dialogue systems. In Proceedings oftheSecond Workshop onNatural Language Interfaces , pages 1– 21, Bali, Indonesia. Association for Computational Linguistics. Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Calta- girone, Thibaut Lavril, et al. 2018. Snips voice plat- form: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190. Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms. Preprint, arXiv:2305.14314. Abhimanyu Dubey et al. 2024. The llama 3 herd of models. ArXiv, abs/2407.21783. Yujie Feng, Zexin Lu, Bo Liu, Liming Zhan, and Xiao- Ming Wu. 2023. Towards LLM-driven dialogue state tracking. In Proceedings ofthe2023 Conference on Empirical Methods inNatural Language Processing , pages 739–755, Singapore. Association for Compu- tational Linguistics. Prakhar Gupta, Cathy Jiao, Yi-Ting Yeh, Shikib Mehri, Maxine Eskenazi, and Jeffrey Bigham. 2022. In- structDial: Improving zero and few-shot general- ization in dialogue through instruction tuning. In Proceedings ofthe2022 Conference onEmpirical Methods inNatural Language Processing , pages 505–525, Abu Dhabi, United Arab Emirates. Associ- ation for Computational Linguistics. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adap- tation of large language models. arXiv preprint arXiv:2106.09685. Yushi Hu, Chia-Hsuan Lee, Tianbao Xie, Tao Yu, Noah A. Smith, and Mari Ostendorf. 2022. In- context learning for few-shot dialogue state tracking. InFindings oftheAssociation forComputational Linguistics: EMNLP 2022 , pages 2627–2643, Abu Dhabi, United Arab Emirates. Association for Com- putational Linguistics. V ojtˇech Hude ˇcek and Ondrej Dusek. 2023. Are large language models all you need for task-oriented dia- logue? In Proceedings ofthe24th Annual Meeting oftheSpecial Interest Group onDiscourse and Dialogue , pages 216–228, Prague, Czechia. Asso- ciation for Computational Linguistics. Page 10: Albert Q Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, et al. 2023. Mistral 7b.arXiv preprint arXiv:2310.06825. Brendan King and Jeffrey Flanigan. 2024. Unsuper- vised end-to-end task-oriented dialogue with LLMs: The power of the noisy channel. In Proceedings ofthe2024 Conference onEmpirical Methods in Natural Language Processing , pages 8283–8300, Mi- ami, Florida, USA. Association for Computational Linguistics. Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. 2023. API-bank: A compre- hensive benchmark for tool-augmented LLMs. In Proceedings ofthe2023 Conference onEmpirical Methods inNatural Language Processing , pages 3102–3116, Singapore. Association for Computa- tional Linguistics. Zekun Li, Zhiyu Zoey Chen, Mike Ross, Patrick Hu- ber, Seungwhan Moon, Zhaojiang Lin, Xin Luna Dong, Adithya Sagar, Xifeng Yan, and Paul A. Crook. 2024. Large language models as zero-shot dia- logue state tracker through function calling. Preprint , arXiv:2402.10466. Qiqiang Lin, Muning Wen, Qiuying Peng, Guanyu Nie, Junwei Liao, Jun Wang, Xiaoyun Mo, Ji- amu Zhou, Cheng Cheng, Yin Zhao, and Weinan Zhang. 2024. Hammer: Robust function-calling for on-device language models via function masking. ArXiv, abs/2410.04587. Yuan Ling, Fanyou Wu, Shujing Dong, Yarong Feng, George Karypis, and Chandan K. Reddy. 2023. International workshop on multimodal learning - 2023 theme: Multimodal learning with founda- tion models. In Proceedings ofthe29th ACM SIGKDD Conference onKnowledge Discovery and Data Mining , KDD ’23, page 5868–5869, New York, NY , USA. Association for Computing Machinery. Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong Wang, Yux- ian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan Wu, Xinzhi Wang, Yong Liu, Yasheng Wang, Duyu Tang, Dandan Tu, Lifeng Shang, Xin Jiang, Ruiming Tang, Defu Lian, Qun Liu, and Enhong Chen. 2024. Toolace: Winning the points of llm function calling. ArXiv, abs/2409.00920. Marvin Minsky. 1986. TheSociety ofMind . Simon & Schuster. OpenAI, Josh Achiam, et al. 2024. Gpt-4 technical report. Preprint, arXiv:2303.08774. Oumi. 2025. Oumi: an open, end-to-end platform for building large foundation models.Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. 2023. Art: Automatic multi- step reasoning and tool-use for large language mod- els.arXiv preprint arXiv:2303.09014. Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. 2023. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334. Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. 2020. Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. In Proceedings of theAAAI Conference onArtificial Intelligence , vol- ume 34, pages 8689–8696. Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. 2024. Toolformer: Language models can teach them- selves to use tools. Advances inNeural Information Processing Systems, 36. Yixuan Su, Lei Shu, Elman Mansimov, Arshit Gupta, Deng Cai, Yi-An Lai, and Yi Zhang. 2022. Multi-task pre-training for plug-and-play task-oriented dialogue system. In Proceedings ofthe60th Annual Meeting oftheAssociation forComputational Linguistics (V olume 1:Long Papers) , pages 4661–4676, Dublin, Ireland. Association for Computational Linguistics. TogetherAI. 2025. Together ai: The ai acceleration cloud. Marilyn A. Walker, Diane J. Litman, Candace A. Kamm, and Alicia Abella. 1997. PARADISE: A framework for evaluating spoken dialogue agents. In35th Annual Meeting oftheAssociation for Computational Linguistics and 8th Conference oftheEuropean Chapter oftheAssociation for Computational Linguistics , pages 271–280, Madrid, Spain. Association for Computational Linguistics. Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024. Exe- cutable code actions elicit better llm agents. In ICLR 2024 Workshop onLarge Language Model (LLM) Agents. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language mod- els. In Advances inNeural Information Processing Systems , volume 35, pages 24824–24837. Curran Associates, Inc. Heng-Da Xu, Xian-Ling Mao, Puhai Yang, Fanshu Sun, and Heyan Huang. 2024. Rethinking task- oriented dialogue systems: From complex modular- ity to zero-shot autonomous agent. In Proceedings ofthe62nd Annual Meeting oftheAssociation forComputational Linguistics (V olume 1:Long Papers) , pages 2748–2763, Bangkok, Thailand. As- sociation for Computational Linguistics. Page 11: Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2024. Berkeley function calling leaderboard. https://gorilla.cs.berkeley. edu/blogs/8_berkeley_function_calling_ leaderboard.html . Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models. Preprint, arXiv:2210.03629. Fanghua Ye, Jarana Manotumruksa, and Emine Yilmaz. 2022. MultiWOZ 2.4: A multi-domain task-oriented dialogue dataset with essential annotation corrections to improve state tracking evaluation. In Proceedings ofthe23rd Annual Meeting oftheSpecial Interest Group onDiscourse andDialogue , pages 351–360, Edinburgh, UK. Association for Computational Lin- guistics. Steve Young. 2002. Talking to machines (statisti- cally speaking). In 7thInternational Conference on Spoken Language Processing (ICSLP 2002) , pages 9–16. Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, et al. 2024. xlam: A family of large action models to empower ai agent systems. CoRR. Xiaoying Zhang, Baolin Peng, Kun Li, Jingyan Zhou, and Helen Meng. 2023. SGP-TOD: Building task bots effortlessly via schema-guided LLM prompting. InFindings oftheAssociation forComputational Linguistics: EMNLP 2023 , pages 13348–13369, Sin- gapore. Association for Computational Linguistics. Page 12: Appendix A Problem Formulation A.1 End-to-End TOD Systems with LLMs LLM-based end-to-end TOD systems generate contextually relevant responses based on dia- logue history and task instructions. Let Fbe a language model parameterized by θ, which maps an input context given as prompt Tto an output system response yt. At each dia- logue turn t, the system receives three key com- ponents: task instructions G, dialogue history Htcomprising of prior user-system interactions {(u1, y1), ...,(ut−1, yt−1)}, and the current user input ut. These elements are combined to form the complete prompt Tt= (G, H t, ut). The model generates a response ytby modeling the conditional probability: P(yt|Tt;θ) =P(yt|G, H t, ut;θ), (1) where P(st|Tt;θ)denotes the probability of gen- erating the response ytgiven the prompt Ttand the model parameters θ. The dialogue progresses by updating the history after each turn Ht+1= Ht+ [(ut, st)], maintaining the sequential nature of the interaction while preserving task orientation through G. A.2 Function Calling with Language Agents A language model Fθmaps an input x= (G, u,Ω), where Gis the task prompt, uis the user query, and Ω ={f1, . . . , f n}is the set of available functions with their arguments and descriptions to a struc- tured function call y. The model generates target function call in a structured format, such as JSON or text schema. The generation probability is de- fined as: P(y|x;θ) =P(y|G, u,Ω;θ) (2) This formulation enables the model to translate natural language inputs into precise and well- structured function calls, facilitating seamless inte- gration with external systems. ReAct Prompting. ReAct (Yao et al., 2023) inte- grate reasoning and action-taking to enable more effective decision-making. It facilitates interme- diate reasoning by breaking down complex tasks into smaller, interpretable reasoning steps. Addi- tionally, it enables interaction with external tools or APIs by producing structured actions that integrateeffectively with external systems. As a result of an API execution, ReAct incorporates observations dynamically, adapting subsequent reasoning and actions based on the results of previous steps, thus improving the system’s responsiveness and overall task performance. B Details of the Evaluation Benchmarks MultiWOZ 2.4. MultiWOZ 2.4 (Ye et al., 2022) is a multi-domain TOD dataset designed to evaluate dialogue systems’ ability to handle complex con- versations across multiple domains such as hotel booking, restaurant reservations, and transporta- tion. We employ two different metrics during our TOD evaluations MultiWOZ: Success Rate , which assesses whether all user-requested information re- lated to the entity is successfully provided and Joint Goal Accuracy (JGA) which measures the accuracy of predicted dialogue states, reflecting the system’s ability to track user intents. During our zero-shot evaluations, we used its test set that contains 999 samples and incorporated AutoTOD prompt (Xu et al., 2024) with slight modifications, thereby gen- erating system responses analogous to those pro- duced in a chat-based inference setting. API-Bank. API-Bank (Li et al., 2023) is de- signed to evaluate tool-augmented LAs, focusing on their ability to plan, retrieve, and invoke APIs effectively. It includes 314 tool-use dialogues and 753 API calls, with two evaluation levels: Level 1 (L-1), which tests the accuracy of invoking a known API based on a given query, and Level 2 (L-2), which assesses the retrieval and invocation of APIs from a candidate list, simulating real-world scenar- ios with multiple API options. By addressing these challenges, API-Bank advances the understanding and enhancement of tool-augmented reasoning in LLMs. During evaluations, we used the official evaluation code from the repository of previous works (Lin et al., 2024). Berkeley Function Calling Leaderboard. In ad- dition to API-Bank, we also used BFCL V36(Patil et al., 2023) which provides a diverse evaluation framework for assessing the models’ ability to per- form function calls across various objectives. It includes more than 1,800 test cases that span tasks such as simple functions, multiple functions, and parallel functions for Python and other environ- 6https://gorilla.cs.berkeley.edu/blogs/13_ bfcl_v3_multi_turn.html Page 13: ments such as REST APIs and JavaScript. Models are evaluated using two primary metrics: (i) Ab- stract Syntax Tree (AST) accuracy, which ensures syntactic correctness by verifying function struc- tures, parameters, and types against predefined doc- umentation and (ii) Executable Function Accuracy, which evaluates whether generated functions exe- cute correctly and produce the expected outputs, emphasizing real-world applicability. In our experi- ments, we employed the official repository released by authors and followed the provided instructions to get model results. C Baseline Model Overviews Used in Experiments In this section, we provide an overview of the models used in our experiments, including their brief descriptions, checkpoints, and the training re-production code references. C.1 Base Models Llama 3.1. The Llama (Large Language Model Meta AI) (Dubey et al., 2024) family is a set of open-source language models from Meta AI, rang- ing from 7 to 405 billion parameters. It trained on a large corpus of web content, academic texts, and books, they excel at reasoning, question-answering, and code generation. Their architecture supports efficient fine-tuning and deployment. In our exper- iments, we use Llama-3.1-8B-Instruct7, released in July 2024, which offers improved multilingual capabilities, longer context windows, and state-of- the-art performance in general knowledge, math, and tool usage Mistral v03. Mistral 7B (Jiang et al., 2023) is one of the state-of-the-art, open-source LLMs produced by Mistral AI. It employs innovative mechanisms such as grouped-query and sliding window atten- tion, which enable efficient processing of longer sequences and faster inference times. In our exper- iments, we use Mistral-7B-Instruct-v0.38, released on May 22, 2024, and available on Hugging Face. C.2 TOD Models LDST. LDST (LLM-driven Dialogue State Tracking) (Feng et al., 2023) is an approach that overcomes the limitations of proprietary models in 7https://huggingface.co/meta-llama/Llama-3. 1-8B-Instruct 8https://huggingface.co/mistralai/ Mistral-7B-Instruct-v0.3state tracking by leveraging a fine-tuned LLaMa 7B model. The approach combines a novel assem- bled domain-slot instruction tuning technique with parameter-efficient strategies, enabling resource- efficient performance that tries matches larger mod- els. During our experiments and to fine-tune LDST we used the provided checkpoints and implementa- tion details for LDST are available in their public repository9. Fnc-TOD. FNC-TOD (Function-Calling for Task-Oriented Dialogue) focuses on DST in LLMs through function calling mechanisms. The method conceptualizes domain schemas as functions and embeds function specifications within the system prompt. This approach achieved improved con- versation state tracking in task-oriented dialogues using a fine-tuned Llama-2-13b-chat-hf model, trained on a focused dataset of 7,200 task-oriented dialogues spanning 36 domains. For our experi- ments, we utilized the authors’ publicly released Zekunli/FncTOD-Llama-13b model available on Huggingface10. NC-Latent-TOD. This work introduces an unsu- pervised approach to TOD systems that operates solely with API schemas and unlabeled dialogues, eliminating the need for costly turn-level anno- tations. The system generates pseudo-labels for API calls and system actions while using a Hard- Expectation maximization approach with LLM pre- dictions for iterative fine-tuning, enhanced by a noisy-channel reranking method (King and Flani- gan, 2024). During our experiments, we used two different models nc-latent-tod-step-2-final11and tod-zero-bqag3oyb-3200012shared by the authors. C.3 Language Agents CodeAct-Agent. CodeAct (Wang et al., 2024) is a framework that enables LLM agents to generate and execute Python code as actions to interact with environment, rather than being limited to JSON or structured text formats. By integrating a Python interpreter, it allows agents to dynamically adjust their actions based on execution results, leverage existing Python packages, and utilize programming constructs like loops and conditionals for complex 9https://github.com/WoodScene/LDST 10https://huggingface.co/Zekunli/ FncTOD-Llama-13b 11https://huggingface.co/Brendan/ nc-latent-tod-step-2-final 12https://huggingface.co/Brendan/ tod-zero-bqag3oyb-32000 Page 14: operations. authors developed CodeActAgent by fine-tuning both Mistral 7B and Llama2 7B models on the CodeAct-Instruct dataset. For our experi- ments, we utilized the authors’ officially released CodeActAgent-Mistral-7b-v0.1 model, available on Huggingface13. Granite-20B. This work introduces Granite-20B, an open-source LLM, specifically designed for function calling capabilities. The model is trained using a multi-task approach on seven core function calling tasks: Nested Function Calling, Function Chaining, Parallel Functions, Function Name De- tection, Parameter-Value Pair Detection, Next-Best Function, and Response Generation. We used the offical model weights granite-20b-code-instruct-8k provided in Huggingface14. Hammer2.0-7B. Hammer (Lin et al., 2024) is a small scale model family up to 7B parameter models designed for on-device function calling and addresses generalization challenges in function call- ing through two key innovations: an irrelevance- augmented dataset that enhances models’ ability to identify inappropriate functions, and a function masking technique that reduces naming-based mis- interpretations by focusing on function descrip- tions. Built by fine-tuning the xLAM-function- calling dataset15with 7,500 additional instances for irrelevance detection, Hammer achieves state- of-the-art performance on BFCL Benchmark. For our experiments, we utilized the official Hammer 2.0 model weights available on Huggingface16, along with training it from scratch for reproducibil- ity using provided public repository and training scripts17. ToolAce 8B. This work introduces ToolACE (Liu et al., 2024), an automated pipeline for generating high-quality function-calling training data. The system features a self-evolution synthesis process that curates a pool of 26,507 diverse APIs, coupled with a multi-agent dialogue generation system and a dual-layer verification process for ensuring data accuracy. Using data generated and fine-tuning on Llama-3.1-8B-Instruct, ToolACE achieve top 13https://huggingface.co/xingyaoww/ CodeActAgent-Mistral-7b-v0.1 14https://huggingface.co/ibm-granite/ granite-20b-code-instruct-8k 15https://huggingface.co/datasets/Salesforce/ xlam-function-calling-60k 16https://huggingface.co/MadeAgents/Hammer2. 0-7b 17https://github.com/MadeAgents/Hammerresults on the BFCL Leaderboard. We used the official Huggingface checkpoint18and dataset19. D Human Validation for Generated CRA Dataset To analyze the quality of generated conversations, we implemented a systematic random sampling ap- proach. From the generated dataset, we randomly selected 100 dialogue instances for validation. We conducted the evaluation against a predefined set of 51 available functions, covering transportation, booking, entertainment, and utility services. We scrutinized each function’s schema, including its parameters and expected usage, to ensure compli- ance. We asked a senior Computer Science student to evaluate these generated samples across four key dimensions: •Undefined Function Call: Validating API names and parameters against the predefined function list to identify undefined functions or invalid arguments. •Incorrect Argument Type: Checking argu- ment structures to ensure compliance with the function schemas. •Argument Hallucination: Detecting unnec- essary or irrelevant arguments misaligned with the conversation context. •Low-Quality Reasoning and Planning: Identifying logical gaps in though steps or unnecessary API calls in ReAct structure. We asked for a binary score (1 for no errors, 0 for detected issues) for each generated dialogue and provided mandatory feedback for any errors. Our evaluation of 100 dialogues showed a 9% error rate, mostly in restaurant reservations where key details like the restaurant name or dining time were missing. These errors stemmed from Argument Hallucination and Low-Quality Reasoning. Re- sults, including dialogue IDs, scores, and feedback, were systematically collected to identify areas for improvement. 18https://huggingface.co/Team-ACE/ToolACE-8B 19https://huggingface.co/datasets/Team-ACE/ ToolACE Page 15: Task2: API-Bank (L1)(c) CoALM 8B(b) ToolAce(a) Llama 3.1 8b Task3: BFCL V3 (Parallel)User: I had 8 pieces of frozen mango and a chai tea. Earlier I had two slices of pepperoni pizza and a coffee.(c) CoALM 8B(b) ToolAce(a) Llama 3.1 8blog_food(food_name=‘mango’, portion_amount=8, ’portion_unit=‘piece’)log_food(food_name=‘chai tea’, portion_amount=1, ’portion_unit=‘cup’)log_food(food_name=‘pepperoni pizza’, portion_amount=2, ’portion_unit=‘slice’)log_food(food_name=‘coffee’, portion_amount=1, ’portion_unit=‘cup’)Model Output:log_food(food_name=“mango”, portion_amount=8.0, portion_unit=“piece”)log_food(food_name=“pepperoni pizza”, portion_amount=2.0, portion_unit=“slice”)Model Output:log_food(food_name=“mango”, portion_amount=8.0, portion_unit=“piece”)log_food(food_name=“chai tea”, portion_amount=1.0, portion_unit=“cup”)log_food(food_name=“pepperoni pizza”, portion_amount=2.0, portion_unit=“slice”)log_food(food_name=“coffee”, portion_amount=1.0, portion_unit=“cup”)Ground Truth: log_food (food_name=“mango”, portion_amount=8.0, portion_unit=“piece”), log_food (food_name=“chai tea”, portion_amount=1.0, portion_unit=“cup”), log_food (food_name=“pepperoni pizza”, portion_amount=2.0, portion_unit=“slice”), log_food (food_name=“coffee”, portion_amount=1.0, portion_unit=“cup”)Model Output:(Wrong syntax)(missing functions)(correct){“name”: “GetUserToken”, “arguments”: {“username”: “user3", “password”: “user3pass”}}Model Output:Model Output:Model Output:{“name”: “ModifyReminder”, “arguments”: {“token”: “p9o8i7u6y5t4r3e2w1q”, “content”: “Submit report”, “time”: “2023-03-25 14:00:00"}}Ground Truth: [{“”name”“: “”GetUserToken”“, “”arguments”“: {“”username”“: “”user3"“, “”password”“: “”user3pass”“}}, {“”name”“: “”ModifyReminder”“, “”arguments”“: {“”token”“: “”p9o8i7u6y5t4r3e2w1q”“, “”content”“: “”Submit report”“, “”time”“: “”2023-03-25 14:00:00”“}}]{“name”: “ModifyReminder”, “arguments”: {“token”: “user3pass”, “content”: “Submit report”, “time”: “2023-03-25 14:00:00"}}{“name”: “ModifyReminder”, “arguments”: {“token”: “p9o8i7u6y5t4k3e2w1q”, “content”: “Submit report”, “time”: “2023-03-17 09:00:00"}}(missing functions)(wrong values)(missing functions)(correct)User: Hi, can you help me modify a reminder?AI: Of course! What do you want to modify about the reminder?User: I need to change the content of the reminder.AI: Okay, I can help with that. Do you have a specific reminder in mind?User: I need to change the content of the reminder for the proposal submission on March 25th at 2:00 PM.AI: I will need your user token. Can you please provide your username and password?User: My username is user3 and password is user3pass. I want to change it to submit documents instead.Generate API Request:Figure 3: Error Analysis of Function-Calling Results. Illustrated performance comparison on function calling benchmarks API-Bank L1 (top) and BFCL V3 parallel function call (bottom). Results demonstrate CoALM’s consistent performance compared to other baselines. Page 16: SNIPS SFT Sample | Format: Dialogue State Tracking Instruction: You are a helpful assistant who is assigned to find the intents shown by the user on 7 domains - GetWeather, AddToPlaylist, SearchScreeningEvent, BookRestaurant, SearchCreativeWork, RateBook, PlayMusic. The user can seek for BookRestaurant by slots - poi, restaurant_type, served_dish, timeRange, party_size_number, restaurant_name, state, country, party_size_description, sort, city, spatial_relation, cuisine, facility. The user can seek for GetWeather by slots - condition_temperature, geographic_poi, current_location, timeRange, condition_description, state, country, city, spatial_relation. The user can seek for SearchCreativeWork by slots - object_type, object_name. The user can seek for PlayMusic by slots - track, playlist, service, genre, year, album, music_item, sort, artist. The user can seek for SearchScreeningEvent by slots - movie_name, location_name, timeRange, object_type, movie_type, object_location_type, spatial_relation. The user can seek for RateBook by slots - rating_value, rating_unit, object_type, object_select, ob- ject_part_of_series_type, best_rating, object_name. Do not capture any other slots! # Task You will be provided with an user utterance. You must find all the user intents and output them in JSON format. # Sample Output "domain": "AddToPlaylist", "slot_values": "music_item": "abc", "artist": "xyz" Input: User: Book a table at a restaurant in Portugal with parking for me and bonnie in 19 minutes Output: System: "domain": "BookRestaurant", "slot_values": "restaurant_type": "restaurant", "country": "Portugal", "facility": "parking", "party_size_description": "me and bonnie", "timeRange": "in 19 minutes"" Figure 4: SNIPS fine-tuning sample example. Page 17: Hammer SFT Sample | Format: Function Calling Instruction: [BEGIN OF TASK INSTRUCTION] You are a tool calling assistant. In order to complete the user’s request, you need to select one or more appropriate tools from the following tools and fill in the correct values for the tool parameters. Your specific tasks are: 1. Make one or more function/tool calls to meet the request based on the question. 2. If none of the function can be used, point it out and refuse to answer. 3. If the given question lacks the parameters required by the function, also point it out. [END OF TASK INSTRUCTION] [BEGIN OF A V AILABLE TOOLS] ["name": "LxOm64zLyg", "description": "Gets hourly weather forecast information for given geographical coordinates using the RapidAPI service.", "parameters": "TDpjPd": "description": "The latitude of the geographical location.", "type": "int", "default": 46.95828, "78th2U3lFj": "description": "The longitude of the geographical location.", "type": "int", "default": 10.87152, "name": "WoDdNSe7e7K5", "description": "Fetches weather updates for a given city using the RapidAPI Weather API.", "parameters": "LzZsvxUC": "description": "The name of the city for which to retrieve weather information.", "type": "str", "default": "London", "name": "CBrCNmwOERb", "description": "Fetches the hourly weather forecast for a given location using the RapidAPI service.", "parameters": "TDEJ.ZwMt": "description": "The name of the location for which to retrieve the hourly weather forecast.", "type": "str", "default": "Berlin", "name": "1YTQVXkwLY", "description": "Returns an air quality forecast for a given location.", "parameters": "2bkgDA": "description": "The latitude of the location for which the air quality forecast is to be retrieved.", "type": "int", "default": "35.779", "DQi.ReZ16": "description": "The longitude of the location for which the air quality forecast is to be retrieved.", "type": "int", "default": "-78.638", "hF.1": "description": "The number of hours for which the forecast is to be retrieved (default is 72).", "type": "int", "default": "72"] [END OF A V AILABLE TOOLS] [BEGIN OF FORMAT INSTRUCTION] The output MUST strictly adhere to the following JSON format, and NO other text MUST be included. The example format is as follows. Please make sure the parameter type is correct. If no function call is needed, please directly output an empty list ’[]’ [ "name": "func_name1", "arguments": "argument1": "value1", "argument2": "value2", ... (more tool calls as required) ] [END OF FORMAT INSTRUCTION] Input: [BEGIN OF QUERY] What are the current weather conditions in Sydney? [END OF QUERY] Output: ["name": "WoDdNSe7e7K5", "arguments": "LzZsvxUC": "Sydney"] Figure 5: Hammer fine-tuning sample example. Page 18: SGD Instruction Sample | Format: Action Optimization Instruction: [BEGIN OF TASK INSTRUCTION] You are a helpful conversational assistant who can perform API function calling. Your goal is to understand user queries and respond using the appropriate API functions. In order to complete the user’s request, you need to select a tool from the following functions and fill in the correct values for the function parameters. Your specific tasks are: 1. Analyze the user’s query within the given dialogue context to identify their intent and relevant details. 2. Make a function/tool call and provide the necessary arguments to meet the request based on the user query. 3. Formulate a natural and coherent response, guiding the conversation towards resolving the user’s request. [END OF TASK INSTRUCTION] [BEGIN OF A V AILABLE TOOLS] 1. FindBus(from_location, to_location, leaving_date) 2. ReserveCar(pickup_location, pickup_date, pickup_time, dropoff_date, type) . . . 50. FindProvider(city, type) 51. GetRide(destination, number_of_riders, shared_ride) [END OF A V AILABLE TOOLS] [BEGIN OF FORMAT INSTRUCTION] The output MUST strictly adhere to the following structured text format. Example Output API Call Format: function_name(argument1=value1, argument2=value2, ...) [END OF FORMAT INSTRUCTION] [BEGIN OF CONVERSATION HISTORY] User: I’m looking for events happening in New York this weekend. Thought: The user wants to find events in a specific location and timeframe. I’ll need to call the FindEvents API with the appropriate category and city. Action: FindEvents(category=’all’, city_of_event=’New York’) Observation: ’events’: [’name’: ’Jazz Festival’, ’date’: ’2023-10-07’, ’location’: ’Central Park’, ’name’: ’Food Truck Rally’, ’date’: ’2023-10-08’, ’location’: ’Union Square’] Thought: I have the event details now. I should summarize the events for the user. System: This weekend in New York, you can attend the Jazz Festival at Central Park on October 7th and the Food Truck Rally at Union Square on October 8th. Would you like more information about any of these events? User: Yes, can you tell me more about the Jazz Festival? Thought: I need to provide more details about the Jazz Festival. I don’t need an API call for that. System: The Jazz Festival in Central Park will feature various artists performing live jazz music throughout the day. It starts at 11 AM and goes until 8 PM. Would you like to know how to get tickets? [END OF CONVERSATION HISTORY] Input: User: Yes, please tell me how to get tickets for the Jazz Festival. Thought: The user is interested in purchasing tickets. I need to call the BuyEventTickets API for the Jazz Festival. Output: Action: BuyEventTickets(event_name=’Jazz Festival’, number_of_seats=2, date=’2023-10-07’, city_of_event=’New York’) Figure 6: SGD fine-tuning sample example targeting function optimization. Page 19: SGD Instruction Sample | Format: System Response Optimization Instruction: [BEGIN OF TASK INSTRUCTION] You are a helpful conversational assistant specializing in understanding user queries and providing accurate, reasoned responses. Your goal is to analyze the user’s input, reason about their intent and needs, and provide a coherent and contextually appropriate system response. Your specific tasks are: - Ensure your response is informative and contextually relevant, guiding the conversation toward successful task completion. - Analyze the user’s input in the context of the conversation history (if available) to identify their intent and relevant details. - Use logical reasoning to determine the most suitable response, considering the user’s needs and the dialogue context. - Generate a natural and coherent system response to address the user’s request or query effectively. [END OF TASK INSTRUCTION] [BEGIN OF FORMAT INSTRUCTION] The output MUST strictly adhere to the following structured text format. Example Output API Call Format: function_name(argument1=value1, argument2=value2, ...) [END OF FORMAT INSTRUCTION] [BEGIN OF CONVERSATION HISTORY] User: I’m looking for events happening in New York this weekend. Thought: The user wants to find events in a specific location and timeframe. I’ll need to call the FindEvents API with the appropriate category and city. Action: FindEvents(category=’all’, city_of_event=’New York’) Observation: ’events’: [’name’: ’Jazz Festival’, ’date’: ’2023-10-07’, ’location’: ’Central Park’, ’name’: ’Food Truck Rally’, ’date’: ’2023-10-08’, ’location’: ’Union Square’] Thought: I have the event details now. I should summarize the events for the user. System: This weekend in New York, you can attend the Jazz Festival at Central Park on October 7th and the Food Truck Rally at Union Square on October 8th. Would you like more information about any of these events? User: Yes, can you tell me more about the Jazz Festival? Thought: I need to provide more details about the Jazz Festival. I don’t need an API call for that. System: The Jazz Festival in Central Park will feature various artists performing live jazz music throughout the day. It starts at 11 AM and goes until 8 PM. Would you like to know how to get tickets? [END OF CONVERSATION HISTORY] Input: User: Yes, please tell me how to get tickets for the Jazz Festival. Thought: The user is interested in purchasing tickets. I need to call the BuyEventTickets API for the Jazz Festival. Action: BuyEventTickets(event_name=’Jazz Festival’, number_of_seats=2, date=’2023-10-07’, city_of_event=’New York’) Observation: ’status’: ’success’, ’message’: ’You have successfully purchased 2 tickets for the Jazz Festival.’ Thought: The tickets were successfully purchased. I need to confirm this with the user. Output: System: You have successfully purchased 2 tickets for the Jazz Festival in Central Park! Enjoy the music! Figure 7: SGD fine-tuning sample example targeting system response optimization. Page 20: You are an advanced AI assistant specializing in conversational dialogues. You have access to a variety of services and APIs to assist users with their requests and your goal is to provide helpful and informative responses to user queries and commands. You can interact with multiple services and APIs to fulfill user requests. Your responses should be natural, informative, and tailored to the user’s needs. # Task Information: - You are asked to create a dataset in the format: User - Thought1 - API - API Input Arguments - API Result - Thought2 - System, or User - Thought - System. - For the given # User Input, generate a multi-turn dialogue that follows this format, with each turn exhibiting realistic context reasoning, thought processes, and API interaction where applicable. - The dialogues should be converted to follow a specific # Output Format, which includes reasoning on whether an API call is needed or if the system can respond directly. - If the system decides that an API call is necessary, use this format: User - Thought1 - API - API Input Arguments - API Result - Thought2 - System. - Call the right API from # Avaliable Functions and provide the necessary input arguments to fulfill the user’s request. - If you think a function argument is not necessary, you can skip it. Don’t provide unnecessary arguments and None values. - Ensure that the API calls are used logically and that the dialogue remains coherent and natural throughout the exchange. - If the system determines that an API call is not necessary, use this format: User - Thought - System. - Include intermediate thoughts where appropriate to capture the model’s internal reasoning, and clearly separate the different components of the format. # Available Functions: 1. CheckBalance(account_type) 2. TransferMoney(account_type, amount, recipient_account_name) . . . 52. FindAttractions(location) 53. GetWeather(city) # Output Format: - If an API Call is Needed: User: [User Input] Thought1: [I need to call an API] API Name: [API Call Name: CheckBalance(), TransferMoney(),..., FindAttractions(), GetWeather()] API Input: [The input parameters for the API] API Result: [API output result] Thought2: [2nd thought after API Result that ensure if the information is enough before the system response] System: [Your system response here] - If an API Call is Not Needed: User: [User Input] Thought: [I don’t need an API and I want to respond to the user] System: [Your system response here] Below is an example of the format: # Example: User: Add Don and Sherri to my "Meditate to Sounds of Nature" playlist. Thought1: The user wants to add two artists to a specific playlist. I’ll need to extract the playlist name and the artists’ names. API Name: AddToPlaylist API Input: ’playlist_name’: ’Meditate to Sounds of Nature’, ’artists’: Don Sherri API Result: ’status’: ’success’, ’message’: ’Don and Sherri have been added to your playlist.’ Thought2: The API call was successful, and the artists were added to the playlist. I need to inform the user about this. System: Don and Sherri have been successfully added to your "Meditate to Sounds of Nature" playlist. Enjoy your music! User: Thank you for adding them. I think I can add more songs to the playlist. Thought: I don’t need an API call for that, I want to respond to the user. System: Would you like to add more songs? User: Yes, please add "Calm River" to the playlist. Thought1: The user wants to add another song to the playlist. I’ll note the song title and proceed with the request. API Name: AddToPlaylist API Input: ’playlist_name’: ’Meditate to Sounds of Nature’, ’songs’: Calm River API Result: ’status’: ’success’, ’message’: ’Calm River has been added to your playlist.’ Thought2: The song was successfully added. I should let the user know. System: "Calm River" has been successfully added to your "Meditate to Sounds of Nature" playlist. Would you like to do anything else? User: No, that’s all for now. Thank you! Thought: The user is satisfied and doesn’t need further assistance. I should acknowledge their gratitude. System: You’re welcome! If you need anything else, feel free to ask. Enjoy your relaxing music! # User Input: Table 6: API-driven conversation prompt to generate SGD fine-tuning samples with GPT4-o. Only functions used across all dialogues are included due to their large number.

---