loader
Generating audio...

arxiv

Paper 2402.11359

Offline Training of Language Model Agents with Functions as Learnable Weights

Authors: Shaokun Zhang, Jieyu Zhang, Jiale Liu, Linxin Song, Chi Wang, Ranjay Krishna, Qingyun Wu

Published: 2024-02-17

Abstract:

Researchers and practitioners have recently reframed powerful Large Language Models (LLMs) as agents, enabling them to automate complex tasks largely via the use of specialized functions. To facilitate the development of LLM agents, we present a novel paradigm of training LLM agents without modifying the LLM weights, which is particularly useful when the LLMs are difficult or inaccessible for modifications. Inspired by how humans continuously forge tools to adapt to real-world tasks, rather than change our biological structure to fit a static set of tools, we propose to progressively forge agent's functions to better solve the downstream tasks instead of modifying the LLM weights. By treating the functions as learnable `agent parameters' and leveraging the fundamental idea of model training in artificial intelligence, we develop AgentOptimizer that employs the LLM to update agents' functions and devise an agent training algorithm with two strategies, roll-back, and early-stop, to streamline the training process. With extensive experiments, we showcase that the agent training paradigm could significantly improve the performance of representative LLM agents in various downstream tasks. We also study the behavior of the agent training regarding aspects like the learning curve and domain transferability.

Paper Content:
Page 1: Offline Training of Language Model Agents with Functions as Learnable Weights Shaokun Zhang* 1Jieyu Zhang* 2Jiale Liu1Linxin Song3Chi Wang4Ranjay Krishna2Qingyun Wu1 Abstract Researchers and practitioners have recently re- framed powerful Large Language Models (LLMs) asagents , enabling them to automate complex tasks largely via the use of specialized functions. To facilitate the development of LLM agents, we present a novel paradigm of training LLM agents without modifying the LLM weights, which is particularly useful when the LLMs are difficult or inaccessible for modifications. Inspired by how humans continuously forge tools to adapt to real- world tasks, rather than change our biological structure to fit a static set of tools, we propose to progressively forge agent’s functions to better solve the downstream tasks instead of modifying the LLM weights. By treating the functions as learnable ‘agent parameters’ and leveraging the fundamental idea of model training in artificial intelligence, we develop AgentOptimizer that em- ploys the LLM to update agents’ functions and devise an agent training algorithm with two strate- gies, roll-back, and early-stop, to streamline the training process. With extensive experiments, we showcase that the agent training paradigm could significantly improve the performance of represen- tative LLM agents in various downstream tasks. We also study the behavior of the agent train- ing regarding aspects like the learning curve and domain transferability. We have integrated our method into AutoGen library. 1. Introduction Reframing Large Language Models (LLMs) as agents has ushered in a new paradigm of automation—one where *Equal contribution1Pennsylvania State University 2University of Washington3University of Southern Califor- nia4Microsoft Research. Correspondence to: Qingyun Wu <qingyun.wu@psu.edu >. Proceedings of the 41stInternational Conference on Machine Learning , Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). Model Training Agent Training NumericalOptimizerLossMLModelTraining setUpdate Parameters AgentOptimizerHistoryLLM AgentUpdate FunctionsTraining setFigure 1. The comparison between model training and agent train- ing. In model training, numerical optimizers (Ruder, 2016) such as SGD and Adam optimize the model weights according to the loss on the training set. In contrast, agent training iteratively updates the agents’ functions according to the execution history using the proposed AgentOptimizer . LLMs can utilize existing functions1to accomplish complex tasks (Xi et al., 2023; Wang et al., 2023b; Yao et al., 2023; Wang et al., 2023a; Humphreys et al., 2022; Shridhar et al., 2021). For example, LLM agents, armed with a function to ‘search over Wikipedia’ can answer knowledge questions; agents with the ability to ‘issue SQL queries’ can search large databases. Functions allow LLMs to access external knowledge sources (Peng et al., 2023), offload numerical computation (Wu et al., 2023b), search the internet (Shi et al., 2017), and much more (Qin et al., 2023). To enable LLM agents with useful functions, users need to first manually create functions that would be helpful for spe- cific downstream tasks. This curation process may require many iterations and, therefore, be time-consuming. Since LLMs are black boxes, researchers have found that LLMs unexpectedly fail to utilize certain kinds of functions (Qin et al., 2023). In response, researchers have tried to improve the underlying LLM’s capability of using existing functions by finetuning the LLM with ground truth function calls (Qin et al., 2024; Zeng et al., 2023a). This finetuning process requires large computing resources. Worse, it limits which LLMs can be used since many LLM models are proprietary. Inspired by the fact that human-made tools become an ex- 1Note that the literature has used the term ‘functions’ to some- times refer to tools or other actions. 1arXiv:2402.11359v4 [cs.AI] 30 Jul 2024 Page 2: Offline Training of Language Model Agents with Functions as Learnable Weights tension of the human user (Botvinick & Cohen, 1998) and how humans forge tools to best adapt to real-world tasks, rather than change the biological structure of the human to fit a static set of tools, we propose a new agent training paradigm that ‘forge’ the functions for the LLM agent to use to best adapt to the training tasks. It addresses both aforementioned challenges at the same time since it does not require finetuning the underlying LLM and could start with an empty set of functions. While the LLM’s parameters are never updated, its functions are optimized to maximize the agent’s ability to solve tasks. In specific, we draw an analogy between traditional model training and our agent training (Figure 1). (1) Instead of updating model parameters, our training process updates the functions for LLM agents, viewing them as the agent’s ‘trainable parameters’. (2) Instead of a loss calculated over a training set, our training process uses the agent’s execution history and performance on training tasks as the basis for up- dating the agent’s functions. Since we operate in the space of functions, numeric optimizers such as SGD or Adam are not applicable. Instead, we develop AgentOptimizer , which leverages the LLM to update the agent’s functions based on the execution history and the agent-generated / ground truth answer from the current epoch. In particular, the AgentOptimizer is instructed to progressively update the current function set by performing one of the predefined function manipulation actions (add, revise, and remove), rather than regenerate the whole function set at each opti- mization step. With AgentOptimizer, the overall workflow of LLM agent training is as follows: given a training set and an empty function set, at each epoch, we first evaluate the agent sys- tem against the training set and collect the execution history as well as the agent-generated / ground truth answers, then we feed this information to the AgentOptimizer to perform an optimization step to update the current function set. To avoid potential performance degradation caused by function updates, we introduce two simple strategies: roll-back and early-stop. The former is to withdraw the current function updates if the performance over the training set is degraded and roll back to the previous status, while the latter is to early terminate the optimization process when a certain num- ber of consecutive optimization steps do not introduce any performance boost over the training set. We conducted extensive empirical evaluations on three dis- tinct tasks: mathematical reasoning (MATH) (Hendrycks et al., 2021), tabular processing (TabMWP) (Lu et al., 2023), and general real-world problems (GAIA) (Mialon et al., 2023). We trained two typical agent systems, GPT-4+ agent (OpenAI, 2023) and ReAct agent (Yao et al., 2023), using the agent training method. For the MATH dataset, agent training resulted in an obvious performance improve-ment in almost all cases. For more realistic and complex tasks GAIA and TabMWP, agent training led to an average performance improvement of 6% and 8.5% in GPT-4+ agent and ReAct agents, respectively. We also perform ablation to demonstrate the efficacy of different components of the agent training method. In addition to ablation, we analyzed its extension to large-scale training and its transferability across different domains. Our contributions are summarized below: •(Paradigm) Inspired by the fundamental idea of model training in machine learning, we introduce a tailored paradigm for training LLM agents without modify- ing the LLMs to build specialized LLM agents for a given application: we establish analogies between the learnable parameters inherent in traditional models and the operational functions of LLM agents, as well as between the models’ loss functions and the agents’ ex- ecution history over the training set, to craft a training regime that enhances the LLM agents’ capabilities; •(Methodology) To realize this paradigm, we propose the AgentOptimizer as an alternative to numeric opti- mizers used in traditional model training. It is designed to operate in the space of the operational functions of LLM agents via the exceptional capabilities of LLMs. Based on the AgentOptimizer, we develop a training algorithm with two additional techniques (roll-back and early-stop) that streamline the training process; •(Experiments) We conduct extensive experiments on three distinct tasks in training two typical agent sys- tems, the GPT-4+ agent and the ReAct agent, to show- case the advantage of the proposed paradigm. We also provide ablation studies and detailed analysis to under- stand the behavior of the agent training. 2. Methodology We begin by defining notations and setting up the research problem. We use SFto denote any LLM agent system with function set F={f1, ..., f n|∀i∈[n], fi∈ V} .fidenotes theithfunction that can be used by agent system SFin the function space V. Throughout this work, we assume black-box LLMs such as ChatGPT (OpenAI, 2022) in the form of LLM as services. Given any task with training data Dtrain and test data Dtest, the goal of this study is to find a set of functions F∗that could improve the LLM agent’s expected performance on unseen test data Dtest. To put it more formally, F∗= arg min F⊂VE[Loss (SF,Dtest)], (1) 2 Page 3: Offline Training of Language Model Agents with Functions as Learnable Weights where Loss (SF,Dtest)measures the average loss of the agent SFon test data Dtest. In the context of agent training throughout this paper, loss is defined as the rate of failed problem-solving attempts using agent systems. However, the test set and its distribution are not available. In traditional machine learning model training, it is a common practice to assume that the distribution of the training and test data are the same or similar. While this assumption doesn’t always hold, in machine learning practice, training loss is used ubiquitously as the primary metric for parameter selection as a compromise solution. Following the same spirit, we also employ training data as a proxy for test data. Then optimizing the functions of the language agent in function space by minimizing the loss of training data. This approach allows us to approxi- mate the performance of the language agent on unseen test data, i.e., ˆF= arg minF⊂VLoss (SF,Dtrain),where ˆFis approximation of F∗. 2.1. The AgentOptimizer To obtain ˆF, it is critical to develop an optimizer tailored for agent training: it should be capable of updating current func- tions according to the agent system’s performance on the training set. In contrast to traditional model training where the optimization is conducted over a numeric model param- eter space and derivative-based optimizers can be applied with a loss of choice, agent training aims to search for the optimal set of functions for the agent system and therefore existing numeric model optimizers are not applicable. Considering these, we propose the AgentOptimizer which leverages LLMs’ exceptional capability of understanding and generating language to iteratively update the current set of functions as an optimizer. Specifically, at each op- timization step, we prompt the AgentOptimizer with the current status of the agent system and its execution history and performance on the training set and instruct it to up- date the functions of the agent system. Intuitively speaking, this iterative optimization paradigm could lead to the iden- tification of optimal functions in a large language space, analogous to iteratively performing gradient descent when training traditional machine learning models. The input to the AgentOptimizer. We use Hto denote the information used to prompt LLMs, which mainly com- prises the following two parts: 1) The execution history of the agents in solving each problem of the training set, including the details of how the agent uses current functions and 2) the final performance over the training data. In addi- tion, we include the current set of functions associated with the agent system as input. This information is necessary for the AgentOptimizer to be aware of the current state of the agent system and accordingly suggest function updates.Algorithm 1 Progressive Function Update (AgentOptimizer.step ) Input: Functions to be optimized F0, historical information H Output: Updated agent functions ˜F 1Initialization: ˜F ← F0, t←0 2while t <MAXNUM do 3 Action ←LLM(Ft,H) 4 ifAction =TERMINATE then 5 break 6 else 7 // add/revise/remove function 8 Ft+1=Action (Ft) ˜F ← Ft+1 9 t←t+ 1 10Return ˜F Progressive function update. Given the inputted infor- mation H, a naive way of updating the current functions is to instruct the LLM to regenerate the whole function set to replace the existing one. However, such an aggressive optimization step is unwise since it overwrites all existing functions, discards useful functions already established, and requires the LLM to generate multiple functions in a single shot. In contrast, we propose to progressively update the functions via predefined actions within each optimization step. In particular, we adopt four actions: 1) addfunction : add one new function that may be useful; 2) revise function : revise one existing function; 3) remove function : remove one existing function; and 4) TERMINATE : terminate the function update process. Except for the TERMINATE ac- tion, all the actions require certain action arguments as input; For example, to perform the addfunction action, the LLM needs to generate the name, description, code, etc. of the function as the action arguments so that when executed, this action will add the new function to the function set. More details are presented in Appendix D.3. At each time step, the AgentOptimizer is prompted to choose one action until the maximum number of actions is reached or the AgentOp- timizer chooses TERMINATE , and the resulting function set will be returned. The overall procedure of progressive function update is shown in Algorithm 1. 2.2. Agent Training With the AgentOptimizer, we then present the overall agent training procedure. In practice, the function updates sug- gested by the AgentOptimizer may cause performance degradation, since the LLM is not the oracle for updating the functions. Therefore, we propose two simple strategies for the training procedure to avoid performance degradation. Roll-back. To avoid performance degradation after func- tion updating, we employ a simple yet effective strategy named roll-back. Specifically, at each optimization step, if the latest function update leads to a performance drop on 3 Page 4: Offline Training of Language Model Agents with Functions as Learnable Weights the training set, the AgentOptimizer will withdraw this up- date. Moreover, considering the fact that LLMs are shown to be able to recognize patterns from in-context demonstra- tions (Wei et al., 2022), we also record the failed updated function and the corresponding performance in a list (Line 11 of Algorithm 2). This list will be used as the prompt for the next function generations from AgentOptimizer. We ex- pect that LLMs could use the historical failure information to generate better functions. The list will be cleared after achieving performance improvement. Early-stop. In extreme situations, the optimization pro- cess may stuck, and rollback repeats without improving the performance. In this case, it is wise to terminate the op- timization process, and we employ an early-stop strategy: the optimization process will be automatically terminated afterCconsecutive optimization steps without performance improvement over the training set. Overall agent training algorithm. The pseudocode of the agent training is shown in Algorithm 2. The agent train- ing process takes as input the following parameters: training dataDtrain , agent system S, maximum training epoch E, and the early-stop threshold C. After an initialization step, which sets the initial functions list F0and initial historical information H0to empty sets, the algorithm proceeds as follows: at each iteration i, the AgentOptimizer optimizes the functions list Fito obtain Fi+1based on historical in- formation Hi. The updated function set is then evaluated on the training set to obtain evaluation information for the next epoch of training. The training procedure terminates when the maximum epoch or early-stop threshold is reached. Algorithm 2 Agent Training Input: Training Data Dtrain , agent system S, max training epoch E, early-stop threshold C Output: Enhanced agent system SˆF 11Initialization: i←0, r←0, H0←∅,F0←∅. 12while i < E do 13 ifHi̸=∅then 14 Fi+1=AgentOptimizer.step (Fi,Hi) 15 else 16 Fi+1← F i 17 Hi+1=Eval (SFi+1,Dtrain ) 18 ifHi+1.loss < H i.loss then 19 Hi.fail record ← ∅, ˆF ← F i+1, i←i+ 1,r←0 20 else 21 Hi.failure record ←(Fi+1, Hi+1.loss) r←r+1 22 ifr > C then // Early stop 23 Break 24Return SˆF3. Experiments We conduct experiments to prove the superiority of the pro- posed method. We begin by providing the experimental settings in Section 3.1. We then evaluate the agent training method on three datasets to verify its effectiveness in Sec- tion 3.2. Finally, we perform in-depth investigations in the last three sections to provide a better understanding of the proposed agent training method. 3.1. Experimental Setup Evaluation tasks and metrics. To evaluate the effective- ness of the proposed agent training, we conducted exper- iments on three distinct tasks: Mathematical Reasoning , Tabular Processing , and General Real-World Tasks . Due to the high cost of OpenAI models, it is impractical to evalu- ate the method on the complete datasets and therefore we subsample data from these datasets for training and testing, following the same settings as previous works (Yuan et al., 2024; Wu et al., 2023b). The number of training examples is set according to the LLM’s context limit. (1) Mathematical reasoning: Following a similar set- ting with (Yuan et al., 2024), we use a subset of MATH datasets (Hendrycks et al., 2021) to evaluate the LLM agent’s performance in addressing mathematical problems. For each data type (7 in total), we randomly choose 20 train- ing examples and 80 test examples, and report the accuracy of each data type respectively. (2) Tabular processing: The TabMWP (Lu et al., 2023) dataset evaluates agents in processing structured data in tables, where each data sample contains one table and one question in natural language. We randomly sampled 100 test examples and 10 training examples. We measured the model performance using accuracy based on the exact match. (3) General real-world tasks: The GAIA dataset (Mialon et al., 2023) is dedicated to evaluating the LLM agents in solving unambiguous real-world questions. From its public subset, we randomly select 10 questions for training and 100 questions for testing and report the correct rate as suggested in the original paper. Agent systems employed. We employ the proposed agent training method to train two typical LLM agent systems: (1) GPT-4+ agent: GPT-4+ agent essentially is GPT-4 with function call and code interpreter. The GPT-4 plays the role of making reasoning decisions, while the code interpreter executes code and function calls suggested by the GPT-4. (2) ReAct agent: The ReAct agent (Yao et al., 2023) gen- erates both reasoning traces and task-specific actions in an interleaved manner to solve tasks. In our evaluations, we optimized the ReAct agent to improve its actions that may 4 Page 5: Offline Training of Language Model Agents with Functions as Learnable Weights Data typesP.Algebra Algebra I.Algebra Geometry C.Probability Precalculus N.Theory Train Test Train Test Train Test Train Test Train Test Train Test Train Test GPT-4+ Agent w/o Agent Training 60.0% 78.8% 55.0% 66.3% 30.0% 30.0% 30.0% 40.0% 65.0% 72.5% 5.0% 32.5% 70.0% 56.3% GPT-4+ Agent w/ Agent Training 65.0% 82.6% 65.0% 65.0% 40.0% 38.8% 40.0% 42.5% 65.0% 76.3% 10.0% 35.0% 80.0% 67.5% ReAct Agent w/o Agent Training 55.0% 87.5% 55.0% 83.8% 25.0% 50.0% 5.0% 53.8% 45.0% 73.8% 5.0% 53.8% 75.0% 68.8% ReAct Agent w/ Agent Training 55.0% 87.5% 60.0% 82.5% 35.0% 51.3% 15.0% 58.8% 50.0% 78.8% 10.0% 62.5% 75.0% 72.5% Table 1. Train/Test accuracy of GPT-4+/ReAct agents with/without agent training on MATH datasets. We show the accuracy of each data type. We can observe that agent training could lead to an obviously better performance for both two agent systems in most cases. MethodGAIA TabMWP Train Test Train Test GPT-4+ Agent w/o Agent Training 10.0% 16.0% 30.0% 51.0% GPT-4+ Agent w/ Agent Training 30.0% 23.0% 66.7% 56.0% ReAct Agent w/o Agent Training 20.0% 12.0% 63.3% 59.0% ReAct Agent w/ Agent Training 40.0% 18.0% 73.3% 70.0% Table 2. Train/Test accuracy of GPT-4+/ReAct agents with/without agent training on the GAIA and TabMWP datasets. We can observe that agent training can lead to greater performance for both GPT-4+ and ReAct agents on both two datasets. Method Number Theory Intermediate Algebra Counting and probability No Agent Training 56.3% 30.0% 72.5% Agent Training w/o Roll-back & Early-stop 63.8% 36.3% 72.5% Agent Training w/o Progressive Function Update 60.0% 28.8% 70.0% Agent Training (Ours) 67.5% 38.8% 76.3% Table 3. We take the training of the GPT-4+ agent as an example and perform ablation to investigate the effect of different components of the agent training method on three data types of the MATH dataset. be taken at each action step after a reasoning process. For both the GPT-4+ agent and ReAct agent, we initialize them with Python as the initial function that can execute the Python code suggested by the LLMs Models. For the more challenging tasks on the MATH and GAIA datasets, we used GPT-4-1106-preview for both AgentOptimizer and LLMs agents. For the easier task TabMWP, we chose to use GPT-3.5-turbo-1106 to con- struct LLMs agents and GPT-4-1106-preview to construct the AgentOptimizer. This was done to better visualize the improvement brought by the agent training and did not sac- rifice the conclusions obtained from the experiments. 3.2. Main Results Mathematical reasoning. We first evaluated the perfor- mance of GPT-4+ agent and ReAct agent on the MATH dataset, as well as their performance after agent training on train/test splits, as shown in Table 1. Across seven data types, we observed that agent training led to better perfor- mance on the test set in most cases (11 out of 14). Addi- tionally, training performance improved in almost all cases, while in the remaining cases, it remained the same. Our results indicate that agent training could produce functionsuseful for unseen test tasks. Interestingly, for counting and probability problems, when training GPT-4+ agent, the train- ing performance remains the same while test performance improves from 72.5% to 76.3%. This suggests that in spe- cific situations, even if the generated functions do not lead to performance improvement on the training set, they are helpful for the unseen test data. Tabular processing and general real-world tasks. We then perform evaluations on Tabular Processing tasks TabMWP (Lu et al., 2023) and general real-world tasks GAIA (Mialon et al., 2023), as shown in Table 2. Our ob- servations indicate that agent training led to performance improvements for both two agent systems. Since these two datasets are more realistic and complex than MATH, our results demonstrate that agent training can generate general and usable functions that increase agents’ realistic task-solving capabilities, indicating that agent training is practically useful to some extent 3.3. Ablation and Analysis 3.3.1. A BLATION We conducted ablation experiments to evaluate the effec- tiveness of two different components of the agent training method: (1) roll-back & early-stop, and (2) progressive 5 Page 6: Offline Training of Language Model Agents with Functions as Learnable Weights function updating. To achieve this goal, we chose three data types of MATH that resulted in the largest performance improvements in training GPT-4+ agent: number theory, in- termediate algebra, and counting and probability2. Specifi- cally, to investigate (1), we removed roll-back and early-stop and trained the agent until reaching the maximum epoch number. The agent status will not roll-back when the train- ing performance drops. For (2), we replaced the progressive function update with a one-step function generation, which directly prompted the GPT-4 in AgentOptimizer to generate the functions at each epoch. We also showed the origin GPT-4+ agent performance without agent training. As shown in Table 3, the performance greatly dropped if either one of them was removed. Another interesting ob- servation is that agent training without progressive function update even exhibited worse performance than the origin GPT-4+ agent without agent training. This scenario proves that prompting LLMs to generate functions is non-trivial. A bad function generation method may even lead to a negative effect. Therefore, a carefully designed function generation algorithm is desirable. 3.3.2. L EARNING CURVE 0 1 2 3 4 5 Epoch22.525.027.530.032.535.037.540.0Accuracy (%) Train T est (a) Positive - I.Algebra 01234567 Epoch54565860626466Accuracy (%) (b) Negative - Algebra Figure 2. On the MATH dataset, we visualize the changes in train/test performance across epochs when training a GPT-4+ agent. For analysis purposes, we select one data type where the training does improve the test performance (Positive) and another that does not (Negative). For analysis purposes, we visualize the learning curve when training GPT-4+ agent in solving mathematical problems in Figure 2. According to the types of experiment results, i.e., whether test performance improves (positive) or not (nega- tive), we choose two data types, the only data type that failed to improve the test performance (Algebra) and one similar data type with the failed one that successfully obtained test performance improvement (Intermediate Algebra). Regard- ing the positive results on Intermediate Algebra in Figure 2a, we observe that when the optimization starts, the test perfor- mance is better than it was at the start time in most epochs, 2Pre-Algebra was not selected due to its similarity to Interme- diate Algebra, despite having the same performance improvements as counting and probability.and the test performance is positively correlated with the training performance in general. These scenarios provide evidence to demonstrate the effectiveness of agent training. However, we also notice that the highest test performance is not at the last epoch where the algorithm terminates. To some extent, it represents GPT-4+ agents overfitting to the training set and suffering from a test performance drop while the training performance remains the same. Regarding the negative results on Algebra in Figure 2b, we get a simi- lar observation that the test performance drops while the training performance remains the same. We also found the scenario that the test performance remains the same while training performance improves, indicating that sometimes the generated tool may be not general enough to be useful but would not harm the performance in solving tasks. 3.3.3. D OMAIN TRANSFERABILITY Figure 3. To investigate the domain transferability of the agent training method, we show test performances of three different data types of the MATH dataset after training with different domains. We then investigate the generalization and transferabil- ity (Zhou et al., 2022) of the agent training method when the test data and training data are not sampled from the same domain. We use three data types in MATH: algebra, intermediate algebra, and geometry. We intend to choose two data types that have similar distributions (algebra and intermediate algebra) and another data type that should have the largest semantic distance with the algebra and interme- diate algebra (geometry). We then train GPT-4+ agent on these three datasets crossly using different train-test pairs and show the test performance in Figure 3. We observe that in most cases (2 out of 3), when the training and test data come from the same domain, agent training leads to the best test performance compared with training using other domains, where the results are intuitive to us. However, we observed an exception when testing on algebra. Using intermediate algebra for training led to better performance than using algebra (67.5% vs. 65.0%). This could be be- cause intermediate algebra shares a similar distribution with algebra, and the more harder problems in intermediate al- gebra could be easier to learn basic and general functions that works for basic problems. Another observation is that using geometry as the training domain leads to the worst test performance in algebra and intermediate algebra. This 6 Page 7: Offline Training of Language Model Agents with Functions as Learnable Weights Method MATH - Train MATH - Test TabMWP - Train TabMWP - Test CREATOR (Qian et al., 2023) N/A 75.0% N/A 30.0% CRAFT (Yuan et al., 2024) 50.0% 73.8% 38.0% 38.5% GPT-4+ Agent w/ Agent Training 60.0% 66.25% 66.6% 56.0% ReAct Agent w/ Agent Training 60.0% 77.5% 73.3% 70.0% Table 4. The comparisons between the trained agent systems with two typical tool-creation methods on MATH and TabMWP datasets. CREATOR doesn’t involve a training stage so the training performance is unavailable. The results indicate that both GPT-4+ agent and ReAct agent trained with our method outperform tool-creation methods in most cases. is because its distribution is far from both of the other two data types. 3.3.4. E XTEND TO LARGE SCALE TRAINING DATA - BATCH TRAINING The proposed agent training method has one obvious bot- tleneck, which is that the training data size is limited to the context limit of the LLM-backed optimizer. This limita- tion prevents the full utilization of large-scale training data. A similar bottleneck occurs in traditional model training, where the constraint is from the GPU/CPU memory. To resolve this problem, traditional machine learning uses the concept of batch training (Masters & Luschi, 2018). This method divides the dataset into smaller subsets (batches) and trains the model iteratively on each batch to overcome the memory limitation. Building on this practice, we propose a straightforward batch training method for our agent training flow. Specifi- cally, we randomly sample one batch of training data within the LLM context limit at each training iteration from large- scale training data. Other procedures remain the same. We evaluate the Intermediate algebra of the MATH dataset on GPT-4+ agent system with 100 problems for training and 80 problems for testing where the test data is the same as it is in previous sections. We tried four different batch sizes (5, 10, 15, and 20), and set the epoch to 40, 20, 13, and 10, respectively, to ensure that the number of examples used for training is the same. We show the final test performance in Figure 4. The results show that large training data does not necessarily lead to test performance improvement in most cases, and only one case achieved a mirror improvement. Even when the batch size is set to 20, which is the same as the training data size in Figure 4, the test performance drops by 7.8%. This drop may be due to the frequent changing of training examples at each epoch, which prevents the Agen- tOptimizer from generating stable and effective functions. 3.4. Agent Training v.s. Tool-Creation Tool-creation algorithms (Cai et al., 2024; Qian et al., 2023) are to prompt LLMs to create tools that are tailored to spe- cific tasks. Since the tool-creation procedure is a one-time process that does not include subsequent optimization mech- anisms based on training performance, the design philoso- phy emphasizes that the created tools can be used (without 5 10 15 20 Batch Size20253035404550Accuracy (%)Regular Training Batch TrainingFigure 4. The comparisons between the ”regular training” of our method and the extended ”batch training”. The batch training with an enlarged training set doesn’t necessarily lead to better performance in different batch settings. error), but not used effectively (improve performance). In this section, we compare the trained GPT-4+/ReAct agents and two latest tool-creation methods, CRE- ATOR (Qian et al., 2023) and CRAFT (Yuan et al., 2024), on MATH and TabMWP datasets. For TabMWP, we follow the same experimental setting as Section 3.2. We choose these two datasets because the baseline codes on these two datasets are available and we can make a rigorous com- parison. To cover all data types of the MATH dataset, we randomly sample 20 examples for training and 80 examples for testing from all data types. As shown in Table 4, after agent training, both the GPT-4+ agent and ReAct agent ex- hibit better performance compared with the tool-creation method, indicating agent training is a promising paradigm to distill function/tool from advanced large language models. 3.5. Analysis of the Learned Functions We conducted an in-depth analysis of the generated func- tions. First, we present a list of frequently used functions generated for all datasets in Table 5. Then, we show the number of successful function calls at the second and end epochs (the functions may not be the same) during model training in Table 6. We also present the widely adopted cyclomatic complexity (McCabe, 1994) of the generated functions. We calculate the complexity using the Lizard Python library and present the average complexity of tools for each task when optimizing both GPT-4+ agent and Re- Act agent. Our observations indicate that the number of successful func- tion calls exhibits significant improvement in most datasets, indicating that the optimized functions are becoming more effective compared to the initial list. Considering function complexity, a good function should have a complexity of no more than 10. A less complex function is less prone to trigger bugs. We observed that the created functions for the 7 Page 8: Offline Training of Language Model Agents with Functions as Learnable Weights Tasks Top Used Functions MATHevaluate expression, calculate polynomial roots, solve algebraic equation, calculate circumference calculate polynomial roots, solve algebraic equation, calculate complex magnitude GAIA scrape wikipedia table, extract pdftext, perform web search, fetch web content TabMWPcalculate total cost, analyze stem leafplot, calculate basic statistics, perform table calculation perform arithmetic operations, statistical analysis Table 5. For illustration purposes, we list frequently used (during testing) functions generated by AgentOptimizer in different tasks. Metrics MATH GAIA TabMWP Second Epoch 11 8 19 Last Epoch 23 10 41 Avg. Complexity 1.2 3.7 5.0 Table 6. The number of successful function calls in the second epoch and the last epoch (functions may not be the same) of the agent training. We also show the cyclomatic complexity of the generated functions in the last row. three tasks exhibit relatively low complexity, indicating that the functions are reliable. 4. Related Work There has been a growing volume of research focusing on employing LLMs to construct autonomous agents for reasoning, planning, and adapting to new observations in real-world tasks (Xi et al., 2023; Wang et al., 2023b; Hong et al., 2024; Yao et al., 2023; Wu et al., 2023a; Li et al., 2023; BabyAGI, 2023; Park et al., 2023). In such LLM agents, functions/tools/actions that LLM can leverage to interact with the environment or solve sub-tasks play a crit- ical role, yet are often manually crafted (Yao et al., 2023). Recent works have explored automatic tool creation (Cai et al., 2024; Qian et al., 2023; Yuan et al., 2024). Specifi- cally, Tool-maker (Cai et al., 2024) proposes to create tools through three demonstrations and then validates the created tool using three validation examples; CREATOR (Qian et al., 2023) proposes to create tools exclusive for each query; And CRAFT (Yuan et al., 2024) first creates customizable tools tailored for specific problems and then retrieves relevant tools for user query inference time. In this work, we propose a conceptual framework that treats functions as learnable parameters in traditional AI models and develop a generic agent training paradigm to improve functions iteratively across epochs. Different from prior works, our AgentOpti- mizer updates the function set based on the LLM agent’s execution history of the whole training set, rather than mak- ing functions according to individual query-answer pair(s); this approach not only includes the specific LLM agent’s behavior into consideration for function creation (in contrast to looking at the query-answer pair only), but also tends to make generic functions that work for the whole training set. By formulating an iterative optimization process, the Agen- tOptimizer can continuously update the functions based on the execution history of each epoch during training in a trial-and-error manner.Sharing a similar goal of improving LLM agents, another line of work aims to enhance agent capability by modifying the underlying LLMs (Patil et al., 2023; Qin et al., 2024; Zeng et al., 2023a). For instance, ToolLLM (Qin et al., 2024) collects a massive amount of APIs to construct in- struction data to finetune LLaMA (Touvron et al., 2023) to obtain a new LLM optimized for using the collected APIs; AgentTune (Zeng et al., 2023a) proposes to enhance the agent abilities through a hybrid instruction-tuning strategy to tune the LLMs parameters. In contrast, we explore a new paradigm of training LLM agents without modifying the un- derlying LLM, which is particularly useful when the LLMs are online services and not available for tuning like GPT-4 or when tuning and maintaining a new LLM are expensive and time-consuming. Besides, in this work, we leverage the exceptional capability of the LLM to build an optimizer (the AgentOptimizer) for training the agents, mimicking the numeric optimizers in model training such as SGD and Adam. Such an idea of using LLM as an optimizer has been proven effective by prior work (Yang et al., 2024; Zhang et al., 2023). While these prior works mainly leverage LLM as an optimizer for optimization problems like prompt optimization (Yang et al., 2024) and hyperparameter optimization (Zhang et al., 2023), our AgentOptimizer is particularly designed for the novel agent training paradigm and progressively update LLM agent’s functions via multiple add, revise, and/or re- move actions within each optimization step. 5. Conclusion In this study, we propose a novel approach to train spe- cialized LLM agents. The core idea is to draw an analogy between LLM agent training and traditional model training, where the learnable parameters in traditional models cor- respond to the operational functions of LLM agents, and the models’ loss functions correspond to the historical per- formance metrics of the agents. Leveraging the impressive optimization capability of LLMs, we enhance the agents by updating the agent functions through the proposed Agen- tOptimizer. We evaluate the proposed method on multi- ple distinct tasks in training two typical agent systems and demonstrate that the agent training exhibits obvious perfor- mance improvement. 8 Page 9: Offline Training of Language Model Agents with Functions as Learnable Weights Impact Statements This paper presents research aimed at advancing the field of language agents. Our work has several potential societal con- sequences, both positive and negative, that we feel need to be highlighted. On the positive side, language agents could be the core of many real-life applications (Hosseini et al., 2023; Cai et al., 2019), and our work could greatly benefit these applications by enhancing the agents. For instance, it could be the core of an industrial robot (Zeng et al., 2023b), and our work could potentially enhance working efficiency. On the negative side, the development of language agents raises the possibility of negative use of enhanced agents, such as using language agents to generate misinformation or harmful content (Navigli et al., 2023) in social media for illegal purposes. Another concern is allowing language models to make changes in external environments (Tian et al., 2023). For instance, allowing language models to per- form code execution in the computer may lead to unintended consequences (Liu et al., 2024). References BabyAGI. Github — babyagi. https://github.com/ yoheinakajima/babyagi , 2023. Botvinick, M. and Cohen, J. Rubber hands ‘feel’touch that eyes see. Nature , 391(6669):756–756, 1998. Cai, C. J., Winter, S., Steiner, D., Wilcox, L., and Terry, M. ” hello ai”: uncovering the onboarding needs of medical practitioners for human-ai collaborative decision-making. Proceedings of the ACM on Human-computer Interaction , 3(CSCW):1–24, 2019. Cai, T., Wang, X., Ma, T., Chen, X., and Zhou, D. Large language models as tool makers. ICLR , 2024. Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the MATH dataset. InNeurIPS , 2021. Hoeffding, W. Probability inequalities for sums of bounded random variables. In The collected works of Wassily Hoeffding , pp. 409–426. Springer, 1994. Hong, S., Zheng, X., Chen, J., Cheng, Y ., Wang, J., Zhang, C., Wang, Z., Yau, S. K. S., Lin, Z., Zhou, L., et al. Metagpt: Meta programming for multi-agent collabora- tive framework. 2024. Hosseini, M., Gao, C. A., Liebovitz, D. M., Carvalho, A. M., Ahmad, F. S., Luo, Y ., MacDonald, N., Holmes, K. L., and Kho, A. An exploratory survey about using chatgpt in education, healthcare, and research. medRxiv , pp. 2023– 03, 2023.Humphreys, P. C., Raposo, D., Pohlen, T., Thornton, G., Chhaparia, R., Muldal, A., Abramson, J., Georgiev, P., Santoro, A., and Lillicrap, T. A data-driven approach for learning to control computers. In ICML , pp. 9466–9482. PMLR, 2022. Jayaseelan, N. Llama 2: The new open source language model. Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint arXiv:2310.06825 , 2023. Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna, E. B., Bressand, F., et al. Mixtral of experts. arXiv preprint arXiv:2401.04088 , 2024. Li, G., Hammoud, H. A. A. K., Itani, H., Khizbullin, D., and Ghanem, B. Camel: Communicative agents for” mind” exploration of large language model society. In NeurIPS , 2023. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. Hyperband: A novel bandit-based approach to hyperparameter optimization. JMLR , 2018. Liu, M., Wang, J., Lin, T., Ma, Q., Fang, Z., and Wu, Y . An empirical study of the code generation of safety-critical software using llms. Applied Sciences , 2024. Lu, P., Qiu, L., Chang, K.-W., Wu, Y . N., Zhu, S.-C., Ra- jpurohit, T., Clark, P., and Kalyan, A. Dynamic prompt learning via policy gradient for semi-structured mathe- matical reasoning. In ICLR , 2023. Masters, D. and Luschi, C. Revisiting small batch training for deep neural networks. arXiv preprint arXiv:1804.07612 , 2018. McCabe, T. J. Software complexity, crosstalk. Journal of Defense Software Engineering , 1994. Mialon, G., Fourrier, C., Swift, C., Wolf, T., LeCun, Y ., and Scialom, T. Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983 , 2023. Navigli, R., Conia, S., and Ross, B. Biases in large language models: Origins, inventory and discussion. ACM Journal of Data and Information Quality , 2023. OpenAI. Introducing ChatGPT, 2022. URL https:// openai.com/blog/chatgpt . (Accessed on Jun 18, 2023). OpenAI. Gpt-4 technical report, 2023. 9 Page 10: Offline Training of Language Model Agents with Functions as Learnable Weights Park, J. S., O’Brien, J., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Interactive simulacra of human behavior. In UIST , pp. 1–22, 2023. Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334 , 2023. Peng, B., Galley, M., He, P., Cheng, H., Xie, Y ., Hu, Y ., Huang, Q., Liden, L., Yu, Z., Chen, W., et al. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813 , 2023. Qian, C., Han, C., Fung, Y ., Qin, Y ., Liu, Z., and Ji, H. Creator: Tool creation for disentangling abstract and con- crete reasoning of large language models. In EMNLP , pp. 6922–6939, 2023. Qin, Y ., Hu, S., Lin, Y ., Chen, W., Ding, N., Cui, G., Zeng, Z., Huang, Y ., Xiao, C., Han, C., et al. Tool learning with foundation models. arXiv preprint arXiv:2304.08354 , 2023. Qin, Y ., Liang, S., Ye, Y ., Zhu, K., Yan, L., Lu, Y ., Lin, Y ., Cong, X., Tang, X., Qian, B., et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. ICLR , 2024. Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y ., Liu, J., Remez, T., Rapin, J., et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 , 2023. Ruder, S. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 , 2016. Shi, T., Karpathy, A., Fan, L., Hernandez, J., and Liang, P. World of bits: An open-domain platform for web-based agents. In ICML . PMLR, 2017. Shridhar, M., Yuan, X., C ˆot´e, M.-A., Bisk, Y ., Trischler, A., and Hausknecht, M. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. In ICLR , 2021. Tian, Y ., Yang, X., Zhang, J., Dong, Y ., and Su, H. Evil geniuses: Delving into the safety of llm-based agents. arXiv preprint arXiv:2311.11855 , 2023. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi `ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models. arXiv preprint arXiv:2302.13971 , 2023. Wang, G., Xie, Y ., Jiang, Y ., Mandlekar, A., Xiao, C., Zhu, Y ., Fan, L., and Anandkumar, A. V oyager: An open- ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 , 2023a.Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y ., et al. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432 , 2023b. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V ., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. 35: 24824–24837, 2022. Wu, Q., Wang, C., and Huang, S. Frugal optimization for cost-related hyperparameters. In AAAI , 2021. Wu, Q., Bansal, G., Zhang, J., Wu, Y ., Zhang, S., Zhu, E., Li, B., Jiang, L., Zhang, X., and Wang, C. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155 , 2023a. Wu, Y ., Jia, F., Zhang, S., Wu, Q., Li, H., Zhu, E., Wang, Y ., Lee, Y . T., Peng, R., and Wang, C. An empirical study on challenging math problem solving with gpt-4. arXiv preprint arXiv:2306.01337 , 2023b. Xi, Z., Chen, W., Guo, X., He, W., Ding, Y ., Hong, B., Zhang, M., Wang, J., Jin, S., Zhou, E., et al. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864 , 2023. Yang, C., Wang, X., Lu, Y ., Liu, H., Le, Q. V ., Zhou, D., and Chen, X. Large language models as optimizers. 2024. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y . React: Synergizing reasoning and acting in language models. 2023. Yuan, L., Chen, Y ., Wang, X., Fung, Y . R., Peng, H., and Ji, H. Craft: Customizing llms by creating and retrieving from specialized toolsets. ICLR , 2024. Zeng, A., Liu, M., Lu, R., Wang, B., Liu, X., Dong, Y ., and Tang, J. Agenttuning: Enabling generalized agent abili- ties for llms. arXiv preprint arXiv:2310.12823 , 2023a. Zeng, F., Gan, W., Wang, Y ., Liu, N., and Yu, P. S. Large language models for robotics: A survey. arXiv preprint arXiv:2311.07226 , 2023b. Zhang, M. R., Desai, N., Bae, J., Lorraine, J., and Ba, J. Using large language models for hyperparameter opti- mization. arXiv e-prints , pp. arXiv–2312, 2023. Zhang, S., Jia, F., Wang, C., and Wu, Q. Targeted hyperpa- rameter optimization with lexicographic preferences over multiple objectives. In ICLR , 2022. Zhou, K., Liu, Z., Qiao, Y ., Xiang, T., and Loy, C. C. Do- main generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022. 10 Page 11: Offline Training of Language Model Agents with Functions as Learnable Weights Appendix A Supplementary Theoretical Analysis 12 B Supplementary Experimental Results 13 B.1 Evaluations on Other Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 B.2 More Experimental Results after Removing Roll-back & Early-stop . . . . . . . . . . . . . . . . . . . . 14 C Supplementary Analysis of Agent Training versus Model Training 14 D Implementations Details 14 D.1 Prompt Design for AgentOptimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 D.2 Prompt Design for ReAct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 D.3 Function calls of LLM backed AgentOptimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 E Generated Functions 17 E.1 Trained Functions in MATH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 E.2 Trained Functions in GAIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 E.3 Trained Functions in TabMWP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 F Case Study 21 F.1 Case Study for MATH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 F.2 Case Study for GAIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 F.3 Case Study for TabMWP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 G Hyperparameters Settings 22 H Limitations 22 11 Page 12: Offline Training of Language Model Agents with Functions as Learnable Weights A. Supplementary Theoretical Analysis In this section, we attempt to provide a theoretical analysis of the proposed agent training method. The objective is to provide an upper bound for the expected test loss difference between the trained agent function and the global optimal function. As an initial attempt, our analysis on the generalization bound of the agent training requires the following two strong assumptions. We leave the relaxation of these two assumptions to future work. Assumption A.1. In the agent training scenario, the training data Dtrain and test data Dtestcome from the same distribution P, i.e.,Dtrain,Dtest∈P. In classical machine learning model training, it is a common practice to assume that the distribution of the training and test data are the same or similar, then use training loss as the primary metric for parameters selection. Assumption A.2. Given training data Dtrain , the proposed agent training method could identify the function set ˆFwhich achieves the smallest loss in Dtrain after agent training. ˆF= arg min F⊂VLoss (SF,Dtrain). (2) Lemma A.3. Under Assumption A.1, for any agent system SFwith function set F, with probability at least 1−δ(δ∈(0,1)), we have: |Loss (SF,Dtrain)−E[Loss (SF,Dtest)]| ≤s βln(1/δ) 2|Dtrain|, in which βrepresents the distance between the largest and the lowest loss value on any data instance. Specifically, for any data instance d∈P,lSF(d)< β, where lSFdenotes the loss function, which measures the loss of each data instance for agent system SF. Proof of Lemma A.3. For any training data set Dtrain and potential test data set Dtestfrom the data distribution P, we have |Loss (SF,Dtrain)−E[Loss (SF,Dtest)]|=|1 |Dtrain||Dtrain|X i=1lSF(di)−Ed∼P[lSF(d)]|. (3) According to Hoeffding’s inequality (Hoeffding, 1994), we have: P(|Loss (SF,Dtrain)−E[Loss (SF,Dtest)]|> ϵ) =P(|1 |Dtrain||Dtrain|X i=1lSF(di)−Ed∼P[lSF(d)]|> ϵ) (4) ≤2 exp−2|Dtrain|ϵ2 1 |Dtrain|P|Dtrain| i=1β= 2 exp−2|Dtrain|ϵ2 β. Then with probability at least 1−2 exp−2|Dtrain|ϵ2 β, we have: |Loss (SF,Dtrain)−E[Loss (SF,Dtest)]| ≤ϵ. (5) Taking δ= 2 exp−2|Dtrain|ϵ2 β, we have: ϵ=s βln(2/δ) 2|Dtrain|(6) Combining Equation 5 and Equation 6, with probability at least 1−δ, we have: |Loss (SF,Dtrain)−E[Loss (SF,Dtest)]| ≤s βln(2/δ) 2|Dtrain|. (7) Which completes the proof. 12 Page 13: Offline Training of Language Model Agents with Functions as Learnable Weights Theorem A.4. Under Assumption A.1 and Assumption A.2, with probability at least 1−δ(δ∈(0,1)), the trained agent system SˆFwith trained functionl list ˆFsatisfies: E[Loss (SˆF,Dtest)]−E[Loss (SF∗,Dtest)]≤2s βln(2/δ) 2|Dtrain|, (8) where F∗denotes the optimal function in the function space V, i.e.,F∗= arg minF⊂VE[Loss (SF,Dtest)]. Proof of Theorem A.4. Taking ˆFinto Lemma A.3, with probability at least 1−δ(δ∈(0,1)), we have: |Loss (SˆF,Dtrain)−E[Loss (SˆF,Dtest)]| ≤s βln(2/δ) 2|Dtrain|. (9) Considering ˆF= arg minF⊂VLoss (SF,Dtrain), we have: Loss (SˆF,Dtrain)< Loss (SF∗,Dtrain). (10) Combing Equation 9 and Equation 10, we have: E[Loss (SˆF,Dtest)]≤Loss (SF∗,Dtrain) +s βln(2/δ) 2|Dtrain|. (11) Taking SF∗into Lemma A.3, we have: |Loss (SF∗,Dtrain)−E[Loss (SF∗,Dtest)]| ≤s βln(2/δ) 2|Dtrain|. (12) Combining Equation 11 and Equation 13, with probability at least 1−δ, we have: E[Loss (SˆF,Dtest)]−E[Loss (SF∗,Dtest)]≤2∗s βln(2/δ) 2|Dtrain|, (13) which completes the proof. Theorem A.4 provides an upper bound on the expected test loss difference between the trained agent function ˆFand the global optimal function F∗. We observe from Equation 13 that a larger training set could lead to a narrower upper bound. However, the training set is limited by the LLM’s context limit. This limitation inspires us to investigate a better way of extending the training dataset, rather than relying on the straightforward batch training approach described in Section 3.3.4. B. Supplementary Experimental Results B.1. Evaluations on Other Language Models Code-Llama-34B Mixtral-8x7B GPT-3.5-turbo-1106 Before Training 7.5% 23.8% 25.0% After Training 11.3% 28.8% 28.8% Table 7. The performance of agents backed by other language models is evaluated before and after agent training on the MATH dataset. The results indicate that agent training still leads to significant performance improvements. In this section, we conducted experiments to evaluate the performance of agents backed by various language models after agent training, including GPT-3.5-turbo-1106 (OpenAI, 2022), and open-source models Mixtral-8x7B (Jiang et al., 2024; 2023) and Code-Llama-34B (Roziere et al., 2023; Jayaseelan). The LLM that backed the AgentOptimizer was GPT-4-1106-preview. We performed experiments on the MATH dataset using the same settings as described in Section 3.4. The results are presented in Table 7. Our findings indicate that agent training leads to better performance on all three models, demonstrating that agent training is agnostic to the LLMs that backed the agent. 13 Page 14: Offline Training of Language Model Agents with Functions as Learnable Weights B.2. More Experimental Results after Removing Roll-back & Early-stop 0123456789 Epoch2030405060708090100Accuracy (%) Number Theory Intermediate Algebra Counting and Probability (a) Training performance w/o roll-back & early-stop I.Algebra N.Theory C.Probability Problem Type020406080100Accuracy (%)36.3%63.8%72.5% 38.8%67.5%76.3%GPT-4+ Agent w/o Roll-back/Early-exit GPT-4+ Agent w/ Roll-back/Early-exit (b) Test performance w/o roll-back & early-stop Figure 5. After removing the roll-back and early-exit mechanisms, the learning curve of the training performance and the final test performance of GPT-4+ Agents. We present additional experimental results in Figure 5 after removing roll-back and early-stop. Specifically, we further illustrate the training performance curve in Figure 5a. We observe that the training performance fluctuated with the number of training epochs, indicating that the learned functions are not stable and may not necessarily lead to improved training performance at each epoch. This unstable function optimization leads to a drop in test performance, as shown in Figure 5b. C. Supplementary Analysis of Agent Training versus Model Training Optimizer Target Human Interpretable Access to Model/LLM Weights Model Training SGD etc. Model Weights ✗ ✓ Agent Training LLMs Functions ✓ ✗ Table 8. Comparing Model Training and Agent Training: Model training relies on an optimizer such as SGD. It is not human-interpretable and requires access to model parameters. In contrast, agent training uses LLMs as the optimizer, which is interpretable in natural language and generated functions. Furthermore, agent training does not require access to model parameters. Table 8 summarizes the differences between these two training paradigms. Although both paradigms have a similar workflow of improving from training data leveraging their optimizers, they have different features. Specifically, the optimizers in traditional model training are gradient descent optimization algorithms, which update the model parameters in the opposite direction of the gradient of the loss function. However, the complex parameters updating logic is not interpretable to humans, and model training requires accessible parameters. In contrast, the optimizers in agent training are LLMs, which prompt the update of agent functions using natural language at each optimization step. The optimization is interpretable to humans (functions and natural language), and it doesn’t require accessible parameters. D. Implementations Details D.1. Prompt Design for AgentOptimizer You are a function optimizer. Your task is to maintain a list of functions for the assistant according to the existing function set and conversation history that happens between the assistant and the user. You can perform one of the following four actions to manipulate the function set using the functions you have: 1. Revise one existing function (using revise function). 2. Remove one existing function (using remove function). 3. Add one new function (using add function). 4. Directly return ”TERMINATE” to me if no more actions are needed for the current function set. 14 Page 15: Offline Training of Language Model Agents with Functions as Learnable Weights Below are the principles that you need to follow for taking these four actions. (1) Revise one existing function: 1. Pay more attention to the failed tasks and corresponding error information, and optimize the function used in these tasks according to the conversation history if needed. 2. A failed function call can occur due to incorrect input arguments (missing arguments) or an incorrect function code implementation. You should focus more on the function code implementation and make it easy to get success function call. 3. Do not revise the function that you think works well and plays a critical role in solving the problems according to the conversation history. Only making revisions if needed. 4. Sometimes, a NameError may occur. To fix this error, you can either revise the name of the function in the code implementation or revise the name of the function call to make these two names consistent. (2) Remove one existing function: 1. Only remove the function that you think is not needed anymore in future tasks. (3) Add one new function: 1. The added function should be general enough to be used in future tasks. For instance, if you encounter a problem that this function can solve, or one step of it, you can use the generated function directly instead of starting from scratch 2. The added new function should solve a higher-level question that encompasses the original query and extend the code’s functionality to make it more versatile and widely applicable. 3. Replace specific strings or variable names with general variables to enhance the tool’s applicability to various queries. All names used inside the function should be passed in as arguments. Below is an example of a function that potentially deserves to be added, which can be used to solve a higher-level question: {{ "name": "evaluate_expression", "description": "Evaluate arithmetic or mathematical expressions provided as strings.", "arguments": {{ "expression": {{ "type": "string", "description": "The mathematical expression to evaluate." }} }}, "packages": "sympy", "code": "from sympy import sympify, SympifyError\n\n def evaluate_expression(expression):\n try:\n result = sympify( expression)\n if result.is_number:\n result = float(result)\n else:\n result = str(result)\n return result\n except SympifyError as e:\n return str(e)" }} (4) Directly return ”TERMINATE”: If you think there is no need to perform any other actions for the current function set since the current list is optimal more actions will harm the performance in future tasks. Please directly reply to me with ”TERMINATE”. One function signature includes the following five elements: 1. Function name 2. Function description 3. JSON schema of arguments encoded as a string 4. A list of package names imported by the function packages 5. The code implementation Below are the signatures of the current functions. List A: {current function signature } The success rate (performance) with this function set is {success rate}. The following list are the function signatures that you have after taking {actions num}actions in our previous conversations. List B: {updated function signature }. We also provide more examples for different functions and their corresponding success rates. The following function signatures are arranged in are arranged in ascending order based on their success rates, where higher success rates indicate better quality. {historical failfunctions } Here are {conversation num}conversation histories of solving {conversation num}tasks. History: {history } The following table shows the statistical information for solving each task in each conversation and indicates whether each task was successfully solved. 1 represents correct. 0 represents wrong. statistic: {statistic } According to the information I provide, please take one of four actions to manipulate list B using the functions you know. Instead of returning TERMINATE directly or taking no action, you should try your best to optimize the function set. Only 15 Page 16: Offline Training of Language Model Agents with Functions as Learnable Weights take no action if you really think the current list is optimal, as more actions will harm performance in future tasks. Even adding a general function that can substitute the assistant’s repeated suggestions of Python code with the same functionality could also be helpful. D.2. Prompt Design for ReAct Answer the following question using your coding skills. Below is a list of the tools you can use and their detailed descriptions: {tool descriptions } You should always follow the below template, when you respond you should provide one (Thought, Action, Action Input) triplet and wait for observation before proceeding to the next round, unless you have reached a FINAL ANSWER. YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings. If you are asked for a number, don’t use comma to write your number neither use units such as $or percent sign unless specified otherwise. If you are asked for a string, don’t use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise. If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string. TEMPLATE: Question: the input question you must answer Thought: your reasoning about the current situation Action 1: the action to take, should be one of [ {tool names}] Action 1 Input: the arguments passed to action 1 Observation 1: the result of action 1 Action 2: the action to take, should be one of [ {tool names}] Action 2 Input: the input to action 2 ... (this Thought/Action/Action Input/Observation can repeat N times) Thought: I now know the final answer FINAL ANSWER: the final answer to the original input question D.3. Function calls of LLM backed AgentOptimizer Add function: add a new function that may be used in future tasks. ADD_FUNC = { "type": "function", "function": { "name": "add_function", "description": "Add a function in the context of the conversation. Necessary Python packages must be declared. The name of the function MUST be the same with the function name in the code you generated.", "parameters": { "type": "object", "properties": { "name": { "type": "string", "description": "The name of the function in the code implementation." }, "description": { "type": "string", "description": "A short description of the function." }, "arguments": { "type": "string", "description": "JSON schema of arguments encoded as a string. Please note that the JSON schema only supports specific types including string, integer, object, array, boolean. (do not have float type) For example: { \"url\": { \"type\": \" string\", \"description\": \"The URL\", }}. Please avoid the error ’array schema missing items’ when using array type." }, "packages": { "type": "string", "description": "A list of package names imported by the function, and that need to be installed with pip prior to invoking the function. This solves ModuleNotFoundError. It should be string, not list." 16 Page 17: Offline Training of Language Model Agents with Functions as Learnable Weights }, "code": { "type": "string", "description": "The implementation in Python. Do not include the function declaration." } }, "required": ["name", "description", "arguments", "packages", "code"] } } } Revise function: revise one existing function. REVISE_FUNC = { "type": "function", "function": { "name": "revise_function", "description": "Revise a function in the context of the conversation. Necessary Python packages must be declared. The name of the function MUST be the same with the function name in the code you generated.", "parameters": { "type": "object", "properties": { "name": { "type": "string", "description": "The name of the function in the code implementation." }, "description": { "type": "string", "description": "A short description of the function." }, "arguments": { "type": "string", "description": "JSON schema of arguments encoded as a string. Please note that the JSON schema only supports specific types including string, integer, object, array, boolean. (do not have float type) For example: { \"url\": { \"type\": \" string\", \"description\": \"The URL\", }}. Please avoid the error ’array schema missing items’ when using array type." }, "packages": { "type": "string", "description": "A list of package names imported by the function, and that need to be installed with pip prior to invoking the function. This solves ModuleNotFoundError. It should be string, not list." }, "code": { "type": "string", "description": "The implementation in Python. Do not include the function declaration." } }, "required": ["name", "description", "arguments", "packages", "code"] } } } Remove function: remove one existing function. REMOVE_FUNC = { "type": "function", "function": { "name": "remove_function", "description": "Remove one function in the context of the conversation. Once remove one function, the assistant will not use this function in future conversation.", "parameters": { "type": "object", "properties": { "name": { "type": "string", "description": "The name of the function in the code implementation." } }, "required": ["name"] } } } E. Generated Functions Here we provide a list of typical function examples generated by AgentOptimizer for all three datasets. E.1. Trained Functions in MATH GPT-4+ Agent 1. evaluate expression: evaluate arithmetic or mathematical expressions provided as strings. 17 Page 18: Offline Training of Language Model Agents with Functions as Learnable Weights from sympy import sympify, SympifyError def evaluate_expression(expression): try: result = sympify(expression) if result.is_number: result = float(result) else: result = str(result) return result except SympifyError as e: return str(e) 2. calculate polynomial roots: find all real roots of a given polynomial. from sympy import solve, sympify, SympifyError def calculate_polynomial_roots(polynomial): try: # Parse the polynomial and solve for its roots roots = solve(sympify(polynomial), dict=True) # Convert roots to a standard list format roots_list = [str(root) for root in roots]\n return roots_list except SympifyError as e: return str(e) 3. solve algebraic equation: revises the existing function that solves a single variable algebraic equation to handle equations more robustly with variable terms on both sides. from sympy import symbols, Eq, solve, parse_expr def solve_algebraic_equation(equation, variable): # Create a symbolic variable symbol = symbols(variable) # Parse the equation string into a sympy expression left_part, right_part = equation.split(’=’) eq = Eq(parse_expr(left_part), parse_expr(right_part)) # Solve the equation for the variable solution = solve(eq, symbol) # Return the solution return solution ReAct Agent 1. convert decimal tofraction: convert a decimal number to its equivalent in the simplest fractional form. from fractions import Fraction def convert_decimal_to_fraction(decimal_number): try: # Convert string to a floating point number decimal_number = float(decimal_number) # Create a Fraction from the decimal number fraction_result = Fraction(decimal_number).limit_denominator() # Return the fraction as a string in the form ’numerator/denominator’ return str(fraction_result) except ValueError as e: return str(e) 2. evaluate math expression: evaluate a wide range of mathematical expressions provided as strings, including basic arithmetic, factorial, combinations, and permutations. from sympy import sympify, factorial, binomial def evaluate_math_expression(expression): try: # Extend the namespace with factorial and binomial functions local_dict = {’factorial’: factorial, ’comb’: binomial} # Evaluate the expression using sympy’s sympify function result = sympify(expression, locals=local_dict) if result.is_number: return float(result) else: return str(result) except Exception as e: return str(e) 3. get polynomial degree: given a polynomial expression as a string, return the degree of the polynomial. from sympy import Poly, SympifyError def get_polynomial_degree(expression): 18 Page 19: Offline Training of Language Model Agents with Functions as Learnable Weights try: # Convert the string expression into a polynomial poly = Poly(expression) # Return the degree of the polynomial return poly.degree() except SympifyError as e: return str(e) E.2. Trained Functions in GAIA GPT-4+ Agent 1. perform web search: performs a web search using Bing Search API and returns the top search results including URLs and snippets. import os import requests def perform_web_search(query): subscription_key = os.environ[’BING_SEARCH_V7_SUBSCRIPTION_KEY’] endpoint = os.environ[’BING_SEARCH_V7_ENDPOINT’] + ’/v7.0/search’ headers = {’Ocp-Apim-Subscription-Key’: subscription_key} params = {’q’: query, ’textDecorations’: True, ’textFormat’: ’HTML’} response = requests.get(endpoint, headers=headers, params=params) response.raise_for_status() search_results = response.json() top_results = [{’url’: result[’url’], ’snippet’: result[’snippet’]} for result in search_results.get(’webPages’, {}).get(’value’, [])] return top_results 2. scrape wikipedia table: scrapes data from a table on a Wikipedia page based on a header keyword. import requests from bs4 import BeautifulSoup def scrape_wikipedia_table(url, header_keyword): response = requests.get(url) response.raise_for_status() soup = BeautifulSoup(response.content, ’html.parser’) headers = soup.find_all([’h1’, ’h2’, ’h3’, ’h4’, ’h5’, ’h6’]) data = [] for header in headers: if header_keyword.lower() in header.text.lower(): table = header.find_next_sibling(’table’, class_=’wikitable’) if table: rows = table.find_all(’tr’) for row in rows: cols = row.find_all([’th’, ’td’]) cols = [ele.text.strip() for ele in cols] data.append([ele for ele in cols if ele]) break return data 3. extract pdftext: extracts text from a PDF file. import fitz # PyMuPDF def extract_pdf_text(file_path): # Open the PDF file with fitz.open(file_path) as pdf: text = ’’ # Iterate over each page for page_num in range(len(pdf)): page = pdf[page_num] text += page.get_text() return text React Agent 1. fetch webpage content: retrieve the HTML content of a given webpage URL. import requests def fetch_webpage_content(url): response = requests.get(url) response.raise_for_status() return response.text 2. fetch bing search results: retrieve search results from Bing Web Search API. import os import requests 19 Page 20: Offline Training of Language Model Agents with Functions as Learnable Weights def fetch_bing_search_results(query): subscription_key = os.environ[’BING_SEARCH_V7_SUBSCRIPTION_KEY’] endpoint = os.environ[’BING_SEARCH_V7_ENDPOINT’] + "/v7.0/search" headers = {’Ocp-Apim-Subscription-Key’: subscription_key} params = {’q’: query, ’textDecorations’: True, ’textFormat’: ’HTML’} response = requests.get(endpoint, headers=headers, params=params) response.raise_for_status() return response.json() 3. extract textfrom pdf: extracts all text from a given PDF file. import fitz # PyMuPDF def extract_text_from_pdf(file_path): try: # Open the PDF file with fitz.open(file_path) as pdf: text = ’’ # Extract text from each page in the PDF for page in pdf: text += page.get_text() return text except Exception as e: return f’An error occurred: {str(e)}’ E.3. Trained Functions in TabMWP GPT-4+ Agent 1. perform arithmetic operations: perform basic arithmetic operations such as sum, average, maximum, minimum, difference, and rate of change on a given list of numbers. def perform_arithmetic_operations(numbers, operation): result = None if operation == ’sum’: result = sum(numbers) elif operation == ’avg’: result = sum(numbers) / len(numbers) if numbers else None elif operation == ’max’: result = max(numbers) if numbers else None elif operation == ’min’: result = min(numbers) if numbers else None elif operation == ’diff’ and len(numbers) > 1: result = numbers[0] - numbers[1] elif operation == ’rate_of_change’ and len(numbers) > 1 and numbers[1] != 0: result = ((numbers[0] - numbers[1]) / abs(numbers[1])) *100 return result 2. analyze stem leafplot Analyze a given stem-leaf plot to calculate the total count of values within a specified range. def analyze_stem_leaf_plot(stem_leaf_data, min_value, max_value): count = 0 for stem, leaves in stem_leaf_data.items(): for leaf in leaves: value = int(stem) *10 + leaf if min_value <= value < max_value: count += 1 return count 3. calculate range Calculate the range (difference between the maximum and minimum) of a list of numbers. def calculate_range(numbers): return max(numbers) - min(numbers) React Agent 1. calculate total cost general: Calculate the total cost given a unit price and quantity, supporting both the quantity as a string or an integer. def calculate_total_cost_general(unit_price, quantity): return float(unit_price) *(int(quantity) if isinstance(quantity, str) else quantity) 20 Page 21: Offline Training of Language Model Agents with Functions as Learnable Weights F. Case Study We present then three case studies for the trained GPT-4+ agent on three different datasets, to identify why the well-optimized learned function leads to the correct result in each case study. F.1. Case Study for MATH Assistant:To determine the number of rectangles with an area of 3 square units that can be formed in the given 6 by 6 grid, we should consider the possible dimensions of such rectangles. Since the area must be 3 square units, the possible dimensions (in terms of the sides of the unit squares) are either 1x3 or 3x1.Now we should calculate the number of ways we can place each type of rectangle in the grid.1.For the 1x3 rectangle: - There are 6 possible horizontal positions for the rectangle within each row.- Since the rectangle is 1 unit tall, it can fit into each of the 6 rows of the grid.- Therefore, the total for this type is 6 horizontal positions * 6 rows.2. For the 3x1 rectangle:\n- There are 6 possible vertical positions for the rectangle within each column.- Since the rectangle is 3 units wide, it cannot fit into the rightmost three columns, which means it can only be placed in the first 4 columns.- Therefore, the total for this type is 6 vertical positions * 4 columns.Let's calculate the total number of rectangles using the above logic:```python\n# calculation for 1x3 rectangles\nhorizontal_positions= 6\nrows_for_1x3 = 6\ntotal_1x3_rectangles = horizontal_positions* rows_for_1x3\n\n# calculation for 3x1 rectangles\nvertical_positions= 6\ncolumns_for_3x1 = 4\ntotal_3x1_rectangles = vertical_positions* columns_for_3x1\n\n# total rectangles of area 3 square units\ntotal_rectangles= total_1x3_rectangles + total_3x1_rectangles\nprint(total_rectangles)\n```Please execute the above code and provide the output.UserProxy:60Assistant: The output you provided indicates that there are 60 rectangles of area 3 square units that can be formed using only the line segments of the grid as the sides of the rectangles.\n\nTherefore, the answer to the given problem is \\boxed{60}. MATH ProblemConsider the six by six grid of unit squares below. How many rectangles of area 3 square units can be formed using only the line segments of the grid as the sides of the rectangles? [asy]unitsize(0.1inch);\ndraw((0,0)--(6,0)--(6,6)--(0,6)--cycle);\ndraw((1,0)--(1,6));\ndraw((2,0)--(2,6));\ndraw((3,0)--(3,6));\ndraw((4,0)-- (4,6));\ndraw((5,0)--(5,6));\ndraw((0,1)--(6,1));\ndraw((0,2)--(6,2));\ndraw((0,3)--(6,3));\ndraw((0,4)--(6,4));\ndraw((0,5)--(6,5));\n\n[/asy] (a) Before Agent Training (b) After Agent TrainingAssistant: First, we need to figure out which possible side lengths can form rectangles with an area of 3 square units. Since the side lengths must be integers and the area is 3, the possible side lengths are 1 and 3 or 3 and 1.\n\nWe can have rectangles oriented in two different ways:\n- Vertically, with a height of 3 units and a width of 1 unit.\n- Horizontally, with a height of 1 unit and a width of 3 units.\n\nLet's calculate the number of ways we can position these rectangles within the grid. Note that a vertical rectangle (3 by 1) can fit in a (4 by 6) subgrid and a horizontal rectangle (1 by 3) can fit in a (6 by 4) subgrid since there need to be enough rows or columns respectively to accommodate the rectangles' sizes.\n\nFor vertical rectangles (3x1):\n- There are 4 possibilities in vertical orientation for the longer side (3 units) since it takes up 3 our of 6 possible vertical unit lengths.\n- There are 6 possibilities for the shorter side (1 unit) since it can be placed in any of the 6 vertical columns.\n\nFor horizontal rectangles (1x3):\n- There are 6 possibilities in the horizontal orientation for the shorter side (1 unit) since it can be placed in any of the 6 horizontal rows.\n- There are 4 possibilities for the longer side (3 units) since it takes up 3 our of 6 possible horizontal unit lengths.\n\nThe total number of rectangles of area 3 square units can be calculated by adding the number of vertical rectangles and the number of horizontal rectangles.\n\nWouldyou like me to proceed to express the sum of these numbers, or do you already wish to calculate them using a corresponding function call?UserProxy: Continue. Please keep solving the problem until you need to query. (If you get to the answer, put it in \\boxed{}.Assistant:"function_call": {"arguments": "{\"expression\":\"(4 * 6) + (4 * 6)\"}","name": "evaluate_expression"}UserProxy: 48 Figure 6. Comparisons of GPT-4+ Agent agents before and after agent training. After the training, the well-optimized learned function leads to correct result compared to real-time generated python code. F.2. Case Study for GAIA Assistant:To determine the country with the least number of athletes at the 1928 Summer Olympics and provide the IOC country code, I will perform these steps: 1. Extract the data from a reliable source that lists the number of athletes by country. 2. Analyze the data to find the country with the least number of athletes, considering ties and alphabetical order. 3. Output the IOC country code of that country.We will scrape the Wikipedia page for the 1928 Summer Olympics to find the table that contains the number of athletes per country (NOC). Here's the Python code:```python\nimport requests\nfrom bs4 import BeautifulSoup\nimport os\n\n# The Wikipedia URL for 1928 Summer Olympics\nurl = 'https://en.wikipedia.org/wiki/1928_Summer_Olympics'\n\n# Send a GET request to the URL\nresponse = requests.get(url)\n\n# Check if the request was successful\nif response.status_code == 200:\n # Parse the response content with BeautifulSoup ... ```UserProxy:Execution failed. Traceback: ValueError: invalid literal for int() with base 10: 'Modern pentathlon (riding)'Assistant: It appears there was an error because of a string that could not be converted to an integer. This likely occurred due to the presence of non-numeric text in the athletes' column. Here is modified code: ...UserProxy: Execution succeeded. Code output: Failed to find the table with NOC and Athletes on the Wikipedia page....(This trial and errorrepeats 4 times.)Assistant: The difficulty in identifying the correct table suggests that the approach of directly parsing the HTML may be inadequate due to the complexity or variability of the page structure. Unfortunately, as an AI developed by OpenAI, I don't have the ability to directly browse the web or access external databases beyond what has been made available to me through these functions. Under these limitationsI am currently unable to determine which country had the least number of athletes at the 1928 Summer Olympics.FINAL ANSWER: Unable to determine. GAIA ProblemWhat country had the least number of athletes at the 1928 Summer Olympics? If there's a tie for a number of athletes, return the first in alphabetical order. Give the IOC country code as your answer. (a) Before Agent Training (b) After Agent TrainingAssistant:"function_call": {"name": "scrape_Wikipedia_table","arguments": "{\"url\" : \"https://en.wikipedia.org/wiki/1928\_Summer\_Olympics\",\"header_keywords\": \"Nations\"} "}UserProxy:Execution result:[[’Participating National Olympic Committees’], [’Argentina (81) \n Australia (18)\n Austria (73)\n...\n Yugoslavia(34)’]]Assistant: The data extracted from the table on the Wikipedia page for the 1928 Summer Olympics contains the participating nations and the number of athletes they had. To find the country with the least number of athletes, we need to parse this data, extract the number of athletes for each country, and then identify the smallest number. Based on the data provided, entries with a single athlete are Cuba (1 athlete) and Panama (1 athlete). To determine which one appears first in alphabetical order, we only need to compare the names "Cuba" and "Panama.“FINAL ANSWER: CUB Figure 7. Comparisons of GPT-4+ Agent task-solving trajectory on GAIA before and after agent training. After training, the agent can successfully leverage the well-optimized function to handle complex web-scraping problems and solve the otherwise coding-heavy task. The suggested python code by assistant is truncated for simplicity. 21 Page 22: Offline Training of Language Model Agents with Functions as Learnable Weights F.3. Case Study for TabMWP After multi-turn conversation…Assistant: "function_call": {"arguments":"{\"numbers\":\"15,16,17,18,23,27,34,37,39,59\",\"operation\":\"min\"}","name":"perform_list_operations“}UserProxy:15Assistant: The smallest number of cows on a farm based on the given stem-and-leaf plot is 15.\\boxed{15}TERMINATEAssistant:1.To find the smallest number of cows, I need to look at the stem and leaf plot provided. The stem represents the tens place of the number of cows, and the leaf represents the ones place.2.From the table: - The smallest number of cows is represented by the smallest possible combination of stem and leaf. Therefore, the smallest number of cows is 13.3.Here is the leaf that can be counted using Python code: ```python leaf = [5, 6, 7, 8, 3, 7, 3, 7, 9, 9] print(leaf)```1.I will wait for the user to give the result.2.Once the user provides the result, I can then continue with the next step. UserProxy: [5, 6, 7, 8, 3, 7, 3, 7, 9, 9]Assistant: The smallest number of cows recorded on a farm, based on the data provided, is \\boxed{13} .Since this is the final answer, I consider the problem solved. TERMINATE TabMWP ProblemA researcher recorded the number of cows on each farm in the county. What is the smallest number of cows?Here is the table that provides you with some necessary information:Cows per farmStem | Leaf 1 | 5, 6, 7, 82 | 3, 73 | 4 | 3, 7, 95 | 9 (a) Before Agent Training (b) After Agent Training Figure 8. Comparisons of GPT-3.5-turbo + Agent task-solving trajectory on TabMWP before and after agent training. After training, the agent can successfully leverage the well-optimized function to obtain an accurate result. G. Hyperparameters Settings The proposed agent training method involves several hyperparameters, including the training epoch, early-stop threshold, and maximum number of actions. In our empirical experiments across all three datasets, we consistently utilized the same hyperparameter configuration for the proposed agent training algorithm. Specifically: (1) We set the training epoch to 10 for all experiments. (2) An early stopping criterion was established with a threshold of 10 epochs. If there were 10 consecutive epochs without any improvement in training performance, the training process terminated. (3) Additionally, we restricted the maximum number of actions taken during each function update step to 3. It is essential to recognize that optimal hyperparameter settings can vary based on the specific problem and task. However, for our research, we kept these parameters fixed to ensure a consistent experimental setup. Combining our algorithm with hyperparameter tuning techniques from previous work (Li et al., 2018; Wu et al., 2021; Zhang et al., 2022) may further enhance performance. H. Limitations A significant bottleneck in the agent training algorithm arises from that the size of training data is limited by the LLM context limit. This constraint severely restricts its applicability to large-scale training scenarios. Furthermore, in Section 3.3.4, we empirically demonstrate that directly applying batch training techniques from traditional machine learning to agent training is ineffective and presents a non-trivial challenge. We regard addressing the limitations as our follow-up work. 22

---