loader
Generating audio...

arxiv

Paper 2501.05707

Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains

Authors: Vighnesh Subramaniam, Yilun Du, Joshua B. Tenenbaum, Antonio Torralba, Shuang Li, Igor Mordatch

Published: 2025-01-10

Abstract:

Large language models (LLMs) have achieved remarkable performance in recent years but are fundamentally limited by the underlying training data. To improve models beyond the training data, recent works have explored how LLMs can be used to generate synthetic data for autonomous self-improvement. However, successive steps of self-improvement can reach a point of diminishing returns. In this work, we propose a complementary approach towards self-improvement where finetuning is applied to a multiagent society of language models. A group of language models, all starting from the same base model, are independently specialized by updating each one using data generated through multiagent interactions among the models. By training each model on independent sets of data, we illustrate how this approach enables specialization across models and diversification over the set of models. As a result, our overall system is able to preserve diverse reasoning chains and autonomously improve over many more rounds of fine-tuning than single-agent self-improvement methods. We quantitatively illustrate the efficacy of the approach across a wide suite of reasoning tasks.

Paper Content:
Page 1: Published as a conference paper at ICLR 2025 MULTIAGENT FINETUNING : SELFIMPROVEMENT WITH DIVERSE REASONING CHAINS Vighnesh Subramaniam∗ MIT CSAIL vsub851@mit.eduYilun Du∗ Harvard University ydu@seas.harvard.eduJoshua B. Tenenbaum MIT CSAIL, BCS, CBMM jbt@mit.edu Antonio Torralba MIT CSAIL torralba@mit.eduShuang Li† Stanford University lishuang@stanford.eduIgor Mordatch† UC Berkeley mordatch@berkeley.edu ABSTRACT Large language models (LLMs) have achieved remarkable performance in recent years but are fundamentally limited by the underlying training data. To improve models beyond the training data, recent works have explored how LLMs can be used to generate synthetic data for autonomous self-improvement. However, successive steps of self-improvement can reach a point of diminishing returns. In this work, we propose a complementary approach towards self-improvement where finetuning is applied to a multiagent society of language models. A group of language models, all starting from the same base model, are independently specialized by updating each one using data generated through multiagent interactions among the models. By training each model on independent sets of data, we illustrate how this approach enables specialization across models and diversification over the set of models. As a result, our overall system is able to preserve diverse reasoning chains and autonomously improve over many more rounds of fine-tuning than single-agent self-improvement methods. We quantitatively illustrate the efficacy of the approach across a wide suite of reasoning tasks. Project website at https://llm-multiagent-ft.github.io 1 I NTRODUCTION Recent breakthroughs in large language models (LLMs) like GPT-3.5 and GPT-4 have demonstrated remarkable proficiency in language generation, comprehension, question answering, and transla- tion (OpenAI, 2023; Touvron et al., 2023). Despite these advancements, LLMs are fundamentally constrained by the data they are trained on, with existing models already using much of the available data on the Internet (Brown et al., 2020). To further enhance the performance of LLMs, recent research on self-improvement , where LLMs generate additional synthetic data on which they are trained on (Huang et al., 2022; Yu et al., 2023). One approach to increase the data available to LLMs is to use powerful existing frontier models like GPT-4 to generate additional supervisory data. However, this approach is limited by the inherent quality of frontier models, preventing models from becoming better than the frontier of what the best existing models can accomplish. In addition, such an approach incurs high financial costs due to inference expenses of such large models and is also often legally prohibited with existing commercial-grade models. An alternative approach is to directly leverage existing language models to generate additional synthetic data for their self-improvement (Zelikman et al., 2022; Bai et al., 2022; Chen et al., 2024b; Yuan et al., 2024). In such works, language models are used to iteratively collect data that they are then finetuned on. However, as models are repeatedly trained, performance gains often plateau relatively quickly as diversity decreases (Figure 1) and the self-improvement loop is often only ∗Equal Contribution, Corresponding authors †Equal advising 1arXiv:2501.05707v2 [cs.CL] 3 Mar 2025 Page 2: Published as a conference paper at ICLR 2025 1 2 3 4 5 Iterations of finetuning Iterations of finetuning Iterations of finetuning4550556065Accuracy Phi-3 Multiagent FT (Ours) Single-agent FT 1 2 3 4 51618202224262830 Mistral Multiagent FT (Ours) Single-agent FT 1 2 3 4 550.052.555.057.560.062.565.067.570.0 LLaMA-3 (8B) Multiagent FT (Ours) Single-agent FT Figure 1: Multiagent finetuning improves reasoning performance over multiple rounds of finetuning . Our multiagent finetuning procedure enables models to improve across multiple iterations of finetuing. Results reported on the MATH dataset. run for two or three rounds (Lu et al., 2023; Song et al., 2024). This limits the applicability of self-improvement to autonomously improve language models, as models can only be improved a limited amount above their base performance. In this paper, we propose a new approach to self-improvement that can help mitigate the issue of decreased gains of performance after multiple rounds of fine-tuning. Instead of fine-tuning a single model, our method finetunes a multiagent set of language models from the same base model and then independently specializes each model to capture parts of a task of interest. Our key insight is that by finetuning multiple models, we can encourage specialization and diversification across responses, which can enable consistent performance gains over many rounds of fine-tuning. To achieve specialization between models, we fine-tune each model repeatedly on independent subsets of the generated data corresponding to responses from the respective particular model. Within our multiagent set of models, we propose to specialize models into distinct functionalities within the output generation procedure. First, we specialize a set of models to be generation agents that produce a set of initial responses given queries. Since initial responses can often be suboptimal, especially for challenging reasoning tasks, we further propose to specialize a set of models as critic agents that evaluate and refine the generations of other models. By using this set of distinct models in combination through multiagent debate (Du et al., 2023), we are able to construct a robust feedback loop for generating final responses, with experiments on other multiagent methods in Appendix D. By training each model on distinct sets of data and roles, our approach fosters specialization across models and promotes diversification within the society of models. Consequently, our system can au- tonomously improve over many more rounds of finetuning compared to single-agent self-improvement methods (Figure 1). We quantitatively demonstrate the effectiveness of our approach across a com- prehensive suite of reasoning tasks, illustrating significant performance gains, as shown in Table 1. In our experiments, we illustrate how our proposed method can be directly applied to both open-source LLMs such as Phi-3, Mistral, and LLaMA-3 as well proprietary LLMs such as GPT-3.5 to substan- tially improve performance. In addition, the finetuned models can generalize to novel datasets and outperform the baseline methods trained directly on these new datasets. Overall, our paper has the following contributions: (1)We propose to leverage multiagent interaction as an approach to self-improvement with language models. (2)We propose to specialize models with distinct roles to enable detailed feedback between agents and to improve the final output quality. (3) We quantitatively verify the applicability of our approach across a wide suite of reasoning tasks on both open-source and proprietary language models. (4)We demonstrate that the finetuned agents can generalize across different datasets in a zero-shot manner. 2 M ULTIAGENT FINETUNING OF LANGUAGE MODELS We provide an overview of our approach towards multiagent finetuning of language models, where we learn a multiagent society of models to accomplish a task. Our method involves two components. We first use a multiagent debate method to construct a finetuning dataset for raining models (though other multiagent generation methods can also be used, see Appendix Section D). We then introduce our approach, multiagent finetuning, where we specialize each LLM model by finetuning each model 2 Page 3: Published as a conference paper at ICLR 2025 Multiagent DebateInput 𝑥Round 1Model 1Model 𝑁 Majority V oting Result 𝑦$𝑦!,!𝑦!,#…Model 𝑛…𝑦!,$Model 1Model 𝑁…Model 𝑛…𝑦%,!𝑦%,#𝑦%,$………Round 𝑀 Finetuning Generation ModelsGeneration Model 𝐴'!&The “correct” output of Model 1 after the first round of multiagent debate {𝑦!,!} Critic Model 𝐴'$,!'A mix of “correct and incorrect” output of Model 1 after the 𝑚-th(𝑚>1) round of multiagent debate {𝑦#,!}Summarize the responses from other modelsRound 2…Finetuning Critic ModelsFinetuningGeneration Model 𝐴'#&The “correct” output of Model N after the first round of multiagent debate {𝑦!,$}Finetuning FinetuningCritic Model 𝐴'(,#'Finetuning…A mix of “correct and incorrect” output of Model N after the 𝑚-th(𝑚>1) round of multiagent debate {𝑦#,$} Figure 2: Overview of Multiagent Finetuning .We first use multiagent debate and majority voting to create the finetuning datasets (left). These datasets are then used to finetune the generation and critic agents (right). When finetuning generation models, we use the majority voted result (”correct” output) to select first-round responses from each agent. We then finetune critic models using responses from the final round based on whether responses match the majority voted result (mix of ”correct and incorrect” outputs). The finetuned models are combined through multiagent debate to generate more accurate answers. In this figure, we illustrate a single finetuning iteration. Applying multiple rounds of finetuning iterations can significantly boost performance. on its own generated data. An overview of our approach can be seen in Figure 2. We first provide an introduction of our multiagent debate method in Section 2.1. We then discuss how to fine-tune a single model on generated data in Section 2.2, and the proposed multiagent finetuning in Section 2.3 and Section 2.4. We then show how to apply finetuned models for inference in Section 2.5. 2.1 M ULTIAGENT DEBATE Multiagent debate (Du et al., 2023) involves a series of Nlanguage model agents—either specific copies or finetuned versions of the same model—each tasked with generating a response to a given problem. After the initial responses are generated, a debate round is initiated among the agents. In our paper, we concatenate and summarize the responses from other agents. Each agent is instructed to construct a new response based on its prior response and the summarized responses from the others. The final result is determined by majority vote based on the outputs from the last round of debate. The multiagent debate is illustrated in Figure 2. 2.2 F INETUNING MODELS ON GENERATED DATA We start by considering how to use data generated by multiagent debate data to finetune a single LLM model for self-improvement. Given a set of natural language inputs Dtask={xi}, we use a multiagent debate method (Du et al., 2023), specifically a debate with Nagents and Mrounds, to generate responses for each input in Dtask. We obtain the final predicted output ˆyifor each xi through majority voting in the last round of debate. We use this to construct a “ground truth” dataset of{(xi,ˆyi)}. In the single LLM model setting, we then finetune the model on the set of generated responses yiwhich match ˆyigiven input xi. While the final debate results ˆyiare accurate, they often similar in style and methodology. As a result, repeatedly capturing a dataset of {(xi,ˆyi)}pairs for multiple rounds of finetuning often leads to a plateau of self-improvement performance. 2.3 F INETUNING MULTIPLE GENERATION AND CRITIC MODELS Our goal in multiagent finetuning is to create datasets that construct a set of models representing different agents that are diverse and accurately solve problems. Instead of building a single dataset to finetune each model, we propose creating different datasets to finetune different models. A set of 3 Page 4: Published as a conference paper at ICLR 2025 Algorithm 1 Multiagent Finetuning of Language Models Require: A pretrained LLM A; A set of language inputs Dtask={xi}; The number of agents N; The number of debate rounds M; The number of finetuning iterations L. 1:AG 1,···, AG N←A# Copy the LLM to build Ngeneration agents 2:AC 1,···, AC N←A# Copy the LLM to build Ncritic agents 3: # Multiple Iterations of Finetuning 4:forl= 1→Ldo 5: # Multiagent Debate 6: forxinDtaskdo# Iterate over the input tasks 7: forminMdo# M rounds of debate 8: ifm= 0then 9: y1,1,···, y1,N←AG 1(x),···, AG N(x)# Response of each generation agent 10: else 11: xs m,1,···, xs m,N←Summarize the responses from other agents in round m−1 12: ym,1,···, ym,N←AC 1(xs m,1),···, AC N(xs m,N)# Response of each critic agent 13: end if 14: end for 15: ˆy←Majority V oting {yM,1,···, yM,N}# Responses of the final round of debate 16: end for 17: # Multiagent Finetuning 18: Initialize datasets for finetuning generation models {DG n}N n=1 19: Initialize datasets for finetuning critic models {DC n}N n=1 20: forninNdo# Iterate over all the agents 21: forxinDtaskdo# Iterate over the input tasks 22: DG n← DG n∪ {(x, y 1,n)|y1,n= ˆy}# Add pairs 23: DC− n← DC− n∪ {(x,(y1,n,···, yM,n))|y1,n̸= ˆy, yM,n= ˆy}# Add pairs 24: DC+ n← DC+ n∪ {(x,(y1,n,···, yM,n))|y1,n= ˆy, yM,n= ˆy}# Add pairs 25: DC n←wDC− n+ (1−w)DC+ n # Combine the datasets 26: end for 27: ˆAG n←Finetune (An,DG n)# Finetune the generation model 28: ˆAC n←Finetune (An,DC n)# Finetune the critic model 29: end for 30: AG 1,···, AG N←ˆAG 1,···,ˆAG N# Generation agent for the next finetuning iteration 31: AC 1,···, AC N←ˆAC 1,···,ˆAC N# Critic agent for the next finetuning iteration 32:end for models are trained as generation agents and others as critic agents. The generation models produce initial responses to input questions. In contrast, the critic models assess the outputs from all generation agents and then select or generate the most effective responses. Finetuning Generation Models. The role of a generation model is to generate accurate responses to input questions. Such models should rely on diverse reasoning chains to promote diversity. Generation agents AG nare constructed from the Ngeneration models which generate a response to the given input x(we omit ifor simplicity). For each agent, we select its outputs ynthat match the final debate results ˆyand construct input-output pairs (x, yn). The resulting dataset for agent AG nisDG n={(x, yn)}. This approach generates a set of finetuning datasets {DG 1,···,DG N}across all Nagents. Each dataset contains different outputs, allowing for specialization and diversification of responses. We finetune each generation model with the corresponding dataset to get Ncorrespondingly finetuned agents {ˆAG 1,···,ˆAG N}. Finetuning Critic Models. The role of a critic model is to further provide accurate critiques to responses from other agents and use these responses to provide an updated answer. Simply finetuning generation models isn’t sufficient for achieving optimal results, especially for more challenging tasks, due to the lack of a feedback mechanism on their outputs. Critic agents AC nare constructed from critic models and evaluate the outputs from all generation agents and then select or synthesize the best responses. This additional step ensures that the system continuously improves and adapts, enhancing overall performance. 4 Page 5: Published as a conference paper at ICLR 2025 In the multiagent debate setting, each agent’s output in the last round of debates is represented as yM,n, where Mdenotes the number of debate rounds. We first identify those outputs yM,nthat align with the final debate results ˆy. These consistent outputs, together with the previous responses, are then used to construct input-output pairs (x,(y1,n, . . . , y M,n))for finetuning the critic models. To enhance the model’s capability to correct incorrect answers generated early in the debate process, we sample a subset of pairs where y1,ndiffers from ˆy, but yM,n matches ˆyand build a dataset DC− n={(x,(y1,n, . . . , y M,n))|y1,n̸= ˆy, yM,n = ˆy}. This indicates that the an- swer was successfully corrected by the end of the debates. We also construct another dataset DC+ n={(x,(y1,n, . . . , y M,n))|y1,n= ˆy, yM,n= ˆy}where both y1,nandyM,nmatch ˆy, demon- strating the agent’s ability to maintain the correct answer throughout the debates. We combine these two datasets to create a comprehensive finetuning dataset for each critic model to construct updated critic agents AC n: DC n=wDC− n+ (1−w)DC+ n. (1) In the above expression, wis a tunable hyperparameter representing the proportion of data sampled from the first set, while (1−w)represents the proportion of data sampled from the second set. This method generates a series of datasets {DC 1,···,DC N}for finetuning the critic models, denoted as {ˆAC 1,···,ˆAC N}after the finetuning process. 2.4 M ULTIPLE ITERATIONS OF FINETUNING The finetuned models are capable of generating responses through multiagent debate. We found that iterative application of the multiagent finetuning allows for continuous learning and adaptation, leading to progressively refined and more accurate responses over time. The finetuned generation agents {ˆAG 1,···,ˆAG N}and critic agents {ˆAC 1,···,ˆAC N}are used to gather datasets for the next iteration through multiagent debate. The algorithm for the proposed approach of Literations of finetuning is detailed in Algorithm 1. The steps for collecting data for finetuning the generation models are marked in red, and the finetuning of critic models is shown in blue. 2.5 I NFERENCE At inference time, we have a set of finetuned generation models which represent generation agents {ˆAG 1,···,ˆAG N}, and a set of finetuned critic models which represent critic agents {ˆAC 1,···,ˆAC N}. We conduct a multiagent debate among these agents, where each individual generation agent participates in the first round of the debate, followed by each individual critic agent in subsequent rounds. Each agent takes the responses from all other agents and generates a new response in each round of the debate. We found that summarizing the responses from the other agents helps eliminate redundant information while retaining the most important details, thereby further improving performance. The final result is determined by a majority vote based on the responses from the final round of the debate. We provide pseudocode in Algorithm 2. 3 E XPERIMENTS 3.1 L ANGUAGE REASONING TASKS We evaluate our method and baselines on three language reasoning tasks. Arithmetic. consists of 1,000 generated arithmetic problems in the form a+b·c+d−e·f. Following the generation procedure in (Du et al., 2023), each variable is assigned a random value up to a maximum of 30. Grade School Math (GSM). (Cobbe et al., 2021) consists of math word problems that require multi-step mathematical reasoning. Each example includes a problem statement, the numerical answer, and an explanation of the answer. MATH. Hendrycks et al. (2021) consists of competition-level math problems categorized into five difficulty levels. For our experiments, we sample problems from the first three levels. For each dataset, we randomly select 500examples for finetuning the language model. Additionally, we select 500held-out problems for evaluation. We parse the generated answers and evaluate their correctness by comparing them with the ground truth answers. Accuracy is reported based on how 5 Page 6: Published as a conference paper at ICLR 2025 frequently the model returns the correct answer. We also report the standard error of each accuracy value to measure the significance of improvement. 3.2 B ASELINES We compare the proposed method with various baselines. In all multiagent settings, we use three agents, and for all debate settings, we conduct two rounds of debates to ensure a fair comparison (additional results with five agents in Appendix Section F). Base utilizes a single language model to process input and generate responses. Majority is a multiagent baseline that selects responses based on a majority vote from multiple agents. If no response secures a majority, one of the potential answers is chosen at random. Debate is a multiagent debate baseline as described in Du et al. (2023). The debate structure is outlined in Figure 2. STaR (Zelikman et al., 2022) iteratively finetunes the language agent using a dataset with ground truth answers for each problem. Initially, the LM generates an answer for each problem, and correct responses, as verified by the ground truth, are added to the finetuning dataset. For problems answered incorrectly, the LM is reprompted with a hint that includes the ground truth answer. Problems where the generated response includes the correct answer are added to the finetuning dataset. The LM is finetuned on the collected dataset. This iterative process of building the dataset and finetuning is repeated until the finetuning loss saturates. The final model is then used for evaluation. Majority FT is a baseline that incorporates both majority voting and finetuning. We prompt the language agents with each problem and conduct a majority vote on their results. We then compile the responses from all agents that align with the majority vote, along with the input, to create a finetuning dataset. The language model is finetuned using this dataset. Finally, we apply majority voting to the outputs of the finetuned model to determine the final answer. 3.3 Q UANTITATIVE RESULTS We compare baselines and our method, which was finetuned for only a single iteration ( L= 1), in Table 1. The accuracy and standard error for each dataset are reported. We use three distinct base language models: three open-source models, Phi-3 4B (Abdin et al., 2024), Mistral 7B (Jiang et al., 2023), and LLaMA-3 8B (Dubey et al., 2024); and one proprietary model, GPT-3.5 (OpenAI, 2022). Our method outperforms all the baselines. Although “STaR” utilizes ground truth labels for data selection and undergoes multiple iterations of finetuning, it still performs worse than our method, which uses only a single finetuning iteration without access to ground truth. The “Majority”, “Debate” and “STaR” methods outperform the “Base” model, demonstrating that majority voting, multiagent debate, and finetuning all contribute to improved performance. “Majority FT” enhances the performance of “Majority” by incorporating a finetuning procedure. Our method is only finetuned on 500 examples and still shows significant improvement over the baselines, particularly on more challenging datasets such as GSM and MATH. Additional evaluations on a larger set of problems and datasets can be found in Appendix Section H. 3.4 M ULTIPLE ITERATIONS OF FINETUNING To verify the effectiveness of multiple iterations of finetuning, as described in Section 2.4, we present the performance of our proposed method “Multiagent FT (Ours)” over five iterations of finetuning in Figure 1. We tested this method on two open-source models, Mistral and Phi-3, using the MATH dataset. The results demonstrate that “Multiagent FT (Ours)” consistently improves performance over time. For example, the accuracy of Phi-3 increased from 58.8% to 66.0%, and the accuracy of Mistral improved from 22.5% to 28.2%. Our method with five rounds of finetuning is 12.6% and 9.31% more accurate than the best baseline listed in Table 1 using Phi-3 and Mistral, respectively. In contrast, finetuning a single agent (”Single-agent FT”), as described in Section 2.2, shows that performance saturates after one iteration of finetuning and starts dropping afterward, indicating potential overfitting to generated responses. This issue occurs when the single model, after several finetuning cycles, becomes fixated on a small range of responses, which limits its diversity and 6 Page 7: Published as a conference paper at ICLR 2025 LLM Methods Arithmetic GSM MATH GPT-3.5 (OpenAI, 2022)Base 81.99±0.99 75.60 ±1.36 46.83 ±2.25 Majority 94.40±1.03 81.20 ±1.24 51.40 ±2.23 Debate 98.21±0.54 83.30 ±1.18 55.73±2.21 STaR 98.38±0.57 83.60±1.17 53.00 ±2.23 Majority FT 98.40±0.56 83.70 ±1.17 53.40 ±2.23 Ours 99.62±0.28 85.60±1.11 60.60±2.18 Phi-3 (Abdin et al., 2024)Base 88.30±1.44 81.20 ±1.74 45.60 ±2.10 Majority 91.80±1.23 81.80 ±1.72 47.20 ±1.82 Debate 96.20±0.86 84.40 ±1.58 53.40±2.28 STaR 94.80±0.99 85.80±1.21 51.80 ±2.06 Majority FT 93.80±1.08 82.20 ±1.71 48.60 ±2.16 Ours 99.40±0.34 88.60±1.42 58.80±2.22 Mistral (Jiang et al., 2023)Base 10.80±0.51 35.60 ±1.92 16.60 ±1.21 Majority 14.80±1.17 41.80 ±0.88 16.80 ±1.25 Debate 19.60±1.12 52.60±1.26 18.20 ±1.37 STaR 17.40±0.97 45.50 ±1.54 17.84 ±1.23 Majority FT 16.40±0.73 44.60 ±1.65 18.91±1.37 Ours 22.60±0.97 58.40±2.11 22.50±1.87 LLaMA-3 (Dubey et al., 2024)Base 43.20±2.22 75.00 ±1.94 46.80 ±2.23 Majority 45.80±2.23 76.40 ±1.90 47.20 ±2.23 Debate 48.40±2.24 78.40±1.44 51.60 ±2.23 Majority FT 49.20±2.24 77.20 ±1.87 52.20±2.23 Ours 52.00±2.24 88.60±1.77 57.40±2.21 Table 1: Quantitative results of the proposed method and baselines. Our method outperforms the baselines across all datasets, as indicated by accuracy (%) ±standard error. The highest values are highlighted in red, and the second-highest values are highlighted in blue. All results are reported over 500 fixed evaluation problems, expect GSM results for GPT-3.5 which are reported over 1000 fixed evaluation problems (to construct nonoverlapping confidence bars). 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Iterations of finetuning0.20.30.40.50.60.70.8Negative Log-Likelihood Diversity Metric: Likelihood LLaMA-3 Multiagent FT (Ours) Single-agent FT 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Iterations of finetuning0.20.30.40.50.60.70.8Negative Log-Likelihood Mistral Multiagent FT (Ours) Single-agent FT 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Iterations of finetuning0.100.150.200.250.300.350.40Embedding Dissimilarity Diversity Metric: Embedding Dissimilarity LLaMA-3 Multiagent FT (Ours) Single-agent FT 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Iterations of finetuning0.200.250.300.350.400.45Embedding Dissimilarity Mistral Multiagent FT (Ours) Single-agent FT Figure 3: Diversity is preserved and can improve across iterations of finetuning. We measure the response diversity of our method and the single-agent finetuning method on the MATH dataset using two diversity measures. The diversity of our method remains consistent over finetuning iterations for one metric and improves for another metric, whereas the diversity of the single-agent method drops significantly. prevents further enhancement. However, finetuning multiple generation and critic agents using our proposed method increases diversity and consistently improves performance. 4 A NALYSIS In this section, we aim to answer the following questions: 1) How important is the proposed multiagent finetuning procedure? 2) Will it increase response diversity? 3) Can the finetuned agent generalize to other datasets in a zero-shot setting? 4.1 A BLATION STUDIES We examine each component of the proposed method, as shown in Table 2. Multiagent FT (Ours) refers to our proposed method with a single round of finetuning, L= 1. Multiagent FT w/o summary removes the summarization step from the multiagent debate. Instead of summarizing, the responses from other agents are directly concatenated and presented to each agent. Summarization helps by eliminating redundant information and retaining the most critical points; therefore, omitting the summarization step can negatively impact performance. 7 Page 8: Published as a conference paper at ICLR 2025 LLM Ablations Arithmetic GSM MATH GPT-3.5 (OpenAI, 2022)Multiagent FT (Ours) 99.62±0.28 85.60±1.67 60.60±2.18 Multiagent FT w/o summary 99.20±0.40 82.20 ±1.72 51.70 ±2.24 Multiagent FT w/o critic 99.20±0.40 83.80±1.65 50.80 ±2.24 Single-agent FT 99.00±0.45 83.60 ±1.66 56.80±2.21 Single-agent FT w/o debate 87.20±1.49 75.00 ±1.93 48.89 ±2.23 Phi-3 (Abdin et al., 2024)Multiagent FT (Ours) 99.40±0.34 88.60±1.42 58.80±2.22 Multiagent FT w/o summary 98.80±0.51 84.40 ±1.68 55.00 ±2.09 Multiagent FT w/o critic 98.20±0.62 86.00 ±1.58 56.60 ±2.22 Single-agent FT 97.40±0.71 86.80±1.51 56.80±2.21 Single-agent FT w/o debate 92.20±1.20 83.60 ±1.66 50.20 ±2.24 Mistral (Jiang et al., 2023)Multiagent FT (Ours) 22.60±1.87 58.40±2.11 22.50±1.87 Multiagent FT w/o summary 21.80±1.84 56.00±1.56 20.20±1.55 Multiagent FT w/o critic 21.00±1.82 54.80 ±1.60 19.01 ±1.59 Single-agent FT 21.20±1.83 55.00 ±2.22 19.21 ±1.69 Single-agent FT w/o debate 17.71±1.70 51.20 ±2.24 17.22 ±1.54 LLaMA-3 (Dubey et al., 2024)Multiagent FT (Ours) 52.00±2.24 88.60±1.77 57.40±2.21 Multiagent FT w/o summary 50.40±2.24 83.20 ±1.67 51.60 ±2.23 Multiagent FT w/o critic 48.60±2.24 82.20 ±1.70 50.50 ±2.23 Single-agent FT 48.00±2.23 84.40±1.62 52.40±2.23 Single-agent FT w/o debate 44.00±2.22 81.60 ±1.73 48.80 ±2.24 Table 2: Ablation results . We examine each component of the proposed method and found that summarization, the combination of critic and generation agents, multiagent finetuning, and multiagent debate all contribute to performance improvement. The accuracy (%) ±standard error is reported. Multiagent FT w/o critic : The critic agents evaluate the outputs from all generation agents and select or synthesize the best responses. Removing the critic agents and only finetuning the Ngeneration agents could hurt performance, as the critic agents play a crucial role of refining the final output. Single-agent FT involves finetuning only a single LLM as covered in Section 2.2 and using it as an agent in multiagent debate. This approach can easily lead to model collapse, where the agent generates similar responses after finetuning, thereby reducing diversity and hurting performance. Therefore, multiagent finetuning is necessary to maintain high performance in reasoning tasks. Single-agent FT w/o Debate further eliminates the debate procedure, with the finetuned LLM generating responses directly. As shown in Du et al. (2023), multiagent debate can significantly boost performance, so removing it could lead to a performance drop. These results indicate that summarization, the combination of critic and generation agents, multiagent finetuning, and multiagent debate all contribute to performance improvement. Our proposed method integrates these components into a single, unified framework, leveraging their combined benefits. 4.2 A GENT RESPONSE DIVERSITY By finetuning multiple agents with distinct roles, our approach enables us to obtain more diverse responses across rounds of finetuning compared to a single agent. Figure 3 illustrates the diversity of generations from our method and single-agent across rounds of finetuning using two metrics of diversity. We cover one metric of diversity, negative log-likelihood, here and cover the other in Section C.4. In our first diversity metric, we aim to characterize specialization by tracking the likelihood of responses of other agents using likelihood calculations of a specific agent. If we are increasing diversity, then the log-likelihood of responses from other agents will decrease across iterations of finetuning. The reasoning used by other agents would be considered less common for the specific agent, indicating a divergence in responses. If accuracy increases while likelihood of responses from other agents decreases, this indicates must specialization. We evaluate the negative log-likelilhood (NLL) of responses from other critic agents using another held-out critic agent and plot this over iterations of finetuning. We do the same with Single-Agent FT, using responses from other agents and evaluate likelihood using a held-out agent. Larger NLL values indicate that the model has assigned low likelihood to a sequence and lower NLL values indicate that the model has assigned higher likelihood to a sequence. We measure this over iterations of finetuning for our method as well as Single-Agent FT. 8 Page 9: Published as a conference paper at ICLR 2025 200.15 0.20 0.25 0.30 0.35 0.40 0.45Diversity Metric: Embedding Dissimlarity 30405060MATH AccuracyPhi-3 MistralSingle-Agent FTMultiagentFT(Ours) Single-Agent FTMultiagentFT(Ours) Figure 4: Relationship between accuracy and diver- sity. We visualize the relationship between embedding dissimilarity and MATH accuracy across rounds of fine- tuning. Our multiagent finetuning preserves diversity across rounds of finetuning while improving accuracy.We compute the diversity across all test exam- ples and present the results in Figure 3. For the “Single-agent FT”, all agents are the same fine- tuned language models, and M= 1. We notice that NLL increases across iterations of finetun- ing for our method, meaning that responses from other critic agents are more diversity accord- ing to our held-out critic agent. Moreover, our responses are more diverse than using Single- Agent FT. This aligns with our previous ob- servation that diverse responses can mitigate model collapse and prevent the model from over- fitting to the finetuning data, leading to better performance. We also include another metric, embedding dissimilarity, as a further compar- ison, finding that responses from our method preserves diversity, where as diversity reduces significantly with Single-agent FT. We provide additional metrics for evaluating diversity in gen- erations in Appendix Section C, and similarly find that multiagent finetuning preserves the final diversity of generations. We further analyze the relationship between diversity and performance and show this in Figure 4. Specifically, we see that an improvement in the diversity of responses correlates positively with an improvement in performance across rounds of finetuning across both Phi-3 and Mistral models. This suggests that in general, increasing the diversity of responses can be helpful for improvement over multiple rounds of fine-tuning. In Appendix Section E, we compare our approach with additional approaches to improve the diversity of samples such as increasing the temperature at which samples are generated, or using unique IDs in a single language to simulate a single agent. We find that our approach outperforms these baselines. 4.3 Z ERO-SHOT GENERALIZATION Base Majority Debate ST aR Majority-FT Ours (zero-shot)0204060GSM Accuracy 44.0049.0051.00 48.00 45.0054.00 Figure 5: Zero-shot generalization of the proposed method . Our method demonstrates zero-shot generalization capabilities. When trained on the MATH dataset, it can effectively generalize to the GSM dataset. It outperforms all the baselines that are trained on the GSM dataset.We investigate the zero-shot generalization of the proposed method across different datasets. Specifically, we use generation and critic agents finetuned on the MATH dataset and evaluate their performance on 100 ran- domly sampled examples from the GSM dataset. We compare our method to baseline methods used in Table 1. These base- lines are trained on the GSM dataset. All methods use Mistral as the base LLM. Figure 5 shows that our method surpasses all the baseline methods, even though it has never seen data from the GSM dataset, indicating the strong zero-shot generalization capability of the proposed method. We show further results in Section H.3. 5 R ELATED WORK Finetuning methods generally fall into three categories: human-in-the-loop, distillation, and self-improvement. We briefly cover the first two categories and spend more time on self-improvement, which is more related to our work. Finetuning with human-in-the-loop and distillation : Several human-in-the-loop methods have been introduced for finetuning, most noticeably RLHF (Christiano et al., 2017; Sun et al., 2023) and 9 Page 10: Published as a conference paper at ICLR 2025 DPO (Rafailov et al., 2024). These methods have been employed as part of instruction tuning (Zhang et al., 2023), improving the generated responses to instructions. Several instruction tuning datasets (Wang et al., 2022; Longpre et al., 2023) have been released publicly, some with human-generated responses. Other datasets have been constructed using the second category of finetuning methods, distillation, whereby a much larger, highly performant LLM is used to generate data that finetunes a smaller LLM (Peng et al., 2023; Liu et al., 2024). These approaches have been used to build recent LLMs such as Alpaca (Taori et al., 2023) or Vicuna (Chiang et al., 2023) using responses generated by GPT-3.5 or GPT-4 (Achiam et al., 2023). Finetuning with self-improvement : Self-improvement methods (Huang et al., 2022; Yu et al., 2023; Yuan et al., 2024; Hsieh et al., 2023; Welleck et al., 2022) improve the performance of LLMs through the finetuning. Common approaches include iterated learning (Anthony et al., 2017; Vani et al.; Polu et al., 2022; Xu et al., 2024) where solution/methods discovered by optimization on prior data are used to uncover further solutions or, in this context, provide additional finetuning data. Some of the main papers we use for comparison finetune using bootstrapping through rationale generation (Zelikman et al., 2022; Lee et al., 2024; Pang et al., 2024; Zhang et al., 2024; Lu et al., 2023) or use self-play/self-training methods through reinforcement learning (Chen et al., 2024b; Yuan et al., 2024; Chen et al., 2024a). Most methods find that using self-generated rationales leads to significant improvement when finetuning. However, these works and many others rely on access to ground truth answer. Overall, existing works often show a plateauing effect with limited boosts in improvement after several rounds of fine-tuning. Our work proposes to use multiagent interaction as an approach to get more consistent gains after multiple rounds of finetuning. Multiagent Interaction : Our work builds on the combination of finetuning and multiagent interaction systems. We primarily incorporate multiagent debate (Du et al., 2023; Chan et al., 2023; Pham et al., 2023; Liang et al., 2023) due to its success in improving factuality and reasoning in LLMs in a variety of tasks at inference time. Several other multiagent interactions could also serve as the basis for this paper. Tree-of-thought (Yao et al., 2024; Long, 2023) and graph-of-thought (Besta et al., 2024) represent two common multiagent interaction systems over LLMs that incorporate responses across multiple LLMs, which improves reasoning. Other works (Wu et al., 2023) have designed more flexible systems for multiagent conversations built on structured program synthesis rather than natural language. Prior work has also focused on incorporating multiagent interaction into domains beyond factuality and reasoning such as strategy and communication games (Abdelnabi et al., 2023). More recently, this has led to multiagent interaction systems over LLMs that have optimized via equilibrium search for factuality and reasoning tasks (Jacob et al., 2023b;a). In contrast to existing works, our work aims to use multiagent interaction as a method to finetune language models. 6 C ONCLUSION AND LIMITATIONS Limitations. In comparison to existing works in single model finetuning, multiagent finetuning is substantially more expensive at both training and inference time as multiple copies of a model need to be trained and run. To run multiagent finetuning experiments on open source models, we used either four H100 GPUs or four A100 GPUs. Models took between 120GB - 240GB of GPU memory and inference took between 12-24 hours across multiple GPUs. To improve the training time of multiagent models, it may be interesting to instead share weights across different instances of models. To improve inference time in multiagent models, we can directly distill the debate procedure into a single modelor use quantization as part of finetuning. Conclusion. In this paper, we have introduced a novel multiagent finetuning framework that sig- nificantly enhances the performance and diversity of language models. By employing a society of agents with distinct roles, our method effectively improves the feedback mechanism and overall output quality, mitigating the limitations inherent in single-agent self-improvement methods. This system allows for autonomous self-improvement through iterative finetuning, leading to substantial performance gains across a comprehensive suite of reasoning tasks. Importantly, our approach is versatile and can be applied to both open-source and proprietary LLMs, ensuring broad utility and impact. Additionally, our method can be integrated with other finetuning approaches such that incorporate human feedback such as RLHF or DPO, which we leave to future work. This work opens new avenues for future research in language model enhancement and sets a foundation for further advancements in the field. 10 Page 11: Published as a conference paper at ICLR 2025 ACKNOWLEDGMENTS This work was supported by the Center for Brains, Minds, and Machines, NSF STC award CCF- 1231216, the NSF award 2124052, the MIT CSAIL Machine Learning Applications Initiative, the MIT-IBM Watson AI Lab, the CBMM-Siemens Graduate Fellowship, the DARPA Mathematics for the DIscovery of ALgorithms and Architectures (DIAL) program, the DARPA Knowledge Management at Scale and Speed (KMASS) program, the DARPA Machine Common Sense (MCS) program, the Air Force Office of Scientific Research (AFOSR) under award number FA9550-21-1- 0399, the United States Air Force Research Laboratory and the Department of the Air Force Artificial Intelligence Accelerator under Cooperative Agreement Number FA8750-19-2-1000, and ONR MURI grant N00014-22-1-2740. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Department of the Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. REFERENCES Sahar Abdelnabi, Amr Gomaa, Sarath Sivaprasad, Lea Sch ¨onherr, and Mario Fritz. Llm-deliberation: Evaluating llms with interactive multi-agent negotiation games. arXiv preprint arXiv:2309.17234 , 2023. 10 Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219 , 2024. 6, 7, 8 Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023. 10 Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. Advances in neural information processing systems , 30, 2017. 10 Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 , 2022. 1 Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 38, pp. 17682–17690, 2024. 10 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020. 1 Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201 , 2023. 10 Zhipeng Chen, Kun Zhou, Wayne Xin Zhao, Junchen Wan, Fuzheng Zhang, Di Zhang, and Ji-Rong Wen. Improving large language models via fine-grained reinforcement learning with minimum editing constraint. arXiv preprint arXiv:2401.06081 , 2024a. 10 Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335 , 2024b. 1, 10 Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot 11 Page 12: Published as a conference paper at ICLR 2025 impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) , 2(3):6, 2023. 10 Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems , 30, 2017. 9 Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021. 5 Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factual- ity and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 , 2023. 2, 3, 5, 6, 8, 10, 15, 20 Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024. 6, 7, 8, 20, 21, 22 Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 , 2021. 5 Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301 , 2023. 10 Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. arXiv preprint arXiv:2210.11610 , 2022. 1, 10 Athul Paul Jacob, Gabriele Farina, and Jacob Andreas. Regularized conventions: Equilibrium computation as a model of pragmatic reasoning. arXiv preprint arXiv:2311.09712 , 2023a. 10 Athul Paul Jacob, Yikang Shen, Gabriele Farina, and Jacob Andreas. The consensus game: Language model generation via equilibrium search. arXiv preprint arXiv:2310.09139 , 2023b. 10 Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825 , 2023. 6, 7, 8 Nicholas Lee, Thanakul Wattanawong, Sehoon Kim, Karttikeya Mangalam, Sheng Shen, Gopala Anumanchipali, Michael W Mahoney, Kurt Keutzer, and Amir Gholami. Llm2llm: Boosting llms with novel iterative data enhancement. arXiv preprint arXiv:2403.15042 , 2024. 10 Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118 , 2023. 10 Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems , 36, 2024. 10 Jieyi Long. Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291 , 2023. 10 Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning , pp. 22631–22648. PMLR, 2023. 10 Jianqiao Lu, Wanjun Zhong, Wenyong Huang, Yufei Wang, Fei Mi, Baojun Wang, Weichao Wang, Lifeng Shang, and Qun Liu. Self: Language-driven self-evolution for large language model. arXiv preprint arXiv:2310.00533 , 2023. 2, 10 12 Page 13: Published as a conference paper at ICLR 2025 OpenAI. Chatgpt: Optimizing language models for dialogue, December 2022. URL https: //openai.com/blog/chatgpt/ . 6, 7, 8 R OpenAI. Gpt-4 technical report. arXiv , pp. 2303–08774, 2023. 1 Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization. arXiv preprint arXiv:2404.19733 , 2024. 10 Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277 , 2023. 10 Chau Pham, Boyi Liu, Yingxiang Yang, Zhengyu Chen, Tianyi Liu, Jianbo Yuan, Bryan A Plum- mer, Zhaoran Wang, and Hongxia Yang. Let models speak ciphers: Multiagent debate through embeddings. arXiv preprint arXiv:2310.06272 , 2023. 10 Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya Sutskever. Formal mathematics statement curriculum learning. arXiv preprint arXiv:2202.01344 , 2022. 10 Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems , 36, 2024. 10 Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. , 21(140):1–67, 2020. 19 Yuda Song, Hanlin Zhang, Carson Eisenach, Sham Kakade, Dean Foster, and Udaya Ghai. Mind the gap: Examining the self-improvement capabilities of large language models. arXiv preprint arXiv:2412.02674 , 2024. 2 Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525 , 2023. 9 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023. 10 Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 , 2023. 1 A Vani, M Schwarzer, Y Lu, E Dhekane, and A Courville. Iterated learning for emergent systematicity in vqa. arxiv 2021. arXiv preprint arXiv:2105.01119 . 10 Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705 , 2022. 10 Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. Generating sequences by learning to self-correct. arXiv preprint arXiv:2211.00053 , 2022. 10 Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155 , 2023. 10 Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou. A survey on knowledge distillation of large language models. arXiv preprint arXiv:2402.13116 , 2024. 10 13 Page 14: Published as a conference paper at ICLR 2025 Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems , 36, 2024. 10 Xiao Yu, Baolin Peng, Michel Galley, Jianfeng Gao, and Zhou Yu. Teaching language models to self-improve through interactive demonstrations. arXiv preprint arXiv:2310.13522 , 2023. 1, 10 Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models. arXiv preprint arXiv:2401.10020 , 2024. 1, 10 Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D Goodman. Star: Self-taught reasoner bootstrap- ping reasoning with reasoning. In Proceedings of the 36th International Conference on Neural Information Processing Systems , pp. 15476–15488, 2022. 1, 6, 10 Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, et al. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792 , 2023. 10 Yunxiang Zhang, Muhammad Khalifa, Lajanugen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, and Lu Wang. Small language models need strong verifiers to self-correct rea- soning. arXiv preprint arXiv:2404.17140 , 2024. 10 14 Page 15: Published as a conference paper at ICLR 2025 A A PPENDIX SUMMARY We add additional details for our methods and experiments as well as additional results to provide more evidence of improvements with multiagent finetuning. In Section B, we provide additional details on summarization, inference and training details using multiagent finetuning with debate. In Section C, we cover additional metrics for measuring diversity in agent responses based (1) consensus and (2) KL-divergence (3) likelihood. Both metrics show that diversity is maintained or increases while accuracy increase over rounds of finetuning. In Section D, we introduce a cooperative approach for composing agent responses rather than a competitive approach through multiagent debate. We apply multiagent finetuning with the cooperative approach to analyze whether our method is agnostic to the approach style. We find strong similar improvements when our method is applied to a cooperative approach. In Section E, we include an additional baseline based on Single Agent FT where we increase the sampling temperature applied across all agents. This is a proxy for increasing diversity that is complementary to our method. We find that multiagent finetuning significantly outperforms methods that modify temperature to artificially induce diversity. In Section F, we add an additional experiment where we apply multiagent finetuning to responses across 5 agents instead of 3. We see significant improvements in performance when using additional agents. In Section G, we present a simple mathematical model illustrating how multiagent finetuning can improve diversity. Finally, in Section H, we present additional evaluations of multiagent finetuning across a wide suite of datasets. B M ETHODOLOGY DETAILS B.1 S UMMARIZATION DETAILS As done in Du et al. (2023), we incorporate summarization into the multiagent debate procedure. In summarization, we have an LLM agent take responses from other agents as input and summarize the answers to the responses. During round mof debate, we introduce a summarization agent AS nwhich takes responses from the other N−1agents in the last round, (ym−1 1,···, ym−1 n−1, ym−1 n+1,···ym−1 N) and generates a summary of the responses xs m,n. This summary is sent to the critic agent AC nto generate a new response. B.2 I NFERENCE DETAILS The pseudocode of our method for inference is shown below. . Algorithm 2 Inference Require: A set of finetuned generation agents {ˆAG 1,···,ˆAG N}; A set of finetuned critic agents {ˆAC 1,···,ˆAC N}; A test set of language inputs and ground truth responses Dtask={xi, yi}; The number of agents N; The number of debate rounds M. 1:success ←0 2:forx, yinDtaskdo# Iterate over the input tasks 3: forminMdo# M rounds of debate 4: ifm= 0then 5: y1,1,···y1,N←ˆAG 1(x),···,ˆAG N(x)# Response of each generation agent 6: else 7: xs m,1,···, xs m,N←Summarize the responses from other generator agents 8: ym,1,···, ym,N←ˆAC 1(xs m,1),···,ˆAC N(xs m,N)# Response of each critic agent 9: end if 10: end for 11: ˆy←Majority V oting {yM,1,···, yM,N}# Responses of the final round of debate 12: success ←success +I(ˆy=y) 13:end for 14:Accuracy ←success |D| 15 Page 16: Published as a conference paper at ICLR 2025 B.3 E XPERIMENTAL DETAILS For all open-source models, we perform finetuning using a total of eight 40GB A100 GPUs and four 80GB H100 GPUs. The evaluation of individual inference times for multi-agent finetuning with open-source models took approximately 30 to 36 hours. Phi-3 We ran our results using Phi-3-Mini-128K-Instruct which has 4 billion tunable parameters. We finetune the entire model end-to-end (no LoRA or memory adaptation) on two 40GB A100 GPUs or one 80GB H100 GPU and run a total of two epochs of finetuning for generation agents and one epoch of finetuning for critic agents. We use a batch size of 1 and a learning rate of 5e−6for generation agents and 5e−7for critic agents. When applying multiple iterations of finetuning, we use a learning rate of 5e−7across both generation and critic agents. Models are finetuned with a fixed training set of 500 randomly selected questions (where we do not provide answer annotations for the questions) and then evaluated on a separate test set of 500 randomly selected questions. Mistral We ran our results using Mistral-7B-Instruct-v0.2 , which has 7 billion tunable parameters. We finetune the entire model end-to-end (no LoRA or memory adaptation) on four 40GB A100 GPUs or two 80GB H100 GPUs and run a total of one epoch of finetuning. We use a batch size of 1 and a learning rate of 5e−7for generation agents and 5e−7for critic agents and a weight decay of1e−2. When applying multiple iterations of finetuning, we use a learning rate of 5e−7across both generation and critic agents. Models are finetuned with a fixed training set of 500 randomly selected questions (where we do not provide answer annotations for the questions) and then evaluated on a separate test set of 500 randomly selected questions. LLaMA-3 We ran our using Meta-Llama-3-8B-Instruct , which has 8 billion tunable param- eters. We finetune the entire model end-to-end (no LoRA or memory adaptation) on three 80GB H100 GPUs and run a total of two epochs of finetuning. We use a batch size of 1 and a learning rate of 5e−7for generation agents and 2e−7for critic agents. When applying multiple iterations of finetuning, we use a learning rate of 5e−7across both generation and critic agents as well as a weight decay of 1e−2. Models are finetuned with a fixed training set of 500 randomly selected questions (where we do not provide answer annotations for the questions) and then evaluated on a separate test set of 500 randomly selected questions. GPT-3.5 We ran our results on the gpt-3.5-turbo-0613 model. We use the finetuning API and run a total of two epochs of finetuning, using a batch size of 1 and a learning rate multiplier of 1. Models are finetuned with a fixed training set of 500 randomly selected questions (where we do not provide answer annotations for the questions) and then evaluated on a separate test set of 500 randomly selected questions. C D IVERSITY METRICS We cover different metrics for measuring diversity for both Phi-3 and Mistral to provide an overview of the diversity of our method in comparison to Single Agent FT. C.1 C ONSENSUS We further analyze the diversity of responses from our method to show that diversity is preserved. Rather than using text embeddings, we further measure the consensus among agents as a more interpretable alternative. This is measured as the proportion of agents that have the same final answer in a given round of debate. We take an average of this proportion across all 500 problems used for evaluation. To obtain the mean consensus of our single agent finetuning baseline, we prompt the single-agent finetuned model 3 times, take a majority vote over generated answers, and find the proportion of agents that had a generated answer that was the majority vote. In order to convert this to diversity, we take the difference of the mean consensus value from 1, which represents the average number of agents with a different response from the consensus answer. We measure diversity as the inverse of consensus. Specifically, we consider the agent responses in the final round of debate {yM,1,···, yM,N}that match the majority-voted final response ˆy. The 16 Page 17: Published as a conference paper at ICLR 2025 1 2 3 4 5 Iterations of Finetuning0102030405060Diversity Phi-3 Multiagent FT (Ours) Single-agent FT 1 2 3 4 5 Iterations of Finetuning0102030405060 Mistral Multiagent FT (Ours) Single-agent FT Figure 6: Consensus: Response diversity across finetuning iterations . We measure the response diversity based on agent consensus of our method and the single-agent finetuning method on the MATH dataset. The diversity of our method remains consistent over finetuning iterations, whereas the diversity of the single-agent method drops significantly. 1 2 3 4 5 Iterations of Finetuning0.020.040.060.080.10KL Divergence across agents Phi-3 Multiagent FT (Ours) Single-agent FT 1 2 3 4 5 Iterations of Finetuning0.040.060.080.100.12 Mistral Figure 7: KL-Divergence: Response diversity across finetuning iterations . We measure diversity based on the KL-divergence between the probabilities of the output tokens between agents. Similar to our likelihood measurement, we find that diversity is preserved across rounds of finetuning. consensus is computed as the percentage of responses in {yM,1,···, yM,N}that match ˆy: Consensus =PN n=1I(yM,n= ˆy) N, where Iis the indicator function. Diversity is then given by Diversity = 1−Consensus. We show results in Figure 6. As seen with our prior metric, embedding dissimilarity, we can preserve diversity based on the responses given by the agents, rather than based on the embeddings of a language model. C.2 KL-D IVERGENCE We next measure diversity by computing KL divergence between the probability distributions com- puted based on the final answers from different agents. We estimate the probability distribution of each agent’s response using the likelihoods from Gemma-2 (2B) For each test example, we compute the KL divergence between the responses of any two agents and then average the values from all pairs of agents to determine the overall KL divergence. We see results in Figure 7. Specifically, we see that diversity is preserved using our method whereby KL-divergence is consistently higher than the single-agent finetuning baseline. 17 Page 18: Published as a conference paper at ICLR 2025 1 2 3 4 5 Iterations of Finetuning0.100.150.200.250.30KL-Divergence LLaMA-3: KL-Divergence between finetuned models and base model Single-agent FT Multiagent FT (Generation Agents) Multiagent FT (Critic Agents) Figure 8: KL Diversity between finetuned and unfinetuned LLM. We measure the KL-divergence between likelihoods of responses from finetuned agents and base LLM agents for single-agent finetuning and genera- tion/critic agents from multiagent finetuning. Likelihoods are calculated using Gemma-2 (2B). We find that our method diverges from the base LLM probabilities and furthermore, critic agents have better divergence in responses and our method has better diversity metrics than single-agent FT. 1 2 3 4 5 Iterations of Finetuning0.150.200.250.300.35Embedding dissimilarity across agents Phi-3 Multiagent FT (Ours) Single-agent FT 1 2 3 4 5 Iterations of Finetuning0.200.250.300.350.400.45 Mistral Figure 9: Embedding Dissimilarity: Response diversity across finetuning iterations. We measure the response diversity based on the embedding dissimilarity between the responses of different agents, where embeddings are computed using the T5-3B encoder. We notice that similar to likelihood measurement, we find that diversity is preserved across rounds of finetuning. C.3 KL-D IVERGENCE ACROSS MODELS We further analyze diversity by comparing the KL-divergence of generation and critic agents with the likelihood of responses from the base LLM model across iterations of finetuning. We measure the KL-divergence between each agent responses and responses from a base LLM for 500 MATH examples. We average KL-divergence across all examples for each iteration of finetuning. We apply this measure to agents formed through Single Agent-FT and to generation and critic agents formed through our method. For Single-Agent FT, we find the KL divergence for each finetuned agent and average the KL-divergence across all examples and all agents per iteration of finetuning. For our method, we separate generation and critic agents and find the average KL-divergence for both. We measure likelihoods using Gemma-2 (2B), similar to Figure 7. We show results in Figure 8. We see that critic agents generally have higher KL-divergences from the base LLM and both critic and generation agents have higher KL-divergences across iterations of finetuning. C.4 E MBEDDING DISSIMILARITY Finally, we analyze diversity by measuring the embedding dissimilarity between responses of different agents. Specifically, we consider agent responses in the final round of debate {yM,1,···, yM,N}that match the majority-voted final majority-voted final response ˆy. For each response, we obtain pretrained 18 Page 19: Published as a conference paper at ICLR 2025 LLM Methods Arithmetic GSM MATH GPT-3.5Cooperative (Base) 96.60±0.83 81.80 ±1.73 53.60 ±2.23 Cooperative (FT) 98.80±0.39 84.00±1.64 56.40±2.21 Table 3: Cooperative Finetuning. Our method supports fine-tuning in cooperative settings, where agents work together (e.g., 3 agents, 2 rounds). 1 2 3 4 5 Iterations of finetuning4550556065Accuracy Phi-3 Multiagent FT (Ours) Single-agent FT (T emp = 1.0) Single-agent FT (T emp = 2.0) 1 2 3 4 5 Iterations of finetuning1618202224262830 Mistral Multiagent FT (Ours) Single-agent FT (T emp = 1.0) Single-agent FT (T emp = 2.0) Figure 10: Inducing diversity through increasing temperature . We introduce an additional baseline where we apply the Single-Agent FT baselin with a temperature of 2. By increasing the sampling temperature, we allow the model to generate more diverse responses. We observe that our method out-performs higher temperature settings, which demonstrates that temperature does not increase diversity in a way that is useful for accuracy. contextual word embeddings from a held-out language model, in this case the T5-3B encoder model (Raffel et al., 2020). We feed each agent response to the T5 encoder model to obtain word embeddings and extract the embedding associated with the classification token [CLS] . As done in prior work, we use this embedding as a representation of the sequence. We compare the similarity of the agent responses using cosine similarity of the [CLS] embeddings. Since cosine similarity measures similarity, to obtain a metric for diversity, we take the complement of cosine similarity by subtracting the value from 1. D C OOPERATIVE FINETUNING In this paper, our method mainly builds on a competitive approach for composing agent responses with multiagent debate. Our approach for multiagent finetuning can be applied to both the competitive setting, where critic agents provide feedback to generator agents, and cooperative settings, where agents work together in a ”mixture of experts” style to generate answers. Instead of prompting agents to critique responses from other agents, in the second round of conversation, we prompt agents to cooperate with other agents. We ask each agent to generate a new response by merging their own response with the responses of other agents, using the prompt “Can you derive a new solution by combining your solution with the solutions of other agents?”. Under this cooperative setting, the proposed multi-agent finetuning improves the performance, as demonstrated by Cooperative (FT) outperforming Cooperative (Base). We show results in Table 3. More specifically, we see that we can finetune with a cooperative method with multiagent finetuning and achieve similar improvements in performance. This demonstrates that our method can be applied to other multiagent prompt settings as a general finetuning method for LLMs. E A DDITIONAL COMPARISONS We compare our approach to two additional approaches to improve the diversity of reasoning chains. 19 Page 20: Published as a conference paper at ICLR 2025 LLM Methods Arithmetic GSM MATH GPT-3.5Debate 99.40±0.34 85.40 ±1.58 58.20 ±2.22 Majority FT 99.60±0.28 86.20±1.54 59.00±2.19 Ours 100.00 ±0.00 88.20±1.44 62.80±2.16 Phi-3Debate 97.40±0.71 86.00±1.55 55.20±2.22 Majority FT 95.80±0.90 84.80 ±1.61 53.20 ±2.23 (Ours) 99.80±0.20 89.40±1.38 60.40±2.19 Table 4: More agents of debate . With 5 agents and 2 rounds of debate, our methods still outperform the baselines and show better results than the 3 agents and 2 rounds of debate results presented in Table 1 of the main paper. LLM Methods MATH LLaMA-3 (Dubey et al., 2024)Base 46.80±2.23 Debate 51.60±2.23 Unique ID 50.80±2.24 Ours 57.40±2.21 Table 5: Unique ID vs Multiagent Finetuning. We introduce an additional comparison to multiagent finetuning where we feed a unique ID token to each agent, corresponding to a generation or critic agent. We find that this is not comparable to improvements on multiagent finetuning. E.1 M ODULATING TEMPERATURES We first consider inducing diverse responses from LLM agents by increasing the temperature of generation. We add an additional baseline where we vary the temperature of agents finetuned using Single Agent-FT. Higher temperature values may be a proxy for more diverse responses. We show results over rounds of finetuning in Figure 10. We see that our method surpasses the performance of this baseline. This likely because higher temperature values can reduce accuracy due to increased variability of samples. Our method preserves diversity of responses while increasing accuracy using a more carefully designed finetuning method. E.2 U NIQUE IDFOR AGENTS We next considering an additional comparison to multiagent finetuning that can preserve diversity while reducing the cost of finetuning. The method involves using a unique identifier as part of the prompt fed to each agent. We feed each generation agent an ID given by GEN1 ,GEN2 , etc. Similarly, each critic agent is given an ID CRIT1 ,CRIT2 , etc. Additionally, we provide a short description to the agent, explaining what the ID refers to. For generation agents, we state that the agent is tasked with creating a solution. For critic agents, we state that the agent is tasked with evaluating and improving responses. The ID is presented to the agent at the beginning of each prompt, marked by the string Agent ID: GEN1 (This is a generation agent tasked with creating a solution.) as an example of the ID fed to generation agent 1. We compare the unique ID approach on the same 500 MATH examples reported in Table 1. Results are shown in Table 5. We find that multiagent finetuning performs significantly better and that using unique IDs is fairly similar to debate. This demonstrates that mechanisms for generating solutions and critiquing them is unlocked via finetuning. F A DDITIONAL AGENTS IN DEBATE In Table 4, we show the influence additional agents with finetuning. We use 5 agents and 2 rounds of debate. We find that additional agents improves results as noted in prior work (Du et al., 2023) over 3 agents, 2 rounds of debate. This also implies that our method will scale with larger number of finetuned agents. 20 Page 21: Published as a conference paper at ICLR 2025 LLM Methods MATH LLaMA-3 (Dubey et al., 2024)Base 24.40±1.92 Majority 25.20±1.94 Debate 29.80±2.05 Majority FT 28.00±2.01 Ours 34.20±2.12 Table 6: Additional Evaluation of Multiagent Finetuning on more difficult tasks. Our method outperforms the baselines on more difficult tasks including examples from all levels of MATH. This shows the applicability of our method in more broad settings. G M ATHEMATICAL MODEL OF DIVERSITY OVER ROUNDS OF FINETUNING We consider a simple mathematical model illustrating how diversity can arise by finetuning models only on answers that they are accurate on. Consider a training dataset of problems in three topics, A, B, and C as well as three models we train all initialized from the same base model. For each model, we assign a specialization skill score SA,SB,SCbetween 0 and 1, representing how accurate the model is at answering questions in the specified topic. All three models are initialized to have a skill of 0.33 on each topic. The specialization Sifor each topic icorresponds to the percentage of questions in topic ithe model get accurate, where SAof 0 represents that a model would get 0% of questions in topic A correct. At each iteration, a model is trained on all questions it answers correctly in each topic. This increases the specialization skill score by fraction of training the model saw for each specific topic. Formally, the updated skill of model Aat iteration twould be: St A=St−1 A 1 +St−1 A St−1 A+St−1 B+St−1 C . (2) To account for a finite amount of capacity in each model, after the above skill update, the skills across all models at iteration tare then normalized to have a sum of one. Without loss of generality, assume that at iteration t, St Ais larger than St BandSt C(which happens by random chance, since we have a finite number of questions). Under the update rule described, the ratio St+1 AtoSt Ais given by  1 +St A St A+St B+St C / X i∈{A,B,C } 1 +St i St A+St B+St C St i . (3) Since St Ais greater than or equal to St i, the above expression is greater than or equal to  1 +St A St A+St B+St C / X i∈{A,B,C } 1 +St A St A+St B+St C St i = 1, (4) where we use the identity that the sum of St iis equal to 1 to indicate since they are normalization of the scores. We thus have that St+1 Awill be larger than St A, with specialization on topic A monotonically increasing over iterations of training. Since a priori the model has no preference for any particular topic, random sampling each initial base model will lead to skill preference over a different random topic. This repeated procedure will then eventually result in models specializing in either topic A, B, C, ensuring diversity across models. This mathematical model is similar to the multiagent finetuning procedure in the paper, where we selectively train generators and critics on datasets they are accurate on and illustrate how they can then specialize on different portions of data. H A DDITIONAL EVALUATIONS H.1 L ARGER MATH E VALUATION To further evaluate multiagent finetuning, we evaluate on the MATH dataset across all 5 levels of difficulty, instead of selecting examples from levels 1-3. We extract 500 examples for training and 500 examples for testing and evaluate on LLaMA-3. 21 Page 22: Published as a conference paper at ICLR 2025 1 2 3 4 5 Iterations of finetuning30323436384042Accuracy LLaMA-3 (8B) Multiagent FT (Ours) Single-agent FT Figure 11: Multiple iterations of finetuning over all levels of MATH . We apply multiple iterations of finetuning over 500 examples of MATH sampled from all levels. Even over a more difficult domain, we see significant improvements from multiagent finetuning that continue to self-improve. LLM Methods MMLU LLaMA-3 (Dubey et al., 2024)Base 60.40±2.18 Majority 61.80±2.17 Debate 65.80±2.12 Majority FT 63.40±2.15 Ours 68.80±2.07 Table 7: MMLU Evaluation We introduce an additional evaluation with the MMLU benchmark, finetuning on 500 MMLU examples and testing on 500 different MMLU examples. We find that our method performs better than other baselines. We show results across all baselines in Table 6 and results across multiple rounds of finetuning in Figure 11. We see consistent improvement using LLaMA-3. H.2 MMLU We add an additional comparison with MMLU to further establish thte improvement of our method on a task related to general factuality and reasoning instead of mathematics. We finetune on 500 MMLU examples randomly sampled from all 57 subjects. We then evaluate on a different set of 500 randomly sampled examples. We show results in Table 7. We see that our method can improve performance on a task related to factuality. H.3 Z ERO-SHOT GENERALIZATION EVALUATION We include a larger evaluation of zero-shot evaluation of our method in Figure 12, where we finetune on 500 MATH problems and test on 1000 GSM problems. We find that our method performs significantly better than all other baselines. Furthermore, we test another setting to measure zero-shot performance by finetuning on the arithmetic dataset and evaluating on the GSM dataset. We finetune using 500 arithmetic problems and evaluate each method on 1000 GSM problems. See Figure 13. We find that our method also performs significantly better than all other baselines. 22 Page 23: Published as a conference paper at ICLR 2025 Base Majority Debate ST aR Majority-FT Ours (zero-shot)0204060GSM Accuracy42.1045.6049.7047.8046.1053.00 Figure 12: Testing zero-shot generalization across 1000 GSM problems We test the zero-shot capabilities of our method using models trained on the MATH dataset. We find that over 1000 problems of GSM, our method performs better than all baselines. Base Majority Debate Majority-FT Ours (zero-shot)0204060GSM Accuracy42.1045.6049.7048.8053.30 Figure 13: Zero-shot generalization after arithmetic finetuning . We evaluate the ability of our method to generalize after finetuning Mistral on the arithmetic task and evaluating on GSM. We find that this aids in GSM performance, even more than finetuning with MATH. 23

---