loader
Generating audio...

arxiv

Paper 2502.02533

Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies

Authors: Han Zhou, Xingchen Wan, Ruoxi Sun, Hamid Palangi, Shariq Iqbal, Ivan Vulić, Anna Korhonen, Sercan Ö. Arık

Published: 2025-02-04

Abstract:

Large language models, employed as multiple agents that interact and collaborate with each other, have excelled at solving complex tasks. The agents are programmed with prompts that declare their functionality, along with the topologies that orchestrate interactions across agents. Designing prompts and topologies for multi-agent systems (MAS) is inherently complex. To automate the entire design process, we first conduct an in-depth analysis of the design space aiming to understand the factors behind building effective MAS. We reveal that prompts together with topologies play critical roles in enabling more effective MAS design. Based on the insights, we propose Multi-Agent System Search (MASS), a MAS optimization framework that efficiently exploits the complex MAS design space by interleaving its optimization stages, from local to global, from prompts to topologies, over three stages: 1) block-level (local) prompt optimization; 2) workflow topology optimization; 3) workflow-level (global) prompt optimization, where each stage is conditioned on the iteratively optimized prompts/topologies from former stages. We show that MASS-optimized multi-agent systems outperform a spectrum of existing alternatives by a substantial margin. Based on the MASS-found systems, we finally propose design principles behind building effective multi-agent systems.

Paper Content:
Page 1: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies Han Zhou1 2 *, Xingchen Wan1, Ruoxi Sun1, Hamid Palangi1, Shariq Iqbal1, Ivan Vulić1 2, Anna Korhonen2 and Sercan Ö. Arık1 1Google,2University of Cambridge Large language models, employed as multiple agents that interact and collaborate with each other, have excelled at solving complex tasks. The agents are programmed with prompts that declare their functionality, along with the topologies that orchestrate interactions across agents. Designing prompts and topologies for multi-agent systems (MAS) is inherently complex. To automate the entire design process, we first conduct an in-depth analysis of the design space aiming to understand the factors behind building effective MAS. We reveal that prompts together with topologies play critical roles in enabling more effective MAS design. Based on the insights, we propose Multi-Agent System Search (Mass), a MAS optimization framework that efficiently exploits the complex MAS design space by interleaving its optimization stages, from local to global, from prompts to topologies, over three stages: 1) block-level ( local) prompt optimization; 2) workflow topology optimization; 3) workflow-level ( global) prompt optimization, where each stage is conditioned on the iteratively optimized prompts/topologies from former stages. We show that Mass-optimized multi-agent systems outperform a spectrum of existing alternatives by a substantial margin. Based on the Mass-found systems, we finally propose design principles behind building effective multi-agent systems. 1. Introduction Prompt Optimization Space Instruction Exemplar Topology Optimizerdebateaggregate reectMulti-Agent Design Space customMASS summarize tool-useOptimized topology <ex_1> <ex_2> ... <ins><ex_1> <ex_2> ... <ins><ex_1> <ex_2> ... <ins> Optimized prompt  for each agent typeOptimized MAS design Figure 1|Proposed Multi-Agent System Search (Mass) framework discovers effective multi- agent system designs (with both optimized topol- ogyandoptimized prompts,right)viainterleaved prompt optimization and topology optimization in a customizable multi-agent design space (key components illustrated on the left).Large language models (LLMs) have showcased extraordinary capabilities in understanding, rea- soning, and generating coherent responses based on user prompts, revolutionizing a wide range of applications (Kojima et al., 2022; Ouyang et al., 2022). LLM-based agents enhance us- ability by autonomously handling complex tasks acrossdiversedomains,includingcodegeneration and debugging (Jimenez et al., 2023), retrieval- augmented generation (Singh et al., 2025; Wang et al., 2024a), data analysis (Guo et al., 2024; Hu et al., 2024b), and interactive decision-making (Li et al., 2025; Su et al., 2025). These agents are typicallyprogrammedwithpromptsthatreinforce them to interact with the environment, utilizing available tools, and approach their objectives over multiple turns (Yao et al., 2023). Beyond individ- ual agents, LLMs can be orchestrated within com- plex topologies that coordinate multiple agents toward a shared objective. This type of multi-agent system (MAS) typically outperforms its single- agent counterpart by involving more diverse agentic perspectives or role profiles, such as agents as verifiers (Shinn et al., 2024) and multi-agent debate (Qian et al., 2024; Wang et al., 2024b). Corresponding author(s): hzhouml@google.com, soarik@google.com * This work was done while Han Zhou was a Student Researcher at Google Cloud AI Research.arXiv:2502.02533v1 [cs.LG] 4 Feb 2025 Page 2: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies However, designing effective MAS for new domains often proves to be challenging. First, the single agent might suffer from prompt sensitivity (Verma et al., 2024), where simple modifications in the prompt can already exert significant but unexpected degradation of performance (Liu et al., 2024a; Zhou et al., 2024b). In MAS, when sensitive agents are cascaded, the compounding effect due to prompt sensitivity may be amplified. Together with the prompt design, crafting an effective topology might demand a substantial amount of manual experimentation, based on trial and error. The problem complexity is exacerbated by the overall combinatorial search space, over not only the unbounded space of prompt design but also the design decisions of what agent to integrate into the topology. Although recent research has explored automating various aspects of agentic designs, there is still a gap in understanding of what matters most regarding improved MAS performance. For example, DSPy (Khattab et al., 2024) automates the process of designing exemplars for improved prompt programming. Li et al. (2024a) proposes to optimize MAS by scaling up the number of agents in majority voting. ADAS (Hu et al., 2024a) programs new topologies expressed in code via an LLM-based meta-agent. AFlow (Zhang et al., 2024b) searches better topologies using Monte Carlo Tree Search within a set of predefined operators. However, the interplay between multiple design spaces, including prompts and topologies, remains unclear. In this paper, we first conduct in-depth analyses of common design spaces in MAS, examining the influence of various aspects such as optimizing the prompts, scaling the number of agents, and involving different types of topologies. Our analyses reveal that prompts frequently form an influential designcomponentthatyieldsstrong-performingMAS,andinfluentialtopologiesonlyrepresentasmall fraction of the full search space. Based on these insights, we aim to distill the essence of influential MAS components into a pruned search space, thereby lowering the complexity of the overall search process. We propose Multi-Agent System Search ( Mass), a novel multi-stage optimization framework that automates the optimization for MAS over an efficient search space. Massintegrates a plug- and-play prompt optimizer and workflow optimizer over a configurable topology space. It overcomes the complexity of joint optimization on MAS by interleaving the optimization stages, from local to global, from prompts to topologies, over three stages: 1) block-level ( local) prompt ‘warm-up’ for each topology block; 2) workflow topology optimization in a prunedset of topology space; 3) workflow-level ( global) prompt optimization given the best-found topology. By optimizing over the identified influential components, Massyields optimized MAS that achieves state-of-the-art performance, outperforming existing manually-crafted MAS baselines and automatically-generated alternatives, by a substantial margin, demonstrated across an extensive selection of tasks, including reasoning, multi-hop understanding, and code generation. Based on the strongest MAS found by Mass, we provide further insights and guidelines behind building effective MAS. Overall, our contributions can be summarized as follows: 1) we provide an in-depth analysis of the design factors that influence the performance of LLM-based MAS, highlighting the importance of prompts and identifying the influential topologies; 2) we propose Mass, a novel multi-stage optimizer that automates the MAS design by interleaving the optimization of prompts and topologies in an influential search space; 3) Massshows significant performance improvement on various evaluation benchmarks, delivering guidelines for building effective multi-agent systems for the future. 2. Designing Multi-Agent Systems In this section, we provide a formulation for designing MAS, followed by analyzing the influence of prompt and topology designs. We refer to the structural arrangements of agents (or equivalently, building blocks) as the topology of agents and define workflow Was the logical sequence across 2 Page 3: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies different topologies that builds the MAS. The design of a MAS can thus be broadly divided into two levels: block-level design and workflow-level orchestration. At the block level, we aim to design effective individual agents that best perform their intended role with better promptdesign. On the other hand, at the workflow level, the optimization involves determining the typesandquantities of agents to include and how to arrange them in the most effective way, referred to as the topology optimization. Formally, given a search space Athat defines all valid configurations 𝑎over the blocks (see Fig. 4), workflow topology optimization can be expressed as the following optimization problem with an objective function 𝑓(·,·)on a target input and output set (𝑥,𝑦)∼D: W∗(𝑎)=arg max 𝑎∼A𝔼(𝑥,𝑦)∼D[𝑓(W(𝑎(𝑥)),𝑦)]. (1) In the rest of this section, we provide an in-depth analysis of each component of MAS design. 2.1. Block-level: Prompt Design for Agents 103 T okens747678808284Accuracy (%) SC Reflect Debate Prompting Prompt->SC Figure 2|Accuracy vs. the total token counts for prompt-optimized agents per question on MATH by Gemini 1.5 Pro compared to scaling agents withself-consistency( SC),self-refine( reflect ), andmulti-agentdebate( debate )only. Theerror bar indicates 1 standard deviation. We show that by utilizing more compute, better accuracy can be obtained via more effective prompting.At the block level, the primary “optimizable com- ponent” that significantly influences downstream performance is the prompt, which defines the role of the agent (e.g., “ You are an expert in re- flecting on errors ...”), provides additional instruc- tions to shape its behavior (e.g., “ You should think step by step ...”) and optionally, contains few-shot demonstrations (in-context examples) to guide the agent’s responses (Wan et al., 2024, 2025). For instance, a state-of-the-art prompt optimizer searches both instructions and few-shot demon- strations, where demonstrations are bootstrapped from the model’s own, correct predictions on the validation set based on a validation metric. Con- ditioned on the demonstrations, the prompt op- timizer then proposes a few candidates for the instruction with a dataset summary or various hints to improve candidate diversity (Opsahl-Ong etal.,2024). Theinstructionsanddemonstrations are then jointly optimized. Although it is well known that LLMs are sensitive to prompts (Verma et al., 2024; Zhou et al., 2024a),applyingautomaticpromptoptimization(APO)techniquestoMASisrathernon-trivial. Unlike single-turn tasks where APO can be easily performed by treating prompts as optimizable variables and performance over a validation set as the target. In MAS, APO becomes more complex due to the interdependence across agents (e.g., the output of one agent may be the input of another agent in a cascade with ground-truth responses for intermediate outputs not being available) and exponentially increasing complexity for combinatorial optimization with more number of agents 𝑛involved; The reward signals also become more sparse when 𝑛increases, preventing us for implementing APO directly on MAS in any manageable budget; as such, many prior works (Xia et al., 2024; Zhang et al., 2024f) in MAS still primarily use handcrafted prompts instead of including the prompts as optimizable components in the MAS design. To systematically understand the influence of prompt design in MAS, we specifically and quantita- tively analyze the effect of prompt optimization and compare its effectiveness to other operations common in MAS literature, such as scaling with more agents but with default prompts. We conduct 3 Page 4: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies APO on a chain-of-thought (Kojima et al., 2022) agent with both instruction optimization and 1-shot exemplar optimization via MIPRO (Opsahl-Ong et al., 2024), and fairly compare the total inference to- ken cost with self-consistency (Kojima et al., 2022), self-refine (Madaan et al., 2024), and multi-agent debate (Du et al., 2024), where the specifications are provided in App. §B. In Fig. 2, prompting, which equips agents with more informative instructions and exemplars, demonstrates significant advantages in its token-effectiveness over other building blocks. Furthermore, by applying self-consistency on top of the prompt-optimized agent, we observe an improved scaling performance on the token cost, whereas standard approaches in scaling the number of agents (e.g. SC, orReflect ) saturate much earlier. This empirical observation sheds light on the importance of prompting while providing early evidence for designing effective MAS – optimize agents locally before scaling their topology . 2.2. Workflow-level Search Space Design CoT SC Ref. Deb. Sum. Method6264666870Performance (%)Base+1% -0%+3% -2%HotpotQA CoT SC Ref. Deb. Exe. Method607080 Base+7% -15%+6%+10%LiveCodeBench Figure3|Theperformanceofdifferenttopologies with Gemini 1.5 Pro compared to the base agent with each topology being optimized with APO, where Sum. (Summarize) and Exe. (Executor) are task-specific topologies as illustrated in Fig. 4. We observe that not all topologies have a positive influence on the MAS design.At the workflow level, the primary focus is on orchestrating agents to achieve the best perfor- mance effectively. As a relatively new concept specific to MAS, topology optimization has re- cently garnered significant attention (Li et al., 2024c; Zhang et al., 2024b). However, while much of the existing research emphasizes search methods —such as discovering the most efficient and effective way to identify the optimal config- uration—there has been less focus on the design ofsearch spaces , which determines the perime- ter and the scope of any search algorithm. This imbalance draws a parallel to the historical devel- opmentof neuralarchitecturesearch (NAS)(White et al., 2023). Initially, the field concentrated on sophisticated search methods, such as Bayesian optimization (Kandasamy et al., 2018; Ru et al., 2021) and differentiable search (Liu et al., 2018). Follow-up works have highlighted the often- overlooked importance of search space design, arguing that it can be equally, if not more, critical (Wan et al., 2022; Zhou et al., 2023, 2024c). Inspired by this insight, we hypothesize that manually crafted topologies might be sub-optimal, and automatic topology optimization (potentially framed as a rigorous optimization problem) can play a similarly pivotal role via judiciously designing search space for MAS. To achieve so, we first define an expressive search space, similar to prior works, that consists of the connections between the following building blocks : •Aggregate : Agents can collaborate in parallel with diversified predictions, which is then followed by an aggregation operator that obtains the most consistent prediction. The aggregate block can be parameterized by 𝑁𝑎agents acting in parallel. Majority vote (Li et al., 2024a) and self-consistency (Chen et al., 2024c) sits within this topology. •Reflect: Agents can act as verifiers, providing critics and improvement suggestions based on for- mer predictions. The feedback is then fed into the predictor or the reflector itself for an iterative improvement. Similarly, reflect can be parameterized by 𝑁𝑟that defines the number of rounds for self-reflection. The self-refine (Madaan et al., 2024) and Reflexion (Shinn et al., 2024) represent this block. •Debate: Agents in debate can elicit more truthful predictions than single-agent prediction (Du et al., 2024; Liang et al., 2024), where each debating agent would collect opinions from all other agents 4 Page 5: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies P A</> P P</> </>PredictorAggregate P SSummarize </></> Long inputs P R</></>Self-reect1Block-level Prompt Optimization Evaluate on  validation task / split (        , 75%), 63%)( . . .Proposed candidate Optimizer: Store evaluations and propose new workowValidation metric2 Workow Topology Optimization </></></></></></></></></></>Best topology from Step 2 </></></></></> Instruction  optimization Demo optimizationWorkow-level Prompt Optimization 3 Let's think step by step → (Example new prompt) <example_1> <example_2> ...Evaluate on  validation task / split Validation metric </>: Optimizable promptsP T </>: Optimizable promptsTool-use D D DP P PA</> </> </></></></>Multi-agent debate </> Figure 4|Illustration of the Massframework with its search space and the multi-stage optimization. The search space combines both prompts ( Instruction ,Demo) and configurable agentic building blocks ( Aggregate ,Reflect ,Debate ,Summarize , and Tool-use ). 1) Block-level Prompt Opti- mization: we conduct block-level prompt optimization for each agentic module individually (denoted by</>); 2) Workflow Topology Optimization: conditioned on the best prompts found in Stage 1 on each agent block, Masssamples valid configurations from an influence-weighted design space while fusing the prompts of each building block from Stage 1; 3) Workflow-level Prompt Optimization: conditioned on the best workflow found in the Stage 2, we again conduct workflow -level prompt optimization on the best-found MAS (topologies visualized for illustration only ). and provides an updated response. This topology would involve a mixture of agents, and 𝑁𝑑defines the number of rounds for debating. •CustomAgents : Whiletheformerthreeformsofagentsrepresentthevastmajorityofagenttopologies constructed as multiple parallel, serial, and mixture of agents, more versatile definitions of agents can be inserted into the MAS design space. For example, for task-specific use cases, we introduce an agent as summarize to improve the long-context capability in the customizable design space. •Tool-use : Building towards an effective MAS, enabling agents to leverage tools to access external information is critical for system performance, such as using retriever for RAG (Lewis et al., 2020) and executor with test cases in coding (Chen et al., 2024d). We introduce tool-use as an optimizable binary ‘insertion’ decision 𝑁𝑇∈{0,1}. Tounderstandtheinfluenceofindividualtopology,wereporttheperformanceofvarioustopologies in Fig. 3. It is noticeable that not all topologies are beneficial to MAS design, whereas positively influenced topologies only represent a small fraction of the overall set, such that, in HotpotQA (Yang et al., 2018), only debate brings 3% gain while others fail to improve or even degrade systematic performance. We again observe similar trends in the test-output-prediction subtask of LiveCodeBench (Jain et al., 2024). It highlights the importance of searching in the influential set of search space, whereas including decremental building blocks may not only result in higher search complexity but also degrade the performance. 3.Mass: Multi-Agent System Search Our analyses in Sec. 2 underscore the importance of well-designed prompts for individual agents and the careful definition of the search space to achieve effective MAS performance. Building on these, we 5 Page 6: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies propose a multistage optimization algorithm, Multi-Agent System Search (Mass), that surpasses prior arts that focused solely on optimizing workflow topology without appropriate prompt designs. Instead, our approach demonstrates the greater effectiveness of MAS design with properly optimized prompts and thoughtfully designed search spaces. Massframework is illustrated in Algorithm 1 and Fig. 4, following an intuition from local to global, from block-level to workflow-level, that conquers the complexity of combinatorial optimization with effective per-stage optimization detailed below. Algorithm 1 Mass: Multi-Agent System Search 1:Input: Agentic modules in the search space 𝑎𝑖∈A, workflow of agents W(𝑎), prompt optimizer O, eval- uatorE, validation setD, temperature 𝑡, number of candidates 𝑁, budget𝐵. 2:Output: Optimized multi-agent system W∗. 3: [Block-levelPrompt Optimization] 4:Prompt optimization for the initial agent 𝑎∗ 0←OD(𝑎0). 5:for𝑎𝑖inA\{𝑎0}do 6:Local prompt optimization for each building block in the design space: 𝑎∗ 𝑖←OD(𝑎𝑖|𝑎∗ 0) 7: Obtain incremental Influence 𝐼𝑎𝑖←E(𝑎∗ 𝑖)/E(𝑎∗ 0). 8:end for 9: [Workflow Topology Optimization] 10: Obtain the selection probability 𝑝𝑎←Softmax(𝐼𝑎,𝑡) 11:while𝑛 < 𝑁do 12:Reject invalid configurations 𝑐and cap a budget 𝐵. The design space is pruned by the selection proba- bility𝑝𝑎,W𝑐←(𝑎∗ 𝑖(·),𝑎∗ 𝑖+1(·),...)with optimized prompts. 13: Store evaluations ED(W𝑐)and propose new W. 14:end while 15:Obtain the best-performing W∗ 𝑐← arg max𝑐∈CED(W𝑐). 16: [Workflow -levelPrompt Optimization] 17:Workflow-level prompt optimization for the best- performing topology: W∗←OD(W∗ 𝑐). 18:Returnoptimized multi-agent system W∗.1) Block-level prompt optimization. Before composing agents, we first ensure that individual agentsarethoroughlyoptimizedattheblocklevel, as highlighted in Sec. 2.1 and Fig. 2 – this step ensures that each agent is primed for its role with the most effective instructions in the most man- ageablecomputationbudget. Tofurtherovercome the complexity of joint optimization on a large MAS space, we first warm up the initial predic- tor with single-agent APO, 𝑎∗ 0←OD(𝑎0), where both instruction and exemplars are jointly opti- mized with the modular prompt optimizer O. Fol- lowed by conditioning on the warmed predictor, we continue optimizing each topology with a min- imum number of agents, 𝑎∗ 𝑖←OD(𝑎𝑖|𝑎∗ 0), such that, 2 predictors paired with 1 debator form the minimum building block as the debate topology, thereby lowering the complexity for optimization, andthistopologycanbescaleduplaterwithmore predictors and debators but all equipped with op- timized prompts. To measure the influence of each building block, we store the validation per- formance once the optimization is completed. It is important that though Stage (1) serves as the warm-up stage per building block, it is still a crit- ical stage that guarantees the follow-up topology optimization is searching in an effective space, composing well-performing agents instead of suffering from the compounding impact from any ill-formed agents with manual prompts. 2) Workflow topology optimization. In this stage, we focus on optimizing the overall MAS structure, determining the most effective arrangement and connectivity between agents. The analysis in Fig. 3 shows that beneficial topologies only represent a small fraction of the full design space. Therefore, we aim to distill the essence of strong-performing topologies into a pruned space, thereby making the workflow-level topology search more efficient. Here, we propose to measure the incremental influence 𝐼𝑎𝑖=E(𝑎∗ 𝑖)/E(𝑎∗ 0)that quantifies the relative gain for integrating the topology 𝑎𝑖over the initial agent 𝑎0. Following the intuition that influential dimension comes with higher selection probability, we activate the corresponding topology dimension 𝑎if𝑢 > 𝑝𝑎, given𝑢∼U( 0,1)and𝑝𝑎=Softmax(𝐼𝑎,𝑡). To compose diverse topologies into a unified space, we constrain the workflow with a rule-based order to reduce the optimization complexity, following a predefined sequence, such that [ summarize , reflect ,debate ,aggregate ]. We integrate rejection sampling over the pre-defined design space that rejects any deactivated dimension, or invalid topology compositions exceeding a maximum budget𝐵on the number of agents. We refer to App. §B for the detailed search space per task. 6 Page 7: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies 3) Workflow-level prompt optimization. As a final step, we treat the entire MAS design as an integrated entity and run an additional round of prompt optimization, conditioned on the best topology discovered in Stage (2), W∗=OD(W∗ 𝑐). It is worth noting that although prompts were optimized at the individual level in Stage (1), this stage acts as an adaptation or fine-tuning process, ensuring that prompts are tailored for orchestration within the MAS and that the interdependence between agents is optimized appropriately. Our experiments (Fig. 5 & 6) demonstrate that this stage often yields practical benefits. 4. Related Work Forms of LLM-based agentic systems. The simplest form of an LLM-based agentic system involves a single agent that can dynamically interact and respond to the environment (Yao et al., 2023). Recent advances endow agents with diverse roles and tools (Wu et al., 2023), orchestrating multiple agents to cooperate with each other (Chen et al., 2024b). Standard forms of agent cooperation (i.e., topology) often involve parallel and serial flows of information. The parallel form usually diversifies the exploration among many agents in parallel (Li et al., 2024a), and self-consistency (SC) (Wang et al., 2023) is a representative way for scaling agents in parallel. The serial form aims to advance the exploitation of a task via a chain of agents, where LLMs can serve as reflective agents to self-justify and refine former predictions (Madaan et al., 2024; Shinn et al., 2024). Later, the opinions from multiple agents can be summarized to retrieve the most consistent answer by an aggregation agent (Chen et al., 2024c; Lin et al., 2024). Moreover, multi-agent debate consists of a more complex flow of information (Chen et al., 2024a; Wang et al., 2024c; Zhang et al., 2024c), and recent research shows that debating can elicit more truthful predictions (Du et al., 2024; Khan et al., 2024). Recent agent topology extends beyond the above connections (Qian et al., 2024; Wang et al., 2024b), and Masscan automatically search the best topology among the aforementioned spaces. Automatic optimization for MAS. Recent research starts automating agent design by interpreting agent functions as learnable policies (Zhang et al., 2024d,e) and synthesizing trajectories for agent fine-tuning (Qiao et al., 2024). Going further from a single agent, automatic multi-agent optimization faces a higher level of complexity, thereby requiring a more sophisticated design of search space and algorithms. Among all recent advances in multi-agent optimization, the optimization space has spanned prompts (Khattab et al., 2024), tools (Zhou et al., 2024d), workflows (Li et al., 2024c), and thinking strategies (Shang et al., 2024). Aligning closer to our topology search space, DyLAN (Liu et al., 2024b) dynamically activates the composition of agents, and Archon (Saad-Falcon et al., 2024) frames MAS as a hyperparameter optimization problem. Neither of them has taken the important promptspaceintoaccount, wherewedemonstratedtheimportanceofpromptoptimizationinSec.2.2. In addition, GPTSwarm (Zhuge et al., 2024) optimizes the connections between agentic nodes using a policy gradient algorithm. State-of-the-art automatic agent design methods, ADAS (Hu et al., 2024a) and AFlow (Zhang et al., 2024b), also attempt to optimize agentic workflows with advanced search algorithms and LLM as optimizers. However, we observe that the importance of proper prompt designs has been relatively under-studied in these prior works. 5. Experiments Models and evaluation data. Aside from the common benchmarks used for automating MAS (Hu et al., 2024a; Zhang et al., 2024b), we conduct experiments on an extensive collection of tasks: 1) Hendryck’s MATH (Hendrycks et al., 2021) and DROP (Dua et al., 2019) for reasoning; HotpotQA (Yang et al., 2018), MuSiQue (Trivedi et al., 2022), 2WikiMultiHopQA (Ho et al., 2020) from LongBench (Bai et al., 2024) for long-context understanding; 3) MBPP (Austin et al., 2021), 7 Page 8: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies Table 1|Results on the evaluation set with Gemini 1.5 Pro and Gemini 1.5 Flash. We report the mean and standard deviation for all results with 3 runs of evaluations. We report the accuracy (%) for MATH and the test-output-prediction subtask of LiveCodeBench (LCB), F1 score for DROP, HotpotQA, MuSiQue, and 2WikiMQA, and pass@1 for MBPP and HumanEval. We note that the meta-prompt of AFlow*only works properly with Claude 3.5 Sonnet. Therefore, we reproduce AFlow with Gemini 1.5 Pro as the executor and Claude 3.5 Sonnet as the optimizer, where*indicates the results are only for reference. Number of agents in inference for all methods are below 10. Gemini-1.5-pro-002 Task Reasoning Multi-hop Long-context Coding Method MATH DROP HotpotQA MuSiQue 2WikiMQA MBPP HumanEval LCB Avg. CoT 71.67 3.3070.591.6757.430.5237.811.4363.391.1268.330.4786.670.9466.330.6265.28 Self-Consistency 77.33 1.2574.060.9058.602.1941.811.0067.791.1969.500.7186.000.8270.330.9468.18 Self-Refine 79.67 2.3671.031.3160.623.3342.151.3466.742.4363.670.2484.001.6367.331.3166.90 Multi-Agent Debate 78.67 0.9471.780.7164.870.2346.000.8071.780.6368.670.8586.671.2573.671.6570.26 ADAS 80.00 0.8272.960.9065.881.2941.951.2471.140.6673.001.0887.671.7065.171.2569.72 AFlow*76.000.8288.920.6368.620.4732.051.2976.51 1.05 - 88.00 0.00 - - Mass(Ours) 84.67 0.4790.52 0.6469.91 1.1151.40 0.4273.340.6786.50 0.4191.67 0.4782.33 0.8578.79 Gemini-1.5-flash-002 CoT 66.67 2.3671.790.6957.821.1037.101.3563.400.6863.331.2575.671.8951.170.2460.87 Self-Consistency 69.33 1.2573.420.1960.191.0141.940.9367.980.7263.670.6277.671.8953.831.1863.50 Self-Refine 71.33 0.9473.711.0958.843.0441.211.9965.561.5763.331.2581.671.8952.001.4163.46 Multi-Agent Debate 71.67 0.9474.790.8764.171.6946.271.3372.190.5463.000.7179.671.2555.500.4165.91 ADAS 68.00 1.4175.951.1861.362.8948.81 1.0366.901.0065.830.2480.672.4950.501.6364.75 Mass(Ours) 81.00 2.4591.68 0.1466.53 0.3843.671.2176.69 0.5078.00 0.8284.67 0.4772.17 0.8574.30 HumanEval (Chen et al., 2021), and LiveCodeBench (LCB) ‘test output prediction’ (Jain et al., 2024) for coding. We refer to App. §B & §D for details on data splits and prompt templates. We run all experimentsprimarilyontwoGemini1.5modelsizes(Reidetal.,2024)( gemini-1.5-pro-002 and (gemini-1.5-flash-002 ) and further validate key findings on Claude 3.5 Sonnet ( @20240620 ) (Anthropic, 2024). Baselines. We consider the following baselines: 1) CoT (Kojima et al., 2022): direct chain-of-thought reasoning via zero-shot prompting; 2) CoT-SC (Wang et al., 2023): with self-consistency to find the most consistent answers from diversified reasoning traces; 3) Self-Refine (Madaan et al., 2024; Shinn et al., 2024): reflective agents to verify and self-refine predictions; 4) Multi-Agent Debate (Du et al., 2024; Liang et al., 2024): with agent justifying answers and aggregating information from other agents; 5) ADAS (Hu et al., 2024a): an automatic agent design framework, where an LLM-based meta-agent iteratively proposes new agents based on former evaluations; 6) AFlow (Zhang et al., 2024b): automatic workflow design via Monte-Carto Tree Search over a set of pre-defined operators. We fairly compare all baselines by limiting the maximum number of agents to 10. We refer to App. §B for all specifications. Setup. Massintegrates the state-of-the-art prompt optimizer, MIPRO (Opsahl-Ong et al., 2024), which optimizes both instructions and demonstrations for each agent via a Bayesian surrogate model. We limit the number of bootstrapped demonstrations to 3 and instruction candidates to 10, per agent in 10 rounds. In topology optimization for all tasks, we search for 10 different topologies via rejection sampling. Along with topology optimization, each topology is evaluated on the validation set 3 times to stabilize the prediction. The optimized MAS is then reported on the held-out test set over three 8 Page 9: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies CoT APO 1PO 2TO 3PO Stage556065707580Performance (%)63.568.074.277.378.8 w/o PO w/o Prune POTO Ablation62646668 Figure 5|Left: average performance per opti- mization stage of Massover 8 evaluation tasks on Gemini 1.5 Pro. We compare Masswith a single agent ( CoT) starting point as the reference and an APO baseline that optimizes over the sin- gle agent by MIPROv2 (Opsahl-Ong et al., 2024). RefertoApp. §Cforthedetailedablationper task. Right: a comparative ablation study on topology optimization ( 2TO) without pruning and without the former stage of prompt optimization ( 1PO) evaluated on HotpotQA. 0 10 20 30 40 50 60 70 80 Round65707580859095F1 (%)Better Prompt1PO: Aggregate1PO: Debate2TO: More parallel agents involved3PO: Workflow-level POADAS AFlow MASS (Ours)Figure 6|The optimization trajectories of Masscompared to automatic agent design base- lines per validation round on DROP. We note that, as a distinct advantage of Mass, the optimiza- tion within stages (1) & (2) of Masscan be com- pletelyparallelized,whereasADASandAFloware iterative algorithms that have to wait to propose new agents until finishing earlier trajectories. runs. We set model temperature 𝑇at 0.7, maximum output tokens at 4096, and the 𝑡in Softmax at 0.05 for sharpening the selection probability 𝑝𝑎for each search dimension. We implement the same LLM backbone as both evaluator and optimizer in all phases. Main results. We present the main results of Masscompared to the baselines on the evaluation set in Table 1. Massyields substantial gains over common forms of multi-agent systems, (e.g. self- consistency, self-refine, and multi-agent debate), that scale up without optimizing prompts for agents in collaboration. Massleads to high-performing MAS: 78.8% and 74.3% on average on Gemini 1.5 Pro and Flash, respectively, where we observe consistent improvements on Claude 3.5 Sonnet as reported in Table 4. By comparing Masswith state-of-the-art automatic agent design baselines, ADAS and AFlow, we first notice that ADAS only brings subtle gains even by already conditioning its meta- agent generation based on the common forms of agents. The meta-agent keeps proposing complex topologies but without optimizing the prompt design. AFlow, on the other hand, demonstrates a competitive performance to Mass, especially on 2WikiMQA and HumanEval. We attribute the performance of AFlow to: 1) its ‘expansion’ phase that generates new nodes based on an error log that contrasts the predictions with the ground truth, which provides implicit textual gradients (Pryzant et al., 2023) to reflect on any formatting errors in prompt design; 2) a more refined search space within a pre-defined set of operators. Though AFlow draws similar inspirations on the importance of search space design as Mass, it still lacks a phase of prompt optimization to optimize its pre-defined operators properly, resulting in under-performance for MAS search results at MATH and MuSiQue. Different from these baselines, the consistent improvements brought by Masshighlight the importance of searching in both prompt and topology design space. Ablating optimization stages. To understand the incremental gain per Massoptimization stage, we provide a stage-by-stage ablation study in Fig. 5. We list the average performance of Massfrom block-level to workflow-level optimization and compare it with a single agent APO baseline, where the block-level optimization performance indicates the best-performing building block 𝑎∈Aafter APO. First, we notice that there is a large gain, 6% on average, between block-level optimization and single-agent optimization, showing that MAS benefits substantially from having its agents 9 Page 10: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies Predictor: Let's think step by step to solve the given problem. Clearly explain your reasoning process, showing all intermediate calculations and justifications.  Express your final answer as a single numerical value or simplified expression enclosed within <answer></answer> tags.  Avoid extraneous text or explanations outside of the core reasoning and final answer. <Task Demo: Exemplar_1>Debator: You are a seasoned math professor specializing in clear and concise explanations.  You are reviewing student solutions to math problems. Below, you will find the problem, followed by solutions from several students.  Carefully examine each student's solution, identifying any errors in their logic or calculations. Provide a comprehensive rationale explaining your analysis of each student's work, clearly stating whether their final answer is correct or incorrect and why.  Finally, provide your own definitive and simplified solution to the problem, ensuring its accuracy and clarity. Present your final answer bracketed between <answer> and </answer> at the end. Question: Compute $17^{-1}\\pmod{83}$.  Solutions: Agent 0: 44\nAgent 1: 74 Rationale: <Rationale> Answer: 44 <Task Demo: Exemplar_2> <Task Demo: Examplar_3>1Block-level Prompt Optimization  ( 62% → 79% ) Workow-level Prompt Optimization ( 83% → 85% ) 32Workow Topology Optimization ( 79% → 83% ) P PD ( ) AP PD PD D( ) P( )P A . . . P) ( Figure 7|A demonstration of the optimization trajectory of Masson MATH. In (1) block-level optimization: multi-agent debate serves as the best-performing topology. In (2) workflow topology optimization, aggregating with more parallel agents outweighs the performance of agents in debate. Lastly, (3) workflow-level optimization discovers the optimal prompt conditioned on the best topology. optimized inside the building block. In addition, going from Stage (1) to (2), another 3% gain can be achieved by composing influential topologies while searching the optimal configurations. Here, we provide an additional ablation on conducting Stage (2) without prompt optimization beforehand or without search space pruning. Fig. 5 (right) shows that both of them are critical for effective search space exploration. Lastly, Massobtains further gains ( ∼2%) by conducting workflow-level prompt optimization on the best-found topology, which indicates that optimizing the prompts towards modeling the interdependence of agents is beneficial in the MAS design. Cost-effectiveness of Mass.We conduct analysis on the cost-effectiveness of Mass. In particular, we visualize the optimization trajectory of Massas shown in Fig. 6. Mass’s trajectory demonstrates a steady trend of optimization that gradually improves the validation performance via interleaving the search towards better prompts and topologies. However, when it comes to automatic design baselines without explicit prompt optimization stages, AFlow is exposed to a larger variance in its optimization due to the nature of MCTS, whereas ADAS gets trapped in discovering over-complex topologies that appear to be less effective than the prompt design space. Overall, the optimization trajectory of Masshighlights the importance of optimizing in an effective design space, where interleaved optimization further resolves the complexity with more consecutive rewards. Following Sec. 2.1, Massalso demonstrated advanced token-effectiveness, which we refer to Fig. 9. Best-found MAS architectures & Design principles. We further inspect an example of optimized promptsandthetrajectoryof MassindiscoveringmoreeffectivetopologiesinFig.7. Theoptimization startsfromazero-shotCoTagent,andsoon MassinStage(1)identifiesthehigh-performingtopology in debate with its optimized prompt. However, as found in Stage (2), aggregating with more parallel agents actually outweighs the multi-agent debate. Workflow-level prompt optimization then leads to the best-performing predictor for aggregation. The overall optimization flow sheds light on our guidelines for building effective MAS: 1) optimizing individual agents properly is important before composing them into an MAS; 2) more effective MAS can be built by composing influential topologies; and 3) modeling the interdependence between agents is beneficial, and can be achieved via workflow- level joint optimization. 10 Page 11: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies 6. Conclusion We approach designing effective MAS by first conducting a thorough analysis of the massive design space, revealing the crucial role of prompts, and identifying an influential subset of search space. Building on these findings, we introduce Mass, a novel multi-stage optimization framework that searches within a pruned design space, interleaving prompt and topology optimization to efficiently generatehigh-performingMAS.Ourexperimentsdemonstratethat Mass-optimizedMASsignificantly outperforms existing manual and automated approaches across an extensive set of tasks. Finally, based on the optimized systems discovered by Mass, we extract valuable design principles to guide the development of future effective LLM-based MAS. Acknowledgment We thank Jinsung Yoon and all other colleagues from Google Cloud AI Research for their valuable feedback. References Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024. J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 , 2021. Y. Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, Y. Dong, J. Tang, and J. Li. LongBench: A bilingual, multitask benchmark for long context understanding. In L.-W. Ku, A. Martins, and V. Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 3119–3137, Bangkok, Thailand, Aug. 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.172. URL https://aclanthology.org/2024.acl-long.172/ . J. Chen, S. Saha, and M. Bansal. ReConcile: Round-table conference improves reasoning via consensus among diverse LLMs. In L.-W. Ku, A. Martins, and V. Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 7066–7085, Bangkok, Thailand, Aug. 2024a. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.381. URL https://aclanthology.org/2024.acl-long.381/ . M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 , 2021. W. Chen, Y. Su, J. Zuo, C. Yang, C. Yuan, C.-M. Chan, H. Yu, Y. Lu, Y.-H. Hung, C. Qian, Y. Qin, X. Cong, R. Xie, Z. Liu, M. Sun, and J. Zhou. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. In The Twelfth International Conference on Learning Representations , 2024b. URL https://openreview.net/forum?id=EHg5GDnyq1 . X. Chen, R. Aksitov, U. Alon, J. Ren, K. Xiao, P. Yin, S. Prakash, C. Sutton, X. Wang, and D. Zhou. Universalself-consistencyforlargelanguagemodels. In ICML2024WorkshoponIn-ContextLearning , 2024c. URL https://openreview.net/forum?id=LjsjHF7nAN . X. Chen, M. Lin, N. Schärli, and D. Zhou. Teaching large language models to self-debug. In The Twelfth International Conference on Learning Representations , 2024d. URL https://openreview. net/forum?id=KuPixIqPiq . 11 Page 12: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch. Improving factuality and reasoning in language models through multiagent debate. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 . OpenReview.net, 2024. URL https: //openreview.net/forum?id=zj7YuTE4t8 . D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 2368–2378, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1246. URL https://aclanthology.org/N19-1246/ . S. Guo, C. Deng, Y. Wen, H. Chen, Y. Chang, and J. Wang. Ds-agent: Automated data science by empowering large language models with case-based reasoning, 2024. URL https://arxiv.org/ abs/2402.17453 . D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS , 2021. URL https: //openreview.net/forum?id=7Bywt2mQsCe . X. Ho, A.-K. Duong Nguyen, S. Sugawara, and A. Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In D. Scott, N. Bel, and C. Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics , pages 6609–6625, Barcelona, Spain (Online), Dec. 2020. International Committee on Computational Linguistics. doi: 10.18653/ v1/2020.coling-main.580. URL https://aclanthology.org/2020.coling-main.580/ . S. Hu, C. Lu, and J. Clune. Automated design of agentic systems. arXiv preprint arXiv:2408.08435 , 2024a. X. Hu, Z. Zhao, S. Wei, Z. Chai, Q. Ma, G. Wang, X. Wang, J. Su, J. Xu, M. Zhu, Y. Cheng, J. Yuan, J. Li, K. Kuang, Y. Yang, H. Yang, and F. Wu. Infiagent-dabench: Evaluating agents on data analysis tasks, 2024b. URL https://arxiv.org/abs/2401.05507 . N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974 , 2024. C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770 , 2023. K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing. Neural architecture search with bayesian optimisation and optimal transport. Advances in neural information processing systems , 31, 2018. A. Khan, J. Hughes, D. Valentine, L. Ruis, K. Sachan, A. Radhakrishnan, E. Grefenstette, S. R. Bowman, T. Rocktäschel, and E. Perez. Debating with more persuasive LLMs leads to more truthful answers. InForty-first International Conference on Machine Learning , 2024. URL https://openreview. net/forum?id=iLCZtl7FTa . O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. V. A, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, H. Miller, M. Zaharia, and C. Potts. DSPy: Compiling declarative language model calls into state-of-the-art pipelines. In The Twelfth International Conference on Learning Representations , 2024. URL https://openreview.net/forum?id=sY5N0zY5Od . 12 Page 13: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems , 35:22199–22213, 2022. P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems , 33:9459–9474, 2020. J. Li, Q. Zhang, Y. Yu, Q. FU, and D. Ye. More agents is all you need. Transactions on Machine Learning Research , 2024a. ISSN 2835-8856. URL https://openreview.net/forum?id=bgzUSZ8aeg . M. Li, S. Zhao, Q. Wang, K. Wang, Y. Zhou, S. Srivastava, C. Gokmen, T. Lee, L. E. Li, R. Zhang, W. Liu, P. Liang, L. Fei-Fei, J. Mao, and J. Wu. Embodied agent interface: Benchmarking llms for embodied decision making, 2025. URL https://arxiv.org/abs/2410.07166 . Y. Li, Y. Du, J. Zhang, L. Hou, P. Grabowski, Y. Li, and E. Ie. Improving multi-agent debate with sparse communication topology. In Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, editors, Findings of the AssociationforComputationalLinguistics: EMNLP2024 , pages7281–7294, Miami, Florida, USA,Nov. 2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.427. URLhttps://aclanthology.org/2024.findings-emnlp.427/ . Z. Li, S. Xu, K. Mei, W. Hua, B. Rama, O. Raheja, H. Wang, H. Zhu, and Y. Zhang. Autoflow: Automated workflow generation for large language model agents. arXiv preprint arXiv:2407.12821 , 2024c. T. Liang, Z. He, W. Jiao, X. Wang, Y. Wang, R. Wang, Y. Yang, S. Shi, and Z. Tu. Encouraging divergent thinking in large language models through multi-agent debate. In Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 17889–17904, Miami, Florida, USA, Nov. 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.992. URL https://aclanthology.org/2024. emnlp-main.992/ . L. Lin, J. Fu, P. Liu, Q. Li, Y. Gong, J. Wan, F. Zhang, Z. Wang, D. Zhang, and K. Gai. Just ask one more time! self-agreement improves reasoning of language models in (almost) all scenarios. In L.-W. Ku, A. Martins, and V. Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024 , pages 3829–3852, Bangkok, Thailand, Aug. 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.230. URL https://aclanthology.org/2024. findings-acl.230/ . F. Liu, N. AlDahoul, G. Eady, Y. Zaki, B. AlShebli, and T. Rahwan. Self-reflection outcome is sensitive to prompt construction. arXiv preprint arXiv:2406.10400 , 2024a. H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055 , 2018. Z. Liu, Y. Zhang, P. Li, Y. Liu, and D. Yang. A dynamic LLM-powered agent network for task- oriented agent collaboration. In First Conference on Language Modeling , 2024b. URL https: //openreview.net/forum?id=XII0Wp1XA9 . A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems , 36, 2024. K. Opsahl-Ong, M. J. Ryan, J. Purtell, D. Broman, C. Potts, M. Zaharia, and O. Khattab. Opti- mizing instructions and demonstrations for multi-stage language model programs. In Y. Al- Onaizan, M. Bansal, and Y.-N. Chen, editors, Proceedings of the 2024 Conference on Empirical 13 Page 14: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies Methods in Natural Language Processing , pages 9340–9366, Miami, Florida, USA, Nov. 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.525. URL https://aclanthology.org/2024.emnlp-main.525/ . L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems , 35:27730–27744, 2022. R. Pryzant, D. Iter, J. Li, Y. Lee, C. Zhu, and M. Zeng. Automatic prompt optimization with “gradient descent” and beam search. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 7957–7968, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.494. URL https://aclanthology.org/2023.emnlp-main.494/ . C. Qian, Z. Xie, Y. Wang, W. Liu, Y. Dang, Z. Du, W. Chen, C. Yang, Z. Liu, and M. Sun. Scaling large-language-model-based multi-agent collaboration. arXiv preprint arXiv:2406.07155 , 2024. S. Qiao, N. Zhang, R. Fang, Y. Luo, W. Zhou, Y. Jiang, C. Lv, and H. Chen. AutoAct: Automatic agent learning from scratch for QA via self-planning. In L.-W. Ku, A. Martins, and V. Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 3003–3021, Bangkok, Thailand, Aug. 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.165. URL https://aclanthology.org/2024. acl-long.165/ . M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. P. Lillicrap, J. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser, I. Antonoglou, R. Anil, S. Borgeaud, A. M. Dai, K. Millican, E. Dyer, M. Glaese, T. Sottiaux, B. Lee, F. Viola, M. Reynolds, Y. Xu, J. Molloy, J. Chen, M. Isard, P. Barham, T. Hennigan, R. McIlroy, M. Johnson, J. Schalkwyk, E. Collins, E. Rutherford, E. Moreira, K. Ayoub, M. Goel, C. Meyer, G. Thornton, Z. Yang, H. Michalewski, Z. Abbas, N. Schucher, A. Anand, R. Ives, J. Keeling, K. Lenc, S. Haykal, S. Shakeri, P. Shyam, A. Chowdhery, R. Ring, S. Spencer, E. Sezener, and et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. CoRR, abs/2403.05530, 2024. doi: 10.48550/ARXIV.2403.05530. URL https: //doi.org/10.48550/arXiv.2403.05530 . B. Ru, X. Wan, X. Dong, and M. Osborne. Interpretable neural architecture search via bayesian optimisation with weisfeiler-lehman kernels. International Conference on Learning Representations (ICLR), 2021. J. Saad-Falcon, A. G. Lafuente, S. Natarajan, N. Maru, H. Todorov, E. Guha, E. K. Buchanan, M. Chen, N. Guha, C. Ré, et al. Archon: An architecture search framework for inference-time techniques. arXiv preprint arXiv:2409.15254 , 2024. Y. Shang, Y. Li, K. Zhao, L. Ma, J. Liu, F. Xu, and Y. Li. Agentsquare: Automatic llm agent search in modular design space. arXiv preprint arXiv:2410.06153 , 2024. N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems , 36, 2024. A. Singh, A. Ehtesham, S. Kumar, and T. T. Khoei. Agentic retrieval-augmented generation: A survey on agentic rag. arXiv preprint arXiv:2501.09136 , 2025. H. Su, R. Sun, J. Yoon, P. Yin, T. Yu, and S. Ö. Arık. Learn-by-interact: A data-centric framework for self-adaptive agents in realistic environments. arXiv preprint arXiv:2501.10893 , 2025. 14 Page 15: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal. MuSiQue: Multihop questions via single- hop question composition. Transactions of the Association for Computational Linguistics , 10:539–554, 2022. doi: 10.1162/tacl_a_00475. URL https://aclanthology.org/2022.tacl-1.31/ . M. Verma, S. Bhambri, and S. Kambhampati. On the brittle foundations of react prompting for agentic large language models. arXiv preprint arXiv:2405.13966 , 2024. X. Wan, B. Ru, P. M. Esperança, and Z. Li. On redundancy and diversity in cell-based neural architecture search. In International Conference on Learning Representations , 2022. URL https: //openreview.net/forum?id=rFJWoYoxrDB . X. Wan, R. Sun, H. Nakhost, and S. O. Arik. Teach better or show smarter? on instructions and exemplars in automatic prompt optimization. In The Thirty-eighth Annual Conference on Neural In- formation Processing Systems , 2024. URL https://openreview.net/forum?id=IdtoJVWVnX . X.Wan,H.Zhou,R.Sun,H.Nakhost,K.Jiang,andS.Ö.Arık. Fromfewtomany: Self-improvingmany- shot reasoners through iterative optimization and generation. arXiv preprint arXiv:2502.00330 , 2025. F. Wang, X. Wan, R. Sun, J. Chen, and S. Ö. Arık. Astute rag: Overcoming imperfect retrieval augmentation and knowledge conflicts for large language models. arXiv preprint arXiv:2410.07176 , 2024a. J. Wang, J. Wang, B. Athiwaratkun, C. Zhang, and J. Zou. Mixture-of-agents enhances large language model capabilities. arXiv preprint arXiv:2406.04692 , 2024b. Q. Wang, Z. Wang, Y. Su, H. Tong, and Y. Song. Rethinking the bounds of LLM reasoning: Are multi-agent discussions the key? In L.-W. Ku, A. Martins, and V. Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 6106–6131, Bangkok, Thailand, Aug. 2024c. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.331. URL https://aclanthology.org/2024.acl-long.331/ . X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self- consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations , 2023. URL https://openreview.net/forum?id= 1PL1NIMMrw . C. White, M. Safari, R. Sukthanker, B. Ru, T. Elsken, A. Zela, D. Dey, and F. Hutter. Neural architecture search: Insights from 1000 papers. arXiv preprint arXiv:2301.08727 , 2023. Q. Wu, G. Bansal, J. Zhang, Y. Wu, S. Zhang, E. Zhu, B. Li, L. Jiang, X. Zhang, and C. Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155 , 2023. C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering agents. arXiv preprint arXiv:2407.01489 , 2024. Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In E. Riloff, D. Chiang, J. Hock- enmaier, and J. Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Nat- ural Language Processing , pages 2369–2380, Brussels, Belgium, Oct.-Nov. 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1259. URL https://aclanthology.org/ D18-1259/ . 15 Page 16: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations , 2023. URL https://openreview.net/forum?id=WE_vluYUL-X . G. Zhang, Y. Yue, Z. Li, S. Yun, G. Wan, K. Wang, D. Cheng, J. X. Yu, and T. Chen. Cut the crap: An economical communication pipeline for llm-based multi-agent systems. arXiv preprint arXiv:2410.02506 , 2024a. J. Zhang, J. Xiang, Z. Yu, F. Teng, X. Chen, J. Chen, M. Zhuge, X. Cheng, S. Hong, J. Wang, et al. Aflow: Automating agentic workflow generation. arXiv preprint arXiv:2410.10762 , 2024b. J. Zhang, X. Xu, N. Zhang, R. Liu, B. Hooi, and S. Deng. Exploring collaboration mechanisms for LLM agents: Asocialpsychologyview. InL.-W.Ku, A.Martins, andV.Srikumar, editors, Proceedingsofthe 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 14544–14607, Bangkok, Thailand, Aug. 2024c. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.782. URL https://aclanthology.org/2024.acl-long.782/ . S. Zhang, J. Zhang, J. Liu, L. Song, C. Wang, R. Krishna, and Q. Wu. Offline training of language model agents with functions as learnable weights. In Forty-first International Conference on Machine Learning , 2024d. URL https://openreview.net/forum?id=2xbkWiEuR1 . W. Zhang, K. Tang, H. Wu, M. Wang, Y. Shen, G. Hou, Z. Tan, P. Li, Y. Zhuang, and W. Lu. Agent- pro: Learning to evolve via policy-level reflection and optimization. In L.-W. Ku, A. Martins, and V. Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers) , pages 5348–5375, Bangkok, Thailand, Aug. 2024e. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.292. URL https://aclanthology.org/2024.acl-long.292/ . Y. Zhang, R. Sun, Y. Chen, T. Pfister, R. Zhang, and S. O. Arik. Chain of agents: Large language models collaborating on long-context tasks. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024f. URL https://openreview.net/forum?id=LuCLf4BJsr . H. Zhou, X. Wan, I. Vulić, and A. Korhonen. Survival of the most influential prompts: Efficient black- box prompt search via clustering and pruning. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 13064–13077, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.870. URL https://aclanthology.org/2023.findings-emnlp.870/ . H. Zhou, X. Wan, Y. Liu, N. Collier, I. Vulić, and A. Korhonen. Fairer preferences elicit improved human-aligned large language model judgments. In Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 1241–1252, Miami, Florida, USA, Nov. 2024a. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.72. URL https://aclanthology.org/2024.emnlp-main. 72/. H. Zhou, X. Wan, L. Proleev, D. Mincu, J. Chen, K. A. Heller, and S. Roy. Batch calibration: Rethinking calibration for in-context learning and prompt engineering. In The Twelfth International Conference on Learning Representations , 2024b. URL https://openreview.net/forum?id=L3FHMoKZcS . H. Zhou, X. Wan, I. Vulić, and A. Korhonen. AutoPEFT: Automatic configuration search for parameter- efficient fine-tuning. Transactions of the Association for Computational Linguistics , 12:525–542, 2024c. doi: 10.1162/tacl_a_00662. URL https://aclanthology.org/2024.tacl-1.29/ . 16 Page 17: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies W. Zhou, Y. Ou, S. Ding, L. Li, J. Wu, T. Wang, J. Chen, S. Wang, X. Xu, N. Zhang, et al. Symbolic learning enables self-evolving agents. arXiv preprint arXiv:2406.18532 , 2024d. M. Zhuge, W. Wang, L. Kirsch, F. Faccio, D. Khizbullin, and J. Schmidhuber. GPTSwarm: Language agents as optimizable graphs. In Forty-first International Conference on Machine Learning , 2024. URLhttps://openreview.net/forum?id=uTC9AFXIhg . 17 Page 18: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies A. Limitations and future work Massis a multi-agent design meta-framework also orthogonal to prompt and topology optimizers. Masshasbroughtsubstantialimprovementsoverasingleagentdesignbysearchinginacustomizable topology space. Though our proposed topology space has covered the vast majority of effective MAS designs, including serial, parallel, and mixture of connections, it is still likely that incorporating other topologies may further improve the final performance of Mass, which is complementary to the development of Mass. For instance, the debate topology proposed in Massinvolves a fully-connected topology across agents. Recent work has been identifying the sparsity of agent communications (Li et al., 2024b; Zhang et al., 2024a), and pruning redundant communications may further enhance the overall efficiency of the strongest Mass-found design. Though the topology optimizer in Massalready traverses efficiently in the proposed topology space, incorporating more advanced search algorithms, such as the Bayes optimizer (Kandasamy et al., 2018; Ru et al., 2021), may further improve the sample efficiency of Masswhen faces a more complex design space. Similarly, the sample efficiency of the prompt optimizer may be further enhanced by conditioning on textual feedback from error logs (Pryzant et al., 2023; Wan et al., 2024), which we will endeavor to explore in future work. B. Implementation details B.1. Datasets In this work, we included the following dataset: 1) Hendryck’s MATH (Hendrycks et al., 2021) consisting challenging competition-level mathematics problems, and DROP (Dua et al., 2019) requires discrete and symbolic reasoning over paragraphs; 2) HotpotQA (Yang et al., 2018), MuSiQue (Trivedi et al., 2022), and 2WikiMultiHopQA (Ho et al., 2020) to evaluate on information seeking from long-context with agentic systems, which we report from standardized versions in LongBench (Bai et al., 2024); 3) MBPP (Austin et al., 2021), HumanEval (Chen et al., 2021), and LiveCodeBench (Jain et al., 2024) as well-established coding benchmarks. Regarding LiveCodeBench, we use the ‘test output prediction’ task as an agent cooperative task. In line with AFlow (Zhang et al., 2024b), we use the public test cases of MBPP and HumanEval for the executor to retrieve reliable external feedback signals. To save computation resources, we randomly sample a subset of the original validation and test splits to conduct all the experiments, where the specifications are reported in Table 2. Table 2|The specification of evaluation tasks: dataset split, topology search space, and the Mass- optimized MAS (on Gemini 1.5 Pro) Task Type |Val| |Test| Topology Search Space Mass MATH Mathematical Reasoning 60 100 { Aggregate ,Reflect ,Debate } {9, 0, 0} DROP Discrete Reasoning 60 200 { Aggregate ,Reflect ,Debate } {5, 0, 0} HotpotQA Long-context Understanding 50 100 { Summarize ,Aggregate ,Reflect ,Debate } {0, 5, 0, 1} MuSiQue Long-context Understanding 50 100 { Summarize ,Aggregate ,Reflect ,Debate } {0, 3, 0, 2} 2WikiMQA Long-context Understanding 50 100 { Summarize ,Aggregate ,Reflect ,Debate } {0, 3, 0, 1} MBPP Coding 60 200 { Aggregate ,Reflect ,Debate ,Executor } {1, 4, 0, 1} HumanEval Coding 50 100 { Aggregate ,Reflect ,Debate ,Executor } {1, 3, 0, 1} LiveCodeBench Coding: test output prediction 100 200 { Aggregate ,Reflect ,Debate ,Executor } {3, 1, 1, 1} 18 Page 19: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies Table 3|The search dimension for each topology. The minimum topology defines the building block thatMassStage (1) optimized. Topology Search Space Minimum Topology Building Block Specification Summarize {0, 1, 2, 3, 4} { Summarizer ,Predictor } {1, 1} Aggregate {1, 3, 5, 7, 9} { Predictor ,Aggregator } {3, 1} Reflect {0, 1, 2, 3, 4} { Predictor ,Reflector } {1, 1} Debate {0, 1, 2, 3, 4} { Predictor ,Debator } {2, 1} Execute {0, 1} { Predictor ,Executor ,Reflector } {1, 1, 1} B.2. Baselines In this section, we report the specifications of all our baselines. We note that for the baselines: CoT, SC, Self-Refine, and Multi-Agent Debate, we follow the prompts given in ADAS (Hu et al., 2024a). 1) Chain-of-Thought (CoT) (Kojima et al., 2022). Direct chain-of-thought reasoning via zero-shot prompting: “Please think step by step and then solve the task." 2) Self-Consistency (SC) (Wang et al., 2023). In self-consistency, we generate diverse chain-of- thought reasoning traces with a temperature of 0.8, followed by a rule-based majority vote that collects the most consistent answer. In Table 1, we report SC@9 to provide a fair comparison across baselines. 3) Self-Refine (Madaan et al., 2024): This baseline consists of one predictor that constantly takes feedback and a self-reflector that provides criticism. It involves a stop criterion whenever the self-reflector outputs “correct” in its prediction. We set the maximum number of rounds of reflections to 5, such that the worst case will involve 11 ( 1+2×5) calls. 4) Multi-Agent Debate (Du et al., 2024; Liang et al., 2024). In this baseline, it involves 3 agents that conduct reasoning and debating for 3 rounds. The opinions along the rounds of debating are finally judged by an aggregator that makes the final prediction. Hence, it contains 10 ( 3×3+1) agents. 5) Automated Design of Agentic Systems (ADAS) (Hu et al., 2024a). Consistent with our main experimental setups. We use Gemini 1.5 as both LLM optimizer and evaluator for reproducing all ADAS results. The generation of ADAS is conditioned on former evaluations of baselines, including CoT, SC, Self-Refine, and Multi-Agent Debate. We report ADAS with 30 rounds of search, and each round is evaluated on the validation set 3 times to stablize the prediction. 6) AFlow (Zhang et al., 2024b). Automatic workflow design via Monte-Carto Tree Search over a set of pre-defined operators. Similar to ADAS, AFlow also relies on an LLM optimizer to generate new nodes and topologies expressed in codes. However, we find the meta-prompt of AFlow does not generalize to other LLM backbones. Consequently, we report AFlow with its original LLM optimizer by Claude 3.5 Sonnet, and reproduce experiments with Gemini 1.5 Pro as the LLM executor. Therefore, the comparison is not completely fair, and we treat the results from AFlow as a good reference. We note that the ‘-’ in Table 1 refers to out-of-time errors, where the LLM executor has been trapped in executing accidental scripts with infinite loops. We still endeavored to report most results from AFlow as shown in Table 1 & Fig. 6 with the default experimental setup from AFlow: 20 rounds, 5 runs of validation per round, and k at 3. 19 Page 20: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies Summarize Aggregate Reect Debate Executor P A PP S P R PTopology Building Blocks EP RP PD MASS-optimized Topology MATH DROP HotpotQA MuSiQue MBP P HumanEval LiveCodeBench P A . . . PP A . . . PP PD DA . . . . . .P PPDD DD DDAP REP REP PPE E ER R RD D DA2WikiMQA P PPD D DA Figure 8|Visualization of the topology building blocks and best Mass-discovered topologies from Gemini 1.5 Pro. B.3. Mass: Multi-Agent System Search In this section, we provide additional details for Mass. The topology search space for each task is defined in Table 2. In addition, for Stage (1) block-level prompt optimization, the specification of the building block is defined in Table 3. We provide the visualization of both the minimum building blocks and the optimized topology in Fig. 8. We refer the reader to App. §D & §E for the prompt templates we used to define each type of agent and the best prompts discovered. C. Additional experiments Table 4|Results on the evaluation set with Claude 3.5 Sonnet. We keep the same experimental setup as Table 1. Since Claude 3.5 Sonnet does not support the same context window as Gemini, we report the standard HotpotQA instead of the LongBench. As we transfer the prompt template for each agent from Gemini to Claude, it is noticeable that the basic topology on some tasks may result in severe degradation of performance, and Masssuccessfully recovers the performance and brings significant improvements over the initial agent. Claude-3.5-Sonnet Task Reasoning Multi-hop Coding Method MATH DROP HotpotQA MBPP HumanEval LCB Avg. CoT 57.33 0.94 55.520.42 23.561.52 67.501.47 88.671.70 72.672.39 60.21 Self-Consistency 61.67 1.89 57.860.45 25.690.44 69.17 0.62 90.000.82 72.672.39 62.84 Self-Refine 57.00 1.63 56.260.56 23.572.56 68.000.82 87.001.41 49.331.65 56.86 Multi-Agent Debate 45.00 3.74 26.620.11 31.413.30 00.000.00 84.333.30 72.821.84 43.36 Mass 63.00 0.0068.93 0.38 66.98 0.99 68.830.62 93.00 0.82 73.73 1.4372.43 20 Page 21: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies Table 5|The detailed ablation results per optimization stage of Mass. Practical gains can be obtained by further conducting workflow-level prompt optimization (3PO) on the best-found topology. Gemini-1.5-pro-002 Task Reasoning Multi-hop Long-context Coding Method MATH DROP HotpotQA MuSiQue 2WikiMQA MBPP HumanEval LCB Avg. Base Agent 62.33 0.9471.650.6156.961.2643.320.1349.200.6168.830.8589.331.7066.332.0963.54 + APO 79.33 1.8977.510.3859.720.0043.970.0061.490.2467.001.0886.331.2568.501.2267.44 + 1PO 80.00 0.0086.450.9062.521.8648.860.6167.400.5880.331.2591.671.2576.000.0074.56 + 2TO 83.00 1.6386.751.3265.221.3452.610.5272.820.8685.001.0892.000.8281.330.0077.55 + 3PO 84.67 0.4790.520.6469.911.1151.400.4273.340.6786.500.4191.670.4782.330.8578.40 0 1000 2000 3000 4000 5000 6000 7000 8000 T otal T okens7072747678808284Accuracy (%) CoTCoT-SC@3CoT-SC@5Refine@5 Debate 1R@2ADebate 2R@3A Step-BackQuality-Diverse Role AssignADAS-T oolADAS-T&SMASSMASS MASS MASSMATH (gemini-1.5-pro-002) Figure 9|The Pareto-front of Mass-optimized designs compared to multi-agent baselines. Total tokens include both inference input tokens and output tokens. Additional multi-agent baselines from ADAS (Hu et al., 2024a) and two best-found ADAS designs are included. D. Prompt template We provide all prompt templates we used for defining the Masssearch space. We use <>to enclose texts that have been skipped for presentation purposes. We follow the DSPy (Khattab et al., 2024) in constructing these agentic templates. The general template for instruction, exemplar, and input/output fields: 1 2<Instruction > 3 4--- 5 6Follow the following format . 7 8Input : ${ Input } 9... 10Output : ${ output } 11 12--- 13 21 Page 22: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies 14<example_1 > 15 16--- 17 18Input : <Input > 19... 20Output : <output > MATH: 1Predictor : 2 3Let ’s think step by step . 4--- 5Question : ${ question } 6Reasoning : Let ’s think step by step in order to ${ produce the answer }. We ... 7Answer : ${ answer } 8 9------------ 10Reflector : 11 12Please review the answer above and criticize on where might be wrong . If you are absolutely sure it is correct , output ’True ’ in ’correctness ’. 13 14--- 15Question : ${ question } 16Text : ${ text } 17Reasoning : Let ’s think step by step in order to ${ produce the correctness }. We ... 18Feedback : ${ feedback } 19Correctness : True / False indicating if answer is correct given the question . 20 21------------ 22Refiner : 23 24Given previous attempts and feedback , carefully consider where you could go wrong in your latest attempt . Using insights from previous attempts , try to solve the task better . Show your final answer bracketed between <answer > and </ answer > at the end . 25 26--- 27Question : ${ question } 28Previous answer : ${ previous_answer } 29Reflection : ${ reflection } 30Correctness : ${ correctness } 31Thinking : ${ thinking } 32Answer : ${ answer } 33 34------------ 35 36Debator : 37 38These are the solutions to the question from other agents . Examine the solutions from other agents in your rationale , finish by giving an updated answer . Show your final answer bracketed between <answer > and </answer > at the end . 39 40--- 41Question : ${ question } 42Solutions : the solutions to the question from other agents 43Reasoning : Let ’s think step by step in order to ${ Examine the solutions from other agents }. We ... 44Answer : The updated answer for the question . Do not repeat Answer : DROP: 1Predictor : 2 3Please think step by step and then solve the task . # Your Task : 4Please answer the following question based on the given context . 5--- 6Question : ${ question } 7Context : ${ context } 8Thinking : ${ thinking } 22 Page 23: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies 9Answer : Directly answer the question . Keep it very concise . 10 11------------ 12Reflector : 13 14Verify that the answer is based on the provided context . Give your reflection in the rationale . 15 16--- 17Question : ${ question } 18Context : ${ context } 19Text : ${ text } 20Reasoning : Let ’s think step by step in order to ${ produce the correctness }. We ... 21Correctness : True / False indicating if answer is correct given the observations and question . 22 23------------ 24Refiner : 25 26Please think step by step and then solve the task . # Your Task : 27Based on the reflection , correctness of the previous answer , and the context again , give an updated answer . 28 29--- 30Question : ${ question } 31Context : ${ context } 32Previous answer : ${ previous_answer } 33Reflection : ${ reflection } 34Correctness : ${ correctness } 35Thinking : ${ thinking } 36Answer : Directly answer the question . Keep it very concise . 37 38------------ 39 40Debator : 41 42These are the solutions to the question from other agents . Based on the context , examine the solutions from other agents in your rationale , finish by giving an updated answer . 43 44--- 45Question : ${ question } 46Context : ${ context } 47Solutions : the solutions to the question from other agents 48Reasoning : Let ’s think step by step in order to ${ Examine the solutions from other agents }. We ... 49Answer : The updated answer for the question . Do not repeat Answer : HotpotQA, MuSiQue, and 2WikiMQA: 1Predictor : 2 3Answer the question with information based on the context . Only return the answer as your output . 4--- 5Question : ${ question } 6Context : ${ context } 7Answer : Only give me the answer . Do not output any other words . 8 9------------ 10Summarizer : 11 12Based on the question , retrieve relevant information from context that is ONLY helpful in answering the question . Include all key information . Do not repeat context . 13--- 14Question : ${ question } 15Context : ${ context } 16Summary : Only generate the summary . Start with Summary : 17 18------------ 23 Page 24: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies 19Reflector : 20 21Verify that the answer is based on the provided context . 22 23--- 24Question : ${ question } 25Context : ${ context } 26Text : ${ text } 27Reasoning : Let ’s think step by step in order to ${ produce the correctness }. We ... 28Correctness : True / False indicating if answer is correct given the observations and question . 29 30------------ 31 32Debator : 33 34These are the solutions to the question from other agents . Based on the context , examine the solutions from other agents in your rationale , finish by giving an updated answer . 35 36--- 37Question : ${ question } 38Context : ${ context } 39Solutions : the solutions to the question from other agents 40Reasoning : Let ’s think step by step in order to ${ Examine the solutions from other agents }. We ... 41Answer : The updated answer for the question . Do not repeat Answer : MBPP: 1Predictor : 2 3Let ’s think step by step . Provide a complete and correct code implementation in python . 4--- 5Question : ${ question } 6Thinking : ${ thinking } 7Answer : Only the code implementation . Do not include example usage or explainations . 8 9------------ 10Reflector : 11 12Please determine the correctness of the solution in passing all test cases . If it fails , based on the error message and trackback , think step by step , carefully propose an updated solution in the answer output with a correct code implementation in python . 13 14--- 15Question : ${ question } 16Previous solution : ${ previous_solution } 17Traceback : It contains the test cases , execution results , and ground truth . If there is an error , the relevant traceback is given . 18Correctness : ’True /False ’ based on the correctness of executive feedback . If there is an error message , output ’False ’ 19Thinking : ${ thinking } 20Answer : ${ answer } 21 22------------ 23 24Debator : 25 26These are the solutions to the question from other agents . Examine the solutions from other agents in your rationale , finish by giving an updated answer . Let ’s think step by step . Provide a complete and correct code implementation in python . 27 28--- 29Question : ${ question } 30Solutions : the solutions to the question from other agents 31Reasoning : Let ’s think step by step in order to ${ Examine the solutions from other agents }. We ... 24 Page 25: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies 32Answer : ${ answer } HumanEval: 1Predictor : 2 3Let ’s think step by step . Provide a complete and correct code implementation in python . 4--- 5Question : ${ question } 6Thinking : ${ thinking } 7Answer : ${ answer } 8 9------------ 10Reflector : 11 12Please determine the correctness of the solution in passing all test cases . If it fails , based on the error message and trackback , think step by step , carefully propose an updated solution in the answer output with a correct code implementation in python . 13 14--- 15Question : ${ question } 16Previous solution : ${ previous_solution } 17Traceback : ${ traceback } 18Thinking : ${ thinking } 19Answer : ${ answer } 20 21------------ 22 23Debator : 24 25These are the solutions to the question from other agents . Examine the solutions from other agents in your rationale , finish by giving an updated answer . Let ’s think step by step . Provide a complete and correct code implementation in python . 26 27--- 28Question : ${ question } 29Solutions : the solutions to the question from other agents 30Reasoning : Let ’s think step by step in order to ${ Examine the solutions from other agents }. We ... 31Answer : ${ answer } LiveCodeBench: 1Predictor : 2 3You are a helpful programming assistant and an expert Python programmer . The user has written a input for the testcase . Think step by step . You will generate the code based on the problem requirepement . You will calculate the output of the testcase and write the whole assertion statement in the markdown code block with the correct output . 4--- 5Question : ${ question } 6Thinking : ${ thinking } 7Code : ${ code } 8Answer : complete the testcase with assertion . 9 10------------ 11Reflector : 12 13If there is an executive output in the traceback , parse the output into an assertion in the answer given the executive output . 14 15--- 16Question : ${ question } 17Previous solution : ${ previous_solution } 18Traceback : It contains the test cases , execution results , and ground truth . If there is an error , the relevant traceback is given . 19Correctness : ’True /False ’ based on the correctness of executive feedback . If there is an error message , output ’False ’ 25 Page 26: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies 20Thinking : ${ thinking } 21Answer : ${ answer } 22 23------------ 24 25Debator : 26 27These are the solutions to the question from other agents . Examine the solutions from other agents in your rationale , finish by giving an updated answer . 28 29--- 30Question : ${ question } 31Solutions : the solutions to the question from other agents 32Reasoning : Let ’s think step by step in order to ${ Examine the solutions from other agents }. We ... 33Answer : assert ${ function ( input )} == { executive_output } E. Best Prompts Discovered In this section, we show a few typical prompts that have been discovered by Mass. Similarly, we indicate <>that skips some comments. MATH: 1Predictor : 2 3Let ’s think step by step to solve the given problem . Clearly explain your reasoning process , showing all intermediate calculations and justifications . Express your final answer as a single numerical value or simplified expression enclosed within <answer ></ answer > tags . Avoid extraneous text or explanations outside of the core reasoning and final answer . 4 5--- 6 7Follow the following format . 8 9Question : ${ question } 10Reasoning : Let ’s think step by step in order to ${ produce the answer }. We ... 11Answer : ${ answer } 12 13--- 14 15Question : What is the average of the two smallest positive integer solutions to the congruence $$14u \\ equiv 46 \\ pmod {100}~? $$ 16Reasoning : Let ’s think step by step in order to find the average of the two smallest positive integer solutions to the congruence $14u \\ equiv 46 \\ pmod {100} $. We are looking for solutions to $14u \\ equiv 46 \\ pmod {100} $. We can divide the congruence by 2 to get $7u \\ equiv 23 \\ pmod {50} $. We want to find the inverse of 7 modulo 50. We use the Euclidean Algorithm : \\ begin { align *} 50 &= 7 \\ cdot 7 + 1 \\\\ 1 &= 50 - 7 \\ cdot 7 \\ end { align *} Thus , $7 ( -7) \\ equiv 1 \\ pmod {50}$, so $7 ^{ -1} \\ equiv -7 \\ equiv 43 \\ pmod {50} $. Multiplying $7u \\ equiv 23 \\ pmod {50} $ by 43 gives $u \\ equiv 23 \\ cdot 43 \\ pmod {50} $. $23 \\ cdot 43 = 989 = 50 \\ cdot 19 + 39$, so $23 \\ cdot 43 \\ equiv 39 \\ pmod {50} $. Therefore , $u \\ equiv 39 \\ pmod {50} $. The two smallest positive integer solutions are $u = 39$ and $u = 39+50=89 $. The average of these two solutions is $\\ frac {39+89}{2} = \\ frac {128}{2} = 64$. 17Answer : 64 18 19--- In the following prompts, interestingly, we observe that including the data summary, task demon- strations, and past instructions that have been used in MIPRO (Opsahl-Ong et al., 2024) to propose new candidates actually improves the final performance. Hence, we keep these prompts that lead to strong task performance. DROP: 1Predictor : 26 Page 27: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies 2 3This dataset is designed for extractive question answering , focusing on retrieving concise , factual answers from short texts . Many questions involve extracting numerical information and performing simple calculations , suggesting applications in areas like sports analytics or financial data analysis . However , the dataset ’s Western cultural bias and lack of complex reasoning questions limit its generalizability and real - world applicability . 4 5TASK DEMO (S): 6<example_1 > 7Question : How many more points did the Spurs win by in Game 4 against the Mavericks ? 8 9Context : 10The Mavericks finished 49 -33 , one game ahead of Phoenix for the eighth and final playoff spot , which meant that they would once again have to face their in - state rivals , the San Antonio Spurs , who were the top seed in the Western Conference with a 62 -20 record . In Game 1 in San Antonio , Dallas had an 81 -71 lead in the fourth quarter , but the Spurs rallied back and took Game 1, 85 -90. However , the Mavs forced 22 turnovers in Game 2 to rout the Spurs 113 -92 , splitting the first two games before the series went to Dallas . In Game 3, Manu Gin \ u00f3bili hit a shot that put the Spurs up 108 -106 with 1.7 seconds left , but a buzzer - beater by Vince Carter gave the Mavs the victory , putting them up 2-1 in the series . The Spurs took Game 4 in Dallas 93 -89 despite a late Dallas comeback after the Spurs at one point had a 20- point lead and later won Game 5 at home , 109 -103 , giving them a 3-2 series lead . The Mavs avoided elimination in Game 6 at home by rallying in the fourth quarter , winning 111 -113. Game 7 was on the Spurs home court , and the Spurs beat the Mavericks 119 -96 , putting an end to the Mavericks season . 11 12Thinking : 13The Spurs scored 93 points in Game 4. The Mavericks scored 89 points in Game 4. The difference is 93 - 89 = 4. 14Answer : 4 15 16 17BASIC INSTRUCTION : 18‘‘‘ 19You are a highly specialized AI tasked with extracting critical numerical information for an urgent news report . A live broadcast is relying on your accuracy and speed . Think step -by -step , focusing on the numerical information provided in the context . Then , answer the question concisely with the extracted numerical answer . Failure to provide the correct numerical information will result in the broadcast being interrupted . 20 21Question : { question } 22Context : { context } 23‘‘‘ 24 25TIP : Keep the instruction clear and concise . 26 27PROPOSED INSTRUCTION : 28 29‘‘‘ 30Extract the numerical answer to the following question . Show your reasoning by identifying the relevant numbers from the provided context and performing any necessary calculations . Respond with only the final numerical answer . 31 32Question : { question } 33Context : { context } 34‘‘‘ HotpotQA: 1Predictor : 2 3This multi - passage question answering dataset focuses on complex questions requiring synthesis of information from multiple Wikipedia - like sources , often involving named entities and temporal reasoning . It emphasizes integrating information , handling ambiguity , and leveraging real - world knowledge , posing a significant challenge for models relying solely on provided text . The dataset appears well - suited for evaluating advanced language models ’ reasoning abilities across 27 Page 28: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies diverse domains and varying complexity levels . 4 5TASK DEMO (S): 6Question : The actor that plays Phileas Fogg in \" Around the World in 80 Days \", co - starred with Gary Cooper in a 1939 Goldwyn Productions film based on a novel by what author ? 7Context : Provided in prompt 8Answer : Charles L. Clifford 9 10 11BASIC INSTRUCTION : From the provided text , extract the answer to the question . Output * only * the answer . 12 13TIP : Keep the instruction clear and concise . Emphasize reliance * only * on the provided text . 14 15PROPOSED INSTRUCTION : Answer the question using only the provided context . Do not use external knowledge . 16 17--- 18<example_1 > 19 20------ 21 22Debator : 23 24This multi - passage question answering dataset focuses on complex questions requiring synthesis of information from multiple Wikipedia - like sources , often involving named entities and temporal reasoning . It emphasizes integrating information , handling ambiguity , and leveraging real - world knowledge , posing a significant challenge for models relying solely on provided text . The dataset appears well - suited for evaluating advanced language models ’ reasoning abilities across diverse domains and varying complexity levels . 25 26TASK DEMO (S): 27Provided above . 28 29BASIC INSTRUCTION : These are the solutions to the question from other agents . Based on the context , examine the solutions from other agents in your rationale , finish by giving an updated answer . 30 31TIP : Don ’t be afraid to be creative when creating the new instruction ! 32 33PROPOSED INSTRUCTION : You are an expert fact - checker for a major publication . Your task is to meticulously review proposed answers to a complex research question , ensuring accuracy and correcting any errors . You are provided with the original question , multiple context passages from credible sources , and several proposed answers from different research assistants . Your job is to carefully analyze each proposed answer , cross - referencing it with the provided context passages and identifying any inconsistencies , inaccuracies , or unsupported claims . 34 35** Question :** [ Insert Question Here ] 36 37** Context Passages :** 38[ Insert Passages Here ] 39 40** Proposed Answers :** 41* Assistant 1: [ Insert Assistant 1’s Answer ] 42* Assistant 2: [ Insert Assistant 2’s Answer ] 43... 44* Assistant N: [ Insert Assistant N’s Answer ] 45 46 47** Instructions :** 48 491. ** Fact - Check & Analyze :** Evaluate each proposed answer individually . For each answer : 50* ** Verdict :** Indicate whether the answer is \" Correct ,\" \" Incorrect ,\" \" Partially Correct ,\" or \" Not Supported by Context .\" 51* ** Evidence :** Provide specific quotes and passage numbers from the context to 28 Page 29: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies support your verdict . Explain how the evidence supports or refutes the proposed answer . Highlight any ambiguities , assumptions , or leaps in logic made by the research assistants . 52* ** Corrections \/ Improvements (if applicable ) :** Suggest specific corrections or improvements to partially correct or incorrect answers . Explain how these changes align with the context . 53 542. ** Synthesize & Refine :** Synthesize the information gathered during the fact - checking process to formulate the most accurate and comprehensive answer to the question . This may involve : 55* Selecting the most accurate proposed answer . 56* Combining elements from multiple proposed answers . 57* Developing a completely new answer based on your analysis of the evidence . 58 593. ** Final Answer :** Clearly state your final , verified answer to the question . 60 614. ** Confidence Level :** Indicate your confidence in the final answer using a scale of \" High ,\" \" Medium ,\" or \" Low .\" Briefly explain the factors influencing your confidence level . 62 63 64This revised instruction emphasizes a more rigorous fact - checking process , encouraging the LM to critically evaluate each proposed answer and provide detailed justifications for its assessments . The addition of a confidence level prompts the LM to reflect on the certainty of its final answer , promoting more nuanced and reliable responses . The \" expert fact - checker \" persona further reinforces the importance of accuracy and attention to detail . 65 66--- 67<example_1 > 68<example_2 > MBPP: 1Predictor : 2 3You are a highly skilled Python programmer tasked with generating a correct and efficient Python function based on the given natural language problem description . Think step -by -step , outlining your reasoning process before presenting the code solution . Your response should adhere to the following structure : 4 5** Thinking :** Provide a clear and concise breakdown of your thought process , including the steps you ’ll take to solve the problem . This should demonstrate a logical progression towards the final solution and may include considerations of data types , algorithms , and edge cases . For example : 6 71. Identify the input data type and expected output . 82. Determine the core logic or algorithm required . 93. Consider potential edge cases or special scenarios . 104. Outline the steps for implementing the solution in Python . 11 12** Answer :** Present your complete and correct Python code implementation within a code block ( using triple backticks ). The code should be well - formatted , efficient , and directly address the problem description . Ensure your function adheres to the provided function signature if given . For example : 13 14‘‘‘ python 15def function_name ( input_arguments ): 16# Code implementation here 17# ... 18return output 19‘‘‘ 20 21Focus on producing functional code that accurately solves the problem . Avoid including unnecessary explanations or examples within the \" Answer \" section . If the problem description includes implicit or explicit test cases , ensure your code passes those tests . Strive for clarity , conciseness , and correctness in both your thinking and your code . 22 23--- 29 Page 30: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies 24<example_1 > 25<example_2 > 26<example_3 > 27 28------ 29Reflector : 30 31This dataset is designed for Python code generation , translating natural language problem descriptions into simple functions and their corresponding test cases . The ’answer ’ and ’test ’ fields are identical , indicating a potential redundancy or a unique task focusing on simultaneous code and test generation . The dataset likely originates from coding challenge websites and emphasizes basic programming concepts with a focus on correctness , but lacks complexity in inputs and error handling . 32 33TASK DEMO (S): 34Question : Write a function that takes in two numbers and returns a tuple with the second number and then the first number . 35 36def swap_numbers (a,b): 37Previous Solution : def swap_numbers (a,b): 38 return (b, a) 39 40Traceback : Test case : print ( swap_numbers (10 ,20) ) 41Output : (20 , 10) 42Ground Truth : (20 ,10) 43Correctness : True 44Thinking : The provided solution correctly swaps the order of the two input numbers and returns them as a tuple . The test case demonstrates this functionality , and the output matches the ground truth . Therefore , no changes are required . 45Answer : ‘‘‘ python 46def swap_numbers (a,b): 47 return (b, a) 48‘‘‘ 49<example_2 > 50<example_3 > 51 52 53BASIC INSTRUCTION : Please determine the correctness of the solution in passing all test cases . If it fails , based on the error message and trackback , think step by step , carefully propose an updated solution in the answer output with a correct code implementation in python . 54 55TIP : The instruction should include a high stakes scenario in which the LM must solve the task ! 56 57PROPOSED INSTRUCTION : 58 59You are an automated code reviewer for a mission - critical satellite control system . A bug in the code could lead to catastrophic failure , so absolute correctness is paramount . You are given a Python function along with its associated test case ( including the expected output ). Analyze the provided 60 61<example_1 > 62<example_2 > 30

---