Authors: Han Zhou, Xingchen Wan, Ruoxi Sun, Hamid Palangi, Shariq Iqbal, Ivan Vulić, Anna Korhonen, Sercan Ö. Arık
Paper Content:
Page 1:
Multi-Agent Design: Optimizing Agents with
Better Prompts and Topologies
Han Zhou1 2 *, Xingchen Wan1, Ruoxi Sun1, Hamid Palangi1, Shariq Iqbal1, Ivan Vulić1 2, Anna Korhonen2
and Sercan Ö. Arık1
1Google,2University of Cambridge
Large language models, employed as multiple agents that interact and collaborate with each other,
have excelled at solving complex tasks. The agents are programmed with prompts that declare their
functionality, along with the topologies that orchestrate interactions across agents. Designing prompts
and topologies for multi-agent systems (MAS) is inherently complex. To automate the entire design
process, we first conduct an in-depth analysis of the design space aiming to understand the factors
behind building effective MAS. We reveal that prompts together with topologies play critical roles in
enabling more effective MAS design. Based on the insights, we propose Multi-Agent System Search
(Mass), a MAS optimization framework that efficiently exploits the complex MAS design space by
interleaving its optimization stages, from local to global, from prompts to topologies, over three stages:
1) block-level ( local) prompt optimization; 2) workflow topology optimization; 3) workflow-level ( global)
prompt optimization, where each stage is conditioned on the iteratively optimized prompts/topologies
from former stages. We show that Mass-optimized multi-agent systems outperform a spectrum of
existing alternatives by a substantial margin. Based on the Mass-found systems, we finally propose
design principles behind building effective multi-agent systems.
1. Introduction
Prompt Optimization Space
Instruction Exemplar
Topology Optimizerdebateaggregate
re ectMulti-Agent Design Space
customMASS
summarize
tool-useOptimized topology
<ex_1>
<ex_2>
...
<ins><ex_1>
<ex_2>
...
<ins><ex_1>
<ex_2>
...
<ins>
Optimized prompt
for each agent typeOptimized
MAS design
Figure 1|Proposed Multi-Agent System Search
(Mass) framework discovers effective multi-
agent system designs (with both optimized topol-
ogyandoptimized prompts,right)viainterleaved
prompt optimization and topology optimization
in a customizable multi-agent design space (key
components illustrated on the left).Large language models (LLMs) have showcased
extraordinary capabilities in understanding, rea-
soning, and generating coherent responses based
on user prompts, revolutionizing a wide range
of applications (Kojima et al., 2022; Ouyang
et al., 2022). LLM-based agents enhance us-
ability by autonomously handling complex tasks
acrossdiversedomains,includingcodegeneration
and debugging (Jimenez et al., 2023), retrieval-
augmented generation (Singh et al., 2025; Wang
et al., 2024a), data analysis (Guo et al., 2024; Hu
et al., 2024b), and interactive decision-making
(Li et al., 2025; Su et al., 2025). These agents are
typicallyprogrammedwithpromptsthatreinforce
them to interact with the environment, utilizing
available tools, and approach their objectives over
multiple turns (Yao et al., 2023). Beyond individ-
ual agents, LLMs can be orchestrated within com-
plex topologies that coordinate multiple agents
toward a shared objective. This type of multi-agent system (MAS) typically outperforms its single-
agent counterpart by involving more diverse agentic perspectives or role profiles, such as agents as
verifiers (Shinn et al., 2024) and multi-agent debate (Qian et al., 2024; Wang et al., 2024b).
Corresponding author(s): hzhouml@google.com, soarik@google.com
* This work was done while Han Zhou was a Student Researcher at Google Cloud AI Research.arXiv:2502.02533v1 [cs.LG] 4 Feb 2025
Page 2:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
However, designing effective MAS for new domains often proves to be challenging. First, the
single agent might suffer from prompt sensitivity (Verma et al., 2024), where simple modifications
in the prompt can already exert significant but unexpected degradation of performance (Liu et al.,
2024a; Zhou et al., 2024b). In MAS, when sensitive agents are cascaded, the compounding effect
due to prompt sensitivity may be amplified. Together with the prompt design, crafting an effective
topology might demand a substantial amount of manual experimentation, based on trial and error.
The problem complexity is exacerbated by the overall combinatorial search space, over not only the
unbounded space of prompt design but also the design decisions of what agent to integrate into the
topology.
Although recent research has explored automating various aspects of agentic designs, there is still
a gap in understanding of what matters most regarding improved MAS performance. For example,
DSPy (Khattab et al., 2024) automates the process of designing exemplars for improved prompt
programming. Li et al. (2024a) proposes to optimize MAS by scaling up the number of agents
in majority voting. ADAS (Hu et al., 2024a) programs new topologies expressed in code via an
LLM-based meta-agent. AFlow (Zhang et al., 2024b) searches better topologies using Monte Carlo
Tree Search within a set of predefined operators. However, the interplay between multiple design
spaces, including prompts and topologies, remains unclear.
In this paper, we first conduct in-depth analyses of common design spaces in MAS, examining
the influence of various aspects such as optimizing the prompts, scaling the number of agents, and
involving different types of topologies. Our analyses reveal that prompts frequently form an influential
designcomponentthatyieldsstrong-performingMAS,andinfluentialtopologiesonlyrepresentasmall
fraction of the full search space. Based on these insights, we aim to distill the essence of influential
MAS components into a pruned search space, thereby lowering the complexity of the overall search
process. We propose Multi-Agent System Search ( Mass), a novel multi-stage optimization framework
that automates the optimization for MAS over an efficient search space. Massintegrates a plug-
and-play prompt optimizer and workflow optimizer over a configurable topology space. It overcomes
the complexity of joint optimization on MAS by interleaving the optimization stages, from local
to global, from prompts to topologies, over three stages: 1) block-level ( local) prompt ‘warm-up’
for each topology block; 2) workflow topology optimization in a prunedset of topology space; 3)
workflow-level ( global) prompt optimization given the best-found topology.
By optimizing over the identified influential components, Massyields optimized MAS that
achieves state-of-the-art performance, outperforming existing manually-crafted MAS baselines and
automatically-generated alternatives, by a substantial margin, demonstrated across an extensive
selection of tasks, including reasoning, multi-hop understanding, and code generation. Based on the
strongest MAS found by Mass, we provide further insights and guidelines behind building effective
MAS. Overall, our contributions can be summarized as follows: 1) we provide an in-depth analysis of
the design factors that influence the performance of LLM-based MAS, highlighting the importance
of prompts and identifying the influential topologies; 2) we propose Mass, a novel multi-stage
optimizer that automates the MAS design by interleaving the optimization of prompts and topologies
in an influential search space; 3) Massshows significant performance improvement on various
evaluation benchmarks, delivering guidelines for building effective multi-agent systems for the future.
2. Designing Multi-Agent Systems
In this section, we provide a formulation for designing MAS, followed by analyzing the influence of
prompt and topology designs. We refer to the structural arrangements of agents (or equivalently,
building blocks) as the topology of agents and define workflow Was the logical sequence across
2
Page 3:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
different topologies that builds the MAS. The design of a MAS can thus be broadly divided into two
levels: block-level design and workflow-level orchestration. At the block level, we aim to design
effective individual agents that best perform their intended role with better promptdesign. On the
other hand, at the workflow level, the optimization involves determining the typesandquantities
of agents to include and how to arrange them in the most effective way, referred to as the topology
optimization. Formally, given a search space Athat defines all valid configurations 𝑎over the blocks
(see Fig. 4), workflow topology optimization can be expressed as the following optimization problem
with an objective function 𝑓(·,·)on a target input and output set (𝑥,𝑦)∼D:
W∗(𝑎)=arg max
𝑎∼A𝔼(𝑥,𝑦)∼D[𝑓(W(𝑎(𝑥)),𝑦)]. (1)
In the rest of this section, we provide an in-depth analysis of each component of MAS design.
2.1. Block-level: Prompt Design for Agents
103
T okens747678808284Accuracy (%)
SC
Reflect
Debate
Prompting
Prompt->SC
Figure 2|Accuracy vs. the total token counts for
prompt-optimized agents per question on MATH
by Gemini 1.5 Pro compared to scaling agents
withself-consistency( SC),self-refine( reflect ),
andmulti-agentdebate( debate )only. Theerror
bar indicates 1 standard deviation. We show that
by utilizing more compute, better accuracy can
be obtained via more effective prompting.At the block level, the primary “optimizable com-
ponent” that significantly influences downstream
performance is the prompt, which defines the
role of the agent (e.g., “ You are an expert in re-
flecting on errors ...”), provides additional instruc-
tions to shape its behavior (e.g., “ You should think
step by step ...”) and optionally, contains few-shot
demonstrations (in-context examples) to guide
the agent’s responses (Wan et al., 2024, 2025).
For instance, a state-of-the-art prompt optimizer
searches both instructions and few-shot demon-
strations, where demonstrations are bootstrapped
from the model’s own, correct predictions on the
validation set based on a validation metric. Con-
ditioned on the demonstrations, the prompt op-
timizer then proposes a few candidates for the
instruction with a dataset summary or various
hints to improve candidate diversity (Opsahl-Ong
etal.,2024). Theinstructionsanddemonstrations
are then jointly optimized.
Although it is well known that LLMs are sensitive to prompts (Verma et al., 2024; Zhou et al.,
2024a),applyingautomaticpromptoptimization(APO)techniquestoMASisrathernon-trivial. Unlike
single-turn tasks where APO can be easily performed by treating prompts as optimizable variables
and performance over a validation set as the target. In MAS, APO becomes more complex due to the
interdependence across agents (e.g., the output of one agent may be the input of another agent in a
cascade with ground-truth responses for intermediate outputs not being available) and exponentially
increasing complexity for combinatorial optimization with more number of agents 𝑛involved; The
reward signals also become more sparse when 𝑛increases, preventing us for implementing APO
directly on MAS in any manageable budget; as such, many prior works (Xia et al., 2024; Zhang
et al., 2024f) in MAS still primarily use handcrafted prompts instead of including the prompts as
optimizable components in the MAS design.
To systematically understand the influence of prompt design in MAS, we specifically and quantita-
tively analyze the effect of prompt optimization and compare its effectiveness to other operations
common in MAS literature, such as scaling with more agents but with default prompts. We conduct
3
Page 4:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
APO on a chain-of-thought (Kojima et al., 2022) agent with both instruction optimization and 1-shot
exemplar optimization via MIPRO (Opsahl-Ong et al., 2024), and fairly compare the total inference to-
ken cost with self-consistency (Kojima et al., 2022), self-refine (Madaan et al., 2024), and multi-agent
debate (Du et al., 2024), where the specifications are provided in App. §B. In Fig. 2, prompting, which
equips agents with more informative instructions and exemplars, demonstrates significant advantages
in its token-effectiveness over other building blocks. Furthermore, by applying self-consistency on
top of the prompt-optimized agent, we observe an improved scaling performance on the token cost,
whereas standard approaches in scaling the number of agents (e.g. SC, orReflect ) saturate much
earlier. This empirical observation sheds light on the importance of prompting while providing early
evidence for designing effective MAS – optimize agents locally before scaling their topology .
2.2. Workflow-level Search Space Design
CoT SC Ref. Deb. Sum.
Method6264666870Performance (%)Base+1%
-0%+3%
-2%HotpotQA
CoT SC Ref. Deb. Exe.
Method607080
Base+7%
-15%+6%+10%LiveCodeBench
Figure3|Theperformanceofdifferenttopologies
with Gemini 1.5 Pro compared to the base agent
with each topology being optimized with APO,
where Sum. (Summarize) and Exe. (Executor)
are task-specific topologies as illustrated in Fig. 4.
We observe that not all topologies have a positive
influence on the MAS design.At the workflow level, the primary focus is on
orchestrating agents to achieve the best perfor-
mance effectively. As a relatively new concept
specific to MAS, topology optimization has re-
cently garnered significant attention (Li et al.,
2024c; Zhang et al., 2024b). However, while
much of the existing research emphasizes search
methods —such as discovering the most efficient
and effective way to identify the optimal config-
uration—there has been less focus on the design
ofsearch spaces , which determines the perime-
ter and the scope of any search algorithm. This
imbalance draws a parallel to the historical devel-
opmentof neuralarchitecturesearch (NAS)(White
et al., 2023). Initially, the field concentrated on
sophisticated search methods, such as Bayesian
optimization (Kandasamy et al., 2018; Ru et al.,
2021) and differentiable search (Liu et al., 2018). Follow-up works have highlighted the often-
overlooked importance of search space design, arguing that it can be equally, if not more, critical (Wan
et al., 2022; Zhou et al., 2023, 2024c). Inspired by this insight, we hypothesize that manually crafted
topologies might be sub-optimal, and automatic topology optimization (potentially framed as a
rigorous optimization problem) can play a similarly pivotal role via judiciously designing search space
for MAS. To achieve so, we first define an expressive search space, similar to prior works, that consists
of the connections between the following building blocks :
•Aggregate : Agents can collaborate in parallel with diversified predictions, which is then followed by
an aggregation operator that obtains the most consistent prediction. The aggregate block can be
parameterized by 𝑁𝑎agents acting in parallel. Majority vote (Li et al., 2024a) and self-consistency
(Chen et al., 2024c) sits within this topology.
•Reflect: Agents can act as verifiers, providing critics and improvement suggestions based on for-
mer predictions. The feedback is then fed into the predictor or the reflector itself for an iterative
improvement. Similarly, reflect can be parameterized by 𝑁𝑟that defines the number of rounds for
self-reflection. The self-refine (Madaan et al., 2024) and Reflexion (Shinn et al., 2024) represent this
block.
•Debate: Agents in debate can elicit more truthful predictions than single-agent prediction (Du et al.,
2024; Liang et al., 2024), where each debating agent would collect opinions from all other agents
4
Page 5:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
P
A</>
P
P</>
</>PredictorAggregate
P SSummarize
</></>
Long inputs
P R</></>Self-re ect1Block-level Prompt Optimization
Evaluate on
validation task / split
( , 75%), 63%)(
. . .Proposed candidate
Optimizer: Store evaluations and
propose new work owValidation metric2 Work ow Topology Optimization
</></></></></></></></></></>Best topology from Step 2
</></></></></>
Instruction
optimization
Demo
optimizationWork ow-level Prompt Optimization 3
Let's think step by
step → (Example
new prompt)
<example_1>
<example_2>
...Evaluate on
validation task / split
Validation metric
</>: Optimizable promptsP
T
</>: Optimizable promptsTool-use
D
D
DP
P
PA</>
</>
</></></></>Multi-agent debate
</>
Figure 4|Illustration of the Massframework with its search space and the multi-stage optimization.
The search space combines both prompts ( Instruction ,Demo) and configurable agentic building
blocks ( Aggregate ,Reflect ,Debate ,Summarize , and Tool-use ). 1) Block-level Prompt Opti-
mization: we conduct block-level prompt optimization for each agentic module individually (denoted
by</>); 2) Workflow Topology Optimization: conditioned on the best prompts found in Stage 1 on
each agent block, Masssamples valid configurations from an influence-weighted design space while
fusing the prompts of each building block from Stage 1; 3) Workflow-level Prompt Optimization:
conditioned on the best workflow found in the Stage 2, we again conduct workflow -level prompt
optimization on the best-found MAS (topologies visualized for illustration only ).
and provides an updated response. This topology would involve a mixture of agents, and 𝑁𝑑defines
the number of rounds for debating.
•CustomAgents : Whiletheformerthreeformsofagentsrepresentthevastmajorityofagenttopologies
constructed as multiple parallel, serial, and mixture of agents, more versatile definitions of agents
can be inserted into the MAS design space. For example, for task-specific use cases, we introduce an
agent as summarize to improve the long-context capability in the customizable design space.
•Tool-use : Building towards an effective MAS, enabling agents to leverage tools to access external
information is critical for system performance, such as using retriever for RAG (Lewis et al., 2020)
and executor with test cases in coding (Chen et al., 2024d). We introduce tool-use as an optimizable
binary ‘insertion’ decision 𝑁𝑇∈{0,1}.
Tounderstandtheinfluenceofindividualtopology,wereporttheperformanceofvarioustopologies
in Fig. 3. It is noticeable that not all topologies are beneficial to MAS design, whereas positively
influenced topologies only represent a small fraction of the overall set, such that, in HotpotQA (Yang
et al., 2018), only debate brings 3% gain while others fail to improve or even degrade systematic
performance. We again observe similar trends in the test-output-prediction subtask of LiveCodeBench
(Jain et al., 2024). It highlights the importance of searching in the influential set of search space,
whereas including decremental building blocks may not only result in higher search complexity but
also degrade the performance.
3.Mass: Multi-Agent System Search
Our analyses in Sec. 2 underscore the importance of well-designed prompts for individual agents and
the careful definition of the search space to achieve effective MAS performance. Building on these, we
5
Page 6:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
propose a multistage optimization algorithm, Multi-Agent System Search (Mass), that surpasses
prior arts that focused solely on optimizing workflow topology without appropriate prompt designs.
Instead, our approach demonstrates the greater effectiveness of MAS design with properly optimized
prompts and thoughtfully designed search spaces. Massframework is illustrated in Algorithm 1 and
Fig. 4, following an intuition from local to global, from block-level to workflow-level, that conquers
the complexity of combinatorial optimization with effective per-stage optimization detailed below.
Algorithm 1 Mass: Multi-Agent System Search
1:Input: Agentic modules in the search space 𝑎𝑖∈A,
workflow of agents W(𝑎), prompt optimizer O, eval-
uatorE, validation setD, temperature 𝑡, number of
candidates 𝑁, budget𝐵.
2:Output: Optimized multi-agent system W∗.
3: [Block-levelPrompt Optimization]
4:Prompt optimization for the initial agent 𝑎∗
0←OD(𝑎0).
5:for𝑎𝑖inA\{𝑎0}do
6:Local prompt optimization for each building block
in the design space: 𝑎∗
𝑖←OD(𝑎𝑖|𝑎∗
0)
7: Obtain incremental Influence 𝐼𝑎𝑖←E(𝑎∗
𝑖)/E(𝑎∗
0).
8:end for
9: [Workflow Topology Optimization]
10: Obtain the selection probability 𝑝𝑎←Softmax(𝐼𝑎,𝑡)
11:while𝑛 < 𝑁do
12:Reject invalid configurations 𝑐and cap a budget 𝐵.
The design space is pruned by the selection proba-
bility𝑝𝑎,W𝑐←(𝑎∗
𝑖(·),𝑎∗
𝑖+1(·),...)with optimized
prompts.
13: Store evaluations ED(W𝑐)and propose new W.
14:end while
15:Obtain the best-performing W∗
𝑐←
arg max𝑐∈CED(W𝑐).
16: [Workflow -levelPrompt Optimization]
17:Workflow-level prompt optimization for the best-
performing topology: W∗←OD(W∗
𝑐).
18:Returnoptimized multi-agent system W∗.1) Block-level prompt optimization. Before
composing agents, we first ensure that individual
agentsarethoroughlyoptimizedattheblocklevel,
as highlighted in Sec. 2.1 and Fig. 2 – this step
ensures that each agent is primed for its role with
the most effective instructions in the most man-
ageablecomputationbudget. Tofurtherovercome
the complexity of joint optimization on a large
MAS space, we first warm up the initial predic-
tor with single-agent APO, 𝑎∗
0←OD(𝑎0), where
both instruction and exemplars are jointly opti-
mized with the modular prompt optimizer O. Fol-
lowed by conditioning on the warmed predictor,
we continue optimizing each topology with a min-
imum number of agents, 𝑎∗
𝑖←OD(𝑎𝑖|𝑎∗
0), such
that, 2 predictors paired with 1 debator form the
minimum building block as the debate topology,
thereby lowering the complexity for optimization,
andthistopologycanbescaleduplaterwithmore
predictors and debators but all equipped with op-
timized prompts. To measure the influence of
each building block, we store the validation per-
formance once the optimization is completed. It
is important that though Stage (1) serves as the
warm-up stage per building block, it is still a crit-
ical stage that guarantees the follow-up topology
optimization is searching in an effective space, composing well-performing agents instead of suffering
from the compounding impact from any ill-formed agents with manual prompts.
2) Workflow topology optimization. In this stage, we focus on optimizing the overall MAS structure,
determining the most effective arrangement and connectivity between agents. The analysis in Fig. 3
shows that beneficial topologies only represent a small fraction of the full design space. Therefore,
we aim to distill the essence of strong-performing topologies into a pruned space, thereby making the
workflow-level topology search more efficient. Here, we propose to measure the incremental influence
𝐼𝑎𝑖=E(𝑎∗
𝑖)/E(𝑎∗
0)that quantifies the relative gain for integrating the topology 𝑎𝑖over the initial agent
𝑎0. Following the intuition that influential dimension comes with higher selection probability, we
activate the corresponding topology dimension 𝑎if𝑢 > 𝑝𝑎, given𝑢∼U( 0,1)and𝑝𝑎=Softmax(𝐼𝑎,𝑡).
To compose diverse topologies into a unified space, we constrain the workflow with a rule-based
order to reduce the optimization complexity, following a predefined sequence, such that [ summarize ,
reflect ,debate ,aggregate ]. We integrate rejection sampling over the pre-defined design space
that rejects any deactivated dimension, or invalid topology compositions exceeding a maximum
budget𝐵on the number of agents. We refer to App. §B for the detailed search space per task.
6
Page 7:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
3) Workflow-level prompt optimization. As a final step, we treat the entire MAS design as an
integrated entity and run an additional round of prompt optimization, conditioned on the best
topology discovered in Stage (2), W∗=OD(W∗
𝑐). It is worth noting that although prompts were
optimized at the individual level in Stage (1), this stage acts as an adaptation or fine-tuning process,
ensuring that prompts are tailored for orchestration within the MAS and that the interdependence
between agents is optimized appropriately. Our experiments (Fig. 5 & 6) demonstrate that this stage
often yields practical benefits.
4. Related Work
Forms of LLM-based agentic systems. The simplest form of an LLM-based agentic system involves
a single agent that can dynamically interact and respond to the environment (Yao et al., 2023).
Recent advances endow agents with diverse roles and tools (Wu et al., 2023), orchestrating multiple
agents to cooperate with each other (Chen et al., 2024b). Standard forms of agent cooperation (i.e.,
topology) often involve parallel and serial flows of information. The parallel form usually diversifies
the exploration among many agents in parallel (Li et al., 2024a), and self-consistency (SC) (Wang
et al., 2023) is a representative way for scaling agents in parallel. The serial form aims to advance the
exploitation of a task via a chain of agents, where LLMs can serve as reflective agents to self-justify
and refine former predictions (Madaan et al., 2024; Shinn et al., 2024). Later, the opinions from
multiple agents can be summarized to retrieve the most consistent answer by an aggregation agent
(Chen et al., 2024c; Lin et al., 2024). Moreover, multi-agent debate consists of a more complex flow
of information (Chen et al., 2024a; Wang et al., 2024c; Zhang et al., 2024c), and recent research
shows that debating can elicit more truthful predictions (Du et al., 2024; Khan et al., 2024). Recent
agent topology extends beyond the above connections (Qian et al., 2024; Wang et al., 2024b), and
Masscan automatically search the best topology among the aforementioned spaces.
Automatic optimization for MAS. Recent research starts automating agent design by interpreting
agent functions as learnable policies (Zhang et al., 2024d,e) and synthesizing trajectories for agent
fine-tuning (Qiao et al., 2024). Going further from a single agent, automatic multi-agent optimization
faces a higher level of complexity, thereby requiring a more sophisticated design of search space
and algorithms. Among all recent advances in multi-agent optimization, the optimization space has
spanned prompts (Khattab et al., 2024), tools (Zhou et al., 2024d), workflows (Li et al., 2024c), and
thinking strategies (Shang et al., 2024). Aligning closer to our topology search space, DyLAN (Liu
et al., 2024b) dynamically activates the composition of agents, and Archon (Saad-Falcon et al., 2024)
frames MAS as a hyperparameter optimization problem. Neither of them has taken the important
promptspaceintoaccount, wherewedemonstratedtheimportanceofpromptoptimizationinSec.2.2.
In addition, GPTSwarm (Zhuge et al., 2024) optimizes the connections between agentic nodes using a
policy gradient algorithm. State-of-the-art automatic agent design methods, ADAS (Hu et al., 2024a)
and AFlow (Zhang et al., 2024b), also attempt to optimize agentic workflows with advanced search
algorithms and LLM as optimizers. However, we observe that the importance of proper prompt designs
has been relatively under-studied in these prior works.
5. Experiments
Models and evaluation data. Aside from the common benchmarks used for automating MAS
(Hu et al., 2024a; Zhang et al., 2024b), we conduct experiments on an extensive collection of
tasks: 1) Hendryck’s MATH (Hendrycks et al., 2021) and DROP (Dua et al., 2019) for reasoning;
HotpotQA (Yang et al., 2018), MuSiQue (Trivedi et al., 2022), 2WikiMultiHopQA (Ho et al., 2020)
from LongBench (Bai et al., 2024) for long-context understanding; 3) MBPP (Austin et al., 2021),
7
Page 8:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
Table 1|Results on the evaluation set with Gemini 1.5 Pro and Gemini 1.5 Flash. We report the mean
and standard deviation for all results with 3 runs of evaluations. We report the accuracy (%) for
MATH and the test-output-prediction subtask of LiveCodeBench (LCB), F1 score for DROP, HotpotQA,
MuSiQue, and 2WikiMQA, and pass@1 for MBPP and HumanEval. We note that the meta-prompt of
AFlow*only works properly with Claude 3.5 Sonnet. Therefore, we reproduce AFlow with Gemini
1.5 Pro as the executor and Claude 3.5 Sonnet as the optimizer, where*indicates the results are only
for reference. Number of agents in inference for all methods are below 10.
Gemini-1.5-pro-002
Task Reasoning Multi-hop Long-context Coding
Method MATH DROP HotpotQA MuSiQue 2WikiMQA MBPP HumanEval LCB Avg.
CoT 71.67 3.3070.591.6757.430.5237.811.4363.391.1268.330.4786.670.9466.330.6265.28
Self-Consistency 77.33 1.2574.060.9058.602.1941.811.0067.791.1969.500.7186.000.8270.330.9468.18
Self-Refine 79.67 2.3671.031.3160.623.3342.151.3466.742.4363.670.2484.001.6367.331.3166.90
Multi-Agent Debate 78.67 0.9471.780.7164.870.2346.000.8071.780.6368.670.8586.671.2573.671.6570.26
ADAS 80.00 0.8272.960.9065.881.2941.951.2471.140.6673.001.0887.671.7065.171.2569.72
AFlow*76.000.8288.920.6368.620.4732.051.2976.51 1.05 - 88.00 0.00 - -
Mass(Ours) 84.67 0.4790.52 0.6469.91 1.1151.40 0.4273.340.6786.50 0.4191.67 0.4782.33 0.8578.79
Gemini-1.5-flash-002
CoT 66.67 2.3671.790.6957.821.1037.101.3563.400.6863.331.2575.671.8951.170.2460.87
Self-Consistency 69.33 1.2573.420.1960.191.0141.940.9367.980.7263.670.6277.671.8953.831.1863.50
Self-Refine 71.33 0.9473.711.0958.843.0441.211.9965.561.5763.331.2581.671.8952.001.4163.46
Multi-Agent Debate 71.67 0.9474.790.8764.171.6946.271.3372.190.5463.000.7179.671.2555.500.4165.91
ADAS 68.00 1.4175.951.1861.362.8948.81 1.0366.901.0065.830.2480.672.4950.501.6364.75
Mass(Ours) 81.00 2.4591.68 0.1466.53 0.3843.671.2176.69 0.5078.00 0.8284.67 0.4772.17 0.8574.30
HumanEval (Chen et al., 2021), and LiveCodeBench (LCB) ‘test output prediction’ (Jain et al., 2024)
for coding. We refer to App. §B & §D for details on data splits and prompt templates. We run all
experimentsprimarilyontwoGemini1.5modelsizes(Reidetal.,2024)( gemini-1.5-pro-002 and
(gemini-1.5-flash-002 ) and further validate key findings on Claude 3.5 Sonnet ( @20240620 )
(Anthropic, 2024).
Baselines. We consider the following baselines: 1) CoT (Kojima et al., 2022): direct chain-of-thought
reasoning via zero-shot prompting; 2) CoT-SC (Wang et al., 2023): with self-consistency to find the
most consistent answers from diversified reasoning traces; 3) Self-Refine (Madaan et al., 2024; Shinn
et al., 2024): reflective agents to verify and self-refine predictions; 4) Multi-Agent Debate (Du et al.,
2024; Liang et al., 2024): with agent justifying answers and aggregating information from other
agents; 5) ADAS (Hu et al., 2024a): an automatic agent design framework, where an LLM-based
meta-agent iteratively proposes new agents based on former evaluations; 6) AFlow (Zhang et al.,
2024b): automatic workflow design via Monte-Carto Tree Search over a set of pre-defined operators.
We fairly compare all baselines by limiting the maximum number of agents to 10. We refer to App. §B
for all specifications.
Setup. Massintegrates the state-of-the-art prompt optimizer, MIPRO (Opsahl-Ong et al., 2024),
which optimizes both instructions and demonstrations for each agent via a Bayesian surrogate model.
We limit the number of bootstrapped demonstrations to 3 and instruction candidates to 10, per agent
in 10 rounds. In topology optimization for all tasks, we search for 10 different topologies via rejection
sampling. Along with topology optimization, each topology is evaluated on the validation set 3 times
to stabilize the prediction. The optimized MAS is then reported on the held-out test set over three
8
Page 9:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
CoT APO 1PO 2TO 3PO
Stage556065707580Performance (%)63.568.074.277.378.8
w/o PO w/o Prune POTO
Ablation62646668
Figure 5|Left: average performance per opti-
mization stage of Massover 8 evaluation tasks
on Gemini 1.5 Pro. We compare Masswith a
single agent ( CoT) starting point as the reference
and an APO baseline that optimizes over the sin-
gle agent by MIPROv2 (Opsahl-Ong et al., 2024).
RefertoApp. §Cforthedetailedablationper task.
Right: a comparative ablation study on topology
optimization ( 2TO) without pruning and without
the former stage of prompt optimization ( 1PO)
evaluated on HotpotQA.
0 10 20 30 40 50 60 70 80
Round65707580859095F1 (%)Better Prompt1PO: Aggregate1PO: Debate2TO: More parallel agents involved3PO: Workflow-level POADAS
AFlow
MASS (Ours)Figure 6|The optimization trajectories of
Masscompared to automatic agent design base-
lines per validation round on DROP. We note that,
as a distinct advantage of Mass, the optimiza-
tion within stages (1) & (2) of Masscan be com-
pletelyparallelized,whereasADASandAFloware
iterative algorithms that have to wait to propose
new agents until finishing earlier trajectories.
runs. We set model temperature 𝑇at 0.7, maximum output tokens at 4096, and the 𝑡in Softmax at
0.05 for sharpening the selection probability 𝑝𝑎for each search dimension. We implement the same
LLM backbone as both evaluator and optimizer in all phases.
Main results. We present the main results of Masscompared to the baselines on the evaluation
set in Table 1. Massyields substantial gains over common forms of multi-agent systems, (e.g. self-
consistency, self-refine, and multi-agent debate), that scale up without optimizing prompts for agents
in collaboration. Massleads to high-performing MAS: 78.8% and 74.3% on average on Gemini
1.5 Pro and Flash, respectively, where we observe consistent improvements on Claude 3.5 Sonnet as
reported in Table 4. By comparing Masswith state-of-the-art automatic agent design baselines, ADAS
and AFlow, we first notice that ADAS only brings subtle gains even by already conditioning its meta-
agent generation based on the common forms of agents. The meta-agent keeps proposing complex
topologies but without optimizing the prompt design. AFlow, on the other hand, demonstrates a
competitive performance to Mass, especially on 2WikiMQA and HumanEval.
We attribute the performance of AFlow to: 1) its ‘expansion’ phase that generates new nodes
based on an error log that contrasts the predictions with the ground truth, which provides implicit
textual gradients (Pryzant et al., 2023) to reflect on any formatting errors in prompt design; 2) a more
refined search space within a pre-defined set of operators. Though AFlow draws similar inspirations
on the importance of search space design as Mass, it still lacks a phase of prompt optimization to
optimize its pre-defined operators properly, resulting in under-performance for MAS search results
at MATH and MuSiQue. Different from these baselines, the consistent improvements brought by
Masshighlight the importance of searching in both prompt and topology design space.
Ablating optimization stages. To understand the incremental gain per Massoptimization stage,
we provide a stage-by-stage ablation study in Fig. 5. We list the average performance of Massfrom
block-level to workflow-level optimization and compare it with a single agent APO baseline, where
the block-level optimization performance indicates the best-performing building block 𝑎∈Aafter
APO. First, we notice that there is a large gain, 6% on average, between block-level optimization
and single-agent optimization, showing that MAS benefits substantially from having its agents
9
Page 10:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
Predictor:
Let's think step by step to solve the
given problem. Clearly explain your
reasoning process, showing all
intermediate calculations and
justifications. Express your final
answer as a single numerical value or
simplified expression enclosed within
<answer></answer> tags. Avoid
extraneous text or explanations outside
of the core reasoning and final answer.
<Task Demo: Exemplar_1>Debator:
You are a seasoned math professor specializing in clear and concise
explanations. You are reviewing student solutions to math problems.
Below, you will find the problem, followed by solutions from several
students. Carefully examine each student's solution, identifying any
errors in their logic or calculations. Provide a comprehensive
rationale explaining your analysis of each student's work, clearly
stating whether their final answer is correct or incorrect and why.
Finally, provide your own definitive and simplified solution to the
problem, ensuring its accuracy and clarity. Present your final answer
bracketed between <answer> and </answer> at the end.
Question: Compute $17^{-1}\\pmod{83}$.
Solutions: Agent 0: 44\nAgent 1: 74
Rationale: <Rationale>
Answer: 44
<Task Demo: Exemplar_2>
<Task Demo: Examplar_3>1Block-level Prompt Optimization ( 62% → 79% )
Work ow-level Prompt Optimization ( 83% → 85% ) 32Work ow Topology Optimization ( 79% → 83% )
P
PD ( ) AP
PD
PD
D( ) P( )P
A
. . .
P) (
Figure 7|A demonstration of the optimization trajectory of Masson MATH. In (1) block-level
optimization: multi-agent debate serves as the best-performing topology. In (2) workflow topology
optimization, aggregating with more parallel agents outweighs the performance of agents in debate.
Lastly, (3) workflow-level optimization discovers the optimal prompt conditioned on the best topology.
optimized inside the building block. In addition, going from Stage (1) to (2), another 3% gain can
be achieved by composing influential topologies while searching the optimal configurations. Here,
we provide an additional ablation on conducting Stage (2) without prompt optimization beforehand
or without search space pruning. Fig. 5 (right) shows that both of them are critical for effective
search space exploration. Lastly, Massobtains further gains ( ∼2%) by conducting workflow-level
prompt optimization on the best-found topology, which indicates that optimizing the prompts towards
modeling the interdependence of agents is beneficial in the MAS design.
Cost-effectiveness of Mass.We conduct analysis on the cost-effectiveness of Mass. In particular,
we visualize the optimization trajectory of Massas shown in Fig. 6. Mass’s trajectory demonstrates
a steady trend of optimization that gradually improves the validation performance via interleaving
the search towards better prompts and topologies. However, when it comes to automatic design
baselines without explicit prompt optimization stages, AFlow is exposed to a larger variance in its
optimization due to the nature of MCTS, whereas ADAS gets trapped in discovering over-complex
topologies that appear to be less effective than the prompt design space. Overall, the optimization
trajectory of Masshighlights the importance of optimizing in an effective design space, where
interleaved optimization further resolves the complexity with more consecutive rewards. Following
Sec. 2.1, Massalso demonstrated advanced token-effectiveness, which we refer to Fig. 9.
Best-found MAS architectures & Design principles. We further inspect an example of optimized
promptsandthetrajectoryof MassindiscoveringmoreeffectivetopologiesinFig.7. Theoptimization
startsfromazero-shotCoTagent,andsoon MassinStage(1)identifiesthehigh-performingtopology
in debate with its optimized prompt. However, as found in Stage (2), aggregating with more parallel
agents actually outweighs the multi-agent debate. Workflow-level prompt optimization then leads
to the best-performing predictor for aggregation. The overall optimization flow sheds light on our
guidelines for building effective MAS: 1) optimizing individual agents properly is important before
composing them into an MAS; 2) more effective MAS can be built by composing influential topologies;
and 3) modeling the interdependence between agents is beneficial, and can be achieved via workflow-
level joint optimization.
10
Page 11:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
6. Conclusion
We approach designing effective MAS by first conducting a thorough analysis of the massive design
space, revealing the crucial role of prompts, and identifying an influential subset of search space.
Building on these findings, we introduce Mass, a novel multi-stage optimization framework that
searches within a pruned design space, interleaving prompt and topology optimization to efficiently
generatehigh-performingMAS.Ourexperimentsdemonstratethat Mass-optimizedMASsignificantly
outperforms existing manual and automated approaches across an extensive set of tasks. Finally,
based on the optimized systems discovered by Mass, we extract valuable design principles to guide
the development of future effective LLM-based MAS.
Acknowledgment
We thank Jinsung Yoon and all other colleagues from Google Cloud AI Research for their valuable
feedback.
References
Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024.
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le,
et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 , 2021.
Y. Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, Y. Dong, J. Tang,
and J. Li. LongBench: A bilingual, multitask benchmark for long context understanding. In L.-W.
Ku, A. Martins, and V. Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers) , pages 3119–3137, Bangkok, Thailand,
Aug. 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.172. URL
https://aclanthology.org/2024.acl-long.172/ .
J. Chen, S. Saha, and M. Bansal. ReConcile: Round-table conference improves reasoning via consensus
among diverse LLMs. In L.-W. Ku, A. Martins, and V. Srikumar, editors, Proceedings of the 62nd
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages
7066–7085, Bangkok, Thailand, Aug. 2024a. Association for Computational Linguistics. doi:
10.18653/v1/2024.acl-long.381. URL https://aclanthology.org/2024.acl-long.381/ .
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y. Burda,
N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint
arXiv:2107.03374 , 2021.
W. Chen, Y. Su, J. Zuo, C. Yang, C. Yuan, C.-M. Chan, H. Yu, Y. Lu, Y.-H. Hung, C. Qian, Y. Qin,
X. Cong, R. Xie, Z. Liu, M. Sun, and J. Zhou. Agentverse: Facilitating multi-agent collaboration and
exploring emergent behaviors. In The Twelfth International Conference on Learning Representations ,
2024b. URL https://openreview.net/forum?id=EHg5GDnyq1 .
X. Chen, R. Aksitov, U. Alon, J. Ren, K. Xiao, P. Yin, S. Prakash, C. Sutton, X. Wang, and D. Zhou.
Universalself-consistencyforlargelanguagemodels. In ICML2024WorkshoponIn-ContextLearning ,
2024c. URL https://openreview.net/forum?id=LjsjHF7nAN .
X. Chen, M. Lin, N. Schärli, and D. Zhou. Teaching large language models to self-debug. In The
Twelfth International Conference on Learning Representations , 2024d. URL https://openreview.
net/forum?id=KuPixIqPiq .
11
Page 12:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch. Improving factuality and reasoning in
language models through multiagent debate. In Forty-first International Conference on Machine
Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 . OpenReview.net, 2024. URL https:
//openreview.net/forum?id=zj7YuTE4t8 .
D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP: A reading comprehension
benchmark requiring discrete reasoning over paragraphs. In J. Burstein, C. Doran, and T. Solorio,
editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages
2368–2378, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi:
10.18653/v1/N19-1246. URL https://aclanthology.org/N19-1246/ .
S. Guo, C. Deng, Y. Wen, H. Chen, Y. Chang, and J. Wang. Ds-agent: Automated data science by
empowering large language models with case-based reasoning, 2024. URL https://arxiv.org/
abs/2402.17453 .
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt.
Measuring mathematical problem solving with the math dataset. NeurIPS , 2021. URL https:
//openreview.net/forum?id=7Bywt2mQsCe .
X. Ho, A.-K. Duong Nguyen, S. Sugawara, and A. Aizawa. Constructing a multi-hop QA dataset for
comprehensive evaluation of reasoning steps. In D. Scott, N. Bel, and C. Zong, editors, Proceedings
of the 28th International Conference on Computational Linguistics , pages 6609–6625, Barcelona,
Spain (Online), Dec. 2020. International Committee on Computational Linguistics. doi: 10.18653/
v1/2020.coling-main.580. URL https://aclanthology.org/2020.coling-main.580/ .
S. Hu, C. Lu, and J. Clune. Automated design of agentic systems. arXiv preprint arXiv:2408.08435 ,
2024a.
X. Hu, Z. Zhao, S. Wei, Z. Chai, Q. Ma, G. Wang, X. Wang, J. Su, J. Xu, M. Zhu, Y. Cheng, J. Yuan,
J. Li, K. Kuang, Y. Yang, H. Yang, and F. Wu. Infiagent-dabench: Evaluating agents on data analysis
tasks, 2024b. URL https://arxiv.org/abs/2401.05507 .
N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica.
Livecodebench: Holistic and contamination free evaluation of large language models for code.
arXiv preprint arXiv:2403.07974 , 2024.
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can
language models resolve real-world github issues? arXiv preprint arXiv:2310.06770 , 2023.
K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing. Neural architecture search with
bayesian optimisation and optimal transport. Advances in neural information processing systems , 31,
2018.
A. Khan, J. Hughes, D. Valentine, L. Ruis, K. Sachan, A. Radhakrishnan, E. Grefenstette, S. R. Bowman,
T. Rocktäschel, and E. Perez. Debating with more persuasive LLMs leads to more truthful answers.
InForty-first International Conference on Machine Learning , 2024. URL https://openreview.
net/forum?id=iLCZtl7FTa .
O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. V. A, S. Haq, A. Sharma, T. T. Joshi,
H. Moazam, H. Miller, M. Zaharia, and C. Potts. DSPy: Compiling declarative language model calls
into state-of-the-art pipelines. In The Twelfth International Conference on Learning Representations ,
2024. URL https://openreview.net/forum?id=sY5N0zY5Od .
12
Page 13:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot
reasoners. Advances in neural information processing systems , 35:22199–22213, 2022.
P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih,
T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances
in Neural Information Processing Systems , 33:9459–9474, 2020.
J. Li, Q. Zhang, Y. Yu, Q. FU, and D. Ye. More agents is all you need. Transactions on Machine Learning
Research , 2024a. ISSN 2835-8856. URL https://openreview.net/forum?id=bgzUSZ8aeg .
M. Li, S. Zhao, Q. Wang, K. Wang, Y. Zhou, S. Srivastava, C. Gokmen, T. Lee, L. E. Li, R. Zhang, W. Liu,
P. Liang, L. Fei-Fei, J. Mao, and J. Wu. Embodied agent interface: Benchmarking llms for embodied
decision making, 2025. URL https://arxiv.org/abs/2410.07166 .
Y. Li, Y. Du, J. Zhang, L. Hou, P. Grabowski, Y. Li, and E. Ie. Improving multi-agent debate with sparse
communication topology. In Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, editors, Findings of the
AssociationforComputationalLinguistics: EMNLP2024 , pages7281–7294, Miami, Florida, USA,Nov.
2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.427.
URLhttps://aclanthology.org/2024.findings-emnlp.427/ .
Z. Li, S. Xu, K. Mei, W. Hua, B. Rama, O. Raheja, H. Wang, H. Zhu, and Y. Zhang. Autoflow: Automated
workflow generation for large language model agents. arXiv preprint arXiv:2407.12821 , 2024c.
T. Liang, Z. He, W. Jiao, X. Wang, Y. Wang, R. Wang, Y. Yang, S. Shi, and Z. Tu. Encouraging divergent
thinking in large language models through multi-agent debate. In Y. Al-Onaizan, M. Bansal, and
Y.-N. Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language
Processing , pages 17889–17904, Miami, Florida, USA, Nov. 2024. Association for Computational
Linguistics. doi: 10.18653/v1/2024.emnlp-main.992. URL https://aclanthology.org/2024.
emnlp-main.992/ .
L. Lin, J. Fu, P. Liu, Q. Li, Y. Gong, J. Wan, F. Zhang, Z. Wang, D. Zhang, and K. Gai. Just ask one more
time! self-agreement improves reasoning of language models in (almost) all scenarios. In L.-W.
Ku, A. Martins, and V. Srikumar, editors, Findings of the Association for Computational Linguistics:
ACL 2024 , pages 3829–3852, Bangkok, Thailand, Aug. 2024. Association for Computational
Linguistics. doi: 10.18653/v1/2024.findings-acl.230. URL https://aclanthology.org/2024.
findings-acl.230/ .
F. Liu, N. AlDahoul, G. Eady, Y. Zaki, B. AlShebli, and T. Rahwan. Self-reflection outcome is sensitive
to prompt construction. arXiv preprint arXiv:2406.10400 , 2024a.
H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. arXiv preprint
arXiv:1806.09055 , 2018.
Z. Liu, Y. Zhang, P. Li, Y. Liu, and D. Yang. A dynamic LLM-powered agent network for task-
oriented agent collaboration. In First Conference on Language Modeling , 2024b. URL https:
//openreview.net/forum?id=XII0Wp1XA9 .
A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye,
Y. Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information
Processing Systems , 36, 2024.
K. Opsahl-Ong, M. J. Ryan, J. Purtell, D. Broman, C. Potts, M. Zaharia, and O. Khattab. Opti-
mizing instructions and demonstrations for multi-stage language model programs. In Y. Al-
Onaizan, M. Bansal, and Y.-N. Chen, editors, Proceedings of the 2024 Conference on Empirical
13
Page 14:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
Methods in Natural Language Processing , pages 9340–9366, Miami, Florida, USA, Nov. 2024.
Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.525. URL
https://aclanthology.org/2024.emnlp-main.525/ .
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama,
A. Ray, et al. Training language models to follow instructions with human feedback. Advances in
neural information processing systems , 35:27730–27744, 2022.
R. Pryzant, D. Iter, J. Li, Y. Lee, C. Zhu, and M. Zeng. Automatic prompt optimization with “gradient
descent” and beam search. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023
Conference on Empirical Methods in Natural Language Processing , pages 7957–7968, Singapore, Dec.
2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.494. URL
https://aclanthology.org/2023.emnlp-main.494/ .
C. Qian, Z. Xie, Y. Wang, W. Liu, Y. Dang, Z. Du, W. Chen, C. Yang, Z. Liu, and M. Sun. Scaling
large-language-model-based multi-agent collaboration. arXiv preprint arXiv:2406.07155 , 2024.
S. Qiao, N. Zhang, R. Fang, Y. Luo, W. Zhou, Y. Jiang, C. Lv, and H. Chen. AutoAct: Automatic agent
learning from scratch for QA via self-planning. In L.-W. Ku, A. Martins, and V. Srikumar, editors,
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers) , pages 3003–3021, Bangkok, Thailand, Aug. 2024. Association for Computational
Linguistics. doi: 10.18653/v1/2024.acl-long.165. URL https://aclanthology.org/2024.
acl-long.165/ .
M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. P. Lillicrap, J. Alayrac, R. Soricut, A. Lazaridou,
O. Firat, J. Schrittwieser, I. Antonoglou, R. Anil, S. Borgeaud, A. M. Dai, K. Millican, E. Dyer,
M. Glaese, T. Sottiaux, B. Lee, F. Viola, M. Reynolds, Y. Xu, J. Molloy, J. Chen, M. Isard, P. Barham,
T. Hennigan, R. McIlroy, M. Johnson, J. Schalkwyk, E. Collins, E. Rutherford, E. Moreira, K. Ayoub,
M. Goel, C. Meyer, G. Thornton, Z. Yang, H. Michalewski, Z. Abbas, N. Schucher, A. Anand,
R. Ives, J. Keeling, K. Lenc, S. Haykal, S. Shakeri, P. Shyam, A. Chowdhery, R. Ring, S. Spencer,
E. Sezener, and et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens
of context. CoRR, abs/2403.05530, 2024. doi: 10.48550/ARXIV.2403.05530. URL https:
//doi.org/10.48550/arXiv.2403.05530 .
B. Ru, X. Wan, X. Dong, and M. Osborne. Interpretable neural architecture search via bayesian
optimisation with weisfeiler-lehman kernels. International Conference on Learning Representations
(ICLR), 2021.
J. Saad-Falcon, A. G. Lafuente, S. Natarajan, N. Maru, H. Todorov, E. Guha, E. K. Buchanan, M. Chen,
N. Guha, C. Ré, et al. Archon: An architecture search framework for inference-time techniques.
arXiv preprint arXiv:2409.15254 , 2024.
Y. Shang, Y. Li, K. Zhao, L. Ma, J. Liu, F. Xu, and Y. Li. Agentsquare: Automatic llm agent search in
modular design space. arXiv preprint arXiv:2410.06153 , 2024.
N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with
verbal reinforcement learning. Advances in Neural Information Processing Systems , 36, 2024.
A. Singh, A. Ehtesham, S. Kumar, and T. T. Khoei. Agentic retrieval-augmented generation: A survey
on agentic rag. arXiv preprint arXiv:2501.09136 , 2025.
H. Su, R. Sun, J. Yoon, P. Yin, T. Yu, and S. Ö. Arık. Learn-by-interact: A data-centric framework for
self-adaptive agents in realistic environments. arXiv preprint arXiv:2501.10893 , 2025.
14
Page 15:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal. MuSiQue: Multihop questions via single-
hop question composition. Transactions of the Association for Computational Linguistics , 10:539–554,
2022. doi: 10.1162/tacl_a_00475. URL https://aclanthology.org/2022.tacl-1.31/ .
M. Verma, S. Bhambri, and S. Kambhampati. On the brittle foundations of react prompting for agentic
large language models. arXiv preprint arXiv:2405.13966 , 2024.
X. Wan, B. Ru, P. M. Esperança, and Z. Li. On redundancy and diversity in cell-based neural
architecture search. In International Conference on Learning Representations , 2022. URL https:
//openreview.net/forum?id=rFJWoYoxrDB .
X. Wan, R. Sun, H. Nakhost, and S. O. Arik. Teach better or show smarter? on instructions and
exemplars in automatic prompt optimization. In The Thirty-eighth Annual Conference on Neural In-
formation Processing Systems , 2024. URL https://openreview.net/forum?id=IdtoJVWVnX .
X.Wan,H.Zhou,R.Sun,H.Nakhost,K.Jiang,andS.Ö.Arık. Fromfewtomany: Self-improvingmany-
shot reasoners through iterative optimization and generation. arXiv preprint arXiv:2502.00330 ,
2025.
F. Wang, X. Wan, R. Sun, J. Chen, and S. Ö. Arık. Astute rag: Overcoming imperfect retrieval
augmentation and knowledge conflicts for large language models. arXiv preprint arXiv:2410.07176 ,
2024a.
J. Wang, J. Wang, B. Athiwaratkun, C. Zhang, and J. Zou. Mixture-of-agents enhances large language
model capabilities. arXiv preprint arXiv:2406.04692 , 2024b.
Q. Wang, Z. Wang, Y. Su, H. Tong, and Y. Song. Rethinking the bounds of LLM reasoning: Are
multi-agent discussions the key? In L.-W. Ku, A. Martins, and V. Srikumar, editors, Proceedings of
the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) ,
pages 6106–6131, Bangkok, Thailand, Aug. 2024c. Association for Computational Linguistics. doi:
10.18653/v1/2024.acl-long.331. URL https://aclanthology.org/2024.acl-long.331/ .
X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-
consistency improves chain of thought reasoning in language models. In The Eleventh International
Conference on Learning Representations , 2023. URL https://openreview.net/forum?id=
1PL1NIMMrw .
C. White, M. Safari, R. Sukthanker, B. Ru, T. Elsken, A. Zela, D. Dey, and F. Hutter. Neural architecture
search: Insights from 1000 papers. arXiv preprint arXiv:2301.08727 , 2023.
Q. Wu, G. Bansal, J. Zhang, Y. Wu, S. Zhang, E. Zhu, B. Li, L. Jiang, X. Zhang, and C. Wang.
Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv
preprint arXiv:2308.08155 , 2023.
C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering
agents. arXiv preprint arXiv:2407.01489 , 2024.
Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning. HotpotQA: A
dataset for diverse, explainable multi-hop question answering. In E. Riloff, D. Chiang, J. Hock-
enmaier, and J. Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Nat-
ural Language Processing , pages 2369–2380, Brussels, Belgium, Oct.-Nov. 2018. Association for
Computational Linguistics. doi: 10.18653/v1/D18-1259. URL https://aclanthology.org/
D18-1259/ .
15
Page 16:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. React: Synergizing reasoning
and acting in language models. In The Eleventh International Conference on Learning Representations ,
2023. URL https://openreview.net/forum?id=WE_vluYUL-X .
G. Zhang, Y. Yue, Z. Li, S. Yun, G. Wan, K. Wang, D. Cheng, J. X. Yu, and T. Chen. Cut the
crap: An economical communication pipeline for llm-based multi-agent systems. arXiv preprint
arXiv:2410.02506 , 2024a.
J. Zhang, J. Xiang, Z. Yu, F. Teng, X. Chen, J. Chen, M. Zhuge, X. Cheng, S. Hong, J. Wang, et al.
Aflow: Automating agentic workflow generation. arXiv preprint arXiv:2410.10762 , 2024b.
J. Zhang, X. Xu, N. Zhang, R. Liu, B. Hooi, and S. Deng. Exploring collaboration mechanisms for LLM
agents: Asocialpsychologyview. InL.-W.Ku, A.Martins, andV.Srikumar, editors, Proceedingsofthe
62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages
14544–14607, Bangkok, Thailand, Aug. 2024c. Association for Computational Linguistics. doi:
10.18653/v1/2024.acl-long.782. URL https://aclanthology.org/2024.acl-long.782/ .
S. Zhang, J. Zhang, J. Liu, L. Song, C. Wang, R. Krishna, and Q. Wu. Offline training of language
model agents with functions as learnable weights. In Forty-first International Conference on Machine
Learning , 2024d. URL https://openreview.net/forum?id=2xbkWiEuR1 .
W. Zhang, K. Tang, H. Wu, M. Wang, Y. Shen, G. Hou, Z. Tan, P. Li, Y. Zhuang, and W. Lu. Agent-
pro: Learning to evolve via policy-level reflection and optimization. In L.-W. Ku, A. Martins,
and V. Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Com-
putational Linguistics (Volume 1: Long Papers) , pages 5348–5375, Bangkok, Thailand, Aug.
2024e. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.292. URL
https://aclanthology.org/2024.acl-long.292/ .
Y. Zhang, R. Sun, Y. Chen, T. Pfister, R. Zhang, and S. O. Arik. Chain of agents: Large language models
collaborating on long-context tasks. In The Thirty-eighth Annual Conference on Neural Information
Processing Systems , 2024f. URL https://openreview.net/forum?id=LuCLf4BJsr .
H. Zhou, X. Wan, I. Vulić, and A. Korhonen. Survival of the most influential prompts: Efficient black-
box prompt search via clustering and pruning. In H. Bouamor, J. Pino, and K. Bali, editors, Findings
of the Association for Computational Linguistics: EMNLP 2023 , pages 13064–13077, Singapore, Dec.
2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.870. URL
https://aclanthology.org/2023.findings-emnlp.870/ .
H. Zhou, X. Wan, Y. Liu, N. Collier, I. Vulić, and A. Korhonen. Fairer preferences elicit improved
human-aligned large language model judgments. In Y. Al-Onaizan, M. Bansal, and Y.-N. Chen,
editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing ,
pages 1241–1252, Miami, Florida, USA, Nov. 2024a. Association for Computational Linguistics. doi:
10.18653/v1/2024.emnlp-main.72. URL https://aclanthology.org/2024.emnlp-main.
72/.
H. Zhou, X. Wan, L. Proleev, D. Mincu, J. Chen, K. A. Heller, and S. Roy. Batch calibration: Rethinking
calibration for in-context learning and prompt engineering. In The Twelfth International Conference
on Learning Representations , 2024b. URL https://openreview.net/forum?id=L3FHMoKZcS .
H. Zhou, X. Wan, I. Vulić, and A. Korhonen. AutoPEFT: Automatic configuration search for parameter-
efficient fine-tuning. Transactions of the Association for Computational Linguistics , 12:525–542,
2024c. doi: 10.1162/tacl_a_00662. URL https://aclanthology.org/2024.tacl-1.29/ .
16
Page 17:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
W. Zhou, Y. Ou, S. Ding, L. Li, J. Wu, T. Wang, J. Chen, S. Wang, X. Xu, N. Zhang, et al. Symbolic
learning enables self-evolving agents. arXiv preprint arXiv:2406.18532 , 2024d.
M. Zhuge, W. Wang, L. Kirsch, F. Faccio, D. Khizbullin, and J. Schmidhuber. GPTSwarm: Language
agents as optimizable graphs. In Forty-first International Conference on Machine Learning , 2024.
URLhttps://openreview.net/forum?id=uTC9AFXIhg .
17
Page 18:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
A. Limitations and future work
Massis a multi-agent design meta-framework also orthogonal to prompt and topology optimizers.
Masshasbroughtsubstantialimprovementsoverasingleagentdesignbysearchinginacustomizable
topology space. Though our proposed topology space has covered the vast majority of effective MAS
designs, including serial, parallel, and mixture of connections, it is still likely that incorporating
other topologies may further improve the final performance of Mass, which is complementary
to the development of Mass. For instance, the debate topology proposed in Massinvolves a
fully-connected topology across agents. Recent work has been identifying the sparsity of agent
communications (Li et al., 2024b; Zhang et al., 2024a), and pruning redundant communications may
further enhance the overall efficiency of the strongest Mass-found design. Though the topology
optimizer in Massalready traverses efficiently in the proposed topology space, incorporating more
advanced search algorithms, such as the Bayes optimizer (Kandasamy et al., 2018; Ru et al., 2021),
may further improve the sample efficiency of Masswhen faces a more complex design space.
Similarly, the sample efficiency of the prompt optimizer may be further enhanced by conditioning on
textual feedback from error logs (Pryzant et al., 2023; Wan et al., 2024), which we will endeavor to
explore in future work.
B. Implementation details
B.1. Datasets
In this work, we included the following dataset: 1) Hendryck’s MATH (Hendrycks et al., 2021)
consisting challenging competition-level mathematics problems, and DROP (Dua et al., 2019) requires
discrete and symbolic reasoning over paragraphs; 2) HotpotQA (Yang et al., 2018), MuSiQue (Trivedi
et al., 2022), and 2WikiMultiHopQA (Ho et al., 2020) to evaluate on information seeking from
long-context with agentic systems, which we report from standardized versions in LongBench (Bai
et al., 2024); 3) MBPP (Austin et al., 2021), HumanEval (Chen et al., 2021), and LiveCodeBench (Jain
et al., 2024) as well-established coding benchmarks. Regarding LiveCodeBench, we use the ‘test
output prediction’ task as an agent cooperative task. In line with AFlow (Zhang et al., 2024b), we use
the public test cases of MBPP and HumanEval for the executor to retrieve reliable external feedback
signals.
To save computation resources, we randomly sample a subset of the original validation and test
splits to conduct all the experiments, where the specifications are reported in Table 2.
Table 2|The specification of evaluation tasks: dataset split, topology search space, and the Mass-
optimized MAS (on Gemini 1.5 Pro)
Task Type |Val| |Test| Topology Search Space Mass
MATH Mathematical Reasoning 60 100 { Aggregate ,Reflect ,Debate } {9, 0, 0}
DROP Discrete Reasoning 60 200 { Aggregate ,Reflect ,Debate } {5, 0, 0}
HotpotQA Long-context Understanding 50 100 { Summarize ,Aggregate ,Reflect ,Debate } {0, 5, 0, 1}
MuSiQue Long-context Understanding 50 100 { Summarize ,Aggregate ,Reflect ,Debate } {0, 3, 0, 2}
2WikiMQA Long-context Understanding 50 100 { Summarize ,Aggregate ,Reflect ,Debate } {0, 3, 0, 1}
MBPP Coding 60 200 { Aggregate ,Reflect ,Debate ,Executor } {1, 4, 0, 1}
HumanEval Coding 50 100 { Aggregate ,Reflect ,Debate ,Executor } {1, 3, 0, 1}
LiveCodeBench Coding: test output prediction 100 200 { Aggregate ,Reflect ,Debate ,Executor } {3, 1, 1, 1}
18
Page 19:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
Table 3|The search dimension for each topology. The minimum topology defines the building block
thatMassStage (1) optimized.
Topology Search Space Minimum Topology Building Block Specification
Summarize {0, 1, 2, 3, 4} { Summarizer ,Predictor } {1, 1}
Aggregate {1, 3, 5, 7, 9} { Predictor ,Aggregator } {3, 1}
Reflect {0, 1, 2, 3, 4} { Predictor ,Reflector } {1, 1}
Debate {0, 1, 2, 3, 4} { Predictor ,Debator } {2, 1}
Execute {0, 1} { Predictor ,Executor ,Reflector } {1, 1, 1}
B.2. Baselines
In this section, we report the specifications of all our baselines. We note that for the baselines: CoT,
SC, Self-Refine, and Multi-Agent Debate, we follow the prompts given in ADAS (Hu et al., 2024a).
1) Chain-of-Thought (CoT) (Kojima et al., 2022). Direct chain-of-thought reasoning via zero-shot
prompting: “Please think step by step and then solve the task."
2) Self-Consistency (SC) (Wang et al., 2023). In self-consistency, we generate diverse chain-of-
thought reasoning traces with a temperature of 0.8, followed by a rule-based majority vote that
collects the most consistent answer. In Table 1, we report SC@9 to provide a fair comparison across
baselines.
3) Self-Refine (Madaan et al., 2024): This baseline consists of one predictor that constantly
takes feedback and a self-reflector that provides criticism. It involves a stop criterion whenever the
self-reflector outputs “correct” in its prediction. We set the maximum number of rounds of reflections
to 5, such that the worst case will involve 11 ( 1+2×5) calls.
4) Multi-Agent Debate (Du et al., 2024; Liang et al., 2024). In this baseline, it involves 3 agents
that conduct reasoning and debating for 3 rounds. The opinions along the rounds of debating are
finally judged by an aggregator that makes the final prediction. Hence, it contains 10 ( 3×3+1)
agents.
5) Automated Design of Agentic Systems (ADAS) (Hu et al., 2024a). Consistent with our main
experimental setups. We use Gemini 1.5 as both LLM optimizer and evaluator for reproducing all
ADAS results. The generation of ADAS is conditioned on former evaluations of baselines, including
CoT, SC, Self-Refine, and Multi-Agent Debate. We report ADAS with 30 rounds of search, and each
round is evaluated on the validation set 3 times to stablize the prediction.
6) AFlow (Zhang et al., 2024b). Automatic workflow design via Monte-Carto Tree Search over a
set of pre-defined operators. Similar to ADAS, AFlow also relies on an LLM optimizer to generate
new nodes and topologies expressed in codes. However, we find the meta-prompt of AFlow does not
generalize to other LLM backbones. Consequently, we report AFlow with its original LLM optimizer by
Claude 3.5 Sonnet, and reproduce experiments with Gemini 1.5 Pro as the LLM executor. Therefore,
the comparison is not completely fair, and we treat the results from AFlow as a good reference. We
note that the ‘-’ in Table 1 refers to out-of-time errors, where the LLM executor has been trapped in
executing accidental scripts with infinite loops. We still endeavored to report most results from AFlow
as shown in Table 1 & Fig. 6 with the default experimental setup from AFlow: 20 rounds, 5 runs of
validation per round, and k at 3.
19
Page 20:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
Summarize Aggregate Re ect Debate Executor
P
A
PP S P R PTopology Building Blocks
EP RP
PD
MASS-optimized Topology
MATH DROP HotpotQA MuSiQue MBP P HumanEval LiveCodeBench
P
A
. . .
PP
A
. . .
PP
PD
DA
. . .
. . .P
PPDD
DD
DDAP
REP
REP
PPE
E
ER
R
RD
D
DA2WikiMQA
P
PPD
D
DA
Figure 8|Visualization of the topology building blocks and best Mass-discovered topologies from
Gemini 1.5 Pro.
B.3. Mass: Multi-Agent System Search
In this section, we provide additional details for Mass. The topology search space for each task is
defined in Table 2. In addition, for Stage (1) block-level prompt optimization, the specification of
the building block is defined in Table 3. We provide the visualization of both the minimum building
blocks and the optimized topology in Fig. 8. We refer the reader to App. §D & §E for the prompt
templates we used to define each type of agent and the best prompts discovered.
C. Additional experiments
Table 4|Results on the evaluation set with Claude 3.5 Sonnet. We keep the same experimental setup
as Table 1. Since Claude 3.5 Sonnet does not support the same context window as Gemini, we report
the standard HotpotQA instead of the LongBench. As we transfer the prompt template for each agent
from Gemini to Claude, it is noticeable that the basic topology on some tasks may result in severe
degradation of performance, and Masssuccessfully recovers the performance and brings significant
improvements over the initial agent.
Claude-3.5-Sonnet
Task Reasoning Multi-hop Coding
Method MATH DROP HotpotQA MBPP HumanEval LCB Avg.
CoT 57.33 0.94 55.520.42 23.561.52 67.501.47 88.671.70 72.672.39 60.21
Self-Consistency 61.67 1.89 57.860.45 25.690.44 69.17 0.62 90.000.82 72.672.39 62.84
Self-Refine 57.00 1.63 56.260.56 23.572.56 68.000.82 87.001.41 49.331.65 56.86
Multi-Agent Debate 45.00 3.74 26.620.11 31.413.30 00.000.00 84.333.30 72.821.84 43.36
Mass 63.00 0.0068.93 0.38 66.98 0.99 68.830.62 93.00 0.82 73.73 1.4372.43
20
Page 21:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
Table 5|The detailed ablation results per optimization stage of Mass. Practical gains can be obtained
by further conducting workflow-level prompt optimization (3PO) on the best-found topology.
Gemini-1.5-pro-002
Task Reasoning Multi-hop Long-context Coding
Method MATH DROP HotpotQA MuSiQue 2WikiMQA MBPP HumanEval LCB Avg.
Base Agent 62.33 0.9471.650.6156.961.2643.320.1349.200.6168.830.8589.331.7066.332.0963.54
+ APO 79.33 1.8977.510.3859.720.0043.970.0061.490.2467.001.0886.331.2568.501.2267.44
+ 1PO 80.00 0.0086.450.9062.521.8648.860.6167.400.5880.331.2591.671.2576.000.0074.56
+ 2TO 83.00 1.6386.751.3265.221.3452.610.5272.820.8685.001.0892.000.8281.330.0077.55
+ 3PO 84.67 0.4790.520.6469.911.1151.400.4273.340.6786.500.4191.670.4782.330.8578.40
0 1000 2000 3000 4000 5000 6000 7000 8000
T otal T okens7072747678808284Accuracy (%)
CoTCoT-SC@3CoT-SC@5Refine@5
Debate 1R@2ADebate 2R@3A
Step-BackQuality-Diverse
Role AssignADAS-T oolADAS-T&SMASSMASS MASS MASSMATH (gemini-1.5-pro-002)
Figure 9|The Pareto-front of Mass-optimized designs compared to multi-agent baselines. Total
tokens include both inference input tokens and output tokens. Additional multi-agent baselines from
ADAS (Hu et al., 2024a) and two best-found ADAS designs are included.
D. Prompt template
We provide all prompt templates we used for defining the Masssearch space. We use <>to enclose
texts that have been skipped for presentation purposes. We follow the DSPy (Khattab et al., 2024) in
constructing these agentic templates.
The general template for instruction, exemplar, and input/output fields:
1
2<Instruction >
3
4---
5
6Follow the following format .
7
8Input : ${ Input }
9...
10Output : ${ output }
11
12---
13
21
Page 22:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
14<example_1 >
15
16---
17
18Input : <Input >
19...
20Output : <output >
MATH:
1Predictor :
2
3Let ’s think step by step .
4---
5Question : ${ question }
6Reasoning : Let ’s think step by step in order to ${ produce the answer }. We ...
7Answer : ${ answer }
8
9------------
10Reflector :
11
12Please review the answer above and criticize on where might be wrong . If you are
absolutely sure it is correct , output ’True ’ in ’correctness ’.
13
14---
15Question : ${ question }
16Text : ${ text }
17Reasoning : Let ’s think step by step in order to ${ produce the correctness }. We ...
18Feedback : ${ feedback }
19Correctness : True / False indicating if answer is correct given the question .
20
21------------
22Refiner :
23
24Given previous attempts and feedback , carefully consider where you could go wrong in
your latest attempt . Using insights from previous attempts , try to solve the
task better . Show your final answer bracketed between <answer > and </ answer > at
the end .
25
26---
27Question : ${ question }
28Previous answer : ${ previous_answer }
29Reflection : ${ reflection }
30Correctness : ${ correctness }
31Thinking : ${ thinking }
32Answer : ${ answer }
33
34------------
35
36Debator :
37
38These are the solutions to the question from other agents . Examine the solutions
from other agents in your rationale , finish by giving an updated answer . Show
your final answer bracketed between <answer > and </answer > at the end .
39
40---
41Question : ${ question }
42Solutions : the solutions to the question from other agents
43Reasoning : Let ’s think step by step in order to ${ Examine the solutions from other
agents }. We ...
44Answer : The updated answer for the question . Do not repeat Answer :
DROP:
1Predictor :
2
3Please think step by step and then solve the task . # Your Task :
4Please answer the following question based on the given context .
5---
6Question : ${ question }
7Context : ${ context }
8Thinking : ${ thinking }
22
Page 23:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
9Answer : Directly answer the question . Keep it very concise .
10
11------------
12Reflector :
13
14Verify that the answer is based on the provided context . Give your reflection in the
rationale .
15
16---
17Question : ${ question }
18Context : ${ context }
19Text : ${ text }
20Reasoning : Let ’s think step by step in order to ${ produce the correctness }. We ...
21Correctness : True / False indicating if answer is correct given the observations and
question .
22
23------------
24Refiner :
25
26Please think step by step and then solve the task . # Your Task :
27Based on the reflection , correctness of the previous answer , and the context again ,
give an updated answer .
28
29---
30Question : ${ question }
31Context : ${ context }
32Previous answer : ${ previous_answer }
33Reflection : ${ reflection }
34Correctness : ${ correctness }
35Thinking : ${ thinking }
36Answer : Directly answer the question . Keep it very concise .
37
38------------
39
40Debator :
41
42These are the solutions to the question from other agents . Based on the context ,
examine the solutions from other agents in your rationale , finish by giving an
updated answer .
43
44---
45Question : ${ question }
46Context : ${ context }
47Solutions : the solutions to the question from other agents
48Reasoning : Let ’s think step by step in order to ${ Examine the solutions from other
agents }. We ...
49Answer : The updated answer for the question . Do not repeat Answer :
HotpotQA, MuSiQue, and 2WikiMQA:
1Predictor :
2
3Answer the question with information based on the context . Only return the answer as
your output .
4---
5Question : ${ question }
6Context : ${ context }
7Answer : Only give me the answer . Do not output any other words .
8
9------------
10Summarizer :
11
12Based on the question , retrieve relevant information from context that is ONLY
helpful in answering the question . Include all key information . Do not repeat
context .
13---
14Question : ${ question }
15Context : ${ context }
16Summary : Only generate the summary . Start with Summary :
17
18------------
23
Page 24:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
19Reflector :
20
21Verify that the answer is based on the provided context .
22
23---
24Question : ${ question }
25Context : ${ context }
26Text : ${ text }
27Reasoning : Let ’s think step by step in order to ${ produce the correctness }. We ...
28Correctness : True / False indicating if answer is correct given the observations and
question .
29
30------------
31
32Debator :
33
34These are the solutions to the question from other agents . Based on the context ,
examine the solutions from other agents in your rationale , finish by giving an
updated answer .
35
36---
37Question : ${ question }
38Context : ${ context }
39Solutions : the solutions to the question from other agents
40Reasoning : Let ’s think step by step in order to ${ Examine the solutions from other
agents }. We ...
41Answer : The updated answer for the question . Do not repeat Answer :
MBPP:
1Predictor :
2
3Let ’s think step by step . Provide a complete and correct code implementation in
python .
4---
5Question : ${ question }
6Thinking : ${ thinking }
7Answer : Only the code implementation . Do not include example usage or explainations .
8
9------------
10Reflector :
11
12Please determine the correctness of the solution in passing all test cases . If it
fails , based on the error message and trackback , think step by step , carefully
propose an updated solution in the answer output with a correct code
implementation in python .
13
14---
15Question : ${ question }
16Previous solution : ${ previous_solution }
17Traceback : It contains the test cases , execution results , and ground truth . If there
is an error , the relevant traceback is given .
18Correctness : ’True /False ’ based on the correctness of executive feedback . If there
is an error message , output ’False ’
19Thinking : ${ thinking }
20Answer : ${ answer }
21
22------------
23
24Debator :
25
26These are the solutions to the question from other agents . Examine the solutions
from other agents in your rationale , finish by giving an updated answer . Let ’s
think step by step . Provide a complete and correct code implementation in python
.
27
28---
29Question : ${ question }
30Solutions : the solutions to the question from other agents
31Reasoning : Let ’s think step by step in order to ${ Examine the solutions from other
agents }. We ...
24
Page 25:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
32Answer : ${ answer }
HumanEval:
1Predictor :
2
3Let ’s think step by step . Provide a complete and correct code implementation in
python .
4---
5Question : ${ question }
6Thinking : ${ thinking }
7Answer : ${ answer }
8
9------------
10Reflector :
11
12Please determine the correctness of the solution in passing all test cases . If it
fails , based on the error message and trackback , think step by step , carefully
propose an updated solution in the answer output with a correct code
implementation in python .
13
14---
15Question : ${ question }
16Previous solution : ${ previous_solution }
17Traceback : ${ traceback }
18Thinking : ${ thinking }
19Answer : ${ answer }
20
21------------
22
23Debator :
24
25These are the solutions to the question from other agents . Examine the solutions
from other agents in your rationale , finish by giving an updated answer . Let ’s
think step by step . Provide a complete and correct code implementation in python
.
26
27---
28Question : ${ question }
29Solutions : the solutions to the question from other agents
30Reasoning : Let ’s think step by step in order to ${ Examine the solutions from other
agents }. We ...
31Answer : ${ answer }
LiveCodeBench:
1Predictor :
2
3You are a helpful programming assistant and an expert Python programmer . The user
has written a input for the testcase . Think step by step . You will generate the
code based on the problem requirepement . You will calculate the output of the
testcase and write the whole assertion statement in the markdown code block with
the correct output .
4---
5Question : ${ question }
6Thinking : ${ thinking }
7Code : ${ code }
8Answer : complete the testcase with assertion .
9
10------------
11Reflector :
12
13If there is an executive output in the traceback , parse the output into an assertion
in the answer given the executive output .
14
15---
16Question : ${ question }
17Previous solution : ${ previous_solution }
18Traceback : It contains the test cases , execution results , and ground truth . If there
is an error , the relevant traceback is given .
19Correctness : ’True /False ’ based on the correctness of executive feedback . If there
is an error message , output ’False ’
25
Page 26:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
20Thinking : ${ thinking }
21Answer : ${ answer }
22
23------------
24
25Debator :
26
27These are the solutions to the question from other agents . Examine the solutions
from other agents in your rationale , finish by giving an updated answer .
28
29---
30Question : ${ question }
31Solutions : the solutions to the question from other agents
32Reasoning : Let ’s think step by step in order to ${ Examine the solutions from other
agents }. We ...
33Answer : assert ${ function ( input )} == { executive_output }
E. Best Prompts Discovered
In this section, we show a few typical prompts that have been discovered by Mass. Similarly, we
indicate <>that skips some comments.
MATH:
1Predictor :
2
3Let ’s think step by step to solve the given problem . Clearly explain your reasoning
process , showing all intermediate calculations and justifications . Express your
final answer as a single numerical value or simplified expression enclosed
within <answer ></ answer > tags . Avoid extraneous text or explanations outside of
the core reasoning and final answer .
4
5---
6
7Follow the following format .
8
9Question : ${ question }
10Reasoning : Let ’s think step by step in order to ${ produce the answer }. We ...
11Answer : ${ answer }
12
13---
14
15Question : What is the average of the two smallest positive integer solutions to the
congruence $$14u \\ equiv 46 \\ pmod {100}~? $$
16Reasoning : Let ’s think step by step in order to find the average of the two smallest
positive integer solutions to the congruence $14u \\ equiv 46 \\ pmod {100} $. We
are looking for solutions to $14u \\ equiv 46 \\ pmod {100} $. We can divide the
congruence by 2 to get $7u \\ equiv 23 \\ pmod {50} $. We want to find the inverse
of 7 modulo 50. We use the Euclidean Algorithm : \\ begin { align *} 50 &= 7 \\ cdot 7
+ 1 \\\\ 1 &= 50 - 7 \\ cdot 7 \\ end { align *} Thus , $7 ( -7) \\ equiv 1 \\ pmod {50}$,
so $7 ^{ -1} \\ equiv -7 \\ equiv 43 \\ pmod {50} $. Multiplying $7u \\ equiv 23 \\ pmod
{50} $ by 43 gives $u \\ equiv 23 \\ cdot 43 \\ pmod {50} $. $23 \\ cdot 43 = 989 = 50
\\ cdot 19 + 39$, so $23 \\ cdot 43 \\ equiv 39 \\ pmod {50} $. Therefore , $u \\ equiv
39 \\ pmod {50} $. The two smallest positive integer solutions are $u = 39$ and $u
= 39+50=89 $. The average of these two solutions is $\\ frac {39+89}{2} = \\ frac
{128}{2} = 64$.
17Answer : 64
18
19---
In the following prompts, interestingly, we observe that including the data summary, task demon-
strations, and past instructions that have been used in MIPRO (Opsahl-Ong et al., 2024) to propose
new candidates actually improves the final performance. Hence, we keep these prompts that lead to
strong task performance.
DROP:
1Predictor :
26
Page 27:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
2
3This dataset is designed for extractive question answering , focusing on retrieving
concise , factual answers from short texts . Many questions involve extracting
numerical information and performing simple calculations , suggesting
applications in areas like sports analytics or financial data analysis . However ,
the dataset ’s Western cultural bias and lack of complex reasoning questions
limit its generalizability and real - world applicability .
4
5TASK DEMO (S):
6<example_1 >
7Question : How many more points did the Spurs win by in Game 4 against the Mavericks ?
8
9Context :
10The Mavericks finished 49 -33 , one game ahead of Phoenix for the eighth and final
playoff spot , which meant that they would once again have to face their in - state
rivals , the San Antonio Spurs , who were the top seed in the Western Conference
with a 62 -20 record . In Game 1 in San Antonio , Dallas had an 81 -71 lead in the
fourth quarter , but the Spurs rallied back and took Game 1, 85 -90. However , the
Mavs forced 22 turnovers in Game 2 to rout the Spurs 113 -92 , splitting the first
two games before the series went to Dallas . In Game 3, Manu Gin \ u00f3bili hit a
shot that put the Spurs up 108 -106 with 1.7 seconds left , but a buzzer - beater
by Vince Carter gave the Mavs the victory , putting them up 2-1 in the series .
The Spurs took Game 4 in Dallas 93 -89 despite a late Dallas comeback after the
Spurs at one point had a 20- point lead and later won Game 5 at home , 109 -103 ,
giving them a 3-2 series lead . The Mavs avoided elimination in Game 6 at home by
rallying in the fourth quarter , winning 111 -113. Game 7 was on the Spurs home
court , and the Spurs beat the Mavericks 119 -96 , putting an end to the Mavericks
season .
11
12Thinking :
13The Spurs scored 93 points in Game 4. The Mavericks scored 89 points in Game 4. The
difference is 93 - 89 = 4.
14Answer : 4
15
16
17BASIC INSTRUCTION :
18‘‘‘
19You are a highly specialized AI tasked with extracting critical numerical
information for an urgent news report . A live broadcast is relying on your
accuracy and speed . Think step -by -step , focusing on the numerical information
provided in the context . Then , answer the question concisely with the extracted
numerical answer . Failure to provide the correct numerical information will
result in the broadcast being interrupted .
20
21Question : { question }
22Context : { context }
23‘‘‘
24
25TIP : Keep the instruction clear and concise .
26
27PROPOSED INSTRUCTION :
28
29‘‘‘
30Extract the numerical answer to the following question . Show your reasoning by
identifying the relevant numbers from the provided context and performing any
necessary calculations . Respond with only the final numerical answer .
31
32Question : { question }
33Context : { context }
34‘‘‘
HotpotQA:
1Predictor :
2
3This multi - passage question answering dataset focuses on complex questions requiring
synthesis of information from multiple Wikipedia - like sources , often involving
named entities and temporal reasoning . It emphasizes integrating information ,
handling ambiguity , and leveraging real - world knowledge , posing a significant
challenge for models relying solely on provided text . The dataset appears well -
suited for evaluating advanced language models ’ reasoning abilities across
27
Page 28:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
diverse domains and varying complexity levels .
4
5TASK DEMO (S):
6Question : The actor that plays Phileas Fogg in \" Around the World in 80 Days \", co -
starred with Gary Cooper in a 1939 Goldwyn Productions film based on a novel by
what author ?
7Context : Provided in prompt
8Answer : Charles L. Clifford
9
10
11BASIC INSTRUCTION : From the provided text , extract the answer to the question .
Output * only * the answer .
12
13TIP : Keep the instruction clear and concise . Emphasize reliance * only * on the
provided text .
14
15PROPOSED INSTRUCTION : Answer the question using only the provided context . Do not
use external knowledge .
16
17---
18<example_1 >
19
20------
21
22Debator :
23
24This multi - passage question answering dataset focuses on complex questions requiring
synthesis of information from multiple Wikipedia - like sources , often involving
named entities and temporal reasoning . It emphasizes integrating information ,
handling ambiguity , and leveraging real - world knowledge , posing a significant
challenge for models relying solely on provided text . The dataset appears well -
suited for evaluating advanced language models ’ reasoning abilities across
diverse domains and varying complexity levels .
25
26TASK DEMO (S):
27Provided above .
28
29BASIC INSTRUCTION : These are the solutions to the question from other agents . Based
on the context , examine the solutions from other agents in your rationale ,
finish by giving an updated answer .
30
31TIP : Don ’t be afraid to be creative when creating the new instruction !
32
33PROPOSED INSTRUCTION : You are an expert fact - checker for a major publication . Your
task is to meticulously review proposed answers to a complex research question ,
ensuring accuracy and correcting any errors . You are provided with the original
question , multiple context passages from credible sources , and several proposed
answers from different research assistants . Your job is to carefully analyze
each proposed answer , cross - referencing it with the provided context passages
and identifying any inconsistencies , inaccuracies , or unsupported claims .
34
35** Question :** [ Insert Question Here ]
36
37** Context Passages :**
38[ Insert Passages Here ]
39
40** Proposed Answers :**
41* Assistant 1: [ Insert Assistant 1’s Answer ]
42* Assistant 2: [ Insert Assistant 2’s Answer ]
43...
44* Assistant N: [ Insert Assistant N’s Answer ]
45
46
47** Instructions :**
48
491. ** Fact - Check & Analyze :** Evaluate each proposed answer individually . For each
answer :
50* ** Verdict :** Indicate whether the answer is \" Correct ,\" \" Incorrect ,\" \"
Partially Correct ,\" or \" Not Supported by Context .\"
51* ** Evidence :** Provide specific quotes and passage numbers from the context to
28
Page 29:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
support your verdict . Explain how the evidence supports or refutes the proposed
answer . Highlight any ambiguities , assumptions , or leaps in logic made by the
research assistants .
52* ** Corrections \/ Improvements (if applicable ) :** Suggest specific corrections or
improvements to partially correct or incorrect answers . Explain how these
changes align with the context .
53
542. ** Synthesize & Refine :** Synthesize the information gathered during the fact -
checking process to formulate the most accurate and comprehensive answer to the
question . This may involve :
55* Selecting the most accurate proposed answer .
56* Combining elements from multiple proposed answers .
57* Developing a completely new answer based on your analysis of the evidence .
58
593. ** Final Answer :** Clearly state your final , verified answer to the question .
60
614. ** Confidence Level :** Indicate your confidence in the final answer using a scale
of \" High ,\" \" Medium ,\" or \" Low .\" Briefly explain the factors influencing
your confidence level .
62
63
64This revised instruction emphasizes a more rigorous fact - checking process ,
encouraging the LM to critically evaluate each proposed answer and provide
detailed justifications for its assessments . The addition of a confidence level
prompts the LM to reflect on the certainty of its final answer , promoting more
nuanced and reliable responses . The \" expert fact - checker \" persona further
reinforces the importance of accuracy and attention to detail .
65
66---
67<example_1 >
68<example_2 >
MBPP:
1Predictor :
2
3You are a highly skilled Python programmer tasked with generating a correct and
efficient Python function based on the given natural language problem
description . Think step -by -step , outlining your reasoning process before
presenting the code solution . Your response should adhere to the following
structure :
4
5** Thinking :** Provide a clear and concise breakdown of your thought process ,
including the steps you ’ll take to solve the problem . This should demonstrate a
logical progression towards the final solution and may include considerations
of data types , algorithms , and edge cases . For example :
6
71. Identify the input data type and expected output .
82. Determine the core logic or algorithm required .
93. Consider potential edge cases or special scenarios .
104. Outline the steps for implementing the solution in Python .
11
12** Answer :** Present your complete and correct Python code implementation within a
code block ( using triple backticks ). The code should be well - formatted ,
efficient , and directly address the problem description . Ensure your function
adheres to the provided function signature if given . For example :
13
14‘‘‘ python
15def function_name ( input_arguments ):
16# Code implementation here
17# ...
18return output
19‘‘‘
20
21Focus on producing functional code that accurately solves the problem . Avoid
including unnecessary explanations or examples within the \" Answer \" section .
If the problem description includes implicit or explicit test cases , ensure your
code passes those tests . Strive for clarity , conciseness , and correctness in
both your thinking and your code .
22
23---
29
Page 30:
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
24<example_1 >
25<example_2 >
26<example_3 >
27
28------
29Reflector :
30
31This dataset is designed for Python code generation , translating natural language
problem descriptions into simple functions and their corresponding test cases .
The ’answer ’ and ’test ’ fields are identical , indicating a potential redundancy
or a unique task focusing on simultaneous code and test generation . The dataset
likely originates from coding challenge websites and emphasizes basic
programming concepts with a focus on correctness , but lacks complexity in inputs
and error handling .
32
33TASK DEMO (S):
34Question : Write a function that takes in two numbers and returns a tuple with the
second number and then the first number .
35
36def swap_numbers (a,b):
37Previous Solution : def swap_numbers (a,b):
38 return (b, a)
39
40Traceback : Test case : print ( swap_numbers (10 ,20) )
41Output : (20 , 10)
42Ground Truth : (20 ,10)
43Correctness : True
44Thinking : The provided solution correctly swaps the order of the two input numbers
and returns them as a tuple . The test case demonstrates this functionality , and
the output matches the ground truth . Therefore , no changes are required .
45Answer : ‘‘‘ python
46def swap_numbers (a,b):
47 return (b, a)
48‘‘‘
49<example_2 >
50<example_3 >
51
52
53BASIC INSTRUCTION : Please determine the correctness of the solution in passing all
test cases . If it fails , based on the error message and trackback , think step by
step , carefully propose an updated solution in the answer output with a correct
code implementation in python .
54
55TIP : The instruction should include a high stakes scenario in which the LM must
solve the task !
56
57PROPOSED INSTRUCTION :
58
59You are an automated code reviewer for a mission - critical satellite control system .
A bug in the code could lead to catastrophic failure , so absolute correctness
is paramount . You are given a Python function along with its associated test
case ( including the expected output ). Analyze the provided
60
61<example_1 >
62<example_2 >
30