Authors: Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Matei Zaharia, James Zou, Ion Stoica
Page 1:
Optimizing Model Selection for Compound AI Systems
Lingjiao Chen†,◦, Jared Quincy Davis◦, Boris Hanin§
Peter Bailis‡, Matei Zaharia‡, James Zou◦, Ion Stoica‡
†Microsoft Research,◦Stanford University,
§Princeton University,‡University of California, Berkeley
Abstract
Compound AI systems that combine multiple LLM calls, such as self-refine and multi-agent-
debate, achieve strong performance on many AI tasks. We address a core question in optimizing
compound systems: for each LLM call or module in the system, how should one decide which LLM
to use? We show that these LLM choices have a large effect on quality, but the search space is
exponential. We propose LLMSelector , an efficient framework for model selection in compound
systems, which leverages two key empirical insights: (i) end-to-end performance is often monotonic
in how well each module performs, with all other modules held fixed, and (ii) per-module perfor-
mance can be estimated accurately by an LLM. Building upon these insights, LLMSelector
iteratively selects one module and allocates to it the model with the highest module-wise perfor-
mance, as estimated by an LLM, until no further gain is possible. LLMSelector is applicable
to any compound system with a bounded number of modules, and its number of API calls scales
linearly with the number of modules, achieving high-quality model allocation both empirically and
theoretically. Experiments with popular compound systems such as multi-agent debate and self-
refine using LLMs such as GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 show that LLMSelector
confers 5%-70% accuracy gains compared to using the same LLM for all modules.
1 Introduction
8 6
4 4
2 06 4
0 08 1
4 6
1 66 4
407 2
856 2
0 08 9
5 6
2 06 0
0 04 3
2 0
85 7
04 48 7
4 1
1 65 9
3 0
08 0
7 5
2 16 5
0 09 5
8 4
2 87 01 0 0 1 0 0
LiveCodeBench CommonGenHard SimpleQA FEVER TableArithmetic TableBias020406080100
GPT-4o only GPT-4 Turbo only GPT-4o mini only Claude 3.5 Sonnet only
Claude 3.5 Haiku only Gemini 1.5 Pro only Llama 3.1 405B only LLMSELECTORDatasetsAccuracy (%)
Figure 1: LLMSelector outperforms compound AI systems that always call the same LLM. Here we
study three compound systems, namely, self-refine (on LiveCodeBench and GCH), multi-agent-debate
(on SimpleQA and FEVER), and locate-solve (on TableArithmetic and TableBias). LLMSelector
achieves 5%-70% accuracy gains over allocating any model alone by allocating different models to
different modules in these compound systems.
Researchers and developers are increasingly leveraging large language models (LLMs) by composing
multiple LLM calls in a compound AI system to tackle complex tasks [Du et al., 2024, Zhang et al.,
1arXiv:2502.14815v1 [cs.AI] 20 Feb 2025
Page 2:
2024, Madaan et al., 2023, DeepMind, 2023, Shinn et al., 2023, Renze and Guven, 2024, Zaharia et al.,
2024]. For example, a common practice is to use one LLM call to generate one initial answer, one LLM
call to give feedback, and one more call to refine the answer based on the feedback, known as self-
refine [Renze and Guven, 2024, Madaan et al., 2023, Ji et al., 2023]. Another example is multi-agent
debate [Du et al., 2024, Liang et al., 2024, Khan et al., 2024], where multiple LLM calls are made
to propose initial answers and then debate which ones are correct. Compared to monolithic models,
significant improvements are possible because the compound systems decompose challenging tasks into
simpler sub-tasks, and perform one LLM call for each sub-task.
Most existing work on improving compound systems focuses on optimizing prompts used in indi-
vidual modules and/or module interactions, while using the same LLM for all modules [Khattab et al.,
2024, Yuksekgonul et al., 2024, Wu et al., 2023]. While this simplifies compound system design, it also
leaves several important questions unaddressed. Does using different models across modules improve
a compound system’s performance? If so, why and by how much? Given a pool of LLMs, can we find
the best model each module should use without exhaustive search?
As a first step towards answering such questions, we systematically study model selection in static
compound AI systems, i.e., those where the number of modules, the sequencing of module calls, and
the mapping between modules and models are fixed. In this context, we indeed find that allocating
different LLMs to different modules leads to substantially higher performance than allocating the
same LLM to all modules. As an example, consider again the self-refine system [Madaan et al., 2023]
consisting of three modules: a generator, a critic, and a refiner. LLM A may be better at providing
feedback but worse at generating and refining answers than LLM B. In this case, allocating LLM A for
the critic and LLM B for the generator and refiner is better than allocating either one to all modules.
Then we formulate the model selection problem (MSP), i.e., identifying the best model each module
should use to maximize the overall performance. MSP is challenging in principle, as it is infeasible to
exhaustively search the exponentially large space of all model choices. Our insights are that, in many
cases, (i) the end-to-end performance can be monotonic in per-module performance, and (ii) per-
module performance can be estimated accurately by an LLM diagnoser. This motivates us to design
LLMSelector , a principled framework that optimizes MSP for any static compound AI systems
given a training budget. LLMSelector iteratively nominates one module and allocates to it the
model with the best module-wise performance, as estimated by an LLM diagnoser. One benefit is
thatLLMSelector is applicable to any compound AI system whose number of modules is fixed.
Furthermore, LLMSelector only incurs a manageable amount of LLM calls. In fact, we provide
mathematical conditions under which LLMSelector finds the optimal solution to MSP with the
number of LLM calls linear to the number of modules (Section 4).
We conduct systematic experiments on a diverse set of compound AI systems using real-world
LLM APIs including GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. Perhaps surprisingly, we
have found that different model choices have a significant effect on compound systems’ performance.
In fact, LLMSelector offers 5%-70% performance gains compared to allocating the same LLM to
all modules (Figure 1). While not optimizing prompts, LLMSelector also outperforms advanced
techniques specializing in prompt optimization (Table 2 in Section 5). This further highlights the
importance of model selection for compound AI systems.
In short, our main contributions are:
•Model selection problem. We formulate the model selection problem (MSP) for compound
AI systems, an increasingly important but under-explored problem.
•The LLMSelector framework. To optimize MSP, we propose LLMSelector , a principled
framework that iteratively chooses one module and allocates to it the model with the highest
module-wise performance estimated by an LLM.
•Model choices matter. Through extensive experiments on practical compound systems using
real-world LLM APIs including GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro, we have found
that choosing different models can substantially affect (up to 100%) a compound AI system’s
performance.
•LLMSelector finds excellent choices. Systematical experiments have shown that LLMS-
elector identifies model choices that outperform allocating the same LLM to all modules by
2
Page 3:
5%-70%.
•Open-source artifacts. We release1our code and data, including compound systems’ inter-
mediate outputs generated by commercial LLM APIs.
2 Related Work
Compound AI system optimization. Prompt engineering and module interaction design is a
central topic of compound AI system optimization. While existing work often relies on manually
tuning them [DeepMind, 2023, Shinn et al., 2023, Zhou et al., 2024b, Pryzant et al., 2023, Fourney
et al., 2024, Zhao et al., 2024, Lu et al., 2023, Zhao et al., 2024], recent work studies how to automate
this process, such as DSPy [Khattab et al., 2024], Textgrad [Yuksekgonul et al., 2024], and Autogen [Wu
et al., 2023]. For example, DSPy uses Bayesian optimization to adjust prompts for all modules, while
Textgrad uses textual feedback to optimize prompts for individual modules. On the other hand, our
work focuses on model selection, a third axis for compound system optimization, complementary to
prompt optimization and module interaction design.
Model market utilization. Model market utilization studies how to use all available (proprietary
and open-source) models for downstream tasks [Lu et al., 2024, Ram´ ırez et al., 2024, Miao et al., 2023].
Extensive work has built various techniques to utilize different models, such as model cascade [Chen
et al., 2024b], model routing [Hu et al., 2024, Stripelis et al., 2024], and mixture-of-experts [Wang
et al., 2024]. While they mainly focus on single-stage tasks such as classification [Chen et al., 2020,
Huang et al., 2025] and question answering [Chen et al., 2024b, Shekhar et al., 2024], we study model
utilization for compound AI systems requiring multiple stages . This is a much more challenging problem
as the search space is much larger.
Model selection. Model selection is a critical part of classic ML and has been extensively studied
in the literature [Kohavi, 1995, Akaike, 1974, Elsken et al., 2019]. While classic techniques focus on
model selection for one ML task, compound systems involve multiple ML tasks. Thus, model selection
becomes more challenging as the search space is exponentially large in the number of tasks.
LLM-as-a-judge. LLMs have been increasingly used for evaluating and judging complex genera-
tions, a phenomenon termed LLM-as-a-judge. Researchers have extensively studied how LLM judges
align with human preferences in real-world scenarios [Zheng et al., 2023, Shankar et al., 2024], how to
improve its quality [Kim et al., 2023], how to evaluate it [Chiang et al., 2024, Chen et al., 2024a, Zeng
et al., 2023], as well as many other applications [Johri et al., 2025, Dhole et al., 2024, Gu et al., 2024,
Zhou et al., 2024a]. In this paper, we find a novel use case of LLM-as-a-judge: diagnosing module-wise
performance to accelerate the model allocation search process.
3 Compound AI Systems
Static Compound AI systems. As defined by [Zaharia et al., 2024], compound AI systems address
AI tasks by synthesizing multiple components that interact with each other. Here, we denote a static
compound AI system by a directed acyclic graph G≜(V, E), where each node v∈Vdenotes one
module, and each directed edge e≜(u, v)∈Eindicates that the output from module uis sent to
module vas input. Without loss of generality, we assume a final output module that generates the final
output without any output edges, and an input module representing the input query which receives
no input edges.
LLM modules. An LLM module is a module that utilizes an LLM to process the inputs. It typically
concatenates all inputs as a text snippet (via some prompt template), obtain an LLM’s response to
this snippet, and send the response as output (potentially after some postprocessing). Throughout
this paper, all modules are LLM modules to simplify notations. In practice, if a module is not an
1https://github.com/LLMSELECTOR/LLMSELECTOR
3
Page 4:
Gen 1
Gen 2
Gen 3Debate 1
Debate 2
Debate 3
(a) self-refineGen Critic Refine
(b) multi-agent-debateFigure 2: Examples of static compound AI systems. (a) self-refine system. (b) multi-agent-debate
system. The diamond and star represent the input and output modules, and the circles represent the
LLM modules. Directed lines represent data flow, and we omit most query inputs for simplicity.
LLM module, one can either merge it into an LLM module (e.g., a module that postprocesses output
from some LLM module), or convert it into an LLM module by conceptually “adding” an LLM to the
module.
Examples. Consider two examples of static compound AI systems, self-refine and multi-agent-
debate. Self-refine, as shown in Figure 2(a), consists of three modules: a generator, a critic, and
a refiner. Given a query, the generator produces an initial answer. The critic provides feedback on
the initial answer, and the refiner uses the feedback to improve the initial answer. Figure 2(b) shows
the architecture of a six-module system: multi-agent-debate. Here, three generators first give their
initial answers to a question, then three debaters debate with each other based on these initial answers.
Refinements and debates can be iterative, but we focus on only one iteration for simplicity.
Notations. Table 1 list our notations. We also use fi→kto indicate a function that is the same as
function fexcept that the value iis mapped to the value k.
Table 1: Notations.
Symbol Description
G= (V, E) A compound AI system
|V| Number of LLM modules
M The set of LLMs
f:V7→M A model allocation
z One task
P(f) End-to-end performance
p(f, z) End-to-end performance on z
pi(f, z) ith module’s performance on z
4 Modeling and Optimizing Model Selection
This section presents how to model and optimize model selection for static compound AI systems.
4
Page 5:
Output Input
...Training setA compound system's architecture
Candidate LLMsLLMSELECT OR
1. module nominator
2. model updaterLLM 1 LLM 2 LLM KAn optimized model allocation
🤖...Figure 3: LLMSelector Workflow. LLMSelector takes as input a compound AI system, a pool
of candidate LLMs, a training dataset consisting of question-answer pairs, and a training budget.
Then LLMSelector iteratively nominates one module and allocates to it the model with the highest
module-wise performance estimated by an LLM. This is repeated until the budget is reached or no
performance gain is possible. Finally, LLMSelector returns an optimized model allocation.
4.1 Problem Statement
Consider a static compound AI system G= (V, E) and a set of LLMs M≜{1,2,···,|M|}to use. Let
F:V7→Mdenote all possible model allocations, each of which allocates an LLM k∈Mto a module
v∈V. Given a task distribution D, the performance of the compound AI system using the model
allocation f∈FisP(f)≜Ez∈D[p(f, z)]. Here, zdenotes a task sampled from the data distribution,
andp(f, z) is the performance of the compound AI system on the given task zusing the allocation f.
Our goal is to find one model allocation that maximizes the overall performance, i.e.,
max
f∈FP(f) (1)
4.2 The assumptions
Problem 1 is challenging without any assumptions, as it is impossible to exhaustively search all possible
model allocations, the size of which grows exponentially in the number of modules |V|. Here we list
our assumptions to enable tractable analysis.
Binary performance. For simplicity, we only consider binary performance, i.e., p(f, z)∈ {0,1}.
Decomposition to per-module performance. In classic computing systems such as a hardware
stack, optimizing individual components (such as CPU, GPU, and memory) often leads to better overall
performance. Similarly, improving individual modules’ quality should also lead to better overall quality
of a compound AI system. For the sake of analysis, we also assume that we can decompose a compound
system’s performance as a monotone function of individual modules’ performance. Formally, let pi(f, z)
denote module vi’s performance on the task zusing allocation f. Then the end-to-end performance can
be decomposed as p(f, z) =h(p1(f, z),p2(f, z),···,pL(f, z)), where h(·) is monotonically increasing.
Monotone module-wise performance. The module-wise performance needs to satisfy certain
properties to enable us to analyze the interplay between individual modules and the compound systems.
In this paper, we focus on module-wise performance piwith the following two conditions.
•piisintra-monotone , which means that
pi(fi→k, z)≥pi(fi→k′, z)
=⇒
pi(f′
i→k, z)≥pi(f′
i→k′, z)
5
Page 6:
In simple terms, piinduces a “ranking” for each module: no matter how models are allocated to
other modules, allocating model kto a given module is always “better” than model k′.
•piisinter-monotone , which indicates that
pi(fi→k, z)>pi(fi→k′, z)
=⇒
∀j,pj(f′
i→k, z)≥pj(f′
i→k′, z)
In other words, if module ith performance is higher by replacing its allocated model from A to
B, then such replacement should not hurt other modules’ performance no matter what models
are allocated to other modules.
Do the assumptions always hold? The above two conditions simplify our analysis, but they are not
always satisfied in practice. In these cases, while our analysis may not hold, the derived algorithm is
still applicable and demonstrates superior performance (as shown later in Section 5).
Optimality Characterization. Suppose the module-wise performance is both intra-monotone and
inter-monotone. Then we are able to study the optimal allocation via the lens of module-wise per-
formance. In particular, we first argue that it is possible to find a model allocation that maximizes
the performance for each module. This is because the module-wise performance is inter-monotone:
improving the model used for one module can only improve the performance for other modules. The
second observation is that a module-wise optimal allocation must also be the globally optimal alloca-
tion. This is due to the fact that the end-to-end performance is a monotone function of all individual
module-wise performance.
4.3 The LLMSelector framework
The above analysis motivates our design of LLMSelector , a principled framework for optimizing
model allocation in compound AI systems within a budget constraint.
Figure 3 gives an overview of how LLMSelector works. It takes the compound AI system ar-
chitecture G, the set of LLM M, a training dataset DTr, and a training budget Bas input, and
returns an optimized model allocation ˆfas the output. Here, each data point in the training dataset
z= (q, a)∈ DTris a question-answer pair specifying a possible question and desired answer. LLMSe-
lector involves an iterative process. In each iteration, it nominates one module and then allocates to
the module the model with the highest module-wise performance. This is repeated until running out
of the training budget or no module can be further improved by updating one module at a time. The
details can be found in Algorithm 1. The following result shows when LLMSelector can identify
the optimal allocation. The proof is left to the appendix.
Theorem 4.1. Suppose for each task zinDTr, the optimal allocation is unique. Then Algorithm 1
converges to the optimal allocation on the training data after Literations.
The LLM diagnoser. LLMSelector requires access to the model-wise performance function pi.
In practice, however, this is often unavailable or too expensive to collect. Therefore, we propose to
use a LLM diagnoser to estimate the model-wise performance function. In particular, we give an LLM
as input a compound AI system G= (V, E), a task z= (q, a) consisting of a question qand the
desired answer a, the inputs and outputs of each module v∈Vusing a specific allocation f, and
ask it to determine module jth’s performance. Let ˆ pj(f, z) denote the output by the LLM diagnoser.
Then we approximate the module-wise performance by pj(f, z) = ˆpj(f, z) +γp(f, z), where γ >0 is a
hyperparameter balancing the LLM’s estimation and the end-to-end performance. The prompt used
for the LLM diagnoser can be found in the appendix.
5 Experiments
We compare the performance of LLMSelector with vanilla compound AI systems using real-world
LLM models in this section. Our goal is three-fold: (i) validating that allocating different models to
6
Page 7:
Algorithm 1: HowLLMSelector works.
Input: A compound system G= (V, E), a pool of Kcandidate LLMs, a training dataset
DTr, and a training budget B
Output: An optimized model allocation ˆf
1Choose a random ˆf0∈F// initialize
2Seti←1, c←0, δ←False, f z←f0,∀z∈ DTr
3while c≤B− |M|andδ=False do
4 j←imod L+ 1// nominate a module
5 kz←max k∈Mpj(fz,j→k, z)
6 fz←fz,j→kz// select a model
7 fi←mode{fz:z∈D}// aggregate
8 c←c+|M|// update the cost
9 ifi > L then
10 δ←Qi
t=i−L1ft=fi// stop criteria
11 end
12end
13Return fi// optimized model choices
different modules can substantially improve compound AI systems’ performance, (ii) quantifying the
performance gains enabled by LLMSelector , and (iii) understanding how LLMSelector ’s LLM
diagnoser makes it possible to identify effective model allocations efficiently.
Experiment setups. Throughout this paper, we use K= 10 real-world LLMs, including GPT-
4o, GPT-4o mini, GPT-4-Turbo, Claude 3.5 Sonnet, Claude 3.5 Haiku, Gemini 1.5 Pro, Gemini 1.5
Flash, Llama 3.1 405B, Llama 3.1 70B, and Qwen 2.5 72B. The temperature is 0.1 for all models.
The maximum number of tokens is 1000 unless specified. By default, we use 50% of each dataset for
training and the other 50% for evaluation.
5.1 A case study on TableArithmetic
Let us start with a case study on TableArithmetic, a synthetic dataset consisting of 100 questions.
Here, each question involves a table consisting of “ID” and “task” rows. The goal is to solve the task
corresponding to a specific ID. The table in each question has a total of 200 entries, and the task in
each entry is a simple arithmetic question “What is X+(10 .9>10.11)?”, where X is a random integer
between 1 and 100.
The locate-solve system. To address TableArithmetic, we use the locate-solve system consisting
of two modules. The first module, locate, extracts the task with the corresponding ID, and the second
module, solve, takes the first module’s output and then answers the extracted task. For this specific
case study, we only use five models: GPT-4o, GPT-4o mini, Claude 3.5 Sonnet, Gemini 1.5 Pro, and
Llama 3.1 405B.
LLMSelector Setup. We use Gemini 1.5 Pro as the LLM diagnoser. For this case study, we set up
γ= 0, that is, we fully rely on the LLM diagnoser as the module-wise performance function for each
module.
Performance Analysis. Figure 4 demonstrates how LLMSelector performs on this task. We first
note that allocating any fixed model to all modules leads to poor end-to-end performance, as shown in
Figure 4 (a). This is because no model has high performance for all modules. Second, LLMSelector ’s
accuracy is perfect. This is because (i) there exists some model with perform accuracy on each module,
and (ii) LLMSelector learns this efficiently. For example, Claude 3.5 is perfect on the first Module,
and Gemini 1.5 Pro makes no mistake on the second module. LLMSelector learns to leverage the
best model for each module, and thus reaches the best performance. To further understand this,
Figure 4(b) gives a concrete example. The query asks the solution to the task with a specific ID. The
7
Page 8:
48+(10.9>10.1 1)?
1 1 1 0.98 1
0.98 0.96 0.96 0.96 0.96
1 1 1 0 1
0.98 1 1 0.76 1
1 1 1 0.18 1
GPT
-4oGPT
-4o miniClaude 3.5 Gemini 1.5 Llama 3.1Llama 3.1Gemini 1.5Claude 3.5GPT-4o miniGPT-4o
Module 2Module 1(b) Example(b4) Allocation by LLMSELECT ORConsider a table with two rows, ID and T ask. The content
of the two rows are as follows.
What is the solution to the task with ID 1827405601?ID4593180651 1827405601 ... 1797544485
Task ... ...
What is 48+(10.9>10.1 1)?48
[...] answer is 48. 48+(10.9>10.1 1)?❌ ✅
3 [...] 2+1=3.2 +(10.9>10.1 1)?❌ ✅
49 [...] answer is 49.✅
✅
(b1) Input question(a) Overall performance
(b3) Allocating Gemini 1.5 Pro
(c) Optimizer ef fectthe final output (3) is very
different [...] Module 1 is
incorrect .
👍👎
Claude 3.5the output (48) is
incorrect but close [...]
Module 1 is correct .
(c1) T est accuracy (c2) End-to-end error (c3) LLM diagnoser's judgment100
0 0 030
0100
2 0100
3090100
3 0 0100
0
LLMSELECT OR GPT-4o GPT-4o mini Claude 3.5 Sonnet Gemini 1.5 Pro Llama 3.1 405B050100end-to-end
module 1
module 2
MethodPerformance (%)
What is 31+(10.9>10.1 1)?(b2) Allocating Claude 3.5
🤖
GPT-4o mini
5 10 15 20 2500.20.40.60.81
Optimizer
LLMSELECT OR
Greedy search
Random search
BudgetAccuracyFigure 4: A case study on the TableArithmetic dataset. (a) Overall performance. Any single LLM
has low performance on either Module 1 (e.g., Claude 3.5) or Module 2 (e.g., Gemini 1.5 Pro), but not
both. LLMSelector learns to use the best LLM for each module and thus achieves high performance
on both modules and thus the whole system. (b) An example. Claude 3.5 fails to answer the extracted
task correctly, while Gemini 1.5 cannot extract the correct task. LLMSelector allocates them in
different modules to obtain the correct answer 49. (c) Optimizer’s effect. (c1) LLMSelector reduces
60% cost to reach the same accuracy as the exhaustive search. (c2) Greedy search’s accuracy is
surprisingly low because of the locally optimal solution. (c3) LLM diagnoser enables LLMSelector
to escape the local optimum.
locate module using Claude 3.5 correctly identifies the task “What is 48 + (10 .9>10.11)?”, but the
solve module using Claude 3.5 incorrectly suggests that 10.9 is less than 10.11 and thus gives a wrong
answer. On the other hand, the locate module using Gemini 1.5 Pro extracts the wrong task, but it
solves the task correctly. LLMSelector learns to use Claude 3.5 for the first module and Gemini 1.5
Pro for the second module, and therefore correctly answers this query.
Optimizer analysis. Next, we focus on understanding the search efficiency of LLMSelector .
In particular, we compare LLMSelector with two baselines: random search and greedy search.
Given an LLM API budget B, random search randomly chooses Bmodel allocations from all possible
allocations, and then returns the one with the highest end-to-end performance. The greedy search
8
Page 9:
Table 2: Performance of LLMSelector and other approaches for optimizing compound AI systems.
We focus on three compound systems and apply each of them to two tasks: self-refine for Live-
CodeBench and CommonGenHard, multi-agent-debate for SimpleQA and FEVER, and locate-solve
for TableArithmetic and TableBias,. The performance gain is the improvement by LLMSelector
against the best of allocating any fixed (same) model to all modules (with underlines). We also com-
pareLLMSelector with the MIPROv2 optimizer implemented in DSPy (using GPT-4o as the LLM).
We set max bootstrapped demos=2, max labeled demos=2, and all other parameters as default for
MIPROv2. We also box the second-best result for each dataset. Overall, LLMSelector achieves
5%-70% accuracy gains over allocating any fixed model to all modules. Interestingly, LLMSelector
also outperforms DSPy with MIPROv2 which specializes in prompt optimization for compound sys-
tems. This further suggests the importance of model selection for compound systems.
MethodCompound AI System
self-refine multi-agent-debate locate-solve
Task
LiveCodeBench CommonGenHard SimpleQA FEVER TableArith TableBias
GPT-4o 86% 44% 20% 64% 0% 0%
GPT-4 Turbo 81% 46% 16% 64% 4% 0%
GPT-4o mini 72% 8% 5% 62% 0% 0%
Claude 3.5 Sonnet 89% 56% 20% 60% 0% 0%
Claude 3.5 Haiku 43% 20% 8% 57% 0% 44%
Gemini 1.5 Pro 87% 41% 16% 59% 30% 0%
Gemini 1.5 Flash 80% 13% 5% 38% 8% 2%
Llama 3.1 405B 80% 75% 21% 65% 0% 0%
Llama 3.1 70B 59% 68% 12% 7% 0% 42%
Qwen 2.5 72B 81% 30% 5% 47% 0% 0%
DSPy 87% 71% 22% 68% 0% 0%
LLMSelector 95% 84% 28% 70% 100% 100%
Gains 6% 11% 7% 5% 70% 56%
iteratively chooses one module and allocates to it the model with the highest end-to-end performance.
As shown in Figure 4(c1), we have found that LLMSelector consistently outperforms these
baselines. In particular, while random search needs to explore all 25 model allocations to ensure
the optimal is identified, LLMSelector needs only to try 10 model allocations, resulting in 60%
cost reduction. Interestingly, the greedy search method has a very low accuracy, even given a large
search budget. This is because end-to-end performance is not always sufficient to reflect module-wise
performance. To see this, Figure 4(c2) gives the training accuracy for each model allocation. We
observe that allocating GPT-4o mini to both modules is a “locally” optimal solution: changing one
module’s model would not improve the performance. However, its performance is actually much lower
than the optimal allocation (Claude 3.5 Sonnet for module 1 and Gemini 1.5 Pro for module 2).
LLMSelector escapes from the “locally” optimal allocation by the LLM diagnoser. Consider, for
example, that LLMSelector starts with the locally optimal allocation: using GPT-4o mini for both
modules. Figure 4(c3) shows the LLM diagnoser’s judgment when evaluating module 1’s performance.
While switching the model to Claude 3.5 Sonnet does not improve the end-to-end performance, the
diagnoser recognizes that Claude 3.5 Sonnet performs well for the first module, and thus enables
LLMSelector moves from the initial allocation to allocating Claude 3.5 Sonnet to module 1.
5.2 Quantitative Performance Improvement
Next, we study the performance of LLMSelector on practical compound AI systems. In particular,
we focus on three compound AI systems, namely, locate-solve, self-refine [Renze and Guven, 2024], and
multi-agent-debate [Du et al., 2024]. The architectures of these systems are shown in Figure 5 in the
appendix. We use six datasets: TableArithmetic and Table Bias for locate-solve, LiveCodeBench [Jain
et al., 2024] and CGH [Renze and Guven, 2024] for self-refine, and SimpleQA [Wei et al., 2024]
and FEVER [Thorne et al., 2018] for multi-agent-debate. We compare LLMSelector with using
any fixed model for all modules and DSPy [Khattab et al., 2024], an open-source library specialized
For prompt optimization in compound systems. For DSPy, we use the optimizer MIPROv2, which
searches for best prompts using Bayesian optimization. We use GPT-4o as the backbone LLM, and set
max bootstrapped demos=2, max labeled demos=2, and all other parameters as default for MIPROv2.
9
Page 10:
More details are given in the appendix.
Table 2 summarizes the quantitative results. First, we observe that no LLM is universally better
than all other LLMs for all tasks. For example, Gemini-1.5 Pro performs the best on TableArthmetic
and LiveCodeBench, but GPT-4o is the best for FEVER. Second, LLMSelector offers 5%-70%
performance gains compared to the best baselines. Interestingly, LLMSelector also outperforms
the DSPy library which extensively optimizes the prompt. For example, on the SimpleQA dataset,
optimizing the prompts by DSPy leads to a 1% accuracy gain, but LLMSelector ’s improvement
is much more substantial (7%). This is again because different models have their own strengths and
weaknesses, and prompting alone is not adequate to turn an LLM’s weakness into its strength. On the
other hand, LLMSelector searches for the LLM with the desired strength directly and thus offers
more benefits.
5.3 Qualitative Understanding
To further understand when and why LLMSelector outperforms allocating the same model to all
modules, we dive into a few specific examples and compare how LLMSelector ’s generations differ
from these by allocating the same LLM. In particular, Figure 6 in the appendix gives one example
from the SimpleQA dataset answered by the multi-agent-debate system. LLMSelector learns to
allocate GPT-4o, Llama 3.1 405B, and Gemini 1.5 Pro for the three answer generators separately, and
use GPT-4o for the three debaters. In this example, the three generators give completely different
answers: 8, 3, and -18, and the GPT-4o debaters identify that 3 is the correct answer. Allocating
GPT-4o to all modules leads to an incorrect answer, however. This is because the GPT-4o generators
always return 8 and thus the debaters fail to identify this mistake. We leave more analysis to the
appendix due to the space limit.
6 Conclusion
In this paper, we study how to select which LLMs to which modules to optimize a given compound
AI system, an important but under-explored question. We propose and develop LLMSelector ,
an efficient framework to address this question by leveraging two key insights: (i) end-to-end per-
formance is often monotonic in per-module performance, and (ii) module-wise performance can be
accurately estimated by an LLM. Our empirical evaluations with real-world LLM APIs show that
LLMSelector offers substantial performance gains (5%-70%) over allocating the same model to
all modules, highlighting the importance of model selection. We also release our code and data
viahttps://github.com/LLMSELECTOR/LLMSELECTOR to stimulate more research on optimizing com-
pound AI systems.
References
Hirotugu Akaike. A new look at the statistical model identification. IEEE transactions on automatic
control , 19(6):716–723, 1974.
Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang,
Yao Wan, Pan Zhou, and Lichao Sun. Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with
vision-language benchmark. arXiv preprint arXiv:2402.04788 , 2024a.
Lingjiao Chen, Matei Zaharia, and James Y Zou. Frugalml: How to use ml prediction apis more
accurately and cheaply. Advances in neural information processing systems , 33:10685–10696, 2020.
Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while
reducing cost and improving performance. TMLR , 2024b.
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng
Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open
platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132 , 2024.
DeepMind. Alphacode 2 technical report. 2023. URL https://storage.googleapis.com/
deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf .
10
Page 11:
Kaustubh D Dhole, Kai Shu, and Eugene Agichtein. Conqret: Benchmarking fine-grained evaluation
of retrieval augmented argumentation with llm judges. arXiv preprint arXiv:2412.05206 , 2024.
Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factu-
ality and reasoning in language models through multiagent debate. In ICML , 2024.
Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. Journal
of Machine Learning Research , 20(55):1–21, 2019.
Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Friederike Niedtner,
Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, et al. Magentic-one: A generalist
multi-agent system for solving complex tasks. arXiv preprint arXiv:2411.04468 , 2024.
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen,
Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge. arXiv preprint arXiv:2411.15594 ,
2024.
Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt
Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing
system. arXiv preprint arXiv:2403.12031 , 2024.
Keke Huang, Yimin Shi, Dujian Ding, Yifei Li, Yang Fei, Laks Lakshmanan, and Xiaokui Xiao.
Thriftllm: On cost-effective selection of large language models for classification queries. arXiv
preprint arXiv:2501.04901 , 2025.
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando
Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evalu-
ation of large language models for code. arXiv preprint arXiv:2403.07974 , 2024.
Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. Towards mitigating
llm hallucination via self reflection. In Findings of the Association for Computational Linguistics:
EMNLP 2023 , pages 1827–1843, 2023.
Shreya Johri, Jaehwan Jeong, Benjamin A Tran, Daniel I Schlessinger, Shannon Wongvibulsin, Lean-
dra A Barnes, Hong-Yu Zhou, Zhuo Ran Cai, Eliezer M Van Allen, David Kim, et al. An evaluation
framework for clinical use of large language models in patient interaction tasks. Nature Medicine ,
pages 1–10, 2025.
Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward
Grefenstette, Samuel R Bowman, Tim Rockt¨ aschel, and Ethan Perez. Debating with more persuasive
llms leads to more truthful answers. arXiv preprint arXiv:2402.06782 , 2024.
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Saiful Haq,
Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, Heather Miller, et al. Dspy: Compiling declar-
ative language model calls into state-of-the-art pipelines. In The Twelfth International Conference
on Learning Representations , 2024.
Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun,
Seongjin Shin, Sungdong Kim, James Thorne, et al. Prometheus: Inducing fine-grained evaluation
capability in language models. In The Twelfth International Conference on Learning Representa-
tions , 2023.
R Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection.
Morgan Kaufman Publishing , 1995.
Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi,
and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent
debate. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, EMNLP , November 2024.
Jinliang Lu, Ziliang Pang, Min Xiao, Yaochen Zhu, Rui Xia, and Jiajun Zhang. Merge, ensemble, and
cooperate! a survey on collaborative strategies in the era of large language models. arXiv preprint
arXiv:2407.06089 , 2024.
11
Page 12:
Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu,
and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models.
Advances in Neural Information Processing Systems , 36, 2023.
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri
Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with
self-feedback. Advances in Neural Information Processing Systems , 36, 2023.
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, and Zhihao
Jia. Towards efficient generative large language model serving: A survey from algorithms to systems.
arXiv preprint arXiv:2312.15234 , 2023.
Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt
optimization with” gradient descent” and beam search. arXiv preprint arXiv:2305.03495 , 2023.
Guillem Ram´ ırez, Alexandra Birch, and Ivan Titov. Optimising calls to large language models with
uncertainty-based two-tier selection. arXiv preprint arXiv:2405.02134 , 2024.
Matthew Renze and Erhan Guven. Self-reflection in llm agents: Effects on problem-solving perfor-
mance. arXiv preprint arXiv:2405.06682 , 2024.
Shreya Shankar, JD Zamfirescu-Pereira, Bj¨ orn Hartmann, Aditya Parameswaran, and Ian Arawjo.
Who validates the validators? aligning llm-assisted evaluation of llm outputs with human prefer-
ences. In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Tech-
nology , pages 1–14, 2024.
Shivanshu Shekhar, Tanishq Dubey, Koyel Mukherjee, Apoorv Saxena, Atharv Tyagi, and Nishanth
Kotla. Towards optimizing the costs of llm usage. arXiv preprint arXiv:2402.01742 , 2024.
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion:
Language agents with verbal reinforcement learning. Advances in Neural Information Processing
Systems , 36, 2023.
Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao,
Salman Avestimehr, and Chaoyang He. Tensoropera router: A multi-model router for efficient llm
inference. arXiv preprint arXiv:2408.12320 , 2024.
James Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos, and Arpit Mittal. The
FEVER2.0 shared task. In Proceedings of the Second Workshop on Fact Extraction and VERification
(FEVER) , 2018.
Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances
large language model capabilities. arXiv preprint arXiv:2406.04692 , 2024.
Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John
Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv
preprint arXiv:2411.04368 , 2024.
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang,
Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent
conversation framework. arXiv preprint arXiv:2308.08155 , 2023.
Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James
Zou. Textgrad: Automatic” differentiation” via text. arXiv preprint arXiv:2406.07496 , 2024.
Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy Davis, Heather Miller, Chris Potts, James
Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi. The shift from models to
compound ai systems. https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/ ,
2024.
Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen. Evaluating large
language models at evaluating instruction following. arXiv preprint arXiv:2310.07641 , 2023.
12
Page 13:
Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan ¨O Arik. Chain of agents:
Large language models collaborating on long-context tasks. arXiv preprint arXiv:2406.02818 , 2024.
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm
agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence ,
volume 38, pages 19632–19642, 2024.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.
Advances in Neural Information Processing Systems , 36:46595–46623, 2023.
Ruiyang Zhou, Lu Chen, and Kai Yu. Is llm a reliable reviewer? a comprehensive evaluation of llm
on automatic paper reviewing tasks. In Proceedings of the 2024 Joint International Conference
on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages
9340–9351, 2024a.
Wangchunshu Zhou, Yixin Ou, Shengwei Ding, Long Li, Jialong Wu, Tiannan Wang, Jiamin Chen,
Shuai Wang, Xiaohua Xu, Ningyu Zhang, Huajun Chen, and Yuchen Eleanor Jiang. Symbolic
learning enables self-evolving agents. 2024b. URL https://arxiv.org/abs/2406.18532 .
13
Page 14:
A Proof of Theorem 4.1
Proof. The proof consists of two parts. First, we show that at iteration j, allocation fzallocates the
same models to the first jmodules as the optimal allocation for each task z. Second, we can show that
taking the mode over all tasks’ allocations leads to the optimal allocation for the training dataset.
We first note that the uniqueness of a task’s optimal model allocation implies that for each module
only one unique model maximizes the per-module quality. That is, for each i, there exists some k,
such that for any k′̸=k, we have pi(fi→k>pi(fi→k′). Suppose not. Let k∗be the model allocated
to module iby the optimal allocation. Due to the monotone assumption, k∗should also maximize
module i’s performance. Let k′be another model that maximizes module i’s performance. By the
inter-monotone assumption, switching from k∗tok′does not hurt any other module’s performance.
By the monotone assumption, k′also maximizes the overall performance. A contradiction. Therefore,
for each module, there is only one unique model that maximizes its performance, regardless of how
other modules are allocated.
Now we can show that at iteration j, allocation fzallocates the same models to the first jmodules
as the optimal allocation. To see this, one can simply notice that the unique “best” model for each
module must also be the optimal model for the end-to-end system. This is again because of the
monotone assumption: otherwise, one can change the model in the optimal allocation to have better
performance of one module and thus the overall system. Therefore, allocating the per-module optimal
model is the same as allocating the optimal model for the entire system. Thus, at iteration j, allocation
fzallocates the same models to the first jmodules as the optimal allocation.
Now we study the second part. By the first part, after Literations, each fzhas become the best
allocation for task z. Recall that we focus on binary performance, i.e., p()∈ {0,1}. Hence, if the model
allocation is not one of fz, its end-to-end performance is simply 0. Now, for any fz, its performance
on the training dataset is the average over its performance on each data point, i.e.,
1
∥DTr∥X
z′∈DTrp(fz, z′)
Now recall that the optimal allocation for each query is unique. That is, p(fz, z′) is 1 if fz=f′
z, and
0 otherwise. Hence, the training performance is proportional to
X
z′∈DTr1fz=f′z
That is, the performance of allocation fzis proportional to the number of training data points whose
optimal allocation is the same as fz. Therefore, taking the mode of all optimal allocations is sufficient
to obtain the best allocation for the training dataset.
B Experiment Details
B.1 Compound AI systems
In this paper, we focus on three compound AI systems, locate-solve, self-refine and multi-agent-debate.
Their architectures are shown in Figure 5. Locate-solve consists of two modules: the first module
extracts the task associated with an ID from an input table, and the second module returns the answer
to the extracted task. Self-refine has a generator, a critic, and a refiner. The generator gives an initial
answer to a question, the critic gives feedback to this answer, and the refiner uses the feedback to refine
the original answer. Multi-agent-debate has two types of modules, answer generators and debaters.
The answer generators offer initial answers to a question. The debaters take the initial answers and
then debate which one is correct. In this paper, we focus on a six-module multi-agent-debate: three
modules are answer generators, and the other three are the debaters.s
B.2 Datasets and evaluation metrics
Now we provide details of all datasets used in this paper.
14
Page 15:
Gen 1
Gen 2
Gen 3Debate 1
Debate 2
Debate 3
(b) self-refineGen Critic Refine
(c) multi-agent-debate(a) locate-solveLocate SolveFigure 5: The architectures of the compound AI systems studied in the experiments. (a) locate-solve
consisting of two modules. (b) self-refine using three modules. (c) multi-agent-debate that involves six
modules in total.
LiveCodeBench. LiveCodeBench [Jain et al., 2024] is a benchmark for code understanding. We use
the test output prediction task in LiveCodeBench. It contains 479 questions in total. Each question
contains a program and an input. The goal is to predict the output of the program. Note that this
is a generative task, as the output space of a given program is unbounded. We use exact match to
measure the performance of a compound system’s generation.
CommonGenHard. CommonGenHard [Madaan et al., 2023] is a constrained generation dataset
consisting of 200 questions. Each question gives 20-30 concepts, and the goal is to generate a coherent
paragraph that uses all the provided concepts. Since all LLMs used in our evaluation generate coherent
texts, we focus on evaluating the quality of whether all concepts are included. That is, the quality is
1 if all concepts are contained in the generated paragraph, and 0 if any concept is missing.
SimpleQA. SimpleQA [Wei et al., 2024] contains 4326 short, fact-seeking questions. Example ques-
tions include “Who received the IEEE Frank Rosenblatt Award in 2010” and “What is the first and
last name of the woman whom the British linguist Bernard Comrie married in 1985”. While seem-
ingly simple, LLMs actually struggle to answer them correctly. We use exact match to measure the
generation quality of a compound system.
FEVER. FEVER [Thorne et al., 2018] is a fact-verification dataset consisting of 2384 questions.
Each question contains a claim, and the task is to classify the claim as one of NOT ENOUGH INFO,
SUPPORTS, REFUTES. We again use exact match as the accuracy metric.
TableArithmetic. TableArithmetic is a synthetic dataset used to understand the locate-solve sys-
tem’s performance. It contains 100 questions. Each question consists of a table of “ID” and “task”
rows, and the goal is to solve the task associated with a specific ID. Each row contains 100 entries.
Each question has the form of “What is X+(10.9¿10.11)?”, where X is a randomly generated integer.
TableBias. TableArithmetic is another synthetic dataset. It contains 100 questions. Each question
consists of a table of “ID” and “task” rows, and the goal is to solve the task associated with a specific
ID. Here, each table contains 80 entries. Each question has the form of “The surgeon, who is the boy’s
father, says I cannot operate on this boy, he is my son. Who is the doctor to the boy? (Ax) Father
(Bx) Mother”, where again x is a randomly generated integer.
B.3 LLM models
We use 10 LLMs offered by third-party providers, including GPT-4o, GPT-4o mini, GPT-4-Turbo,
Claude 3.5 Sonnet, Claude 3.5 Haiku, Gemini 1.5 Pro, Gemini 1.5 Flash, Llama 3.1 405B, Llama
3.1 70B, and Qwen 2.5 72B. In particular, GPT-4o, GPT-4o mini, and GPT-4 Turbo correspond to
gpt-4o-2024-05-13, gpt-4o-mini-2024-07-18 and gpt-4-turbo-2024-04-09 offered by OpenAI. Claude 3.5
Sonnet and Claude 3.5 Haiku refer to claude-3-5-sonnet-20240620 and claude-3-haiku-20240307 by
15
Page 16:
Anthropic. Gemini 1.5 Pro and Gemini 1.5 Flash are gemini-1.5-pro and gemini-1.5-flash by Google,
since Google does not offer date-aware snapshots of their APIs. Finally, open-source models are
accessed via the togetherAI APIs. In particular, Llama 3.1 405B, Llama 3.1 70B, and Llama 3.1
405B correspond to meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo, meta-llama/Meta-Llama-3.1-
70B-Instruct-Turbo and Qwen/Qwen2.5-72B-Instruct-Turbo by togetherAI.
B.4 Prompt for the LLM diagnoser
The following box gives the prompt template for the LLM diagnoser.
LLM diagnoser prompt
You are an error diagnosis expert for compound AI systems. Below is the description of a
compound AI system consisting of multiple modules, a query, the generations from each module
of the compound AI system, the final output, and the desired answer. Assume that the desired
answer is 100% correct. If the final output matches the correct answer, generate ‘error: 0’.
Otherwise, analyze whether module i leads to the mistake. If so, generate ‘error: 1’. Otherwise,
generate ’error: 0’. Think step by step.
[Compound AI system]:
[query]:
[module 0 output]:
[module 1 output]:
...:
[module |V|output]:
[final output]:
[desired answer]:
[your analysis]:
B.5 Qualitative example analysis
To better understand why LLMSelector can outperform allocating the same LLM to all modules,
we give more examples for self-refine and multi-agent-debate, as shown in Figure 6 and Figure 7.
In addition to the examples shown in Figure 6 analyzed in the main paper, another example from
the LiveCodeBench dataset answered by the self-refine system is shown in Figure 7. In this case,
LLMSelector learns to use Claude 3.5 Sonnet for the generator and refiner, and uses GPT-4o for
the critic module. Recall that always allocating Claude 3.5 Sonnet is better than always allocating
any other LLMs. However, this leads to an incorrect answer on this example, as Claude 3.5 Sonnet
as the critic fails to realize its own generation is incorrect. However, GPT-4o as the critic correctly
identifies the initial generation is incorrect. Thus LLMSelector correctly answers this question.
16
Page 17:
✅Gen 1
Gen 2
Gen 3Debate 1
Debate 2
Debate 3
GPT-4o Gemini 1.5 Pro Llama 3.1 405BHow many seats in the Chamber of Deputies did the Italian Communist
Party lose in the 1958 Italian General Election?
[...] 1953 Italian General Election,
the PCI had 143 seats [...] 1958
Italian General Election,140 seats
[...] answer is (3).
[...] Upon reviewing historical data
[...] the Italian Communist Party
lost 3 seats in the 1958 Italian
General Election. Final Answer: (3)[...] Upon reviewing historical data
[...] the Italian Communist Party
lost 3 seats in the 1958 Italian
General Election. Final Answer: (3)8
3
-18
(b) DELUXEAGENT
Gen 1
Gen 2
Gen 3Debate 1
Debate 2
Debate 3The solutions provided by the
other agents [..] My initial response
also aligns with this information.
Final answer: (8)
[...] the Italian Communist Party
lost 8 seats [...] I can confirm that
the information is consistent and
accurate. [..] answer is (8)[..] they both agree on the number
Both agents state that the party
lost 8 seats. Final answer: (8)8
8
8
❌ (c) Allocating GPT -4o(a) queryFigure 6: An Illustrative example of applying LLMSelector multi-agent-debate on SimpleQA. (a)
the query. (b) model allocation learned by LLMSelector . By allocating GPT-4o, Gemini 1.5 Pro,
and LLama 3.1 405B to the three generators separately, LLMSelector enables a diverse set of initial
answers, and thus the debaters recognize the correct answer. (c) allocating GPT-4o to all modules.
GPT-4o as the generator consistently generates the incorrect answer 8; thus, the debaters fail to
identify this issue and lead to an incorrect answer.
17
Page 18:
(b) DELUXEAGENTdef checkArray(nums: List[int], k: int) -> bool:
a = [0] * (len(nums) + 1)
s = 0
for i in range(len(nums)):
s += a[i]
nums[i] -= s
if nums[i] < 0:
return False
if i <= len(nums) - k:
s += nums[i]
a[i + k] -= nums[i]
nums[i] = 0
return not any(nums)
Input: checkArray(nums = [2, 2, 3, 1, 1, 0], k = 3)
Output:
(a) query
Gen Critic
Incorrect FalseRefine
True
✅
(c) Allocating Claude 3.5 SonnetGen Critic
Correct FalseRefine
False
❌
GPT-4o Claude 3.5 SonnetFigure 7: An illustrative example of applying LLMSelector to self-refine on the livecodebench
dataset. (a) the query. (b) model allocation learned by LLMSelector . Allocating GPT-4o to the
critic recognizes the mistake made by the initial generation and thus leads to the correct answer “True”.
(c) Allocating Claude 3.5 Sonnet to all modules. Claude 3.5 Sonnet as the critic tends to agree with
its initial generation and thus its own mistakes can easily be omitted.
18