Authors: Justin Chih-Yao Chen, Sukwon Yun, Elias Stengel-Eskin, Tianlong Chen, Mohit Bansal
Paper Content:
Page 1:
Symbolic Mixture-of-Experts:
Adaptive Skill-based Routing for Heterogeneous Reasoning
Justin Chih-Yao Chen∗Sukwon Yun∗Elias Stengel-Eskin∗
Tianlong Chen Mohit Bansal
UNC Chapel Hill
https://symbolic-moe.github.io
Abstract
Combining existing pre-trained expert LLMs is a promising avenue for scalably
tackling large-scale and diverse tasks. However, selecting experts at the task level
is often too coarse-grained, as heterogeneous tasks may require different expertise
for each instance. To enable adaptive instance-level mixing of pre-trained LLM
experts, we propose SYMBOLIC -MOE, a symbolic, text-based, and gradient-free
Mixture-of-Experts framework. SYMBOLIC -MOEtakes a fine-grained approach to
selection by emphasizing skills, i.e., specialized subcategories or subtopics such as
algebra in mathematics or molecular biology in biomedical reasoning. We propose
a skill-based recruiting strategy that dynamically selects the most relevant set of
expert LLMs for diverse reasoning tasks based on their strengths. Each selected
expert then generates its own reasoning, resulting in koutputs from kexperts,
which are then synthesized into a final high-quality response by an aggregator. The
aggregator is chosen based on its ability to integrate diverse reasoning outputs.
We show that instance-level expert selection improves performance by a large
margin but – when implemented naively – can introduce a high computational
overhead due to the need for constant model loading and offloading. To address
this, we implement a batch inference strategy that groups instances based on their
assigned experts, ensuring each model will only be loaded once. This allows us
to integrate 16 models on a single GPU with a time cost comparable to prior
multi-agent baselines using 4 GPUs. Through extensive evaluations on diverse
benchmarks (MMLU-Pro, GPQA, AIME, and MedMCQA), we demonstrate that
SYMBOLIC -MOEoutperforms strong LLMs like GPT4o-mini, as well as multi-
agent approaches, with an absolute average improvement of 8.15% over the best
multi-agent baseline. Moreover, SYMBOLIC -MOEremoves the need for expensive
multi-round discussions, outperforming discussion baselines with less computation.
1 Introduction
A core strength of humans is our ability to communicate and coordinate with each other using language
(Clark, 1996; Yow & Lim, 2019; Xu et al., 2023). This allows diverse experts to contribute specialized
knowledge towards solving a problem, and is common across a variety of settings, including research,
medicine, and engineering. Like humans, large language models (LLMs) often have differing skills
and strengths, derived from differences in their architectures and training regimens. For instance,
math-specific models like MetaMath (Yu et al., 2023), WizardMath (Luo et al., 2023), and QwenMath
(Yang et al., 2024) are post-trained with mathematical reasoning data, making them particularly
adept at math tasks – often at the cost of performance on out-of-distribution tasks (Kumar et al.,
2022; Chu et al., 2025) like commonsense or medical reasoning (Lobo et al., 2024). Even within
specialized domains, differences in pre-training data can lead to nuanced variations in expertise: one
math-focused model may excel at algebra, while another is better suited for geometry. This motivates
our development of an automated, skill-based framework designed to identify and select the most
∗Equal contribution.
1arXiv:2503.05641v2 [cs.CL] 11 Mar 2025
Page 2:
suitable set of expert models for a given problem.1Allowing multiple diverse models to combine their
answers can improve even the strongest models (Chen et al., 2024c), and as tasks grow in complexity,
leveraging specialized models at test time presents a promising approach to scaling LLM capabilities.
Indeed, combining multiple “expert” models, i.e., Mixture-of-Experts (MoE) is a well-studied notion
in machine learning (Jacobs et al., 1991; Eigen et al., 2013) and has been applied widely for large
pre-trained models, enabling better performance at a lower computing cost (Shazeer et al., 2017a;
Fedus et al., 2022; Riquelme et al., 2021). However, in the conventional MoE settings, experts are
typically sub-models, i.e. subsets of parameters within a larger model, trained simultaneously in
an end-to-end fashion, where at test time, experts are combined in the model’s parameter space.
This generally requires end-to-end training from scratch, which is often computationally expensive
and precludes the re-use of the vast pool of existing already-trained LLMs. Building on recent
efforts in combining a fixed set of models through multi-agent discussions (Chen et al., 2024c;
Du et al., 2023; Liang et al., 2023; Wang et al., 2024a), we propose exploring a new training-free
paradigm for large-scale MoEs: a symbolic mixture of experts ( SYMBOLIC -MOE). Rather than using
information encoded in the model’s hidden state, SYMBOLIC -MOEuses symbolic structures in two
ways: First, SYMBOLIC -MOEinfers a set of discrete skills needed to solve a problem, measuring
the abilities of each model in a pool of candidate expert models according to the set. It then uses
skill-based performance as a “router” to recruit a sparse subset of experts for each problem . Secondly,
SYMBOLIC -MOEcombines pretrained experts through a symbolic channel, i.e., language, which
is a common protocol already shared by all models involved. To take advantage of the diverse set
of expert LLMs, two key challenges present: (1) Effective Expert Selection :Given a large set of
existing LLMs, how can we choose the best experts for each example? (2) Scalable Expert Mixing :
How can we scalably serve a large number of experts without increasing the demand for GPUs?
Effective Expert Selection: The increasing diversity of benchmarks (Miranda et al., 2024) and
the growing number of models means that experts must be selected not at the level of tasks, but at
the level of individual queries. Even at the task level, manual selection can be labor-intensive, and
the performance of multi-agent frameworks can be sensitive to the agents chosen for a task (Chen
et al., 2024c; Liang et al., 2023; Wang et al., 2024b). At the query level, manual selection becomes
infeasible, especially in multi-model settings where different experts must collaborate to address
complex queries (Wang et al., 2024c; Rein et al., 2023). Moreover, manually selecting models for
each task does not guarantee optimal performance. For instance, as shown in Fig. 1 (a), while a given
subset of models may perform well on math tasks on average, their proficiency in specific subfields
like algebra or probability might vary. Therefore, using of a fixed subset of models on all math
samples might hurt performance on particular subtasks. This underscores the need for an automated,
fine-grained selection mechanism as shown in Fig. 1 (b). Given that evaluating all possible model
subsets is computationally infeasible, we instead propose an efficient approach that activates only the
most relevant experts while minimizing computational overhead.
Scalable Expert Mixing: Unless models are retrained jointly, the integration of multiple models
must occur through their outputs – typically via a symbolic channel, such as multi-agent discussion.
Past work has often relied on multiple rounds of inference, leading to significant GPU demands.
As shown in Fig. 3 (I), when selecting a fixed set of models for a given domain (e.g. as done by
(Li et al., 2024b; 2025)), models could in principle be loaded onto parallel GPUs. This reduces the
latency of the system but increases the number of GPUs needed. Moreover, it does not scale to a
dynamic setting like the one we consider, where the number of GPUs required would be equal to the
number of potential models available (in our case, 16), making this option prohibitively expensive.
Given kmodels and nqueries, this solution requires kGPUs. Alternatively, models could be run
sequentially on a single GPU (cf. Fig. 3 (II)), which minimizes GPU usage but incurs high overhead
due to frequent loading and off-loading of models. In other words, this solution requires O(kn)loads.
Given the high cost of loading models into GPU memory (Griggs et al., 2024; Li et al., 2024a), this
solution is also untenable. Processing each query separately also fails to fully utilize GPU memory,
further limiting its efficiency.
Driven by these motivations, in SYMBOLIC -MOE, we achieve effective expert selection and scalable
expert mixing through the following approach. First, to enhance expert selection, we introduce
automated skill-based recruiting , which leverages an inferred skill-based model profile – a dictionary
mapping the most relevant models to specific skills (e.g., keywords or subcategories) required to
1We refer experts to those LLMs that are selected to solve a problem in this work.
2
Page 3:
Router
CoTCoTCoTAggregator“Algebra” Experts for Q1
CoTCoTCoTAggregator“Probability” Experts for Q2
(b) Symbolic-MoE (Ours)(a) Existing Multi-Agent Work
CoTCoTCoT
Fixed Experts for MATH
Task-Level Recruitment
ResourceIntensive
CoTCoTCoT
CoTCoTCoT
Fixed Experts for MATHCoTCoTCoTSkill: “Algebra”
Skill: “Probability” MATH Q1. MATH Q2. MATH Q1. MATH Q2. Instance-Level RecruitmentComputationallyEfficient
……A1. A2. A1. A2.
A quadratic function!"="!+%"+& has roots at "=3 and "=8.... what is the value of &?A biased coin has a probability ) of landing on heads. ..., what is the value of )?Given that the quadratic function has roots at ... the value of& is “12”.
The probability of at least one head is!"... the value of) is “#$”.
A quadratic function!"="!+%"+& has roots at "=3 and "=8.... what is the value of &?A biased coin has a probability ) of landing on heads. ..., what is the value of )?Let’s break down...The quadratic function canbe written as...the value of & is “24”
Let’s break down...Flipping the coin twice results in.. the value of) is “%&”.
Limited Number of Agents (e.g., 3)
Large LLM Pool(e.g., 16)Figure 1: Comparison between existing multi-agent work and our work. (a) In prior work, a fixed set
of task-level experts (e.g., Phi, Mistral, and Llama) is recruited to solve mathematical problems, where
heterogeneous questions may differ in the skills required to solve them (e.g Q1 requires algebra, while
Q2 focuses on probability). Expert models then engage in multiple rounds of discussion, making
this approach resource-intensive. (b) In contrast, SYMBOLIC -MOEadaptively recruits instance-level
experts based on skills needed (“Algebra” experts for Q1 and a different set of “Probability” experts
for Q2). By generating only a single round of responses with a selected aggregator to synthesize the
final output, our approach is both more performant and more efficient.
solve a given query – using a small subset of available samples. Like a router in a standard MoE
setting, SYMBOLIC -MOE’s skill-based routing allows the most relevant kexperts to be activated
while keeping the others inactive, optimizing both accuracy and efficiency. Next, to enable scalable
expert mixing, we propose a batch inference mechanism (illustrated in Fig. 3 (III)). For each
query, the router first selects the necessary experts, and then examples are grouped into batches
per model. We then run all queries for a given model in a single batch, allowing us to load and
off-load O(k)times in total, which is far faster than sequential processing’s O(kn). Finally, while
SYMBOLIC -MOE’s batched implementation accommodates up to 16models on a single GPU , it
can also be parallelized across multiple GPUs. This flexibility ensures both speedups with increased
computing power and accessibility for users with limited hardware resources. As shown in Figure 1,
existing multi-agent approaches use a fixed set of experts for each task (e.g., all math problems),
regardless of fine-grained topics (e.g., algebra or probability), and rely on multiple discussion rounds,
making them non-adaptive and resource-intensive. In contrast, our approach introduces adaptive
query-level recruitment based on fine-grained skills, selecting the most suitable models for each query.
Furthermore, we improve efficiency through (1) a batch inference strategy that processes all queries
for a model in a single batch and (2) a task-specific aggregator instead of multi-agent discussions,
ensuring that expert models generate their answers only once.
We evaluate SYMBOLIC -MOEon diverse benchmarks, including MMLU-Pro (Wang et al., 2024c),
GPQA (Rein et al., 2023), AIME 2024 (MAA, 2024), and MedMCQA Pal et al. (2022), using a
diverse model pool. We show that integrating this model pool with automated skill-based recruiting
yields an average accuracy improvement of 8.15% over the best multi-agent baseline. Moreover,
despite primarily using LLMs with 7-8 billion (B) parameters, SYMBOLIC -MOEachieves comparable
performance with larger 70B models, and on average, outperforms strong proprietary models like
GPT4o-mini (OpenAI, 2024), Gemini 1.5 Pro (Team et al., 2024), and DeepSeek-V3 (DeepSeek-
AI et al., 2025b). Without any manual intervention, SYMBOLIC -MOEconsistently surpasses all
baselines, whereas the strongest baseline changes across settings, with some baselines performing
competitively in one setting but poorly in others. Thus, SYMBOLIC -MOEeliminates the need for
the user to evaluate and compare a large number of possible baselines for each setting. Notably,
SYMBOLIC -MOEobtains these benefits while reducing the amount of compute needed. Using a
single GPU, SYMBOLIC -MOEhas44% less run-time than a mixture-of-agents baseline (Wang et al.,
2024a). When four GPUs are available for both methods, we obtain an almost 2×speedup over
this baseline. Our analysis highlights that SYMBOLIC -MOEshows strong robustness despite the
variation in the best-performing models and optimal aggregators for each task. Additionally, we show
that selecting a task-specific aggregator based on its ability to integrate diverse answers can achieve
performance comparable to multi-round discussions while requiring substantially less compute.
3
Page 4:
…Aggregator Selection
Model Profile Creation
Pool of Models
…
Router
Algebra Experts
A quadratic function!"=*"!+%"+& has roots at "=2 and "=5. If !" passes the point (0,10), what is the value of *?(b)Inference(a)Preprocessing
Chains-of-Thought
Final AnswerLet's break down...The quadratic function can be written as !"=*("−2)("−5) ...- Therefore, the value of * is “1”.... the value of * is “1”.Gene.Bio.+3-1+5Calc.…
…Aggr. Benchmark Test Sample
Router Samples from Model ProfilesFinal Output
Model ProfileValidation Set
…Adaptive Skill-Based RecruitmentLet's break down...!"=*("−2)("−5) ..!Skill: “Algebra”
+5+3-1-4
"
Aggregate
…
>>Benchmark
or
+6{correct_CoT}{incorrcet_CoT}{incorrcet_CoT}{question}Figure 2: Overview of SYMBOLIC -MOE. (a) Preprocessing: Given a validation set and a pool of
agents, we create model profiles and select an aggregator. (b) Inference-Time: For each test example,
SYMBOLIC -MOEactivates the most relevant models (experts) based on skill-based routing, using
model profiles determined during preprocessing. These models generate CoT responses, which the
aggregator (chosen based on its ability to select correct answers) synthesizes into a final answer.
2 Related Work
Mixture-of-Agents. Traditional Mixture-of-Experts (MoE) (Jacobs et al., 1991; Jordan & Jacobs,
1994; Chen et al., 1999; Yuksel et al., 2012) models distribute computation across multiple experts,
with growing interest in sparsity-driven approaches. The Sparse MoE (SMoE) approach (Shazeer et al.,
2017a) improves efficiency by activating only the most relevant experts per input, enhancing scalability
for high-dimensional data. This method has been widely applied in vision tasks (Riquelme et al., 2021;
Wang et al., 2020; Yang et al., 2019; Abbas & Andreopoulos, 2020), language tasks (Lepikhin et al.,
2021; Zhang et al., 2021; Zuo et al., 2022; Jiang et al., 2021) and multimodal learning (Kudugunta
et al., 2021; Yun et al., 2024). In contrast to past sparse MoE work, the models in our approach
are combined through symbolic channels (e.g., language outputs) as opposed to in the parameter
space, allowing us to re-use existing LLMs without training. MoA (Wang et al., 2024a) introduces
a framework for combining LLM agents into ensembles that relies on a fixed set of agents across
tasks and instances. This approach requires multiple rounds of generation and aggregation before
producing a final answer. Similarly, Self-MoA (SMoA; Wang et al., 2024a) posits that using multiple
distinct LLMs is unnecessary, suggesting that optimal performance can be achieved by invoking
the task-best model multiple times alongside the task-best aggregator. Our work differs from MoA
and SMoA by introducing adaptive, instance-level, skill-based routing while avoiding costly multi-
model discussions in favor of streamlined aggregation, significantly reducing computational overhead.
We also find that mixing different LLMs is advantageous when paired with effective routing and
aggregation strategies. Moreover, we show that the best aggregator for a task is not necessarily the
best-performing model overall, but can be identified through a synthetic task we introduce, designed
to evaluate aggregation effectiveness.
Multi-Agent Reasoning. Multi-agent reasoning has emerged as a promising paradigm for en-
hancing complex problem-solving and decision-making in AI systems. Early approaches employed
reinforcement learning-based coordination (Lowe et al., 2017; Foerster et al., 2018; Jaques et al.,
2019), while recent efforts leverage LLM-based multi-agent frameworks. One line of research
explores student-teacher paradigms (Magister et al., 2022; Fu et al., 2023; Ho et al., 2022; Du
et al., 2023; Chen et al., 2024a), where reasoning capabilities are distilled from stronger to weaker
agents. Another approach investigates multi-agent debate frameworks, where agents interact to refine
arguments and enhance collective decision-making; this has been explored with multiple instances
of a single model (Liang et al., 2023; Xiong et al., 2023; Chan et al., 2023) or debates between
multiple LLM types (Chen et al., 2024c). In both cases, the set of models is predefined by the user.
In contrast, our approach automatically selects models based on inferred skills. Additionally, our
framework achieves superior efficiency by avoiding multi-round discussions while still outperforming
debate-based methods.
4
Page 5:
3 Methodology
SYMBOLIC -MOEconsists of three stages: (I) model profile creation and aggregator selection (Fig. 2
(a)), which occur during preprocessing, followed by (II) expert recruitment and (III) final answer
generation (Fig. 2 (b)), both of which take place during inference. First, we establish the problem
setup in §3.1. Then, in the preprocessing stage ( §3.2), we describe how the model profile is created
in§3.2.1, which enables a fine-grained skill-based assessment of each model. We also describe the
process for selecting the aggregator in §3.2.2, which is responsible for generating the final answer.
Next, in the inference stage ( §3.3), we introduce the recruitment process in §3.3.1, where the experts
are selected based on the skills needed for each problem, and how the answers from the recruited
experts are combined using the aggregator to improve reasoning in §3.3.2.
3.1 Problem Setup
Given a pool of nmodels M={Mi}n
i=1, where each model represents a distinct LLM with
potentially different pre-training datasets and architectures, our goal is to optimize performance
through dynamic allocation – solving each problem with the most suitable subset of kmodels – and
combined reasoning – allowing experts to combine information to enhance reasoning. To achieve this,
we assume access to a small validation set to obtain (1) model profiles Pi∀i≤n, and (2) aggregator
profiles that benchmark the ability of each model to act as an aggregator. We use these profiles to
recruit experts (at the instance level) and to select the aggregator (at the task level).
3.2 Preprocessing
3.2.1 Model Profile Creation
To recruit the kmost suitable experts for a given question, we assess each model’s specialized skills
across various problem-solving domains, illustrated in Fig. 2 (a). This is done by evaluating their
performance on the validation set for each task (see Table 6 for sizes), thereby constructing a model
profile Pifor each model Mi. For each question in the validation set, we first prompt an LLM –
referred to as the “Keyword LLM” – to identify the essential skills required to solve the problem.
The prompt used for generating these required skills is provided in Appendix G. For consistency, we
use Qwen2.5-7B-Instruct (Team, 2024) as the Keyword LLM. To reduce noise, we generate keyword
annotations for each question five times, and retain only those that appear more than once across the
whole validation set. These extracted skills represent core knowledge areas necessary for solving
the problem – for instance, a given college-level math problem may require skills such as algebra,
calculus, and geometry. Once all questions are annotated with their required skills, each model Mi
in the pool attempts to solve them using Chain-of-Thought reasoning (Wei et al., 2022). A correct
answer increases the score of each associated skill by +1, while an incorrect answer results in a
−1penalty. At the end of this process, each model has a profile Pi∈ {P1, . . . , Pn}represented as a
dictionary. For example, a model’s profile may be: {‘Algebra’: 10, ‘Biology’: 3, ‘Chemistry’: -6, ... }.
3.2.2 Aggregator Selection
An aggregator is a model that consolidates koutputs into a single high-quality response. Our pilot
experiments, along with findings from Wang et al. (2024a) and Li et al. (2024b), indicate that the
aggregator model plays a crucial role in the final performance, and selecting the most effective
model for aggregation is a non-trivial challenge. We find that reasoning capability and aggregation
ability are often orthogonal. That is, a strong reasoning model is not necessarily a strong aggregator
and vice versa; qualitatively, we show this later in Table 5. We find that adaptively selecting an
aggregator on the instance level based on model profiles is less effective, motivating us to choose
the aggregator based on a model’s ability to consolidate different answers. To identify the best
aggregator per task, we construct a synthetic task using the same validation set. From the profile
creation process, we obtain outputs from all models, some correct and some incorrect. For each
question, we sample one correct reasoning chain and two incorrect ones, structuring the input as
follows: {question },{correct CoT},{incorrect CoT},{incorrect CoT}. We
shuffle the order of the correct and incorrect CoTs and instruct each model to act as an aggregator
(using the prompt shown in Appendix G), synthesizing a final output with a predicted answer. We
5
Page 6:
then benchmark each model’s aggregation ability, measuring how well it can generate a correct
answer based on this input, and select the best-performing aggregator for each task.
3.3 Inference
3.3.1 Skill-based Recruiting
At inference time (see Fig. 2 (b)), we follow the same keyword annotation procedure as in Section 3.2.1
to generate relevant keywords for the test sample. To align a test sample’s keywords with those in
the model profiles, we employ Sentence-BERT (Reimers & Gurevych, 2020) to match keywords
via the cosine similarity between their embeddings. Each test-time keyword is then mapped to its
closest counterpart in the model profile. Next, the expert recruitment is performed by selecting the
topkmodels whose profiles best match the required skills of the test sample. This is determined by
two factors: (1) local suitability score and(2) global competency . For each model Mi, itslocal
suitability score for a test sample q,S(Mi,q)is computed as the sum of its skill scores over the set
of keywords needed for q(denoted as Kq). It can be expressed as follows:
S(Mi,q) =∑
kj∈Kqs(i)
kj
where s(i)
kjrepresents the score of model Mifor the j-th skill in the test sample q. This results in an
model ranking distribution Dqfor each test sample q:
Dq={S(M1,q),S(M2,q), ...,S(Mn,q)}
Intuitively, suppose M1has scores of +3,+5, and−2for algebra, calculus, and geometry, respec-
tively, which are needed for a given sample; its total score for this sample would be 3+5−2=6.
Having this calculation for all models yields a ranking distribution of model strengths for the given
sample, e.g., {M1: 6,M2: 3, ..., Mn: -10}, which is the ranking of how suitable a model is for a
sample . To account for each model’s overall strength in a task, i.e., global competency , we compute
each model’s total score across all keywords in their profile, and normalize it by the total sum of all
agents’ scores. We denote this global strength as γi, representing a model’s overall task performance
relative to others. Finally, the expert selection is performed by sampling from the product of the local
suitability score, S(Mi,q)(from a model rank distribution Dq) and the global competency γi. That
is, the relevance score of a model Mifor a test sample qis:
w(i)
q=γiS(Mi,q)
We apply a softmax function with the temperature set to 0.5to this distribution {w(i)
q}n
i=1, and then
sample kexperts for each problem, i.e.,
E(i)
q∼Categorical (w(1)
q,w(2)
q, ...,w(n)
q),i={1, 2, ..., k}
To enhance efficiency and reduce computational overhead, we filter out low-frequency experts –
those who appear in fewer than 5%of test cases. For example, given a test set with 100 samples,
where 3 experts are recruited per sample (totaling 300 selections), any expert appearing fewer than
300×5%=6times is discarded and resampled from the remaining higher-frequency experts.
3.3.2 Final Answer Generation
After expert recruitment, each sample will be passed to the experts, i.e., the input for each expert is
the test problem, x0=q. These experts generate their reasoning paths to the problem in the form of
Chain-of-Thought (Wei et al., 2022):
y(i)
0=E(i)(x0)∀i∈ {1, 2, ..., k}
Then, the task-specific aggregator A∗is introduced to synthesize the koutputs into a high-quality
final answer (Wang et al., 2024a). That is, the final answer is produced by: y=A∗(
k
i=1y(i)
0),
where
denotes the concatenation operation. In Appendix H, we provide a detailed discussion on
SYMBOLIC -MOE in the context of sparse MoE frameworks and how it shares its design principles.
6
Page 7:
IDQuestionExpert 1Expert 2Expert k1How many of the following …QwenLlamaMistral2A large gene has dozens of …QwenGemmaPhi3Consider the following …MistralGemmaPhi
ExpertIDQwen1, 2Llama1Mistral1, 3Gemma2, 3Phi2, 3abcdeIDExpert 1Expert 2Expert k1Qwen OutputLlama Output…2Qwen Output……3………aa(III) Batch Inference: Flexible to the Number of GPUs(I) Parallel Inference w/ Multiple GPUs
QwenLlamaMistral
…Model n
Requires n GPUs at the same timeQuestionExpert 1Expert 2Expert kHow …QwenLlamaMistral
Gemma(II) Sequential Inference w/ Single GPUQuestionExpert 1Expert 2Expert kHow ...QwenLlamaMistralQwen
QuestionExpert 1Expert 2Expert kA large ...GemmaQwenPhiLlama
Mistral
GemmaQwenPhiTakes n times longer than using 1 modelQuestionExpert 1Expert 2Expert kA large ...GemmaQwenPhiFits in a single GPUCan be accelerated w/ more GPUs
QwenLlamaMistral
…Model n
Gemma
SequentialGenerationGroup by ExpertsLoad
Reloading a used modelLoad
Figure 3: Different approaches to achieving adaptiveness in SYMBOLIC -MOE, which uses different
models for each instance. In a naive setup (I), kGPUs must be hosted simultaneously, allowing
immediate access to outputs from each model. Another naive setup (II) requires only a single GPU
but involves constant loading and offloading of models to obtain outputs from the corresponding
model. Our scalable batch inference process (III) strikes a balance between (I) and (II). When models
are assigned to problems, we group samples by model and sequentially load the corresponding LLM
onto a single GPU to generate outputs efficiently. Moreover, this approach still allows us to parallelize
across GPUs if they are available.
3.3.3 Scalable Batched Inference
In our experiments, we mostly consider 7B–8B parameter LLMs, which have a substantial memory
footprint. Due to the adaptive nature of the recruitment process, the set of participating LLMs may
change dynamically for different problems. For instance, one sample may require Qwen, Llama, and
Mistral, while another may need Gemma, Exaone, and Phi. A naive implementation of this approach
can lead to high latency, particularly when the required models change frequently. In such cases,
the computational cost includes not only inference but also data transfers across GPUs. To mitigate
these computational challenges, SYMBOLIC -MOEintroduces a novel batching strategy to maximize
throughput. Specifically, for a given set of instances, we precompute (using inferred skills) which
LLMs will be called for each instance. We then group instances based on their required experts, as
illustrated in Fig. 3 (III) and Algorithm 1 in Appendix F. In other words, each active expert receives
all relevant instances at once, ensuring that each expert is loaded only once per batch. This enables
efficient batched inference on a single GPU while supporting a pool of 16LLMs. Moreover, this
approach is flexible, as more GPUs can further accelerate inference through parallelization.
4 Results and Analysis
4.1 Experimental Setup
Model Pool and Datasets. Our experiments consider 16LLMs ranging from 3.5B to 12B parame-
ters, with most models falling in the 7–8B range. These include general-purpose instruction-tuned
models, domain-specific fine-tuned variants on math and biology, and models distilled from DeepSeek
R1’s trajectories (DeepSeek-AI et al., 2025a). A full list of models is provided in Table 7 in the
Appendix. We measure performance on a diverse range of datasets, chosen to require expertise in a
number of domains. First, we consider MMLU-Pro (Wang et al., 2024c), which is a harder version of
MMLU (Hendrycks et al., 2021), containing a variety of questions across 14 college-level subjects,
from Philosophy and History to Economics and Chemistry. Given its large test set of 12,000 samples
and the computational cost of evaluating proprietary models, we employ stratified sampling to create
a subset of 2,100 samples, ensuring each category contains 150 samples. We also evaluate on AIME
2024, which is a challenging mathematics competition dataset containing math Olympiad problems.
For more science-specific reasoning, we test on GPQA (Rein et al., 2023), which contains questions
across a variety of science fields written by experts, explicitly designed to be difficult to answer even
by skilled humans with web access. Finally, we also include MedMCQA (Pal et al., 2022), which
covers questions sourced from medical entrance exams across 21 medical subjects. For each dataset,
7
Page 8:
Category Method Model MMLU-Pro AIME GPQA MedMCQA Avg.
Zero-Shot CoT GPT4o-mini 63.95 10.00 42.93 68.18 46.27
Zero-Shot CoT Gemini 1.5 Pro 76.38 36.67 61.62 72.68 61.84Close-Source
Single ModelZero-Shot CoT DeepSeekV3 76.29 26.00 60.10 74.09 59.12
Open-Source Zero-Shot CoT Qwen2.5 72B 71.54 ±0.88 25.55 ±3.85 51.02 ±0.27 69.02 ±0.32 54.28
70B Model Zero-Shot CoT Llama3.3 70B 69.26 ±0.47 32.22 ±3.85 51.44 ±0.62 59.78 ±0.74 53.18
Open-Source
7B ModelZero-Shot CoT QwenR1 52.57 ±0.45 55.93 ±5.16 44.95 ±1.49 38.72 ±0.44 48.04
Zero-Shot CoT Task-Best 54.89 ±0.53 55.93 ±5.16 48.43 ±3.10 55.44 ±0.50 53.62
Advanced
Single ModelSelf-Refine (SR) Task-Best 53.74 ±0.20 53.33 ±3.34 50.84 ±3.65 49.57 ±0.59 51.87
Self-Consistency (SC) Task-Best x5 56.71 ±0.14 67.78 ±1.57 53.54 ±0.36 56.85 ±0.11 58.72
Single-Model
Multi-AgentDebate Task-Best x3 56.21 ±0.55 56.67 ±6.67 50.51 ±0.51 51.63 ±0.80 53.76
Self-MoA Task-Best x3 55.43 ±0.72 55.56 ±5.09 52.86 ±1.46 53.27 ±0.60 54.28
Multi-Model
Multi-AgentMoA Top-3 61.78 ±0.25 41.11 ±5.09 52.86 ±3.37 59.29 ±0.32 53.76
ReConcile Top-3 56.46 ±0.10 50.00 ±7.20 47.98 ±2.32 60.74 ±0.43 53.80
SYMBOLIC -MOE Adaptive 63.71 ±0.43 68.88 ±5.08 57.78 ±2.09 59.35 ±0.14 62.43
Table 1: Comparison of SYMBOLIC -MOEwith single-model and multi-model baselines. SYMBOLIC -
MOEoutperforms all multi-agent baselines and achieves performance comparable to strong propri-
etary models like GPT4o-mini, as well as 70B models, while primarily operating with 7-8B models.
Notably, no single baseline consistently secures the second-best performance, even when the strongest
models for each task are known. In contrast, our method demonstrates robustness, yielding superior
results through adaptive expert selection. We bold the best results and underline the second-best
(excluding methods using bigger or proprietary models, shown in gray).
we sample around 350 samples as the validation set to create the agent and aggregator selection
profiles.2Full dataset statistics are provided in Table 6 in the Appendix.
Baselines. We compare against four categories of baselines.
•Zero-shot single-model methods : This category includes proprietary models such as GPT-4o-
mini (OpenAI, 2024), Gemini 1.5 Pro (Team et al., 2024), and DeepSeek-V3 (DeepSeek-AI et al.,
2025b); high-capacity open-source models like Qwen2.5 72B (Qwen et al., 2025) and Llama 3.3
70B (AI@Meta, 2024); and strong distilled 7B models such as QwenR1 (DeepSeek-AI et al.,
2025a). For reference, we also report the best task-specific model from our pool for each task,
denoted as Task-Best.
•Advanced single-model baselines with inference-time compute : We evaluate methods that
enhance inference-time reasoning, specifically Self-Refine (SR) (Madaan et al., 2023b) and Self-
Consistency (SC) (Wang et al., 2023b). To ensure a fair comparison, we set SC’s sample size to
5, aligning with the number of large language model (LLM) calls in SYMBOLIC -MOE, which
engages three experts and one aggregator model.3Additionally, for these baselines, we use the
best-performing LLM for each task, inferred on the same dev set used for our agent profile creation.
•Single-model multi-agent baselines : To isolate the impact of SYMBOLIC -MOE’s recruitment
strategy, we compare against methods where multiple instances of the same model collaborate.
Specifically, we consider Multi-Agent Debate (Debate) (Du et al., 2023) and Self-Mixture-of-
Agents (Self-MoA) (Li et al., 2025), both of which rely on iterative, multi-round discussions using
a single model. These baselines employ three agents, each using the same task-best model, and
conduct two rounds of discussion, resulting in a total of 6LLM calls per sample.
•Multi-model multi-agent baselines : We also evaluate approaches leveraging diverse models in a
multi-agent setup. This includes Mixture-of-Agents (MoA) (Wang et al., 2024a) and ReConcile
(Chen et al., 2024c), both of which incorporate a fixed set of models in multi-round interactions. To
ensure a fair comparison with our approach, particularly in the use of the validation set, we select
the top three performing models from the validation set and conduct multi-round interactions. In
MoA, agents participate in two rounds of discussion, while agents in ReConcile engage in three
rounds, leading to 6and9LLM calls per sample, respectively.
2For AIME, we sample validation questions from prior years’ problems (2012-2023).
3We use an odd number of SC calls to avoid ties.
8
Page 9:
For single-model baselines, we use the strongest task-specific LLM, while for multi-model baselines,
we select the top three models per task. These selections, like those in SYMBOLIC -MOE, are
determined based on validation performance. Table 8 in the Appendix details the top-1 and top-3
models for each task.
Implementation Details. We conduct our experiments for SYMBOLIC -MOEand other single-
model baselines on a single A6000 GPU with 48 GB of memory, while MoA and ReConcile are
executed on 8 A6000 GPUs for parallelization. For the 70B models, we use the original version
without quantization and perform inference on 4 A6000 GPUs. All open-source models utilize vLLM
(Kwon et al., 2023) for inference. The temperature is set to 0.7 for all methods. The maximum output
token length is fixed at 4096 for all models, except for QwenR1 and LlamaR1, which have a limit of
32768 since they are trained with longer trajectories and tend to generate longer outputs. All results,
except those from proprietary models (due to budget constraints), are averaged over three random
seeds. Further details on the model pool, distribution of the expert recruited, and all the prompts we
use can be found in Table 7 and Appendix G.
4.2 Main Results
We present the main results in Table 1, and summarize the key findings below.
SYMBOLIC -MOEconsistently outperforms all baselines. Across all domains, SYMBOLIC -MOE
shows superior performance compared to all baselines, beating single-model baselines using the best
overall model (e.g., SR, SC), multi-agent debate with a single model (Debate and Self-MoA), as well
as multi-model multi-agent baselines like MoA and ReConcile. SYMBOLIC -MOEoutperforms the
most competitive multi-agent baseline, Self-MoA, by 8.15% (absolute) on average, with consistent
improvements across domains (e.g., 8.28% gain on MMLU-Pro, 13.45% on AIME, 4.92% on GPQA,
6.08% on MedMCQA). These gains are also seen when comparing to multi-model baselines like
MoA and ReConcile that use the top three strongest models per domain. SYMBOLIC -MOEalso
substantially outperforms test-time scaling methods, such as Self-Refine (Madaan et al., 2023b)
and Self-Consistency (Wang et al., 2023a). Surprisingly, when using the task-best model, SC beats
multi-agent debate baselines (e.g., Self-MoA, MoA), though it still underperforms SYMBOLIC -MOE
by an average of 3.71% . This indicates that scaling test-time compute with the task-best model is a
simple but effective way to improve performance, and adaptively selecting the most suitable experts
leads to further improvements.
SYMBOLIC -MOEgeneralize well across tasks. No single baseline in Table 1 is universally
effective across all tasks. For instance, while MoA performs well on MMLU-Pro, it struggles on
AIME; ReConcile excels in MedMCQA but fails to generalize to GPQA. Knowing which method
works best for a task is therefore nontrivial, requiring running every method on validation sets to
choose from the many settings available. In contrast, SYMBOLIC -MOEconsistently delivers a strong
performance across all domains. SYMBOLIC -MOEespecially excels on AIME and GPQA, where
SC is a surprisingly strong baseline, and where other strong methods like MoA and Self-MoA fall far
behind SYMBOLIC -MOE. Moreover, we see that, while SC with the top model is the most competitive
setting on AIME and GPQA, it lags behind other baselines on MMLU-Pro and MedMCQA, where
multi-agent baselines perform better. This discrepancy may stem from the broader subject diversity in
MMLU-Pro and MedMCQA, where agent collaboration facilitates better consensus, whereas AIME
is more math-focused, and favors individual model performance – the task-best model, QwenR1
7B, delivers strong solo performance already. In light of that, we include QwenR1 7B in Table 1,
which is a powerful model distilled from DeepSeek-R1’s trajectories (DeepSeek-AI et al., 2025a).
While QwenR1 demonstrates exceptional math and code reasoning capabilities ( 55.93% on AIME),
leading to strong Self-Consistency performance on AIME ( 67.78% ), it struggles to generalize to other
domains such as MedMCQA. This further underscores the need for a robust and flexible framework
like S YMBOLIC -MOE to achieve broad generalization across diverse tasks.
SYMBOLIC -MOEmatches strong proprietary models and larger 70B models. In Table 1,
we also find that SYMBOLIC -MOEachieves a similar average performance to models that have
substantially more parameters. For example, SYMBOLIC -MOEoutperforms Llama3.3 70B on
AIME and GPQA and roughly matches it on MedMCQA, despite requiring only four 7-8B models
9
Page 10:
(three for the experts and one for the aggregator). Similarly, SYMBOLIC -MOEoutperforms or
matches a number of strong proprietary models on average – for instance, it matches Gemini 1.5 Pro
and outperforms GPT4o-mini, driven in part by sizable gains on AIME and GPQA. These results
underscore that, by drawing from a large pool of experts, SYMBOLIC -MOEenables a smaller set of
models to achieve performance comparable to models with significantly more parameters, especially
when considering a heterogeneous set of tasks like the ones we explore.
Method # GPUs Run Time (s)
Sequential Inference 1 196.92
MoA 1 45.98
MoA 4 21.66
SYMBOLIC -MOE 1 25.76
SYMBOLIC -MOE 4 10.85
Table 2: Efficiency comparison of MoA
and SYMBOLIC -MOE. We report time
per sample, based on the total time needed
to perform inference for the whole GPQA
test set (divided by the number of ques-
tions in the test set).SYMBOLIC -MOEis more efficient. We compare
SYMBOLIC -MOE’s efficiency to a naive implementa-
tion of sequential inference and to that of MoA and in
Table 2. On GPQA, we measure the average run-time
for the 198 examples in the test set. Unsurprisingly, the
sequential inference baseline shows the highest latency,
since the model is constantly changing, requiring loading
and offloading it for each instance. We also find that
SYMBOLIC -MOEoperates in 44% less time on a single
GPU than MoA, which requires discussion, while achiev-
ing better accuracy. Also, SYMBOLIC -MOErunning on
a single GPU shows similar time with MoA on 4 GPUs,
and furthermore, when we run SYMBOLIC -MOEon 4
GPUs, it results in almost 2×speedup over MoA.
SYMBOLIC -MOEachieves efficiency in two key ways:
(1) it employs batched inference to minimize model loading time, and (2) it eliminates round-wise
discussions, which introduce significant overhead at each round. In Table 3 we find that with a reliable
aggregator, skipping inter-agent discussions yields performance comparable to actually conducting
them. Engaging kagents in a one-round discussion requires generating kinitial outputs, concatenating
them, and then generating kupdated outputs, significantly increasing overhead.
Discuss Aggr. MMLU-Pro GPQA
✓ Adaptive 59.07 57.01
✗ Adaptive 57.12 58.01
✓ Task-best 57.81 57.78
✗ Task-best 56.67 57.01
✓ Task-specific 63.83 57.72
✗ Task-specific 63.71 57.78
Table 3: Comparison of SYMBOLIC -
MOEwith and without discussion across
varying aggregators. Discussion stabi-
lizes performance with suboptimal aggre-
gators (first four rows) but has little effect
with an optimal one (last two rows).Like multi-agent discussion baselines, SYMBOLIC -MOE
can also operate in a discussion-based manner. Instead
of aggregating initial responses, models first engage in
a round discussion before submitting final answers to an
aggregator. Table 3 evaluates this approach, comparing
the adaptive aggregator (suboptimal) and task-best aggre-
gator (suboptimal), as well as the task-specific aggregator
(optimal) on MMLU-Pro and GPQA. Given the optimal
aggregator, we see that discussion yields marginal gains
on MMLU-Pro ( 63.83 vs.63.71 ) and a modest drop on
GPQA ( 57.72 vs.57.78 ). We find that while round-wise
discussion does improve performance incrementally, but
the final outcome is ultimately determined by the strength
of the aggregator. As a result, SYMBOLIC -MOEhas less
run time than the baselines while maintaining or even
surpassing their performance, as demonstrated in Table 2.
4.3 Analysis
Recruiting Strategy Acc.
Top-3 Experts 52.86
Top-5 Experts 47.68
3 Random Experts 42.61
5 Random Experts 44.92
Model Profile (Ours) 57.78
Table 4: Comparison of different recruit-
ing strategies on GPQA.Utility of the Model Profile. SYMBOLIC -MOEprofiles
models based on their skills and leverages this information
to recruit experts effectively. To underscore the importance
of this step, we compare several alternative selection strate-
gies in Table 4, evaluating accuracy on GPQA. Specifically,
we assess performance when selecting the top- kagents
overall and when choosing kagents at random. In the top- k
approach, experts are fixed as the best-performing models
for the task, whereas in the random- kstrategy, the selected
experts vary across instances. Our results demonstrate
that skill-based selection is essential. Notably, although
10
Page 11:
selecting the top kexperts for a task may seem intuitive, it consistently underperforms compared to
SYMBOLIC -MOE’s adaptive instance-level expert selection. Interestingly, top- 5selection performs
worse than top- 3selection, suggesting that a broader selection increases the likelihood of including
weaker models, leading to performance degradation. Additionally, the random selection strategy
consistently harms performance, showing a 12.86% −15.61% drop compared to SYMBOLIC -MOE,
likely also due to the inclusion of weaker experts. Overall, skill-based instance-level selection con-
sistently outperforms more coarse-grained approaches, with SYMBOLIC -MOEsurpassing MoA by
4.92% as also shown in Table 1. These findings emphasize the value of leveraging performance-based
statistics and individual skill assessments for expert selection.
Aggregator MMLU-Pro GPQA
Random 52.29 48.92
Adaptive 57.12 58.01
Task-specific 63.71 57.78
Table 5: Ablations on different aggrega-
tors in our full setting.Role and Selection of the Aggregator. Unlike most of
our discussion-based multi-agent baselines, SYMBOLIC -
MOEcollects a single CoT and answer from each expert
and combines them via an aggregator. Besides the effi-
ciency gain as shown in Table 2, we investigate the role
the aggregator plays in our framework. While experts are
selected per instance, the aggregator is chosen based at
the task, as we find that reasoning ability does not neces-
sarily translate to effective aggregation. Table 5 examines
the impact of aggregator selection in SYMBOLIC -MOE, comparing three strategies: (1) a randomly
chosen aggregator from the model pool, (2) an instance-specific top-1 aggregator based on model
profiling, and (3) a task-specific aggregator determined by task-level performance. Evaluated on
MMLU-Pro and GPQA, the results indicate that a random aggregator substantially degrades perfor-
mance, indicating that the aggregator plays a crucial role. While the instance-specific top-1 aggregator
improves outcomes on both datasets, the task-specific aggregator outperforms it on MMLU-Pro and
performs comparably on GPQA. We further find that the similar performance of instance-specific
and task-specific aggregation on GPQA is due to a high degree of overlap in selected aggregators.
Overall, these findings suggest that a model being a good reasoner does not imply it will be a good
aggregator, supporting our decision to use task-based aggregation.
5 Discussion and Conclusion
A key feature highlighted in Table 1 is the consistency ofSYMBOLIC -MOE’s performance. While
baseline methods occasionally do well in isolated settings (e.g. MoA on MMLU-Pro, ReConcile on
MedMCQA) it is important to highlight that no baseline does well consistently across settings. This
means that – without SYMBOLIC -MOE– getting a strong overall result would require evaluating all
the baseline methods and choosing the best settings manually. In contrast to the baselines, SYMBOLIC -
MOEconsistently achieves high performance without human intervention . By automatically selecting
discussion agents, S YMBOLIC -MOE provides a consistent recipe that generalizes across domains.
Modularity. Another key advantage of SYMBOLIC -MOEis its modularity. Unlike typical Mixture-
of-Experts (MoEs) frameworks, which need to be trained end-to-end from scratch in a centralized
manner, SYMBOLIC -MOEuses the symbolic output channel of existing models to combine experts.
This gradient-free approach enables seamless integration of pre-trained models without updates,
allowing them to be trained independently and distributedly. Such delegation enhances domain
specialization for each model. Moreover, while standard MoEs have a fixed size determined before
training, SYMBOLIC -MOEcanadaptively grow and evolve as models are updated. Given the rapid
advancements in LLMs, cost-effective and efficient updates are essential – state-of-the-art models are
often replaced within months. SYMBOLIC -MOE’s modular and gradient-free design simplifies these
updates, requiring only a few calls to obtain a new model’s agent profile.
Connections to Inference-Time Scaling. Like other LLM-discussion frameworks, SYMBOLIC -
MOEcan be seen as a form of multi-model inference-time scaling. Past work (Wang et al., 2023a;
Snell et al., 2024; Shao et al., 2024) has highlighted the benefits of adding inference-time compute,
ranging from simply sampling multiple responses (as done in our SC baselines) to more complex
strategies such as refinement (Madaan et al., 2023a; Chen et al., 2024b). SYMBOLIC -MOEsurpasses
several such baselines (Table 1) by adaptively selecting the optimal set of expert models while
avoiding the costly discussion process, reducing overhead while improving performance. Moreover,
11
Page 12:
our novel batching method enables SYMBOLIC -MOEto run efficiently on a single GPU while
remaining flexible for acceleration with multiple GPUs.
Conclusion. We introduced SYMBOLIC -MOE, a scalable MoE framework that combines models
through their symbolic output (i.e., via natural language discussion). SYMBOLIC -MOEinfers which
skills are needed for a given problem and recruits agents based on those skills to engage in a
discussion about a given input, guiding the discussion to reach a better consensus. On four diverse
reasoning datasets, SYMBOLIC -MOEoutperforms standard inference-time scaling methods as well
as other debate frameworks and other mixture-of-agents methods, leading to consistently strong
performance across domains without human intervention. SYMBOLIC -MOE’s average performance
across heterogeneous tasks is in fact stronger than that of advanced proprietary models such as
GPT4o-mini. Moreover, unlike past work that requires multiple models to be loaded on separate
GPUs running in parallel, SYMBOLIC -MOEintroduces a novel batching strategy that allows us to
run on a single GPU in roughly the same amount of time, obtaining the best of both worlds in terms
of performance and efficiency.
Limitations
Like other multi-agent discussion methods (Du et al., 2023; Chen et al., 2024c), SYMBOLIC -MOE
involves running multiple models, which increases inference cost. This cost can be reduced via
distillation: Chen et al. (2024a) distill multi-agent discussions between a fixed set of agents into a
single model, showing improvements over distilling from single models. This approach could easily
be adapted to distill from a variable set of agents, allowing the student model to benefit from the
routing and skill selection process. We leave distilling from S YMBOLIC -MOE to future work.
SYMBOLIC -MOEalso relies on skills inferred from a small validation set to set the agent profiles. In
our experiments, we ensure fair comparisons to the baselines by choosing models for the baselines
according to the same validation set, giving the baselines equal access to the data. Skill inference
relies on the Keyword LLM being sufficiently trained on a given domain to infer relevant skills –
empirically, we find that this is the case across a variety of domains. Overall, SYMBOLIC -MOEwill
continue to improve with better skill inference modules, which can easily be swapped in.
Acknowledgements
This work was supported by NSF-CAREER Award 1846185, DARPA ECOLE Program No.
HR00112390060, Microsoft Accelerate Foundation Models Research (AFMR) grant program, NSF-
AI Engage Institute DRL-2112635, National Institutes of Health (NIH) under other transactions
1OT2OD038045-01, and Cisco and Capital One Faculty Awards. The views and conclusions con-
tained in this document are those of the authors and should not be interpreted as representing official
policies, either expressed or implied, of the NIH or other sponsors.
References
Alhabib Abbas and Yiannis Andreopoulos. Biased mixtures of experts: Enabling computer vision
inference under data transfer limitations. IEEE Transactions on Image Processing , 29:7656–7667,
2020.
AI@Meta. Llama 3.3 model card. 2024. URL https://github.com/meta-llama/
llama-models/blob/main/models/llama3_3/MODEL_CARD.md .
Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and
Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv
preprint arXiv:2308.07201 , 2023.
Justin Chen, Swarnadeep Saha, Elias Stengel-Eskin, and Mohit Bansal. Magdi: Structured distillation
of multi-agent interaction graphs improves reasoning in smaller language models. In Forty-first
International Conference on Machine Learning , 2024a.
12
Page 13:
Justin Chih-Yao Chen, Archiki Prasad, Swarnadeep Saha, Elias Stengel-Eskin, and Mohit Bansal.
Magicore: Multi-agent, iterative, coarse-to-fine refinement for reasoning. arXiv preprint
arXiv:2409.12147 , 2024b.
Justin Chih-Yao Chen, Swarnadeep Saha, and Mohit Bansal. Reconcile: Round-table conference
improves reasoning via consensus among diverse llms, 2024c. URL https://arxiv.org/
abs/2309.13007 .
Ke Chen, Lei Xu, and Huisheng Chi. Improved learning algorithms for mixture of experts in
multiclass classification. Neural networks , 12(9):1229–1252, 1999.
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V .
Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation
model post-training. arXiv preprint arXiv:2501.17161 , 2025. URL https://arxiv.org/
abs/2501.17161 .
Herbert H Clark. Using language . Cambridge university press, 1996.
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu,
Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning
capability in llms via reinforcement learning, 2025a. URL https://arxiv.org/abs/2501.
12948 .
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang
Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v3
technical report. 2025b. URL https://arxiv.org/abs/2412.19437 .
Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factual-
ity and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 ,
2023.
David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep
mixture of experts. arXiv preprint arXiv:1312.4314 , 2013.
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter
models with simple and efficient sparsity. Journal of Machine Learning Research , 23(120):1–39,
2022. URL http://jmlr.org/papers/v23/21-0998.html .
Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson.
Counterfactual multi-agent policy gradients. In Proceedings of the AAAI conference on artificial
intelligence , volume 32, 2018.
Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. Specializing smaller language
models towards multi-step reasoning. In International Conference on Machine Learning , pp.
10421–10430. PMLR, 2023.
Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, and Ion
Stoica. M \’elange: Cost efficient large language model serving by exploiting gpu heterogeneity.
arXiv preprint arXiv:2404.14527 , 2024.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob
Steinhardt. Measuring massive multitask language understanding. Proceedings of the International
Conference on Learning Representations (ICLR) , 2021.
Namgyu Ho, Laura Schmid, and Se-Young Yun. Large language models are reasoning teachers.
arXiv preprint arXiv:2212.10071 , 2022.
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of
local experts. Neural computation , 3(1):79–87, 1991.
Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro Ortega, DJ Strouse,
Joel Z Leibo, and Nando De Freitas. Social influence as intrinsic motivation for multi-agent deep
reinforcement learning. In International conference on machine learning , pp. 3040–3049. PMLR,
2019.
13
Page 14:
Hao Jiang, Ke Zhan, Jianwei Qu, Yongkang Wu, Zhaoye Fei, Xinyu Zhang, Lei Chen, Zhicheng Dou,
Xipeng Qiu, Zikai Guo, et al. Towards more effective and economic sparsely-activated model.
arXiv preprint arXiv:2110.07431 , 2021.
Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural
computation , 6(2):181–214, 1994.
Sneha Kudugunta, Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin, Minh-Thang
Luong, and Orhan Firat. Beyond distillation: Task-level mixture-of-experts for efficient inference.
In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Findings
of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana,
Dominican Republic, 16-20 November, 2021 , pp. 3577–3599. Association for Computational
Linguistics, 2021. doi: 10.18653/v1/2021.findings-emnlp.304. URL https://doi.org/10.
18653/v1/2021.findings-emnlp.304 .
Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, and Percy Liang. Fine-
tuning can distort pretrained features and underperform out-of-distribution. In International
Conference on Learning Representations , 2022. URL https://openreview.net/forum?
id=UYneFzXSJWh .
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E.
Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model
serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating
Systems Principles , 2023.
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang,
Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional
computation and automatic sharding. In 9th International Conference on Learning Representations,
ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021. URL https://
openreview.net/forum?id=qrwe7XHTmYb .
Baolin Li, Yankai Jiang, Vijay Gadepally, and Devesh Tiwari. Llm inference serving: Survey of
recent advances and opportunities. arXiv preprint arXiv:2407.12391 , 2024a.
Dawei Li, Zhen Tan, Peijia Qian, Yifan Li, Kumar Satvik Chaudhary, Lijie Hu, and Jiayi Shen.
Smoa: Improving multi-agent large language models with sparse mixture-of-agents. arXiv preprint
arXiv:2411.03284 , 2024b.
Wenzhe Li, Yong Lin, Mengzhou Xia, and Chi Jin. Rethinking mixture-of-agents: Is mixing different
large language models beneficial? arXiv preprint arXiv:2502.00674 , 2025.
Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi,
and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent
debate. arXiv preprint arXiv:2305.19118 , 2023.
Elita Lobo, Chirag Agarwal, and Himabindu Lakkaraju. On the impact of fine-tuning on chain-of-
thought reasoning. arXiv preprint arXiv:2411.15382 , 2024.
Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent
actor-critic for mixed cooperative-competitive environments. Advances in neural information
processing systems , 30, 2017.
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng,
Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical
reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583 ,
2023.
MAA. American invitational mathematics examination - aime. in american invitational
mathematics examination, 2 2024. URL https://maa.org/math-competitions/
american-invitational-mathematics-examination-aime .
14
Page 15:
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegr-
effe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta,
Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdan-
bakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In
NeurIPS , 2023a. URL http://papers.nips.cc/paper_files/paper/2023/hash/
91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html .
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Sean Welleck, Bodhisattwa Prasad Majumder,
Shashank Gupta, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with
self-feedback, 2023b.
Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn.
Teaching small language models to reason. arXiv preprint arXiv:2212.08410 , 2022.
Brando Miranda, Alycia Lee, Sudharsan Sundar, and Sanmi Koyejo. Beyond scale: the diversity
coefficient as a data quality metric demonstrates LLMs are pre-trained on formally diverse data,
2024. URL https://openreview.net/forum?id=506Sxc0Adp .
OpenAI. Gpt-4o mini: advancing cost-efficient intelligence, 2024. URL https://openai.com/
index/gpt-4o-mini-advancing-cost-efficient-intelligence/ .
Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale
multi-subject multi-choice dataset for medical domain question answering. In Conference on
health, inference, and learning , pp. 248–260. PMLR, 2022.
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan
Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang,
Jianxin Yang, Jiaxi Yang, Jingren Zhou, et al. Qwen2.5 technical report. 2025. URL https:
//arxiv.org/abs/2412.15115 .
Nils Reimers and Iryna Gurevych. Making monolingual sentence embeddings multilingual using
knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing . Association for Computational Linguistics, 11 2020. URL https://
arxiv.org/abs/2004.09813 .
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani,
Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark,
2023. URL https://arxiv.org/abs/2311.12022 .
Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr ´e Su-
sano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. In
Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman
Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on
Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual , pp.
8583–8595, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/
48237d9f2dea8c74c2a72126cf63d933-Abstract.html .
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang,
Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical
reasoning in open language models. arXiv preprint arXiv:2402.03300 , 2024.
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and
Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv
preprint arXiv:1701.06538 , 2017a.
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hinton,
and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts
layer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France,
April 24-26, 2017, Conference Track Proceedings . OpenReview.net, 2017b. URL https://
openreview.net/forum?id=B1ckMDqlg .
15
Page 16:
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally
can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314 , 2024.
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer,
et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024.
URLhttps://arxiv.org/abs/2403.05530 .
Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.
github.io/blog/qwen2.5/ .
Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances
large language model capabilities. arXiv preprint arXiv:2406.04692 , 2024a.
Qineng Wang, Zihao Wang, Ying Su, Hanghang Tong, and Yangqiu Song. Rethinking the bounds
of LLM reasoning: Are multi-agent discussions the key? In Lun-Wei Ku, Andre Martins,
and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers) , pp. 6106–6131, Bangkok, Thailand, August
2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.331. URL
https://aclanthology.org/2024.acl-long.331/ .
Xin Wang, Fisher Yu, Lisa Dunlap, Yi-An Ma, Ruth Wang, Azalia Mirhoseini, Trevor Darrell, and
Joseph E Gonzalez. Deep mixture of experts via shallow embedding. In Uncertainty in artificial
intelligence , pp. 552–562. PMLR, 2020.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha
Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language
models. In ICLR . OpenReview.net, 2023a. URL https://openreview.net/pdf?id=
1PL1NIMMrw .
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha
Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language
models. In The Eleventh International Conference on Learning Representations , 2023b. URL
https://openreview.net/forum?id=1PL1NIMMrw .
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming
Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi
Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A More Robust and Challenging Multi-Task
Language Understanding Benchmark. In The Thirty-eight Conference on Neural Information
Processing Systems Datasets and Benchmarks Track , November 2024c.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny
Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in
neural information processing systems , 35:24824–24837, 2022.
Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, and Bing Qin. Examining inter-consistency of large
language models collaboration: An in-depth analysis via debate. arXiv preprint arXiv:2305.11595 ,
2023.
Enwei Xu, Wei Wang, and Qingxia Wang. The effectiveness of collaborative problem solving in
promoting students’ critical thinking: A meta-analysis based on empirical literature. Human-
ities and Social Sciences Communications , 10(1):16, 2023. ISSN 2662-9992. doi: 10.1057/
s41599-023-01508-1. URL https://doi.org/10.1057/s41599-023-01508-1 .
An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu,
Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu,
Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical
expert model via self-improvement. arXiv preprint arXiv:2409.12122 , 2024. URL https:
//arxiv.org/abs/2409.12122 .
Brandon Yang, Gabriel Bender, Quoc V Le, and Jiquan Ngiam. Condconv: Conditionally parameter-
ized convolutions for efficient inference. Advances in Neural Information Processing Systems , 32,
2019.
16
Page 17:
W. Quin Yow and Tony Zhao Ming Lim. Sharing the same languages helps us work better together.
Palgrave Communications , 5(1):154, 2019. ISSN 2055-1045. doi: 10.1057/s41599-019-0365-z.
URLhttps://doi.org/10.1057/s41599-019-0365-z .
Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo
Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for
large language models. arXiv preprint arXiv:2309.12284 , 2023.
Seniha Esen Yuksel, Joseph N. Wilson, and Paul D. Gader. Twenty years of mixture of experts.
IEEE Transactions on Neural Networks and Learning Systems , 23(8):1177–1193, 2012. doi:
10.1109/TNNLS.2012.2200299.
Sukwon Yun, Inyoung Choi, Jie Peng, Yangfan Wu, Jingxuan Bao, Qiyiwen Zhang, Jiayi Xin,
Qi Long, and Tianlong Chen. Flex-moe: Modeling arbitrary modality combination via the flexible
mixture-of-experts. arXiv preprint arXiv:2410.08245 , 2024.
Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. Moefication: Condi-
tional computation of transformer models for efficient inference. arXiv preprint arXiv:2110.01786 ,
2021.
Simiao Zuo, Xiaodong Liu, Jian Jiao, Young Jin Kim, Hany Hassan, Ruofei Zhang, Jianfeng Gao,
and Tuo Zhao. Taming sparsely activated transformer with stochastic experts. In International
Conference on Learning Representations , 2022. URL https://openreview.net/forum?
id=B72HXs80q4 .
17
Page 18:
Appendix
A Dataset Statistics and Licenses
We provide the sample sizes and licenses of the datasets used in this work in Table 6. All the datasets
are in English, and all datasets are used in a fashion consistent with their intended use.
Validation Size Test Size License
MMLU-Pro (Wang et al., 2024c) 350 2,100 Apache License
AIME (MAA, 2024) 354 30 CC0
GPQA (Rein et al., 2023) 249 198 MIT License
MedMCQA (Pal et al., 2022) 504 4,183 MIT License
Table 6: The statistics and licenses of the datasets we use in this work.
B Model Pool
Model Name Size Huggingface Link
BioLlama 8B ContactDoctor/Bio-Medical-Llama-3-8B
DeepSeekMath 7B deepseek-ai/deepseek-math-7b-instruct
Exaone 7.8B LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct
Gemma2 9B google/gemma-2-9b-it
GLM4 9B THUDM/glm-4-9b
Granite 8B ibm-granite/granite-3.1-8b-instruct
InternLM3 8B internlm/internlm3-8b-instruct
Llama3.1 8B meta-llama/Llama-3.1-8B-Instruct
LlamaR1 8B deepseek-ai/DeepSeek-R1-Distill-Llama-8B
Mathstral 7B mistralai/Mathstral-7B-v0.1
Mistral 12B mistralai/Mistral-Nemo-Instruct-2407
Phi3.5-mini 3.5B microsoft/Phi-3.5-mini-instruct
Qwen2.5 7B Qwen/Qwen2.5-7B-Instruct
Qwen2.5-Coder 7B Qwen/Qwen2.5-Coder-7B-Instruct
Qwen2.5-Math 7B Qwen/Qwen2.5-Math-7B-Instruct
QwenR1 7B deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
Table 7: The models constituting the model pool.
C Performance on the Validation Set
Table 8 shows the performance of each model on the validation set. We highlight the top-1 and
top-3 models in bold font and yellow background, respectively. This information is also used for the
baselines we compare against in Table 1.
D Performance of Each Model as an Aggregator
Table 9 shows the performance of each model when acting as an aggregator. Note that the best-
performing model in Table 8 can be different from the best aggregator model in Table 9, motivating
us to choose the aggregator based on this synthetic task described in Section 3.2.2.
E Distribution of Experts
We present the distribution of recruited experts across different datasets in Fig. 4. As noted in
Section 3.2.1, we trim experts with occurrences below 5% to reduce model loading time. This is
18
Page 19:
Model MMLU-Pro AIME GPQA MedMCQA
BioLlama 37.71 0.85 27.31 42.86
DeepSeekMath 32.57 3.32 28.11 35.71
Exaone 52.29 25.99 32.13 56.35
Gemma 53.71 7.73 36.95 64.29
GLM 50.29 7.37 30.92 58.33
Granite 43.43 5.92 34.14 56.15
InternLM 43.14 7.91 36.14 55.56
Llama 46.00 6.78 33.73 66.87
LlamaR1 54.29 51.98 56.22 53.37
Mathstral 34.57 3.11 36.55 52.38
Mistral 45.14 1.41 33.73 46.43
Phi 46.57 1.41 47.79 65.87
Qwen 54.00 13.56 37.35 67.06
QwenCode 46.29 9.89 30.52 50.79
QwenMath 31.71 11.13 28.51 36.90
QwenR1 53.43 57.06 51.41 37.90
Table 8: Comparison of model performance on the validation set. The best model on each task is
bolded , and the top 3 models on each task are highlighted in yellow.
Model MMLU-Pro AIME GPQA MedMCQA
BioLlama 37.31 21.47 30.12 42.46
DeepSeekMath 32.57 5.37 21.69 35.71
Exaone 57.43 47.92 35.34 56.35
Gemma 49.71 3.11 31.73 53.37
GLM 52.57 26.27 35.34 51.39
Granite 48.86 36.44 38.96 56.15
InternLM 55.14 16.95 42.57 51.59
Llama 51.14 11.86 40.56 50.60
LlamaR1 59.71 53.67 46.18 49.01
Mathstral 41.71 26.27 35.74 46.43
Mistral 48.00 18.93 33.33 46.43
Phi 27.71 9.04 26.10 25.40
Qwen 56.86 38.14 39.36 53.37
QwenCode 51.14 29.66 38.96 50.79
QwenMath 31.71 5.93 16.06 36.90
QwenR1 58.00 57.63 48.59 45.44
Table 9: Performance of each model when used as an aggregator, on the validation set. The best
model on each task is bolded , and is selected as the task-specific aggregator.
19
Page 20:
Gemma
21.5%
QwenR1
18.6%
LlamaR1
14.0%Qwen
12.4%GLM
8.7%Exaone
7.2%MMLU-Pro (Before)
QwenR1
60.0%LlamaR1
40.0%AIME (Before)
LlamaR1
54.9% QwenR1
31.5%GPQA (Before)
Qwen
23.2%
Llama
20.9%
Phi
18.2%Gemma
18.2%GLM
5.1%MedMCQA (Before)
Gemma
25.9%
QwenR1
22.7%LlamaR1
17.0%Qwen
15.2%GLM
10.6%Exaone
8.7%MMLU-Pro (After)
QwenR1
60.0%LlamaR1
40.0%AIME (After)
LlamaR1
64.1%QwenR1
35.9%GPQA (After)
Qwen
26.9%
Llama
24.5%Phi
21.4%Gemma
21.1%GLM
6.1%MedMCQA (After)Figure 4: Distribution of the recruited experts across datasets. Top row: the distribution before
trimming. Bottom row: the distribution after trimming and re-sampling.
visualized in Fig. 4, where the top row shows the distribution before trimming, and the bottom row
shows the distribution after trimming. The distribution varies significantly across datasets – on more
diverse datasets such as MMLU-Pro, the recruited experts are also more varied. In contrast, for AIME
and GPQA, which focus more on math and science, the recruited experts are dominated by a few
models.
F Algorithm
We provide the algorithm for our batched inference strategy in Appendix F.
Algorithm 1 BatchedInference
Require: Test samples Q, Model pool M
Ensure: Inference results for all samples
1:expert sample map←∅ ▷Expert-to-samples mapping
2:forq∈ Q do
3: E(1)
q,E(2)
q, ...,E(k)
q←RECRUIT EXPERTS (q,M) ▷Select kexperts per sample ( §3.3.1)
4: fore∈Eqdo
5: expert sample map[e]←expert sample map[e]∪{q}
6: end for
7:end for
8:
9:results ←∅ ▷Results collection
10:for(e,qe)∈expert sample map do
11: results ←results ∪e.GENERATE (qe) ▷Batch inference per expert
12:end for
13:return results
20
Page 21:
G Prompts
Prompt for the Keyword LLM to Generate Keywords
Question: {question }
What are the core knowledge, subjects or skills needed to solve this problem? List 2-5 keywords
separated in comma. Example keywords: psychology, virology, behavioral theory, microbiology,
diplomacy, political science, property law, finance, business. Give ONLY the keywords, no
other words or explanation.
Follow this format: Keywords: <keyword1 >,<keyword2 >...
Prompt for Zero-shot Chain-of-Thought Generation (Multiple Choice)
Question: {question }
Provide your step-by-step reasoning first, and then print “The answer is (X)” where X is the
answer choice (one capital letter), at the end of your response.
Prompt for Zero-shot Chain-of-Thought Generation (Math)
Question: {question }
Provide your step-by-step reasoning first, and then print “The answer is \\boxed{X}”, where X
is the final answer, at the end of your response.
Prompt for the Aggregator (Wang et al., 2024a)
You have been provided with a set of responses from various open-source models to the latest
user query. Your task is to synthesize these responses into a single, high-quality response. It
is crucial to critically evaluate the information provided in these responses, recognizing that
some of it may be biased or incorrect. Your response should not simply replicate the given
answers but should offer a refined, accurate, and comprehensive reply to the instruction. Ensure
your response is well-structured, coherent, and adheres to the highest standards of accuracy and
reliability.
Responses from models:
{model 1response }
{model 2response }
{model 3response }
Question: {question }
Provide your step-by-step reasoning first, and then print “The answer is (X)” where X is the
answer choice (one capital letter), at the end of your response.
H S YMBOLIC -MOE as a Sparse Mixture-of-Expert
In the Sparse Mixture-of-Experts (SMoE) framework (Shazeer et al., 2017a), a trainable router
dynamically selects a subset of experts for each input. Formally, given an input x, the output of an
SMoE layer, yis computed as:
y=k
∑
i=1R(x)i·fi(x),
R(x) =softmax (Top-K (g(x)),k)(1)
21
Page 22:
where fi(x)represents the response of the i-th expert, and R(x)is a trainable router that assigns
selection probabilities to each expert based on g(x), typically a small feedforward network (Shazeer
et al., 2017b; Riquelme et al., 2021). The Top-K operation retains only the top kexperts, setting the
probabilities of others to zero after the softmax operation.
However, directly applying SMoE in our framework presents key challenges. Unlike SMoE, our
method operates in a symbolic, text-based space and is designed for test-time inference, meaning that
we do not rely on a trainable router to learn expert selection, nor do the experts in our method refer to
model parameters. Instead, we introduce a skill-based routing mechanism to select relevant experts
based on predefined competencies rather than learned gating functions. Formally, our aggregation
process can be expressed as:
y=A∗(
k
i=1y(i))
y(i)=E(i)(x)∀i∈ {1, 2, ..., k}
E(i)∼Categorical (w(1),w(2), ...,w(n))∀i≤k(2)
where A∗is the aggregator model determined via validation set, and
denotes the concatenation of
experts’ responses, i.e., y(·). Here, y(j)represents the output of expert j’s forward response given
an input x, defined as E(j)(x). Each expert E(i),∀i≤kis selected from our proposed skill-based
routing strategy (Section 3.3.1). In short, we construct model profiles using a validation set to evaluate
each model’s specialization across different skills. This allows us to estimate a probability distribution
w(j)over models based on both their suitability for the required skills and their global competence
relative to other experts.
This skill-based routing framework retains the core benefits of SMoE while removing the reliance on
a trainable gating mechanism. Specifically, the aggregator model A∗inSYMBOLIC -MOEplays a role
analogous to the weighted sum ( ∑) operation in SMoE, synthesizing outputs from selected experts.
Likewise, the recruited agent E(i)corresponds to the Top-k operation in SMoE, ensuring that only
the most relevant and specialized experts contribute to the final output. We inherit the key conceptual
benefits of SMoE – dynamic expert selection and response aggregation – while also introducing
additional advantages. SYMBOLIC -MOEis gradient-free, eliminating the need for retraining, and is
entirely automatic, leveraging a large pool of pre-trained models to deliver a better performance.
22