loader
Generating audio...

arxiv

Paper 2503.05641

Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning

Authors: Justin Chih-Yao Chen, Sukwon Yun, Elias Stengel-Eskin, Tianlong Chen, Mohit Bansal

Published: 2025-03-07

Abstract:

Combining existing pre-trained expert LLMs is a promising avenue for scalably tackling large-scale and diverse tasks. However, selecting experts at the task level is often too coarse-grained, as heterogeneous tasks may require different expertise for each instance. To enable adaptive instance-level mixing of pre-trained LLM experts, we propose Symbolic-MoE, a symbolic, text-based, and gradient-free Mixture-of-Experts framework. Symbolic-MoE takes a fine-grained approach to selection by emphasizing skills, e.g., algebra in math or molecular biology in biomedical reasoning. We propose a skill-based recruiting strategy that dynamically selects the most relevant set of expert LLMs for diverse reasoning tasks based on their strengths. Each selected expert then generates its own reasoning, resulting in k outputs from k experts, which are then synthesized into a final high-quality response by an aggregator chosen based on its ability to integrate diverse reasoning outputs. We show that Symbolic-MoE's instance-level expert selection improves performance by a large margin but -- when implemented naively -- can introduce a high computational overhead due to the need for constant model loading and offloading. To address this, we implement a batch inference strategy that groups instances based on their assigned experts, loading each model only once. This allows us to integrate 16 expert models on 1 GPU with a time cost comparable to or better than prior multi-agent baselines using 4 GPUs. Through extensive evaluations on diverse benchmarks (MMLU-Pro, GPQA, AIME, and MedMCQA), we demonstrate that Symbolic-MoE outperforms strong LLMs like GPT4o-mini, as well as multi-agent approaches, with an absolute average improvement of 8.15% over the best multi-agent baseline. Moreover, Symbolic-MoE removes the need for expensive multi-round discussions, outperforming discussion baselines with less computation.

Paper Content:
Page 1: Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning Justin Chih-Yao Chen∗Sukwon Yun∗Elias Stengel-Eskin∗ Tianlong Chen Mohit Bansal UNC Chapel Hill https://symbolic-moe.github.io Abstract Combining existing pre-trained expert LLMs is a promising avenue for scalably tackling large-scale and diverse tasks. However, selecting experts at the task level is often too coarse-grained, as heterogeneous tasks may require different expertise for each instance. To enable adaptive instance-level mixing of pre-trained LLM experts, we propose SYMBOLIC -MOE, a symbolic, text-based, and gradient-free Mixture-of-Experts framework. SYMBOLIC -MOEtakes a fine-grained approach to selection by emphasizing skills, i.e., specialized subcategories or subtopics such as algebra in mathematics or molecular biology in biomedical reasoning. We propose a skill-based recruiting strategy that dynamically selects the most relevant set of expert LLMs for diverse reasoning tasks based on their strengths. Each selected expert then generates its own reasoning, resulting in koutputs from kexperts, which are then synthesized into a final high-quality response by an aggregator. The aggregator is chosen based on its ability to integrate diverse reasoning outputs. We show that instance-level expert selection improves performance by a large margin but – when implemented naively – can introduce a high computational overhead due to the need for constant model loading and offloading. To address this, we implement a batch inference strategy that groups instances based on their assigned experts, ensuring each model will only be loaded once. This allows us to integrate 16 models on a single GPU with a time cost comparable to prior multi-agent baselines using 4 GPUs. Through extensive evaluations on diverse benchmarks (MMLU-Pro, GPQA, AIME, and MedMCQA), we demonstrate that SYMBOLIC -MOEoutperforms strong LLMs like GPT4o-mini, as well as multi- agent approaches, with an absolute average improvement of 8.15% over the best multi-agent baseline. Moreover, SYMBOLIC -MOEremoves the need for expensive multi-round discussions, outperforming discussion baselines with less computation. 1 Introduction A core strength of humans is our ability to communicate and coordinate with each other using language (Clark, 1996; Yow & Lim, 2019; Xu et al., 2023). This allows diverse experts to contribute specialized knowledge towards solving a problem, and is common across a variety of settings, including research, medicine, and engineering. Like humans, large language models (LLMs) often have differing skills and strengths, derived from differences in their architectures and training regimens. For instance, math-specific models like MetaMath (Yu et al., 2023), WizardMath (Luo et al., 2023), and QwenMath (Yang et al., 2024) are post-trained with mathematical reasoning data, making them particularly adept at math tasks – often at the cost of performance on out-of-distribution tasks (Kumar et al., 2022; Chu et al., 2025) like commonsense or medical reasoning (Lobo et al., 2024). Even within specialized domains, differences in pre-training data can lead to nuanced variations in expertise: one math-focused model may excel at algebra, while another is better suited for geometry. This motivates our development of an automated, skill-based framework designed to identify and select the most ∗Equal contribution. 1arXiv:2503.05641v2 [cs.CL] 11 Mar 2025 Page 2: suitable set of expert models for a given problem.1Allowing multiple diverse models to combine their answers can improve even the strongest models (Chen et al., 2024c), and as tasks grow in complexity, leveraging specialized models at test time presents a promising approach to scaling LLM capabilities. Indeed, combining multiple “expert” models, i.e., Mixture-of-Experts (MoE) is a well-studied notion in machine learning (Jacobs et al., 1991; Eigen et al., 2013) and has been applied widely for large pre-trained models, enabling better performance at a lower computing cost (Shazeer et al., 2017a; Fedus et al., 2022; Riquelme et al., 2021). However, in the conventional MoE settings, experts are typically sub-models, i.e. subsets of parameters within a larger model, trained simultaneously in an end-to-end fashion, where at test time, experts are combined in the model’s parameter space. This generally requires end-to-end training from scratch, which is often computationally expensive and precludes the re-use of the vast pool of existing already-trained LLMs. Building on recent efforts in combining a fixed set of models through multi-agent discussions (Chen et al., 2024c; Du et al., 2023; Liang et al., 2023; Wang et al., 2024a), we propose exploring a new training-free paradigm for large-scale MoEs: a symbolic mixture of experts ( SYMBOLIC -MOE). Rather than using information encoded in the model’s hidden state, SYMBOLIC -MOEuses symbolic structures in two ways: First, SYMBOLIC -MOEinfers a set of discrete skills needed to solve a problem, measuring the abilities of each model in a pool of candidate expert models according to the set. It then uses skill-based performance as a “router” to recruit a sparse subset of experts for each problem . Secondly, SYMBOLIC -MOEcombines pretrained experts through a symbolic channel, i.e., language, which is a common protocol already shared by all models involved. To take advantage of the diverse set of expert LLMs, two key challenges present: (1) Effective Expert Selection :Given a large set of existing LLMs, how can we choose the best experts for each example? (2) Scalable Expert Mixing : How can we scalably serve a large number of experts without increasing the demand for GPUs? Effective Expert Selection: The increasing diversity of benchmarks (Miranda et al., 2024) and the growing number of models means that experts must be selected not at the level of tasks, but at the level of individual queries. Even at the task level, manual selection can be labor-intensive, and the performance of multi-agent frameworks can be sensitive to the agents chosen for a task (Chen et al., 2024c; Liang et al., 2023; Wang et al., 2024b). At the query level, manual selection becomes infeasible, especially in multi-model settings where different experts must collaborate to address complex queries (Wang et al., 2024c; Rein et al., 2023). Moreover, manually selecting models for each task does not guarantee optimal performance. For instance, as shown in Fig. 1 (a), while a given subset of models may perform well on math tasks on average, their proficiency in specific subfields like algebra or probability might vary. Therefore, using of a fixed subset of models on all math samples might hurt performance on particular subtasks. This underscores the need for an automated, fine-grained selection mechanism as shown in Fig. 1 (b). Given that evaluating all possible model subsets is computationally infeasible, we instead propose an efficient approach that activates only the most relevant experts while minimizing computational overhead. Scalable Expert Mixing: Unless models are retrained jointly, the integration of multiple models must occur through their outputs – typically via a symbolic channel, such as multi-agent discussion. Past work has often relied on multiple rounds of inference, leading to significant GPU demands. As shown in Fig. 3 (I), when selecting a fixed set of models for a given domain (e.g. as done by (Li et al., 2024b; 2025)), models could in principle be loaded onto parallel GPUs. This reduces the latency of the system but increases the number of GPUs needed. Moreover, it does not scale to a dynamic setting like the one we consider, where the number of GPUs required would be equal to the number of potential models available (in our case, 16), making this option prohibitively expensive. Given kmodels and nqueries, this solution requires kGPUs. Alternatively, models could be run sequentially on a single GPU (cf. Fig. 3 (II)), which minimizes GPU usage but incurs high overhead due to frequent loading and off-loading of models. In other words, this solution requires O(kn)loads. Given the high cost of loading models into GPU memory (Griggs et al., 2024; Li et al., 2024a), this solution is also untenable. Processing each query separately also fails to fully utilize GPU memory, further limiting its efficiency. Driven by these motivations, in SYMBOLIC -MOE, we achieve effective expert selection and scalable expert mixing through the following approach. First, to enhance expert selection, we introduce automated skill-based recruiting , which leverages an inferred skill-based model profile – a dictionary mapping the most relevant models to specific skills (e.g., keywords or subcategories) required to 1We refer experts to those LLMs that are selected to solve a problem in this work. 2 Page 3: Router CoTCoTCoTAggregator“Algebra” Experts for Q1 CoTCoTCoTAggregator“Probability” Experts for Q2 (b) Symbolic-MoE (Ours)(a) Existing Multi-Agent Work CoTCoTCoT Fixed Experts for MATH Task-Level Recruitment ResourceIntensive CoTCoTCoT CoTCoTCoT Fixed Experts for MATHCoTCoTCoTSkill: “Algebra” Skill: “Probability” MATH Q1. MATH Q2. MATH Q1. MATH Q2. Instance-Level RecruitmentComputationallyEfficient ……A1. A2. A1. A2. A quadratic function!"="!+%"+& has roots at "=3 and "=8.... what is the value of &?A biased coin has a probability ) of landing on heads. ..., what is the value of )?Given that the quadratic function has roots at ... the value of& is “12”. The probability of at least one head is!"... the value of) is “#$”. A quadratic function!"="!+%"+& has roots at "=3 and "=8.... what is the value of &?A biased coin has a probability ) of landing on heads. ..., what is the value of )?Let’s break down...The quadratic function canbe written as...the value of & is “24” Let’s break down...Flipping the coin twice results in.. the value of) is “%&”. Limited Number of Agents (e.g., 3) Large LLM Pool(e.g., 16)Figure 1: Comparison between existing multi-agent work and our work. (a) In prior work, a fixed set of task-level experts (e.g., Phi, Mistral, and Llama) is recruited to solve mathematical problems, where heterogeneous questions may differ in the skills required to solve them (e.g Q1 requires algebra, while Q2 focuses on probability). Expert models then engage in multiple rounds of discussion, making this approach resource-intensive. (b) In contrast, SYMBOLIC -MOEadaptively recruits instance-level experts based on skills needed (“Algebra” experts for Q1 and a different set of “Probability” experts for Q2). By generating only a single round of responses with a selected aggregator to synthesize the final output, our approach is both more performant and more efficient. solve a given query – using a small subset of available samples. Like a router in a standard MoE setting, SYMBOLIC -MOE’s skill-based routing allows the most relevant kexperts to be activated while keeping the others inactive, optimizing both accuracy and efficiency. Next, to enable scalable expert mixing, we propose a batch inference mechanism (illustrated in Fig. 3 (III)). For each query, the router first selects the necessary experts, and then examples are grouped into batches per model. We then run all queries for a given model in a single batch, allowing us to load and off-load O(k)times in total, which is far faster than sequential processing’s O(kn). Finally, while SYMBOLIC -MOE’s batched implementation accommodates up to 16models on a single GPU , it can also be parallelized across multiple GPUs. This flexibility ensures both speedups with increased computing power and accessibility for users with limited hardware resources. As shown in Figure 1, existing multi-agent approaches use a fixed set of experts for each task (e.g., all math problems), regardless of fine-grained topics (e.g., algebra or probability), and rely on multiple discussion rounds, making them non-adaptive and resource-intensive. In contrast, our approach introduces adaptive query-level recruitment based on fine-grained skills, selecting the most suitable models for each query. Furthermore, we improve efficiency through (1) a batch inference strategy that processes all queries for a model in a single batch and (2) a task-specific aggregator instead of multi-agent discussions, ensuring that expert models generate their answers only once. We evaluate SYMBOLIC -MOEon diverse benchmarks, including MMLU-Pro (Wang et al., 2024c), GPQA (Rein et al., 2023), AIME 2024 (MAA, 2024), and MedMCQA Pal et al. (2022), using a diverse model pool. We show that integrating this model pool with automated skill-based recruiting yields an average accuracy improvement of 8.15% over the best multi-agent baseline. Moreover, despite primarily using LLMs with 7-8 billion (B) parameters, SYMBOLIC -MOEachieves comparable performance with larger 70B models, and on average, outperforms strong proprietary models like GPT4o-mini (OpenAI, 2024), Gemini 1.5 Pro (Team et al., 2024), and DeepSeek-V3 (DeepSeek- AI et al., 2025b). Without any manual intervention, SYMBOLIC -MOEconsistently surpasses all baselines, whereas the strongest baseline changes across settings, with some baselines performing competitively in one setting but poorly in others. Thus, SYMBOLIC -MOEeliminates the need for the user to evaluate and compare a large number of possible baselines for each setting. Notably, SYMBOLIC -MOEobtains these benefits while reducing the amount of compute needed. Using a single GPU, SYMBOLIC -MOEhas44% less run-time than a mixture-of-agents baseline (Wang et al., 2024a). When four GPUs are available for both methods, we obtain an almost 2×speedup over this baseline. Our analysis highlights that SYMBOLIC -MOEshows strong robustness despite the variation in the best-performing models and optimal aggregators for each task. Additionally, we show that selecting a task-specific aggregator based on its ability to integrate diverse answers can achieve performance comparable to multi-round discussions while requiring substantially less compute. 3 Page 4: …Aggregator Selection Model Profile Creation Pool of Models … Router Algebra Experts A quadratic function!"=*"!+%"+& has roots at "=2 and "=5. If !" passes the point (0,10), what is the value of *?(b)Inference(a)Preprocessing Chains-of-Thought Final AnswerLet's break down...The quadratic function can be written as !"=*("−2)("−5) ...- Therefore, the value of * is “1”.... the value of * is “1”.Gene.Bio.+3-1+5Calc.… …Aggr. Benchmark Test Sample Router Samples from Model ProfilesFinal Output Model ProfileValidation Set …Adaptive Skill-Based RecruitmentLet's break down...!"=*("−2)("−5) ..!Skill: “Algebra” +5+3-1-4 " Aggregate … >>Benchmark or +6{correct_CoT}{incorrcet_CoT}{incorrcet_CoT}{question}Figure 2: Overview of SYMBOLIC -MOE. (a) Preprocessing: Given a validation set and a pool of agents, we create model profiles and select an aggregator. (b) Inference-Time: For each test example, SYMBOLIC -MOEactivates the most relevant models (experts) based on skill-based routing, using model profiles determined during preprocessing. These models generate CoT responses, which the aggregator (chosen based on its ability to select correct answers) synthesizes into a final answer. 2 Related Work Mixture-of-Agents. Traditional Mixture-of-Experts (MoE) (Jacobs et al., 1991; Jordan & Jacobs, 1994; Chen et al., 1999; Yuksel et al., 2012) models distribute computation across multiple experts, with growing interest in sparsity-driven approaches. The Sparse MoE (SMoE) approach (Shazeer et al., 2017a) improves efficiency by activating only the most relevant experts per input, enhancing scalability for high-dimensional data. This method has been widely applied in vision tasks (Riquelme et al., 2021; Wang et al., 2020; Yang et al., 2019; Abbas & Andreopoulos, 2020), language tasks (Lepikhin et al., 2021; Zhang et al., 2021; Zuo et al., 2022; Jiang et al., 2021) and multimodal learning (Kudugunta et al., 2021; Yun et al., 2024). In contrast to past sparse MoE work, the models in our approach are combined through symbolic channels (e.g., language outputs) as opposed to in the parameter space, allowing us to re-use existing LLMs without training. MoA (Wang et al., 2024a) introduces a framework for combining LLM agents into ensembles that relies on a fixed set of agents across tasks and instances. This approach requires multiple rounds of generation and aggregation before producing a final answer. Similarly, Self-MoA (SMoA; Wang et al., 2024a) posits that using multiple distinct LLMs is unnecessary, suggesting that optimal performance can be achieved by invoking the task-best model multiple times alongside the task-best aggregator. Our work differs from MoA and SMoA by introducing adaptive, instance-level, skill-based routing while avoiding costly multi- model discussions in favor of streamlined aggregation, significantly reducing computational overhead. We also find that mixing different LLMs is advantageous when paired with effective routing and aggregation strategies. Moreover, we show that the best aggregator for a task is not necessarily the best-performing model overall, but can be identified through a synthetic task we introduce, designed to evaluate aggregation effectiveness. Multi-Agent Reasoning. Multi-agent reasoning has emerged as a promising paradigm for en- hancing complex problem-solving and decision-making in AI systems. Early approaches employed reinforcement learning-based coordination (Lowe et al., 2017; Foerster et al., 2018; Jaques et al., 2019), while recent efforts leverage LLM-based multi-agent frameworks. One line of research explores student-teacher paradigms (Magister et al., 2022; Fu et al., 2023; Ho et al., 2022; Du et al., 2023; Chen et al., 2024a), where reasoning capabilities are distilled from stronger to weaker agents. Another approach investigates multi-agent debate frameworks, where agents interact to refine arguments and enhance collective decision-making; this has been explored with multiple instances of a single model (Liang et al., 2023; Xiong et al., 2023; Chan et al., 2023) or debates between multiple LLM types (Chen et al., 2024c). In both cases, the set of models is predefined by the user. In contrast, our approach automatically selects models based on inferred skills. Additionally, our framework achieves superior efficiency by avoiding multi-round discussions while still outperforming debate-based methods. 4 Page 5: 3 Methodology SYMBOLIC -MOEconsists of three stages: (I) model profile creation and aggregator selection (Fig. 2 (a)), which occur during preprocessing, followed by (II) expert recruitment and (III) final answer generation (Fig. 2 (b)), both of which take place during inference. First, we establish the problem setup in §3.1. Then, in the preprocessing stage ( §3.2), we describe how the model profile is created in§3.2.1, which enables a fine-grained skill-based assessment of each model. We also describe the process for selecting the aggregator in §3.2.2, which is responsible for generating the final answer. Next, in the inference stage ( §3.3), we introduce the recruitment process in §3.3.1, where the experts are selected based on the skills needed for each problem, and how the answers from the recruited experts are combined using the aggregator to improve reasoning in §3.3.2. 3.1 Problem Setup Given a pool of nmodels M={Mi}n i=1, where each model represents a distinct LLM with potentially different pre-training datasets and architectures, our goal is to optimize performance through dynamic allocation – solving each problem with the most suitable subset of kmodels – and combined reasoning – allowing experts to combine information to enhance reasoning. To achieve this, we assume access to a small validation set to obtain (1) model profiles Pi∀i≤n, and (2) aggregator profiles that benchmark the ability of each model to act as an aggregator. We use these profiles to recruit experts (at the instance level) and to select the aggregator (at the task level). 3.2 Preprocessing 3.2.1 Model Profile Creation To recruit the kmost suitable experts for a given question, we assess each model’s specialized skills across various problem-solving domains, illustrated in Fig. 2 (a). This is done by evaluating their performance on the validation set for each task (see Table 6 for sizes), thereby constructing a model profile Pifor each model Mi. For each question in the validation set, we first prompt an LLM – referred to as the “Keyword LLM” – to identify the essential skills required to solve the problem. The prompt used for generating these required skills is provided in Appendix G. For consistency, we use Qwen2.5-7B-Instruct (Team, 2024) as the Keyword LLM. To reduce noise, we generate keyword annotations for each question five times, and retain only those that appear more than once across the whole validation set. These extracted skills represent core knowledge areas necessary for solving the problem – for instance, a given college-level math problem may require skills such as algebra, calculus, and geometry. Once all questions are annotated with their required skills, each model Mi in the pool attempts to solve them using Chain-of-Thought reasoning (Wei et al., 2022). A correct answer increases the score of each associated skill by +1, while an incorrect answer results in a −1penalty. At the end of this process, each model has a profile Pi∈ {P1, . . . , Pn}represented as a dictionary. For example, a model’s profile may be: {‘Algebra’: 10, ‘Biology’: 3, ‘Chemistry’: -6, ... }. 3.2.2 Aggregator Selection An aggregator is a model that consolidates koutputs into a single high-quality response. Our pilot experiments, along with findings from Wang et al. (2024a) and Li et al. (2024b), indicate that the aggregator model plays a crucial role in the final performance, and selecting the most effective model for aggregation is a non-trivial challenge. We find that reasoning capability and aggregation ability are often orthogonal. That is, a strong reasoning model is not necessarily a strong aggregator and vice versa; qualitatively, we show this later in Table 5. We find that adaptively selecting an aggregator on the instance level based on model profiles is less effective, motivating us to choose the aggregator based on a model’s ability to consolidate different answers. To identify the best aggregator per task, we construct a synthetic task using the same validation set. From the profile creation process, we obtain outputs from all models, some correct and some incorrect. For each question, we sample one correct reasoning chain and two incorrect ones, structuring the input as follows: {question },{correct CoT},{incorrect CoT},{incorrect CoT}. We shuffle the order of the correct and incorrect CoTs and instruct each model to act as an aggregator (using the prompt shown in Appendix G), synthesizing a final output with a predicted answer. We 5 Page 6: then benchmark each model’s aggregation ability, measuring how well it can generate a correct answer based on this input, and select the best-performing aggregator for each task. 3.3 Inference 3.3.1 Skill-based Recruiting At inference time (see Fig. 2 (b)), we follow the same keyword annotation procedure as in Section 3.2.1 to generate relevant keywords for the test sample. To align a test sample’s keywords with those in the model profiles, we employ Sentence-BERT (Reimers & Gurevych, 2020) to match keywords via the cosine similarity between their embeddings. Each test-time keyword is then mapped to its closest counterpart in the model profile. Next, the expert recruitment is performed by selecting the topkmodels whose profiles best match the required skills of the test sample. This is determined by two factors: (1) local suitability score and(2) global competency . For each model Mi, itslocal suitability score for a test sample q,S(Mi,q)is computed as the sum of its skill scores over the set of keywords needed for q(denoted as Kq). It can be expressed as follows: S(Mi,q) =∑ kj∈Kqs(i) kj where s(i) kjrepresents the score of model Mifor the j-th skill in the test sample q. This results in an model ranking distribution Dqfor each test sample q: Dq={S(M1,q),S(M2,q), ...,S(Mn,q)} Intuitively, suppose M1has scores of +3,+5, and−2for algebra, calculus, and geometry, respec- tively, which are needed for a given sample; its total score for this sample would be 3+5−2=6. Having this calculation for all models yields a ranking distribution of model strengths for the given sample, e.g., {M1: 6,M2: 3, ..., Mn: -10}, which is the ranking of how suitable a model is for a sample . To account for each model’s overall strength in a task, i.e., global competency , we compute each model’s total score across all keywords in their profile, and normalize it by the total sum of all agents’ scores. We denote this global strength as γi, representing a model’s overall task performance relative to others. Finally, the expert selection is performed by sampling from the product of the local suitability score, S(Mi,q)(from a model rank distribution Dq) and the global competency γi. That is, the relevance score of a model Mifor a test sample qis: w(i) q=γiS(Mi,q) We apply a softmax function with the temperature set to 0.5to this distribution {w(i) q}n i=1, and then sample kexperts for each problem, i.e., E(i) q∼Categorical (w(1) q,w(2) q, ...,w(n) q),i={1, 2, ..., k} To enhance efficiency and reduce computational overhead, we filter out low-frequency experts – those who appear in fewer than 5%of test cases. For example, given a test set with 100 samples, where 3 experts are recruited per sample (totaling 300 selections), any expert appearing fewer than 300×5%=6times is discarded and resampled from the remaining higher-frequency experts. 3.3.2 Final Answer Generation After expert recruitment, each sample will be passed to the experts, i.e., the input for each expert is the test problem, x0=q. These experts generate their reasoning paths to the problem in the form of Chain-of-Thought (Wei et al., 2022): y(i) 0=E(i)(x0)∀i∈ {1, 2, ..., k} Then, the task-specific aggregator A∗is introduced to synthesize the koutputs into a high-quality final answer (Wang et al., 2024a). That is, the final answer is produced by: y=A∗( k i=1y(i) 0), where denotes the concatenation operation. In Appendix H, we provide a detailed discussion on SYMBOLIC -MOE in the context of sparse MoE frameworks and how it shares its design principles. 6 Page 7: IDQuestionExpert 1Expert 2Expert k1How many of the following …QwenLlamaMistral2A large gene has dozens of …QwenGemmaPhi3Consider the following …MistralGemmaPhi ExpertIDQwen1, 2Llama1Mistral1, 3Gemma2, 3Phi2, 3abcdeIDExpert 1Expert 2Expert k1Qwen OutputLlama Output…2Qwen Output……3………aa(III) Batch Inference: Flexible to the Number of GPUs(I) Parallel Inference w/ Multiple GPUs QwenLlamaMistral …Model n Requires n GPUs at the same timeQuestionExpert 1Expert 2Expert kHow …QwenLlamaMistral Gemma(II) Sequential Inference w/ Single GPUQuestionExpert 1Expert 2Expert kHow ...QwenLlamaMistralQwen QuestionExpert 1Expert 2Expert kA large ...GemmaQwenPhiLlama Mistral GemmaQwenPhiTakes n times longer than using 1 modelQuestionExpert 1Expert 2Expert kA large ...GemmaQwenPhiFits in a single GPUCan be accelerated w/ more GPUs QwenLlamaMistral …Model n Gemma SequentialGenerationGroup by ExpertsLoad Reloading a used modelLoad Figure 3: Different approaches to achieving adaptiveness in SYMBOLIC -MOE, which uses different models for each instance. In a naive setup (I), kGPUs must be hosted simultaneously, allowing immediate access to outputs from each model. Another naive setup (II) requires only a single GPU but involves constant loading and offloading of models to obtain outputs from the corresponding model. Our scalable batch inference process (III) strikes a balance between (I) and (II). When models are assigned to problems, we group samples by model and sequentially load the corresponding LLM onto a single GPU to generate outputs efficiently. Moreover, this approach still allows us to parallelize across GPUs if they are available. 3.3.3 Scalable Batched Inference In our experiments, we mostly consider 7B–8B parameter LLMs, which have a substantial memory footprint. Due to the adaptive nature of the recruitment process, the set of participating LLMs may change dynamically for different problems. For instance, one sample may require Qwen, Llama, and Mistral, while another may need Gemma, Exaone, and Phi. A naive implementation of this approach can lead to high latency, particularly when the required models change frequently. In such cases, the computational cost includes not only inference but also data transfers across GPUs. To mitigate these computational challenges, SYMBOLIC -MOEintroduces a novel batching strategy to maximize throughput. Specifically, for a given set of instances, we precompute (using inferred skills) which LLMs will be called for each instance. We then group instances based on their required experts, as illustrated in Fig. 3 (III) and Algorithm 1 in Appendix F. In other words, each active expert receives all relevant instances at once, ensuring that each expert is loaded only once per batch. This enables efficient batched inference on a single GPU while supporting a pool of 16LLMs. Moreover, this approach is flexible, as more GPUs can further accelerate inference through parallelization. 4 Results and Analysis 4.1 Experimental Setup Model Pool and Datasets. Our experiments consider 16LLMs ranging from 3.5B to 12B parame- ters, with most models falling in the 7–8B range. These include general-purpose instruction-tuned models, domain-specific fine-tuned variants on math and biology, and models distilled from DeepSeek R1’s trajectories (DeepSeek-AI et al., 2025a). A full list of models is provided in Table 7 in the Appendix. We measure performance on a diverse range of datasets, chosen to require expertise in a number of domains. First, we consider MMLU-Pro (Wang et al., 2024c), which is a harder version of MMLU (Hendrycks et al., 2021), containing a variety of questions across 14 college-level subjects, from Philosophy and History to Economics and Chemistry. Given its large test set of 12,000 samples and the computational cost of evaluating proprietary models, we employ stratified sampling to create a subset of 2,100 samples, ensuring each category contains 150 samples. We also evaluate on AIME 2024, which is a challenging mathematics competition dataset containing math Olympiad problems. For more science-specific reasoning, we test on GPQA (Rein et al., 2023), which contains questions across a variety of science fields written by experts, explicitly designed to be difficult to answer even by skilled humans with web access. Finally, we also include MedMCQA (Pal et al., 2022), which covers questions sourced from medical entrance exams across 21 medical subjects. For each dataset, 7 Page 8: Category Method Model MMLU-Pro AIME GPQA MedMCQA Avg. Zero-Shot CoT GPT4o-mini 63.95 10.00 42.93 68.18 46.27 Zero-Shot CoT Gemini 1.5 Pro 76.38 36.67 61.62 72.68 61.84Close-Source Single ModelZero-Shot CoT DeepSeekV3 76.29 26.00 60.10 74.09 59.12 Open-Source Zero-Shot CoT Qwen2.5 72B 71.54 ±0.88 25.55 ±3.85 51.02 ±0.27 69.02 ±0.32 54.28 70B Model Zero-Shot CoT Llama3.3 70B 69.26 ±0.47 32.22 ±3.85 51.44 ±0.62 59.78 ±0.74 53.18 Open-Source 7B ModelZero-Shot CoT QwenR1 52.57 ±0.45 55.93 ±5.16 44.95 ±1.49 38.72 ±0.44 48.04 Zero-Shot CoT Task-Best 54.89 ±0.53 55.93 ±5.16 48.43 ±3.10 55.44 ±0.50 53.62 Advanced Single ModelSelf-Refine (SR) Task-Best 53.74 ±0.20 53.33 ±3.34 50.84 ±3.65 49.57 ±0.59 51.87 Self-Consistency (SC) Task-Best x5 56.71 ±0.14 67.78 ±1.57 53.54 ±0.36 56.85 ±0.11 58.72 Single-Model Multi-AgentDebate Task-Best x3 56.21 ±0.55 56.67 ±6.67 50.51 ±0.51 51.63 ±0.80 53.76 Self-MoA Task-Best x3 55.43 ±0.72 55.56 ±5.09 52.86 ±1.46 53.27 ±0.60 54.28 Multi-Model Multi-AgentMoA Top-3 61.78 ±0.25 41.11 ±5.09 52.86 ±3.37 59.29 ±0.32 53.76 ReConcile Top-3 56.46 ±0.10 50.00 ±7.20 47.98 ±2.32 60.74 ±0.43 53.80 SYMBOLIC -MOE Adaptive 63.71 ±0.43 68.88 ±5.08 57.78 ±2.09 59.35 ±0.14 62.43 Table 1: Comparison of SYMBOLIC -MOEwith single-model and multi-model baselines. SYMBOLIC - MOEoutperforms all multi-agent baselines and achieves performance comparable to strong propri- etary models like GPT4o-mini, as well as 70B models, while primarily operating with 7-8B models. Notably, no single baseline consistently secures the second-best performance, even when the strongest models for each task are known. In contrast, our method demonstrates robustness, yielding superior results through adaptive expert selection. We bold the best results and underline the second-best (excluding methods using bigger or proprietary models, shown in gray). we sample around 350 samples as the validation set to create the agent and aggregator selection profiles.2Full dataset statistics are provided in Table 6 in the Appendix. Baselines. We compare against four categories of baselines. •Zero-shot single-model methods : This category includes proprietary models such as GPT-4o- mini (OpenAI, 2024), Gemini 1.5 Pro (Team et al., 2024), and DeepSeek-V3 (DeepSeek-AI et al., 2025b); high-capacity open-source models like Qwen2.5 72B (Qwen et al., 2025) and Llama 3.3 70B (AI@Meta, 2024); and strong distilled 7B models such as QwenR1 (DeepSeek-AI et al., 2025a). For reference, we also report the best task-specific model from our pool for each task, denoted as Task-Best. •Advanced single-model baselines with inference-time compute : We evaluate methods that enhance inference-time reasoning, specifically Self-Refine (SR) (Madaan et al., 2023b) and Self- Consistency (SC) (Wang et al., 2023b). To ensure a fair comparison, we set SC’s sample size to 5, aligning with the number of large language model (LLM) calls in SYMBOLIC -MOE, which engages three experts and one aggregator model.3Additionally, for these baselines, we use the best-performing LLM for each task, inferred on the same dev set used for our agent profile creation. •Single-model multi-agent baselines : To isolate the impact of SYMBOLIC -MOE’s recruitment strategy, we compare against methods where multiple instances of the same model collaborate. Specifically, we consider Multi-Agent Debate (Debate) (Du et al., 2023) and Self-Mixture-of- Agents (Self-MoA) (Li et al., 2025), both of which rely on iterative, multi-round discussions using a single model. These baselines employ three agents, each using the same task-best model, and conduct two rounds of discussion, resulting in a total of 6LLM calls per sample. •Multi-model multi-agent baselines : We also evaluate approaches leveraging diverse models in a multi-agent setup. This includes Mixture-of-Agents (MoA) (Wang et al., 2024a) and ReConcile (Chen et al., 2024c), both of which incorporate a fixed set of models in multi-round interactions. To ensure a fair comparison with our approach, particularly in the use of the validation set, we select the top three performing models from the validation set and conduct multi-round interactions. In MoA, agents participate in two rounds of discussion, while agents in ReConcile engage in three rounds, leading to 6and9LLM calls per sample, respectively. 2For AIME, we sample validation questions from prior years’ problems (2012-2023). 3We use an odd number of SC calls to avoid ties. 8 Page 9: For single-model baselines, we use the strongest task-specific LLM, while for multi-model baselines, we select the top three models per task. These selections, like those in SYMBOLIC -MOE, are determined based on validation performance. Table 8 in the Appendix details the top-1 and top-3 models for each task. Implementation Details. We conduct our experiments for SYMBOLIC -MOEand other single- model baselines on a single A6000 GPU with 48 GB of memory, while MoA and ReConcile are executed on 8 A6000 GPUs for parallelization. For the 70B models, we use the original version without quantization and perform inference on 4 A6000 GPUs. All open-source models utilize vLLM (Kwon et al., 2023) for inference. The temperature is set to 0.7 for all methods. The maximum output token length is fixed at 4096 for all models, except for QwenR1 and LlamaR1, which have a limit of 32768 since they are trained with longer trajectories and tend to generate longer outputs. All results, except those from proprietary models (due to budget constraints), are averaged over three random seeds. Further details on the model pool, distribution of the expert recruited, and all the prompts we use can be found in Table 7 and Appendix G. 4.2 Main Results We present the main results in Table 1, and summarize the key findings below. SYMBOLIC -MOEconsistently outperforms all baselines. Across all domains, SYMBOLIC -MOE shows superior performance compared to all baselines, beating single-model baselines using the best overall model (e.g., SR, SC), multi-agent debate with a single model (Debate and Self-MoA), as well as multi-model multi-agent baselines like MoA and ReConcile. SYMBOLIC -MOEoutperforms the most competitive multi-agent baseline, Self-MoA, by 8.15% (absolute) on average, with consistent improvements across domains (e.g., 8.28% gain on MMLU-Pro, 13.45% on AIME, 4.92% on GPQA, 6.08% on MedMCQA). These gains are also seen when comparing to multi-model baselines like MoA and ReConcile that use the top three strongest models per domain. SYMBOLIC -MOEalso substantially outperforms test-time scaling methods, such as Self-Refine (Madaan et al., 2023b) and Self-Consistency (Wang et al., 2023a). Surprisingly, when using the task-best model, SC beats multi-agent debate baselines (e.g., Self-MoA, MoA), though it still underperforms SYMBOLIC -MOE by an average of 3.71% . This indicates that scaling test-time compute with the task-best model is a simple but effective way to improve performance, and adaptively selecting the most suitable experts leads to further improvements. SYMBOLIC -MOEgeneralize well across tasks. No single baseline in Table 1 is universally effective across all tasks. For instance, while MoA performs well on MMLU-Pro, it struggles on AIME; ReConcile excels in MedMCQA but fails to generalize to GPQA. Knowing which method works best for a task is therefore nontrivial, requiring running every method on validation sets to choose from the many settings available. In contrast, SYMBOLIC -MOEconsistently delivers a strong performance across all domains. SYMBOLIC -MOEespecially excels on AIME and GPQA, where SC is a surprisingly strong baseline, and where other strong methods like MoA and Self-MoA fall far behind SYMBOLIC -MOE. Moreover, we see that, while SC with the top model is the most competitive setting on AIME and GPQA, it lags behind other baselines on MMLU-Pro and MedMCQA, where multi-agent baselines perform better. This discrepancy may stem from the broader subject diversity in MMLU-Pro and MedMCQA, where agent collaboration facilitates better consensus, whereas AIME is more math-focused, and favors individual model performance – the task-best model, QwenR1 7B, delivers strong solo performance already. In light of that, we include QwenR1 7B in Table 1, which is a powerful model distilled from DeepSeek-R1’s trajectories (DeepSeek-AI et al., 2025a). While QwenR1 demonstrates exceptional math and code reasoning capabilities ( 55.93% on AIME), leading to strong Self-Consistency performance on AIME ( 67.78% ), it struggles to generalize to other domains such as MedMCQA. This further underscores the need for a robust and flexible framework like S YMBOLIC -MOE to achieve broad generalization across diverse tasks. SYMBOLIC -MOEmatches strong proprietary models and larger 70B models. In Table 1, we also find that SYMBOLIC -MOEachieves a similar average performance to models that have substantially more parameters. For example, SYMBOLIC -MOEoutperforms Llama3.3 70B on AIME and GPQA and roughly matches it on MedMCQA, despite requiring only four 7-8B models 9 Page 10: (three for the experts and one for the aggregator). Similarly, SYMBOLIC -MOEoutperforms or matches a number of strong proprietary models on average – for instance, it matches Gemini 1.5 Pro and outperforms GPT4o-mini, driven in part by sizable gains on AIME and GPQA. These results underscore that, by drawing from a large pool of experts, SYMBOLIC -MOEenables a smaller set of models to achieve performance comparable to models with significantly more parameters, especially when considering a heterogeneous set of tasks like the ones we explore. Method # GPUs Run Time (s) Sequential Inference 1 196.92 MoA 1 45.98 MoA 4 21.66 SYMBOLIC -MOE 1 25.76 SYMBOLIC -MOE 4 10.85 Table 2: Efficiency comparison of MoA and SYMBOLIC -MOE. We report time per sample, based on the total time needed to perform inference for the whole GPQA test set (divided by the number of ques- tions in the test set).SYMBOLIC -MOEis more efficient. We compare SYMBOLIC -MOE’s efficiency to a naive implementa- tion of sequential inference and to that of MoA and in Table 2. On GPQA, we measure the average run-time for the 198 examples in the test set. Unsurprisingly, the sequential inference baseline shows the highest latency, since the model is constantly changing, requiring loading and offloading it for each instance. We also find that SYMBOLIC -MOEoperates in 44% less time on a single GPU than MoA, which requires discussion, while achiev- ing better accuracy. Also, SYMBOLIC -MOErunning on a single GPU shows similar time with MoA on 4 GPUs, and furthermore, when we run SYMBOLIC -MOEon 4 GPUs, it results in almost 2×speedup over MoA. SYMBOLIC -MOEachieves efficiency in two key ways: (1) it employs batched inference to minimize model loading time, and (2) it eliminates round-wise discussions, which introduce significant overhead at each round. In Table 3 we find that with a reliable aggregator, skipping inter-agent discussions yields performance comparable to actually conducting them. Engaging kagents in a one-round discussion requires generating kinitial outputs, concatenating them, and then generating kupdated outputs, significantly increasing overhead. Discuss Aggr. MMLU-Pro GPQA ✓ Adaptive 59.07 57.01 ✗ Adaptive 57.12 58.01 ✓ Task-best 57.81 57.78 ✗ Task-best 56.67 57.01 ✓ Task-specific 63.83 57.72 ✗ Task-specific 63.71 57.78 Table 3: Comparison of SYMBOLIC - MOEwith and without discussion across varying aggregators. Discussion stabi- lizes performance with suboptimal aggre- gators (first four rows) but has little effect with an optimal one (last two rows).Like multi-agent discussion baselines, SYMBOLIC -MOE can also operate in a discussion-based manner. Instead of aggregating initial responses, models first engage in a round discussion before submitting final answers to an aggregator. Table 3 evaluates this approach, comparing the adaptive aggregator (suboptimal) and task-best aggre- gator (suboptimal), as well as the task-specific aggregator (optimal) on MMLU-Pro and GPQA. Given the optimal aggregator, we see that discussion yields marginal gains on MMLU-Pro ( 63.83 vs.63.71 ) and a modest drop on GPQA ( 57.72 vs.57.78 ). We find that while round-wise discussion does improve performance incrementally, but the final outcome is ultimately determined by the strength of the aggregator. As a result, SYMBOLIC -MOEhas less run time than the baselines while maintaining or even surpassing their performance, as demonstrated in Table 2. 4.3 Analysis Recruiting Strategy Acc. Top-3 Experts 52.86 Top-5 Experts 47.68 3 Random Experts 42.61 5 Random Experts 44.92 Model Profile (Ours) 57.78 Table 4: Comparison of different recruit- ing strategies on GPQA.Utility of the Model Profile. SYMBOLIC -MOEprofiles models based on their skills and leverages this information to recruit experts effectively. To underscore the importance of this step, we compare several alternative selection strate- gies in Table 4, evaluating accuracy on GPQA. Specifically, we assess performance when selecting the top- kagents overall and when choosing kagents at random. In the top- k approach, experts are fixed as the best-performing models for the task, whereas in the random- kstrategy, the selected experts vary across instances. Our results demonstrate that skill-based selection is essential. Notably, although 10 Page 11: selecting the top kexperts for a task may seem intuitive, it consistently underperforms compared to SYMBOLIC -MOE’s adaptive instance-level expert selection. Interestingly, top- 5selection performs worse than top- 3selection, suggesting that a broader selection increases the likelihood of including weaker models, leading to performance degradation. Additionally, the random selection strategy consistently harms performance, showing a 12.86% −15.61% drop compared to SYMBOLIC -MOE, likely also due to the inclusion of weaker experts. Overall, skill-based instance-level selection con- sistently outperforms more coarse-grained approaches, with SYMBOLIC -MOEsurpassing MoA by 4.92% as also shown in Table 1. These findings emphasize the value of leveraging performance-based statistics and individual skill assessments for expert selection. Aggregator MMLU-Pro GPQA Random 52.29 48.92 Adaptive 57.12 58.01 Task-specific 63.71 57.78 Table 5: Ablations on different aggrega- tors in our full setting.Role and Selection of the Aggregator. Unlike most of our discussion-based multi-agent baselines, SYMBOLIC - MOEcollects a single CoT and answer from each expert and combines them via an aggregator. Besides the effi- ciency gain as shown in Table 2, we investigate the role the aggregator plays in our framework. While experts are selected per instance, the aggregator is chosen based at the task, as we find that reasoning ability does not neces- sarily translate to effective aggregation. Table 5 examines the impact of aggregator selection in SYMBOLIC -MOE, comparing three strategies: (1) a randomly chosen aggregator from the model pool, (2) an instance-specific top-1 aggregator based on model profiling, and (3) a task-specific aggregator determined by task-level performance. Evaluated on MMLU-Pro and GPQA, the results indicate that a random aggregator substantially degrades perfor- mance, indicating that the aggregator plays a crucial role. While the instance-specific top-1 aggregator improves outcomes on both datasets, the task-specific aggregator outperforms it on MMLU-Pro and performs comparably on GPQA. We further find that the similar performance of instance-specific and task-specific aggregation on GPQA is due to a high degree of overlap in selected aggregators. Overall, these findings suggest that a model being a good reasoner does not imply it will be a good aggregator, supporting our decision to use task-based aggregation. 5 Discussion and Conclusion A key feature highlighted in Table 1 is the consistency ofSYMBOLIC -MOE’s performance. While baseline methods occasionally do well in isolated settings (e.g. MoA on MMLU-Pro, ReConcile on MedMCQA) it is important to highlight that no baseline does well consistently across settings. This means that – without SYMBOLIC -MOE– getting a strong overall result would require evaluating all the baseline methods and choosing the best settings manually. In contrast to the baselines, SYMBOLIC - MOEconsistently achieves high performance without human intervention . By automatically selecting discussion agents, S YMBOLIC -MOE provides a consistent recipe that generalizes across domains. Modularity. Another key advantage of SYMBOLIC -MOEis its modularity. Unlike typical Mixture- of-Experts (MoEs) frameworks, which need to be trained end-to-end from scratch in a centralized manner, SYMBOLIC -MOEuses the symbolic output channel of existing models to combine experts. This gradient-free approach enables seamless integration of pre-trained models without updates, allowing them to be trained independently and distributedly. Such delegation enhances domain specialization for each model. Moreover, while standard MoEs have a fixed size determined before training, SYMBOLIC -MOEcanadaptively grow and evolve as models are updated. Given the rapid advancements in LLMs, cost-effective and efficient updates are essential – state-of-the-art models are often replaced within months. SYMBOLIC -MOE’s modular and gradient-free design simplifies these updates, requiring only a few calls to obtain a new model’s agent profile. Connections to Inference-Time Scaling. Like other LLM-discussion frameworks, SYMBOLIC - MOEcan be seen as a form of multi-model inference-time scaling. Past work (Wang et al., 2023a; Snell et al., 2024; Shao et al., 2024) has highlighted the benefits of adding inference-time compute, ranging from simply sampling multiple responses (as done in our SC baselines) to more complex strategies such as refinement (Madaan et al., 2023a; Chen et al., 2024b). SYMBOLIC -MOEsurpasses several such baselines (Table 1) by adaptively selecting the optimal set of expert models while avoiding the costly discussion process, reducing overhead while improving performance. Moreover, 11 Page 12: our novel batching method enables SYMBOLIC -MOEto run efficiently on a single GPU while remaining flexible for acceleration with multiple GPUs. Conclusion. We introduced SYMBOLIC -MOE, a scalable MoE framework that combines models through their symbolic output (i.e., via natural language discussion). SYMBOLIC -MOEinfers which skills are needed for a given problem and recruits agents based on those skills to engage in a discussion about a given input, guiding the discussion to reach a better consensus. On four diverse reasoning datasets, SYMBOLIC -MOEoutperforms standard inference-time scaling methods as well as other debate frameworks and other mixture-of-agents methods, leading to consistently strong performance across domains without human intervention. SYMBOLIC -MOE’s average performance across heterogeneous tasks is in fact stronger than that of advanced proprietary models such as GPT4o-mini. Moreover, unlike past work that requires multiple models to be loaded on separate GPUs running in parallel, SYMBOLIC -MOEintroduces a novel batching strategy that allows us to run on a single GPU in roughly the same amount of time, obtaining the best of both worlds in terms of performance and efficiency. Limitations Like other multi-agent discussion methods (Du et al., 2023; Chen et al., 2024c), SYMBOLIC -MOE involves running multiple models, which increases inference cost. This cost can be reduced via distillation: Chen et al. (2024a) distill multi-agent discussions between a fixed set of agents into a single model, showing improvements over distilling from single models. This approach could easily be adapted to distill from a variable set of agents, allowing the student model to benefit from the routing and skill selection process. We leave distilling from S YMBOLIC -MOE to future work. SYMBOLIC -MOEalso relies on skills inferred from a small validation set to set the agent profiles. In our experiments, we ensure fair comparisons to the baselines by choosing models for the baselines according to the same validation set, giving the baselines equal access to the data. Skill inference relies on the Keyword LLM being sufficiently trained on a given domain to infer relevant skills – empirically, we find that this is the case across a variety of domains. Overall, SYMBOLIC -MOEwill continue to improve with better skill inference modules, which can easily be swapped in. Acknowledgements This work was supported by NSF-CAREER Award 1846185, DARPA ECOLE Program No. HR00112390060, Microsoft Accelerate Foundation Models Research (AFMR) grant program, NSF- AI Engage Institute DRL-2112635, National Institutes of Health (NIH) under other transactions 1OT2OD038045-01, and Cisco and Capital One Faculty Awards. The views and conclusions con- tained in this document are those of the authors and should not be interpreted as representing official policies, either expressed or implied, of the NIH or other sponsors. References Alhabib Abbas and Yiannis Andreopoulos. Biased mixtures of experts: Enabling computer vision inference under data transfer limitations. IEEE Transactions on Image Processing , 29:7656–7667, 2020. AI@Meta. Llama 3.3 model card. 2024. URL https://github.com/meta-llama/ llama-models/blob/main/models/llama3_3/MODEL_CARD.md . Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201 , 2023. Justin Chen, Swarnadeep Saha, Elias Stengel-Eskin, and Mohit Bansal. Magdi: Structured distillation of multi-agent interaction graphs improves reasoning in smaller language models. In Forty-first International Conference on Machine Learning , 2024a. 12 Page 13: Justin Chih-Yao Chen, Archiki Prasad, Swarnadeep Saha, Elias Stengel-Eskin, and Mohit Bansal. Magicore: Multi-agent, iterative, coarse-to-fine refinement for reasoning. arXiv preprint arXiv:2409.12147 , 2024b. Justin Chih-Yao Chen, Swarnadeep Saha, and Mohit Bansal. Reconcile: Round-table conference improves reasoning via consensus among diverse llms, 2024c. URL https://arxiv.org/ abs/2309.13007 . Ke Chen, Lei Xu, and Huisheng Chi. Improved learning algorithms for mixture of experts in multiclass classification. Neural networks , 12(9):1229–1252, 1999. Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V . Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161 , 2025. URL https://arxiv.org/ abs/2501.17161 . Herbert H Clark. Using language . Cambridge university press, 1996. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025a. URL https://arxiv.org/abs/2501. 12948 . DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v3 technical report. 2025b. URL https://arxiv.org/abs/2412.19437 . Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factual- ity and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 , 2023. David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314 , 2013. William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research , 23(120):1–39, 2022. URL http://jmlr.org/papers/v23/21-0998.html . Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI conference on artificial intelligence , volume 32, 2018. Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. Specializing smaller language models towards multi-step reasoning. In International Conference on Machine Learning , pp. 10421–10430. PMLR, 2023. Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, and Ion Stoica. M \’elange: Cost efficient large language model serving by exploiting gpu heterogeneity. arXiv preprint arXiv:2404.14527 , 2024. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR) , 2021. Namgyu Ho, Laura Schmid, and Se-Young Yun. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071 , 2022. Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation , 3(1):79–87, 1991. Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro Ortega, DJ Strouse, Joel Z Leibo, and Nando De Freitas. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In International conference on machine learning , pp. 3040–3049. PMLR, 2019. 13 Page 14: Hao Jiang, Ke Zhan, Jianwei Qu, Yongkang Wu, Zhaoye Fei, Xinyu Zhang, Lei Chen, Zhicheng Dou, Xipeng Qiu, Zikai Guo, et al. Towards more effective and economic sparsely-activated model. arXiv preprint arXiv:2110.07431 , 2021. Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation , 6(2):181–214, 1994. Sneha Kudugunta, Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin, Minh-Thang Luong, and Orhan Firat. Beyond distillation: Task-level mixture-of-experts for efficient inference. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021 , pp. 3577–3599. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.findings-emnlp.304. URL https://doi.org/10. 18653/v1/2021.findings-emnlp.304 . Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, and Percy Liang. Fine- tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations , 2022. URL https://openreview.net/forum? id=UYneFzXSJWh . Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , 2023. Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021. URL https:// openreview.net/forum?id=qrwe7XHTmYb . Baolin Li, Yankai Jiang, Vijay Gadepally, and Devesh Tiwari. Llm inference serving: Survey of recent advances and opportunities. arXiv preprint arXiv:2407.12391 , 2024a. Dawei Li, Zhen Tan, Peijia Qian, Yifan Li, Kumar Satvik Chaudhary, Lijie Hu, and Jiayi Shen. Smoa: Improving multi-agent large language models with sparse mixture-of-agents. arXiv preprint arXiv:2411.03284 , 2024b. Wenzhe Li, Yong Lin, Mengzhou Xia, and Chi Jin. Rethinking mixture-of-agents: Is mixing different large language models beneficial? arXiv preprint arXiv:2502.00674 , 2025. Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118 , 2023. Elita Lobo, Chirag Agarwal, and Himabindu Lakkaraju. On the impact of fine-tuning on chain-of- thought reasoning. arXiv preprint arXiv:2411.15382 , 2024. Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems , 30, 2017. Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583 , 2023. MAA. American invitational mathematics examination - aime. in american invitational mathematics examination, 2 2024. URL https://maa.org/math-competitions/ american-invitational-mathematics-examination-aime . 14 Page 15: Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegr- effe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdan- bakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In NeurIPS , 2023a. URL http://papers.nips.cc/paper_files/paper/2023/hash/ 91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html . Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Sean Welleck, Bodhisattwa Prasad Majumder, Shashank Gupta, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback, 2023b. Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn. Teaching small language models to reason. arXiv preprint arXiv:2212.08410 , 2022. Brando Miranda, Alycia Lee, Sudharsan Sundar, and Sanmi Koyejo. Beyond scale: the diversity coefficient as a data quality metric demonstrates LLMs are pre-trained on formally diverse data, 2024. URL https://openreview.net/forum?id=506Sxc0Adp . OpenAI. Gpt-4o mini: advancing cost-efficient intelligence, 2024. URL https://openai.com/ index/gpt-4o-mini-advancing-cost-efficient-intelligence/ . Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning , pp. 248–260. PMLR, 2022. Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, et al. Qwen2.5 technical report. 2025. URL https: //arxiv.org/abs/2412.15115 . Nils Reimers and Iryna Gurevych. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing . Association for Computational Linguistics, 11 2020. URL https:// arxiv.org/abs/2004.09813 . David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022 . Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr ´e Su- sano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual , pp. 8583–8595, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/ 48237d9f2dea8c74c2a72126cf63d933-Abstract.html . Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 , 2024. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 , 2017a. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings . OpenReview.net, 2017b. URL https:// openreview.net/forum?id=B1ckMDqlg . 15 Page 16: Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314 , 2024. Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. URLhttps://arxiv.org/abs/2403.05530 . Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm. github.io/blog/qwen2.5/ . Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities. arXiv preprint arXiv:2406.04692 , 2024a. Qineng Wang, Zihao Wang, Ying Su, Hanghang Tong, and Yangqiu Song. Rethinking the bounds of LLM reasoning: Are multi-agent discussions the key? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pp. 6106–6131, Bangkok, Thailand, August 2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.331. URL https://aclanthology.org/2024.acl-long.331/ . Xin Wang, Fisher Yu, Lisa Dunlap, Yi-An Ma, Ruth Wang, Azalia Mirhoseini, Trevor Darrell, and Joseph E Gonzalez. Deep mixture of experts via shallow embedding. In Uncertainty in artificial intelligence , pp. 552–562. PMLR, 2020. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In ICLR . OpenReview.net, 2023a. URL https://openreview.net/pdf?id= 1PL1NIMMrw . Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations , 2023b. URL https://openreview.net/forum?id=1PL1NIMMrw . Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , November 2024c. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems , 35:24824–24837, 2022. Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, and Bing Qin. Examining inter-consistency of large language models collaboration: An in-depth analysis via debate. arXiv preprint arXiv:2305.11595 , 2023. Enwei Xu, Wei Wang, and Qingxia Wang. The effectiveness of collaborative problem solving in promoting students’ critical thinking: A meta-analysis based on empirical literature. Human- ities and Social Sciences Communications , 10(1):16, 2023. ISSN 2662-9992. doi: 10.1057/ s41599-023-01508-1. URL https://doi.org/10.1057/s41599-023-01508-1 . An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122 , 2024. URL https: //arxiv.org/abs/2409.12122 . Brandon Yang, Gabriel Bender, Quoc V Le, and Jiquan Ngiam. Condconv: Conditionally parameter- ized convolutions for efficient inference. Advances in Neural Information Processing Systems , 32, 2019. 16 Page 17: W. Quin Yow and Tony Zhao Ming Lim. Sharing the same languages helps us work better together. Palgrave Communications , 5(1):154, 2019. ISSN 2055-1045. doi: 10.1057/s41599-019-0365-z. URLhttps://doi.org/10.1057/s41599-019-0365-z . Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284 , 2023. Seniha Esen Yuksel, Joseph N. Wilson, and Paul D. Gader. Twenty years of mixture of experts. IEEE Transactions on Neural Networks and Learning Systems , 23(8):1177–1193, 2012. doi: 10.1109/TNNLS.2012.2200299. Sukwon Yun, Inyoung Choi, Jie Peng, Yangfan Wu, Jingxuan Bao, Qiyiwen Zhang, Jiayi Xin, Qi Long, and Tianlong Chen. Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts. arXiv preprint arXiv:2410.08245 , 2024. Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. Moefication: Condi- tional computation of transformer models for efficient inference. arXiv preprint arXiv:2110.01786 , 2021. Simiao Zuo, Xiaodong Liu, Jian Jiao, Young Jin Kim, Hany Hassan, Ruofei Zhang, Jianfeng Gao, and Tuo Zhao. Taming sparsely activated transformer with stochastic experts. In International Conference on Learning Representations , 2022. URL https://openreview.net/forum? id=B72HXs80q4 . 17 Page 18: Appendix A Dataset Statistics and Licenses We provide the sample sizes and licenses of the datasets used in this work in Table 6. All the datasets are in English, and all datasets are used in a fashion consistent with their intended use. Validation Size Test Size License MMLU-Pro (Wang et al., 2024c) 350 2,100 Apache License AIME (MAA, 2024) 354 30 CC0 GPQA (Rein et al., 2023) 249 198 MIT License MedMCQA (Pal et al., 2022) 504 4,183 MIT License Table 6: The statistics and licenses of the datasets we use in this work. B Model Pool Model Name Size Huggingface Link BioLlama 8B ContactDoctor/Bio-Medical-Llama-3-8B DeepSeekMath 7B deepseek-ai/deepseek-math-7b-instruct Exaone 7.8B LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct Gemma2 9B google/gemma-2-9b-it GLM4 9B THUDM/glm-4-9b Granite 8B ibm-granite/granite-3.1-8b-instruct InternLM3 8B internlm/internlm3-8b-instruct Llama3.1 8B meta-llama/Llama-3.1-8B-Instruct LlamaR1 8B deepseek-ai/DeepSeek-R1-Distill-Llama-8B Mathstral 7B mistralai/Mathstral-7B-v0.1 Mistral 12B mistralai/Mistral-Nemo-Instruct-2407 Phi3.5-mini 3.5B microsoft/Phi-3.5-mini-instruct Qwen2.5 7B Qwen/Qwen2.5-7B-Instruct Qwen2.5-Coder 7B Qwen/Qwen2.5-Coder-7B-Instruct Qwen2.5-Math 7B Qwen/Qwen2.5-Math-7B-Instruct QwenR1 7B deepseek-ai/DeepSeek-R1-Distill-Qwen-7B Table 7: The models constituting the model pool. C Performance on the Validation Set Table 8 shows the performance of each model on the validation set. We highlight the top-1 and top-3 models in bold font and yellow background, respectively. This information is also used for the baselines we compare against in Table 1. D Performance of Each Model as an Aggregator Table 9 shows the performance of each model when acting as an aggregator. Note that the best- performing model in Table 8 can be different from the best aggregator model in Table 9, motivating us to choose the aggregator based on this synthetic task described in Section 3.2.2. E Distribution of Experts We present the distribution of recruited experts across different datasets in Fig. 4. As noted in Section 3.2.1, we trim experts with occurrences below 5% to reduce model loading time. This is 18 Page 19: Model MMLU-Pro AIME GPQA MedMCQA BioLlama 37.71 0.85 27.31 42.86 DeepSeekMath 32.57 3.32 28.11 35.71 Exaone 52.29 25.99 32.13 56.35 Gemma 53.71 7.73 36.95 64.29 GLM 50.29 7.37 30.92 58.33 Granite 43.43 5.92 34.14 56.15 InternLM 43.14 7.91 36.14 55.56 Llama 46.00 6.78 33.73 66.87 LlamaR1 54.29 51.98 56.22 53.37 Mathstral 34.57 3.11 36.55 52.38 Mistral 45.14 1.41 33.73 46.43 Phi 46.57 1.41 47.79 65.87 Qwen 54.00 13.56 37.35 67.06 QwenCode 46.29 9.89 30.52 50.79 QwenMath 31.71 11.13 28.51 36.90 QwenR1 53.43 57.06 51.41 37.90 Table 8: Comparison of model performance on the validation set. The best model on each task is bolded , and the top 3 models on each task are highlighted in yellow. Model MMLU-Pro AIME GPQA MedMCQA BioLlama 37.31 21.47 30.12 42.46 DeepSeekMath 32.57 5.37 21.69 35.71 Exaone 57.43 47.92 35.34 56.35 Gemma 49.71 3.11 31.73 53.37 GLM 52.57 26.27 35.34 51.39 Granite 48.86 36.44 38.96 56.15 InternLM 55.14 16.95 42.57 51.59 Llama 51.14 11.86 40.56 50.60 LlamaR1 59.71 53.67 46.18 49.01 Mathstral 41.71 26.27 35.74 46.43 Mistral 48.00 18.93 33.33 46.43 Phi 27.71 9.04 26.10 25.40 Qwen 56.86 38.14 39.36 53.37 QwenCode 51.14 29.66 38.96 50.79 QwenMath 31.71 5.93 16.06 36.90 QwenR1 58.00 57.63 48.59 45.44 Table 9: Performance of each model when used as an aggregator, on the validation set. The best model on each task is bolded , and is selected as the task-specific aggregator. 19 Page 20: Gemma 21.5% QwenR1 18.6% LlamaR1 14.0%Qwen 12.4%GLM 8.7%Exaone 7.2%MMLU-Pro (Before) QwenR1 60.0%LlamaR1 40.0%AIME (Before) LlamaR1 54.9% QwenR1 31.5%GPQA (Before) Qwen 23.2% Llama 20.9% Phi 18.2%Gemma 18.2%GLM 5.1%MedMCQA (Before) Gemma 25.9% QwenR1 22.7%LlamaR1 17.0%Qwen 15.2%GLM 10.6%Exaone 8.7%MMLU-Pro (After) QwenR1 60.0%LlamaR1 40.0%AIME (After) LlamaR1 64.1%QwenR1 35.9%GPQA (After) Qwen 26.9% Llama 24.5%Phi 21.4%Gemma 21.1%GLM 6.1%MedMCQA (After)Figure 4: Distribution of the recruited experts across datasets. Top row: the distribution before trimming. Bottom row: the distribution after trimming and re-sampling. visualized in Fig. 4, where the top row shows the distribution before trimming, and the bottom row shows the distribution after trimming. The distribution varies significantly across datasets – on more diverse datasets such as MMLU-Pro, the recruited experts are also more varied. In contrast, for AIME and GPQA, which focus more on math and science, the recruited experts are dominated by a few models. F Algorithm We provide the algorithm for our batched inference strategy in Appendix F. Algorithm 1 BatchedInference Require: Test samples Q, Model pool M Ensure: Inference results for all samples 1:expert sample map←∅ ▷Expert-to-samples mapping 2:forq∈ Q do 3: E(1) q,E(2) q, ...,E(k) q←RECRUIT EXPERTS (q,M) ▷Select kexperts per sample ( §3.3.1) 4: fore∈Eqdo 5: expert sample map[e]←expert sample map[e]∪{q} 6: end for 7:end for 8: 9:results ←∅ ▷Results collection 10:for(e,qe)∈expert sample map do 11: results ←results ∪e.GENERATE (qe) ▷Batch inference per expert 12:end for 13:return results 20 Page 21: G Prompts Prompt for the Keyword LLM to Generate Keywords Question: {question } What are the core knowledge, subjects or skills needed to solve this problem? List 2-5 keywords separated in comma. Example keywords: psychology, virology, behavioral theory, microbiology, diplomacy, political science, property law, finance, business. Give ONLY the keywords, no other words or explanation. Follow this format: Keywords: <keyword1 >,<keyword2 >... Prompt for Zero-shot Chain-of-Thought Generation (Multiple Choice) Question: {question } Provide your step-by-step reasoning first, and then print “The answer is (X)” where X is the answer choice (one capital letter), at the end of your response. Prompt for Zero-shot Chain-of-Thought Generation (Math) Question: {question } Provide your step-by-step reasoning first, and then print “The answer is \\boxed{X}”, where X is the final answer, at the end of your response. Prompt for the Aggregator (Wang et al., 2024a) You have been provided with a set of responses from various open-source models to the latest user query. Your task is to synthesize these responses into a single, high-quality response. It is crucial to critically evaluate the information provided in these responses, recognizing that some of it may be biased or incorrect. Your response should not simply replicate the given answers but should offer a refined, accurate, and comprehensive reply to the instruction. Ensure your response is well-structured, coherent, and adheres to the highest standards of accuracy and reliability. Responses from models: {model 1response } {model 2response } {model 3response } Question: {question } Provide your step-by-step reasoning first, and then print “The answer is (X)” where X is the answer choice (one capital letter), at the end of your response. H S YMBOLIC -MOE as a Sparse Mixture-of-Expert In the Sparse Mixture-of-Experts (SMoE) framework (Shazeer et al., 2017a), a trainable router dynamically selects a subset of experts for each input. Formally, given an input x, the output of an SMoE layer, yis computed as: y=k ∑ i=1R(x)i·fi(x), R(x) =softmax (Top-K (g(x)),k)(1) 21 Page 22: where fi(x)represents the response of the i-th expert, and R(x)is a trainable router that assigns selection probabilities to each expert based on g(x), typically a small feedforward network (Shazeer et al., 2017b; Riquelme et al., 2021). The Top-K operation retains only the top kexperts, setting the probabilities of others to zero after the softmax operation. However, directly applying SMoE in our framework presents key challenges. Unlike SMoE, our method operates in a symbolic, text-based space and is designed for test-time inference, meaning that we do not rely on a trainable router to learn expert selection, nor do the experts in our method refer to model parameters. Instead, we introduce a skill-based routing mechanism to select relevant experts based on predefined competencies rather than learned gating functions. Formally, our aggregation process can be expressed as: y=A∗( k i=1y(i)) y(i)=E(i)(x)∀i∈ {1, 2, ..., k} E(i)∼Categorical (w(1),w(2), ...,w(n))∀i≤k(2) where A∗is the aggregator model determined via validation set, and denotes the concatenation of experts’ responses, i.e., y(·). Here, y(j)represents the output of expert j’s forward response given an input x, defined as E(j)(x). Each expert E(i),∀i≤kis selected from our proposed skill-based routing strategy (Section 3.3.1). In short, we construct model profiles using a validation set to evaluate each model’s specialization across different skills. This allows us to estimate a probability distribution w(j)over models based on both their suitability for the required skills and their global competence relative to other experts. This skill-based routing framework retains the core benefits of SMoE while removing the reliance on a trainable gating mechanism. Specifically, the aggregator model A∗inSYMBOLIC -MOEplays a role analogous to the weighted sum ( ∑) operation in SMoE, synthesizing outputs from selected experts. Likewise, the recruited agent E(i)corresponds to the Top-k operation in SMoE, ensuring that only the most relevant and specialized experts contribute to the final output. We inherit the key conceptual benefits of SMoE – dynamic expert selection and response aggregation – while also introducing additional advantages. SYMBOLIC -MOEis gradient-free, eliminating the need for retraining, and is entirely automatic, leveraging a large pool of pre-trained models to deliver a better performance. 22

---