Authors: Kaustubh Vyas, Damien Graux, Yijun Yang, Sébastien Montella, Chenxin Diao, Wendi Zhou, Pavlos Vougiouklis, Ruofei Lai, Yang Ren, Keshuang Li, Jeff Z. Pan
Page 1:
Under review as a conference paper at ICLR 2025
FROM AN LLM S WARM TO A PDDL- EMPOWERED
HIVE: PLANNING SELF-EXECUTED INSTRUCTIONS IN
AMULTI -MODAL JUNGLE
Kaustubh Vyas1, Damien Graux1, Yijun Yang1, S´ebastien Montella1, Chenxin Diao1
Wendi Zhou2, Pavlos Vougiouklis1, Ruofei Lai1, Yang Ren1, Keshuang Li1, Jeff Z. Pan1,2
1Huawei Technologies Ltd., UK,2University of Edinburgh, UK
{firstname.lastname }@huawei.com
ABSTRACT
In response to the call for agent-based solutions that leverage the ever-increasing
capabilities of the deep models’ ecosystem, we introduce H IVE– a comprehen-
sive solution for selecting appropriate models and subsequently planning a set
of atomic actions to satisfy the end-users’ instructions. H IVEoperates over sets
of models and, upon receiving natural language instructions (i.e. user queries ),
schedules and executes explainable plans of atomic actions. These actions can
involve one or more of the available models to achieve the overall task, while re-
specting end-users specific constraints. Notably, H IVEhandles tasks that involve
multi-modal inputs and outputs, enabling it to handle complex, real-world queries.
Our system is capable of planning complex chains of actions while guaranteeing
explainability, using an LLM-based formal logic backbone empowered by PDDL
operations. We introduce the M USE benchmark in order to offer a comprehensive
evaluation of the multi-modal capabilities of agent systems. Our findings show
that our framework redefines the state-of-the-art for task selection, outperform-
ing other competing systems that plan operations across multiple models while
offering transparency guarantees while fully adhering to user constraints.
1 I NTRODUCTION
Within the past few years, the number of available models –either through commercial paywalls or
open-sourced– has exploded both in terms of intrinsic performances and in terms of tasks handled
by them, ranging from text generation (Achiam et al., 2023; Anthropic, 2023; Team et al., 2023)
to more specific actions such as code generation (Becker et al., 2023; Dong et al., 2024) or image
generation (Wang et al., 2023b; Zhu et al., 2023). This rapid growth has unlocked unprecedented
potential for real-world applications, inspiring practitioners, especially in industry, to envision new
use cases that leverage these powerful models (Liu et al., 2023c; Shen et al., 2024; Lu et al., 2024;
Xing et al., 2024). However, if creativity and possibility have been unleashed by such a surge,
implementing pipelines that involve multiple models remains a complex and largely manual (and
often cumbersome) process, particularly when addressing tasks beyond the original design of these
models. This often leads developers to create ad hoc modules to manage these complexities. In
addition, a significant number of models available in the wild are either advanced proof-of-concept
or very specialised ones, see e.g.the hundreds of thousands of models available on the HuggingFace
platform1. As a consequence of this abundance, navigating through this jungle to select the appro-
priate models for a set of tasks has become very challenging. This complexity arises both in terms
of performance and compatibility. Connecting models’ input and output formats is complex, as the
generated results are often difficult to control (Scholak et al., 2021; Qin et al., 2022). Moreover,
planning and chaining tasks for real-world use-cases present a significant challenge too.
In this study, we present a comprehensive solution to tackle the aforementioned two challenges, i.e.
(I)selecting appropriate models and then (II)planning a set of atomic actions to achieve the objec-
tives in the end-users’ instructions (i.e., user queries .) Our system, HIVE, takes natural language
instructions (potentially involving multi-modal inputs and outputs) and can effectively schedule ,
execute and explain plans composed of atomic actions. These plans may involve one or more mod-
11 016 247 models on https://huggingface.co/models as of October 1st, 2024.
1arXiv:2412.12839v1 [cs.AI] 17 Dec 2024
Page 2:
Under review as a conference paper at ICLR 2025
els, carefully orchestrated to accomplish the overall task while adhering to users-specific constraints,
such as model size or licensing, to name a few.
One of our key contributions addresses the first challenge: the lack of machine-understandable
interface that consolidates comprehensive information about available models. To bridge this gap,
we propose a Capability Knowledge Graph (C-KG) which encompasses all the dimensions needed
to sort models along the many needed dimensions required for automated planning and execution.
For each model, inter alia , C-KG captures critical details such as supported tasks, performance
metrics from state-of-the-art benchmarks, and minimal code snippets for inference. Additionally,
to enable the planning of complex action sequences with guaranteed explainability, we developed
a novel planning approach which, instead of relying solely on LLM reasoning capabilities, also
employs formal logic to reach its conclusions. To achieve this, we took advantage of PDDL –
a formal language widely used in robotics for defining planning problems (Aeronautiques et al.,
1998), mapping the end-users’ instructions with atomic actions, thereby enabling the conversion
of natural language instructions into a PDDL problem space. This approach allows us to formally
plan before executing the tasks using code snippets from the C-KG. As a result, it has enabled us to
generate detailed reports that provide end-users with fine-grained and reliable explanations.
In the absence of standard publicly available benchmarks for solving real-world tasks, we intro-
duce M USE, a new evaluation benchmark of complex queries involving multi-modal inputs and
outputs, to assess our proposed framework. Using M USE, we reviewed the closest existing solu-
tions, namely HuggingGPT (Shen et al., 2024) and ControlLLM (Liu et al., 2023c), which only
tackle sub-problems of our broader objectives. The results indicate that H IVEnot only surpasses
these competitors but also consistently outperforms them across all benchmark dimensions. H IVE
demonstrates a 30% higher accuracy in task selection and respects user constraints in 100% of cases,
while being more reliable.
2 P RELIMINARIES
Planning Domain Definition Language (PDDL) (Aeronautiques et al., 1998) is a standardised lan-
guage extensively used in the field of artificial intelligence (AI) planning to represent planning do-
mains and problems. PDDL provides a formal syntax and semantics for defining the components of
a planning task, including actions, predicates, objects, and their relationships. It enables the clear
specification of the initial state, goal conditions, and permissible actions within a domain, facilitating
the development and comparison of planning algorithms. In our work, PDDL plays a critical role in
task decomposition and planning. By defining tasks as actions within PDDL domains, we leverage
established planning techniques to generate coherent and feasible plans. The use of PDDL allows us
to formally model complex tasks, ensuring that the system can reason about the preconditions and
effects of actions within a well-defined framework.
LetD=
d1, d2, . . . , d |D|
be a set of PDDL domains. Each PDDL domain dj∈Dis associated
with a set of PDDL actions s.t. adj=n
adj
1, adj
2, . . . , adj
Ajo
, where j∈[1,|D|]andAj∈Nthe num-
ber of actions included within the PDDL domain dj. Furthermore, let Tbe a set of different tasks,
consisting of all PDDL actions across the available PDDL domains ∈D, as follows: T=|D|S
j=1adj.
Finally, we define M=
m1, m2, . . . , m |M|
as the set of all models available for completing a set
of different tasks Tor combinations thereof.
3 H IVE— G ENERAL ARCHITECTURE
3.1 C APABILITY KNOWLEDGE GRAPH
We extract model cards2directly from HuggingFace and incorporate an OpenIE extraction route for
converting the textual descriptions from each model card into a structured representation. We align
the models with Papers With Code3, which enables us to collect information about how a particular
2https://huggingface.co/docs/hub/en/model-cards
3https://paperswithcode.com/
2
Page 3:
Under review as a conference paper at ICLR 2025
Re p ortUse rQue r y
P ar sin g &
Re phr asin g
Domain
Classif ic a tion
Action
SelectionPDDL
Domain
PDDL Pr o blem(Rule-b ased
Reconstruction)(Rule-b ased
Me r g e r)PDDL Sp acePDDL Libr ar yPLANT ask Planne r
Pr efe r encesResp onse
Plan Ex ecutione r
Model
Selection
Code Snippet
Extr action
Ar gumen t
Mappin g
Code
Ex ecution
F or each action
in the plan
Off line
Model Cards
P ape r sW ithCode
Code Snippets</>
MMLU
Whispe r -
lar g e-v2
Python
Python
Ap ache-
2.0
Mistr al
AISt ability
AI
St ability
AI
T ext-to-
T ext
T ext-to-
Ima g e
A udio-
to-text
Mistr al-7B
-Instruct-
v0.3St able-
diffusion-
2-1C-KG
Figure 1: H IVEmodular architecture.
model performs across different benchmarks. Our goal is to build a Capability Knowledge Graph,
denoted as C-KG, which is a graph G(M,T, E), such that each model mj∈Mis associated with
one or more tasks from Talong with its relevant performance, and different types of edges ∈E
with various properties (e.g., number of parameters or supported languages). We search each model
card collected by HuggingFace for potential arXiv4paper ID. We make use of the arXiv paper ID in
order to bridge models from HuggingFace with their corresponding performances in different tasks
and benchmarks. When we identify an arXiv paper ID within the knowledge base of Papers With
Code, we extract all model versions5and their performances across the different benchmarks.
Each different model version is represented as a separate vertex within the C-KG. A model
vertex mj∈C-KG is aligned with its model card iffthe model’s ID can be matched (e.g.,
flan-t5-base6) in both the original model card and the ID of at least one of the retrieved records
from Paper With Code. This enables us to build a knowledge graph in which different models are
associated with different benchmarks and tasks. Apart from the relevant performance scores in the
various benchmarks and the models’ properties extracted from HuggingFace, we extract from the
model cards coding snippets that describe how each model can be loaded and executed. These ex-
ecution snippets enable on-demand loading of a chosen model and execution of its inference step
based on the provided parameters, if selected by model selection pipelines (see Section 3.2.2).
When processing the HuggingFace model card, we leverage a combination of keywords (see Ap-
pendix C) and regular expressions to identify coding blocks that showcase simplified examples of
loading and executing a particular model mj. After the coding blocks are extracted, we prompt (see
Appendix C) an LLM that is proficient in code generation, such as DeepSeek-Coder7, to generate a
suitable Python function invoking mjand running inference while taking into consideration input
arguments of the originally extracted coding block. The resulting Python function, together with its
signature (returned type, variable types and default values), is finally stored within the C-KG and
connected with its corresponding model using an execution ∈Eedge.
3.2 P LANNING MODEL ACTIONS
Having extracted and systematically structured the information pertinent to models associated with
various tasks, we are now positioned to delineate the specific actions required to accomplish the
objectives in the user query, as depicted in Figure 1.
4https://arxiv.org/
5arXiv papers may report results of different model variants, e.g.performance across different models sizes.
6https://huggingface.co/google/flan-t5-base
7https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Base
3
Page 4:
Under review as a conference paper at ICLR 2025
3.2.1 T ASK PLANNER
Parsing User Query User queries are often vague and unstructured, making it challenging for sys-
tems to understand the user’s intent accurately. To overcome this, we introduce a parsing-rephrasing
stage that bridges the gap between the ambiguous query and the system’s structured requirements.
This initial step sets the foundation for the subsequent stages, enabling the system to extract relevant
information.
We parse an input user query qinto distinct components: instruction ( i:str), input text ( t:str),
question ( s:str), URL ( u:str), data ( x:dict ), and categories ( g:list ), as follows:
P(q) ={i,t,s,u,x,g} (1)
withPthe parsing function. The ability of LLMs to parse user queries into structured formats,
like JSON, has been highlighted across multiple research efforts (Petroni et al., 2019; Wei et al.,
2023). Using prompt engineering with a few-shot setting Brown (2020) (see Appendix C), we guide
the LLM to convert an unstructured user input into structured data. Additionally, we ask the LLM
to rewrite the user instruction to enhance clarity, simplifying complex directives and converting
implicit information into explicit statements.
The instruction is crucial as it is used in the later stages to decompose the user query into smaller
parts and determine objectives in the user query, q. By transforming vague queries into well-defined
components, our system becomes more robust to handle diverse and complex user inputs.
Task Decomposition Given the resulting instruction, after processing the input user query q, we
proceed to decompose it into smaller, manageable steps to identify a specific plan to attain the
objectives (Wei et al., 2022; Yao et al., 2023b) and understand how each part of the instruction is
associated with achievable goals within the system’s capabilities. We utilise an LLM as a classifier
(Zhang et al., 2024) in a few-shot example setting (Brown, 2020) (see Appendix C) to identify the
relevant domains from the original set of PDDL domains, D.
By providing the LLM with examples of instructions and their associated domains, we guide it to
select the pertinent subset D∗⊆Dthat aligns with the instruction. We prioritise recall in this
classification step to ensure that all potentially relevant domains are considered, minimizing the
risk of missing critical actions required to fulfil the user’s objectives. This approach enhances the
system’s robustness by accounting for a wider range of possible actions.
Once we determine the subset of relevant PDDL domains, D∗, we leverage the predefined PDDL
structures of each classified domain. These domain structures are then mergedto create a unified
PDDL domain file inclusive of all actions from D∗, s.t.
aD∗=|D|[
j=1adjdj∈D∗. (2)
This method ensures that the combined domain file encompasses all necessary actions while main-
taining consistency and comprehensibility. Next, we exploit the parsed instruction iand the com-
piled set of actions from D∗,aD∗, with an LLM to determine the specific actions required to achieve
the instruction’s objectives (see Appendix C), as follows: aD∗
i⊆aD∗.
This allows to precisely map high-level user intents to concrete actions. Following this selection, the
combined PDDL domain file (i.e. aD∗) and the identified actions set (i.e. aD∗
i) help to reconstruct
the corresponding PDDL problem. Finally, a Best First Width Search (Lipovetzky & Geffner, 2017)
logical reasoner computes a detailed Plan of Actions ordering the actions ∈aD∗
isuch that the system
can execute step-by-step, ensuring coherent execution in line with the user’s intent.
This hierarchical approach not only enhances the system’s robustness in parsing and understand-
ing diverse and complex user inputs but also guarantees the accuracy and feasibility of generated
actionable plans by rigorously structuring them within established domain constraints.
3.2.2 M ODEL SELECTION
Following the task planning, the next stage is the selection of appropriate models capable of per-
forming the specified actions in the plan. Utilising the information from our C-KG (see Sec-
tion 3.1), our goal is to identify a model combination M⋆⊆Mthat would satisfy a set of conditions
4
Page 5:
Under review as a conference paper at ICLR 2025
C=
c1, c2, . . . , c |C|
imposed by the user while offering guarantees that the selected model com-
bination will be suitable to generate an output ythat addresses the original input query q, s.t.
M⋆= arg max
Mp(y|q,M,C) (3)
For instance, if licensing is a concern, we filter out models that do not meet the required licensing
terms. Similarly, if there are limitations on computational resources, we prioritise models that are
efficient in size and cost. Additionally, we consider performance scores from relevant benchmarks
to guide our selection. By analyzing these performance metrics, we can choose models that have
demonstrated high effectiveness on tasks similar to those required, ensuring that the selected models
are not only compliant with user constraints but also optimal for the specific tasks at hand.
By balancing these factors: model capabilities, licensing requirements, resource constraints, and
performance metrics, we systematically select the most suitable model for each action. This ensures
that the execution of the plan is aligned with both the technical requirements of the tasks and the
practical considerations of the user. Ultimately, this careful selection enhances the system’s effi-
ciency and effectiveness, enabling it to perform complex tasks while adhering to user constraints.
3.2.3 P LAN EXECUTION
With the appropriate models selected for each action, we proceed to the plan execution phase. The
execution involves retrieving the Python code snippets associated with execute relation with the
chosen models, which are stored in the C-KG. More specifically, we map the arguments of the
Python functions to the relevant components extracted from the parsed user query— {i,t,s,u,x,g}
(cf. Eq. 1) —using a complex similarity mapping algorithm. This mapping ensures that the models
receive the correct inputs derived from the user’s query, accurately capturing the user’s intent. In
cases where the Python code snippet for the selected model is not available, we employ a fallback
strategy. We search for code snippets from other models that have been assigned to the same task
and possess similar functionalities. This approach leverages the semantic and functional similarities
between models within the same domain, allowing us to substitute models when necessary without
compromising the action’s intended outcome. By orchestrating the retrieval of code snippets and
the fine-grained mapping of arguments, our system seamlessly transforms high-level plans into exe-
cutable code, ensuring that the models operate on the intended data and parameters. This execution
phase is critical as it bridges the gap between planning and action, which guarantees that each step
of the plan is performed correctly and efficiently.
4 E XPERIMENTS
4.1 B ASELINES
Addressing complex, multi-modal real-world tasks presents substantial challenges, and current so-
lutions are limited. For our experiments, we compare our proposed method against the most relevant
state-of-the-art techniques: HuggingGPT (Shen et al., 2024) and ControlLLM (Liu et al., 2023c).
These represent significant advancements in integrating LLMs with task planning and execution
frameworks.
We propose two innovative methods: H IVEand H IVE light. H IVEleverages the advanced capabil-
ities of ChatGPT for parsing user queries and decomposing tasks. To address the computational
challenges associated with ChatGPT-based systems, we designed H IVE lightas an efficient alterna-
tive that can be deployed on local servers. H IVE lightemploys InterLM2.5-7B-chat8Cai et al. (2024)
for parsing user queries and Mistral-7B-Instruct-v0.39for task decomposition, both of which have
been subjected to 8-bit quantization. We selected a chat-oriented model for parsing, as conversa-
tional models excel at understanding subtle queries. Additionally, an instruction fine-tuned model
was chosen for task decomposition due to its capability to deliver precise instruction clarity. By
employing this dual-method setup, we ensure a thorough performance evaluation, positioning H IVE
and H IVE lightas strong competitors to existing state-of-the-art frameworks.
8https://huggingface.co/internlm/internlm2_5-7b-chat
9https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3
5
Page 6:
Under review as a conference paper at ICLR 2025
Table 1: Comparison of Task Selection, Flow of Tasks, and Output across all competitors.
Query Types HuggingGPT ControlLLM HIVE light HIVE
TS FoT O TS FoT O TS FoT O TS FoT O
Single Task 0.47 0.47 0.83 0.74 0.74 0.74 0.80 0.80 0.72 0.88 0.88 0.79
Two Tasks 0.64 0.55 0.44 0.33 0.33 0.38 0.67 0.62 0.52 0.71 0.69 0.58
Three Tasks 0.42 0.42 0.30 0.36 0.36 0.33 0.57 0.43 0.33 0.67 0.67 0.46
Overall 0.57 0.51 0.53 0.43 0.43 0.47 0.69 0.64 0.55 0.74 0.73 0.62
4.2 M USE — M ULTI-MODAL SUB-TASK EXECUTION BENCHMARK
In the absence of standard publicly available benchmarks for solving real-world tasks, and recog-
nizing that HuggingGPT (Shen et al., 2024) did not release their evaluation dataset, we developed
a new benchmark to assess our proposed framework alongside state-of-the-art methods like Hug-
gingGPT (Shen et al., 2024) and ControlLLM (Liu et al., 2023c). Although ControlLLM (Liu et al.,
2023c) released their benchmark, it utilises a fine-tuned task decomposer, which could introduce
bias if used for our evaluation. Therefore, to ensure a fair and unbiased comparison, we collaborated
with experts from diverse linguistic backgrounds to create a set of 100 heterogeneous, real-world
user queries (see Appendix Table 5) These queries are categorised into three types: Single-task,
Two-task, and Three-task queries. In order to facilitate a fair comparison, we included only those
tasks and models that are supported by all three systems. The benchmark is designed to cover vari-
ous task domains, such as automatic speech recognition, question answering, and image generation,
involving 15 models across different modalities10. This comprehensive benchmark enables us to
rigorously evaluate the performance and generalisability of our framework.
Metrics. To assess our framework against state-of-the-art methods, we evaluate performance on
three fronts, using binary metrics for simplicity and clarity:
•Task Selection (TS) : Determines whether the system accurately identifies the required tasks from
the user’s query. We assign a binary score—1 if the system selects all the tasks correctly, and 0 if it
does not or if it selected irrelevant tasks. Correct task selection is crucial as it lays the foundation
for successful execution and directly impacts the relevance of the final output.
•Flow of Thought (FoT) : We evaluate the logical sequence and integration of the selected tasks.
A binary score is given based on whether the system establishes the correct flow—1 for a proper
flow that respects task dependencies and order, and 0 for an incorrect sequence. This ensures that,
especially in multi-task queries involving two or three tasks, the system processes tasks in an order
that leads to the desired outcome.
•Final Output (O) : Assesses the correctness of the system’s final response to the user’s query. We
adopt a binary evaluation—1 if the output fulfils the user’s requirements, and 0 if it falls short.
This includes evaluating the relevance of generated content, and the overall satisfaction of the
user’s intent. Note that we do not evaluate the quality of the output, that is we do not judge if the
output is accurate but only focus on whether the expected task has been performed.
By employing these binary metrics for each query, we simplify the evaluation process while effec-
tively capturing the essential aspects of each system’s performance.
4.3 O VERALL RESULTS
The results of the experiment are presented in Table 1. The scores highlight the effectiveness of
our proposed approach in handling complex multi-modal tasks. H IVEconsistently outperforms
the baseline methods in overall performance across all evaluation metrics: TS, FoT, and O. This
performance is evident in both single-task and multi-task scenarios. Remarkably, H IVE light, which
10MUSE involves 10domains leading to 10 distinct PDDL domain files covering 15tasks. 67% of the queries
are multi-modal [text,image,audio] , see also Appendix A and the supplementary material .
6
Page 7:
Under review as a conference paper at ICLR 2025
Table 2: Cross-Modality Performances.
In↓Out→ Text
HuggingGPT ControlLLM HIVE light
Image 0.48 0.45 0.52
Audio 0.25 0.25 0.67
In↓Out→ Image
HuggingGPT ControlLLM HIVE light
Text 0.36 0.18 0.75
Audio 0.50 0.25 1.00
In↓Out→ Audio
HuggingGPT ControlLLM HIVE light
Text 0.33 0.14 0.71
Image 0.80 0.00 0.000 5 10OverallThreeTwoSingle
4.526.594.314.05
11.7111.1312.1410.97
4.526.064.73.2
5.056.45.024.28
HuggingGPT ControlLLM HIVE light HIVEseconds
Figure 2: Average times (s) before execution.
is based on 8-bit quantised models with only 7 billion parameters, surpasses both HuggingGPT and
ControlLLM in the overall evaluation. This underscores the effectiveness of our approach even with
reduced computational resources.
In the case of single-task queries, all methods perform relatively well in generating correct final
outputs. Though, H IVEstands out by achieving the highest performance in Task Selection and
Flow of Thought, indicating a more accurate and coherent understanding and reasoning process.
Conversely, HuggingGPT attains the highest Final Output performance but performs poorly in Task
Selection and Flow of Thought metrics. This discrepancy arises because HuggingGPT tends to
collect as many relevant tasks as possible, even if a query requires only a single task, leading to
over-selection. Despite this over-selection, its strong ChatGPT backbone enables it to produce high-
quality final outputs. Nevertheless, this approach may reduce the trustworthiness of the results, as it
does not align precisely with the intended tasks (further discussed in Section 4.4.2). When it comes
to multi-task queries, as expected, H IVEdistinctly outperforms its competitors across all metrics.
PDDL planning in H IVE lightallows it to surpass HuggingGPT and ControlLLM, demonstrating
proficiency in task selection, task flow management and execution. In contrast, ControlLLM strug-
gles considerably with multi-task queries, exhibiting the weakest performance among the evaluated
methods. While H IVE, HIVE light, and HuggingGPT employ prompt-based strategies to provide flex-
ibility and adaptability, ControlLLM relies on a fine-tuned task decomposer. This approach limits its
capacity to generalise to queries that deviate even slightly from its training set, leading to significant
performance declines in multi-task scenarios.
4.4 D ISCUSSIONS
4.4.1 C ROSS -MODALITY PERFORMANCES
To gain a better understanding of the multi-modality of these systems’ capabilities, we dig deeper
into the Final Output (O) results from Table 1. We divide this investigation into three distinct parts
based on the output modality and analyze the performance when the other two modalities are in-
volved in the input (Table 2).
When it comes to text output, H IVE lightdemonstrates a substantial lead over its competitors when the
input includes any image or audio. This showcases H IVE light’s ability to integrate visual and auditory
data to enhance text outputs. In the context of image output, H IVE lightonce again outperforms
the other systems, illustrating its proficiency in converting textual and audio inputs into coherent
visual responses. Lastly, although our system shows commendable performance in text-based audio
generation, it falls short of achieving the desired objective in the image-to-audio scenario, indicating
an area for potential improvement in future iterations.
4.4.2 T RUSTWORTHINESS
In order to review the connection between justifications (i.e. the conjunction11of TS and FoT)
andoutputs (O), we group in Figure 3 the results based on (justification, output) scores which can
11⊤iff TS AND FoT are both 1;⊥in all other cases.
7
Page 8:
Under review as a conference paper at ICLR 2025
⊤⊤ ⊤⊥ ⊥⊤ ⊥⊥02040
25
15 132133
694250
9
13358
10
026HuggingGPT ControlLLM
HIVE light HIVE
Figure 3: Correlations between justifications (TSAND FoT) and outputs (O).
respectively be correct ⊤or incorrect ⊥. The failure cases Err are further discussed in Section 4.4.3.
There are thereby four distinct cases. First, ⊤⊤ which corresponds to fully correct cases having
both justifications and outputs, in this H IVEsolutions (50 +) outperform both HuggingGPT (25) and
ControlLLM (33). Then, ⊤⊥means that the plans were correct but the execution did not go through.
In this category the four reviewed systems perform similarly, ranging from 6 to 15 cases. On the
right-end side of Figure 3, the plain-wrong case ⊥⊥witnesses ControlLLM as the “worst” system
(see results in Section 4.3), having the highest score. Finally, the critical ⊥⊤case indicates a lack
of trustworthiness across all the baseline systems that may result in misleading outcomes. This case
involves an incorrect plan or justification, despite the output being correct. Such cases have been
recorded 9 times for ControlLLM and 13 times for HuggingGPT while being absent from H IVEand
singleton in H IVE lightshowcasing the reliability of the results produced by H IVE.
Overall , this discussion allowed us to highlight two aspects. One, unlike other solutions, Hugging-
GPT exhibits results ranging from 13 to 25 in the four categories, meaning that it is hard to rely
on it. Second, H IVE(s) tend not to fall in incoherent cases where results are correct without the
corresponding plan being correct—in other words, when results are good their explanations can be
trusted as well.
4.4.3 R OBUSTNESS
Since most of M USE’s queries require multiple models to interact together in a compatible manner,
we noticed that sometimes the tested systems fail at dealing with either the justification or the output
parts. In Table 3, we list the different cases encountered. The first point to be highlighted is that,
among the four systems, HuggingGPT is the less robust one by far: 22 Err against 10, 7 and 6
for the other solutions. Second, even more critical, is that HuggingGPT, unlike the competition, is
able to generate correct results ( ⊤) while failing ( Err) in its plan construction. This exacerbates
the fact that its justifications cannot be trusted, as the executed actions tend in many cases not to be
consistent with the compiled plan, using GPT-3.5 at most places. This last finding is coherent with
the⊥⊤discussion in Section 4.4.2.
4.4.4 L ATENCY FOR PLANNING
Lastly, in this discussion section, we analyse the time performances (in seconds) of the systems
to come up with a plan and select suitable models. As the chosen models may differ and no en-
forced rules such as “the quicker the better” (see Section 4.5 for discussion about model selection
capabilities) were added, we measure the latencies up to the model selection stage.
Figure 2 presents these latencies according to the split already presented in Section 4.3 as per the
number of tasks involved in the queries. First of all, HuggingGPT and H IVE(s) share the same orders
of magnitude whereas ControlLLM is a magnitude slower (always 10+ seconds). Next, as expected,
the more tasks within the query the slower the systems become until they reach an execution plan.
On this, it is interesting to note that HuggingGPT’s scaling law does not seem “linear” as the slope
increases greatly between two- and three-task queries. This behaviour is compatible with the internal
implementation of HuggingGPT: when other systems are rule-based (see the C-KG for H IVE(s) to
select models), HuggingGPT needs to prompt (together with model descriptions) to select which
models to use for each task.
8
Page 9:
Under review as a conference paper at ICLR 2025
Table 3: Failing case enumeration ( Err), either as justifications (TSAND FoT) or outputs (O).
Failure Type HuggingGPT ControlLLM HIVE light HIVE
Err,⊤ 8 0 0 0
⊤,Err 3 1 3 3
⊥,Err 3 1 1 0
Err,Err 8 8 3 3
Overall 22 10 7 6
4.5 T AKING INTO ACCOUNT USERS ’ CONSTRAINTS IN TERMS OF MODEL SELECTION
As depicted in Section 3.2.2, once a plan of actions is established, H IVEselects the best models
to realise them. Obviously, depending on the circumstances, the definition of what is “best” may
vary a lot, e.g. when resources are sparse, one may decide to use the smallest models possible
even if the resulting quality is reduced, alternatively users might choose to select models based on
their respective (recorded) results for specific benchmarks. In order to respect these various cases,
HIVEallows users to specify selection criteria. In this Section12, we review the capabilities of H IVE
against HuggingGPT when users want to force some conditions of their own in the model selection.
Since ControlLLM has one-to-one mappings of models for each task, it is de facto excluded.
Practically, we use the following query13: “Transcribe the audio from .audio 1.wav
and find entity tokens ”. Regarding the task-model mappings, we let both H IVEand Hug-
gingGPT have access to: openai/whisper-large-v2 and nvidia/parakeet-rnnt-1.1b (having respec-
tively Apache-2.0 and CC-By-4.0 for licenses) for the ASR; and to dslim/bert-base-NER (MIT
license) for NER. We first run the query without any constraints (control run, see Appendix B):
both systems, H IVEand HuggingGPT, were able to transcribe the audio file and perform NER (even
though HuggingGPT result set was empty). We then applied the following model selection con-
straints sequentially:
1.License restrictions : only use Openrail++ and Deepseek — H IVEreturned nothing which was
the expected behaviour as the available models were not having the requested licenses; on the
other hand, HuggingGPT performed the task as in the control therefore infringing the restrictions
(see Appendix B).
2. Uses the “ smallest possible ” model14— H IVEcomplied with the user choice and used the
smaller models whereas HuggingGPT kept using openai/whisper-large-v2 as in the control run
(see Appendix B).
3. Filter for the model having the best results at the speech recognition on common
voice english15benchmark — Using the benchmark records from the C-KG, H IVEwas
able to select the correct model unlike HuggingGPT which chose models like in the control run
(see Appendix Table B).
Overall , HIVEanswered each time while properly taking into account the given constraints. While
HuggingGPT failed every time, misleading even the users with regards to its justifications (refer to
Section 4.4.2 for further justifications on this).
5 R ELATED WORK
Automated Planning. The cognitive ability to organize and coordinate actions toward a specific
goal is referred to as planning . While humans innately possess this capacity, machines lack such a
capability. Automated planning has garnered significant interest from researchers across various do-
mains, including robotics (Guo et al., 2023), autonomous vehicles ( ´Angel Madridano et al., 2021),
and dialogue systems (Wang et al., 2023a). The methodologies employed to devise sequences of
actions have evolved considerably, particularly in light of recent breakthroughs in deep learning.
12See also Appendix B for an extensive result description.
13A multimodal one involving two tasks: ASR and NER.
14We use the model disk footprint as a proxy for its size.
15Introduced in “Common V oice: A Massively-Multilingual Speech Corpus” Ardila et al. (2019).
9
Page 10:
Under review as a conference paper at ICLR 2025
Before the advent of large language models (LLMs), planning frameworks such as STRIPS (Fikes
& Nilsson, 1971) or HTN (Erol et al., 1994) were developed to decompose tasks into a series of ac-
tions (or sub-tasks) leading to the desired outcomes (Sacerdoti, 1975). Building upon these frame-
works, the Planning Domain Definition Language (PDDL) (Aeronautiques et al., 1998) emerged
as a widely adopted standardized language for defining planning problems and domains. However,
LLMs have superseded those frameworks to stand as a planner on their own (i.e. the LLM-as-planner
paradigm). Multiple prompt engineering techniques (Liu et al., 2023b; Graux et al., 2024) were de-
signed to leverage in-context learning aiming to directly generate the multi-step problem solutions.
More specifically, the Chain-of-Thought (Wei et al., 2022) has revealed the promising reasoning
capabilities of LLMs, and therefore new techniques were fashioned such as the self-consistency
decoding strategy (Wang et al., 2022), Tree-of-Thought (Yao et al., 2023a), Program-of-Thought
(Chen et al., 2023) or Graph-of-Thought (Yao et al., 2023c; Besta et al., 2024). However, LLMs are
still struggling to produce acceptable and logical plans, especially as the complexity of the problem
increases(Valmeekam et al., 2023; Xiao et al., 2024; Zheng et al., 2024) Thus, numerous initiatives
have therefore sought to integrate problem-specific languages like PDDL along LLMs to maximize
their effectiveness and leverage their full potential (Pallagani et al., 2023; Liu et al., 2023a; Oswald
et al., 2024).
LLM-as-Agent. The genesis of large language models (LLMs) primarily stemmed from textual
content, which initially narrowed the research focus to text generation. However, to address the
diversity of real-world scenarios, significant efforts have been directed toward developing vision or
speech LLMs, thereby aligning with a multi-modal paradigm (Zhu et al., 2023; Wu et al., 2023;
Wang et al., 2023b). exemplify this trend. Additionally, to expand the capabilities of LLMs, there
has been an increasing trend to integrate external tools with LLMs. Toolformer (Schick et al., 2024)
pioneered the invocation of tool calls within generated sequences via special tokens giving rise to
tool-augmented LLMs (Qin et al., 2023a;b; Guo et al., 2024; Qu et al., 2024). Then, ReAct (Yao
et al., 2023b) introduced such intermediate tool calls during the reasoning process by incorporat-
ing intermediate outcomes within the prompt to better guide the final resolution of the problem. In
contrast to ReAct, Reflexion (Shinn et al., 2023) adds verbal feedback on those intermediate results
to further assess and verify outcomes, In the meantime, a plethora of fine-tuned LLMs tailored for
specific tasks has become ubiquitous on platforms such as Hugging Face Hub (Wolf et al., 2019),
alongside proprietary models such as GPT-4 (Achiam et al., 2023), Claude (Anthropic, 2023), and
Gemini (Team et al., 2023) offering the opportunities to consider these LLMs as distinct agents.
Indeed, the gathering of technical details for each parametric model stands as a critical compo-
nent in the reporting and tracking efforts underlined by the use of Model Cards Mitchell et al.
(2019). HuggingGPT (Shen et al., 2024), leverages such a large pool of LLMs using ChatGPT as
the core controller. Following a similar approach, ControlLLM (Liu et al., 2023c) and Chameleon
Lu et al. (2024) explore task planning via prompt engineering and integrate a more diverse pool of
tools. While HuggingGPT, ControlLLM or Chameleon appoint appropriate models for each sub-
task, however, their model selection process remains sub-optimal as they do not identify the most
accurate model. Thus, if these frameworks can fulfil their plans, the resulting performance may be
unsatisfactory if the best agent is not utilized. To the best of our knowledge, our work represents the
first attempt to address this gap.
6 C ONCLUSION
Our research introduces H IVE, an innovative and comprehensive solution designed to navigate the
complexities of model selection and task planning using a diverse set of deep learning models.
By leveraging a Capability Knowledge Graph and an LLM-based formal logic planner, we tran-
scend the limitations of the existing systems. H IVEstands out for its capability to plan and explain
complex action chains while respecting user-specific constraints –thereby achieving both high per-
formance and full transparency. Empirical evaluations on our newly designed benchmark reveal
HIVE’s superior performance, consistently outperforming competing platforms like HuggingGPT
and ControlLLM. This breakthrough underscores H IVE’s potential to redefine the state-of-the-art in
task selection and planning, ultimately facilitating more efficient and user-friendly applications of
advanced deep models. H IVEthus advances the handling of multi-modal tasks.
10
Page 11:
Under review as a conference paper at ICLR 2025
REFERENCES
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale-
man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical
report. arXiv preprint arXiv:2303.08774 , 2023.
Constructions Aeronautiques, Adele Howe, Craig Knoblock, ISI Drew McDermott, Ashwin Ram,
Manuela Veloso, Daniel Weld, David Wilkins Sri, Anthony Barrett, Dave Christianson, et al.
Pddl— the planning domain definition language. Technical Report, Tech. Rep. , 1998.
Anthropic. Claude (oct 8 version). Accessed: 2023-10-08, 2023. URL https://www.
anthropic.com/ . Large language model.
Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li,
Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, and Furu Wei. SpeechT5: Unified-modal encoder-
decoder pre-training for spoken language processing. In Proceedings of the 60th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers) , pp. 5723–5738, May
2022.
Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer,
Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. Common voice: A
massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670 , 2019.
Brett A. Becker, Paul Denny, James Finnie-Ansley, Andrew Luxton-Reilly, James Prather, and Ed-
die Antonio Santos. Programming is hard - or at least it used to be: Educational opportunities
and challenges of ai code generation. In Proceedings of the 54th ACM Technical Symposium
on Computer Science Education V . 1 , SIGCSE 2023, pp. 500–506, New York, NY , USA, 2023.
Association for Computing Machinery. ISBN 9781450394314. doi: 10.1145/3545945.3569759.
URL https://doi.org/10.1145/3545945.3569759 .
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gian-
inazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of
thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI
Conference on Artificial Intelligence , volume 38, pp. 17682–17690, 2024.
Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 , 2020.
Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui
Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye
Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting
Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li,
Shuaibin Li, Wei Li, Yining Li, Hongwei Liu, Jiangning Liu, Jiawei Hong, Kaiwen Liu, Kuikun
Liu, Xiaoran Liu, Chengqi Lv, Haijun Lv, Kai Lv, Li Ma, Runyuan Ma, Zerun Ma, Wenchang
Ning, Linke Ouyang, Jiantao Qiu, Yuan Qu, Fukai Shang, Yunfan Shao, Demin Song, Zifan Song,
Zhihao Sui, Peng Sun, Yu Sun, Huanze Tang, Bin Wang, Guoteng Wang, Jiaqi Wang, Jiayu Wang,
Rui Wang, Yudong Wang, Ziyi Wang, Xingjian Wei, Qizhen Weng, Fan Wu, Yingtong Xiong,
Chao Xu, Ruiliang Xu, Hang Yan, Yirong Yan, Xiaogui Yang, Haochen Ye, Huaiyuan Ying, Jia
Yu, Jing Yu, Yuhang Zang, Chuyu Zhang, Li Zhang, Pan Zhang, Peng Zhang, Ruijie Zhang, Shuo
Zhang, Songyang Zhang, Wenjian Zhang, Wenwei Zhang, Xingcheng Zhang, Xinyue Zhang, Hui
Zhao, Qian Zhao, Xiaomeng Zhao, Fengzhe Zhou, Zaida Zhou, Jingming Zhuo, Yicheng Zou,
Xipeng Qiu, Yu Qiao, and Dahua Lin. Internlm2 technical report, 2024.
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and
Sergey Zagoruyko. End-to-end object detection with transformers. CoRR , abs/2005.12872, 2020.
URL https://arxiv.org/abs/2005.12872 .
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompt-
ing: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on
Machine Learning Research , 2023.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep
bidirectional transformers for language understanding. CoRR , abs/1810.04805, 2018. URL
http://arxiv.org/abs/1810.04805 .
11
Page 12:
Under review as a conference paper at ICLR 2025
Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. Self-collaboration code generation via chatgpt. ACM
Trans. Softw. Eng. Methodol. , 33(7), September 2024. ISSN 1049-331X. doi: 10.1145/3672459.
URLhttps://doi.org/10.1145/3672459 .
Kutluhan Erol, James A Hendler, and Dana S Nau. Semantics for hierarchical task-network plan-
ning. Citeseer, 1994.
Richard E Fikes and Nils J Nilsson. Strips: A new approach to the application of theorem proving
to problem solving. Artificial intelligence , 2(3-4):189–208, 1971.
Damien Graux, S ´ebastien Montella, Hajira Jabeen, Claire Gardent, and Jeff Z Pan. [prompteng] first
international workshop on prompt engineering for pre-trained language models. In Companion
Proceedings of the ACM on Web Conference 2024 , pp. 1311–1312, 2024.
Huihui Guo, Fan Wu, Yunchuan Qin, Ruihui Li, Keqin Li, and Kenli Li. Recent trends in task and
motion planning for robotics: A survey. ACM Comput. Surv. , 55(13s), jul 2023. ISSN 0360-0300.
doi: 10.1145/3583136. URL https://doi.org/10.1145/3583136 .
Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong
Sun, and Yang Liu. Stabletoolbench: Towards stable large-scale benchmarking on tool learning
of large language models, 2024.
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap-
lot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier,
L´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril,
Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mistral 7b, 2023. URL https:
//arxiv.org/abs/2310.06825 .
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training
for natural language generation, translation, and comprehension, 2019a. URL https://
arxiv.org/abs/1910.13461 .
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: denoising sequence-to-sequence pre-
training for natural language generation, translation, and comprehension. CoRR , abs/1910.13461,
2019b. URL http://arxiv.org/abs/1910.13461 .
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image
pre-training for unified vision-language understanding and generation, 2022a. URL https:
//arxiv.org/abs/2201.12086 .
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image
pre-training for unified vision-language understanding and generation, 2022b. URL https:
//arxiv.org/abs/2201.12086 .
Nir Lipovetzky and Hector Geffner. Best-first width search: Exploration and exploitation in
classical planning. Proceedings of the AAAI Conference on Artificial Intelligence , 31(1), Feb.
2017. doi: 10.1609/aaai.v31i1.11027. URL https://ojs.aaai.org/index.php/
AAAI/article/view/11027 .
B. Liu, Yuqian Jiang, Xiaohan Zhang, Qian Liu, Shiqi Zhang, Joydeep Biswas, and Peter
Stone. Llm+p: Empowering large language models with optimal planning proficiency. ArXiv ,
abs/2304.11477, 2023a. URL https://api.semanticscholar.org/CorpusID:
258298051 .
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-
train, prompt, and predict: A systematic survey of prompting methods in natural language pro-
cessing. ACM Comput. Surv. , 55(9), jan 2023b. ISSN 0360-0300. doi: 10.1145/3560815. URL
https://doi.org/10.1145/3560815 .
Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Zhiheng Li, Xizhou Zhu, Lewei Lu, Qifeng
Chen, Yu Qiao, Jifeng Dai, and Wenhai Wang. Controlllm: Augment language models with tools
by searching on graphs. arXiv preprint arXiv:2305.10601 , 2023c.
12
Page 13:
Under review as a conference paper at ICLR 2025
Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu,
and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language mod-
els.Advances in Neural Information Processing Systems , 36, 2024.
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson,
Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. In
Proceedings of the conference on fairness, accountability, and transparency , pp. 220–229, 2019.
James Oswald, Kavitha Srinivas, Harsha Kokel, Junkyu Lee, Michael Katz, and Shirin Sohrabi.
Large language models as planning domain generators. Proceedings of the International Confer-
ence on Automated Planning and Scheduling , 34(1):423–431, May 2024. doi: 10.1609/icaps.
v34i1.31502. URL https://ojs.aaai.org/index.php/ICAPS/article/view/
31502 .
Vishal Pallagani, Bharath Muppasani, Biplav Srivastava, Francesca Rossi, Lior Horesh, Keerthiram
Murugesan, Andrea Loreggia, Francesco Fabiano, Rony Joseph, and Yathin Kethepalli. Plans-
former tool: Demonstrating generation of symbolic plans using transformers. In Edith Elkind
(ed.), Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence,
IJCAI-23 , pp. 7158–7162. International Joint Conferences on Artificial Intelligence Organiza-
tion, 8 2023. doi: 10.24963/ijcai.2023/839. URL https://doi.org/10.24963/ijcai.
2023/839 . Demo Track.
Fabio Petroni, Tim Rockt ¨aschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller,
and Sebastian Riedel. Language models as knowledge bases? arXiv preprint arXiv:1909.01066 ,
2019.
Lianhui Qin, Sean Welleck, Daniel Khashabi, and Yejin Choi. Cold decoding: Energy-
based constrained text generation with langevin dynamics. In S. Koyejo, S. Mo-
hamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural
Information Processing Systems , volume 35, pp. 9538–9551. Curran Associates, Inc.,
2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/
file/3e25d1aff47964c8409fd5c8dc0438d7-Paper-Conference.pdf .
Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei
Huang, Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Huadong Wang, Cheng Qian, Runchu
Tian, Kunlun Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Zhang, Yining Ye, Bowen Li,
Ziwei Tang, Jing Yi, Yuzhang Zhu, Zhenning Dai, Lan Yan, Xin Cong, Yaxi Lu, Weilin Zhao,
Yuxiang Huang, Junxi Yan, Xu Han, Xian Sun, Dahai Li, Jason Phang, Cheng Yang, Tongshuang
Wu, Heng Ji, Zhiyuan Liu, and Maosong Sun. Tool learning with foundation models, 2023a.
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru
Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li,
Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+
real-world apis, 2023b.
Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu,
and Ji-Rong Wen. Tool learning with large language models: A survey. arXiv preprint
arXiv:2405.17935 , 2024.
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever.
Robust speech recognition via large-scale weak supervision, 2022. URL https://arxiv.
org/abs/2212.04356 .
Ren´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction.
CoRR , abs/2103.13413, 2021. URL https://arxiv.org/abs/2103.13413 .
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-
resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition (CVPR) , pp. 10684–10695, June 2022.
Earl D. Sacerdoti. The nonlinear nature of plans. In Proceedings of the 4th International Joint
Conference on Artificial Intelligence - Volume 1 , IJCAI’75, pp. 206–214, San Francisco, CA,
USA, 1975. Morgan Kaufmann Publishers Inc.
13
Page 14:
Under review as a conference paper at ICLR 2025
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of
bert: smaller, faster, cheaper and lighter. In NeurIPS EMC2Workshop , 2019.
Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Eric Hambro,
Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can
teach themselves to use tools. Advances in Neural Information Processing Systems , 36, 2024.
Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. PICARD: Parsing incrementally for
constrained auto-regressive decoding from language models. In Marie-Francine Moens, Xu-
anjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Con-
ference on Empirical Methods in Natural Language Processing , pp. 9895–9901, Online and
Punta Cana, Dominican Republic, November 2021. Association for Computational Linguis-
tics. doi: 10.18653/v1/2021.emnlp-main.779. URL https://aclanthology.org/2021.
emnlp-main.779 .
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugging-
gpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information
Processing Systems , 36, 2024.
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao.
Reflexion: language agents with verbal reinforcement learning. In A. Oh, T. Nau-
mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neu-
ral Information Processing Systems , volume 36, pp. 8634–8652. Curran Associates, Inc.,
2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/
file/1b44b878bb782e6954cd888628510e90-Paper-Conference.pdf .
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu,
Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly
capable multimodal models. arXiv preprint arXiv:2312.11805 , 2023.
Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao
Kambhampati. Planbench: An extensible benchmark for evaluating large language
models on planning and reasoning about change. In A. Oh, T. Naumann, A. Glober-
son, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Pro-
cessing Systems , volume 36, pp. 38975–38987. Curran Associates, Inc., 2023. URL
https://proceedings.neurips.cc/paper_files/paper/2023/file/
7a92bcdede88c7afd108072faf5485c8-Paper-Datasets_and_Benchmarks.
pdf.
Hongru Wang, Minda Hu, Yang Deng, Rui Wang, Fei Mi, Weichao Wang, Yasheng Wang, Wai-
Chung Kwan, Irwin King, and Kam-Fai Wong. Large language models as source planner for
personalized knowledge-grounded dialogues. In Houda Bouamor, Juan Pino, and Kalika Bali
(eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 , pp. 9556–9569,
Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.
findings-emnlp.641. URL https://aclanthology.org/2023.findings-emnlp.
641.
Xinyu Wang, Bohan Zhuang, and Qi Wu. Switchgpt: Adapting large language models for non-text
outputs. arXiv preprint arXiv:2309.07623 , 2023b.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh-
ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.
arXiv preprint arXiv:2203.11171 , 2022.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny
Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in
neural information processing systems , 35:24824–24837, 2022.
Xiang Wei, Xingyu Cui, Ning Cheng, Xiaobin Wang, Xin Zhang, Shen Huang, Pengjun Xie, Jinan
Xu, Yufeng Chen, Meishan Zhang, et al. Zero-shot information extraction via chatting with
chatgpt. arXiv preprint arXiv:2302.10205 , 2023.
14
Page 15:
Under review as a conference paper at ICLR 2025
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi,
Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, et al. Huggingface’s transformers:
State-of-the-art natural language processing. arxiv. arXiv preprint arXiv:1910.03771 , 2019.
Chenfei Wu, Sheng-Kai Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual
chatgpt: Talking, drawing and editing with visual foundation models. ArXiv , abs/2303.04671,
2023. URL https://api.semanticscholar.org/CorpusID:257404891 .
Ruixuan Xiao, Wentao Ma, Ke Wang, Yuchuan Wu, Junbo Zhao, Haobo Wang, Fei Huang, and
Yongbin Li. Flowbench: Revisiting and benchmarking workflow-guided planning for llm-based
agents. arXiv preprint arXiv:2406.14884 , 2024.
Mingzhe Xing, Rongkai Zhang, Hui Xue, Qi Chen, Fan Yang, and Zhen Xiao. Understanding the
weakness of large language model agents within a complex android environment. arXiv preprint
arXiv:2402.06596 , 2024.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik
Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In
A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in
Neural Information Processing Systems , volume 36, pp. 11809–11822. Curran Associates, Inc.,
2023a. URL https://proceedings.neurips.cc/paper_files/paper/2023/
file/271db9922b8d1f4dd7aaef84ed5ac703-Paper-Conference.pdf .
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao.
ReAct: Synergizing reasoning and acting in language models. In International Conference on
Learning Representations (ICLR) , 2023b.
Yao Yao, Zuchao Li, and Hai Zhao. Beyond chain-of-thought, effective graph-of-thought reasoning
in language models. arXiv preprint arXiv:2305.16582 , 2023c.
Yazhou Zhang, Mengyao Wang, Chenyu Ren, Qiuchi Li, Prayag Tiwari, Benyou Wang, and Jing
Qin. Pushing the limit of llm capacity for text classification. arXiv preprint arXiv:2402.07470 ,
2024.
Yilun Zhao, Linyong Nan, Zhenting Qi, Rui Zhang, and Dragomir Radev. ReasTAP: Injecting
table reasoning skills during pre-training via synthetic reasoning examples. In Proceedings of the
2022 Conference on Empirical Methods in Natural Language Processing , pp. 9006–9018, Abu
Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL
https://aclanthology.org/2022.emnlp-main.615 .
Huaixiu Steven Zheng, Swaroop Mishra, Hugh Zhang, Xinyun Chen, Minmin Chen, Azade Nova,
Le Hou, Heng-Tze Cheng, Quoc V Le, Ed H Chi, et al. Natural plan: Benchmarking llms on
natural language planning. arXiv preprint arXiv:2406.04520 , 2024.
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En-
hancing vision-language understanding with advanced large language models. arXiv preprint
arXiv:2304.10592 , 2023.
´Angel Madridano, Abdulla Al-Kaff, David Mart ´ın, and Arturo de la Escalera. Trajectory plan-
ning for multi-robot systems: Methods and applications. Expert Systems with Applications , 173:
114660, 2021. ISSN 0957-4174. doi: https://doi.org/10.1016/j.eswa.2021.114660. URL https:
//www.sciencedirect.com/science/article/pii/S0957417421001019 .
15
Page 16:
Under review as a conference paper at ICLR 2025
A M USE E XPERIMENTAL SETUP
Table 4 provides a comprehensive overview of the domains and tasks encompassed within the M USE
benchmark. All three competing systems have access to the models associated with each task listed
in Table 4. It should be noted that both HuggingGPT and ControlLLM utilise ChatGPT as their
backbone model, leveraging it for the execution of certain tasks. Since ControlLLM permits only
one-to-one mappings, to maintain an unbiased benchmark, we assign one model per task (for details,
please refer to the Supplementary Material). Moreover, to preserve the naturalness of the queries,
we have refrained from making any grammatical or spelling corrections in the dataset.
Table 4: AI Tasks and Associated Models.
Domain Task Model
AudioAutomatic Speech Recognition openai/whisper-large-v2 (Radford et al., 2022)
Text to Speech microsoft/speecht5 tts (Ao et al., 2022)
Image Generation Text to Image stabilityai/stable-diffusion-2-1(Rombach et al., 2022)
Image to TextImage Captioning Salesforce/blip-image-captioning-base(Li et al., 2022b)
Object Detection facebook/detr-resnet-101(Carion et al., 2020)
Visual Question Answering Salesforce/blip-vqa-base(Li et al., 2022a)
Image to Image Depth Estimation Intel/dpt-hybrid-midas(Ranftl et al., 2021)
Machine Translation Translation from xx to yy mistralai/Mistral-7B-Instruct-v0.1(Jiang et al., 2023)
Question AnsweringAnswer based on Context distilbert/distilbert-base-uncased-distilled-squad(Sanh et al., 2019)
Open QA mistralai/Mistral-7B-Instruct-v0.1(Jiang et al., 2023)
Table QA Yale-LILY/reastap-large(Zhao et al., 2022)
Summarisation Abstractive Summarisation facebook/bart-large-cnn(Lewis et al., 2019b)
Text Generation Text Generation mistralai/Mistral-7B-Instruct-v0.1(Jiang et al., 2023)
Text Classification Text Classification facebook/bart-large-mnli (Lewis et al., 2019a)
Token Classification Named Entity Recognition dslim/bert-base-NER(Devlin et al., 2018)
Table 5: Small sample of queries from the M USE benchmark.
Domain Queries
“object detection” How many people is in the image? ./data/images/image 6.jpg
“image totext”, “text generation” Write me a sonnet about this image ./data/images/image 11.jpg
“image totext”, “image generation” Build a pictorial presentation of the flower after deciphering the audio
./data/audios/audio 4.wav
“audio”, “question answering” Use audio transcription as context and answer ‘what is the country start-
ing the war whom the associated speech is about?’ from ./data/audios/au-
dio6.wav
“audio”, “summarisation”, “image generation” Extract a summary from the audio transcript ./data/audios/audio 8.wav, and
generate an image based on the summary
16
Page 17:
Under review as a conference paper at ICLR 2025
B S CENARIOS DETAILS FOR MODEL SELECTION VALIDATION
↓↓ No Constraints (Cold run) ↓↓
HIVE light
We, the people of the United States, in order to form a more perfect
Union, establish justice, ensure domestic tranquility, provide for the
common defense, promote the general welfare, and secure the bless-
ings of liberty to ourselves and our posterity, do ordain and establish
this Constitution for the United States of America.
{
"LOC": [
"United States",
"United States ofAmerica"
],
"MISC": [
"Constitution"
]
}
Comment :Successfull!HuggingGPT
Based on the inference results, I have transcribed the audio file
“.[Audio]” and obtained the following text: “We, the people of the
United States, in order to form a more perfect union, establish jus-
tice, ensure domestic tranquility, provide for the common defense,
promote the general welfare, and secure the blessings of liberty to
ourselves United States of America.”
For the named entity recognition task, I have used the model
“dslim/bert-base-NER.” However, the inference result for this task
is empty, indicating that no named entity tokens were found in the
transcribed text.
Here is the complete workflow for your request:
1.Automatic Speech Recognition (ASR) Task:
Model used: “openai/whisper-large-v2”
Inference result: The audio file“.[Audio]” was transcribed, and
the resulting text is the one mentioned above.
2.Named Entity Recognition (NER) Task:
Model used: “dslim/bert-base-NER”
Inference result: No named entity tokens were found in the
transcribed text.
Please note that the inference results are based on the models used,
and there is a possibility of errors or incomplete detection.
Comment :Unsuccessful: the system failed to pass the output from
the previous task as input to the subsequent one, preventing the NER
model from identifying any entities
↓↓ Specifically asked to use the smallest model available for each task ↓↓
HIVE light
We, the people of the United States, in order to form a more perfect
Union, establish justice, ensure domestic tranquility, provide for the
common defense, promote the general welfare, and secure the bless-
ings of liberty to ourselves and our posterity, do ordain and establish
this Constitution for the United States of America.
{
"LOC": [
"United States",
"United States ofAmerica"
],
"MISC": [
"Constitution"
]
}
Comment :Successful!HuggingGPT
Based on the inference results, I have transcribed the audio file
“.[Audio]” and found the following named entity tokens:
“We the people of Unites States” – ORGANISATION
“the United States of America” – LOCATION
To accomplish this, I followed the following workflow:
1.Task 1: Audio-to-Audio Transcription Model: No suitable
model found for this task
Inference Result Error: “audio-to-audio” not found in avail-
able tasks
2.Task 2: Automatic Speech Recognition (ASR) Task:
Model: I used the “openai/whisper-large-v2” model
Inference result: “We, the people of the United States, in order
to form a more perfect union, establish justice, ensure domestic
tranquility, provide for the common defense, promote the general
welfare, and secure the blessings of liberty to ourselves United
States of America.”
3.Task 3: Named Entity Recognition (NER) Task:
Model: I used the “dslim/bert-base-NER” model
Inference result: No named entity tokens were found in the
transcribed text.
I apologize for not being able to provide any named entity tokens
in the transcribed text. If you have any further questions or need
assistance with anything else, please let me know!
Comment :Unsuccessful: the system failed in task selection and
selected “audio-to-audio“ as the first task, which is irrelevant to the
user query. The system also failed to identify any entities
17
Page 18:
Under review as a conference paper at ICLR 2025
↓↓ Specifically ask the systems to use openrail++ &deepseek licensed models ↓↓
HIVE light
Comment : Successful: as
there were no models with
openrail++ and deepseek
for ASR or NER, no model should
be selectedHuggingGPT
I apologize for any confusion caused. Based on the inference results, here is the detailed workflow for
your request:
1.Automatic Speech Recognition (ASR) Task:
Task: Transcribing the audio file “[Audio]”.
Model used: “openai/whisper-large-v2”
Inference result: The transcribed text is as follows: “We, the people of the United States, in
order to form a more perfect union, establish justice, ensure domestic tranquility, provide for the
common defense, promote the general welfare and secure the blessings of liberty to ourselves and
our posterity, do ordain and establish this Constitution for the United States of America.”
2.Named Entity Recognition (NER) Task:
Task: Finding named entity tokens in the transcribed text.
Model used: “dslim/bert-base-NER”.
Inference result: Unfortunately, there is no available inference result for this task.
3.Zero-shot Classification Task: Task: Classifying named entity tokens.
Model used: “gpt-3.5-turbo-0613”
Inference result: I apologize, but as an AI language model, I am unable to access or transcribe
audio directly. However, once the audio is transcribed, I can assist you with the zero-shot classifi-
cation task. Could you please provide the transcription of the audio file, and I’ll be happy to help
you with the classification task using the provided text and image information?
4.Text-to-Image Task: Task: Generating images based on named entity tokens.
Model used: “stability/stable-diffusion-v1”
Inference result: Unfortunately, there was an error in the inference process for this task.
Please note that the inference results are based on the available models and their capabilities. If there
are any specific requirements or modifications you would like to make, please let me know.
Comment :Unsuccessful: first the system performed an incorrect selections of tasks and then also
misguided the user by selecting models that do not aligned with the users license contraints
↓↓ Specifically asked to choose the Automatic speech recognition model based on best
performance on speech recognition on common voice english ↓↓
HIVE light
we the people of the united states in or-
der to form a more perfect union establish
justice insure domestic tranquillity provide
for the common defense promote the gen-
eral welfare and secure the blessings of lib-
erty to ourselves and our posterity do ordain
and establish this constitution for the united
states of america
{
"LOC": [
"united States",
"united States of
,→america"
],
"MISC": [
"constitution"
]
}
Comment :Successful: to test the model
selection on benchmark performances, we
added a “parakeet-rnnt“ model in the
mix as on this specific benchmark it per-
forms better than the “whisper-large-v2“.
HIVE light successfully selects this for the
ASR taskHuggingGPT
Based on the Inference results, I will directly answer your request and provide a detailed
workflow for your specific task. Here is the complete path or URLs of the files in the inference
results:
1.Transcribing the audio file “[Audio]” Model used: “openai/whisper-large-v2”
Inference Result: The transcribed text is as follows: “We, the people of the United
States, in order to form a more perfect union, establish justice, ensure domestic tranquility,
provide for the common defense, promote the general welfare, and secure the blessings
of liberty to ourselves United States of America.”
2.Finding named entity tokens in the transcribed text:
Model used: “dslim/bert-base-NER”
Inference result: No named entity tokens were found in the transcribed text.
3.Zero-shot classification for the task of “named entity tokens”:
Model used: “gpt-3.5-turbo-0613”
Inference result: As an AI language model, I don’t have direct access to external files or
the ability to process audio. Therefore, I am unable to perform zero-shot classification on
the named entity tokens.
4.Generating a response for the task of selecting models based on scores from “speech-
recognition-on-common-voice-english”:
Model used: “gpt-3.5-turbo-0613”
Inference result: Based on the input the following response is generated:
“To select models for speech recognition on Common V oice English, you can use the
scores obtained from the speech recognition systems trained on the Common V oice En-
glish dataset. By evaluating the performance of different models using these scores, you
can determine which models are most suitable for the task. This allows you to choose
the best models for your specific speech recognition needs. Factors such as accuracy,
robustness and language coverage should be considered when selecting these models.
THe scores provide a valuable metric for comparing and evaluating different models, ul-
timately assisting in the decision-making process.”
Please not that the information provided is based on the inference results from the models
used. If you have any specific concerns or further questions, feel free to let me know!
Comment :Unsuccessful: the system fails at the task selection state while also ignoring the
requirements to selected model for the ASR based on the benchmark performance
18
Page 19:
Under review as a conference paper at ICLR 2025
C H IVE’S PROMPTS
Usage Snippet Extraction (§3.1)
You are a python programming expert, mainly used to convert python code snippets into
python functions. Follow the rules:
1. You also make sure all the function variables have default value.
2. There should always be input variable with default value that takes the input for the model.
3. Take Model path as a variable with default value of model name
4. Return the model response in the python function
Transform the code snippets in the text into one signed Python function including all the
potential variables and default values for all. Only respond with code and in markdown
format “‘python“‘. {code}
Parsing & Rephrasing (§3.2)
Task Decomposition stage: The AI assistant can parse user input into multiple inputs and
fill the relevant keys in the following JSON .
{"instruction": None, "input_text": None, "question": None,
,→"url": None, "data_dict": {}, "categories": [] }
Example 1
User: What is the date mentioned in this audio www.google.com/audio file.mp3?
Response: {{”instruction”: ”Convert the audio to text and then answer the question”, ”url”:
”www.google.com/audio file.mp3”, ”input text”: ”What is the date?” }}
. . . . . .
Example 5
User: Please transcribe the voice into text ./audio/audio 1.mp3 and classify the transcribed
text into categories such as ’movie’, ’music’, ’painting’, or ’Other’.
Response: {”instruction”: ”Convert the audio to text, and perform text classification”, ”url”:
”./audio/audio 1.mp3”, ”categories”: [’movie’, ’music’, ’painting’, ’Other’] }
Based on the above example, parse the following:
User:{USER INPUT }
Response:
Only return a JSON
TheJSON keys are defined below:
instruction: textitwhat are the tasks asked in the question, it maybe one, two or three tasks.
Try to find the implicit tasks as well
input text: extract the original text or context in the input. Do not generate on your own
question: extract if there is any question asked
url: extract url passed in the user query
data dict: extract dictionary passed in the user query
categories: extract ALL categories mentioned in the user query
Do not generate anything other than a parsable JSON
Domain Classification (§3.2)
You are professional in natural language processing task. Find which domains are re-
lated with the provided user query? You should pick domains from the following list
19
Page 20:
Under review as a conference paper at ICLR 2025
{domains }. You MUST NOT output other domains not in the provdied list. Here are the
examples:
Example 1: Answer the following questions in detail and give me a summarisation
for the answer
Domains: question answering; summarisation
. . .
Example 11: Summarise the transcript of the audio and find entities in it
Domains: audio; summarisation; token classification
Provided Query: {query }. ’The order matters’
Domains:
Action Selection (§3.2)
You are an action selector. Given a Task and a list of Actions ,
Select the crucial/mandatory actions. Follow the instructions below:
1. Pick the least amount of actions that can do the job in Task
2. Only select actions that are REQUIRED andNECESSARY forTask
3. Focus on precision of selection
4. Do not select EVALUATION orSCORE actions unless explicitly asked for in Task
Example 1
Task: I want to select a schema and then generate a SQL query for the a question
Actions: [”Schema-Selection”, ”generate SQL”, ”execute query”, ”validate SQL”]
Can you understand the requirements of the Task and select necessary actions from the
Actions . Do not give any explanations, only return a list and nothing else. Select at most
three diverse, yet relevant actions.
Selected Actions: [”Schema-Selection”, ”generate SQL”]
. . .
. . .
Example 3
Task: Retrieve documents on renewable energy advancements and summarise the latest
technologies
Actions: [”query based summarization”, ”rank documents”, ”re-
trieve most relevant document”, ”keyphrase extraction”, ”summarization evaluation”,
”get extractive summarization”, ”get abstractive summarization”, ”re-
trieve multiple documents”]
Can you understand the requirements of the Task and select necessary actions from the
Actions . Do not give any explanations, only return a list and nothing else. Select at most
three diverse, yet relevant actions.
Selected Actions: [”retrieve multiple documents”, ”get extractive summarization”]
Task:{user instruction }
Actions: {actions }
Can you understand the requirements of the Task and select necessary actions from the
Actions . Do not give any explanations, only return a list and nothing else. Select at most
three diverse, yet relevant actions.
Selected Actions:
20
Page 21:
Under review as a conference paper at ICLR 2025
Figure 4: Capability Knowledge Graph in use for the M USE Benchmark.
D M USE C APABILITY -KG VISUALISATION
As described in Section 4.2, we introduced the M USE benchmark in order to compare the per-
formances of HuggingGPT, ControlLLM and H IVE. The latter was declined in two sub-versions:
HIVE lighthaving an 8-bit quantised 7B model for planning tasks and H IVErelying on GPT-3.5 for
the same action. In particular, M USE comes with 100 multi-modal, multi-task, complex, natural-
language queries16.
In this Appendix, we provide, in Figure 4, a snapshot of the Capability Knowledge Graph which
corresponds to the models and their associated pieces of data used by H IVEto tackle the M USE
benchmark17. It is an excerpt, containing 461 triples involving 381 entities, of the complete C-KG
which contains 125k triples for 39k distinct entities. The complete C-KG was formulated by extract-
ing metadata from over 600,000 models listed on HuggingFace, with 26,806 models retained based
on popularity metrics. In Figure 4, model nodes are depicted in yellow and are at the core of the
knowledge graph, in the sense that it is around models that the C-KG building process was designed,
starting from the HuggingFace model cards (see Section 3.1 for more details on this process).
Visually, this graph is twofold, indeed the two stable-diffusion models (xl-base-1.0 & 2-1) are dis-
connected from the rest of the graph. The remaining 13 models, on the other hand, are making
a single piece of graph and organisation nodes (in red) or license nodes (in black) are often con-
nection hubs. Interestingly, one could see that the main part of the C-KG is “bordered” by gray
and purple nodes which corresponds to metrics and respective scores for the various benchmark
(in bright green) data points retrieved from the paperswithcode API. Finally, the image-to-text
block composed of blip-image-captioning-base and blip-vqa-base is somewhat separated from the
main graph and the whisper-large-v2 which is completely surrounded by all the languages it covers
(light blue nodes).
16See the supplementary material archive for all the query details and their associated data.
17A Web-interface to explore the M USE C-KG is available as supplementary material too.
21