Paper Content:
Page 1:
KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths
over Knowledge Graphs
Qi Zhao1, Hongyu Yang1, Qi Song1∗, Xinwei Yao2, Xiangyang Li1
1University of Science and Technology of China, Hefei, Anhui, China
2Zhejiang University of Technology, Hangzhou, Zhejiang, China
{zq2021, hongyuyang }@mail.ustc.edu.cn
xwyao@zjut.edu.cn, {qisong09, xiangyangli }@ustc.edu.cn
Abstract
Large language models (LLMs) have demonstrated
remarkable capabilities in various complex tasks,
yet they still suffer from hallucinations. Introduc-
ing external knowledge, such as knowledge graph,
can enhance the LLMs’ ability to provide factual
answers. LLMs have the ability to interactively
explore knowledge graphs. However, most ap-
proaches have been affected by insufficient inter-
nal knowledge excavation in LLMs, limited gen-
eration of trustworthy knowledge reasoning paths,
and a vague integration between internal and ex-
ternal knowledge. Therefore, we propose Know-
Path, a knowledge-enhanced large model frame-
work driven by the collaboration of internal and ex-
ternal knowledge. It relies on the internal knowl-
edge of the LLM to guide the exploration of in-
terpretable directed subgraphs in external knowl-
edge graphs, better integrating the two knowledge
sources for more accurate reasoning. Extensive ex-
periments on multiple real-world datasets confirm
the superiority of KnowPath.
1 Introduction
Large language models (LLMs) are increasingly being ap-
plied in various fields of Natural Language Processing (NLP)
tasks, such as text generation [Wang et al. , 2024; Dong et
al., 2023 ], knowledge-based question answering [Luo et al. ,
2024a; Zhao et al. , 2024 ], and over specific domains [Al-
berts et al. , 2023; Jung et al. , 2024 ]. In most scenarios,
LLMs serve as intermediary agents for implementing vari-
ous functions [Huet al. , 2024; Huang et al. , 2024; Guo et al. ,
2024 ]. However, due to the characteristics of generative mod-
els, LLMs still suffer from hallucination issues, often gener-
ating incorrect answers that can lead to uncontrollable and
severe consequences [Liet al. , 2024 ]. Introducing knowledge
graphs (KGs) to mitigate this phenomenon is promising [Yin
et al. , 2022 ]. This is because knowledge graphs store a large
amount of structured factual knowledge, which can provide
large models with accurate knowledge dependencies. At the
same time, correcting the knowledge in large models often
∗Corresponding author
LLMs
What language is spoken in Netherlands and
Belgium?Query
LLMs OnlyLLMs
KGs
Our Method LLMs with KGs Retrieval
evalueate
Netherlands Dutchmain_country
KGsguide
LLMs
NetherlandsWest Flemishcountries_spoken_in
main_country
DutchInternal
knowledge
Internal
knowledge Inference
Paths
The answer is
North Germanic.
(a.) (b.) (c.)Figure 1: (a.) The LLMs-only approach suffers from severe hallu-
cinations. (b.) The LLMs with KGs approach provides insufficient
information, and their graph-based reasoning with KGs is often inac-
curate. (c.) We first mine the internal knowledge of LLMs, offering
more information for external KG reasoning and achieving better in-
tegration of internal and external knowledge in LLMs.
requires fine-tuning their model parameters, which inevitably
incurs high computational costs [Sun et al. , 2024 ]. In con-
trast, updating knowledge graphs is relatively simple and in-
curs minimal overhead.
The paradigms of combining LLMs with KGs can be clas-
sified into three main categories. The first one is knowledge
injection during pre-training or fine-tuning [Luoet al. , 2024b;
Caoet al. , 2023; Jiang et al. , 2022; Yang et al. , 2024 ]. While
the model’s ability to grasp knowledge improves, these meth-
ods introduce high computational costs and catastrophic for-
getting. The second one entails using LLMs as agents to
reason through knowledge retrieved from the KGs. This ap-
proach does not require fine-tuning or retraining, significantly
reducing overhead [Jiang et al. , 2023; Yang et al. , 2023 ].
However, It heavily relies on the completeness of external
KGs and underutilizes the internal knowledge of the LLMs.
The third one enables LLMs to participate in the process of
knowledge exploration within external KGs [Maet al. , 2024 ].
In this case, the LLMs can engage in the selection of knowl-arXiv:2502.12029v2 [cs.AI] 13 Mar 2025
Page 2:
Question:Where is located the Whistler mountain, and is the same place in which there has been the Demi Lovato Summer Toor 2009?
retainedWhistler Mountain ,located_in , British Columbia
Whistler Mountain , part_of , Coast Mountains
Demi Lovato Summer Tour 2009 , held_in , United States
Demi Lovato Summer Tour 2009 , performer , Demi Lovato
........
[Whistler Mountain->located_in->British Columbia,Demi
Lovato Summer Tour 2009->held_in->United States]
CanadaWhistler Blackcomb
Whistler Mountain
contains
Coast MountainsPacific Rangeselevation
prominencecontainedby
containsBritish ColumbiacontainedbyGaribaldi Volcanic Belt
containsLava Fork volcanoAlaska
extendsknown_for Glaciers
nearentity
Whistler relation
abandoned
Insufficient information to
answer the question,
continue exploring.
The answer is Canada. However, the Demi Lovato
Summer Tour 2009 was a concert tour that
occurred in various cities across North America, not
specifically at this location.Insufficient information to
answer the question,
continue exploring.
LLM
Depth=1
Depth=2
Depth=3(a)
Inference Paths
Generation
(b)
Subgraph
Exploration
(c)
Evaluation-based
answering
guideFigure 2: The workflow of KnowPath. It contains: (a) Inference Paths Generation to exploit the internal knowledge of LLMs, (b) Subgraph
Exploration to generate a trustworthy directed subgraph, (c) Evaluation-based Answering to integrate internal and external knowledge.
edge nodes at each step [Sunet al. , 2024; Chen et al. , 2024;
Xuet al. , 2024 ], thereby leveraging the advantages of the in-
ternal knowledge of the LLMs to some extent.
The effective patterns of LLMs introducing KGs still have
limitations. 1) Insufficient exploration of internal knowledge
in LLMs. When exploring KGs, most approaches primarily
treat LLMs as agents to select relevant relationships and en-
tities, overlooking the potential of the internal knowledge. 2)
Constrained generation of trustworthy reasoning paths. Some
methods have attempted to generate highly interpretable rea-
soning paths, but they limit the scale of path exploration, re-
quire additional memory. The generated paths also lack intu-
itive visual interpretability. 3) Ambiguous fusion of internal
and external knowledge. How to better integrate the internal
knowledge of LLMs with the external knowledge in KGs still
requires further exploration.
To overcome the above limitations, we propose KnowPath,
a knowledge-enhanced large model framework driven by the
collaboration of internal and external knowledge. Specifi-
cally, KnowPath consists of three stages. 1) Inference paths
generation. To entirely exploit the internal knowledge of
LLMs and adept in zero-shot scenario, this stage employs a
prompt-driven approach to extract the knowledge triples most
relevant to the topic entities, and then generates reasoning
paths based on these knowledge triples to attempt answering
the question. 2) Trustworthy directed subgraph exploration.
It refers to the process where the LLM combines the previ-
ously generated knowledge reasoning paths to select entities
and relationships, and then responses based on the subgraph
formed by these selections. This stage enables the LLMs
to fully participate in the effective construction of external
knowledge, while providing a clear process for constructing
subgraphs. 3) Evaluation-based answering. At this stage, ex-
ternal knowledge primarily guides the KnowPath, while in-
ternal knowledge assists in generating the answer. Our con-
tributions can be summarized as follows:
• We focus on a new view, emphasizing the importance of
the LLMs’ powerful internal knowledge in knowledge
question answering, via a prompt-based internal knowl-
edge reasoning path generation method for LLMs.• We build a knowledge-enhanced large model frame-
work driven by the collaboration of internal and exter-
nal knowledge. It not only integrates both the internal
and external knowledge of the LLMs better, but also pro-
vides clearer and more trustworthy reasoning paths.
• Extensive experiments conducted on multiple knowl-
edge question answering datasets demonstrate that our
KnowPath significantly mitigates the hallucination prob-
lem in LLMs and outperforms the existing best.
2 KnowPath
2.1 Preliminary
Topic Entities represent the main entities in a query Q, de-
noted as e0. Each Qcontains Ntopic entities {e1
0, ..., eN
0}.
Inference Paths are a set of paths P=p1, ..., p Lgenerated
by the LLM’s own knowledge, where L∈[1, N]is dynami-
cally determined by the LLM agent. Each path pstarts from
the topic entity e0∈ {e1
0, ..., eN
0}and can be represented as
p=e0→r1→e1→...→rn→en, where eiandri
represent entities and relationships, respectively.
Knowledge Graph(KG) is composed of many structured
knowledge triples: K={(eh, r, et), r∈R, eh, et∈E},
where Erepresents all entities in the knowledge graph, and R
represents all relationships, and ehandetrepresent the head
and tail entities, respectively.
KG Subgraph refers to a connected subgraph extracted from
the knowledge graph K, where the entities and relationships
are entirely derived from K, i.e.,Ks⊆K.
2.2 Inference Paths Generation
Due to the extensive world knowledge stored within its pa-
rameters, LLMs can be considered as a complementary rep-
resentation of KGs. To fully excavate the internal knowl-
edge of LLMs and guide the exploration of KGs, we pro-
pose a prompt-driven method to extract the internal knowl-
edge of LLMs effectively. It can retrieve reasoning paths of
the model’s internal knowledge and clearly display the rea-
soning process, and also is particularly effective in zero-shot
scenarios. Specifically, given a query Q, we first guide the
Page 3:
LLM to extract the most relevant topic entities {e1
0, ..., eN
0}
through a specially designed prompt. Then, based on these
topic entities, the large model is instructed to generate a set
of knowledge triples associated with them. The number of
triples nis variable. Finally, the LLM attempts to answer
based on the previously generated knowledge triples and pro-
vides a specific reasoning path from entities and relations to
the answer. Each path is in the form of P=e1
0→r1→
e1→...→rn→en. The details of the Inference Paths
Generation process are presented in the Appendix A.1.
2.3 Subgraph Exploration
Exploration Initialization. KnowPath performs subgraph
exploration for a maximum of Drounds. Each round cor-
responds to an additional hop in knowledge graph Kand
thej-th contains Nsubgraphs {K1
s,j, ..., KN
s,j}. Each sub-
graph Ki
s,jis composed of a set of knowledge graph reason-
ing paths, i.e. Ki
s,j={pi
1,j∪ ··· ∪ pi
l,j, i∈[1, N]}. The
number of reasoning paths lis flexibly determined by the
LLM agent. Taking the D-th round and the z-th path as an
example, it starts exploration from one topic entity ei
0and ul-
timately forms a connected subgraph of the KG, denoted as
pi
z,D={ei
0, ei
1,z, ri
1,z, ei
2,z, ri
2,z, ..., ri
D,z, ei
D,z}. The start of
the first round of subgraph exploration ( D=0), each path pi
corresponds to the current topic entity, i.e. p0
z,0={e1
0}.
Relation Exploration. Relation exploration aims to ex-
pand the subgraphs obtained in each round of exploration, en-
abling deep reasoning. Specifically, for the i-th subgraph and
thej-th round of subgraph exploration, the candidate entities
is denoted as Ei
j={ei
j−1,1, ..., ei
j−1,l}, where ei
j−1,1is the
tail entity of the reasoning path pi
1,j−1. Based on these can-
didates Ei
j, we search for all coresponding single-hop rela-
tions in knowledge graph K, denoted as Ri
a,j={r1, ..., r M},
where Mis determined by the specific knowledge graph K.
Finally, the LLM agent will rely on the query Q, the infer-
ence path Pgenerated through the LLM’s internal knowledge
(Section 2.2), and all topic entities e0to select the most rel-
evant candidate relations from Ri
a,j, denoted as Ri
j⊆Ri
a,j,
which is dynamically determined by the LLM agent.
Entity Exploration. Entity exploration depends on the al-
ready determined candidate entities and candidate relations.
Taking the i-th subgraph and the j-th round of subgraph ex-
ploration as an example, relying on Ei
jandRi
j, we perform
queries like (e, r,?)or(?, r, e)on the knowledge graph K
to retrieve the corresponding entities Ei
a,j={e1, ..., e N},
where Nvaries depending on the knowledge graph K. Then,
the agent also considers the query Q, the inference path P
in Section 2.2, the topic entity ei
0, and the candidate rela-
tion set Ri
jfromEi
a,jto generate the most relevant entity set
Ei
j+1={ei
j,1, ..., ei
j,l} ⊆Ei
a,j. Note that ei
j,1is the tail entity
of the reasoning path pi
1,j.
Subgraph Update. Relation exploration determines entity
exploration, and we update the subgraph only after complet-
ing the entity exploration. Specifically, for the i-th subgraph
and the j-th round of subgraph exploration, we append the
result of the exploration (, r, ei
j,1)to the path pi
1,jin the sub-
graph Ki
s,j. This path update algorithm not only considersAlgorithm 1 Subgraph Exploration
Require: entityDict ,entityName ,question ,
maxWidth ,depth ,path
1:SetoriginalPath aspath
2:ifdepth = 0then
3: Initialize path as[ ]∗maxWidth
4:end if
5:foreidinentityDict do
6: FindrelevantRelations
7: forrelation inrelevantRelations do
8: Find entities linked by relation
9: end for
10:end for
11:Extract relevantEntities using candidate entities
12:Update path andentityDict based on relevance
13:extraPath ←(path−originalPath )
14:return extraPath ,entityDict
Algorithm 2 Update Reasoning Path in Subgraph
Require: path ,pathIsHead ,isHead ,r,e
1:if not pathIsHead then
2: if not isHead then
3: newPath ←path + [←, r,←, e].
4: else
5: newPath ←path + [→, r,→, e].
6: end if
7:else
8: if not isHead then
9: newPath ←[e,→, r,→] +path .
10: else
11: newPath ←[e,←, r,←] +path .
12: end if
13:end if
14:Append newPath topath
15:return path
the directionality of entities and relations, but also automati-
cally determines and updates the paths. Its detailed process is
described in Algorithm 2. The final subgraph can be flexibly
expanded due to the variable number of paths l.
2.4 Evaluation-based Answering
After completing the subgraph update for each round, the
agent attempts to answer the query through the subgraph
{K1
s,j, ..., KN
s,j}. If it determines that the current subgraph
is insufficient to answer the question, the next round of sub-
graph exploration will be executed, until the maximum explo-
ration depth Dis reached. Otherwise, it will output the final
answer along with the corresponding interpretable directed
subgraph. Unlike previous work [Chen et al. , 2024 ], even if
no answer is found at the maximum exploration depth, our
KnowPath will rely on the inference path Pto response. The
framework of KnowPath is shown in Figure 2.
Page 4:
Method CWQ WebQSP Simple Questions WebQuestions
LLM only
IO prompt [Brown et al. , 2020 ] 37.6±0.8 63.3 ±1.2 20.0 ±0.5 48.7 ±1.4
COT [Weiet al. , 2022 ] 38.8±1.5 62.2 ±0.7 20.5 ±0.4 49.1 ±0.9
RoG w/o planning [Luoet al. , 2024b ] 43.0±0.9 66.9 ±1.3 - -
SC[Wang et al. , 2022 ] 45.4±1.1 61.1 ±0.5 18.9 ±0.6 50.3 ±1.2
Fine-Tuned KG Enhanced LLM
UniKGQA [Jiang et al. , 2022 ] 51.2±1.0 75.1 ±0.8 - -
RE-KBQA [Caoet al. , 2023 ] 50.3±1.2 74.6 ±1.0 - -
ChatKBQA [Luoet al. , 2024a ] 76.5±1.3 78.1 ±1.1 85.8 ±0.9 55.1 ±0.6
RoG [Luoet al. , 2024b ] 64.5±0.7 85.7 ±1.4 73.3 ±0.8 56.3 ±1.0
Prompting KG Enhanced LLM with GPT3.5
StructGPT [Jiang et al. , 2023 ] 54.3±1.0 72.6 ±1.2 50.2 ±0.5 51.3 ±0.9
ToG [Sunet al. , 2024 ] 57.1±1.5 76.2 ±0.8 53.6 ±1.0 54.5 ±0.7
PoG [Chen et al. , 2024 ] 63.2±1.0 82.0 ±0.9 58.3 ±0.6 57.8 ±1.2
KnowPath (Ours) 67.9 ±0.6 84.1 ±1.3 61.5 ±0.8 60.0 ±1.0
Prompting KG Enhanced LLM with DeepSeek-V3
ToG [Sunet al. , 2024 ] 60.9±0.7 82.6 ±1.0 59.7 ±0.9 57.9 ±0.8
PoG [Chen et al. , 2024 ] 68.3±1.1 85.3 ±0.9 63.9 ±0.5 61.2 ±1.3
KnowPath (Ours) 73.5 ±0.9 89.0 ±0.8 65.3 ±1.0 64.0 ±0.7
Table 1: Hits@1 scores (%) of different models on four datasets under various knowledge-enhanced methods. We use GPT-3.5 Turbo and
DeepSeek-V3 as the primary backbones. Bold text indicates the results achieved by our method.
3 Experimental Setup
3.1 Baselines
We chose corresponding advanced baselines for comparison
based on the three main paradigms of existing knowledge-
based question answering. 1) The First is the LLM-only, in-
cluding the standard prompt (IO prompt [Brown et al. , 2020 ]),
the chain of thought prompt (CoT [Wei et al. , 2022 ]), the
self-consistency (SC [Wang et al. , 2022 ]), and the RoG with-
out planning (ROG w/o planning [Luo et al. , 2024b ]). 2)
The second is the KG-enhanced fine-tuned LLMs, which in-
clude ChatKBQA [Luoet al. , 2024a ], RoG [Luoet al. , 2024b ],
UniKGQA [Jiang et al. , 2022 ], and RE-KBQA [Cao et al. ,
2023 ]. 3) The third is the KG-enhanced prompt-based LLMs,
including Think on graph (ToG [Sun et al. , 2024 ]), Plan on
graph (PoG [Chen et al. , 2024 ]), and StructGPT [Jiang et al. ,
2023 ]. Unlike the second, this scheme no longer requires
fine-tuning and has become a widely researched mode today.
3.2 Datasets and Metrics
Datasets. We adopt four knowledge-based question answer-
ing datasets: the single-hop Simple Questions [Bordes et
al., 2015 ], the complex multi-hop CWQ [Talmor and Berant,
2018 ]and WebQSP [Yihet al. , 2016 ], and the open-domain
WebQuestions [Berant et al. , 2013 ].
Metrics. Following previous research [Chen et al. , 2024 ],
we apply exact match accuracy (Hits@1) for evaluation.
3.3 Experiment Details
Following previous research [Chen et al. , 2024 ], to control
the overall costs, the maximum subgraph exploration depthDamx is set to 3. Since the FreeBase [Bollacker et al. , 2008 ]
supports all the aforementioned datasets, we apply it as the
base graph for subgraph exploration, and We apply GPT-3.5-
turbo-1106 and DeepSeek-V3 as the base models. All exper-
iments are deployed on four NVIDIA A800-40G GPUs.
4 Result
4.1 Main results
We conducted comprehensive experiments on four widely
used knowledge-based question answering datasets. The ex-
perimental results are presented in Table 1, and four key find-
ings are outlined as follows:
KnowPath performs the best. Our KnowPath outper-
forms all the Prompting-driven KG-Enhanced. For instance,
on the multi-hop CWQ, regardless of the base model used,
KnowPath achieves a maximum improvement of about 13%
in Hits@1. In addition, KnowPath outperforms the LLM-
only with a clear margin and surpasses the majority of Fine-
Tuned KG-Enhanced LLM methods. On the most challeng-
ing open-domain question answering dataset WebQuestions,
KnowPath achieves the best performance compared to strong
baselines from other paradigms (e.g., PoG 61.2% vs Ours
64.0%). This demonstrates KnowPath’s ability to enhance
the factuality of LLMs in open-domain question answering,
which is an intriguing phenomenon worth further exploration.
KnowPath excels at complex multi-hop tasks. On
both CWQ and WebQSP, KnowPath outperforms the lat-
est strong baseline PoG, achieving an average improvement
of approximately 5% and 2.9%, respectively. On the We-
bQSP, DeepSeek-v3 with KnowPath not only outperforms
Page 5:
Method CWQ WebQSP SimpleQA WebQ
KnowPath 73.5 89.0 65.3 64.0
-w/o IPG 67.3 84.5 63.1 61.0
-w/o SE 64.7 83.1 60.4 60.7
Base 39.2 66.7 23.0 53.7
Table 2: Ablation experiment results on four knowledge-based ques-
tion answering tasks. IPG stands for Inference Paths Generation
module, while SE stands for Subgraph Exploration module.
all Prompting-based KG-Enhanced LLMs but also surpasses
the strongest baseline ROG among Fine-Tuned KG-Enhanced
LLMs (85.7% vs 89%). On the more challenging multi-hop
CWQ, the improvement of KnowPath over the PoG is signif-
icantly greater than the improvement on the simpler single-
hop SimpleQuestions (5.2% vs 1.4%). These collectively in-
dicate that KnowPath is sensitive to deep reasoning.
Knowledge enhancement greatly aids factual question
answering. When question answering is based solely on
LLMs, the performance is poor across multiple tasks. For
example, COT achieves only about 20.5% Hits@1 on Sim-
pleQuestions. This is caused by the hallucinations inherent
in LLMs. Whatever method is applied to introduce the KGs,
they significantly outperform LLM-only. The maximum im-
provements across the four tasks are 35.9%, 27.9%, 46.4%,
and 15.3%. These further emphasize the importance of intro-
ducing knowledge graphs for generating correct answers.
The stronger the base, the higer the performance.
As DeepSeek-V3 is better than GPT-3.5, even though both
are prompting-based knowledge-enhanced, their performance
on all tasks shows a significant difference after incorporat-
ing our KnowPath. Replacing GPT-3.5 with DeepSeek-V3,
KnowPath achieved a maximum improvement from 67.9% to
73.5% on CWQ, and on Simple Questions, it improved by
at least 3.8%. These findings indicate that the improvement
in model performance directly drives the enhancement of its
performance in knowledge-based question-answering.
KnowPath is a more flexible plugin. Compared to fine-
tuned knowledge-enhanced LLMs, our KnowPath does not
require fine-tuning of the LLM, yet it outperforms most of the
fine-tuned methods. In addition, on the CWQ dataset, Know-
Path with DeepSeek-V3 achieves performance that is very
close to the strongest baseline, ChatKBQA, which requires
fine-tuning for knowledge enhancement. On the WebQSP
dataset, it outperforms ChatKBQA by about 11% (78.1%
vs 89.0%). Overall, the resource consumption of KnowPath
is significantly lower than that of Fine-Tuned KG-Enhanced
LLMs. This is because KnowPath improves performance by
optimizing inference paths and enhancing knowledge integra-
tion, making it a more flexible and plug-and-play framework.
4.2 Ablation Study
We validate the effectiveness of each component of Know-
Path and quantify their contributions to performance. Its re-
sults are presented in Table 2, and visualized in Figure 3.
Each component contributes to the overall remarkable
performance. After removing each module, their perfor-
mance on different datasets will decline. However, comparedMethod LLM Call Total Token Input Token
ToG 22.6 9669.4 8182.9
PoG 16.3 8156.2 7803.0
KnowPath 9.9 2742.4 2368.9
Table 3: Cost-effectiveness analysis on the CWQ dataset between
our KnowPath and the strongly prompt-driven knowledge-enhanced
benchmarks (ToG and PoG). The Total Token includes two parts:
the total number of tokens from multiple input prompts and the total
number of tokens from the intermediate results returned by the LLM.
The Input Token represents only the total number of tokens from the
multiple input prompts. The LLM Call refer to the total number of
accesses to the LLM agent.
Figure 3: Comparison of KnowPath, its individual components, and
strong baseline methods (ToG and PoG) on the performance across
four commonly used knowledge-based question answering datasets.
to the base model, the addition of these modules still signifi-
cantly improves the overall performance.
It is necessary to focus on the powerful internal knowl-
edge of LLMs. Eliminating the Subgraph Exploration and
relying solely on the internal knowledge mining of LLMs to
generate reasoning paths and provide answers proves to be
highly effective. It has shown significant improvement across
all four datasets, with an average performance enhancement
of approximately 21.6%. The most notable improvement was
observed on SimpleQA, where performance leaped from 23%
to 60.4%. This indicates that even without the incorporation
of external knowledge graphs, the performance of the model
in generating factual responses can be enhanced to a certain
extent through internal mining methods. However, without
the guidance of internal knowledge reasoning paths, Know-
Path has seen some performance decline across all tasks, es-
pecially in complex multi-hop CWQ and WebQSP.
The most critical credible directed Subgraph Explo-
ration is deep-sensitive. Removing the subgraph exploration
leads to a significant decline in Knowpath across all tasks, av-
Page 6:
Figure 4: Visualization of the cost-effectiveness analysis on four public knowledge-based question-answering datasets.
(a) Exploration temperature
(b) The count of triples
Figure 5: Analysis of key parameters.
eraging a drop of approximately 5.7%. This performance dip
is particularly pronounced in complex multi-hop tasks. For
instance, on the CWQ, Knowpath without subgraph explo-
ration experiences a nearly 9% decrease.
4.3 Cost-effectiveness Analysis
To explore the cost-effectiveness of KnowPath while main-
taining high accuracy, we conducted a cost-benefit analysis.
In this experiment, we tracked the primary sources of cost,
including the LLM Call, Input Token, and Total Token usage.
The results are presented in the Table 3, and are visualized in
Figure 4. Our key findings are described as follows:
The number of accesses to the LLM agent was signif-
icantly reduced. Specifically, the LLM calls for TOG and
POG was 2.28x and 1.64x of that in our KnowPath, respec-
tively. This exceptionally low cost can be attributed to the fact
that the Subgraph Exploration does not limit the scale of the
path search, and this can be broken down into three key rea-
sons. First, in each round of subgraph exploration, only one
relation exploration and one entity exploration are conducted.
Second, the Evaluation-based answering only accesses the
LLM once after each round of subgraph exploration to judge
whether the current subgraph can answer the question. If it
cannot, the next round is performed. Third, if the largest ex-
plored subgraph still cannot answer the question, KnowPath
will rely on the Inference Paths Generation.
The number of tokens used is saved by several times.
Whether in Total Token or Input Tokens, KnowPath saves ap-proximately 4.0x compared to TOG and POG. This is mainly
since all the prompts used in KnowPath are based on the care-
fully designed zero-shot approach, rather than the in-context
learning used by the previous, which require providing large
context to ensure the factuality of the answers. We explored
the reasons behind this difference. First, previous methods
rely on more contextual information for in-context learning
to ensure the correctness of the output. Secondly, KnowPath
fully leverages the powerful internal relevant knowledge and
uses it as the input signal for the agent. This not only provides
more contextual reference but also significantly improves the
accuracy and efficiency of relation and entity exploration in
subgraph exploration, ensuring that the generated subgraph is
highly relevant while enabling the most effective reasoning
toward potential answers.
4.4 Parameter analysis
We analyze the key parameters that affect the performance of
KnowPath on the WebQSP, and discuss the following issues:
What is the impact of the temperature in Subgraph Ex-
ploration? We explore the optimal temperature from 0.2 to
1, and the relation between it and Hits@1 is shown in Figure
5a. During subgraph exploration, variations in the tempera-
ture affect the divergence of the model’s generated answers.
A lower temperature negatively impacts KnowPath’s perfor-
mance, as the model generates overly conservative answers
with insufficient knowledge, while the LLM relies on its in-
ternal knowledge when exploring and selecting entities and
relationships. A higher temperature also harms KnowPath, as
the divergent answers may deviate from the given candidates.
Extensive experiments show that 0.4 is the optimal tempera-
ture, consistent with other existing works [Chen et al. , 2024 ].
How is the count of knowledge triples determined in In-
ference Paths Generation? We explored it with a step size
of 15, and the relationship between the count of knowledge
triples and Hits@1 is shown in Figure 5a. When the count
is 0, KnowPath’s performance is poor due to the lack of in-
ternal knowledge exploration. When the count is too large,
such as 45, its performance is also suboptimal, as excessive
exploration introduces irrelevant knowledge as interference.
Extensive experiments show that 15 is the optimal.
Page 7:
religion.religious_text_ofZhang Jue
religion.religion.notable_figuresTaoismTao Te Ching
Zhuang ZhouDaozang
time.event.instance_
of_recurring_event2002 -03-Football
League Cup
sports.sports_team.
championships
Liverpool F.C.sports.sports_championship
_event.season2003 Football
League Cup FinalQuestion : What text in the religion which include Zhang Jue
as a key figure is considered to be sacred?
There is no explicit information provided about Zhang Jue.
The answer to the question is Taiping Jing .
Unable to answer this question, use cot to answer: the
question is Taiping Jing .
The answer is Tao Te Ching, Daozang , Zhuang Zhou .ToG
PoG
Know
PathTo answer this question, additional knowledge or data about
the league cup winners in 2002 would be required .
The winner of the 2002 Football League Cup was Blackburn Rovers .
reasoning_chains :
["Football League Cup", " sports.sports_championship.events ", "2002 Football
League Cup Final"],["Football League Cup"," sports.sports_championship_event
-.champio n-nship ", "2002 Football League Cup Final"]
the answer to the question is Liverpool F.C.Question : who won the league cup in 2002?
Inference Path : Zhang Jue -> is a key figure in -> Way of the Five
Pecks of Rice -> is a -> Taoist sect -> Taoism -> is based on -> Tao Te
Ching -> is considered to be -> sacred text in Taoism.
SubGraphInference Path : 2002 League Cup -> was won by -> Birmingham City ->
defeated -> Liverpool -> 2002 League Cup -> was contested between ->
Birmingham City and Liverpool.
Football League Cup
SubGraph
(a.)A case from CWQ (b.)A case from WebQuestionsFigure 6: The case study on the multi-hop CWQ and open-domain WebQuestions dataset. To provide a clear and vivid comparison with the
strong baselines (ToG and PoG), we visualized the execution process of KnowPath
4.5 Case Study
To provide a clear and vivid comparison with the strong base-
lines, we visualized the execution process of KnowPath, as
shown in Figure 6. In the CWQ, ToG and PoG can only ex-
tract context from the question, failing to gather enough ac-
curate knowledge for a correct answer, thus producing the in-
correct answer ”Taiping Jing.” In contrast, KnowPath uncov-
ers large model reasoning paths that provide additional, suffi-
cient information. This enables key nodes, such as ”Taoism,”
to be identified during subgraph exploration, ultimately lead-
ing to the correct answer, ”Zhuang Zhou.” In the WebQues-
tions, ToG is unable to answer the question due to insufficient
information. Although PoG provides a reasoning chain, the
knowledge derived from the reasoning process is inaccurate,
and the final answer still relies on the reasoning of the large
model, resulting in the incorrect answer ”Blackburn Rovers.”
In contrast, guided by Inference, KnowPath accurately identi-
fied the relationship ”time.event.instance ofrecurring event”
and, through reasoning with the node ”2002-03-Football
League Cup,” ultimately arrived at the correct result node
”Liverpool F.C.” Overall, KnowPath not only provides an-
swers but also generates directed subgraphs, which serve as
the foundation for trustworthy reasoning and significantly en-
hance the interpretability of the results.
5 Related Work
Prompt-driven LLM inference.
CoT[Wei et al. , 2022 ](Chain of Thought) effectively im-
proves the reasoning ability of large models, enhancing per-
formance on complex tasks with minimal contextual prompts.
Self-Consistency (SC) [Wang et al. , 2022 ]samples multiple
reasoning paths to select the most consistent answer, with
further improvements seen in DIVERSE [Liet al. , 2022 ]and
V ote Complex [Fuet al. , 2022 ]. Other methods have ex-plored CoT enhancements in zero-shot scenarios [Kojima et
al., 2022; Chung et al. , 2024 ]. However, reasoning solely
based on the model’s knowledge still faces significant hallu-
cination issues, which remain unresolved.
KG-enhanced LLM inference.
”Early works enhanced model knowledge understanding
by injecting KGs into model parameters through fine-tuning
or retraining [Cao et al. , 2023; Jiang et al. , 2022; Yang et
al., 2024 ]. ChatKBQA [Luo et al. , 2024a ]and RoG [Luo
et al. , 2024b ]utilize fine-tuned LLMs to generate logical
forms. StructGPT [Jiang et al. , 2023 ], based on the RAG ap-
proach, retrieves information from KGs for question answer-
ing. ToG [Sun et al. , 2024 ]and PoG [Chen et al. , 2024 ]in-
volve LLMs in knowledge graph reasoning, using them as
agents to assist in selecting entities and relationships dur-
ing exploration. Despite achieving strong performance, these
methods still face challenges like insufficient internal knowl-
edge mining and the inability to generate trustworthy reason-
ing paths.
6 Conclusion
In this paper, to enhance the ability of LLMs to provide fac-
tual answers, we propose the knowledge-enhanced reasoning
framework KnowPath, driven by the collaboration of internal
and external knowledge. It focuses on leveraging the reason-
ing paths generated by the extensive internal knowledge of
LLMs to guide the trustworthy directed subgraph exploration
of knowledge graphs. Extensive experiments show that: 1)
Our KnowPath is optimal and excels at complex multi-hop
tasks. 2) It demonstrates remarkable cost-effectiveness, with
a 55% reduction in the number of LLM calls and a 75% de-
crease in the number of tokens consumed compared to the
strong baselines. 3) KnowPath can explore directed sub-
graphs of the KGs, providing an intuitive and trustworthy rea-
soning process, greatly enhancing the overall interpretability.
Page 8:
References
[Alberts et al. , 2023 ]Ian L Alberts, Lorenzo Mercolli,
Thomas Pyka, George Prenosil, Kuangyu Shi, Axel
Rominger, and Ali Afshar-Oromieh. Large language mod-
els (llm) and chatgpt: what will the impact on nuclear
medicine be? European journal of nuclear medicine and
molecular imaging , 50(6):1549–1552, 2023.
[Berant et al. , 2013 ]Jonathan Berant, Andrew Chou, Roy
Frostig, and Percy Liang. Semantic parsing on freebase
from question-answer pairs. In Proceedings of the 2013
conference on empirical methods in natural language pro-
cessing , pages 1533–1544, 2013.
[Bollacker et al. , 2008 ]Kurt Bollacker, Colin Evans,
Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase:
a collaboratively created graph database for structuring
human knowledge. In Proceedings of the 2008 ACM
SIGMOD international conference on Management of
data, pages 1247–1250, 2008.
[Bordes et al. , 2015 ]Antoine Bordes, Nicolas Usunier,
Sumit Chopra, and Jason Weston. Large-scale simple
question answering with memory networks. arXiv
preprint arXiv:1506.02075 , 2015.
[Brown et al. , 2020 ]Tom Brown, Benjamin Mann, Nick Ry-
der, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, et al. Language models are few-shot
learners. Advances in neural information processing sys-
tems, 33:1877–1901, 2020.
[Caoet al. , 2023 ]Yong Cao, Xianzhi Li, Huiwen Liu, Wen
Dai, Shuai Chen, Bin Wang, Min Chen, and Daniel Her-
shcovich. Pay more attention to relation exploration
for knowledge base question answering. arXiv preprint
arXiv:2305.02118 , 2023.
[Chen et al. , 2024 ]Liyi Chen, Panrong Tong, Zhongming
Jin, Ying Sun, Jieping Ye, and Hui Xiong. Plan-on-graph:
Self-correcting adaptive planning of large language model
on knowledge graphs. CoRR , abs/2410.23875, 2024.
[Chung et al. , 2024 ]Hyung Won Chung, Le Hou, Shayne
Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan
Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma,
et al. Scaling instruction-finetuned language models. Jour-
nal of Machine Learning Research , 25(70):1–53, 2024.
[Dong et al. , 2023 ]Xiangjue Dong, Yibo Wang, Philip S Yu,
and James Caverlee. Probing explicit and implicit gen-
der bias through llm conditional text generation. arXiv
preprint arXiv:2311.00306 , 2023.
[Fuet al. , 2022 ]Yao Fu, Hao Peng, Ashish Sabharwal, Peter
Clark, and Tushar Khot. Complexity-based prompting for
multi-step reasoning. In The Eleventh International Con-
ference on Learning Representations , 2022.
[Guo et al. , 2024 ]Taicheng Guo, Xiuying Chen, Yaqi Wang,
Ruidi Chang, Shichao Pei, Nitesh V . Chawla, Olaf Wiest,
and Xiangliang Zhang. Large language model based multi-
agents: A survey of progress and challenges. In Proceed-
ings of the Thirty-Third International Joint Conference onArtificial Intelligence, IJCAI 2024, Jeju, South Korea, Au-
gust 3-9, 2024 , pages 8048–8057. ijcai.org, 2024.
[Huet al. , 2024 ]Yuxuan Hu, Gemju Sherpa, Lan Zhang,
Weihua Li, Quan Bai, Yijun Wang, and Xiaodan Wang.
An llm-enhanced agent-based simulation tool for informa-
tion propagation. In Proceedings of the Thirty-Third Inter-
national Joint Conference on Artificial Intelligence, IJCAI
2024, Jeju, South Korea, August 3-9, 2024 , pages 8679–
8682. ijcai.org, 2024.
[Huang et al. , 2024 ]Xu Huang, Weiwen Liu, Xiaolong
Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng
Wang, Ruiming Tang, and Enhong Chen. Understand-
ing the planning of llm agents: A survey. arXiv preprint
arXiv:2402.02716 , 2024.
[Jiang et al. , 2022 ]Jinhao Jiang, Kun Zhou, Wayne Xin
Zhao, and Ji-Rong Wen. Unikgqa: Unified retrieval and
reasoning for solving multi-hop question answering over
knowledge graph. arXiv preprint arXiv:2212.00959 , 2022.
[Jiang et al. , 2023 ]Jinhao Jiang, Kun Zhou, Zican Dong,
Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen.
Structgpt: A general framework for large language
model to reason over structured data. arXiv preprint
arXiv:2305.09645 , 2023.
[Jung et al. , 2024 ]Sung Jae Jung, Hajung Kim, and Ky-
oung Sang Jang. Llm based biological named entity recog-
nition from scientific literature. In 2024 IEEE Interna-
tional Conference on Big Data and Smart Computing (Big-
Comp) , pages 433–435. IEEE, 2024.
[Kojima et al. , 2022 ]Takeshi Kojima, Shixiang Shane Gu,
Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large
language models are zero-shot reasoners. Advances in
neural information processing systems , 35:22199–22213,
2022.
[Liet al. , 2022 ]Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang
Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. On
the advance of making language models better reasoners.
arXiv preprint arXiv:2206.02336 , 2022.
[Liet al. , 2024 ]Johnny Li, Saksham Consul, Eda Zhou,
James Wong, Naila Farooqui, Yuxin Ye, Nithyashree
Manohar, Zhuxiaona Wei, Tian Wu, Ben Echols, et al.
Banishing llm hallucinations requires rethinking general-
ization. arXiv preprint arXiv:2406.17642 , 2024.
[Luoet al. , 2024a ]Haoran Luo, Haihong E, Zichen Tang,
Shiyao Peng, Yikai Guo, Wentai Zhang, Chenghao Ma,
Guanting Dong, Meina Song, Wei Lin, Yifan Zhu, and
Anh Tuan Luu. Chatkbqa: A generate-then-retrieve frame-
work for knowledge base question answering with fine-
tuned large language models. In Lun-Wei Ku, Andre Mar-
tins, and Vivek Srikumar, editors, Findings of the Associ-
ation for Computational Linguistics, ACL 2024, Bangkok,
Thailand and virtual meeting, August 11-16, 2024 , pages
2039–2056. Association for Computational Linguistics,
2024.
[Luoet al. , 2024b ]Linhao Luo, Yuan-Fang Li, Gholamreza
Haffari, and Shirui Pan. Reasoning on graphs: Faithful
Page 9:
and interpretable large language model reasoning. In The
Twelfth International Conference on Learning Representa-
tions, ICLR 2024, Vienna, Austria, May 7-11, 2024 . Open-
Review.net, 2024.
[Maet al. , 2024 ]Shengjie Ma, Chengjin Xu, Xuhui Jiang,
Muzhi Li, Huaren Qu, and Jian Guo. Think-on-graph
2.0: Deep and interpretable large language model reason-
ing with knowledge graph-guided retrieval. arXiv e-prints ,
pages arXiv–2407, 2024.
[Sunet al. , 2024 ]Jiashuo Sun, Chengjin Xu, Lumingyuan
Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel M.
Ni, Heung-Yeung Shum, and Jian Guo. Think-on-graph:
Deep and responsible reasoning of large language model
on knowledge graph. In The Twelfth International Con-
ference on Learning Representations, ICLR 2024, Vienna,
Austria, May 7-11, 2024 . OpenReview.net, 2024.
[Talmor and Berant, 2018 ]Alon Talmor and Jonathan Be-
rant. The web as a knowledge-base for answering complex
questions. In Marilyn Walker, Heng Ji, and Amanda Stent,
editors, Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1
(Long Papers) , pages 641–651, New Orleans, Louisiana,
June 2018. Association for Computational Linguistics.
[Wang et al. , 2022 ]Xuezhi Wang, Jason Wei, Dale Schu-
urmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha
Chowdhery, and Denny Zhou. Self-consistency improves
chain of thought reasoning in language models. arXiv
preprint arXiv:2203.11171 , 2022.
[Wang et al. , 2024 ]Ziao Wang, Xiaofeng Zhang, and Hong-
wei Du. Beyond what if: Advancing counterfactual text
generation with structural causal modeling. In Proceed-
ings of the Thirty-Third International Joint Conference on
Artificial Intelligence, IJCAI 2024, Jeju, South Korea, Au-
gust 3-9, 2024 , pages 6522–6530. ijcai.org, 2024.
[Weiet al. , 2022 ]Jason Wei, Xuezhi Wang, Dale Schuur-
mans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le,
Denny Zhou, et al. Chain-of-thought prompting elicits
reasoning in large language models. Advances in neural
information processing systems , 35:24824–24837, 2022.
[Xuet al. , 2024 ]Yao Xu, Shizhu He, Jiabei Chen, Zihao
Wang, Yangqiu Song, Hanghang Tong, Guang Liu, Kang
Liu, and Jun Zhao. Generate-on-graph: Treat llm as both
agent and kg in incomplete knowledge graph question an-
swering. arXiv preprint arXiv:2404.14741 , 2024.
[Yang et al. , 2023 ]Linyao Yang, Hongyang Chen, Zhao
Li, Xiao Ding, and Xindong Wu. Chatgpt is not
enough: Enhancing large language models with knowl-
edge graphs for fact-aware language modeling. arXiv
preprint arXiv:2306.11489 , 2023.
[Yang et al. , 2024 ]Linyao Yang, Hongyang Chen, Zhao Li,
Xiao Ding, and Xindong Wu. Give us the facts: Enhanc-
ing large language models with knowledge graphs for fact-
aware language modeling. IEEE Transactions on Knowl-
edge and Data Engineering , 2024.[Yihet al. , 2016 ]Wen-tau Yih, Matthew Richardson, Chris
Meek, Ming-Wei Chang, and Jina Suh. The value of
semantic parse labeling for knowledge base question an-
swering. In Katrin Erk and Noah A. Smith, editors, Pro-
ceedings of the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 2: Short Papers) ,
pages 201–206, Berlin, Germany, August 2016. Associa-
tion for Computational Linguistics.
[Yinet al. , 2022 ]Da Yin, Li Dong, Hao Cheng, Xiaodong
Liu, Kai-Wei Chang, Furu Wei, and Jianfeng Gao. A sur-
vey of knowledge-intensive nlp with pre-trained language
models. arXiv preprint arXiv:2202.08772 , 2022.
[Zhao et al. , 2024 ]Ruilin Zhao, Feng Zhao, Long Wang,
Xianzhi Wang, and Guandong Xu. Kg-cot: Chain-of-
thought prompting of large language models over knowl-
edge graphs for knowledge-aware question answering. In
Proceedings of the Thirty-Third International Joint Con-
ference on Artificial Intelligence, IJCAI 2024, Jeju, South
Korea, August 3-9, 2024 , pages 6642–6650. ijcai.org,
2024.