loader
Generating audio...

arxiv

Paper 2502.12029

KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs

Authors: Qi Zhao, Hongyu Yang, Qi Song, Xinwei Yao, Xiangyang Li

Published: 2025-02-17

Abstract:

Large language models (LLMs) have demonstrated remarkable capabilities in various complex tasks, yet they still suffer from hallucinations. Introducing external knowledge, such as knowledge graph, can enhance the LLMs' ability to provide factual answers. LLMs have the ability to interactively explore knowledge graphs. However, most approaches have been affected by insufficient internal knowledge excavation in LLMs, limited generation of trustworthy knowledge reasoning paths, and a vague integration between internal and external knowledge. Therefore, we propose KnowPath, a knowledge-enhanced large model framework driven by the collaboration of internal and external knowledge. It relies on the internal knowledge of the LLM to guide the exploration of interpretable directed subgraphs in external knowledge graphs, better integrating the two knowledge sources for more accurate reasoning. Extensive experiments on multiple real-world datasets confirm the superiority of KnowPath.

Paper Content:
Page 1: KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs Qi Zhao1, Hongyu Yang1, Qi Song1∗, Xinwei Yao2, Xiangyang Li1 1University of Science and Technology of China, Hefei, Anhui, China 2Zhejiang University of Technology, Hangzhou, Zhejiang, China {zq2021, hongyuyang }@mail.ustc.edu.cn xwyao@zjut.edu.cn, {qisong09, xiangyangli }@ustc.edu.cn Abstract Large language models (LLMs) have demonstrated remarkable capabilities in various complex tasks, yet they still suffer from hallucinations. Introduc- ing external knowledge, such as knowledge graph, can enhance the LLMs’ ability to provide factual answers. LLMs have the ability to interactively explore knowledge graphs. However, most ap- proaches have been affected by insufficient inter- nal knowledge excavation in LLMs, limited gen- eration of trustworthy knowledge reasoning paths, and a vague integration between internal and ex- ternal knowledge. Therefore, we propose Know- Path, a knowledge-enhanced large model frame- work driven by the collaboration of internal and ex- ternal knowledge. It relies on the internal knowl- edge of the LLM to guide the exploration of in- terpretable directed subgraphs in external knowl- edge graphs, better integrating the two knowledge sources for more accurate reasoning. Extensive ex- periments on multiple real-world datasets confirm the superiority of KnowPath. 1 Introduction Large language models (LLMs) are increasingly being ap- plied in various fields of Natural Language Processing (NLP) tasks, such as text generation [Wang et al. , 2024; Dong et al., 2023 ], knowledge-based question answering [Luo et al. , 2024a; Zhao et al. , 2024 ], and over specific domains [Al- berts et al. , 2023; Jung et al. , 2024 ]. In most scenarios, LLMs serve as intermediary agents for implementing vari- ous functions [Huet al. , 2024; Huang et al. , 2024; Guo et al. , 2024 ]. However, due to the characteristics of generative mod- els, LLMs still suffer from hallucination issues, often gener- ating incorrect answers that can lead to uncontrollable and severe consequences [Liet al. , 2024 ]. Introducing knowledge graphs (KGs) to mitigate this phenomenon is promising [Yin et al. , 2022 ]. This is because knowledge graphs store a large amount of structured factual knowledge, which can provide large models with accurate knowledge dependencies. At the same time, correcting the knowledge in large models often ∗Corresponding author LLMs What language is spoken in Netherlands and Belgium?Query LLMs OnlyLLMs KGs Our Method LLMs with KGs Retrieval evalueate Netherlands Dutchmain_country KGsguide LLMs NetherlandsWest Flemishcountries_spoken_in main_country DutchInternal knowledge Internal knowledge Inference Paths The answer is North Germanic. (a.) (b.) (c.)Figure 1: (a.) The LLMs-only approach suffers from severe hallu- cinations. (b.) The LLMs with KGs approach provides insufficient information, and their graph-based reasoning with KGs is often inac- curate. (c.) We first mine the internal knowledge of LLMs, offering more information for external KG reasoning and achieving better in- tegration of internal and external knowledge in LLMs. requires fine-tuning their model parameters, which inevitably incurs high computational costs [Sun et al. , 2024 ]. In con- trast, updating knowledge graphs is relatively simple and in- curs minimal overhead. The paradigms of combining LLMs with KGs can be clas- sified into three main categories. The first one is knowledge injection during pre-training or fine-tuning [Luoet al. , 2024b; Caoet al. , 2023; Jiang et al. , 2022; Yang et al. , 2024 ]. While the model’s ability to grasp knowledge improves, these meth- ods introduce high computational costs and catastrophic for- getting. The second one entails using LLMs as agents to reason through knowledge retrieved from the KGs. This ap- proach does not require fine-tuning or retraining, significantly reducing overhead [Jiang et al. , 2023; Yang et al. , 2023 ]. However, It heavily relies on the completeness of external KGs and underutilizes the internal knowledge of the LLMs. The third one enables LLMs to participate in the process of knowledge exploration within external KGs [Maet al. , 2024 ]. In this case, the LLMs can engage in the selection of knowl-arXiv:2502.12029v2 [cs.AI] 13 Mar 2025 Page 2: Question:Where is located the Whistler mountain, and is the same place in which there has been the Demi Lovato Summer Toor 2009? retainedWhistler Mountain ,located_in , British Columbia Whistler Mountain , part_of , Coast Mountains Demi Lovato Summer Tour 2009 , held_in , United States Demi Lovato Summer Tour 2009 , performer , Demi Lovato ........ [Whistler Mountain->located_in->British Columbia,Demi Lovato Summer Tour 2009->held_in->United States] CanadaWhistler Blackcomb Whistler Mountain contains Coast MountainsPacific Rangeselevation prominencecontainedby containsBritish ColumbiacontainedbyGaribaldi Volcanic Belt containsLava Fork volcanoAlaska extendsknown_for Glaciers nearentity Whistler relation abandoned Insufficient information to answer the question, continue exploring. The answer is Canada. However, the Demi Lovato Summer Tour 2009 was a concert tour that occurred in various cities across North America, not specifically at this location.Insufficient information to answer the question, continue exploring. LLM Depth=1 Depth=2 Depth=3(a) Inference Paths Generation (b) Subgraph Exploration (c) Evaluation-based answering guideFigure 2: The workflow of KnowPath. It contains: (a) Inference Paths Generation to exploit the internal knowledge of LLMs, (b) Subgraph Exploration to generate a trustworthy directed subgraph, (c) Evaluation-based Answering to integrate internal and external knowledge. edge nodes at each step [Sunet al. , 2024; Chen et al. , 2024; Xuet al. , 2024 ], thereby leveraging the advantages of the in- ternal knowledge of the LLMs to some extent. The effective patterns of LLMs introducing KGs still have limitations. 1) Insufficient exploration of internal knowledge in LLMs. When exploring KGs, most approaches primarily treat LLMs as agents to select relevant relationships and en- tities, overlooking the potential of the internal knowledge. 2) Constrained generation of trustworthy reasoning paths. Some methods have attempted to generate highly interpretable rea- soning paths, but they limit the scale of path exploration, re- quire additional memory. The generated paths also lack intu- itive visual interpretability. 3) Ambiguous fusion of internal and external knowledge. How to better integrate the internal knowledge of LLMs with the external knowledge in KGs still requires further exploration. To overcome the above limitations, we propose KnowPath, a knowledge-enhanced large model framework driven by the collaboration of internal and external knowledge. Specifi- cally, KnowPath consists of three stages. 1) Inference paths generation. To entirely exploit the internal knowledge of LLMs and adept in zero-shot scenario, this stage employs a prompt-driven approach to extract the knowledge triples most relevant to the topic entities, and then generates reasoning paths based on these knowledge triples to attempt answering the question. 2) Trustworthy directed subgraph exploration. It refers to the process where the LLM combines the previ- ously generated knowledge reasoning paths to select entities and relationships, and then responses based on the subgraph formed by these selections. This stage enables the LLMs to fully participate in the effective construction of external knowledge, while providing a clear process for constructing subgraphs. 3) Evaluation-based answering. At this stage, ex- ternal knowledge primarily guides the KnowPath, while in- ternal knowledge assists in generating the answer. Our con- tributions can be summarized as follows: • We focus on a new view, emphasizing the importance of the LLMs’ powerful internal knowledge in knowledge question answering, via a prompt-based internal knowl- edge reasoning path generation method for LLMs.• We build a knowledge-enhanced large model frame- work driven by the collaboration of internal and exter- nal knowledge. It not only integrates both the internal and external knowledge of the LLMs better, but also pro- vides clearer and more trustworthy reasoning paths. • Extensive experiments conducted on multiple knowl- edge question answering datasets demonstrate that our KnowPath significantly mitigates the hallucination prob- lem in LLMs and outperforms the existing best. 2 KnowPath 2.1 Preliminary Topic Entities represent the main entities in a query Q, de- noted as e0. Each Qcontains Ntopic entities {e1 0, ..., eN 0}. Inference Paths are a set of paths P=p1, ..., p Lgenerated by the LLM’s own knowledge, where L∈[1, N]is dynami- cally determined by the LLM agent. Each path pstarts from the topic entity e0∈ {e1 0, ..., eN 0}and can be represented as p=e0→r1→e1→...→rn→en, where eiandri represent entities and relationships, respectively. Knowledge Graph(KG) is composed of many structured knowledge triples: K={(eh, r, et), r∈R, eh, et∈E}, where Erepresents all entities in the knowledge graph, and R represents all relationships, and ehandetrepresent the head and tail entities, respectively. KG Subgraph refers to a connected subgraph extracted from the knowledge graph K, where the entities and relationships are entirely derived from K, i.e.,Ks⊆K. 2.2 Inference Paths Generation Due to the extensive world knowledge stored within its pa- rameters, LLMs can be considered as a complementary rep- resentation of KGs. To fully excavate the internal knowl- edge of LLMs and guide the exploration of KGs, we pro- pose a prompt-driven method to extract the internal knowl- edge of LLMs effectively. It can retrieve reasoning paths of the model’s internal knowledge and clearly display the rea- soning process, and also is particularly effective in zero-shot scenarios. Specifically, given a query Q, we first guide the Page 3: LLM to extract the most relevant topic entities {e1 0, ..., eN 0} through a specially designed prompt. Then, based on these topic entities, the large model is instructed to generate a set of knowledge triples associated with them. The number of triples nis variable. Finally, the LLM attempts to answer based on the previously generated knowledge triples and pro- vides a specific reasoning path from entities and relations to the answer. Each path is in the form of P=e1 0→r1→ e1→...→rn→en. The details of the Inference Paths Generation process are presented in the Appendix A.1. 2.3 Subgraph Exploration Exploration Initialization. KnowPath performs subgraph exploration for a maximum of Drounds. Each round cor- responds to an additional hop in knowledge graph Kand thej-th contains Nsubgraphs {K1 s,j, ..., KN s,j}. Each sub- graph Ki s,jis composed of a set of knowledge graph reason- ing paths, i.e. Ki s,j={pi 1,j∪ ··· ∪ pi l,j, i∈[1, N]}. The number of reasoning paths lis flexibly determined by the LLM agent. Taking the D-th round and the z-th path as an example, it starts exploration from one topic entity ei 0and ul- timately forms a connected subgraph of the KG, denoted as pi z,D={ei 0, ei 1,z, ri 1,z, ei 2,z, ri 2,z, ..., ri D,z, ei D,z}. The start of the first round of subgraph exploration ( D=0), each path pi corresponds to the current topic entity, i.e. p0 z,0={e1 0}. Relation Exploration. Relation exploration aims to ex- pand the subgraphs obtained in each round of exploration, en- abling deep reasoning. Specifically, for the i-th subgraph and thej-th round of subgraph exploration, the candidate entities is denoted as Ei j={ei j−1,1, ..., ei j−1,l}, where ei j−1,1is the tail entity of the reasoning path pi 1,j−1. Based on these can- didates Ei j, we search for all coresponding single-hop rela- tions in knowledge graph K, denoted as Ri a,j={r1, ..., r M}, where Mis determined by the specific knowledge graph K. Finally, the LLM agent will rely on the query Q, the infer- ence path Pgenerated through the LLM’s internal knowledge (Section 2.2), and all topic entities e0to select the most rel- evant candidate relations from Ri a,j, denoted as Ri j⊆Ri a,j, which is dynamically determined by the LLM agent. Entity Exploration. Entity exploration depends on the al- ready determined candidate entities and candidate relations. Taking the i-th subgraph and the j-th round of subgraph ex- ploration as an example, relying on Ei jandRi j, we perform queries like (e, r,?)or(?, r, e)on the knowledge graph K to retrieve the corresponding entities Ei a,j={e1, ..., e N}, where Nvaries depending on the knowledge graph K. Then, the agent also considers the query Q, the inference path P in Section 2.2, the topic entity ei 0, and the candidate rela- tion set Ri jfromEi a,jto generate the most relevant entity set Ei j+1={ei j,1, ..., ei j,l} ⊆Ei a,j. Note that ei j,1is the tail entity of the reasoning path pi 1,j. Subgraph Update. Relation exploration determines entity exploration, and we update the subgraph only after complet- ing the entity exploration. Specifically, for the i-th subgraph and the j-th round of subgraph exploration, we append the result of the exploration (, r, ei j,1)to the path pi 1,jin the sub- graph Ki s,j. This path update algorithm not only considersAlgorithm 1 Subgraph Exploration Require: entityDict ,entityName ,question , maxWidth ,depth ,path 1:SetoriginalPath aspath 2:ifdepth = 0then 3: Initialize path as[ ]∗maxWidth 4:end if 5:foreidinentityDict do 6: FindrelevantRelations 7: forrelation inrelevantRelations do 8: Find entities linked by relation 9: end for 10:end for 11:Extract relevantEntities using candidate entities 12:Update path andentityDict based on relevance 13:extraPath ←(path−originalPath ) 14:return extraPath ,entityDict Algorithm 2 Update Reasoning Path in Subgraph Require: path ,pathIsHead ,isHead ,r,e 1:if not pathIsHead then 2: if not isHead then 3: newPath ←path + [←, r,←, e]. 4: else 5: newPath ←path + [→, r,→, e]. 6: end if 7:else 8: if not isHead then 9: newPath ←[e,→, r,→] +path . 10: else 11: newPath ←[e,←, r,←] +path . 12: end if 13:end if 14:Append newPath topath 15:return path the directionality of entities and relations, but also automati- cally determines and updates the paths. Its detailed process is described in Algorithm 2. The final subgraph can be flexibly expanded due to the variable number of paths l. 2.4 Evaluation-based Answering After completing the subgraph update for each round, the agent attempts to answer the query through the subgraph {K1 s,j, ..., KN s,j}. If it determines that the current subgraph is insufficient to answer the question, the next round of sub- graph exploration will be executed, until the maximum explo- ration depth Dis reached. Otherwise, it will output the final answer along with the corresponding interpretable directed subgraph. Unlike previous work [Chen et al. , 2024 ], even if no answer is found at the maximum exploration depth, our KnowPath will rely on the inference path Pto response. The framework of KnowPath is shown in Figure 2. Page 4: Method CWQ WebQSP Simple Questions WebQuestions LLM only IO prompt [Brown et al. , 2020 ] 37.6±0.8 63.3 ±1.2 20.0 ±0.5 48.7 ±1.4 COT [Weiet al. , 2022 ] 38.8±1.5 62.2 ±0.7 20.5 ±0.4 49.1 ±0.9 RoG w/o planning [Luoet al. , 2024b ] 43.0±0.9 66.9 ±1.3 - - SC[Wang et al. , 2022 ] 45.4±1.1 61.1 ±0.5 18.9 ±0.6 50.3 ±1.2 Fine-Tuned KG Enhanced LLM UniKGQA [Jiang et al. , 2022 ] 51.2±1.0 75.1 ±0.8 - - RE-KBQA [Caoet al. , 2023 ] 50.3±1.2 74.6 ±1.0 - - ChatKBQA [Luoet al. , 2024a ] 76.5±1.3 78.1 ±1.1 85.8 ±0.9 55.1 ±0.6 RoG [Luoet al. , 2024b ] 64.5±0.7 85.7 ±1.4 73.3 ±0.8 56.3 ±1.0 Prompting KG Enhanced LLM with GPT3.5 StructGPT [Jiang et al. , 2023 ] 54.3±1.0 72.6 ±1.2 50.2 ±0.5 51.3 ±0.9 ToG [Sunet al. , 2024 ] 57.1±1.5 76.2 ±0.8 53.6 ±1.0 54.5 ±0.7 PoG [Chen et al. , 2024 ] 63.2±1.0 82.0 ±0.9 58.3 ±0.6 57.8 ±1.2 KnowPath (Ours) 67.9 ±0.6 84.1 ±1.3 61.5 ±0.8 60.0 ±1.0 Prompting KG Enhanced LLM with DeepSeek-V3 ToG [Sunet al. , 2024 ] 60.9±0.7 82.6 ±1.0 59.7 ±0.9 57.9 ±0.8 PoG [Chen et al. , 2024 ] 68.3±1.1 85.3 ±0.9 63.9 ±0.5 61.2 ±1.3 KnowPath (Ours) 73.5 ±0.9 89.0 ±0.8 65.3 ±1.0 64.0 ±0.7 Table 1: Hits@1 scores (%) of different models on four datasets under various knowledge-enhanced methods. We use GPT-3.5 Turbo and DeepSeek-V3 as the primary backbones. Bold text indicates the results achieved by our method. 3 Experimental Setup 3.1 Baselines We chose corresponding advanced baselines for comparison based on the three main paradigms of existing knowledge- based question answering. 1) The First is the LLM-only, in- cluding the standard prompt (IO prompt [Brown et al. , 2020 ]), the chain of thought prompt (CoT [Wei et al. , 2022 ]), the self-consistency (SC [Wang et al. , 2022 ]), and the RoG with- out planning (ROG w/o planning [Luo et al. , 2024b ]). 2) The second is the KG-enhanced fine-tuned LLMs, which in- clude ChatKBQA [Luoet al. , 2024a ], RoG [Luoet al. , 2024b ], UniKGQA [Jiang et al. , 2022 ], and RE-KBQA [Cao et al. , 2023 ]. 3) The third is the KG-enhanced prompt-based LLMs, including Think on graph (ToG [Sun et al. , 2024 ]), Plan on graph (PoG [Chen et al. , 2024 ]), and StructGPT [Jiang et al. , 2023 ]. Unlike the second, this scheme no longer requires fine-tuning and has become a widely researched mode today. 3.2 Datasets and Metrics Datasets. We adopt four knowledge-based question answer- ing datasets: the single-hop Simple Questions [Bordes et al., 2015 ], the complex multi-hop CWQ [Talmor and Berant, 2018 ]and WebQSP [Yihet al. , 2016 ], and the open-domain WebQuestions [Berant et al. , 2013 ]. Metrics. Following previous research [Chen et al. , 2024 ], we apply exact match accuracy (Hits@1) for evaluation. 3.3 Experiment Details Following previous research [Chen et al. , 2024 ], to control the overall costs, the maximum subgraph exploration depthDamx is set to 3. Since the FreeBase [Bollacker et al. , 2008 ] supports all the aforementioned datasets, we apply it as the base graph for subgraph exploration, and We apply GPT-3.5- turbo-1106 and DeepSeek-V3 as the base models. All exper- iments are deployed on four NVIDIA A800-40G GPUs. 4 Result 4.1 Main results We conducted comprehensive experiments on four widely used knowledge-based question answering datasets. The ex- perimental results are presented in Table 1, and four key find- ings are outlined as follows: KnowPath performs the best. Our KnowPath outper- forms all the Prompting-driven KG-Enhanced. For instance, on the multi-hop CWQ, regardless of the base model used, KnowPath achieves a maximum improvement of about 13% in Hits@1. In addition, KnowPath outperforms the LLM- only with a clear margin and surpasses the majority of Fine- Tuned KG-Enhanced LLM methods. On the most challeng- ing open-domain question answering dataset WebQuestions, KnowPath achieves the best performance compared to strong baselines from other paradigms (e.g., PoG 61.2% vs Ours 64.0%). This demonstrates KnowPath’s ability to enhance the factuality of LLMs in open-domain question answering, which is an intriguing phenomenon worth further exploration. KnowPath excels at complex multi-hop tasks. On both CWQ and WebQSP, KnowPath outperforms the lat- est strong baseline PoG, achieving an average improvement of approximately 5% and 2.9%, respectively. On the We- bQSP, DeepSeek-v3 with KnowPath not only outperforms Page 5: Method CWQ WebQSP SimpleQA WebQ KnowPath 73.5 89.0 65.3 64.0 -w/o IPG 67.3 84.5 63.1 61.0 -w/o SE 64.7 83.1 60.4 60.7 Base 39.2 66.7 23.0 53.7 Table 2: Ablation experiment results on four knowledge-based ques- tion answering tasks. IPG stands for Inference Paths Generation module, while SE stands for Subgraph Exploration module. all Prompting-based KG-Enhanced LLMs but also surpasses the strongest baseline ROG among Fine-Tuned KG-Enhanced LLMs (85.7% vs 89%). On the more challenging multi-hop CWQ, the improvement of KnowPath over the PoG is signif- icantly greater than the improvement on the simpler single- hop SimpleQuestions (5.2% vs 1.4%). These collectively in- dicate that KnowPath is sensitive to deep reasoning. Knowledge enhancement greatly aids factual question answering. When question answering is based solely on LLMs, the performance is poor across multiple tasks. For example, COT achieves only about 20.5% Hits@1 on Sim- pleQuestions. This is caused by the hallucinations inherent in LLMs. Whatever method is applied to introduce the KGs, they significantly outperform LLM-only. The maximum im- provements across the four tasks are 35.9%, 27.9%, 46.4%, and 15.3%. These further emphasize the importance of intro- ducing knowledge graphs for generating correct answers. The stronger the base, the higer the performance. As DeepSeek-V3 is better than GPT-3.5, even though both are prompting-based knowledge-enhanced, their performance on all tasks shows a significant difference after incorporat- ing our KnowPath. Replacing GPT-3.5 with DeepSeek-V3, KnowPath achieved a maximum improvement from 67.9% to 73.5% on CWQ, and on Simple Questions, it improved by at least 3.8%. These findings indicate that the improvement in model performance directly drives the enhancement of its performance in knowledge-based question-answering. KnowPath is a more flexible plugin. Compared to fine- tuned knowledge-enhanced LLMs, our KnowPath does not require fine-tuning of the LLM, yet it outperforms most of the fine-tuned methods. In addition, on the CWQ dataset, Know- Path with DeepSeek-V3 achieves performance that is very close to the strongest baseline, ChatKBQA, which requires fine-tuning for knowledge enhancement. On the WebQSP dataset, it outperforms ChatKBQA by about 11% (78.1% vs 89.0%). Overall, the resource consumption of KnowPath is significantly lower than that of Fine-Tuned KG-Enhanced LLMs. This is because KnowPath improves performance by optimizing inference paths and enhancing knowledge integra- tion, making it a more flexible and plug-and-play framework. 4.2 Ablation Study We validate the effectiveness of each component of Know- Path and quantify their contributions to performance. Its re- sults are presented in Table 2, and visualized in Figure 3. Each component contributes to the overall remarkable performance. After removing each module, their perfor- mance on different datasets will decline. However, comparedMethod LLM Call Total Token Input Token ToG 22.6 9669.4 8182.9 PoG 16.3 8156.2 7803.0 KnowPath 9.9 2742.4 2368.9 Table 3: Cost-effectiveness analysis on the CWQ dataset between our KnowPath and the strongly prompt-driven knowledge-enhanced benchmarks (ToG and PoG). The Total Token includes two parts: the total number of tokens from multiple input prompts and the total number of tokens from the intermediate results returned by the LLM. The Input Token represents only the total number of tokens from the multiple input prompts. The LLM Call refer to the total number of accesses to the LLM agent. Figure 3: Comparison of KnowPath, its individual components, and strong baseline methods (ToG and PoG) on the performance across four commonly used knowledge-based question answering datasets. to the base model, the addition of these modules still signifi- cantly improves the overall performance. It is necessary to focus on the powerful internal knowl- edge of LLMs. Eliminating the Subgraph Exploration and relying solely on the internal knowledge mining of LLMs to generate reasoning paths and provide answers proves to be highly effective. It has shown significant improvement across all four datasets, with an average performance enhancement of approximately 21.6%. The most notable improvement was observed on SimpleQA, where performance leaped from 23% to 60.4%. This indicates that even without the incorporation of external knowledge graphs, the performance of the model in generating factual responses can be enhanced to a certain extent through internal mining methods. However, without the guidance of internal knowledge reasoning paths, Know- Path has seen some performance decline across all tasks, es- pecially in complex multi-hop CWQ and WebQSP. The most critical credible directed Subgraph Explo- ration is deep-sensitive. Removing the subgraph exploration leads to a significant decline in Knowpath across all tasks, av- Page 6: Figure 4: Visualization of the cost-effectiveness analysis on four public knowledge-based question-answering datasets. (a) Exploration temperature (b) The count of triples Figure 5: Analysis of key parameters. eraging a drop of approximately 5.7%. This performance dip is particularly pronounced in complex multi-hop tasks. For instance, on the CWQ, Knowpath without subgraph explo- ration experiences a nearly 9% decrease. 4.3 Cost-effectiveness Analysis To explore the cost-effectiveness of KnowPath while main- taining high accuracy, we conducted a cost-benefit analysis. In this experiment, we tracked the primary sources of cost, including the LLM Call, Input Token, and Total Token usage. The results are presented in the Table 3, and are visualized in Figure 4. Our key findings are described as follows: The number of accesses to the LLM agent was signif- icantly reduced. Specifically, the LLM calls for TOG and POG was 2.28x and 1.64x of that in our KnowPath, respec- tively. This exceptionally low cost can be attributed to the fact that the Subgraph Exploration does not limit the scale of the path search, and this can be broken down into three key rea- sons. First, in each round of subgraph exploration, only one relation exploration and one entity exploration are conducted. Second, the Evaluation-based answering only accesses the LLM once after each round of subgraph exploration to judge whether the current subgraph can answer the question. If it cannot, the next round is performed. Third, if the largest ex- plored subgraph still cannot answer the question, KnowPath will rely on the Inference Paths Generation. The number of tokens used is saved by several times. Whether in Total Token or Input Tokens, KnowPath saves ap-proximately 4.0x compared to TOG and POG. This is mainly since all the prompts used in KnowPath are based on the care- fully designed zero-shot approach, rather than the in-context learning used by the previous, which require providing large context to ensure the factuality of the answers. We explored the reasons behind this difference. First, previous methods rely on more contextual information for in-context learning to ensure the correctness of the output. Secondly, KnowPath fully leverages the powerful internal relevant knowledge and uses it as the input signal for the agent. This not only provides more contextual reference but also significantly improves the accuracy and efficiency of relation and entity exploration in subgraph exploration, ensuring that the generated subgraph is highly relevant while enabling the most effective reasoning toward potential answers. 4.4 Parameter analysis We analyze the key parameters that affect the performance of KnowPath on the WebQSP, and discuss the following issues: What is the impact of the temperature in Subgraph Ex- ploration? We explore the optimal temperature from 0.2 to 1, and the relation between it and Hits@1 is shown in Figure 5a. During subgraph exploration, variations in the tempera- ture affect the divergence of the model’s generated answers. A lower temperature negatively impacts KnowPath’s perfor- mance, as the model generates overly conservative answers with insufficient knowledge, while the LLM relies on its in- ternal knowledge when exploring and selecting entities and relationships. A higher temperature also harms KnowPath, as the divergent answers may deviate from the given candidates. Extensive experiments show that 0.4 is the optimal tempera- ture, consistent with other existing works [Chen et al. , 2024 ]. How is the count of knowledge triples determined in In- ference Paths Generation? We explored it with a step size of 15, and the relationship between the count of knowledge triples and Hits@1 is shown in Figure 5a. When the count is 0, KnowPath’s performance is poor due to the lack of in- ternal knowledge exploration. When the count is too large, such as 45, its performance is also suboptimal, as excessive exploration introduces irrelevant knowledge as interference. Extensive experiments show that 15 is the optimal. Page 7: religion.religious_text_ofZhang Jue religion.religion.notable_figuresTaoismTao Te Ching Zhuang ZhouDaozang time.event.instance_ of_recurring_event2002 -03-Football League Cup sports.sports_team. championships Liverpool F.C.sports.sports_championship _event.season2003 Football League Cup FinalQuestion : What text in the religion which include Zhang Jue as a key figure is considered to be sacred? There is no explicit information provided about Zhang Jue. The answer to the question is Taiping Jing . Unable to answer this question, use cot to answer: the question is Taiping Jing . The answer is Tao Te Ching, Daozang , Zhuang Zhou .ToG PoG Know PathTo answer this question, additional knowledge or data about the league cup winners in 2002 would be required . The winner of the 2002 Football League Cup was Blackburn Rovers . reasoning_chains : ["Football League Cup", " sports.sports_championship.events ", "2002 Football League Cup Final"],["Football League Cup"," sports.sports_championship_event -.champio n-nship ", "2002 Football League Cup Final"] the answer to the question is Liverpool F.C.Question : who won the league cup in 2002? Inference Path : Zhang Jue -> is a key figure in -> Way of the Five Pecks of Rice -> is a -> Taoist sect -> Taoism -> is based on -> Tao Te Ching -> is considered to be -> sacred text in Taoism. SubGraphInference Path : 2002 League Cup -> was won by -> Birmingham City -> defeated -> Liverpool -> 2002 League Cup -> was contested between -> Birmingham City and Liverpool. Football League Cup SubGraph (a.)A case from CWQ (b.)A case from WebQuestionsFigure 6: The case study on the multi-hop CWQ and open-domain WebQuestions dataset. To provide a clear and vivid comparison with the strong baselines (ToG and PoG), we visualized the execution process of KnowPath 4.5 Case Study To provide a clear and vivid comparison with the strong base- lines, we visualized the execution process of KnowPath, as shown in Figure 6. In the CWQ, ToG and PoG can only ex- tract context from the question, failing to gather enough ac- curate knowledge for a correct answer, thus producing the in- correct answer ”Taiping Jing.” In contrast, KnowPath uncov- ers large model reasoning paths that provide additional, suffi- cient information. This enables key nodes, such as ”Taoism,” to be identified during subgraph exploration, ultimately lead- ing to the correct answer, ”Zhuang Zhou.” In the WebQues- tions, ToG is unable to answer the question due to insufficient information. Although PoG provides a reasoning chain, the knowledge derived from the reasoning process is inaccurate, and the final answer still relies on the reasoning of the large model, resulting in the incorrect answer ”Blackburn Rovers.” In contrast, guided by Inference, KnowPath accurately identi- fied the relationship ”time.event.instance ofrecurring event” and, through reasoning with the node ”2002-03-Football League Cup,” ultimately arrived at the correct result node ”Liverpool F.C.” Overall, KnowPath not only provides an- swers but also generates directed subgraphs, which serve as the foundation for trustworthy reasoning and significantly en- hance the interpretability of the results. 5 Related Work Prompt-driven LLM inference. CoT[Wei et al. , 2022 ](Chain of Thought) effectively im- proves the reasoning ability of large models, enhancing per- formance on complex tasks with minimal contextual prompts. Self-Consistency (SC) [Wang et al. , 2022 ]samples multiple reasoning paths to select the most consistent answer, with further improvements seen in DIVERSE [Liet al. , 2022 ]and V ote Complex [Fuet al. , 2022 ]. Other methods have ex-plored CoT enhancements in zero-shot scenarios [Kojima et al., 2022; Chung et al. , 2024 ]. However, reasoning solely based on the model’s knowledge still faces significant hallu- cination issues, which remain unresolved. KG-enhanced LLM inference. ”Early works enhanced model knowledge understanding by injecting KGs into model parameters through fine-tuning or retraining [Cao et al. , 2023; Jiang et al. , 2022; Yang et al., 2024 ]. ChatKBQA [Luo et al. , 2024a ]and RoG [Luo et al. , 2024b ]utilize fine-tuned LLMs to generate logical forms. StructGPT [Jiang et al. , 2023 ], based on the RAG ap- proach, retrieves information from KGs for question answer- ing. ToG [Sun et al. , 2024 ]and PoG [Chen et al. , 2024 ]in- volve LLMs in knowledge graph reasoning, using them as agents to assist in selecting entities and relationships dur- ing exploration. Despite achieving strong performance, these methods still face challenges like insufficient internal knowl- edge mining and the inability to generate trustworthy reason- ing paths. 6 Conclusion In this paper, to enhance the ability of LLMs to provide fac- tual answers, we propose the knowledge-enhanced reasoning framework KnowPath, driven by the collaboration of internal and external knowledge. It focuses on leveraging the reason- ing paths generated by the extensive internal knowledge of LLMs to guide the trustworthy directed subgraph exploration of knowledge graphs. Extensive experiments show that: 1) Our KnowPath is optimal and excels at complex multi-hop tasks. 2) It demonstrates remarkable cost-effectiveness, with a 55% reduction in the number of LLM calls and a 75% de- crease in the number of tokens consumed compared to the strong baselines. 3) KnowPath can explore directed sub- graphs of the KGs, providing an intuitive and trustworthy rea- soning process, greatly enhancing the overall interpretability. Page 8: References [Alberts et al. , 2023 ]Ian L Alberts, Lorenzo Mercolli, Thomas Pyka, George Prenosil, Kuangyu Shi, Axel Rominger, and Ali Afshar-Oromieh. Large language mod- els (llm) and chatgpt: what will the impact on nuclear medicine be? European journal of nuclear medicine and molecular imaging , 50(6):1549–1552, 2023. [Berant et al. , 2013 ]Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language pro- cessing , pages 1533–1544, 2013. [Bollacker et al. , 2008 ]Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250, 2008. [Bordes et al. , 2015 ]Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. Large-scale simple question answering with memory networks. arXiv preprint arXiv:1506.02075 , 2015. [Brown et al. , 2020 ]Tom Brown, Benjamin Mann, Nick Ry- der, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing sys- tems, 33:1877–1901, 2020. [Caoet al. , 2023 ]Yong Cao, Xianzhi Li, Huiwen Liu, Wen Dai, Shuai Chen, Bin Wang, Min Chen, and Daniel Her- shcovich. Pay more attention to relation exploration for knowledge base question answering. arXiv preprint arXiv:2305.02118 , 2023. [Chen et al. , 2024 ]Liyi Chen, Panrong Tong, Zhongming Jin, Ying Sun, Jieping Ye, and Hui Xiong. Plan-on-graph: Self-correcting adaptive planning of large language model on knowledge graphs. CoRR , abs/2410.23875, 2024. [Chung et al. , 2024 ]Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. Jour- nal of Machine Learning Research , 25(70):1–53, 2024. [Dong et al. , 2023 ]Xiangjue Dong, Yibo Wang, Philip S Yu, and James Caverlee. Probing explicit and implicit gen- der bias through llm conditional text generation. arXiv preprint arXiv:2311.00306 , 2023. [Fuet al. , 2022 ]Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. Complexity-based prompting for multi-step reasoning. In The Eleventh International Con- ference on Learning Representations , 2022. [Guo et al. , 2024 ]Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi- agents: A survey of progress and challenges. In Proceed- ings of the Thirty-Third International Joint Conference onArtificial Intelligence, IJCAI 2024, Jeju, South Korea, Au- gust 3-9, 2024 , pages 8048–8057. ijcai.org, 2024. [Huet al. , 2024 ]Yuxuan Hu, Gemju Sherpa, Lan Zhang, Weihua Li, Quan Bai, Yijun Wang, and Xiaodan Wang. An llm-enhanced agent-based simulation tool for informa- tion propagation. In Proceedings of the Thirty-Third Inter- national Joint Conference on Artificial Intelligence, IJCAI 2024, Jeju, South Korea, August 3-9, 2024 , pages 8679– 8682. ijcai.org, 2024. [Huang et al. , 2024 ]Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understand- ing the planning of llm agents: A survey. arXiv preprint arXiv:2402.02716 , 2024. [Jiang et al. , 2022 ]Jinhao Jiang, Kun Zhou, Wayne Xin Zhao, and Ji-Rong Wen. Unikgqa: Unified retrieval and reasoning for solving multi-hop question answering over knowledge graph. arXiv preprint arXiv:2212.00959 , 2022. [Jiang et al. , 2023 ]Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. Structgpt: A general framework for large language model to reason over structured data. arXiv preprint arXiv:2305.09645 , 2023. [Jung et al. , 2024 ]Sung Jae Jung, Hajung Kim, and Ky- oung Sang Jang. Llm based biological named entity recog- nition from scientific literature. In 2024 IEEE Interna- tional Conference on Big Data and Smart Computing (Big- Comp) , pages 433–435. IEEE, 2024. [Kojima et al. , 2022 ]Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems , 35:22199–22213, 2022. [Liet al. , 2022 ]Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. On the advance of making language models better reasoners. arXiv preprint arXiv:2206.02336 , 2022. [Liet al. , 2024 ]Johnny Li, Saksham Consul, Eda Zhou, James Wong, Naila Farooqui, Yuxin Ye, Nithyashree Manohar, Zhuxiaona Wei, Tian Wu, Ben Echols, et al. Banishing llm hallucinations requires rethinking general- ization. arXiv preprint arXiv:2406.17642 , 2024. [Luoet al. , 2024a ]Haoran Luo, Haihong E, Zichen Tang, Shiyao Peng, Yikai Guo, Wentai Zhang, Chenghao Ma, Guanting Dong, Meina Song, Wei Lin, Yifan Zhu, and Anh Tuan Luu. Chatkbqa: A generate-then-retrieve frame- work for knowledge base question answering with fine- tuned large language models. In Lun-Wei Ku, Andre Mar- tins, and Vivek Srikumar, editors, Findings of the Associ- ation for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024 , pages 2039–2056. Association for Computational Linguistics, 2024. [Luoet al. , 2024b ]Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. Reasoning on graphs: Faithful Page 9: and interpretable large language model reasoning. In The Twelfth International Conference on Learning Representa- tions, ICLR 2024, Vienna, Austria, May 7-11, 2024 . Open- Review.net, 2024. [Maet al. , 2024 ]Shengjie Ma, Chengjin Xu, Xuhui Jiang, Muzhi Li, Huaren Qu, and Jian Guo. Think-on-graph 2.0: Deep and interpretable large language model reason- ing with knowledge graph-guided retrieval. arXiv e-prints , pages arXiv–2407, 2024. [Sunet al. , 2024 ]Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel M. Ni, Heung-Yeung Shum, and Jian Guo. Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph. In The Twelfth International Con- ference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. [Talmor and Berant, 2018 ]Alon Talmor and Jonathan Be- rant. The web as a knowledge-base for answering complex questions. In Marilyn Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 641–651, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. [Wang et al. , 2022 ]Xuezhi Wang, Jason Wei, Dale Schu- urmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 , 2022. [Wang et al. , 2024 ]Ziao Wang, Xiaofeng Zhang, and Hong- wei Du. Beyond what if: Advancing counterfactual text generation with structural causal modeling. In Proceed- ings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI 2024, Jeju, South Korea, Au- gust 3-9, 2024 , pages 6522–6530. ijcai.org, 2024. [Weiet al. , 2022 ]Jason Wei, Xuezhi Wang, Dale Schuur- mans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems , 35:24824–24837, 2022. [Xuet al. , 2024 ]Yao Xu, Shizhu He, Jiabei Chen, Zihao Wang, Yangqiu Song, Hanghang Tong, Guang Liu, Kang Liu, and Jun Zhao. Generate-on-graph: Treat llm as both agent and kg in incomplete knowledge graph question an- swering. arXiv preprint arXiv:2404.14741 , 2024. [Yang et al. , 2023 ]Linyao Yang, Hongyang Chen, Zhao Li, Xiao Ding, and Xindong Wu. Chatgpt is not enough: Enhancing large language models with knowl- edge graphs for fact-aware language modeling. arXiv preprint arXiv:2306.11489 , 2023. [Yang et al. , 2024 ]Linyao Yang, Hongyang Chen, Zhao Li, Xiao Ding, and Xindong Wu. Give us the facts: Enhanc- ing large language models with knowledge graphs for fact- aware language modeling. IEEE Transactions on Knowl- edge and Data Engineering , 2024.[Yihet al. , 2016 ]Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-Wei Chang, and Jina Suh. The value of semantic parse labeling for knowledge base question an- swering. In Katrin Erk and Noah A. Smith, editors, Pro- ceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages 201–206, Berlin, Germany, August 2016. Associa- tion for Computational Linguistics. [Yinet al. , 2022 ]Da Yin, Li Dong, Hao Cheng, Xiaodong Liu, Kai-Wei Chang, Furu Wei, and Jianfeng Gao. A sur- vey of knowledge-intensive nlp with pre-trained language models. arXiv preprint arXiv:2202.08772 , 2022. [Zhao et al. , 2024 ]Ruilin Zhao, Feng Zhao, Long Wang, Xianzhi Wang, and Guandong Xu. Kg-cot: Chain-of- thought prompting of large language models over knowl- edge graphs for knowledge-aware question answering. In Proceedings of the Thirty-Third International Joint Con- ference on Artificial Intelligence, IJCAI 2024, Jeju, South Korea, August 3-9, 2024 , pages 6642–6650. ijcai.org, 2024.

---