loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2412.11418v1

ConceptEdit: Conceptualization-Augmented Knowledge Editing in Large Language Models for Commonsense Reasoning

Authors: Liyu Zhang, Weiqi Wang, Tianqing Fang, Yangqiu Song

Published: 2024-12-16

Abstract:

Knowledge Editing (KE) aims to adjust a Large Language Model's (LLM) internal representations and parameters to correct inaccuracies and improve output consistency without incurring the computational expense of re-training the entire model. However, editing commonsense knowledge still faces difficulties, including limited knowledge coverage in existing resources, the infeasibility of annotating labels for an overabundance of commonsense knowledge, and the strict knowledge formats of current editing methods. In this paper, we address these challenges by presenting ConceptEdit, a framework that integrates conceptualization and instantiation into the KE pipeline for LLMs to enhance their commonsense reasoning capabilities. ConceptEdit dynamically diagnoses implausible commonsense knowledge within an LLM using another verifier LLM and augments the source knowledge to be edited with conceptualization for stronger generalizability. Experimental results demonstrate that LLMs enhanced with ConceptEdit successfully generate commonsense knowledge with improved plausibility compared to other baselines and achieve stronger performance across multiple question answering benchmarks.

Paper Content: on Alphaxiv
Page 1: CONCEPT EDIT: Conceptualization-Augmented Knowledge Editing in Large Language Models for Commonsense Reasoning Liyu Zhang*, Weiqi Wang∗, Tianqing Fang, Yangqiu Song Department of Computer Science and Engineering, HKUST, Hong Kong SAR, China lzhangcx@connect.ust.hk, {wwangbw, tfangaa, yqsong}@cse.ust.hk Abstract Knowledge Editing (KE) aims to adjust a Large Language Model’s (LLM) internal representa- tions and parameters to correct inaccuracies and improve output consistency without incur- ring the computational expense of re-training the entire model. However, editing common- sense knowledge still faces difficulties, includ- ing limited knowledge coverage in existing re- sources, the infeasibility of annotating labels for an overabundance of commonsense knowl- edge, and the strict knowledge formats of cur- rent editing methods. In this paper, we address these challenges by presenting CONCEPT EDIT, a framework that integrates conceptualization and instantiation into the KE pipeline for LLMs to enhance their commonsense reasoning ca- pabilities. CONCEPT EDITdynamically diag- noses implausible commonsense knowledge within an LLM using another verifier LLM and augments the source knowledge to be edited with conceptualization for stronger generaliz- ability. Experimental results demonstrate that LLMs enhanced with C ONCEPT EDITsuccess- fully generate commonsense knowledge with improved plausibility compared to other base- lines and achieve stronger performance across multiple question answering benchmarks. 1 Introduction With the recent advancements in Large Language Models (LLMs;OpenAI, 2024b,a; Dubey et al., 2024; Chan et al., 2024), Knowledge Editing (KE;Zhang et al., 2024; Wang et al., 2025) meth- ods have emerged as a computationally efficient strategy to correct inaccurate responses and up- date LLMs with timely or new knowledge by di- rectly modifying their internal weights or represen- tations, without fully re-training the entire model. Such methods have been applied to various do- mains, including factual reasoning (Ju et al., 2024; Wang et al., 2024a), medical knowledge (Xu et al., *Equal Contribution2024b), and commonsense reasoning (Huang et al., 2024), and have proven effective in enhancing domain-specific expertise. Despite their success, current KE methods face several challenges when applied to commonsense knowledge (Davis and Marcus, 2015). First, ex- isting commonsense knowledge bases (West et al., 2023; Fang et al., 2021b; Yang et al., 2023; Fang et al., 2021a, 2023; Ding et al., 2024; Xu et al., 2024a) offer only limited coverage of the extensive and diverse information required for robust reason- ing. They often focus on isolated facts rather than forming hierarchical structures that enable gener- alization through editing (Ma et al., 2021b; Wang et al., 2024d). Second, the inherently unstructured and wide-ranging nature of commonsense knowl- edge complicates scaling and curation, making it infeasible to rely on human annotation alone to correct implausible knowledge in LLMs. Finally, the flexible representation of commonsense knowl- edge—where a single fact may manifest in multiple formats—necessitates editing at the (relation, tail) pair level rather than at individual tokens. To address these issues, we present CONCEPT E- DIT, a novel knowledge editing framework tailored for editing commonsense knowledge within LLMs. To handle the vast, potentially unlabeled common- sense knowledge, we employ VERA (Liu et al., 2023), an automated commonsense plausibility ver- ifier, which prompts an LLM to generate common- sense knowledge and determines its plausibility. For knowledge deemed erroneous and requiring edits, we integrate conceptualization and instan- tiation (Wang et al., 2023b,a) to enrich semantic coverage and support more generalizable editing, covering not only the targeted knowledge but also other potentially relevant yet implausible informa- tion within the LLM. To ensure flexibility, CON- CEPT EDITadopts an open-ended format for edit- ing, enabling the handling of arbitrary knowledge structures rather than focusing solely on traditional 1arXiv:2412.11418v1 [cs.CL] 16 Dec 2024 Page 2: What effects does the event of Alice plays together every day have on others? Others will feel the urge to sneeze repeatedly. Human labeling Ground truth triple in CSKB PersonX plays together every day, oEffect, get to know someone Traditional Knowledge EditingPersonX plays together every dayConceptualization &Instantiatio n Commonsense Knowledge Bas e1. (PersonX plays together every day, oEffect, get to know someone) 2. (PersonX plays together every day, xIntent, to be amused) 3. (PersonX plays together every day, xNeed, to know how to play)1. (PersonX plays together every day, oEffect, get to know someone) 2. (PersonX has fun, xIntent, to be amused) 3. (PersonX engages in enjoyable group activities, xNeed, to know how to play) ... Abstrac tCommonsense Knowledge BasePersonX has funPersonX engages in enjoyable group activities ConceptEdit (Ground truth triple in augmented abstract KB) VERAFigure 1: An overview of CONCEPT EDIT, which pipelines conceptualization and instantiation, knowledge editing, and LLM verification together for automated and scalable knowledge editing over commonsense knowledge. (h,r,t) triplets. Experimental results on Abstrac- tATOMIC (He et al., 2024) demonstrate that LLMs enhanced by CONCEPT EDITgenerate common- sense knowledge with improved plausibility. Fur- ther evaluations across five commonsense question- answering benchmarks also show performance im- provements. We will release our data, models, and code publicly upon acceptance. 2 Related Works 2.1 Knowledge Editing Knowledge editing (Cao et al., 2021) aims to up- date an LLM’s internal knowledge without full retraining or relying solely on prompt engineer- ing, is becoming increasingly crucial. Meng et al. (2022) propose ROME, which identifies and up- dates factual associations within specific MLP lay- ers, achieving precise single-fact edits guided by causal mediation analysis. MEMIT (Meng et al., 2023) extends ROME’s principles to handle large- scale edits simultaneously. By distributing updates across multiple layers and parameters, MEMIT effi- ciently integrates thousands of facts while maintain- ing specificity and fluency. GRACE (Hartvigsen et al., 2023), on the other hand, avoids internal parameter changes by integrating external dictio- naries and adapters as a modular memory source. This approach allows flexible, inference-time ac- cess to new knowledge, though it may sacrifice some internal coherence and interpretability. In our work, we build upon these methods to enhance editing commonsense knowledge in LLMs. 2.2 Conceptualization in Commonsense Conceptualization abstracts entities or events into general concepts, forming abstract commonsense knowledge (Murphy, 2004), while instantiation grounds these concepts into new instances, intro- ducing additional commonsense knowledge. Pre-vious work largely focused on entity-level con- ceptualization (Durme et al., 2009; Song et al., 2011, 2015; Liu et al., 2022; Peng et al., 2022), with He et al. (2024); Wang et al. (2023b,a) pio- neering event-level conceptualization from Word- Net (Miller, 1995) and Probase (Wu et al., 2012). For instantiation, Allaway et al. (2023) introduced a controllable generative framework that automat- ically identifies valid instances. In this work, we leverage the conceptualization distillation frame- work proposed by Wang et al. (2024c) to augment the knowledge being edited, ensuring broader se- mantic coverage and thereby improving the gener- alizability of edited knowledge. 3 The C ONCEPT EDITFramework An overview of CONCEPT EDITis presented in Figure 1. Our framework consists of three main components: (1) automated knowledge verification with VERA (Liu et al., 2023), (2) abstract knowl- edge acquisition via conceptualization and instan- tiation, and (3) LLM knowledge editing. We use the AbstractATOMIC (He et al., 2024) and CAN- DLE (Wang et al., 2024c) datasets for training and evaluation as two rich sources of abstract knowl- edge with conceptualization and instantiation. The training set of both datasets are used for editing and the testing sets are used for evaluation. 3.1 Automated Knowledge Verification Since commonsense knowledge is vast, traditional human-in-the-loop methods for detecting and cor- recting erroneous outputs in LLMs are neither eas- ily scalable nor adaptable. Inspired by recent ad- vances in using LLMs as automated judges (Raina et al., 2024; Wang et al., 2024b), we propose a fully automated verification strategy to assess an LLM’s internal commonsense knowledge. We use VERA (Liu et al., 2023), a discriminative LLM 2 Page 3: trained to score the plausibility of arbitrary com- monsense statements, as our evaluation tool. For each triple in the AbstractATOMIC (He et al., 2024) training set, we prompt the LLM with the head event and request it to generate the corresponding relation and tail. VERA then evaluates the plausi- bility of the generated knowledge by producing a score in the range [0,1], where values above 0.5 are considered plausible, and those below 0.5 are deemed implausible. By iterating over all triples, this process provides both the LLM’s generated responses and VERA’s discrimination results, pin- pointing which portions of the generated knowl- edge are incorrect. Consequently, we can identify the exact “areas” within the LLM’s internal knowl- edge that require editing. This automated pipeline eliminates the dependence on costly human anno- tations for error detection, enabling scalable and efficient improvements of the LLM’s commonsense understanding. 3.2 Conceptualization and Instantiation While existing approaches primarily integrate decontextualized commonsense knowledge into LLMs through KE techniques, we hypothesize that capturing the diverse patterns that the same piece of knowledge can exhibit under different contexts is equally important. To this end, we augment the knowledge to be edited by implementing both con- ceptualization and instantiation, following Wang et al. (2024c). For each triple targeted for edit- ing, we first abstract its instances into more general concepts by prompting GPT-4o, producing abstract knowledge triples (Figure 1). We then instantiate these abstract concepts into novel, context-specific instances, again using GPT-4o, thereby forming a rich knowledge base. This process yields approxi- mately 160,000 commonsense knowledge triples, substantially improving the semantic coverage and contextual adaptability of the edited knowledge. 3.3 LLM Knowledge Editing Finally, we apply knowledge editing to the LLM using the enriched knowledge base generated through our conceptualization and instantiation pro- cesses, correcting errors identified by VERA. To accomplish this, we experiment with three estab- lished knowledge editing methods: MEMIT (Meng et al., 2023), ROME (Meng et al., 2022), and GRACE (Hartvigsen et al., 2023). For GRACE, which relies on adapters to determine whether and how to use an external dictionary, we adopt Vanilla MEMIT GRACE ROME0.60.6250.650.6750.70.7250.750.7750.8Average Plausible Rate Vanilla MEMIT GRACE ROME0.80.8250.850.8750.90.9250.950.9751.0Average Expert Acceptance RateFigure 2: Average plausible rate and expert acceptance rate of LLMs’ generation after C ONCEPT EDIT. the original deferral mechanism implementation. We evaluate our framework with these edit- ing methods on four representative LLM back- bones: Mistral-7B-Instruct-v0.2 (Jiang et al., 2023), Meta-Llama-3-8B-Instruct (Dubey et al., 2024), Chatglm2-6b (Zeng et al., 2024), and GPT-J-6B (Wang and Komatsuzaki, 2021). 4 Experiments and Analyses In this section, we first evaluate the LLMs after applying CONCEPT EDITusing both expert and automated assessments. We then illustrate their improved performance on downstream tasks and present several ablation studies. 4.1 LLMs-After-Editing Evaluation We first evaluate LLMs after editing via two mea- sures. First, we prompt these LLMs with head events in the testing set of AbstractATOMIC and ask it to complete the commonsense knowledge. With the generations on the testing set, we ask VERA to score them again and we calculate the plausible ratio whose scores are above 0.5. Then, we sample a subset of 200 generations and recruit two expert annotators to conduct a manual analyses on the acceptance ratio of the plausible assertions that passed VERA’s filtering. We compare mod- els after being edited with MEMIT, GRACE, and ROME, and set another vanilla group as baseline comparison. As shown in Figure 2, both VERA and human evaluations exhibit consistent trends. For instance, while human raters tend to assign higher scores compared to VERA, their evaluations align directionally, with both methods identifying similar patterns of improvement. When applying MEMIT-based editing, both VERA and human evaluations show notable enhancements over the Vanilla baseline. Similarly, GRACE and ROME edits enhance plausibility scores, with MEMIT and 3 Page 4: SocialIQA PIQA aNLI WG CSQA0.00.10.20.30.40.50.60.7ScoreVanilla Baseline After EditingFigure 3: Performance of the best LLM after editing on five downstream tasks compared to the vanilla baseline. GRACE achieving the highest overall performance. The strong results from expert annotations further validate the reliability of VERA’s judgments, sup- porting the use of VERA in our framework as an effective commonsense evaluator to identify im- plausible knowledge requiring further editing. This approach reduces reliance on manual annotations while preserving robust assessment capabilities. 4.2 Downstream Improvements To assess whether enhanced internal commonsense reasoning improves downstream task performance, we evaluate the edited models on multiple com- monsense reasoning benchmarks. Following Ma et al. (2021a), we test our framework on the val- idation splits of five widely-used commonsense QA benchmarks: Abductive NLI (aNLI; Bhagavat- ula et al., 2020), CommonsenseQA (CSQA; Tal- mor et al., 2019), PhysicalIQA (PIQA; Bisk et al., 2020), SocialIQA (SocialIQA; Sap et al., 2019), and WinoGrande (WG; Sakaguchi et al., 2021). These benchmarks are designed to evaluate a range of knowledge types crucial for robust common- sense reasoning (Shi et al., 2023; Wang and Song, 2024). We compare the performance of the best LLM edited with CONCEPT EDITagainst its correspond- ing vanilla baseline across all benchmarks, with the results visualized in Figure 3. The results show that models edited with CONCEPT EDITachieve signifi- cant performance improvements across all bench- marks, with particularly notable gains in aNLI and SocialIQA. These findings demonstrate the effec- tiveness of CONCEPT EDITin enhancing common- sense reasoning capabilities and suggest its poten- tial for broader applications in improving LLM performance on real-world reasoning tasks. Mistral-7B Llama-3-8B ChatGLM2-6b0.30.40.50.60.7Mean ScoreConceptualization Non-ConceptualizationFigure 4: VERA evaluation scores of edited LLMs with and without integrating conceptualization. 4.3 Ablation Study Finally, to validate the effect of conceptualization, we conducted an ablation study on MEMIT by re- moving the conceptualization step and comparing performance. In this setup, we edit LLMs both with and without the integration of conceptualization and instantiation, and evaluate their performance by examining the average VERA scores of the gener- ated outputs on the testing set. The conceptualized variant leveraged enriched commonsense triples generated via abstraction and instantiation prior to the editing process, while the non-conceptualized variant directly applied MEMIT without these pre- processing steps. Figure 4 demonstrates that the conceptual- ized variants consistently outperform their non- conceptualized counterparts, achieving higher plau- sibility and improved downstream task accuracy. These results suggest that the enriched conceptual patterns introduced before editing not only enhance plausibility but also enable the model to generalize commonsense knowledge to more complex reason- ing tasks, ultimately boosting overall performance. 5 Conclusions In this paper, we introduce CONCEPT EDIT, a novel knowledge editing framework designed to enhance commonsense reasoning in LLMs by addressing the challenges of limited knowledge coverage, scal- ability, and flexible representation. By integrat- ing automated verification through VERA and se- mantic enrichment via conceptualization and in- stantiation, CONCEPT EDITenables more effective and generalizable editing of commonsense knowl- edge. Experimental results demonstrate significant improvements in both knowledge plausibility and downstream task performance, validating the effec- tiveness of our approach. We envision that CON- 4 Page 5: CEPT EDITwill inspire future research on scalable and context-aware knowledge editing, paving the way for LLMs to better handle the complexity and diversity of commonsense reasoning. Limitations Our approach, CONCEPT EDIT, advances LLM commonsense reasoning through conceptualization and iterative knowledge editing, yet several chal- lenges persist. First, editing one piece of knowl- edge can cascade through related concepts, creat- ing non-linear interactions that are difficult to de- tect and manage, especially as the knowledge base scales up. Second, iterative updates risk knowl- edge drift, where successive edits subtly conflict with or overwrite prior facts, emphasizing the need for robust frameworks to maintain consistency. Fi- nally, the lack of stable ground truth for common- sense, which is often context-sensitive and cultur- ally variable, complicates standardization. Address- ing these challenges will require globally coordi- nated editing mechanisms, improved theoretical frameworks, and systematic human-in-the-loop val- idation to ensure edits align with broader consensus and expert judgment. Ethics Statement In this paper, all datasets and models used are free and accessible for research purposes, aligning with their intended usage. The expert annotators are graduate students with extensive experience in NLP and commonsense reasoning research, and they voluntarily agreed to participate without compen- sation. Therefore, we believe there are no ethical concerns associated with our work. References Emily Allaway, Jena D. Hwang, Chandra Bhagavatula, Kathleen R. McKeown, Doug Downey, and Yejin Choi. 2023. Penguins don’t fly: Reasoning about generics through instantiations and exceptions. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Lin- guistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023 , pages 2610–2627. Association for Computa- tional Linguistics. Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Han- nah Rashkin, Doug Downey, Wen-tau Yih, and Yejin Choi. 2020. Abductive commonsense reasoning. In 8th International Conference on Learning Represen- tations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net.Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelli- gence, AAAI 2020, The Thirty-Second Innovative Ap- plications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020 , pages 7432– 7439. AAAI Press. Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Edit- ing factual knowledge in language models. In Pro- ceedings of the 2021 Conference on Empirical Meth- ods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7- 11 November, 2021 , pages 6491–6506. Association for Computational Linguistics. Chunkit Chan, Cheng Jiayang, Weiqi Wang, Yuxin Jiang, Tianqing Fang, Xin Liu, and Yangqiu Song. 2024. Exploring the potential of chatgpt on sentence level relations: A focus on temporal, causal, and discourse relations. In Findings of the Association for Computational Linguistics: EACL 2024, St. Ju- lian’s, Malta, March 17-22, 2024 , pages 684–721. Association for Computational Linguistics. Ernest Davis and Gary Marcus. 2015. Commonsense reasoning and commonsense knowledge in artificial intelligence. Commun. ACM , 58(9):92–103. Wenxuan Ding, Weiqi Wang, Sze Heng Douglas Kwok, Minghao Liu, Tianqing Fang, Jiaxin Bai, Xin Liu, Changlong Yu, Zheng Li, Chen Luo, Qingyu Yin, Bing Yin, Junxian He, and Yangqiu Song. 2024. In- tentionqa: A benchmark for evaluating purchase in- tention comprehension abilities of language models in e-commerce. In Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024 , pages 2247– 2266. Association for Computational Linguistics. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Bap- tiste Rozière, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Al- lonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Geor- gia Lewis Anderson, Graeme Nail, Grégoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Han- nah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel M. Kloumann, Ishan 5 Page 6: Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, and et al. 2024. The llama 3 herd of models. CoRR , abs/2407.21783. Benjamin Van Durme, Phillip Michalak, and Lenhart K. Schubert. 2009. Deriving generalized knowledge from corpora using wordnet abstraction. In EACL 2009, 12th Conference of the European Chapter of the Association for Computational Linguistics, Pro- ceedings of the Conference, Athens, Greece, March 30 - April 3, 2009 , pages 808–816. The Association for Computer Linguistics. Tianqing Fang, Quyet V . Do, Sehyun Choi, Weiqi Wang, and Yangqiu Song. 2023. CKBP v2: An expert- annotated evaluation set for commonsense knowl- edge base population. CoRR , abs/2304.10392. Tianqing Fang, Weiqi Wang, Sehyun Choi, Shibo Hao, Hongming Zhang, Yangqiu Song, and Bin He. 2021a. Benchmarking commonsense knowledge base pop- ulation with an effective evaluation dataset. In Pro- ceedings of the 2021 Conference on Empirical Meth- ods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7- 11 November, 2021 , pages 8949–8964. Association for Computational Linguistics. Tianqing Fang, Hongming Zhang, Weiqi Wang, Yangqiu Song, and Bin He. 2021b. DISCOS: bridg- ing the gap between discourse knowledge and com- monsense knowledge. In WWW ’21: The Web Con- ference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021 , pages 2648–2659. ACM / IW3C2. Tom Hartvigsen, Swami Sankaranarayanan, Hamid Palangi, Yoon Kim, and Marzyeh Ghassemi. 2023. Aging with GRACE: lifelong model editing with dis- crete key-value adaptors. In Advances in Neural Information Processing Systems 36: Annual Confer- ence on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 . Mutian He, Tianqing Fang, Weiqi Wang, and Yangqiu Song. 2024. Acquiring and modeling abstract com- monsense knowledge via conceptualization. Artif. Intell. , 333:104149. Xiusheng Huang, Yequan Wang, Jun Zhao, and Kang Liu. 2024. Commonsense knowledge editing based on free-text in llms. In Proceedings of the 2024 Con- ference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, Novem- ber 12-16, 2024 , pages 14870–14880. Association for Computational Linguistics.Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Re- nard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timo- thée Lacroix, and William El Sayed. 2023. Mistral 7b.CoRR , abs/2310.06825. Tianjie Ju, Yijin Chen, Xinwei Yuan, Zhuosheng Zhang, Wei Du, Yubin Zheng, and Gongshen Liu. 2024. In- vestigating multi-hop factual shortcuts in knowledge editing of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024 , pages 8987–9001. Association for Computational Linguis- tics. Jiacheng Liu, Wenya Wang, Dianzhuo Wang, Noah A. Smith, Yejin Choi, and Hannaneh Hajishirzi. 2023. Vera: A general-purpose plausibility estimation model for commonsense statements. In Proceedings of the 2023 Conference on Empirical Methods in Nat- ural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pages 1264–1287. Association for Computational Linguistics. Jingping Liu, Tao Chen, Chao Wang, Jiaqing Liang, Li- han Chen, Yanghua Xiao, Yunwen Chen, and Ke Jin. 2022. V ocsk: Verb-oriented commonsense knowl- edge mining with taxonomy-guided induction. Artif. Intell. , 310:103744. Kaixin Ma, Filip Ilievski, Jonathan Francis, Yonatan Bisk, Eric Nyberg, and Alessandro Oltramari. 2021a. Knowledge-driven data construction for zero-shot evaluation in commonsense question answering. In Thirty-Fifth AAAI Conference on Artificial Intelli- gence, AAAI 2021, Thirty-Third Conference on In- novative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Ad- vances in Artificial Intelligence, EAAI 2021, Vir- tual Event, February 2-9, 2021 , pages 13507–13515. AAAI Press. Kaixin Ma, Filip Ilievski, Jonathan Francis, Satoru Ozaki, Eric Nyberg, and Alessandro Oltramari. 2021b. Exploring strategies for generalizable com- monsense reasoning with pre-trained models. In Pro- ceedings of the 2021 Conference on Empirical Meth- ods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7- 11 November, 2021 , pages 5474–5483. Association for Computational Linguistics. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associ- ations in GPT. In Advances in Neural Information Processing Systems 35: Annual Conference on Neu- ral Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - Decem- ber 9, 2022 . 6 Page 7: Kevin Meng, Arnab Sen Sharma, Alex J. Andonian, Yonatan Belinkov, and David Bau. 2023. Mass- editing memory in a transformer. In The Eleventh International Conference on Learning Representa- tions, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net. George A. Miller. 1995. Wordnet: A lexical database for english. Commun. ACM , 38(11):39–41. Gregory Murphy. 2004. The big book of concepts . MIT press. OpenAI. 2024a. Gpt-4o mini: advancing cost-efficient intelligence. OpenAI . OpenAI. 2024b. Hello gpt-4o. OpenAI . Hao Peng, Xiaozhi Wang, Shengding Hu, Hailong Jin, Lei Hou, Juanzi Li, Zhiyuan Liu, and Qun Liu. 2022. COPEN: probing conceptual knowledge in pre- trained language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Lan- guage Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022 , pages 5015– 5035. Association for Computational Linguistics. Vyas Raina, Adian Liusie, and Mark J. F. Gales. 2024. Is llm-as-a-judge robust? investigating universal ad- versarial attacks on zero-shot LLM assessment. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024 , pages 7499–7517. Association for Computational Linguis- tics. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat- ula, and Yejin Choi. 2021. Winogrande: an adver- sarial winograd schema challenge at scale. Commun. ACM , 64(9):99–106. Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. 2019. Social iqa: Common- sense reasoning about social interactions. In Proceed- ings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter- national Joint Conference on Natural Language Pro- cessing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019 , pages 4462–4472. Association for Computational Linguistics. Haochen Shi, Weiqi Wang, Tianqing Fang, Baixuan Xu, Wenxuan Ding, Xin Liu, and Yangqiu Song. 2023. QADYNAMICS: training dynamics-driven synthetic QA diagnostic for zero-shot commonsense question answering. In Findings of the Association for Com- putational Linguistics: EMNLP 2023, Singapore, De- cember 6-10, 2023 , pages 15329–15341. Association for Computational Linguistics. Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hong- song Li, and Weizhu Chen. 2011. Short text concep- tualization using a probabilistic knowledgebase. In IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16-22, 2011 , pages 2330– 2336. IJCAI/AAAI.Yangqiu Song, Shusen Wang, and Haixun Wang. 2015. Open domain short text conceptualization: A gener- ative + descriptive modeling approach. In Proceed- ings of the Twenty-Fourth International Joint Confer- ence on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015 , pages 3820–3826. AAAI Press. Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Commonsenseqa: A question answering challenge targeting commonsense knowl- edge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers) , pages 4149–4158. Association for Computational Linguistics. Ben Wang and Aran Komatsuzaki. 2021. GPT-J- 6B: A 6 Billion Parameter Autoregressive Lan- guage Model. https://github.com/kingoflolz/ mesh-transformer-jax . Jiaan Wang, Yunlong Liang, Zengkui Sun, Yuxuan Cao, Jiarong Xu, and Fandong Meng. 2024a. Cross- lingual knowledge editing in large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024 , pages 11676–11686. Association for Computational Linguistics. Song Wang, Yaochen Zhu, Haochen Liu, Zaiyi Zheng, Chen Chen, and Jundong Li. 2025. Knowledge edit- ing for large language models: A survey. ACM Com- put. Surv. , 57(3):59:1–59:37. Weiqi Wang, Limeng Cui, Xin Liu, Sreyashi Nag, Wenju Xu, Sheikh Sarwar, Chen Luo, Yang Lau- rence Li, Hansu Gu, Hui Liu, Changlong Yu, Jiaxin Bai, Yifan Gao, Haiyang Zhang, Qi He, Shuiwang Ji, and Yangqiu Song. 2024b. EcomScriptBench: A multi-task benchmark for e-commerce script plan- ning via step-wise intention-driven product associa- tion. CoRR . Weiqi Wang, Tianqing Fang, Wenxuan Ding, Baixuan Xu, Xin Liu, Yangqiu Song, and Antoine Bosselut. 2023a. CAR: Conceptualization-augmented reasoner for zero-shot commonsense question answering. In Findings of the Association for Computational Lin- guistics: EMNLP 2023 , pages 13520–13545, Singa- pore. Association for Computational Linguistics. Weiqi Wang, Tianqing Fang, Chunyang Li, Haochen Shi, Wenxuan Ding, Baixuan Xu, Zhaowei Wang, Ji- axin Bai, Xin Liu, Jiayang Cheng, Chunkit Chan, and Yangqiu Song. 2024c. CANDLE: iterative concep- tualization and instantiation distillation from large language models for commonsense reasoning. In Proceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024 . Association for Computational Linguis- tics. 7 Page 8: Weiqi Wang, Tianqing Fang, Haochen Shi, Baixuan Xu, Wenxuan Ding, Liyu Zhang, Wei Fan, Jiaxin Bai, Haoran Li, Xin Liu, and Yangqiu Song. 2024d. On the role of entity and event level conceptualiza- tion in generalizable reasoning: A survey of tasks, methods, applications, and future directions. CoRR , abs/2406.10885. Weiqi Wang, Tianqing Fang, Baixuan Xu, Chun Yi Louis Bo, Yangqiu Song, and Lei Chen. 2023b. CAT: A contextualized conceptualization and instan- tiation framework for commonsense reasoning. In Proceedings of the 61st Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023 , pages 13111–13140. Association for Computa- tional Linguistics. Weiqi Wang and Yangqiu Song. 2024. MARS: bench- marking the metaphysical reasoning abilities of lan- guage models with a multi-task evaluation dataset. CoRR , abs/2406.02106. Peter West, Ronan Le Bras, Taylor Sorensen, Bill Yuchen Lin, Liwei Jiang, Ximing Lu, Khyathi Chandu, Jack Hessel, Ashutosh Baheti, Chandra Bhagavatula, and Yejin Choi. 2023. Novacomet: Open commonsense foundation models with sym- bolic knowledge distillation. In Findings of the Asso- ciation for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023 , pages 1127–1149. Association for Computational Linguistics. Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Qili Zhu. 2012. Probase: a probabilistic tax- onomy for text understanding. In Proceedings of the ACM SIGMOD International Conference on Manage- ment of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20-24, 2012 , pages 481–492. ACM. Baixuan Xu, Weiqi Wang, Haochen Shi, Wenxuan Ding, Huihao Jing, Tianqing Fang, Jiaxin Bai, Xin Liu, Changlong Yu, Zheng Li, Chen Luo, Qingyu Yin, Bing Yin, Long Chen, and Yangqiu Song. 2024a. MIND: multimodal shopping intention distillation from large vision-language models for e-commerce purchase understanding. In Proceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024 , pages 7800–7815. Associa- tion for Computational Linguistics. Derong Xu, Ziheng Zhang, Zhihong Zhu, Zhenxi Lin, Qidong Liu, Xian Wu, Tong Xu, Wanyu Wang, Yuyang Ye, Xiangyu Zhao, Enhong Chen, and Yefeng Zheng. 2024b. Editing factual knowledge and ex- planatory ability of medical large language models. InProceedings of the 33rd ACM International Con- ference on Information and Knowledge Management, CIKM 2024, Boise, ID, USA, October 21-25, 2024 , pages 2660–2670. ACM. Zonglin Yang, Xinya Du, Erik Cambria, and Claire Cardie. 2023. End-to-end case-based reasoning for commonsense knowledge base completion. In Pro- ceedings of the 17th Conference of the EuropeanChapter of the Association for Computational Lin- guistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023 , pages 3491–3504. Association for Computa- tional Linguistics. Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yang, Weng Lam Tam, Wenyi Zhao, Xiao Liu, Xiao Xia, Xiaohan Zhang, Xiao- tao Gu, Xin Lv, Xinghan Liu, Xinyi Liu, Xinyue Yang, Xixuan Song, Xunkai Zhang, Yifan An, Yi- fan Xu, Yilin Niu, Yuantao Yang, Yueyan Li, Yushi Bai, Yuxiao Dong, Zehan Qi, Zhaoyu Wang, Zhen Yang, Zhengxiao Du, Zhenyu Hou, and Zihan Wang. 2024. Chatglm: A family of large language mod- els from GLM-130B to GLM-4 all tools. CoRR , abs/2406.12793. Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, Jintian Zhang, Yuansheng Ni, Siyuan Cheng, Ziwen Xu, Xin Xu, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Lei Liang, Zhiqiang Zhang, Xiaowei Zhu, Jun Zhou, and Huajun Chen. 2024. A comprehensive study of knowledge editing for large language models. CoRR , abs/2401.01286. 8

---