loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2503.04141

HEISIR: Hierarchical Expansion of Inverted Semantic Indexing for Training-free Retrieval of Conversational Data using LLMs

Authors: Sangyeop Kim, Hangyeul Lee, Yohan Lee

Published: 2025-03-06

Abstract:

The growth of conversational AI services has increased demand for effective information retrieval from dialogue data. However, existing methods often face challenges in capturing semantic intent or require extensive labeling and fine-tuning. This paper introduces HEISIR (Hierarchical Expansion of Inverted Semantic Indexing for Retrieval), a novel framework that enhances semantic understanding in conversational data retrieval through optimized data ingestion, eliminating the need for resource-intensive labeling or model adaptation. HEISIR implements a two-step process: (1) Hierarchical Triplets Formulation and (2) Adjunct Augmentation, creating semantic indices consisting of Subject-Verb-Object-Adjunct (SVOA) quadruplets. This structured representation effectively captures the underlying semantic information from dialogue content. HEISIR achieves high retrieval performance while maintaining low latency during the actual retrieval process. Our experimental results demonstrate that HEISIR outperforms fine-tuned models across various embedding types and language models. Beyond improving retrieval capabilities, HEISIR also offers opportunities for intent and topic analysis in conversational data, providing a versatile solution for dialogue systems.

Paper Content: on Alphaxiv
Page 1: HEISIR: Hierarchical Expansion of Inverted Semantic Indexing for Training-free Retrieval of Conversational Data using LLMs Sangyeop Kim†1,2, Hangyeul Lee2, Yohan Lee1 1Coxwave 2Seoul National University sangyeop.kim@coxwave.com, mikelee@snu.ac.kr, yohan.lee@coxwave.com Abstract The growth of conversational AI services has increased demand for effective information re- trieval from dialogue data. However, existing methods often face challenges in capturing semantic intent or require extensive labeling and fine-tuning. This paper introduces HEISIR (Hierarchical Expansion of Inverted Seman- tic Indexing for Retrieval), a novel frame- work that enhances semantic understanding in conversational data retrieval through op- timized data ingestion, eliminating the need for resource-intensive labeling or model adap- tation. HEISIR implements a two-step pro- cess: (1) Hierarchical Triplets Formulation and (2) Adjunct Augmentation, creating seman- tic indices consisting of Subject-Verb-Object- Adjunct (SVOA) quadruplets. This structured representation effectively captures the underly- ing semantic information from dialogue con- tent. HEISIR achieves high retrieval perfor- mance while maintaining low latency during the actual retrieval process. Our experimental results demonstrate that HEISIR outperforms fine-tuned models across various embedding types and language models. Beyond improv- ing retrieval capabilities, HEISIR also offers opportunities for intent and topic analysis in conversational data, providing a versatile solu- tion for dialogue systems. 1 Introduction Conversational AI is being deployed across diverse industries, including personalized services (Koca- balli et al., 2019), code generation (Li et al., 2022), educational tutoring (Mousavinasab et al., 2021) and even in social welfare services (Jo et al., 2023). This rapid expansion has resulted in vast amounts of dialogue data, rich with insights into user needs and behavioral patterns. To harness this wealth of information, Information Retrieval approach tai- lored to the unique characteristics of conversational †Corresponding author.data is necessary. We define this area as Conversa- tional Data Retrieval (CDR), which aims to enable efficient access to and extraction of relevant infor- mation from large-scale conversational data. CDR is highly important for both end-users and conversational AI service providers. End-users frequently need to retrieve specific information from past conversations, such as previous agree- ments or discussion outcomes, using natural lan- guage queries. Service providers can utilize CDR to analyze user interactions and enhance their AI- powered services. For example, CDR can enhance Retrieval Augmented Generation (RAG) systems by providing relevant dialogue examples (Lewis et al., 2020; Wang et al., 2024), and support ser- vice improvement by identifying patterns in user behavior and recurring topics (Motger et al., 2022; Owoicho et al., 2022). However, conventional sparse and dense retrieval methods exhibit limited performance when applied to conversational data. Unlike traditional document data, conversational data has distinct characteristics that make traditional retrieval approaches insuffi- cient. First, each utterance is assigned to a specific speaker, meaning that the same message can be expressed in entirely different ways depending on who is speaking. Additionally, a speaker’s inten- tion can be multifaceted; a single utterance may have multiple, often complex, intents in it. Further- more, the meaning of utterances depends heavily on the context of the dialogue and the relationship between participants, often without explicit con- textual markers. These unique characteristics pose significant challenges that existing context-aware retrieval models, trained primarily on document- based QA tasks (Yang et al., 2015; Rajpurkar et al., 2016; Nguyen et al., 2016), struggle to address. Apart from the limited effectiveness of existing retrieval methods on conversational data, the struc- ture of the retrieval system itself poses a significant challenge to practical implementation. A typical re- 1arXiv:2503.04141v1 [cs.IR] 6 Mar 2025 Page 2: Figure 1: Architecture of HEISIR framework: Data Ingestion Phase trieval system consists of two main phases (Wang et al., 2021): the data ingestion phase, where in- coming data is processed and indexed offline, and the retrieval phase, where relevant information is searched based on user queries. Recent research on retrieval system leverages the powerful context understanding capability of LLMs through tech- niques such as re-ranking (Zhang et al., 2023), query rewriting (Yu et al., 2020), and using larger retrieval models (Ma et al., 2023; Peng et al., 2023). However, these approaches, especially when incor- porating LLMs, introduce significant latency trade- off in the retrieval phase, making them impractical for real-time services. Building on these insights, we propose a novel framework that extracts and processes the inverted semantic indices in the data ingestion phase, unlike traditional approaches that process indices in the retrieval phase. HEISIR (Hierarchical Expansion ofInverted Semantic Indexing for Retrieval) imple- ments structured semantic indices that capture the inherent syntactic hierarchy within sentences. Over- all scheme of our framework is detailed in Figure 1. HEISIR constructs search indices based on the inherent syntax present in all natural language sen- tences, eliminating the need for extensive labeling and training. In data ingestion phase, HEISIR ex- tracts semantic indices and stores them as inverted indices . In retrieval phase, HEISIR computes score to retrieve the most relevant conversations. Our ap- proach not only enhances retrieval performance but also offers practical advantages, as it eliminates latency during the retrieval phase. The key contri- butions of this research are:1.Improved retrieval performance signifi- cantly enhances retrieval capabilities with only negligible increase of latency by optimizing data ingestion. 2.Practical real-world applicability enables deployment of effective dialogue retrieval sys- tems in production environments lacking la- beled training data. 3.Practical-scale Robustness consistently en- hances performance when integrated with any combination of language model scales. 4.Versatility beyond retrieval provides highly interpretable atomic semantic units, enabling intuitive intent and topic analysis. 2 Linguistic Preliminaries To design semantic indices for CDR, it is necessary to identify two fundamental components in a mes- sage: the speaker and the intent. These components correspond to the subject and the verb of a sentence. To capture complex intents while maintaining the structural integrity, HEISIR follows the syntactic comprehension process of humans. Numerous studies support the view that hu- mans comprehend sentences incrementally. Specif- ically, sentences are processed by first construct- ing the simplest syntactic structure and then in- tegrating semantic adjuncts to achieve full un- derstanding (Kamide et al., 2003; Altmann and Mirkovi ´c, 2009; Fossum and Levy, 2012). Phrase Structure Grammar (PSG) (Chomsky, 2002; Gazdar, 1985) is one of the most widely used models to explain this top-down syntactic pro- cessing. PSG breaks down natural language sen- 2 Page 3: tences into constituents —syntactically significant units—organized in a hierarchical binary tree struc- ture. The overall scheme of how PSG parses mes- sages for HEISIR is detailed in Figure 2. Figure 2: Phrase Structure Grammar and Constituents Another key concept central to syntactic pro- cessing is verb valency . Verb valency refers to the number of arguments that a verb can take in a sen- tence, and it describes the relationship between the verb and other elements in the clause, such as the subject, direct object, and indirect object. Verbs can be categorized based on their valency as ava- lent (taking no arguments), monovalent (taking one argument), divalent (taking two arguments), ortrivalent (taking three arguments). Verb valency plays a critical role in determining the constituent hierarchy in sentences, as different valency types lead to different phrase structures. In this study, we develop a semantic indexing framework according to hierarchical syntactic pro- cessing schemes that represent the three most com- mon types of verb valency: monovalent, divalent, and trivalent. We parse the constituents from sen- tences as follows: •Subject : The entity performing the action or the one that the sentence is about. •Verb : The action or state described in the sen- tence. •Object : The entity directly affected by the verb’s action. •Adjunct : An optional or necessary element that provides additional information about the action, such as time, place, or manner. HEISIR borrows these concepts from syntax, but uses them in a slightly different context. For in- stance, HEISIR fixes the Subject to the speaker ofutterance, to perform information retrieval in con- versational data. Therefore, avalent verbs are disre- garded due to the existence of an explicit subject. This makes index tuples include at least two con- stituents: the Subject and the Verb . Furthermore, we use the word Adjunct in a slightly broader con- text than is conventional; For divalent verbs that take two objects, it is syntactically correct to al- locate both direct and indirect objects under the Object constituent. Instead, we include indirect object into the Adjunct category for structural in- tegrity and better search performance. 3 Hierarchical Expansion of Inverted Semantic Indexing for Retrieval Figure 3: 2-Step Expansion Process of HEISIR HEISIR is a novel conversational data retrieval framework that incorporates incremental syntactic processing with inverted indexing. The framework invests resources in the data ingestion phase, allow- ing for optimized performance during the retrieval phase without sacrificing latency. Figure 3 outlines two key steps of the HEISIR framework: (1) Hierarchical Triplets Formula- tion, identifying the fundamental syntactic con- stituents – subject, verb, and object – forming the core structures of the sentence; (2) Adjunct Expan- sion, expanding SVO triplets to SVOA quadruplets, where A represents an Adjunct. While a single-step approach for SVOA quadru- plet extraction may seem straightforward, it has critical limitations in accuracy and control. For example, it generates redundant variations like “teaches kids at kindergarten” and “teaches chil- dren in kindergarten” that carry the same meaning, introducing unnecessary index noise. In contrast, HEISIR’s two-step approach first extracts core SVO triplets and then carefully adds adjuncts, effectively preventing redundancy while 3 Page 4: maintaining high-quality indices. Furthermore, the experimental results supporting this analysis are presented in Section 6.2. The detailed prompts used in our approach can be found in Appendix A. 3.1 Data Ingestion Phase Step 1: Hierarchical Triplets Formulation In this step, HEISIR breaks down sentences into one or more SVO triplets . These triplets capture the core syntactic hierarchy within the dialogue con- tent, addressing two key limitations of traditional embedding methods (Mikolov, 2013; Pennington et al., 2014). First, HEISIR reduces ambiguity in the embeddings of complex sentences. HEISIR de- composes messages into multiple SVO triplets, sig- nificantly reducing sentence complexity. Addition- ally, HEISIR strictly excludes pronouns to ensure semantic completeness in each index. Second, our approach minimizes the impact of non-semantic elements on performance. As discussed in sec- tion 2, HEISIR fixes the subject to the speaker of the message to enhance retrieval performance. During this syntactic transformation, avalent verbs are transformed into their mono-, di-, or trivalent equivalents, and all tenses, auxiliaries, and syn- tactic markers are removed. This process refines HEISIR-generated triplets to capture only semanti- cally significant elements of the message. Step 2: Adjunct Augmentation SVO triplet in- dices extracted in step 1 identify hierarchical struc- tures in the message. To enhance specificity, we introduce a Adjunct Augmentation process. We cat- egorize four distinct patterns of adjuncts: -Detailed Content and Theme of Discussion specifies the target of communication or conver- sational themes using prepositions. -Reason or Causation explains the underlying causes or reasons for actions or situations, ad- dressing the "why" of a scenario. -Condition or Accompanying Circumstance de- scribes the context or conditions under which actions occur or situations exist. -No information indicates cases where adding detailed information is impossible or not mean- ingful for the given sentence structure. Following this step, each message in a conver- sation is encoded into one or more SVOA quadru- plets. These SVOA quadruplets break down com- plex messages into comprehensible semantic units. The indices are then stored in an inverted indexstructure, which facilitates efficient retrieval in the subsequent phase. 3.2 Retrieval Phase: Scoring After deriving SVOA quadruplets from a conversa- tion, we devise a method to collectively evaluate embeddings of five conversational components — conversation, message, SV , SVO, SVOA — in a single score metric. To formalize our scoring method, we represent queries and conversations with embeddings Eqand Econvrespectively, while other conversational com- ponents C={message, SV , SVO, SVOA }are en- coded as Ec. The relevance between components is computed using a vector similarity function f(·,·). The scoring process is defined as follows: Sconv=f(Eq, Econv) (1) Sc= max Ecmf(Eq, Ecm)forc∈C where Ecm∈Ec(m), m∈conv(2) SHEISIR =Sconv+X c∈CSc (3) Equation (1) measures the semantic similarity between the query and the entire conversation con- tent to capture conversation-level relevance. Com- ponent scores are computed as shown in (2), where we evaluate each type separately by finding the highest similarity between the query and compo- nent instances within the conversation. For each conversational component c∈C,Ecmdenotes a set of embeddings of component cin the mes- sagem, and maxEcmpicks best from multiple <component, message> pairs. Finally, the overall HEISIR score in (3) is determined by aggregating the conversation-level similarity with individual component scores to provide a comprehensive mea- sure of relevance. Potential alternatives of maximization function in (2) are summation and averaging. However, we employ maximization instead of these alternatives since summation and averaging introduce noise from less relevant messages, whereas focusing on the most salient component is intuitively more ef- fective. 4 Experiment 4.1 Dataset We select five dialogue datasets addressing key top- ics in conversational data analysis: profanity detec- 4 Page 5: tion, user satisfaction evaluation, and personal in- formation protection. We extract queries from five datasets and map them to corresponding conversa- tion sessions. The statistics of the derived dataset is detailed in Table 1. -BAD (Xu et al., 2021) focuses on improving safety of conversational agents. -DICES (Aroyo et al., 2024) evaluates safety of AI responses in diverse contexts. -Daily Dialog (Li et al., 2017) provides labeled multi-turn dialogues reflecting user sentiments. -PILD (Xu et al., 2020) detects and protects sen- sitive personal information. -USS (Sun et al., 2021) simulates user satisfaction in task-oriented dialogue systems. Dataset Conv. QueryUtterances per Conv.Mapped Conv. per Query Train4,0963,00012.315.5 Validation 590 14.9 Test 4,035 3,501 12.7 15.4 Table 1: Statistics of Dataset 4.2 Baseline For the evaluation of Information Retrieval, we compare the following baseline models: -DPR (Karpukhin et al., 2020): Learns dense em- beddings for queries and passages using a dual- encoder architecture for efficient retrieval. -SPLADE-v3 (Lassance et al., 2024): Employs sparse lexical representations for expansion- based retrieval, enhancing earlier SPLADE ver- sions (Formal et al., 2021). -LLM2Vec (BehnamGhader et al., 2024): Creates dense vector representations of text using large language models to facilitate retrieval tasks. For training-free models, we utilize the follow- ing in our comparison: -CoT Expansion (Jagerman et al., 2023): Ex- pands queries using CoT prompting and uses both original and expanded queries for retrieval. -HyDE (Gao et al., 2022a): Generates hypotheti- cal relevant documents using an LLM to enhance retrieval performance. -LameR (Shen et al., 2023a): Produces hypotheti-cal documents using BM25 initial results to im- prove retrieval effectiveness. 4.3 Experimental Setup For context consideration, we set a window of k= 2previous messages. We set LLMs temperature to 0.0, and use cosine similarity for embedding comparisons. All implementation details, including the versions of the embedding models and LLMs used, can be found in Appendix D. Embedding for HEISIR For this study, We select embedding models based on their perfor- mance in the MTEB benchmark (Muennighoff et al., 2022). We explore Encoder-only models with extended context capabilities, such as GTE (Li et al., 2023) and Nomic (Nussbaum et al., 2024). Another category of embedding models is decoder-only models based on LLMs. Specif- ically, we utilize LLM2Vec (BehnamGhader et al., 2024) with its LLama-3 (AI@Meta, 2024) based version. Additionally, we use NV-Embed (Lee et al., 2024), which currently achieves the highest aver- age MTEB benchmark score. For comprehensive evaluation, We employ latest OpenAI-small and large embedding models (OpenAI, 2024). LLMs for HEISIR The study uses a range of LLM models for different computational envi- ronments. For low-resource settings, we assess Gemma (2B) (Team et al., 2024), Phi-3 (mini, 3.8B) (Abdin et al., 2024), LLama-3 (8B) (AI@Meta, 2024), and Qwen-2 (7B) (Qwen, 2024), which are suitable for environments with limited hardware. We also test API-based models including GPT-3.5- turbo (OpenAI, 2023) and Claude 3 Haiku (An- thropic, 2024). 5 Result 5.1 Baseline Result Type Model acc@1 ndcg@5 ndcg@10 ndcg@20 BaselineDPR 0.0337 0.0274 0.0257 0.0283 SPLADE-v3 0.2048 0.1696 0.1614 0.1743 LLM2Vec 0.2825 0.2225 0.2092 0.2220 Fine-tunedDPR 0.1708 0.1420 0.1342 0.1444 SPLADE-v3 0.2474 0.2015 0.1908 0.2034 LLM2Vec 0.3533 0.2881 0.2745 0.2912 HEISIRGPT-3.5-turbo0.4085 0.3260 0.3056 0.3198+ OpenAI-large Table 2: Performance metrics of baseline, fine-tuned, and best models As shown in Table 2, models that typically excel in general document retrieval struggle to achieve 5 Page 6: ModelGTE (Encoder-only) LLM2Vec (Decoder-only) OpenAI-large (API) acc@1 ndcg@5 ndcg@10 ndcg@20 acc@1 ndcg@5 ndcg@10 ndcg@20 acc@1 ndcg@5 ndcg@10 ndcg@20BasePre-trained 0.3156 0.2550 0.2394 0.2520 0.2825 0.2225 0.2092 0.2220 0.3425 0.2826 0.2670 0.2839 Fine-tuned 0.3708 0.2994 0.2823 0.2984 0.3533 0.2881 0.2745 0.2912 - - - -MiniGemma 0.3096 0.2562 0.2415 0.2541 0.3136 0.2597 0.2428 0.2543 0.3422 0.2733 0.2574 0.2705 Phi-3 0.3582 0.2934 0.2757 0.2901 0.3865 0.3138 0.2936 0.3055 0.3950 0.3192 0.2992 0.3136SmallLLama-3 0.3728 0.3005 0.2829 0.2958 0.3902 0.3155 0.2944 0.3073 0.4053 0.3263 0.3072 0.3207 Qwen-2 0.3613 0.2952 0.2786 0.2921 0.3879 0.3123 0.2912 0.3035 0.4045 0.3224 0.3026 0.3166APIGPT-3.5-turbo 0.3633 0.2943 0.2776 0.2930 0.3916 0.3140 0.2934 0.3068 0.4085 0.3260 0.3056 0.3198 Haiku 0.3716 0.2978 0.2809 0.2943 0.3927 0.3161 0.2958 0.3086 0.4056 0.3245 0.3028 0.3185 Table 3: Performance comparison of different models across Encoder-only (GTE), Decoder-only (LLM2Vec), and API (OpenAI-large) approaches: Underlined values outperform the Fine-tuned model, bold values indicate the best performing LLM model combination for each embedding model. high performance in conversational data retrieval. The LLM2Vec model (BehnamGhader et al., 2024), which is based on a decoder-only architecture spe- cialized for dialogue generation, outperforms other models, indicating that embedding-based retrieval methods are more effective for semantic search in the context of conversational data. Fine-tuning alone does not sufficiently address this issue, sug- gesting the need for retrieval methods specifically tailored to conversational data. 5.2 Main Result In this section, we only discuss the best-performing combinations of LLMs and embedding models. Re- sults for all other combinations also show the ef- fectiveness of HEISIR, with details available in Appendix B. Table 3 demonstrates the effectiveness of HEISIR method in conversational data retrieval across various model architectures and embedding types. Our model, in combination with various LLMs, outperforms all pre-trained encoder-only and decoder-only baselines and API models, except for when paired with Gemma. Notably, many of these models even surpass the fine-tuned baselines, highlighting the key strength of our method: achiev- ing high performance without a need for resource- intensive data labeling and training. These results provide insights into optimal combinations for var- ious constraints. OpenAI-large with GPT-3.5-turbo achieve the highest overall performance. For en- vironments prohibiting external APIs, LLama-3 + LLM2Vec perform best, though other combina- tions show comparable results, offering flexibility. In resource-constrained settings, LLama-3 + GTE provides an efficient balance of performance and resource usage.Model acc@1 ndcg@5 ndcg@10 ndcg@20 Time (s) BM25 0.1465 0.1233 0.1200 0.1299 0.0045 CoT Expansion 0.1225 0.0963 0.0894 0.0926 2.0059 HyDE 0.2702 0.2139 0.2026 0.2128 2.6192 LameR 0.2245 0.1853 0.1752 0.1866 3.5465 LLM2Vec 0.2825 0.2225 0.2092 0.2220 0.2091 HEISIR LLM2Vec 0.3902 0.3155 0.2944 0.3073 0.2778 OpenAI-large 0.3425 0.2826 0.2670 0.2839 0.5966 HEISIR OpenAI-large 0.4085 0.3260 0.3056 0.3198 0.7334 Table 4: Performance Comparison of Different Models 5.3 Practical Retrieval Efficiency Table 4 demonstrates that HEISIR outperforms var- ious training-free methods in both speed and accu- racy. Methods that rely on LLMs for query expan- sion during the search phase, such as CoT Expan- sion (Jagerman et al., 2023), LameR (Shen et al., 2023b), and HyDE (Gao et al., 2022b), require sig- nificantly longer search times. In contrast, HEISIR adds minimal time during retrieval phase (0.0687 seconds to LLM2Vec, 0.1367 seconds to OpenAI- large) by requiring only simple score computation. This efficiency is achieved by shifting the computa- tional load to the data ingestion phase, which takes 3.86 seconds with GPT-3.5-turbo and 1.22 seconds with LLama-3. These ingestion times are reason- able considering they can be significantly reduced through parallelization techniques. Additionally, HEISIR is more cost-effective. While other meth- ods incur costs per search, HEISIR only requires a one-time data processing step during the data inges- tion phase. As search frequency increases, HEISIR remains economically efficient, whereas costs for other methods continue to rise. LLM inference for indexing incurs an additional cost of approximately $0.00894 per conversation, based on the latest API pricing. An interesting trend was observed in the exper- imental results: LameR, which is designed as an 6 Page 7: improvement over HyDE, shows inferior perfor- mance than HyDE. LameR integrates BM25 for higher retrieval performance. This unexpected re- sult suggests that while BM25 may be well-suited for traditional document retrieval, it appears to hin- der performance in conversational data retrieval. 6 Analysis 6.1 Interacting with Each Component Figure 4: Marginal Performance of Components Figure 4 shows the average performance across all experimental settings. Each contour represents a combination of the conversational components: Conversation, Message, SV , SVO, SVOA. No- tably, SVOA index embeddings outperforms ex- isting retrieval methods that rely on conversation and message embeddings. The best performance is achieved when embeddings of all conversational components are combined, demonstrating the ef- fectiveness of HEISIR. 6.2 2-step index construction The key strength of HEISIR lies in its 2-step SVOA quadruplet extraction process. This approach offers both qualitative and quantitative advantages over single-step extraction methods. First, the 2-step approach extracts SVOA quadruplets with greater precision. The Hierarchical Triplet Formulation step establishes the syntactic hierarchy within the message, while the Adjunct Augmentation step collects detailed information, simulating the incre- mental sentence comprehension process of humans. This method consistently produces higher-quality SVOA quadruplets compared to single-step extrac- tion. Second, quantitative evaluations show that our 2-step approach consistently outperforms the single-step method (Table 5).Model Metric LLM embedding acc@1 ndcg@5 ndcg@10 ndcg@20 Phi-3GTE0.3530 0.2883 0.2724 0.2856 (-1.45%) (-1.74%) (-1.20%) (-1.55%) LLM2Vec0.3405 0.2724 0.2559 0.2692 (-11.90%) (-13.19%) (-12.84%) (-11.88%) OpenAI-large0.3879 0.3149 0.2970 0.3137 (-1.80%) (-1.35%) (-0.74%) (+0.03%)LLama-3GTE0.3513 0.2872 0.2713 0.2843 (-5.77%) (-4.43%) (-4.10%) (-3.89%) LLM2Vec0.3362 0.2710 0.2544 0.2693 (-13.84%) (-14.11%) (-13.59%) (-12.37%) OpenAI-large0.3865 0.3106 0.2957 0.3123 (-4.64%) (-4.81%) (-3.74%) (-2.62%)GPT-3.5-turboGTE0.3510 0.2844 0.2682 0.2823 (-3.39%) (-3.36%) (-3.39%) (-3.65%) LLM2Vec0.3359 0.2696 0.2533 0.2673 (-14.22%) (-14.14%) (-13.67%) (-12.88%) OpenAI-large0.3876 0.3126 0.2954 0.3111 (-5.12%) (-4.11%) (-3.34%) (-2.72%) Table 5: Single-step Performance (% Indicates Perfor- mance Change Compared to 2-step) 6.3 Exploring the effects of score ensembles While our results confirm that combining the ex- tracted semantic indices enhances retrieval perfor- mance, this improvement can be claimed as a sim- pleensemble effect . In this subsection, we com- pare the ensemble effect with HEISIR. To measure the ensemble effect, we combine scores from all embedding models for the conversation and mes- sage components, excluding the significantly un- derperforming NV-Embed model (Lee et al., 2024) to avoid underestimating the ensemble effect. For comparison, we apply the HEISIR method to the Phi-3 model (Abdin et al., 2024), which is the smallest and least performant among our HEISIR variants. As shown in Table 6, HEISIR outperforms the ensemble effect, even in its least optimal setting. This demonstrates that HEISIR adds value beyond simple score aggregation. combination acc@1 ndcg@5 ndcg@10 ndcg@20 ensemble effect 0.3445 0.2784 0.2651 0.2802 (- NV-Embed) 0.3522 0.286 0.2715 0.2887 HEISIR Phi-3,GTE 0.3582 0.2934 0.2757 0.2901 HEISIR Phi-3,LLM2Vec 0.3865 0.3138 0.2936 0.3055 HEISIR Phi-3,OpenAI-large 0.3950 0.3192 0.2992 0.3136 Table 6: Comparison of Ensemble Effects 6.4 Potential to Hybrid Search In section 6.1, we observed that progressively adding conversational components improved re- trieval performance. Following this approach, we 7 Page 8: experimented with the addition of keywords to de- termine whether they enhance performance. Pre- vious research on hybrid search methods report promising performance improvements by incor- porating keyword to semantic search (Kuzi et al., 2020; Shen et al., 2023b). We explore addition of BM25 score into various HEISIR settings, with results shown in Table 7. The performance results reveal interesting pat- terns in the integration of BM25 (Robertson et al., 1995) with various models. When combined with weaker models like Gemma and NV-Embed, BM25 provided slight improvements. However, its integra- tion with stronger models ( HEISIR others) led to per- formance declines. This trend aligns with our pre- vious observations in Section 5.3. These findings suggest that BM25’s simplistic relevance estima- tion may struggle to capture the complex semantic relationships present in conversational contexts. Model acc@1 acc@5 ndcg@10 ndcg@20 Only BM25 0.1465 0.3610 0.1200 0.1299 HEISIR Gemma, All-embeddings 0.3048 0.5919 0.2315 0.2430 + BM25 score 0.3105 0.5939 0.2346 0.2465 HEISIR All-LLMs,NV-Embed 0.2806 0.5542 0.2107 0.2196 + BM25 score 0.2912 0.5688 0.2189 0.2279 HEISIR others 0.3822 0.6849 0.2886 0.3020 + BM25 score 0.3767 0.6791 0.2858 0.2997 Table 7: Metric Changes with and without BM25 6.5 Optimization of HEISIR through weighted sum HEISIR currently calculates the final score by equally summing all component scores. However, components may contribute differently to sentence meaning and overall performance, suggesting po- tential benefits from a weighted sum approach. Our experiments using random search demon- strate clear potential for performance improvement through weight optimization, as shown in Table 8. These weights can be further refined for specific domains, potentially yielding even better results. 6.6 Applications of Semantic Indexing Utilizing HEISIR for retrieval offers two key ad- vantages beyond performance. First, it enhances interpretability: the highest-scoring SVOA indices provide structured, detailed insights into retrieval results. Second, HEISIR enables easy and effective result modification. Traditional retrieval systems of- ten struggle to exclude unwanted results across se- mantically similar queries, but HEISIR overcomesModelBaseline Weighted acc@1 ndcg@20 acc@1 ndcg@20 OpenAI-large + GPT-3.5-turbo0.4085 0.3198 0.4179 0.3202 LLama-3 + LLM2Vec0.3902 0.3073 0.4050 0.3097 Table 8: Comparison of model performance with base- line and weighted sum approaches this by allowing the removal of specific semantic indices responsible for undesired results, Furthermore, the SV , SVO, and SVOA indices provide a guidance to user intent and conversation topics, aiding dialogue state tracking. Analysis of SI embeddings from OpenAI-large reveals clear clustering patterns that reproduce most intent la- bels in the USS dataset and encompass intents from other datasets (Table 9). This demonstrates the po- tential of HEISIR for automatic intent classification without fine-tuning. The potential for topic analysis using SVOA is further explored in Appendix C. Intent Label Representative Samples Rejection Declines, denies, refuses, rejects Suggestion & Planning Suggests, wants, plans, advises, offers Expression & Description & Explanation Expresses, describes, implies, explains Request Requests, asks for, needs, seeks Preference Likes, loves, enjoys, dislikes Acknowledgment & Gratitude Acknowledges, greets, thanks, appreciates Inquiry & Curiosity Inquires about, wants to know, wonders Mention & Specification States, specifies, confirms, clarifies Question Asks, questions Mention Mentions Table 9: Intent Clusters and Representative Samples 7 Related Work Semantic Inverted Indexing for Retrieval Tradi- tional retrieval methods streamline text searching with inverted indices (Zobel et al., 1998; Gormley and Tong, 2015; Dey et al., 2024), while knowl- edge graph embeddings (KGEs) focus on reveal- ing the innate structure of questions and answers (Wang et al., 2017; Bordes et al., 2014). However, term-based approaches fail to capture context in multi-turn conversations, and KGEs struggle with representing implicit relationships (Ge et al., 2024). There are very few researches that utilized SVO constituents as semantic indices, but were limited to simplistic SVO structure losing all semantic ad- juncts (Burek et al., 2007; Gao and Wang, 2009). Our research bridges these gaps with hierarchically expanded semantic indices, offering conversation- centric and detailed semantic representations than existing approaches. 8 Page 9: Semantic Search Semantic search models have shown capabilities in handling the semantic com- plexity in natural language queries. Encoder-based models, such as SBERT (Reimers and Gurevych, 2019), DPR (Karpukhin et al., 2020) and Sim- CSE (Gao et al., 2021), have shown high perfor- mance upon their introduction but lacked scalabil- ity in general. Current research efforts, including SGPT (Muennighoff, 2022), RepLLaMA (Ma et al., 2023), leverage LLMs to address the limitation of encoder-based approaches. However, LLM-based models typically require substantial computational resources for training and deployment. In contrast, HEISIR can achieve high performance without the need for labeling and training. 8 Conclusion In this paper, we introduced HEISIR, a novel frame- work for conversational data retrieval that reflects human sentence comprehension by capturing the syntactic hierarchy in natural language. HEISIR employs a 2-step extraction process to significantly improve the precision of semantic indices. Further- more, by building semantic understanding from nat- ural language syntax, HEISIR alleviates the need for extensive labeling and training. Our experiments demonstrate that HEISIR con- sistently enhances retrieval performance across a variety of settings. Additionally, the inherent in- terpretability of the method offers significant ad- vantages, making intent and topic analysis in dia- logue datasets more accessible, thus providing a versatile solution for conversational AI systems. As chat-based services continue to grow rapidly, our findings have the potential to greatly enhance conversational data retrieval for both end-users and service providers. Limitations Multilingual : HEISIR extracts quadruplets from messages based on the syntax of English, which is an isolating language. In isolating languages, word order plays a crucial role in semantic com- prehension since each word typically contains only one morpheme. This suggests that HEISIR may en- counter challenges when extracting SVOA quadru- plets from agglutinative and fusional languages, where word order does not directly reflect the syn- tactic hierarchy. Expanding HEISIR to Document Data : Ex- tending HEISIR from conversations to general doc-uments presents several challenges. Conversational data have clear speakers as subjects in each ut- terance. However, documents often use passive voice or describe third-party actions, creating com- plex subject relationships that are difficult to parse. Furthermore, while conversations typically contain clear actions, documents tend to focus more on de- scribing states and concepts, which makes SVOA analysis more challenging. This complexity neces- sitates advanced prompting and preprocessing tech- niques to accurately capture the subject and context. Additionally, it becomes important to account for avalent verbs in document retrieval, which are not considered in conversational data. Weight Optimization : HEISIR currently em- ploys simple summation to compute similarity scores. However, performance improvements were observed when assigning different weights to each conversational component. While uniform weight aggregation already outperforms current state-of- the-art retrieval models, further exploration is needed to optimize these weights for maximum efficiency. References Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harki- rat Behl, et al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219 . AI@Meta. 2024. Llama 3 model card. Gerry TM Altmann and Jelena Mirkovi ´c. 2009. Incre- mentality and prediction in human sentence process- ing. Cognitive science , 33(4):583–609. Anthropic. 2024. Claude 3 haiku: Our fastest model yet. Accessed: 2024-07-06. Lora Aroyo, Alex Taylor, Mark Diaz, Christopher Homan, Alicia Parrish, Gregory Serapio-García, Vin- odkumar Prabhakaran, and Ding Wang. 2024. Dices dataset: Diversity in conversational ai evaluation for safety. Advances in Neural Information Processing Systems , 36. Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. 2024. Llm2vec: Large language models are secretly powerful text encoders. arXiv preprint arXiv:2404.05961 . Antoine Bordes, Jason Weston, and Nicolas Usunier. 2014. Open question answering with weakly super- vised embedding models. In Machine Learning and 9 Page 10: Knowledge Discovery in Databases: European Con- ference, ECML PKDD 2014, Nancy, France, Septem- ber 15-19, 2014. Proceedings, Part I 14 , pages 165– 180. Springer. Gaston Burek, Christian Pietsch, and Anne De Roeck. 2007. Svo triple based latent semantic analysis for recognising textual entailment. Noam Chomsky. 2002. Syntactic structures . Mouton de Gruyter. Snehasis Dey, Bhimasen Moharana, Utpal Chandra De, Tapaswini Samant, Trupti Mayee Behera, and Shob- han Banerjee. 2024. Search engine for qna using distributed inverted index system. In 2024 3rd In- ternational Conference for Innovation in Technology (INOCON) , pages 1–4. IEEE. Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. Splade: Sparse lexical and expan- sion model for first stage ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages 2288–2292. Victoria Fossum and Roger Levy. 2012. Sequential vs. hierarchical syntactic models of human incremen- tal sentence processing. In Proceedings of the 3rd workshop on cognitive modeling and computational linguistics (CMCL 2012) , pages 61–69. Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. 2022a. Precise zero-shot dense retrieval without rele- vance labels. arXiv preprint arXiv:2212.10496 . Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. 2022b. Precise zero-shot dense retrieval without rele- vance labels. arXiv preprint arXiv:2212.10496 . Ming Gao and Ji-Cheng Wang. 2009. Semantic search based on svo constructions in chinese. In 2009 In- ternational Conference on Web Information Systems and Mining , pages 213–216. IEEE. Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. Simcse: Simple contrastive learning of sentence em- beddings. arXiv preprint arXiv:2104.08821 . Gerald Gazdar. 1985. Generalized phrase structure grammar . Harvard University Press. Xiou Ge, Yun Cheng Wang, Bin Wang, C-C Jay Kuo, et al. 2024. Knowledge graph embedding: An overview. APSIPA Transactions on Signal and Infor- mation Processing , 13(1). Clinton Gormley and Zachary Tong. 2015. Elastic- search: the definitive guide: a distributed real-time search and analytics engine . " O’Reilly Media, Inc.". Rolf Jagerman, Honglei Zhuang, Zhen Qin, Xuanhui Wang, and Michael Bendersky. 2023. Query expan- sion by prompting large language models. arXiv preprint arXiv:2305.03653 .Eunkyung Jo, Daniel A Epstein, Hyunhoon Jung, and Young-Ho Kim. 2023. Understanding the benefits and challenges of deploying conversational ai lever- aging large language models for public health inter- vention. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems , pages 1– 16. Yuki Kamide, Gerry TM Altmann, and Sarah L Hay- wood. 2003. The time-course of prediction in in- cremental sentence processing: Evidence from an- ticipatory eye movements. Journal of Memory and language , 49(1):133–156. Vladimir Karpukhin, Barlas O ˘guz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 . Ahmet Baki Kocaballi, Shlomo Berkovsky, Juan C Quiroz, Liliana Laranjo, Huong Ly Tong, Dana Reza- zadegan, Agustina Briatore, and Enrico Coiera. 2019. The personalization of conversational agents in health care: systematic review. Journal of medical Internet research , 21(11):e15360. Saar Kuzi, Mingyang Zhang, Cheng Li, Michael Ben- dersky, and Marc Najork. 2020. Leveraging semantic and lexical matching to improve the recall of docu- ment retrieval systems: A hybrid approach. arXiv preprint arXiv:2010.01195 . Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. 2024. Splade-v3: New baselines for splade. arXiv preprint arXiv:2403.06789 . Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. 2024. Nv-embed: Improved techniques for training llms as generalist embedding models. arXiv preprint arXiv:2405.17428 . Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Hein- rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neu- ral Information Processing Systems , 33:9459–9474. Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. Dailydialog: A manually labelled multi-turn dialogue dataset. arXiv preprint arXiv:1710.03957 . Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-level code generation with alphacode. Science , 378(6624):1092–1097. Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281 . 10 Page 11: Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. 2023. Fine-tuning llama for multi-stage text retrieval. Preprint , arXiv:2310.08319. Tomas Mikolov. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 . Quim Motger, Xavier Franch, and Jordi Marco. 2022. Software-based dialogue systems: survey, taxonomy, and challenges. ACM Computing Surveys , 55(5):1– 42. Elham Mousavinasab, Nahid Zarifsanaiey, Sharareh R. Niakan Kalhori, Mahnaz Rakhshan, Leila Keikha, and Marjan Ghazi Saeedi. 2021. Intelligent tutor- ing systems: a systematic review of characteristics, applications, and evaluation methods. Interactive Learning Environments , 29(1):142–163. Niklas Muennighoff. 2022. Sgpt: Gpt sentence em- beddings for semantic search. arXiv preprint arXiv:2202.08904 . Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2022. Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316 . Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human-generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 . Zach Nussbaum, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. 2024. Nomic embed: Training a reproducible long context text embedder. Preprint , arXiv:2402.01613. OpenAI. 2023. Gpt-3.5: Generative pre-trained trans- former. Accessed: 2024-07-06. OpenAI. 2024. New embedding models and api updates. https://openai.com/blog/ new-embedding-models-and-api-updates/ . Accessed: 2024-07-06. Paul Owoicho, Jeff Dalton, Mohammad Aliannejadi, Leif Azzopardi, Johanne R Trippas, and Svitlana Vakulenko. 2022. Trec cast 2022: Going beyond user ask and system retrieve with initiative and response generation. In TREC . Zhiyuan Peng, Xuyang Wu, and Yi Fang. 2023. Soft prompt tuning for augmenting dense retrieval with large language models. arXiv preprint arXiv:2307.08303 . Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word rep- resentation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages 1532–1543. Qwen. 2024. Qwen2 technical report. Accessed: 2024- 07-06.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 . Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 . Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at trec-3. Nist Special Publication Sp , 109:109. Tao Shen, Guodong Long, Xiubo Geng, Chongyang Tao, Tianyi Zhou, and Daxin Jiang. 2023a. Large language models are strong zero-shot retriever. arXiv preprint arXiv:2304.14233 . Tao Shen, Guodong Long, Xiubo Geng, Chongyang Tao, Tianyi Zhou, and Daxin Jiang. 2023b. Large language models are strong zero-shot retriever. arXiv preprint arXiv:2304.14233 . Weiwei Sun, Shuo Zhang, Krisztian Balog, Zhaochun Ren, Pengjie Ren, Zhumin Chen, and Maarten de Ri- jke. 2021. Simulating user satisfaction for the evalu- ation of task-oriented dialogue systems. In Proceed- ings of the 44rd International ACM SIGIR Confer- ence on Research and Development in Information Retrieval , SIGIR ’21. ACM. Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 . Jianguo Wang, Xiaomeng Yi, Rentong Guo, Hai Jin, Peng Xu, Shengjun Li, Xiangyu Wang, Xiangzhou Guo, Chengming Li, Xiaohai Xu, et al. 2021. Milvus: A purpose-built vector data management system. In Proceedings of the 2021 International Conference on Management of Data , pages 2614–2627. Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. 2017. Knowledge graph embedding: A survey of approaches and applications. IEEE transactions on knowledge and data engineering , 29(12):2724–2743. Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, et al. 2024. Searching for best practices in retrieval-augmented generation. arXiv preprint arXiv:2407.01219 . Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, and Emily Dinan. 2021. Bot-adversarial dia- logue for safe conversational agents. In Proceedings of the 2021 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies , pages 2950–2968. 11 Page 12: Qiongkai Xu, Lizhen Qu, Zeyu Gao, and Gholamreza Haffari. 2020. Personal information leakage detec- tion in conversations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP) , pages 6567–6580, On- line. Association for Computational Linguistics. Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. Wikiqa: A challenge dataset for open-domain ques- tion answering. In Proceedings of the 2015 con- ference on empirical methods in natural language processing , pages 2013–2018. Shi Yu, Jiahua Liu, Jingqin Yang, Chenyan Xiong, Paul Bennett, Jianfeng Gao, and Zhiyuan Liu. 2020. Few- shot generative conversational query rewriting. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Informa- tion Retrieval , pages 1933–1936. Longhui Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, and Min Zhang. 2023. Rankinggpt: Empowering large language models in text ranking with progressive enhancement. arXiv preprint arXiv:2311.16720 . Justin Zobel, Alistair Moffat, and Kotagiri Ramamo- hanarao. 1998. Inverted files versus signature files for text indexing. ACM Transactions on Database Systems (TODS) , 23(4):453–490.A Prompts for HEISIR We detail the prompts used for our method. All prompts use {{$variable}} as placeholders for ex- ternal variables. While the prompts shown here are designed for GPT-3.5-turbo, we adapted them for other LLM models by adding predefined tokens according to each model’s specific template. Each prompt consists of four rows in the prompt table: common LLM parameters, system prompt, 5-shot examples generated using GPT-4o (omitted in this appendix for brevity), and user message with input placeholders. A.1 Prompt for Hierarchical Triplets Formulation Temperature : 0.0 max_tokens : 1024 System: You must extract all [$information triplet$ ]in the message, given the context and role, subject to the following conditions: •[$information triplet$ ]should include anything you need to understand the message and the role, such as the role’s emotions, the topic of conversation, the intent of the message and etc. •You need to extract an [$information triplet$ ]for a new message according to the following instructions. Structural Constraints of [$information triplet$ ]: •The structure of [$information triplet$ ]:[$subject$ ] [$common verb$ ] [$target content$ ] •[$subject$ ]: Must be the {{$role}}. •[$common verb$ ]: –Must Use a singular present tense verb (e.g., "says", "asks", "wants to", "in- quires about", "looks into") to describe the {{$role}}’s action or intention. –If you need to use the negative form, write it as not in front of the common verb. However, use negative forms only when they are essential to understanding the sentence. •[$target content$ ]: –Must be a noun phrase of 3 characters or less –If[$target content$ ]contains content too specific to be generalized, such as a person’s name, webpage address, or code, generalize it by using general expressions such as person, friend, url, website, code and etc. –Should contain only one content; if multiple noun phrases or contents are needed, separate them into individual [$information triplet$ ]items. How to Construct an Information Triplet: 1.You receive the [$conversation context$ ]along with the [$message$ ]you need to analyze as input. 2.Review the [$conversation context$] to understand the nuance and content of the [$message$ ]. However, the [$conversation context$] should only be used to understand the message and should not be used to extract the [$information triplet$ ]. 3.Extract all the [$information triplet$ ]that can be obtained from the message in the form of a JSON. 4. The JSON format is as follows: { "information_triplet": [ {"{{$role}} [$common verb$]": [$target content$]}, {"{{$role}} [$common verb$]": [$target content$]}, {"{{$role}} [$common verb$]": [$target content$]}, ... ] } Few-shot examples User: [$conversation context$ ] {{$context}} [$message$ ] {{$role}}: {{$message}} Extract as much [$information triplet$ ]as possible from the message while maintain- ing all of the above conditions. We recommend extracting between 5 and 20 pieces of [$information triplet$ ], depending on the length of the sentence. The key of the informa- tion_triplet you create should start with {{$role}}. Answer: Table 10: Prompt and parameters for Hierarachical Triplets Expansion 12 Page 13: A.2 Adjunct Augmentation Temperature : 0.0 max_tokens : 1024 System: You will be provided with $conversation context$, $message$ and $information list$. $message$ is a real conversation message from {{$role}} that you need to analyze. $information list$ is information extracted from $message$, consisting of $subject$, $verb$, and $target content$. You need to elaborate on the $detail$ of $target content$ according to the following instructions. • Prepositions must be actively used to describe the $detail$ of the target content. •If the value placed in $detail$ is vague, such as a pronoun, please use a specific noun to make the meaning clear. • Here are three recommended strategies for elaborating on the $detail$: 1.Detailed Content and theme of Discussion This category involves prepositions that help specify the subject of communication, the object of emotions or actions, and the theme of a conversation. For example –user expresses anger: to assistant. –user asks questions: about climate change. –assistant maintains a professional attitude: towards the user’s queries. –assistant offers advice: regarding data protection. –user expresses confusion: over the assistant’s instructions. –user seeks clarification: with respect to the subscription plans. –assistant invests effort: in improving the user interface. 2.Reasons or Causations This set of phrases explains the cause or reason behind an action or situation. It answers the "why" of a scenario. For example –assistant pauses the service: because of maintenance needs. –user misses the deadline: due to a technical glitch. –user returns the product: owing to a manufacturing defect. –assistant improves response time: thanks to the user’s constructive feedback. 3.Conditions or accompanying Circumstances These prepositions are used to describe the conditions under which something happens or the context that accompanies an action. For example –user solves problems: with patience. –assistant writes articles: for users. –user reads the manual: over the weekend. –user leaves feedback: on the website. –user places the report: under the book. 4.In cases where no specific details can be included If a sentence structure consisting of {{$role}} $verb$ $target content$ $detail$ is impossible or adding detailed information is not meaningful, set "no information" as the value for $detail$. For example –user mentions Christmas: no information. • The answer should be written in JSON format as follows: { "detailed_information": [ {"{{$role}} [$verb$] [$target content$]": "[$detail$]"}, {"{{$role}} [$verb$] [$target content$]": "[$detail$]"}, ... ] } Few-shot examples User: [$conversation context$ ] {{$context}} [$message$ ] {{$role}}: {{$message}} [$information list$ ] {{$info_list}} Choose the best of the three strategies above and write a 2-3 word answer that clarifies the content of the sentence. Do not include the content of the sentence in your answer, but start with the preposition. The entire contents of [$Information list$] should be used as the key for detailed_information without any changes at all. Answer: Table 11: Prompt and parameters for Detailed Descrip- tion Augmentation B All results Table [12-17] displays performance for key com- ponent combinations across LLM and Embedding models. ’p’ and ’r’ represent precision and recall, respectively. The highest value per metric is in bold, with the second-highest underlined. Table 18 shows the baseline performance of all embedding models when using only conversationand messages. C Intent & Topic Analysis In the main text, we observed intent clustering us- ing SV . SVOA or SVO can reveal specific conversa- tion topics. Table C shows the results of clustering into 15 groups. Examining representative topics for each cluster demonstrates that we can recover most major conversation themes from our dataset, along with related topics. This analysis uses a basic K-means algorithm, showcasing the potential of Semantic indices for intent and topic analysis. While the current ap- proach is fundamental, it effectively captures and categorizes conversational content. We anticipate that more advanced methods could yield even more robust and nuanced models, opening up various possibilities for future research in natural language processing and conversational AI. D Reproducing For all experiments, we utilized a single NVIDIA H100 80GB HBM3 GPU. Our study employed various models from Hugging Face and API. The embedding models consisted of Alibaba- NLP/gte-large-en-v1.5 (GTE), nomic-ai/nomic- embed-text-v1 (Nomic), nvidia/NV-Embed-v1 (NV-Embed), and McGill-NLP/LLM2Vec-Meta- Llama-3-8B-Instruct-mntp-supervised (LLM2Vec). Language models included google/gemma-2b-it (Gemma), microsoft/Phi-3-mini-128k-instruct (Phi- 3), meta-llama/Meta-Llama-3-8B-Instruct (LLama- 3), Qwen/Qwen2-7B-Instruct (Qwen-2), GPT-3.5- turbo-0125, and claude-3-haiku-20240307. In the fine-tuning process, train/validation datasets were created like test data, using only train data and excluding DICES due to data unavailability. Fine- tuning ran for 10 epochs. For LLM2Vec, we used LoRA (r=16, alpha=32) with a 1e-4 learning rate; other models used 1e-5. All code and data are ac- cessible in the supplementary materials. E Impact of Context We select GPT-3.5-turbo as the LLM as it is the most accessible and widely model used in industry. Including previous context for semantic indices construction shows minor improvement in retrieval performance, as shown in Table 20. However, in terms of interpretability, a thor- ough analysis of SVOA indices reveals that SVOA quadruplets extracted without previous context fail 13 Page 14: embedding combination acc@1 acc@5 p@5 p@10 r@5 r@10 ndcg@10 ndcg@20 mrr@10 mrr@20 map@10 map@20 GTEsv 0.0229 0.0717 0.0159 0.0133 0.0058 0.0096 0.0158 0.0168 0.0441 0.0481 0.0059 0.0050 sv _svo 0.1311 0.3353 0.1020 0.0873 0.0440 0.0724 0.1045 0.1136 0.2189 0.2277 0.0518 0.0494 sv _svo_svoa 0.1899 0.4467 0.1492 0.1271 0.0709 0.1127 0.1552 0.1688 0.2983 0.3062 0.0831 0.0820 svoa_conv_msg 0.3253 0.6252 0.2424 0.1990 0.1179 0.1805 0.2504 0.2643 0.4514 0.4577 0.1486 0.1450 svo_svoa_conv_msg 0.3131 0.6181 0.2400 0.1969 0.1150 0.1776 0.2465 0.2594 0.4425 0.4489 0.1458 0.1414 sv _svo_svoa_conv_msg 0.3096 0.6124 0.2350 0.1929 0.1120 0.1729 0.2415 0.2541 0.4374 0.4437 0.1423 0.1378 Nomicsv 0.0157 0.0520 0.0115 0.0106 0.0041 0.0077 0.0120 0.0132 0.0325 0.0363 0.0044 0.0037 sv _svo 0.1245 0.3056 0.0972 0.0827 0.0416 0.0676 0.0984 0.1065 0.2030 0.2114 0.0497 0.0473 sv _svo_svoa 0.1799 0.4264 0.1459 0.1232 0.0674 0.1083 0.1497 0.1594 0.2874 0.2952 0.0806 0.0777 svoa_conv_msg 0.2959 0.5910 0.2234 0.1806 0.1088 0.1650 0.2295 0.2421 0.4197 0.4265 0.1354 0.1319 svo_svoa_conv_msg 0.2939 0.5855 0.2246 0.1803 0.1089 0.1628 0.2282 0.2400 0.4179 0.4245 0.1344 0.1299 sv _svo_svoa_conv_msg 0.2919 0.5793 0.2203 0.1781 0.1064 0.1606 0.2244 0.2344 0.4117 0.4182 0.1317 0.1263 NV-Embedsv 0.0183 0.0620 0.0137 0.0119 0.0049 0.0083 0.0137 0.0157 0.0376 0.0425 0.0050 0.0043 sv _svo 0.1282 0.3196 0.0979 0.0824 0.0418 0.0672 0.0992 0.1052 0.2114 0.2192 0.0495 0.0464 sv _svo_svoa 0.1879 0.4247 0.1403 0.1170 0.0652 0.1016 0.1442 0.1544 0.2878 0.2956 0.0769 0.0744 svoa_conv_msg 0.2339 0.4664 0.1616 0.1297 0.0754 0.1143 0.1656 0.1723 0.3326 0.3406 0.0922 0.0868 svo_svoa_conv_msg 0.2482 0.4976 0.1754 0.1419 0.0819 0.1237 0.1793 0.1870 0.3536 0.3607 0.1008 0.0955 sv _svo_svoa_conv_msg 0.2505 0.4970 0.1745 0.1416 0.0808 0.1236 0.1791 0.1871 0.3556 0.3629 0.1004 0.0953 LLM2Vecsv 0.0240 0.0686 0.0155 0.0143 0.0058 0.0104 0.0166 0.0184 0.0447 0.0496 0.0062 0.0054 sv _svo 0.1394 0.3585 0.1136 0.0969 0.0497 0.0803 0.1151 0.1247 0.2326 0.2411 0.0585 0.0562 sv _svo_svoa 0.1974 0.4622 0.1578 0.1317 0.0746 0.1171 0.1611 0.1739 0.3078 0.3153 0.0877 0.0863 svoa_conv_msg 0.3082 0.6130 0.2338 0.1909 0.1134 0.1740 0.2410 0.2538 0.4361 0.4422 0.1419 0.1381 svo_svoa_conv_msg 0.3165 0.6195 0.2378 0.1930 0.1150 0.1743 0.2439 0.2563 0.4421 0.4485 0.1443 0.1397 sv _svo_svoa_conv_msg 0.3136 0.6198 0.2387 0.1923 0.1151 0.1737 0.2428 0.2543 0.4410 0.4474 0.1436 0.1387 OpenAI-smallsv 0.0171 0.0628 0.0138 0.0123 0.0050 0.0091 0.0139 0.0158 0.0376 0.0423 0.0049 0.0043 sv _svo 0.1280 0.3376 0.1068 0.0909 0.0474 0.0757 0.1079 0.1178 0.2174 0.2263 0.0549 0.0529 sv _svo_svoa 0.1962 0.4562 0.1546 0.1308 0.0728 0.1171 0.1599 0.1711 0.3059 0.3134 0.0867 0.0845 svoa_conv_msg 0.3253 0.6190 0.2398 0.1976 0.1173 0.1813 0.2503 0.2656 0.4508 0.4573 0.1495 0.1469 svo_svoa_conv_msg 0.3242 0.6178 0.2392 0.1957 0.1166 0.1779 0.2477 0.2614 0.4485 0.4549 0.1476 0.1442 sv _svo_svoa_conv_msg 0.3208 0.6084 0.2355 0.1925 0.1150 0.1752 0.2440 0.2574 0.4448 0.4511 0.1451 0.1416 OpenAI-largesv 0.0163 0.0603 0.0135 0.0119 0.0054 0.0089 0.0136 0.0152 0.0360 0.0405 0.0049 0.0042 sv _svo 0.1325 0.3399 0.1062 0.0918 0.0472 0.0764 0.1091 0.1183 0.2218 0.2304 0.0554 0.0532 sv _svo_svoa 0.1962 0.4582 0.1545 0.1313 0.0734 0.1176 0.1604 0.1727 0.3078 0.3157 0.0864 0.0848 svoa_conv_msg 0.3508 0.6590 0.2587 0.2130 0.1271 0.1941 0.2694 0.2851 0.4798 0.4856 0.1627 0.1600 svo_svoa_conv_msg 0.3473 0.6387 0.2527 0.2071 0.1226 0.1875 0.2620 0.2756 0.4714 0.4773 0.1575 0.1536 sv _svo_svoa_conv_msg 0.3422 0.6347 0.2483 0.2031 0.1202 0.1836 0.2574 0.2705 0.4656 0.4714 0.1541 0.1502 Table 12: Results for Different Embeddings and Gemma embedding combination acc@1 acc@5 p@5 p@10 r@5 r@10 ndcg@10 ndcg@20 mrr@10 mrr@20 map@10 map@20 GTEsv 0.0374 0.1163 0.0280 0.0241 0.0113 0.0189 0.0286 0.0318 0.0728 0.0785 0.0116 0.0106 sv_svo 0.2396 0.5330 0.1864 0.1564 0.0879 0.1390 0.1920 0.2053 0.3642 0.3709 0.1065 0.1047 sv_svo_svoa 0.2816 0.5698 0.2107 0.1748 0.1015 0.1577 0.2187 0.2336 0.4033 0.4099 0.1267 0.1248 svoa_conv_msg 0.3476 0.6587 0.2638 0.2158 0.1270 0.1949 0.2716 0.2856 0.4805 0.4861 0.1643 0.1603 svo_svoa_conv_msg 0.3610 0.6621 0.2660 0.2191 0.1280 0.1972 0.2760 0.2905 0.4877 0.4933 0.1680 0.1644 sv_svo_svoa_conv_msg 0.3582 0.6695 0.2677 0.2184 0.1286 0.1966 0.2757 0.2901 0.4886 0.4947 0.1677 0.1641 Nomicsv 0.0386 0.1060 0.0262 0.0214 0.0105 0.0170 0.0265 0.0281 0.0693 0.0738 0.0111 0.0099 sv_svo 0.2508 0.5079 0.1847 0.1544 0.0849 0.1343 0.1900 0.1995 0.3603 0.3673 0.1076 0.1034 sv_svo_svoa 0.2913 0.5641 0.2163 0.1773 0.1020 0.1563 0.2216 0.2335 0.4077 0.4148 0.1306 0.1265 svoa_conv_msg 0.3416 0.6264 0.2464 0.2022 0.1199 0.1841 0.2566 0.2701 0.4637 0.4700 0.1541 0.1505 svo_svoa_conv_msg 0.3573 0.6484 0.2600 0.2121 0.1247 0.1919 0.2688 0.2817 0.4815 0.4880 0.1635 0.1590 sv_svo_svoa_conv_msg 0.3602 0.6590 0.2635 0.2138 0.1257 0.1916 0.2707 0.2839 0.4858 0.4920 0.1647 0.1603 NV-Embedsv 0.0371 0.1200 0.0297 0.0245 0.0115 0.0187 0.0287 0.0316 0.0736 0.0794 0.0116 0.0104 sv_svo 0.2616 0.5330 0.1907 0.1572 0.0878 0.1359 0.1949 0.2043 0.3750 0.3819 0.1097 0.1052 sv_svo_svoa 0.2933 0.5601 0.2089 0.1720 0.0996 0.1534 0.2168 0.2277 0.4064 0.4134 0.1259 0.1219 svoa_conv_msg 0.2465 0.4990 0.1776 0.1447 0.0827 0.1269 0.1818 0.1885 0.3520 0.3589 0.1027 0.0973 svo_svoa_conv_msg 0.2773 0.5541 0.2069 0.1702 0.0967 0.1497 0.2124 0.2210 0.3950 0.4011 0.1233 0.1182 sv_svo_svoa_conv_msg 0.2893 0.5630 0.2129 0.1739 0.0997 0.1528 0.2180 0.2266 0.4053 0.4117 0.1275 0.1217 LLM2Vecsv 0.0537 0.1557 0.0385 0.0330 0.0149 0.0251 0.0390 0.0421 0.0986 0.1051 0.0162 0.0143 sv_svo 0.2862 0.5781 0.2155 0.1781 0.1009 0.1567 0.2203 0.2332 0.4101 0.4172 0.1272 0.1237 sv_svo_svoa 0.3193 0.6073 0.2378 0.1930 0.1136 0.1725 0.2430 0.2568 0.4405 0.4470 0.1451 0.1421 svoa_conv_msg 0.3579 0.6552 0.2610 0.2124 0.1263 0.1938 0.2699 0.2833 0.4838 0.4897 0.1627 0.1584 svo_svoa_conv_msg 0.3790 0.6904 0.2817 0.2273 0.1351 0.2044 0.2880 0.2998 0.5081 0.5136 0.1769 0.1712 sv_svo_svoa_conv_msg 0.3865 0.6967 0.2860 0.2321 0.1368 0.2078 0.2936 0.3055 0.5166 0.5219 0.1807 0.1749 OpenAI-smallsv 0.0497 0.1400 0.0356 0.0294 0.0141 0.0225 0.0353 0.0378 0.0896 0.0955 0.0147 0.0131 sv_svo 0.2625 0.5410 0.1964 0.1636 0.0918 0.1453 0.2035 0.2163 0.3828 0.3904 0.1161 0.1130 sv_svo_svoa 0.3022 0.5844 0.2233 0.1841 0.1066 0.1650 0.2314 0.2451 0.4239 0.4310 0.1362 0.1333 svoa_conv_msg 0.3653 0.6550 0.2627 0.2167 0.1285 0.1986 0.2752 0.2915 0.4893 0.4956 0.1677 0.1654 svo_svoa_conv_msg 0.3807 0.6724 0.2736 0.2246 0.1333 0.2032 0.2849 0.2989 0.5054 0.5112 0.1749 0.1713 sv_svo_svoa_conv_msg 0.3893 0.6821 0.2785 0.2283 0.1347 0.2060 0.2898 0.3034 0.5147 0.5201 0.1783 0.1740 OpenAI-largesv 0.0457 0.1328 0.0324 0.0284 0.0127 0.0215 0.0334 0.0362 0.0844 0.0907 0.0137 0.0121 sv_svo 0.2594 0.5456 0.1969 0.1656 0.0909 0.1461 0.2036 0.2175 0.3824 0.3892 0.1157 0.1132 sv_svo_svoa 0.3005 0.5907 0.2210 0.1834 0.1067 0.1647 0.2296 0.2451 0.4243 0.4312 0.1338 0.1325 svoa_conv_msg 0.3965 0.6918 0.2831 0.2319 0.1394 0.2121 0.2950 0.3113 0.5223 0.5276 0.1812 0.1785 svo_svoa_conv_msg 0.3899 0.7021 0.2864 0.2331 0.1396 0.2109 0.2959 0.3117 0.5230 0.5284 0.1823 0.1792 sv_svo_svoa_conv_msg 0.3950 0.7092 0.2898 0.2357 0.1405 0.2133 0.2992 0.3136 0.5285 0.5337 0.1844 0.1803 Table 13: Results for Different Embeddings and Phi-3 14 Page 15: embedding combination acc@1 acc@5 p@5 p@10 r@5 r@10 ndcg@10 ndcg@20 mrr@10 mrr@20 map@10 map@20 GTEsv 0.0286 0.0985 0.0233 0.0201 0.0086 0.0146 0.0231 0.0245 0.0588 0.0635 0.0091 0.0078 sv_svo 0.1965 0.4693 0.1576 0.1370 0.0731 0.1198 0.1644 0.1761 0.3131 0.3213 0.0885 0.0860 sv_svo_svoa 0.2953 0.5995 0.2267 0.1896 0.1079 0.1695 0.2351 0.2478 0.4260 0.4321 0.1372 0.1339 svoa_conv_msg 0.3582 0.6655 0.2699 0.2215 0.1300 0.2004 0.2783 0.2914 0.4886 0.4943 0.1691 0.1642 svo_svoa_conv_msg 0.3767 0.6735 0.2741 0.2244 0.1321 0.2021 0.2844 0.2975 0.5026 0.5086 0.1740 0.1690 sv_svo_svoa_conv_msg 0.3728 0.6730 0.2739 0.2240 0.1313 0.2012 0.2829 0.2958 0.5006 0.5066 0.1725 0.1675 Nomicsv 0.0263 0.0857 0.0201 0.0177 0.0076 0.0131 0.0203 0.0219 0.0523 0.0564 0.0079 0.0069 sv_svo 0.2125 0.4579 0.1587 0.1323 0.0728 0.1147 0.1623 0.1718 0.3177 0.3254 0.0890 0.0850 sv_svo_svoa 0.3193 0.6013 0.2317 0.1900 0.1097 0.1674 0.2382 0.2503 0.4403 0.4466 0.1410 0.1371 svoa_conv_msg 0.3510 0.6387 0.2567 0.2093 0.1243 0.1897 0.2648 0.2776 0.4730 0.4793 0.1603 0.1558 svo_svoa_conv_msg 0.3699 0.6527 0.2679 0.2165 0.1286 0.1938 0.2749 0.2887 0.4902 0.4968 0.1684 0.1638 sv_svo_svoa_conv_msg 0.3693 0.6630 0.2699 0.2189 0.1289 0.1949 0.2770 0.2900 0.4929 0.4991 0.1695 0.1646 NV-Embedsv 0.0280 0.0954 0.0228 0.0217 0.0078 0.0149 0.0240 0.0255 0.0594 0.0643 0.0093 0.0079 sv_svo 0.2025 0.4576 0.1578 0.1308 0.0711 0.1119 0.1609 0.1701 0.3139 0.3219 0.0884 0.0844 sv_svo_svoa 0.3153 0.5930 0.2257 0.1843 0.1060 0.1624 0.2327 0.2425 0.4325 0.4393 0.1370 0.1319 svoa_conv_msg 0.2539 0.5113 0.1856 0.1479 0.0871 0.1298 0.1880 0.1961 0.3651 0.3725 0.1075 0.1020 svo_svoa_conv_msg 0.2819 0.5567 0.2097 0.1705 0.0980 0.1494 0.2141 0.2236 0.3989 0.4061 0.1254 0.1196 sv_svo_svoa_conv_msg 0.2845 0.5644 0.2145 0.1732 0.1001 0.1516 0.2175 0.2264 0.4047 0.4117 0.1274 0.1211 LLM2Vecsv 0.0371 0.1291 0.0309 0.0257 0.0115 0.0190 0.0299 0.0309 0.0764 0.0816 0.0119 0.0100 sv_svo 0.2274 0.5159 0.1826 0.1536 0.0841 0.1344 0.1868 0.1992 0.3509 0.3586 0.1046 0.1013 sv_svo_svoa 0.3433 0.6387 0.2532 0.2047 0.1192 0.1815 0.2586 0.2716 0.4700 0.4765 0.1559 0.1516 svoa_conv_msg 0.3673 0.6735 0.2694 0.2178 0.1295 0.1966 0.2765 0.2897 0.4954 0.5012 0.1670 0.1625 svo_svoa_conv_msg 0.3885 0.6924 0.2836 0.2307 0.1350 0.2070 0.2921 0.3055 0.5169 0.5223 0.1790 0.1740 sv_svo_svoa_conv_msg 0.3902 0.6978 0.2880 0.2326 0.1365 0.2084 0.2944 0.3073 0.5188 0.5244 0.1808 0.1754 OpenAI-smallsv 0.0366 0.1183 0.0275 0.0247 0.0105 0.0179 0.0283 0.0307 0.0724 0.0780 0.0110 0.0098 sv_svo 0.2179 0.4856 0.1686 0.1401 0.0784 0.1226 0.1723 0.1852 0.3290 0.3375 0.0960 0.0937 sv_svo_svoa 0.3156 0.6061 0.2359 0.1943 0.1125 0.1737 0.2427 0.2559 0.4394 0.4460 0.1439 0.1409 svoa_conv_msg 0.3802 0.6655 0.2703 0.2235 0.1310 0.2038 0.2827 0.2972 0.5012 0.5073 0.1727 0.1687 svo_svoa_conv_msg 0.3867 0.6849 0.2820 0.2286 0.1351 0.2068 0.2905 0.3056 0.5118 0.5176 0.1796 0.1758 sv_svo_svoa_conv_msg 0.3905 0.6889 0.2849 0.2296 0.1364 0.2070 0.2919 0.3067 0.5166 0.5219 0.1801 0.1764 OpenAI-largesv 0.0314 0.1125 0.0265 0.0237 0.0100 0.0169 0.0268 0.0286 0.0673 0.0728 0.0104 0.0089 sv_svo 0.2174 0.4896 0.1691 0.1459 0.0792 0.1289 0.1774 0.1922 0.3340 0.3425 0.0981 0.0972 sv_svo_svoa 0.3253 0.6244 0.2408 0.2005 0.1156 0.1788 0.2504 0.2640 0.4535 0.4595 0.1490 0.1459 svoa_conv_msg 0.3933 0.7072 0.2922 0.2381 0.1424 0.2174 0.3016 0.3165 0.5247 0.5299 0.1863 0.1828 svo_svoa_conv_msg 0.4050 0.7149 0.2982 0.2418 0.1437 0.2189 0.3068 0.3208 0.5351 0.5400 0.1903 0.1862 sv_svo_svoa_conv_msg 0.4053 0.7135 0.2971 0.2431 0.1427 0.2194 0.3072 0.3207 0.5339 0.5388 0.1904 0.1861 Table 14: Results for Different Embeddings and LLama-3 embedding combination acc@1 acc@5 p@5 p@10 r@5 r@10 ndcg@10 ndcg@20 mrr@10 mrr@20 map@10 map@20 GTEsv 0.0366 0.1077 0.0267 0.0235 0.0102 0.0172 0.0272 0.0294 0.0683 0.0739 0.0110 0.0097 sv_svo 0.2139 0.4790 0.1676 0.1433 0.0774 0.1248 0.1728 0.1864 0.3271 0.3352 0.0940 0.0920 sv_svo_svoa 0.2842 0.5753 0.2159 0.1793 0.1038 0.1617 0.2231 0.2376 0.4054 0.4122 0.1292 0.1273 svoa_conv_msg 0.3636 0.6638 0.2644 0.2168 0.1276 0.1969 0.2753 0.2887 0.4900 0.4960 0.1669 0.1623 svo_svoa_conv_msg 0.3693 0.6661 0.2692 0.2211 0.1294 0.1999 0.2801 0.2932 0.4962 0.5022 0.1712 0.1659 sv_svo_svoa_conv_msg 0.3613 0.6661 0.2688 0.2208 0.1296 0.1997 0.2786 0.2921 0.4916 0.4980 0.1696 0.1649 Nomicsv 0.0331 0.1005 0.0251 0.0206 0.0099 0.0155 0.0247 0.0260 0.0635 0.0681 0.0102 0.0088 sv_svo 0.2034 0.4682 0.1640 0.1385 0.0745 0.1193 0.1673 0.1796 0.3153 0.3232 0.0918 0.0895 sv_svo_svoa 0.2885 0.5813 0.2228 0.1818 0.1050 0.1608 0.2263 0.2376 0.4119 0.4180 0.1325 0.1286 svoa_conv_msg 0.3393 0.6361 0.2503 0.2055 0.1218 0.1882 0.2599 0.2725 0.4654 0.4709 0.1562 0.1523 svo_svoa_conv_msg 0.3562 0.6558 0.2622 0.2122 0.1270 0.1917 0.2692 0.2819 0.4809 0.4872 0.1643 0.1594 sv_svo_svoa_conv_msg 0.3619 0.6587 0.2645 0.2147 0.1276 0.1927 0.2720 0.2845 0.4859 0.4920 0.1662 0.1610 NV-Embedsv 0.0346 0.1171 0.0290 0.0252 0.0104 0.0180 0.0284 0.0298 0.0696 0.0747 0.0113 0.0097 sv_svo 0.2131 0.4784 0.1682 0.1420 0.0758 0.1216 0.1717 0.1828 0.3283 0.3361 0.0934 0.0900 sv_svo_svoa 0.2862 0.5821 0.2183 0.1787 0.1035 0.1583 0.2232 0.2333 0.4099 0.4158 0.1300 0.1251 svoa_conv_msg 0.2514 0.4996 0.1785 0.1450 0.0829 0.1273 0.1836 0.1908 0.3572 0.3644 0.1045 0.0988 svo_svoa_conv_msg 0.2768 0.5501 0.2075 0.1685 0.0969 0.1479 0.2112 0.2199 0.3929 0.3999 0.1233 0.1175 sv_svo_svoa_conv_msg 0.2825 0.5610 0.2125 0.1715 0.0992 0.1500 0.2154 0.2239 0.4008 0.4077 0.1263 0.1201 LLM2Vecsv 0.0446 0.1451 0.0363 0.0311 0.0139 0.0226 0.0358 0.0383 0.0879 0.0940 0.0145 0.0127 sv_svo 0.2371 0.5304 0.1907 0.1598 0.0880 0.1390 0.1936 0.2086 0.3581 0.3663 0.1081 0.1061 sv_svo_svoa 0.3142 0.6170 0.2371 0.1976 0.1131 0.1769 0.2459 0.2587 0.4427 0.4489 0.1452 0.1419 svoa_conv_msg 0.3599 0.6530 0.2620 0.2139 0.1268 0.1946 0.2709 0.2843 0.4858 0.4916 0.1631 0.1586 svo_svoa_conv_msg 0.3799 0.6815 0.2791 0.2266 0.1347 0.2042 0.2866 0.2992 0.5093 0.5145 0.1746 0.1693 sv_svo_svoa_conv_msg 0.3879 0.6864 0.2847 0.2299 0.1365 0.2065 0.2912 0.3035 0.5152 0.5203 0.1786 0.1728 OpenAI-smallsv 0.0380 0.1291 0.0320 0.0281 0.0121 0.0209 0.0322 0.0344 0.0787 0.0842 0.0130 0.0113 sv_svo 0.2231 0.5021 0.1774 0.1510 0.0832 0.1326 0.1841 0.1976 0.3426 0.3507 0.1023 0.1001 sv_svo_svoa 0.2942 0.5938 0.2283 0.1879 0.1093 0.1686 0.2345 0.2476 0.4220 0.4285 0.1378 0.1348 svoa_conv_msg 0.3705 0.6618 0.2674 0.2194 0.1302 0.2006 0.2783 0.2931 0.4931 0.4995 0.1700 0.1663 svo_svoa_conv_msg 0.3807 0.6778 0.2779 0.2253 0.1348 0.2050 0.2861 0.3001 0.5056 0.5112 0.1760 0.1718 sv_svo_svoa_conv_msg 0.3816 0.6824 0.2800 0.2283 0.1358 0.2069 0.2890 0.3032 0.5082 0.5136 0.1779 0.1738 OpenAI-largesv 0.0428 0.1322 0.0333 0.0293 0.0119 0.0207 0.0338 0.0356 0.0846 0.0904 0.0137 0.0116 sv_svo 0.2291 0.5124 0.1815 0.1539 0.0844 0.1356 0.1871 0.2004 0.3488 0.3558 0.1036 0.1017 sv_svo_svoa 0.3088 0.5950 0.2274 0.1889 0.1089 0.1702 0.2368 0.2510 0.4304 0.4371 0.1392 0.1367 svoa_conv_msg 0.3873 0.6935 0.2852 0.2340 0.1407 0.2139 0.2966 0.3124 0.5186 0.5242 0.1828 0.1798 svo_svoa_conv_msg 0.3965 0.7004 0.2922 0.2366 0.1422 0.2145 0.3005 0.3154 0.5255 0.5312 0.1863 0.1822 sv_svo_svoa_conv_msg 0.4045 0.7021 0.2926 0.2380 0.1422 0.2157 0.3026 0.3166 0.5305 0.5358 0.1874 0.1832 Table 15: Results for Different Embeddings and Qwen-2 15 Page 16: embedding combination acc@1 acc@5 p@5 p@10 r@5 r@10 ndcg@10 ndcg@20 mrr@10 mrr@20 map@10 map@20 GTEsv 0.0448 0.1345 0.0351 0.0292 0.0132 0.0214 0.0342 0.0363 0.0843 0.0903 0.0146 0.0126 sv_svo 0.2211 0.5024 0.1758 0.1482 0.0815 0.1291 0.1796 0.1931 0.3392 0.3475 0.0984 0.0962 sv_svo_svoa 0.2913 0.5910 0.2214 0.1826 0.1057 0.1642 0.2279 0.2424 0.4183 0.4252 0.1329 0.1302 svoa_conv_msg 0.3588 0.6495 0.2633 0.2160 0.1281 0.1967 0.2735 0.2882 0.4837 0.4899 0.1664 0.1623 svo_svoa_conv_msg 0.3676 0.6621 0.2696 0.2207 0.1307 0.1992 0.2790 0.2939 0.4925 0.4982 0.1705 0.1665 si_svo_svoa_conv_msg 0.3633 0.6641 0.2677 0.2199 0.1296 0.1980 0.2776 0.2930 0.4906 0.4964 0.1693 0.1657 Nomicsv 0.0446 0.1188 0.0326 0.0267 0.0125 0.0198 0.0320 0.0341 0.0774 0.0832 0.0142 0.0123 sv_svo 0.2336 0.4959 0.1782 0.1491 0.0826 0.1294 0.1825 0.1934 0.3454 0.3533 0.1025 0.0989 sv_svo_svoa 0.3156 0.5921 0.2268 0.1869 0.1075 0.1662 0.2347 0.2463 0.4340 0.4406 0.1391 0.1349 svoa_conv_msg 0.3482 0.6421 0.2515 0.2060 0.1222 0.1886 0.2618 0.2753 0.4720 0.4784 0.1578 0.1539 svo_svoa_conv_msg 0.3622 0.6558 0.2651 0.2170 0.1284 0.1963 0.2751 0.2871 0.4895 0.4953 0.1682 0.1631 si_svo_svoa_conv_msg 0.3696 0.6750 0.2719 0.2195 0.1307 0.1973 0.2784 0.2899 0.4975 0.5029 0.1702 0.1649 NV-Embedsv 0.0448 0.1308 0.0345 0.0298 0.0128 0.0213 0.0344 0.0362 0.0834 0.0891 0.0146 0.0126 sv_svo 0.2322 0.4944 0.1798 0.1508 0.0811 0.1278 0.1835 0.1929 0.3458 0.3538 0.1023 0.0976 sv_svo_svoa 0.3019 0.5898 0.2235 0.1805 0.1057 0.1604 0.2283 0.2405 0.4244 0.4309 0.1341 0.1301 svoa_conv_msg 0.2471 0.4996 0.1790 0.1452 0.0834 0.1272 0.1831 0.1897 0.3551 0.3625 0.1038 0.0981 svo_svoa_conv_msg 0.2825 0.5470 0.2056 0.1696 0.0951 0.1480 0.2121 0.2212 0.3965 0.4033 0.1233 0.1177 si_svo_svoa_conv_msg 0.2896 0.5590 0.2095 0.1720 0.0971 0.1496 0.2153 0.2255 0.4035 0.4104 0.1251 0.1198 LLM2Vecsv 0.0634 0.1680 0.0448 0.0381 0.0174 0.0284 0.0454 0.0484 0.1102 0.1174 0.0200 0.0175 sv_svo 0.2585 0.5490 0.2009 0.1694 0.0926 0.1464 0.2061 0.2192 0.3811 0.3891 0.1171 0.1137 sv_svo_svoa 0.3259 0.6304 0.2464 0.2024 0.1178 0.1810 0.2531 0.2667 0.4530 0.4589 0.1517 0.1485 svoa_conv_msg 0.3628 0.6590 0.2610 0.2137 0.1261 0.1940 0.2709 0.2850 0.4871 0.4932 0.1633 0.1592 svo_svoa_conv_msg 0.3813 0.6867 0.2800 0.2277 0.1344 0.2054 0.2884 0.3018 0.5101 0.5155 0.1767 0.1718 si_svo_svoa_conv_msg 0.3916 0.6915 0.2861 0.2319 0.1366 0.2081 0.2934 0.3068 0.5184 0.5234 0.1801 0.1749 OpenAI-smallsv 0.0626 0.1520 0.0404 0.0342 0.0158 0.0250 0.0413 0.0443 0.1024 0.1089 0.0183 0.0164 sv_svo 0.2445 0.5310 0.1951 0.1616 0.0898 0.1410 0.1973 0.2097 0.3633 0.3711 0.1115 0.1084 sv_svo_svoa 0.3236 0.6081 0.2329 0.1921 0.1111 0.1723 0.2419 0.2558 0.4425 0.4489 0.1432 0.1405 svoa_conv_msg 0.3656 0.6664 0.2695 0.2194 0.1314 0.2006 0.2785 0.2937 0.4928 0.4987 0.1707 0.1674 svo_svoa_conv_msg 0.3830 0.6781 0.2804 0.2286 0.1355 0.2086 0.2897 0.3035 0.5088 0.5148 0.1792 0.1744 si_svo_svoa_conv_msg 0.3873 0.6861 0.2844 0.2306 0.1371 0.2092 0.2928 0.3065 0.5144 0.5199 0.1812 0.1766 OpenAI-largesv 0.0531 0.1531 0.0402 0.0334 0.0151 0.0247 0.0394 0.0423 0.0962 0.1028 0.0170 0.0149 sv_svo 0.2539 0.5293 0.1934 0.1612 0.0897 0.1401 0.1982 0.2112 0.3720 0.3796 0.1118 0.1089 sv_svo_svoa 0.3202 0.6121 0.2374 0.1953 0.1144 0.1759 0.2455 0.2599 0.4457 0.4518 0.1456 0.1431 svoa_conv_msg 0.4013 0.6992 0.2871 0.2360 0.1414 0.2167 0.3004 0.3153 0.5286 0.5343 0.1857 0.1819 svo_svoa_conv_msg 0.4047 0.7112 0.2940 0.2381 0.1437 0.2168 0.3034 0.3180 0.5343 0.5395 0.1882 0.1844 si_svo_svoa_conv_msg 0.4085 0.7124 0.2955 0.2402 0.1439 0.2178 0.3056 0.3198 0.5362 0.5415 0.1898 0.1856 Table 16: Results for Different Embeddings and GPT-3.5-turbo embedding combination acc@1 acc@5 p@5 p@10 r@5 r@10 ndcg@10 ndcg@20 mrr@10 mrr@20 map@10 map@20 GTEsv 0.0411 0.1297 0.0316 0.0279 0.0120 0.0205 0.0320 0.0340 0.0799 0.0854 0.0129 0.0113 sv_svo 0.2228 0.5013 0.1797 0.1498 0.0829 0.1307 0.1826 0.1947 0.3431 0.3507 0.1012 0.0983 sv_svo_svoa 0.2916 0.6010 0.2268 0.1876 0.1078 0.1673 0.2327 0.2469 0.4242 0.4302 0.1351 0.1324 svoa_conv_msg 0.3628 0.6724 0.2668 0.2172 0.1282 0.1967 0.2748 0.2891 0.4918 0.4977 0.1658 0.1615 sit_svoa_conv_msg 0.3665 0.6795 0.2734 0.2219 0.1308 0.1997 0.2796 0.2944 0.4969 0.5027 0.1702 0.1659 sv_svo_svoa_conv_msg 0.3716 0.6801 0.2715 0.2234 0.1301 0.2007 0.2809 0.2943 0.4995 0.5054 0.1707 0.1659 Nomicsv 0.0394 0.1163 0.0291 0.0242 0.0107 0.0180 0.0285 0.0320 0.0726 0.0788 0.0117 0.0105 sv_svo 0.2359 0.4973 0.1787 0.1492 0.0814 0.1287 0.1828 0.1927 0.3470 0.3546 0.1028 0.0986 sv_svo_svoa 0.3096 0.6147 0.2348 0.1897 0.1105 0.1669 0.2382 0.2486 0.4370 0.4429 0.1410 0.1362 svoa_conv_msg 0.3365 0.6398 0.2514 0.2044 0.1220 0.1861 0.2587 0.2735 0.4647 0.4707 0.1553 0.1521 sit_svoa_conv_msg 0.3559 0.6561 0.2666 0.2175 0.1281 0.1948 0.2735 0.2866 0.4829 0.4886 0.1670 0.1620 sv_svo_svoa_conv_msg 0.3602 0.6590 0.2695 0.2193 0.1294 0.1955 0.2757 0.2882 0.4867 0.4928 0.1687 0.1630 NV-Embedsv 0.0408 0.1282 0.0318 0.0291 0.0115 0.0204 0.0327 0.0343 0.0805 0.0858 0.0131 0.0113 sv_svo 0.2276 0.5056 0.1826 0.1494 0.0838 0.1285 0.1831 0.1923 0.3460 0.3537 0.1024 0.0976 sv_svo_svoa 0.3079 0.5930 0.2248 0.1842 0.1084 0.1644 0.2320 0.2421 0.4306 0.4367 0.1357 0.1303 svoa_conv_msg 0.2539 0.5110 0.1821 0.1460 0.0853 0.1290 0.1853 0.1921 0.3621 0.3692 0.1049 0.0990 sit_svoa_conv_msg 0.2851 0.5701 0.2127 0.1701 0.0994 0.1494 0.2138 0.2245 0.4030 0.4103 0.1239 0.1189 sv_svo_svoa_conv_msg 0.2873 0.5810 0.2179 0.1751 0.1016 0.1531 0.2191 0.2279 0.4094 0.4162 0.1273 0.1212 LLM2Vecsv 0.0548 0.1602 0.0399 0.0365 0.0149 0.0262 0.0418 0.0455 0.1015 0.1087 0.0172 0.0153 sv_svo 0.2716 0.5658 0.2125 0.1743 0.0973 0.1506 0.2143 0.2261 0.3950 0.4021 0.1232 0.1192 sv_svo_svoa 0.3390 0.6412 0.2516 0.2077 0.1195 0.1842 0.2597 0.2719 0.4662 0.4721 0.1556 0.1510 svoa_conv_msg 0.3556 0.6684 0.2662 0.2143 0.1283 0.1945 0.2722 0.2848 0.4867 0.4928 0.1641 0.1587 sit_svoa_conv_msg 0.3887 0.6801 0.2808 0.2302 0.1345 0.2063 0.2913 0.3043 0.5164 0.5224 0.1784 0.1727 sv_svo_svoa_conv_msg 0.3927 0.6944 0.2875 0.2340 0.1369 0.2087 0.2958 0.3086 0.5228 0.5282 0.1815 0.1758 OpenAI-smallsv 0.0506 0.1514 0.0390 0.0331 0.0144 0.0239 0.0387 0.0422 0.0952 0.1019 0.0163 0.0145 sv_svo 0.2582 0.5207 0.1926 0.1612 0.0891 0.1401 0.1991 0.2108 0.3734 0.3805 0.1132 0.1095 sv_svo_svoa 0.3213 0.6233 0.2411 0.1997 0.1140 0.1777 0.2493 0.2633 0.4486 0.4549 0.1486 0.1449 svoa_conv_msg 0.3730 0.6784 0.2744 0.2207 0.1329 0.2015 0.2809 0.2954 0.5017 0.5075 0.1709 0.1674 sit_svoa_conv_msg 0.3856 0.6901 0.2842 0.2300 0.1363 0.2079 0.2916 0.3052 0.5146 0.5202 0.1795 0.1752 sv_svo_svoa_conv_msg 0.3910 0.6978 0.2888 0.2328 0.1384 0.2092 0.2952 0.3085 0.5198 0.5251 0.1824 0.1776 OpenAI-largesv 0.0494 0.1422 0.0354 0.0316 0.0133 0.0229 0.0366 0.0396 0.0914 0.0982 0.0151 0.0132 sv_svo 0.2565 0.5367 0.1942 0.1613 0.0907 0.1408 0.1991 0.2103 0.3748 0.3823 0.1128 0.1089 sv_svo_svoa 0.3276 0.6210 0.2433 0.2003 0.1164 0.1793 0.2509 0.2644 0.4543 0.4603 0.1486 0.1453 svoa_conv_msg 0.3882 0.7072 0.2908 0.2355 0.1419 0.2157 0.2983 0.3132 0.5226 0.5277 0.1834 0.1794 sit_svoa_conv_msg 0.3956 0.7121 0.2949 0.2389 0.1428 0.2160 0.3019 0.3171 0.5280 0.5330 0.1863 0.1827 sv_svo_svoa_conv_msg 0.4056 0.7129 0.2955 0.2385 0.1434 0.2155 0.3028 0.3185 0.5345 0.5395 0.1870 0.1835 Table 17: Results for Different Embeddings and Haiku 16 Page 17: embedding combination acc@1 acc@5 p@5 p@10 r@5 r@10 ndcg@10 ndcg@20 mrr@10 mrr@20 map@10 map@20 GTEmsg 0.2656 0.5278 0.1950 0.1601 0.0944 0.1462 0.2025 0.2151 0.3772 0.3843 0.1169 0.1140 conv 0.2336 0.5016 0.1629 0.1295 0.0767 0.1158 0.1667 0.1729 0.3482 0.3558 0.0879 0.0814 conv_msg 0.3156 0.6055 0.2323 0.1889 0.1118 0.1718 0.2394 0.2520 0.4380 0.4447 0.1404 0.1355 Nomicmsg 0.2294 0.4773 0.1678 0.1391 0.0835 0.1292 0.1766 0.1892 0.3358 0.3438 0.0997 0.0980 conv 0.2131 0.4730 0.1529 0.1241 0.0732 0.1122 0.1582 0.1684 0.3255 0.3345 0.0832 0.0790 conv_msg 0.2708 0.5487 0.1991 0.1631 0.0982 0.1522 0.2077 0.2221 0.3897 0.3972 0.1185 0.1165 NV-Embedmsg 0.1962 0.4170 0.1368 0.1097 0.0612 0.0924 0.1385 0.1391 0.2904 0.2968 0.0738 0.0663 conv 0.0808 0.1839 0.0459 0.0356 0.0213 0.0309 0.0480 0.0510 0.1249 0.1317 0.0221 0.0202 conv_msg 0.1571 0.3382 0.1049 0.0830 0.0474 0.0721 0.1074 0.1111 0.2346 0.2421 0.0563 0.0515 LLM2Vecmsg 0.2619 0.5467 0.1991 0.1613 0.0963 0.1455 0.2032 0.2154 0.3820 0.3899 0.1160 0.1123 conv 0.1537 0.3179 0.0933 0.0735 0.0459 0.0690 0.0989 0.1089 0.2259 0.2359 0.0499 0.0479 conv_msg 0.2825 0.5496 0.2008 0.1633 0.0985 0.1503 0.2092 0.2220 0.3988 0.4067 0.1189 0.1152 OpenAI-smallmsg 0.2599 0.5256 0.1945 0.1601 0.0972 0.1498 0.2031 0.2197 0.3724 0.3799 0.1181 0.1183 conv 0.2705 0.5470 0.1869 0.1493 0.0902 0.1359 0.1929 0.2027 0.3863 0.3942 0.1060 0.1005 conv_msg 0.3116 0.6027 0.2280 0.1887 0.1126 0.1741 0.2390 0.2543 0.4350 0.4422 0.1408 0.1382 OpenAI-largemsg 0.2893 0.5733 0.2146 0.1797 0.1075 0.1677 0.2272 0.2444 0.4123 0.4194 0.1331 0.1332 conv 0.3073 0.5901 0.2129 0.1717 0.1025 0.1551 0.2203 0.2313 0.4263 0.4343 0.1248 0.1185 conv_msg 0.3425 0.6592 0.2591 0.2103 0.1286 0.1954 0.2670 0.2839 0.4759 0.4821 0.1601 0.1578 Table 18: Baseline for conversation data retrieval performance of each embedding model Cluster Representative Samples Restaurants & Food Asks restaurant, mentions food, asks food, asks music at restaurant Movies Likes movie, asks movie, mentions movie, ac- knowledges movie Well-being & Health Asks question about well-being, inquires well- being, asks opinion Reservations Specifies number of people, confirms booking, asks booking Family & Pets Mentions family, mentions work, mentions kids, mentions dog Sports & Hobbies Asks sports, asks interest in sports, mentions foot- ball, plays games Pricing & Admission Asks fee for entrance, asks method of payment, mentions price Conversation Closure Declines help, declines offer, says goodbye, ends conversation Miscellaneous Has kids, greets morning, feels tired, watches TV Location Inquiries Asks location, asks hotel, mentions hotel, asks duration of stay Interest in Others Asks you, greets person with question, asks activ- ity, asks today Greetings Greets person, expresses gratitude, acknowledges response Music & Pets Likes music, mentions hobbies, has dog, enjoys music Recommendations Thinks better, asks for recommendations, suggests change of topic Asking Opinions Expresses opinion, states opinion, expresses frus- tration Table 19: Intent Clusters and Representative Samples from DataModel ContextMetric acc@1 ndcg@5 ndcg@10 ndcg@20 HEISIR GPT-3.5-turbo, All-embeddingsx 0.3569 0.2869 0.2691 0.2813 o 0.3683 0.2959 0.2772 0.2902 Table 20: Metric Changes with and without context to capture relational dependencies between words. Since HEISIR avoids the use of pronouns to en- sure semantic completeness, this lack of context makes it difficult to extract accurate information. Consequently, extracting SVOA quadruplets from context-less messages not only reduces the preci- sion of the semantic indices but also results in fewer indices being generated. For future applications be- yond retrieval tasks, incorporating previous context will be necessary. F Dataset License and Disclaimer The datasets used in this study are subject to the fol- lowing licenses: BAD (Xu et al., 2021) and DICES (Aroyo et al., 2024) (CC BY 4.0), DailyDialog (Li et al., 2017) (CC BY-NC-SA 4.0), PILD (Xu et al., 2020) (MIT license), USS (Sun et al., 2021) (in- dividual licenses), ReDial (CC-BY-4.0), MWOZ (MIT License), and SGD (CC-BY-SA-4.0). We strictly adhered to these licenses and confirm that no commercial use was made of any dataset in this research. While BAD and DICES datasets contain some inherently harmful content, we used these datasets solely as search targets and emphasize that our paper does not include any harmful content in its presented material. 17

---