loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2503.12062

Genicious: Contextual Few-shot Prompting for Insights Discovery

Authors: Vineet Kumar, Ronald Tony, Darshita Rathore, Vipasha Rana, Bhuvanesh Mandora, Kanishka, Chetna Bansal, Anindya Moitra

Published: 2025-03-15

Abstract:

Data and insights discovery is critical for decision-making in modern organizations. We present Genicious, an LLM-aided interface that enables users to interact with tabular datasets and ask complex queries in natural language. By benchmarking various prompting strategies and language models, we have developed an end-to-end tool that leverages contextual few-shot prompting, achieving superior performance in terms of latency, accuracy, and scalability. Genicious empowers stakeholders to explore, analyze and visualize their datasets efficiently while ensuring data security through role-based access control and a Text-to-SQL approach.

Paper Content: on Alphaxiv
Page 1: Genicious: Contextual Few-shot Prompting for Insights Discovery Vineet Kumar PayPal Bengaluru, India vkumar32@paypal.comRonald Tony PayPal Bengaluru, India rtony@paypal.comDarshita Rathore PayPal Bengaluru, India drathore@paypal.com Vipasha Rana PayPal Bengaluru, India viprana@paypal.comBhuvanesh Mandora PayPal Bengaluru, India bmandora@paypal.comKanishka PayPal Bengaluru, India kdhar@paypal.com Chetna Bansal PayPal Bengaluru, India cbansal@paypal.comAnindya Moitra PayPal Bengaluru, India amoitra@paypal.com Abstract Data and insights discovery is critical for decision-making in mod- ern organizations. We present Genicious, an LLM-aided interface that enables users to interact with tabular datasets and ask complex queries in natural language. By benchmarking various prompting strategies and language models, we have developed an end-to-end tool that leverages contextual few-shot prompting, achieving su- perior performance in terms of latency, accuracy, and scalability. Genicious empowers stakeholders to explore, analyze and visual- ize their datasets efficiently while ensuring data security through role-based access control and a Text-to-SQL approach. CCS Concepts •Computing methodologies →Natural language processing ; Natural language processing ;•Information systems →Data analytics ;•Applied computing →Enterprise computing. ACM Reference Format: Vineet Kumar, Ronald Tony, Darshita Rathore, Vipasha Rana, Bhuvanesh Mandora, Kanishka, Chetna Bansal, and Anindya Moitra. 2024. Genicious: Contextual Few-shot Prompting for Insights Discovery. In 8th International Conference on Data Science and Management of Data (12th ACM IKDD CODS and 30th COMAD) (CODS-COMAD Dec ’24), December 18–21, 2024, Jodhpur, India. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3703323. 3704274 1 Introduction In today’s digital age, data has become an essential resource. How- ever, the sheer volume and growing complexity of data present significant challenges in querying and exploration, even for sea- soned professionals. While leaders and senior leadership require quick access to insights about key performance indicators (KPIs), Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). CODS-COMAD Dec ’24, December 18–21, 2024, Jodhpur, India ©2024 Copyright held by the owner/author(s). ACM ISBN 979-8-4007-1124-4/24/12 https://doi.org/10.1145/3703323.3704274traditional methods of data presentation, such as dashboards, of- ten fall short of providing comprehensive, flexible analysis. These pre-designed templates, though useful, constrain users to a limited set of predefined queries and visualizations. This rigidity becomes a significant barrier when decision-makers need to explore data dynamically, ask ad-hoc questions, or investigate nuanced aspects of their business that don’t conform to existing templates. Text-to-SQL interfaces address this limitation by allowing users to pose questions in natural language, breaking free from the con- straints of pre-built dashboards. This capability empowers leaders and team members alike to query data spontaneously, uncovering deeper insights and responding to emerging trends or challenges inreal-time . By bridging the gap between complex database struc- tures and intuitive human inquiry, Text-to-SQL systems not only enhance data accessibility but also foster a culture of data-driven decision-making across all levels of an organization, truly unlocking the power of data analytics. However, Text-to-SQL is a longstanding problem in the do- main of Natural Language Processing. Formally, given a natural language query(𝑄)and its corresponding database schema (𝑆), which presents data and attribute type information, a Text-to- SQL system aims to generate a valid and executable SQL query (𝑦), which when executed on the database; will return results that match the user’s original intent. One of the primary reasons for framing insights discovery as aText-to-SQL process is the need to maintain data confidential- ity. In many cases, the datasets being queried contain sensitive or proprietary information that cannot be shared with external lan- guage models (LLMs). By utilizing a Text-to-SQL approach, the actual data remains secure and undisclosed, as only the schema or metadata of the database is exposed. Text-to-SQL acts as a middle ground where the LLM only needs access to the database structure—not the data itself—to generate SQL queries. These SQL queries can then be executed internally to retrieve insights, ensuring that the raw data never leaves the secure environment. This approach allows users to ask questions and gain insights about specific datasets while maintaining strict data protection and compliance with privacy policies.arXiv:2503.12062v1 [cs.IR] 15 Mar 2025 Page 2: CODS-COMAD Dec ’24, December 18–21, 2024, Jodhpur, India Kumar et al. Recently, the evolution of large language models (LLMs) has reshaped the landscape. Models such as GPT and Llama show ex- ceptional abilities and achieve very good results in various tasks with minimal fine-tuning. These models are particularly effective in few-shot and zero-shot learning contexts, enabling them to tackle complex tasks with little or no additional training data. This ad- vancement has expanded the potential for applying LLMs in more sophisticated and diverse applications for tabular data. Though LLMs have demonstrated impressive capabilities, they continue to struggle with limitations, particularly in handling domain- specific or knowledge-intensive tasks. These challenges often mani- fest as ‘hallucinations’ where the models generate incorrect or mis- leading information. To address these issues, Retrieval-Augmented Generation (RAG) has been developed as a solution that strengthens LLMs by retrieving relevant information from external knowledge sources through semantic similarity matching. By incorporating this external data, RAG significantly reduces the occurrence of factual inaccuracies in generated content. To address the aforementioned challenges and cater to a diverse user base - from SQL novices to experts across various business functions - we have developed Genicious1, an innovative Gen AI- powered system for data querying and insights generation. Geni- cious leverages the text-generative capabilities of LLMs through a novel approach to bridge the gap between natural language commu- nication and complex database interactions. This system is designed to democratize data access, enabling users of all skill levels to ex- tract valuable insights from relational databases through intuitive, conversational interfaces. Here are the key contributions of our paper — •We conducted a comprehensive benchmarking of several widely used open-source and proprietary LLMs for SQL gen- eration, evaluating their performance on both public and our domain-specific datasets. •We explored various prompting strategies aimed at enhanc- ing the efficiency and accuracy of SQL code generation. •Drawing on experimental evidence, we implemented a novel Contextual Few-shot Prompting approach by integrating RAG to dynamically adapt few-shot examples based on user queries, thereby enhancing SQL generation performance. The rest of this paper is organized as follows – In Section 2, we briefly discuss related works. Section 3 talks about various experiments we conducted and the rationale for the design choices we made. Section 4 presents the system descriptions. Finally, Section 5 concludes the paper with a few possible directions for future improvement & research. 2 Related Works This field has witnessed significant advancements in the last decade, evolving from rule-based complex feature engineering systems to very recent large language models (LLMs). Hong et al. [ 7] and Shi et al. [ 13] provide an excellent survey of methods in this direction. Early Text-to-SQL systems were predominantly rule-based [10][11] and often involved extensive feature engineering. While these systems demonstrated great performance in certain domains 1Demo is available here.(for which they were trained ) they lacked the generalization capa- bility. With advancements in Deep Learning, Seq2Seq models have shown promising results in the domain. The appeal of these models [14][16] lies in their ability to transform natural language inputs into corresponding SQL outputs directly. This end-to-end approach significantly streamlines the process, eliminating the need for ex- tensive feature engineering and the creation of complex rule-based systems. Remarkable text generation capabilities of LLMs[ 2] have shown very promising results in this domain [ 4][12]. Broadly there are two main approaches to use LLMs for SQL generation – prompt engi- neering [ 3] and fine-tuning [ 6]. To rigorously evaluate the accuracy of these systems, researchers have developed several benchmark datasets. Among these, Spider [ 17] described as a large-scale com- plex and cross-domain semantic parsing and Text-to-SQL dataset – has become a standard for measuring progress in the field. Cur- rently, leading approaches on the Spider leaderboard are C3[ 4], DIN-SQL[12], and DAIL-SQL[6]. 3 Experiments & Design Choices While developing our Natural Language-based Insights Discovery tool, we faced three crucial decisions that significantly impacted the system’s performance and usability. These key aspects were – •Selection of an appropriate Large Language Model (LLM) •Design of a scalable & accurate prompting strategy •Choice of evaluation criteria for model-generated outputs Each of these decisions required careful consideration of various options, balancing factors such as performance, cost-effectiveness, and scalability. In this section, we will elaborate on the alternatives we explored for each aspect, the choices we ultimately made, and the rationale behind our decisions. 3.1 Problem Formulation Given a natural language query 𝑄and its corresponding database schema 𝑆, (which presents data type and attribute type information), aText-to-SQL system (that utilizes LLMs) aims to generate a valid and executable SQL query 𝑦, when query and schema are fed into LLM in a prompt template T(𝑄, 𝑆). Formally, the auto-regressive generation process of SQL query 𝑦can be formulated as follows: Pr(𝑦|T(𝑄, 𝑆))=|𝑦|Ö 𝑖=1Pr(𝑦𝑖|T(𝑄, 𝑆),𝑦<𝑖) 3.2 Choice of LLMs We chose both popular open-source as well as proprietary LLMs for our experiments. For open-source LLMs, a thorough review by the legal team is done to ensure compliance, and upon approval, the LLMs are added to the repository and made available for use. Due to legal constraints, only LLMs that have been approved through this process are included. Specifically, we selected versions of Llama [15], GPT [ 1], and Mistral [ 8] for our analysis. More details about models and their specifications of parameter size and model type (whether they are foundation model or assistant model) is described in Table 1. All open-source models are full-precision variants. Page 3: Genicious: Contextual Few-shot Prompting for Insights Discovery CODS-COMAD Dec ’24, December 18–21, 2024, Jodhpur, India Model Size Type Llama 2 13 B Chat Mistral 7 B Instruct Llama 3 8 B Instruct Llama 3.1 8 B Instruct GPT 3.5 Turbo 175 B Chat GPT 4 1.8 T Chat Table 1: Candidate LLMs for evaluation 3.3 Prompt Engineering Prompt engineering constitutes a strategic combination of infor- mation such as instructions, questions and the addition of table information, aimed at enhancing the performance of LLMs to gen- erate expected output. We explored advanced prompt engineering techniques that enable us to tackle complex tasks while enhancing the LLM’s reliability using basic instruction, supplementary knowl- edge and ensembling a few techniques as follows: Reasoning: •Prompt Chaining: Involves linking multiple prompts together, where the output of one prompt becomes the input for the next, to guide a model through complex tasks. •Self-Consistency: Involves generating multiple responses to the same prompt and selecting the most coherent answer to improve reliability. •Cross-Consistency: Involves comparing responses generated from different models to identify and select the most consis- tent or accurate answer. Implementation of LLM-based Text-to-SQL systems often relies on In-context learning [ 2]. ICL offers Language models to learn tasks given the instructions and a few examples in the form of demonstrations. •Zero-shot Prompting: Model tries to generate SQL queries without any additional examples. •Few-shot Prompting: Model is given a few examples to serve as a foundation for guiding it to generate responses for better performance. However, LLMs often encounter limitations in generating ac- curate output in cases where access to domain-specific data and specific private information is required. Unlike publicly accessible sources, proprietary data is often stored in relational SQL databases, PDFs, Confluence pages, slide decks, etc. Fine-tuning LLMs, which requires customizing the model using new datasets for a particular task, comes with challenges such as high training costs, power- ful system configuration, sending data to the model, and limited transparency in the model’s decision-making. Hence, there is a need for an approach which relies on learning through contextual information, additional relevant data and enhancement through Retrieval-Augmented Generation (RAG) [ 5] to enrich the context leading to more accurate and contextually appropriate outputs. Using RAG we can make the demonstrations of few-shot prompt- ing more contextual. Based on user query we can retrieve the most relevant and contextual examples to be fed as a system demonstra- tion for SQL generation. For few-shot learning, to determine the optimal number of ex- amples(𝑘)to include in the prompt, we conducted a brief analysis by evaluating accuracy across different values of (𝑘). Our findings Figure 1: Number of Examples (𝑘)in Context vs. Accuracy indicated that beyond 𝑘=4, the improvement in accuracy is not prominent. Additionally, increasing the number of tokens can incur higher computational costs for some models. Therefore, we selected 𝑘=4as the optimal value, as illustrated in Figure 1. 3.4 Evaluation of Different LLMs & Prompting Strategies To evaluate the performance of the various LLMs and prompting strategies described in Section 3.2 and 3.3, we conducted a bench- marking experiment. For this experiment, we utilized a very simple prompt template– <prompt template> {System Instructions} {Database Schema} {Optional Demonstrations} </prompt template> To conduct our experiments we used the Spider [ 17] dataset and evaluated the accuracy of LLM generated model using the following two criteria (both of these methods are from the Spider evaluation framework) (1)Exact Match of generated SQL: We decompose each SQL into several clauses, and conduct a set comparison in each SQL clause (2)Execution Accuracy: We compare the values of the predicted SQL queries with those of the ground truth SQL query 0%20%40%60%80%100% Llama 2MistralLlama 3Llama 3.1GPT 3.5GPT 4EasyMediumHardExtra Hard Figure 2: Performance of LLMs wrt. difficulty level on Spider [17] test set. We found that both of these metrics are highly correlated. There- fore, we are only reporting Execution Accuracy in Table 2. Few-Shot Prompting consistently outperforms Zero-Shot, and Contextual Few-Shot outperforms both methods. We also experimented with self-consistency; however, due to the increased latency and minimal accuracy gains from querying the model multiple times, we chose Page 4: CODS-COMAD Dec ’24, December 18–21, 2024, Jodhpur, India Kumar et al. Contextual Few-Shot Prompting as our preferred strategy. Figure 2 illustrates the comparative performance of LLMs with respect to the difficulty of questions. GPT is performing better for hard questions. Among all competing models, Llama 3.1 proved to be the most efficient and accurate in our experiments. However, when tested on our domain-specific dataset, GPT-3.5 Turbo emerged as the top- performing model. Hence we chose GPT-3.5 Turbo for developing Genicious. Strategy Llama 2 Mistral Llama 3 Llama 3.1 GPT 3.5 T GPT 4 ZS 5.0% 36.0% 27.0% 41.0% 37.0% 28.0% FS 51.0% 51.0% 52.0% 53.0% 53.0% 52.0% CFS 68.0% 70.0% 69.0% 71.0% 65.0% 55.0% SC 26.0% 40.0% 51.0% 53.0% 53.0% 47.0% CFS w/ SC 70.3% 66.7% 70.0% 70.3% 60.0% 56.7% Table 2: Evaluation of different LLMs & Prompting Strategies on Spider [ 17] test set. ZS: Zero Shot, FS: Few Shot (𝑘=4), CFS: Contextual Few Shot (𝑘=4), SC: Self Consistency (𝑛=5), CFS w/ SC: Contextual Few Shot w/ Self Consistency (𝑘=4, 𝑛=5). 4 System Description Genicious is a Java Spring Boot–based application with a React.js user interface, allowing users to interact and communicate through REST APIs for querying various datasets by leveraging advanced natural language processing techniques. The architecture of the Genicious tool is shown in Figure 3. There are two broad compo- nents in the tool – one is an offline phase (shaded in blue) and the other is a query phase. Offline Phase: Onboarding and Vector store Preparation. During the onboarding of a new dataset, a list of candidate questions and their corresponding SQLs are gathered. These questions should ex- haustively cover various themes of the domain. For example, when using the tool to answer questions involving year-over-year com- parisons, a similar question and its corresponding SQL query can be included in the candidate question pool. Once we have the question pool ready we use an embedding model to convert these questions into numerical embeddings. We are using text-embedding-ada-002 model. These embeddings, which are numerical representations capturing the semantic essence of the questions, are stored in a vector database (Milvus), facilitating efficient retrieval during the query phase. As new datasets/projects are introduced, the system is incremen- tally updated by adding corresponding embeddings of the question pool into the vector store. This dynamic approach ensures that the vector store remains current and relevant, continuously improv- ing the accuracy and efficiency of the similarity search and query generation processes. By integrating these advanced techniques and maintaining an adaptive vector store, our system provides a robust and scalable solution for querying diverse projects and generating precise re- sponses. Query Phase: Input Processing and Query Generation. Now, when a user submits a query 𝑄, the system performs the following steps: Figure 3: Tool Architecture (1)Embedding Generation: The user’s input question (𝑄)is converted into a numerical embedding (𝑒)using the text em- bedding model text-embedding-ada-002 . This embedding is then used to retrieve SQL examples that are relevant as part of the Retrieval Augmented Generation (RAG) framework. (2)Similarity Search: Using FAISS (Facebook AI Similarity Search) [ 9], we perform a similarity search with the embed- ding generated in the previous step (𝑒)against pre-stored question embeddings in the vector database. FAISS efficiently ranks the questions based on their proximity to the user’s question. (3)Contextual Examples Retrieval: This step retrieves and presents the most relevant textual questions from the ques- tion pool, mapped back from their numerical embeddings. These retrieved questions are used to generate contextual- ized examples where the few-shot demonstrations are dy- namic (adapting to the input question) and relevant. (4)Prompt Building: We maintain an instruction database that stores prompt templates for each project/dataset. For every user query, an appropriate instruction prompt is fetched from this database and appended with the original user query and retrieved contextual few-shot examples. This context-aware, dynamic few-shot instruction is then sent to the LLMs. (5)LLM Query Generation: The LLM formulates a BigQuery (BQ) query tailored to effectively address the user’s question. (6)SQL Sanitization: The LLM-generated SQL undergoes a sanitization step, which involves scanning the SQL for any harmful keywords (e.g., DROP, ALTER, UPDATE , etc.). (7)Response Delivery: If the SQL is deemed safe, the query is executed on the relational database, and the results are returned via REST API calls, ensuring a seamless and respon- sive interaction. Finally, the results are presented to the user using tables and visualizations. 5 Adoption and Future Steps The P95 latency for a single query response is around 6 seconds. The tool efficiently handles simple to intermediate queries on flat tables, retrieves the latest metrics, performs basic calculations, and investigates trends. We are actively working on adding new fea- tures in future version such as making the bot truly conversational with scoped interactions, and utilizing an agentic framework while ensuring robust data protection, among others. Page 5: Genicious: Contextual Few-shot Prompting for Insights Discovery CODS-COMAD Dec ’24, December 18–21, 2024, Jodhpur, India References [1]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al .2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023). [2]Tom B Brown. 2020. Language models are few-shot learners. arXiv preprint ArXiv:2005.14165 (2020). [3]Shuaichen Chang and Eric Fosler-Lussier. 2023. How to prompt llms for text- to-sql: A study in zero-shot, single-domain, and cross-domain settings. arXiv preprint arXiv:2305.11853 (2023). [4]Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin, Dongfang Lou, et al .2023. C3: Zero-shot text-to-sql with chatgpt. arXiv preprint arXiv:2307.07306 (2023). [5]Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining . 6491–6501. [6]Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jin- gren Zhou. 2023. Text-to-sql empowered by large language models: A benchmark evaluation. arXiv preprint arXiv:2308.15363 (2023). [7]Zijin Hong, Zheng Yuan, Qinggang Zhang, Hao Chen, Junnan Dong, Feiran Huang, and Xiao Huang. 2024. Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL. arXiv preprint arXiv:2406.08426 (2024). [8]Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al .2023. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).[9]Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547. [10] Fei Li and Hosagrahar V Jagadish. 2014. Constructing an interactive natural language interface for relational databases. Proceedings of the VLDB Endowment 8, 1 (2014), 73–84. [11] Tanzim Mahmud, KM Azharul Hasan, Mahtab Ahmed, and Thwoi Hla Ching Chak. 2015. A rule based approach for NLP based query processing. In 2015 2nd international conference on electrical information and communication technologies (EICT) . IEEE, 78–82. [12] Mohammadreza Pourreza and Davood Rafiei. 2024. Din-sql: Decomposed in- context learning of text-to-sql with self-correction. Advances in Neural Informa- tion Processing Systems 36 (2024). [13] Liang Shi, Zhengju Tang, and Zhi Yang. 2024. A Survey on Employing Large Language Models for Text-to-SQL Tasks. arXiv preprint arXiv:2407.15186 (2024). [14] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014). [15] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al .2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023). [16] Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sqlnet: Generating structured queries from natural language without reinforcement learning. arXiv preprint arXiv:1711.04436 (2017). [17] Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al .2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887 (2018).

---