Authors: Vineet Kumar, Ronald Tony, Darshita Rathore, Vipasha Rana, Bhuvanesh Mandora, Kanishka, Chetna Bansal, Anindya Moitra
Page 1:
Genicious: Contextual Few-shot Prompting for Insights Discovery
Vineet Kumar
PayPal
Bengaluru, India
vkumar32@paypal.comRonald Tony
PayPal
Bengaluru, India
rtony@paypal.comDarshita Rathore
PayPal
Bengaluru, India
drathore@paypal.com
Vipasha Rana
PayPal
Bengaluru, India
viprana@paypal.comBhuvanesh Mandora
PayPal
Bengaluru, India
bmandora@paypal.comKanishka
PayPal
Bengaluru, India
kdhar@paypal.com
Chetna Bansal
PayPal
Bengaluru, India
cbansal@paypal.comAnindya Moitra
PayPal
Bengaluru, India
amoitra@paypal.com
Abstract
Data and insights discovery is critical for decision-making in mod-
ern organizations. We present Genicious, an LLM-aided interface
that enables users to interact with tabular datasets and ask complex
queries in natural language. By benchmarking various prompting
strategies and language models, we have developed an end-to-end
tool that leverages contextual few-shot prompting, achieving su-
perior performance in terms of latency, accuracy, and scalability.
Genicious empowers stakeholders to explore, analyze and visual-
ize their datasets efficiently while ensuring data security through
role-based access control and a Text-to-SQL approach.
CCS Concepts
•Computing methodologies →Natural language processing ;
Natural language processing ;•Information systems →Data
analytics ;•Applied computing →Enterprise computing.
ACM Reference Format:
Vineet Kumar, Ronald Tony, Darshita Rathore, Vipasha Rana, Bhuvanesh
Mandora, Kanishka, Chetna Bansal, and Anindya Moitra. 2024. Genicious:
Contextual Few-shot Prompting for Insights Discovery. In 8th International
Conference on Data Science and Management of Data (12th ACM IKDD CODS
and 30th COMAD) (CODS-COMAD Dec ’24), December 18–21, 2024, Jodhpur,
India. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3703323.
3704274
1 Introduction
In today’s digital age, data has become an essential resource. How-
ever, the sheer volume and growing complexity of data present
significant challenges in querying and exploration, even for sea-
soned professionals. While leaders and senior leadership require
quick access to insights about key performance indicators (KPIs),
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
CODS-COMAD Dec ’24, December 18–21, 2024, Jodhpur, India
©2024 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-1124-4/24/12
https://doi.org/10.1145/3703323.3704274traditional methods of data presentation, such as dashboards, of-
ten fall short of providing comprehensive, flexible analysis. These
pre-designed templates, though useful, constrain users to a limited
set of predefined queries and visualizations. This rigidity becomes
a significant barrier when decision-makers need to explore data
dynamically, ask ad-hoc questions, or investigate nuanced aspects
of their business that don’t conform to existing templates.
Text-to-SQL interfaces address this limitation by allowing users
to pose questions in natural language, breaking free from the con-
straints of pre-built dashboards. This capability empowers leaders
and team members alike to query data spontaneously, uncovering
deeper insights and responding to emerging trends or challenges
inreal-time . By bridging the gap between complex database struc-
tures and intuitive human inquiry, Text-to-SQL systems not only
enhance data accessibility but also foster a culture of data-driven
decision-making across all levels of an organization, truly unlocking
the power of data analytics.
However, Text-to-SQL is a longstanding problem in the do-
main of Natural Language Processing. Formally, given a natural
language query(𝑄)and its corresponding database schema (𝑆),
which presents data and attribute type information, a Text-to-
SQL system aims to generate a valid and executable SQL query
(𝑦), which when executed on the database; will return results that
match the user’s original intent.
One of the primary reasons for framing insights discovery as
aText-to-SQL process is the need to maintain data confidential-
ity. In many cases, the datasets being queried contain sensitive or
proprietary information that cannot be shared with external lan-
guage models (LLMs). By utilizing a Text-to-SQL approach, the
actual data remains secure and undisclosed, as only the schema or
metadata of the database is exposed.
Text-to-SQL acts as a middle ground where the LLM only needs
access to the database structure—not the data itself—to generate
SQL queries. These SQL queries can then be executed internally
to retrieve insights, ensuring that the raw data never leaves the
secure environment. This approach allows users to ask questions
and gain insights about specific datasets while maintaining strict
data protection and compliance with privacy policies.arXiv:2503.12062v1 [cs.IR] 15 Mar 2025
Page 2:
CODS-COMAD Dec ’24, December 18–21, 2024, Jodhpur, India Kumar et al.
Recently, the evolution of large language models (LLMs) has
reshaped the landscape. Models such as GPT and Llama show ex-
ceptional abilities and achieve very good results in various tasks
with minimal fine-tuning. These models are particularly effective in
few-shot and zero-shot learning contexts, enabling them to tackle
complex tasks with little or no additional training data. This ad-
vancement has expanded the potential for applying LLMs in more
sophisticated and diverse applications for tabular data.
Though LLMs have demonstrated impressive capabilities, they
continue to struggle with limitations, particularly in handling domain-
specific or knowledge-intensive tasks. These challenges often mani-
fest as ‘hallucinations’ where the models generate incorrect or mis-
leading information. To address these issues, Retrieval-Augmented
Generation (RAG) has been developed as a solution that strengthens
LLMs by retrieving relevant information from external knowledge
sources through semantic similarity matching. By incorporating
this external data, RAG significantly reduces the occurrence of
factual inaccuracies in generated content.
To address the aforementioned challenges and cater to a diverse
user base - from SQL novices to experts across various business
functions - we have developed Genicious1, an innovative Gen AI-
powered system for data querying and insights generation. Geni-
cious leverages the text-generative capabilities of LLMs through a
novel approach to bridge the gap between natural language commu-
nication and complex database interactions. This system is designed
to democratize data access, enabling users of all skill levels to ex-
tract valuable insights from relational databases through intuitive,
conversational interfaces.
Here are the key contributions of our paper —
•We conducted a comprehensive benchmarking of several
widely used open-source and proprietary LLMs for SQL gen-
eration, evaluating their performance on both public and our
domain-specific datasets.
•We explored various prompting strategies aimed at enhanc-
ing the efficiency and accuracy of SQL code generation.
•Drawing on experimental evidence, we implemented a novel
Contextual Few-shot Prompting approach by integrating
RAG to dynamically adapt few-shot examples based on user
queries, thereby enhancing SQL generation performance.
The rest of this paper is organized as follows – In Section 2,
we briefly discuss related works. Section 3 talks about various
experiments we conducted and the rationale for the design choices
we made. Section 4 presents the system descriptions. Finally, Section
5 concludes the paper with a few possible directions for future
improvement & research.
2 Related Works
This field has witnessed significant advancements in the last decade,
evolving from rule-based complex feature engineering systems to
very recent large language models (LLMs). Hong et al. [ 7] and Shi
et al. [ 13] provide an excellent survey of methods in this direction.
Early Text-to-SQL systems were predominantly rule-based
[10][11] and often involved extensive feature engineering. While
these systems demonstrated great performance in certain domains
1Demo is available here.(for which they were trained ) they lacked the generalization capa-
bility.
With advancements in Deep Learning, Seq2Seq models have
shown promising results in the domain. The appeal of these models
[14][16] lies in their ability to transform natural language inputs
into corresponding SQL outputs directly. This end-to-end approach
significantly streamlines the process, eliminating the need for ex-
tensive feature engineering and the creation of complex rule-based
systems.
Remarkable text generation capabilities of LLMs[ 2] have shown
very promising results in this domain [ 4][12]. Broadly there are two
main approaches to use LLMs for SQL generation – prompt engi-
neering [ 3] and fine-tuning [ 6]. To rigorously evaluate the accuracy
of these systems, researchers have developed several benchmark
datasets. Among these, Spider [ 17] described as a large-scale com-
plex and cross-domain semantic parsing and Text-to-SQL dataset
– has become a standard for measuring progress in the field. Cur-
rently, leading approaches on the Spider leaderboard are C3[ 4],
DIN-SQL[12], and DAIL-SQL[6].
3 Experiments & Design Choices
While developing our Natural Language-based Insights Discovery
tool, we faced three crucial decisions that significantly impacted
the system’s performance and usability. These key aspects were –
•Selection of an appropriate Large Language Model (LLM)
•Design of a scalable & accurate prompting strategy
•Choice of evaluation criteria for model-generated outputs
Each of these decisions required careful consideration of various
options, balancing factors such as performance, cost-effectiveness,
and scalability. In this section, we will elaborate on the alternatives
we explored for each aspect, the choices we ultimately made, and
the rationale behind our decisions.
3.1 Problem Formulation
Given a natural language query 𝑄and its corresponding database
schema 𝑆, (which presents data type and attribute type information),
aText-to-SQL system (that utilizes LLMs) aims to generate a valid
and executable SQL query 𝑦, when query and schema are fed into
LLM in a prompt template T(𝑄, 𝑆). Formally, the auto-regressive
generation process of SQL query 𝑦can be formulated as follows:
Pr(𝑦|T(𝑄, 𝑆))=|𝑦|Ö
𝑖=1Pr(𝑦𝑖|T(𝑄, 𝑆),𝑦<𝑖)
3.2 Choice of LLMs
We chose both popular open-source as well as proprietary LLMs for
our experiments. For open-source LLMs, a thorough review by the
legal team is done to ensure compliance, and upon approval, the
LLMs are added to the repository and made available for use. Due to
legal constraints, only LLMs that have been approved through this
process are included. Specifically, we selected versions of Llama
[15], GPT [ 1], and Mistral [ 8] for our analysis. More details about
models and their specifications of parameter size and model type
(whether they are foundation model or assistant model) is described
in Table 1. All open-source models are full-precision variants.
Page 3:
Genicious: Contextual Few-shot Prompting for Insights Discovery CODS-COMAD Dec ’24, December 18–21, 2024, Jodhpur, India
Model Size Type
Llama 2 13 B Chat
Mistral 7 B Instruct
Llama 3 8 B Instruct
Llama 3.1 8 B Instruct
GPT 3.5 Turbo 175 B Chat
GPT 4 1.8 T Chat
Table 1: Candidate LLMs for evaluation
3.3 Prompt Engineering
Prompt engineering constitutes a strategic combination of infor-
mation such as instructions, questions and the addition of table
information, aimed at enhancing the performance of LLMs to gen-
erate expected output. We explored advanced prompt engineering
techniques that enable us to tackle complex tasks while enhancing
the LLM’s reliability using basic instruction, supplementary knowl-
edge and ensembling a few techniques as follows:
Reasoning:
•Prompt Chaining: Involves linking multiple prompts together,
where the output of one prompt becomes the input for the
next, to guide a model through complex tasks.
•Self-Consistency: Involves generating multiple responses to
the same prompt and selecting the most coherent answer to
improve reliability.
•Cross-Consistency: Involves comparing responses generated
from different models to identify and select the most consis-
tent or accurate answer.
Implementation of LLM-based Text-to-SQL systems often relies
on In-context learning [ 2]. ICL offers Language models to learn
tasks given the instructions and a few examples in the form of
demonstrations.
•Zero-shot Prompting: Model tries to generate SQL queries
without any additional examples.
•Few-shot Prompting: Model is given a few examples to serve
as a foundation for guiding it to generate responses for better
performance.
However, LLMs often encounter limitations in generating ac-
curate output in cases where access to domain-specific data and
specific private information is required. Unlike publicly accessible
sources, proprietary data is often stored in relational SQL databases,
PDFs, Confluence pages, slide decks, etc. Fine-tuning LLMs, which
requires customizing the model using new datasets for a particular
task, comes with challenges such as high training costs, power-
ful system configuration, sending data to the model, and limited
transparency in the model’s decision-making. Hence, there is a
need for an approach which relies on learning through contextual
information, additional relevant data and enhancement through
Retrieval-Augmented Generation (RAG) [ 5] to enrich the context
leading to more accurate and contextually appropriate outputs.
Using RAG we can make the demonstrations of few-shot prompt-
ing more contextual. Based on user query we can retrieve the most
relevant and contextual examples to be fed as a system demonstra-
tion for SQL generation.
For few-shot learning, to determine the optimal number of ex-
amples(𝑘)to include in the prompt, we conducted a brief analysis
by evaluating accuracy across different values of (𝑘). Our findings
Figure 1: Number of Examples (𝑘)in Context vs. Accuracy
indicated that beyond 𝑘=4, the improvement in accuracy is not
prominent. Additionally, increasing the number of tokens can incur
higher computational costs for some models. Therefore, we selected
𝑘=4as the optimal value, as illustrated in Figure 1.
3.4 Evaluation of Different LLMs & Prompting
Strategies
To evaluate the performance of the various LLMs and prompting
strategies described in Section 3.2 and 3.3, we conducted a bench-
marking experiment. For this experiment, we utilized a very simple
prompt template–
<prompt template>
{System Instructions}
{Database Schema}
{Optional Demonstrations}
</prompt template>
To conduct our experiments we used the Spider [ 17] dataset and
evaluated the accuracy of LLM generated model using the following
two criteria (both of these methods are from the Spider evaluation
framework)
(1)Exact Match of generated SQL: We decompose each SQL into
several clauses, and conduct a set comparison in each SQL
clause
(2)Execution Accuracy: We compare the values of the predicted
SQL queries with those of the ground truth SQL query
0%20%40%60%80%100%
Llama 2MistralLlama 3Llama 3.1GPT 3.5GPT 4EasyMediumHardExtra Hard
Figure 2: Performance of LLMs wrt. difficulty level on Spider
[17] test set.
We found that both of these metrics are highly correlated. There-
fore, we are only reporting Execution Accuracy in Table 2. Few-Shot
Prompting consistently outperforms Zero-Shot, and Contextual
Few-Shot outperforms both methods. We also experimented with
self-consistency; however, due to the increased latency and minimal
accuracy gains from querying the model multiple times, we chose
Page 4:
CODS-COMAD Dec ’24, December 18–21, 2024, Jodhpur, India Kumar et al.
Contextual Few-Shot Prompting as our preferred strategy. Figure
2 illustrates the comparative performance of LLMs with respect
to the difficulty of questions. GPT is performing better for hard
questions.
Among all competing models, Llama 3.1 proved to be the most
efficient and accurate in our experiments. However, when tested
on our domain-specific dataset, GPT-3.5 Turbo emerged as the top-
performing model. Hence we chose GPT-3.5 Turbo for developing
Genicious.
Strategy Llama 2 Mistral Llama 3 Llama 3.1 GPT 3.5 T GPT 4
ZS 5.0% 36.0% 27.0% 41.0% 37.0% 28.0%
FS 51.0% 51.0% 52.0% 53.0% 53.0% 52.0%
CFS 68.0% 70.0% 69.0% 71.0% 65.0% 55.0%
SC 26.0% 40.0% 51.0% 53.0% 53.0% 47.0%
CFS w/ SC 70.3% 66.7% 70.0% 70.3% 60.0% 56.7%
Table 2: Evaluation of different LLMs & Prompting Strategies
on Spider [ 17] test set. ZS: Zero Shot, FS: Few Shot (𝑘=4), CFS:
Contextual Few Shot (𝑘=4), SC: Self Consistency (𝑛=5), CFS w/
SC: Contextual Few Shot w/ Self Consistency (𝑘=4, 𝑛=5).
4 System Description
Genicious is a Java Spring Boot–based application with a React.js
user interface, allowing users to interact and communicate through
REST APIs for querying various datasets by leveraging advanced
natural language processing techniques. The architecture of the
Genicious tool is shown in Figure 3. There are two broad compo-
nents in the tool – one is an offline phase (shaded in blue) and the
other is a query phase.
Offline Phase: Onboarding and Vector store Preparation. During
the onboarding of a new dataset, a list of candidate questions and
their corresponding SQLs are gathered. These questions should ex-
haustively cover various themes of the domain. For example, when
using the tool to answer questions involving year-over-year com-
parisons, a similar question and its corresponding SQL query can be
included in the candidate question pool. Once we have the question
pool ready we use an embedding model to convert these questions
into numerical embeddings. We are using text-embedding-ada-002
model. These embeddings, which are numerical representations
capturing the semantic essence of the questions, are stored in a
vector database (Milvus), facilitating efficient retrieval during the
query phase.
As new datasets/projects are introduced, the system is incremen-
tally updated by adding corresponding embeddings of the question
pool into the vector store. This dynamic approach ensures that the
vector store remains current and relevant, continuously improv-
ing the accuracy and efficiency of the similarity search and query
generation processes.
By integrating these advanced techniques and maintaining an
adaptive vector store, our system provides a robust and scalable
solution for querying diverse projects and generating precise re-
sponses.
Query Phase: Input Processing and Query Generation. Now, when
a user submits a query 𝑄, the system performs the following steps:
Figure 3: Tool Architecture
(1)Embedding Generation: The user’s input question (𝑄)is
converted into a numerical embedding (𝑒)using the text em-
bedding model text-embedding-ada-002 . This embedding
is then used to retrieve SQL examples that are relevant as part
of the Retrieval Augmented Generation (RAG) framework.
(2)Similarity Search: Using FAISS (Facebook AI Similarity
Search) [ 9], we perform a similarity search with the embed-
ding generated in the previous step (𝑒)against pre-stored
question embeddings in the vector database. FAISS efficiently
ranks the questions based on their proximity to the user’s
question.
(3)Contextual Examples Retrieval: This step retrieves and
presents the most relevant textual questions from the ques-
tion pool, mapped back from their numerical embeddings.
These retrieved questions are used to generate contextual-
ized examples where the few-shot demonstrations are dy-
namic (adapting to the input question) and relevant.
(4)Prompt Building: We maintain an instruction database that
stores prompt templates for each project/dataset. For every
user query, an appropriate instruction prompt is fetched from
this database and appended with the original user query and
retrieved contextual few-shot examples. This context-aware,
dynamic few-shot instruction is then sent to the LLMs.
(5)LLM Query Generation: The LLM formulates a BigQuery
(BQ) query tailored to effectively address the user’s question.
(6)SQL Sanitization: The LLM-generated SQL undergoes a
sanitization step, which involves scanning the SQL for any
harmful keywords (e.g., DROP, ALTER, UPDATE , etc.).
(7)Response Delivery: If the SQL is deemed safe, the query
is executed on the relational database, and the results are
returned via REST API calls, ensuring a seamless and respon-
sive interaction. Finally, the results are presented to the user
using tables and visualizations.
5 Adoption and Future Steps
The P95 latency for a single query response is around 6 seconds.
The tool efficiently handles simple to intermediate queries on flat
tables, retrieves the latest metrics, performs basic calculations, and
investigates trends. We are actively working on adding new fea-
tures in future version such as making the bot truly conversational
with scoped interactions, and utilizing an agentic framework while
ensuring robust data protection, among others.
Page 5:
Genicious: Contextual Few-shot Prompting for Insights Discovery CODS-COMAD Dec ’24, December 18–21, 2024, Jodhpur, India
References
[1]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren-
cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal
Anadkat, et al .2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774
(2023).
[2]Tom B Brown. 2020. Language models are few-shot learners. arXiv preprint
ArXiv:2005.14165 (2020).
[3]Shuaichen Chang and Eric Fosler-Lussier. 2023. How to prompt llms for text-
to-sql: A study in zero-shot, single-domain, and cross-domain settings. arXiv
preprint arXiv:2305.11853 (2023).
[4]Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin,
Dongfang Lou, et al .2023. C3: Zero-shot text-to-sql with chatgpt. arXiv preprint
arXiv:2307.07306 (2023).
[5]Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin,
Tat-Seng Chua, and Qing Li. 2024. A Survey on RAG Meeting LLMs: Towards
Retrieval-Augmented Large Language Models. In Proceedings of the 30th ACM
SIGKDD Conference on Knowledge Discovery and Data Mining . 6491–6501.
[6]Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jin-
gren Zhou. 2023. Text-to-sql empowered by large language models: A benchmark
evaluation. arXiv preprint arXiv:2308.15363 (2023).
[7]Zijin Hong, Zheng Yuan, Qinggang Zhang, Hao Chen, Junnan Dong, Feiran
Huang, and Xiao Huang. 2024. Next-Generation Database Interfaces: A Survey
of LLM-based Text-to-SQL. arXiv preprint arXiv:2406.08426 (2024).
[8]Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De-
vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel,
Guillaume Lample, Lucile Saulnier, et al .2023. Mistral 7B. arXiv preprint
arXiv:2310.06825 (2023).[9]Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity
search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
[10] Fei Li and Hosagrahar V Jagadish. 2014. Constructing an interactive natural
language interface for relational databases. Proceedings of the VLDB Endowment
8, 1 (2014), 73–84.
[11] Tanzim Mahmud, KM Azharul Hasan, Mahtab Ahmed, and Thwoi Hla Ching
Chak. 2015. A rule based approach for NLP based query processing. In 2015 2nd
international conference on electrical information and communication technologies
(EICT) . IEEE, 78–82.
[12] Mohammadreza Pourreza and Davood Rafiei. 2024. Din-sql: Decomposed in-
context learning of text-to-sql with self-correction. Advances in Neural Informa-
tion Processing Systems 36 (2024).
[13] Liang Shi, Zhengju Tang, and Zhi Yang. 2024. A Survey on Employing Large
Language Models for Text-to-SQL Tasks. arXiv preprint arXiv:2407.15186 (2024).
[14] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning
with neural networks. Advances in neural information processing systems 27
(2014).
[15] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Azhar, et al .2023. Llama: Open and efficient foundation language models. arXiv
preprint arXiv:2302.13971 (2023).
[16] Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sqlnet: Generating structured
queries from natural language without reinforcement learning. arXiv preprint
arXiv:1711.04436 (2017).
[17] Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James
Ma, Irene Li, Qingning Yao, Shanelle Roman, et al .2018. Spider: A large-scale
human-labeled dataset for complex and cross-domain semantic parsing and
text-to-sql task. arXiv preprint arXiv:1809.08887 (2018).