Authors: Shiping Yang, Jie Wu, Wenbiao Ding, Ning Wu, Shining Liang, Ming Gong, Hengyuan Zhang, Dongmei Zhang
Paper Content:
Page 1:
Quantifying the Robustness of Retrieval-Augmented Language Models
Against Spurious Features in Grounding Data
Shiping Yang1 2 *Jie Wu2Wenbiao Ding2Ning Wu2Shining Liang2Ming Gong2Hengyuan Zhang3
Dongmei Zhang2
Abstract
Robustness has become a critical attribute for
the deployment of RAG systems in real-world
applications. Existing research focuses on ro-
bustness to explicit noise (e.g., document seman-
tics) but overlooks spurious features (a.k.a. im-
plicit noise). While previous works have explored
spurious features in LLMs, they are limited to
specific features (e.g., formats) and narrow sce-
narios (e.g., ICL). In this work, we statistically
confirm the presence of spurious features in the
RAG paradigm, a robustness problem caused by
the sensitivity of LLMs to semantic-agnostic fea-
tures. Moreover, we provide a comprehensive
taxonomy of spurious features and empirically
quantify their impact through controlled experi-
ments. Further analysis reveals that not all spu-
rious features are harmful and they can even be
beneficial sometimes. Extensive evaluation re-
sults across multiple LLMs suggest that spurious
features are a widespread and challenging prob-
lem in the field of RAG. We release all codes and
data at: https://github.com/maybenotime/RAG-
SpuriousFeatures.
1. Introduction
Retrieval-Augmented Generation (RAG) has emerged as a
promising paradigm to mitigate LLMs hallucinations (Gao
et al., 2023; Yang et al., 2023b), integrating relevant external
knowledge to improve the factuality and trustworthiness
of LLM-generated outputs (Zhou et al., 2024). However,
Retrieval-Augmented Language Models (RALMs) still face
substantial robustness issue due to the presence of noise in
retrieved documents (Liu et al., 2023; Li et al., 2024c).
Recent research aims to explore the characteristics that af-
*Work done during internship at Microsoft1Simon Fraser Uni-
versity2Microsoft3Tsinghua University. Correspondence to: Jie
Wu<jiewu1@microsoft.com >.
Question :The atomic number of
indium which belongs to 5th period
is
Same Content but Different Spurious Features
Search and Retrieve
{"title" : "Period 5 element", "text" : “…49 In Indium
Post-transition metal ( Kr ) 4d 5s 5p…54 Xe Xenon
Noble gas ( Kr ) 4d 5s 5p."}FormatStyle
JSON
MD
title: Period 5 element
text: …49 In Indium Post-transition metal ( Kr ) 4d 5s
5p…54 Xe Xenon Noble gas ( Kr ) 4d 5s 5p.YAML···# Period 5 element
…49 In Indium Post-transition metal ( Kr ) 4d 5s
5p…54 Xe Xenon Noble gas ( Kr ) 4d 5s 5p.
Inconsistent Response
Style
Format
JSON
49
MD
59
YAML
59
···
Figure 1. An example from the SURE framework (Sec. 3), illus-
trating the sensitivity of RAG systems to spurious features within
grounding data. MD represents the Markdown format. The origi-
nal retrieved document is fed into the LLMs in different formats,
leading to inconsistent responses.
fect the robustness of RAG systems from the perspective of
grounding data construction (Cuconasu et al., 2024). These
studies examine various factors, including the type (Wu
et al., 2024a), number (Xu et al., 2024), and position of doc-
uments (Liu et al., 2024) within the prompt context. A more
detailed discussion of related work is in Appendix A. How-
ever, previous analyses primarily focus on explicit noise
(i.e., causal features) that significantly alter the semantic
information of grounding data(Wu et al., 2024b; Cuconasu
et al., 2024), while neglecting spurious features (a.k.a. im-
plicit noise) that make slight modifications but still preserve
the meaning of contexts. This limitation extends to existing
evaluation benchmarks, which simulate complex noise sce-
narios to assess the robustness of RAG systems (Chen et al.,
2024a; Wu et al., 2024a), yet lack available benchmarks and
metrics to measure the robustness of LLMs against spurious
features.
Contemporary RAG systems typically employ production-
level retrievers, such as Bing and Google, to collect relevant
information from the internet. Unlike a single corpus, the
internet encompasses diverse data with distinct features. For
1arXiv:2503.05587v1 [cs.CL] 7 Mar 2025
Page 2:
Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
any given query, there may exist numerous golden docu-
ments that contain the correct answer but differ in style,
format, or other attributes. As shown in Figure 1, we have
observed that LLMs may fail to consistently derive the cor-
rect answer from golden documents with different formats.
A similar phenomenon is reported in Sclar et al. (2024)
and He et al. (2024), which demonstrate that LLMs are
extremely sensitive to the format of prompts (i.e., spurious
features). For more related work, see Appendix A. Unfortu-
nately, there is no statistic and empirical evidence to support
the existence of spurious features in the RAG paradigm.
This highlights the urgent need to redefine spurious fea-
tures in RAG and systematically quantify the robustness of
RALMs against them.
To address these challenges, we first design a preliminary
experiment to demonstrate that RALMs are sensitive to
semantic-agnostic features in the grounding data, thereby
extending the definition of spurious features to RAG sys-
tems. Building on findings from our preliminary experiment
and recent studies, we identify five types of spurious fea-
tures that may appear in RAG scenarios. Then, we propose
a novel framework, SURE , for automating the process of ro-
bustness evaluation. This framework follows a perturb-then-
evaluate approach, offering great scalability. In SURE , auto-
mated perturbations are applied on the original instances to
inject the corresponding spurious features. The perturbed in-
stances are then examined to ensure that the causal features
remain intact. After these steps, we employ tailored met-
rics to quantify the robustness of RALMs against spurious
features. Further analysis reveals that not every spurious fea-
tures is harmful and they can even be beneficial sometimes.
We also find that our proposed metrics are more suitable for
evaluating spurious features in golden documents. Based on
this finding, we distill the most challenging instances from
the synthetic data generated by our framework to create
a lighter benchmark, SIG, enabling more efficient robust-
ness evaluation. We evaluate 12 representative LLMs with
varying architectures and scales, and the results show that
maintaining robustness against spurious features remains a
significant challenge, especially for open-source LLMs.
Our contribution can be summarized as follows: 1)We ex-
tend the definition of spurious features to RAG systems
through a preliminary experiment. To the best of our knowl-
edge, this is the first comprehensive study to define and
evaluate spurious features from RAG perspective. 2)We
propose a novel evaluation framework, SURE , to assess the
robustness of RALMs against spurious features, which ef-
fectively simulate five types of spurious features that may
appear in real world scenarios. 3)Through extensive experi-
ments and analysis using SURE on two representative LLMs,
we provide valuable insights for future research. 4)Based
on these findings, we curate a lightweight yet challenging
evaluation dataset named SIG and benchmark the robust-ness of the current state-of-the-art LLMs against spurious
features.
2. Preliminary
In this section, we first define causal and spurious features in
the context of retrieval-augmented generation and then pro-
vide statistical evidence to support the existence of spurious
features. Specifically, we start by defining an oracle re-
triever to sample data according to the preferences of LLMs
and then apply statistical testing to validate that RALMs
exhibit biases toward semantic-agnostic features within the
grounding data.
2.1. Causal and Spurious Features in RAG
In general, causal features are input features that have a di-
rect causal effect on the output of predictive model (Yu et al.,
2020). Their relationship is rooted in causality, rather than
mere statistical correlation. When it comes to Large Lan-
guage Models, the meaning and intent of prompts serve as
causal features that directly influence the models’ responses.
In the context of RAG, causal features refer to the semantic
information of grounding data, including whether the cor-
rect answer exists, the amount of noise documents in the
grounding data, and the content of those noise documents.
In contrast, spurious features are input features that co-occur
with causal features and are erroneously captured by the
model (Neuhaus et al., 2023). These features exhibit a sta-
tistical correlation with the model’s output but lack a causal
relationship. Recent research has shown that LLMs are
sensitive to seemingly trivial features like prompt format-
ting, thereby extending the definition of spurious features to
LLMs (Sclar et al., 2024). Similarly, we hypothesize that
the semantic-agnostic features of the grounding data can
be defined as spurious features in RAG systems. However,
the conclusion drawn from LLMs may not be applicable
to RALMs. Unlike LLMs, the input prompts for RALMs
incorporate additional dynamic content—grounding data—
to augment their output. This grounding data is derived
from retrieved documents rather than static instructions pro-
vided directly by users. Therefore, we design a preliminary
experiment to validate whether RALMs are sensitive to
semantic-agnostic features within the grounding data.
2.2. Oracle Retriever
We aim to confirm the existence of spurious features in RAG
scenarios, i.e., to demonstrate the sensitivity of RALMs to
semantic-agnostic features in grounding data.
There are some challenges in revealing the sensitivity of
RALMs. Specifically, When retrieving from a single corpus,
it is difficult to mine semantically equivalent counterparts
with obvious differences in semantic-agnostic features. This
2
Page 3:
Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
is because documents from the corpus often share similar
styles and formats. With only slight differences in these
features, we are difficult to observe significant performance
variations using traditional evaluation metrics like accuracy.
Thus, more fine-grained metrics are required to capture the
subtle performance changes. Inspired by the use of LLMs
as supervision signals for document utility (Izacard et al.,
2023; Gan et al., 2024), we propose the oracle score, which
measures fine-grained performance through calculating the
log probability of generating correct answers based on the
given documents. The oracle score is defined as follows:
Oracle (x, y,θ) =TX
t=1logp(yt|x, y<t,θ) (1)
where xis the input prompt for RALMs, including the in-
struction I, grounding data G, and query Q;yrepresents the
ground truth answer; θdenotes the model parameters; and
Tis the total length of the answer sequence. For cases with
multiple answers, we compute the final score by averaging
the corresponding oracle scores across all answers. Notably,
the oracle score is less effective at estimating the helpfulness
of documents when handling queries with long-sequence an-
swers1. This limitation arises from the growing expansion
of the solution space as the answer length increases.
We further define an oracle retriever that ranks documents
according to their oracle scores. Since the oracle score is
computed using LLMs as supervision signals, the oracle re-
triever can be regarded as a sampler that reflects the LLMs’
preferences for grounding data. With this sampler, we suc-
cessfully transform the target of demonstrating that RALMs
are sensitive to semantic-agnostic features within grounding
data into showing that the oracle retriever is biased toward
spurious features in documents.
2.3. Preliminary Experiment & Analysis
Using the oracle retriever, we recall 100 documents from
the Wikipedia dump for each query in the NQ-open dataset.
However, the computational cost of calculating oracle scores
across the entire corpus to select top-ranked documents is
prohibitive. Therefore, we introduce Contriever-msmarco , a
traditional dense retriever, for first-stage retrieval, followed
by reranking in descending order based on the oracle scores.
In addition to reducing the computational load, this initial
retrieval also ensures the semantic similarity of retrieved
documents. To further eliminate the effect of causal features,
documents without golden answers are filtered out, ensuring
that the remaining documents have roughly consistent causal
features.
We then select the first-ranked and last-ranked documents
1In our experiments, we only use queries where the answer
sequences is fewer than 5 tokens, ensuring the effectiveness of
oracle scores.for each query from the remaining documents, resulting
in two sets, each containing 2658 samples. By comparing
the differences in feature distributions between these two
sets, we can assess whether the oracle retriever exhibits
bias toward semantic-agnostic features. If these two sets
do not belong to the same feature distribution, this can be
attributed to the oracle retriever’s bias towards semantic-
agnostic features during sampling. To confirm that this
bias is not introduced by the dense retriever in first-stage
retrieval, we establish a control group by randomly sampling
two documents instead of selecting the first- and last-ranked
documents.
To evaluate whether the two distributions are same, we
employ the Kolmogorov-Smirnov (K-S) test. The following
semantic-agnostic features are measured in the experiments:
1) Flesh Score, 2) Distinct-1, 3) Dependency Tree Depth, 4)
PPL, and 5) Token Length. A detailed introduction of the
K-S test and these features can be found in the Appendix B.
We conduct experiments using Mistral-7B-Instruct-v0.3 to
implement the oracle retriever. The K-S statistic and P-value
are presented in Table 3. Furthermore, we visualize the
feature distributions for both the experimental and control
groups in Figure 6. For all tested features in the experimen-
tal group, the K-S test rejects the null hypothesis, concluding
that the distribution of the two sets are significantly different.
In contrast, for the control group, the K-S test fails to reject
the null hypothesis. The results for Llama-3.1-8B-Instruct
are also provided in Appendix B. According to these results,
we can conclude that RALMs exhibit bias toward spurious
features in documents.
The preliminary experiment provides statistical evidence
supporting the existence of spurious features in RAG sys-
tems. Nevertheless, it does not offer empirical evidence or
quantitative analysis. Inspired by previous data synthesis
studies (Tan et al., 2024b; Tong et al., 2024; Li et al., 2024b;
Wang et al., 2024a), we use a data synthesis approach to
better control feature variables and quantify the robustness
of RALMs against spurious features.
3. Proposed Framework
In this section, we detail our proposed evaluation framework,
SURE (Spurious Feat UresRobustness Evaluation), which
designed specifically for assessing the robustness of RALMs
against spurious features in grounding data. As illustrated
in Figure 2, this framework comprise four components: 1)
Comprehensive Taxonomy. We identify five distinct types of
spurious features that may arise in the context of Retrieval
Augmented Generation. 2)Spurious Features Injection. We
design a data synthesis pipeline to automate the injection of
spurious features, utilizing both model-based and rule-based
methods to construct counterparts of the original document
3
Page 4:
Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
Simple
Format
StyleTaxonomy
Complex
LLM -
Gen
Source
Self-Gen
Logic
Reverse RandomLLM
Rank
HTML Markdown
YAML JSON
Meta Data
Source TimeTestbed Construction
Instance Preparation
Query Source: NQ-Open
Document Source: Wiki dump
Spurious Features Injection
Model -based Perturbations
Rule -based Modification
Causal Features Preservation
➢Maintain Semantic Equivalence
Bidirectional entailment algo
➢Preserve Ground Truths
String Match Strategy
RAG Systems
➢Retriever
➢LLMsContriever -msmarco
Llama3.1 -8B-Instruct
Mistral -7B-Instruct
Evaluation Metrics
Robustness Rate
Win Rate
Accuracy
Lose RateMistral-7B-Instruct Llama-3.1-8B-InstructRobustness Rate
Figure 2. Overview of our SURE framework. We provide a Comprehensive Taxonomy that includes five types of spurious features, further
divided into 13 subtypes of perturbations (left section). To construct the testbed, we prepare raw instances initially and then synthesize the
modified instances through a workflow consisting of Spurious Features Injection andCausal Features Preservation (middle section). By
applying carefully tailored metrics for Robustness Evaluation , we quantify the robustness of target RAG systems (right section).
with varying spurious features. 3)Causal Features Preser-
vation. We employ a bidirectional entailment algorithm and
a string matching strategy to ensure that the causal features
of grounding data remain unchanged. 4)Robustness Evalua-
tion. We report the win, lose and robustness rate at instance
level to facilitate a fine-grained evaluation.
3.1. Problem Formulation
Given a query q, the retriever Rreturns a list of relevant
documents from a corpus D={di}N
i=1. The relevance be-
tween document dand query qcan be measured by various
methods. In this work, we use a BERT-based dense retriever
to obtain the embedding of query and documents, respec-
tively. The relevance score is calculated by computing their
dot-product similarity:
s(q, di) =E(q)·E(di). (2)
Then, the Top-k documents with the highest similarity
scores are retrieved:
Dretrieve =argtop- k{s(q, di)|di∈D}. (3)
To formally quantify the robustness of RAG systems against
spurious features, we define the input prompt for the LLM-
based reader as P= (I, G, Q ), where Irepresents instruc-
tion,Grefers to the grounding data, constituted by a subset
ofDretrieve , andQis the query. A perturbation is introducedto investigate the impact of spurious features by applying
a semantic-agnostic modification to the original grounding
data, while preserving its causal features. We define g(.)
to automate this process, transforming Gtog(G)and pro-
ducing a counterpart ˆP= (I, g(G), Q). The outputs of
LLM-based reader for PandˆPare compared to evaluate
the impact of the introduced perturbation:
y=LLM (P),ˆy=LLM (ˆP). (4)
3.2. Taxonomy of Spurious Features
Based on the biased spurious features identified in our pre-
liminary experiment and prior works, we introduce five
types of spurious features and their corresponding perturba-
tions in detail as follows.
Style Perturbations The same content can be expressed
in different styles, using varying tones, words and sentence
structures. As shown in Section 2.3, LLMs exhibit biases
towards readability-related features. Similarly, for humans,
the readability of a text can significantly influence its ac-
cessibility to the audience (Yang et al., 2023a). Therefore,
we define two perturbations from the perspective of read-
ability style: Simple andComplex . The former simplifies
the grounding data by using basic vocabulary and simple
sentence structure, while the latter employs professional
vocabulary and a formal academic tone to complex the doc-
uments.
4
Page 5:
Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
Source Perturbations LLM-generated content, including
both misinformation and correct claims, infiltrates every cor-
ner of the internet. Recent studies have shown that neural
retrievers are biased towards LLM-generated content, lead-
ing to the marginalization of human-authored content (Dai
et al., 2024; Chen et al., 2024b). Moreover, our preliminary
experiments demonstrate that LLMs are biased towards the
Perplexity (PPL) of text. Thus, we define two types of
source perturbations: LLM-generated andSelf-generated .
Specifically, the LLM-generated perturbation paraphrases
the original document using a powerful LLM, while the self-
generated perturbation employs the same backbone model
used as the generator in the RAG system.
Logic Perturbations The arrangement of sentences
within a passage typically follows a logical order, ensur-
ing the clarity and coherence in the narrative flow. Here, we
simulate scenarios where the intrinsic logical chain is dis-
rupted by three different perturbations: Random ,Reverse ,
andLLM-reranked , each representing a distinct sentence
ordering strategy.
Format Perturbations The internet contains various data
formats, including HTML ,Markdown ,YAML andJSON .
These formats are usually processed into plain text before
being fed to LLMs. To mitigate the loss of structural in-
formation during this process, some RAG studies propose
using the original format, rather than plain text, to augment
the generation (Tan et al., 2024a). However, as highlighted
in previous research, the prompt format is recognized as a
spurious feature that can significantly impact model perfor-
mance (Sclar et al., 2024; He et al., 2024). Therefore, we
perturb the original document with four common formats to
explore the impact of grounding data format in the context
of RAG.
Metadata Perturbations Metadata is often included in
the HTML results returned by search engines. In our frame-
work, we consider two main types of metadata: Timestamp
andData source . The timestamp indicates the time when
the data was created, and data source identifies the origin of
the data.
3.3. Spurious Features Injection
The automation of spurious features injection is essential
for automating the entire evaluation framework. We detail
the process of collecting the original instances and describe
how the automated perturbation was implemented.
Instance Preparation An instance is the dynamic compo-
nent of the prompt P, consisting of a query Qand grounding
dataG. The queries are drawn from the NQ-open dataset,
while our data source is English Wikipedia dump as of 20
December 2018. To construct the original instances, we firstselect 1,000 queries based on the close-book QA results of
Mistral-7B-Instruct-v0.3 on NQ-open dataset. This subset
includes 500 queries that can be answered directly using
parametric knowledge ( Known ) and 500 queries that require
external knowledge for answering ( Unknown ). We then
retrieve 100 documents for each query from the Wikipedia
dump to serve as the grounding data. Overall, we have a
total of 100,000 original instances for the following pertur-
bation step.
Automated Perturbation As introduced in Section 3.1,
the perturbation g(.)injects spurious features by modify-
ing the grounding data. For style and source perturbations,
g(.)is implemented using an LLM2prompted by carefully
crafted guidelines to modify the raw document, producing
counterparts of the original instances. For logic and format
perturbations, we develop g(.)as a heuristic method based
on a set of predefined rules3. To simulate metadata that
may appear in the real world, we first synthesize pseudo
Wikipedia or Twitter links for the raw instances, and then or-
ganize them into HTML format using a rule-based g(.). The
complete implementation details for automated perturbation
are provided in Appendix C.
3.4. Causal Features Preservation
To eliminate the effect of causal features, it is essential to
follow the principle of controlled experiments by keeping
causal features constant while systematically manipulat-
ing spurious features. This approach isolates the impact
of spurious features from that of causal features, enabling
an accurate quantification of robustness against spurious
features. In our framework, we introduce two methods to
ensure the stability of causal features in the grounding data.
Maintain Semantic Equivalence For models capable of
following human instructions, we directly instruct them
to maintain semantic equivalence when injecting spurious
features. Nonetheless, it’s impossible to completely avoid
semantic shift during the perturbation process. To ensure
the semantic consistency before and after introducing per-
turbation, we employ a bidirectional entailment algorithm
to filter out instance pairs (raw instance, perturbed instance)
with semantic inequivalence. Specifically, for document G
and its modified counterpart g(G), we use a Natural Lan-
guage Inference (NLI) system to detect whether the latter
can be inferred from the former, and vice versa. The NLI
system classifies predictions into one of: entailment ,neu-
tral,contradiction . We compute both directions, and the
algorithm returns equivalent if and only if both directions
2Unless otherwise specified, all model-based g(.)are imple-
mented using Llama-3.1-70B-Instruct .
3One exception is that we implement the LLM-reranked pertur-
bation using an LLM-based g(.).
5
Page 6:
Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
are predicted as entailment.
In general, this algorithm can be implemented by any NLI
system. However, in our case, the concatenation of Gand
g(G)sometimes exceeds the context limitation of a Bert-
based NLI model. Hence, we apply an LLM-based NLI
system4to implement the bidirectional entailment algo-
rithm. The prompt is included in Appendix D.
Preserve Ground Truths While semantic equivalence
protects causal features to the greatest extent, the perturba-
tion may lead to the correct answer being paraphrased into
an alias (e.g., ”President Roosevelt” to ”Roosevelt”). These
variations in the grounding data are likely to result in false
negatives when determining response correctness, despite
the NQ-Open dataset providing multiple potential answer
variants for each query. To address this issue, we employ a
simple string-matching strategy to filter out documents that
undergo unexpected modifications. For Golden documents
that originally contained the correct answers, we keep them
only if they preserve the ground truths after perturbation.
ForNoise documents that did not contain the correct an-
swers, we discard them if they unexpectedly acquire the
ground truths due to perturbations.
3.5. Overview of the Synthetic Dataset
Through the steps of spurious features injection andcausal
features preservation , we derive the final dataset avail-
able for robustness evaluation. The synthetic dataset gen-
erated by the SURE framework is divided into four subsets
based on the categories of queries and documents within
the instances. Notably, the distribution of the dataset is
model-specific, as the classification of Known andUnknown
queries is determined by the intrinsic knowledge of the tar-
get LLM. Table 1 presents the dataset statistics for assessing
Mistral-7B-Instruct-v0.3 . And the distribution for Llama-
3.1-8B-Instruct is shown in Appendix E.
3.6. Robustness Evaluation
We employ an evaluation method Y(.), in line with Liu
et al. (2024); Cuconasu et al. (2024), to measure the cor-
rectness of responses generated by RAG systems. This
approach checks whether any of the correct answers is con-
tained within the response produced by the LLM and then
derives a binary label. Previous researches use accuracy as
the primary metric and report it at dataset level to assess the
robustness of RALMs, which is quantified by calculating
the variations in the models’ accuracy across different types
of noise. However, dataset-level metrics has certain limita-
4Farquhar et al. (2024) confirms the effectiveness of the LLM-
based NLI system through human annotation, demonstrating that
its performance is on par with the DeBERTa-large model used in
Kuhn et al. (2023).K-G K-N U-G U-N Total
Style 7766 31152 2593 37692 79203
Source 9249 32435 3228 39101 84013
Logic 9724 35537 3587 41990 90838
Format 11037 38018 4141 45518 98714
Meta 11104 38018 4255 45420 98797
Table 1. Statistics of the evaluation dataset for Mistral-7B-Instruct-
v0.3. K-G denotes the instances composed of ( Known query,
Golden Document), while U-N refers to the instances consist-
ing of ( Unknown query, Noise Document). The values represents
the number of instance pairs for each type of perturbations within
the category of spurious features.
tions, as it may fail to capture fine-grained variations that
occur at the instance level. As shown in Figure 3, RALMs
may appear robust at dataset-level evaluations but exhibit
significant sensitivity at the instance level.
To quantify whether a RAG system is robust and unbiased at
the instance level, we assign a ternary label to each instance
by comparing the correctness of the LLM’s response before
and after introducing the perturbation. This comparison
process can be formulated as C=Y(yi)−Y( ˆyi), where
Clies in the set (−1,0,1). Based on the comparison out-
comes, we define three metrics: Robustness Rate (RR) ,
Win Rate (WR) , and Lose Rate (LR) . The RR is calculated
as follows:
RR=1
NNX
i=1I(C== 0) (5)
where Nis the total number of instances in the dataset;
yiandˆyirepresent the outputs of LLM for the original
and perturbed instances. RR measures the proportion of
instances where the RALM’s answer remains consistent
(0) before and after introducing the perturbation. Similarly,
WR and LR quantify the proportions of instances where
the correctness of the RALM’s response changes after the
perturbation, either from incorrect to correct ( C==−1) or
from correct to incorrect ( C== 1 ).
4. Experiments
4.1. Experimental Setup
We assess the robustness of RAG systems to spurious fea-
tures by evaluating them on their most popular application—
the Question Answering (QA) task, following the standard
”retrieve-read” setting of the RAG paradigm. We employ
Contriever-msmarco as the default retriever and conduct our
main experiments using the synthetic dataset generated by
theSURE framework. Due to space limitations, additional
6
Page 7:
Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
Mistral-7B-Instruct-v0.3
Spurious Features PerturbationsKnown-Golden Known-Noise Unknown-Golden Unknown-Noise Total
LR RR WR LR RR WR LR RR WR LR RR WR LR RR WR
StyleSimple 7.33 85.00 7,67 4.45 91.64 3.90 7.87 82.95 9.18 0.70 98.76 0.54 3.06 94.09 2.85
Complex 6.05 87.42 6.53 3.85 92.03 4.12 6.90 85.92 7.17 0.62 98.82 0.56 2.63 94.61 2.76
SourceLLM-Generated 5.91 87.62 6.47 3.57 92.27 4.16 6.41 86.52 7.06 0.59 98.75 0.66 2.55 94.56 2.90
Self-Generated 6.30 87.06 6.64 3.94 92.02 4.04 6.26 86.80 6.94 0.61 98.77 0.62 2.74 94.42 2.85
LogicReverse 5.44 89.34 5.22 2.99 94.10 2.92 5.97 88.54 5.49 0.48 99.04 0.48 2.21 95.65 2.14
Random 4.47 91.87 3.66 2.43 95.15 2.42 4.18 91.44 4.38 0.36 99.27 0.37 1.76 96.56 1.68
LLM-Ranked 3.52 93.15 3.33 2.07 95.84 2.09 3.57 92.89 3.54 0.34 99.30 0.36 1.48 97.04 1.48
FormatJSON 7.96 88.53 3.51 5.15 92.68 2.17 6.95 88.92 4.13 0.65 99.02 0.33 3.46 94.98 1.55
HTML 9.30 87.03 3.67 5.89 92.36 1.74 8.36 87.39 4.25 0.74 99.01 0.26 4.00 94.62 1.38
YAML 4.75 90.90 4.35 3.88 93.24 2.87 5.05 90.53 4.42 0.51 99.06 0.44 2.47 95.55 1.98
Markdown 3.98 92.49 3.53 2.91 94.36 2.72 4.11 92.59 3.31 0.44 99.15 0.41 1.94 96.29 1.77
MetadataTimestamp (pre) 2.62 94.90 2.48 1.28 97.61 1.11 3.15 94.45 2.40 0.17 99.67 0.17 1.00 98.12 0.89
Timestamp (post) 2.74 94.87 2.40 1.16 97.63 1.21 3.45 94.41 2.14 0.17 99.68 0.15 0.98 98.12 0.89
Datasource (wiki) 3.78 92.31 3.91 1.5 96.66 1.84 3.69 92.95 3.36 0.26 99.48 0.26 1.28 97.31 1.41
Datasource (twitter) 2.68 93.59 3.73 1.3 97.22 1.48 2.04 94.90 3.06 0.20 99.59 0.21 0.98 97.80 1.22
Table 2. Robustness evaluation results of Mistral-7B-Instruct-v0.3 on the synthetic dataset. For timestamp perturbation, (pre) and (post)
represents whether the virtual timestamp we created is before or after the LLM’s knowledge cutoff date. For datasource perturbation,
(wiki) and (twitter) indicate the domains of the pseudo link we synthesize. We use Bold to mark the WR values that are higher than the
LR, suggesting that the perturbation is beneficial.
Raw Document
Modified DocumentInstance 1 Instance 2 Instance 3 Instance N…
……
Dataset Level Variation:
Acc w/ Raw=Acc w/ Modified
Instance Level Variation:
Robustness Rate = 50%
Figure 3. A comparison of dataset-level metric (Acc) and instance-
level metric (RR) for robustness evaluation. "and%indicate the
correctness of responses. In this example, the value of RR reflects
the unrobustness at the instance level, while dataset-level metric
overlook the sensitivity of RALMs to spurious features within
documents.
implementation details can be found in Appendix F.
4.2. Result Analysis
In this section, we analyze our main results by category
and offer insights from various perspectives. The results
ofMistral-7B-Instruct andLlama-3.1-8B-Instruct are pre-
sented in Table 2 and Table 6, respectively.
For Different Queries There is a significant difference
in robustness rates between known and unknown queries
when evaluated on noise documents (compare K-N and U-N), whereas no such gap is observed for golden documents
(compare K-G and U-G). This phenomenon arises from the
tug-of-war between an LLM’s internal prior and external
evidence. When the retrieved content is noise documents,
LLMs are likely to override it with their own correct prior
knowledge (Known). However, the robustness to explicit
noise may be affected by the spurious features in docu-
ments. In other words, implicit noise (spurious features)
can coexist with explicit noise , highlighting the challenge
of maintaining robustness to spurious features in complex
noise scenarios.
For Different Grounding Data Across all types of spu-
rious features, the robustness rate on noise documents is
consistently higher than that on golden documents, whether
for known or unknown queries. This derives from the limi-
tation of the RR metric we propose, which measures unro-
bustness by capturing changes in answer correctness rather
than the inconsistency of responses. When tested on noise
documents, regardless of which spurious features are in-
jected, LLMs always generate incorrect responses, as noise
documents lack ground truths. In this case, even though
the responses change, the RR does not decrease since all
responses remain incorrect. However, in real-world appli-
cations, our primary concern is whether the RAG system
can consistently generate correct answer when faced with
golden documents containing different spurious features.
Therefore, we primarily focus on the RR results for the
(K, G) and (U, G) subsets .
7
Page 8:
Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
For Different Metrics If the win rate surpasses the lose
rate, it shows that more instances were corrected rather than
misanswered after introducing perturbations. Based on the
comparison of WR and LR in Table 2 and Table 6, we can
conclude that Not every spurious feature is harmful and
they can even be beneficial sometimes . For example, the
WR of source perturbations is consistently higher than the
LR on both models. This suggests that the performance of
RAG can be enhanced by introducing beneficial spurious
features (e.g., simply paraphrasing documents using LLMs).
For Different Perturbations We observe notable differ-
ences in the robustness rates among five types of spuri-
ous features. However, within each category, the robust-
ness rates across different perturbations are relatively close.
When further comparing perturbations within the same cat-
egory, we find that while their RR values are compara-
ble, their WR and LR can differ significantly, indicating
LLMs’ preference for certain sub-perturbations. In sum-
mary, there are variations in robustness across different
spurious features, and preference exists among the sub-
perturbations of each spurious feature .
4.3. SIG Benchmark & Further Analysis
The raw synthetic dataset is not ideal for extensive evalu-
ation due to its large size. Furthermore, the class imbal-
ance result in unfair comparisons across different types of
spurious features. To facilitate more efficient evaluation,
we extract the most challenging data from the subsets of
synthetic dataset to create a lightweight benchmark: SIG
(Spurious features InGolden document). Specifically, we
select instance pairs where both the Mistral and Llama mod-
els exhibit sensitivity and unrobustness in our main experi-
ments. An equal number of samples (100) are chosen from
the (K,G) and (U,G) subsets of each perturbations.
Robustness Comparison of SOTA LLMs We evaluate
a broader range of models on SIG benchmark, including
GPT-4O ,GPT-4O-mini ,Mistral-Large-Instruct5,Llama-
3.3-70B-Instruct ,Deepseek-v3 (671B,MoE), and Qwen2.5-
72B-Instruct . To better compare the robustness of different
models, we average the RR of each perturbation within a
category to derive the overall robustness for a specific type
of spurious feature. The performance of six SOTA LLMs
is then visualized using a radar chart, as shown in Figure
4. Notably, GPT-4O-mini achieved the best performance,
even surpassing GPT-4O by a large margin on the format
and meta types. Despite the impressive robustness of closed-
source models, they may still exhibit sensitivity to certain
specific perturbations. For instance, GPT-4o achieved only
an 89% robustness rate on the datasource(twitter) perturba-
tion.
5https://huggingface.co/mistralai/Mistral-Large-Instruct-2411
Figure 4. Comparison of Robustness rates across six SOTA LLMs.
Scaling Analysis for Different Model Sizes To investi-
gate the impact of parameter scale on RAG robustness, we
gradually increase the size of LLM-based readers (Qwen2.5
series, ranging from 0.5B to 72B) and evaluate their robust-
ness across five types of spurious features. As illustrated in
Figure 5, the robustness rate for all spurious features shows
a relatively upward trend as the model size increases. How-
ever, when we further scale the model from 32B to 72B, the
RR undergoes a significant decline (except for format and
meta). This indicates that robustness issues related to spuri-
ous features cannot be resolved simply by increasing model
size. Interestingly, for meta perturbations, while RALMs
demonstrate strong robustness across all scales (even for the
0.5B model), their performance receives little to no benefit
from scaling up.
Figure 5. Scaling analysis of robustness to spurious features.
8
Page 9:
Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
5. Conclusion
In this work, we formally highlight the spurious features
problem in RAG system. Through preliminary experiments,
we provide statistical evidence to support the presence of
spurious features in RALMs. We also propose a novel
evaluation framework, SURE , to assess the robustness of
LLMs against spurious features. This framework includes
a comprehensive taxonomy of spurious features, carefully
designed metrics for robustness evaluation, and a data syn-
thesis pipeline. Extensive evaluation and further analysis
offers valuable insights for future research.
Impact Statement
We introduce a new type of noise, spurious features, that can
undermine the robustness and trustworthiness of contempo-
rary RAG systems. With the increasing diversity of golden
documents on the internet, this noise may become more
prevalent. Furthermore, traditional methods for improving
the robustness of RAG systems are ineffective for handling
spurious features, as they are implicit and can coexist with
golden documents. Therefore, harmful spurious features
could undermines user trust and contaminates Internet data.
Moving forward, we aim to explore methods for mitigat-
ing hallucinations and robustness issues caused by spurious
features within grounding data.
References
Bajaj, P., Campos, D., Craswell, N., Deng, L., Gao, J., Liu,
X., Majumder, R., McNamara, A., Mitra, B., Nguyen, T.,
et al. Ms marco: A human generated machine reading
comprehension dataset. arXiv preprint arXiv:1611.09268 ,
2016.
Bian, N., Lin, H., Liu, P., Lu, Y ., Zhang, C., He, B., Han, X.,
and Sun, L. Influence of external information on large
language models mirrors social cognitive patterns. IEEE
Transactions on Computational Social Systems , 2024.
Chen, J., Lin, H., Han, X., and Sun, L. Benchmarking large
language models in retrieval-augmented generation. In
Proceedings of the AAAI Conference on Artificial Intelli-
gence , volume 38, pp. 17754–17762, 2024a.
Chen, X., He, B., Lin, H., Han, X., Wang, T., Cao, B., Sun,
L., and Sun, Y . Spiral of silences: How is large lan-
guage model killing information retrieval?–a case study
on open domain question answering. arXiv preprint
arXiv:2404.10496 , 2024b.
Cuconasu, F., Trappolini, G., Siciliano, F., Filice, S., Cam-
pagnano, C., Maarek, Y ., Tonellotto, N., and Silvestri, F.
The power of noise: Redefining retrieval for rag systems.
InProceedings of the 47th International ACM SIGIRConference on Research and Development in Information
Retrieval , pp. 719–729, 2024.
Dai, S., Zhou, Y ., Pang, L., Liu, W., Hu, X., Liu, Y ., Zhang,
X., Wang, G., and Xu, J. Neural retrievers are biased
towards llm-generated content. In Proceedings of the
30th ACM SIGKDD Conference on Knowledge Discovery
and Data Mining , pp. 526–537, 2024.
Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G.,
Mazar ´e, P.-E., Lomeli, M., Hosseini, L., and J ´egou, H.
The faiss library. arXiv preprint arXiv:2401.08281 , 2024.
Farquhar, S., Kossen, J., Kuhn, L., and Gal, Y . Detecting
hallucinations in large language models using semantic
entropy. Nature , 630(8017):625–630, 2024.
Feldman, P., Foulds, J. R., and Pan, S. Ragged edges:
The double-edged sword of retrieval-augmented chatbots.
arXiv preprint arXiv:2403.01193 , 2024.
Gan, C., Yang, D., Hu, B., Zhang, H., Li, S., Liu, Z., Shen,
Y ., Ju, L., Zhang, Z., Gu, J., et al. Similarity is not all
you need: Endowing retrieval augmented generation with
multi layered thoughts. arXiv preprint arXiv:2405.19893 ,
2024.
Gao, Y ., Xiong, Y ., Gao, X., Jia, K., Pan, J., Bi, Y ., Dai, Y .,
Sun, J., and Wang, H. Retrieval-augmented generation
for large language models: A survey. arXiv preprint
arXiv:2312.10997 , 2023.
He, J., Rungta, M., Koleczek, D., Sekhon, A., Wang, F. X.,
and Hasan, S. Does prompt formatting have any impact
on llm performance? arXiv preprint arXiv:2411.10541 ,
2024.
Izacard, G., Caron, M., Hosseini, L., Riedel, S., Bojanowski,
P., Joulin, A., and Grave, E. Unsupervised dense infor-
mation retrieval with contrastive learning. arXiv preprint
arXiv:2112.09118 , 2021.
Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni,
F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., and
Grave, E. Atlas: Few-shot learning with retrieval aug-
mented language models. Journal of Machine Learning
Research , 24(251):1–43, 2023.
Kang, H., Ni, J., and Yao, H. Ever: Mitigating hallucination
in large language models through real-time verification
and rectification. arXiv preprint arXiv:2311.09114 , 2023.
Kong, A., Zhao, S., Chen, H., Li, Q., Qin, Y ., Sun, R.,
Zhou, X., Wang, E., and Dong, X. Better zero-shot
reasoning with role-play prompting. In Proceedings of
the 2024 Conference of the North American Chapter of
the Association for Computational Linguistics: Human
Language Technologies (Volume 1: Long Papers) , pp.
4099–4113, 2024.
9
Page 10:
Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
Kuhn, L., Gal, Y ., and Farquhar, S. Semantic uncer-
tainty: Linguistic invariances for uncertainty estima-
tion in natural language generation. arXiv preprint
arXiv:2302.09664 , 2023.
Li, C., Wang, J., Zhu, K., Zhang, Y ., Hou, W., Lian, J., and
Xie, X. Emotionprompt: Leveraging psychology for large
language models enhancement via emotional stimulus.
arXiv e-prints , pp. arXiv–2307, 2023a.
Li, D., Zhang, H., Li, Y ., and Yang, S. Multi-level con-
trastive learning for script-based character understanding.
arXiv preprint arXiv:2310.13231 , 2023b.
Li, D., Jiang, B., Huang, L., Beigi, A., Zhao, C., Tan, Z.,
Bhattacharjee, A., Jiang, Y ., Chen, C., Wu, T., et al. From
generation to judgment: Opportunities and challenges of
llm-as-a-judge. arXiv preprint arXiv:2411.16594 , 2024a.
Li, D., Tan, Z., Chen, T., and Liu, H. Contextualization dis-
tillation from large language model for knowledge graph
completion. arXiv preprint arXiv:2402.01729 , 2024b.
Li, D., Yang, S., Tan, Z., Baik, J. Y ., Yun, S., Lee, J.,
Chacko, A., Hou, B., Duong-Tran, D., Ding, Y ., et al.
Dalk: Dynamic co-augmentation of llms and kg to an-
swer alzheimer’s disease questions with scientific litera-
ture. arXiv preprint arXiv:2405.04819 , 2024c.
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua,
M., Petroni, F., and Liang, P. Lost in the middle: How
language models use long contexts. Transactions of the
Association for Computational Linguistics , 12:157–173,
2024.
Liu, Y ., Huang, L., Li, S., Chen, S., Zhou, H., Meng, F.,
Zhou, J., and Sun, X. Recall: A benchmark for llms
robustness against external counterfactual knowledge.
arXiv preprint arXiv:2311.08147 , 2023.
Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M.,
Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of
demonstrations: What makes in-context learning work?
InProceedings of the 2022 Conference on Empirical
Methods in Natural Language Processing , pp. 11048–
11064, 2022.
Neuhaus, Y ., Augustin, M., Boreiko, V ., and Hein, M. Spuri-
ous features everywhere-large-scale detection of harmful
spurious features in imagenet. In Proceedings of the
IEEE/CVF International Conference on Computer Vision ,
pp. 20235–20246, 2023.
Pezeshkpour, P. and Hruschka, E. Large language models
sensitivity to the order of options in multiple-choice ques-
tions. In Findings of the Association for Computational
Linguistics: NAACL 2024 , pp. 2006–2017, 2024.Sclar, M., Choi, Y ., Tsvetkov, Y ., and Suhr, A. Quanti-
fying language models’ sensitivity to spurious features
in prompt design or: How i learned to start worrying
about prompt formatting. In The Twelfth International
Conference on Learning Representations , 2024.
Shuster, K., Poff, S., Chen, M., Kiela, D., and Weston, J.
Retrieval augmentation reduces hallucination in conver-
sation. In Findings of the Association for Computational
Linguistics: EMNLP 2021 , pp. 3784–3803, 2021.
Tan, J., Dou, Z., Wang, W., Wang, M., Chen, W., and Wen,
J.-R. Htmlrag: Html is better than plain text for model-
ing retrieved knowledge in rag systems. arXiv preprint
arXiv:2411.02959 , 2024a.
Tan, Z., Li, D., Wang, S., Beigi, A., Jiang, B., Bhattacharjee,
A., Karami, M., Li, J., Cheng, L., and Liu, H. Large
language models for data annotation and synthesis: A sur-
vey. In Proceedings of the 2024 Conference on Empirical
Methods in Natural Language Processing , pp. 930–957,
2024b.
Tong, Y ., Wang, S., Li, D., Wang, Y ., Han, S., Lin, Z.,
Huang, C., Huang, J., and Shang, J. Optimizing language
model’s reasoning abilities with weak supervision. arXiv
preprint arXiv:2405.04086 , 2024.
Wang, S., Tong, Y ., Zhang, H., Li, D., Zhang, X., and Chen,
T. Bpo: Towards balanced preference optimization be-
tween knowledge breadth and depth in alignment. arXiv
preprint arXiv:2411.10914 , 2024a.
Wang, Y ., Hernandez, A. G., Kyslyi, R., and Kersting, N.
Evaluating quality of answers for retrieval-augmented
generation: A strong llm is all you need. arXiv preprint
arXiv:2406.18064 , 2024b.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi,
E., Le, Q. V ., Zhou, D., et al. Chain-of-thought prompting
elicits reasoning in large language models. Advances in
neural information processing systems , 35:24824–24837,
2022.
Wu, J., Che, F., Zhang, C., Tao, J., Zhang, S., and Shao,
P. Pandora’s box or aladdin’s lamp: A comprehensive
analysis revealing the role of rag noise in large language
models. arXiv preprint arXiv:2408.13533 , 2024a.
Wu, K., Wu, E., and Zou, J. Clasheval: Quantifying the
tug-of-war between an llm’s internal prior and external
evidence. In The Thirty-eight Conference on Neural In-
formation Processing Systems Datasets and Benchmarks
Track , 2024b.
Xu, P., Ping, W., Wu, X., McAfee, L., Zhu, C., Liu, Z., Sub-
ramanian, S., Bakhturina, E., Shoeybi, M., and Catanzaro,
B. Retrieval meets long context large language models.
10
Page 11:
Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
InThe Twelfth International Conference on Learning
Representations , 2024.
Yang, S., Sun, R., and Wan, X. A new dataset and empirical
study for sentence simplification in chinese. In Proceed-
ings of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers) , pp.
8306–8321, 2023a.
Yang, S., Sun, R., and Wan, X. A new benchmark and
reverse validation method for passage-level hallucination
detection. In Findings of the Association for Computa-
tional Linguistics: EMNLP 2023 , pp. 3898–3908, 2023b.
Yu, K., Guo, X., Liu, L., Li, J., Wang, H., Ling, Z., and
Wu, X. Causality-based feature selection: Methods and
evaluations. ACM Computing Surveys (CSUR) , 53(5):
1–36, 2020.
Zhang, H., Shang, C., Wang, S., Zhang, D., Yao, F., Sun,
R., Yu, Y ., Yang, Y ., and Wei, F. Shifcon: Enhancing
non-dominant language capabilities with a shift-based
contrastive framework. arXiv preprint arXiv:2410.19453 ,
2024a.
Zhang, H., Wu, Y ., Li, D., Yang, Z., Zhao, R., Jiang, Y ., and
Tan, F. Balancing speciality and versatility: a coarse to
fine framework for supervised fine-tuning large language
model. arXiv preprint arXiv:2404.10306 , 2024b.
Zhou, Y ., Liu, Y ., Li, X., Jin, J., Qian, H., Liu, Z., Li, C.,
Dou, Z., Ho, T.-Y ., and Yu, P. S. Trustworthiness in
retrieval-augmented generation systems: A survey. arXiv
preprint arXiv:2409.10102 , 2024.
Zhu, K., Wang, J., Zhou, J., Wang, Z., Chen, H., Wang, Y .,
Yang, L., Ye, W., Zhang, Y ., Zhenqiang Gong, N., et al.
Promptbench: Towards evaluating the robustness of large
language models on adversarial prompts. arXiv e-prints ,
pp. arXiv–2306, 2023.
Zhuo, J., Zhang, S., Fang, X., Duan, H., Lin, D., and Chen,
K. Prosa: Assessing and understanding the prompt sen-
sitivity of llms. In Findings of the Association for Com-
putational Linguistics: EMNLP 2024 , pp. 1950–1976,
2024.
11
Page 12:
Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
A. Related Work
A.1. Robustness Evaluation of Retrieval-Augmented Generation
RAG systems comprise two core components: a retriever and an LLM-based reader. Augmenting LLMs with retrieved
external knowledge has been proven to effectively reduce hallucinations (Shuster et al., 2021; Kang et al., 2023). However,
the retrieved contexts inevitably contains noise in addition to desirable knowledge, which may mislead LLMs to produce an
incorrect response (Bian et al., 2024; Feldman et al., 2024). Previous works have explored automated evaluation frameworks
to assess the robustness of RAG systems in various settings. For instance, Chen et al. (2024a) benchmarked four fundamental
capabilities required for RAG, including noise robustness, negative rejection, information integration and counterfactual
robustness. Some studies have provided a detailed taxonomy of noise documents to further simulate the complexity of
real-world scenarios and highlighted the potential positive effects of certain types of noise (Cuconasu et al., 2024; Wu
et al., 2024a). There are also some recent works that propose using LLM-as-a-judge (Li et al., 2024a) to evaluate the RAG
system (Wang et al., 2024b).
While these studies have identified several explicit noises that affect the robustness of RAG systems, they overlook implicit
noises. This type of noise, such as phrasing and formatting, is everywhere and unavoidable, as it coexists with the grounding
data without altering its semantic information. In this work, we define these semantic-agnostic noises as spurious features
and evaluate the robustness of RALMs to such noises.
A.2. Prompt Sensitivity of LLMs
Large Language Models take prompts as inputs and then generate response accordingly. Prompts are instructions provided
to an LLM to perform specific tasks automatically and ensure desired qualities in the generated output. However, it is known
that current LLMs are sensitive to the features of input prompts (Zhu et al., 2023). This sensitivity poses challenges for
researchers attempting to evaluate the model’s performance accurately and precisely (Zhuo et al., 2024).
Some existing works have investigated the impact of different prompt techniques on model performance, including chain-of-
thought (Wei et al., 2022), in-context learning (Min et al., 2022), and role-play prompting (Kong et al., 2024). Beyond these
causal features that significantly influence the meaning of prompts, other works have demonstrated that LLMs are highly
sensitive to spurious features (Sclar et al., 2024), e.g, prompt formatting (He et al., 2024), language style (Li et al., 2023a),
the order of options (Pezeshkpour & Hruschka, 2024).
Currently, there is no statistical or empirical evidence to support the existence of spurious features in RALMs. To address
this gap, we extend the definition of spurious features to RAG systems through statistical testing and empirical analysis.
B. Preliminary Experiment Results
We introduce several semantic-agnostic features we measured in our preliminary experiments. These features include:
•Flesch Score : A readability metric designed to evaluate text difficulty. It is calculated based on the average number of
syllables per word and the average number of words per sentence. The Flesch score is a number on a scale from 0 to
100, where a higher score indicates that the text is easier to read.
•Distinct-1 : A metric used to assess the diversity of generated text. It calculates the proportion of unique words (distinct
words) to the total number of words in the output. A higher Distinct-1 score indicates that the text contains a greater
variety of unique words, implying more diversity in the generated content.
•Dependency Tree Depth (DTD) : A syntactic complexity metric calculated by analyzing its dependency tree. Depen-
dency Tree Depth refers to the maximum depth of a sentence’s dependency parse tree. A deeper tree suggests more
complex sentence structures, while a shallower tree indicates simpler syntactic constructions.
•Perplexity (PPL) : A metric used for evaluating language models, measuring how well a probabilistic model predicts a
given text. It reflects the uncertainty of a language model when generating sequences of words. Lower PPL values
indicate better predictive performance, meaning the model assigns higher probabilities to the actual labels in the
sequence.
•Token Length : We compute the total number of tokens in a text as an alternative measure of text length, given that
12
Page 13:
Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
the documents in our corpus have been pre-segmented into fixed 100-word chunks. The value is model-specific and
depends on the model’s vocabulary.
Kolmogorov-Smirnov (K-S) Test The K-S test is a non-parametric statistical test used to compare the distribution of two
datasets. It evaluate whether two samples come from the same underlying probability distribution. The null hypothesis of the
K-S test is that the two samples are drawn from the same distribution, while the alternative hypothesis is that the two samples
are drawn from different distributions. There are two key values provided by K-S test: the K-S Statistic quantifies the largest
difference between the two sample distributions, and the p-value assess the statistical significance of that difference. If the
p-value is lower than a chosen significance level (0.05 in our experiments), we reject the null hypothesis, concluding that
the two distributions are significantly different. Otherwise, we fail to reject the null hypothesis, suggesting that there is
no significant difference between the two distributions. We provide the K-S test results in Table 3 and Table 4, with the
corresponding distribution visualizations plots shown in Figure 6 and Figure 7.
Experimental Group Control Group
K-S statistic P-value K-S statistic P-value
Flesch score 0.0677 1.01×10−5∗∗∗0.0301 0.1799
Distinct-1 0.0756 4.95×10−7∗∗∗0.0203 0.6431
DTD 0.0636 4.29×10−5∗∗∗0.0124 0.9866
PPL 0.0722 1.88×10−6∗∗∗0.0162 0.8776
Token Length 0.1708 2.91×10−34∗∗∗0.0256 0.3493
Table 3. K-S test results for Mistral-7B-Instruct-v0.3 as the oracle retriever.
Experimental Group Control Group
K-S statistic P-value K-S statistic P-value
Flesch score 0.0305 0.1694 0.0173 0.8210
Distinct-1 0.0798 8.94×10−8∗∗∗0.0327 0.1159
DTD 0.0474 0.0051∗∗0.0203 0.6431
PPL 0.0538 0.0009∗∗∗0.0181 0.7791
Token Length 0.1275 2.99×10−19∗∗∗0.0188 0.7349
Table 4. K-S test results for Llama-3.1-8B-Instruct as the oracle retriever.
C. Implementation Details for Injecting Spurious Features
We provide detailed prompts for LLM-based perturbations in Figure 8. For rule-based perturbations, placeholder template is
presented in Figure 9.
D. Implementation Details for Preserving Causal Features
We employ a bidirectional entailment algorithm to ensure the semantic equivalence before and after introducing spurious
features. The prompts for its core component, NLI model, are shown in Figure 10.
E. Statistics of the Synthetic Dataset
We present the dataset statistics for evaluating Llama-3.1-8B-Instruct in Table 5.
13
Page 14:
Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
(a) Feature distributions of the experimental group
(b) Feature distribution of the control group
Figure 6. Visualization of feature distributions for Mistral-7B-Instruct-v0.3
14
Page 15:
Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
(a) Feature distributions of the experimental group
(b) Feature distribution of the control group
Figure 7. Visualization of feature distributions for Llama-3.1-8B-Instruct
15
Page 16:
Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
Style Perturbations
[Simple]
Please simplify the following text while preserving its original meaning. Use shorter sentences, basic
vocabulary, and clear language. Avoid complex structures, technical terms, or ambiguous expressions.
Here is the passage to simplify: {Document }
[Complex]
Please complexify the following text while preserving its original meaning. Use longer sentences, intricate
sentence structures, and advanced vocabulary. Avoid contractions, informal language, and colloquial
expressions, ensuring the text maintains a professional and authoritative tone throughout.
Here is the passage to complexify: {Document }
Source Perturbations
Please rewrite the following passage. Ensure that the overall meaning, tone, and important details remain
intact. Avoid any significant shifts in style or focus. The aim is to create a fresh version while faithfully
conveying the original content.
Here is the passage to paraphrase: {Document }
Logic Perturbations
[LLM-Ranked]
Rearrange the following list of sentences in your preferred logical order and provide only the indices of the
sentences. Please do not include any explanations.
Example: {Example }
Sentences List: {Sentences List }
The length of the Sentences List is {Length of Sentences List }. Therefore, the indices must contain {Length
of Sentences List }elements, and the index values cannot exceed {Length of Sentences List - 1 }.
[Reverse] [Pyhton Code]
[Random] [Python Code]
Figure 8. Prompt templates for LLM-based perturbations.
16
Page 17:
Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
Format Perturbations
[JSON]
{
"title": "{Title}",
"text": "{Document}"
}
[HTML]
<html lang="en">
<head>
<meta charset="UTF-8">
{Title}
</head>
<body> {Document} </body>
</html>
[YAML]
Title: {Title}
Text: {Document}
[Markdown]
# {Title}
{Document}
Metadata Perturbations
[Timestamp]
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name=’timestamp’ content=’{timestamp}’>
{Title}
</head>
<body> {Document} </body>
</html>
[Datasource]
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name=’datasource’ content=’{datasource}’>
{Title}
</head>
<body> {Document} </body>
</html>
Figure 9. Placeholder templates for rule-Based perturbations.
17
Page 18:
Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
Consider the two passages below.
Premise: {raw text }
Hypothesis: {perturbated text }
Does the premise semantically entail the hypothesis? Answer with ’entailment’ if they are paraphrases,’contradiction’
if they have opposing meanings, or ’neutral’ if they are neither.
Response:
Figure 10. Prompts for LLM-based NLI system.
K-G K-N U-G U-N Total
Style 7321 28975 3038 39869 79203
Source 8768 30145 3709 41391 84013
Logic 9229 33294 4082 44233 90838
Format 10481 35616 4697 47920 98714
Meta 10563 35451 4796 47987 98797
Table 5. Distribution of the synthetic dataset for Llama-3.1-8B-Instruct .
F. Experimental Setup Details
Dataset We conduct our main experiments using the synthetic dataset generated by the SURE framework. The queries are
sourced from the NQ-open dataset6, and the documents are derive from the English Wikipedia dump.
Models We test two representative LLMs in our main experiments: Mistral-7B-Instruct-v0.3 andLlama-3.1-8B-Instruct .
Prompts The instruction Iin the RAG prompt P= (I, G, Q ), shown in Figure 11, is derived from Cuconasu et al. (2024),
with slight modifications to better adapt to our setting.
Implementation Details We follow the typical ”retrieve-read” setting of RAG paradigm. For the retrieval module, we
useContriever-msmarco7, a BERT-based dense retriever, as the default retriever. It is finetuned on the MS MARCO
dataset (Bajaj et al., 2016) after unsupervised pretraining via contrastive learning (Izacard et al., 2021). To optimize the
efficiency of vector similarity searches, we employ the Faiss library (Douze et al., 2024). For the read module, we deploy
LLMs on NVIDIA A100 GPUs and accelerate inference with vllm8. We follow (Li et al., 2023b; Zhang et al., 2024a;b) in
using a greedy decoding strategy and set the temperature to 0.1 to ensure stable outputs and strong reproducibility.
You are given a question and you MUST respond by EXTRACTING the answer (max 5 tokens) from the provided
document. If the document does not contain the answer, respond with NO-RES.
Figure 11. Instruction Iused for the QA task.
6https://huggingface.co/datasets/google-research-datasets/nq_open
7https://huggingface.co/facebook/contriever-msmarco
8https://github.com/vllm-project/vllm
18
Page 19:
Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
Llama-3.1-8B-Instruct
Spurious Features PerturbationsKnown-Golden Known-Noise Unknown-Golden Unknown-Noise Total
LR RR WR LR RR WR LR RR WR LR RR WR LR RR WR
StyleSimple 7.79 83.04 9.18 1.70 95.80 2.50 8.43 82.88 8.69 0.27 99.45 0.28 1.80 95.96 2.24
Complex 6.00 85.60 8.40 1.91 96.59 1.50 6.71 84.86 8.43 0.23 99.57 0.20 1.63 96.62 1.75
SourceLLM-Generated 5.89 86.43 7.69 1.43 96.83 1.74 6.20 85.71 8.09 0.22 99.56 0.22 1.51 96.60 1.89
Self-Generated 6.55 85.01 8.44 1.55 96.37 2.09 6.52 86.36 7.12 0.19 99.57 0.24 1.62 96.32 2.06
LogicReverse 5.06 90.82 4.12 1.13 97.82 1.06 5.73 89.71 4.56 0.21 99.67 0.13 1.29 97.64 1.07
Random 3.91 93.16 2.93 0.86 98.31 0.83 4.21 91.67 4.12 0.14 99.72 0.14 0.97 98.18 0.85
LLM-Ranked 3.24 93.93 2.83 0.82 98.43 0.74 3.58 93.36 3.06 0.13 99.76 0.11 0.85 98.39 0.75
FormatJSON 7.01 88.25 4.74 1.70 97.25 1.05 5.92 89.63 4.45 0.25 99.61 0.14 1.76 97.08 1.16
HTML 11.85 84.46 3.69 2.70 96.90 0.40 9.33 86.78 3.90 0.35 99.61 0.04 2.85 96.41 0.74
YAML 5.26 89.94 4.80 1.26 97.41 1.33 4.79 90.80 4.41 0.17 99.67 0.16 1.32 97.40 1.28
Markdown 2.32 92.23 5.45 0.60 96.89 2.51 2.34 93.46 4.19 0.07 99.61 0.32 0.61 97.55 1.84
MetadataTimestamp (pre) 2.08 95.81 2.11 0.28 99.42 0.29 2.54 95.56 1.90 0.02 99.95 0.03 0.46 99.10 0.44
Timestamp (post) 2.04 95.86 2.10 0.25 99.43 0.32 2.81 95.56 1.63 0.02 99.95 0.03 0.46 99.11 0.43
Datasource (wiki) 2.11 93.45 4.44 0.23 98.96 0.81 3.25 92.47 4.27 0.03 99.86 0.11 0.48 98.50 1.03
Datasource (twitter) 2.27 94.11 3.62 0.31 99.25 0.43 2.77 93.97 3.25 0.02 99.91 0.07 0.50 98.77 0.73
Table 6. Robustness evaluation results of Llama-3.1-8B-Instruct on the synthetic dataset. We use Bold to mark the WR values that are
higher than the LR, suggesting that the perturbation is beneficial.
19