Authors: Viktor Moskvoretskii, Maria Lysyuk, Mikhail Salnikov, Nikolay Ivanov, Sergey Pletenev, Daria Galimzianova, Nikita Krayko, Vasily Konovalov, Irina Nikishina, Alexander Panchenko
Paper Content:
Page 1:
Adaptive Retrieval without Self-Knowledge?
Bringing Uncertainty Back Home
Viktor Moskvoretskii1,3, Maria Lysyuk2,1, Mikhail Salnikov2,1, Nikolay Ivanov1,
Sergey Pletenev2,1,Daria Galimzianova4,Nikita Krayko4,Vasily Konovalov2,5,
Irina Nikishina6,Alexander Panchenko2,1
1Skoltech,2AIRI,3HSE University,4MTS AI,5MIPT,6University of Hamburg
Correspondence: vvmoskvoretskii@gmail.com
Abstract
Retrieval Augmented Generation (RAG) im-
proves correctness of Question Answering
(QA) and addresses hallucinations in Large
Language Models (LLMs), yet greatly increase
computational costs. Besides, RAG is not al-
ways needed as may introduce irrelevant infor-
mation. Recent adaptive retrieval methods inte-
grate LLMs’ intrinsic knowledge with external
information appealing to LLM self-knowledge,
but they often neglect efficiency evaluations
and comparisons with uncertainty estimation
techniques. We bridge this gap by conducting a
comprehensive analysis of 35 adaptive retrieval
methods, including 8 recent approaches and
27 uncertainty estimation techniques, across 6
datasets using 10 metrics for QA performance,
self-knowledge, and efficiency. Our findings
show that uncertainty estimation techniques of-
ten outperform complex pipelines in terms of
efficiency and self-knowledge, while maintain-
ing comparable QA performance.
1 Introduction
Large Language Models have gained increased
popularity due to their remarkable performance
across diverse tasks, such as question answering
(QA) (Yang et al., 2018; Kwiatkowski et al., 2019).
At the same time, hallucinations represent a sub-
stantial challenge for LLMs. Solely utilizing only
parametric knowledge in generating trustworthy
content is limited by the knowledge boundaries
of LLMs (Yin et al., 2024), which may poten-
tially lead to internal hallucinations (Ding et al.,
2024). While external information via RAG (Lewis
et al., 2020b) can potentially help to fill these gaps,
it raises the possibility of irrelevance, thus lead-
ing to the error accumulation (Shi et al., 2023)
and increasing the likelihood of external hallucina-
tions (Ding et al., 2024).
To balance between the intrinsic knowledge of
LLMs and external information, adaptive retrieval
Self-KnowledgeRetrieval Efficiency
LM Efficiency
QA PerformanceIRCoT
AdaptiveRAG
DRAGIN
EigValLaplacian
FLARE
RowenHybrid
Seakr
Best UEFigure 1: Performance comparison of the state-of-the
art models across efficiency metrics (number of LLM
calls, Retrieval calls), QA quality metric (In-Accuracy),
and the ability to identify self-knowledge (ROCAUC).
The plot demonstrates the reverted ranks of the methods
across 6 datasets.
methods have emerged (Su et al., 2024b; Ding et al.,
2024; Jeong et al., 2024). These methods rely on
LLM self-knowledge — model capacity to recog-
nize its own knowledge (Yin et al., 2023) — and
determine when it lacks critical information.
Adaptive retrieval methods may not only im-
prove answer correctness, but also substantially de-
crease retrieval calls, enhancing efficiency. While
recent methods have focused extensively on the
retrieval calls (Su et al., 2024b; Jeong et al., 2024;
Trivedi et al., 2023), they often overlook the cost
of LLM calls, which can be even more expensive,
especially with proprietary models. Furthermore,
recent studies of complex pipelines do not assess
self-knowledge abilities and lack comparisons with
well-established uncertainty estimation methods,
such as Mean Entropy (Fomicheva et al., 2020).
To address these limitations, we conduct a com-
prehensive study of 35 adaptive retrieval systems,
including 8 recently published methods and 27 es-
tablished uncertainty estimation methods, across
6 QA datasets covering both simple one-hop and
1arXiv:2501.12835v2 [cs.CL] 21 Feb 2025
Page 2:
complex multi-hop questions. We evaluate these
methods in terms of the QA performance, self-
knowledge, and two types of efficiency, using a to-
tal of 10 metrics. Our evaluation, shown in Figure 1,
reveals that no single method dominates across all
axes. However, well-established uncertainty esti-
mation methods are often more useful compared to
recently published, more complex pipelines.
Finally, we provide a rigorous in-depth assess-
ment of the out-of-distribution (OOD) performance
of uncertainty methods and analyze the complexity
of their functional classes.
Our contributions and findings are as follows:
1.A consistent study of 35 adaptive retrieval
methods on 6 single- and multi-hop datasets,
evaluating QA performance, self-knowledge,
and efficiency across 10 metrics.
2.The first comprehensive application and com-
parison of 27 well-established uncertainty es-
timation methods for adaptive retrieval, show-
casing their potential and efficiency.
3.An in-depth analysis of uncertainty methods
for adaptive retrieval, covering OOD transfer
and examining the complexity of their func-
tional classes.
We publish all the code and data.1
2 Related Work
Retrieval-Augmented Generation methods are
widely used to enhance the performance of LLMs
in many tasks, like up-to-date information (Jiang
et al., 2024) or questions about rare entities in
which LLM shows poor generation quality due to
lack of inner knowledge (Allen-Zhu and Li, 2024).
In the simplest case, the input sequence of the ques-
tion is used as a query for databases or search en-
gines. The resulting information is then incorpo-
rated as an additional context, proven effective for
a variety of tasks (Khandelwal et al., 2020; Lewis
et al., 2020a) and models (Borgeaud et al., 2022;
Ram et al., 2023; Socher et al., 2013). All these
methods are applied to the retrieval once before
generation, so they are often combined under the
name single-round retrieval augmentation.
Adaptive Retrieval-Augmented Generation
methods perform retrieval for every query may be
both inefficient and unnecessary. Moreover, retriev-
ing knowledge at every step may be misleading or
even conflicting with LLM’s parameters (Simhi
1https://github.com/s-nlp/AdaRAGUEet al., 2024). Adaptive retrieval methods have
emerged as an attempt to understand whether LLM
needs external knowledge by exploiting models’
self-knowledge abilities.
The decision to retrieve may depend on dif-
ferent criteria. It may be based on the text out-
puts of LLMs (Trivedi et al., 2023) or text con-
sistency (Ding et al., 2024), on the self-aware un-
certainty of LLMs from their internal states (Jiang
et al., 2023; Su et al., 2024b; Yao et al., 2024) or
using a trained classifier to decide whether to re-
trieve (Jeong et al., 2024).
Uncertainty Estimation (UE) measures the con-
fidence in LLM predictions and can be classified
into white-box and black-box methods. White-box
methods require access to internal model details,
such as logits or layer outputs, and are divided
into information-based (using token or sequence
probabilities from a single model), ensemble-based
(leveraging probabilities from different model ver-
sions), and density-based (constructing a probabil-
ity density from latent representations). Black-box
methods, in contrast, only require access to the
model’s output (Fadeeva et al., 2023).
3 Methods
In this section, we briefly introduce the existing
adaptive retrieval methods. More details can be
found in Appendix G.
3.1 End-to-End Methods
IRCoT (Interleaving Retrieval in a CoT ) is a dy-
namic approach that adds extra relevant passages
from the retriever to the context if the current CoT
step has not produced the answer yet. The query
for extra context is based on the last generated CoT
sentence (Trivedi et al., 2023).
Adaptive RAG (Jeong et al., 2024) uses the clas-
sifier based on the T5-large model (Raffel et al.,
2020) that predicts one of the three outcomes:
whether not to retrieve at all, retrieve once and
retrieve multiple times with IRCoT.
FLARE (Forward- Looking Active Retrieval
augmented generation) is a method that retrieves
context when token probability falls below a thresh-
old, regenerating the response until the next uncer-
tain token or completion (Jiang et al., 2023).
DRAGIN (Dynamic Retrieval Augmented
Generation based on Information Needs) moni-
tors token probabilities like FLARE but filters stop-
words to identify uncertainty tokens. It improves
2
Page 3:
context retrieval by reformulating queries using
attention weights and reasoning (Su et al., 2024b).
Rowen (Retrieve OnlyWhen It Needs) is a
consistency-based approach with two components:
the Consistency Language, which measures answer
consistency across English and Chinese, and the
Consistency Model, which evaluates semantic co-
herence across models. Both output inconsistency
scores to trigger retrieval. The Rowen Hybrid com-
bines both components (Ding et al., 2024).
SeaKR (Self-aware Knowledge Retrieval) uses
an Uncertainty Module (UM) to monitor LLM in-
ternal states and trigger retrieval when uncertainty
exceeds a threshold. A re-ranking component se-
lects a snippet that reduces uncertainty and im-
proves factual accuracy (Yao et al., 2024).
3.2 Uncertainty Estimation Methods
For uncertainty estimation, we employ 27 different
methods, described in detail in Table 14. In the
main part of our paper, we focus on the 5 best-
performing uncertainty estimation methods, which
include approaches from various method families:
•Lexical Similarity : Measures a consistency
score based on the average similarity of sam-
pled responses (Fomicheva et al., 2020).
•Max Entropy : Computes the entropy of each
token and aggregates it for the sequence using
the maximum value (Fomicheva et al., 2020).
•Mean Entropy : Computes the entropy of
each token and aggregates it for the sequence
using the mean value (Fomicheva et al., 2020).
•SAR : Measures the entropy of each token,
reweights it based on token relevance, and
aggregates the values using a sum over the
sequence (Duan et al., 2023).
•EigValLaplacian : Computes the sum of
Laplacian eigenvalues by constructing a
weighted graph based on the consistency of
sampled responses (Lin et al., 2023).
4 Experimental Setup
In this section, we briefly discuss the implementa-
tion details and the evaluation setup.
4.1 Implementation Details
We use the LLaMA 3.1-8b-instruct model (Dubey
et al., 2024) with the default generation parame-
ters for all experiments. The baseline methods fol-
low their original protocols, including promptingand parameter settings, while uncertainty estima-
tion methods use the AdaptiveRAG protocol (Jeong
et al., 2024), with the same prompt and few-shot
examples.
For all methods, we use the BM25 re-
triever (Robertson et al., 1994) with Elasticsearch
7.17.92and the Wikipedia corpus preprocessed
by Karpukhin et al. (2020), following previous stud-
ies (Su et al., 2024a; Yao et al., 2024).
Uncertainty method scores are computed on
both training and test sets using the LM-
Polygraph (Fadeeva et al., 2023). A set of clas-
sifiers are trained on the training set scores, with
the best classifier’s performance reported based on
downstream metrics. Additional details are pro-
vided in Appendix F.
4.2 Datasets
We use the single-hop and multi-hop QA datasets
in the same experimental setup to replicate a real-
world scenario where various queries have differ-
ent difficulties. The choice of datasets is stan-
dard for the task with the single-hop questions
– SQUAD v1.1 (Rajpurkar et al., 2016), Natu-
ral Questions (Kwiatkowski et al., 2019), Trivi-
aQA (Joshi et al., 2017), and the datasets with the
complex ones – MuSiQue (Trivedi et al., 2022),
HotpotQA (Yang et al., 2018), and 2WikiMulti-
HopQA (Ho et al., 2020), following previous pa-
pers (Trivedi et al., 2023; Jeong et al., 2024; Su
et al., 2024b; Yao et al., 2024). To ensure con-
sistency, we use the subsets of 500 questions of
the original test parts of the datasets from previous
studies (Trivedi et al., 2023; Jeong et al., 2024) .
4.3 Evaluation
We conduct a comprehensive evaluation using QA
downstream metrics, efficiency metrics, and self-
knowledge metrics to broadly cover every aspect of
the model. To fairly compare performance across
datasets, we also use methods ranks on each dataset
(smaller rank indicates better performance) and av-
erage the ranks. This ensures a balanced evalua-
tion, as performance gains may vary in significance
across datasets.
4.3.1 Downstream QA Metrics
To assess the final QA system quality we use In-
Accuracy, EM and F1, following previous stud-
ies (Mallen et al., 2023; Baek et al., 2023; Asai
et al., 2024; Jeong et al., 2024), where:
2https://www.elastic.co/elasticsearch
3
Page 4:
•In-Accuracy (InAcc) evaluates whether the
predicted answer includes the ground truth.
•Exact Match (EM) measures the exact match
of prediction with the ground truth.
•F1quantifies the degree of token overlap be-
tween the predicted answer and the ground
truth answer.
We primarily rely on In-Accuracy as the main
metric, as it is more robust to answer variations
compared to EM and provides a better measure
of correctness than F1. Additionally, the overall
trends across these metrics are generally consistent.
4.3.2 Efficiency Metrics
In addition to enhanced quality, adaptive retrieval
procedures must also demonstrate improvements
in efficiency; otherwise, consistent retrieval might
remain superior. To evaluate it, we measure:
•Retriever Calls (RC) : The average number of
retriever calls made by the system to answer a
single question, following Jeong et al. (2024).
•LM Calls (LMC) : The average number of
calls to the Language Model per question.
Some systems may invoke the LM multiple
times to calculate uncertainty, rephrase ques-
tions or generate additional rationales.
4.3.3 Self-Knowledge Metrics
Self-knowledge is defined as a model’s ability to
recognize its own knowledge (Yin et al., 2023).
Measuring self-knowledge provides insight into
the effectiveness of a method’s adaptive retrieval
component, as downstream performance is often
influenced by external factors such as retriever se-
lection, language model generation parameters, etc.
The task of identifying self-knowledge is for-
mulated as a binary classification problem, where
the ground truth label yis derived from the In-
Accuracy of the model’s response without external
knowledge. Each method fcan be represented as a
function mapping input text xto a real-valued self-
knowledge score f(x)∈R, where higher values
indicate lower self-knowledge. The classification
task is then performed by a classifier C, producing
the final prediction ˆy=C(f(x))∈ {0,1}.
For evaluation, we adopt metrics established
in prior uncertainty estimation research (Fadeeva
et al., 2024b; Tao et al., 2024) and reflexive self-
knowledge analysis (Ni et al., 2024).
•ROC-AUC (AUC) evaluates the robustness
of the method’s self-knowledge identificationperformance: AUC(f(x),y).
•Spearman Correlation (Corr) measures the
alignment between the self-knowledge scores
and the ground truth: Corr(f(x),y).
•Accuracy quantifies the correctness of self-
knowledge classifier:1
nnP
i=11{ˆyi, yi}.
•Overconfidence is a fraction of incorrect
answers where the method was confident
about self-knowledge reflecting how often the
method incorrectly assumes that the model
possesses the required knowledge when it
does not:1P
i(1−ˆyi)P
i(1− 1{ˆyi, yi})·(1−ˆyi).
•Underconfidence is a fraction of correct
answers where the method was unconfi-
dent about self-knowledge reflecting how of-
ten the method fails to recognize that the
model already has the required knowledge:
1P
iˆyiP
i(1− 1{ˆyi, yi})·ˆyi.
5 Results
In the following sections, we describe the results
of baseline and uncertainty methods for down-
stream performance, efficiency and self-knowledge.
Along with the end-to-end and UE methods, we
also apply two additional methods for better com-
parison. “Best UE” refers to the top-performing
uncertainty estimation method for each dataset.
“Ideal” represents the performance of a system with
an oracle providing ideal predictions for the need
to retrieve.
5.1 Downstream and Efficiency Performance
The results in Table 1 show that uncertainty esti-
mation methods outperform baseline methods on
single-hop datasets and perform comparably on
multi-hop datasets, while being significantly more
compute-efficient, often several times cheaper.
While baseline methods may achieve slightly
better performance on some datasets, they require
multiple calls to both the language model and re-
triever, leading to higher computational costs. In
contrast, uncertainty estimation methods consis-
tently require fewer than one retriever call and two
or less LM calls per question, significantly reduc-
ing inference costs.
Uncertainty estimation for adaptive retrieval con-
sistently outperforms constant retrieval in terms of
performance and efficiency. However, analysis of
4
Page 5:
MethodNQ SQUAD TQA 2Wiki HotPot Musique
InAcc ↑LMC ↓RC↓InAcc ↑LMC ↓RC↓InAcc ↑LMC ↓RC↓InAcc ↑LMC ↓RC↓InAcc ↑LMC ↓RC↓InAcc ↑LMC ↓RC↓
Never RAG 0.446 1.0 0.00 0.176 1.0 0.00 0.636 1.0 0.00 0.318 1.0 0.00 0.286 1.0 0.00 0.106 1.0 0.00
Always RAG 0.496 1.0 1.00 0.312 1.0 1.00 0.610 1.0 1.00 0.374 1.0 1.00 0.410 1.0 1.00 0.100 1.0 1.00
IRCoT 0.478 2.7 2.70 0.268 2.7 2.68 0.608 2.7 2.74 0.454 4.4 4.38 0.438 3.5 3.45 0.138 4.1 4.08
AdaptiveRAG 0.496 2.0 0.98 0.286 2.0 0.97 0.628 1.5 0.54 0.454 5.2 2.64 0.414 4.6 2.34 0.140 3.6 3.63
DRAGIN 0.480 4.5 2.24 0.298 4.3 2.14 0.666 4.1 2.06 0.456 5.8 2.92 0.430 5.1 2.56 0.134 6.3 3.15
FLARE 0.450 3.1 2.07 0.238 3.1 2.08 0.648 2.1 1.39 0.424 3.9 2.85 0.372 5.1 4.07 0.090 4.1 3.10
Rowen CL 0.494 29.5 7.24 0.196 29.2 7.19 0.656 28.7 7.06 0.444 32.9 7.87 0.354 31.9 7.67 0.104 42.1 9.52
Rowen CM 0.494 29.5 7.27 0.196 29.2 7.20 0.656 28.7 7.12 0.444 32.9 7.87 0.356 31.9 7.70 0.104 42.1 9.52
Rowen Hybrid 0.494 55.0 7.27 0.196 54.3 7.15 0.656 53.4 6.93 0.444 61.8 7.85 0.354 59.8 7.63 0.102 80.2 9.48
Seakr 0.406 14.6 1.00 0.268 14.6 1.00 0.656 14.6 1.00 0.398 12.3 2.44 0.424 9.9 1.76 0.118 12.3 2.40
EigValLaplacian 0.512 1.8 0.81 0.314 2.0 1.00 0.640 1.3 0.26 0.384 2.0 0.98 0.410 1.9 0.91 0.102 2.0 0.99
Lex-Similarity 0.512 1.6 0.58 0.318 2.0 0.96 0.646 1.2 0.22 0.376 2.0 0.97 0.410 2.0 0.95 0.100 2.0 1.00
Max Entropy 0.506 1.7 0.73 0.312 2.0 1.00 0.650 1.2 0.22 0.376 2.0 0.95 0.414 2.0 0.99 0.100 2.0 1.00
Mean Entropy 0.498 1.9 0.88 0.314 2.0 0.95 0.650 1.3 0.30 0.378 1.9 0.93 0.410 2.0 0.99 0.100 2.0 1.00
SAR 0.500 1.8 0.79 0.312 2.0 1.00 0.642 1.3 0.29 0.380 2.0 0.97 0.412 1.9 0.90 0.100 2.0 1.00
Best UE 0.512 1.8 0.81 0.318 2.0 0.96 0.662 1.3 0.28 0.384 2.0 0.98 0.414 2.0 0.99 0.104 2.0 0.99
Ideal 0.608 1.6 0.55 0.360 1.8 0.82 0.736 1.4 0.36 0.500 1.7 0.68 0.460 1.7 0.71 0.164 1.9 0.89
Table 1: QA Performance of adaptive retrieval and uncertainty methods. ‘Best UE’ refers to the top-performing
uncertainty estimation method for each dataset. ‘Ideal’ represents the performance of a system with an oracle
providing ideal predictions for the need to retrieve. ‘InAcc’ denotes In-Accuracy, measuring the QA system’s
performance. ‘LMC’ indicates the mean number of LM calls per question, and ‘RC’ represents the mean number of
retrieval calls per question.
the Ideal uncertainty estimator reveals that current
methods still fall short of perfect performance, both
in terms of efficiency and In-Accuracy, highlight-
ing the ongoing challenge of accurately identifying
self-knowledge within the model.
Takeaway 1: Uncertainty methods outper-
form baselines on single-hop tasks, match
them on multi-hop tasks, and are far more
efficient. The “Ideal” estimator high-
lights room for improvement in the self-
knowledge identification.
5.2 Self-Knowledge Performance
The results in Table 2 demonstrate that, despite
strong downstream performance, most adaptive re-
trieval methods may lack the ability to accurately
identify self-knowledge, exhibiting near-zero corre-
lation and random predictions. For instance, while
DRAGIN typically dominates on downstream tasks,
it performs poorly on self-knowledge metrics.
In contrast, SeaKR exhibits strong self-
knowledge identification on single-hop datasets,
underscoring the value of inspecting the internal
states of language models. However, SeaKR’s per-
formance declines on multi-hop datasets, where in-
ternal states may provide limited information about
the model’s knowledge of more complex questions.
For multi-hop tasks, AdaptiveRAG demonstrates
superior results, highlighting the effectiveness of re-
flexive trainable methods, which apparently handle
complex reasoning better.
These results suggest that internal-state uncer-
tainty excels for simple questions, while reflexiveuncertainty methods are better suited for complex
reasoning tasks.
According to the results in Figure 2, nearly all
baseline models, except for AdaptiveRAG, exhibit
a tendency to either consistently overestimate self-
knowledge or, conversely, to be underconfident. In
contrast, uncertainty methods strike the best bal-
ance between overconfidence and underconfidence,
demonstrating more adequate and reliable values.
Overall, uncertainty estimation methods consis-
tently exhibit the strongest ability to identify self-
knowledge, ranking first or second across all meth-
ods. These findings emphasize the need for a more
thorough evaluation of adaptive retrieval methods,
beyond relying solely on downstream performance,
showing no significant correlation, further shown
in Table 13.
Takeaway 2: Internal-based SeaKR ex-
cels at simple tasks, while reflexive Adap-
tiveRAG performs better on complex ones.
Uncertainty methods provide the most reli-
able self-knowledge estimates, emphasizing
evaluation beyond QA performance.
5.3 Uncertainty Estimation
We analyze 27 uncertainty estimation methods
across QA performance, efficiency, and self-
knowledge, categorizing them by underlying ap-
proach. Methods are ranked based on their average
performance across datasets, with smaller ranks
indicating better results.
As shown in Figure 3, EigValLaplacian and Lex-
5
Page 6:
MethodNQ SQUAD TQA 2Wiki HotPot Musique
Acc Corr AUC Acc Corr AUC Acc Corr AUC Acc Corr AUC Acc Corr AUC Acc Corr AUC
AdaptiveRAG 0.57 0.06 0.54 0.73 0.10 0.58 0.51 -0.02 0.49 0.72 0.34 0.71 0.71 0.19 0.62 0.88 0.15 0.64
DRAGIN 0.55 0.12 0.57 0.82 0.11 0.58 0.36 0.03 0.52 0.68 -0.07 0.46 0.71 0.01 0.51 0.89 -0.01 0.49
FLARE 0.59 0.16 0.59 0.54 0.11 0.58 0.58 0.12 0.57 0.51 0.20 0.62 0.42 0.06 0.54 0.59 0.01 0.51
Rowen CL 0.45 -0.14 0.44 0.18 -0.06 0.47 0.64 -0.07 0.47 0.32 -0.10 0.46 0.29 -0.13 0.44 0.11 0.00 0.50
Rowen CM 0.45 -0.03 0.49 0.18 -0.06 0.47 0.64 -0.13 0.44 0.32 0.02 0.51 0.29 -0.14 0.44 0.11 -0.02 0.49
Rowen Hybrid 0.45 -0.12 0.44 0.17 -0.07 0.46 0.63 -0.13 0.43 0.32 -0.04 0.48 0.29 -0.17 0.41 0.11 -0.01 0.49
Seakr 0.55 0.24 0.64 0.82 0.36 0.77 0.36 0.47 0.78 0.68 -0.22 0.37 0.71 0.08 0.55 0.89 0.06 0.56
EigValLaplacian 0.60 0.17 0.60 0.83 0.10 0.57 0.70 0.34 0.71 0.69 0.19 0.62 0.73 0.27 0.67 0.89 0.12 0.62
Lex-Similarity 0.61 0.22 0.63 0.84 0.22 0.67 0.73 0.39 0.74 0.68 0.21 0.63 0.73 0.30 0.69 0.90 0.08 0.59
Max Entropy 0.63 0.20 0.62 0.82 0.25 0.69 0.72 0.35 0.71 0.69 0.19 0.62 0.73 0.29 0.69 0.89 0.18 0.67
Mean Entropy 0.61 0.20 0.62 0.84 0.32 0.74 0.72 0.36 0.72 0.68 0.28 0.68 0.72 0.31 0.70 0.90 0.15 0.64
SAR 0.61 0.23 0.63 0.83 0.28 0.71 0.72 0.38 0.73 0.69 0.23 0.65 0.73 0.30 0.69 0.89 0.17 0.66
Table 2: Self-knowledge metrics for adaptive retrieval and uncertainty methods. ‘Acc’ and ‘AUC’ refer to accuracy
and ROC-AUC, respectively, for identifying self-knowledge. ‘Corr’ denotes the Spearman correlation with the
self-knowledge label. Bold values indicate the highest score, underlined values represent the second-highest score.
AdaptiveRAGDRAGINFLARERowenCL
RowenCM
RowenHybrid
Seakr
EigValLaplacianLex-SimilarityMax EntropyMean EntropySAR0.4
0.2
0.00.20.40.60.8
OverConfidence
UnderConfidence
Figure 2: Average overconfidence and underconfidence
for each method. Deviation from the zero value is un-
desirable and indicates erroneous behavior. High Over-
Confidence values reflect cases where the method incor-
rectly assumes the model has the required knowledge
when it does not. High UnderConfidence values indi-
cate instances where the method fails to recognize that
the model already possesses the required knowledge.
Similarity rank highest for In-Accuracy, while SAR
variants and Mean Entropy dominate for ROC-
AUC, highlighting an inconsistency between self-
knowledge and downstream performance. This
discrepancy is further evidenced by a moderate
Spearman correlation of 0.65between In-Accuracy
and ROC-AUC ranks, likely due to differing sen-
sitivities to Type I and II errors. EigValLaplacian
also ranks highest for Retrieval Calls, indicating
overconfidence.
For our main analysis, we select uncertainty
methods with the best QA performance: EigVal-
Laplacian, Lex-Similarity, and Max Entropy and
top self-knowledge methods: SAR and Mean En-
tropy for self-knowledge assessment. Internal-state
methods generally rank lower for In-Accuracy and
ROC-AUC but perform better in efficiency, sug-
gesting overconfidence. Consistency-based meth-
ods excel in QA performance but drop in self-knowledge, lagging behind logit-based methods,
indicating better stability to distribution shifts.
The Hybrid method balances all metrics, ranking
in the top-5 for In-Accuracy and ROC-AUC and
first for efficiency. However, it requires calculat-
ing all uncertainty estimates, introducing compu-
tational overhead, which may still be justified in
retrieval-limited scenarios.
Finally, we analyze feature importance for the
Hybrid method in details in Figure 16 in Ap-
pendix E.
Takeaway 3: Consistency-based methods
excel in downstream performance but lag in
self-knowledge, while logit-based methods
dominate self-knowledge metrics. The Hy-
brid method balances all metrics but incurs
higher computational costs.
6 Out-of-Domain Transfer
To analyze the robustness of UE methods on out-of-
domain (OOD) datasets, we evaluate their perfor-
mance across all possible dataset pairs by training
on each dataset and testing on every other. The
relative change in performance, expressed as a per-
centage compared to in-domain performance (see
Appendix D), is used to assess OOD robustness.
For statistical tests details, refer to Appendix B.
In Figure 4, we present the distributions of
performance change across all train-test dataset
pairs. For In-Accuracy, most methods perform
comparably, with EigValLaplacian being the only
method that significantly lags behind and differs
from nearly all others. While most methods are
centered around 0, indicating stability, there is a no-
ticeable tail representing a loss in quality. Neverthe-
less, the loss typically remains under 4%, suggest-
6
Page 7:
3 4 5
In-Accuracy RankEigValLaplacian
Lex-Similarity
Max Entropy
Mean Probability
Hybrid
Mean Entropy
FisherRao
Eccentricity
Semantic Entropy
RenyiNeg
CCP
SAR
Perplexity
DegMat
Max Probability
SentenceSAR
Min Entropy
Median Entropy
Mean PMI
Median Probability
NumSemSets
Min Probability
Mean CPMI
MD
RDE
PTrue
RMD
5 10 15 20
ROC-AUC RankSAR
SentenceSAR
Mean Entropy
Perplexity
Hybrid
Max Entropy
Lex-Similarity
Mean Probability
Semantic Entropy
CCP
Eccentricity
DegMat
NumSemSets
Max Probability
EigValLaplacian
Median Entropy
Median Probability
FisherRao
MD
RenyiNeg
Min Entropy
RMD
RDE
Min Probability
PTrue
Mean CPMI
Mean PMI
4 6 8 10
RC RankHybrid
RMD
MD
Lex-Similarity
Perplexity
Min Entropy
Mean Probability
DegMat
Max Probability
Max Entropy
PTrue
RDE
Mean Entropy
SentenceSAR
SAR
Median Entropy
CCP
NumSemSets
Semantic Entropy
Eccentricity
Median Probability
Min Probability
Mean CPMI
RenyiNeg
FisherRao
Mean PMI
EigValLaplacian
Internal-Based Logit-Based Consistency-Based HybridFigure 3: Uncertainty methods average ranks for In-Accuracy, ROC-AUC and Retrieval Calls. Smaller rank indicate
average better performance. The In-Accuracy ranks demonstrate key downstream metrics, while the ROC-AUC
ranks show self-knowledge abilities across different methods, affecting average downstream performance. The
Retriever Calls (RC) ranks represent the efficiency of the method. This evaluation led to choose EigValLaplacian ,
Lex-Similarity ,Max Entropy ,Mean Entropy , and SAR for more detailed analysis.
6
4
2
0 2 4 6 8
QA Performance OOD Change, %Density
40
30
20
10
0
Self-Knowledge OOD Change, %50
0 50 100 150 200 250
Retrieval Calls OOD Change, %
Lex-Similarity Max Entropy EigValLaplacian Mean Entropy SAR
Figure 4: The transferability of methods between datasets was evaluated using average changes in metrics for
Out-Of-Distribution (OOD) data. QA Performance in OOD was measured by InAccuracy, showing comparable
results across methods. Self-Knowledge, evaluated by Accuracy, degraded significantly. Efficiency was assessed by
RC, indicating that methods tend to call the retriever more frequently after transfer.
ing strong downstream OOD transfer performance,
with occasional cases of positive improvement.
However, the QA performance can be influ-
enced by multiple factors. Self-knowledge transfer,
measured by Accuracy, reveals a complex picture.
While the changes are centered around 0—an en-
couraging sign of stability—the tail indicating qual-
ity loss is notably heavier, with more extreme varia-
tions and no cases of positive transfer. EigValLapla-
cian stands out with the weakest transfer perfor-
mance, whereas other methods show comparable
results without statistically significant differences.
Efficiency transfer analysis shows a similar cen-
tering around 0 but reveals the largest percentage
changes. Methods tend to call the retriever more
frequently when transferred, indicating undercon-
fidence. No significant differences are observed
between methods.Takeaway 4: UE methods show strong
OOD robustness for QA performance but
lower for self-knowledge and efficiency,
with no significant differences between
most methods.
7 Uncertainty Estimation Complexity
We analyze the complexity of uncertainty estima-
tion methods fwith logistic regression C(f), en-
abling rigorous evaluation of f’s complexity.
To achieve so, we employ Rademacher Complex-
ity for functional complexity (Yin et al., 2019) and
sharpness of loss landscape with Hessian eigenval-
ues (Sagun et al., 2016; Glorot and Bengio, 2010).
Rademacher Complexity quantifies the capacity
of a hypothesis class to fit random noise with higher
7
Page 8:
107
105
103
101
Loss Landscape SharpnessMin Entropy
CCP
Max Probability
Perplexity
Lex-Similarity
FisherRao
Mean Probability
Median Entropy
DegMat
Median Probability
Mean Entropy
Min Probability
Eccentricity
Max Entropy
PTrue
SentenceSAR
Semantic Entropy
SAR
EigValLaplacian
NumSemSets
RenyiNeg
Mean PMI
Mean CPMI
RMD
RDE
MD
Internal-Based Logit-Based Consistency-BasedFigure 5: Average loss landscape sharpness in logarith-
mic scale. Higher values correspond to more complex
functions.
values indicating greater complexity:
Rn(Hf) =Eσ"
sup
h∈Hf1
nnX
i=1σih(xi)#
,
whereHfis the hypothesis class induced by uncer-
tainty method f,σi∼U{−1,1}are Rademacher
random variables, and h(xi)is the model’s predic-
tion.
Loss Landscape Sharpness quantifies complex-
ity from different perspective evaluating the cur-
vature of loss landscape, with higher values in-
dicating more complex and harder generalizable
functions (Kaur et al., 2023).
LetL(w)∈C2be a twice continuously differen-
tiable loss function with respect to w∈Rd, and let
H(w) =∇2
wL(w)denote its Hessian. The sharp-
ness at the optimized parameters w∗is defined as:
λmax= sup
∥v∥2=1v⊤H(w∗)v,
where λmaxis the largest eigenvalue of H(w∗),
capturing the steepest curvature of the loss surface.
Figures 5 and 6 show that internal-based features
induce the most complex functions with sharper
loss landscapes. In contrast, moderately com-
plex features like EigValLaplacian achieve better
QA performance, while overall consistency-based
methods are more complex than logit-based ones.
0.4 0.6 0.8 1.0
Normalized Rademacher ComplexityMin Probability
Min Entropy
FisherRao
DegMat
CCP
RenyiNeg
Mean Probability
Eccentricity
Median Probability
Perplexity
Mean Entropy
Max Probability
PTrue
Median Entropy
Lex-Similarity
SAR
Mean PMI
Mean CPMI
Max Entropy
Semantic Entropy
SentenceSAR
EigValLaplacian
NumSemSets
RDE
MD
RMD
Internal-Based Logit-Based Consistency-BasedFigure 6: Normalized Rademacher Complexity for un-
certainty methods. Higher values indicate richer com-
plexity of feature.
Takeaway 5: Internal-based methods are
the most complex and harder to generalize,
while consistency-based methods are more
complex than logit-based ones.
8 Conclusion
We present a comprehensive computational study
of adaptive retrieval systems, evaluating 27 estab-
lished uncertainty estimation methods alongside 8
recently published methods tailored for this task.
Our analysis considers downstream QA perfor-
mance, efficiency, and self-knowledge, covering
a total of 10 evaluation metrics. Our findings show
that established uncertainty methods achieve perfor-
mance comparable to recently proposed adaptive
retrieval approaches, while being more efficient and
exhibiting stronger self-knowledge capabilities.
Moreover, we conducted an in-depth compari-
son of the 27 uncertainty estimation methods, re-
vealing notable discrepancies between downstream
performance and self-knowledge metrics. Our anal-
ysis of OOD transfer shows minimal deviations
in downstream performance but a significant de-
cline in self-knowledge, with no substantial differ-
ences observed between methods. We also identify
the higher functional complexity of internal-based
methods.
Limitations
•We conduct our study using the LLaMA3.1-
8b-instruct model, which is among the best
8
Page 9:
open-source models within its parameter
range. However, extending the analysis to
additional models would help validate the con-
sistency of our findings across different archi-
tectures.
•Our evaluation is performed on 6 QA datasets,
which are standard for this task. Expanding
the evaluation to include more QA datasets,
particularly domain-specific ones, could un-
cover additional insights and highlight the gen-
eralizability of the methods.
Ethical Considerations
Text information retrieval systems may yield biased
text documents, biasing the resulting generation of
even an aligned ethically LLM in an undesired di-
rection. Therefore, engineers deploying RAG and
Adaptive RAG pipelines in real world applications
facing users shall consider this potential issue.
References
Zeyuan Allen-Zhu and Yuanzhi Li. 2024. Physics of
language models: Part 3.1, knowledge storage and
extraction. Preprint , arXiv:2309.14316.
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and
Hannaneh Hajishirzi. 2024. Self-rag: Learning to
retrieve, generate, and critique through self-reflection.
InThe Twelfth International Conference on Learning
Representations, ICLR 2024, Vienna, Austria, May
7-11, 2024 . OpenReview.net.
Jinheon Baek, Soyeong Jeong, Minki Kang, Jong C.
Park, and Sung Ju Hwang. 2023. Knowledge-
augmented language model verification. In Proceed-
ings of the 2023 Conference on Empirical Methods
in Natural Language Processing, EMNLP 2023, Sin-
gapore, December 6-10, 2023 , pages 1720–1736. As-
sociation for Computational Linguistics.
Sebastian Borgeaud, Arthur Mensch, Jordan Hoff-
mann, Trevor Cai, Eliza Rutherford, Katie Milli-
can, George Bm Van Den Driessche, Jean-Baptiste
Lespiau, Bogdan Damoc, Aidan Clark, Diego
De Las Casas, Aurelia Guy, Jacob Menick, Roman
Ring, Tom Hennigan, Saffron Huang, Loren Mag-
giore, Chris Jones, Albin Cassirer, Andy Brock,
Michela Paganini, Geoffrey Irving, Oriol Vinyals,
Simon Osindero, Karen Simonyan, Jack Rae, Erich
Elsen, and Laurent Sifre. 2022. Improving language
models by retrieving from trillions of tokens. In
Proceedings of the 39th International Conference
on Machine Learning , volume 162 of Proceedings
of Machine Learning Research , pages 2206–2240.
PMLR.
Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian
Pedregosa, Andreas Mueller, Olivier Grisel, VladNiculae, Peter Prettenhofer, Alexandre Gramfort,
Jaques Grobler, Robert Layton, Jake VanderPlas, Ar-
naud Joly, Brian Holt, and Gaël Varoquaux. 2013.
API design for machine learning software: experi-
ences from the scikit-learn project. In ECML PKDD
Workshop: Languages for Data Mining and Machine
Learning , pages 108–122.
Maxime Darrin, Pablo Piantanida, and Pierre Colombo.
2023. Rainproof: An umbrella to shield text genera-
tor from out-of-distribution data. In Proceedings of
the 2023 Conference on Empirical Methods in Natu-
ral Language Processing, EMNLP 2023, Singapore,
December 6-10, 2023 , pages 5831–5857. Association
for Computational Linguistics.
Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen,
and Xueqi Cheng. 2024. Retrieve only when it
needs: Adaptive retrieval augmentation for halluci-
nation mitigation in large language models. CoRR ,
abs/2402.10612.
Jinhao Duan, Hao Cheng, Shiqi Wang, Chenan Wang,
Alex Zavalny, Renjing Xu, Bhavya Kailkhura, and
Kaidi Xu. 2023. Shifting attention to relevance: To-
wards the uncertainty estimation of large language
models. arXiv preprint arXiv:2307.01379 .
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey,
Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman,
Akhil Mathur, Alan Schelten, Amy Yang, Angela
Fan, et al. 2024. The llama 3 herd of models. arXiv
preprint arXiv:2407.21783 .
Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem
Shelmanov, Sergey Petrakov, Haonan Li, Hamdy
Mubarak, Evgenii Tsymbalov, Gleb Kuzmin, Alexan-
der Panchenko, Timothy Baldwin, Preslav Nakov,
and Maxim Panov. 2024a. Fact-checking the out-
put of large language models via token-level uncer-
tainty quantification. In Findings of the Association
for Computational Linguistics, ACL 2024, Bangkok,
Thailand and virtual meeting, August 11-16, 2024 ,
pages 9367–9385. Association for Computational
Linguistics.
Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem
Shelmanov, Sergey Petrakov, Haonan Li, Hamdy
Mubarak, Evgenii Tsymbalov, Gleb Kuzmin, Alexan-
der Panchenko, Timothy Baldwin, et al. 2024b. Fact-
checking the output of large language models via
token-level uncertainty quantification. arXiv preprint
arXiv:2403.04696 .
Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun,
Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin,
Daniil Vasilev, Elizaveta Goncharova, Alexander
Panchenko, Maxim Panov, Timothy Baldwin, and
Artem Shelmanov. 2023. Lm-polygraph: Uncertainty
estimation for language models. In Proceedings of
the 2023 Conference on Empirical Methods in Nat-
ural Language Processing, EMNLP 2023 - System
Demonstrations, Singapore, December 6-10, 2023 ,
pages 446–461. Association for Computational Lin-
guistics.
9
Page 10:
Marina Fomicheva, Shuo Sun, Lisa Yankovskaya,
Frédéric Blain, Francisco Guzmán, Mark Fishel,
Nikolaos Aletras, Vishrav Chaudhary, and Lucia Spe-
cia. 2020. Unsupervised quality estimation for neural
machine translation. Transactions of the Association
for Computational Linguistics , 8:539–555.
Xavier Glorot and Yoshua Bengio. 2010. Understand-
ing the difficulty of training deep feedforward neural
networks. In Proceedings of the Thirteenth Interna-
tional Conference on Artificial Intelligence and Statis-
tics, volume 9 of Proceedings of Machine Learning
Research , pages 249–256, Chia Laguna Resort, Sar-
dinia, Italy. PMLR.
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara,
and Akiko Aizawa. 2020. Constructing A multi-hop
QA dataset for comprehensive evaluation of reason-
ing steps. In Proceedings of the 28th International
Conference on Computational Linguistics, COLING
2020, Barcelona, Spain (Online), December 8-13,
2020 , pages 6609–6625. International Committee on
Computational Linguistics.
Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju
Hwang, and Jong Park. 2024. Adaptive-rag: Learn-
ing to adapt retrieval-augmented large language mod-
els through question complexity. In Proceedings of
the 2024 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies (Volume 1: Long
Papers), NAACL 2024, Mexico City, Mexico, June
16-21, 2024 , pages 7036–7050. Association for Com-
putational Linguistics.
Yuxin Jiang, Yufei Wang, Chuhan Wu, Wanjun Zhong,
Xingshan Zeng, Jiahui Gao, Liangyou Li, Xin Jiang,
Lifeng Shang, Ruiming Tang, Qun Liu, and Wei
Wang. 2024. Learning to edit: Aligning llms with
knowledge editing. Preprint , arXiv:2402.11905.
Zhengbao Jiang, Frank Xu, Luyu Gao, Zhiqing Sun,
Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie
Callan, and Graham Neubig. 2023. Active retrieval
augmented generation. In Proceedings of the 2023
Conference on Empirical Methods in Natural Lan-
guage Processing , pages 7969–7992, Singapore. As-
sociation for Computational Linguistics.
Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke
Zettlemoyer. 2017. Triviaqa: A large scale distantly
supervised challenge dataset for reading comprehen-
sion. In Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics, ACL
2017, Vancouver, Canada, July 30 - August 4, Volume
1: Long Papers , pages 1601–1611. Association for
Computational Linguistics.
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom
Henighan, Dawn Drain, Ethan Perez, Nicholas
Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli
Tran-Johnson, et al. 2022. Language models
(mostly) know what they know. arXiv preprint
arXiv:2207.05221 .Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick
S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi Chen,
and Wen-tau Yih. 2020. Dense passage retrieval for
open-domain question answering. In Proceedings of
the 2020 Conference on Empirical Methods in Nat-
ural Language Processing, EMNLP 2020, Online,
November 16-20, 2020 , pages 6769–6781. Associa-
tion for Computational Linguistics.
Simran Kaur, Jeremy Cohen, and Zachary Chase Lipton.
2023. On the maximum hessian eigenvalue and gen-
eralization. In Proceedings on "I Can’t Believe It’s
Not Better! - Understanding Deep Learning Through
Empirical Falsification" at NeurIPS 2022 Workshops ,
volume 187 of Proceedings of Machine Learning
Research , pages 51–65. PMLR.
Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke
Zettlemoyer, and Mike Lewis. 2020. Generalization
through memorization: Nearest neighbor language
models. In International Conference on Learning
Representations .
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023.
Semantic uncertainty: Linguistic invariances for un-
certainty estimation in natural language generation.
InThe Eleventh International Conference on Learn-
ing Representations, ICLR 2023, Kigali, Rwanda,
May 1-5, 2023 . OpenReview.net.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
field, Michael Collins, Ankur P. Parikh, Chris Alberti,
Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken-
ton Lee, Kristina Toutanova, Llion Jones, Matthew
Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob
Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-
ral questions: a benchmark for question answering
research. Trans. Assoc. Comput. Linguistics , 7:452–
466.
Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin.
2018. A simple unified framework for detecting out-
of-distribution samples and adversarial attacks. In
Advances in Neural Information Processing Systems
31: Annual Conference on Neural Information Pro-
cessing Systems 2018, NeurIPS 2018, December 3-8,
2018, Montréal, Canada , pages 7167–7177.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
Petroni, Vladimir Karpukhin, Naman Goyal, Hein-
rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-
täschel, Sebastian Riedel, and Douwe Kiela. 2020a.
Retrieval-augmented generation for knowledge-
intensive nlp tasks. In Advances in Neural Infor-
mation Processing Systems , volume 33, pages 9459–
9474. Curran Associates, Inc.
Patrick S. H. Lewis, Ethan Perez, Aleksandra Pik-
tus, Fabio Petroni, Vladimir Karpukhin, Naman
Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih,
Tim Rocktäschel, Sebastian Riedel, and Douwe
Kiela. 2020b. Retrieval-augmented generation for
knowledge-intensive NLP tasks. In Advances in Neu-
ral Information Processing Systems 33: Annual Con-
ference on Neural Information Processing Systems
2020, NeurIPS 2020, December 6-12, 2020, virtual .
10
Page 11:
Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2023.
Generating with confidence: Uncertainty quantifi-
cation for black-box large language models. arXiv
preprint arXiv:2305.19187 .
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das,
Daniel Khashabi, and Hannaneh Hajishirzi. 2023.
When not to trust language models: Investigating
effectiveness of parametric and non-parametric mem-
ories. In Proceedings of the 61st Annual Meeting of
the Association for Computational Linguistics (Vol-
ume 1: Long Papers), ACL 2023, Toronto, Canada,
July 9-14, 2023 , pages 9802–9822. Association for
Computational Linguistics.
Shiyu Ni, Keping Bi, Jiafeng Guo, and Xueqi Cheng.
2024. When do llms need retrieval augmentation?
mitigating llms’ overconfidence helps retrieval aug-
mentation. arXiv preprint arXiv:2402.11457 .
Colin Raffel, Noam Shazeer, Adam Roberts, Kather-
ine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the
limits of transfer learning with a unified text-to-text
transformer. Journal of Machine Learning Research ,
21(140):1–67.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. 2016. Squad: 100, 000+ questions
for machine comprehension of text. In Proceedings
of the 2016 Conference on Empirical Methods in
Natural Language Processing, EMNLP 2016, Austin,
Texas, USA, November 1-4, 2016 , pages 2383–2392.
The Association for Computational Linguistics.
Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay,
Amnon Shashua, Kevin Leyton-Brown, and Yoav
Shoham. 2023. In-context retrieval-augmented lan-
guage models. Transactions of the Association for
Computational Linguistics , 11:1316–1331.
Jie Ren, Jiaming Luo, Yao Zhao, Kundan Krishna, Mo-
hammad Saleh, Balaji Lakshminarayanan, and Pe-
ter J. Liu. 2023. Out-of-distribution detection and
selective generation for conditional language models.
InThe Eleventh International Conference on Learn-
ing Representations, ICLR 2023, Kigali, Rwanda,
May 1-5, 2023 . OpenReview.net.
Stephen E. Robertson, Steve Walker, Susan Jones,
Micheline Hancock-Beaulieu, and Mike Gatford.
1994. Okapi at TREC-3. In Proceedings of The Third
Text REtrieval Conference, TREC 1994, Gaithers-
burg, Maryland, USA, November 2-4, 1994 , volume
500-225 of NIST Special Publication , pages 109–
126. National Institute of Standards and Technology
(NIST).
Levent Sagun, Leon Bottou, and Yann LeCun. 2016.
Eigenvalues of the hessian in deep learning: Singu-
larity and beyond. arXiv preprint arXiv:1611.07476 .
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan
Scales, David Dohan, Ed H. Chi, Nathanael Schärli,
and Denny Zhou. 2023. Large language models canbe easily distracted by irrelevant context. In Interna-
tional Conference on Machine Learning, ICML 2023,
23-29 July 2023, Honolulu, Hawaii, USA , volume
202 of Proceedings of Machine Learning Research ,
pages 31210–31227. PMLR.
Adi Simhi, Jonathan Herzig, Idan Szpektor, and Yonatan
Belinkov. 2024. Constructing benchmarks and in-
terventions for combating hallucinations in llms.
Preprint , arXiv:2404.09971.
Richard Socher, Alex Perelygin, Jean Wu, Jason
Chuang, Christopher D. Manning, Andrew Ng, and
Christopher Potts. 2013. Recursive deep models for
semantic compositionality over a sentiment treebank.
InProceedings of the 2013 Conference on Empiri-
cal Methods in Natural Language Processing , pages
1631–1642, Seattle, Washington, USA. Association
for Computational Linguistics.
Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu, and
Yiqun Liu. 2024a. Dragin: Dynamic retrieval aug-
mented generation based on the information needs of
large language models. Preprint , arXiv:2403.10081.
Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu,
and Yiqun Liu. 2024b. DRAGIN: dynamic retrieval
augmented generation based on the real-time informa-
tion needs of large language models. In Proceedings
of the 62nd Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
ACL 2024, Bangkok, Thailand, August 11-16, 2024 ,
pages 12991–13013. Association for Computational
Linguistics.
Junya Takayama and Yuki Arase. 2019. Relevant and
informative response generation using pointwise mu-
tual information. In Proceedings of the First Work-
shop on NLP for Conversational AI , pages 133–138.
Shuchang Tao, Liuyi Yao, Hanxing Ding, Yuexiang Xie,
Qi Cao, Fei Sun, Jinyang Gao, Huawei Shen, and
Bolin Ding. 2024. When to trust llms: Aligning
confidence with response quality. arXiv preprint
arXiv:2404.17287 .
Qwen Team. 2024. Qwen2.5: A party of foundation
models.
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot,
and Ashish Sabharwal. 2022. Musique: Multi-
hop questions via single-hop question composition.
Trans. Assoc. Comput. Linguistics , 10:539–554.
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot,
and Ashish Sabharwal. 2023. Interleaving retrieval
with chain-of-thought reasoning for knowledge-
intensive multi-step questions. In Proceedings of
the 61st Annual Meeting of the Association for Com-
putational Linguistics (Volume 1: Long Papers),
ACL 2023, Toronto, Canada, July 9-14, 2023 , pages
10014–10037. Association for Computational Lin-
guistics.
11
Page 12:
Liam van der Poel, Ryan Cotterell, and Clara Meis-
ter. 2022. Mutual information alleviates hallucina-
tions in abstractive summarization. In Proceedings of
the 2022 Conference on Empirical Methods in Natu-
ral Language Processing, EMNLP 2022, Abu Dhabi,
United Arab Emirates, December 7-11, 2022 , pages
5956–5965. Association for Computational Linguis-
tics.
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng,
Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan
Li, Dayiheng Liu, Fei Huang, et al. 2024. Qwen2
technical report. arXiv preprint arXiv:2407.10671 .
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-
gio, William W. Cohen, Ruslan Salakhutdinov, and
Christopher D. Manning. 2018. Hotpotqa: A dataset
for diverse, explainable multi-hop question answer-
ing. In Proceedings of the 2018 Conference on Em-
pirical Methods in Natural Language Processing,
Brussels, Belgium, October 31 - November 4, 2018 ,
pages 2369–2380. Association for Computational
Linguistics.
Zijun Yao, Weijian Qi, Liangming Pan, Shulin Cao,
Linmei Hu, Weichuan Liu, Lei Hou, and Juanzi
Li. 2024. Seakr: Self-aware knowledge retrieval
for adaptive retrieval augmented generation. CoRR ,
abs/2406.19215.
Dong Yin, Ramchandran Kannan, and Peter Bartlett.
2019. Rademacher complexity for adversarially ro-
bust generalization. In International conference on
machine learning , pages 7085–7094. PMLR.
Xunjian Yin, Xu Zhang, Jie Ruan, and Xiaojun Wan.
2024. Benchmarking knowledge boundary for large
language models: A different perspective on model
evaluation. In Proceedings of the 62nd Annual Meet-
ing of the Association for Computational Linguistics
(Volume 1: Long Papers), ACL 2024, Bangkok, Thai-
land, August 11-16, 2024 , pages 2270–2286. Associ-
ation for Computational Linguistics.
Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu,
Xipeng Qiu, and Xuanjing Huang. 2023. Do large
language models know what they don’t know? arXiv
preprint arXiv:2305.18153 .
KiYoon Yoo, Jangho Kim, Jiho Jang, and Nojun Kwak.
2022. Detection of adversarial examples in text clas-
sification: Benchmark and baseline via robust density
estimation. In Findings of the Association for Com-
putational Linguistics: ACL 2022, Dublin, Ireland,
May 22-27, 2022 , pages 3656–3672. Association for
Computational Linguistics.
12
Page 13:
A Classifier for UE
We further analyze how the choice of classifier (Logistic Regression, Threshold, KNN, MLP or Decision
Tree) algorithm impacts QA performance. Specifically, we compute the average performance drop for
each uncertainty method when switching from the maximum classifier performance to the average. This
sensitivity further indicates the complexity of the method, as more complex methods require thorough
choice of classifier and hyperparameters.
The results in Figure 7 demonstrate that, except for NumSemSets (the number of semantic clusters
of sampled responses), consistency-based methods exhibit higher sensitivity compared to logit-based
methods. Internal state-based methods and hybrid approaches show the greatest sensitivity, which aligns
with their complex nature. This increased sensitivity likely arises because their features are less explicit,
capturing subtle internal state changes that are inherently harder to fit and explain.
We further illustrate the ranking differences in Table 3, showing that methods with greater stability tend
to achieve higher ranks when averaged across classifiers.
1.5
1.0
0.5
0.0
QA Performance Drop, %Hybrid
RMD
EigValLaplacian
RDE
Mean Probability
Min Entropy
Lex-Similarity
Mean Entropy
Semantic Entropy
MD
Perplexity
PTrue
DegMat
CCP
Mean CPMI
SAR
SentenceSAR
Median Entropy
FisherRao
Eccentricity
RenyiNeg
Min Probability
Max Entropy
Median Probability
Max Probability
Mean PMI
NumSemSets
Internal-Based
Logit-BasedConsistency-Based
Hybrid
Figure 7: Average QA performance drop for uncertainty methods for when switching maximum over classifiers to
average.
B Statistical Tests for OOD Testing
To evaluate the OOD performance differences between uncertainty estimation methods under dataset-
dependent conditions, we use the Friedman test, suitable for data with small sample sizes and no
assumptions about normality, while also being appropriate for repeated measurements.
After the Friedman test, we apply the Nemenyi post-hoc test to identify statistically significant pairwise
differences between methods, similarly due to rank-based nature and accounting for multiple comparisons
to ensure robust analysis. We also report significance with asterisk atop of the number.
C Performance Analysis Across Datasets
The scatter plot visualizes the performance comparison of various Retrieval-Augmented Generation
(RAG) methods for all studied datasets, Figure 8. The x-axis represents the number of LLM calls, while
the y-axis shows the Bootstrap Mean In-Accuracy. Circle sizes in the visualization correspond to the
number of retrieval calls required by each method.
13
Page 14:
012345 10153061
LLM Calls0.380.390.400.410.420.430.440.450.46In-Accuracy Bootstrap Mean
EigValLaplacian
Lex-Similarity
Max EntropyMean EntropySARAdaptiveRAGDRAGIN
FLARERowen_CM
Rowen_CLRowen_Hybrid
Seakr2WikiMultihopQA
012345 10153059
LLM Calls0.360.370.380.390.400.410.420.43
EigValLaplacianLex-SimilarityMax Entropy
Mean EntropySAR
AdaptiveRAGDRAGIN
FLARE
Rowen_CM
Rowen_CLRowen_HybridSeakrHotpotQA
012345 101530 80
LLM Calls0.090.100.110.120.130.14
EigValLaplacian
Lex-Similarity
Max EntropyMean EntropySARAdaptiveRAG
DRAGIN
FLARERowen_CM
Rowen_CLRowen_HybridSeakrMusicque
012345 10153055
LLM Calls0.420.440.460.480.50In-Accuracy Bootstrap MeanEigValLaplacianLex-Similarity
Max Entropy
Mean EntropySAR
AdaptiveRAG
DRAGIN
FLARERowen_CM
Rowen_CLRowen_Hybrid
SeakrNQ
012345 10153054
LLM Calls0.200.220.240.260.280.300.32EigValLaplacian Lex-Similarity
Max EntropyMean Entropy SAR
AdaptiveRAGDRAGIN
FLARE
Rowen_CM
Rowen_CLRowen_HybridSeakrSQuAD
012345 10153053
LLM Calls0.6300.6350.6400.6450.6500.6550.6600.665
EigValLaplacianLex-SimilarityMax Entropy
Mean Entropy
SAR
AdaptiveRAGDRAGIN
FLARERowen_CM
Rowen_CLRowen_Hybrid
SeakrTriviaQA
Methods
Baseline RAGs Uncertainty Estimation MethodsRetrieval Calls
1 3 7Figure 8: Performance comparison showing the relationship between LLM calls and Bootstrap Mean In-Accuracy.
The size of each point indicates the number of retrieval calls required by each method.
14
Page 15:
Method Mean Max Difference
Hybrid 25.67 11.33 -14.33
RMD 23.50 15.00 -8.50
Perplexity 15.00 8.83 -6.17
MD 19.50 13.50 -6.00
CCP 15.50 9.67 -5.83
EigValLaplacian 11.67 6.00 -5.67
Median Entropy 15.67 10.67 -5.00
Mean Entropy 12.83 8.33 -4.50
RDE 16.83 12.67 -4.17
DegMat 13.67 10.00 -3.67
Max Entropy 10.00 6.50 -3.50
Mean CPMI 14.00 11.00 -3.00
Lex-Similarity 9.33 6.50 -2.83
SentenceSAR 12.00 9.17 -2.83
Min Entropy 11.83 9.33 -2.50
SAR 10.50 8.83 -1.67
PTrue 16.00 14.50 -1.50
Max Probability 11.33 10.50 -0.83
Min Probability 11.00 10.17 -0.83
FisherRao 9.17 8.50 -0.67
Semantic Entropy 9.17 8.50 -0.67
RenyiNeg 9.00 9.33 0.33
Mean Probability 5.83 6.33 0.50
Mean PMI 10.33 11.17 0.83
Median Probability 8.83 9.83 1.00
Eccentricity 7.83 8.83 1.00
NumSemSets 10.33 12.00 1.67
Table 3: Rank of UC methods by In-Accuracy, aggregated using the mean or maximum across different classifiers.
A lower difference indicates reduced stability to classifier choice, whereas a higher difference reflects greater
robustness to classifier choice.
MethodNQ SQUAD TQA 2Wiki HotPot Musique
Over Under Over Under Over Under Over Under Over Under Over Under
AdaptiveRAG 0.01 0.43 0.14 0.13 0.19 0.30 0.12 0.15 0.11 0.18 0.02 0.10
DRAGIN 0.00 0.45 0.00 0.18 0.00 0.64 0.00 0.32 0.00 0.29 0.00 0.11
FLARE 0.20 0.21 0.40 0.06 0.18 0.24 0.43 0.06 0.53 0.05 0.34 0.06
Rowen CL 0.55 0.00 0.82 0.00 0.36 0.00 0.68 0.00 0.71 0.00 0.89 0.00
Rowen CM 0.55 0.00 0.82 0.00 0.36 0.00 0.68 0.00 0.71 0.00 0.89 0.00
Rowen Hybrid 0.55 0.00 0.82 0.00 0.36 0.01 0.68 0.00 0.71 0.00 0.89 0.00
Seakr 0.00 0.45 0.00 0.18 0.00 0.64 0.00 0.32 0.00 0.29 0.00 0.11
Lex-Similarity 0.01 0.21 0.00 0.15 0.18 0.06 0.00 0.28 0.02 0.22 0.00 0.10
Max Entropy 0.08 0.20 0.00 0.13 0.17 0.07 0.00 0.29 0.00 0.23 0.00 0.10
EigValLaplacian 0.03 0.33 0.00 0.17 0.21 0.08 0.00 0.30 0.01 0.23 0.00 0.10
SAR 0.06 0.29 0.00 0.16 0.18 0.09 0.00 0.28 0.03 0.22 0.00 0.11
Mean Entropy 0.04 0.29 0.00 0.14 0.14 0.07 0.00 0.29 0.00 0.15 0.00 0.10
Table 4: Over- and UnderConfidence for adaptive retrieval methods and uncertainty estimation. Values closest to zero
indicate the best performance. UnderConfidence refers to cases where the method failed to detect self-knowledge
despite its presence, while OverConfidence reflects cases where the method incorrectly detected self-knowledge
when it was absent.
15
Page 16:
MethodNQ SQUAD TQA
EM F1 InAcc RC EM F1 InAcc RC EM F1 InAcc RC
CCP 0.398 0.512 0.496 0.94 0.252 0.389 0.312 1.00 0.600 0.692 0.662 0.28
DegMat 0.394 0.514 0.496 0.97 0.252 0.389 0.312 1.00 0.598 0.684 0.644 0.29
Eccentricity 0.404 0.520 0.500 0.84 0.252 0.390 0.312 1.00 0.594 0.677 0.638 0.21
EigValLaplacian 0.406 0.532 0.512 0.81 0.254 0.391 0.314 1.00 0.594 0.682 0.640 0.26
FisherRao 0.390 0.506 0.498 0.88 0.252 0.389 0.312 1.00 0.598 0.688 0.654 0.11
Hybrid 0.410 0.534 0.504 0.65 0.254 0.393 0.314 0.99 0.594 0.689 0.654 0.32
Lex-Similarity 0.420 0.535 0.512 0.58 0.256 0.394 0.318 0.96 0.600 0.689 0.646 0.22
MD 0.398 0.511 0.496 1.00 0.252 0.389 0.312 1.00 0.598 0.681 0.642 0.05
Max Entropy 0.422 0.535 0.506 0.73 0.252 0.389 0.312 1.00 0.598 0.689 0.650 0.22
Max Probability 0.418 0.532 0.502 0.82 0.252 0.389 0.312 1.00 0.592 0.683 0.646 0.21
Mean CPMI 0.390 0.506 0.496 1.00 0.252 0.389 0.312 1.00 0.592 0.675 0.640 0.02
Mean Entropy 0.402 0.514 0.498 0.88 0.254 0.392 0.314 0.95 0.598 0.687 0.650 0.30
Mean PMI 0.390 0.506 0.496 1.00 0.254 0.389 0.312 1.00 0.596 0.683 0.640 0.02
Mean Probability 0.404 0.512 0.498 0.77 0.258 0.394 0.318 0.98 0.592 0.681 0.642 0.06
Median Entropy 0.412 0.519 0.496 1.00 0.252 0.389 0.312 1.00 0.596 0.682 0.644 0.15
Median Probability 0.408 0.512 0.496 1.00 0.252 0.389 0.312 1.00 0.592 0.680 0.644 0.26
Min Entropy 0.398 0.515 0.504 0.93 0.252 0.389 0.312 1.00 0.592 0.675 0.636 0.00
Min Probability 0.398 0.515 0.502 0.91 0.252 0.389 0.312 1.00 0.592 0.675 0.636 0.00
NumSemSets 0.406 0.521 0.502 0.83 0.252 0.389 0.312 1.00 0.590 0.680 0.638 0.28
PTrue 0.388 0.506 0.496 1.00 0.252 0.389 0.312 1.00 0.592 0.676 0.636 0.00
Perplexity 0.404 0.515 0.498 0.77 0.256 0.392 0.316 0.98 0.594 0.683 0.646 0.16
RDE 0.388 0.506 0.496 1.00 0.252 0.389 0.312 1.00 0.588 0.670 0.634 0.08
RMD 0.394 0.508 0.496 1.00 0.252 0.389 0.312 1.00 0.592 0.675 0.636 0.00
RenyiNeg 0.402 0.517 0.498 0.96 0.252 0.389 0.312 1.00 0.594 0.688 0.654 0.24
SAR 0.410 0.526 0.500 0.79 0.254 0.389 0.312 1.00 0.590 0.681 0.642 0.29
Semantic Entropy 0.406 0.521 0.504 0.83 0.260 0.393 0.316 0.92 0.596 0.685 0.640 0.24
SentenceSAR 0.410 0.521 0.500 0.73 0.254 0.391 0.314 0.99 0.596 0.685 0.644 0.24
Table 5: Detailed QA performance results for uncertainty methods on one-hop datasets. ‘InAcc’ denotes In-Accuracy,
and ‘EM’ stands for Exact Match. Higher values indicate better performance. Bold values highlight the best results.
Standard deviations for InAcc, EM, and F1 are ≈0.02±0.003, calculated using bootstrapping.
Method2Wiki HotPot Musique
EM F1 InAcc RC EM F1 InAcc RC EM F1 InAcc RC
CCP 0.310 0.398 0.376 0.98 0.386 0.497 0.410 1.00 0.088 0.167 0.100 1.00
DegMat 0.314 0.407 0.382 0.95 0.386 0.498 0.410 1.00 0.088 0.168 0.100 1.00
Eccentricity 0.312 0.406 0.384 0.93 0.390 0.502 0.414 0.93 0.088 0.167 0.100 1.00
EigValLaplacian 0.312 0.405 0.384 0.98 0.384 0.501 0.410 0.91 0.088 0.169 0.102 1.00
FisherRao 0.306 0.399 0.378 0.98 0.386 0.497 0.410 1.00 0.088 0.169 0.100 1.00
Hybrid 0.298 0.391 0.368 0.93 0.384 0.491 0.406 0.94 0.090 0.169 0.102 1.00
Lex-Similarity 0.306 0.400 0.376 0.97 0.386 0.498 0.410 0.95 0.088 0.168 0.100 1.00
MD 0.302 0.397 0.374 1.00 0.386 0.497 0.410 1.00 0.088 0.167 0.100 1.00
Max Entropy 0.304 0.398 0.376 0.95 0.390 0.501 0.414 0.99 0.088 0.167 0.100 1.00
Max Probability 0.304 0.396 0.374 1.00 0.386 0.497 0.410 1.00 0.088 0.167 0.100 1.00
Mean CPMI 0.302 0.397 0.376 0.98 0.386 0.497 0.410 1.00 0.090 0.169 0.102 0.99
Mean Entropy 0.306 0.400 0.378 0.93 0.386 0.497 0.410 0.99 0.088 0.167 0.100 1.00
Mean PMI 0.310 0.399 0.382 0.96 0.386 0.497 0.410 1.00 0.088 0.167 0.100 1.00
Mean Probability 0.308 0.400 0.380 0.96 0.388 0.498 0.412 0.97 0.092 0.173 0.104 0.97
Median Entropy 0.308 0.398 0.378 0.98 0.386 0.497 0.410 1.00 0.088 0.167 0.100 1.00
Median Probability 0.304 0.397 0.376 0.94 0.386 0.497 0.410 1.00 0.090 0.169 0.102 1.00
Min Entropy 0.308 0.397 0.376 0.93 0.386 0.497 0.410 1.00 0.090 0.171 0.104 0.99
Min Probability 0.312 0.401 0.376 0.95 0.386 0.497 0.410 1.00 0.090 0.169 0.102 0.99
NumSemSets 0.304 0.396 0.374 1.00 0.386 0.502 0.412 0.95 0.088 0.167 0.100 1.00
PTrue 0.308 0.398 0.372 0.87 0.386 0.497 0.410 1.00 0.090 0.169 0.102 0.99
Perplexity 0.304 0.398 0.376 0.96 0.386 0.498 0.410 1.00 0.088 0.168 0.100 1.00
RDE 0.304 0.398 0.376 0.99 0.386 0.497 0.410 1.00 0.090 0.171 0.102 0.99
RMD 0.304 0.398 0.372 0.97 0.388 0.499 0.412 0.95 0.088 0.167 0.100 1.00
RenyiNeg 0.302 0.396 0.374 1.00 0.390 0.500 0.414 0.97 0.088 0.167 0.100 1.00
SAR 0.310 0.404 0.380 0.97 0.386 0.500 0.412 0.90 0.088 0.167 0.100 1.00
Semantic Entropy 0.304 0.398 0.374 1.00 0.386 0.499 0.412 0.93 0.088 0.169 0.102 1.00
SentenceSAR 0.308 0.403 0.376 0.89 0.384 0.498 0.410 0.90 0.088 0.167 0.100 1.00
Table 6: Detailed QA performance results for uncertainty methods on one-hop datasets. ‘InAcc’ denotes In-Accuracy,
and ‘EM’ stands for Exact Match. Higher values indicate better performance. Bold values highlight the best results.
Standard deviations for InAcc, EM, and F1 are ≈0.02±0.002for HotPotQA and 2Wiki and ≈0.01±0.001for
Musique, calculated using bootstrapping.
16
Page 17:
MethodNQ SQUAD TQA
EM F1 Acc LLMC RC EM F1 Acc LLMC RC EM F1 Acc LLMC RC
No Context 0.386 0.495 0.446 1.0 0.00 0.156 0.249 0.176 1.0 0.00 0.592 0.675 0.636 1.0 0.00
All Context 0.388 0.506 0.496 1.0 1.00 0.252 0.389 0.312 1.0 1.00 0.522 0.636 0.610 1.0 1.00
AdaptiveRAG 0.388 0.505 0.496 1.0 0.98 0.238 0.366 0.286 1.0 0.97 0.564 0.656 0.628 0.5 0.54
DRAGIN 0.396 0.510 0.480 4.5 2.24 0.244 0.371 0.298 4.3 2.14 0.584 0.691 0.666 4.1 2.06
FLARE 0.358 0.477 0.450 3.1 2.07 0.190 0.303 0.238 3.1 2.08 0.570 0.674 0.648 2.1 1.39
FS-RAG 0.348 0.483 0.428 2.7 2.70 0.226 0.361 0.286 2.8 2.78 0.540 0.640 0.632 2.5 2.47
IRCoT 0.392 0.502 0.478 2.7 2.70 0.210 0.341 0.268 2.7 2.68 0.526 0.634 0.608 2.7 2.74
Rowen CL 0.002 0.104 0.494 29.5 7.24 0.004 0.061 0.196 29.2 7.19 0.022 0.188 0.656 28.7 7.06
Rowen CM 0.002 0.104 0.494 29.5 7.27 0.004 0.061 0.196 29.2 7.20 0.022 0.188 0.656 28.7 7.12
Rowen Hybrid 0.002 0.104 0.494 55.0 7.27 0.004 0.061 0.196 54.3 7.15 0.022 0.189 0.656 53.4 6.93
Seakr 0.360 0.487 0.406 14.6 1.00 0.226 0.361 0.268 14.6 1.00 0.598 0.692 0.656 14.6 1.00
Table 7: Results of baselines for onehop datasets. LLMC refers to the average number of LLM calls per question,
while RC indicates the average number of retrieval calls per question. For NQ the standard deviations of Acc, EM,
and F1 are ≈0.022±0.001across all methods. For SQUAD and Trivia the standard deviations of Acc, EM, and F1
are≈0.018±0.006across all methods. Overall, the methods exhibit similar deviations, with Rowen showing the
lowest deviation, typically ≤0.01.
Method2Wiki HotPotQA Musique
EM F1 Acc LLMC RC EM F1 Acc LLMC RC EM F1 Acc LLMC RC
No Context 0.302 0.371 0.318 1.0 0.00 0.280 0.372 0.286 1.0 0.00 0.100 0.193 0.106 1.0 0.00
All Context 0.302 0.396 0.374 1.0 1.00 0.386 0.497 0.410 1.0 1.00 0.088 0.167 0.100 1.0 1.00
AdaptiveRAG 0.384 0.471 0.454 2.6 2.64 0.396 0.499 0.414 2.3 2.34 0.122 0.216 0.140 3.6 3.63
DRAGIN 0.406 0.480 0.456 5.8 2.92 0.398 0.506 0.430 5.1 2.56 0.116 0.207 0.134 6.3 3.15
FLARE 0.358 0.451 0.424 3.9 2.85 0.298 0.391 0.372 5.1 4.07 0.076 0.161 0.090 4.1 3.10
FS-RAG 0.348 0.431 0.388 3.8 3.76 0.376 0.503 0.422 3.7 3.70 0.088 0.187 0.100 3.4 3.35
IRCoT 0.362 0.460 0.454 4.4 4.38 0.414 0.516 0.438 3.5 3.45 0.116 0.221 0.138 4.1 4.08
Rowen CL 0.002 0.083 0.444 32.9 7.87 0.002 0.084 0.354 31.9 7.67 0.002 0.034 0.104 42.1 9.52
Rowen CM 0.002 0.083 0.444 32.9 7.87 0.002 0.084 0.356 31.9 7.70 0.002 0.034 0.104 42.1 9.52
Rowen Hybrid 0.002 0.083 0.444 61.8 7.85 0.004 0.086 0.354 59.8 7.63 0.002 0.034 0.102 80.2 9.48
Seakr 0.382 0.460 0.398 12.3 2.44 0.400 0.523 0.424 9.9 1.76 0.112 0.215 0.118 12.3 2.40
Table 8: Results of baselines for multihop datasets. LLMC refers to the average number of LLM calls per question,
while RC indicates the average number of retrieval calls per question. For 2Wiki and HotPotQA, the standard
deviations of Acc, EM, and F1 are ≤0.022±0.001across all methods. For Musique, the standard deviations
are≤0.015±0.001. Overall, the methods exhibit similar deviations, with Rowen showing the lowest deviation,
typically ≤0.01.
17
Page 18:
NQ SQUAD TQA 2Wiki HotPot Musique Avg
Mean CPMI -2.02 -7.44 -4.38 -2.98 -5.76 0.78 -3.63
Mean PMI -1.45 -8.21 -4.69 -4.19 -5.37 3.20 -3.45
RDE -1.29 -7.18 -3.52 -2.23 -4.68 1.18 -2.95
PTrue -1.94 -7.95 -3.77 -2.03 -5.07 3.53 -2.87
EigValLaplacian -3.28 -7.01 -3.44 -2.29 -3.32 2.35 -2.83
Min Probability -1.83 -5.26 -3.52 -2.98 -3.22 1.96 -2.48
RenyiNeg -1.04 -4.23 -6.12 0.21 -3.38 2.40 -2.03
NumSemSets -0.96 -5.90 -3.39 -0.21 -3.20 1.60 -2.01
FisherRao -0.80 -5.26 -3.44 -2.75 -3.12 3.60 -1.96
Min Entropy -2.22 -4.23 -3.08 -0.85 -1.17 0.38 -1.86
Median Entropy -0.89 -3.97 -3.81 -1.48 -3.80 3.60 -1.72
Median Probability -1.13 -2.44 -4.60 -0.11 -3.41 1.96 -1.62
Hybrid -1.83 -4.71 -4.28 -0.32 -5.12 7.06 -1.53
Mean Probability -0.16 -2.39 -4.22 -0.42 -2.43 1.54 -1.35
Max Entropy -1.90 -2.56 -4.80 -0.43 -3.00 6.40 -1.05
CCP 0.32 -2.18 -6.77 -0.64 -2.44 6.00 -0.95
Max Probability -0.64 -2.95 -4.41 -0.64 -2.24 5.20 -0.95
DegMat 0.56 -2.69 -3.91 -1.57 -2.15 4.80 -0.83
Lex-Similarity -2.50 -3.77 -3.41 -0.85 -2.34 8.00 -0.81
Eccentricity 0.00 -2.31 -2.63 -2.60 -2.71 5.60 -0.78
SAR -0.24 -2.31 -2.87 -1.05 -3.11 5.60 -0.66
SentenceSAR 0.32 -2.42 -3.35 -0.85 -2.15 5.20 -0.54
Semantic Entropy -0.48 -1.77 -2.69 0.43 -1.94 3.53 -0.49
RMD 0.08 -7.56 -1.76 -0.11 -0.49 7.60 -0.37
Perplexity 0.24 -2.66 -4.27 0.85 -1.56 5.60 -0.30
Mean Entropy -0.48 -1.40 -3.94 0.74 -1.76 7.20 0.06
MD 0.00 -2.56 -2.74 0.65 0.00 9.60 0.82
Table 9: Average QA performance differences after transfer (in percentage) for each dataset. Negative values
indicate a loss in In-Accuracy compared to in-domain testing, while positive values represent an In-Accuracy gain.
18
Page 19:
Method
acronymMethod
full nameShort description
logit based
FisherRao
(Darrin et al., 2023)Fisher-Rao
distanceFisherRao is a distance on the Riemannian space formed by the parametric distributions, using the Fisher
information matrix as its metric. It computes the geodesic distance between two discrete distributions.
Max Entropy
(Fomicheva et al., 2020)Maximum Token
EntropyThe maximum entropy of all tokens in the generated sequence.
Max ProbabilityMaximum Sequence
ProbabilityThe score leverages the probability of the most likely sequence generation.
Mean CPMI
(van der Poel et al., 2022)Mean conditional
pointwise mutual
informationExtension of the PMI method by considering only those marginal probabilities for which the entropy of the
conditional distribution is above certain threshold.
Mean Entropy
(Fomicheva et al., 2020)Mean Token
EntropyThe average entropy of each individual token in the generated sequence.
Mean PMI
(Takayama and Arase, 2019)Mean pointwise
mutual informationPMI compares the probability of two events (the question and the generated answer) occurring together to what
this probability would be if the events were independent.
Mean ProbabilityMean Sequence
ProbabilityThe total uncertainty is measured via average sequence probability.
Median Entropy
(Fomicheva et al., 2020)Median Token
EntropyThe median entropy of all tokens in the generated sequence.
Median ProbabilityMedian Sequence
ProbabilityThe total uncertainty is measured via median sequence probability.
Min Entropy
(Fomicheva et al., 2020)Minimum Token
EntropyThe minimum entropy of all tokens in the generated sequence.
Min ProbabilityMinimum Sequence
ProbabilityThe score leverages the probability of the least likely sequence generation.
Perplexity
(Fomicheva et al., 2020)Perplexity The score computes the average negative log probability of generated tokens, which is further exponentiated.
PTrue
(Kadavath et al., 2022)probability P(true)The method measures the uncertainty of the claim by asking the LLM itself whether the generated claim is true or
not. The confidence is the probability of thefirst generated token y1 being equal to “True”.
RenyiNeg
(Darrin et al., 2023)Rényi negentropy The score computes alpha-Renyi-divergence between the sample and the uniform distributions.
SAR
(Duan et al., 2023)Shifting Attention
to more RelevantSAR corrects generative inequalities by reviewing the relevance of each token and emphasizing uncertainty
quantification attention to those more relevant components. The relevance is measured by calculating similarity
of sentence before and after removing the certain token.
SentenceSAR
(Duan et al., 2023)Shifting Attention
to more Relevant
at Sentence levelSAR measured at sentence-level.
consistency based
CCP
(Fadeeva et al., 2024a)Claim-Conditioned
ProbabilityThe method aggregates token-level uncertainties into a claim-level score, it removes the impact of uncertainty
about what claim to generate on the current step and what surface form to use.
DegMat
(Lin et al., 2023)Degree
matrixUsing the Degree matrix a new uncertainty measure could be found that reflects the average pairwise distance.
Eccentricity
(Lin et al., 2023)EccentricityThe smallest k eigenvectors of Laplacian Graph are used as the proxy for the models’ embeddings. Then, we could
use the average offset from the average embedding as the uncertainty measure.
EigValLaplacian
(Lin et al., 2023)Sum of eigenvalues of
the graph LaplacianThe score uses pairwise similarities between the sampled answers to the questions to form the symmetric weighted
adjacency matrix (degree matrix). This matrix is further used to create the graph Laplacian. The sum of Eigenvalues
of the Graph Laplacian are used as a measure of uncertainty.
Lex-Similarity
(Fomicheva et al., 2020)Lexical
similarityThe score computes how similar two words or phrases are in terms of their meaning.
NumSemSets
(Lin et al., 2023)Number of semantic
setsThe number of semantic sets initially equals the total number of generated answers K. If two answers are
semantically similar, they are put into one cluster. A higher number of semantic sets corresponds to an increased
level of uncertainty, as it suggests a higher number of diverse semantic interpretations for the answer.
Semantic Entropy
(Kuhn et al., 2023)Semantic
EntropyThe method aims to deal with the generated sequences that have similar meaning while having different probabilities
according to the model. The idea is to cluster generated sequences into several semantically homogeneous
clusters with a bi-directional entailment algorithm and average the sequence probabilities within the clusters.
internal-based
MD
(Lee et al., 2018)Mahalanobis
distanceIn this paper, the authors propose a simple yet effective method for detecting any abnormal samples, which is
applicable to any pre-trained softmax neural classifier. They obtain the class conditional Gaussian distributions
with respect to (low- and upper-level) features of the deep modelsunder Gaussian discriminant analysis, which
result in a confidence score based on the Mahalanobis distance.
RDE
(Yoo et al., 2022)Robust density
estimationThe method improves over MD by reducing the dimensionality of the last hidden state of the decoder averaged
over all generated tokens via PCA decomposition. Additionally, computing of the covariance matrix for each
individual class is done by using the Minimum Covariance Determinant estimation. The uncertainty score is
computed as the MD in the space of reduced dimensionality.
RMD
(Ren et al., 2023)Relative Mahalanobis
distanceThe MD distance score is adjusted by subtracting from it the other MD score computed for some
large general purpose dataset covering many domains.
blended approach
Hybrid Hybrid Our hybrid approach that uses all uncertainty features defined in the table.
Table 10: Description of the uncertainty estimation methods used in the paper. The methods are grouped by their
categories: logit based, consistency-based, internal-based and hybrid.
19
Page 20:
Method EigValLaplacian Lex-Similarity Max Entropy Mean Entropy SAR
EigValLaplacian 1.00 0.03 0.81 0.00 0.00
Lex-Similarity 0.03 1.00 0.38 0.42 0.95
Max Entropy 0.81 0.38 1.00 0.00 0.08
Mean Entropy 0.00 0.42 0.00 1.00 0.86
SAR 0.00 0.95 0.08 0.86 1.00
Table 11: In-Accuracy P-Value, Friedman Test Results: Test Statistic: 29.580 P-value: 0.00001
Method EigValLaplacian Lex-Similarity Max Entropy Mean Entropy SAR
EigValLaplacian 1.00 0.00 0.08 0.00 0.05
Lex-Similarity 0.00 1.00 0.81 0.94 0.88
Max Entropy 0.08 0.81 1.00 0.33 1.00
Mean Entropy 0.00 0.94 0.33 1.00 0.42
SAR 0.05 0.88 1.00 0.42 1.00
Table 12: Accuracy P-Value, Friedman Test Results: Test Statistic: 22.847 P-value: 0.00014
20
Page 21:
D Performance Analysis Across OOD Datasets
NQ SQUAD TQA 2Wiki HotPot MusiqueNQ SQUAD TQA 2Wiki HotPot Musique0%
0%
0%
0%
0%
0%-1.7% -21.3% -1.2% -0.3% -0.4%
-7.6% -47.4% -0.9% -1.9% +0.0%
-11.0% -47.7% -9.0% -19.8% -31.1%
-7.0% -0.2% -40.8% -0.8% +0.2%
-2.0% -3.6% -18.1% -1.2% -0.4%
-7.6% -0.2% -45.1% -0.6% -1.6%
40
30
20
10
010203040
Percentage Difference (%)
NQ SQUAD TQA 2Wiki HotPot MusiqueNQ SQUAD TQA 2Wiki HotPot Musique0%
0%
0%
0%
0%
0%-2.5% -2.2% -1.6% +0.0% +2.0%
-3.1% -4.7% -2.6% +0.0% +0.0%
-3.9% -28.0% -3.1% -17.6% +9.8%
-3.1% -0.6% -4.1% +1.0% +0.0%
-3.1% -3.2% -1.9% -1.6% +0.0%
-3.1% -0.6% -4.4% -2.6% +0.0%
40
30
20
10
010203040
Percentage Difference (%)
Figure 9: Heatmap of improvement/decrease of the Accuracy and In-Accuracy scores on the OOD setup for the
EigValLaplacian method.
NQ SQUAD TQA 2Wiki HotPot MusiqueNQ SQUAD TQA 2Wiki HotPot Musique0%
0%
0%
0%
0%
0%-0.7% -4.9% +0.3% +0.3% -0.4%
-4.6% -30.8% +0.6% +0.3% +0.4%
-3.0% -29.4% -9.9% -11.6% -21.0%
-3.3% -0.2% -26.4% +0.6% +0.4%
-1.0% -0.5% -20.3% +0.0% +0.2%
-8.2% -1.0% -48.6% +0.0% -0.8%
40
30
20
10
010203040
Percentage Difference (%)
NQ SQUAD TQA 2Wiki HotPot MusiqueNQ SQUAD TQA 2Wiki HotPot Musique0%
0%
0%
0%
0%
0%+0.6% -1.2% +0.0% +0.0% +8.0%
-3.1% -4.0% +0.0% +0.0% +4.0%
-1.2% -19.5% -4.3% -11.7% +12.0%
-1.6% +0.6% -3.7% +0.0% +6.0%
-3.5% +0.6% -2.5% +0.0% +10.0%
-3.1% -1.3% -5.6% +0.0% +0.0%
40
30
20
10
010203040
Percentage Difference (%)
Figure 10: Heatmap of improvement/decrease of the Accuracy and In-Accuracy scores on the OOD setup for the
Lex-Similarity method.
NQ SQUAD TQA 2Wiki HotPot MusiqueNQ SQUAD TQA 2Wiki HotPot Musique0%
0%
0%
0%
0%
0%-0.7% -4.9% +0.3% +0.3% -0.4%
-4.6% -30.8% +0.6% +0.3% +0.4%
-3.0% -29.4% -9.9% -11.6% -21.0%
-3.3% -0.2% -26.4% +0.6% +0.4%
-1.0% -0.5% -20.3% +0.0% +0.2%
-8.2% -1.0% -48.6% +0.0% -0.8%
40
30
20
10
010203040
Percentage Difference (%)
NQ SQUAD TQA 2Wiki HotPot MusiqueNQ SQUAD TQA 2Wiki HotPot Musique0%
0%
0%
0%
0%
0%+0.6% -1.2% +0.0% +0.0% +8.0%
-3.1% -4.0% +0.0% +0.0% +4.0%
-1.2% -19.5% -4.3% -11.7% +12.0%
-1.6% +0.6% -3.7% +0.0% +6.0%
-3.5% +0.6% -2.5% +0.0% +10.0%
-3.1% -1.3% -5.6% +0.0% +0.0%
40
30
20
10
010203040
Percentage Difference (%)
Figure 11: Heatmap of improvement/decrease of the Accuracy and In-Accuracy scores on the OOD setup for the
MaxEntropy method.
21
Page 22:
NQ SQUAD TQA 2wiki HotPot MusiqueNQ SQUAD TQA 2Wiki HotPot Musique0%
0%
0%
0%
0%
0%+0.0% -5.8% +0.0% +0.0% -5.1%
+0.0% -5.8% +0.0% +0.0% -5.1%
-32.5% -42.0% -8.0% -10.6% -1.7%
+0.0% +0.0% -5.8% -3.7% +0.0%
+0.0% +0.0% -5.8% +0.0% -5.1%
+0.0% +0.0% -5.8% +0.0% +0.9%
40
30
20
10
010203040
Percentage Difference (%)
NQ SQUAD TQA 2wiki HotPot MusiqueNQ SQUAD TQA 2Wiki HotPot Musique0%
0%
0%
0%
0%
0%-4.7% -7.0% +1.5% +0.0% -6.0%
-4.2% -6.0% -2.0% -1.1% +0.0%
-6.9% -14.1% -9.2% -20.5% -20.1%
-10.8% -7.7% -8.4% -3.2% -10.4%
+0.0% -2.3% -7.5% -30.7% -8.2%
-4.2% +0.0% -6.0% -2.0% -1.1%
40
30
20
10
010203040
Percentage Difference (%)Figure 12: Heatmap of improvement/decrease of the In-Accuracy scores on the OOD setup for the SeaKR and
DRAGIN methods
NQ SQUAD TQA 2wiki HotPot MusiqueNQ SQUAD TQA 2Wiki HotPot Musique0%
0%
0%
0%
0%
0%-8.3% -6.3% -11.8% -31.2% -15.1%
-3.2% -3.4% -11.6% -25.8% -17.9%
-4.5% -1.7% -13.4% -23.4% -21.7%
-6.3% -3.8% -6.6% -21.2% +0.0%
-3.0% -3.8% -5.6% -9.4% -12.3%
-6.3% -3.8% -6.6% +0.0% -21.2%
40
30
20
10
010203040
Percentage Difference (%)
NQ SQUAD TQA 2wiki HotPot MusiqueNQ SQUAD TQA 2Wiki HotPot Musique0%
0%
0%
0%
0%
0%-3.5% -1.6% -3.5% +0.0% -7.8%
-1.2% +0.0% -3.1% -0.5% -5.2%
-1.2% +0.0% -3.1% -0.5% -9.7%
-1.2% -5.9% -0.6% +0.0% -11.7%
-1.2% -0.7% +0.0% -3.1% -3.9%
-1.2% -11.2% -1.9% -7.5% -0.5%
40
30
20
10
010203040
Percentage Difference (%)
Figure 13: Heatmap of improvement/decrease of the In-Accuracy scores on the OOD setup for the FLARE and
AdaptiveRAG methods. AdaptiveRAG shows the most stable performance in OOD.
22
Page 23:
E Feature Importance Analysis for Hybrid Method of Uncertainty Estimation
This section provides a figure 14 to represent ranks importance of different uncertainty estimation methods
as a feature in a hybrid method. In addition, Figure 15 represents feature importance estimation in the
form of a bar chart for each dataset.
NQ
TriviaQASQuAD
HotpotQA
2WikiMultihopQAMusicque0
5
10
15
20
25Rank (1 = Most Important)
Perplexity Mean Entropy Mean Probability CCP SentenceSAR
Figure 14: Top-5 UE methods as a features for hybrid method across datasets.
0.4
0.2
0.0 0.2 0.4Median ProbabilityEccentricitySemantic EntropyRenyiNegEigValLaplacianCCPMax EntropyMean ProbabilityFisherRaoMDMedian EntropyRDERMDMean CPMIMean PMIDegMatMax ProbabilityMin ProbabilityPTrueSARMin EntropyMean EntropyPerplexityNumSemSetsLex-SimilaritySentenceSAR
-0.431-0.386-0.212-0.142-0.059-0.046-0.035-0.032-0.013-0.000-0.0000.0000.0010.0040.0050.0240.0260.0300.0850.1070.1200.1380.1720.1730.2740.535NQ
0.2
0.0 0.2 0.4 0.6 0.8NumSemSetsPTrueSemantic EntropyFisherRaoMedian ProbabilityRenyiNegMean PMILex-SimilarityRMDRDEMDMin EntropyMax EntropyMean CPMIMin ProbabilityMean ProbabilityMax ProbabilityDegMatEccentricityPerplexitySAREigValLaplacianSentenceSARMedian EntropyMean EntropyCCP
-0.304-0.273-0.121-0.083-0.083-0.071-0.023-0.012-0.0030.0000.0020.0040.0080.0100.0250.1530.1550.1670.1850.1950.2010.3530.3970.4740.6560.863TriviaQA
0.2
0.0 0.2 0.4 0.6RenyiNegMedian ProbabilityNumSemSetsDegMatFisherRaoSentenceSAREigValLaplacianMean PMISARRMDRDEMDMin ProbabilityMean CPMIMin EntropyPTruePerplexityMean ProbabilitySemantic EntropyMax EntropyMax ProbabilityLex-SimilarityCCPMedian EntropyEccentricityMean Entropy
-0.307-0.297-0.223-0.192-0.108-0.089-0.034-0.024-0.016-0.0010.0010.0010.0020.0180.0200.0230.0660.0900.1740.2500.3850.3980.4380.5460.6540.727SQuAD
0.6
0.4
0.2
0.0 0.2 0.4 0.6Median EntropyMax ProbabilityMax EntropyCCPMedian ProbabilityEigValLaplacianSemantic EntropyMin ProbabilityMean CPMIMDRDERMDEccentricityMin EntropyDegMatLex-SimilarityNumSemSetsMean PMIFisherRaoMean EntropyMean ProbabilityPTrueSentenceSARPerplexitySARRenyiNeg
-0.615-0.553-0.300-0.280-0.168-0.145-0.084-0.065-0.036-0.0020.0000.0030.0060.0080.0210.0290.1000.1290.1810.2020.2100.3370.3700.4200.5120.562HotpotQA
0.2
0.0 0.2 0.4 0.6Lex-SimilarityEigValLaplacianSARSentenceSARMax ProbabilityDegMatMean CPMIMin ProbabilityRMDRDEMDMin EntropySemantic EntropyEccentricityMax EntropyPerplexityMean PMIRenyiNegFisherRaoMean EntropyMean ProbabilityCCPMedian EntropyNumSemSetsPTrueMedian Probability
-0.261-0.221-0.179-0.146-0.144-0.040-0.008-0.006-0.0020.0000.0010.0040.0360.0410.0540.0640.0750.1170.1510.1650.1930.2380.3010.4400.5210.6942WikiMultihopQA
0.6
0.4
0.2
0.0 0.2 0.4 0.6EigValLaplacianRenyiNegMedian EntropySemantic EntropyMean PMIMax ProbabilityPTrueMax EntropyMin ProbabilitySARMDRDERMDNumSemSetsMean CPMIMin EntropyDegMatLex-SimilarityFisherRaoMedian ProbabilitySentenceSARMean ProbabilityCCPPerplexityMean EntropyEccentricity
-0.596-0.387-0.378-0.148-0.114-0.056-0.053-0.050-0.045-0.008-0.0050.0000.0050.0060.0410.0690.1720.1770.2060.3270.3650.4270.4340.5200.6210.661Musicque
Figure 15: Feature Importance for each dataset for Hybrid method.
23
Page 24:
NQ
TriviaQASQuAD
HotpotQA
2WikiMultihopQAMusicquePerplexity
Mean Entropy
Max Entropy
Median Entropy
Min Entropy
Mean Probability
Max Probability
Median Probability
Min Probability
PTrue
Mean PMI
Mean CPMI
RenyiNeg
FisherRao
Semantic Entropy
CCP
SAR
SentenceSAR
NumSemSets
EigValLaplacian
DegMat
Eccentricity
Lex-Similarity
MD
RMD
RDE0.172 0.195 0.066 0.420 0.064 0.520
0.138 0.656 0.727 0.202 0.165 0.621
-0.035 0.008 0.250 -0.300 0.054 -0.050
-0.000 0.474 0.546 -0.615 0.301 -0.378
0.120 0.004 0.020 0.008 0.004 0.069
-0.032 0.153 0.090 0.210 0.193 0.427
0.026 0.155 0.385 -0.553 -0.144 -0.056
-0.431 -0.083 -0.297 -0.168 0.694 0.327
0.030 0.025 0.002 -0.065 -0.006 -0.045
0.085 -0.273 0.023 0.337 0.521 -0.053
0.005 -0.023 -0.024 0.129 0.075 -0.114
0.004 0.010 0.018 -0.036 -0.008 0.041
-0.142 -0.071 -0.307 0.562 0.117 -0.387
-0.013 -0.083 -0.108 0.181 0.151 0.206
-0.212 -0.121 0.174 -0.084 0.036 -0.148
-0.046 0.863 0.438 -0.280 0.238 0.434
0.107 0.201 -0.016 0.512 -0.179 -0.008
0.535 0.397 -0.089 0.370 -0.146 0.365
0.173 -0.304 -0.223 0.100 0.440 0.006
-0.059 0.353 -0.034 -0.145 -0.221 -0.596
0.024 0.167 -0.192 0.021 -0.040 0.172
-0.386 0.185 0.654 0.006 0.041 0.661
0.274 -0.012 0.398 0.029 -0.261 0.177
-0.000 0.002 0.001 -0.002 0.001 -0.005
0.001 -0.003 -0.001 0.003 -0.002 0.005
0.000 0.000 0.001 0.000 0.000 0.000
0.6
0.4
0.2
0.00.20.40.60.8
Feature ImportanceFigure 16: Feature Importance across datasets for Hybrid method. Different Uncertainty Estimation methods
showed different performance on different datasets.
24
Page 25:
F Technical Details
For all experiments, we use the LLaMA 3.1-8b-instruct model with its default generation parameters. In
our baseline methods, we strictly adhere to their original procedures, including prompting, parameter
settings, and other configurations. For testing uncertainty estimation methods, we follow the protocol
of AdaptiveRAG (Jeong et al., 2024), using the same prompt and few-shot examples. For the Rowen
Consistency Model evaluation we use Qwen 2.5-72B-Instruct (Yang et al., 2024; Team, 2024) as the
verification model instead of the original Qwen-Max-0428 due to the API usage limitations.
For all our methods we use the same retriever, a term-based sparse retrieval model known as
BM25 (Robertson et al., 1994) and the same version of Elasticsearch 7.17.93, following previous stud-
ies (Su et al., 2024b; Yao et al., 2024). For the external document corpus, we use the Wikipedia corpus
preprocessed by Karpukhin et al. (2020).
For all uncertainty methods, we compute scores on both the training and test sets with LM-
Polygraph (Fadeeva et al., 2023) with MIT License. Using the training set scores, we fit multiple
classifiers, including Threshold, Logistic Regression, Decision Tree, KNN, and MLP. The performance of
the best classifier is reported based on downstream metrics, with a further analysis of classifier stability
provided in Section A.
For classifiers, we employed scikit-learn library (Buitinck et al., 2013) and the following configurations:
• Logistic Regression with default hyperparameters.
•Threshold Classifier is optimized by finding the best threshold for In-Accuracy over a log-scaled grid
of size 200, spanning the minimum to maximum training uncertainty values.
• Decision Tree with a maximum depth of 3.
• K-Nearest Neighbors (KNN) using 15 neighbors.
• Multi-Layer Perceptron (MLP) configured with 2 hidden layers, each of size 64.
All hyperparameters remained fixed across all runs to ensure consistency.
Standard deviation is calculated via bootstrap sampling using 1000 rounds.
For trainable uncertainty methods, such as Mahalanobis Distance, we split the training data into two
equal parts: one part is used to learn the parameters of the uncertainty method, while the other is used to
train the classifier. For Relative Mahalanobis Distance, we utilize C4 as the source of additional relative
data for training the parameters.
Our evaluation was conducted on an NVIDIA A100 GPU. The total runtime was approximately 6
hours for SeaKR, 18 hours for both DRAGIN and FLARE, 36 hours for IRCoT, 2 hours for training
AdaptiveRAG on IRCoT generations, and 10 hours for Rowen CM, Rowen CL, and Hybrid (with caching).
In contrast, all uncertainty estimations at once required less than 1 hour, highlighting their computational
efficiency and reduced CO 2emissions.
G Methods
IRCoT (Trivedi et al., 2023) – Interleaves Retrieval in a CoT was one of the pioneering methods to
work with multi-hop questions. The authors proposed a new approach that interleaves retrieval with steps
in chain-of-thought (CoT) reasoning. At first, the authors rertrieve Kparagraphs relevant to the question
Qas a query. Next, there are two steps, namely, reason andretrieve that are made iteratively until the
termination criterion is met. As the incontext examples, questions, answers, gold relevant contexts and
the example of CoT for the question are shown. In the reason step, we show the model CoT reasoning
generated so far and let it complete the rest. Although the model can generate multiple sentences in the
CoT, only the first generated sentence is taken. If the CoT contains the phrase "answer is:" (which was
shown in the context examples as a phrase after which the final answer is written, so that we fix the format
3https://www.elastic.co/guide/en/elasticsearch/reference/7.17/release-notes-7.17.9.html
25
Page 26:
of the answers) or the maximum number of iterations has been reached4, the process is terminated. In
theretrieve step the last generated sentence in the CoT is taken to retrieve more additional paragraphs
that would be relevant to answer the questions. These newly retrieved paragraphs are added to the ones
retrieved in the previous question5as the context for the question.
Adaptive RAG (Jeong et al., 2024) uses the complexity of the question for adaptive retrieval. Simple
questions can be answered without retrieval at all while complex questions require a multistep approach
with iterative usage of both LLMs and retrievers. While users often ask simple and straightforward
questions, the strategy which is necessary for answering complex questions is largely inefficient for the
simple queries. The authors proposed a balanced strategy by training a classifier that predicts one of
the three outcomes: whether not to retrieve at all (class A), retrieve once (class B) and retrieve multiple
times (class C, the authors use IRCoT (Trivedi et al., 2023)). The classifier based on the t5-large model is
trained on the development parts of the six considered datasets. The authors ran questions for all three
methods and labeled as the correct the most efficient one. The most efficient means that if the correct
answer is obtained by all three classes, class A is returned as the true one. As the additional training data
the authors used the inductive biases in datasets (this concept assumes that simple questions should be
answered with one step retrieve, and complex questions with multistep retrieve).
FLARE (Jiang et al., 2023) – Forward- Looking Active Retrieval augmented generation is a method
designed to improve the performance of LLMs by selectively integrating external knowledge. The idea
behind FLARE is to monitor the probabilities generated by the LLM during the generation of the answers.
If the model generates a token with probability below threshold (i.e. the model is uncertain), FLARE
intervenes by querying an external knowledge source, such as a search engine or structured knowledge
base, to retrieve relevant information. Using this additional context, FLARE regenerates the response
until the next uncertain token or ends the generation. This approach balances high-quality generation and
high response speed.
DRAGIN (Su et al., 2024b) – Dynamic Retrieval Augmented Generation based on the Information
Needs of LLMs follows a similar approach to FLARE by monitoring the model’s token probabilities
during generation. If LLM produces tokens with low likelihood, indicating uncertainty or knowledge
gaps, DRAGIN triggers a retrieval process. For better identification of uncertainty tokens, DRAGIN filters
out all stopwords.6. This paper also introduces an additional step: reformulating the query with keywords
before retrieving information. These reformulated keywords are based on the model’s internal attention
weights and reasoning, allowing the system to determine what information is necessary and target relevant
external knowledge sources more effectively. By incorporating new knowledge and ensuring the relevance
of the retrieved information, DRAGIN improves the coherence of the final response. This approach
reduces the risk of retrieving irrelevant documents and optimizes the model’s reasoning process, especially
in situations where queries may be ambiguous or incomplete.
Rowen (Ding et al., 2024) – Retrieve OnlyWhenIt Needs method presents a novel approach to
reducing hallucinations in LLMs. This method uses an adaptive retrieval mechanism to improve the
accuracy of LLM output. The method intelligently determines when to use external knowledge sources,
based on a language and model evaluation.
The Rowen Consistency Language (Rowen CL) component of Rowen involves generating semanti-
cally equivalent perturbations of the input query across English and Chinese languages. This includes
asking the model to produce variations of the same question and then comparing the consistency of
the responses generated in different languages. A high degree of inconsistency among these responses
indicates uncertainty in the model’s understanding, prompting the system to initiate a retrieval process
to gather factual information that may clarify or correct the initial response. The Rowen Consistency
Model (Rowen CM) extends this idea by assessing the semantic coherence of responses generated by dif-
4set to 8 in the experiments
5maximum amount of paragraphs is set to 15 to fit the model’s context limit
6https://spacy.io/usage/linguistic-features
26
Page 27:
ferent models, OpenAI GPT-3.5 and Qwen-Max-04287, as described in the original paper. By comparing
outputs from a primary language model with those generated by a verification model, final consistency
model score calculated. Rowen Hybrid - the hybrid version of Rowen CL and Rowen-CM, if the sum of
the consistency scores for both CL and CM is greater than the threshold, the retriever is used to mitigate
potential hallucinations.
To ensure a reproducible and comparable evaluation of our work, we have reimplemented Rowen
approach using LLaMA3.1-8b-instruct as the primary model and Qwen 2.5-72B-Instruct (Yang et al.,
2024; Team, 2024) as the verification model for consistency model evaluation.
SeaKR (Yao et al., 2024) – Self-aware Knowledge Retrieval for Adaptive RAG uses an uncertainty
approach to minimise hallucinations in LLMs. SeaKR uses the model’s internal states to extract self-aware
uncertainty, activating external knowledge sources only when the LLM exhibits high uncertainty during
generation. This selective retrieval mechanism increases the accuracy and reliability of the generated
output.
The SeaKR Uncertainty Module (SeaKR UM) is a core component that monitors the internal states
of the LLM to quantify its self-aware uncertainty. When the uncertainty level exceeds a predefined
threshold, SeaKR UM triggers the retrieval process to retrieve relevant knowledge snippets from external
databases. To ensure the most effective integration of the retrieved information, the SeaKR Re-ranking
Component (SeaKR RC) re-orders the retrieved snippets based on their ability to reduce the model’s
uncertainty, selecting the snippet that provides the greatest clarity and factual accuracy.
To ensure a reproducible and comparable evaluation of our approach, we have reimplemented the SeaKR
model using Llama-3.1-8b-instruct for the evaluation of self-conscious uncertainty. For consistency, we
use the same eigenscore threshold as in the original paper because it gave the best results, but we have
also tried others.
H Correlations between evaluation metrics across each dataset
InAcc EM F1 Acc AUC Corr
InAcc 1.00 0.63 0.75 0.09 -0.02 0.05
EM 0.63 1.00 0.93 -0.12 0.09 0.09
F1 0.75 0.93 1.00 -0.06 0.08 0.09
Acc 0.09 -0.12 -0.06 1.00 0.21 0.15
AUC -0.02 0.09 0.08 0.21 1.00 0.79
Corr 0.05 0.09 0.09 0.15 0.79 1.00
Table 13: Spearman correlations between evaluation metrics normalized across each dataset. The result reveals a
low correlation between downstream metrics (InAcc, EM, F1) and self-knowledge metrics (Acc, AUC, Corr). This
underscores the importance of conducting a more comprehensive evaluation of self-knowledge of adaptive retrieval
systems, rather than relying solely on downstream performance.
7https://qwenlm.github.io/blog/qwen-max-0428/
27
Page 28:
Method
acronymMethod
full nameShort description
logit based
FisherRao
(Darrin et al., 2023)Fisher-Rao
distanceFisherRao is a distance on the Riemannian space formed by the parametric distributions, using the Fisher
information matrix as its metric. It computes the geodesic distance between two discrete distributions.
Max Entropy
(Fomicheva et al., 2020)Maximum Token
EntropyThe maximum entropy of all tokens in the generated sequence.
Max ProbabilityMaximum Sequence
ProbabilityThe score leverages the probability of the most likely sequence generation.
Mean CPMI
(van der Poel et al., 2022)Mean conditional
pointwise mutual
informationExtension of the PMI method by considering only those marginal probabilities for which the entropy of the
conditional distribution is above certain threshold.
Mean Entropy
(Fomicheva et al., 2020)Mean Token
EntropyThe average entropy of each individual token in the generated sequence.
Mean PMI
(Takayama and Arase, 2019)Mean pointwise
mutual informationPMI compares the probability of two events (the question and the generated answer) occurring together to what
this probability would be if the events were independent.
Mean ProbabilityMean Sequence
ProbabilityThe total uncertainty is measured via average sequence probability.
Median Entropy
(Fomicheva et al., 2020)Median Token
EntropyThe median entropy of all tokens in the generated sequence.
Median ProbabilityMedian Sequence
ProbabilityThe total uncertainty is measured via median sequence probability.
Min Entropy
(Fomicheva et al., 2020)Minimum Token
EntropyThe minimum entropy of all tokens in the generated sequence.
Min ProbabilityMinimum Sequence
ProbabilityThe score leverages the probability of the least likely sequence generation.
Perplexity
(Fomicheva et al., 2020)Perplexity The score computes the average negative log probability of generated tokens, which is further exponentiated.
PTrue
(Kadavath et al., 2022)probability P(true)The method measures the uncertainty of the claim by asking the LLM itself whether the generated claim is true or
not. The confidence is the probability of thefirst generated token y1 being equal to “True”.
RenyiNeg
(Darrin et al., 2023)Rényi negentropy The score computes alpha-Renyi-divergence between the sample and the uniform distributions.
SAR
(Duan et al., 2023)Shifting Attention
to more RelevantSAR corrects generative inequalities by reviewing the relevance of each token and emphasizing uncertainty
quantification attention to those more relevant components. The relevance is measured by calculating similarity
of sentence before and after removing the certain token.
SentenceSAR
(Duan et al., 2023)Shifting Attention
to more Relevant
at Sentence levelSAR measured at sentence-level.
consistency based
CCP
(Fadeeva et al., 2024a)Claim-Conditioned
ProbabilityThe method aggregates token-level uncertainties into a claim-level score, it removes the impact of uncertainty
about what claim to generate on the current step and what surface form to use.
DegMat
(Lin et al., 2023)Degree
matrixUsing the Degree matrix a new uncertainty measure could be found that reflects the average pairwise distance.
Eccentricity
(Lin et al., 2023)EccentricityThe smallest k eigenvectors of Laplacian Graph are used as the proxy for the models’ embeddings. Then, we could
use the average offset from the average embedding as the uncertainty measure.
EigValLaplacian
(Lin et al., 2023)Sum of eigenvalues of
the graph LaplacianThe score uses pairwise similarities between the sampled answers to the questions to form the symmetric weighted
adjacency matrix (degree matrix). This matrix is further used to create the graph Laplacian. The sum of Eigenvalues
of the Graph Laplacian are used as a measure of uncertainty.
Lex-Similarity
(Fomicheva et al., 2020)Lexical
similarityThe score computes how similar two words or phrases are in terms of their meaning.
NumSemSets
(Lin et al., 2023)Number of semantic
setsThe number of semantic sets initially equals the total number of generated answers K. If two answers are
semantically similar, they are put into one cluster. A higher number of semantic sets corresponds to an increased
level of uncertainty, as it suggests a higher number of diverse semantic interpretations for the answer.
Semantic Entropy
(Kuhn et al., 2023)Semantic
EntropyThe method aims to deal with the generated sequences that have similar meaning while having different probabilities
according to the model. The idea is to cluster generated sequences into several semantically homogeneous
clusters with a bi-directional entailment algorithm and average the sequence probabilities within the clusters.
internal-based
MD
(Lee et al., 2018)Mahalanobis
distanceIn this paper, the authors propose a simple yet effective method for detecting any abnormal samples, which is
applicable to any pre-trained softmax neural classifier. They obtain the class conditional Gaussian distributions
with respect to (low- and upper-level) features of the deep modelsunder Gaussian discriminant analysis, which
result in a confidence score based on the Mahalanobis distance.
RDE
(Yoo et al., 2022)Robust density
estimationThe method improves over MD by reducing the dimensionality of the last hidden state of the decoder averaged
over all generated tokens via PCA decomposition. Additionally, computing of the covariance matrix for each
individual class is done by using the Minimum Covariance Determinant estimation. The uncertainty score is
computed as the MD in the space of reduced dimensionality.
RMD
(Ren et al., 2023)Relative Mahalanobis
distanceThe MD distance score is adjusted by subtracting from it the other MD score computed for some
large general purpose dataset covering many domains.
blended approach
Hybrid Hybrid Our hybrid approach that uses all uncertainty features defined in the table.
Table 14: Description of the uncertainty estimation methods used in the paper. The methods are grouped by their
categories: logit based, consistency-based, internal-based and hybrid.
28
Page 29:
MethodNQ SQUAD TQA 2Wiki HotPot Musique
InAcc ↑LMC ↓RC↓InAcc ↑LMC ↓RC↓InAcc ↑LMC ↓RC↓InAcc ↑LMC ↓RC↓InAcc ↑LMC ↓RC↓InAcc ↑LMC ↓RC↓
Never RAG 0.446 1.0 0.00 0.176 1.0 0.00 0.636 1.0 0.00 0.318 1.0 0.00 0.286 1.0 0.00 0.106 1.0 0.00
Always RAG 0.496 1.0 1.00 0.312 1.0 1.00 0.610 1.0 1.00 0.374 1.0 1.00 0.410 1.0 1.00 0.100 1.0 1.00
AdaptiveRAG 0.496 2.0 0.98 0.286 2.0 0.97 0.636 1.5 0.54 0.454 5.2 2.64 0.44 4.7 2.41 0.154 3.6 3.63
DRAGIN 0.480 4.5 2.24 0.298 4.3 2.14 0.667 4.0 2.0 0.456 5.8 2.92 0.435 5.1 2.5 0.134 6.3 3.15
FLARE 0.462 4.26 2.0 0.288 3.2 2.5 0.648 2.1 1.39 0.424 3.9 2.85 0.372 5.1 4.07 0.106 4.3 3.11
Seakr 0.406 14.6 1.00 0.286 14.0 1.00 0.656 14.6 1.00 0.398 12.3 2.44 0.424 9.9 1.76 0.118 12.3 2.40
Ideal 0.608 1.6 0.55 0.360 1.8 0.82 0.736 1.4 0.36 0.500 1.7 0.68 0.460 1.7 0.71 0.164 1.9 0.89
Table 15: QA Performance of adaptive retrieval fine-tuned with in-domain data. ‘Ideal’ represents the performance
of a system with an oracle providing ideal predictions for the need to retrieve. ‘InAcc’ denotes In-Accuracy,
measuring the QA system’s performance. ‘LMC’ indicates the mean number of LM calls per question, and ‘RC’
represents the mean number of retrieval calls per question. AdaptiveRAG and DRAGIN methods show the best
performance.
29