Page 1:
Context Filtering with Reward Modeling in Question Answering
Sangryul Kim
KAIST AI
sangryul@kaist.ac.krJames Thorne
KAIST AI
thorne@kaist.ac.kr
Abstract
Question Answering (QA) in NLP is the task
of finding answers to a query within a relevant
context retrieved by a retrieval system. Yet, the
mix of relevant and irrelevant information in
these contexts can hinder performance enhance-
ments in QA tasks. To address this, we intro-
duce a context filtering approach that removes
non-essential details, summarizing crucial con-
tent through Reward Modeling. This method
emphasizes keeping vital data while omitting
the extraneous during summarization model
training. We offer a framework for developing
efficient QA models by discerning useful infor-
mation from dataset pairs, bypassing the need
for costly human evaluation. Furthermore, we
show that our approach can significantly outper-
form the baseline, as evidenced by a 6.8-fold
increase in the EM Per Token (EPT) metric,
which we propose as a measure of token ef-
ficiency, indicating a notable token-efficiency
boost for low-resource settings1.
1 Introduction
The ability of language models to effectively under-
stand and process long texts has become a critical
requirement, particularly for question-answering
(QA) applications (Beltagy et al., 2020; Feldman
and El-Yaniv, 2019; Nan et al., 2021; Caciularu
et al., 2022). However, several studies highlight the
problem that even with a substantial amount of rel-
evant context provided, the inclusion of irrelevant
content within the context can adversely affect over-
all performance (Shi et al., 2023; Akimoto et al.,
2023; Sauchuk et al., 2022; Oh and Thorne, 2023).
The challenge often lies in distinguishing useful
information from irrelevant details.
Our research tackles this issue by introducing
a new approach to filter out unnecessary content,
focusing on summarizing the key points through
1Code and datasets are available at https://github.com/
xfactlab/coling2025-context-filtering
ACountryBoyCanSurvive"ACountryBoyCanSurvive"isasongwrittenandrecordedbyAmericancountrymusicartistHankWilliamsJr.ThesongwasreleasedasasingleinJanuary1982andreachedapeakofnumber2onthe….OriginContext
Thesong"ACountryBoyCanSurvive"wasreleasedinJanuary1982.GoodSummarizedThesongwasreleasedasasingleinJanuary1982.BadSummarizedRewardModelACountryBoyCanSurvivewasreleasedinJanuary1982.BetterSummarized
whendidcountryboycansurvivecomeoutQuestion
Answer:January1982Reader
Figure 1: For an effective QA task, we conduct con-
text filtering through the process of creating better sum-
marization using a reward model. Simultaneously, we
make it possible to discern which parts are helpful and
which are filtered out by utilizing rewards extracted
from the data.
Reward Modeling. Specifically, we take note of
the Direct Preference Optimization (DPO) method
(Rafailov et al., 2023), which trains models using
positive and negative feedback from datasets com-
posed of pairs of " chosen " and " rejected " texts.
We suggest employing this technique for filtering
the context in QA tasks. We particularly focus
on the process of inducing the chosen and nega-
tive datasets by paying attention to the presence or
absence of specific information within the three es-
sential elements required for the QA task: context,
question, and answer. We investigate how the pres-
ence or lack of each piece of information impacts
the reward modeling process, thus contributing to
the development of an efficient context filtering
model. Our method ultimately aims to enhance the
efficiency of QA models by identifying and retain-
ing only the most relevant information to the query,
thereby improving performance.
We study context efficiency by introducing an
1arXiv:2412.11707v1 [cs.CL] 16 Dec 2024
Page 2:
EM Per Token (EPT) metric and use it for compar-
isons between models. This allows us to evaluate
the trade-off between context length and the an-
swer Exact Match (EM) score. This is motivated
by the fact that more context induces diminishing
returns (Izacard and Grave, 2021) and comes with
performance overheads.
2 Backgrounds
2.1 Knowledge Refinement
For open-domain question answering, the prevail-
ing trend in research is focused on retrieving the
correct context from a corpus of knowledge to con-
dition a reader (Chen et al., 2017; Karpukhin et al.,
2020; De Cao et al., 2021; Petroni et al., 2021). Si-
multaneously, language models have been used to
augment question information (Chen et al., 2023)
context generation (Yu et al., 2022), and summa-
rize / paraphrase retrieved information (Xu et al.,
2023; Lee et al., 2023). However, recent studies
claim that the presence of irrelevant information
within the context can lead to a decrease in the per-
formance of the model in a process called detrimen-
tal retrieval (Sauchuk et al., 2022; Oh and Thorne,
2023; Shi et al., 2023; Akimoto et al., 2023). There-
fore, there is a need for research on models that can
filter detrimental content from the retrieved context
while aggregating only the essential information.
2.2 Reward Modeling
Many methodologies for training LLMs have been
developed to align models to user preferences:
models are steered away from generating toxic
or unhelpful responses. Many methods are de-
rived from the Bradley-Terry model of competition
(Bradley and Terry, 1952), using preference pairs
containing chosen and rejected responses. In par-
ticular, Reinforcement Learning from Human Feed-
back (Ouyang et al., 2022, RLHF) incorporates hu-
man evaluations for training a reward model that is
used to score responses, guiding fine-tuning a text
generation model to align with user preferences.
However, a drawback of this approach is the need to
collect human preferences for training, which can
be costly. Additionally, Proximal Policy Optimiza-
tion (PPO), commonly used in RLHF, is sensitive
to hyperparameter settings. Improper tuning can
lead to training instability or divergence (Schulman
et al., 2017; Hsu et al., 2020). Overcoming some of
these limitations, Rafailov et al. (2023) proposed
Direct Preference Optimization (DPO). In DPO,the model itself serves as the source of the reward
function. Let ywbe the preferred output and ylbe
the rejected output for a given given task prompt
xand denote the dataset D={x(i), y(i)
w, y(i)
l}N
i=1.
We can denote the loss LR(rϕ, D)with the reward
function rϕ(y, x)as below (Rafailov et al., 2023):
−E(x,yw,yl)∼D[logσ(rϕ(x, yw)−rϕ(x, yl))]
(1)
In the QA task, we assume that items with a very
high likelihood of containing the answer and whose
surrounding context is closely related to the answer
are "chosen", while those with a low likelihood
of containing the answer and whose surrounding
context is likely unrelated to the answer are "re-
jected". Therefore, we design experiments to find
a method that filters only the context necessary for
the QA model by conducting training in a way that
increases the margin between "chosen" and "re-
jected" in association with the DPO reward loss.
3 Context Filtering
Our experiments are performed studying models
for open-domain question-answering. We adopt a
summarize-then-read pipeline structure adapting
the model structure from Inoue et al. (2021). The
pipeline consists of an abstractive summarization
model and a question answering model. During the
summarize phase, we compare SFT and DPO fine-
tuning to determine how efficiently each model
summarizes. In the question answering phase, we
compare the ability to find the correct answers
within the context that has been filtered out through
summarization. We use the FLAN-T5-XL model
(Chung et al., 2022)2from Hugging Face (Wolf
et al., 2020) (3B parameters) throughout all ex-
periments. This seq2seq encoder-decoder model
is pre-trained for abstractive summarization and
captures the main point of the context. All specific
settings, such as prompts and model templates used
in the experiments, can be found in Appendix B.
3.1 Data Generation
Our experiments are conducted on three question
answering datasets: SQuAD v1.1 (Rajpurkar et al.,
2016), Natural Question (NQ) (Kwiatkowski et al.,
2019), TriviaQA (TQA) (Joshi et al., 2017). Re-
trieved contextual information is selected from
DPR (Karpukhin et al., 2020)3for NQ and TQA.
2https://huggingface.co/google/flan-t5-xl
3https://github.com/facebookresearch/DPR
2
Page 3:
For NQ and TQA, instances where the answer to
the question is not within the top-1 relevant context
are excluded. We split the training set of SQuAD to
create an additional validation dataset and use the
existing validation set as the test set. The datasets
postfixed with " r" in the table are those that have
undergone this process. Detailed information is in
Appendix C.
For the QA, the task aims to find the corre-
sponding answer Afor a given question Qand
a context C. We can establish three strategies for
summarization through the missing combinations
in the necessary information denoted as the tuple
I= (Q, A, C ). We can construct prompts with
different combinations of information:
Type 1 consists of I1= (Q, A, C ), containing
all the information (question, answer, and context),
aiming to achieve the best possible summarization
for comparison and for the highest likelihood of
generating correct answers for pairwise training.
However, this would not be realistic in an unseen
test setting.
Type 2 consists of I2= (Q, C), and the goal is
to summarize the parts related to the question in
the context without providing information about
the answer. This may be most realistic for appli-
cation to unseen questions and represents the task
signature of the model we would aim to train.
Type 3 is composed of I3= (A, C), focusing on
summarizing or extracting related parts in a more
lexical aspect, ensuring that the model includes
the answer, regardless of the context. We use the
outputs generated from the various types of prompt
configurations in the following experiments.
3.2 SFT Summarizer Training
The objective of Supervised Fine Tuning(SFT) is
to fine-tune the base model to generate the out-
put summary O1created with the Type 1 prompt
when given a context and question (Type 2) as
input, learning the policy πsft(O1|I2). Through
this process, the fine-tuned summarization model
πsftshould be able to utilize all the information
included in I1= (Q, A, C )to produce the best
summarization results given I2= (Q, C).
3.3 DPO Summarizer Training
For the DPO Training, we require pairs of answers
(y1, y2)and need to determine which summary
would be ywandylsatisfying yw≻yl|x(Rafailovet al., 2023). Typically human labelers or an LLM
determine ywandyl, but in this architecture sce-
nario, we assume that outputs from the base model
with two types prompts; (Type 2, Type 3) can be
candidate of the yl, denoted as O2,O3respectively.
This is because we assume that the outputs from
Type 2 and Type 3, which were created with miss-
ing information, are less preferred than those from
Type 1. We aim to understand how the lack of
information in each of Types 2 and 3 affects the
reward model through DPO training compared to
Type 1. Hense, we construct two different DPO
model πdpo
O1,O2,πdpo
O1,O3. We use Hugging Face TRL
(von Werra et al., 2020) for DPO Training.
3.4 Reader Training
We study the impact on reducing context through
summarization, and thus the reduced context
length, on the downstream reader. We train the
reader to generate answers using the filtered con-
text, which is achieved by summarizing each
dataset with the Type 1 trained summarization mod-
els. For baseline evaluation, using Type 1 will de-
termine how well each model can deliver answers
without loss of information in the "read" phase of
thesummarize-then-read pipeline.
3.5 Evaluation
To assess the reader’s ability to generate answers us-
ing summarized contexts, we use the exact match
(EM) score and unigram F1 metrics (Rajpurkar
et al., 2016). For the NQ and TQA datasets, where
there are multiple possible answers, we initially fil-
tered based on whether the first correct answer was
included in the context, treating the first answer as
correct if it matched the given answer. Addition-
ally, we calculate the number of tokens required to
find the answer in the context using our proposed
EM Per Token (EPT) metric. EPT serves as an
indicator of the efficiency of the given context, c:
EPT (c, y∗,ˆy) =EM(y∗,ˆy)
|c|(2)
Furthermore, to evaluate the performance of the
summarization itself, we introduce a metric called
Inclusion Rate of Answer (IRA) shown in Table
2. The IRA can verify whether the target answer
is fully contained within the given context. If the
answer is exactly present, it is evaluated as 1; if it is
partially missing or not included, it is evaluated as
0. This metric assesses how well the summarization
is done in terms of leaving room for the reader to
3
Page 4:
Dataset NQr TQA r SQuAD r
Model EM F1 Tok Len EPT EM F1 Tok Len EPT EM F1 Tok Len EPT
Origin 59.59 67.71 147.30 0.40 77.38 83.44 152.41 0.51 68.32 82.97 179.83 0.38
SFT 55.21 63.28 29.99 1.84 69.68 75.49 28.88 2.41 59.66 76.07 30.26 1.97
DPO O1,O249.87 58.68 41.92 1.19 66.06 72.80 39.52 1.67 48.79 61.85 35.61 1.37
DPO O1,O352.83 59.83 29.97 1.76 62.29 68.16 18.20 3.42 55.40 69.28 21.46 2.58
Table 1: Performance comparison across models on different datasets. EM: Exact Match, F1: F1-score, Tok Len:
Token Length, EPT: Efficiency per Token.
Dataset NQr TQArSQuADr
Model IRA
Original 100 100 100
SFT 75.93 78.76 89.94
DPO O1,O271.94 77.27 66.93
DPO O1,O368.60 65.91 73.88
Table 2: The ratio indicating whether the target an-
swer is included within the context summarized by each
model when the original context is summarized.
find the answer, essentially checking the potential
for finding the answer before the reader phase. This
metric evaluates the summarization process itself,
independent of the subsequent reading phase.
4 Impact of Context Filtering
Trade-off between Token Length and Accuracy
Metrics. This experiment focuses on extractive
QA datasets, where answering a question requires
not only identifying the correct answer and its rel-
evant context but also understanding the relation-
ships between different parts of the context that
contribute to the answer. By comparing the IRA
scores in Table 2 with the EM scores in Table 1,
we observe that context length does not directly
correlate with accuracy. To further investigate, we
analyze the contribution of individual tokens to the
EM score using the previously defined EPT metric.
Our analysis, conducted with both the SFT and
DPO models, demonstrates that models retain ac-
curay with reduced context lengths. For the NQ
dataset, reducing the token length to just 20% of
the original retains 92% of the initial performance.
Similarly, for the TQA dataset, a token length
of 12% preserves 80% of the performance, while
for the SQuAD dataset, 12% of the token length
achieves 81% of the original performance. These
results highlight the potential for substantial token
reduction without a severe loss in accuracy and
underscore potential efficiency gains.
From another perspective, we can observe thatthe filtering process in SFT and DPO also leads to
the loss of information related to the answer span.
This can be interpreted in two ways. Firstly, when
examining the prompt "Summarize below context
into one sentence..." used in SFT and DPO, the
answer might have been deemed relatively less im-
portant in the process of reducing it to one sentence
from a summarization standpoint. This suggests
that some dependency on the prompt and the model
remains, indicating room for improvement. Sec-
ondly, our approach to filtering the context did not
simply involve lexically cutting off existing content
or directly extracting it (Wang et al., 2023); rather,
we restructured it through a prompt, transforming
filtering into a summarization task.
Despite this, the results in Table 1 still reveal
that the trade-off between token length and EM
and F1 metrics is favorable; the efficiency gained
from filtering outweighs the slight loss in accuracy,
indicating benefits of filtering.
Comparison with Different Reward Model
Strategies. DPO is known to enable credit assign-
ment down to the token level (Rafailov et al., 2024).
Therefore, after training, we can indirectly estimate
how it influenced tokens by analyzing the metrics in
Table 1 and the EM rates based on length. During
the Data Generation phase (Section 3.1), we empir-
ically observe that Type 2 outputs, O2, are summa-
rized in a form that provides short answers about
the context and question, while the Type 3 outputs,
O3, are more reflective of the lexical elements sur-
rounding the answer in the context. Furthermore,
the presence of the answer in the prompt for Type
3 induces generations that are more similar to Type
1 compared to Type 2. Therefore, during DPO
training, the DPO O1, O2model is designed to pro-
duce longer outputs that are more similar to those
of Type 1, whereas responses from DPO O1, O3are
shorter outputs centered around elements present
in Type 1 but absent in Type 3 outputs, indicating
the intended direction of the training results.
4
Page 5:
5 Conclusions
In this work, we aim to construct an efficient QA
system by filtering out unnecessary information
from the context. By introducing the EPT met-
ric, we assess the efficiency of the context in QA
tasks. The results demonstrate that using the DPO
model with reward modeling and its underlying
SFT model for filtering is more efficient (per token)
than using the full original context. In our future
experiments following this study, we plan to ap-
ply the reward modeling concept developed here to
conduct experiments on effective retrieval models.
Moving beyond filtering within a single context,
we aim to explore filtering across multiple contexts
and incorporate a rewarding model that uses reader
performance as a reward. Our goal is to build an
integrated Information Retrieval (IR) system that
enhances the overall efficiency and effectiveness of
the retrieval process.
Limitations
In this work, we focus on deriving an efficient con-
text filtering model through reward modeling. The
essence of the reward lies in the base model, such
as the SFT model evaluating the reward, and the
pairs of chosen and rejected for DPO training. We
generate data relying on the model’s parameters
by providing complete information and incomplete
information according to each type, without human
evaluation, and use this data for our experiments.
During this process, we identify the introduction
of some unintended biases, leading to biased re-
sults and inconsistent outcomes across the datasets.
Given the nature of the research, it is crucial to
generate data that aligns with the intended purpose
during the training data generation stage. There-
fore, when conducting additional follow-up experi-
ments with this idea, we believe it is necessary to
establish proper metrics at the data level and a ver-
ification process for evaluating the efficiency and
assessing context filtering in QA tasks, separate
from the efficiency evaluation.
Moreover, for our experiments, we used datasets
that were modified and reduced from existing
sources like NQ, TQA, and SQuAD after under-
going specific processing to suit our experimen-
tal needs. We believe that adding more diverse
datasets, including those covering multi-hop QA
(Yang et al., 2018) and long context QA (Fan et al.,
2019), would allow for deeper interpretations.Ethics Statement
In our experiments, we utilize the FLAN-T5
(Chung et al., 2022) model, which is a T5 (Raffel
et al., 2020) model further enhanced with instruc-
tion tuning. Therefore, during the training process,
the pre-trained parameters could indirectly or di-
rectly influence the outcomes, leading to uncon-
trolled generations that might result in content that
is ethically or socially problematic. Consequently,
when disclosing such research to the public, it’s
crucial to be aware of these potential risks and
consider implementing engineering and research-
based mitigative measures to prevent such issues.
We utilize ChatGPT in the process of writing pa-
pers, using it for grammar correction and refining
sentences. During this process, expressions may be
included that differ from the intended meaning.
Acknowledgments
This work was partly supported by Institute for In-
formation & communications Technology Technol-
ogy Planning & Evaluation(IITP) grant funded by
the Korea government(MSIT) (RS-2019-II190075,
Artificial Intelligence Graduate School Support
Program(KAIST) and National Research Founda-
tion of Korea(NRF) grant funded by the Korea gov-
ernment(MSIT) (RS-2024-00406715, AI for All:
A Social Platform for Barrier-free AI). This work
was also supported by the Artificial intelligence
convergence cluster development project funded by
the Ministry of Science and ICT (MSIT, Korea) &
Gwangju Metropolitan City.
References
Kosuke Akimoto, Kunihiro Takeoka, and Masafumi
Oyamada. 2023. Context quality matters in training
fusion-in-decoder for extractive open-domain ques-
tion answering. In Findings of the Association for
Computational Linguistics: EMNLP 2023 , pages
11711–11729, Singapore. Association for Compu-
tational Linguistics.
Iz Beltagy, Matthew E. Peters, and Arman Cohan.
2020. Longformer: The long-document transformer.
Preprint , arXiv:2004.05150.
Ralph Allan Bradley and Milton E. Terry. 1952. Rank
analysis of incomplete block designs: I. the method
of paired comparisons. Biometrika , 39(3/4):324–
345.
Avi Caciularu, Ido Dagan, Jacob Goldberger, and Ar-
man Cohan. 2022. Long context question answering
via supervised contrastive learning. In Proceedings
5
Page 6:
of the 2022 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies , pages 2872–2879,
Seattle, United States. Association for Computational
Linguistics.
Andong Chen, Yuan Sun, Xiaobing Zhao, Rosella
Galindo Esparza, Kehai Chen, Yang Xiang, Tiejun
Zhao, and Min Zhang. 2023. Improving low-resource
question answering by augmenting question infor-
mation. In Findings of the Association for Compu-
tational Linguistics: EMNLP 2023 , pages 10413–
10420, Singapore. Association for Computational
Linguistics.
Danqi Chen, Adam Fisch, Jason Weston, and Antoine
Bordes. 2017. Reading Wikipedia to answer open-
domain questions. In Proceedings of the 55th Annual
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers) , pages 1870–1879,
Vancouver, Canada. Association for Computational
Linguistics.
Hyung Won Chung, Le Hou, Shayne Longpre, Barret
Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi
Wang, Mostafa Dehghani, Siddhartha Brahma, et al.
2022. Scaling instruction-finetuned language models.
arXiv preprint arXiv:2210.11416 .
Nicola De Cao, Gautier Izacard, Sebastian Riedel, and
Fabio Petroni. 2021. Autoregressive entity retrieval.
In9th International Conference on Learning Repre-
sentations, ICLR 2021, Virtual Event, Austria, May
3-7, 2021 . OpenReview.net.
Angela Fan, Yacine Jernite, Ethan Perez, David Grang-
ier, Jason Weston, and Michael Auli. 2019. ELI5:
Long form question answering. In Proceedings of
the 57th Annual Meeting of the Association for Com-
putational Linguistics , pages 3558–3567, Florence,
Italy. Association for Computational Linguistics.
Yair Feldman and Ran El-Yaniv. 2019. Multi-hop para-
graph retrieval for open-domain question answering.
InProceedings of the 57th Annual Meeting of the As-
sociation for Computational Linguistics , pages 2296–
2309, Florence, Italy. Association for Computational
Linguistics.
Chloe Ching-Yun Hsu, Celestine Mendler-Dünner, and
Moritz Hardt. 2020. Revisiting design choices
in proximal policy optimization. arXiv preprint
arXiv:2009.10897 .
Naoya Inoue, Harsh Trivedi, Steven Sinha, Niranjan Bal-
asubramanian, and Kentaro Inui. 2021. Summarize-
then-answer: Generating concise explanations for
multi-hop reading comprehension. In Proceedings of
the 2021 Conference on Empirical Methods in Natu-
ral Language Processing , pages 6064–6080, Online
and Punta Cana, Dominican Republic. Association
for Computational Linguistics.
Gautier Izacard and Edouard Grave. 2021. Leveraging
passage retrieval with generative models for open do-
main question answering. In Proceedings of the 16thConference of the European Chapter of the Associ-
ation for Computational Linguistics: Main Volume ,
pages 874–880, Online. Association for Computa-
tional Linguistics.
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke
Zettlemoyer. 2017. TriviaQA: A large scale distantly
supervised challenge dataset for reading comprehen-
sion. In Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics (Vol-
ume 1: Long Papers) , pages 1601–1611, Vancouver,
Canada. Association for Computational Linguistics.
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick
Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and
Wen-tau Yih. 2020. Dense passage retrieval for open-
domain question answering. In Proceedings of the
2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP) , pages 6769–6781,
Online. Association for Computational Linguistics.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
field, Michael Collins, Ankur Parikh, Chris Alberti,
Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken-
ton Lee, Kristina Toutanova, Llion Jones, Matthew
Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob
Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-
ral questions: A benchmark for question answering
research. Transactions of the Association for Compu-
tational Linguistics , 7:452–466.
Yejoon Lee, Philhoon Oh, and James Thorne. 2023.
Knowledge corpus error in question answering. In
Findings of the Association for Computational Lin-
guistics: EMNLP 2023 , pages 9183–9197, Singapore.
Association for Computational Linguistics.
Feng Nan, Cicero Nogueira dos Santos, Henghui Zhu,
Patrick Ng, Kathleen McKeown, Ramesh Nallapati,
Dejiao Zhang, Zhiguo Wang, Andrew O. Arnold, and
Bing Xiang. 2021. Improving factual consistency
of abstractive summarization via question answering.
InProceedings of the 59th Annual Meeting of the
Association for Computational Linguistics and the
11th International Joint Conference on Natural Lan-
guage Processing (Volume 1: Long Papers) , pages
6881–6894, Online. Association for Computational
Linguistics.
Philhoon Oh and James Thorne. 2023. Detrimental con-
texts in open-domain question answering. In Find-
ings of the Association for Computational Linguis-
tics: EMNLP 2023 , pages 11589–11605, Singapore.
Association for Computational Linguistics.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
2022. Training language models to follow instruc-
tions with human feedback. Advances in Neural
Information Processing Systems , 35:27730–27744.
Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick
Lewis, Majid Yazdani, Nicola De Cao, James Thorne,
Yacine Jernite, Vladimir Karpukhin, Jean Maillard,
6
Page 7:
Vassilis Plachouras, Tim Rocktäschel, and Sebastian
Riedel. 2021. KILT: a benchmark for knowledge
intensive language tasks. In Proceedings of the 2021
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies , pages 2523–2544, Online.
Association for Computational Linguistics.
Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea
Finn. 2024. From $r$ to $q^*$: Your language
model is secretly a q-function. In First Conference
on Language Modeling .
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano
Ermon, Christopher D Manning, and Chelsea Finn.
2023. Direct preference optimization: Your language
model is secretly a reward model. arXiv preprint
arXiv:2305.18290 .
Colin Raffel, Noam Shazeer, Adam Roberts, Kather-
ine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the
limits of transfer learning with a unified text-to-text
transformer. Journal of Machine Learning Research ,
21(140):1–67.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. 2016. SQuAD: 100,000+ questions for
machine comprehension of text. In Proceedings of
the 2016 Conference on Empirical Methods in Natu-
ral Language Processing , pages 2383–2392, Austin,
Texas. Association for Computational Linguistics.
Artsiom Sauchuk, James Thorne, Alon Halevy, Nicola
Tonellotto, and Fabrizio Silvestri. 2022. On the role
of relevance in natural language processing tasks. In
Proceedings of the 45th International ACM SIGIR
Conference on Research and Development in Infor-
mation Retrieval , pages 1785–1789.
John Schulman, Filip Wolski, Prafulla Dhariwal,
Alec Radford, and Oleg Klimov. 2017. Proxi-
mal policy optimization algorithms. arXiv preprint
arXiv:1707.06347 .
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan
Scales, David Dohan, Ed Chi, Nathanael Schärli, and
Denny Zhou. 2023. Large language models can be
easily distracted by irrelevant context. In Proceed-
ings of the 40th International Conference on Machine
Learning , ICML’23. JMLR.org.
Leandro von Werra, Younes Belkada, Lewis Tun-
stall, Edward Beeching, Tristan Thrush, Nathan
Lambert, and Shengyi Huang. 2020. Trl: Trans-
former reinforcement learning. https://github.
com/huggingface/trl .
Zhiruo Wang, Jun Araki, Zhengbao Jiang, Md Rizwan
Parvez, and Graham Neubig. 2023. Learning to filter
context for retrieval-augmented generation. arXiv
preprint arXiv:2311.08377 .
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame,
Quentin Lhoest, and Alexander Rush. 2020. Trans-
formers: State-of-the-art natural language processing.
InProceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing: System
Demonstrations , pages 38–45, Online. Association
for Computational Linguistics.
Fangyuan Xu, Weijia Shi, and Eunsol Choi. 2023.
Recomp: Improving retrieval-augmented lms with
compression and selective augmentation. Preprint ,
arXiv:2310.04408.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,
William Cohen, Ruslan Salakhutdinov, and Christo-
pher D. Manning. 2018. HotpotQA: A dataset for
diverse, explainable multi-hop question answering.
InProceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing , pages
2369–2380, Brussels, Belgium. Association for Com-
putational Linguistics.
Wenhao Yu, Dan Iter, Shuohang Wang, Yichong
Xu, Mingxuan Ju, Soumya Sanyal, Chenguang
Zhu, Michael Zeng, and Meng Jiang. 2022. Gen-
erate rather than retrieve: Large language mod-
els are strong context generators. arXiv preprint
arXiv:2209.10063 .
7
Page 8:
A Hyperparameters
In our experiments, we conduct three training
phases: SFT training, DPO training, and reader
training. For each experiment, we utilize one
NVIDIA A100 80GB or NVIDIA H100 80GB
GPU. We set the number of training epochs to 3,
and both the training and evaluation batch sizes to
4, with a gradient accumulation step of 32. The
optimizer used is "paged_adamw_32bit," with a
learning rate of 2e-4, and we employ a "cosine"
type learning rate scheduler.
B Prompts and Templates
B.1 Prompts for Data Geneartion and SFT
Training
[Type 1]
Summarize below context into one
sentence according to fit the
following context, question and
answer.
Context: {context}
Question: {question}
Answer: {answer}
Output:
[Type 2]
Summarize below context into one
sentence according to fit the
following context and question.
Context: {context}
Question: {question}
Output:
[Type 3]
Summarize below context into one
sentence according to fit the
following context and answer.
Context: {context}
Answer: {answer}
Output:
B.2 Prompts for Reader Training
[Train Input]Given the context and question,
predict the answer to the question.
Context: {context}
Question: {question}
Answer:
[Train Output]
{target answer}
C Details of Datasets
Dataset NQr TQAr SQuADr
Split Number of Datasets
Train 79,618 (43,032) 78,785 (15,759) 80,000
Validation 8,757 (4,164) 8,837 (1,534) 7,599
Test 3,610 (1,554) 11,313 (1,883) 10,570
Table 3: The specific numbers of data used in the entire
pipeline. For NQ and TQA, the numbers in parentheses
indicate the actual amount of data involved.
The number of datasets involved in the experi-
ment is depicted in table 3.
8