Authors: Ioannis Chasandras, Odysseas S. Chlapanis, Ion Androutsopoulos
Page 1:
AUEB-Archimedes at RIRAG-2025:
Is obligation concatenation really all you need?
Ioannis Chasandras1, Odysseas S. Chlapanis1,2and Ion Androutsopoulos1,2
1Department of Informatics, Athens University of Economics and Business, Greece
2Archimedes/Athena RC, Greece
Abstract
This paper presents the systems we developed
for RIRAG-2025, a shared task that requires
answering regulatory questions by retrieving
relevant passages. The generated answers are
evaluated using RePASs, a reference-free and
model-based metric. Our systems use a combi-
nation of three retrieval models and a reranker.
We show that by exploiting a neural component
of RePASs that extracts important sentences
(‘obligations’) from the retrieved passages, we
achieve a dubiously high score (0.947), even
though the answers are directly extracted from
the retrieved passages and are not actually gen-
erated answers. We then show that by selecting
the answer with the best RePASs among a few
generated alternatives and then iteratively re-
fining this answer by reducing contradictions
and covering more obligations, we can generate
readable, coherent answers that achieve a more
plausible and relatively high score (0.639).
1 Introduction
The Regulatory Information Retrieval and Answer
Generation (RIRAG)1shared task focuses on the
development of systems that can effectively retrieve
relevant information from regulatory texts to gener-
ate accurate answers for obligation-related queries.
It is divided into two subtasks: passage retrieval ,
where systems identify the ten most relevant pas-
sages from regulatory documents, and answer gen-
eration , which requires synthesizing comprehen-
sive answers from the retrieved passages.
We participated with three systems and released
our code publicly.2Each one of them uses a Rank
Fusion (Wang et al., 2021) combination of three
retrieval models: BM25 (Robertson et al., 1994),
and two neural domain-specific retrievers, based
on a law- and a finance-specific embedding model,
1https://regnlp.github.io/
2https://github.com/nlpaueb/
verify-refine-repassrespectively. We also apply a neural reranker to the
top-N retrieved passages.
For answer generation, our first system adver-
sarially exploits the evaluation metric of the task,
called RePASs, by using one of its neural com-
ponents. Specifically, we extract important sen-
tences (‘obligations’) from the retrieved passages
and then concatenate these sentences to get an ‘an-
swer’. Even though the produced answers may
be incoherent and may not answer the question di-
rectly, this system achieves a perfect score, much
higher than the score of human experts. The sec-
ond system extends this approach with an LLM
that generates an answer (for each question) by it-
eratively reformulating (as parts of an answer) the
extracted obligations of the previous system. This
results in more readable answers, but performance
deteriorates to RePASs scores below those of the
challenge’s baseline (Gokhan et al., 2024).
Our third system works by a) generating mul-
tiple candidate answers and using RePASs to se-
lect the best answer, and b) iteratively refining the
selected answer by removing contradictions and
adding ‘obligation’ sentences that increase RePASs.
This system performs worse than the adversarial
(first) system, but much better than the baseline,
and the answers are coherent and readable.
2 Task setup
Dataset: The dataset of the task consists of train,
development, and test sets (22k, 2.8k, 2.7k ques-
tions respectively). Passages are retrieved from a
corpus of 40 regulatory documents from the Abu
Dhabi Global Markets (ADGM) collection. The
task organizers used a separate hidden test set, with
446 questions, to evaluate the participants.
Evaluation: Passage retrieval is evaluated using
recall@10 and MAP@10. Answer generation is
evaluated using RePASs, a reference-free metric
(Gokhan et al., 2024). To calculate RePASs, en-arXiv:2412.11567v1 [cs.CL] 16 Dec 2024
Page 2:
tailment andcontradiction scores are obtained by
comparing each sentence of the retrieved passages
(used as premises) with each sentence of the gener-
ated answer (hypothesis) using an NLI model. For
each generated sentence (of the answer), the high-
est probabilities for entailment and contradiction
(comparing to retrieved sentences) are selected, and
the scores are averaged over all the sentences of
the answer. Additionally, obligation -sentences are
extracted from the retrieved passages using a Legal-
BERT model (Chalkidis et al., 2020) fine-tuned on
a synthetic dataset (Gokhan et al., 2024). For an
obligation to be considered covered by the gener-
ated answer, a sentence of the answer must entail
the obligation-sentence with a confidence above a
certain threshold, according to another NLI model.
3 Passage retrieval
All three of our systems use the same passage re-
trieval, which improves upon the baseline retrieval
system of the shared task (Gokhan et al., 2024)
in three ways: a) we use domain-specific neural
retrieval models, b) we extend the Rank Fusion ap-
proach (Wang et al., 2021) to include three models
instead of two, and c) we use a reranker.
3.1 Retrieval models
We experiment with BM25 (Robertson et al., 1994)
and three of the best3text embedding models:
text-embedding-3-large (OL3) from OpenAI
(Neelakantan et al., 2022), voyage-law-2 (VL2),
andvoyage-finance-2 (VF2) from Voyage .4The
OL3 embedding model is only used for compari-
son; it is not included in our final systems, because
domain-specific embedding models worked better.
We also use the voyage-rerank-2 reranker.
3.2 Rank Fusion
The task combines the financial and legal domains,
which motivates using two domain-specific neural
retrievers. Also, according to Wang et al. (2021),
BM25 should be fused with neural retrievers, be-
cause it captures exact term matching better. Hence,
we expand Rank Fusion to handle three retrievers
instead of two, as follows.
f(p) =aˆsx(p)+bˆsy(p)+(1−(a+b))ˆsz(p)(1)
3MTEB-law: https://huggingface.co/spaces/mteb/
leaderboard?task=retrieval&language=law
4https://docs.voyageai.com/docs/embeddingsHere pis a retrieved passage, aandbare fusion
weights, and ˆsx(p),ˆsy(p),ˆsz(p)are the normalized
relevance scores of the three fused retrievers.
3.3 Experimental results for retrieval
We conduct three experiments on the public test set.
In Table 1, we compare the scores of the four single
retrieval models. We see that the domain-specific
voyage-law-2 (VL2) and voyage-finance-2 (VF2)
perform better than BM25 and the generic OL3.
Model Recall@10 MAP@10
BM25 0.6994 0.5584
OL3 0.7385 0.5736
VL2 0.7705 0.6275
VF2 0.7895 0.6559
Table 1: Comparison of single retrieval models.
In the second experiment (Table 2), we compare
Rank Fusion configurations, again on the public
test set. The newly introduced triple Rank Fusion,
with BM25, VL2 and VF2, is the best. The values
ofa, bwere selected by trying a few combinations.
Rank Fusion a bR@10 M@10
BM25, OL3 0.30 - 78.9 65.0
VL2, VF2 0.40 - 79.4 66.0
BM25, VL2 0.25 - 79.9 66.5
BM25, VF2 0.30 - 80.4 67.6
BM25, VL2, VF2 0.25 0.2 81.1 69.0
Table 2: Comparison of Rank Fusion configurations.
In the third experiment (Fig. 1), we investigate
the effect of reranking the top- Nretrieved passages,
for different Nvalues, by computing Recall@10
on the public test set. The best value is N= 50 .
10 20 30 40 50 60 70 80 90 1000.810.810.810.810.820.82
Top-N reranked passagesRecall@10
Figure 1: Recall@10 scores of our best retriever (Rank
Fusion of BM25, VL2, VF2) when reranking the top- N
retrieved passages, for different Nvalues.
Our final retrieval model is a triple Rank Fu-
sion model (BM25, VL2, VF2) with reranking
Page 3:
(voyage-rerank-2 ,N= 50 ), which ranked 4th
in the retrieval subtask, achieving 69.4 Recall@10,
and 59.4 MAP@10 on the hidden test set.
4 Answer generation
The answer generators of this section use our best
retriever (Section 3, BM25, VL2, VF2, reranker).
4.1 Preprocessing
Filtering: We follow Gokhan et al. (2024), i.e., we
rank the retrieved passages by decreasing relevance
scores; we then keep only passages that satisfy two
conditions: (i) their score must be above a certain
threshold , and (ii) their score must not fall below
the previous passage’s score more than max drop .
Extracting obligations: To obtain obligations
from the retrieved passages, we use the same fine-
tuned LegalBERT model used in RePASs (Sec-
tion 2) for obligation extraction. If a passage does
not contain any obligations, we use it as is.
4.2 Experimental results for preprocessing
To select the values of the filtering threshold and
max drop (Section 4.1), we conducted two experi-
ments using GPT-4o-mini5for answer generation.
The first experiment shows that the recommended
values of 0.70,0.20of Gokhan et al. (2024) are
outperformed by 0.90,0.10, respectively (Table 3).
Threshold Max Drop RePASs
0.70 0.20 0.4708
0.75 0.05 0.5006
0.80 0.05 0.5050
0.85 0.15 0.5001
0.90 0.10 0.5117
Table 3: Performance of the baseline answer generator
for different values of threshold andmax drop , using
our best retriever (BM25, VL2, VF2, reranker).
The second experiment compared the performance
of the task’s baseline when (a) the entire retrieved
passages were given to the LLM, or (b) only the
obligations were given, or (c) only the obligations
were given, but with a tailored prompt. No signif-
icant difference was noticed between (a) and (b),
but (c) was significantly better in RePASs (Table 4),
due to the increase in obligation coverage anden-
tailment , even though contradiction was worse. All
prompts can be found in Appendix B.
5https://openai.com/index/
gpt-4o-mini-advancing-cost-efficient-intelligence/Context RePASs Obl. Ent. Con.
Passages 0.411 0.147 0.177 0.090
Obligations 0.413 0.156 0.172 0.090
+ prompt 0.512 0.278 0.366 0.109
Table 4: Performance of the baseline system for differ-
ent kinds of inputs (entire retrieved passages, obligations
only, obligations with tailored prompt).
4.3 Naive Obligation Concatenation (NOC)
Our first answer generator (NOC) adversarially ex-
ploits the extracted obligations (Section 4.1). It
simply concatenates and outputs them as the ‘an-
swer’. From the definition of RePASs (Section 2),
this answer should get an almost perfect obligation
score. Additionally, we expect a low contradiction
score, as obligations should not conflict.
4.4 LLM Obligation Concatenation (LOC)
The answers of NOC (Section 4.3) do not answer
the question directly; they are just excerpts from re-
trieved passages. To alleviate this, we create a varia-
tion of NOC, called LOC: for each extracted obliga-
tion, we prompt an LLM ( GPT-4o-mini ) to answer
the given question using this obligation. If the gen-
erated answer does not cover (Section 2) the orig-
inal obligation, then the LLM is prompted again,
until a certain number of tries Khas been reached
(we use K= 3). Finally, the per-obligation an-
swers are concatenated to form a complete answer.
4.5 Verify and Refine with RePASs (VRR)
Our third answer generator (VRR) first ‘verifies’
the correctness of the answers, then iteratively ‘re-
fines’ them. The first stage (verification) is loosely
inspired by self-consistency (Wang et al., 2023);
it involves the generation of many alternative an-
swers by the LLM and the selection of the one with
the highest RePASs score. The selected answer is
then iteratively refined by reducing contradictions
and increasing obligations , as explained below.
4.5.1 Verification step
In the verification step, we obtain Nalternative
answers from the LLM (using all the extracted
obligations and the question as input) and evaluate
them using RePASs. We choose the alternative
answer with the best RePASs score.
4.5.2 Refinement step
Contradiction removal: To remove contradic-
tions: a) we compute the average contradiction
Page 4:
System / Group Name RePASs Obligation Entailment Contradiction
GPT-4o baseline* 0.583 0.220 0.769 0.238
Human experts* 0.859 1.000 0.837 0.260
Indic aiDias 0.973 0.993 0.987 0.062
Ocean’s Eleven 0.971 0.991 0.986 0.065
AUEB NLP Group - NOC 0.947 (0.951) 0.951 (0.963) 0.986 (0.986) 0.096 (0.096)
AUEB NLP Group - VRR 0.639 (0.646) 0.502 (0.524) 0.446 (0.446) 0.031 (0.031)
AICOE 0.601 0.230 0.827 0.254
AUEB NLP Group - LOC 0.562 (0.568) 0.423 (0.439) 0.375 (0.375) 0.110 (0.110)
Table 5: Leaderboard results for Subtask 2. Results computed by ourselves for our systems are shown in brackets.
Differences are attributed to using different GPUs. *Scores taken from Gokhan et al. (2024).
score over all the answers (over all the best alter-
native answers for all questions) across the dataset
using the same NLI model as in RePASs, and b)
we remove the sentences of the answer that get a
contradiction score higher than the average.
Obligation insertion: To locate missing obliga-
tions, we extract obligations from the retrieved pas-
sages and the current answer. Obligations from the
retrieved passages that are not covered (Section 2)
by the current answer are missing obligations. We
prompt GPT-4o to insert the missing obligations by
correcting a sentence or adding a new one to the
current answer (complete prompt in Appendix B).
4.6 Experimental results for generation
In the following experiments we use the hidden test
set,GPT-4o-mini as the generator for LOC, and
GPT-4o6as the generator for VRR.
Table 5 compares the task’s baseline and hu-
man expert performance, as reported by Gokhan
et al. (2024), to our three submissions (NOC,
VRR, LOC) and to the best submissions of the
top three competitors. NOC achieves an almost
perfect RePASs score ( 0.947), surpassing human
experts ( +0.088). As expected, obligation andcon-
tradiction scores are excellent for the adversarial
NOC, but surprisingly entailment scores are even
better without directly optimizing towards them.
Similar results are observed for the methods of
the top scoring competitors. However, as already
mentioned, NOC’s answers are just verbatim sen-
tences from the retrieved passages, which proves
that RePASs can easily be deceived. LOC on the
other hand, which rewrites the ‘obligations’ using
GPT-4o-mini , performs even worse than the base-
line model, which shows that RePASs is also very
sensitive to the style of the answer. VRR, which
6https://openai.com/index/hello-gpt-4o/VRR RePASs Improvement
Baseline (Ours) 0.506 -
+ Verification 0.611 + 0.105
+ Refinement 0.646 + 0.025
Table 6: Contribution of VRR stages, using GPT-4o.
actually generates answers from the retrieved pas-
sages, improves upon the task’s baseline substan-
tially (+0.056) and ranks first among systems that
do not exceed human performance; we suspect that
systems with super-human performance may trick
the RePASs measure, like our NOC system.
The next experiment (Table 6) measures the con-
tribution of the verification and refinement pro-
cesses of VRR. Both processes are beneficial, but
verification’s improvement is more important.
5 Conclusion
We introduced three systems for the RIRAG shared
task. The retrieval backbone of all systems com-
bined BM25 with two domain specific neural re-
trievers and a reranker. We achieved a near-perfect
score with an adversarial system that exploits the
neural model for obligation extraction of RePASs,
highlighting the difficulty of developing a robust
reference-free metric for RAG evaluation. Our best
non-adversarial system (VRR) first generates mul-
tiple alternative answers from the retrieved obliga-
tions, selects the alternative answer that maximizes
RePASs, then iteratively improves it by maximiz-
ing obligation coverage and minimizing contradic-
tions. This system produces coherent answers, and
obtains the highest RePASs score among competi-
tors that do not exceed human performance (which
may be a sign of gaming RePASs).
Page 5:
Limitations
We demonstrated that reference-free model-based
metrics, such as RePASs, used for evaluating
Retrieval-Augmented Generation (RAG) systems,
can be susceptible to adversarial attacks. Specifi-
cally, we showed that it is possible to provide an-
swers that receive a high score from the metric, but
may not be useful to non-experts. The attack was
tailored to RePASs and a specific domain, and it
may not apply to other domains or metrics.
VRR requires an accurate verifier, such as
RePASs, which is not always available. The obliga-
tion extraction component in RePASs is fine-tuned
using a synthetic dataset (Gokhan et al., 2024),
which in turn requires a powerful LLM teacher to
solve the task with few-shot prompting alone. This
is quite rare for hard domain-specific problems.
Acknowledgments
This work has been partially supported by project
MIS 5154714 of the National Recovery and Re-
silience Plan Greece 2.0 funded by the Euro-
pean Union under the NextGenerationEU Program.
All experiments were done using AWS resources
which were provided by the National Infrastruc-
tures for Research and Technology GRNET and
funded by the EU Recovery and Resiliency Facility.
References
Ilias Chalkidis, Manos Fergadiotis, Prodromos Malaka-
siotis, Nikolaos Aletras, and Ion Androutsopoulos.
2020. LEGAL-BERT: The muppets straight out of
law school. In Findings of the Association for Com-
putational Linguistics: EMNLP 2020 , pages 2898–
2904, Online. Association for Computational Lin-
guistics.
Catalina Goanta, Nikolaos Aletras, Ilias Chalkidis, Sofia
Ranchordás, and Gerasimos Spanakis. 2023. Regu-
lation and NLP (RegNLP): Taming large language
models. In Proceedings of the 2023 Conference on
Empirical Methods in Natural Language Processing ,
pages 8712–8724, Singapore. Association for Com-
putational Linguistics.
Tuba Gokhan, Kexin Wang, Iryna Gurevych, and Ted
Briscoe. 2024. Regnlp in action: Facilitating com-
pliance through automated information retrieval and
answer generation. Preprint , arXiv:2409.05677.
Yichen Huang and Timothy Baldwin. 2023. Robust-
ness tests for automatic machine translation metrics
with adversarial attacks. In Findings of the Associ-
ation for Computational Linguistics: EMNLP 2023 ,
pages 5126–5135, Singapore. Association for Com-
putational Linguistics.Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
Petroni, Vladimir Karpukhin, Naman Goyal, Hein-
rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-
täschel, Sebastian Riedel, and Douwe Kiela. 2020.
Retrieval-augmented generation for knowledge-
intensive nlp tasks. In Proceedings of the 34th Inter-
national Conference on Neural Information Process-
ing Systems , NIPS ’20, pages 9459–9474, Red Hook,
NY , USA. Curran Associates Inc.
Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue,
and Xipeng Qiu. 2020. BERT-ATTACK: Adversar-
ial attack against BERT using BERT. In Proceed-
ings of the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP) , pages
6193–6202, Online. Association for Computational
Linguistics.
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler
Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,
Shashank Gupta, Bodhisattwa Prasad Majumder,
Katherine Hermann, Sean Welleck, Amir Yazdan-
bakhsh, and Peter Clark. 2024. Self-refine: iterative
refinement with self-feedback. In Proceedings of the
37th International Conference on Neural Information
Processing Systems , pages 46534–46594, Red Hook,
NY , USA.
Arvind Neelakantan, Tao Xu, Raul Puri, Alec Rad-
ford, Jesse Michael Han, Jerry Tworek, Qiming
Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy,
Johannes Heidecke, Pranav Shyam, Boris Power,
Tyna Eloundou Nekoul, Girish Sastry, Gretchen
Krueger, David Schnurr, Felipe Petroski Such, Kenny
Hsu, Madeleine Thompson, Tabarak Khan, Toki
Sherbakov, Joanne Jang, Peter Welinder, and Lilian
Weng. 2022. Text and code embeddings by con-
trastive pre-training. Preprint , arXiv:2201.10005.
Xin Quan, Marco Valentino, Louise A. Dennis, and An-
dre Freitas. 2024. Verification and refinement of nat-
ural language explanations through LLM-symbolic
theorem proving. In Proceedings of the 2024 Con-
ference on Empirical Methods in Natural Language
Processing , pages 2933–2958, Miami, Florida, USA.
Association for Computational Linguistics.
Stephen E. Robertson, Steve Walker, Susan Jones,
Micheline Hancock-Beaulieu, and Mike Gatford.
1994. Okapi at trec-3. In Proceedings of The Third
Text REtrieval Conference, TREC 1994 , pages 109–
126, Gaithersburg, Maryland, USA.
Ante Wang, Linfeng Song, Ye Tian, Baolin Peng, Lifeng
Jin, Haitao Mi, Jinsong Su, and Dong Yu. 2024. Self-
consistency boosts calibration for math reasoning. In
Findings of the Association for Computational Lin-
guistics: EMNLP 2024 , pages 6023–6029, Miami,
Florida, USA. Association for Computational Lin-
guistics.
Shuai Wang, Shengyao Zhuang, and Guido Zuccon.
2021. Bert-based dense retrievers require interpo-
lation with bm25 for effective passage retrieval. In
Page 6:
Proceedings of the 2021 ACM SIGIR International
Conference on Theory of Information Retrieval , IC-
TIR ’21, page 317–324, New York, NY , USA. Asso-
ciation for Computing Machinery.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le,
Ed H. Chi, Sharan Narang, Aakanksha Chowdhery,
and Denny Zhou. 2023. Self-consistency improves
chain of thought reasoning in language models. In
The Eleventh International Conference on Learning
Representations .
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng,
Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei
Lin, and Daxin Jiang. 2024. WizardLM: Empow-
ering large pre-trained language models to follow
complex instructions. In The Twelfth International
Conference on Learning Representations .
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
Thomas L. Griffiths, Yuan Cao, and Karthik
Narasimhan. 2024. Tree of thoughts: deliberate prob-
lem solving with large language models. In Proceed-
ings of the 37th International Conference on Neural
Information Processing Systems , NIPS ’23, pages
11809–11822, Red Hook, NY , USA. Curran Asso-
ciates Inc.
A Related work
RAG: Retrieval-Augmented Generation (RAG)
(Lewis et al., 2020) systems can help tackle domain-
specific problems that RegNLP (Goanta et al.,
2023) presents, by incorporating information from
large regulatory document collections.
Verify and Refine: VRR is loosely inspired by
LLM methods that select the best answer from
multiple candidates and iteratively refine these an-
swers (Wang et al., 2024; Madaan et al., 2024; Yao
et al., 2024; Quan et al., 2024), frameworks like
Explanation-Refiner (Quan et al., 2024) that use
theorem proving to validate and refine explanations,
and WizardLM (Xu et al., 2024) that evolves in-
struction data to enhance model performance.
Adversarial attacks: Many works implement ad-
versarial attacks that are similar to our NOC sys-
tem. BERT-ATTACK (Li et al., 2020) leverages
a pretrained BERT model to deceive other mod-
els. Huang and Baldwin (2023) show that popu-
lar model-based evaluation metrics for machine-
translation are susceptible to inconsistencies when
given adversarially-degraded translations.
B Prompts
For all our prompts we have used GPT-4o to im-
prove them, and then kept those that performed the
task better (according to our opinion) in a few (2-3)
sample questions.Baseline prompt (Gokhan et al., 2024)
You are a regulatory compliance assistant. Pro-
vide a detailed answer for the question that fully
integrates all the obligations and best practices
from the given passages. Ensure your response is
cohesive and directly addresses the question. Syn-
thesize the information from all passages into a
single, unified answer.
Prompt for obligations in the context (VRR)
You are a regulatory compliance assistant. Your
task is to provide a brief but concise and detailed
answer to the Question, ensuring that all Obliga-
tions are fully addressed. Directly integrate each
obligation into the response, ensuring no obliga-
tion is missed or implied. Avoid adding information
beyond what is explicitly stated in the Obligations,
and cite specific rules when necessary. Use the ex-
act terminology and structure from the obligations
where applicable, to ensure high alignment and log-
ical consistency. Focus solely on the provided obli-
gations to craft a response that is well-structured,
concise, and free of contradictions.
Prompt for inserting obligations (VRR)
You are a regulatory compliance assistant. Your
task is to integrate the following Obligations that
are missing from the Answer. You may change
sentences or add new ones to cover all Obligations.
Avoid adding changes or sentences that contradict
the Answer and/or the Obligations.
Prompt that rewrites an obligation (LOC)
You are a regulatory compliance assistant. Your
task is to construct a brief but concise response
that addresses the Question by focusing exclu-
sively on the specified Obligation. Ensure your
response clearly identifies and explains the obliga-
tion, including any relevant conditions or restric-
tions. Avoid addressing unrelated aspects of the
Question, and limit your response strictly to what
is explicitly stated in the provided passage.
C Detailed experiments for VRR
Table 7 shows the progression of RePASs through-
out the execution of the VRR algorithm. The Ver-
ification step leads to an increase in all metrics.
Obligation Refinement (‘Ref. Obl.’) alone does not
lead to an increased score, Contradiction Refine-
ment (‘Ref. Contr.’) is necessary. Even though
Page 7:
Obligation Coverage (’Obl.’) increases at the ex-
pense of the Entailment (’Ent.’) score, RePASs
improves overall.
Step RePASs Obl. Ent. Con.
Preprocessing 0.506 0.246 0.408 0.136
Verify 0.611 0.389 0.527 0.083
Ref. Contr. 1 0.638 0.389 0.554 0.030
Ref. Obl. 1 0.634 0.465 0.490 0.053
Ref. Contr. 2 0.643 0.464 0.497 0.032
Ref. Obl. 2 0.637 0.496 0.464 0.049
Ref. Contr. 3 0.643 0.494 0.467 0.030
Ref. Obl. 3 0.642 0.527 0.446 0.046
Ref. Contr. 4 0.647 0.525 0.446 0.031
Ref. Obl. 4 0.641 0.538 0.430 0.045
Table 7: RePASs progress during VRR execution.