Paper Content:
Page 1:
1
Evaluating open-source Large Language Models for
automated fact-checking
Nicol `o Fontana1, Francesco Corso1,2, Enrico Zuccolotto1, Francesco Pierri1
1Politecnico di Milano, Italy
2CENTAI, Italy
{nicolo.fontana, francesco.corso, francesco.pierri }@polimi.it
Abstract —The increasing prevalence of online misinformation
has heightened the demand for automated fact-checking solu-
tions. Large Language Models (LLMs) have emerged as potential
tools for assisting in this task, but their effectiveness remains
uncertain. This study evaluates the fact-checking capabilities
of various open-source LLMs, focusing on their ability to
assess claims with different levels of contextual information.
We conduct three key experiments: (1) evaluating whether
LLMs can identify the semantic relationship between a claim
and a fact-checking article, (2) assessing models’ accuracy in
verifying claims when given a related fact-checking article, and
(3) testing LLMs’ fact-checking abilities when leveraging data
from external knowledge sources such as Google and Wikipedia.
Our results indicate that LLMs perform well in identifying claim-
article connections and verifying fact-checked stories but struggle
with confirming factual news, where they are outperformed by
traditional fine-tuned models such as RoBERTa. Additionally, the
introduction of external knowledge does not significantly enhance
LLMs’ performance, calling for more tailored approaches. Our
findings highlight both the potential and limitations of LLMs
in automated fact-checking, emphasizing the need for further
refinements before they can reliably replace human fact-checkers.
Index Terms —fact-checking, large language models, prompting
analysis
I. I NTRODUCTION
In today’s digital era, the vast availability of online infor-
mation has facilitated the rapid spread of both misinformation
and disinformation. Online social platforms, in particular,
enable false narratives to gain traction, due to their high-
connectivity nature [1]. This challenge is amplified when influ-
ential individuals propagate misleading content, with research
suggesting that such misinformation can significantly impact
critical events, including election outcomes [2], [3].
The burden of fact-checking and debunking false claims
has been traditionally left to journalists. However, the rise
of the Internet, combined with growing distrust in traditional
media [4], has led to the emergence of independent fact-
checking organizations [5]. These groups focus on verifying
rumors, misconceptions, and fake news spread online. One of
the most notable fact-checking initiatives, Politifact1, received
the Pulitzer Prize for national reporting in 20092. Politifact
introduced a rating system to evaluate claims, and many
organizations have since adopted similar systems. However,
1https://www.politifact.com/
2https://www.pulitzer.org/prize-winners-by-year/2009these subjective and often ambiguous ratings complicate com-
parisons across fact-checks [6].
Fact-checking remains a labor-intensive process, requiring
teams to spend days or even weeks verifying claims [7]. Given
the overwhelming flow of information online3, traditional
methods cannot keep pace, highlighting the need for more
efficient approaches.
One promising avenue is the automation of fact-checking
through artificial intelligence (AI) technologies [8]. Re-
searchers have investigated AI-driven models, such as Convo-
lutional Neural Networks (CNNs) and Graph Neural Networks
(GNNs), to aid in these efforts. While these models have made
significant progress, their effectiveness remains limited [9],
[10]. More recently, Large Language Models (LLMs) based on
transformer architectures have demonstrated significant poten-
tial [11], [12]. These models excel at generating natural lan-
guage responses, answering complex questions, and producing
high-quality content [13], [14], making them an accessible and
versatile tool for a broad audience. Their advanced reasoning
capabilities further position them as suitable candidates for
fact-checking [15].
However, a major limitation of LLMs lies in the outdated
nature of their training data [16], which hampers their ability
to address recent misinformation effectively. To address this
challenge, researchers are exploring innovative approaches to
integrate real-time, external knowledge into these models.
Such advancements aim to enhance their ability to provide
accurate and timely responses to rapidly evolving misinfor-
mation [17].
Building on previous research, our study explores the
use of diverse open LLMs to investigate whether smaller
models can achieve a more favorable balance between cost-
effectiveness and accuracy compared to larger models. Cost-
effective methods are particularly critical for watchdog groups,
non-profits, and smaller organizations that often operate with
limited budgets and resources. These entities play a crucial role
in combating misinformation but frequently lack the financial
capacity to deploy and maintain large-scale AI systems. By
identifying efficient yet accurate alternatives, our research aims
to empower such organizations with accessible tools, enabling
them to enhance their fact-checking capabilities without incur-
ring prohibitive costs. Additionally, we employed as a baseline
for comparison a Small Language Model (SLM) by fine-tuning
3https://explodingtopics.com/blog/data-generated-per-dayarXiv:2503.05565v1 [cs.CY] 7 Mar 2025
Page 2:
2
RoBERTa. This reference allows us to understand the relative
importance of the LLMs’ performance.
We articulate our contributions into three key research
questions:
1)RQ1: Can LLMs identify the connection between
a claim and an article? This research question in-
vestigates whether LLMs can accurately determine if a
given claim and its paired article are contextually related,
addressing the same topic. The focus is on evaluating the
models’ capability to assess the relevance and alignment
between the claim and the content of the article.
2)RQ2: Can LLMs judge a claim based on the related
fact-checking article? This question explores whether
LLMs can effectively analyze a related fact-checking
article and provide a trustworthy evaluation of the claim
based on the information contained within the document.
3)RQ3: Are LLMs able to retrieve contextual informa-
tion and fact-check claims? This question assesses the
models’ ability to verify the truthfulness of a claim when
they are provided with a related article, as opposed to
relying solely on their pre-trained internal knowledge.
The objective is to determine whether the inclusion of
new, external information enhances their accuracy in
fact-checking tasks.
To address the first two questions ( RQ1, RQ2 ), we conduct
experiments using 24 different prompts to guide the models’
performance, emphasizing the role of effective prompting. For
the third question ( RQ3 ), we use neutral and straightforward
prompts to evaluate various sources of external information–
such as Google or Wikipedia (or their absence)–and the
representation format (snippet, summary, or full article) that
yields the best results. In this third scenario, the model acts
autonomously, conducting internet searches to gather informa-
tion and refine its verdicts. For Google searches, we limit the
results to a time frame before the claim to avoid bias and
better simulate real-world fact-checking.
Our analysis utilizes the Fact-Checking Insights dataset4,
which is a comprehensive resource containing structured data
from tens of thousands of claims made by political figures
and social media posts, scrutinized and rated by independent
fact-checking organizations such as AFP5, Politifact6, and
Snopes7.
II. R ELATED WORK
The study of fake news generation and dissemination has
gained increasing attention, particularly with the rise of on-
line platforms that facilitate rapid information diffusion [18].
Research in this domain has primarily focused on two key
aspects: the identification of fake news and the detection of
its spreaders [15].
Over time, human fact-checking has been increasingly
supported by machine learning methods. Early approaches
4https://www.factcheckinsights.org/download
5https://factcheck.afp.com
6https://www.politifact.com/
7https://www.snopes.com/leveraged classical machine learning techniques for keyword-
based text analysis using traditional neural networks [19]
and applied network analysis to assess the trustworthiness
of sources propagating misinformation [20]. More recently,
the emergence of Large Language Models has significantly
advanced fake news detection, enabling more sophisticated
analyses. These include evaluating the performance of fake
news detectors against synthetically generated misinforma-
tion rather than human-created content [21] and conducting
sentiment-based assessments through emotional analysis of
news texts [22].
Some approaches aim to enhance human fact-checking by
integrating feedback from LLMs. For instance, LLMs have
been used to improve document retrieval, aiding in the verifi-
cation of statements [23], or to decompose claims hierarchi-
cally into sub-statements for systematic verification [24]. Con-
versely, other methodologies seek to achieve fully autonomous
fact-checking, relying on LLMs to assess the veracity of
claims using a zero-shot approach, without requiring additional
training or human intervention.
A major challenge in automating fact-checking with LLMs
lies in their tendency to produce biased answers [25], [26] and
to hallucinate facts [27] and their reliance on static training
data, which may often be outdated [16]. To address this limi-
tation, researchers have explored methods to integrate external
knowledge into LLMs using frameworks such as ReAct, which
combines reasoning and action within LLMs [28]. This has led
to several innovative fact-checking approaches. For example,
FActScore assigns factuality scores to responses based on
information retrieved from Wikipedia [29], while FACTOOL
employs various tools for evidence collection and reasoning to
analyze claims and assign factuality labels based on supporting
evidence [30]. Additionally, Toolformer enables LLMs to
autonomously integrate external tools, enhancing performance
across various tasks while preserving core language modeling
capabilities [31]. Despite these advancements, research indi-
cates that while LLMs generally perform well, they struggle
more with verifying factual statements than identifying false
ones [17].
To further improve accuracy, researchers have explored
collaborative approaches where multiple models interact and
reason together to reach a verdict. FactCheck-GPT, for in-
stance, addresses factual inaccuracies by enabling multiple
LLMs to debate and converge on a consensus through iter-
ative discussions, supplemented by external searches such as
Google [32]–[34].
Recently, the focus has expanded beyond mere verification
to include misinformation correction. Systems like MUSE
represent a significant advancement in this area, demonstrating
the evolving capability of LLMs to both detect and cor-
rect misinformation in real-time environments [35]. Similarly,
Verify-and-Edit enhances LLM reasoning by incorporating
Chain-of-Thought (CoT) reasoning and external knowledge
sources such as DrQA, Wikipedia, and Google Search to refine
responses and correct factual inaccuracies [36]. The Chain-of-
Verification technique further improves factual accuracy by
leveraging parametric knowledge to revise LLM-generated re-
sponses [37]. Another approach, SELF-CHECKER, evaluates
Page 3:
3
factuality by integrating real-time web searches (e.g., Bing)
to assign factuality labels, reinforcing the role of external
knowledge in automated fact-checking [38].
Contrary to prior research suggesting LLM superiority,
some studies indicate that task-specific Small Language Mod-
els (SLMs), such as fine-tuned BERT, may outperform LLMs
in certain fact-checking tasks. One study proposes a hy-
brid approach where LLMs generate rationales while SLMs
handle classification, leveraging the strengths of both model
types [39]. This finding aligns with our analysis, which
shows that LLMs fail to achieve significant performance
improvements in zero-shot fact-checking on isolated state-
ments, even when provided with temporally contextualized
information. Similarly, [40] demonstrates that LLMs can be
more effective as supplementary tools to traditional detection
methods. Specifically, the study analyzes feature propagation
in networks of entities and concepts extracted from articles
using LLM assistance.
Another relevant research direction investigates the stylis-
tic metrics underlying written content. One study explores
whether LLMs exhibit human-like planning and creativity in
news article generation [41]. Additionally, building on the
Undeutsch psychological theory that memories of real events
differ from those of imagined ones, [42] suggests that fake and
real news may have distinct stylistic patterns. This raises the
possibility that LLM-based fact-checking methods may rely
more on stylistic cues than actual content analysis, particularly
in low-context scenarios—an issue reflected in our final task.
Unlike other studies, we restrict the retrieval date settings
in Google to obtain realistic estimates of how the LLM would
perform on new data. We also use a new dataset called Fact-
Checking Insights, which compiles information from various
sources to reduce bias. Instead of feeding articles to the LLM,
we provide contextual data, such as the author and the date the
claim was made. Additionally, we conduct a temporal analysis
of claim accuracy to determine whether the training data’s
cutoff date affects performance. Furthermore, we focus on
more economical, open-source models, limiting our analysis to
those with at most 70 billion parameters to better understand
their strengths and weaknesses.
III. E XPERIMENTAL DESIGN
We focus exclusively on English-language claims, as most
models are primarily trained in English and are expected
to perform best in this language [43]. Our analysis utilizes
the Fact-Check Insights dataset, a comprehensive resource
available to researchers, journalists, and other stakeholders
engaged in countering political misinformation and falsehoods
online. This dataset comprises structured data from tens of
thousands of claims made by political figures and social media
posts, scrutinized and rated by independent fact-checking
organizations such as AFP, Politifact, and Snopes.
A. Data Preprocessing
Starting from a collection of over 200K observations down-
loaded from Fact-Check Insights in June 2024, we removed
duplicate entries, rows with incomplete or incorrect attributes,and data that were not in the English language. Next, we
scraped the original fact-checking article associated with each
claim in the dataset, removing those that could not be collected
successfully. After this process, we obtained a final dataset of
60.000 claims along with their corresponding fact-checking
articles.
Following the literature [17], we grouped the claims under
two macro-categories: True and False . The former includes
entries deemed accurate or of higher quality, while the latter
contains content lacking a factual basis, such as fake quotes,
conspiracy theories, misleading edited media, or ironic and
exaggerated criticism. In addition, we labeled all entries la-
beled as a mixture of true and false content as False . For
example, labels like Geppetto mark ,trustworthy ,mostly true ,
andcorrect attribution were categorized as True , while labels
such as false ,mostly false ,Quattro Pinocchio , and unproven
orlegend were classified as False . The resulting classes are
highly unbalanced, with 90% of the claims labeled as False .
This is reasonable since fact-checking activities usually focus
on false claims [44].
Next, we sampled 50 claims (25 True and 25 False ) for
each year from 2013 to 2023. We instead sampled 500 claims
published in 2024, 30 of which only were True as this was
the maximum number available in that year at the time of
collection. We include observations from 2024 to better test the
capabilities of LLMs at judging claims that are not available
in their training data, i.e., because they were published after
the training cutoff date of the model.
B. Approach
To evaluate LLMs as effective tools for fact-checking, we
designed three distinct tasks aimed at thoroughly assessing
their capabilities in this field and addressing the key questions
outlined in Section .
1)Understanding the article-statement connection : This
task evaluates how accurately an LLM can answer when
provided with a pair article-claim that are either related
(i.e., the article is fact-checking the claim) or not (i.e.,
a random fact-checking article is picked).
2)Providing an accurate verdict based on a fact-
checking article : In this task, the model is required to
answer what is the verdict on a given claim based on
the associated fact-checking article.
3)Fact-Checking the claim : In this task, the LLM evalu-
ates the truthfulness of a given statement, with or with-
out additional contextual information, using a neutral
prompt (i.e. a prompt that does not steer the model
towards a certain decision).
Each task evaluates distinct aspects of the models’ perfor-
mance in fact-checking scenarios. We anticipate that the first
two tasks will be relatively easier for the models, as they
are required to answer queries based on manually provided
contextual information. These tasks test the models’ capabil-
ities using a range of prompts, as described below. The third
task instead simulates a real-world scenario where the model
must retrieve relevant information to verify a given claim. To
Page 4:
4
achieve this, we incorporate the ReAct framework [28], en-
abling models to access the internet for gathering information
useful in fact-checking.
In all cases, we provide the models with two basic metadata
information related to the claim: the publication date and,
when available ( ∼70% of the cases), the claim’s author.
Including the date allows the model to contextualize the claim,
as the validity of a statement may change over time with the
emergence of new information. Knowing the claim’s author
can offer insights into its credibility, as claims from unreliable
sources are more likely to be false [45]. Also, fact-checking
agencies such as AFP and PolitiFact highlight the importance
of both the claimant’s identity and the claim’s timing in
assessing its veracity.
C. Prompt Engineering
In this section, we provide a comprehensive overview of the
24 prompts evaluated in our experiments, which we categorize
into three approaches: Zero-Shot, Few-Shot, and Chain-of-
Thought. Zero-Shot prompts require the model to generate
responses without any prior examples, relying solely on its
pre-trained knowledge [46]. Few-Shot [47] prompts provide
a small number of examples to guide the model toward
the desired output style and reasoning process. Chain-of-
Thought [48] prompts explicitly break down the reasoning pro-
cess into intermediate steps, encouraging the model to generate
more structured and explainable responses. We construct these
prompts by combining different modules, as detailed next.
1) Basic prompt modules: Here we detail a few basic
components that are included in all prompts, in the order they
appear as shown in Figure 1 and provided in the Appendix.
The Role prompt module primes the LLM to act as a fact-
checker, aligning its reasoning with fact-checking principles.
This framing encourages a more critical approach to evaluating
claims and, as noted by [49], can improve performance.
The Task prompt module requires the LLM to generate an
explanation prior to providing an answer. This methodology
is supported by literature indicating that additional tokens
enhance the reasoning process of models [50]. After generating
the explanation, the model assigns a score (0-100) rather than
a definitive verdict. This scoring mechanism allows for flex-
ible adjustment of the model’s skepticism, as the acceptance
threshold for responses can be modified. We also decided to
implement a scoring mechanism instead of relying on labels
because, during the setup of our experiments, we empirically
found that instructing the LLM to return a factual label
often led to inconsistent outputs. This approach is particularly
relevant in our context, where understanding and verifying the
reasoning behind a claim is crucial for ensuring the accuracy
and trustworthiness of the verification process.
We employ JSON prompt module that provides a structured
format for extracting expected responses, facilitating consistent
and reliable evaluation across the various tasks. Adopting a
similar approach to other work [50], [51], we developed a
prompt that enables the LLMs to produce structured JSON
outputs. Although these methods may occasionally result in
null outputs (approximately 1% of the results) they are highlyadvantageous, as they guarantee the presence of a result at the
specified position.
The Final prompt module serves as a brief reminder to
the model regarding how it should structure its responses,
including the required format and the necessity of always
providing a score. This prompt is significant given that the
articles being analyzed can be lengthy. Smaller models, in
particular, struggle to adhere to the specified format when
processing long texts. By including this reminder at the end,
we significantly improve the model’s precision in generating
responses. Hence, this structured approach ensures that the
model remains focused and compliant with the expected output
format.
2) Prompt modules for Tasks 1 and 2: We consider three
different approaches for the first two tasks:
•Zero Shot (ZS) : This prompt module requires the model
to generate responses based solely on its preexisting
knowledge and understanding. It does not require exam-
ples and it relies on indirect information for prediction.
However, it heavily depends on the quality of semantic
information and may struggle with highly diverse cate-
gories [52].
•Few Shot (FS) : Unlike the ZS approach, the FS prompt
module provides the model with several examples before
responding. In this process, we provide one example
per class to avoid the limitations of one-shot prompting,
which relies on only one example. However, there is a
risk of overfitting the small dataset, and the model’s per-
formance can be susceptible to the quality and diversity
of the examples provided. To minimize these problems,
the instances are randomly selected in each iteration.
Also, to avoid confusion, especially for smaller models,
we encapsulated each sample within square brackets to
emphasize that these represent distinct entities, leading
to more precise and clearly defined responses from the
model. The example articles were limited to remain inside
the context windows. This method assesses the model’s
ability to generalize from a limited set of examples [52].
•Chain of Thoughts (CoT) : Recent research has shown
that adding a chain-of-thought paradigm can significantly
boost the performance of language models across various
tasks [53], [54]. In this approach, we include the phrase
”Let’s think step-by-step” after the examples, recognizing
that language models can act as zero-shot reasoners [55].
This technique aims to enhance reasoning skills and
achieve better results.
Furthermore, we added the following approach to obtain
more sophisticated prompts:
•Enriched criteria : This approach enriches the prompt
by providing a more explicit rationale behind labels. For
theFalse category, we specify that it includes misinfor-
mation types such as fake quotes and conspiracy theo-
ries. Conversely, the True category encompasses content
that demonstrates factual accuracy or contains higher-
quality information. As suggested by previous studies,
this enhancement is expected to lead to higher accuracy
in ambiguous cases. In our context, it should assist the
Page 5:
5
Fig. 1. Example of prompt structure with optional configurations. The black text represents the main prompt, shared across all tasks and configurations. Red
placeholders indicate where the actual article and statement are inserted for each task. Blue components correspond to optional prompt modules, which are
integrated into the main prompt (at the indicated position) based on the specific combination being tested. For instance, a prompt incorporating both Enrich
andChain-of-Thought will consist of the black main prompt, with the blue Enrich component placed between the Role andTask sub-prompts, and the blue
Chain-of-Thought component appended at the end.
model in accurately labeling mixed verdicts, such as
mixture [35].
•Self-Reflection : Once we receive the model’s answer, we
feed the LLM with the previous prompt and its generated
response and ask it to reflect on its answer. This allows
the model to reason and think about its answer before
providing a new response. By encouraging this reflective
process, we aim to reduce instances of hallucination,
where the model might generate inaccurate or fabricated
information. This approach fosters critical thinking and
should enhance the reliability of the model’s outputs [56].
•Summary : We also investigated whether providing a
summary focused on the claim rather than an entire
article could enhance the effectiveness of our approach.
This strategy distills relevant information, optimizing the
context window’s use. This prompt depends hardly on the
large language model’s ability to generate a coherent and
accurate summary.
The last two methods are more costly because they involve
calling the LLM twice and require processing longer texts
compared to the first method. In contrast, the first method
involves only a minimal increase in tokens, making it far more
cost-efficient.
3) ReAct for Task 3: Our fact-checking methodology in-
corporates the ReAct framework [28] to enhance precision
and efficiency in information retrieval. This method can be
broken into two pieces: Reasoning (Re) and Act (Action).
First, the model uses its internal knowledge to reflect and
think about the user’s input, and then it identifies the steps
required to solve the problem. After Reasoning, the model
starts to Act following the steps identified before; in our case,
it acts using an external tool to retrieve information from
a source. This framework allows the model to perform thisseries of Reasoning an Action multiple times, but we limited
our experiment to only one iteration. The model could also
conclude that it does not need to search online to produce
an optimal result and, after just one reasoning step, return
the answer; otherwise, it would perform one search before
answering. In this framework, the LLM operates using a struc-
tured conversational ReAct JSON prompt, ensuring that each
response adheres to a predefined format. The JSON responses
can take one of two forms: ”Final answer,” which includes
the final score and explanation, or ”Action,” which enables
the model to perform an online search. When performing a
search, the agent autonomously generates a query and calls
the search tool, providing the query as metadata. After the
tool retrieves relevant information, the LLM is called again
to produce a final answer, incorporating the newly obtained
knowledge.
In particular, we utilize two information sources: Wikipedia
and Google. For Wikipedia, we leverage its API by developing
a tool that, given a query generated by the LLM, returns
the top 3 matches. Each match includes the title, link, and a
snippet. These results are used to retrieve the snippets, while
for complete content retrieval, we scrape the corresponding
URLs, limiting the data to a certain length to ensure it fits
within the context window. Finally, to generate summaries,
we feed the scraped pages back into the model and request
a summary for each page, resulting in three separate model
calls for summarization.
Meanwhile, for Google, we simulate an actual user experi-
ence using the Selenium automation tool. We apply a stricter
criterion, filtering results based on a time range of up to one
week before the claim’s assertion date. Using Selenium, we
systematically scan the first 20 search results and extract the
top three relevant links. As with Wikipedia, we use the results
Page 6:
6
to gather snippets. For full content, we scrape the URLs, and
in the case of summaries, we prompt the model to generate
summaries from the scraped pages.
Following this process, we conduct experiments across three
different settings:
1)Snippets : In this setting, we provide only short snip-
pets of information. This approach allows for a quick,
surface-level analysis and is highly cost-effective.
2)Full Article : Here, we present the complete articles
retrieved from the search results. This setting enables a
more in-depth content analysis but may risk overloading
the model’s context window.
3)Summary : In this approach, we supply LLM-generated
summaries based on the articles collected in the previous
setting. These summaries should distill the essential in-
formation, facilitating efficient analysis and comparison
while avoiding the context window limitation. However,
this method is more expensive as it requires invoking
the LLM twice.
As we can see in this task, the number of operations, the
complexity, the strict formatting requirements, and the use
of external tools make it more challenging for the model to
follow the format consistently. As a result, we observed that
approximately 2% of the outputs were invalid.
All the prompts used in this study are detailed in the
Appendix.
D. Experimental settings
In our study, we employed four models: two small
and two large ones. Specifically, we utilized the Mistral
family models, including Mistral-7B-Instruct-v0.3
andMixtral-8x7B-Instruct-v0.1 , alongside LLaMA
models, which included Meta-Llama-3-8B-Instruct
andMeta-Llama-3-70B-Instruct .
For interaction with these models, we resort to the Hug-
ging Face API8. We established fixed parameters to ensure
reproducibility across all tasks and tested models. We set the
temperature to a low value of 0.1 to enhance the consistency
of our experiments, while the number of new tokens was fixed
at 256.
To perform the experiments for Task 1 and Task 2, we
structured a framework with placeholders, as presented in sub-
section III-C. We experimented with all possible combinations
of different prompt techniques for each dataset entry.
For the Summary enhancement, we provided a summary
instead of feeding the full article to the model. As mentioned
before, we asked the model to generate a separate response,
following the steps outlined earlier. These procedures were
repeated for each model across the first two tasks.
1) Task 1: Understanding the article-statement connection:
Our experiments involved feeding our dataset into the models.
Each entry was resource-intensive due to its 24 settings. To
ensure a balanced dataset, we divided the entries into two
groups: one with the correct claim and the other with a
random, unrelated claim. This allowed us to create a dataset
with an equal number of explained andunexplained entries.
8https://huggingface.co/blog/inference-pro#supported-models2) Task 2: Providing an accurate verdict based on the
article’s knowledge: Similarly to the previous task, this ex-
periment involved a dataset of 1,000 samples. We provided
the LLM with the article verifying the claim and asked
for their verdict, which was explicitly contained within the
article. Our primary objectives were to verify the accuracy
of our label mapping and to assess whether the LLMs could
effectively extract critical information from the text. This task
was instrumental in refining our label mapping.
3) Task 3: Fact-Checking the claim: In this task, we utilized
the complete dataset. Unlike the previous tasks, we provided
the model with claims along with their corresponding metadata
and allowed it to explore freely in 3 settings: Wikipedia,
Google, or no context.
As described in previous sections, we employed a ReAct
agent that required the LLM to perform queries and interact
with external resources, specifically the Wikipedia API and
the Selenium tool.
We developed an agent using the Langchain9library to
implement this functionality, creating two specific tools: one
for Wikipedia and another for Google.
By cleverly manipulating the URL, we filtered out results
published. This approach aimed to avoid articles that clearly
fact-check the claim.
We deployed our models as autonomous agents by integrat-
ing the capabilities of LangChain with its HuggingFace inte-
gration. To ensure objectivity in the responses, we employed a
zero-shot prompt while assigning the model the role of a fact-
checker. We evaluated three scenarios by providing the model
with different input formats for each source of information:
the snippet, the full article, or a summary generated by the
LLM.
E. Extraction of results
While extracting results from the model, we encountered
situations where the model returned multiple answers for a
single query. In such cases, we selected the last complete an-
swer, which was often the most comprehensive. To ensure the
extracted data complied with the JSON format, we addressed
formatting issues, specifically converting single quotes (’) to
double quotes (”). This adjustment was made using a regular
expression.
Once the formatting was corrected, we utilized the JSON
structure to extract relevant scores and explanations. This data
was then compiled into a resulting dataset and subsequently
fed into another program for score computation.
We recorded None entries in the dataset when the extraction
failed or yielded no valid response.
We decided to fix the threshold for a positive label with a
score of 50. Labels with a value of 50 or lower were catego-
rized as False , while those greater than 50 were categorized as
True. Any values outside the range of 0 to 100 were classified
asNone .
9https://www.langchain.com
Page 7:
7
F . Baseline
To compare our models, we fine-tuned a RoBERTa model
to perform binary classification on article-claim pairs (Task
1 and Task 2) or claim-only embeddings (Task 3) using our
dataset. This dataset consists of a large number of claims,
each labeled as True orFalse , and originally paired with
fact-checking articles. However, given that RoBERTa can only
process sequences up to 512 tokens, we decided to reduce the
articles’ length, which can exceed 20,000 characters.
For preprocessing, we tokenized the claims using the
RoBERTa tokenizer with a maximum sequence length of 512
tokens. We trained the model on this dataset using a cross-
entropy loss function and an AdamW optimizer, fine-tuning for
three epochs. The dataset was split into training and validation
sets, and model performance was evaluated using accuracy and
classification metrics.
Our results indicate that fine-tuning RoBERTa shows great
performance in distinguishing between true and false claims.
These results are included as baselines in the main graphs.
G. Evaluation metrics
To evaluate our model’s performance comprehensively, we
employ five distinct metrics: Precision, Recall, F1 score, and
ROC AUC Score.
•Recall (Sensitivity/True Positive Rate) : Measures the
proportion of actual positives that were correctly identi-
fied.
Recall =TP
TP+FN
•Precision (Positive Predictive Value) : Measures the
proportion of predicted positives that are truly positive.
Precision =TP
TP+FP
•Accuracy : The overall proportion of correct predictions,
both true positives and true negatives.
Accuracy =TP+TN
Total Population
•F1 Score : The harmonic mean of precision and recall. It
balances the two, which is especially useful when there
is an uneven class distribution like in our case.
F1 Score = 2×Precision ×Recall
Precision +Recall
•ROC AUC Score (Area Under the Curve) : A numerical
value representing the area under the ROC curve. It
summarizes the model’s ability to distinguish between
classes. A score of 1.0 indicates perfect classification,
while 0.5 indicates random guessing.
These metrics collectively offer a detailed understanding of
the model’s efficacy. In addition, we compute recall, precision,
and f1 score separately for each class, thereby distinguishing
the model’s handling of true positives/negatives from false
positives/negatives. This differentiation is crucial for identi-
fying any potential difficulties the model may encounter when
processing different cases.In fact-checking, the disaggregation of evaluation metrics
cases serves a particularly critical function. This distinction
is important because the implications of misclassification can
vary considerably depending on whether a true statement is
incorrectly labeled as false or a false statement is incorrectly
labeled as true. It also helps identify potential biases within
the model, such as a predisposition towards skepticism or
credulity. Furthermore, this strategy aligns closely with the
practical priorities of fact-checking, where the relative impor-
tance of avoiding false positives versus false negatives can
vary depending on the context.
IV. R ESULTS
We present results for Llama3 8B, Llama3 70B, and Mixtral
8x7B, while the results for Mistral 7B have been relegated to
the Appendix due to the exceptionally high number of faulty
responses generated by this model. Interestingly, this issue
was further exacerbated when a token limit was applied to
the answer prompt, while, this same modification reduced the
occurrence of faulty responses in the other models.
Task 1: Understanding the article-statement connection
In Fig. 2, we show that all models outperform the fine-tuned
RoBERTa, which exhibits particularly low F1 scores, espe-
cially for the negative class. The only exceptions are certain
prompts for both Mixtral 8x7B and Llama3 8B. Among the
three models, Llama3 70B consistently delivers the strongest
performance, reaching an F1 score of approximately 0.9,
depending on the prompt configuration. In contrast, Llama3
8B and Mixtral 8x7B produce lower scores, with Mixtral 8x7B
emerging as the weakest model, averaging 0.65 (Fig. 3) and
occasionally dropping below the random threshold of 0.5.
As shown in Fig. 3, and consistent with findings in the
literature [57], no single prompt format stands out as the
best across all models. For Llama3 8B, providing the full
article without relying on more complex prompting strategies
yields the highest F1 scores across all prompt configurations.
Meanwhile, Llama3 70B and Mixtral 8x7B achieve their best
results when enriched prompts—those with more detailed class
definitions—are incorporated.
Figure 4 highlights an approximately linear relationship
between the F1 scores for the positive and negative classes,
suggesting that the models maintain a balanced performance
on this task. However, there is a slight tendency for less
effective models, such as Llama3 8B and Mixtral 8x7B, to
perform better on the negative class, possibly due to their
smaller number of parameters.
Finally, Llama3 70B not only outperforms the other models
in terms of F1 score but also generates the fewest faulty
responses (i.e. instances where the model fails to provide a
score or adhere to the expected output format), as illustrated
in Fig. 5. In particular, it exhibits the lowest median percentage
of faults: 0.003 compared to Llama3 8B’s 0.241 and Mixtral
8x7B’s 0.393.
Page 8:
8
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000056/uni00000046/uni00000052/uni00000055/uni00000048
/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048
/uni00000031/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048
Fig. 2. Task 1: Models’ F1 scores computed for both classes. Fine-tuned
RoBERTa is used as a reference baseline.
/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025
/uni00000013/uni00000011/uni0000001a/uni0000001b/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001b/uni00000018/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001b/uni00000015/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001c/uni00000017
/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025
/uni00000013/uni00000011/uni0000001c/uni00000015/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000013/uni00000011/uni0000001c/uni0000001a/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001c/uni0000001b/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000013/uni00000011/uni0000001c/uni00000019
/uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025
/uni00000013/uni00000011/uni00000019/uni00000018/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001b/uni00000013/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001b/uni00000019/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001a/uni0000001a
Fig. 3. Task 1: For each model, for each prompt category (Zero-Shot,
Few-Shot, and Chain-of-Tought) the best prompt’s F1 score is reported.
The average F1 score for each model is also reported with 0.95 confidence
intervals.
/uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013
/uni00000029/uni00000014/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000046/uni0000004f/uni00000044/uni00000056/uni00000056/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000046/uni0000004f/uni00000044/uni00000056/uni00000056
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025
/uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025
Fig. 4. Task 1: Comparison between the F1 scores obtained by each model
for each prompt variation with respect to both classes. The bisector is reported
as a reference.
Task 2: Providing an accurate verdict based on a fact-checking
article
In this task, the performance gap between the positive and
negative classes is more pronounced, affecting not only the
three evaluated models but also the fine-tuned RoBERTa.
This discrepancy is primarily due to the class imbalance
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000008/uni00000003/uni00000052/uni00000049/uni00000003/uni00000029/uni00000044/uni00000058/uni0000004f/uni00000057/uni00000056
Fig. 5. Task 1: Percentage of faults for each model. The median faults
percentage for each model is: 0.241 (Llama3 8B), 0.003 (Llama3 70B), 0.393
(Mixtral 8x7B)
introduced by the 2024 samples in the dataset, which contain
470 false claims but only 30 true claims. This distribution
contrasts with earlier years (2003–2013), where the dataset is
balanced.
Focusing on the results for the negative class, all three
models perform relatively well compared to the baseline
set by the fine-tuned RoBERTa, although none fully match
its performance. Among them, Llama3 70B comes closest,
achieving an F1 score of approximately 0.9, compared to
RoBERTa’s 0.95.
The performance gap widens even further for the positive
class, which contains fewer samples. This is evident both in
comparison to RoBERTa’s baseline and relative to the other
models. Llama3 70B remains the strongest performer, consis-
tently achieving scores above 0.6 across all configurations and
coming closest to RoBERTa’s threshold.
As observed in Task 1, no single prompting strategy proves
to be the most effective across all models. Additionally, the
rate of faulty responses increases for both Llama3 8B and
Llama3 70B, with the latter reaching nearly 30% for some
prompts. Meanwhile, Mixtral 8x7B maintains a consistently
high fault rate, similar to its performance in Task 1.
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000056/uni00000046/uni00000052/uni00000055/uni00000048
/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048
/uni00000031/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048
Fig. 6. Task 2: Models’ F1 scores computed for both classes. Fine-tuned
RoBERTa is used as a reference baseline.
Page 9:
9
/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025
/uni00000013/uni00000011/uni0000001a/uni00000018/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001b/uni00000013/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c/uni0000000f/uni00000003/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000013/uni00000011/uni0000001b/uni00000014/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001a/uni0000001a
/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025
/uni00000013/uni00000011/uni0000001b/uni00000019/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000013/uni00000011/uni0000001b/uni0000001b/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000013/uni00000011/uni0000001b/uni0000001a/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001b/uni0000001b
/uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025
/uni00000013/uni00000011/uni0000001a/uni0000001b/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001b/uni00000016/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001b/uni00000016/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001b/uni00000018
Fig. 7. Task 2: For each model, for each prompt category (Zero-Shot,
Few-Shot, and Chain-of-Tought) the best prompt’s F1 score is reported.
The average F1 score for each model is also reported with 0.95 confidence
intervals.
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000008/uni00000003/uni00000052/uni00000049/uni00000003/uni00000029/uni00000044/uni00000058/uni0000004f/uni00000057/uni00000056
Fig. 8. Task 2: The percentage of faults for each model. The median faults
percentage for each model is: 0.392 (Llama3 8B), 0.066 (Llama3 70B), 0.370
(Mixtral 8x7B)
Task 3: Fact-Checking the claim
Figure 9 shows that, under the current experimental settings,
LLMs achieve high F1 scores for the negative class. However,
they are consistently outperformed by the RoBERTa baseline.
This suggests that while advanced models can sometimes
perform well in classifying claims as True orFalse , simpler
approaches may consistently yield better results.
A notable trend in the results is the widening gap between
the F1 scores of the positive and negative classes. Specifically,
the highest F1 score achieved for the positive class is lower
than the lowest F1 score observed for the negative class. This
discrepancy highlights a significant challenge in the classifica-
tion task. As was also the case in Task 2, this phenomenon can
be attributed to dataset imbalance, which skews the model’s
ability to correctly predict both classes with equal proficiency.
Further analysis reveals an intriguing temporal pattern when
breaking down the performance by claim publication date. As
depicted in Figure 10, models exhibit substantially better per-
formance on the positive class for claims that were published
before 2024. Conversely, Figure 11 shows an opposite trend
for the negative class: claims originating from 2024 tend to
yield the highest performance. This suggests that temporalfactors, possibly related to shifts in linguistic patterns, dataset
composition, or the evolving nature of factual claims, may be
influencing the model’s ability to distinguish between true and
false claims.
Unlike the previous tasks, one striking difference is the
relatively low rate of faulty responses across all three models,
as illustrated in Figure 12. This suggests that, at least within
the scope of this particular task, the models are more stable and
less prone to generating erroneous classifications compared to
their performance in earlier experiments.
Another interesting observation emerges when analyzing
the impact of external content sources on model perfor-
mance. Specifically, as shown in Figure 13, when models
are supplemented with information extracted from Google and
Wikipedia, their classification accuracy improves. However, a
crucial factor in this improvement appears to be the format
in which the external information is presented. The results
indicate that models perform significantly better when pro-
vided with a structured summary of the search results, rather
than being fed entire web pages or isolated snippets. This
suggests that concise, well-organized contextual information
is more beneficial for improving model accuracy than raw,
unstructured text.
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000056/uni00000046/uni00000052/uni00000055/uni00000048
/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048
/uni00000031/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048
Fig. 9. Task 3: Models’ F1 scores computed for both classes. Fine-tuned
RoBERTa is used as a reference baseline.
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000056/uni00000046/uni00000052/uni00000055/uni00000048
/uni00000033/uni00000055/uni00000048/uni00000003/uni00000015/uni00000013/uni00000015/uni00000017
/uni00000015/uni00000013/uni00000015/uni00000017
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000053/uni00000055/uni00000048/uni00000015/uni00000013/uni00000015/uni00000017
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000015/uni00000013/uni00000015/uni00000017
Fig. 10. Task 3: Models’ F1 scores computed for the positive class distin-
guishing between claims dated before 2024 (thus possibly included in the
models’ training dataset) and from 2024. Fine-tuned RoBERTa is used as a
reference baseline.
Page 10:
10
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000056/uni00000046/uni00000052/uni00000055/uni00000048
/uni00000033/uni00000055/uni00000048/uni00000003/uni00000015/uni00000013/uni00000015/uni00000017
/uni00000015/uni00000013/uni00000015/uni00000017
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000053/uni00000055/uni00000048/uni00000015/uni00000013/uni00000015/uni00000017
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000015/uni00000013/uni00000015/uni00000017
Fig. 11. Task 3: Models’ F1 scores computed for the negative class
distinguishing between claims dated before 2024 (thus possibly included in
the models’ training dataset) and from 2024. Fine-tuned RoBERTa is used as
a reference baseline.
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000008/uni00000003/uni00000052/uni00000049/uni00000003/uni00000029/uni00000044/uni00000058/uni0000004f/uni00000057/uni00000056
Fig. 12. Task 3: The percentage of faults for each model. The median faults
percentage for each model is: 0.02 (Llama3 8B), 0.024 (Llama3 70B), 0.051
(Mixtral 8x7B)
/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025
/uni00000013/uni00000011/uni00000019/uni00000019/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni00000019/uni0000001b/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni00000019/uni00000018/uni00000013/uni00000011/uni00000019/uni0000001c
/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025
/uni00000013/uni00000011/uni0000001a/uni00000017/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001a/uni00000017/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001a/uni00000019/uni00000013/uni00000011/uni0000001a/uni0000001b
/uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025
/uni00000013/uni00000011/uni0000001a/uni00000019/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001a/uni00000018/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001a/uni0000001c/uni00000013/uni00000011/uni0000001a/uni00000018
Fig. 13. Task 3: For each model, for each prompt category (Zero-Shot, with
Google contextual information, and with Wikipedia contextual information)
the best prompt’s F1 score is reported. The average F1 score for each model
is also reported with 0.95 confidence intervals.
V. C ONCLUSION
Our study builds upon and expands existing research on the
role of LLMs in fact-checking, offering new insights into their
capabilities and limitations.
To systematically evaluate LLMs in this context, we de-signed our study around three key tasks. The first task assessed
the models’ ability to recognize the semantic relationship
between a claim and an article. The second task measured
their performance in verifying a claim’s truthfulness when
provided with a related fact-checking article. Finally, the third
task explored the zero-shot fact-checking abilities of LLMs
under varying levels of external knowledge support.
Our findings from the first task indicate that LLMs effec-
tively identify the semantic connection between claims and
articles, outperforming fine-tuned Small Language Models in
this regard. This suggests a strong potential for LLMs in
assisting fact-checking efforts. Additionally, we observed that
prompt design plays a crucial role in model performance,
highlighting the need for careful prompt engineering in fact-
checking applications.
In the second task, we found that larger LLMs can accu-
rately determine the veracity of claims when provided with a
fact-checking article, particularly when evaluating fake news.
However, in line with previous work [17], their performance
declines significantly when verifying true news, where simpler
fine-tuned Small Language Models substantially outperform
them.
Similarly, in the third task, all three LLMs underperformed
compared to a fine-tuned RoBERTa model when asked to
assess claim veracity—regardless of whether additional ex-
ternal knowledge (sourced from Google or Wikipedia) was
incorporated. These results align with prior findings [39],
which highlight the superior performance of fine-tuned Small
Language Models over LLMs in fake news detection. This
reinforces the idea that, while LLMs can be valuable tools
in fact-checking, they are not yet reliable enough to fully
automate the process.
Contrary to expectations from previous studies, introducing
external knowledge did not enhance LLM performance. A
possible explanation for this, as suggested by [42], is that fake
and factual news often exhibit distinct writing styles, which
may significantly hinder LLMs’ ability to differentiate between
them.
Overall, our study highlights both the promise and the
challenges of using LLMs for fact-checking, emphasizing the
need for further advancements before they can serve as a
standalone solution.
ACKNOWLEDGMENTS
The authors are thankful to Sofia Mongardi for her support
in the design of this manuscript. The work in this paper was
originally submitted as a Master Thesis titled: ”Evaluating
the Effectiveness of Open Large Language Models in Fact-
checking Claims” written by Enrico Zuccolotto and supervised
by Prof. Francesco Pierri. This paper is supported by PNRR-
PE-AI FAIR project funded by the NextGeneration EU pro-
gram.
Page 11:
11
REFERENCES
[1] M. Del Vicario, A. Bessi, F. Zollo, F. Petroni, A. Scala, G. Caldarelli,
H. E. Stanley, and W. Quattrociocchi, “The spreading of misinformation
online,” Proceedings of the National Academy of Sciences , vol.
113, no. 3, p. 554–559, Jan. 2016. [Online]. Available: https:
//www.pnas.org/doi/full/10.1073/pnas.1517441113
[2] H. Allcott and M. Gentzkow, “Social media and fake news in the 2016
election,” Journal of Economic Perspectives , vol. 31, no. 2, p. 211–36,
May 2017. [Online]. Available: https://www.aeaweb.org/articles?id=10.
1257/jep.31.2.211
[3] D. B ¨ar, F. Pierri, G. De Francisci Morales, and S. Feuerriegel,
“Systematic discrepancies in the delivery of political ads on facebook
and instagram,” PNAS Nexus , vol. 3, no. 7, p. pgae247, Jul. 2024a.
[Online]. Available: https://doi.org/10.1093/pnasnexus/pgae247
[4] K. Fink, “The biggest challenge facing journalism: A lack of trust,”
Journalism , vol. 20, no. 1, pp. 40–43, 2019. [Online]. Available:
https://doi.org/10.1177/1464884918807069
[5] C. Spivak, “The fact-checking explosion: In a bitter political landscape
marked by rampant allegations of questionable credibility, more and
more news outlets are launching truth-squad operations,” American
Journalism Review , vol. 32, no. 4, pp. 38–44, 2010.
[6] C. Lim, “Checking how fact-checkers check,” Research & Politics ,
vol. 5, no. 3, 2018.
[7] G. Warren, I. Shklovski, and I. Augenstein, “Show me the
work: Fact-checkers’ requirements for explainable automated fact-
checking,” Feb. 2025, arXiv:2502.09083 [cs]. [Online]. Available:
http://arxiv.org/abs/2502.09083
[8] Z. Guo, M. Schlichtkrull, and A. Vlachos, “A survey on automated
fact-checking,” Transactions of the Association for Computational Lin-
guistics , vol. 10, pp. 178–206, 2022.
[9] L. Hu, S. Wei, Z. Zhao, and B. Wu, “Deep learning for fake
news detection: A comprehensive survey,” AI Open , vol. 3, pp.
133–155, 2022. [Online]. Available: https://www.sciencedirect.com/
science/article/pii/S2666651022000134
[10] Y . Wang, S. Qian, J. Hu, Q. Fang, and C. Xu, “Fake news detection
via knowledge-driven multimodal graph convolutional networks,” in
Proceedings of the 2020 International Conference on Multimedia
Retrieval , ser. ICMR ’20. New York, NY , USA: Association
for Computing Machinery, 2020, p. 540–547. [Online]. Available:
https://doi.org/10.1145/3372278.3390713
[11] T. Lin, Y . Wang, X. Liu, and X. Qiu, “A survey of transformers,”
AI Open , vol. 3, pp. 111–132, 2022. [Online]. Available: https:
//www.sciencedirect.com/science/article/pii/S2666651022000146
[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in
neural information processing systems , vol. 30, 2017.[13] M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. B. Shaikh,
N. Akhtar, J. Wu, S. Mirjalili et al. , “A survey on large language models:
Applications, challenges, limitations, and practical usage,” Authorea
Preprints , 2023.
[14] N. Fontana, F. Pierri, and L. M. Aiello, “Nicer than humans: How do
large language models behave in the prisoner’s dilemma?” Sep. 2024,
arXiv:2406.13605. [Online]. Available: http://arxiv.org/abs/2406.13605
[15] E. Papageorgiou, C. Chronis, I. Varlamis, and Y . Himeur, “A survey
on the use of large language models (llms) in fake news,” Future
Internet , vol. 16, no. 88, p. 298, Aug. 2024. [Online]. Available:
https://www.mdpi.com/1999-5903/16/8/298
[16] E. Sanu, T. K. Amudaa, P. Bhat, G. Dinesh, A. U. Kumar Chate,
and R. K. P, “Limitations of large language models,” in 2024 8th
International Conference on Computational System and Information
Technology for Sustainable Solutions (CSITSS) , Nov. 2024, p.
1–6. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/
10817070
[17] D. Quelle and A. Bovet, “The perils and promises of fact-checking with
large language models,” Frontiers in Artificial Intelligence , vol. 7, Feb.
2024. [Online]. Available: http://dx.doi.org/10.3389/frai.2024.1341697
[18] F. Pierri and S. Ceri, “False news on social media: A data-driven
survey,” SIGMOD Rec. , vol. 48, no. 2, p. 18–27, Dec. 2019. [Online].
Available: https://dl.acm.org/doi/10.1145/3377330.3377334
[19] J. A. Nasir, O. S. Khan, and I. Varlamis, “Fake news detection: A
hybrid cnn-rnn based deep learning approach,” International Journal
of Information Management Data Insights , vol. 1, no. 1, p. 100007,
Apr. 2021b. [Online]. Available: https://www.sciencedirect.com/science/
article/pii/S2667096820300070
[20] A. Mewada and R. K. Dewang, “Cipf: Identifying fake profiles
on social media using a cnn-based communal influence propagation
framework,” Multimedia Tools and Applications , vol. 83, no. 10, p.
29419–29454, Mar. 2024a. [Online]. Available: https://doi.org/10.1007/
s11042-023-16685-z
[21] J. Su, C. Cardie, and P. Nakov, “Adapting fake news detection to
the era of large language models,” in Findings of the Association
for Computational Linguistics: NAACL 2024 , K. Duh, H. Gomez,
and S. Bethard, Eds. Mexico City, Mexico: Association for
Computational Linguistics, Jun. 2024, p. 1473–1490. [Online].
Available: https://aclanthology.org/2024.findings-naacl.95/
[22] X. Zhang, J. Cao, X. Li, Q. Sheng, L. Zhong, and K. Shu,
“Mining dual emotion for fake news detection,” in Proceedings of
the Web Conference 2021 , ser. WWW ’21. New York, NY , USA:
Association for Computing Machinery, Jun. 2021, p. 3465–3476.
[Online]. Available: https://dl.acm.org/doi/10.1145/3442381.3450004
[23] X. Zhang and W. Gao, “Reinforcement retrieval leveraging fine-
grained feedback for fact checking news claims with black-box
Page 12:
12
llm,” in Proceedings of the 2024 Joint International Conference
on Computational Linguistics, Language Resources and Evaluation
(LREC-COLING 2024) , N. Calzolari, M.-Y . Kan, V . Hoste, A. Lenci,
S. Sakti, and N. Xue, Eds. Torino, Italia: ELRA and ICCL, May
2024, p. 13861–13873. [Online]. Available: https://aclanthology.org/
2024.lrec-main.1209/
[24] ——, “Towards llm-based fact verification on news claims with
a hierarchical step-by-step prompting method,” in Proceedings of
the 13th International Joint Conference on Natural Language
Processing and the 3rd Conference of the Asia-Pacific Chapter of the
Association for Computational Linguistics (Volume 1: Long Papers) ,
J. C. Park, Y . Arase, B. Hu, W. Lu, D. Wijaya, A. Purwarianti,
and A. A. Krisnadhi, Eds. Nusa Dua, Bali: Association for
Computational Linguistics, Nov. 2023a, p. 996–1011. [Online].
Available: https://aclanthology.org/2023.ijcnlp-main.64/
[25] G. Nogara, F. Pierri, S. Cresci, L. Luceri, P. T ¨ornberg, and
S. Giordano, “Toxic bias: Perspective api misreads german as
more toxic,” Jul. 2024c, arXiv:2312.12651 [cs]. [Online]. Available:
http://arxiv.org/abs/2312.12651
[26] G. Liu, C. A. Bono, and F. Pierri, “Comparing diversity, negativity,
and stereotypes in chinese-language ai technologies: an investigation of
baidu, ernie and qwen,” Feb. 2025b, arXiv:2408.15696 [cs]. [Online].
Available: http://arxiv.org/abs/2408.15696
[27] I. Augenstein, T. Baldwin, M. Cha, T. Chakraborty, G. L. Ciampaglia,
D. Corney, R. DiResta, E. Ferrara, S. Hale, A. Halevy, E. Hovy,
H. Ji, F. Menczer, R. Miguez, P. Nakov, D. Scheufele, S. Sharma, and
G. Zagni, “Factuality challenges in the era of large language models
and opportunities for fact-checking,” Nature Machine Intelligence ,
vol. 6, no. 8, p. 852–863, Aug. 2024. [Online]. Available:
https://www.nature.com/articles/s42256-024-00881-z
[28] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and
Y . Cao, “React: Synergizing reasoning and acting in language
models,” International Conference on Learning Representations
(ICLR) , Jan. 2023. [Online]. Available: https://par.nsf.gov/biblio/
10451467-react-synergizing-reasoning-acting-language-models
[29] S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. Koh,
M. Iyyer, L. Zettlemoyer, and H. Hajishirzi, “Factscore: Fine-
grained atomic evaluation of factual precision in long form text
generation,” in Proceedings of the 2023 Conference on Empirical
Methods in Natural Language Processing , H. Bouamor, J. Pino,
and K. Bali, Eds. Singapore: Association for Computational
Linguistics, Dec. 2023, p. 12076–12100. [Online]. Available: https:
//aclanthology.org/2023.emnlp-main.741/
[30] I.-C. Chern, S. Chern, S. Chen, W. Yuan, K. Feng, C. Zhou, J. He,
G. Neubig, and P. Liu, “Factool: Factuality detection in generative
ai – a tool augmented framework for multi-task and multi-domainscenarios,” jul 2023, arXiv:2307.13528 [cs]. [Online]. Available:
http://arxiv.org/abs/2307.13528
[31] T. Schick, J. Dwivedi-Yu, R. Dessi, R. Raileanu, M. Lomeli, E. Hambro,
L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language
models can teach themselves to use tools,” Advances in Neural Informa-
tion Processing Systems , vol. 36, p. 68539–68551, Dec. 2023. [Online].
Available: https://proceedings.neurips.cc/paper files/paper/2023/hash/
d842425e4bf79ba039352da0f658a906-Abstract-Conference.html
[32] Y . Wang, R. Gangi Reddy, Z. M. Mujahid, A. Arora, A. Rubashevskii,
J. Geng, O. Mohammed Afzal, L. Pan, N. Borenstein, A. Pillai,
I. Augenstein, I. Gurevych, and P. Nakov, “Factcheck-bench: Fine-
grained evaluation benchmark for automatic fact-checkers,” in Findings
of the Association for Computational Linguistics: EMNLP 2024 , Y . Al-
Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA:
Association for Computational Linguistics, Nov. 2024, p. 14199–14230.
[Online]. Available: https://aclanthology.org/2024.findings-emnlp.830
[33] Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving
factuality and reasoning in language models through multiagent debate,”
inProceedings of the 41st International Conference on Machine Learn-
ing, ser. ICML’24, vol. 235. Vienna, Austria: JMLR.org, Jul. 2024, p.
11733–11763.
[34] K. Kim, S. Lee, K.-H. Huang, H. P. Chan, M. Li, and H. Ji, “Can
llms produce faithful explanations for fact-checking? towards faithful
explainable fact-checking via multi-agent debate,” 2024. [Online].
Available: https://arxiv.org/abs/2402.07401
[35] X. Zhou, A. Sharma, A. X. Zhang, and T. Althoff, “Correcting
misinformation on social media with a large language model,” 2024.
[Online]. Available: https://arxiv.org/abs/2403.11169
[36] R. Zhao, X. Li, S. Joty, C. Qin, and L. Bing, “Verify-and-edit:
A knowledge-enhanced chain-of-thought framework,” in Proceedings
of the 61st Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers) , A. Rogers, J. Boyd-Graber, and
N. Okazaki, Eds. Toronto, Canada: Association for Computational
Linguistics, Jul. 2023, p. 5823–5840. [Online]. Available: https:
//aclanthology.org/2023.acl-long.320/
[37] S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, A. Celikyilmaz,
and J. Weston, “Chain-of-verification reduces hallucination in large
language models,” in Findings of the Association for Computational
Linguistics: ACL 2024 , L.-W. Ku, A. Martins, and V . Srikumar, Eds.
Bangkok, Thailand: Association for Computational Linguistics, Aug.
2024, p. 3563–3578. [Online]. Available: https://aclanthology.org/2024.
findings-acl.212/
[38] M. Li, B. Peng, M. Galley, J. Gao, and Z. Zhang, “Self-checker:
Plug-and-play modules for fact-checking with large language models,”
inFindings of the Association for Computational Linguistics: NAACL
2024 , K. Duh, H. Gomez, and S. Bethard, Eds. Mexico City, Mexico:
Page 13:
13
Association for Computational Linguistics, Jun. 2024, p. 163–181.
[Online]. Available: https://aclanthology.org/2024.findings-naacl.12/
[39] B. Hu, Q. Sheng, J. Cao, Y . Shi, Y . Li, D. Wang, and P. Qi, “Bad
actor, good advisor: Exploring the role of large language models in
fake news detection,” Proceedings of the AAAI Conference on Artificial
Intelligence , vol. 38, no. 2020, p. 22105–22113, Mar. 2024. [Online].
Available: https://ojs.aaai.org/index.php/AAAI/article/view/30214
[40] X. Ma, Y . Zhang, K. Ding, J. Yang, J. Wu, and H. Fan,
“On fake news detection with llm enhanced semantics mining,”
inProceedings of the 2024 Conference on Empirical Methods in
Natural Language Processing , Y . Al-Onaizan, M. Bansal, and Y .-N.
Chen, Eds. Miami, Florida, USA: Association for Computational
Linguistics, Nov. 2024, p. 508–521. [Online]. Available: https:
//aclanthology.org/2024.emnlp-main.31/
[41] A. Spangher, N. Peng, S. Gehrmann, and M. Dredze, “Do llms
plan like human writers? comparing journalist coverage of press
releases with llms,” in Proceedings of the 2024 Conference on
Empirical Methods in Natural Language Processing , Y . Al-Onaizan,
M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association
for Computational Linguistics, Nov. 2024a, p. 21814–21828. [Online].
Available: https://aclanthology.org/2024.emnlp-main.1216/
[42] J. Wu, J. Guo, and B. Hooi, “Fake news in sheep’s clothing:
Robust fake news detection against llm-empowered style attacks,” in
Proceedings of the 30th ACM SIGKDD Conference on Knowledge
Discovery and Data Mining , ser. KDD ’24. New York, NY , USA:
Association for Computing Machinery, Aug. 2024b, p. 3367–3378.
[Online]. Available: https://dl.acm.org/doi/10.1145/3637528.3671977
[43] V . D. Lai, N. Ngo, A. Pouran Ben Veyseh, H. Man,
F. Dernoncourt, T. Bui, and T. H. Nguyen, “ChatGPT beyond
English: Towards a comprehensive evaluation of large language
models in multilingual learning,” in Findings of the Association
for Computational Linguistics: EMNLP 2023 , H. Bouamor, J. Pino,
and K. Bali, Eds. Singapore: Association for Computational
Linguistics, Dec. 2023, pp. 13 171–13 189. [Online]. Available:
https://aclanthology.org/2023.findings-emnlp.878
[44] L. Reynolds and K. McDonell, “Prompt programming for large
language models: Beyond the few-shot paradigm,” in Extended
Abstracts of the 2021 CHI Conference on Human Factors in
Computing Systems , ser. CHI EA ’21. New York, NY , USA:
Association for Computing Machinery, 2021. [Online]. Available:
https://doi.org/10.1145/3411763.3451760
[45] L. A. Henkel and M. E. Mattson, “Reading is believing: The truth effect
and source credibility,” Consciousness and cognition , vol. 20, no. 4, pp.
1705–1721, 2011.
[46] W. Wang, V . W. Zheng, H. Yu, and C. Miao, “A survey of zero-shot
learning: Settings, methods, and applications,” ACM Trans. Intell. Syst.Technol. , vol. 10, no. 2, pp. 13:1–13:37, Jan. 2019. [Online]. Available:
https://dl.acm.org/doi/10.1145/3293318
[47] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal,
A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh,
D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler,
M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish,
A. Radford, I. Sutskever, and D. Amodei, “Language models are
few-shot learners,” in Advances in Neural Information Processing
Systems , vol. 33. Curran Associates, Inc., 2020, p. 1877–1901.
[Online]. Available: https://proceedings.neurips.cc/paper files/paper/
2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
[48] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le,
D. Zhou et al. , “Chain-of-thought prompting elicits reasoning in large
language models,” Advances in Neural Information Processing Systems ,
vol. 35, pp. 24 824–24 837, 2022.
[49] A. Kong, S. Zhao, H. Chen, Q. Li, Y . Qin, R. Sun, X. Zhou,
E. Wang, and X. Dong, “Better zero-shot reasoning with role-play
prompting,” in Proceedings of the 2024 Conference of the North
American Chapter of the Association for Computational Linguistics:
Human Language Technologies (Volume 1: Long Papers) , K. Duh,
H. Gomez, and S. Bethard, Eds. Mexico City, Mexico: Association
for Computational Linguistics, Jun. 2024, p. 4099–4113. [Online].
Available: https://aclanthology.org/2024.naacl-long.228/
[50] K. Pelrine, A. Imouza, C. Thibault, M. Reksoprodjo, C. Gupta,
J. Christoph, J.-F. Godbout, and R. Rabbany, “Towards reliable
misinformation mitigation: Generalization, uncertainty, and gpt-4,”
inProceedings of the 2023 Conference on Empirical Methods in
Natural Language Processing , H. Bouamor, J. Pino, and K. Bali,
Eds. Singapore: Association for Computational Linguistics, Dec.
2023, p. 6399–6429. [Online]. Available: https://aclanthology.org/2023.
emnlp-main.395/
[51] C. Shorten, C. Pierse, T. B. Smith, E. Cardenas, A. Sharma,
J. Trengrove, and B. van Luijt, “Structuredrag: Json response
formatting with large language models,” 2024. [Online]. Available:
https://arxiv.org/abs/2408.11061
[52] S. Schulhoff, M. Ilie, N. Balepur, K. Kahadze, A. Liu, C. Si, Y . Li,
A. Gupta, H. Han, S. Schulhoff, P. S. Dulepet, S. Vidyadhara, D. Ki,
S. Agrawal, C. Pham, G. Kroiz, F. Li, H. Tao, A. Srivastava, H. D.
Costa, S. Gupta, M. L. Rogers, I. Goncearenco, G. Sarli, I. Galynker,
D. Peskoff, M. Carpuat, J. White, S. Anadkat, A. Hoyle, and P. Resnik,
“The prompt report: A systematic survey of prompting techniques,”
2024. [Online]. Available: https://arxiv.org/abs/2406.06608
[53] D. Zhou, N. Sch ¨arli, L. Hou, J. Wei, N. Scales, X. Wang,
D. Schuurmans, C. Cui, O. Bousquet, Q. Le, and E. Chi, “Least-to-
most prompting enables complex reasoning in large language models,”
Page 14:
14
2023. [Online]. Available: https://arxiv.org/abs/2205.10625
[54] M. Suzgun, N. Scales, N. Sch ¨arli, S. Gehrmann, Y . Tay, H. W. Chung,
A. Chowdhery, Q. Le, E. Chi, D. Zhou, and J. Wei, “Challenging
big-bench tasks and whether chain-of-thought can solve them,” in
Findings of the Association for Computational Linguistics: ACL 2023 ,
A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada:
Association for Computational Linguistics, Jul. 2023, p. 13003–13051.
[Online]. Available: https://aclanthology.org/2023.findings-acl.824/
[55] T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large lan-
guage models are zero-shot reasoners,” Advances in neural information
processing systems , vol. 35, pp. 22 199–22 213, 2022.
[56] Z. Ji, T. Yu, Y . Xu, N. Lee, E. Ishii, and P. Fung, “Towards
mitigating LLM hallucination via self reflection,” in Findings
of the Association for Computational Linguistics: EMNLP 2023 ,
H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association
for Computational Linguistics, Dec. 2023, pp. 1827–1843. [Online].
Available: https://aclanthology.org/2023.findings-emnlp.123
[57] F. Corso, F. Pierri, and G. D. F. Morales, “Conspiracy theories and where
to find them on tiktok,” arXiv preprint arXiv:2407.12545 , 2024.
Page 15:
1
Evaluating open-source Large Language Models for
automated fact-checking - Supplementary Materials
Nicol `o Fontana1, Francesco Corso1,2, Enrico Zuccolotto1, Francesco Pierri1
1Politecnico di Milano, Italy
2CENTAI, Italy
{nicolo.fontana, francesco.corso, francesco.pierri }@polimi.it
Abstract —The increasing prevalence of online misinformation
has heightened the demand for automated fact-checking solu-
tions. Large Language Models (LLMs) have emerged as potential
tools for assisting in this task, but their effectiveness remains
uncertain. This study evaluates the fact-checking capabilities
of various open-source LLMs, focusing on their ability to
assess claims with different levels of contextual information.
We conduct three key experiments: (1) evaluating whether
LLMs can identify the semantic relationship between a claim
and a fact-checking article, (2) assessing models’ accuracy in
verifying claims when given a related fact-checking article, and
(3) testing LLMs’ fact-checking abilities when leveraging data
from external knowledge sources such as Google and Wikipedia.
Our results indicate that LLMs perform well in identifying claim-
article connections and verifying fact-checked stories but struggle
with confirming factual news, where they are outperformed by
traditional fine-tuned models such as RoBERTa. Additionally, the
introduction of external knowledge does not significantly enhance
LLMs’ performance, calling for more tailored approaches. Our
findings highlight both the potential and limitations of LLMs
in automated fact-checking, emphasizing the need for further
refinements before they can reliably replace human fact-checkers.
Index Terms —fact-checking, large language models, prompting
analysis
SUPPLEMENTARY MATERIALS
Dataset
Our analysis utilizes the Fact-Check Insights dataset, a
comprehensive resource invaluable to researchers, journalists,
technologists, and other stakeholders engaged in countering
political misinformation and falsehoods online. This dataset
comprises structured data from tens of thousands of claims
made by political figures and social media posts, meticulously
scrutinized and rated by independent fact-checker organiza-
tions such as AFP, Politifact, and Snopes.
A. Data Selection
Given the multipurpose and extensive nature of the dataset,
we undertook a preprocessing phase to align it with our
specific research requirements. We selected only a subset of
pertinent columns, as seen in Table I.
We decided to retain the fact-checking article”s publication
date to maintain a temporal reference point. This approach
enables temporal analysis and enhances our ability to assess
the importance of external context in claims made after the
training data cut-off date.B. Data Cleaning
We applied a rigorous preprocessing methodology to ensure
data accuracy and reliability. The first step involved ensur-
ing the reliability of the language through a cross-validation
method. We utilized the langdetect library to identify the
language of the text in claimReviewed . This detection
was then cross-referenced with the language extracted from
the respective website domains in the url column to ensure
consistency in language identification.
Subsequently, we refined the dataset through a series
of operations to ensure its accuracy and reliability. First,
we eliminated entries that lacked critical data, such as
theclaimReviewed ordatePublished fields or cases
where the datePublished was set in the future. The ab-
sence of a claim makes fact-checking impossible, and without
a valid date, the entry becomes unreliable. Additionally, entries
that did not include an alternateName , or verdict, were
filtered out, as a verdict is essential for our analysis.
Next, we excluded entries with unknown language. If lan-
guage detection was unsuccessful, the entry was removed to
maintain data consistency and reliability across the dataset.
Finally, we addressed the issue of duplicate entries. To
preserve the integrity of the data, we carefully identified and
removed duplicates. This step was crucial in ensuring our
analysis was not skewed by repetitive information.
This careful selection left us with a dataset of around
200,000 claims in 40 languages spanning under 30 years, from
1996 to 2024. As shown in Figure 3, there has been a higher
concentration of claims in recent years, with a significant
increase starting in 2019. The surge in 2020 includes many
claims related to COVID-19 and a substantial number focused
on political figures. As evidenced by the graphs 4, English
is the predominant language of fact-checkers, followed by
Arabic, Spanish, Portuguese, and Italian, reflecting the top
languages spoken in the Western world.
After selecting the English language, we extracted the text
through a scraping procedure, utilizing newspaper3k library
on the url column. This process created an additional field
containing the text verifying the claimReviewed . These
articles have varying lengths from 1000 to 20000 words (Fig-
ure 2). Subsequently, we removed entries with unsuccessful
scrapes. This leaves us with 60,000 claims, each linked to its
corresponding golden document.arXiv:2503.05565v1 [cs.CY] 7 Mar 2025
Page 16:
2
/uni00000015/uni00000013/uni00000014/uni00000017 /uni00000015/uni00000013/uni00000014/uni00000019 /uni00000015/uni00000013/uni00000014/uni0000001b /uni00000015/uni00000013/uni00000015/uni00000013 /uni00000015/uni00000013/uni00000015/uni00000015 /uni00000015/uni00000013/uni00000015/uni00000017
/uni0000003c/uni00000048/uni00000044/uni00000055/uni00000013/uni00000014/uni00000013/uni00000013/uni00000015/uni00000013/uni00000013/uni00000016/uni00000013/uni00000013/uni00000017/uni00000013/uni00000013/uni00000018/uni00000013/uni00000013/uni00000031/uni00000058/uni00000050/uni00000045/uni00000048/uni00000055/uni00000003/uni00000052/uni00000049/uni00000003/uni00000046/uni0000004f/uni00000044/uni0000004c/uni00000050/uni00000056/uni00000029/uni00000044/uni0000004f/uni00000056/uni00000048
/uni00000037/uni00000055/uni00000058/uni00000048
Fig. 1. For each year, the number of True and False claims in the used dataset.
Fig. 2. Distribution of articles length.
Fig. 3. Distribution of fact-checked claims over the years.
C. Labeling
Following the methodology outlined by Quelle [1], we
divided our dataset into two categories: trueandfalse . The true
category includes entries deemed accurate or of higher quality,
while the false category encompasses content lacking a factual
basis, such as fake quotes, conspiracy theories, misleading
edited media, or ironic and exaggerated criticism.
To classify our entries, we manually mapped the first 500
common verdicts into one of these two categories (Here
is a sample Figure 5). Although this process was time-
consuming and meticulous, it resulted in a robust and rigorous
Fig. 4. Number of fact-checked claims by the claim language.
Fig. 5. Verdict distribution.
classification of our data. For example, labels like Geppetto
mark, trustworthy, mostly true, and correct attribution were
categorized as true, while labels such as false, mostly false,
Quattro Pinocchio, and unproven orlegend were classified
asfalse . Additionally, terms like mixture were categorized as
false , given that a statement containing both true and false
elements is considered overall inaccurate.
D. Balancing
Due to a substantial class imbalance, 90% of claims were
categorized as false . Since fact-checkers typically verify false
claims, steps were taken to address this disparity by equalizing
the number of true and false samples. This adjustment resulted
in a much smaller but more balanced dataset.
As depicted in Figure 3 our dataset spans various years,
with a notable concentration of claims from 2013 to 2024.
This temporal distribution enhances the statistical reliability
of the dataset across different periods. Given the limited API
usage and the large number of experiments to be conducted,
we decided to select a subset of the balanced dataset. Specif-
ically, we chose 50 entries for each year from 2013 onward,
consisting of 25 true claims and 25 false ones, as shown in
Figure 1. Stratified sampling based on verdicts was employed
to ensure optimal representation of both categories.
Instead, for the year 2024, we included 500 samples. In
this case, the dataset could not be perfectly balanced due to
Page 17:
3
Field Description
claimReviewed The statement or claim under examination.
datePublished The date of the article”s publication.
url The source URL from which the data was gathered.
reviewRating.alternateName The verdict assigned to the claim.
author.name The name of the fact-checking organization conducting the analysis.
language The language in which the claim was evaluated.
reviewRating.author.name The author of the claim being reviewed.
TABLE I
DESCRIPTION OF FACT -CHECKING DATABASE FIELDS .
the limited number of true claims, with only 30 available.
This is particularly significant because most of the LLMs used
in this research were trained before 2024, except for Mistral
7B, which was fine-tuned in June 2024. This portion of the
dataset estimates how well these models can perform as fact-
checkers in real-time. This process culminated in creating a
comprehensive English dataset used in all the experiments for
each task.
E. Mistral 7B Results
In our analysis, we initially included
Mistral-7B-Instruct-v0.3 as a representative
of smaller models within the Mistral family. However, we
observed a high percentage of faulty responses in the first
two preparatory tasks, leading us to exclude its predictions
from the main analysis. For completeness, we provide the
plots including the results obtained by Mistral 7B for all
three tasks.
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000056/uni00000046/uni00000052/uni00000055/uni00000048
/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048
/uni00000031/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048
Fig. 6. Task 1: Models’ F1 scores computed for both classes. Fine-tuned
RoBERTa is used as a reference baseline.
/uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013
/uni00000029/uni00000014/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000046/uni0000004f/uni00000044/uni00000056/uni00000056/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000046/uni0000004f/uni00000044/uni00000056/uni00000056
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025
/uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025
/uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025Fig. 7. Task 1: Comparison between the F1 scores obtained by each model
for each prompt variation with respect to both classes. The bisector is reported
as a reference.
/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025
/uni00000013/uni00000011/uni0000001a/uni0000001b/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001b/uni00000018/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001b/uni00000015/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001c/uni00000017
/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025
/uni00000013/uni00000011/uni0000001c/uni00000015/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000013/uni00000011/uni0000001c/uni0000001a/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001c/uni0000001b/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000013/uni00000011/uni0000001c/uni00000019
/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025
/uni00000013/uni00000011/uni00000014/uni0000001a/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000014/uni00000011/uni00000013/uni00000013/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c/uni0000000f/uni00000003/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000014/uni00000011/uni00000013/uni00000013/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001b/uni00000019
/uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025
/uni00000013/uni00000011/uni00000019/uni00000018/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001b/uni00000013/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001b/uni00000019/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001a/uni0000001a
Fig. 8. Task 1: For each model, for each prompt category (Zero-Shot,
Few-Shot, and Chain-of-Tought) the best prompt’s F1 score is reported.
The average F1 score for each model is also reported with 0.95 confidence
intervals.
Page 18:
4
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000008/uni00000003/uni00000052/uni00000049/uni00000003/uni00000029/uni00000044/uni00000058/uni0000004f/uni00000057/uni00000056
Fig. 9. Task 1: Percentage of faults for each model. The median faults
percentage for each model is: 0.241 (Llama3 8B), 0.003 (Llama3 70B), 1.0
(Mistral 7B), 0.393 (Mixtral 8x7B)
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000056/uni00000046/uni00000052/uni00000055/uni00000048
/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048
/uni00000031/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048
Fig. 10. Task 2: Models’ F1 scores computed for both classes. Fine-tuned
RoBERTa is used as a reference baseline.
/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025
/uni00000013/uni00000011/uni0000001a/uni00000018/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001b/uni00000013/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c/uni0000000f/uni00000003/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000013/uni00000011/uni0000001b/uni00000014/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001a/uni0000001a
/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025
/uni00000013/uni00000011/uni0000001b/uni00000019/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000013/uni00000011/uni0000001b/uni0000001b/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000013/uni00000011/uni0000001b/uni0000001a/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001b/uni0000001b
/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025
/uni00000013/uni00000011/uni00000014/uni00000014/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c/uni0000000f/uni00000003/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000014/uni00000011/uni00000013/uni00000013/uni00000013/uni00000011/uni00000013/uni00000013/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001a/uni00000019
/uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025
/uni00000013/uni00000011/uni0000001a/uni0000001b/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001b/uni00000016/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001b/uni00000016/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001b/uni00000018
Fig. 11. Task 2: For each model, for each prompt category (Zero-Shot,
Few-Shot, and Chain-of-Tought) the best prompt’s F1 score is reported.
The average F1 score for each model is also reported with 0.95 confidence
intervals.
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000008/uni00000003/uni00000052/uni00000049/uni00000003/uni00000029/uni00000044/uni00000058/uni0000004f/uni00000057/uni00000056
Fig. 12. Task 2: The percentage of faults for each model. The median faults
percentage for each model is: 0.392 (Llama3 8B), 0.066 (Llama3 70B), 1.0
(Mistral 7B), 0.370 (Mixtral 8x7B)
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000056/uni00000046/uni00000052/uni00000055/uni00000048
/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048
/uni00000031/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048
Fig. 13. Task 3: Models’ F1 scores computed for both classes. Fine-tuned
RoBERTa is used as a reference baseline.
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000056/uni00000046/uni00000052/uni00000055/uni00000048
/uni00000033/uni00000055/uni00000048/uni00000003/uni00000015/uni00000013/uni00000015/uni00000017
/uni00000015/uni00000013/uni00000015/uni00000017
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000053/uni00000055/uni00000048/uni00000015/uni00000013/uni00000015/uni00000017
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000015/uni00000013/uni00000015/uni00000017
Fig. 14. Task 3: Models’ F1 scores computed for the positive class distin-
guishing between claims dated before 2024 (thus possibly included in the
models’ training dataset) and from 2024. Fine-tuned RoBERTa is used as a
reference baseline.
Page 19:
5
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000056/uni00000046/uni00000052/uni00000055/uni00000048
/uni00000033/uni00000055/uni00000048/uni00000003/uni00000015/uni00000013/uni00000015/uni00000017
/uni00000015/uni00000013/uni00000015/uni00000017
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000053/uni00000055/uni00000048/uni00000015/uni00000013/uni00000015/uni00000017
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000015/uni00000013/uni00000015/uni00000017
Fig. 15. Task 3: Models’ F1 scores computed for the negative class
distinguishing between claims dated before 2024 (thus possibly included in
the models’ training dataset) and from 2024. Fine-tuned RoBERTa is used as
a reference baseline.
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000008/uni00000003/uni00000052/uni00000049/uni00000003/uni00000029/uni00000044/uni00000058/uni0000004f/uni00000057/uni00000056
Fig. 16. Task 3: The percentage of faults for each model. The median faults
percentage for each model is: 0.02 (Llama3 8B), 0.024 (Llama3 70B), 0.005
(Mistral 7B), 0.051 (Mixtral 8x7B)
/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025
/uni00000013/uni00000011/uni00000019/uni00000019/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni00000019/uni0000001b/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni00000019/uni00000018/uni00000013/uni00000011/uni00000019/uni0000001c
/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025
/uni00000013/uni00000011/uni0000001a/uni00000017/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001a/uni00000017/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001a/uni00000019/uni00000013/uni00000011/uni0000001a/uni0000001b
/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025
/uni00000013/uni00000011/uni0000001a/uni00000017/uni00000056/uni00000051/uni0000004c/uni00000053/uni00000053/uni00000048/uni00000057 /uni00000013/uni00000011/uni0000001a/uni00000017/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001a/uni00000018/uni00000013/uni00000011/uni0000001a/uni00000019
/uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025
/uni00000013/uni00000011/uni0000001a/uni00000019/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001a/uni00000018/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001a/uni0000001c/uni00000013/uni00000011/uni0000001a/uni00000018
Fig. 17. Task 3: For each model, for each prompt category (Zero-Shot, with
Google contextual information, and with Wikipedia contextual information)
the best prompt’s F1 score is reported. The average F1 score for each model
is also reported with 0.95 confidence intervals.
Page 20:
6
F . Task 3 Additional Plots
/uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013
/uni00000035/uni00000048/uni00000046/uni00000044/uni0000004f/uni0000004f/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000033/uni00000055/uni00000048/uni00000046/uni0000004c/uni00000056/uni0000004c/uni00000052/uni00000051/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025/uni00000003/uni00000010/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048
/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048
/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044
/uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050
Fig. 18. Task 3: Precision and Recall curve for the positive class for Llama3
8B. The best prompt of each category is highlighted. Fine-tuned RoBERTa
and a random classifier are used as a baseline for comparison.
/uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013
/uni00000035/uni00000048/uni00000046/uni00000044/uni0000004f/uni0000004f/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000033/uni00000055/uni00000048/uni00000046/uni0000004c/uni00000056/uni0000004c/uni00000052/uni00000051/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025/uni00000003/uni00000010/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048
/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048
/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044
/uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050
Fig. 19. Task 3: Precision and Recall curve for the positive class for Llama3
70B. The best prompt of each category is highlighted. Fine-tuned RoBERTa
and a random classifier are used as a baseline for comparison.
/uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013
/uni00000035/uni00000048/uni00000046/uni00000044/uni0000004f/uni0000004f/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000033/uni00000055/uni00000048/uni00000046/uni0000004c/uni00000056/uni0000004c/uni00000052/uni00000051/uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000003/uni00000010/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048
/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048
/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044
/uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050
Fig. 20. Task 3: Precision and Recall curve for the positive class for Mixtral
8x7B. The best prompt of each category is highlighted. Fine-tuned RoBERTa
and a random classifier are used as a baseline for comparison.
REFERENCES
[1] D. Quelle and A. Bovet, “The perils and promises of fact-checking with
large language models,” Frontiers in Artificial Intelligence , vol. 7, Feb.
2024. [Online]. Available: http://dx.doi.org/10.3389/frai.2024.1341697
/uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013
/uni00000035/uni00000048/uni00000046/uni00000044/uni0000004f/uni0000004f/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000033/uni00000055/uni00000048/uni00000046/uni0000004c/uni00000056/uni0000004c/uni00000052/uni00000051/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025/uni00000003/uni00000010/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048
/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048
/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044
/uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050Fig. 21. Task 3: Precision and Recall curve for the negative class for Llama3
8B. The best prompt of each category is highlighted. Fine-tuned RoBERTa
and a random classifier are used as a baseline for comparison.
/uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013
/uni00000035/uni00000048/uni00000046/uni00000044/uni0000004f/uni0000004f/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000033/uni00000055/uni00000048/uni00000046/uni0000004c/uni00000056/uni0000004c/uni00000052/uni00000051/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025/uni00000003/uni00000010/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048
/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048
/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044
/uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050
Fig. 22. Task 3: Precision and Recall curve for the negative class for Llama3
70B. The best prompt of each category is highlighted. Fine-tuned RoBERTa
and a random classifier are used as a baseline for comparison.
/uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013
/uni00000035/uni00000048/uni00000046/uni00000044/uni0000004f/uni0000004f/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000033/uni00000055/uni00000048/uni00000046/uni0000004c/uni00000056/uni0000004c/uni00000052/uni00000051/uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000003/uni00000010/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048
/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048
/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044
/uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050
Fig. 23. Task 3: Precision and Recall curve for the negative class for Mixtral
8x7B. The best prompt of each category is highlighted. Fine-tuned RoBERTa
and a random classifier are used as a baseline for comparison.
Page 21:
7
/uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013
/uni00000029/uni00000044/uni0000004f/uni00000056/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000037/uni00000055/uni00000058/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025/uni00000003/uni00000010/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048
/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048
/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044
/uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050
Fig. 24. Task 3: ROC curve for the positive class for Llama3 8B. The best
prompt of each category is highlighted. Fine-tuned RoBERTa and a random
classifier are used as a baseline for comparison.
/uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013
/uni00000029/uni00000044/uni0000004f/uni00000056/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000037/uni00000055/uni00000058/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025/uni00000003/uni00000010/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048
/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048
/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044
/uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050
Fig. 25. Task 3: ROC curve for the positive class for Llama3 70B. The best
prompt of each category is highlighted. Fine-tuned RoBERTa and a random
classifier are used as a baseline for comparison.
/uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013
/uni00000029/uni00000044/uni0000004f/uni00000056/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000037/uni00000055/uni00000058/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000003/uni00000010/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048
/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048
/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044
/uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050
Fig. 26. Task 3: ROC curve for the positive class for Mixtral 8x7B. The best
prompt of each category is highlighted. Fine-tuned RoBERTa and a random
classifier are used as a baseline for comparison.
/uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013
/uni00000029/uni00000044/uni0000004f/uni00000056/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000037/uni00000055/uni00000058/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025/uni00000003/uni00000010/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048
/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048
/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044
/uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050Fig. 27. Task 3: ROC curve for the negative class for Llama3 8B. The best
prompt of each category is highlighted. Fine-tuned RoBERTa and a random
classifier are used as a baseline for comparison.
/uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013
/uni00000029/uni00000044/uni0000004f/uni00000056/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000037/uni00000055/uni00000058/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025/uni00000003/uni00000010/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048
/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048
/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044
/uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050
Fig. 28. Task 3: ROC curve for the negative class for Llama3 70B. The best
prompt of each category is highlighted. Fine-tuned RoBERTa and a random
classifier are used as a baseline for comparison.
/uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013
/uni00000029/uni00000044/uni0000004f/uni00000056/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000037/uni00000055/uni00000058/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000003/uni00000010/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048
/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057
/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048
/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044
/uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044
/uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050
Fig. 29. Task 3: ROC curve for the negative class for Mixtral 8x7B. The best
prompt of each category is highlighted. Fine-tuned RoBERTa and a random
classifier are used as a baseline for comparison.