Open AGI Codes | Your Codes Reflect!

Updates

Generating audio...

arxiv

Paper 2503.05565

Evaluating open-source Large Language Models for automated fact-checking

Authors: Nicolo' Fontana, Francesco Corso, Enrico Zuccolotto, Francesco Pierri

Published: 2025-03-07

Abstract:

The increasing prevalence of online misinformation has heightened the demand for automated fact-checking solutions. Large Language Models (LLMs) have emerged as potential tools for assisting in this task, but their effectiveness remains uncertain. This study evaluates the fact-checking capabilities of various open-source LLMs, focusing on their ability to assess claims with different levels of contextual information. We conduct three key experiments: (1) evaluating whether LLMs can identify the semantic relationship between a claim and a fact-checking article, (2) assessing models' accuracy in verifying claims when given a related fact-checking article, and (3) testing LLMs' fact-checking abilities when leveraging data from external knowledge sources such as Google and Wikipedia. Our results indicate that LLMs perform well in identifying claim-article connections and verifying fact-checked stories but struggle with confirming factual news, where they are outperformed by traditional fine-tuned models such as RoBERTa. Additionally, the introduction of external knowledge does not significantly enhance LLMs' performance, calling for more tailored approaches. Our findings highlight both the potential and limitations of LLMs in automated fact-checking, emphasizing the need for further refinements before they can reliably replace human fact-checkers.

Paper Content:

Page 1: 1 Evaluating open-source Large Language Models for automated fact-checking Nicol `o Fontana1, Francesco Corso1,2, Enrico Zuccolotto1, Francesco Pierri1 1Politecnico di Milano, Italy 2CENTAI, Italy {nicolo.fontana, francesco.corso, francesco.pierri }@polimi.it Abstract —The increasing prevalence of online misinformation has heightened the demand for automated fact-checking solu- tions. Large Language Models (LLMs) have emerged as potential tools for assisting in this task, but their effectiveness remains uncertain. This study evaluates the fact-checking capabilities of various open-source LLMs, focusing on their ability to assess claims with different levels of contextual information. We conduct three key experiments: (1) evaluating whether LLMs can identify the semantic relationship between a claim and a fact-checking article, (2) assessing models’ accuracy in verifying claims when given a related fact-checking article, and (3) testing LLMs’ fact-checking abilities when leveraging data from external knowledge sources such as Google and Wikipedia. Our results indicate that LLMs perform well in identifying claim- article connections and verifying fact-checked stories but struggle with confirming factual news, where they are outperformed by traditional fine-tuned models such as RoBERTa. Additionally, the introduction of external knowledge does not significantly enhance LLMs’ performance, calling for more tailored approaches. Our findings highlight both the potential and limitations of LLMs in automated fact-checking, emphasizing the need for further refinements before they can reliably replace human fact-checkers. Index Terms —fact-checking, large language models, prompting analysis I. I NTRODUCTION In today’s digital era, the vast availability of online infor- mation has facilitated the rapid spread of both misinformation and disinformation. Online social platforms, in particular, enable false narratives to gain traction, due to their high- connectivity nature [1]. This challenge is amplified when influ- ential individuals propagate misleading content, with research suggesting that such misinformation can significantly impact critical events, including election outcomes [2], [3]. The burden of fact-checking and debunking false claims has been traditionally left to journalists. However, the rise of the Internet, combined with growing distrust in traditional media [4], has led to the emergence of independent fact- checking organizations [5]. These groups focus on verifying rumors, misconceptions, and fake news spread online. One of the most notable fact-checking initiatives, Politifact1, received the Pulitzer Prize for national reporting in 20092. Politifact introduced a rating system to evaluate claims, and many organizations have since adopted similar systems. However, 1https://www.politifact.com/ 2https://www.pulitzer.org/prize-winners-by-year/2009these subjective and often ambiguous ratings complicate com- parisons across fact-checks [6]. Fact-checking remains a labor-intensive process, requiring teams to spend days or even weeks verifying claims [7]. Given the overwhelming flow of information online3, traditional methods cannot keep pace, highlighting the need for more efficient approaches. One promising avenue is the automation of fact-checking through artificial intelligence (AI) technologies [8]. Re- searchers have investigated AI-driven models, such as Convo- lutional Neural Networks (CNNs) and Graph Neural Networks (GNNs), to aid in these efforts. While these models have made significant progress, their effectiveness remains limited [9], [10]. More recently, Large Language Models (LLMs) based on transformer architectures have demonstrated significant poten- tial [11], [12]. These models excel at generating natural lan- guage responses, answering complex questions, and producing high-quality content [13], [14], making them an accessible and versatile tool for a broad audience. Their advanced reasoning capabilities further position them as suitable candidates for fact-checking [15]. However, a major limitation of LLMs lies in the outdated nature of their training data [16], which hampers their ability to address recent misinformation effectively. To address this challenge, researchers are exploring innovative approaches to integrate real-time, external knowledge into these models. Such advancements aim to enhance their ability to provide accurate and timely responses to rapidly evolving misinfor- mation [17]. Building on previous research, our study explores the use of diverse open LLMs to investigate whether smaller models can achieve a more favorable balance between cost- effectiveness and accuracy compared to larger models. Cost- effective methods are particularly critical for watchdog groups, non-profits, and smaller organizations that often operate with limited budgets and resources. These entities play a crucial role in combating misinformation but frequently lack the financial capacity to deploy and maintain large-scale AI systems. By identifying efficient yet accurate alternatives, our research aims to empower such organizations with accessible tools, enabling them to enhance their fact-checking capabilities without incur- ring prohibitive costs. Additionally, we employed as a baseline for comparison a Small Language Model (SLM) by fine-tuning 3https://explodingtopics.com/blog/data-generated-per-dayarXiv:2503.05565v1 [cs.CY] 7 Mar 2025 Page 2: 2 RoBERTa. This reference allows us to understand the relative importance of the LLMs’ performance. We articulate our contributions into three key research questions: 1)RQ1: Can LLMs identify the connection between a claim and an article? This research question in- vestigates whether LLMs can accurately determine if a given claim and its paired article are contextually related, addressing the same topic. The focus is on evaluating the models’ capability to assess the relevance and alignment between the claim and the content of the article. 2)RQ2: Can LLMs judge a claim based on the related fact-checking article? This question explores whether LLMs can effectively analyze a related fact-checking article and provide a trustworthy evaluation of the claim based on the information contained within the document. 3)RQ3: Are LLMs able to retrieve contextual informa- tion and fact-check claims? This question assesses the models’ ability to verify the truthfulness of a claim when they are provided with a related article, as opposed to relying solely on their pre-trained internal knowledge. The objective is to determine whether the inclusion of new, external information enhances their accuracy in fact-checking tasks. To address the first two questions ( RQ1, RQ2 ), we conduct experiments using 24 different prompts to guide the models’ performance, emphasizing the role of effective prompting. For the third question ( RQ3 ), we use neutral and straightforward prompts to evaluate various sources of external information– such as Google or Wikipedia (or their absence)–and the representation format (snippet, summary, or full article) that yields the best results. In this third scenario, the model acts autonomously, conducting internet searches to gather informa- tion and refine its verdicts. For Google searches, we limit the results to a time frame before the claim to avoid bias and better simulate real-world fact-checking. Our analysis utilizes the Fact-Checking Insights dataset4, which is a comprehensive resource containing structured data from tens of thousands of claims made by political figures and social media posts, scrutinized and rated by independent fact-checking organizations such as AFP5, Politifact6, and Snopes7. II. R ELATED WORK The study of fake news generation and dissemination has gained increasing attention, particularly with the rise of on- line platforms that facilitate rapid information diffusion [18]. Research in this domain has primarily focused on two key aspects: the identification of fake news and the detection of its spreaders [15]. Over time, human fact-checking has been increasingly supported by machine learning methods. Early approaches 4https://www.factcheckinsights.org/download 5https://factcheck.afp.com 6https://www.politifact.com/ 7https://www.snopes.com/leveraged classical machine learning techniques for keyword- based text analysis using traditional neural networks [19] and applied network analysis to assess the trustworthiness of sources propagating misinformation [20]. More recently, the emergence of Large Language Models has significantly advanced fake news detection, enabling more sophisticated analyses. These include evaluating the performance of fake news detectors against synthetically generated misinforma- tion rather than human-created content [21] and conducting sentiment-based assessments through emotional analysis of news texts [22]. Some approaches aim to enhance human fact-checking by integrating feedback from LLMs. For instance, LLMs have been used to improve document retrieval, aiding in the verifi- cation of statements [23], or to decompose claims hierarchi- cally into sub-statements for systematic verification [24]. Con- versely, other methodologies seek to achieve fully autonomous fact-checking, relying on LLMs to assess the veracity of claims using a zero-shot approach, without requiring additional training or human intervention. A major challenge in automating fact-checking with LLMs lies in their tendency to produce biased answers [25], [26] and to hallucinate facts [27] and their reliance on static training data, which may often be outdated [16]. To address this limi- tation, researchers have explored methods to integrate external knowledge into LLMs using frameworks such as ReAct, which combines reasoning and action within LLMs [28]. This has led to several innovative fact-checking approaches. For example, FActScore assigns factuality scores to responses based on information retrieved from Wikipedia [29], while FACTOOL employs various tools for evidence collection and reasoning to analyze claims and assign factuality labels based on supporting evidence [30]. Additionally, Toolformer enables LLMs to autonomously integrate external tools, enhancing performance across various tasks while preserving core language modeling capabilities [31]. Despite these advancements, research indi- cates that while LLMs generally perform well, they struggle more with verifying factual statements than identifying false ones [17]. To further improve accuracy, researchers have explored collaborative approaches where multiple models interact and reason together to reach a verdict. FactCheck-GPT, for in- stance, addresses factual inaccuracies by enabling multiple LLMs to debate and converge on a consensus through iter- ative discussions, supplemented by external searches such as Google [32]–[34]. Recently, the focus has expanded beyond mere verification to include misinformation correction. Systems like MUSE represent a significant advancement in this area, demonstrating the evolving capability of LLMs to both detect and cor- rect misinformation in real-time environments [35]. Similarly, Verify-and-Edit enhances LLM reasoning by incorporating Chain-of-Thought (CoT) reasoning and external knowledge sources such as DrQA, Wikipedia, and Google Search to refine responses and correct factual inaccuracies [36]. The Chain-of- Verification technique further improves factual accuracy by leveraging parametric knowledge to revise LLM-generated re- sponses [37]. Another approach, SELF-CHECKER, evaluates Page 3: 3 factuality by integrating real-time web searches (e.g., Bing) to assign factuality labels, reinforcing the role of external knowledge in automated fact-checking [38]. Contrary to prior research suggesting LLM superiority, some studies indicate that task-specific Small Language Mod- els (SLMs), such as fine-tuned BERT, may outperform LLMs in certain fact-checking tasks. One study proposes a hy- brid approach where LLMs generate rationales while SLMs handle classification, leveraging the strengths of both model types [39]. This finding aligns with our analysis, which shows that LLMs fail to achieve significant performance improvements in zero-shot fact-checking on isolated state- ments, even when provided with temporally contextualized information. Similarly, [40] demonstrates that LLMs can be more effective as supplementary tools to traditional detection methods. Specifically, the study analyzes feature propagation in networks of entities and concepts extracted from articles using LLM assistance. Another relevant research direction investigates the stylis- tic metrics underlying written content. One study explores whether LLMs exhibit human-like planning and creativity in news article generation [41]. Additionally, building on the Undeutsch psychological theory that memories of real events differ from those of imagined ones, [42] suggests that fake and real news may have distinct stylistic patterns. This raises the possibility that LLM-based fact-checking methods may rely more on stylistic cues than actual content analysis, particularly in low-context scenarios—an issue reflected in our final task. Unlike other studies, we restrict the retrieval date settings in Google to obtain realistic estimates of how the LLM would perform on new data. We also use a new dataset called Fact- Checking Insights, which compiles information from various sources to reduce bias. Instead of feeding articles to the LLM, we provide contextual data, such as the author and the date the claim was made. Additionally, we conduct a temporal analysis of claim accuracy to determine whether the training data’s cutoff date affects performance. Furthermore, we focus on more economical, open-source models, limiting our analysis to those with at most 70 billion parameters to better understand their strengths and weaknesses. III. E XPERIMENTAL DESIGN We focus exclusively on English-language claims, as most models are primarily trained in English and are expected to perform best in this language [43]. Our analysis utilizes the Fact-Check Insights dataset, a comprehensive resource available to researchers, journalists, and other stakeholders engaged in countering political misinformation and falsehoods online. This dataset comprises structured data from tens of thousands of claims made by political figures and social media posts, scrutinized and rated by independent fact-checking organizations such as AFP, Politifact, and Snopes. A. Data Preprocessing Starting from a collection of over 200K observations down- loaded from Fact-Check Insights in June 2024, we removed duplicate entries, rows with incomplete or incorrect attributes,and data that were not in the English language. Next, we scraped the original fact-checking article associated with each claim in the dataset, removing those that could not be collected successfully. After this process, we obtained a final dataset of 60.000 claims along with their corresponding fact-checking articles. Following the literature [17], we grouped the claims under two macro-categories: True and False . The former includes entries deemed accurate or of higher quality, while the latter contains content lacking a factual basis, such as fake quotes, conspiracy theories, misleading edited media, or ironic and exaggerated criticism. In addition, we labeled all entries la- beled as a mixture of true and false content as False . For example, labels like Geppetto mark ,trustworthy ,mostly true , andcorrect attribution were categorized as True , while labels such as false ,mostly false ,Quattro Pinocchio , and unproven orlegend were classified as False . The resulting classes are highly unbalanced, with 90% of the claims labeled as False . This is reasonable since fact-checking activities usually focus on false claims [44]. Next, we sampled 50 claims (25 True and 25 False ) for each year from 2013 to 2023. We instead sampled 500 claims published in 2024, 30 of which only were True as this was the maximum number available in that year at the time of collection. We include observations from 2024 to better test the capabilities of LLMs at judging claims that are not available in their training data, i.e., because they were published after the training cutoff date of the model. B. Approach To evaluate LLMs as effective tools for fact-checking, we designed three distinct tasks aimed at thoroughly assessing their capabilities in this field and addressing the key questions outlined in Section . 1)Understanding the article-statement connection : This task evaluates how accurately an LLM can answer when provided with a pair article-claim that are either related (i.e., the article is fact-checking the claim) or not (i.e., a random fact-checking article is picked). 2)Providing an accurate verdict based on a fact- checking article : In this task, the model is required to answer what is the verdict on a given claim based on the associated fact-checking article. 3)Fact-Checking the claim : In this task, the LLM evalu- ates the truthfulness of a given statement, with or with- out additional contextual information, using a neutral prompt (i.e. a prompt that does not steer the model towards a certain decision). Each task evaluates distinct aspects of the models’ perfor- mance in fact-checking scenarios. We anticipate that the first two tasks will be relatively easier for the models, as they are required to answer queries based on manually provided contextual information. These tasks test the models’ capabil- ities using a range of prompts, as described below. The third task instead simulates a real-world scenario where the model must retrieve relevant information to verify a given claim. To Page 4: 4 achieve this, we incorporate the ReAct framework [28], en- abling models to access the internet for gathering information useful in fact-checking. In all cases, we provide the models with two basic metadata information related to the claim: the publication date and, when available ( ∼70% of the cases), the claim’s author. Including the date allows the model to contextualize the claim, as the validity of a statement may change over time with the emergence of new information. Knowing the claim’s author can offer insights into its credibility, as claims from unreliable sources are more likely to be false [45]. Also, fact-checking agencies such as AFP and PolitiFact highlight the importance of both the claimant’s identity and the claim’s timing in assessing its veracity. C. Prompt Engineering In this section, we provide a comprehensive overview of the 24 prompts evaluated in our experiments, which we categorize into three approaches: Zero-Shot, Few-Shot, and Chain-of- Thought. Zero-Shot prompts require the model to generate responses without any prior examples, relying solely on its pre-trained knowledge [46]. Few-Shot [47] prompts provide a small number of examples to guide the model toward the desired output style and reasoning process. Chain-of- Thought [48] prompts explicitly break down the reasoning pro- cess into intermediate steps, encouraging the model to generate more structured and explainable responses. We construct these prompts by combining different modules, as detailed next. 1) Basic prompt modules: Here we detail a few basic components that are included in all prompts, in the order they appear as shown in Figure 1 and provided in the Appendix. The Role prompt module primes the LLM to act as a fact- checker, aligning its reasoning with fact-checking principles. This framing encourages a more critical approach to evaluating claims and, as noted by [49], can improve performance. The Task prompt module requires the LLM to generate an explanation prior to providing an answer. This methodology is supported by literature indicating that additional tokens enhance the reasoning process of models [50]. After generating the explanation, the model assigns a score (0-100) rather than a definitive verdict. This scoring mechanism allows for flex- ible adjustment of the model’s skepticism, as the acceptance threshold for responses can be modified. We also decided to implement a scoring mechanism instead of relying on labels because, during the setup of our experiments, we empirically found that instructing the LLM to return a factual label often led to inconsistent outputs. This approach is particularly relevant in our context, where understanding and verifying the reasoning behind a claim is crucial for ensuring the accuracy and trustworthiness of the verification process. We employ JSON prompt module that provides a structured format for extracting expected responses, facilitating consistent and reliable evaluation across the various tasks. Adopting a similar approach to other work [50], [51], we developed a prompt that enables the LLMs to produce structured JSON outputs. Although these methods may occasionally result in null outputs (approximately 1% of the results) they are highlyadvantageous, as they guarantee the presence of a result at the specified position. The Final prompt module serves as a brief reminder to the model regarding how it should structure its responses, including the required format and the necessity of always providing a score. This prompt is significant given that the articles being analyzed can be lengthy. Smaller models, in particular, struggle to adhere to the specified format when processing long texts. By including this reminder at the end, we significantly improve the model’s precision in generating responses. Hence, this structured approach ensures that the model remains focused and compliant with the expected output format. 2) Prompt modules for Tasks 1 and 2: We consider three different approaches for the first two tasks: •Zero Shot (ZS) : This prompt module requires the model to generate responses based solely on its preexisting knowledge and understanding. It does not require exam- ples and it relies on indirect information for prediction. However, it heavily depends on the quality of semantic information and may struggle with highly diverse cate- gories [52]. •Few Shot (FS) : Unlike the ZS approach, the FS prompt module provides the model with several examples before responding. In this process, we provide one example per class to avoid the limitations of one-shot prompting, which relies on only one example. However, there is a risk of overfitting the small dataset, and the model’s per- formance can be susceptible to the quality and diversity of the examples provided. To minimize these problems, the instances are randomly selected in each iteration. Also, to avoid confusion, especially for smaller models, we encapsulated each sample within square brackets to emphasize that these represent distinct entities, leading to more precise and clearly defined responses from the model. The example articles were limited to remain inside the context windows. This method assesses the model’s ability to generalize from a limited set of examples [52]. •Chain of Thoughts (CoT) : Recent research has shown that adding a chain-of-thought paradigm can significantly boost the performance of language models across various tasks [53], [54]. In this approach, we include the phrase ”Let’s think step-by-step” after the examples, recognizing that language models can act as zero-shot reasoners [55]. This technique aims to enhance reasoning skills and achieve better results. Furthermore, we added the following approach to obtain more sophisticated prompts: •Enriched criteria : This approach enriches the prompt by providing a more explicit rationale behind labels. For theFalse category, we specify that it includes misinfor- mation types such as fake quotes and conspiracy theo- ries. Conversely, the True category encompasses content that demonstrates factual accuracy or contains higher- quality information. As suggested by previous studies, this enhancement is expected to lead to higher accuracy in ambiguous cases. In our context, it should assist the Page 5: 5 Fig. 1. Example of prompt structure with optional configurations. The black text represents the main prompt, shared across all tasks and configurations. Red placeholders indicate where the actual article and statement are inserted for each task. Blue components correspond to optional prompt modules, which are integrated into the main prompt (at the indicated position) based on the specific combination being tested. For instance, a prompt incorporating both Enrich andChain-of-Thought will consist of the black main prompt, with the blue Enrich component placed between the Role andTask sub-prompts, and the blue Chain-of-Thought component appended at the end. model in accurately labeling mixed verdicts, such as mixture [35]. •Self-Reflection : Once we receive the model’s answer, we feed the LLM with the previous prompt and its generated response and ask it to reflect on its answer. This allows the model to reason and think about its answer before providing a new response. By encouraging this reflective process, we aim to reduce instances of hallucination, where the model might generate inaccurate or fabricated information. This approach fosters critical thinking and should enhance the reliability of the model’s outputs [56]. •Summary : We also investigated whether providing a summary focused on the claim rather than an entire article could enhance the effectiveness of our approach. This strategy distills relevant information, optimizing the context window’s use. This prompt depends hardly on the large language model’s ability to generate a coherent and accurate summary. The last two methods are more costly because they involve calling the LLM twice and require processing longer texts compared to the first method. In contrast, the first method involves only a minimal increase in tokens, making it far more cost-efficient. 3) ReAct for Task 3: Our fact-checking methodology in- corporates the ReAct framework [28] to enhance precision and efficiency in information retrieval. This method can be broken into two pieces: Reasoning (Re) and Act (Action). First, the model uses its internal knowledge to reflect and think about the user’s input, and then it identifies the steps required to solve the problem. After Reasoning, the model starts to Act following the steps identified before; in our case, it acts using an external tool to retrieve information from a source. This framework allows the model to perform thisseries of Reasoning an Action multiple times, but we limited our experiment to only one iteration. The model could also conclude that it does not need to search online to produce an optimal result and, after just one reasoning step, return the answer; otherwise, it would perform one search before answering. In this framework, the LLM operates using a struc- tured conversational ReAct JSON prompt, ensuring that each response adheres to a predefined format. The JSON responses can take one of two forms: ”Final answer,” which includes the final score and explanation, or ”Action,” which enables the model to perform an online search. When performing a search, the agent autonomously generates a query and calls the search tool, providing the query as metadata. After the tool retrieves relevant information, the LLM is called again to produce a final answer, incorporating the newly obtained knowledge. In particular, we utilize two information sources: Wikipedia and Google. For Wikipedia, we leverage its API by developing a tool that, given a query generated by the LLM, returns the top 3 matches. Each match includes the title, link, and a snippet. These results are used to retrieve the snippets, while for complete content retrieval, we scrape the corresponding URLs, limiting the data to a certain length to ensure it fits within the context window. Finally, to generate summaries, we feed the scraped pages back into the model and request a summary for each page, resulting in three separate model calls for summarization. Meanwhile, for Google, we simulate an actual user experi- ence using the Selenium automation tool. We apply a stricter criterion, filtering results based on a time range of up to one week before the claim’s assertion date. Using Selenium, we systematically scan the first 20 search results and extract the top three relevant links. As with Wikipedia, we use the results Page 6: 6 to gather snippets. For full content, we scrape the URLs, and in the case of summaries, we prompt the model to generate summaries from the scraped pages. Following this process, we conduct experiments across three different settings: 1)Snippets : In this setting, we provide only short snip- pets of information. This approach allows for a quick, surface-level analysis and is highly cost-effective. 2)Full Article : Here, we present the complete articles retrieved from the search results. This setting enables a more in-depth content analysis but may risk overloading the model’s context window. 3)Summary : In this approach, we supply LLM-generated summaries based on the articles collected in the previous setting. These summaries should distill the essential in- formation, facilitating efficient analysis and comparison while avoiding the context window limitation. However, this method is more expensive as it requires invoking the LLM twice. As we can see in this task, the number of operations, the complexity, the strict formatting requirements, and the use of external tools make it more challenging for the model to follow the format consistently. As a result, we observed that approximately 2% of the outputs were invalid. All the prompts used in this study are detailed in the Appendix. D. Experimental settings In our study, we employed four models: two small and two large ones. Specifically, we utilized the Mistral family models, including Mistral-7B-Instruct-v0.3 andMixtral-8x7B-Instruct-v0.1 , alongside LLaMA models, which included Meta-Llama-3-8B-Instruct andMeta-Llama-3-70B-Instruct . For interaction with these models, we resort to the Hug- ging Face API8. We established fixed parameters to ensure reproducibility across all tasks and tested models. We set the temperature to a low value of 0.1 to enhance the consistency of our experiments, while the number of new tokens was fixed at 256. To perform the experiments for Task 1 and Task 2, we structured a framework with placeholders, as presented in sub- section III-C. We experimented with all possible combinations of different prompt techniques for each dataset entry. For the Summary enhancement, we provided a summary instead of feeding the full article to the model. As mentioned before, we asked the model to generate a separate response, following the steps outlined earlier. These procedures were repeated for each model across the first two tasks. 1) Task 1: Understanding the article-statement connection: Our experiments involved feeding our dataset into the models. Each entry was resource-intensive due to its 24 settings. To ensure a balanced dataset, we divided the entries into two groups: one with the correct claim and the other with a random, unrelated claim. This allowed us to create a dataset with an equal number of explained andunexplained entries. 8https://huggingface.co/blog/inference-pro#supported-models2) Task 2: Providing an accurate verdict based on the article’s knowledge: Similarly to the previous task, this ex- periment involved a dataset of 1,000 samples. We provided the LLM with the article verifying the claim and asked for their verdict, which was explicitly contained within the article. Our primary objectives were to verify the accuracy of our label mapping and to assess whether the LLMs could effectively extract critical information from the text. This task was instrumental in refining our label mapping. 3) Task 3: Fact-Checking the claim: In this task, we utilized the complete dataset. Unlike the previous tasks, we provided the model with claims along with their corresponding metadata and allowed it to explore freely in 3 settings: Wikipedia, Google, or no context. As described in previous sections, we employed a ReAct agent that required the LLM to perform queries and interact with external resources, specifically the Wikipedia API and the Selenium tool. We developed an agent using the Langchain9library to implement this functionality, creating two specific tools: one for Wikipedia and another for Google. By cleverly manipulating the URL, we filtered out results published. This approach aimed to avoid articles that clearly fact-check the claim. We deployed our models as autonomous agents by integrat- ing the capabilities of LangChain with its HuggingFace inte- gration. To ensure objectivity in the responses, we employed a zero-shot prompt while assigning the model the role of a fact- checker. We evaluated three scenarios by providing the model with different input formats for each source of information: the snippet, the full article, or a summary generated by the LLM. E. Extraction of results While extracting results from the model, we encountered situations where the model returned multiple answers for a single query. In such cases, we selected the last complete an- swer, which was often the most comprehensive. To ensure the extracted data complied with the JSON format, we addressed formatting issues, specifically converting single quotes (’) to double quotes (”). This adjustment was made using a regular expression. Once the formatting was corrected, we utilized the JSON structure to extract relevant scores and explanations. This data was then compiled into a resulting dataset and subsequently fed into another program for score computation. We recorded None entries in the dataset when the extraction failed or yielded no valid response. We decided to fix the threshold for a positive label with a score of 50. Labels with a value of 50 or lower were catego- rized as False , while those greater than 50 were categorized as True. Any values outside the range of 0 to 100 were classified asNone . 9https://www.langchain.com Page 7: 7 F . Baseline To compare our models, we fine-tuned a RoBERTa model to perform binary classification on article-claim pairs (Task 1 and Task 2) or claim-only embeddings (Task 3) using our dataset. This dataset consists of a large number of claims, each labeled as True orFalse , and originally paired with fact-checking articles. However, given that RoBERTa can only process sequences up to 512 tokens, we decided to reduce the articles’ length, which can exceed 20,000 characters. For preprocessing, we tokenized the claims using the RoBERTa tokenizer with a maximum sequence length of 512 tokens. We trained the model on this dataset using a cross- entropy loss function and an AdamW optimizer, fine-tuning for three epochs. The dataset was split into training and validation sets, and model performance was evaluated using accuracy and classification metrics. Our results indicate that fine-tuning RoBERTa shows great performance in distinguishing between true and false claims. These results are included as baselines in the main graphs. G. Evaluation metrics To evaluate our model’s performance comprehensively, we employ five distinct metrics: Precision, Recall, F1 score, and ROC AUC Score. •Recall (Sensitivity/True Positive Rate) : Measures the proportion of actual positives that were correctly identi- fied. Recall =TP TP+FN •Precision (Positive Predictive Value) : Measures the proportion of predicted positives that are truly positive. Precision =TP TP+FP •Accuracy : The overall proportion of correct predictions, both true positives and true negatives. Accuracy =TP+TN Total Population •F1 Score : The harmonic mean of precision and recall. It balances the two, which is especially useful when there is an uneven class distribution like in our case. F1 Score = 2×Precision ×Recall Precision +Recall •ROC AUC Score (Area Under the Curve) : A numerical value representing the area under the ROC curve. It summarizes the model’s ability to distinguish between classes. A score of 1.0 indicates perfect classification, while 0.5 indicates random guessing. These metrics collectively offer a detailed understanding of the model’s efficacy. In addition, we compute recall, precision, and f1 score separately for each class, thereby distinguishing the model’s handling of true positives/negatives from false positives/negatives. This differentiation is crucial for identi- fying any potential difficulties the model may encounter when processing different cases.In fact-checking, the disaggregation of evaluation metrics cases serves a particularly critical function. This distinction is important because the implications of misclassification can vary considerably depending on whether a true statement is incorrectly labeled as false or a false statement is incorrectly labeled as true. It also helps identify potential biases within the model, such as a predisposition towards skepticism or credulity. Furthermore, this strategy aligns closely with the practical priorities of fact-checking, where the relative impor- tance of avoiding false positives versus false negatives can vary depending on the context. IV. R ESULTS We present results for Llama3 8B, Llama3 70B, and Mixtral 8x7B, while the results for Mistral 7B have been relegated to the Appendix due to the exceptionally high number of faulty responses generated by this model. Interestingly, this issue was further exacerbated when a token limit was applied to the answer prompt, while, this same modification reduced the occurrence of faulty responses in the other models. Task 1: Understanding the article-statement connection In Fig. 2, we show that all models outperform the fine-tuned RoBERTa, which exhibits particularly low F1 scores, espe- cially for the negative class. The only exceptions are certain prompts for both Mixtral 8x7B and Llama3 8B. Among the three models, Llama3 70B consistently delivers the strongest performance, reaching an F1 score of approximately 0.9, depending on the prompt configuration. In contrast, Llama3 8B and Mixtral 8x7B produce lower scores, with Mixtral 8x7B emerging as the weakest model, averaging 0.65 (Fig. 3) and occasionally dropping below the random threshold of 0.5. As shown in Fig. 3, and consistent with findings in the literature [57], no single prompt format stands out as the best across all models. For Llama3 8B, providing the full article without relying on more complex prompting strategies yields the highest F1 scores across all prompt configurations. Meanwhile, Llama3 70B and Mixtral 8x7B achieve their best results when enriched prompts—those with more detailed class definitions—are incorporated. Figure 4 highlights an approximately linear relationship between the F1 scores for the positive and negative classes, suggesting that the models maintain a balanced performance on this task. However, there is a slight tendency for less effective models, such as Llama3 8B and Mixtral 8x7B, to perform better on the negative class, possibly due to their smaller number of parameters. Finally, Llama3 70B not only outperforms the other models in terms of F1 score but also generates the fewest faulty responses (i.e. instances where the model fails to provide a score or adhere to the expected output format), as illustrated in Fig. 5. In particular, it exhibits the lowest median percentage of faults: 0.003 compared to Llama3 8B’s 0.241 and Mixtral 8x7B’s 0.393. Page 8: 8 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000056/uni00000046/uni00000052/uni00000055/uni00000048 /uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048 /uni00000031/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048 Fig. 2. Task 1: Models’ F1 scores computed for both classes. Fine-tuned RoBERTa is used as a reference baseline. /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni00000013/uni00000011/uni0000001a/uni0000001b/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001b/uni00000018/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001b/uni00000015/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001c/uni00000017 /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000013/uni00000011/uni0000001c/uni00000015/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000013/uni00000011/uni0000001c/uni0000001a/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001c/uni0000001b/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000013/uni00000011/uni0000001c/uni00000019 /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025 /uni00000013/uni00000011/uni00000019/uni00000018/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001b/uni00000013/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001b/uni00000019/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001a/uni0000001a Fig. 3. Task 1: For each model, for each prompt category (Zero-Shot, Few-Shot, and Chain-of-Tought) the best prompt’s F1 score is reported. The average F1 score for each model is also reported with 0.95 confidence intervals. /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013 /uni00000029/uni00000014/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000046/uni0000004f/uni00000044/uni00000056/uni00000056/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000046/uni0000004f/uni00000044/uni00000056/uni00000056 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025 Fig. 4. Task 1: Comparison between the F1 scores obtained by each model for each prompt variation with respect to both classes. The bisector is reported as a reference. Task 2: Providing an accurate verdict based on a fact-checking article In this task, the performance gap between the positive and negative classes is more pronounced, affecting not only the three evaluated models but also the fine-tuned RoBERTa. This discrepancy is primarily due to the class imbalance /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000008/uni00000003/uni00000052/uni00000049/uni00000003/uni00000029/uni00000044/uni00000058/uni0000004f/uni00000057/uni00000056 Fig. 5. Task 1: Percentage of faults for each model. The median faults percentage for each model is: 0.241 (Llama3 8B), 0.003 (Llama3 70B), 0.393 (Mixtral 8x7B) introduced by the 2024 samples in the dataset, which contain 470 false claims but only 30 true claims. This distribution contrasts with earlier years (2003–2013), where the dataset is balanced. Focusing on the results for the negative class, all three models perform relatively well compared to the baseline set by the fine-tuned RoBERTa, although none fully match its performance. Among them, Llama3 70B comes closest, achieving an F1 score of approximately 0.9, compared to RoBERTa’s 0.95. The performance gap widens even further for the positive class, which contains fewer samples. This is evident both in comparison to RoBERTa’s baseline and relative to the other models. Llama3 70B remains the strongest performer, consis- tently achieving scores above 0.6 across all configurations and coming closest to RoBERTa’s threshold. As observed in Task 1, no single prompting strategy proves to be the most effective across all models. Additionally, the rate of faulty responses increases for both Llama3 8B and Llama3 70B, with the latter reaching nearly 30% for some prompts. Meanwhile, Mixtral 8x7B maintains a consistently high fault rate, similar to its performance in Task 1. /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000056/uni00000046/uni00000052/uni00000055/uni00000048 /uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048 /uni00000031/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048 Fig. 6. Task 2: Models’ F1 scores computed for both classes. Fine-tuned RoBERTa is used as a reference baseline. Page 9: 9 /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni00000013/uni00000011/uni0000001a/uni00000018/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001b/uni00000013/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c/uni0000000f/uni00000003/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000013/uni00000011/uni0000001b/uni00000014/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001a/uni0000001a /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000013/uni00000011/uni0000001b/uni00000019/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000013/uni00000011/uni0000001b/uni0000001b/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000013/uni00000011/uni0000001b/uni0000001a/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001b/uni0000001b /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025 /uni00000013/uni00000011/uni0000001a/uni0000001b/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001b/uni00000016/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001b/uni00000016/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001b/uni00000018 Fig. 7. Task 2: For each model, for each prompt category (Zero-Shot, Few-Shot, and Chain-of-Tought) the best prompt’s F1 score is reported. The average F1 score for each model is also reported with 0.95 confidence intervals. /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000008/uni00000003/uni00000052/uni00000049/uni00000003/uni00000029/uni00000044/uni00000058/uni0000004f/uni00000057/uni00000056 Fig. 8. Task 2: The percentage of faults for each model. The median faults percentage for each model is: 0.392 (Llama3 8B), 0.066 (Llama3 70B), 0.370 (Mixtral 8x7B) Task 3: Fact-Checking the claim Figure 9 shows that, under the current experimental settings, LLMs achieve high F1 scores for the negative class. However, they are consistently outperformed by the RoBERTa baseline. This suggests that while advanced models can sometimes perform well in classifying claims as True orFalse , simpler approaches may consistently yield better results. A notable trend in the results is the widening gap between the F1 scores of the positive and negative classes. Specifically, the highest F1 score achieved for the positive class is lower than the lowest F1 score observed for the negative class. This discrepancy highlights a significant challenge in the classifica- tion task. As was also the case in Task 2, this phenomenon can be attributed to dataset imbalance, which skews the model’s ability to correctly predict both classes with equal proficiency. Further analysis reveals an intriguing temporal pattern when breaking down the performance by claim publication date. As depicted in Figure 10, models exhibit substantially better per- formance on the positive class for claims that were published before 2024. Conversely, Figure 11 shows an opposite trend for the negative class: claims originating from 2024 tend to yield the highest performance. This suggests that temporalfactors, possibly related to shifts in linguistic patterns, dataset composition, or the evolving nature of factual claims, may be influencing the model’s ability to distinguish between true and false claims. Unlike the previous tasks, one striking difference is the relatively low rate of faulty responses across all three models, as illustrated in Figure 12. This suggests that, at least within the scope of this particular task, the models are more stable and less prone to generating erroneous classifications compared to their performance in earlier experiments. Another interesting observation emerges when analyzing the impact of external content sources on model perfor- mance. Specifically, as shown in Figure 13, when models are supplemented with information extracted from Google and Wikipedia, their classification accuracy improves. However, a crucial factor in this improvement appears to be the format in which the external information is presented. The results indicate that models perform significantly better when pro- vided with a structured summary of the search results, rather than being fed entire web pages or isolated snippets. This suggests that concise, well-organized contextual information is more beneficial for improving model accuracy than raw, unstructured text. /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000056/uni00000046/uni00000052/uni00000055/uni00000048 /uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048 /uni00000031/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048 Fig. 9. Task 3: Models’ F1 scores computed for both classes. Fine-tuned RoBERTa is used as a reference baseline. /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000056/uni00000046/uni00000052/uni00000055/uni00000048 /uni00000033/uni00000055/uni00000048/uni00000003/uni00000015/uni00000013/uni00000015/uni00000017 /uni00000015/uni00000013/uni00000015/uni00000017 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000053/uni00000055/uni00000048/uni00000015/uni00000013/uni00000015/uni00000017 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000015/uni00000013/uni00000015/uni00000017 Fig. 10. Task 3: Models’ F1 scores computed for the positive class distin- guishing between claims dated before 2024 (thus possibly included in the models’ training dataset) and from 2024. Fine-tuned RoBERTa is used as a reference baseline. Page 10: 10 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000056/uni00000046/uni00000052/uni00000055/uni00000048 /uni00000033/uni00000055/uni00000048/uni00000003/uni00000015/uni00000013/uni00000015/uni00000017 /uni00000015/uni00000013/uni00000015/uni00000017 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000053/uni00000055/uni00000048/uni00000015/uni00000013/uni00000015/uni00000017 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000015/uni00000013/uni00000015/uni00000017 Fig. 11. Task 3: Models’ F1 scores computed for the negative class distinguishing between claims dated before 2024 (thus possibly included in the models’ training dataset) and from 2024. Fine-tuned RoBERTa is used as a reference baseline. /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000008/uni00000003/uni00000052/uni00000049/uni00000003/uni00000029/uni00000044/uni00000058/uni0000004f/uni00000057/uni00000056 Fig. 12. Task 3: The percentage of faults for each model. The median faults percentage for each model is: 0.02 (Llama3 8B), 0.024 (Llama3 70B), 0.051 (Mixtral 8x7B) /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni00000013/uni00000011/uni00000019/uni00000019/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni00000019/uni0000001b/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni00000019/uni00000018/uni00000013/uni00000011/uni00000019/uni0000001c /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000013/uni00000011/uni0000001a/uni00000017/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001a/uni00000017/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001a/uni00000019/uni00000013/uni00000011/uni0000001a/uni0000001b /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025 /uni00000013/uni00000011/uni0000001a/uni00000019/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001a/uni00000018/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001a/uni0000001c/uni00000013/uni00000011/uni0000001a/uni00000018 Fig. 13. Task 3: For each model, for each prompt category (Zero-Shot, with Google contextual information, and with Wikipedia contextual information) the best prompt’s F1 score is reported. The average F1 score for each model is also reported with 0.95 confidence intervals. V. C ONCLUSION Our study builds upon and expands existing research on the role of LLMs in fact-checking, offering new insights into their capabilities and limitations. To systematically evaluate LLMs in this context, we de-signed our study around three key tasks. The first task assessed the models’ ability to recognize the semantic relationship between a claim and an article. The second task measured their performance in verifying a claim’s truthfulness when provided with a related fact-checking article. Finally, the third task explored the zero-shot fact-checking abilities of LLMs under varying levels of external knowledge support. Our findings from the first task indicate that LLMs effec- tively identify the semantic connection between claims and articles, outperforming fine-tuned Small Language Models in this regard. This suggests a strong potential for LLMs in assisting fact-checking efforts. Additionally, we observed that prompt design plays a crucial role in model performance, highlighting the need for careful prompt engineering in fact- checking applications. In the second task, we found that larger LLMs can accu- rately determine the veracity of claims when provided with a fact-checking article, particularly when evaluating fake news. However, in line with previous work [17], their performance declines significantly when verifying true news, where simpler fine-tuned Small Language Models substantially outperform them. Similarly, in the third task, all three LLMs underperformed compared to a fine-tuned RoBERTa model when asked to assess claim veracity—regardless of whether additional ex- ternal knowledge (sourced from Google or Wikipedia) was incorporated. These results align with prior findings [39], which highlight the superior performance of fine-tuned Small Language Models over LLMs in fake news detection. This reinforces the idea that, while LLMs can be valuable tools in fact-checking, they are not yet reliable enough to fully automate the process. Contrary to expectations from previous studies, introducing external knowledge did not enhance LLM performance. A possible explanation for this, as suggested by [42], is that fake and factual news often exhibit distinct writing styles, which may significantly hinder LLMs’ ability to differentiate between them. Overall, our study highlights both the promise and the challenges of using LLMs for fact-checking, emphasizing the need for further advancements before they can serve as a standalone solution. ACKNOWLEDGMENTS The authors are thankful to Sofia Mongardi for her support in the design of this manuscript. The work in this paper was originally submitted as a Master Thesis titled: ”Evaluating the Effectiveness of Open Large Language Models in Fact- checking Claims” written by Enrico Zuccolotto and supervised by Prof. Francesco Pierri. This paper is supported by PNRR- PE-AI FAIR project funded by the NextGeneration EU pro- gram. Page 11: 11 REFERENCES [1] M. Del Vicario, A. Bessi, F. Zollo, F. Petroni, A. Scala, G. Caldarelli, H. E. Stanley, and W. Quattrociocchi, “The spreading of misinformation online,” Proceedings of the National Academy of Sciences , vol. 113, no. 3, p. 554–559, Jan. 2016. [Online]. Available: https: //www.pnas.org/doi/full/10.1073/pnas.1517441113 [2] H. Allcott and M. Gentzkow, “Social media and fake news in the 2016 election,” Journal of Economic Perspectives , vol. 31, no. 2, p. 211–36, May 2017. [Online]. Available: https://www.aeaweb.org/articles?id=10. 1257/jep.31.2.211 [3] D. B ¨ar, F. Pierri, G. De Francisci Morales, and S. Feuerriegel, “Systematic discrepancies in the delivery of political ads on facebook and instagram,” PNAS Nexus , vol. 3, no. 7, p. pgae247, Jul. 2024a. [Online]. Available: https://doi.org/10.1093/pnasnexus/pgae247 [4] K. Fink, “The biggest challenge facing journalism: A lack of trust,” Journalism , vol. 20, no. 1, pp. 40–43, 2019. [Online]. Available: https://doi.org/10.1177/1464884918807069 [5] C. Spivak, “The fact-checking explosion: In a bitter political landscape marked by rampant allegations of questionable credibility, more and more news outlets are launching truth-squad operations,” American Journalism Review , vol. 32, no. 4, pp. 38–44, 2010. [6] C. Lim, “Checking how fact-checkers check,” Research & Politics , vol. 5, no. 3, 2018. [7] G. Warren, I. Shklovski, and I. Augenstein, “Show me the work: Fact-checkers’ requirements for explainable automated fact- checking,” Feb. 2025, arXiv:2502.09083 [cs]. [Online]. Available: http://arxiv.org/abs/2502.09083 [8] Z. Guo, M. Schlichtkrull, and A. Vlachos, “A survey on automated fact-checking,” Transactions of the Association for Computational Lin- guistics , vol. 10, pp. 178–206, 2022. [9] L. Hu, S. Wei, Z. Zhao, and B. Wu, “Deep learning for fake news detection: A comprehensive survey,” AI Open , vol. 3, pp. 133–155, 2022. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S2666651022000134 [10] Y . Wang, S. Qian, J. Hu, Q. Fang, and C. Xu, “Fake news detection via knowledge-driven multimodal graph convolutional networks,” in Proceedings of the 2020 International Conference on Multimedia Retrieval , ser. ICMR ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 540–547. [Online]. Available: https://doi.org/10.1145/3372278.3390713 [11] T. Lin, Y . Wang, X. Liu, and X. Qiu, “A survey of transformers,” AI Open , vol. 3, pp. 111–132, 2022. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S2666651022000146 [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems , vol. 30, 2017.[13] M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. B. Shaikh, N. Akhtar, J. Wu, S. Mirjalili et al. , “A survey on large language models: Applications, challenges, limitations, and practical usage,” Authorea Preprints , 2023. [14] N. Fontana, F. Pierri, and L. M. Aiello, “Nicer than humans: How do large language models behave in the prisoner’s dilemma?” Sep. 2024, arXiv:2406.13605. [Online]. Available: http://arxiv.org/abs/2406.13605 [15] E. Papageorgiou, C. Chronis, I. Varlamis, and Y . Himeur, “A survey on the use of large language models (llms) in fake news,” Future Internet , vol. 16, no. 88, p. 298, Aug. 2024. [Online]. Available: https://www.mdpi.com/1999-5903/16/8/298 [16] E. Sanu, T. K. Amudaa, P. Bhat, G. Dinesh, A. U. Kumar Chate, and R. K. P, “Limitations of large language models,” in 2024 8th International Conference on Computational System and Information Technology for Sustainable Solutions (CSITSS) , Nov. 2024, p. 1–6. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/ 10817070 [17] D. Quelle and A. Bovet, “The perils and promises of fact-checking with large language models,” Frontiers in Artificial Intelligence , vol. 7, Feb. 2024. [Online]. Available: http://dx.doi.org/10.3389/frai.2024.1341697 [18] F. Pierri and S. Ceri, “False news on social media: A data-driven survey,” SIGMOD Rec. , vol. 48, no. 2, p. 18–27, Dec. 2019. [Online]. Available: https://dl.acm.org/doi/10.1145/3377330.3377334 [19] J. A. Nasir, O. S. Khan, and I. Varlamis, “Fake news detection: A hybrid cnn-rnn based deep learning approach,” International Journal of Information Management Data Insights , vol. 1, no. 1, p. 100007, Apr. 2021b. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/S2667096820300070 [20] A. Mewada and R. K. Dewang, “Cipf: Identifying fake profiles on social media using a cnn-based communal influence propagation framework,” Multimedia Tools and Applications , vol. 83, no. 10, p. 29419–29454, Mar. 2024a. [Online]. Available: https://doi.org/10.1007/ s11042-023-16685-z [21] J. Su, C. Cardie, and P. Nakov, “Adapting fake news detection to the era of large language models,” in Findings of the Association for Computational Linguistics: NAACL 2024 , K. Duh, H. Gomez, and S. Bethard, Eds. Mexico City, Mexico: Association for Computational Linguistics, Jun. 2024, p. 1473–1490. [Online]. Available: https://aclanthology.org/2024.findings-naacl.95/ [22] X. Zhang, J. Cao, X. Li, Q. Sheng, L. Zhong, and K. Shu, “Mining dual emotion for fake news detection,” in Proceedings of the Web Conference 2021 , ser. WWW ’21. New York, NY , USA: Association for Computing Machinery, Jun. 2021, p. 3465–3476. [Online]. Available: https://dl.acm.org/doi/10.1145/3442381.3450004 [23] X. Zhang and W. Gao, “Reinforcement retrieval leveraging fine- grained feedback for fact checking news claims with black-box Page 12: 12 llm,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , N. Calzolari, M.-Y . Kan, V . Hoste, A. Lenci, S. Sakti, and N. Xue, Eds. Torino, Italia: ELRA and ICCL, May 2024, p. 13861–13873. [Online]. Available: https://aclanthology.org/ 2024.lrec-main.1209/ [24] ——, “Towards llm-based fact verification on news claims with a hierarchical step-by-step prompting method,” in Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , J. C. Park, Y . Arase, B. Hu, W. Lu, D. Wijaya, A. Purwarianti, and A. A. Krisnadhi, Eds. Nusa Dua, Bali: Association for Computational Linguistics, Nov. 2023a, p. 996–1011. [Online]. Available: https://aclanthology.org/2023.ijcnlp-main.64/ [25] G. Nogara, F. Pierri, S. Cresci, L. Luceri, P. T ¨ornberg, and S. Giordano, “Toxic bias: Perspective api misreads german as more toxic,” Jul. 2024c, arXiv:2312.12651 [cs]. [Online]. Available: http://arxiv.org/abs/2312.12651 [26] G. Liu, C. A. Bono, and F. Pierri, “Comparing diversity, negativity, and stereotypes in chinese-language ai technologies: an investigation of baidu, ernie and qwen,” Feb. 2025b, arXiv:2408.15696 [cs]. [Online]. Available: http://arxiv.org/abs/2408.15696 [27] I. Augenstein, T. Baldwin, M. Cha, T. Chakraborty, G. L. Ciampaglia, D. Corney, R. DiResta, E. Ferrara, S. Hale, A. Halevy, E. Hovy, H. Ji, F. Menczer, R. Miguez, P. Nakov, D. Scheufele, S. Sharma, and G. Zagni, “Factuality challenges in the era of large language models and opportunities for fact-checking,” Nature Machine Intelligence , vol. 6, no. 8, p. 852–863, Aug. 2024. [Online]. Available: https://www.nature.com/articles/s42256-024-00881-z [28] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” International Conference on Learning Representations (ICLR) , Jan. 2023. [Online]. Available: https://par.nsf.gov/biblio/ 10451467-react-synergizing-reasoning-acting-language-models [29] S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi, “Factscore: Fine- grained atomic evaluation of factual precision in long form text generation,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, p. 12076–12100. [Online]. Available: https: //aclanthology.org/2023.emnlp-main.741/ [30] I.-C. Chern, S. Chern, S. Chen, W. Yuan, K. Feng, C. Zhou, J. He, G. Neubig, and P. Liu, “Factool: Factuality detection in generative ai – a tool augmented framework for multi-task and multi-domainscenarios,” jul 2023, arXiv:2307.13528 [cs]. [Online]. Available: http://arxiv.org/abs/2307.13528 [31] T. Schick, J. Dwivedi-Yu, R. Dessi, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” Advances in Neural Informa- tion Processing Systems , vol. 36, p. 68539–68551, Dec. 2023. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/2023/hash/ d842425e4bf79ba039352da0f658a906-Abstract-Conference.html [32] Y . Wang, R. Gangi Reddy, Z. M. Mujahid, A. Arora, A. Rubashevskii, J. Geng, O. Mohammed Afzal, L. Pan, N. Borenstein, A. Pillai, I. Augenstein, I. Gurevych, and P. Nakov, “Factcheck-bench: Fine- grained evaluation benchmark for automatic fact-checkers,” in Findings of the Association for Computational Linguistics: EMNLP 2024 , Y . Al- Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, p. 14199–14230. [Online]. Available: https://aclanthology.org/2024.findings-emnlp.830 [33] Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving factuality and reasoning in language models through multiagent debate,” inProceedings of the 41st International Conference on Machine Learn- ing, ser. ICML’24, vol. 235. Vienna, Austria: JMLR.org, Jul. 2024, p. 11733–11763. [34] K. Kim, S. Lee, K.-H. Huang, H. P. Chan, M. Li, and H. Ji, “Can llms produce faithful explanations for fact-checking? towards faithful explainable fact-checking via multi-agent debate,” 2024. [Online]. Available: https://arxiv.org/abs/2402.07401 [35] X. Zhou, A. Sharma, A. X. Zhang, and T. Althoff, “Correcting misinformation on social media with a large language model,” 2024. [Online]. Available: https://arxiv.org/abs/2403.11169 [36] R. Zhao, X. Li, S. Joty, C. Qin, and L. Bing, “Verify-and-edit: A knowledge-enhanced chain-of-thought framework,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, p. 5823–5840. [Online]. Available: https: //aclanthology.org/2023.acl-long.320/ [37] S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, A. Celikyilmaz, and J. Weston, “Chain-of-verification reduces hallucination in large language models,” in Findings of the Association for Computational Linguistics: ACL 2024 , L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, p. 3563–3578. [Online]. Available: https://aclanthology.org/2024. findings-acl.212/ [38] M. Li, B. Peng, M. Galley, J. Gao, and Z. Zhang, “Self-checker: Plug-and-play modules for fact-checking with large language models,” inFindings of the Association for Computational Linguistics: NAACL 2024 , K. Duh, H. Gomez, and S. Bethard, Eds. Mexico City, Mexico: Page 13: 13 Association for Computational Linguistics, Jun. 2024, p. 163–181. [Online]. Available: https://aclanthology.org/2024.findings-naacl.12/ [39] B. Hu, Q. Sheng, J. Cao, Y . Shi, Y . Li, D. Wang, and P. Qi, “Bad actor, good advisor: Exploring the role of large language models in fake news detection,” Proceedings of the AAAI Conference on Artificial Intelligence , vol. 38, no. 2020, p. 22105–22113, Mar. 2024. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/30214 [40] X. Ma, Y . Zhang, K. Ding, J. Yang, J. Wu, and H. Fan, “On fake news detection with llm enhanced semantics mining,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, p. 508–521. [Online]. Available: https: //aclanthology.org/2024.emnlp-main.31/ [41] A. Spangher, N. Peng, S. Gehrmann, and M. Dredze, “Do llms plan like human writers? comparing journalist coverage of press releases with llms,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024a, p. 21814–21828. [Online]. Available: https://aclanthology.org/2024.emnlp-main.1216/ [42] J. Wu, J. Guo, and B. Hooi, “Fake news in sheep’s clothing: Robust fake news detection against llm-empowered style attacks,” in Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , ser. KDD ’24. New York, NY , USA: Association for Computing Machinery, Aug. 2024b, p. 3367–3378. [Online]. Available: https://dl.acm.org/doi/10.1145/3637528.3671977 [43] V . D. Lai, N. Ngo, A. Pouran Ben Veyseh, H. Man, F. Dernoncourt, T. Bui, and T. H. Nguyen, “ChatGPT beyond English: Towards a comprehensive evaluation of large language models in multilingual learning,” in Findings of the Association for Computational Linguistics: EMNLP 2023 , H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 13 171–13 189. [Online]. Available: https://aclanthology.org/2023.findings-emnlp.878 [44] L. Reynolds and K. McDonell, “Prompt programming for large language models: Beyond the few-shot paradigm,” in Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems , ser. CHI EA ’21. New York, NY , USA: Association for Computing Machinery, 2021. [Online]. Available: https://doi.org/10.1145/3411763.3451760 [45] L. A. Henkel and M. E. Mattson, “Reading is believing: The truth effect and source credibility,” Consciousness and cognition , vol. 20, no. 4, pp. 1705–1721, 2011. [46] W. Wang, V . W. Zheng, H. Yu, and C. Miao, “A survey of zero-shot learning: Settings, methods, and applications,” ACM Trans. Intell. Syst.Technol. , vol. 10, no. 2, pp. 13:1–13:37, Jan. 2019. [Online]. Available: https://dl.acm.org/doi/10.1145/3293318 [47] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Advances in Neural Information Processing Systems , vol. 33. Curran Associates, Inc., 2020, p. 1877–1901. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/ 2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html [48] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou et al. , “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems , vol. 35, pp. 24 824–24 837, 2022. [49] A. Kong, S. Zhao, H. Chen, Q. Li, Y . Qin, R. Sun, X. Zhou, E. Wang, and X. Dong, “Better zero-shot reasoning with role-play prompting,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , K. Duh, H. Gomez, and S. Bethard, Eds. Mexico City, Mexico: Association for Computational Linguistics, Jun. 2024, p. 4099–4113. [Online]. Available: https://aclanthology.org/2024.naacl-long.228/ [50] K. Pelrine, A. Imouza, C. Thibault, M. Reksoprodjo, C. Gupta, J. Christoph, J.-F. Godbout, and R. Rabbany, “Towards reliable misinformation mitigation: Generalization, uncertainty, and gpt-4,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, p. 6399–6429. [Online]. Available: https://aclanthology.org/2023. emnlp-main.395/ [51] C. Shorten, C. Pierse, T. B. Smith, E. Cardenas, A. Sharma, J. Trengrove, and B. van Luijt, “Structuredrag: Json response formatting with large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2408.11061 [52] S. Schulhoff, M. Ilie, N. Balepur, K. Kahadze, A. Liu, C. Si, Y . Li, A. Gupta, H. Han, S. Schulhoff, P. S. Dulepet, S. Vidyadhara, D. Ki, S. Agrawal, C. Pham, G. Kroiz, F. Li, H. Tao, A. Srivastava, H. D. Costa, S. Gupta, M. L. Rogers, I. Goncearenco, G. Sarli, I. Galynker, D. Peskoff, M. Carpuat, J. White, S. Anadkat, A. Hoyle, and P. Resnik, “The prompt report: A systematic survey of prompting techniques,” 2024. [Online]. Available: https://arxiv.org/abs/2406.06608 [53] D. Zhou, N. Sch ¨arli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. Le, and E. Chi, “Least-to- most prompting enables complex reasoning in large language models,” Page 14: 14 2023. [Online]. Available: https://arxiv.org/abs/2205.10625 [54] M. Suzgun, N. Scales, N. Sch ¨arli, S. Gehrmann, Y . Tay, H. W. Chung, A. Chowdhery, Q. Le, E. Chi, D. Zhou, and J. Wei, “Challenging big-bench tasks and whether chain-of-thought can solve them,” in Findings of the Association for Computational Linguistics: ACL 2023 , A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, p. 13003–13051. [Online]. Available: https://aclanthology.org/2023.findings-acl.824/ [55] T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large lan- guage models are zero-shot reasoners,” Advances in neural information processing systems , vol. 35, pp. 22 199–22 213, 2022. [56] Z. Ji, T. Yu, Y . Xu, N. Lee, E. Ishii, and P. Fung, “Towards mitigating LLM hallucination via self reflection,” in Findings of the Association for Computational Linguistics: EMNLP 2023 , H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 1827–1843. [Online]. Available: https://aclanthology.org/2023.findings-emnlp.123 [57] F. Corso, F. Pierri, and G. D. F. Morales, “Conspiracy theories and where to find them on tiktok,” arXiv preprint arXiv:2407.12545 , 2024. Page 15: 1 Evaluating open-source Large Language Models for automated fact-checking - Supplementary Materials Nicol `o Fontana1, Francesco Corso1,2, Enrico Zuccolotto1, Francesco Pierri1 1Politecnico di Milano, Italy 2CENTAI, Italy {nicolo.fontana, francesco.corso, francesco.pierri }@polimi.it Abstract —The increasing prevalence of online misinformation has heightened the demand for automated fact-checking solu- tions. Large Language Models (LLMs) have emerged as potential tools for assisting in this task, but their effectiveness remains uncertain. This study evaluates the fact-checking capabilities of various open-source LLMs, focusing on their ability to assess claims with different levels of contextual information. We conduct three key experiments: (1) evaluating whether LLMs can identify the semantic relationship between a claim and a fact-checking article, (2) assessing models’ accuracy in verifying claims when given a related fact-checking article, and (3) testing LLMs’ fact-checking abilities when leveraging data from external knowledge sources such as Google and Wikipedia. Our results indicate that LLMs perform well in identifying claim- article connections and verifying fact-checked stories but struggle with confirming factual news, where they are outperformed by traditional fine-tuned models such as RoBERTa. Additionally, the introduction of external knowledge does not significantly enhance LLMs’ performance, calling for more tailored approaches. Our findings highlight both the potential and limitations of LLMs in automated fact-checking, emphasizing the need for further refinements before they can reliably replace human fact-checkers. Index Terms —fact-checking, large language models, prompting analysis SUPPLEMENTARY MATERIALS Dataset Our analysis utilizes the Fact-Check Insights dataset, a comprehensive resource invaluable to researchers, journalists, technologists, and other stakeholders engaged in countering political misinformation and falsehoods online. This dataset comprises structured data from tens of thousands of claims made by political figures and social media posts, meticulously scrutinized and rated by independent fact-checker organiza- tions such as AFP, Politifact, and Snopes. A. Data Selection Given the multipurpose and extensive nature of the dataset, we undertook a preprocessing phase to align it with our specific research requirements. We selected only a subset of pertinent columns, as seen in Table I. We decided to retain the fact-checking article”s publication date to maintain a temporal reference point. This approach enables temporal analysis and enhances our ability to assess the importance of external context in claims made after the training data cut-off date.B. Data Cleaning We applied a rigorous preprocessing methodology to ensure data accuracy and reliability. The first step involved ensur- ing the reliability of the language through a cross-validation method. We utilized the langdetect library to identify the language of the text in claimReviewed . This detection was then cross-referenced with the language extracted from the respective website domains in the url column to ensure consistency in language identification. Subsequently, we refined the dataset through a series of operations to ensure its accuracy and reliability. First, we eliminated entries that lacked critical data, such as theclaimReviewed ordatePublished fields or cases where the datePublished was set in the future. The ab- sence of a claim makes fact-checking impossible, and without a valid date, the entry becomes unreliable. Additionally, entries that did not include an alternateName , or verdict, were filtered out, as a verdict is essential for our analysis. Next, we excluded entries with unknown language. If lan- guage detection was unsuccessful, the entry was removed to maintain data consistency and reliability across the dataset. Finally, we addressed the issue of duplicate entries. To preserve the integrity of the data, we carefully identified and removed duplicates. This step was crucial in ensuring our analysis was not skewed by repetitive information. This careful selection left us with a dataset of around 200,000 claims in 40 languages spanning under 30 years, from 1996 to 2024. As shown in Figure 3, there has been a higher concentration of claims in recent years, with a significant increase starting in 2019. The surge in 2020 includes many claims related to COVID-19 and a substantial number focused on political figures. As evidenced by the graphs 4, English is the predominant language of fact-checkers, followed by Arabic, Spanish, Portuguese, and Italian, reflecting the top languages spoken in the Western world. After selecting the English language, we extracted the text through a scraping procedure, utilizing newspaper3k library on the url column. This process created an additional field containing the text verifying the claimReviewed . These articles have varying lengths from 1000 to 20000 words (Fig- ure 2). Subsequently, we removed entries with unsuccessful scrapes. This leaves us with 60,000 claims, each linked to its corresponding golden document.arXiv:2503.05565v1 [cs.CY] 7 Mar 2025 Page 16: 2 /uni00000015/uni00000013/uni00000014/uni00000017 /uni00000015/uni00000013/uni00000014/uni00000019 /uni00000015/uni00000013/uni00000014/uni0000001b /uni00000015/uni00000013/uni00000015/uni00000013 /uni00000015/uni00000013/uni00000015/uni00000015 /uni00000015/uni00000013/uni00000015/uni00000017 /uni0000003c/uni00000048/uni00000044/uni00000055/uni00000013/uni00000014/uni00000013/uni00000013/uni00000015/uni00000013/uni00000013/uni00000016/uni00000013/uni00000013/uni00000017/uni00000013/uni00000013/uni00000018/uni00000013/uni00000013/uni00000031/uni00000058/uni00000050/uni00000045/uni00000048/uni00000055/uni00000003/uni00000052/uni00000049/uni00000003/uni00000046/uni0000004f/uni00000044/uni0000004c/uni00000050/uni00000056/uni00000029/uni00000044/uni0000004f/uni00000056/uni00000048 /uni00000037/uni00000055/uni00000058/uni00000048 Fig. 1. For each year, the number of True and False claims in the used dataset. Fig. 2. Distribution of articles length. Fig. 3. Distribution of fact-checked claims over the years. C. Labeling Following the methodology outlined by Quelle [1], we divided our dataset into two categories: trueandfalse . The true category includes entries deemed accurate or of higher quality, while the false category encompasses content lacking a factual basis, such as fake quotes, conspiracy theories, misleading edited media, or ironic and exaggerated criticism. To classify our entries, we manually mapped the first 500 common verdicts into one of these two categories (Here is a sample Figure 5). Although this process was time- consuming and meticulous, it resulted in a robust and rigorous Fig. 4. Number of fact-checked claims by the claim language. Fig. 5. Verdict distribution. classification of our data. For example, labels like Geppetto mark, trustworthy, mostly true, and correct attribution were categorized as true, while labels such as false, mostly false, Quattro Pinocchio, and unproven orlegend were classified asfalse . Additionally, terms like mixture were categorized as false , given that a statement containing both true and false elements is considered overall inaccurate. D. Balancing Due to a substantial class imbalance, 90% of claims were categorized as false . Since fact-checkers typically verify false claims, steps were taken to address this disparity by equalizing the number of true and false samples. This adjustment resulted in a much smaller but more balanced dataset. As depicted in Figure 3 our dataset spans various years, with a notable concentration of claims from 2013 to 2024. This temporal distribution enhances the statistical reliability of the dataset across different periods. Given the limited API usage and the large number of experiments to be conducted, we decided to select a subset of the balanced dataset. Specif- ically, we chose 50 entries for each year from 2013 onward, consisting of 25 true claims and 25 false ones, as shown in Figure 1. Stratified sampling based on verdicts was employed to ensure optimal representation of both categories. Instead, for the year 2024, we included 500 samples. In this case, the dataset could not be perfectly balanced due to Page 17: 3 Field Description claimReviewed The statement or claim under examination. datePublished The date of the article”s publication. url The source URL from which the data was gathered. reviewRating.alternateName The verdict assigned to the claim. author.name The name of the fact-checking organization conducting the analysis. language The language in which the claim was evaluated. reviewRating.author.name The author of the claim being reviewed. TABLE I DESCRIPTION OF FACT -CHECKING DATABASE FIELDS . the limited number of true claims, with only 30 available. This is particularly significant because most of the LLMs used in this research were trained before 2024, except for Mistral 7B, which was fine-tuned in June 2024. This portion of the dataset estimates how well these models can perform as fact- checkers in real-time. This process culminated in creating a comprehensive English dataset used in all the experiments for each task. E. Mistral 7B Results In our analysis, we initially included Mistral-7B-Instruct-v0.3 as a representative of smaller models within the Mistral family. However, we observed a high percentage of faulty responses in the first two preparatory tasks, leading us to exclude its predictions from the main analysis. For completeness, we provide the plots including the results obtained by Mistral 7B for all three tasks. /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000056/uni00000046/uni00000052/uni00000055/uni00000048 /uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048 /uni00000031/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048 Fig. 6. Task 1: Models’ F1 scores computed for both classes. Fine-tuned RoBERTa is used as a reference baseline. /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013 /uni00000029/uni00000014/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000046/uni0000004f/uni00000044/uni00000056/uni00000056/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000046/uni0000004f/uni00000044/uni00000056/uni00000056 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025Fig. 7. Task 1: Comparison between the F1 scores obtained by each model for each prompt variation with respect to both classes. The bisector is reported as a reference. /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni00000013/uni00000011/uni0000001a/uni0000001b/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001b/uni00000018/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001b/uni00000015/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001c/uni00000017 /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000013/uni00000011/uni0000001c/uni00000015/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000013/uni00000011/uni0000001c/uni0000001a/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001c/uni0000001b/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000013/uni00000011/uni0000001c/uni00000019 /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025 /uni00000013/uni00000011/uni00000014/uni0000001a/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000014/uni00000011/uni00000013/uni00000013/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c/uni0000000f/uni00000003/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000014/uni00000011/uni00000013/uni00000013/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001b/uni00000019 /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025 /uni00000013/uni00000011/uni00000019/uni00000018/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001b/uni00000013/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001b/uni00000019/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001a/uni0000001a Fig. 8. Task 1: For each model, for each prompt category (Zero-Shot, Few-Shot, and Chain-of-Tought) the best prompt’s F1 score is reported. The average F1 score for each model is also reported with 0.95 confidence intervals. Page 18: 4 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000008/uni00000003/uni00000052/uni00000049/uni00000003/uni00000029/uni00000044/uni00000058/uni0000004f/uni00000057/uni00000056 Fig. 9. Task 1: Percentage of faults for each model. The median faults percentage for each model is: 0.241 (Llama3 8B), 0.003 (Llama3 70B), 1.0 (Mistral 7B), 0.393 (Mixtral 8x7B) /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000056/uni00000046/uni00000052/uni00000055/uni00000048 /uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048 /uni00000031/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048 Fig. 10. Task 2: Models’ F1 scores computed for both classes. Fine-tuned RoBERTa is used as a reference baseline. /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni00000013/uni00000011/uni0000001a/uni00000018/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001b/uni00000013/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c/uni0000000f/uni00000003/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000013/uni00000011/uni0000001b/uni00000014/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001a/uni0000001a /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000013/uni00000011/uni0000001b/uni00000019/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000013/uni00000011/uni0000001b/uni0000001b/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000013/uni00000011/uni0000001b/uni0000001a/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001b/uni0000001b /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025 /uni00000013/uni00000011/uni00000014/uni00000014/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047/uni0000000f/uni00000003/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c/uni0000000f/uni00000003/uni00000056/uni00000048/uni0000004f/uni00000049/uni00000010/uni00000055/uni00000048/uni00000049/uni0000004f/uni00000048/uni00000046/uni00000057/uni0000004c/uni00000052/uni00000051 /uni00000014/uni00000011/uni00000013/uni00000013/uni00000013/uni00000011/uni00000013/uni00000013/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001a/uni00000019 /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000026/uni0000004b/uni00000044/uni0000004c/uni00000051/uni00000010/uni00000052/uni00000049/uni00000010/uni00000037/uni0000004b/uni00000052/uni00000058/uni0000004a/uni0000004b/uni00000057/uni00000029/uni00000048/uni0000005a/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025 /uni00000013/uni00000011/uni0000001a/uni0000001b/uni00000048/uni00000051/uni00000055/uni0000004c/uni00000046/uni0000004b/uni00000048/uni00000047 /uni00000013/uni00000011/uni0000001b/uni00000016/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001b/uni00000016/uni00000049/uni00000058/uni0000004f/uni0000004f/uni00000003/uni00000044/uni00000055/uni00000057/uni0000004c/uni00000046/uni0000004f/uni00000048 /uni00000013/uni00000011/uni0000001b/uni00000018 Fig. 11. Task 2: For each model, for each prompt category (Zero-Shot, Few-Shot, and Chain-of-Tought) the best prompt’s F1 score is reported. The average F1 score for each model is also reported with 0.95 confidence intervals. /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000008/uni00000003/uni00000052/uni00000049/uni00000003/uni00000029/uni00000044/uni00000058/uni0000004f/uni00000057/uni00000056 Fig. 12. Task 2: The percentage of faults for each model. The median faults percentage for each model is: 0.392 (Llama3 8B), 0.066 (Llama3 70B), 1.0 (Mistral 7B), 0.370 (Mixtral 8x7B) /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000056/uni00000046/uni00000052/uni00000055/uni00000048 /uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048 /uni00000031/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048 Fig. 13. Task 3: Models’ F1 scores computed for both classes. Fine-tuned RoBERTa is used as a reference baseline. /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000056/uni00000046/uni00000052/uni00000055/uni00000048 /uni00000033/uni00000055/uni00000048/uni00000003/uni00000015/uni00000013/uni00000015/uni00000017 /uni00000015/uni00000013/uni00000015/uni00000017 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000053/uni00000055/uni00000048/uni00000015/uni00000013/uni00000015/uni00000017 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000015/uni00000013/uni00000015/uni00000017 Fig. 14. Task 3: Models’ F1 scores computed for the positive class distin- guishing between claims dated before 2024 (thus possibly included in the models’ training dataset) and from 2024. Fine-tuned RoBERTa is used as a reference baseline. Page 19: 5 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000029/uni00000014/uni00000003/uni00000056/uni00000046/uni00000052/uni00000055/uni00000048 /uni00000033/uni00000055/uni00000048/uni00000003/uni00000015/uni00000013/uni00000015/uni00000017 /uni00000015/uni00000013/uni00000015/uni00000017 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000053/uni00000055/uni00000048/uni00000015/uni00000013/uni00000015/uni00000017 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044/uni00000003/uni00000015/uni00000013/uni00000015/uni00000017 Fig. 15. Task 3: Models’ F1 scores computed for the negative class distinguishing between claims dated before 2024 (thus possibly included in the models’ training dataset) and from 2024. Fine-tuned RoBERTa is used as a reference baseline. /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000008/uni00000003/uni00000052/uni00000049/uni00000003/uni00000029/uni00000044/uni00000058/uni0000004f/uni00000057/uni00000056 Fig. 16. Task 3: The percentage of faults for each model. The median faults percentage for each model is: 0.02 (Llama3 8B), 0.024 (Llama3 70B), 0.005 (Mistral 7B), 0.051 (Mixtral 8x7B) /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025 /uni00000013/uni00000011/uni00000019/uni00000019/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni00000019/uni0000001b/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni00000019/uni00000018/uni00000013/uni00000011/uni00000019/uni0000001c /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025 /uni00000013/uni00000011/uni0000001a/uni00000017/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001a/uni00000017/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001a/uni00000019/uni00000013/uni00000011/uni0000001a/uni0000001b /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni00000030/uni0000004c/uni00000056/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001a/uni00000025 /uni00000013/uni00000011/uni0000001a/uni00000017/uni00000056/uni00000051/uni0000004c/uni00000053/uni00000053/uni00000048/uni00000057 /uni00000013/uni00000011/uni0000001a/uni00000017/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001a/uni00000018/uni00000013/uni00000011/uni0000001a/uni00000019 /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013/uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044/uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048/uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025 /uni00000013/uni00000011/uni0000001a/uni00000019/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001a/uni00000018/uni00000056/uni00000058/uni00000050/uni00000050/uni00000044/uni00000055/uni0000005c /uni00000013/uni00000011/uni0000001a/uni0000001c/uni00000013/uni00000011/uni0000001a/uni00000018 Fig. 17. Task 3: For each model, for each prompt category (Zero-Shot, with Google contextual information, and with Wikipedia contextual information) the best prompt’s F1 score is reported. The average F1 score for each model is also reported with 0.95 confidence intervals. Page 20: 6 F . Task 3 Additional Plots /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013 /uni00000035/uni00000048/uni00000046/uni00000044/uni0000004f/uni0000004f/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000033/uni00000055/uni00000048/uni00000046/uni0000004c/uni00000056/uni0000004c/uni00000052/uni00000051/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025/uni00000003/uni00000010/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048 /uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048 /uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044 /uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050 Fig. 18. Task 3: Precision and Recall curve for the positive class for Llama3 8B. The best prompt of each category is highlighted. Fine-tuned RoBERTa and a random classifier are used as a baseline for comparison. /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013 /uni00000035/uni00000048/uni00000046/uni00000044/uni0000004f/uni0000004f/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000033/uni00000055/uni00000048/uni00000046/uni0000004c/uni00000056/uni0000004c/uni00000052/uni00000051/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025/uni00000003/uni00000010/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048 /uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048 /uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044 /uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050 Fig. 19. Task 3: Precision and Recall curve for the positive class for Llama3 70B. The best prompt of each category is highlighted. Fine-tuned RoBERTa and a random classifier are used as a baseline for comparison. /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013 /uni00000035/uni00000048/uni00000046/uni00000044/uni0000004f/uni0000004f/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000033/uni00000055/uni00000048/uni00000046/uni0000004c/uni00000056/uni0000004c/uni00000052/uni00000051/uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000003/uni00000010/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048 /uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048 /uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044 /uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050 Fig. 20. Task 3: Precision and Recall curve for the positive class for Mixtral 8x7B. The best prompt of each category is highlighted. Fine-tuned RoBERTa and a random classifier are used as a baseline for comparison. REFERENCES [1] D. Quelle and A. Bovet, “The perils and promises of fact-checking with large language models,” Frontiers in Artificial Intelligence , vol. 7, Feb. 2024. [Online]. Available: http://dx.doi.org/10.3389/frai.2024.1341697 /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013 /uni00000035/uni00000048/uni00000046/uni00000044/uni0000004f/uni0000004f/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000033/uni00000055/uni00000048/uni00000046/uni0000004c/uni00000056/uni0000004c/uni00000052/uni00000051/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025/uni00000003/uni00000010/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048 /uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048 /uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044 /uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050Fig. 21. Task 3: Precision and Recall curve for the negative class for Llama3 8B. The best prompt of each category is highlighted. Fine-tuned RoBERTa and a random classifier are used as a baseline for comparison. /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013 /uni00000035/uni00000048/uni00000046/uni00000044/uni0000004f/uni0000004f/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000033/uni00000055/uni00000048/uni00000046/uni0000004c/uni00000056/uni0000004c/uni00000052/uni00000051/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025/uni00000003/uni00000010/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048 /uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048 /uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044 /uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050 Fig. 22. Task 3: Precision and Recall curve for the negative class for Llama3 70B. The best prompt of each category is highlighted. Fine-tuned RoBERTa and a random classifier are used as a baseline for comparison. /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013 /uni00000035/uni00000048/uni00000046/uni00000044/uni0000004f/uni0000004f/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000033/uni00000055/uni00000048/uni00000046/uni0000004c/uni00000056/uni0000004c/uni00000052/uni00000051/uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000003/uni00000010/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048 /uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048 /uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044 /uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050 Fig. 23. Task 3: Precision and Recall curve for the negative class for Mixtral 8x7B. The best prompt of each category is highlighted. Fine-tuned RoBERTa and a random classifier are used as a baseline for comparison. Page 21: 7 /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013 /uni00000029/uni00000044/uni0000004f/uni00000056/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000037/uni00000055/uni00000058/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025/uni00000003/uni00000010/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048 /uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048 /uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044 /uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050 Fig. 24. Task 3: ROC curve for the positive class for Llama3 8B. The best prompt of each category is highlighted. Fine-tuned RoBERTa and a random classifier are used as a baseline for comparison. /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013 /uni00000029/uni00000044/uni0000004f/uni00000056/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000037/uni00000055/uni00000058/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025/uni00000003/uni00000010/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048 /uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048 /uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044 /uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050 Fig. 25. Task 3: ROC curve for the positive class for Llama3 70B. The best prompt of each category is highlighted. Fine-tuned RoBERTa and a random classifier are used as a baseline for comparison. /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013 /uni00000029/uni00000044/uni0000004f/uni00000056/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000037/uni00000055/uni00000058/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000003/uni00000010/uni00000003/uni00000053/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048 /uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048 /uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044 /uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050 Fig. 26. Task 3: ROC curve for the positive class for Mixtral 8x7B. The best prompt of each category is highlighted. Fine-tuned RoBERTa and a random classifier are used as a baseline for comparison. /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013 /uni00000029/uni00000044/uni0000004f/uni00000056/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000037/uni00000055/uni00000058/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001b/uni00000025/uni00000003/uni00000010/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048 /uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048 /uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044 /uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050Fig. 27. Task 3: ROC curve for the negative class for Llama3 8B. The best prompt of each category is highlighted. Fine-tuned RoBERTa and a random classifier are used as a baseline for comparison. /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013 /uni00000029/uni00000044/uni0000004f/uni00000056/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000037/uni00000055/uni00000058/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni0000002f/uni0000004f/uni00000044/uni00000050/uni00000044/uni00000016/uni00000003/uni0000001a/uni00000013/uni00000025/uni00000003/uni00000010/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048 /uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048 /uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044 /uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050 Fig. 28. Task 3: ROC curve for the negative class for Llama3 70B. The best prompt of each category is highlighted. Fine-tuned RoBERTa and a random classifier are used as a baseline for comparison. /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000015 /uni00000013/uni00000011/uni00000017 /uni00000013/uni00000011/uni00000019 /uni00000013/uni00000011/uni0000001b /uni00000014/uni00000011/uni00000013 /uni00000029/uni00000044/uni0000004f/uni00000056/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013/uni00000037/uni00000055/uni00000058/uni00000048/uni00000003/uni00000033/uni00000052/uni00000056/uni0000004c/uni00000057/uni0000004c/uni00000059/uni00000048/uni00000003/uni00000035/uni00000044/uni00000057/uni00000048/uni00000030/uni0000004c/uni0000005b/uni00000057/uni00000055/uni00000044/uni0000004f/uni00000003/uni0000001b/uni0000005b/uni0000001a/uni00000025/uni00000003/uni00000010/uni00000003/uni00000051/uni00000048/uni0000004a/uni00000044/uni00000057/uni0000004c/uni00000059/uni00000048 /uni0000003d/uni00000048/uni00000055/uni00000052/uni00000010/uni00000056/uni0000004b/uni00000052/uni00000057 /uni0000002a/uni00000052/uni00000052/uni0000004a/uni0000004f/uni00000048 /uni0000003a/uni0000004c/uni0000004e/uni0000004c/uni00000053/uni00000048/uni00000047/uni0000004c/uni00000044 /uni00000035/uni00000052/uni00000025/uni00000028/uni00000035/uni00000037/uni00000044 /uni00000035/uni00000044/uni00000051/uni00000047/uni00000052/uni00000050 Fig. 29. Task 3: ROC curve for the negative class for Mixtral 8x7B. The best prompt of each category is highlighted. Fine-tuned RoBERTa and a random classifier are used as a baseline for comparison.