loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2412.15264

ReXTrust: A Model for Fine-Grained Hallucination Detection in AI-Generated Radiology Reports

Authors: Romain Hardy, Sung Eun Kim, Du Hyun Ro, Pranav Rajpurkar

Published: 2024-12-17

Abstract:

The increasing adoption of AI-generated radiology reports necessitates robust methods for detecting hallucinations--false or unfounded statements that could impact patient care. We present ReXTrust, a novel framework for fine-grained hallucination detection in AI-generated radiology reports. Our approach leverages sequences of hidden states from large vision-language models to produce finding-level hallucination risk scores. We evaluate ReXTrust on a subset of the MIMIC-CXR dataset and demonstrate superior performance compared to existing approaches, achieving an AUROC of 0.8751 across all findings and 0.8963 on clinically significant findings. Our results show that white-box approaches leveraging model hidden states can provide reliable hallucination detection for medical AI systems, potentially improving the safety and reliability of automated radiology reporting.

Paper Content: on Alphaxiv
Page 1: ReXTrust: A Model for Fine-Grained Hallucination Detection in AI-Generated Radiology Reports Romain Hardy1, Sung Eun Kim1, Du Hyun Ro2, Pranav Rajpurkar1 1Department of Biomedical Informatics, Harvard Medical School 2Department of Orthopedic Surgery, Seoul National University Hospital romain hardy@fas.harvard.edu, sungeun kim2@hms.harvard.edu, franciu7@snu.ac.kr, pranav rajpurkar@hms.harvard.edu Abstract The increasing adoption of AI-generated radiology reports ne- cessitates robust methods for detecting hallucinations—false or unfounded statements that could impact patient care. We present ReXTrust, a novel framework for fine-grained hal- lucination detection in AI-generated radiology reports. Our approach leverages sequences of hidden states from large vision-language models to produce finding-level hallucina- tion risk scores. We evaluate ReXTrust on a subset of the MIMIC-CXR dataset and demonstrate superior performance compared to existing approaches, achieving an AUROC of 0.8751 across all findings and 0.8963 on clinically significant findings. Our results show that white-box approaches lever- aging model hidden states can provide reliable hallucination detection for medical AI systems, potentially improving the safety and reliability of automated radiology reporting. 1 Introduction The automated generation of radiology reports using large vision-language models (LVLMs) has recently shown promising results, offering the potential to improve workflow efficiency and standardization in clinical settings (Bannur et al. 2024; Tanida et al. 2023; Hamamci, Er, and Menze 2024; Liu et al. 2024a; Gu et al. 2024; Chen et al. 2024d; Zhou et al. 2024). However, these models sometimes gener- ate hallucinations—statements that are false, unfounded, or inconsistent with the input images. In medical settings, such hallucinations pose significant risks, as false pathological findings could lead to unnecessary interventions or missed diagnoses, directly impacting patient care. In this paper, we introduce ReXTrust, a white-box model designed to detect hallucinations in LVLM-generated radi- ology reports. ReXTrust uses a self-attention module trained on LVLM hidden states, enabling fine-grained insights into specific radiological findings and reliable overall hallucina- tion risk scores. By analyzing internal model representa- tions, ReXTrust can identify potential hallucinations during the generation process itself, rather than relying solely on post-hoc analysis. We evaluate ReXTrust on a subset of the MIMIC-CXR dataset, demonstrating its effectiveness across different medical categories and severity levels, with par- ticular attention to clinically significant findings that would impact patient care. Our contributions are threefold: 1. We develop a white-box architecture for radiology findinghallucination detection that enables both token-level at- tention analysis and finding-level prediction, making the detection process transparent to clinical users. 2. We demonstrate state-of-the-art performance in detect- ing hallucinations in radiology reports, particularly for clinically significant findings, achieving superior results compared to existing approaches while maintaining inter- pretability. 3. We provide empirical evidence that model hidden states contain reliable signals for hallucination detection, sug- gesting a path toward improved semantic fidelity in med- ical report generation systems. 2 Background and Related Work 2.1 Hallucination Detection Hallucination detection is the task of identifying AI- generated content that is false, unfounded, or inconsistent with input data. In medical report generation, hallucinations can manifest as either fictional findings or critical omissions, both of which pose significant risks to patient care. Hallucina- tions are particularly consequential in radiology, as false find- ings may both initiate unnecessary medical procedures and mask critical pathologies requiring urgent intervention. Cur- rent hallucination detection methods for LLMs and LVLMs can be categorized into three approaches based on their re- quired level of access to model parameters: black-box, gray- box, and white-box approaches. Each category presents dif- ferent tradeoffs between computational complexity, ease of implementation, and detection accuracy. Our work builds upon these foundations by introducing a white-box approach specifically designed for the radiology domain, where the stakes of hallucination detection are high. 2.2 Black-Box Methods Black-box methods operate without access to model parame- ters, relying solely on model outputs. Despite this limitation, these methods have gained prominence due to their broad applicability across proprietary models and APIs. Research in this area has pursued two main directions. The first ex- plores models’ self-evaluation capabilities, with Kadavath et al. (2022) and Lin, Hilton, and Evans (2022) demonstrat- ing that LLMs can identify their own hallucinations with reasonable accuracy when explicitly prompted. The secondarXiv:2412.15264v3 [cs.CL] 31 Jan 2025 Page 2: direction quantifies model uncertainty by analyzing output diversity. Kuhn, Gal, and Farquhar (2023) and Farquhar et al. (2024) showed that semantic entropy—the uncertainty in the meanings of model-generated text—correlates strongly with hallucination likelihood. Building on these foundations, Friel and Sanyal (2023) proposed using auxiliary LLMs as external validators, demonstrating improved detection performance across both open and closed-domain tasks. More recently, Manakul, Liusie, and Gales (2023) introduced SelfCheck GPT, which evaluates generated sequences through mutual entailment analysis across multiple generations, providing a more robust assessment of content reliability. While these methods provide valuable insights, our work demonstrates that access to model parameters can facilitate robust halluci- nation detection in the medical domain. 2.3 Gray-Box Methods Gray-box methods leverage access to token-level probability distributions, enabling more precise analysis than black-box approaches while remaining computationally efficient. Tra- ditional metrics in this category include perplexity, token entropy, and mutual information (Fomicheva et al. 2020; Van der Poel, Cotterell, and Meister 2022). However, token- level uncertainty evaluation presents inherent challenges, as large uncertainty values may reflect the presence of multiple viable continuations rather than potential fabrications. To ad- dress this limitation, Fadeeva et al. (2024) developed a claim- conditioned scoring system (CCP) that evaluates model gen- erations according to the probability that a semantically equivalent output would be generated instead, providing a more reliable measure of content veracity. While gray-box methods offer valuable insights into language model uncer- tainty, our approach moves beyond probability distributions to leverage the rich information contained in hidden states, enabling more nuanced hallucination detection capabilities. 2.4 White-Box Methods White-box methods require complete access to model weights and typically leverage intermediate representations of the model inputs. The predominant approach involves training supervised classifiers on extracted model activa- tions. For instance, Azaria and Mitchell (2023) and Su et al. (2024) employed feedforward neural networks for post-generation hallucination detection, while Alnuhait et al. (2024) developed similar architectures for pre-generation de- tection and mitigation. In a significant advancement, Kossen et al. (2024) demonstrated that a logistic regression model trained on hidden activations can effectively predict semantic entropy, suggesting that LLMs encode uncertainty informa- tion within their internal representations. Building on this in- sight, Chen et al. (2024a) introduced the EigenScore metric, which analyzes the consistency of hidden state embeddings across multiple generations to provide a robust measure of output reliability. ReXTrust extends previous white-box ap- proaches by using a self-attention mechanism that analyzes sequences of hidden states, thus achieving high fidelity on finding-level hallucination detection.2.5 Multimodal Methods Given that LVLMs incorporate language model components, hallucination detection methods developed for LLMs can be naturally extended to LVLMs (Li et al. 2024b). However, the visual component of LVLMs presents unique opportunities to validate generated content against input images, particu- larly for verifying visual claims. Recognizing this potential, Yin et al. (2023) developed Woodpecker, which employs aux- iliary object detectors and visual question-answering models to verify visual assertions in model-generated text. Similarly, Chen et al. (2024c) approached LVLM hallucination detec- tion through independent visual evidence-gathering modules that systematically confirm or contradict claims extracted from model generations. This approach was further refined by Fei et al. (2024), who employed Siamese networks to compare scene graphs extracted from model generations with those derived from input images. In the medical domain, specialized LVLM hallucination detection methods have emerged to address domain-specific challenges. Chen et al. (2024b) introduced MediHallDetec- tor for categorizing medical hallucinations, demonstrating superior consistency compared to GPT baselines when eval- uating medical content. A significant advance was made by Zhang et al. (2024) with RadFlag, an approach that lever- ages entailment relationships between findings generated at different sampling temperatures to identify potentially hallu- cinatory content in LVLM-generated medical reports. ReX- Trust builds upon and complements RadFlag by providing fine-grained insights at both the token and finding levels; our empirical results demonstrate the benefits of analyzing model hidden states over post-hoc output analysis. 3 Methodology In this section, we formalize the task of hallucination detec- tion for radiology report generation. A comparative analysis of black-box, gray-box, and white-box approaches by Liu et al. (2024b) indicated that white-box methods generally demonstrate superior performance. Motivated by this result, we develop ReXTrust as a white-box hallucination detection model. Throughout our experiments, we utilize the MedVersa model (Zhou et al. 2024) to generate and evaluate candidate reports. Given a sequence of input chest X-rays, MedVersa generates a candidate report 𝑅, which we decompose into a set of findings{𝑠𝑖}𝑛 𝑖=1. A finding is defined as a single claim in𝑅(e.g., “There is pneumonia”), which usually corresponds to a single sentence. In cases where a sentence contains mul- tiple claims, we split the sentence into individual claims. For instance, the sentence “No evidence of pneumonia, pneu- mothorax, or pleural effusion” would be split into three dis- tinct findings: “No evidence of pneumonia,” “No evidence of pneumothorax,” and “No evidence of pleural effusion.” 3.1 Model Architecture For each individual finding 𝑠𝑖, ReXTrust predicts a score in the interval [0, 1], interpreted as the probability that 𝑠𝑖con- tains hallucinatory content. ReXTrust generates these hallu- cination risk scores through an end-to-end architecture that Page 3: Figure 1: ReXTrust hallucination detection framework. The model processes MedVersa’s hidden states ℎ𝑖through a self-attention module to produce finding-level hallucination scores. processes MedVersa’s hidden activation states. The model employs a self-attention module to analyze the sequence of hidden states corresponding to each finding, enabling it to capture both local patterns and broader contextual relation- ships between tokens when computing the final risk score for𝑠𝑖. Figure 1 provides a schematic representation of this framework. For a given finding 𝑠𝑖, ReXTrust first extracts the sequence of hidden states(ℎ𝑙 𝑡)𝑡∈𝑠𝑖for all tokens 𝑡in𝑠𝑖, where𝑙de- notes the index of a specific hidden layer (we use 𝑙=16 in our implementation) and ℎ𝑙 𝑡∈R𝑑(𝑑=4096 for Med- Versa). This sequence is projected to a 1024-dimensional latent space through a linear transformation and combined with sinusoidal positional embeddings to encode relative to- ken positions. The model then processes this embedding through three 1024-dimensional linear projections to pro- duce query, key, and value vectors. These vectors are fed through an 8-headed self-attention module with dropout reg- ularization (𝑝=0.1), with each head having dimension 128. The attended outputs are mean-pooled along the sequence dimension, passed through another 1024-dimensional pro- jection, and ultimately fed to a sigmoid classification head that produces a hallucination risk score 𝑟.The advantages of the ReXTrust architecture are two-fold. First, it enables efficient training and inference while main- taining interpretability for clinical users through attention map analysis. Second, the self-attention mechanism natu- rally captures the full range of token interactions within a finding, allowing the model to simultaneously consider both local features and long-range dependencies when assessing hallucination risk. 3.2 Data We conduct our experiments using studies from the test set of the MIMIC-CXR database (Johnson et al. 2019). To ensure rigorous evaluation, we randomly partition the subjects into a training set for ReXTrust and a held-out set for final evalu- ation. The training set contains 231 subjects (corresponding to 1923 studies), while the held-out set contains 58 subjects (424 studies). We maintain strict subject-level separation be- tween all sets to prevent data leakage. For hallucination labeling at the finding level, we adopt the LLM entailment strategy proposed by Zhang et al. (2024). For each finding generated by MedVersa, we employ Ope- nAI’s gpt-4o model to classify it as either “completely en- tailed,” “partially entailed,” or “not entailed” by the corre- Page 4: sponding ground truth radiologist-written report. We con- sider a finding to be hallucinated if it is either partially en- tailed or not entailed by the ground truth report. 3.3 Model Training ReXTrust is trained using the average binary cross-entropy between predicted hallucination scores and the associated finding-level hallucination labels as the objective function. We employ a weighted random sampler to ensure balanced exposure to positive and negative examples during training, while maintaining equal class weights in the objective func- tion. The model is optimized for 5 epochs using the AdamW optimizer (Loshchilov 2017) with learning rate 1 .0×10−4on a cosine schedule, batch size 128, and standard parameters (𝛽1=0.9,𝛽2=0.999) with weight decay 0.01. We employ 5-fold cross-validation on the training set, with model per- formance monitored via the area under the receiver operating characteristic curve (AUROC) on the validation set. The final weights for ReXTrust are obtained by averaging across folds. 3.4 Finding Severity Radiological findings exhibit substantial heterogeneity in their clinical significance. For instance, a procedural obser- vation such as “Single portable view of the chest” carries minimal risk if hallucinated, whereas the false positive find- ing “There is a left apical pneumothorax” could precipitate unnecessary emergency intervention. To systematically an- alyze this variance in clinical risk, we employ gpt-4o to categorize the findings in our evaluation set into four distinct severity tiers: 1. Findings requiring immediate emergency intervention. 2. Findings warranting non-emergency clinical action. 3. Findings with minimal clinical significance. 4. Findings which do not fit into the three previous tiers. The classification is performed using a standardized prompt (detailed in Appendix A). Our analysis encompasses both the complete evaluation set and a focused evaluation of high- risk findings (tiers 1 and 2), enabling a nuanced assessment of hallucination detection performance in clinically critical scenarios. The reliability of these LLM-generated severity labels is discussed in Section 5.4. 4 Results 4.1 Discriminative Power of ReXTrust On the complete set of findings, ReXTrust achieves an AU- ROC of 0.8751 (95% CI: 0.8623, 0.8880). When evaluated exclusively on clinically relevant findings, the AUROC im- proves to 0.8963 (95% CI: 0.8824, 0.9091). Notably, ReX- Trust’s performance remains robust—and even shows im- provement—when restricted to clinically significant find- ings. This suggests that ReXTrust learns generalizable fea- tures for hallucination detection rather than overfitting to patterns specific to less critical findings.4.2 Comparison to Other Approaches We evaluate ReXTrust against a set of baseline methods spanning diverse approaches to hallucination detection. Tra- ditional uncertainty estimation methods include entropy, which measures token-level predictive uncertainty, and CCP scores (Fadeeva et al. 2024), which evaluate generations based on the probability semantically equivalent alternatives. Among general-purpose multimodal detectors, we evaluate UNIHD (Chen et al. 2024c), which employs independent evidence-gathering modules to validate generated content. Since UNIHD was not originally designed for the medical domain, we implement domain-specific adaptations that pre- serve its core mechanisms while enabling fair comparison on radiology data (detailed in Appendix B). We also compare against EigenScore (Chen et al. 2024a), a white-box method that assesses hallucination likelihood by analyzing similari- ties between hidden state activations across multiple model- generated sentences (in our case, we use high-temperature samples from MedVersa). Finally, we compare against Rad- Flag (Zhang et al. 2024), a radiology-specific method that identifies hallucinations by analyzing entailment relation- ships between findings generated at different sampling tem- peratures. Table 1 presents the comparative evaluation using the micro-averaged AUROC, area under the precision-recall curve (AUPRC), and area under the generalized risk cov- erage curve (AUGRC) metrics (Traub et al. 2024). ReXTrust demonstrates substantial improvements over all baseline methods across all metrics. The strongest competing method, RadFlag, achieves a finding-level AUROC of 0.7999 (95% CI: 0.7844, 0.8152), which is 7.52 points lower than ReX- Trust. ReXTrust’s superior performance relative to UNIHD suggests that training a supervised model directly on in- domain hallucination detection using hidden states provides stronger signals than relying on external verification mod- ules. Furthermore, its improvement over RadFlag indicates that analyzing the generation process through hidden states may be more reliable than post-hoc analysis of model out- puts. To validate our architectural choice of self-attention, we evaluate a variant of ReXTrust that processes hidden states independently for each token rather than attending to the full sequence. In this attention-free version, we generate token- level predictions that are then mean-pooled to produce a finding-level hallucination risk score. While this simpler ar- chitecture achieves high performance, it is worse than the full self-attention model. This gap suggests that ReXTrust benefits from modeling the contextual relationships between tokens, capturing dependencies that are difficult to identify through token-level analysis alone. 4.3 Performance Across Medical Categories Following the categorization approach of Zhang et al. (2024), we refine our analysis by partitioning clinically relevant find- ings into five distinct categories: Lungs, Pleural, Cardiome- diastinal, Musculoskeletal, and Medical Devices. Figure 2 presents the 95% confidence intervals for ReXTrust’s AU- GRC scores (red) on the held-out evaluation set across these Page 5: Model AUROC AUPRC AUGRC Entropy 0.7361 (0.7168, 0.7551) 0.6848 (0.6582, 0.7119) 0.2093 (0.1987, 0.2202) CCP (Fadeeva et al. 2024) 0.6535 (0.6334, 0.6734) 0.5692 (0.5430, 0.5969) 0.1751 (0.1659, 0.1860) EigenScore (Chen et al. 2024a) 0.6029 (0.5821, 0.6245) 0.5290 (0.5023, 0.5612) 0.1979 (0.1878, 0.2086) UniHD (Chen et al. 2024c) 0.6569 (0.6401, 0.6733) 0.5710 (0.5475, 0.5970) 0.1638 (0.1528, 0.1719) RadFlag (Zhang et al. 2024) 0.7999 (0.7844, 0.8152) 0.7429 (0.7184, 0.7653) 0.0988 (0.0907, 0.1067) ReXTrust (without self-attention) 0.8683 (0.8555, 0.8808) 0.8186 (0.7958, 0.8431) 0.0646 (0.0582, 0.0707) ReXTrust 0.8751 (0.8623, 0.8880) 0.8310 (0.8097, 0.8529) 0.0637 (0.0571, 0.0701) Table 1: AUROC, AUPRC, and AUGRC performance of ReXTrust on the held-out evaluation set, compared to baseline hallucination detection methods. ReXTrust significantly outperforms other methods across all metrics. Parentheses show the 95% CIs computed using a 1,000-sample bootstrap. Figure 2: ReXTrust AUGRC performance on five medical categories, compared to RadFlag. categories, compared against those of RadFlag (blue). ReX- Trust demonstrates superior selective classification perfor- mance across all categories, as indicated by its lower point estimates. To determine whether ReXTrust’s performance is statistically significantly better than RadFlag’s, we also cal- culate the confidence intervals on the difference between the AUGRC scores of the two models (Table 2). We find that none of the confidence intervals contain zero, and there- fore conclude that the selective classification performance of ReXTrust is statistically significantly better than that of RadFlag at the finding level across all categories. Table 3 evaluates ReXTrust on findings containing com- monly occurring keywords. For each keyword, we show ReX- Trust’s micro-averaged 𝐹1score on clinically relevant find- ings that contain the keyword and on which the model dis- plays high confidence (i.e., the predicted score is either in the top or bottom score quartile for that category). ReXTrust achieves impressive performance on pleural effusion, pneu- monia, and edema findings. However, it performs worse on findings related to pneumothorax, atelectasis, and tubes. WeCategory AUGRC RadFlag−AUGRC ReXTrust Medical Devices 0.0324 (0.0050, 0.0621) Cardiovascular 0.0359 (0.0151, 0.0551) Pulmonary 0.0300 (0.0140, 0.0486) Musculoskeletal 0.0528 (0.0311, 0.0803) Pleural 0.0166 (0.0082, 0.0253) Table 2: Difference in AUGRC between RadFlag and ReX- Trust on five medical categories. ReXTrust demonstrates statistically significantly better selective classification per- formance across all categories. also note that ReXTrust performs reasonably well when de- tecting hallucinations in findings containing positional key- words despite not containing explicit visual modules. Keyword 𝐹1 “pleural effusion” 0.8403 (0.7847, 0.8958) “pneumothorax” 0.6193 (0.5455, 0.6932) “consolidation” 0.7000 (0.5857, 0.8000) “pneumonia” 0.9091 (0.8182, 0.9773) “edema” 0.8857 (0.8000, 0.9571) “atelectasis” 0.6000 (0.4500, 0.7500) “tube” 0.5750 (0.4000, 0.7250) “right” 0.6981 (0.6038, 0.7830) “left” 0.6591 (0.5679, 0.7614) Table 3: Micro-averaged 𝐹1scores of ReXTrust on selected finding keywords. 5 Discussion 5.1 Qualitative Examples Figure 3 presents an analysis of ReXTrust’s performance on four representative studies from the held-out evaluation set. ReXTrust’s self-attention mechanism enables identification Page 6: of specific semantic components that contribute to the overall hallucination risk assessment. In the first case, ReXTrust as- signs high attention weights to “right” and “pneumothorax,” correctly identifying this positive finding as hallucinatory. High attention weights are not necessarily indicative of hallucination risk, however. For instance, the third model- generated finding indicates the lack of an anomaly; ReX- Trust’s attention weights are highest on the “pneumothorax” tokens, but its final prediction indicates that the finding is true. In the fourth finding, it is the converse; ReXTrust fo- cuses on the “consolidation” tokens, but the model falsely predicts that the finding is not hallucinatory. 5.2 Complementarity with RadFlag To investigate whether ReXTrust and RadFlag capture com- plementary aspects of hallucination detection, we evaluate a linear ensemble that combines their predictions with weights 0.8 and 0.2, respectively. The ensemble achieves an AUROC of 0.8822 (95% CI: 0.8699, 0.8947), an AUPRC of 0.8471 (95% CI: 0.8276, 0.8670), and an AUGRC of 0.0594 (95% CI: 0.0532, 0.0656), surpassing the performance of both in- dividual models. The complementary nature of these approaches stems from their fundamentally different detection strategies: ReXTrust analyzes model hidden states to identify potential halluci- nations during the generation process, while RadFlag em- ploys temperature-based sampling to detect inconsistencies in model outputs. Furthermore, the success of the ensem- ble suggests that ReXTrust could be extended to incorporate hidden states from outputs sampled at high temperatures. 5.3 Generalizability Across Architectures While ReXTrust was trained to detect hallucinations in MedVersa-generated reports using MedVersa hidden states, this framework can be readily adapted to other medical LLMs and LVLMs, such as RaDialog (Pellegrini et al. 2023) and LLaV A-Med (Li et al. 2024a), and Maira-2 (Bannur et al. 2024). Although the architecture of ReXTrust would need to be modified to accommodate different hidden state dimen- sions, the core principle of applying a self-attention module over sequences of hidden states is applicable. As such, mod- els like ReXTrust could serve as valuable tools for assess- ing the factual reliability of large medical language models. If hallucinations can be reliably predicted from a model’s hidden states, it suggests that modifications to the training methodology may be necessary to reduce the frequency of hallucinatory content. 5.4 Label Reliability Our study relies on LLM-generated hallucination and sever- ity labels, which inherently contain some degree of noise. Previous work by Zhang et al. (2024) evaluated similar hal- lucination labels through clinical validation, finding high but imperfect agreement between LLM and clinician assess- ments. We perform a similar evaluation of the severity labels in collaboration with an expert clinician. From a sample of 60 findings across 39 studies—with 15 findings from each of the four severity categories defined in Section 3.4—wefind that 8 of the 30 findings categorized into tiers 3 and 4 by the LLM were classified as belonging to tiers 1 and 2 by the clinician. Conversely, none of the 30 findings categorized into tiers 1 and 2 by the LLM were reclassified into tiers 3 and 4 by the clinician. These results suggest that we likely underestimate the number of clinically significant findings. However, given ReXTrust’s consistent performance across finding categories, we expect its performance on the true set of clinically significant findings to align with the global results presented in Table 1. 5.5 Limitations and Future Work We identify three primary limitations of this study. First, ReXTrust depends on supervised labels. Our work lever- ages expert-written radiology reports from MIMIC-CXR to generate binary hallucination labels for training. In scenar- ios where high-quality ground truth reports are unavailable, ReXTrust’s white-box approach may be less suitable than unsupervised alternatives. However, given that ReXTrust achieves high performance even when trained on a small dataset, it may still be useful in label-limited settings. Second, ReXTrust shows suboptimal performance on cer- tain types of radiological findings (see Table 3). While this weakness might be partially mitigated through ensembling with orthogonal approaches, as discussed in Section 5.2, fu- ture work should incorporate more explicit visual grounding tools to verify the factuality of assertions in AI-generated findings (He et al. 2024; Zou et al. 2024; Shaaban, Khan, and Yaqub 2024; Bannur et al. 2024). Third, we note that several benchmarks used for com- parison in Section 4.2 were not originally designed for ra- diology (e.g., UNIHD). Although we implemented domain- appropriate modifications to ensure fair comparison with our approach, we may underestimate their potential performance in the medical domain. Future work should explore exten- sive adaptations of general-purpose hallucination detectors to medical imaging tasks. 6 Conclusion We have presented ReXTrust, a white-box framework for de- tecting hallucinations in AI-generated radiology reports. Our approach demonstrates superior performance compared to existing hallucination detection methods, achieving an AU- ROC of 0.8751 across all findings and 0.8963 on clinically significant findings. ReXTrust’s end-to-end architecture en- ables both granular analysis of hallucination patterns and reliable overall risk assessment. Our results indicate that model hidden states contain valu- able signals about the reliability of generated content that can be effectively leveraged for hallucination detection. The robust performance of ReXTrust on clinically significant findings is particularly noteworthy, as it demonstrates the model’s ability to identify potentially harmful hallucinations that could impact patient care. Our work also identifies im- portant areas for future research, such as improving halluci- nation detection performance on specific classes of findings and developing methods to reduce dependence on super- vised labels. Nevertheless, ReXTrust represents a significant Page 7: Figure 3: Qualitative examples of ReXTrust on four studies from our held-out evaluation set. Shown on the left are examples of chest X-rays serving as inputs to MedVersa. Shown in the middle are findings generated by MedVersa, as well as token-level attention visualizations from ReXTrust. Shown on the right are the finding-level risk scores output by the classification head of ReXTrust. advance in ensuring the reliability of medical report genera- tion models. As medical LVLMs continue to evolve and see increased clinical adoption, approaches like ReXTrust will be essential for maintaining high standards of accuracy and safety in medical AI systems. A Finding Severity Prompt We present the complete prompt template used to categorize AI-generated findings into severity tiers in Figure 4. B Adaptating UNIHD to Radiology Noting that UNIHD is not adapted to radiology images and reports, we apply sensible adaptations in order to provide a fair comparison with ReXTrust. Specifically, we modify the prompts of the individual modules to include radiologi- cal terminology and examples. We also replace the authors’ Grounding Dino object detection module (Liu et al. 2023) with a custom module based on pre-trained models from Cohen et al. (2022).C Ablation Studies C.1 Hidden Layer Index We test the performance of ReXTrust as a function of the hidden layer index of MedVersa. Figure 5 shows the AUROC (blue) and AUPRC (orange) scores of ReXTrust trained on hidden states ℎ𝑙 𝑡as we let𝑙vary from 0 to 32. We find that the performance of ReXTrust saturates near layer 16. References Alnuhait, D.; Kirtane, N.; Khalifa, M.; and Peng, H. 2024. FactCheckmate: Preemptively Detecting and Mitigating Hal- lucinations in LMs. arXiv preprint arXiv:2410.02899 . Azaria, A.; and Mitchell, T. 2023. The internal state of an LLM knows when it’s lying. arXiv preprint arXiv:2304.13734 . Bannur, S.; Bouzid, K.; Castro, D. C.; Schwaighofer, A.; Bond-Taylor, S.; Ilse, M.; P ´erez-Garc ´ıa, F.; Salvatelli, V.; Sharma, H.; Meissen, F.; et al. 2024. MAIRA-2: Page 8: You are an AI radiology assistant helping process reports from chest X-rays. You are given a finding (F) from a chest X-ray report. Your job is to judge and categorize the clinical severity of the finding. For positive findings, here are the definitions for the severity categories you should use: 1. Emergency clinical consequence •Definition: Findings which could lead to outcomes requiring immediate or urgent medical intervention to prevent significant harm, deterioration, or life-threatening complications. This includes scenarios where rapid diagnosis and management are critical to stabilize the patient but excludes conditions that can be managed electively. •Examples: Tension pneumothorax, acute myocardial infarction, Endotracheal tube misplacement compromising airway function. 2. Non-emergency but actionable clinical consequence •Definition: Findings which require medical attention or follow-up but do not necessitate immediate or urgent intervention to prevent significant harm. These findings are important for patient health management and should be addressed through routine clinical care, elective procedures, or scheduled follow-ups. •Examples: Newly diagnosed malignancy without acute complications, chronic infection like tuberculosis, significant but non-acute organ abnormalities such as a thyroid nodule needing further investigation. 3. Clinically insignificant consequence •Definition: Findings that pose no health risk or adverse outcome for the patient and require no clinical intervention or monitoring. These findings are typically benign, normal variations, or imaging artifacts that do not impact patient management or treatment plans. •Examples: Normal anatomical variations (e.g., azygos fissure), incidental imaging artifacts, findings that are clinically irrelevant to patient health and do not necessitate any follow-up. 4. Other •Definition: Any findings that do not fit into the above categories. For negative findings, you should evaluate the severity of the negations of those findings. For example, if you are given the finding, “There is no pneumothorax,” you should evaluate the severity of the negation, “There is pneumothorax,” and return the corresponding severity category (which would be “emergency clinical consequence” in this case). Return a JSON of the format: {"severity category": <category>, "reason": <reason> }, where <category> is one of “emergency clinical consequence”, “non-emergency but actionable clinical consequence”, “clinically insignifi- cant consequence”, or “other”, and <reason> is a string explaining your reasoning for the severity category. Figure 4: Prompt template for radiological finding severity classification. The prompt provides detailed definitions and examples for each severity category to ensure consistent classification across all experiments. Page 9: Figure 5: AUROC and AUPRC performance of ReXTrust as a function of the layer index of the MedVersa hidden states used as inputs. Model performance saturates near layer 16. Error bars indicate 95% confidence intervals. Grounded Radiology Report Generation. arXiv preprint arXiv:2406.04449 . Chen, C.; Liu, K.; Chen, Z.; Gu, Y.; Wu, Y.; Tao, M.; Fu, Z.; and Ye, J. 2024a. INSIDE: LLMs’ Internal States Re- tain the Power of Hallucination Detection. arXiv preprint arXiv:2402.03744 . Chen, J.; Yang, D.; Wu, T.; Jiang, Y.; Hou, X.; Li, M.; Wang, S.; Xiao, D.; Li, K.; and Zhang, L. 2024b. Detect- ing and Evaluating Medical Hallucinations in Large Vision Language Models. arXiv preprint arXiv:2406.10185 . Chen, X.; Wang, C.; Xue, Y.; Zhang, N.; Yang, X.; Li, Q.; Shen, Y.; Liang, L.; Gu, J.; and Chen, H. 2024c. Unified hal- lucination detection for multimodal large language models. arXiv preprint arXiv:2402.03190 . Chen, Z.; Varma, M.; Delbrouck, J.-B.; Paschali, M.; Blanke- meier, L.; Van Veen, D.; Valanarasu, J. M. J.; Youssef, A.; Cohen, J. P.; Reis, E. P.; et al. 2024d. Chexagent: Towards a foundation model for chest x-ray interpretation. arXiv preprint arXiv:2401.12208 . Cohen, J. P.; Viviano, J. D.; Bertin, P.; Morrison, P.; Torabian, P.; Guarrera, M.; Lungren, M. P.; Chaudhari, A.; Brooks, R.; Hashir, M.; and Bertrand, H. 2022. TorchXRayVision: A library of chest X-ray datasets and models. In Medical Imaging with Deep Learning . Fadeeva, E.; Rubashevskii, A.; Shelmanov, A.; Petrakov, S.; Li, H.; Mubarak, H.; Tsymbalov, E.; Kuzmin, G.; Panchenko, A.; Baldwin, T.; et al. 2024. Fact-checking the output of large language models via token-level uncertainty quantification. arXiv preprint arXiv:2403.04696 . Farquhar, S.; Kossen, J.; Kuhn, L.; and Gal, Y. 2024. Detect- ing hallucinations in large language models using semantic entropy. Nature , 630(8017): 625–630. Fei, H.; Luo, M.; Xu, J.; Wu, S.; Ji, W.; Lee, M.-L.; and Hsu, W. 2024. Fine-grained Structural Hallucination Detec- tion for Unified Visual Comprehension and Generation inMultimodal LLM. In Proceedings of the 1st ACM Multime- dia Workshop on Multi-modal Misinformation Governance in the Era of Foundation Models , 13–22. Fomicheva, M.; Sun, S.; Yankovskaya, L.; Blain, F.; Guzm ´an, F.; Fishel, M.; Aletras, N.; Chaudhary, V.; and Specia, L. 2020. Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computa- tional Linguistics , 8: 539–555. Friel, R.; and Sanyal, A. 2023. Chainpoll: A high effi- cacy method for llm hallucination detection. arXiv preprint arXiv:2310.18344 . Gu, T.; Liu, D.; Li, Z.; and Cai, W. 2024. Complex organ mask guided radiology report generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision , 7995–8004. Hamamci, I. E.; Er, S.; and Menze, B. 2024. Ct2rep: Auto- mated radiology report generation for 3d medical imaging. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention , 476–486. Springer. He, J.; Li, P.; Liu, G.; and Zhong, S. 2024. Parameter- Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding. arXiv preprint arXiv:2410.23822 . Johnson, A. E.; Pollard, T. J.; Berkowitz, S. J.; Greenbaum, N. R.; Lungren, M. P.; Deng, C.-y.; Mark, R. G.; and Horng, S. 2019. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scien- tific data , 6(1): 317. Kadavath, S.; Conerly, T.; Askell, A.; Henighan, T.; Drain, D.; Perez, E.; Schiefer, N.; Hatfield-Dodds, Z.; DasSarma, N.; Tran-Johnson, E.; et al. 2022. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221 . Kossen, J.; Han, J.; Razzak, M.; Schut, L.; Malik, S.; and Gal, Y. 2024. Semantic entropy probes: Robust and cheap hallu- cination detection in llms. arXiv preprint arXiv:2406.15927 . Kuhn, L.; Gal, Y.; and Farquhar, S. 2023. Seman- tic uncertainty: Linguistic invariances for uncertainty es- timation in natural language generation. arXiv preprint arXiv:2302.09664 . Li, C.; Wong, C.; Zhang, S.; Usuyama, N.; Liu, H.; Yang, J.; Naumann, T.; Poon, H.; and Gao, J. 2024a. Llava-med: Train- ing a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Sys- tems, 36. Li, Q.; Lyu, C.; Geng, J.; Zhu, D.; Panov, M.; and Karray, F. 2024b. Reference-free Hallucination Detection for Large Vision-Language Models. arXiv preprint arXiv:2408.05767 . Lin, S.; Hilton, J.; and Evans, O. 2022. Teaching mod- els to express their uncertainty in words. arXiv preprint arXiv:2205.14334 . Liu, C.; Tian, Y.; Chen, W.; Song, Y.; and Zhang, Y. 2024a. Bootstrapping Large Language Models for Radiology Re- port Generation. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 38, 18635–18643. Liu, L.; Pan, Y.; Li, X.; and Chen, G. 2024b. Uncertainty Es- timation and Quantification for LLMs: A Simple Supervised Approach. arXiv preprint arXiv:2404.15993 . Page 10: Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Li, C.; Yang, J.; Su, H.; Zhu, J.; et al. 2023. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 . Loshchilov, I. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 . Manakul, P.; Liusie, A.; and Gales, M. J. 2023. Self- checkgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896 . Pellegrini, C.; ¨Ozsoy, E.; Busam, B.; Navab, N.; and Keicher, M. 2023. RaDialog: A large vision-language model for radi- ology report generation and conversational assistance. arXiv preprint arXiv:2311.18681 . Shaaban, M. A.; Khan, A.; and Yaqub, M. 2024. Med- PromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis. arXiv preprint arXiv:2403.15585 . Su, W.; Wang, C.; Ai, Q.; Hu, Y.; Wu, Z.; Zhou, Y.; and Liu, Y. 2024. Unsupervised real-time hallucination detection based on the internal states of large language models. arXiv preprint arXiv:2403.06448 . Tanida, T.; M¨ uller, P.; Kaissis, G.; and Rueckert, D. 2023. Interactive and explainable region-guided radiology report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 7433–7442. Traub, J.; Bungert, T. J.; L¨ uth, C. T.; Baumgartner, M.; Maier- Hein, K. H.; Maier-Hein, L.; and Jaeger, P. F. 2024. Overcom- ing common flaws in the evaluation of selective classification systems. arXiv preprint arXiv:2407.01032 . Van der Poel, L.; Cotterell, R.; and Meister, C. 2022. Mutual information alleviates hallucinations in abstractive summa- rization. arXiv preprint arXiv:2210.13210 . Yin, S.; Fu, C.; Zhao, S.; Xu, T.; Wang, H.; Sui, D.; Shen, Y.; Li, K.; Sun, X.; and Chen, E. 2023. Woodpecker: Hal- lucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045 . Zhang, S.; Sambara, S.; ; Banerjee, O.; Acosta, J.; Fahrner, J.; and Rajpurkar, P. 2024. RadFlag: A Black-Box Hallucina- tion Detection Method for Medical Vision Language Models. arXiv preprint arXiv:2411.00299 . Zhou, H.-Y.; Adithan, S.; Acosta, J. N.; Topol, E. J.; and Ra- jpurkar, P. 2024. A Generalist Learner for Multifaceted Med- ical Image Interpretation. arXiv preprint arXiv:2405.07988 . Zou, K.; Bai, Y.; Chen, Z.; Zhou, Y.; Chen, Y.; Ren, K.; Wang, M.; Yuan, X.; Shen, X.; and Fu, H. 2024. MedRG: Medical Report Grounding with Multi-modal Large Language Model. arXiv preprint arXiv:2404.06798 .

---