loader
Generating audio...

arxiv

Paper 2412.12042

The Impact of AI Assistance on Radiology Reporting: A Pilot Study Using Simulated AI Draft Reports

Authors: Julián N. Acosta, Siddhant Dogra, Subathra Adithan, Kay Wu, Michael Moritz, Stephen Kwak, Pranav Rajpurkar

Published: 2024-12-16

Abstract:

Radiologists face increasing workload pressures amid growing imaging volumes, creating risks of burnout and delayed reporting times. While artificial intelligence (AI) based automated radiology report generation shows promise for reporting workflow optimization, evidence of its real-world impact on clinical accuracy and efficiency remains limited. This study evaluated the effect of draft reports on radiology reporting workflows by conducting a three reader multi-case study comparing standard versus AI-assisted reporting workflows. In both workflows, radiologists reviewed the cases and modified either a standard template (standard workflow) or an AI-generated draft report (AI-assisted workflow) to create the final report. For controlled evaluation, we used GPT-4 to generate simulated AI drafts and deliberately introduced 1-3 errors in half the cases to mimic real AI system performance. The AI-assisted workflow significantly reduced average reporting time from 573 to 435 seconds (p=0.003), without a statistically significant difference in clinically significant errors between workflows. These findings suggest that AI-generated drafts can meaningfully accelerate radiology reporting while maintaining diagnostic accuracy, offering a practical solution to address mounting workload challenges in clinical practice.

Paper Content:
Page 1: The Impact of AI Assistance on Radiology Reporting: A Pilot Study Using Simulated AI Draft Reports Julián N. Acosta1 , Siddhant Dogra2, Subathra Adithan3, Kay Wu4, Michael Moritz5, Stephen Kwak6 and Pranav Rajpurkar1 1Harvard Medical School, Boston, MA, USA 2NYU Langone Health, New York, NY, USA 3Jawaharlal Institute of Postgraduate Medical Education and Research, Puducherry, India 4University of Toronto, Toronto, ON, Canada 5Saint Louis University School of Medicine, Saint Louis, MO, USA 6Johns Hopkins School of Medicine, Baltimore, MD, USA Radiologists face increasing workload pressures amid growing imaging volumes, creating risks of burnout and delayed reporting times. While artificial intelligence (AI) based automated radiology report generation shows promise for reporting workflow optimization, evidence of its real-world impact on clinical accuracy and efficiency remains limited. This study evaluated the effect of draft reports on radiology reporting workflows by conducting a three reader multi-case study comparing standard versus AI-assisted reporting workflows. In both workflows, radiologists reviewed the cases and modified either a standard template (standard workflow) or an AI-generated draft report (AI-assisted workflow) to create the final report. For controlled evaluation, we used GPT-4 to generate simulated AI drafts and deliberately introduced 1-3 errors in half the cases to mimic real AI system performance. The AI-assisted workflow significantly reduced average reporting time from 573 to 435 seconds (p=0.003), without a statistically significant difference in clinically significant errors between workflows. These findings suggest that AI-generated drafts can meaningfully accelerate radiology reporting while maintaining diagnostic accuracy, offering a practical solution to address mounting workload challenges in clinical practice. 1 Introduction Radiologists face mounting pressure as imaging vol- umes surge, requiring them to interpret more studies while maintaining accuracy and speed. This strain has led to concerning burnout rates [ 1] and potential impacts on diagnostic quality. While artificial intel- ligence (AI) solutions have demonstrated promise in radiology applications like worklist optimization and computer-aided detection, their integration into clinical reporting workflows remains limited. Automated radiology report generation [ 4] represents a promising alternative avenue for AI to seamlessly integrate into radiologists’ workflows, providing them with AI-generated draft reports that can serve as a starting point in the reporting process. By reduc- ing the time and effort required to generate reports from scratch, AI-assisted reporting has the potential to significantly improve radiologists’ efficiency and productivity. Recent studies have focused on evaluating the qual- ity of AI-generated radiology reports using various approaches, such as automated natural languageand clinical accuracy metrics [ 10]. Additionally, some studies have conducted manual evaluations to gauge radiologists’ perceptions of AI-generated reports [ 6,7,5]. However, the majority of these studies have focused on evaluating the impact of AI assistance on medical image interpretation in experi- mental settings that differ from radiologists’ day-to- day workflows [ 9,2]. In real-world clinical workflows, image interpretation and report generation often oc- cur in parallel as the radiologist evaluates each case, highlighting the need for studies that closely mimic real-world reporting workflows to fully understand the impact of AI assistance on radiologists’ perfor- mance and experience. Furthermore, most studies on report generation have focused on simpler modalities like chest X-rays [ 5]. However, the potential benefits and challenges of AI-assisted reporting for more complex modalities remain largely unexplored. In this study, we evaluated the impact of AI- generated draft reports on chest CT interpretation using acrossover study design where radiologists modified either standard templates or AI-generated Corresponding Author: Julián N. Acosta; julian_acosta@hms.harvard.eduarXiv:2412.12042v1 [cs.HC] 16 Dec 2024 Page 2: drafts to create final clinically accurate reports, closely mirroring clinical practice. We found that AI assistance significantly reduced reporting time while maintaining diagnostic accuracy , paving the way for larger clinical trials to compre- hensively assess the impact of AI-assisted reporting in clinical practice. 2 Methods Study Design We conducted a three-reader, multi-case crossover study using 20 chest CT scans from the CT-RATE dataset [ 3], with each case evaluated under both standard and AI-assisted conditions across different readers. Cases Utilizing the CT-RATE dataset, we randomly selected and curated two groups of 10 chest CT scans each, matched by patient age, sex, and proxies for case complexity, namely the number of findings and the number of impression sentences in the original reports. Readers Three radiology-trained readers par- ticipated in the study: one board-certified radiologist and two radiology residents. Each reader evaluated all cases, alternating between standard and AI-assisted workflows to ensure balanced exposure to both conditions. AI Drafts AI drafts were generated by adapting a standard CT chest negative template and incor- porating specific findings from the original reports using GPT-4. For half the cases, we deliberately introduced 1-3 errors using GPT-4 to simulate common error patterns observed in AI-generated radiology reports, such as false positives and false negatives [ 8]. This simulation approach allowed precise control over error rates, allowing subgroup analysis by number of errors. Reading Workflow Readers evaluated cases in both standard and AI-assisted workflows. In the AI-assisted workflow, readers were provided with pre-generated draft reports where any content indicating abnormal findings was automatically highlighted, regardless of the finding’s accuracy, while in the standard workflow, they used normal negative templates (Figure 1), which is standard in radiology practice. For both workflows, readers were instructed to review the imaging findings and modify the provided text as needed to ensure clinical accuracy, following standard radiology Figure 1 |Study overview. (A) Reader assignment and crossover between AI-assisted and unassisted workflows. (B) Unassisted workflow using standard template. (C) AI-assisted workflow with pre-generated drafts. |2 Page 3: Figure 2 |Reporting platform. Top: normal negative tem- plate case. Bottom: AI-drafted case. practice. The crossover design controlled for poten- tial order effects and individual reader variability while ensuring balanced evaluation across conditions. Platform Implementation We used a custom- built platform intetrating a Python Flask backend with a Postgres SQL database hosted on Google Cloud SQL and a JavaScript/CSS frontend. The platform features a login and signup page, a worklist page listing assigned studies, and a report editing page where participants can view patient data, available clinical information, and a report template or draft report (Figure 2). The platform highlights insertions into the template (positive findings) for easy review and records timestamps at the time of accessing and signing a report. The web incorporates the Open Health Imaging Foundation (OHIF) DICOM viewer, which opens automatically when accessing a case, with images stored and loaded via Google Cloud Health DICOMSTORE. Endpoints We collected two primary outcome measures: reporting time (measured from case opening to report signing) and error assessment (conducted by an independent, experienced radi- ologist who reviewed all final signed reports for clinically significant errors). User Experience Assessment Following completion of all cases, readers completed apost-experiment survey assessing their experience with AI-assisted reporting. The survey included Likert-scale questions evaluating ease of use, work- flow integration, mental effort requirements, and likelihood to recommend the system to colleagues (scored on a 1-10 scale). Statistical Analysis We employed mixed- effects models to analyze both primary outcomes. For error count analysis, we fitted a Poisson mixed-effects model to account for the count nature of the data. Reporting times were analyzed using linear mixed-effects models after log transformation to meet normality assumptions. Both models incorporated fixed effects for patient demographics (age and gender) and case complexity (number of findings), with random intercepts for readers and cases to account for repeated measurements and case-specific variation. Analyses were performed in R (v4.4.2) and the lme4 package. 3 Results Study Cases The study included 20 cases (50% female, median age 60.0 years, IQR 33-68) with a final number 59 radiologist-signed reports, with one report excluded due to a data recording error. Clinical Accuracy As shown in (Table 1), the AI-assisted workflow demonstrated a slightly lower mean number of clinically significant errors (0.27±0.52) compared to the standard workflow (0.38±0.78), though this difference did not reach statistical significance in our mixed model analysis. Reporting Time The AI-assisted workflow significantly reduced median reporting time from 573 seconds (IQR 403-895) to 435 seconds (IQR 298-716) (p=0.003), representing a 24% improvement in efficiency. Specifically, Reader 1 and Reader 2 demonstrated reduced mean reporting times with AI assistance (717 to 398 seconds and 361 to 322 seconds, respectively), while Reader 3 showed an increase (947 to 1015 seconds). However, all three readers achieved reduced median reporting times (678 to 356 seconds, 354 to 312 seconds, and 904 to 879 seconds). This trend highlights the variability in individual responses but suggests an overall positive impact of AI on workflow efficiency (Figure 3). Subgroup Analysis In exploratory analyses, we further subdivided the AI-assisted workflow into cases with and without intentionally introduced er- |3 Page 4: Figure 3 |Differences in reporting times using AI-drafts by reader. rors. Subgroup analyses comparing each AI-assisted workflow to standard reporting showed no statisti- cally significant differences, though these analyses werelimitedbysmallersamplesizesinthesubgroups. User Experience Post-experiment survey re- sults revealed unanimous positive feedback regarding system usability, with all 3 readers either agreeing or strongly agreeing that the AI-assisted reporting system was easy to use and would be well-integrated into their workflow. Regarding cognitive load, 2 of 3 readers reported that AI-assisted reporting required somewhat less mental effort compared to standard template-based reporting, while 1 of 3 indicated significantly reduced mental effort. However, when asked about likelihood to recommend the system to colleagues, responses showed some variation, with scores of 5, 9, and 10 on a 10-point scale. 4 Discussion AI assistance improves reporting efficiency Our pilot study suggests that AI-generated draft reports may improve reporting efficiency without compromising diagnostic accuracy, even in the presence of AI errors. The observed 24% reduction in median reporting time suggests potential for work- flow optimization in clinical practice. If replicated in larger studies, this efficiency improvement could Table 1 |Clinically significant errors in preliminary study. Clinically signifi- cant errorsAI-draft (n=30)Normal template (n=29) Mean (SD) 0.27 (0.52) 0.38 (0.78) Median (IQR) 0 (0-0) 0 (0-0)substantially reduce radiologists’ workload during clinical shifts, allowing more time for complex cases or reducing overall work pressure. Clinical accuracy remains consistent with AI assistance The observation that accuracy remained stable despite the presence of intentional errors in some AI drafts is promising. Our exploratory analyses, though limited by sample size, showed no significant differences in error rates between standard workflow and either type of AI assistance (with or without introduced errors). While encouraging, these findings need validation in larger studies before drawing definitive conclusions about radiologists’ ability to maintain vigilance when working with AI-generated content. Variability across readers Individual vari- ability in time savings among our three readers highlights the importance of understanding factors that may influence AI assistance effectiveness, such as experience level, comfort with technology, and personal workflow preferences. Although previous studies have shown remarkable heterogeneity in features influencing AI assistance effects [ 9], larger studies are needed to systematically investigate these individual factors and their impact on AI assistance effectiveness. User perceptions and adoption consider- ationsThe unanimous positive feedback regarding system usability and workflow integration from our readers is encouraging, suggesting that AI-generated preliminary reports could be a promising avenue to explore in clinical practice. The consistent reporting of reduced mental effort aligns with our efficiency findings and suggests AI assistance may help address cognitive burden. However, the variability in readers’ willingness to recommend a system like this to colleagues reveals an important disconnect between personal usability experiences and broader implementation concerns. This disparity might reflect deeper uncertainties about AI’s role in radiology, such as concerns about over-reliance or impact on training, which should be explicitly addressed in future studies. Limitations Our study has several impor- tant limitations. With only three readers, our findings may not be generalize to the broader radiologist population. The small sample size limited our statistical power, particularly in the exploratory subgroup analyses. The use of simulated AI drafts rather than actual AI report generation |4 Page 5: models outputs may not fully reflect real-world performance. Additionally, the artificial setting of a controlled study may not fully reflect real-world clinical practice conditions. Future Directions While our pilot results are encouraging, the next crucial step is conducting a large-scale clinical trial involving multiple readers and cases, using real AI-generated draft reports rather than simulated AI drafts. Such a trial should evaluate not only efficiency and accuracy but also radiologist satisfaction, confidence, and mental effort levels, as well as the impact of different error types and frequencies in AI drafts. Conclusion Our findings suggest that AI as- sistance in radiology reporting may offer meaningful efficiency gains without compromising diagnostic accuracy, even in the presence of AI errors. However, the observed individual variability and study limita- tions emphasize the need for larger-scale validation before widespread clinical implementation. Disclosures JNA, SD, and MM, and PR are part-time employ- ees of a2z Radiology AI. PR is a co-founder of a2z Radiology AI. |5 Page 6: References [1]Christopher R Bailey, Allison M Bailey, Anna Sophia McKenney, and Clifford R Weiss. Understandingandappreciatingburnoutinradi- ologists. Radiographics , 42(5):E137–E139, July 2022. [2] Souhail Bennani, Nor-Eddine Regnard, Jeanne Ventre, Louis Lassalle, Toan Nguyen, Alexis Ducarouge, LucasDargent, EnoraGuillo, Elodie Gouhier, Sophie-Hélène Zaimi, Emma Canniff, Cécile Malandrin, Philippe Khafagy, Hasmik Koulakian, Marie-Pierre Revel, and Guillaume Chassagnon. Using AI to improve radiologist performance in detection of abnormalities on chest radiographs. Radiology , 309(3):e230860, Dec. 2023. [3]Ibrahim Ethem Hamamci, Sezgin Er, Furkan Almas, Ayse Gulnihan Simsek, Sevval Nil Esir- gun, Irem Dogan, Muhammed Furkan Dasdelen, Bastian Wittmann, Enis Simsar, Mehmet Sim- sar, Emine Bensu Erdemir, Abdullah Alanbay, Anjany Sekuboyina, Berkan Lafci, Mehmet K Ozdemir, and Bjoern Menze. A foundation model utilizing chest CT volumes and radiology reports for supervised-level zero-shot detection of abnormalities. arXiv [cs.CV] , Mar. 2024. [4]Yuxiang Liao, Hantao Liu, and Irena Spasić. Deep learning approaches to automatic radi- ology report generation: A systematic review. Informatics in Medicine Unlocked , 39:101273, Jan. 2023. [5]Ryutaro Tanno, David G T Barrett, Andrew Sel- lergren, Sumedh Ghaisas, Sumanth Dathathri, Abigail See, Johannes Welbl, Charles Lau, Tao Tu, Shekoofeh Azizi, Karan Singhal, Mike Schaekermann, Rhys May, Roy Lee, Siwai Man, Sara Mahdavi, Zahra Ahmed, Yossi Matias, Joelle Barral, S M Ali Eslami, Danielle Belgrave, Yun Liu, Sreenivasa Raju Kalidindi, Shravya Shetty, Vivek Natarajan, Pushmeet Kohli, Po- Sen Huang, Alan Karthikesalingam, and Ira Ktena. Collaboration between clinicians and vision-language models in radiology report gen- eration. Nat. Med. , pages 1–10, Nov. 2024. [6]Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Chuck Lau, Ryutaro Tanno, Ira Ktena, Basil Mustafa, Aakanksha Chowdhery, Yun Liu, Simon Kornblith, David Fleet, Philip Mansfield, Sushant Prakash, Renee Wong, Sunny Virmani, Christopher Semturs, S Sara Mahdavi, Bradley Green, Ewa Domi- nowska, Blaise Aguera y Arcas, Joelle Barral,Dale Webster, Greg S Corrado, Yossi Matias, Karan Singhal, Pete Florence, Alan Karthike- salingam, and Vivek Natarajan. Towards gener- alist biomedical AI. arXiv [cs.CL] , July 2023. [7]Lin Yang, Shawn Xu, Andrew Sellergren, Timo Kohlberger, Yuchen Zhou, Ira Ktena, Atilla Kiraly, Faruk Ahmed, Farhad Hormozdiari, Tiam Jaroensri, Eric Wang, Ellery Wulczyn, Fayaz Jamil, Theo Guidroz, Chuck Lau, Siyuan Qiao, Yun Liu, Akshay Goel, Kendall Park, Arnav Agharwal, Nick George, Yang Wang, Ryutaro Tanno, David G T Barrett, Wei-Hung Weng, S Sara Mahdavi, Khaled Saab, Tao Tu, Sreenivasa Raju Kalidindi, Mozziyar Etemadi, Jorge Cuadros, Gregory Sorensen, Yossi Matias, Katherine Chou, Greg Corrado, Joelle Barral, Shravya Shetty, David Fleet, S M Ali Eslami, Daniel Tse, Shruthi Prabhakara, Cory McLean, Dave Steiner, Rory Pilgrim, Christopher Kelly, Shekoofeh Azizi, and Daniel Golden. Advancing multimodal medical capabilities of gemini, 2024. [8]Feiyang Yu, Mark Endo, Rayan Krishnan, Ian Pan, Andy Tsai, Eduardo Pontes Reis, Ed- uardo Kaiser Ururahy Nunes Fonseca, Henrique Min Ho Lee, Zahra Shakeri Hossein Abad, An- drew Y Ng, Curtis P Langlotz, Vasantha Kumar Venugopal, and Pranav Rajpurkar. Evaluating progress in automatic chest X-ray radiology re- port generation. Patterns (N Y) , 4(9):100802, Sept. 2023. [9]Feiyang Yu, Alex Moehring, Oishi Banerjee, Tobias Salz, Nikhil Agarwal, and Pranav Ra- jpurkar. Heterogeneity and predictors of the effects of AI assistance on radiologists. Nat. Med., 30(3):837–849, Mar. 2024. [10]Brian Nlong Zhao, Xinyang Jiang, Xufang Luo, Yifan Yang, Bo Li, Zilong Wang, Javier Alvarez- Valle, Matthew P Lungren, Dongsheng Li, and Lili Qiu. Large multimodal model for real-world radiology report generation. Oct. 2023. |6

---