Authors: Julián N. Acosta, Siddhant Dogra, Subathra Adithan, Kay Wu, Michael Moritz, Stephen Kwak, Pranav Rajpurkar
Paper Content:
Page 1:
The Impact of AI Assistance on Radiology Reporting: A Pilot
Study Using Simulated AI Draft Reports
Julián N. Acosta1 , Siddhant Dogra2, Subathra Adithan3, Kay Wu4, Michael Moritz5, Stephen Kwak6
and Pranav Rajpurkar1
1Harvard Medical School, Boston, MA, USA
2NYU Langone Health, New York, NY, USA
3Jawaharlal Institute of Postgraduate Medical Education and Research, Puducherry, India
4University of Toronto, Toronto, ON, Canada
5Saint Louis University School of Medicine, Saint Louis, MO, USA
6Johns Hopkins School of Medicine, Baltimore, MD, USA
Radiologists face increasing workload pressures amid growing imaging volumes, creating risks of burnout
and delayed reporting times. While artificial intelligence (AI) based automated radiology report generation
shows promise for reporting workflow optimization, evidence of its real-world impact on clinical accuracy
and efficiency remains limited. This study evaluated the effect of draft reports on radiology reporting
workflows by conducting a three reader multi-case study comparing standard versus AI-assisted reporting
workflows. In both workflows, radiologists reviewed the cases and modified either a standard template
(standard workflow) or an AI-generated draft report (AI-assisted workflow) to create the final report. For
controlled evaluation, we used GPT-4 to generate simulated AI drafts and deliberately introduced 1-3 errors
in half the cases to mimic real AI system performance. The AI-assisted workflow significantly reduced
average reporting time from 573 to 435 seconds (p=0.003), without a statistically significant difference
in clinically significant errors between workflows. These findings suggest that AI-generated drafts can
meaningfully accelerate radiology reporting while maintaining diagnostic accuracy, offering a practical
solution to address mounting workload challenges in clinical practice.
1 Introduction
Radiologists face mounting pressure as imaging vol-
umes surge, requiring them to interpret more studies
while maintaining accuracy and speed. This strain
has led to concerning burnout rates [ 1] and potential
impacts on diagnostic quality. While artificial intel-
ligence (AI) solutions have demonstrated promise
in radiology applications like worklist optimization
and computer-aided detection, their integration into
clinical reporting workflows remains limited.
Automated radiology report generation [ 4] represents
a promising alternative avenue for AI to seamlessly
integrate into radiologists’ workflows, providing them
with AI-generated draft reports that can serve as a
starting point in the reporting process. By reduc-
ing the time and effort required to generate reports
from scratch, AI-assisted reporting has the potential
to significantly improve radiologists’ efficiency and
productivity.
Recent studies have focused on evaluating the qual-
ity of AI-generated radiology reports using various
approaches, such as automated natural languageand clinical accuracy metrics [ 10]. Additionally,
some studies have conducted manual evaluations
to gauge radiologists’ perceptions of AI-generated
reports [ 6,7,5]. However, the majority of these
studies have focused on evaluating the impact of AI
assistance on medical image interpretation in experi-
mental settings that differ from radiologists’ day-to-
day workflows [ 9,2]. In real-world clinical workflows,
image interpretation and report generation often oc-
cur in parallel as the radiologist evaluates each case,
highlighting the need for studies that closely mimic
real-world reporting workflows to fully understand
the impact of AI assistance on radiologists’ perfor-
mance and experience.
Furthermore, most studies on report generation have
focused on simpler modalities like chest X-rays [ 5].
However, the potential benefits and challenges of
AI-assisted reporting for more complex modalities
remain largely unexplored.
In this study, we evaluated the impact of AI-
generated draft reports on chest CT interpretation
using acrossover study design where radiologists
modified either standard templates or AI-generated
Corresponding Author: Julián N. Acosta; julian_acosta@hms.harvard.eduarXiv:2412.12042v1 [cs.HC] 16 Dec 2024
Page 2:
drafts to create final clinically accurate reports,
closely mirroring clinical practice. We found that
AI assistance significantly reduced reporting
time while maintaining diagnostic accuracy ,
paving the way for larger clinical trials to compre-
hensively assess the impact of AI-assisted reporting
in clinical practice.
2 Methods
Study Design We conducted a three-reader,
multi-case crossover study using 20 chest CT scans
from the CT-RATE dataset [ 3], with each case
evaluated under both standard and AI-assisted
conditions across different readers.
Cases Utilizing the CT-RATE dataset, we
randomly selected and curated two groups of 10
chest CT scans each, matched by patient age, sex,
and proxies for case complexity, namely the number
of findings and the number of impression sentences
in the original reports.
Readers Three radiology-trained readers par-
ticipated in the study: one board-certified
radiologist and two radiology residents. Each reader
evaluated all cases, alternating between standard
and AI-assisted workflows to ensure balanced
exposure to both conditions.
AI Drafts AI drafts were generated by adapting
a standard CT chest negative template and incor-
porating specific findings from the original reports
using GPT-4. For half the cases, we deliberately
introduced 1-3 errors using GPT-4 to simulate
common error patterns observed in AI-generated
radiology reports, such as false positives and false
negatives [ 8]. This simulation approach allowed
precise control over error rates, allowing subgroup
analysis by number of errors.
Reading Workflow Readers evaluated cases
in both standard and AI-assisted workflows. In
the AI-assisted workflow, readers were provided
with pre-generated draft reports where any content
indicating abnormal findings was automatically
highlighted, regardless of the finding’s accuracy,
while in the standard workflow, they used normal
negative templates (Figure 1), which is standard
in radiology practice. For both workflows, readers
were instructed to review the imaging findings
and modify the provided text as needed to ensure
clinical accuracy, following standard radiology
Figure 1 |Study overview. (A) Reader assignment and
crossover between AI-assisted and unassisted workflows. (B)
Unassisted workflow using standard template. (C) AI-assisted
workflow with pre-generated drafts.
|2
Page 3:
Figure 2 |Reporting platform. Top: normal negative tem-
plate case. Bottom: AI-drafted case.
practice. The crossover design controlled for poten-
tial order effects and individual reader variability
while ensuring balanced evaluation across conditions.
Platform Implementation We used a custom-
built platform intetrating a Python Flask backend
with a Postgres SQL database hosted on Google
Cloud SQL and a JavaScript/CSS frontend. The
platform features a login and signup page, a worklist
page listing assigned studies, and a report editing
page where participants can view patient data,
available clinical information, and a report template
or draft report (Figure 2). The platform highlights
insertions into the template (positive findings)
for easy review and records timestamps at the
time of accessing and signing a report. The web
incorporates the Open Health Imaging Foundation
(OHIF) DICOM viewer, which opens automatically
when accessing a case, with images stored and
loaded via Google Cloud Health DICOMSTORE.
Endpoints We collected two primary outcome
measures: reporting time (measured from case
opening to report signing) and error assessment
(conducted by an independent, experienced radi-
ologist who reviewed all final signed reports for
clinically significant errors).
User Experience Assessment Following
completion of all cases, readers completed apost-experiment survey assessing their experience
with AI-assisted reporting. The survey included
Likert-scale questions evaluating ease of use, work-
flow integration, mental effort requirements, and
likelihood to recommend the system to colleagues
(scored on a 1-10 scale).
Statistical Analysis We employed mixed-
effects models to analyze both primary outcomes.
For error count analysis, we fitted a Poisson
mixed-effects model to account for the count nature
of the data. Reporting times were analyzed using
linear mixed-effects models after log transformation
to meet normality assumptions. Both models
incorporated fixed effects for patient demographics
(age and gender) and case complexity (number of
findings), with random intercepts for readers and
cases to account for repeated measurements and
case-specific variation. Analyses were performed in
R (v4.4.2) and the lme4 package.
3 Results
Study Cases The study included 20 cases (50%
female, median age 60.0 years, IQR 33-68) with a
final number 59 radiologist-signed reports, with one
report excluded due to a data recording error.
Clinical Accuracy As shown in (Table 1),
the AI-assisted workflow demonstrated a slightly
lower mean number of clinically significant errors
(0.27±0.52) compared to the standard workflow
(0.38±0.78), though this difference did not reach
statistical significance in our mixed model analysis.
Reporting Time The AI-assisted workflow
significantly reduced median reporting time from
573 seconds (IQR 403-895) to 435 seconds (IQR
298-716) (p=0.003), representing a 24% improvement
in efficiency. Specifically, Reader 1 and Reader 2
demonstrated reduced mean reporting times with
AI assistance (717 to 398 seconds and 361 to 322
seconds, respectively), while Reader 3 showed an
increase (947 to 1015 seconds). However, all three
readers achieved reduced median reporting times
(678 to 356 seconds, 354 to 312 seconds, and 904 to
879 seconds). This trend highlights the variability in
individual responses but suggests an overall positive
impact of AI on workflow efficiency (Figure 3).
Subgroup Analysis In exploratory analyses,
we further subdivided the AI-assisted workflow into
cases with and without intentionally introduced er-
|3
Page 4:
Figure 3 |Differences in reporting times using AI-drafts by
reader.
rors. Subgroup analyses comparing each AI-assisted
workflow to standard reporting showed no statisti-
cally significant differences, though these analyses
werelimitedbysmallersamplesizesinthesubgroups.
User Experience Post-experiment survey re-
sults revealed unanimous positive feedback regarding
system usability, with all 3 readers either agreeing
or strongly agreeing that the AI-assisted reporting
system was easy to use and would be well-integrated
into their workflow. Regarding cognitive load, 2 of 3
readers reported that AI-assisted reporting required
somewhat less mental effort compared to standard
template-based reporting, while 1 of 3 indicated
significantly reduced mental effort. However, when
asked about likelihood to recommend the system to
colleagues, responses showed some variation, with
scores of 5, 9, and 10 on a 10-point scale.
4 Discussion
AI assistance improves reporting efficiency
Our pilot study suggests that AI-generated draft
reports may improve reporting efficiency without
compromising diagnostic accuracy, even in the
presence of AI errors. The observed 24% reduction
in median reporting time suggests potential for work-
flow optimization in clinical practice. If replicated
in larger studies, this efficiency improvement could
Table 1 |Clinically significant errors in preliminary study.
Clinically signifi-
cant errorsAI-draft
(n=30)Normal
template
(n=29)
Mean (SD) 0.27 (0.52) 0.38 (0.78)
Median (IQR) 0 (0-0) 0 (0-0)substantially reduce radiologists’ workload during
clinical shifts, allowing more time for complex cases
or reducing overall work pressure.
Clinical accuracy remains consistent with AI
assistance The observation that accuracy remained
stable despite the presence of intentional errors
in some AI drafts is promising. Our exploratory
analyses, though limited by sample size, showed no
significant differences in error rates between standard
workflow and either type of AI assistance (with
or without introduced errors). While encouraging,
these findings need validation in larger studies before
drawing definitive conclusions about radiologists’
ability to maintain vigilance when working with
AI-generated content.
Variability across readers Individual vari-
ability in time savings among our three readers
highlights the importance of understanding factors
that may influence AI assistance effectiveness, such
as experience level, comfort with technology, and
personal workflow preferences. Although previous
studies have shown remarkable heterogeneity in
features influencing AI assistance effects [ 9], larger
studies are needed to systematically investigate
these individual factors and their impact on AI
assistance effectiveness.
User perceptions and adoption consider-
ationsThe unanimous positive feedback regarding
system usability and workflow integration from our
readers is encouraging, suggesting that AI-generated
preliminary reports could be a promising avenue to
explore in clinical practice. The consistent reporting
of reduced mental effort aligns with our efficiency
findings and suggests AI assistance may help
address cognitive burden. However, the variability
in readers’ willingness to recommend a system like
this to colleagues reveals an important disconnect
between personal usability experiences and broader
implementation concerns. This disparity might
reflect deeper uncertainties about AI’s role in
radiology, such as concerns about over-reliance
or impact on training, which should be explicitly
addressed in future studies.
Limitations Our study has several impor-
tant limitations. With only three readers, our
findings may not be generalize to the broader
radiologist population. The small sample size
limited our statistical power, particularly in the
exploratory subgroup analyses. The use of simulated
AI drafts rather than actual AI report generation
|4
Page 5:
models outputs may not fully reflect real-world
performance. Additionally, the artificial setting of
a controlled study may not fully reflect real-world
clinical practice conditions.
Future Directions While our pilot results
are encouraging, the next crucial step is conducting
a large-scale clinical trial involving multiple readers
and cases, using real AI-generated draft reports
rather than simulated AI drafts. Such a trial should
evaluate not only efficiency and accuracy but also
radiologist satisfaction, confidence, and mental effort
levels, as well as the impact of different error types
and frequencies in AI drafts.
Conclusion Our findings suggest that AI as-
sistance in radiology reporting may offer meaningful
efficiency gains without compromising diagnostic
accuracy, even in the presence of AI errors. However,
the observed individual variability and study limita-
tions emphasize the need for larger-scale validation
before widespread clinical implementation.
Disclosures
JNA, SD, and MM, and PR are part-time employ-
ees of a2z Radiology AI. PR is a co-founder of a2z
Radiology AI.
|5
Page 6:
References
[1]Christopher R Bailey, Allison M Bailey,
Anna Sophia McKenney, and Clifford R Weiss.
Understandingandappreciatingburnoutinradi-
ologists. Radiographics , 42(5):E137–E139, July
2022.
[2] Souhail Bennani, Nor-Eddine Regnard, Jeanne
Ventre, Louis Lassalle, Toan Nguyen, Alexis
Ducarouge, LucasDargent, EnoraGuillo, Elodie
Gouhier, Sophie-Hélène Zaimi, Emma Canniff,
Cécile Malandrin, Philippe Khafagy, Hasmik
Koulakian, Marie-Pierre Revel, and Guillaume
Chassagnon. Using AI to improve radiologist
performance in detection of abnormalities on
chest radiographs. Radiology , 309(3):e230860,
Dec. 2023.
[3]Ibrahim Ethem Hamamci, Sezgin Er, Furkan
Almas, Ayse Gulnihan Simsek, Sevval Nil Esir-
gun, Irem Dogan, Muhammed Furkan Dasdelen,
Bastian Wittmann, Enis Simsar, Mehmet Sim-
sar, Emine Bensu Erdemir, Abdullah Alanbay,
Anjany Sekuboyina, Berkan Lafci, Mehmet K
Ozdemir, and Bjoern Menze. A foundation
model utilizing chest CT volumes and radiology
reports for supervised-level zero-shot detection
of abnormalities. arXiv [cs.CV] , Mar. 2024.
[4]Yuxiang Liao, Hantao Liu, and Irena Spasić.
Deep learning approaches to automatic radi-
ology report generation: A systematic review.
Informatics in Medicine Unlocked , 39:101273,
Jan. 2023.
[5]Ryutaro Tanno, David G T Barrett, Andrew Sel-
lergren, Sumedh Ghaisas, Sumanth Dathathri,
Abigail See, Johannes Welbl, Charles Lau, Tao
Tu, Shekoofeh Azizi, Karan Singhal, Mike
Schaekermann, Rhys May, Roy Lee, Siwai Man,
Sara Mahdavi, Zahra Ahmed, Yossi Matias,
Joelle Barral, S M Ali Eslami, Danielle Belgrave,
Yun Liu, Sreenivasa Raju Kalidindi, Shravya
Shetty, Vivek Natarajan, Pushmeet Kohli, Po-
Sen Huang, Alan Karthikesalingam, and Ira
Ktena. Collaboration between clinicians and
vision-language models in radiology report gen-
eration. Nat. Med. , pages 1–10, Nov. 2024.
[6]Tao Tu, Shekoofeh Azizi, Danny Driess, Mike
Schaekermann, Mohamed Amin, Pi-Chuan
Chang, Andrew Carroll, Chuck Lau, Ryutaro
Tanno, Ira Ktena, Basil Mustafa, Aakanksha
Chowdhery, Yun Liu, Simon Kornblith, David
Fleet, Philip Mansfield, Sushant Prakash, Renee
Wong, Sunny Virmani, Christopher Semturs,
S Sara Mahdavi, Bradley Green, Ewa Domi-
nowska, Blaise Aguera y Arcas, Joelle Barral,Dale Webster, Greg S Corrado, Yossi Matias,
Karan Singhal, Pete Florence, Alan Karthike-
salingam, and Vivek Natarajan. Towards gener-
alist biomedical AI. arXiv [cs.CL] , July 2023.
[7]Lin Yang, Shawn Xu, Andrew Sellergren, Timo
Kohlberger, Yuchen Zhou, Ira Ktena, Atilla
Kiraly, Faruk Ahmed, Farhad Hormozdiari,
Tiam Jaroensri, Eric Wang, Ellery Wulczyn,
Fayaz Jamil, Theo Guidroz, Chuck Lau, Siyuan
Qiao, Yun Liu, Akshay Goel, Kendall Park,
Arnav Agharwal, Nick George, Yang Wang,
Ryutaro Tanno, David G T Barrett, Wei-Hung
Weng, S Sara Mahdavi, Khaled Saab, Tao Tu,
Sreenivasa Raju Kalidindi, Mozziyar Etemadi,
Jorge Cuadros, Gregory Sorensen, Yossi Matias,
Katherine Chou, Greg Corrado, Joelle Barral,
Shravya Shetty, David Fleet, S M Ali Eslami,
Daniel Tse, Shruthi Prabhakara, Cory McLean,
Dave Steiner, Rory Pilgrim, Christopher Kelly,
Shekoofeh Azizi, and Daniel Golden. Advancing
multimodal medical capabilities of gemini, 2024.
[8]Feiyang Yu, Mark Endo, Rayan Krishnan, Ian
Pan, Andy Tsai, Eduardo Pontes Reis, Ed-
uardo Kaiser Ururahy Nunes Fonseca, Henrique
Min Ho Lee, Zahra Shakeri Hossein Abad, An-
drew Y Ng, Curtis P Langlotz, Vasantha Kumar
Venugopal, and Pranav Rajpurkar. Evaluating
progress in automatic chest X-ray radiology re-
port generation. Patterns (N Y) , 4(9):100802,
Sept. 2023.
[9]Feiyang Yu, Alex Moehring, Oishi Banerjee,
Tobias Salz, Nikhil Agarwal, and Pranav Ra-
jpurkar. Heterogeneity and predictors of the
effects of AI assistance on radiologists. Nat.
Med., 30(3):837–849, Mar. 2024.
[10]Brian Nlong Zhao, Xinyang Jiang, Xufang Luo,
Yifan Yang, Bo Li, Zilong Wang, Javier Alvarez-
Valle, Matthew P Lungren, Dongsheng Li, and
Lili Qiu. Large multimodal model for real-world
radiology report generation. Oct. 2023.
|6