Authors: Ferit Akaybicen, Aaron Cummings, Lota Iwuagwu, Xinyue Zhang, Modupe Adewuyi
Paper Content:
Page 1:
arXiv:2412.16341v1 [cs.LG] 20 Dec 2024A Machine Learning Approach for Emergency
Detection in Medical Scenarios Using Large
Language Models
Ferit Akaybicen1, Aaron Cummings1, Lota Iwuagwu2, Xinyue Zhang1, Modupe Adewuyi2
1Department of Computer Science, Kennesaw State University , Marietta, GA 30060
2WellStar School of Nursing, Kennesaw State University, Ken nesaw, GA 30060
Abstract —The rapid identification of medical emergencies
through digital communication channels remains a critical chal-
lenge in modern healthcare delivery, particularly with the in-
creasing prevalence of telemedicine. This paper presents a novel
approach leveraging large language models (LLMs) and promp t
engineering techniques for automated emergency detection in
medical communications. We developed and evaluated a compr e-
hensive system using multiple LLaMA model variants (1B, 3B,
and 7B parameters) to classify medical scenarios as emergen cy or
non-emergency situations. Our methodology incorporated b oth
system prompts and in-prompt training approaches, evaluat ed
across different hardware configurations. The results demo n-
strate exceptional performance, with the LLaMA 2 (7B) model
achieving 99.7% accuracy and the LLaMA 3.2 (3B) model reach-
ing 99.6% accuracy with optimal prompt engineering. Throug h
systematic testing of training examples within the prompts , we
identified that including 10 example scenarios in the model
prompts yielded optimal classification performance. Proce ssing
speeds varied significantly between platforms, ranging fro m 0.05
to 2.2 seconds per request. The system showed particular str ength
in minimizing high-risk false negatives in emergency scena rios,
which is crucial for patient safety. The code implementatio n
and evaluation framework are publicly available on GitHub,
facilitating further research and development in this cruc ial area
of healthcare technology.
Index Terms —Emergency Detection, Large Language Models,
Healthcare AI, Natural Language Processing, Prompt Engine er-
ing, Medical Informatics
I. I NTRODUCTION
Medical emergencies can occur in various settings, from
hospital wards to home care environments, and can range
from acute physical conditions to mental health crises [1].
The rapid identification of these emergencies is often hinde red
by factors such as communication barriers, lack of immediat e
medical supervision, or the inability of patients to recogn ize
the severity of their symptoms [2]. Current methods for
emergency detection in healthcare settings primarily rely on
manual monitoring, wearable devices, or alarm systems [3].
While these approaches have their merits, they also have
limitations in terms of scalability, cost-effectiveness, and the
ability to detect a wide range of emergency situations.
In the realm of healthcare, the ability to quickly and accu-
rately identify emergency situations is crucial for patien t safety
and optimal care outcomes. With the increasing prevalence
of telemedicine and digital health platforms, where health care
is provided remotely through telecommunications technolo gy,there is a growing need for automated systems that can detect
emergencies based on textual communication. This paper
presents a novel machine learning approach, which is a subse t
of artificial intelligence that enables systems to automati cally
learn and improve from experience without being explicitly
programmed, for emergency detection in medical scenarios
using large language models (LLMs) and prompt engineering
techniques. Large Language Models are sophisticated artifi cial
intelligence systems trained on vast amounts of text data,
capable of understanding and generating human-like text [4 ]
[5], while prompt engineering refers to the art and science
of crafting specific instructions or inputs to optimize thes e
models’ responses for particular tasks.
Natural Language Processing (NLP), a branch of artificial
intelligence that enables computers to understand, interp ret,
and manipulate human language, serves as the foundational
technology for our approach [5]. Through NLP, our system can
analyze and comprehend the nuances of medical communica-
tions, making it particularly valuable for emergency detec tion.
Our proposed system aims to address these challenges by
leveraging the power of LLMs and machine learning to an-
alyze text-based communications in medical contexts [6]. B y
processing and classifying phrases or messages, the system can
distinguish between emergency and non-emergency situatio ns,
potentially alerting healthcare providers or emergency se rvices
when necessary. The primary objectives of this research are
to develop a comprehensive dataset of emergency and non-
emergency phrases relevant to various medical scenarios, d e-
velop and evaluate prompt engineering approaches capable o f
accurately classifying these phrases, and assess the poten tial of
this approach for real-world applications in enhancing pat ient
safety and streamlining emergency response in healthcare
settings [1].
This paper details the methodology used to develop the
system, including data collection, model development, and
evaluation. We begin with a comprehensive literature revie w
that explores existing research in LLMs in healthcare, emer -
gency detection methods, and implementation consideratio ns.
Our methodology outlines the four main components of our
approach: data collection, model development, evaluation met-
rics, and privacy considerations. We then present the resul ts
of our experiments, demonstrating the high accuracy achiev ed
by our approach across different LLaMA model variants. In
Page 2:
the discussion section, we analyze the implications of our
findings, including the significance of our accuracy rates, t he
balance between model size and performance, and hardware
considerations. Finally, we conclude by summarizing our ke y
achievements and discussing the implications of this techn ol-
ogy for the future of healthcare and emergency management,
particularly in the context of telemedicine, where remote
patient care requires robust and reliable emergency detect ion
systems to ensure patient safety and timely intervention wh en
needed [5] [7].
II. L ITERATURE REVIEW
The rapid evolution of Large Language Models (LLMs)
in healthcare applications has created new opportunities f or
emergency detection and patient care. This literature revi ew
synthesizes current research relevant to our LLM-based eme r-
gency detection system, focusing on three key themes: LLMs
in healthcare, emergency detection methodologies, and imp le-
mentation considerations.
A. LLMs in Healthcare Applications
Recent research demonstrates the growing significance of
LLMs in healthcare settings. Rezgui [4] provides a com-
prehensive analysis of LLMs in clinical decision making,
emphasizing the critical need for continuous performance
monitoring and evaluation. This work established foundati onal
principles for our testing methodology across different mo del
configurations. He et al. [5] further illuminate the transit ion
from traditional pretrained language models to modern LLMs
in healthcare, offering crucial insights into training met hods
and optimization strategies that informed our selection of
LLaMA model variants.
The application of LLMs in specific healthcare contexts
has shown promising results. Alghamdi and Mostafa [6]
demonstrate the effectiveness of domain-specific fine-tuni ng
in their healthcare LLM agents for pilgrims, achieving a
5% performance improvement through retrieval-augmented
generation. Their findings support our approach to prompt
engineering, though our results suggest that careful promp t
design can achieve high accuracy without extensive fine-
tuning. Preiksaitis et al. [1] provide valuable context thr ough
their scoping review of LLMs in emergency medicine, iden-
tifying key themes in current research and emphasizing the
importance of prospective validation.
B. Emergency Detection Methodologies
Various approaches to emergency detection have been ex-
plored in recent literature. Kulhandjian et al. [3] achieve d
90% accuracy using CNN based classification for emergency
keyword detection, though our LLM based approach demon-
strates superior performance at 99.7% accuracy. Deb et al.’ s
work on Speech Emotion Recognition in emergency scenarios,
while focusing on audio analysis, provides valuable insigh ts
into dataset curation and emergency scenario classificatio n that
influenced our data collection methodology.Li et al. [8] conducted comprehensive benchmarking of
LLMs in evidence based medicine, finding that knowledge
guided prompting improved performance significantly. Thei r
results validate our emphasis on prompt engineering, thoug h
our study achieved higher accuracy rates in the specific cont ext
of emergency detection. Huang et al. [9] demonstrated the
feasibility of emergency event detection from social media
using BERT-Att-BiLSTM models, though our system achieves
faster processing speeds and higher accuracy rates.
Implementation Considerations Several studies address cr u-
cial implementation aspects of healthcare AI systems. Gaut am
et al. [2] highlight important security considerations for remote
patient monitoring systems, while Pamulaparthyvenkata et al.
[7] present an AI enabled distributed healthcare framework
that complements our research by addressing infrastructur e
requirements. McPeak et al.’s work on implementing LLMs
in Nigerian clinical settings provides valuable insights i nto
prompt engineering for specific healthcare contexts, suppo rting
our decision to focus on prompt design rather than model fine-
tuning.
Sathe et al. [10] offer a broader perspective on AI healthcar e
applications, particularly emphasizing ethical consider ations
that, while not directly addressed in our study, remain cru-
cial for real world implementation. Ferri et al.’s [11] Deep -
EMC2 model for emergency medical call classification, thoug h
achieving lower accuracy rates than our system, demonstrat es
the value of integrated approaches to emergency detection.
C. Research Gap and Contribution
While existing literature demonstrates various approache s to
emergency detection and LLM implementation in healthcare,
there remains a significant gap in combining these elements
for high accuracy emergency detection in medical communi-
cations. Our research addresses this gap by presenting a nov el
LLM-based approach that achieves unprecedented accuracy
rates while maintaining practical processing speeds. The l iter-
ature supports our methodological choices while highlight ing
the unique contribution of our work to the field. The reviewed
literature reveals a clear trajectory toward more sophisti cated
AI applications in healthcare, with particular emphasis on
accuracy, reliability, and practical implementation cons ider-
ations. Our research builds upon these foundations while
advancing the state-of-the-art in emergency detection thr ough
innovative use of LLMs and prompt engineering techniques.
III. M ETHODOLOGY
A. Data Collection and Preparation
To create a robust dataset for our emergency detection
system, we employed a multi-step process. We began by
accessing databases of emergency calls and medical records ,
focusing on a wide range of medical scenarios, similar to
the approach used by Deb et al. [12] in their emergency
response system development. This provided a foundation of
real-world emergency situations. The identified situation s were
then converted into concise phrases that capture the essenc e
Page 3:
of each scenario, reflecting typical language used in medica l
communications.
To expand our initial dataset, which was carefully curated
from real-world emergency call conversations and medical
communications, we employed Generative AI techniques. Thi s
AI-assisted expansion process involved multiple rounds of
generation, where each round produced contextually releva nt
emergency and non-emergency phrases. Our team meticu-
lously reviewed each generated phrase, eliminating those t hat
didn’t meet our strict medical accuracy criteria or lacked r eal-
world relevance. This iterative process of generation, rev iew,
and refinement continued until we achieved a robust and
reliable dataset. Through this rigorous validation proces s, ap-
proximately 20% of AI-generated phrases were eliminated in
each round, ensuring only the most accurate and representat ive
scenarios remained. The final dataset was carefully balance d
between emergency and non-emergency situations, and sys-
tematically divided into training, validation, and test se ts to
ensure comprehensive evaluation of our model’s performanc e.
B. Model Development and Technical Implementation
Our approach to model development focused on two distinct
testing methodologies using a large language model (LLM)
as our foundation, building upon the prompt engineering
principles demonstrated by McPeak et al. [13] in their clini cal
decision support implementation. The implementation was r e-
alized through a custom HTTP server built in Python, utilizi ng
the http.server and json packages as can be seen in Figure 1
The server was configured to run on port 9111, interfacing
with LM Studio on localhost:1234.
The complete implementation, including source code and
documentation, is available in our public GitHub repositor y
(https://github.com/FeritMelih/TBED). This repository con-
tains all necessary components for replicating our methodo l-
ogy and deploying the system.
InputHTTP
ServerOutput
LM
StudioAPI calls API calls
Fig. 1. System Architecture Overview
The first methodology implemented a system prompt - a
carefully crafted set of instructions given to the LLM that
defines the task and expected output format without includin g
specific examples. This system prompt was designed to pro-
vide clear instructions for distinguishing between emerge ncy
and non-emergency medical situations, incorporating spec ific
criteria and guidelines for classification, following simi lar
prompt engineering approaches described by Li et al. [14].
The second methodology enhanced the base approach
through in-prompt training, which is a technique where care -
fully selected examples are included directly in the promptalongside the instructions, similar to the approach demon-
strated by Naimi et al. [15] in their automated testing frame -
work. These examples serve as immediate reference points
for the model, demonstrating the desired behavior through
concrete cases. We integrated representative cases of both
emergency and non-emergency scenarios into the prompt
structure. The in-prompt examples were iteratively refined
based on initial performance assessments [13].
Initially, we had planned for a third phase involving fine-
tuning, which is the process of further training a pre-train ed
model on a specific dataset to optimize its performance for
a particular task. However, the exceptional results achiev ed
through in-prompt training made this step unnecessary. Thi s
finding aligns with recent research suggesting that well-
designed prompts with appropriate examples can achieve
comparable or superior results to fine-tuning in specific tas k
domains [14].
The technical implementation included:
•A Python-based HTTP server handling JSON requests
•Integration with LM Studio through a REST API
•Configurable settings for port, platform, and model se-
lection
•Error handling for various scenarios including invalid
JSON and server processing issues
•Platform override capabilities for flexibility in deploy-
ment
C. Evaluation Metrics
The evaluation framework was designed to provide a com-
prehensive assessment of model performance across both
testing approaches, incorporating elements from the evalu ation
methodology used by Deb et al. [12]. We implemented a
balanced testing dataset consisting of 500 emergency and 50 0
non-emergency scenarios, ensuring robust evaluation acro ss
equal class distributions. The primary metrics used for eva lu-
ation included:
1) Overall Accuracy: Calculated as the proportion of cor-
rectly classified instances across both categories, provid -
ing a high-level measure of model performance.
2) Category-Specific Metrics: For both emergency and non-
emergency classifications, we computed:
•Precision: The ratio of correct positive predictions
to total positive predictions
•Recall: The ratio of correct positive predictions to
all actual positives
•F1-Score: The harmonic mean of precision and
recall
3) Confusion Matrices: Generated to visualize the distri-
bution of correct and incorrect classifications, helping
identify specific patterns in model errors
The evaluation process was identical for both testing ap-
proaches, allowing for direct comparison of their performa nce.
Special attention was paid to false negatives in emergency
situations, as these represent the highest-risk type of mis clas-
sification in a medical context [13]. The evaluation framewo rk
Page 4:
was designed to be particularly sensitive to these critical errors,
ensuring that the model’s performance was assessed not just on
overall accuracy but also on its ability to minimize high-ri sk
misclassifications.
The results were validated through multiple test runs to
ensure consistency and reliability of the performance metr ics.
This rigorous evaluation approach provided a clear picture of
each model’s capabilities and limitations, ultimately dem on-
strating the superior performance of the in-prompt trainin g
methodology.
D. Privacy and Data Protection
The development and implementation of our emergency
detection system necessitates careful consideration of pr ivacy
and data protection, particularly given the sensitive natu re of
medical information. Our approach prioritizes patient con fi-
dentiality and data security through several key measures.
Firstly, the system operates entirely on-premises, elimin at-
ing the risks associated with cloud-based data storage and
processing. This local deployment ensures that sensitive m ed-
ical information never leaves the healthcare facility’s se cure
environment. Moreover, our system is designed with a ”no dat a
retention” policy, meaning that individual patient data is not
stored or logged after processing. This transient data hand ling
approach significantly reduces the risk of data breaches or
unauthorized access to patient information. In compliance
with HIPAA regulations, all data processing is conducted
in a manner that maintains the integrity and confidentiality
of protected health information (PHI). The system’s input i s
limited to the specific text-based communications necessar y
for emergency detection, avoiding the collection or proces sing
of extraneous personal data.
While our system does not store individual patient data, we
maintain comprehensive logs of system performance and usag e
patterns, which are anonymized and aggregated to comply
with HIPAA’s accounting of disclosures requirements. Thes e
measures collectively create a robust framework for protec ting
patient privacy while leveraging advanced AI capabilities
to improve emergency response in healthcare settings. By
prioritizing on-premises deployment, transient data proc essing,
and strict adherence to healthcare data protection regulat ions,
our system sets a high standard for responsible AI use in
sensitive medical contexts.
IV. R ESULTS
Our experimental evaluation yielded comprehensive result s
across multiple model variations and hardware configuratio ns,
demonstrating the robustness and scalability of our emerge ncy
detection approach. The testing framework encompassed thr ee
different LLaMA model variants, each evaluated on two dis-
tinct GPU platforms (M3 and RTX 4080), using LM Studio
as the implementation platform.
A. LLaMA 3.2 (3B Parameters) Performance
The base model demonstrated strong and consistent per-
formance in classifying both emergency and non-emergencyscenarios. As can be seen in Table 1, with 8 tuning messages,
the model achieved a 99.1% accuracy rate across both GPU
platforms. Performance improved to 99.6% accuracy when
using 10 tuning messages, though interestingly, increasin g to
20 tuning messages led to a slight performance degradation
(97.7%). This finding suggests an optimal sweet spot in the
number of training examples needed for effective emergency
detection.
TABLE I
CONFUSION MATRIX FOR LLAMA 3.2 (3B) M ODEL
Pred.
EmergencyPred.
Non-Emerg.
Act. Emerg. 500 0
Act. Non-Emerg. 4 496
B. LLaMA 2 (7B Parameters) Performance
The larger model achieved the best overall performance with
a remarkable 99.7% accuracy rate. Out of 500 non-emergency
scenarios, it correctly identified 499 cases (99.8% true neg ative
rate), with only one false positive. In emergency scenarios , it
misclassified only two cases as non-emergencies, resulting in
just three total misclassifications across the entire test s et. This
exceptional performance validates our approach’s capabil ity to
maintain high accuracy in critical medical classification t asks.
TABLE II
CONFUSION MATRIX FOR LLAMA 2 (7B) M ODEL
Pred.
EmergencyPred.
Non-Emerg.
Act. Emerg. 498 2
Act. Non-Emerg. 1 499
C. LLaMA 3.2 (1B Parameters) Performance
The smaller model showed significantly reduced perfor-
mance, with accuracy ranging from 64.4% to 67.7%. While
it maintained good performance in identifying true emer-
gencies, it struggled with false positives, producing 325- 356
misclassifications in this category. This finding highlight s the
importance of model capacity in achieving reliable emergen cy
detection.
TABLE III
CONFUSION MATRIX FOR LLAMA 3.2 (1B) M ODEL
Pred.
EmergencyPred.
Non-Emerg.
Act. Emerg. 500 0
Act. Non-Emerg. 356 144
D. Processing Speed and Hardware Considerations
The M3 platform demonstrated superior processing speed,
handling requests in 0.05-0.38 seconds, while the RTX 4080
required approximately 2.2 seconds per request. This per-
formance difference is particularly relevant for real-wor ld
applications where response time is crucial.
Page 5:
E. Comparative Analysis
When comparing the two larger models (3B and 7B pa-
rameters), both achieved the high accuracy necessary for
medical emergency detection, with the 7B model showing
slightly superior performance. The confusion matrices rev ealed
that false negatives (emergency situations classified as no n-
emergencies) were rare in both models, which is particularl y
important from a patient safety perspective.
LLaMA 1B LLaMA 3B LLaMA 7B60708090100
67.799.6 99.7
Model VariantAccuracy (%)
Accuracy
Fig. 2. Performance Comparison Across LLaMA Model Variants
The results strongly support our methodology’s effective-
ness, particularly the in-prompt training approach. The hi gh
accuracy rates achieved across different model sizes and
hardware configurations demonstrate the robustness of our
approach. The optimal performance achieved with the 7B
parameter model suggests that while larger models can provi de
better accuracy, even the 3B model achieves clinically ac-
ceptable performance levels, offering flexibility in deplo yment
based on available computational resources.
These findings are particularly significant given the critic al
nature of emergency detection in healthcare settings. The
combination of high accuracy, reasonable processing speed s,
and the ability to minimize high-risk false negatives makes
this approach viable for real-world implementation in medi cal
scenarios.
V. D ISCUSSION
The results of our study demonstrate exceptional promise
for LLM-based emergency detection in medical scenarios,
with several key findings that warrant detailed discussion.
The achievement of 99.7% accuracy with the LLaMA 2 (7B)
model, compared to the 99.6% accuracy with LLaMA 3.2 (3B)
using optimal tuning messages, provides compelling eviden cefor the effectiveness of our approach. These performance
levels are particularly noteworthy given the critical natu re
of emergency detection in healthcare settings, where false
negatives could have serious consequences, aligning with
similar findings in emergency detection systems [9].
The comparative performance across different model sizes
offers valuable insights into the scalability and practica l imple-
mentation considerations of our approach. While the LLaMA
2 (7B) model achieved the best results, the strong perfor-
mance of the LLaMA 3.2 (3B) model suggests that effective
emergency detection can be achieved with smaller models,
potentially enabling deployment in resource constrained e nvi-
ronments. This finding is consistent with recent benchmarki ng
studies of LLMs in medical applications [8]. However, the
significant drop in performance with the 1B model (64.4-
67.7% accuracy) establishes a clear lower bound for model
size in this application.
The relationship between the number of tuning messages
and model performance is particularly interesting. The obs er-
vation that performance peaked at 10 tuning messages (99.6%
accuracy) and slightly degraded with 20 messages (97.7%)
suggests an optimal sweet spot in prompt engineering. This
finding aligns with recent research on LLM optimization in
biomedical applications [16] and has important implicatio ns
for system implementation and maintenance, indicating tha t
more training examples don’t necessarily lead to better res ults.
6 8 10 12 14 16 18 20 229797.59898.59999.5100
Number of Tuning MessagesAccuracy (%)
LLaMA 3.2 3B
Fig. 3. Impact of Tuning Messages on Model Accuracy
The processing speed differences between hardware plat-
forms (M3: 0.05-0.38 seconds vs. RTX 4080: 2.2 seconds)
provide valuable insights for real-world deployment consi d-
erations. The sub-second response times achieved on the
M3 platform demonstrate the system’s viability for real-ti me
emergency detection, particularly crucial in healthcare s ettings
where rapid response is essential, comparable to existing
emergency medical dispatch systems [11].
Page 6:
The system’s robust performance across different hardware
configurations and model sizes indicates strong generaliza -
tion capabilities. This is crucial for practical applicati ons in
healthcare, where the system must handle a wide variety of
medical situations and communication styles [10]. The resu lts
suggest that the model has successfully learned to identify
subtle contextual cues and distinguish between truly urgen t
situations and those that may use urgent-sounding language
but do not constitute actual emergencies.
However, it is important to acknowledge certain limitation s
and areas for future research:
1) While our test set was comprehensive, real-world im-
plementation would require continuous monitoring and
validation across an even broader range of scenarios.
2) The performance gap between model sizes suggests a
need to investigate intermediate model sizes that might
offer optimal balance between accuracy and resource
requirements.
3) The system’s performance should be evaluated in dif-
ferent languages and cultural contexts, as medical com-
munication patterns can vary significantly across these
dimensions.
Future development should focus on several key areas:
•Integration testing with existing healthcare communica-
tion systems
•Investigation of model compression techniques to im-
prove processing speed while maintaining accuracy
•Development of explanation mechanisms to help health-
care providers understand the system’s classifications
•Assessment of the system’s performance with medical
terminology variations and colloquial descriptions of
symptoms
The potential applications of this technology extend beyon d
traditional healthcare settings. The high accuracy and effi cient
processing times make the system particularly valuable in
telemedicine platforms, remote patient monitoring system s,
and emergency triage services [10]. The success of our
approach with different model sizes also suggests potentia l
applications in various deployment scenarios, from resour ce-
rich hospital environments to mobile healthcare units.
From a broader perspective, our findings contribute signif-
icantly to the growing body of research on the application
of large language models in healthcare [8]. The success of
our approach, particularly the relationship between model
size, tuning message quantity, and performance, provides
valuable insights for similar healthcare-related classifi cation
tasks where high accuracy is crucial.
The trade-offs between model size, accuracy, and processin g
speed revealed in our study also have important implication s
for the broader field of medical AI applications [10], sugges t-
ing that careful consideration of these factors is essentia l for
successful real-world implementation.
VI. C ONCLUSION
This research presents a groundbreaking machine learning
approach for detecting emergencies in medical scenarios us ingnatural language processing, with results that demonstrat e
the viability of LLM-based systems for critical healthcare
applications. Through rigorous testing across multiple mo del
sizes and hardware configurations, we established that both
LLaMA 2 (7B) and LLaMA 3.2 (3B) models can achieve
the high accuracy necessary for medical emergency detectio n,
with the 7B model reaching 99.7% accuracy and the 3B model
achieving 99.6% with optimal prompt engineering.
The study’s findings regarding the relationship between
model size, tuning message quantity, and processing speed
provide valuable insights for practical implementation. T he
discovery that optimal performance can be achieved with
moderate-sized models and a relatively small number of tuni ng
messages (10) challenges the assumption that larger models
and more training data are always better. This has significan t
implications for resource allocation and system design in
healthcare settings.
The successful implementation across different hardware
platforms, with processing times ranging from 0.05 to 2.2
seconds, demonstrates the system’s adaptability to variou s
deployment scenarios. This flexibility, combined with the h igh
accuracy rates, makes the system particularly suitable for
integration into existing healthcare infrastructure, fro m sophis-
ticated hospital systems to remote telemedicine platforms .
Looking forward, this research establishes a foundation
for expanding the application of LLMs in critical medical
decision-making processes. Future developments should fo cus
on multilingual capabilities, cultural adaptations, and i ntegra-
tion with existing healthcare systems. The success of this
approach not only validates the use of LLMs for emergency
detection but also suggests promising applications in othe r
areas of healthcare where rapid, accurate classification of
medical situations is essential.
REFERENCES
[1] C. Preiksaitis, N. Ashenburg, G. Bunney et al. , “The role of large
language models in transforming emergency medicine: Scopi ng review,”
JMIR Medical Informatics , vol. 12, 2024.
[2] D. Gautam, G. Thakur, M. Obaidat, K.-F. Hsiao, and P. Kuma r, “Security
analysis and improvement of authenticated key agreement pr otocol for
remote patient monitoring iomt,” International Conference on Commu-
nications, Computing, Cybersecurity, and Informatics (CC CI), pp. 1–8,
2024.
[3] H. Kulhandjian, B. Poorman, J. Gutierrez, and M. Kulhand jian, “Ai-
powered emergency keyword detection for autonomous vehicl es,”Inter-
national Conference on Computing, Networking and Communic ations
(ICNC) , pp. 984–988, 2024.
[4] K. Rezgui, “Large language models for healthcare: Appli cations, models,
datasets, and challenges,” International Conference on Control, Decision
and Information Technologies (CoDIT) , pp. 2366–2371, 2024.
[5] K. He, R. Mao, Q. Lin et al. , “A survey of large language models for
healthcare: from data, technology, and applications to acc ountability and
ethics,” October 2023.
[6] H. Alghamdi and A. Mostafa, “Towards reliable healthcar e llm agents:
A case study for pilgrims during hajj,” Information , vol. 15, no. 7, p.
371, 2024.
[7] S. Pamulaparthyvenkata, P. Murugesan, D. Gottipalli, a nd P. Palanisamy,
“Ai-enabled distributed healthcare framework for secure a nd resilient re-
mote patient monitoring,” International Conference on Smart Electronics
and Communication (ICOSEC) , pp. 2034–2041, 2024.
[8] J. Li, Y . Deng, J. Zhu et al. , “Benchmarking large language models
in evidence-based medicine,” IEEE Journal of Biomedical and Health
Informatics , 2024.
Page 7:
[9] L. Huang, P. Shi, H. Zhu et al. , “Early detection of emergency events
from social media: a new text clustering approach,” Natural Hazards ,
vol. 111, pp. 851–875, 2022.
[10] N. Sathe, V . Deodhe, Y . Sharma, and A. Shinde, “A compreh ensive
review of ai in healthcare: Exploring neural networks in med ical
imaging, llm-based interactive response systems, nlp-bas ed ehr systems,
ethics, and beyond,” International Conference on Advanced Computing
& Communication Technologies (ICACCTech) , pp. 633–640, 2023.
[11] P. Ferri, C. S´ aez, A. F´ elix-De Castro, J. Juan-Albarr ac´ ın, V . Blanes-
Selva, P. S´ anchez-Cuesta, and J. M. Garc´ ıa-G´ omez, “Deep ensemble
multitask classification of emergency medical call inciden ts combining
multimodal data improves emergency medical dispatch,” Artificial Intel-
ligence in Medicine , vol. 117, p. 102088, 2021.
[12] P. Deb, H. Mahrin, and A. Bhuiyan, “Enhancing emergency response
through speech emotion recognition: A machine learning app roach,”
International Conference on Computer and Information Tech nology
(ICCIT) , pp. 1–5, 2023.
[13] G. McPeak, A. Sautmann, O. George et al. , “An llm’s medical testing
recommendations in a nigerian clinic: Potential and limits of prompt en-
gineering for clinical decision support,” IEEE International Conference
on Healthcare Informatics (ICHI) , pp. 586–591, 2024.
[14] Y . Li, R. Zhang, J. Liu, and Q. Lei, “A semantic controlla ble long
text steganography framework based on llm prompt engineeri ng and
knowledge graph,” IEEE Signal Processing Letters , vol. 31, pp. 2610–
2614, 2024.
[15] L. Naimi, E. Bouziane, M. Manaouch, and A. Jakimi, “A new approach
for automatic test case generation from use case diagram usi ng llms and
prompt engineering,” International Conference on Circuit, Systems and
Communication (ICCSC) , pp. 1–5, 2024.
[16] N. S. Babaiha, S. G. Rao, J. Klein, B. Schultz, M. Jacobs, and
M. Hofmann-Apitius, “Rationalism in the face of gpt hypes: B ench-
marking the output of large language models against human ex pert-
curated biomedical knowledge graphs,” Artificial Intelligence in the Life
Sciences , vol. 5, 2024.
Page 8:
This figure "fig1.png" is available in "png"
format from:
http://arxiv.org/ps/2412.16341v1