Paper Content:
Page 1:
LLM S IN DISEASE DIAGNOSIS : A C OMPARATIVE STUDY OF
DEEPSEEK-R1 AND O3 M INIACROSS CHRONIC HEALTH
CONDITIONS
Gaurav Kumar Gupta
Youngstown State University
Youngstown, OH
gkgupta@student.ysu.eduPranal Pande
Youngstown State University
Youngstown, OH
ppande@student.ysu.edu
ABSTRACT
Large Language Models (LLMs) are revolutionizing medical diagnostics by enhancing both disease
classification and clinical decision-making. In this study, we evaluate the performance of two LLM-
based diagnostic tools, DeepSeek R1 andO3 Mini , using a structured dataset of symptoms and
diagnoses. We assessed their predictive accuracy at both the disease and category levels, as well as the
reliability of their confidence scores. DeepSeek R1 achieved a disease-level accuracy of 76% and an
overall accuracy of 82%, outperforming O3 Mini, which attained 72% and 75% respectively. Notably,
DeepSeek R1 demonstrated exceptional performance in Mental Health, Neurological Disorders,
and Oncology, where it reached 100% accuracy, while O3 Mini excelled in Autoimmune Disease
classification with 100% accuracy. Both models, however, struggled with Respiratory Disease
classification, recording accuracies of only 40% for DeepSeek R1 and 20% for O3 Mini. Additionally,
the analysis of confidence scores revealed that DeepSeek R1 provided high-confidence predictions
in 92% of cases, compared to 68% for O3 Mini. Ethical considerations regarding bias, model
interpretability, and data privacy are also discussed to ensure the responsible integration of LLMs
into clinical practice. Overall, our findings offer valuable insights into the strengths and limitations
of LLM-based diagnostic systems and provide a roadmap for future enhancements in AI-driven
healthcare.
Keywords Large Language Models ·Medical Diagnostics ·Disease Classification ·DeepSeek R1 ·O3 Mini ·Clinical
Decision Support ·AI in Healthcare ·Diagnostic Accuracy ·Data Privacy ·Ethical AI
1 Introduction
Large Language Models (LLMs) have emerged as groundbreaking advancements in artificial intelligence (AI), reshaping
various sectors—including healthcare—through their ability to process and generate human-like text. In medical
diagnostics, LLMs are being increasingly leveraged to support clinicians by interpreting complex patient data, identifying
subtle patterns, and providing diagnostic insights that may otherwise be overlooked. By processing extensive clinical
data—from electronic health records and medical literature to patient-reported symptoms—these models offer the
potential not only to classify diseases into broad categories but also to predict specific disease names, thereby refining
the diagnostic process [1, 2, 3].
The integration of LLMs into healthcare has the potential to enhance diagnostic processes by improving accuracy,
efficiency, and overall clinical decision-making. For instance, LLMs can rapidly analyze unstructured clinical narratives
to extract critical diagnostic features and uncover underlying patterns within complex patient data [ 4,5]. By synthesizing
information from diverse sources, these models offer valuable insights that support clinicians in their diagnostic
reasoning, ultimately contributing to more informed and effective treatment decisions.
Recent advancements in AI have led to the development of specialized LLM-based platforms tailored for medical appli-
cations. Among these, DeepSeek R1 andO3 Mini have attracted considerable attention for their robust performance inarXiv:2503.10486v1 [cs.CL] 13 Mar 2025
Page 2:
both automated disease classification and disease name prediction. DeepSeek R1 is designed to capture the nuances of
clinical language, enabling it to differentiate between diseases with overlapping symptom profiles with high accuracy
[6]. Conversely, O3 Mini emphasizes computational efficiency and scalability, making it particularly suitable for rapid
diagnostic support in high-volume or resource-constrained clinical settings [ 7]. Notably, while preliminary findings
suggest that DeepSeek R1 may offer superior overall diagnostic accuracy and confidence levels, O3 Mini has shown
distinct strengths in certain domains, such as Autoimmune Disease prediction.
Despite these promising developments, the clinical deployment of LLM-based diagnostic tools requires rigorous
validation across diverse patient populations and clinical scenarios. It is also crucial to address the ethical implications
of these technologies, including potential biases in training data, the opacity of decision-making processes, and stringent
data privacy requirements [ 8,9]. Ensuring compliance with data protection standards such as HIPAA and GDPR is
essential for maintaining patient trust and safeguarding sensitive information.
In this study, we evaluate the diagnostic performance of DeepSeek R1 and O3 Mini by comparing their ability to
classify diseases and predict specific disease names across multiple clinical categories. Our research aims to provide
a comprehensive assessment of these LLM-based systems, highlighting their strengths and identifying key areas for
improvement. Through this work, we contribute to the ongoing discourse on AI-driven diagnostics and offer insights
that could pave the way for more robust, interpretable, and ethically responsible diagnostic tools in healthcare.
2 Related Work
Artificial Intelligence (AI) and Large Language Models (LLMs) have been increasingly explored for their role in
disease diagnosis, leveraging deep learning techniques to analyze clinical data, electronic health records (EHRs),
and symptom descriptions. Several studies have demonstrated the effectiveness of LLMs in disease classification,
differential diagnosis, and clinical decision support[4][9][3][10].
A recent scoping review by Zhou et al. (2024) provides a comprehensive analysis of LLM-based methods for disease
diagnosis, examining various disease types, clinical datasets, and evaluation techniques [ 1]. The study highlights the
efficacy of LLMs in diagnostic tasks and underscores the need for standardized benchmarks to ensure fair performance
evaluation. Unlike our study, which focuses on model-specific evaluations, their review broadly assesses LLM
methodologies and architectures.
Similarly, Wang et al. (2024) investigated the probabilistic medical predictions of LLMs, demonstrating how prompt
engineering can enhance the accuracy and flexibility of clinical diagnoses [ 2]. Their findings suggest that LLMs can
dynamically adjust diagnostic predictions based on evolving patient information. In contrast, our study evaluates the
confidence scores of specialized AI models, DeepSeek R1 and O3 Mini, across multiple disease categories, providing
a structured approach to assessing AI reliability in medical applications. Furthermore, Sun et al. (2024) introduced
a conversational AI model that mimics doctor-patient interactions using reinforcement learning to refine diagnostic
reasoning [ 11]. Their system optimizes follow-up questioning and achieves high performance in disease screening and
differential diagnosis. Unlike this approach, our study does not involve interactive AI responses but rather evaluates the
direct diagnostic accuracy and confidence levels of predefined AI models.
A broader survey by He et al. (2023) examined the deployment of LLMs across various healthcare domains, highlighting
key challenges such as fairness, interpretability, accountability, and ethical considerations [ 8]. While their work provides
a conceptual overview of LLM applications in healthcare, our study contributes by offering an empirical evaluation of
AI model performance in disease classification.
Our prior study, Digital Diagnostics: The Potential of Large Language Models in Recognizing Symptoms of Common
Illnesses [5], focused on analyzing the capabilities of LLMs in diagnosing common illnesses. In this work, we expand
upon that foundation by classifying chronic diseases, including Diabetes, Cancer, Heart Disease, Mental Health
disorders, and Autoimmune Diseases. This shift towards chronic disease classification enables a more structured
evaluation of AI model performance in handling complex, long-term medical conditions. Unlike general LLM-based
studies that assess language models on broad medical knowledge, our research specifically evaluates DeepSeek R1 and
O3 Mini for their ability to classify chronic diseases with high accuracy and confidence. We systematically assess their
performance across multiple disease categories, highlighting both their strengths and areas where improvements are
needed.
Additionally, this study introduces a confidence-based evaluation approach, ensuring that AI-driven disease classification
models not only achieve high accuracy but also provide reliable predictions that can be confidently used in real-world
healthcare applications. A key challenge identified in our research is the classification of Respiratory Diseases, an
area where both models show relatively lower performance. Addressing such limitations is crucial for enhancing AI
adoption in clinical environments.
2
Page 3:
By conducting a detailed comparison of these AI models and assessing their diagnostic reliability, this study contributes
to the ongoing research in AI-driven medical diagnostics, offering practical insights into the optimization and deployment
of AI systems in clinical settings. Our findings provide a structured understanding of how AI models perform in chronic
disease classification and suggest potential future improvements for their integration into medical practice.
3 Methodology
The dataset used in this study was obtained from authoritative medical sources, including Mayo Clinic ,WebMD ,
WHO ,HealthLink BC , and Penn Medicine [12][13][14]. These sources provide verified and widely accepted medical
knowledge that serves as a foundation for clinical decision-making and disease diagnosis. The dataset was structured
to include comprehensive information on disease categories, specific diseases, and their corresponding symptoms.
Each disease was systematically mapped to a set of commonly reported symptoms based on information curated
from these medical institutions, ensuring that the dataset represented real-world symptom presentations. Our study
specifically targets the classification of chronic diseases, including Cancer, Diabetes, Heart Disease, and Mental
Health Disorders . Chronic diseases often present with overlapping symptoms, making accurate classification a
challenging yet essential task for improving medical diagnostics.
3.1 LLMs Models Used
To conduct this study, we employed two state-of-the-art large language models (LLMs), DeepSeek R1 andO3 Mini ,
which have been developed to enhance disease classification based on symptom analysis. These models were selected
due to their advanced capabilities in processing and interpreting medical language, leveraging extensive training data
sourced from clinical texts, research publications, and structured medical databases[6][7].
DeepSeek R1 is a sophisticated LLM designed to process medical text efficiently and derive meaningful diagnostic
insights. It has been trained on diverse medical literature and symptom-disease mappings, enabling it to predict potential
diseases with high accuracy. Its architecture allows for a deep understanding of complex symptom patterns, improving
its capability to differentiate between diseases with similar symptom presentations, which is crucial for chronic disease
diagnosis [6].
O3 Mini , on the other hand, is a lightweight LLM optimized for real-time medical decision support. While it prioritizes
computational efficiency, it still maintains strong diagnostic accuracy, making it suitable for scenarios requiring quick
disease classification with minimal resource consumption. O3 Mini integrates knowledge from structured medical
databases and symptom assessment algorithms, allowing it to provide concise and interpretable diagnostic predictions.
Its efficiency makes it particularly valuable for early detection and continuous monitoring of chronic diseases [7].
Both models were evaluated by inputting carefully curated symptom sets derived from our dataset. The LLMs processed
these inputs and generated outputs that included disease category classification, specific disease predictions, confidence
scores, suggested diagnostic steps, and a brief reasoning behind their decision. These outputs were systematically
recorded for further analysis, enabling a comprehensive assessment of each model’s performance in chronic disease
classification and diagnostic reasoning. By focusing on chronic diseases, this study aims to provide a structured
evaluation of how LLMs can enhance medical diagnostics and improve disease prediction accuracy in real-world
healthcare applications.
3.2 Data Collection and Testing Process
To evaluate the performance of the models, we compiled a dataset consisting of 50 distinct symptoms for each disease
category, sourced from reputable medical organizations[ 12][13][14]. These symptoms were carefully curated to ensure
a broad and representative coverage of each disease class, reducing potential bias and improving generalizability. The
selection process focused on ensuring that both common and less frequent symptoms were included, capturing a realistic
distribution of symptom presentations. Given our focus on chronic diseases such as Cancer, Diabetes, Heart Disease,
and Mental Health Disorders, the dataset was tailored to reflect the most clinically relevant symptomatology for these
conditions.
Once the dataset was finalized, the symptoms were formatted into structured input queries and fed into the LLM models
using a standardized prompt. The objective was to simulate real-world diagnostic scenarios where models would be
required to analyze symptom patterns and provide meaningful predictions. Each model was tasked with identifying the
correct disease category, predicting the most probable disease name, assigning a confidence score (Low/Medium/High),
suggesting appropriate next steps for diagnosis or management, and providing a concise explanation of its reasoning.
3
Page 4:
Figure 1: Data Collection Process
This approach allowed us to capture the interpretability and reliability of the models’ predictions, ensuring a robust
evaluation framework.
Figure 1 illustrates the structured process of data collection and testing. It begins with symptom collection, followed by
data structuring, input processing through LLM models, and output evaluation. Each model processes the standardized
symptom inputs and generates diagnostic insights, which are systematically recorded for analysis. The figure outlines
key stages, including disease identification, confidence score assignment, and reasoning evaluation, highlighting the
step-by-step methodology employed in our study.
The outputs generated by DeepSeek R1 and O3 Mini were systematically recorded for further analysis. These results
were compared against the ground truth data to measure classification accuracy, confidence alignment, and diagnostic
validity. Additionally, we examined cases where the models exhibited uncertainty or misclassification, providing insights
into their limitations and areas for potential improvement. This structured testing and evaluation process enabled a
comprehensive assessment of how effectively each LLM could be leveraged for real-world disease classification and
clinical decision support.
3.3 LLMs Evaluation Prompt
The LLMs models were evaluated using a standardized prompt to classify diseases based on given symptoms. The
prompt used in this study was as follows:
4
Page 5:
Prompt for Models
You are an AI medical assistant specializing in disease classification. A patient presents with the following
symptoms:
- Symptoms: [Enter symptoms here]
Task:
1. Classify the disease based on the provided symptoms.
2. Identify the specific disease (predict only one).
3. Provide a confidence score (Low/Medium/High).
4. Suggest next steps (Limit: Maximum 2-3 lines).
5. Explain the reasoning behind the diagnosis (Limit: Maximum 2-3 lines).
Additional Instructions:
• If you are unable to classify the disease with confidence, request a hint.
•If a hint is needed, ask: "I need more information. Would you like me to list possible disease
categories?"
Response Format:
•Disease Classification: [Predicted category]
•Specific Diagnosis: [Exact disease name (Only one)]
•Confidence Score: [Low/Medium/High]
•Next Steps: [Short recommendation, Max 2-3 lines]
•Reasoning: [Brief explanation, Max 2-3 lines]
3.4 Evaluation and Validation
In this study, we performed a rigorous quantitative evaluation of the LLMs’ diagnostic performance by assessing their
ability to accurately predict both the disease category and the specific disease based on a standardized set of symptoms.
The evaluation framework is based on a point-based scoring system and a series of quantitative metrics that provide a
comprehensive assessment of the model performance.
3.4.1 Scoring Criteria
For each test case, the following criteria were used:
•Disease Prediction: The LLM receives 1 point if the predicted disease exactly matches the ground truth;
otherwise, it receives 0 points.
•Category Prediction: The LLM receives 1 point if the predicted disease category corresponds to the correct
category; otherwise, it receives 0 points.
In addition, the confidence scores provided by the LLMs (categorized as High, Medium, or Low) were recorded for
each test case to evaluate the reliability of the predictions.
3.4.2 Quantitative Metrics
To quantitatively assess the performance, we defined the following metrics:
1. Disease-Level Accuracy This metric quantifies the percentage of cases in which the LLM correctly predicted the
specific disease:
Disease-Level Accuracy = PN
i=1Point for Disease i
N!
×100, (1)
where Nis the total number of test cases and Point for Disease iis 1 if the prediction for case iis correct, and 0
otherwise.
5
Page 6:
2. Category-Level Accuracy This metric measures the proportion of cases in which the LLM correctly classified the
disease into its appropriate general category:
Category-Level Accuracy = PN
i=1Point for Categoryi
N!
×100, (2)
where Point for Categoryiis 1 if the category prediction for case iis correct, and 0 otherwise.
3. Overall Accuracy To provide a holistic view of the LLMs’ performance, we combine the disease-level and
category-level accuracies:
Overall Accuracy = PN
i=1(Point for Disease i+Point for Categoryi)
2N!
×100. (3)
This metric equally weights both the specific disease and category predictions by dividing the total score by 2N.
4. Confidence Score Distribution This metric evaluates the reliability of the predictions by examining how frequently
each confidence level is assigned. It is computed as:
Confidence Score Distribution =Count of Cases at a Specific Confidence Level
N×100, (4)
where the numerator represents the number of cases that received a particular confidence level (e.g., High, Medium, or
Low) and Nis the total number of test cases.
3.4.3 Implementation of the Evaluation Framework
The evaluation was conducted by inputting standardized symptom sets into the LLMs. For each test case, the
output—consisting of the disease classification, specific diagnosis, assigned confidence score, and any recommended
next steps—was recorded. The point-based scoring system was applied to each test case, and the metrics defined above
were computed over the entire test set. This systematic approach allowed for an objective comparison of the LLMs’
performance across all symptom sets and disease categories.
This quantitative evaluation framework ensures a comprehensive and objective assessment of the LLMs by focusing
on disease-level accuracy, category-level accuracy, overall accuracy, and the distribution of confidence scores. These
metrics provide valuable insights into the strengths and limitations of the LLMs, thereby supporting efforts to enhance
diagnostic reliability in clinical applications.
4 Results
This section presents the comparative performance analysis of the large language models (LLMs) DeepSeek R1 and
O3 Mini in disease classification. The analysis is structured based on multiple evaluation criteria, including accuracy
metrics, confidence score distribution, and category-specific performance. Figures 2 and 3 illustrate the key findings,
visually depicting accuracy variations across disease categories and confidence levels in model predictions.
4.1 Overall Performance Metrics
The evaluation of the models’ diagnostic capabilities was conducted at two levels: disease-level accuracy and category-
level accuracy. Disease-level accuracy measures how often a model correctly identifies a specific disease, while
category-level accuracy evaluates the ability to classify diseases into broader medical categories.
As summarized in Table 1, DeepSeek R1 demonstrated an overall disease-level accuracy of 76%, slightly outperforming
O3 Mini, which achieved 72%. Category-level accuracy followed a similar trend, with DeepSeek R1 achieving 88%
and O3 Mini attaining 78%. When combining disease and category classification accuracy, the overall accuracy of
DeepSeek R1 was 82%, whereas O3 Mini reached 75%.
The scatter plot in Figure 2 visually represents the classification accuracy of both models across different disease
categories. The results highlight that DeepSeek R1 consistently outperformed O3 Mini in most disease classifications,
particularly in Mental Health, Neurological Disorders, and Oncology (Cancer) , where it reached 100% accuracy.
This suggests that DeepSeek R1 is highly proficient in diagnosing conditions that have well-defined symptomatology
and established medical literature.
6
Page 7:
Metric DeepSeek R1 (%) O3 Mini (%)
Disease-Level Accuracy 76.00 % 72.00 %
Category-Level Accuracy 88.00% 78.00%
Overall Accuracy 82.00% 75.00%
Table 1: Comparison of Accuracy Metrics for DeepSeek R1 and O3 Mini.
Figure 2: Scatter Plot: Accuracy Comparison of DeepSeek R1 and O3 Mini.
4.2 Category-Specific Accuracy Analysis
A deeper examination of performance across disease categories (Table 1) reveals distinct strengths and weaknesses.
DeepSeek R1 demonstrated superior accuracy in diagnosing Mental Health, Neurological Disorders, and Oncology
(Cancer) , reaching 100% accuracy in these domains. This suggests that the model is highly effective at recognizing
symptom patterns for well-documented diseases with clear diagnostic criteria.
O3 Mini, on the other hand, exhibited a unique advantage in Autoimmune Diseases , achieving 100% accuracy
compared to DeepSeek R1’s 80%. Autoimmune diseases often have diverse and overlapping symptom presentations,
making accurate classification challenging. The performance of O3 Mini in this category suggests that it may be
particularly effective at recognizing subtle variations in symptoms that characterize these conditions.
Despite these successes, both models exhibited challenges in classifying Respiratory Diseases , where DeepSeek
R1 achieved 40% accuracy and O3 Mini trailed at 20%. This lower performance suggests that both models may
struggle with diseases that have highly overlapping symptoms, such as differentiating between asthma and chronic
obstructive pulmonary disease (COPD). Additionally, in Cardiovascular, Infectious, and Renal Disorders , both
models performed moderately, indicating areas where further model fine-tuning or the incorporation of additional
medical knowledge could improve performance.
4.3 Confidence Score Analysis
In addition to accuracy, the confidence levels of model predictions were analyzed. The confidence score distribution, as
detailed in Table 2, revealed that DeepSeek R1 provided high-confidence predictions in 92% of cases, whereas O3 Mini
exhibited high confidence in 68% of cases. Predictions classified as medium confidence accounted for 8% in DeepSeek
R1 and 32% in O3 Mini. Notably, neither model generated low-confidence predictions, which suggests that both LLMs
have a high degree of internal certainty in their classifications.
7
Page 8:
The heatmap in Figure 3 further highlights how confidence correlates with correctness. DeepSeek R1’s higher proportion
of high-confidence correct classifications suggests a more reliable decision-making process, reducing the likelihood
of uncertainty in clinical applications. Conversely, O3 Mini exhibited a greater proportion of medium-confidence
predictions, indicating that while it provides reasonable accuracy, it often does so with less certainty.
Confidence Level DeepSeek R1 (%) O3 Mini (%)
High 92.00 % 68.00 %
Medium 8.00% 32.00%
Low 0.00% 0.00%
Table 2: Confidence Score Distribution (Percentage) for DeepSeek R1 and O3 Mini.
Figure 3: Heatmap of Confidence Score Metrics for DeepSeek R1 and O3 Mini.
4.4 Comparative Insights
From the overall evaluation, it is evident that DeepSeek R1 consistently outperforms O3 Mini in terms of classification
accuracy and confidence. The higher confidence levels and improved classification accuracy suggest that DeepSeek R1
may be a more suitable candidate for real-world medical applications requiring high certainty in diagnoses. However,
8
Page 9:
O3 Mini’s strong performance in Autoimmune Diseases suggests that different models may specialize in specific
medical domains.
A key area requiring further improvement is Respiratory Disease classification, where both models struggled. Future
work could involve fine-tuning these models with specialized datasets or hybrid approaches combining different LLM
architectures for better disease-specific performance.
These findings provide critical insights into the strengths and limitations of LLM-based medical diagnostics, offering a
roadmap for future advancements in AI-driven healthcare.
5 Discussion
Our evaluation of DeepSeek R1 and O3 Mini demonstrates that LLMs can provide promising diagnostic support in
disease classification, offering a new dimension to the way clinical data is processed and interpreted. The ability of
these models to analyze complex symptom patterns and generate diagnostic predictions suggests that they have the
potential to augment traditional diagnostic methods, thereby enhancing clinical decision-making. However, the study
also reveals several important considerations and challenges that must be addressed before such systems can be widely
adopted in clinical practice. Critical issues such as model interpretability, integration with existing healthcare systems,
and the need for robust validation across diverse patient populations highlight the careful and methodical approach
required for transitioning these promising tools from research to real-world application.
5.1 Ethical Concerns
The deployment of LLMs in medical diagnostics raises several ethical issues that must be addressed to ensure responsible
usage. Our results indicate that DeepSeek R1 consistently outperforms O3 Mini in most disease categories—with
overall accuracy of 82% compared to 75%—yet both models exhibit challenges in specific areas, such as Respiratory
Disease classification (40% and 20%, respectively). These discrepancies underscore the risk that biases in training
data may lead to uneven diagnostic performance across different conditions and, potentially, across diverse patient
populations[15][8].
Fairness is a critical concern. If certain demographics or disease presentations are underrepresented, the LLMs might
perform well overall but fail for specific groups or conditions. Transparency in model development, rigorous validation
on diverse datasets, and continuous monitoring in clinical settings are essential to address such disparities[16].
Furthermore, the “black-box” nature of these models, as seen in our confidence score analysis—where DeepSeek R1
exhibited high confidence in 92% of cases versus 68% for O3 Mini—highlights the need for explainability. Clinicians
must understand the rationale behind model predictions, particularly when incorrect diagnoses occur. Establishing clear
guidelines that position LLMs as decision-support tools rather than replacements for clinical judgment will be crucial to
ensure accountability.
Patient privacy and data security are paramount in any diagnostic process, especially when leveraging LLM-based
platforms for medical decision-making. Strict adherence to data protection regulations, such as HIPAA and GDPR,
is essential to safeguard sensitive patient information. Ensuring compliance with these standards not only protects
against unauthorized access and misuse but also upholds the trust and confidentiality that are foundational to clinical
practice[9][5].
5.2 Limitations
While the quantitative metrics are encouraging, several limitations must be acknowledged. The dataset used in this
study, although extensive, may not capture the full complexity of real-world clinical scenarios. For instance, our
analysis revealed high accuracy in domains like Mental Health, Neurological Disorders, and Oncology, yet both models
struggled with Respiratory Diseases—a reflection of potential gaps in the dataset or the inherent difficulty of classifying
conditions with overlapping symptoms.
Additionally, the exclusive reliance on textual symptom descriptions limits diagnostic precision. Clinical diagnoses
typically require multi-modal inputs such as imaging, laboratory tests, and patient histories. Our current approach,
which does not integrate these modalities, might lead to oversimplified assessments that do not fully capture the nuances
of patient presentations.
Moreover, while our evaluation framework includes quantitative metrics such as disease-level and category-level
accuracy, these measures do not entirely capture the clinical interpretability of model outputs. The confidence scores
9
Page 10:
provide a basic indication of reliability, yet the underlying decision-making process remains opaque. This limitation
could hinder clinical trust and acceptance.
5.3 Future Directions
The findings of this study suggest several avenues for future research. Enhancing model generalizability by incorporating
larger, more diverse datasets is imperative. Future work should aim to update training data continuously to reflect
evolving clinical practices and to cover a broader spectrum of patient demographics and disease presentations.
The integration of multi-modal data is another promising direction. Combining textual symptom analysis with other
sources—such as medical imaging, laboratory results, and comprehensive patient histories—could yield a more holistic
diagnostic tool, potentially addressing the observed deficiencies in areas like Respiratory Disease classification.
Improving model interpretability remains a critical focus. Future research should explore explainable AI techniques
that clarify the decision-making process behind LLM predictions. Techniques such as attention visualization or rule
extraction could help elucidate why models like DeepSeek R1 and O3 Mini assign certain confidence levels, thereby
building clinician trust.
Lastly, rigorous clinical validation through prospective studies and controlled trials is essential. Real-world evaluations
will help confirm the robustness and reliability of these LLM-based systems and determine how best to integrate them
into clinical workflows, ensuring that they complement and enhance, rather than replace, clinical expertise. Collectively,
addressing these ethical, technical, and practical challenges will be vital for harnessing the full potential of LLMs in
medical diagnostics and ultimately improving patient outcomes.
6 Conclusion
This study evaluated the diagnostic performance of DeepSeek R1 and O3 Mini in disease classification across multiple
categories using a rigorously structured dataset and a comprehensive quantitative evaluation framework. Our results
demonstrate that LLMs can provide promising diagnostic support in medical applications. Specifically, DeepSeek R1
outperformed O3 Mini in terms of overall accuracy and confidence levels, indicating its superior ability to analyze
complex symptom patterns. However, O3 Mini showed a notable strength in Autoimmune Disease classification,
suggesting that different models may have specialized advantages in certain clinical domains.
The study also identified critical areas for improvement, particularly in the classification of Respiratory Diseases,
where both models underperformed. These findings highlight the need for continued refinement and adaptation of
LLM-based diagnostic tools to address the inherent challenges of overlapping symptomatology and data variability.
Additionally, the importance of ethical considerations—including transparency, fairness, and compliance with data
protection regulations—was underscored as essential for the responsible integration of LLMs into clinical practice.
In conclusion, our research contributes valuable insights to the growing body of work on LLM-based medical diagnostics.
As these models continue to evolve, ensuring rigorous clinical validation and the development of more interpretable,
equitable diagnostic tools will be paramount. This will not only enhance diagnostic accuracy but also build greater trust
among clinicians and patients, ultimately advancing global healthcare outcomes.
References
[1] X. Zhou and Y . Peng. A scoping review of llm-based methods for disease diagnosis. arXiv preprint , 2024.
[2]L. Wang, H. Zhao, and K. Chen. Probabilistic medical predictions using large language models: Enhancing
clinical flexibility with prompt engineering. npj Digital Medicine , 7:1–15, 2024.
[3]R. Davis, E. Thompson, and S. Lee. Large language model influence on diagnostic reasoning. JAMA Network
Open , 6(5):e2310062, 2023.
[4]L. Anderson, P. Wu, and R. Zhang. Evaluation of large language models as a diagnostic aid for complex cases.
Frontiers in Medicine , 2024.
[5]Gaurav Kumar Gupta, Aditi Singh, Sijo Valayakkad Manikandan, and Abul Ehtesham. Digital diagnostics: The
potential of large language models in recognizing symptoms of common illnesses. AI, 6(1), 2025.
[6]DeepSeek AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. https:
//github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf , 2024. Accessed: February
2024.
10
Page 11:
[7] OpenAI. Openai o3 mini. https://openai.com/index/openai-o3-mini , 2024. Accessed: February 2024.
[8]M. He, J. Xu, and T. Wang. A survey on large language models in healthcare: Applications, challenges, and future
directions. arXiv preprint , 2023.
[9]Z. Chen, S. Moore, and R. Patel. Current applications and challenges in large language models for patient care.
Communications Medicine , 4(56), 2024.
[10] Uttam Dhakal, Aniket Kumar Singh, Suman Devkota, Yogesh Sapkota, Bishal Lamichhane, Suprinsa Paudyal,
and Chandra Dhakal. Gpt-4’s assessment of its performance in a usmle-based case study, 2024.
[11] J. Sun, W. Li, and Y . Zhang. Conversational disease diagnosis using external planner-controlled llms. arXiv
preprint , 2024.
[12] HealthLink BC. Healthlink bc: Symptom checker and disease information. https://www.healthlinkbc.ca ,
2024. Accessed: February 2024.
[13] Mayo Clinic. Mayo clinic: Disease symptoms and medical information. https://www.mayoclinic.org , 2024.
Accessed: February 2024.
[14] WebMD. Webmd: Disease symptoms, diagnosis, and treatment. https://www.webmd.com , 2024. Accessed:
February 2024.
[15] Aniket Kumar Singh, Bishal Lamichhane, Suman Devkota, Uttam Dhakal, and Chandra Dhakal. Do large language
models show human-like biases? exploring confidence—competence gap in ai. Information , 15(2), 2024.
[16] D. McDuff, J. Cohn, and J. Hernandez. Large language models for differential diagnosis: A new approach for
ai-assisted healthcare. arXiv preprint , 2023.
[17] Aditi Singh, Abul Ehtesham, Gaurav Kumar Gupta, Nikhil Kumar Chatta, Saket Kumar, and Tala Talaei Khoei.
Exploring prompt engineering: A systematic review with swot analysis, 2024.
[18] Aditi Singh, Abul Ehtesham, Saket Kumar, Gaurav Kumar Gupta, and Tala Talaei Khoei. Encouraging responsible
use of generative ai in education: A reward-based learning approach. In Tim Schlippe, Eric C. K. Cheng, and
Tianchong Wang, editors, Artificial Intelligence in Education Technologies: New Development and Innovative
Practices , pages 404–413, Singapore, 2025. Springer Nature Singapore.
[19] Wan Hang Keith Chiu, Wei Sum Koel Ko, William Chi Shing Cho, Sin Yu Joanne Hui, Wing Chi Lawrence Chan,
and Michael D Kuo. Evaluating the diagnostic performance of large language models on complex multimodal
medical cases. J Med Internet Res , 26:e53724, May 2024.
[20] Esraa Hassan, Tarek Abd El-Hafeez, and Mahmoud Y . Shams. Optimizing classification of diseases through
language model analysis of symptoms. Scientific Reports , 14(1):1507, 2024.
[21] L. Wang, H. Zhao, and K. Chen. Probabilistic medical predictions of large language models: Enhancing clinical
flexibility with prompt engineering. npj Digital Medicine , 7:1–15, 2024.
[22] A. Kumar, V . Singh, and M. Patel. Systematic benchmarking demonstrates large language models’ potential in
diagnosing genetic diseases. medRxiv , 2024.
[23] R. Davis, E. Thompson, and S. Lee. Large language models with retrieval-augmented generation for zero-shot
disease phenotyping. arXiv preprint , 2023.
[24] J. Lee, D. Kim, and P. Brown. Large language models are clinical reasoners: Reasoning-aware diagnosis
framework with prompt-generated rationales. arXiv preprint , 2023.
[25] H. Smith, R. Gonzalez, and C. Lee. Rarebench: Can llms serve as rare diseases specialists? arXiv preprint , 2024.
[26] Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, and Sergio Segura. o3-mini vs deepseek-r1:
Which one is safer?, 2025.
Abbreviations
The following abbreviations are used in this manuscript:
• AI: Artificial Intelligence
• LLM(s): Large Language Model(s)
• HIPAA: Health Insurance Portability and Accountability Act
• GDPR: General Data Protection Regulation
• EHR: Electronic Health Record
• NLP: Natural Language Processing
11
Page 12:
A Accuracy Comparison of LLMs Across Disease Categories
Disease Category DeepSeek R1 Accuracy (%) O3 Mini Accuracy (%)
Autoimmune Diseases 80.00 100.00
Cardiovascular Diseases 60.00 60.00
Endocrine Disorders 80.00 80.00
Gastrointestinal Disorders 80.00 80.00
Infectious Diseases 60.00 60.00
Mental Health Disorders 100.00 100.00
Neurological Disorders 100.00 100.00
Oncology (Cancer) 100.00 80.00
Renal Disorders (Kidney) 60.00 40.00
Respiratory Diseases 40.00 20.00
Table 3: Accuracy Comparison of DeepSeek R1 and O3 Mini Across Disease Categories.
12