Open AGI Codes | Your Codes Reflect!

Updates

Generating audio...

arxiv

Paper 2503.10486

LLMs in Disease Diagnosis: A Comparative Study of DeepSeek-R1 and O3 Mini Across Chronic Health Conditions

Authors: Gaurav Kumar Gupta, Pranal Pande

Published: 2025-03-13

Abstract:

Large Language Models (LLMs) are revolutionizing medical diagnostics by enhancing both disease classification and clinical decision-making. In this study, we evaluate the performance of two LLM- based diagnostic tools, DeepSeek R1 and O3 Mini, using a structured dataset of symptoms and diagnoses. We assessed their predictive accuracy at both the disease and category levels, as well as the reliability of their confidence scores. DeepSeek R1 achieved a disease-level accuracy of 76% and an overall accuracy of 82%, outperforming O3 Mini, which attained 72% and 75% respectively. Notably, DeepSeek R1 demonstrated exceptional performance in Mental Health, Neurological Disorders, and Oncology, where it reached 100% accuracy, while O3 Mini excelled in Autoimmune Disease classification with 100% accuracy. Both models, however, struggled with Respiratory Disease classification, recording accuracies of only 40% for DeepSeek R1 and 20% for O3 Mini. Additionally, the analysis of confidence scores revealed that DeepSeek R1 provided high-confidence predictions in 92% of cases, compared to 68% for O3 Mini. Ethical considerations regarding bias, model interpretability, and data privacy are also discussed to ensure the responsible integration of LLMs into clinical practice. Overall, our findings offer valuable insights into the strengths and limitations of LLM-based diagnostic systems and provide a roadmap for future enhancements in AI-driven healthcare.

Paper Content:

Page 1: LLM S IN DISEASE DIAGNOSIS : A C OMPARATIVE STUDY OF DEEPSEEK-R1 AND O3 M INIACROSS CHRONIC HEALTH CONDITIONS Gaurav Kumar Gupta Youngstown State University Youngstown, OH gkgupta@student.ysu.eduPranal Pande Youngstown State University Youngstown, OH ppande@student.ysu.edu ABSTRACT Large Language Models (LLMs) are revolutionizing medical diagnostics by enhancing both disease classification and clinical decision-making. In this study, we evaluate the performance of two LLM- based diagnostic tools, DeepSeek R1 andO3 Mini , using a structured dataset of symptoms and diagnoses. We assessed their predictive accuracy at both the disease and category levels, as well as the reliability of their confidence scores. DeepSeek R1 achieved a disease-level accuracy of 76% and an overall accuracy of 82%, outperforming O3 Mini, which attained 72% and 75% respectively. Notably, DeepSeek R1 demonstrated exceptional performance in Mental Health, Neurological Disorders, and Oncology, where it reached 100% accuracy, while O3 Mini excelled in Autoimmune Disease classification with 100% accuracy. Both models, however, struggled with Respiratory Disease classification, recording accuracies of only 40% for DeepSeek R1 and 20% for O3 Mini. Additionally, the analysis of confidence scores revealed that DeepSeek R1 provided high-confidence predictions in 92% of cases, compared to 68% for O3 Mini. Ethical considerations regarding bias, model interpretability, and data privacy are also discussed to ensure the responsible integration of LLMs into clinical practice. Overall, our findings offer valuable insights into the strengths and limitations of LLM-based diagnostic systems and provide a roadmap for future enhancements in AI-driven healthcare. Keywords Large Language Models ·Medical Diagnostics ·Disease Classification ·DeepSeek R1 ·O3 Mini ·Clinical Decision Support ·AI in Healthcare ·Diagnostic Accuracy ·Data Privacy ·Ethical AI 1 Introduction Large Language Models (LLMs) have emerged as groundbreaking advancements in artificial intelligence (AI), reshaping various sectors—including healthcare—through their ability to process and generate human-like text. In medical diagnostics, LLMs are being increasingly leveraged to support clinicians by interpreting complex patient data, identifying subtle patterns, and providing diagnostic insights that may otherwise be overlooked. By processing extensive clinical data—from electronic health records and medical literature to patient-reported symptoms—these models offer the potential not only to classify diseases into broad categories but also to predict specific disease names, thereby refining the diagnostic process [1, 2, 3]. The integration of LLMs into healthcare has the potential to enhance diagnostic processes by improving accuracy, efficiency, and overall clinical decision-making. For instance, LLMs can rapidly analyze unstructured clinical narratives to extract critical diagnostic features and uncover underlying patterns within complex patient data [ 4,5]. By synthesizing information from diverse sources, these models offer valuable insights that support clinicians in their diagnostic reasoning, ultimately contributing to more informed and effective treatment decisions. Recent advancements in AI have led to the development of specialized LLM-based platforms tailored for medical appli- cations. Among these, DeepSeek R1 andO3 Mini have attracted considerable attention for their robust performance inarXiv:2503.10486v1 [cs.CL] 13 Mar 2025 Page 2: both automated disease classification and disease name prediction. DeepSeek R1 is designed to capture the nuances of clinical language, enabling it to differentiate between diseases with overlapping symptom profiles with high accuracy [6]. Conversely, O3 Mini emphasizes computational efficiency and scalability, making it particularly suitable for rapid diagnostic support in high-volume or resource-constrained clinical settings [ 7]. Notably, while preliminary findings suggest that DeepSeek R1 may offer superior overall diagnostic accuracy and confidence levels, O3 Mini has shown distinct strengths in certain domains, such as Autoimmune Disease prediction. Despite these promising developments, the clinical deployment of LLM-based diagnostic tools requires rigorous validation across diverse patient populations and clinical scenarios. It is also crucial to address the ethical implications of these technologies, including potential biases in training data, the opacity of decision-making processes, and stringent data privacy requirements [ 8,9]. Ensuring compliance with data protection standards such as HIPAA and GDPR is essential for maintaining patient trust and safeguarding sensitive information. In this study, we evaluate the diagnostic performance of DeepSeek R1 and O3 Mini by comparing their ability to classify diseases and predict specific disease names across multiple clinical categories. Our research aims to provide a comprehensive assessment of these LLM-based systems, highlighting their strengths and identifying key areas for improvement. Through this work, we contribute to the ongoing discourse on AI-driven diagnostics and offer insights that could pave the way for more robust, interpretable, and ethically responsible diagnostic tools in healthcare. 2 Related Work Artificial Intelligence (AI) and Large Language Models (LLMs) have been increasingly explored for their role in disease diagnosis, leveraging deep learning techniques to analyze clinical data, electronic health records (EHRs), and symptom descriptions. Several studies have demonstrated the effectiveness of LLMs in disease classification, differential diagnosis, and clinical decision support[4][9][3][10]. A recent scoping review by Zhou et al. (2024) provides a comprehensive analysis of LLM-based methods for disease diagnosis, examining various disease types, clinical datasets, and evaluation techniques [ 1]. The study highlights the efficacy of LLMs in diagnostic tasks and underscores the need for standardized benchmarks to ensure fair performance evaluation. Unlike our study, which focuses on model-specific evaluations, their review broadly assesses LLM methodologies and architectures. Similarly, Wang et al. (2024) investigated the probabilistic medical predictions of LLMs, demonstrating how prompt engineering can enhance the accuracy and flexibility of clinical diagnoses [ 2]. Their findings suggest that LLMs can dynamically adjust diagnostic predictions based on evolving patient information. In contrast, our study evaluates the confidence scores of specialized AI models, DeepSeek R1 and O3 Mini, across multiple disease categories, providing a structured approach to assessing AI reliability in medical applications. Furthermore, Sun et al. (2024) introduced a conversational AI model that mimics doctor-patient interactions using reinforcement learning to refine diagnostic reasoning [ 11]. Their system optimizes follow-up questioning and achieves high performance in disease screening and differential diagnosis. Unlike this approach, our study does not involve interactive AI responses but rather evaluates the direct diagnostic accuracy and confidence levels of predefined AI models. A broader survey by He et al. (2023) examined the deployment of LLMs across various healthcare domains, highlighting key challenges such as fairness, interpretability, accountability, and ethical considerations [ 8]. While their work provides a conceptual overview of LLM applications in healthcare, our study contributes by offering an empirical evaluation of AI model performance in disease classification. Our prior study, Digital Diagnostics: The Potential of Large Language Models in Recognizing Symptoms of Common Illnesses [5], focused on analyzing the capabilities of LLMs in diagnosing common illnesses. In this work, we expand upon that foundation by classifying chronic diseases, including Diabetes, Cancer, Heart Disease, Mental Health disorders, and Autoimmune Diseases. This shift towards chronic disease classification enables a more structured evaluation of AI model performance in handling complex, long-term medical conditions. Unlike general LLM-based studies that assess language models on broad medical knowledge, our research specifically evaluates DeepSeek R1 and O3 Mini for their ability to classify chronic diseases with high accuracy and confidence. We systematically assess their performance across multiple disease categories, highlighting both their strengths and areas where improvements are needed. Additionally, this study introduces a confidence-based evaluation approach, ensuring that AI-driven disease classification models not only achieve high accuracy but also provide reliable predictions that can be confidently used in real-world healthcare applications. A key challenge identified in our research is the classification of Respiratory Diseases, an area where both models show relatively lower performance. Addressing such limitations is crucial for enhancing AI adoption in clinical environments. 2 Page 3: By conducting a detailed comparison of these AI models and assessing their diagnostic reliability, this study contributes to the ongoing research in AI-driven medical diagnostics, offering practical insights into the optimization and deployment of AI systems in clinical settings. Our findings provide a structured understanding of how AI models perform in chronic disease classification and suggest potential future improvements for their integration into medical practice. 3 Methodology The dataset used in this study was obtained from authoritative medical sources, including Mayo Clinic ,WebMD , WHO ,HealthLink BC , and Penn Medicine [12][13][14]. These sources provide verified and widely accepted medical knowledge that serves as a foundation for clinical decision-making and disease diagnosis. The dataset was structured to include comprehensive information on disease categories, specific diseases, and their corresponding symptoms. Each disease was systematically mapped to a set of commonly reported symptoms based on information curated from these medical institutions, ensuring that the dataset represented real-world symptom presentations. Our study specifically targets the classification of chronic diseases, including Cancer, Diabetes, Heart Disease, and Mental Health Disorders . Chronic diseases often present with overlapping symptoms, making accurate classification a challenging yet essential task for improving medical diagnostics. 3.1 LLMs Models Used To conduct this study, we employed two state-of-the-art large language models (LLMs), DeepSeek R1 andO3 Mini , which have been developed to enhance disease classification based on symptom analysis. These models were selected due to their advanced capabilities in processing and interpreting medical language, leveraging extensive training data sourced from clinical texts, research publications, and structured medical databases[6][7]. DeepSeek R1 is a sophisticated LLM designed to process medical text efficiently and derive meaningful diagnostic insights. It has been trained on diverse medical literature and symptom-disease mappings, enabling it to predict potential diseases with high accuracy. Its architecture allows for a deep understanding of complex symptom patterns, improving its capability to differentiate between diseases with similar symptom presentations, which is crucial for chronic disease diagnosis [6]. O3 Mini , on the other hand, is a lightweight LLM optimized for real-time medical decision support. While it prioritizes computational efficiency, it still maintains strong diagnostic accuracy, making it suitable for scenarios requiring quick disease classification with minimal resource consumption. O3 Mini integrates knowledge from structured medical databases and symptom assessment algorithms, allowing it to provide concise and interpretable diagnostic predictions. Its efficiency makes it particularly valuable for early detection and continuous monitoring of chronic diseases [7]. Both models were evaluated by inputting carefully curated symptom sets derived from our dataset. The LLMs processed these inputs and generated outputs that included disease category classification, specific disease predictions, confidence scores, suggested diagnostic steps, and a brief reasoning behind their decision. These outputs were systematically recorded for further analysis, enabling a comprehensive assessment of each model’s performance in chronic disease classification and diagnostic reasoning. By focusing on chronic diseases, this study aims to provide a structured evaluation of how LLMs can enhance medical diagnostics and improve disease prediction accuracy in real-world healthcare applications. 3.2 Data Collection and Testing Process To evaluate the performance of the models, we compiled a dataset consisting of 50 distinct symptoms for each disease category, sourced from reputable medical organizations[ 12][13][14]. These symptoms were carefully curated to ensure a broad and representative coverage of each disease class, reducing potential bias and improving generalizability. The selection process focused on ensuring that both common and less frequent symptoms were included, capturing a realistic distribution of symptom presentations. Given our focus on chronic diseases such as Cancer, Diabetes, Heart Disease, and Mental Health Disorders, the dataset was tailored to reflect the most clinically relevant symptomatology for these conditions. Once the dataset was finalized, the symptoms were formatted into structured input queries and fed into the LLM models using a standardized prompt. The objective was to simulate real-world diagnostic scenarios where models would be required to analyze symptom patterns and provide meaningful predictions. Each model was tasked with identifying the correct disease category, predicting the most probable disease name, assigning a confidence score (Low/Medium/High), suggesting appropriate next steps for diagnosis or management, and providing a concise explanation of its reasoning. 3 Page 4: Figure 1: Data Collection Process This approach allowed us to capture the interpretability and reliability of the models’ predictions, ensuring a robust evaluation framework. Figure 1 illustrates the structured process of data collection and testing. It begins with symptom collection, followed by data structuring, input processing through LLM models, and output evaluation. Each model processes the standardized symptom inputs and generates diagnostic insights, which are systematically recorded for analysis. The figure outlines key stages, including disease identification, confidence score assignment, and reasoning evaluation, highlighting the step-by-step methodology employed in our study. The outputs generated by DeepSeek R1 and O3 Mini were systematically recorded for further analysis. These results were compared against the ground truth data to measure classification accuracy, confidence alignment, and diagnostic validity. Additionally, we examined cases where the models exhibited uncertainty or misclassification, providing insights into their limitations and areas for potential improvement. This structured testing and evaluation process enabled a comprehensive assessment of how effectively each LLM could be leveraged for real-world disease classification and clinical decision support. 3.3 LLMs Evaluation Prompt The LLMs models were evaluated using a standardized prompt to classify diseases based on given symptoms. The prompt used in this study was as follows: 4 Page 5: Prompt for Models You are an AI medical assistant specializing in disease classification. A patient presents with the following symptoms: - Symptoms: [Enter symptoms here] Task: 1. Classify the disease based on the provided symptoms. 2. Identify the specific disease (predict only one). 3. Provide a confidence score (Low/Medium/High). 4. Suggest next steps (Limit: Maximum 2-3 lines). 5. Explain the reasoning behind the diagnosis (Limit: Maximum 2-3 lines). Additional Instructions: • If you are unable to classify the disease with confidence, request a hint. •If a hint is needed, ask: "I need more information. Would you like me to list possible disease categories?" Response Format: •Disease Classification: [Predicted category] •Specific Diagnosis: [Exact disease name (Only one)] •Confidence Score: [Low/Medium/High] •Next Steps: [Short recommendation, Max 2-3 lines] •Reasoning: [Brief explanation, Max 2-3 lines] 3.4 Evaluation and Validation In this study, we performed a rigorous quantitative evaluation of the LLMs’ diagnostic performance by assessing their ability to accurately predict both the disease category and the specific disease based on a standardized set of symptoms. The evaluation framework is based on a point-based scoring system and a series of quantitative metrics that provide a comprehensive assessment of the model performance. 3.4.1 Scoring Criteria For each test case, the following criteria were used: •Disease Prediction: The LLM receives 1 point if the predicted disease exactly matches the ground truth; otherwise, it receives 0 points. •Category Prediction: The LLM receives 1 point if the predicted disease category corresponds to the correct category; otherwise, it receives 0 points. In addition, the confidence scores provided by the LLMs (categorized as High, Medium, or Low) were recorded for each test case to evaluate the reliability of the predictions. 3.4.2 Quantitative Metrics To quantitatively assess the performance, we defined the following metrics: 1. Disease-Level Accuracy This metric quantifies the percentage of cases in which the LLM correctly predicted the specific disease: Disease-Level Accuracy = PN i=1Point for Disease i N! ×100, (1) where Nis the total number of test cases and Point for Disease iis 1 if the prediction for case iis correct, and 0 otherwise. 5 Page 6: 2. Category-Level Accuracy This metric measures the proportion of cases in which the LLM correctly classified the disease into its appropriate general category: Category-Level Accuracy = PN i=1Point for Categoryi N! ×100, (2) where Point for Categoryiis 1 if the category prediction for case iis correct, and 0 otherwise. 3. Overall Accuracy To provide a holistic view of the LLMs’ performance, we combine the disease-level and category-level accuracies: Overall Accuracy = PN i=1(Point for Disease i+Point for Categoryi) 2N! ×100. (3) This metric equally weights both the specific disease and category predictions by dividing the total score by 2N. 4. Confidence Score Distribution This metric evaluates the reliability of the predictions by examining how frequently each confidence level is assigned. It is computed as: Confidence Score Distribution =Count of Cases at a Specific Confidence Level N×100, (4) where the numerator represents the number of cases that received a particular confidence level (e.g., High, Medium, or Low) and Nis the total number of test cases. 3.4.3 Implementation of the Evaluation Framework The evaluation was conducted by inputting standardized symptom sets into the LLMs. For each test case, the output—consisting of the disease classification, specific diagnosis, assigned confidence score, and any recommended next steps—was recorded. The point-based scoring system was applied to each test case, and the metrics defined above were computed over the entire test set. This systematic approach allowed for an objective comparison of the LLMs’ performance across all symptom sets and disease categories. This quantitative evaluation framework ensures a comprehensive and objective assessment of the LLMs by focusing on disease-level accuracy, category-level accuracy, overall accuracy, and the distribution of confidence scores. These metrics provide valuable insights into the strengths and limitations of the LLMs, thereby supporting efforts to enhance diagnostic reliability in clinical applications. 4 Results This section presents the comparative performance analysis of the large language models (LLMs) DeepSeek R1 and O3 Mini in disease classification. The analysis is structured based on multiple evaluation criteria, including accuracy metrics, confidence score distribution, and category-specific performance. Figures 2 and 3 illustrate the key findings, visually depicting accuracy variations across disease categories and confidence levels in model predictions. 4.1 Overall Performance Metrics The evaluation of the models’ diagnostic capabilities was conducted at two levels: disease-level accuracy and category- level accuracy. Disease-level accuracy measures how often a model correctly identifies a specific disease, while category-level accuracy evaluates the ability to classify diseases into broader medical categories. As summarized in Table 1, DeepSeek R1 demonstrated an overall disease-level accuracy of 76%, slightly outperforming O3 Mini, which achieved 72%. Category-level accuracy followed a similar trend, with DeepSeek R1 achieving 88% and O3 Mini attaining 78%. When combining disease and category classification accuracy, the overall accuracy of DeepSeek R1 was 82%, whereas O3 Mini reached 75%. The scatter plot in Figure 2 visually represents the classification accuracy of both models across different disease categories. The results highlight that DeepSeek R1 consistently outperformed O3 Mini in most disease classifications, particularly in Mental Health, Neurological Disorders, and Oncology (Cancer) , where it reached 100% accuracy. This suggests that DeepSeek R1 is highly proficient in diagnosing conditions that have well-defined symptomatology and established medical literature. 6 Page 7: Metric DeepSeek R1 (%) O3 Mini (%) Disease-Level Accuracy 76.00 % 72.00 % Category-Level Accuracy 88.00% 78.00% Overall Accuracy 82.00% 75.00% Table 1: Comparison of Accuracy Metrics for DeepSeek R1 and O3 Mini. Figure 2: Scatter Plot: Accuracy Comparison of DeepSeek R1 and O3 Mini. 4.2 Category-Specific Accuracy Analysis A deeper examination of performance across disease categories (Table 1) reveals distinct strengths and weaknesses. DeepSeek R1 demonstrated superior accuracy in diagnosing Mental Health, Neurological Disorders, and Oncology (Cancer) , reaching 100% accuracy in these domains. This suggests that the model is highly effective at recognizing symptom patterns for well-documented diseases with clear diagnostic criteria. O3 Mini, on the other hand, exhibited a unique advantage in Autoimmune Diseases , achieving 100% accuracy compared to DeepSeek R1’s 80%. Autoimmune diseases often have diverse and overlapping symptom presentations, making accurate classification challenging. The performance of O3 Mini in this category suggests that it may be particularly effective at recognizing subtle variations in symptoms that characterize these conditions. Despite these successes, both models exhibited challenges in classifying Respiratory Diseases , where DeepSeek R1 achieved 40% accuracy and O3 Mini trailed at 20%. This lower performance suggests that both models may struggle with diseases that have highly overlapping symptoms, such as differentiating between asthma and chronic obstructive pulmonary disease (COPD). Additionally, in Cardiovascular, Infectious, and Renal Disorders , both models performed moderately, indicating areas where further model fine-tuning or the incorporation of additional medical knowledge could improve performance. 4.3 Confidence Score Analysis In addition to accuracy, the confidence levels of model predictions were analyzed. The confidence score distribution, as detailed in Table 2, revealed that DeepSeek R1 provided high-confidence predictions in 92% of cases, whereas O3 Mini exhibited high confidence in 68% of cases. Predictions classified as medium confidence accounted for 8% in DeepSeek R1 and 32% in O3 Mini. Notably, neither model generated low-confidence predictions, which suggests that both LLMs have a high degree of internal certainty in their classifications. 7 Page 8: The heatmap in Figure 3 further highlights how confidence correlates with correctness. DeepSeek R1’s higher proportion of high-confidence correct classifications suggests a more reliable decision-making process, reducing the likelihood of uncertainty in clinical applications. Conversely, O3 Mini exhibited a greater proportion of medium-confidence predictions, indicating that while it provides reasonable accuracy, it often does so with less certainty. Confidence Level DeepSeek R1 (%) O3 Mini (%) High 92.00 % 68.00 % Medium 8.00% 32.00% Low 0.00% 0.00% Table 2: Confidence Score Distribution (Percentage) for DeepSeek R1 and O3 Mini. Figure 3: Heatmap of Confidence Score Metrics for DeepSeek R1 and O3 Mini. 4.4 Comparative Insights From the overall evaluation, it is evident that DeepSeek R1 consistently outperforms O3 Mini in terms of classification accuracy and confidence. The higher confidence levels and improved classification accuracy suggest that DeepSeek R1 may be a more suitable candidate for real-world medical applications requiring high certainty in diagnoses. However, 8 Page 9: O3 Mini’s strong performance in Autoimmune Diseases suggests that different models may specialize in specific medical domains. A key area requiring further improvement is Respiratory Disease classification, where both models struggled. Future work could involve fine-tuning these models with specialized datasets or hybrid approaches combining different LLM architectures for better disease-specific performance. These findings provide critical insights into the strengths and limitations of LLM-based medical diagnostics, offering a roadmap for future advancements in AI-driven healthcare. 5 Discussion Our evaluation of DeepSeek R1 and O3 Mini demonstrates that LLMs can provide promising diagnostic support in disease classification, offering a new dimension to the way clinical data is processed and interpreted. The ability of these models to analyze complex symptom patterns and generate diagnostic predictions suggests that they have the potential to augment traditional diagnostic methods, thereby enhancing clinical decision-making. However, the study also reveals several important considerations and challenges that must be addressed before such systems can be widely adopted in clinical practice. Critical issues such as model interpretability, integration with existing healthcare systems, and the need for robust validation across diverse patient populations highlight the careful and methodical approach required for transitioning these promising tools from research to real-world application. 5.1 Ethical Concerns The deployment of LLMs in medical diagnostics raises several ethical issues that must be addressed to ensure responsible usage. Our results indicate that DeepSeek R1 consistently outperforms O3 Mini in most disease categories—with overall accuracy of 82% compared to 75%—yet both models exhibit challenges in specific areas, such as Respiratory Disease classification (40% and 20%, respectively). These discrepancies underscore the risk that biases in training data may lead to uneven diagnostic performance across different conditions and, potentially, across diverse patient populations[15][8]. Fairness is a critical concern. If certain demographics or disease presentations are underrepresented, the LLMs might perform well overall but fail for specific groups or conditions. Transparency in model development, rigorous validation on diverse datasets, and continuous monitoring in clinical settings are essential to address such disparities[16]. Furthermore, the “black-box” nature of these models, as seen in our confidence score analysis—where DeepSeek R1 exhibited high confidence in 92% of cases versus 68% for O3 Mini—highlights the need for explainability. Clinicians must understand the rationale behind model predictions, particularly when incorrect diagnoses occur. Establishing clear guidelines that position LLMs as decision-support tools rather than replacements for clinical judgment will be crucial to ensure accountability. Patient privacy and data security are paramount in any diagnostic process, especially when leveraging LLM-based platforms for medical decision-making. Strict adherence to data protection regulations, such as HIPAA and GDPR, is essential to safeguard sensitive patient information. Ensuring compliance with these standards not only protects against unauthorized access and misuse but also upholds the trust and confidentiality that are foundational to clinical practice[9][5]. 5.2 Limitations While the quantitative metrics are encouraging, several limitations must be acknowledged. The dataset used in this study, although extensive, may not capture the full complexity of real-world clinical scenarios. For instance, our analysis revealed high accuracy in domains like Mental Health, Neurological Disorders, and Oncology, yet both models struggled with Respiratory Diseases—a reflection of potential gaps in the dataset or the inherent difficulty of classifying conditions with overlapping symptoms. Additionally, the exclusive reliance on textual symptom descriptions limits diagnostic precision. Clinical diagnoses typically require multi-modal inputs such as imaging, laboratory tests, and patient histories. Our current approach, which does not integrate these modalities, might lead to oversimplified assessments that do not fully capture the nuances of patient presentations. Moreover, while our evaluation framework includes quantitative metrics such as disease-level and category-level accuracy, these measures do not entirely capture the clinical interpretability of model outputs. The confidence scores 9 Page 10: provide a basic indication of reliability, yet the underlying decision-making process remains opaque. This limitation could hinder clinical trust and acceptance. 5.3 Future Directions The findings of this study suggest several avenues for future research. Enhancing model generalizability by incorporating larger, more diverse datasets is imperative. Future work should aim to update training data continuously to reflect evolving clinical practices and to cover a broader spectrum of patient demographics and disease presentations. The integration of multi-modal data is another promising direction. Combining textual symptom analysis with other sources—such as medical imaging, laboratory results, and comprehensive patient histories—could yield a more holistic diagnostic tool, potentially addressing the observed deficiencies in areas like Respiratory Disease classification. Improving model interpretability remains a critical focus. Future research should explore explainable AI techniques that clarify the decision-making process behind LLM predictions. Techniques such as attention visualization or rule extraction could help elucidate why models like DeepSeek R1 and O3 Mini assign certain confidence levels, thereby building clinician trust. Lastly, rigorous clinical validation through prospective studies and controlled trials is essential. Real-world evaluations will help confirm the robustness and reliability of these LLM-based systems and determine how best to integrate them into clinical workflows, ensuring that they complement and enhance, rather than replace, clinical expertise. Collectively, addressing these ethical, technical, and practical challenges will be vital for harnessing the full potential of LLMs in medical diagnostics and ultimately improving patient outcomes. 6 Conclusion This study evaluated the diagnostic performance of DeepSeek R1 and O3 Mini in disease classification across multiple categories using a rigorously structured dataset and a comprehensive quantitative evaluation framework. Our results demonstrate that LLMs can provide promising diagnostic support in medical applications. Specifically, DeepSeek R1 outperformed O3 Mini in terms of overall accuracy and confidence levels, indicating its superior ability to analyze complex symptom patterns. However, O3 Mini showed a notable strength in Autoimmune Disease classification, suggesting that different models may have specialized advantages in certain clinical domains. The study also identified critical areas for improvement, particularly in the classification of Respiratory Diseases, where both models underperformed. These findings highlight the need for continued refinement and adaptation of LLM-based diagnostic tools to address the inherent challenges of overlapping symptomatology and data variability. Additionally, the importance of ethical considerations—including transparency, fairness, and compliance with data protection regulations—was underscored as essential for the responsible integration of LLMs into clinical practice. In conclusion, our research contributes valuable insights to the growing body of work on LLM-based medical diagnostics. As these models continue to evolve, ensuring rigorous clinical validation and the development of more interpretable, equitable diagnostic tools will be paramount. This will not only enhance diagnostic accuracy but also build greater trust among clinicians and patients, ultimately advancing global healthcare outcomes. References [1] X. Zhou and Y . Peng. A scoping review of llm-based methods for disease diagnosis. arXiv preprint , 2024. [2]L. Wang, H. Zhao, and K. Chen. Probabilistic medical predictions using large language models: Enhancing clinical flexibility with prompt engineering. npj Digital Medicine , 7:1–15, 2024. [3]R. Davis, E. Thompson, and S. Lee. Large language model influence on diagnostic reasoning. JAMA Network Open , 6(5):e2310062, 2023. [4]L. Anderson, P. Wu, and R. Zhang. Evaluation of large language models as a diagnostic aid for complex cases. Frontiers in Medicine , 2024. [5]Gaurav Kumar Gupta, Aditi Singh, Sijo Valayakkad Manikandan, and Abul Ehtesham. Digital diagnostics: The potential of large language models in recognizing symptoms of common illnesses. AI, 6(1), 2025. [6]DeepSeek AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. https: //github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf , 2024. Accessed: February 2024. 10 Page 11: [7] OpenAI. Openai o3 mini. https://openai.com/index/openai-o3-mini , 2024. Accessed: February 2024. [8]M. He, J. Xu, and T. Wang. A survey on large language models in healthcare: Applications, challenges, and future directions. arXiv preprint , 2023. [9]Z. Chen, S. Moore, and R. Patel. Current applications and challenges in large language models for patient care. Communications Medicine , 4(56), 2024. [10] Uttam Dhakal, Aniket Kumar Singh, Suman Devkota, Yogesh Sapkota, Bishal Lamichhane, Suprinsa Paudyal, and Chandra Dhakal. Gpt-4’s assessment of its performance in a usmle-based case study, 2024. [11] J. Sun, W. Li, and Y . Zhang. Conversational disease diagnosis using external planner-controlled llms. arXiv preprint , 2024. [12] HealthLink BC. Healthlink bc: Symptom checker and disease information. https://www.healthlinkbc.ca , 2024. Accessed: February 2024. [13] Mayo Clinic. Mayo clinic: Disease symptoms and medical information. https://www.mayoclinic.org , 2024. Accessed: February 2024. [14] WebMD. Webmd: Disease symptoms, diagnosis, and treatment. https://www.webmd.com , 2024. Accessed: February 2024. [15] Aniket Kumar Singh, Bishal Lamichhane, Suman Devkota, Uttam Dhakal, and Chandra Dhakal. Do large language models show human-like biases? exploring confidence—competence gap in ai. Information , 15(2), 2024. [16] D. McDuff, J. Cohn, and J. Hernandez. Large language models for differential diagnosis: A new approach for ai-assisted healthcare. arXiv preprint , 2023. [17] Aditi Singh, Abul Ehtesham, Gaurav Kumar Gupta, Nikhil Kumar Chatta, Saket Kumar, and Tala Talaei Khoei. Exploring prompt engineering: A systematic review with swot analysis, 2024. [18] Aditi Singh, Abul Ehtesham, Saket Kumar, Gaurav Kumar Gupta, and Tala Talaei Khoei. Encouraging responsible use of generative ai in education: A reward-based learning approach. In Tim Schlippe, Eric C. K. Cheng, and Tianchong Wang, editors, Artificial Intelligence in Education Technologies: New Development and Innovative Practices , pages 404–413, Singapore, 2025. Springer Nature Singapore. [19] Wan Hang Keith Chiu, Wei Sum Koel Ko, William Chi Shing Cho, Sin Yu Joanne Hui, Wing Chi Lawrence Chan, and Michael D Kuo. Evaluating the diagnostic performance of large language models on complex multimodal medical cases. J Med Internet Res , 26:e53724, May 2024. [20] Esraa Hassan, Tarek Abd El-Hafeez, and Mahmoud Y . Shams. Optimizing classification of diseases through language model analysis of symptoms. Scientific Reports , 14(1):1507, 2024. [21] L. Wang, H. Zhao, and K. Chen. Probabilistic medical predictions of large language models: Enhancing clinical flexibility with prompt engineering. npj Digital Medicine , 7:1–15, 2024. [22] A. Kumar, V . Singh, and M. Patel. Systematic benchmarking demonstrates large language models’ potential in diagnosing genetic diseases. medRxiv , 2024. [23] R. Davis, E. Thompson, and S. Lee. Large language models with retrieval-augmented generation for zero-shot disease phenotyping. arXiv preprint , 2023. [24] J. Lee, D. Kim, and P. Brown. Large language models are clinical reasoners: Reasoning-aware diagnosis framework with prompt-generated rationales. arXiv preprint , 2023. [25] H. Smith, R. Gonzalez, and C. Lee. Rarebench: Can llms serve as rare diseases specialists? arXiv preprint , 2024. [26] Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, and Sergio Segura. o3-mini vs deepseek-r1: Which one is safer?, 2025. Abbreviations The following abbreviations are used in this manuscript: • AI: Artificial Intelligence • LLM(s): Large Language Model(s) • HIPAA: Health Insurance Portability and Accountability Act • GDPR: General Data Protection Regulation • EHR: Electronic Health Record • NLP: Natural Language Processing 11 Page 12: A Accuracy Comparison of LLMs Across Disease Categories Disease Category DeepSeek R1 Accuracy (%) O3 Mini Accuracy (%) Autoimmune Diseases 80.00 100.00 Cardiovascular Diseases 60.00 60.00 Endocrine Disorders 80.00 80.00 Gastrointestinal Disorders 80.00 80.00 Infectious Diseases 60.00 60.00 Mental Health Disorders 100.00 100.00 Neurological Disorders 100.00 100.00 Oncology (Cancer) 100.00 80.00 Renal Disorders (Kidney) 60.00 40.00 Respiratory Diseases 40.00 20.00 Table 3: Accuracy Comparison of DeepSeek R1 and O3 Mini Across Disease Categories. 12