loader
Generating audio...

arxiv

Paper 2503.10573

Unveiling the Mathematical Reasoning in DeepSeek Models: A Comparative Study of Large Language Models

Authors: Afrar Jahin, Arif Hassan Zidan, Yu Bao, Shizhe Liang, Tianming Liu, Wei Zhang

Published: 2025-03-13

Abstract:

With the rapid evolution of Artificial Intelligence (AI), Large Language Models (LLMs) have reshaped the frontiers of various fields, spanning healthcare, public health, engineering, science, agriculture, education, arts, humanities, and mathematical reasoning. Among these advancements, DeepSeek models have emerged as noteworthy contenders, demonstrating promising capabilities that set them apart from their peers. While previous studies have conducted comparative analyses of LLMs, few have delivered a comprehensive evaluation of mathematical reasoning across a broad spectrum of LLMs. In this work, we aim to bridge this gap by conducting an in-depth comparative study, focusing on the strengths and limitations of DeepSeek models in relation to their leading counterparts. In particular, our study systematically evaluates the mathematical reasoning performance of two DeepSeek models alongside five prominent LLMs across three independent benchmark datasets. The findings reveal several key insights: 1). DeepSeek-R1 consistently achieved the highest accuracy on two of the three datasets, demonstrating strong mathematical reasoning capabilities. 2). The distilled variant of LLMs significantly underperformed compared to its peers, highlighting potential drawbacks in using distillation techniques. 3). In terms of response time, Gemini 2.0 Flash demonstrated the fastest processing speed, outperforming other models in efficiency, which is a crucial factor for real-time applications. Beyond these quantitative assessments, we delve into how architecture, training, and optimization impact LLMs' mathematical reasoning. Moreover, our study goes beyond mere performance comparison by identifying key areas for future advancements in LLM-driven mathematical reasoning. This research enhances our understanding of LLMs' mathematical reasoning and lays the groundwork for future advancements

Paper Content:
Page 1: Unveiling the Mathematical Reasoning in DeepSeek Models: A Comparative Study of Large Language Models Afrar Jahin∗ School of Computer and Cyber Sciences Augusta University, Augusta, GA, USA ajahin@augusta.eduArif Hassan Zidan∗ School of Computer and Cyber Sciences Augusta University, Augusta, GA, USA azidan@augusta.edu Yu Bao Department of Graduate Psychology James Madison University, Harrisonburg, V A, USA bao2yx@jmu.edu Shizhe Liang Institute of Plant Breeding, Genetics & Genomics University of Georgia, Athens, GA, USA sl73741@uga.eduTianming Liu School of Computing University of Georgia, Athens, GA, USA tliu@uga.edu Wei Zhang† School of Computer and Cyber Sciences Augusta University, Augusta, GA, USA wzhang2@augusta.edu Abstract With the rapid evolution of Artificial Intelligence (AI), Large Language Models (LLMs) have reshaped the frontiers of various fields, spanning healthcare, public health, engineering, science, agriculture, education, arts, humanities, and mathe- matical reasoning. Among these advancements, DeepSeek models have emerged as noteworthy contenders, demonstrating promising capabilities that set them apart from their peers. While previous studies have conducted comparative analyses of LLMs, few have delivered a comprehensive evaluation of mathematical reasoning across a broad spectrum of LLMs. In this work, we aim to bridge this gap by conducting an in-depth comparative study, focusing on the strengths and limita- tions of DeepSeek models in relation to their leading counterparts. In particular, our study systematically evaluates the mathematical reasoning performance of two DeepSeek models alongside five prominent LLMs across three independent benchmark datasets. The findings reveal several key insights: 1). DeepSeek-R1 consistently achieved the highest accuracy on two of the three datasets, demon- strating strong mathematical reasoning capabilities. 2). The distilled variant of LLMs significantly underperformed compared to its peers, highlighting potential drawbacks in using distillation techniques. 3). In terms of response time, Gemini 2.0 Flash demonstrated the fastest processing speed, outperforming other models in efficiency, which is a crucial factor for real-time applications. Beyond these quantitative assessments, we delve into how architecture, training, and optimiza- tion impact LLMs’ mathematical reasoning. Moreover, our study goes beyond ∗These authors contributed equally to this work.†The corresponding author.arXiv:2503.10573v1 [cs.LG] 13 Mar 2025 Page 2: mere performance comparison by identifying key areas for future advancements in LLM-driven mathematical reasoning. This research enhances our understanding of LLMs’ mathematical reasoning and lays the groundwork for future advancements. 1 Introduction As Artificial Intelligence (AI) continues to advance at an unprecedented pace, an array of powerful Large Language Models (LLMs) has emerged, including OpenAI’s GPT-4o, o1, o3, Claude 3.5, Llama 3.2, Qwen 2.5, and Gemini 2.0 [ 1,12,28,30,48,57,69,82,102]. Driven by cutting- edge advancements in Deep Neural Networks (DNNs), these models integrate human cognition elements to enhance problem-solving and decision-making [ 49]. Serving as a guiding light in natural language processing (NLP) [ 65], healthcare [ 21,60,61], clinical textual processing [ 22, 59,104], biomedical image analyses [ 60], code generation [ 58], decision support [ 51], multimodal data analytics [ 83,90], and mathematical reasoning [ 2,52,72,76,91,103,107], LLMs push the boundaries of AI capabilities, approximating human-like reasoning through sophisticated statistical inference [ 1,48,69]. However, despite their transformative potential, these models face notable limitations. Their high computational demands pose significant barriers to broader accessibility, making large-scale implementation costly [ 69]. Moreover, while LLMs perform well in general contexts, they often struggle with specialized tasks, exhibiting inconsistencies in performance. Multimodal models, for example, continue to face challenges in spatial reasoning and real-world physics, while AI-assisted code generation frequently produces syntactically correct yet functionally flawed outputs, requiring human oversight [ 58,103]. These constraints underscore the ongoing need for refinement and innovation in AI research to bridge the gap between artificial and human intelligence. In particular, GPT-4o, released by OpenAI in May 2024, is a multimodal model capable of pro- cessing text, images, and voice with remarkable efficiency. Leveraging an advanced transformer architecture, it surpasses GPT-3 in critical areas such as mathematical reasoning and language comprehension [ 1,23,44,75]. With an estimated 2 trillion parameters, GPT-4o is significantly larger than its predecessors, enabling substantial improvements in performance and adaptability. Meanwhile, other cutting-edge models, such as o1 and o3, have been introduced in 2024 and 2025, respectively. In particular, o1 enhances reasoning capabilities by incorporating a Chain-of-Thought (CoT) approach [ 80], where it renders intermediate reasoning steps before reaching a final answer closely mirroring the way humans process complex problems [ 103]. More significantly, o3 advances reasoning performance through a "simulated reasoning" process, in which the model actively gen- erates and evaluates multiple solution paths [ 6]. By pausing, reflecting, and adjusting its approach before delivering a final answer, o3 exhibits human-like reasoning, allowing for more nuanced and adaptive problem-solving, particularly in highly complex scenarios [ 67]. However, none of the GPT models, including GPT-4o, o1, and o3, are open-source, as they remain proprietary. In addition, Claude 3.5, released in 2024, is built on previous versions. It emphasizes safety, alignment, and performance, with improvements in reasoning, language understanding, and handling complex tasks like text and code generation [ 30]. With 250B parameters, it surpasses earlier models in accuracy and ethical alignment. It supports up to 200K tokens for extended context, enabling better processing of larger inputs. Notably, enhanced by reinforcement learning from human feedback (RLHF) [ 41,85] and Constitutional AI, it reduces undesirable responses, biases, and better aligns with human intent. Claude 3.5 excels in specialized areas like coding and scientific reasoning, with improved transparency and ethical safeguards [30, 69]. Furthermore, Llama-3.3, presented in 2024, is the latest version of LLM in Meta AI family, following Llama-1 and Llama-2. Llama-3.3 advances further with 70B parameters and 128K token context window, improved by grouped-query attention for better efficiency [ 28,63]. Llama 3.1 excels in coding, logical problem solving, and low-resource language tasks. Unlike closed models such as GPT series, it remains open-weight and freely accessible for research and commercial use, but is restricted to text-only input [ 26,63,69]. Safety measures, such as automated red-teaming and filtered training data, help minimize undesirable outputs. Moreover, Gemini 2.0 Flash is the latest multimodal LLM of Google, building on versions 1.0 and 1.5 to offer more robust generative AI capabilities across text, images, audio, and video. 2 Page 3: Moreover, Gemini 2.0 Flash, initially introduced as an experimental variant, provides significant speed and efficiency gains over its predecessor, Gemini 1.5 Flash, without sacrificing the efficiency [ 43,45, 82]. Notably, it outperforms Gemini 1.5 Pro on key benchmarks while operating at twice the speed. It enables the incorporation of agentic AI and native use, allowing the model to call external functions (Google Search and Maps) and integrate streaming data for expanded real-time applications. By combining better performance in tasks such as math, code generation, and multilingual audio output with enhanced efficiency. Gemini 2.0 aims to deliver comprehensive, cost-effective AI solutions for both developers and end users [ 43,82]. Lastly, Qwen2.5, released in September 2024, is the latest iteration in the Qwen series, following Qwen2 in June 2024 and the original Qwen in August 2023. Qwen1.5 featured models up to 72B parameters, emphasizing efficiency and open-source accessibility. Qwen2 introduced improved reasoning, multilingual support, and coding capabilities, with models scaling up to 72.71B parameters. [94, 106] Besides, established in 2023 as a research initiative to push the boundaries of artificial general intelli- gence (AGI), DeepSeek models set out to overcome existing limitations by developing specialized models focused on efficiency, adaptability, and domain expertise [ 8,20,38,37,57,48,56,76]. In 2024, the Mixture-of-Experts (MoE), an efficiency-driven architecture that leverages sparse activation to reduce computational overhead [ 38,56,57], for DeepSeek was introduced. This was followed by the launch of DeepSeek Coder, a suite of code-focused models ranging from 1B to 33B parameters, designed to streamline software development workflows. Meanwhile, DeepSeek Math, trained on 120B math-related tokens, was developed to handle advanced mathematical and symbolic reasoning tasks [ 76]. Expanding its model portfolio, DeepSeek introduced the V2 and V3 series [ 56,57]. V2 implemented multi-head latent attention (MLA) alongside a MoE system with 236B parameters in total, of which only 21B were active per query, optimizing computational efficiency [ 56]. V3, an open-source model, further enhanced efficiency with 671B total parameters, activating only 37B per query, excelling in complex reasoning tasks while minimizing resource demands and reliance on supervised data [ 57]. In 2025, DeepSeek made a significant breakthrough with R1 Zero, incorporat- ing self-verification, reflection, and extended CoTs. Notably, DeepSeek-R1 was recently presented, specifically designed for mathematical, coding, and logical problem-solving, enhancing autonomous decision-making and precision in both research and enterprise applications [ 37,38,76]. To extend accessibility, an open-sourced suite of distilled models for DeepSeek, optimized for deployment in resource-constrained environments such as edge computing platforms and low-memory systems. These models preserve scalability and cost-effectiveness, making cutting-edge AI more accessible across diverse applications. Notably, mathematical reasoning usually poses intricate challenges without straightforward solu- tions [ 3,55]. Unlike routine tasks that follow established frameworks, mathematical reasoning demands creativity, abstract thinking, and advanced cognitive skills. In this study, we systematically evaluate the mathematical reasoning performance of a wide array of LLMs, with a particular emphasis on DeepSeek models, to further unveil machine creativity in mathematical reasoning. Leveraging mul- tiple independent datasets and quantitative metrics, we conduct a comprehensive empirical analysis to assess and compare their capabilities. Our findings provide a detailed evaluation of prominent LLMs in mathematical reasoning while highlighting the advancements and unique strengths of DeepSeek models. 2 Related Works Recent research has increasingly focused on summarizing and evaluating the performance of LLMs in problem-solving and code generation, areas where general-purpose LLMs excel in text-based tasks but often struggle with mathematical precision and structured reasoning [ 24,36,48,77,86,97]. To address these limitations, LLMs have prioritized enhancing reasoning capabilities and improving computational efficiency in next-generation models, aiming to bridge the gap between linguistic fluency and robust problem-solving skills. Specifically, in 2024, Wang et al., summarized and detailed the specifications of Multimodal Large Language Models (MLLMs), standing out at the forefront of AI [ 83]. In brief, Wang et al., overviewed a wide array of data modalities, such as text, images, audio, and sequential data [ 83]. In fact, MLLMs play a crucial role in multimodal understanding tasks by integrating text and image information to achieve more intelligent and comprehensive understanding and reasoning. In this field, Wang et al., outlined the developments, advancements, and utilization of multiple MLLMs, including MiniGPT-4, 3 Page 4: InstructBLIP, and Wiki-LLaV A are three highly regarded models, along with other related MLLMs such as 3DMIT, GroundingGPT, ModaVerse, Vary-toy, LLaV AMOLE, and CogCom [83]. In addition, at the end of 2024, Zhong et al. established a comprehensive and extensive evaluation of o1-preview (the early version of o1) to showcase the advancement of OpenAI’s o1-preview across a diverse array of complex reasoning tasks, spanning multiple domains, including computer science, mathematics, natural sciences, medicine, linguistics, and social sciences [ 103]. Through extensive testing, o1-preview highlighted remarkable capabilities, often achieving human-level or superior performance in areas ranging from coding challenges to scientific reasoning and from language processing to creative problem-solving [ 103]. Key findings include: 1). 83.3% success rate in solving complex competitive programming problems, surpassing human experts. 2). Superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models. 3). 100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions. 4). Advanced natural language inference capabilities across general and specialized domains like medicine. 5). Impressive performance in chip design tasks, outperforming specialized models in areas such as script generation and bug analysis. 6). Remarkable proficiency in anthropology and geology, demonstrating deep understanding and reasoning in these specialized fields. 7). Comprehensive financial knowledge and strong capabilities in quantitative investing. 8). Effective performance in social media analysis, including sentiment analysis and emotion recognition. 9). Excellent performance in educational measurement and psychometrics, demonstrating a solid grasp of standard psychometric concepts that are equivalent to or beyond a first-year master’s or doctoral student’s level. Furthermore, Yang et al. investigated the capabilities of advanced LLMs, particularly the o1 model, in literary analysis [ 96]. Given the prestige of the Nobel Prize and its emphasis on cultural, historical, and linguistic depth, applying LLMs to these works offers valuable insights into both human and AI approaches to literary interpretation [ 96]. The study employed qualitative and quantitative evaluations to assess coherence, creativity, and fidelity to the text, shedding light on the strengths and limitations of AI in domains traditionally dominated by human expertise. While LLMs demonstrated remarkable analytical capabilities, particularly in structured tasks, they struggled with emotional nuance and coherence—areas where human interpretation remains unparalleled. This research underscores the transformative potential of human-AI collaboration in the humanities, paving the way for new opportunities in literary studies, textual analysis, and interdisciplinary research. By leveraging AI’s analytical power alongside human intuition, this study highlights promising directions for AI-assisted literary interpretation and beyond. Moreover, Xu et al. demonstrated that LLMs have made significant advancements in clinical decision-making, particularly those leveraging in-context demonstrations and specialized medical fine-tuning [ 92]. These models exhibit strong performance in medical language processing but continue to face challenges in real-time adaptability, multi-step reasoning, and handling complex medical tasks. Agent-based AI systems aim to overcome these limitations by incorporating reasoning traces, contextual tool selection, knowledge retrieval, and both short- and long-term memory. These features enable medical AI agents to manage complex clinical scenarios where decision-making requires real-time interaction with the environment [ 92]. Unlike conventional model-based approaches that treat medical queries as isolated questions, medical AI agents approach them as dynamic, multi- faceted tasks, allowing them to function more like human doctors. Xu also investigated the selection of the backbone LLM for medical AI agents, which serves as the foundation for their reasoning and action generation. Specifically, Xu et al. [ 92] examined the capabilities of the emerging o1 model and its impact on agents’ reasoning, tool-use adaptability, and real-time information retrieval across diverse clinical scenarios, including high-stakes environments such as intensive care units (ICUs). The findings highlighted o1’s potential to enhance diagnostic accuracy and consistency, paving the way for more intelligent, responsive AI tools that support improved patient outcomes and more effective decision-making in clinical practice [92]. Notably, Ahn et al. [ 3] addressed four critical aspects of LLMs in mathematical reasoning: 1). mathematical problem types and datasets, 2). techniques for enhancing LLM performance (such as prompt engineering and fine-tuning), 3). factors affecting model effectiveness (including scale and pre-training data), and 4). challenges such as brittleness and inconsistency, which often cause models to generate different answers for similar problems. 4 Page 5: Besides, Liang et al. presented a comprehensive overview on the applications of AI in mathematical research, highlighting the transformative role AI has begun to play in this domain [ 55]. Traditionally, AI advancements have heavily relied on theoretical foundations provided by mathematics and statistics. However, recent developments in AI, particularly in reinforcement learning (RL) and LLMs, have demonstrated the potential for AI to contribute back to mathematics by offering flexible algorithmic frameworks and powerful inductive reasoning capabilities that support various aspects of mathematical research. This survey aimed to establish a bridge between AI and mathematics, providing insights into the mutual benefits and fostering deeper interdisciplinary understanding. In particular, Liang et al. discussed that while current AI and LLMs may struggle with complex deductive reasoning, their inherent creativity, the capability to generate outputs at high throughput based on recognition of shallow patterns, holds significant potential to support and inspire mathematical research [ 55]. Furthermore, Liang et al. [ 55] addressed the lack of cross-disciplinary communication: mathematicians may not fully comprehend the latest advances in AI, while AI researchers frequently prioritize benchmark performance over real-world applications in frontier mathematical research. Lastly, this work seeks to close that gap, offering a detailed exploration of AI fundamentals, its strengths, and its emerging applications in the mathematical sciences. While various studies have begun exploring in-depth comparisons between DeepSeek-R1 and other leading models [ 37,38,76], few provide a comprehensive analysis of their mathematical reasoning capabilities. Thus, in this work, we introduce a holistic empirical study to systematically compare the mathematical reasoning performance of DeepSeek-R1 with its peer models. Our goal is to contribute to a deeper understanding of DeepSeek-R1’s strengths and limitations while inspiring further advancements in the future works, using a wide array of peer models, datasets, and quantitative metrics. 3 Methodology In this section, we provide a comprehensive overview of our experimental framework, detailing the datasets utilized, the quantitative evaluation metrics, as well as the overall methodology. We also present a justification for selecting these datasets as benchmarks, emphasizing their relevance, diversity, and suitability in effectively assessing LLM performance in mathematical reasoning. By establishing a consistent evaluation framework, we aim to ensure a thorough and insightful comparison of model capabilities across various reasoning tasks. 3.1 Experimental Frameworks The framework of this empirical study is illustrated in Figure 1 and consists of five key components: 1). Benchmark Data: We utilized three representative benchmark datasets [ 16,39,62] to evaluate the mathematical reasoning capabilities of multiple LLMs. 2). Zero-Shot Prompting: Zero-shot prompting was employed to design all prompts [ 42], ensuring a standardized evaluation approach. 3). Language Model Selection: A diverse set of representative LLMs was selected, including Google’s Gemini 2.0 Flash [ 45], OpenAI’s GPT-4o [ 1], o1-mini [ 46], o1 [ 103], o3-mini [ 6], DeepSeek’s DeepSeek-R1-Distill-Qwen-1.5B [ 37], hereafter referred to as DeepSeek-1.5B, and DeepSeek- R1 [56]. 4). Evaluation: Accuracy was used as a critical quantitative metric to assess and compare the mathematical reasoning performance of these models [ 17–19]. 5). Comparison Between Models: A comprehensive statistical analysis was conducted to evaluate and contrast the performance of each LLM, offering deeper insights into their mathematical reasoning capabilities. Overall, this structured framework enabled a holistic and systematic comparison of LLMs in mathematical reasoning. 3.2 Datasets To evaluate the mathematical reasoning capabilities of LLMs, we employ well-established bench- mark datasets that encompass diverse mathematical domains, question types, and difficulty levels. The selected datasets include Math Competition (MATH) [ 40], Grade School Math 8K (GSM8K) [16], and Massive Multitask Language Understanding (MMLU) [ 39], covering topics ranging from elementary arithmetic to advanced algebra, logic, and mathematical competitions. From each bench- mark, we carefully select a representative subset of problems to ensure a balanced evaluation while maintaining the diversity and complexity of mathematical reasoning tasks. These subsets are chosen for their broad coverage of mathematical domains, variety in question formats, and representation of different complexity levels, ensuring a comprehensive assessment of LLMs’ symbolic manipulation, 5 Page 6: Figure 1: This figure illustrates the experimental framework of this work, including five vital components. logical reasoning, and problem-solving abilities [ 35,74,101]. In total, our evaluation spans 2,178 mathematical problems across multiple benchmark datasets, ensuring a comprehensive assessment of LLMs’ mathematical reasoning capabilities. By evaluating LLMs on these benchmarks, we identified their strengths and limitations in mathemati- cal reasoning, aiding in the development of more robust and capable models. Table 1 provides an overview of the datasets utilized in this study. Table 1: Overview of Benchmark Data subsets Used for Evaluation Dataset Category Question Type Domain Number of Problems MATH Mathematics competitions Free-form numberic Algebra, Geometry, Number Theory, Combinatorics262 GSM8K Grade School Math Word Problems Arithmetic 1320 MMLUCollege Mathematics Multiple-choice Algebra, Calculus, Discrete Math100 Abstract Algebra Multiple-choice Group Theory, Rings, Alge- braic Structures100 Formal Logic Multiple-choice Proofs, Logical Reasoning 126 High School Mathematics Multiple-choice General Mathematics 270 3.2.1 MATH MATH is a comprehensive benchmark dataset designed to evaluate mathematical reasoning across a wide range of problem-solving scenarios [ 40]. Unlike conventional mathematics problem datasets, which primarily assess algorithmic proficiency, MATH focuses on advanced reasoning and heuristic- driven problem-solving. Specifically, MATH is derived from high-level mathematics competitions, including the AMC 10, AMC 12, and AIME [ 40]. These problems are inherently more complex than standard K- 12 mathematics tasks, requiring the application of non-trivial problem-solving strategies, logical inference, and domain-specific heuristics rather than direct formulaic computations. Meanwhile, 6 Page 7: MATH is widely regarded as a difficult benchmark, with LLMs achieving accuracy rates between 3.0% and 6.9% [ 40]. Despite these low overall scores, LLMs demonstrate some degree of mathematical competency, as evidenced by: Up to 15% accuracy on the easiest difficulty level [ 40]. The ability to generate step-by-step solutions that, while sometimes incorrect, remain coherent and contextually relevant. To gauge the dataset’s difficulty, human performance was also considered. For instance, a computer science Ph.D. student with no particular affinity for mathematics achieved approximately 40% accuracy. In addition, a three-time IMO (International Mathematical Olympiad) gold medalist attained 90% accuracy. Thus, these findings indicate that MATH is challenging for both human solvers and LLMs, making it an invaluable dataset for assessing and advancing mathematical reasoning capabilities in both LLMs and human problem solvers. 3.2.2 GSM8K GSM8K, introduced in 2021, is a benchmark dataset consisting of 8,500 high-quality grade school- level math problems [ 16]. Designed to incorporate high linguistic diversity while relying on fundamental mathematical concepts, it presents a unique challenge for state-of-the-art LLMs. Al- though the underlying math is relatively simple, the diverse problem formulations create significant hurdles, preventing many models from achieving consistently high accuracy. GSM8K is structured into 7,500 training problems and 1,320 testing problems, all carefully crafted by expert human prob- lem writers. The problems primarily involve elementary arithmetic operations and typically require 2 to 8 logical steps to reach a solution. This dataset is widely used to evaluate logical reasoning and mathematical proficiency in LLMs and serves as a benchmark for various assessments, including the LLM Leaderboard. Notably, while some GSM8K problems are conceptually straightforward, they can still be challenging for even the most advanced LLMs, often exhibiting high variability in responses when the same problem is presented in slightly different ways. This highlights the ongoing difficulty of achieving robust mathematical reasoning in language models. In this study, we selected the 1.32K mathematical test set from GSM8K to systematically evaluate and compare the mathematical reasoning capabilities of each LLM. 3.2.3 MMLU The MMLU dataset is a comprehensive benchmark comprising multiple-choice questions from a diverse range of academic disciplines. Spanning 57 distinct tasks, it covers subjects in the humanities, social sciences, hard sciences, and mathematical reasoning, reflecting the breadth of knowledge that is essential for various fields of study [ 39]. MMLU’s questions were manually curated by graduate and undergraduate students from publicly available sources, including practice questions from standardized exams such as the Graduate Record Examination (GRE) and the United States Medical Licensing Examination (USMLE). Additionally, it features questions designed for under- graduate courses and readers of Oxford University Press books. Notably, its mathematical reasoning component encompasses multiple subfields, categorized into areas such as "Abstract Algebra," "College Mathematics," "Formal Logic," and "High School Mathematics." For instance, questions in the "Abstract Algebra" category originate from professional mathematical practice, while the "High School Mathematics" category includes problems akin to those found in standard high school exams. In total, MMLU comprises 15,908 questions, divided into three subsets: a). A few-shot development set containing five questions per subject. b). A validation set with 1,540 questions, used for selecting hyperparameters. c). A test set with 14,079 questions, ensuring a rigorous evaluation Each subject within MMLU includes a minimum of 100 test questions, making it a more extensive and challenging assessment than most standard exams. This broad and structured dataset serves as a critical benchmark for evaluating the reasoning and problem-solving capabilities of LLMs. 3.3 Evaluate LLMs in Mathematical Reasoning As discussed previously, multiple LLMs have demonstrated remarkable abilities in mathematical reasoning [ 105]. To systematically evaluate the mathematical reasoning capabilities of various LLMs, we conducted experiments using API-based access to a diverse set of models. The selected models include Google’s Gemini-2.0-flash, OpenAI’s GPT-4o, o1-mini, o1, and o3-mini, and DeepSeek’s DeepSeek-1.5B and DeepSeek-R1. These models were accessed through their respective API services using secure API keys from Google, OpenAI, DeepSeek, and Openrouter. 7 Page 8: Figure 2: Templates used to evaluate mathematical reasoning in LLMs. The multiple-choice format (a) ensures structured decision-making with predefined options, while the open-ended format (b) assesses free-form problem-solving capabilities. 3.3.1 Prompting Strategy The effectiveness of LLMs in mathematical reasoning is highly dependent on the design of input prompts [ 5,10,25]. To ensure a fair and reproducible evaluation, we employed a structured prompt- ing strategy focusing on zero-shot prompting initially. Our methodology was designed to assess models’ inherent problem-solving abilities without additional external guidance [ 87]. Overall, for all evaluations, we incorporated a zero-shot prompting approach, where models were presented with mathematical problems without any prior exemplars or demonstrations. Zero-shot prompting is particularly valuable in assessing a model’s ability to generalize mathematical knowledge from its pretraining corpus and apply learned concepts to novel problems [ 87,98]. By not providing explicit examples, this method evaluates how well a model can infer the appropriate solution methodology based solely on its prior learning. In detail, each prompt as shown in Figure 2, is explicitly instructing models to reason through the problem before selecting an answer enclosed in the \boxed{} format. While explicit CoT prompting was not enforced, the instruction to "follow the format and continue your reasoning until the final boxed answer" implicitly encouraged stepwise reasoning, allowing us to assess the extent to which models could self-initiate logical reasoning without explicit CoT guidance [ 73]. Empirical observations revealed that some models naturally generated intermediate reasoning steps before selecting an answer, demonstrating emergent reasoning capabilities, while others provided the final answer directly, particularly those with weaker reasoning abilities. Variations in reasoning depth were observed across different models, reflecting differences in their internal problem-solving strategies. To further investigate the impact of explicit stepwise prompting, we conducted an ablation study by modifying the prompt to include a direct instruction: "Solve this problem step by step before providing the final answer." This explicit CoT version consistently led to more structured and detailed explanations across all models, confirming that models respond differently to implicit vs. explicit reasoning cues [ 98]. The comparative analysis of both prompting strategies provided valuable insights into how structured guidance influences model reasoning and accuracy. For multiple-choice questions, we provided the set of answer choices and instructed the models to select the correct response from the given options, ensuring a structured decision-making process [ 68]. The prompt included clear formatting guidelines to standardize the response format across different models. In contrast, for open-ended problems that required free-form numeric or symbolic responses, we presented the question directly without predefined answer choices, allowing the models to generate responses based on their internal reasoning capabilities [ 81]. This approach ensured that multiple-choice tasks evaluated the models’ ability to differentiate between structured options, while open-ended problems tested their capability to derive solutions independently. 8 Page 9: 3.3.2 Sampling Settings To ensure consistency and reliability in model responses, we employed a structured sampling strategy using temperature, top-k sampling, and top-p (nucleus) sampling [ 11,70], which control randomness and diversity in token selection. To ensure a standardized evaluation across different LLMs, we utilized API access through OpenRouter for DeepSeek models and OpenAI API for GPT-4o, o1-mini, o1, and o3-mini, while maintaining their respective default generation settings. All models, including GPT-4o and o-series models, Gemini and DeepSeek use a default temperature of T= 1.0and an unrestricted probability sampling strategy (top-p = 1.0) [ 70], allowing the model to consider the full probability distribution for token selection. Moreover, to ensure smooth execution of API requests while avoiding rate limits imposed by different providers, we implemented an adaptive time delay between successive API calls. This approach prevented throttling issues while maintaining efficiency in large-scale evaluation. All queries were formatted in a standardized manner to ensure fair comparisons across models. 3.3.3 Evaluation Metric To assess the correctness of model-generated responses in mathematical tasks, we employ the Exact Match metric as our primary evaluation method [ 78]. Given that each question in our benchmark dataset has a single correct answer and the model produces a response per query, Exact Match ensures a rigorous evaluation by comparing the extracted answer to the ground truth. Unlike other evaluation metrics such as Pass@k [ 13], which allows for multiple valid responses, exact match is particularly suited for our setup as each question has only one correct answer, making alternative correctness measures unnecessary. The model produces a single response per query, eliminating the need for multi-sample evaluation, and we extract the final answer by parsing the boxed notation (\boxed{}), ensuring a direct one-to-one comparison with the ground truth. Since mathematical evaluations require strict correctness, even minor deviations, such as additional decimal places (5.00 vs. 5) or formatting inconsistencies, must be addressed through preprocessing before applying the metric. To ensure fairness, we apply preprocessing steps to standardize model-generated answers by extracting the boxed answer, trimming extraneous characters, normalizing numerical values (e.g.,1 2to 0.5) [ 84], and handling decimal precision by rounding values where applicable. This ensures that the Exact Match metric strictly evaluates mathematical correctness while minimizing penalization due to trivial formatting differences. Letˆyirepresent the extracted answer from the model’s output for the ithquestion, and let yibe the corresponding ground truth answer. The Exact Match accuracy is computed as: Exact Match (%) =PN i=1⊮(normalize (ˆyi) =normalize (yi)) N×100 (1) where: •Nis the total number of evaluated questions. •⊮(·)is the indicator function, returning 1if the extracted model response matches the ground truth after preprocessing, and 0 otherwise. • normalize (·)is a function that standardizes formatting, trims spaces, and normalizes numer- ical values. Furthermore, error-handling mechanisms were incorporated to detect hallucinated responses [ 71,79, 100], incomplete computations, or missing boxed outputs, with automated validation and manual review of flagged cases ensuring robust evaluation. Additionally, since some model responses exceeded token limits and resulted in truncated outputs, detection mechanisms were implemented to flag incomplete solutions, allowing either exclusion from evaluation or manual review when necessary. Table 2 presents the evaluation of model-generated responses against ground truth values. The Reference Answer represents the ground truth, while the Parsed Model Answer contains the extracted response from the model’s output. The Correctness column indicates whether the model’s response exactly matches the ground truth after parsing, ensuring a strict evaluation of mathematical 9 Page 10: Table 2: Comparison of ground truth and model-generated parsed answers Reference answer Parsed Model answer Correctness p - q [’p - q’] True 90^\circ [’90’] True 4 [’5’] False \dfrac{3}{56} [’dfrac356’] True 6 - 5i [’6 - 5i’] True accuracy. Discrepancies, such as format mismatches or numerical errors, directly impact correctness scores, highlighting the model’s precision in solving mathematical problems. By integrating automated parsing, formatting corrections, and evaluation metrics, our post-processing pipeline ensured reliable and unbiased assessment of mathematical reasoning capabilities across different LLM architectures. 4 Results In this section, we conducted a quantitative and qualitative evaluation of the state-of-the-art LLMs mentioned earlier, using multiple public benchmark datasets for mathematical reasoning. As an initial assessment, we selected a subset of three widely recognized datasets [ 16,39,62], as discussed in Section 3, to systematically evaluate the performance of various cutting-edge models. Overall, our evaluation includes DeepSeek-R1 [ 56], DeepSeek’s most advanced reasoning model, along with its distilled variant DeepSeek-1.5B [ 37], which is derived from the R1 model. Additionally, we assessed Google’s latest Gemini 2.0 Flash [ 45] and four OpenAI models: GPT-4o, o1-mini, o1, and o3-mini [ 6,44,103]. Notably, while all these models are designed with reasoning capabilities, GPT-4o is the only one not explicitly optimized for providing reasoning-based CoT responses. These selected LLMs represent the most advanced AI systems available today, continuously evolving and competing across various mathematical reasoning benchmarks. 4.1 Evaluation of LLMs using MATH As illustrated in Table 3, the performance of various LLMs on the MATH benchmark dataset [ 40] exhibits significant variability, underscoring differences in their mathematical reasoning capabilities. In particular, GPT-4o [ 44] exhibited a significantly lower accuracy of 64.88%, lagging behind its peer models. This limitation may be attributed to deficiencies in complex reasoning capabilities or the inherent difficulty of competition-level mathematical problems in the dataset. In contrast, OpenAI’s o1 model achieved the highest accuracy (93.12%), closely followed by DeepSeek-R1 (90.45%), with both models surpassing the 90% correctness threshold—a benchmark that no other model reached. Other models demonstrated varying levels of performance: Gemini 2.0 Flash, 85.87%; o1-mini, 88.93%; o3-mini, 82.06%. However, a key observation is that the distilled variant of DeepSeek- R1 performed significantly worse, further reinforcing the hypothesis that model distillation can impair mathematical reasoning capabilities by compromising critical reasoning pathways in favor of improved computational efficiency. Additionally, the quantitative comparisons across multiple LLMs, as presented in Figure 3, reveal consistent performance patterns across OpenAI’s models, while DeepSeek-R1 consistently outper- formed its distilled variant. These findings highlight the trade-offs between model size and reasoning ability, as demonstrated by Fu et al. [ 33], underscoring the need for optimization techniques that enhance computational efficiency without sacrificing mathematical problem-solving proficiency. Furthermore, Figure 4 provides an in-depth comparison of the prompt responses and mathematical reasoning of DeepSeek-R1, o1, and Gemini 2.0 Flash. Interestingly, Gemini 2.0 Flash failed to provide a correct answer, whereas DeepSeek-R1 and o1 demonstrated more structured and logically coherent reasoning processes. This further illustrates the varying degrees of mathematical reasoning proficiency across models [ 103] and the importance of structured reasoning frameworks in achieving high accuracy. 10 Page 11: Table 3: Performance comparison of various models on MATH in terms of accuracy. Publisher Model Name Accuracy (%) Google Gemini 2.0 Flash 85.87 OpenAIGPT-4o 64.88 o1-mini 88.93 o1 93.12 o3-mini 82.06 DeepSeekDeepSeek-1.5B 65.64 DeepSeek-R1 90.45 Figure 3: Performance comparison of various LLMs in the MATH benchmark. The accuracy of each model is displayed above the corresponding bar, highlighting differences in mathematical problem-solving capabilities across model publishers. Moreover, Figure 4 illustrates that DeepSeek-R1 and o1 followed nearly identical reasoning steps, correctly assigning coordinates, computing midpoints and centroids, and using determinant-based area calculations to arrive at the correct answer, 8. However, Gemini 2.0 Flash deviated from the correct reasoning path by making errors in its calculations. Specifically, despite correctly identifying some intermediate relationships, its handling of area scaling and altitude ratio calculations led to an incorrect final answer of 4 instead of, correct response of 8. This demonstrates a key limitation in Gemini 2.0 Flash’s numerical precision and reasoning accuracy for complex geometric problems compared with its peers. Notably, the strong performance of DeepSeek-R1 in Figure 4, which relies heavily on RL method- ologies [ 29], highlights the effectiveness of iterative, feedback-driven training paradigms in tackling mathematically intricate tasks. A crucial component in its training is Group Relative Policy Op- timization (GRPO) [ 76], a more efficient RL approach than traditional methods. GRPO enables DeepSeek-R1 to excel in complex domains such as mathematics, science, and coding, reinforcing the potential of advanced learning techniques in enhancing model performance [ 37]. These find- ings underscore the importance of advanced reasoning mechanisms in solving competition-level mathematical problems, suggesting that RL frameworks-exemplified by DeepSeek-R1-represent a promising direction for enhancing the mathematical reasoning capabilities of next-generation LLMs. 4.2 Evaluation of LLMs using GSM8K As discussed in Section 3, the GSM8K benchmark dataset [ 16] comprises approximately 1.32k test samples designed to assess math reasoning in LLMs [ 16]. The zero-shot performance of various models is summarized in Table 4, presenting valuable insights into their mathematical reasoning capabilities. In particular, the results revealed that DeepSeek-R1 [ 37] performed on par with OpenAI’s o1, with both models achieving the highest accuracy of 96.13%, surpassing all other peers in this evaluation. Additionally, several other models demonstrated strong performance: Gemini 2.0 Flash, 95.53%; 11 Page 12: Figure 4: Comparative evaluation of different LLMs on a Geometric problem from the MATH benchmark. The figure showcases step-by-step reasoning from three different models and highlights correctness. The question is shown in red, while the extracted answers are boxed. Green-highlighted sections indicate correct responses, whereas red-highlighted sections denote incorrect outputs. This comparison illustrates differences in mathematical reasoning and accuracy across models. GPT-4o, 95.58%; o1-mini, 95.91%; o3-mini, 95.83% However, a notable performance gap was observed in DeepSeek-R1’s distilled variant, which achieved only 81.13% accuracy—a significant decline compared to its full-scale counterpart. This performance disparity underscores the trade-offs associated with model distillation, where reducing the number of parameters for computational efficiency may inadvertently weaken logical reasoning and problem-solving capabilities. While distillation enables smaller, faster models, our findings suggest that excessive parameter reduction can compromise mathematical reasoning proficiency, potentially limiting their effectiveness in complex problem-solving tasks. Similarly, Figure 5 demonstrates that DeepSeek-R1 and o1 outperformed other peer LLMs. Notably, there are no significant differences across all LLMs released via OpenAI. These results emphasize the delicate balance between model size and reasoning ability, highlight- ing the need for advancing distillation techniques that preserve core reasoning structures while maintaining computational efficiency. Table 4: Performance comparison of various models on GSM8K in terms of accuracy. Publisher Model Name Accuracy (%) Google Gemini 2.0 Flash 95.53 OpenAIGPT-4o 95.98 o1-mini 95.91 o1 96.13 o3-mini 95.83 DeepSeekDeepSeek-1.5B 81.12 DeepSeek-R1 96.13 Moreover, the qualitative comparisons across multiple LLMs, as illustrated in Figure 6, reveal a noteworthy disparity in reasoning capabilities among OpenAI’s o1, DeepSeek-R1, and Google’s Gemini 2.0 Flash. The analysis demonstrates that DeepSeek-R1 exhibits superior performance in mathematical reasoning compared to its counterparts. Specifically, DeepSeek-R1 generated accurate 12 Page 13: solutions and also articulated a systematic, step-by-step process to arrive at its conclusions. In contrast, both OpenAI’s o1 and Gemini 2.0 Flash produced less comprehensive pathways, which resulted in erroneous answers. These findings, as shown in Figure 6, underscore DeepSeek-R1’s robust analytical proficiency relative to contemporary models, highlighting its advanced problem-solving capabilities in complex mathematical tasks. Figure 5: Performance comparison of various LLMs in the GSM8K benchmark. The accuracy of each model is displayed above the corresponding bar, highlighting differences in mathematical problem-solving capabilities across model publishers. Figure 6: Comparative evaluation of different LLMs on an Arithmetic problem from the GSM8K benchmark. The figure showcases step-by-step reasoning from three different models and highlights correctness. The question is shown in red, while the extracted answers are boxed. Green-highlighted sections indicate correct responses, whereas red-highlighted sections denote incorrect outputs. This comparison illustrates differences in mathematical reasoning and accuracy across models. 13 Page 14: 4.3 Evaluation of LLMs using MMLU Leveraging the structured categorization of MMLU benchmark dataset [ 39], we conduct an in-depth comparative analysis of various LLMs across different mathematical domains. First, as shown in Table 5, DeepSeek-R1 outperformed all peer models, achieving the highest accuracy of 97.62% in Formal Logic. Other models, such as o1, o3-mini, and Gemini 2.0 Flash, also demonstrated strong performance, all surpassing 90% accuracy. However, distilled variants of DeepSeek-R1 and o1-mini failed to reach this benchmark, reinforcing the hypothesis that model distillation can degrade logical reasoning capabilities by limiting access to essential computational pathways. Table 5: Performance comparison of various models on MMLU in terms of accuracy. Problem Type Model Name Accuracy (%) Abstract AlgebraGemini 2.0 flash 87.00 GPT-4o 85.00 o1-mini 88.00 o1 94.00 o3-mini 96.00 DeepSeek-1.5B 61.00 DeepSeek-R1 94.00 College MathematicsGemini 2.0 flash 96.00 GPT-4o 79.00 o1-mini 76.00 o1 99.00 o3-mini 99.00 DeepSeek-1.5B 65.00 DeepSeek-R1 96.00 Formal LogicGemini 2.0 flash 90.48 GPT-4o 82.54 o1-mini 78.57 o1 95.24 o3-mini 96.03 DeepSeek-1.5B 47.62 DeepSeek-R1 97.62 High School MathematicsGemini 2.0 flash 97.77 GPT-4o 88.15 o1-mini 98.89 o1 99.26 o3-mini 98.52 DeepSeek-1.5B 86.30 DeepSeek-R1 97.04 In addition, in Abstract Algebra, o3-mini outperformed its peers, achieving the highest accuracy of 96.00%. Abstract algebra is a particularly challenging subject, even for math graduates, as it demands a fundamental shift in mathematical thinking from traditional algebraic manipulations to analyzing complex algebraic structures such as groups, rings, and fields. Mastery of this subject requires a deep understanding of abstract concepts, rigorous proof construction, and advanced logical reasoning. The results also suggest that OpenAI’s models, particularly o3-mini, excel in abstract reasoning and symbolic manipulation, enabling them to reconstruct and process complex algebraic structures effectively. In contrast, the distilled variant of DeepSeek-R1 performed significantly worse, achieving only 61% accuracy, further validating the detrimental impact of parameter reduction on logical inference and abstract reasoning. Furthermore, in College Mathematics, both o1 and o3-mini achieved exceptional accuracy of 99.00% in College Mathematics, surpassing all other models. This outstanding performance can be attributed to two key factors: a). Superior reasoning capabilities: OpenAI’s o1 and o3-mini models are designed 14 Page 15: with enhanced mathematical reasoning frameworks, allowing them to efficiently process and solve complex college-level problems. b). Diversity of training data: The extensive dataset exposure during training may have equipped o1 and o3-mini with a broader mathematical foundation, leading to their unparalleled performance in advanced mathematical reasoning tasks. Meanwhile, Gemini 2.0 Flash also demonstrated strong performance, achieving a comparable accuracy of 96%, reinforcing its competence in college-level mathematics. However, distillation had a clear negative impact on performance, as evidenced by: 1). o1-mini, which achieved only 76.00% accuracy, a significant drop from its full-scale counterpart. 2). DeepSeek-1.5B, which attained just 65.00% accuracy, further emphasizing the trade-off between model size and reasoning ability. This performance gap highlights the increased complexity of college mathematics compared to high school-level math. College-level coursework usually demands: 1). A deeper understanding of abstract mathematical concepts; 2). Stronger logical reasoning abilities; 3). Proficiency in complex proofs and real-world applications. Notably, the faster pace and heavier workload of college mathematics compared to high school math further contribute to its difficulty [7]. Moreover, given the relative simplicity of High School Mathematics, o1 achieved the highest accuracy at 99.62%, slightly improving upon its results in College Mathematics. Other peer models also performed well: DeepSeek-R1, 97.04%; o1-mini, 98.89%; o3-mini, 98.52%; Gemini 2.0 Flash, 97.77%. GPT-4o showed a lower accuracy (88.15%) compared with its peers in this domain too. Once again, the distilled variant of DeepSeek-R1 struggled, achieving only 86.30% accuracy, further demonstrating the performance degradation caused by model compression. Similarly, Figure 7 presents a comprehensive comparative analysis of various LLMs across four key mathematical domains: Abstract Algebra, College Mathematics, Formal Logic, and High School Mathematics. Notably, OpenAI’s models demonstrate superior performance in abstract reasoning, particularly excelling in Abstract Algebra, where symbolic manipulation and deep theoretical under- standing are crucial. This suggests that OpenAI’s models have strong capabilities in reconstructing and processing abstract mathematical structures. Significantly, DeepSeek-R1 outperformed its peers in Formal Logic, showcasing its exceptional logical reasoning skills and ability to handle complex rule-based problem-solving more effectively than other models. These findings further highlight the specialized strengths of different LLMs and enhance the need for targeted model improvements to optimize performance across various mathematical domains. Figure 7: Comparison of LLM models across different problem types in the MMLU benchmark dataset. Figure 8 presents a comparative evaluation of different LLMs on a Formal Logic problem from the MMLU benchmark dataset [ 39]. The task involves identifying the correct conclusion from a given argument, with four answer choices labeled A, B, C, and D. The correct reference answer is option D, requiring models to distinguish between supporting premises and the actual conclusion. DeepSeek- R1 successfully identifies option D as the correct conclusion, offering a structured explanation by recognizing that the phrase "Because of this" serves as a linguistic connector rather than a part of the conclusion itself. In contrast, o1 and Gemini 2.0 Flash select option C, misinterpreting the logical structure of the argument and failing to differentiate between an explanatory transition and the central 15 Page 16: Figure 8: Comparative evaluation of different LLMs on a Formal Logic problem from the MMLU benchmark. The figure showcases step-by-step reasoning from three different models and highlights correctness. The question is shown in red, while the extracted answers are boxed. Green-highlighted sections indicate correct responses, whereas red-highlighted sections denote incorrect outputs. This comparison illustrates differences in mathematical reasoning and accuracy across models. claim. The comparative analysis in Figure 8 further solidifies DeepSeek-R1’s exceptional logical reasoning skills and ability to handle complex rule-based problem-solving tasks. Figure 9 illustrates a comparative evaluation of different LLMs in solving a problem related to abstract algebra, sourced from the MMLU benchmark dataset. The problem required reasoning about the number of elements of a given order in a group, specifically evaluating whether Statement 1 and Statement 2 hold true under group-theoretic principles. The task involved selecting the correct truth-value combination from four possible answer choices: A (True, True), B (False, False), C (True, False), and D (False, True). The reference answer, indicated in the Figure 9, is option A (True, True). DeepSeek-R1 correctly identified both statements as true, providing a detailed justification using the properties of cyclic subgroups, Euler’s totient function, and disjoint subgroup structures. Similarly, o1 also arrived at the correct answer, correctly recognizing that exceeding 8 elements of order 15 necessitates at least 16 elements due to the existence of multiple distinct subgroups. However, Gemini 2.0 Flash incorrectly classified Statement 2 as false, demonstrating a reasoning flaw in its approach to subgroup structures and element counting. This comparison in Figure 9 highlights DeepSeek-R1 and o1’s superior reasoning abilities in abstract algebra, as both models successfully applied cyclic subgroup properties and disjoint set reasoning to reach the correct conclusion. In summary, DeepSeek-R1 dominated in Formal Logic, while o3-mini excelled in Abstract Alge- bra. o1 and o3-mini outperformed all models in College Mathematics, likely due to their advanced reasoning frameworks and diverse training data. Model distillation significantly reduced perfor- mance as shown in [ 37], particularly in complex reasoning tasks like Abstract Algebra and College Mathematics. High school-level mathematics is relatively easier for LLMs, with multiple models surpassing 97% accuracy. These findings emphasize the trade-offs between model size, reasoning ca- pability, and computational efficiency, highlighting the need for optimization strategies that maintain problem-solving proficiency while improving scalability. 16 Page 17: Figure 9: Comparative evaluation of different LLMs on an Abstract Algebra problem from the MMLU benchmark. The figure showcases step-by-step reasoning from three different models and highlights correctness. The question is shown in red, while the extracted answers are boxed. Green-highlighted sections indicate correct responses, whereas red-highlighted sections denote incorrect outputs. This comparison illustrates differences in mathematical reasoning and accuracy across models. 4.4 Statistical Analyses In this section, Table 6 presents a comparative evaluation of LLMs based on their mean accuracy and standard deviation across all three benchmark datasets. Among the evaluated models, OpenAI’s o1 achieves the highest mean accuracy (96.13%), followed closely by DeepSeek-R1 (95.21%) and o3- mini (94.57%). While Gemini 2.0 Flash attains a relatively high accuracy of 92.11%, it underperforms relative to the top-tier models. In terms of stability, as quantified by standard deviation, both o1 (2.32) and DeepSeek-R1 (2.41) exhibit the most consistent performance. We believe that DeepSeek-R1’s high consistency can be attributed to its GRPO framework [ 37], which utilizes a multi-agent collaborative architecture, effectively integrates diverse reasoning pathways. This synergy helps mitigate task-specific biases and enhances consistency by reducing variability. The structured approach allows the model to dynamically prioritize high-confidence solutions while iteratively refining less certain outputs, thereby contributing to its robustness across heterogeneous datasets. By contrast, o1-mini demonstrated significantly higher variability (standard deviation: 8.31), sug- gesting sensitivity to dataset-specific characteristics. Gemini 2.0 Flash (4.59) and o3-mini (5.74) exhibited moderate fluctuations in performance, likely due to limitations in their single-agent rea- soning paradigms. GPT-4o also exhibited a significantly larger standard deviation (10.42) and was one of the two most inconsistent models alongside DeepSeek-1.5B. Meanwhile, DeepSeek-1.5B’s considerably lower mean accuracy (67.78%) and the highest standard deviation (12.82) reflect sub- stantial variability in its responses, further cementing the detrimental impact of parameter reduction on mathematical reasoning. Figure 10 presents the performance statistics of different LLMs across three datasets. The bar chart in Figure 10 illustrates the mean accuracy (%) of each model, with error bars representing the standard deviation of the accuracy percentages across datasets. The standard deviation bars indicate the variability in the performance of each model. Notably, o1 from OpenAI demonstrates the highest mean accuracy with a low variance, whereas DeepSeek-1.5B exhibits the largest variability, suggesting inconsistency in its accuracy across datasets. This visualization underscores that o1 and 17 Page 18: Table 6: Performance statistics of different LLM models across all datasets, including mean accuracy (in %) and standard deviation. LLM Models Gemini 2.0 Flash GPT-4o o1-mini o1 o3-mini DeepSeek-1.5B DeepSeek-R1 Mean (%) 92.11 82.59 87.72 96.13 94.57 67.78 95.21 Standard Deviation 4.59 10.42 8.31 2.32 5.74 12.82 2.41 Figure 10: Performance comparison of five LLMs across multiple datasets, grouped by publisher. The bars represent the mean accuracy (%), while the error bars indicate the standard deviation across three benchmark datasets. DeepSeek-R1 excel not only in accuracy but also in reliability for mathematical problem-solving tasks. 4.5 Response time analysis Efficient response times are crucial in the deployment of LLMs, as they directly influence user experience and the practicality of real-time applications. Factors affecting inference speed include model size, computational complexity, hardware capabilities, and optimization techniques. Figure 11: Latency Comparison across different LLMs Notably, Figure 11 presents a comparative analysis of the average response time per query across various LLMs. The DeepSeek-R1 model exhibits the highest latency, averaging 81.0 seconds per 18 Page 19: response, which implies that DeepSeek-R1 is significantly slower than all other models. Among OpenAI’s models, o1 demonstrates a high response time of 15.0 seconds, likely due to its complex CoT process. GPT-4o also exhibits relatively high latency at 11.5 seconds, making it one of the three models that exceed the 10-seconds latency threshold. In contrast, o1-mini and o3-mini achieved faster inference speeds of 6.8 seconds and 5.3 seconds, respectively, outperforming the other two OpenAI models. DeepSeek-1.5B operates at 5.0 seconds, slightly outperforming OpenAI’s o3-mini models. Notably, Gemini 2.0 Flash exhibits the fastest response time at 4.2seconds, making it the most efficient model in terms of inference speed. Importantly, understanding these response times is essential for selecting appropriate models based on application requirements, balancing the trade-offs between latency and performance. 5 Discussion In this study, we conducted a comprehensive evaluation of various LLMs to assess their mathematical reasoning capabilities across multiple benchmarks [ 16,39,40]. The evaluated LLMs include Gemini 2.0 Flash [ 45], GPT-4o [ 44], o1-mini [ 46], o1 [ 103], o3-mini [ 6], DeepSeek-1.5B [ 57], and DeepSeek- R1 [ 56]. Our empirical analysis revealed that DeepSeek-R1 and o1 outperformed its peers in most mathematical domains, demonstrating strong mathematical reasoning capabilities in solving structured problems. However, its performance was not on par with o3-mini in some complex and abstract fields, such as Abstract Algebra. Additionally, our findings underscore the impact of model distillation on mathematical reasoning performance. Specifically, distilled variants showcased noticeable declines in accuracy, reinforcing concerns that distillation, while improving computational efficiency, may inadvertently weaken the LLM’s capability to handle intricate problem-solving tasks. These observations suggest that there exists a trade-off between computational efficiency and reasoning depth, which needs to be carefully managed in future model optimizations. First, our results further validate the advanced capabilities of DeepSeek-R1, which is trained using RL rather than Supervised Fine-Tuning (SFT) [ 31,56,69]. DeepSeek-R1 incorporates GRPO, a specialized RL-based optimization strategy that enhances training efficiency by evaluating model actions relative to a group of sampled peers. Unlike traditional RL approaches that require an external reward model, GRPO eliminates the need for a separate reward function by dynamically calculating advantages based on group-based scoring [ 56,57]. This unique training paradigm delivers DeepSeek-R1 to balance cost and performance efficiently, leading to competitive accuracy in solving grade school-level mathematical problems. The superior performance of DeepSeek-R1 in GSM8K highlights its superiority in handling structured, multi-step arithmetic reasoning tasks. Notably, DeepSeek-R1 remains comparable to o1 and o3-mini, demonstrating that while GRPO enhances efficiency, it does not necessarily surpass models with specific CoT in all problem categories. In addition, our analysis of DeepSeek-R1’s performance across MMLU benchmarks reveals a signifi- cant variation in mathematical reasoning capabilities across different domains. While DeepSeek-R1 excelled in Formal Logic, it struggled in more abstract mathematical areas, such as College Mathemat- ics and Abstract Algebra. In particular, this discrepancy suggests that GRPO is particularly effective for solving problems that rely on structured logical reasoning, where a group-based approach to RL can refine logical consistency and decision-making [ 34]. Formal Logic problems, for instance, often involve pattern recognition and structured deduction [ 34], which could align well with GRPO’s frame- work. The group scoring mechanism in GRPO enables the model to internalize patterns commonly recognized by human reasoning, allowing it to perform well in problems that require rule-based logical deductions. However, the performance drop in Abstract Algebra and College Mathematics suggests that DeepSeek-R1 struggles with tasks requiring highly abstract conceptualization. Abstract Algebra, in particular, represents a fundamental shift from procedural algebra to the exploration of abstract mathematical structures such as groups, rings, and fields [ 99]. Unlike rule-based logic problems, abstract algebra necessitates deep symbolic manipulation, theorem proofs, and the ability to navigate complex algebraic structures [ 99], which require advanced deductive reasoning beyond GRPO’s core optimization strategy. Furthermore, proof-based reasoning, a crucial component of higher-level mathematics, demands meta-cognitive processes that extend beyond pattern recognition, including recursive logic application and theorem proving. The limitations observed in DeepSeek-R1 to generalize across abstract mathe- matical problems indicate that a more versatile reasoning approach may be needed to complement 19 Page 20: GRPO-based training. In contrast, CoT reasoning demonstrates strong performance in specific mathematical fields by leveraging step-by-step logical breakdowns to enhance problem-solving accuracy. CoT excels in tasks that require explicit intermediate reasoning steps, which helps models navigate complex mathematical derivations more effectively. However, despite these advantages, CoT remains constrained by its tendency to overfit to structured problem-solving paradigms [ 9,50], leading to limited adaptability when faced with problems requiring broader conceptual generalization. Thus, while CoT provides a deeper, structured reasoning framework, it may struggle to satisfy the generalization requirements necessary for handling diverse mathematical challenges. The trade-off between GRPO’s structured logic refinement and CoT’s sequential deductive approach suggests that a hybrid reasoning mechanism—integrating the strengths of both frameworks—may be essential for developing a more versatile and mathematically proficient LLM. In summary, our findings align with prior research [ 2,4,27,31,47,52,54,64,66,72,76,91,103,105, 107], further confirming that DeepSeek-R1 demonstrates strong mathematical reasoning compared to other LLMs. However, our research extends beyond previous work by providing a granular, field-specific analysis of LLM performance across different domains. Instead of evaluating general reasoning skills, we delve into the nuances of logical, arithmetic, and abstract reasoning capabilities, offering a holistic understanding of how different LLMs excel or struggle across mathematical disciplines. Despite its valuable contributions, our study has several limitations: 1). Limited Model Scope: While our evaluation includes prominent LLMs, we did not test other cutting-edge models, (such as Llama-3.2 / Llama-3.3, Claude, Mistral) which could provide additional insights into alternative training methodologies. 2). Restricted Benchmark Diversity: Our assessment relies on three primary benchmark datasets, which, while comprehensive, may not fully capture the breadth of mathematical reasoning challenges in real-world applications. More importantly, future research should expand the scope of evaluated LLMs and incorporate additional benchmarks to gain a broader understanding of model strengths and weaknesses across varied mathematical disciplines. Specifically, integrating GRPO with peer CoTs could allow models to leverage both structured logical optimization and stepwise deductive reasoning. This integration could lead to LLM swarms, where multiple reasoning frameworks synergize to enhance efficiency and accuracy. Meanwhile, our study reaffirms that distillation can significantly impair reasoning capabilities. Future efforts should explore adaptive distillation methods that preserve deep reasoning pathways while optimizing for computational efficiency. Beyond current CoT frameworks [ 14,53,88, 89,95], investigators could provide new insights into advanced reasoning frameworks, such as Chain of Draft (CoD) [ 93]. Those into how LLMs handle interconnected logical dependencies will improve abstract mathematical problem-solving [ 15]. Future models can strike a balance between efficiency, depth, and mathematical intelligence by advancing multi-framework LLMs and refining distillation strategies. Meanwhile, Xia et al. [ 88] introduced ReasonEval, a novel methodology for evaluating the quality of reasoning steps in mathematical tasks, emphasizing the need to assess both validity and redundancy in intermediate steps. Additionally, Frieder et al. [ 32] explored the mathematical capabilities of ChatGPT and GPT-4, demonstrating their effectiveness in querying mathematical facts and analyzing their performance across varying levels of mathematical complexity. These studies lay the foundation for more comprehensive and rigorous evaluations in the future, advancing our understanding of LLMs in mathematical reasoning. 6 Conclusion Overall, our study presents a comprehensive comparative analysis of LLM performance in mathemat- ical reasoning, identifying key strengths and weaknesses across multiple domains. We highlight the effectiveness of GRPO in structured problem-solving, demonstrating its ability to enhance logical consistency and rule-based reasoning. However, we also uncover its limitations in handling abstract mathematical concepts, where deep symbolic manipulation and theorem-based reasoning remain challenging for GRPO-trained models. Furthermore, our findings underscore the trade-offs associated with model distillation, revealing its potential drawbacks in preserving mathematical reasoning capabilities. As a result, we propose new strategies for refining distillation techniques, ensuring that compressed models maintain robust problem-solving proficiency without sacrificing computational efficiency. Beyond performance evaluation, this study paves the way for future advancements in LLM-driven mathematical intelligence by outlining potential enhancements in reasoning frameworks, including hybrid models that integrate RL with structured step-by-step inference methods. By address- 20 Page 21: ing these challenges, our work contributes to the development of next-generation AI systems capable of handling complex mathematical reasoning with greater accuracy, efficiency, and adaptability. 7 Acknowledgement The authors acknowledge the support of the Augusta University High Performance Computing Services (AUHPCS) for providing computational resources contributing to the results presented in this publication/report. References [1]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023. [2]Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157 , 2024. [3]Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges, 2024. [4]Amogh Akella. Improving math problem solving in large language models through categoriza- tion and strategy tailoring. arXiv preprint arXiv:2411.00042 , 2024. [5]Xavier Amatriain. Prompt design and engineering: Introduction and advanced methods. arXiv preprint arXiv:2401.14423 , 2024. [6]Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, and Sergio Segura. Early external safety testing of openai’s o3-mini: Insights from the pre-deployment evaluation. arXiv preprint arXiv:2501.17749 , 2025. [7]Babette M. Benken, Jorge Ramírez, Xuhui Li, and Scott Wetendorf. Developmental math- ematics success: Impact of students’ knowledge and attitudes. Journal of Developmental Education , 38:14, 2015. [8]Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954 , 2024. [9]Johan Boye and Birger Moell. Large language models and mathematical reasoning failures. arXiv preprint arXiv:2502.11574 , 2025. [10] William Cain. Prompting change: Exploring prompt engineering in large language model ai and its potential to transform education. TechTrends , 68(1):47–57, 2024. [11] Nicola Cecere, Andrea Bacciu, Ignacio Fernández Tobías, and Amin Mantrach. Monte carlo temperature: a robust sampling strategy for llm’s uncertainty quantification methods. arXiv preprint arXiv:2502.18389 , 2025. [12] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models. ACM transactions on intelligent systems and technology , 15(3):1–45, 2024. [13] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 , 2021. [14] Qiguang Chen, Libo Qin, Jiaqi Wang, Jingxuan Zhou, and Wanxiang Che. Unlocking the capabilities of thought: A reasoning boundary framework to quantify and optimize chain-of- thought. Advances in Neural Information Processing Systems , 37:54872–54904, 2025. 21 Page 22: [15] Zui Chen, Tianqiao Liu, Mi Tian, Weiqi Luo, Zitao Liu, et al. Advancing mathematical reasoning in language models: The impact of problem-solving data, data synthesis methods, and training stages. In The Thirteenth International Conference on Learning Representations , 2025. [16] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021. [17] Google CoLab. Google colab for quantitative comparisons in gsm8k. https: //colab.research.google.com/drive/1guO9lNk9WENzyin9wgjkHa_4rhrnRyuP? usp=sharing . Accessed: 2025-03-04. [18] Google CoLab. Google colab for quantitative comparisons in math. https: //colab.research.google.com/drive/1AwK-k9uVtd6aPxK-KDBmfpl727j5cxfz? usp=sharing . Accessed: 2025-03-04. [19] Google CoLab. Google colab for quantitative comparisons in mmlu. https: //colab.research.google.com/drive/1ye1Mvu5rSehQBcoUQzMhi8QYJjgqZyu7? usp=sharing . Accessed: 2025-03-04. [20] Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specializa- tion in mixture-of-experts language models. arXiv preprint arXiv:2401.06066 , 2024. [21] Haixing Dai, Yiwei Li, Zhengliang Liu, Lin Zhao, Zihao Wu, Suhang Song, Ye Shen, Dajiang Zhu, Xiang Li, Sheng Li, et al. Ad-autogpt: an autonomous gpt for alzheimer’s disease infodemiology. arXiv preprint arXiv:2306.10095 , 2023. [22] Haixing Dai, Zhengliang Liu, Wenxiong Liao, Xiaoke Huang, Yihan Cao, Zihao Wu, Lin Zhao, Shaochen Xu, Fang Zeng, Wei Liu, et al. Auggpt: Leveraging chatgpt for text data augmentation. IEEE Transactions on Big Data , 2025. [23] Robert Dale. Gpt-3: What’s it good for? Natural Language Engineering , 27(1):113–118, 2021. [24] Sumit Kumar Dam, Choong Seon Hong, Yu Qiao, and Chaoning Zhang. A complete survey on llm-based ai chatbots. arXiv preprint arXiv:2406.16937 , 2024. [25] Paul Denny, Juho Leinonen, James Prather, Andrew Luxton-Reilly, Thezyrie Amarouche, Brett A Becker, and Brent N Reeves. Prompt problems: A new programming exercise for the generative ai era. In Proceedings of the 55th ACM Technical Symposium on Computer Science Education V . 1 , pages 296–302, 2024. [26] Ricardo Dominguez-Olmedo, Moritz Hardt, and Celestine Mendler-Dünner. Questioning the survey responses of large language models. Advances in Neural Information Processing Systems , 37:45850–45878, 2025. [27] Yucong Duan. Meta-analysis and evaluation of large models’ mathematical abilities based on dikwp semantic mathematics. Unpublished , 2025. [28] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024. [29] Gabriel Dulac-Arnold, Nir Levine, Daniel J Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester. Challenges of real-world reinforcement learning: definitions, bench- marks and analysis. Mach. Learn. , 110(9):2419–2468, September 2021. [30] Maxim Enis and Mark Hopkins. From llm to nmt: Advancing low-resource machine translation with claude. arXiv preprint arXiv:2404.13813 , 2024. 22 Page 23: [31] Evgenii Evstafev. Token-hungry, yet precise: Deepseek r1 highlights the need for multi-step reasoning over speed in math. arXiv preprint arXiv:2501.18576 , 2025. [32] Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt. Advances in neural information processing systems , 36:27699–27744, 2023. [33] Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. Specializing smaller language models towards multi-step reasoning, 2023. [34] Bernhard Ganter and Rudolf Wille. Formal concept analysis: mathematical foundations . Springer Nature, 2024. [35] Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? arXiv preprint arXiv:2406.04127 , 2024. [36] Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge. arXiv preprint arXiv:2411.15594 , 2024. [37] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 , 2025. [38] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196 , 2024. [39] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 , 2020. [40] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 , 2021. [41] Siyuan Hu, Mingyu Ouyang, Difei Gao, and Mike Zheng Shou. The dawn of gui agent: A preliminary case study with claude 3.5 computer use. arXiv preprint arXiv:2411.10323 , 2024. [42] Wenyang Hu, Yao Shu, Zongmin Yu, Zhaoxuan Wu, Xiaoqiang Lin, Zhongxiang Dai, See- Kiong Ng, and Bryan Kian Hsiang Low. Localized zeroth-order prompt optimization. Advances in Neural Information Processing Systems , 37:86309–86345, 2025. [43] Jiaxing Huang and Jingyi Zhang. A survey on evaluation of multimodal large language models. arXiv preprint arXiv:2408.15769 , 2024. [44] Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276 , 2024. [45] Muhammad Imran and Norah Almusharraf. Google gemini as a next generation ai educational tool: a review of emerging educational technology. Smart Learning Environments , 11(1):22, 2024. [46] Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720 , 2024. [47] Qile Jiang, Zhiwei Gao, and George Em Karniadakis. Deepseek vs. chatgpt: A compar- ative study for scientific computing and scientific machine learning tasks. arXiv preprint arXiv:2502.17764 , 2025. 23 Page 24: [48] Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, et al. Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences , 103:102274, 2023. [49] Shalom Lappin. Assessing the strengths and weaknesses of large language models. Journal of Logic, Language and Information , 33(1):9–20, 2024. [50] Chengpeng Li, Guanting Dong, Mingfeng Xue, Ru Peng, Xiang Wang, and Dayiheng Liu. Dotamath: Decomposition of thought with code assistance and self-correction for mathematical reasoning. arXiv preprint arXiv:2407.04078 , 2024. [51] Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, et al. Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems , 35:31199–31212, 2022. [52] Xiaoyuan Li, Wenjie Wang, Moxin Li, Junrong Guo, Yang Zhang, and Fuli Feng. Evaluat- ing mathematical reasoning of large language models: A focus on error identification and correction. arXiv preprint arXiv:2406.00755 , 2024. [53] Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. arXiv preprint arXiv:2402.12875 , 1, 2024. [54] Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models. arXiv preprint arXiv:2502.17419 , 2025. [55] Shizhe Liang, Wei Zhang, and Tianyang Zhong. Mathematics and machine creativity: A survey on bridging mathematics with ai. arXiv preprint arXiv:2412.16543 , 2024. [56] Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434 , 2024. [57] Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 , 2024. [58] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems , 36:21558–21572, 2023. [59] Zhengliang Liu, Yue Huang, Xiaowei Yu, Lu Zhang, Zihao Wu, Chao Cao, Haixing Dai, Lin Zhao, Yiwei Li, Peng Shu, et al. Deid-gpt: Zero-shot medical text de-identification by gpt-4. arXiv preprint arXiv:2303.11032 , 2023. [60] Zhengliang Liu, Hanqi Jiang, Tianyang Zhong, Zihao Wu, Chong Ma, Yiwei Li, Xiaowei Yu, Yutong Zhang, Yi Pan, Peng Shu, et al. Holistic evaluation of gpt-4v for biomedical imaging. arXiv preprint arXiv:2312.05256 , 2023. [61] Zhengliang Liu, Yiwei Li, Peng Shu, Aoxiao Zhong, Longtao Yang, Chao Ju, Zihao Wu, Chong Ma, Jie Luo, Cheng Chen, et al. Radiology-llama2: Best-in-class large language model for radiology. arXiv preprint arXiv:2309.06419 , 2023. [62] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models. CoRR , 2023. [63] Zhenyan Lu, Xiang Li, Dongqi Cai, Rongjie Yi, Fangming Liu, Xiwen Zhang, Nicholas D Lane, and Mengwei Xu. Small language models: Survey, measurements, and insights. arXiv preprint arXiv:2409.15790 , 2024. 24 Page 25: [64] Sarah Mercer, Samuel Spillard, and Daniel P Martin. Brief analysis of deepseek r1 and it’s implications for generative ai. arXiv preprint arXiv:2502.02523 , 2025. [65] Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre, Ilana Heintz, and Dan Roth. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys , 56(2):1– 40, 2023. [66] Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229 , 2024. [67] Gianluca Mondillo, Mariapia Masino, Simone Colosimo, Alessandra Perrotta, and Vittoria Frattolillo. Evaluating ai reasoning models in pediatric medicine: A comparative analysis of o3-mini and o3-mini-high. medRxiv , pages 2025–02, 2025. [68] Aidar Myrzakhan, Sondos Mahmoud Bsharat, and Zhiqiang Shen. Open-llm-leaderboard: From multi-choice to open-style questions for llms evaluation, benchmark, and arena. arXiv preprint arXiv:2406.07545 , 2024. [69] Fnu Neha and Deepshikha Bhati. A survey of deepseek models. Authorea Preprints , 2025. [70] Minh Nguyen, Andrew Baker, Clement Neo, Allen Roush, Andreas Kirsch, and Ravid Shwartz- Ziv. Turning up the heat: Min-p sampling for creative and coherent llm outputs. arXiv preprint arXiv:2407.01082 , 2024. [71] Xinzhe Ni, Yeyun Gong, Zhibin Gou, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Exploring the mystery of influential data for mathematical reasoning. arXiv preprint arXiv:2404.01067 , 2024. [72] Shuai Peng, Di Fu, Liangcai Gao, Xiuqin Zhong, Hongguang Fu, and Zhi Tang. Multimath: Bridging visual and mathematical reasoning for large language models. arXiv preprint arXiv:2409.00147 , 2024. [73] Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv preprint arXiv:2402.07927 , 2024. [74] Amrith Setlur, Saurabh Garg, Xinyang Geng, Naman Garg, Virginia Smith, and Aviral Kumar. Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold. Advances in Neural Information Processing Systems , 37:43000–43031, 2024. [75] Sakib Shahriar, Brady D Lund, Nishith Reddy Mannuru, Muhammad Arbab Arshad, Kadhim Hayawi, Ravi Varma Kumar Bevara, Aashrith Mannuru, and Laiba Batool. Putting gpt-4o to the sword: A comprehensive evaluation of language, vision, speech, and multimodal proficiency. Applied Sciences , 14(17):7782, 2024. [76] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 , 2024. [77] Zhuocheng Shen. Llm with tools: A survey. arXiv preprint arXiv:2409.18807 , 2024. [78] Ali Soroush, Benjamin S Glicksberg, Eyal Zimlichman, Yiftach Barash, Robert Freeman, Alexander W Charney, Girish N Nadkarni, and Eyal Klang. Large language models are poor medical coders—benchmarking of medical code querying. NEJM AI , 1(5):AIdbp2300040, 2024. [79] Joseph Spracklen, Raveen Wijewickrama, AHM Sakib, Anindya Maiti, Bimal Viswanath, and Murtuza Jadliwala. We have a package for you! a comprehensive analysis of package hallucinations by code generating llms. arXiv preprint arXiv:2406.10279 , 2024. [80] Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. Chain of thoughtlessness? an analysis of cot in planning. Advances in Neural Information Processing Systems , 37:29106– 29141, 2025. 25 Page 26: [81] Zhi Rui Tam, Cheng-Kuang Wu, Chieh-Yen Lin, and Yun-Nung Chen. None of the above, less of the right: Parallel patterns between humans and llms on multi-choice questions answering. arXiv preprint arXiv:2503.01550 , 2025. [82] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 , 2024. [83] Jiaqi Wang, Hanqi Jiang, Yiheng Liu, Chong Ma, Xu Zhang, Yi Pan, Mengyuan Liu, Peiran Gu, Sichen Xia, Wenjun Li, et al. A comprehensive review of multimodal large language models: Performance and challenges across different tasks. arXiv preprint arXiv:2408.01319 , 2024. [84] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems , 35:24824–24837, 2022. [85] Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R Bowman, He He, and Shi Feng. Language models learn to mislead humans via rlhf. arXiv preprint arXiv:2409.12822 , 2024. [86] Junchao Wu, Shu Yang, Runzhe Zhan, Yulin Yuan, Lidia Sam Chao, and Derek Fai Wong. A survey on llm-generated text detection: Necessity, methods, and future directions. Computa- tional Linguistics , pages 1–66, 2025. [87] Zhenyu Wu, Meng Jiang, and Chao Shen. Get an a in math: Progressive rectification prompting. InProceedings of the AAAI Conference on Artificial Intelligence , volume 38, pages 19288– 19296, 2024. [88] Yu Xia, Rui Wang, Xu Liu, Mingyan Li, Tong Yu, Xiang Chen, Julian McAuley, and Shuai Li. Beyond chain-of-thought: A survey of chain-of-x paradigms for llms. arXiv preprint arXiv:2404.15676 , 2024. [89] Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, and Bo Li. Badchain: Backdoor chain-of-thought prompting for large language models. arXiv preprint arXiv:2401.12242 , 2024. [90] Zhenxiang Xiao, Yuzhong Chen, Junjie Yao, Lu Zhang, Zhengliang Liu, Zihao Wu, Xiaowei Yu, Yi Pan, Lin Zhao, Chong Ma, et al. Instruction-vit: Multi-modal prompts for instruction learning in vision transformer. Information Fusion , 104:102204, 2024. [91] Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang. Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data. arXiv preprint arXiv:2405.14333 , 2024. [92] Shaochen Xu, Yifan Zhou, Zhengliang Liu, Zihao Wu, Tianyang Zhong, Huaqin Zhao, Yiwei Li, Hanqi Jiang, Yi Pan, Junhao Chen, et al. Towards next-generation medical agent: How o1 is reshaping decision-making in medical scenarios. arXiv preprint arXiv:2411.14461 , 2024. [93] Silei Xu, Wenhao Xie, Lingxiao Zhao, and Pengcheng He. Chain of draft: Thinking faster by writing less. arXiv preprint arXiv:2502.18600 , 2025. [94] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115 , 2024. [95] Wen Yang, Kai Fan, and Minpeng Liao. Markov chain of thought for efficient mathematical reasoning. arXiv preprint arXiv:2410.17635 , 2024. [96] Zhenyuan Yang, Zhengliang Liu, Jing Zhang, Cen Lu, Jiaxin Tai, Tianyang Zhong, Yiwei Li, Siyan Zhao, Teng Yao, Qing Liu, et al. Analyzing nobel prize literature with large language models. arXiv preprint arXiv:2410.18142 , 2024. 26 Page 27: [97] Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing , page 100211, 2024. [98] Xiaosong Yuan, Chen Shen, Shaotian Yan, Xiaofeng Zhang, Liang Xie, Wenxiao Wang, Renchu Guan, Ying Wang, and Jieping Ye. Instance-adaptive zero-shot chain-of-thought prompting. arXiv preprint arXiv:2409.20441 , 2024. [99] Yumiati Yumiati, Harry Dwi Putra, Saleh Haji, et al. Blended online learning: Students’ perception and its effect on learning outcomes abstract algebra. Infinity Journal , 14(1):65–84, 2025. [100] Piotr Zablocki and Zofia Gajewska. Assessing hallucination risks in large language models through internal state analysis. Authorea Preprints , 2024. [101] Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, William Song, Tiffany Zhao, Pranav Raja, Charlotte Zhuang, Dylan Slack, et al. A careful examination of large language model performance on grade school arithmetic. Advances in Neural Information Processing Systems , 37:46819–46836, 2024. [102] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223 , 1(2), 2023. [103] Tianyang Zhong, Zhengliang Liu, Yi Pan, Yutong Zhang, Yifan Zhou, Shizhe Liang, Zihao Wu, Yanjun Lyu, Peng Shu, Xiaowei Yu, et al. Evaluation of openai o1: Opportunities and challenges of agi. arXiv preprint arXiv:2409.18486 , 2024. [104] Tianyang Zhong, Yaonai Wei, Li Yang, Zihao Wu, Zhengliang Liu, Xiaozheng Wei, Wenjun Li, Junjie Yao, Chong Ma, Xiang Li, et al. Chatabl: Abductive learning via natural language interaction with chatgpt. arXiv preprint arXiv:2304.11107 , 2023. [105] Zihao Zhou, Shudong Liu, Maizhen Ning, Wei Liu, Jindong Wang, Derek F Wong, Xiaowei Huang, Qiufeng Wang, and Kaizhu Huang. Is your model really a good math reasoner? evaluating mathematical reasoning with checklist. arXiv preprint arXiv:2407.08733 , 2024. [106] Shiben Zhu, Wanqin Hu, Zhi Yang, Jiani Yan, and Fang Zhang. Qwen-2.5 outperforms other large language models in the chinese national nursing licensing examination: Retrospective cross-sectional comparative study. JMIR Medical Informatics , 13:e63731, 2025. [107] Wenwen Zhuang, Xin Huang, Xiantao Zhang, and Jin Zeng. Math-puma: Progressive upward multimodal alignment to enhance mathematical reasoning. arXiv preprint arXiv:2408.08640 , 2024. 27

---