Authors: Afrar Jahin, Arif Hassan Zidan, Yu Bao, Shizhe Liang, Tianming Liu, Wei Zhang
Paper Content:
Page 1:
Unveiling the Mathematical Reasoning in DeepSeek
Models: A Comparative Study of Large Language
Models
Afrar Jahin∗
School of Computer and Cyber Sciences
Augusta University, Augusta, GA, USA
ajahin@augusta.eduArif Hassan Zidan∗
School of Computer and Cyber Sciences
Augusta University, Augusta, GA, USA
azidan@augusta.edu
Yu Bao
Department of Graduate Psychology
James Madison University, Harrisonburg, V A, USA
bao2yx@jmu.edu
Shizhe Liang
Institute of Plant Breeding, Genetics & Genomics
University of Georgia, Athens, GA, USA
sl73741@uga.eduTianming Liu
School of Computing
University of Georgia, Athens, GA, USA
tliu@uga.edu
Wei Zhang†
School of Computer and Cyber Sciences
Augusta University, Augusta, GA, USA
wzhang2@augusta.edu
Abstract
With the rapid evolution of Artificial Intelligence (AI), Large Language Models
(LLMs) have reshaped the frontiers of various fields, spanning healthcare, public
health, engineering, science, agriculture, education, arts, humanities, and mathe-
matical reasoning. Among these advancements, DeepSeek models have emerged
as noteworthy contenders, demonstrating promising capabilities that set them apart
from their peers. While previous studies have conducted comparative analyses of
LLMs, few have delivered a comprehensive evaluation of mathematical reasoning
across a broad spectrum of LLMs. In this work, we aim to bridge this gap by
conducting an in-depth comparative study, focusing on the strengths and limita-
tions of DeepSeek models in relation to their leading counterparts. In particular,
our study systematically evaluates the mathematical reasoning performance of
two DeepSeek models alongside five prominent LLMs across three independent
benchmark datasets. The findings reveal several key insights: 1). DeepSeek-R1
consistently achieved the highest accuracy on two of the three datasets, demon-
strating strong mathematical reasoning capabilities. 2). The distilled variant of
LLMs significantly underperformed compared to its peers, highlighting potential
drawbacks in using distillation techniques. 3). In terms of response time, Gemini
2.0 Flash demonstrated the fastest processing speed, outperforming other models
in efficiency, which is a crucial factor for real-time applications. Beyond these
quantitative assessments, we delve into how architecture, training, and optimiza-
tion impact LLMs’ mathematical reasoning. Moreover, our study goes beyond
∗These authors contributed equally to this work.†The corresponding author.arXiv:2503.10573v1 [cs.LG] 13 Mar 2025
Page 2:
mere performance comparison by identifying key areas for future advancements in
LLM-driven mathematical reasoning. This research enhances our understanding of
LLMs’ mathematical reasoning and lays the groundwork for future advancements.
1 Introduction
As Artificial Intelligence (AI) continues to advance at an unprecedented pace, an array of powerful
Large Language Models (LLMs) has emerged, including OpenAI’s GPT-4o, o1, o3, Claude 3.5,
Llama 3.2, Qwen 2.5, and Gemini 2.0 [ 1,12,28,30,48,57,69,82,102]. Driven by cutting-
edge advancements in Deep Neural Networks (DNNs), these models integrate human cognition
elements to enhance problem-solving and decision-making [ 49]. Serving as a guiding light in
natural language processing (NLP) [ 65], healthcare [ 21,60,61], clinical textual processing [ 22,
59,104], biomedical image analyses [ 60], code generation [ 58], decision support [ 51], multimodal
data analytics [ 83,90], and mathematical reasoning [ 2,52,72,76,91,103,107], LLMs push the
boundaries of AI capabilities, approximating human-like reasoning through sophisticated statistical
inference [ 1,48,69]. However, despite their transformative potential, these models face notable
limitations. Their high computational demands pose significant barriers to broader accessibility,
making large-scale implementation costly [ 69]. Moreover, while LLMs perform well in general
contexts, they often struggle with specialized tasks, exhibiting inconsistencies in performance.
Multimodal models, for example, continue to face challenges in spatial reasoning and real-world
physics, while AI-assisted code generation frequently produces syntactically correct yet functionally
flawed outputs, requiring human oversight [ 58,103]. These constraints underscore the ongoing
need for refinement and innovation in AI research to bridge the gap between artificial and human
intelligence.
In particular, GPT-4o, released by OpenAI in May 2024, is a multimodal model capable of pro-
cessing text, images, and voice with remarkable efficiency. Leveraging an advanced transformer
architecture, it surpasses GPT-3 in critical areas such as mathematical reasoning and language
comprehension [ 1,23,44,75]. With an estimated 2 trillion parameters, GPT-4o is significantly
larger than its predecessors, enabling substantial improvements in performance and adaptability.
Meanwhile, other cutting-edge models, such as o1 and o3, have been introduced in 2024 and 2025,
respectively. In particular, o1 enhances reasoning capabilities by incorporating a Chain-of-Thought
(CoT) approach [ 80], where it renders intermediate reasoning steps before reaching a final answer
closely mirroring the way humans process complex problems [ 103]. More significantly, o3 advances
reasoning performance through a "simulated reasoning" process, in which the model actively gen-
erates and evaluates multiple solution paths [ 6]. By pausing, reflecting, and adjusting its approach
before delivering a final answer, o3 exhibits human-like reasoning, allowing for more nuanced and
adaptive problem-solving, particularly in highly complex scenarios [ 67]. However, none of the GPT
models, including GPT-4o, o1, and o3, are open-source, as they remain proprietary.
In addition, Claude 3.5, released in 2024, is built on previous versions. It emphasizes safety, alignment,
and performance, with improvements in reasoning, language understanding, and handling complex
tasks like text and code generation [ 30]. With 250B parameters, it surpasses earlier models in
accuracy and ethical alignment. It supports up to 200K tokens for extended context, enabling better
processing of larger inputs. Notably, enhanced by reinforcement learning from human feedback
(RLHF) [ 41,85] and Constitutional AI, it reduces undesirable responses, biases, and better aligns
with human intent. Claude 3.5 excels in specialized areas like coding and scientific reasoning, with
improved transparency and ethical safeguards [30, 69].
Furthermore, Llama-3.3, presented in 2024, is the latest version of LLM in Meta AI family, following
Llama-1 and Llama-2. Llama-3.3 advances further with 70B parameters and 128K token context
window, improved by grouped-query attention for better efficiency [ 28,63]. Llama 3.1 excels in
coding, logical problem solving, and low-resource language tasks. Unlike closed models such as GPT
series, it remains open-weight and freely accessible for research and commercial use, but is restricted
to text-only input [ 26,63,69]. Safety measures, such as automated red-teaming and filtered training
data, help minimize undesirable outputs. Moreover, Gemini 2.0 Flash is the latest multimodal LLM
of Google, building on versions 1.0 and 1.5 to offer more robust generative AI capabilities across
text, images, audio, and video.
2
Page 3:
Moreover, Gemini 2.0 Flash, initially introduced as an experimental variant, provides significant speed
and efficiency gains over its predecessor, Gemini 1.5 Flash, without sacrificing the efficiency [ 43,45,
82]. Notably, it outperforms Gemini 1.5 Pro on key benchmarks while operating at twice the speed. It
enables the incorporation of agentic AI and native use, allowing the model to call external functions
(Google Search and Maps) and integrate streaming data for expanded real-time applications. By
combining better performance in tasks such as math, code generation, and multilingual audio output
with enhanced efficiency. Gemini 2.0 aims to deliver comprehensive, cost-effective AI solutions
for both developers and end users [ 43,82]. Lastly, Qwen2.5, released in September 2024, is the
latest iteration in the Qwen series, following Qwen2 in June 2024 and the original Qwen in August
2023. Qwen1.5 featured models up to 72B parameters, emphasizing efficiency and open-source
accessibility. Qwen2 introduced improved reasoning, multilingual support, and coding capabilities,
with models scaling up to 72.71B parameters. [94, 106]
Besides, established in 2023 as a research initiative to push the boundaries of artificial general intelli-
gence (AGI), DeepSeek models set out to overcome existing limitations by developing specialized
models focused on efficiency, adaptability, and domain expertise [ 8,20,38,37,57,48,56,76]. In
2024, the Mixture-of-Experts (MoE), an efficiency-driven architecture that leverages sparse activation
to reduce computational overhead [ 38,56,57], for DeepSeek was introduced. This was followed by
the launch of DeepSeek Coder, a suite of code-focused models ranging from 1B to 33B parameters,
designed to streamline software development workflows. Meanwhile, DeepSeek Math, trained on
120B math-related tokens, was developed to handle advanced mathematical and symbolic reasoning
tasks [ 76]. Expanding its model portfolio, DeepSeek introduced the V2 and V3 series [ 56,57]. V2
implemented multi-head latent attention (MLA) alongside a MoE system with 236B parameters in
total, of which only 21B were active per query, optimizing computational efficiency [ 56]. V3, an
open-source model, further enhanced efficiency with 671B total parameters, activating only 37B per
query, excelling in complex reasoning tasks while minimizing resource demands and reliance on
supervised data [ 57]. In 2025, DeepSeek made a significant breakthrough with R1 Zero, incorporat-
ing self-verification, reflection, and extended CoTs. Notably, DeepSeek-R1 was recently presented,
specifically designed for mathematical, coding, and logical problem-solving, enhancing autonomous
decision-making and precision in both research and enterprise applications [ 37,38,76]. To extend
accessibility, an open-sourced suite of distilled models for DeepSeek, optimized for deployment
in resource-constrained environments such as edge computing platforms and low-memory systems.
These models preserve scalability and cost-effectiveness, making cutting-edge AI more accessible
across diverse applications.
Notably, mathematical reasoning usually poses intricate challenges without straightforward solu-
tions [ 3,55]. Unlike routine tasks that follow established frameworks, mathematical reasoning
demands creativity, abstract thinking, and advanced cognitive skills. In this study, we systematically
evaluate the mathematical reasoning performance of a wide array of LLMs, with a particular emphasis
on DeepSeek models, to further unveil machine creativity in mathematical reasoning. Leveraging mul-
tiple independent datasets and quantitative metrics, we conduct a comprehensive empirical analysis to
assess and compare their capabilities. Our findings provide a detailed evaluation of prominent LLMs
in mathematical reasoning while highlighting the advancements and unique strengths of DeepSeek
models.
2 Related Works
Recent research has increasingly focused on summarizing and evaluating the performance of LLMs
in problem-solving and code generation, areas where general-purpose LLMs excel in text-based tasks
but often struggle with mathematical precision and structured reasoning [ 24,36,48,77,86,97]. To
address these limitations, LLMs have prioritized enhancing reasoning capabilities and improving
computational efficiency in next-generation models, aiming to bridge the gap between linguistic
fluency and robust problem-solving skills.
Specifically, in 2024, Wang et al., summarized and detailed the specifications of Multimodal Large
Language Models (MLLMs), standing out at the forefront of AI [ 83]. In brief, Wang et al., overviewed
a wide array of data modalities, such as text, images, audio, and sequential data [ 83]. In fact, MLLMs
play a crucial role in multimodal understanding tasks by integrating text and image information to
achieve more intelligent and comprehensive understanding and reasoning. In this field, Wang et al.,
outlined the developments, advancements, and utilization of multiple MLLMs, including MiniGPT-4,
3
Page 4:
InstructBLIP, and Wiki-LLaV A are three highly regarded models, along with other related MLLMs
such as 3DMIT, GroundingGPT, ModaVerse, Vary-toy, LLaV AMOLE, and CogCom [83].
In addition, at the end of 2024, Zhong et al. established a comprehensive and extensive evaluation of
o1-preview (the early version of o1) to showcase the advancement of OpenAI’s o1-preview across a
diverse array of complex reasoning tasks, spanning multiple domains, including computer science,
mathematics, natural sciences, medicine, linguistics, and social sciences [ 103]. Through extensive
testing, o1-preview highlighted remarkable capabilities, often achieving human-level or superior
performance in areas ranging from coding challenges to scientific reasoning and from language
processing to creative problem-solving [ 103]. Key findings include: 1). 83.3% success rate in solving
complex competitive programming problems, surpassing human experts. 2). Superior ability in
generating coherent and accurate radiology reports, outperforming other evaluated models. 3). 100%
accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions.
4). Advanced natural language inference capabilities across general and specialized domains like
medicine. 5). Impressive performance in chip design tasks, outperforming specialized models in areas
such as script generation and bug analysis. 6). Remarkable proficiency in anthropology and geology,
demonstrating deep understanding and reasoning in these specialized fields. 7). Comprehensive
financial knowledge and strong capabilities in quantitative investing. 8). Effective performance
in social media analysis, including sentiment analysis and emotion recognition. 9). Excellent
performance in educational measurement and psychometrics, demonstrating a solid grasp of standard
psychometric concepts that are equivalent to or beyond a first-year master’s or doctoral student’s
level.
Furthermore, Yang et al. investigated the capabilities of advanced LLMs, particularly the o1 model,
in literary analysis [ 96]. Given the prestige of the Nobel Prize and its emphasis on cultural, historical,
and linguistic depth, applying LLMs to these works offers valuable insights into both human and AI
approaches to literary interpretation [ 96]. The study employed qualitative and quantitative evaluations
to assess coherence, creativity, and fidelity to the text, shedding light on the strengths and limitations
of AI in domains traditionally dominated by human expertise. While LLMs demonstrated remarkable
analytical capabilities, particularly in structured tasks, they struggled with emotional nuance and
coherence—areas where human interpretation remains unparalleled. This research underscores
the transformative potential of human-AI collaboration in the humanities, paving the way for new
opportunities in literary studies, textual analysis, and interdisciplinary research. By leveraging AI’s
analytical power alongside human intuition, this study highlights promising directions for AI-assisted
literary interpretation and beyond.
Moreover, Xu et al. demonstrated that LLMs have made significant advancements in clinical
decision-making, particularly those leveraging in-context demonstrations and specialized medical
fine-tuning [ 92]. These models exhibit strong performance in medical language processing but
continue to face challenges in real-time adaptability, multi-step reasoning, and handling complex
medical tasks. Agent-based AI systems aim to overcome these limitations by incorporating reasoning
traces, contextual tool selection, knowledge retrieval, and both short- and long-term memory. These
features enable medical AI agents to manage complex clinical scenarios where decision-making
requires real-time interaction with the environment [ 92]. Unlike conventional model-based approaches
that treat medical queries as isolated questions, medical AI agents approach them as dynamic, multi-
faceted tasks, allowing them to function more like human doctors. Xu also investigated the selection
of the backbone LLM for medical AI agents, which serves as the foundation for their reasoning and
action generation. Specifically, Xu et al. [ 92] examined the capabilities of the emerging o1 model
and its impact on agents’ reasoning, tool-use adaptability, and real-time information retrieval across
diverse clinical scenarios, including high-stakes environments such as intensive care units (ICUs).
The findings highlighted o1’s potential to enhance diagnostic accuracy and consistency, paving the
way for more intelligent, responsive AI tools that support improved patient outcomes and more
effective decision-making in clinical practice [92].
Notably, Ahn et al. [ 3] addressed four critical aspects of LLMs in mathematical reasoning: 1).
mathematical problem types and datasets, 2). techniques for enhancing LLM performance (such as
prompt engineering and fine-tuning), 3). factors affecting model effectiveness (including scale and
pre-training data), and 4). challenges such as brittleness and inconsistency, which often cause models
to generate different answers for similar problems.
4
Page 5:
Besides, Liang et al. presented a comprehensive overview on the applications of AI in mathematical
research, highlighting the transformative role AI has begun to play in this domain [ 55]. Traditionally,
AI advancements have heavily relied on theoretical foundations provided by mathematics and
statistics. However, recent developments in AI, particularly in reinforcement learning (RL) and
LLMs, have demonstrated the potential for AI to contribute back to mathematics by offering flexible
algorithmic frameworks and powerful inductive reasoning capabilities that support various aspects
of mathematical research. This survey aimed to establish a bridge between AI and mathematics,
providing insights into the mutual benefits and fostering deeper interdisciplinary understanding. In
particular, Liang et al. discussed that while current AI and LLMs may struggle with complex deductive
reasoning, their inherent creativity, the capability to generate outputs at high throughput based on
recognition of shallow patterns, holds significant potential to support and inspire mathematical
research [ 55]. Furthermore, Liang et al. [ 55] addressed the lack of cross-disciplinary communication:
mathematicians may not fully comprehend the latest advances in AI, while AI researchers frequently
prioritize benchmark performance over real-world applications in frontier mathematical research.
Lastly, this work seeks to close that gap, offering a detailed exploration of AI fundamentals, its
strengths, and its emerging applications in the mathematical sciences. While various studies have
begun exploring in-depth comparisons between DeepSeek-R1 and other leading models [ 37,38,76],
few provide a comprehensive analysis of their mathematical reasoning capabilities. Thus, in this
work, we introduce a holistic empirical study to systematically compare the mathematical reasoning
performance of DeepSeek-R1 with its peer models. Our goal is to contribute to a deeper understanding
of DeepSeek-R1’s strengths and limitations while inspiring further advancements in the future works,
using a wide array of peer models, datasets, and quantitative metrics.
3 Methodology
In this section, we provide a comprehensive overview of our experimental framework, detailing
the datasets utilized, the quantitative evaluation metrics, as well as the overall methodology. We
also present a justification for selecting these datasets as benchmarks, emphasizing their relevance,
diversity, and suitability in effectively assessing LLM performance in mathematical reasoning. By
establishing a consistent evaluation framework, we aim to ensure a thorough and insightful comparison
of model capabilities across various reasoning tasks.
3.1 Experimental Frameworks
The framework of this empirical study is illustrated in Figure 1 and consists of five key components:
1). Benchmark Data: We utilized three representative benchmark datasets [ 16,39,62] to evaluate
the mathematical reasoning capabilities of multiple LLMs. 2). Zero-Shot Prompting: Zero-shot
prompting was employed to design all prompts [ 42], ensuring a standardized evaluation approach. 3).
Language Model Selection: A diverse set of representative LLMs was selected, including Google’s
Gemini 2.0 Flash [ 45], OpenAI’s GPT-4o [ 1], o1-mini [ 46], o1 [ 103], o3-mini [ 6], DeepSeek’s
DeepSeek-R1-Distill-Qwen-1.5B [ 37], hereafter referred to as DeepSeek-1.5B, and DeepSeek-
R1 [56]. 4). Evaluation: Accuracy was used as a critical quantitative metric to assess and compare
the mathematical reasoning performance of these models [ 17–19]. 5). Comparison Between Models:
A comprehensive statistical analysis was conducted to evaluate and contrast the performance of each
LLM, offering deeper insights into their mathematical reasoning capabilities. Overall, this structured
framework enabled a holistic and systematic comparison of LLMs in mathematical reasoning.
3.2 Datasets
To evaluate the mathematical reasoning capabilities of LLMs, we employ well-established bench-
mark datasets that encompass diverse mathematical domains, question types, and difficulty levels.
The selected datasets include Math Competition (MATH) [ 40], Grade School Math 8K (GSM8K)
[16], and Massive Multitask Language Understanding (MMLU) [ 39], covering topics ranging from
elementary arithmetic to advanced algebra, logic, and mathematical competitions. From each bench-
mark, we carefully select a representative subset of problems to ensure a balanced evaluation while
maintaining the diversity and complexity of mathematical reasoning tasks. These subsets are chosen
for their broad coverage of mathematical domains, variety in question formats, and representation of
different complexity levels, ensuring a comprehensive assessment of LLMs’ symbolic manipulation,
5
Page 6:
Figure 1: This figure illustrates the experimental framework of this work, including five vital
components.
logical reasoning, and problem-solving abilities [ 35,74,101]. In total, our evaluation spans 2,178
mathematical problems across multiple benchmark datasets, ensuring a comprehensive assessment of
LLMs’ mathematical reasoning capabilities.
By evaluating LLMs on these benchmarks, we identified their strengths and limitations in mathemati-
cal reasoning, aiding in the development of more robust and capable models. Table 1 provides an
overview of the datasets utilized in this study.
Table 1: Overview of Benchmark Data subsets Used for Evaluation
Dataset Category Question Type Domain Number of Problems
MATH Mathematics competitions Free-form numberic Algebra, Geometry, Number
Theory, Combinatorics262
GSM8K Grade School Math Word Problems Arithmetic 1320
MMLUCollege Mathematics Multiple-choice Algebra, Calculus, Discrete
Math100
Abstract Algebra Multiple-choice Group Theory, Rings, Alge-
braic Structures100
Formal Logic Multiple-choice Proofs, Logical Reasoning 126
High School Mathematics Multiple-choice General Mathematics 270
3.2.1 MATH
MATH is a comprehensive benchmark dataset designed to evaluate mathematical reasoning across a
wide range of problem-solving scenarios [ 40]. Unlike conventional mathematics problem datasets,
which primarily assess algorithmic proficiency, MATH focuses on advanced reasoning and heuristic-
driven problem-solving.
Specifically, MATH is derived from high-level mathematics competitions, including the AMC
10, AMC 12, and AIME [ 40]. These problems are inherently more complex than standard K-
12 mathematics tasks, requiring the application of non-trivial problem-solving strategies, logical
inference, and domain-specific heuristics rather than direct formulaic computations. Meanwhile,
6
Page 7:
MATH is widely regarded as a difficult benchmark, with LLMs achieving accuracy rates between 3.0%
and 6.9% [ 40]. Despite these low overall scores, LLMs demonstrate some degree of mathematical
competency, as evidenced by: Up to 15% accuracy on the easiest difficulty level [ 40]. The ability to
generate step-by-step solutions that, while sometimes incorrect, remain coherent and contextually
relevant. To gauge the dataset’s difficulty, human performance was also considered. For instance, a
computer science Ph.D. student with no particular affinity for mathematics achieved approximately
40% accuracy. In addition, a three-time IMO (International Mathematical Olympiad) gold medalist
attained 90% accuracy. Thus, these findings indicate that MATH is challenging for both human solvers
and LLMs, making it an invaluable dataset for assessing and advancing mathematical reasoning
capabilities in both LLMs and human problem solvers.
3.2.2 GSM8K
GSM8K, introduced in 2021, is a benchmark dataset consisting of 8,500 high-quality grade school-
level math problems [ 16]. Designed to incorporate high linguistic diversity while relying on
fundamental mathematical concepts, it presents a unique challenge for state-of-the-art LLMs. Al-
though the underlying math is relatively simple, the diverse problem formulations create significant
hurdles, preventing many models from achieving consistently high accuracy. GSM8K is structured
into 7,500 training problems and 1,320 testing problems, all carefully crafted by expert human prob-
lem writers. The problems primarily involve elementary arithmetic operations and typically require 2
to 8 logical steps to reach a solution. This dataset is widely used to evaluate logical reasoning and
mathematical proficiency in LLMs and serves as a benchmark for various assessments, including
the LLM Leaderboard. Notably, while some GSM8K problems are conceptually straightforward,
they can still be challenging for even the most advanced LLMs, often exhibiting high variability
in responses when the same problem is presented in slightly different ways. This highlights the
ongoing difficulty of achieving robust mathematical reasoning in language models. In this study, we
selected the 1.32K mathematical test set from GSM8K to systematically evaluate and compare the
mathematical reasoning capabilities of each LLM.
3.2.3 MMLU
The MMLU dataset is a comprehensive benchmark comprising multiple-choice questions from a
diverse range of academic disciplines. Spanning 57 distinct tasks, it covers subjects in the humanities,
social sciences, hard sciences, and mathematical reasoning, reflecting the breadth of knowledge
that is essential for various fields of study [ 39]. MMLU’s questions were manually curated by
graduate and undergraduate students from publicly available sources, including practice questions
from standardized exams such as the Graduate Record Examination (GRE) and the United States
Medical Licensing Examination (USMLE). Additionally, it features questions designed for under-
graduate courses and readers of Oxford University Press books. Notably, its mathematical reasoning
component encompasses multiple subfields, categorized into areas such as "Abstract Algebra,"
"College Mathematics," "Formal Logic," and "High School Mathematics." For instance, questions
in the "Abstract Algebra" category originate from professional mathematical practice, while the
"High School Mathematics" category includes problems akin to those found in standard high school
exams. In total, MMLU comprises 15,908 questions, divided into three subsets: a). A few-shot
development set containing five questions per subject. b). A validation set with 1,540 questions, used
for selecting hyperparameters. c). A test set with 14,079 questions, ensuring a rigorous evaluation
Each subject within MMLU includes a minimum of 100 test questions, making it a more extensive
and challenging assessment than most standard exams. This broad and structured dataset serves as a
critical benchmark for evaluating the reasoning and problem-solving capabilities of LLMs.
3.3 Evaluate LLMs in Mathematical Reasoning
As discussed previously, multiple LLMs have demonstrated remarkable abilities in mathematical
reasoning [ 105]. To systematically evaluate the mathematical reasoning capabilities of various LLMs,
we conducted experiments using API-based access to a diverse set of models. The selected models
include Google’s Gemini-2.0-flash, OpenAI’s GPT-4o, o1-mini, o1, and o3-mini, and DeepSeek’s
DeepSeek-1.5B and DeepSeek-R1. These models were accessed through their respective API services
using secure API keys from Google, OpenAI, DeepSeek, and Openrouter.
7
Page 8:
Figure 2: Templates used to evaluate mathematical reasoning in LLMs. The multiple-choice format
(a) ensures structured decision-making with predefined options, while the open-ended format (b)
assesses free-form problem-solving capabilities.
3.3.1 Prompting Strategy
The effectiveness of LLMs in mathematical reasoning is highly dependent on the design of input
prompts [ 5,10,25]. To ensure a fair and reproducible evaluation, we employed a structured prompt-
ing strategy focusing on zero-shot prompting initially. Our methodology was designed to assess
models’ inherent problem-solving abilities without additional external guidance [ 87]. Overall, for all
evaluations, we incorporated a zero-shot prompting approach, where models were presented with
mathematical problems without any prior exemplars or demonstrations. Zero-shot prompting is
particularly valuable in assessing a model’s ability to generalize mathematical knowledge from its
pretraining corpus and apply learned concepts to novel problems [ 87,98]. By not providing explicit
examples, this method evaluates how well a model can infer the appropriate solution methodology
based solely on its prior learning.
In detail, each prompt as shown in Figure 2, is explicitly instructing models to reason through
the problem before selecting an answer enclosed in the \boxed{} format. While explicit CoT
prompting was not enforced, the instruction to "follow the format and continue your reasoning until
the final boxed answer" implicitly encouraged stepwise reasoning, allowing us to assess the extent to
which models could self-initiate logical reasoning without explicit CoT guidance [ 73]. Empirical
observations revealed that some models naturally generated intermediate reasoning steps before
selecting an answer, demonstrating emergent reasoning capabilities, while others provided the final
answer directly, particularly those with weaker reasoning abilities. Variations in reasoning depth were
observed across different models, reflecting differences in their internal problem-solving strategies.
To further investigate the impact of explicit stepwise prompting, we conducted an ablation study
by modifying the prompt to include a direct instruction: "Solve this problem step by step before
providing the final answer." This explicit CoT version consistently led to more structured and detailed
explanations across all models, confirming that models respond differently to implicit vs. explicit
reasoning cues [ 98]. The comparative analysis of both prompting strategies provided valuable insights
into how structured guidance influences model reasoning and accuracy.
For multiple-choice questions, we provided the set of answer choices and instructed the models to
select the correct response from the given options, ensuring a structured decision-making process [ 68].
The prompt included clear formatting guidelines to standardize the response format across different
models. In contrast, for open-ended problems that required free-form numeric or symbolic responses,
we presented the question directly without predefined answer choices, allowing the models to
generate responses based on their internal reasoning capabilities [ 81]. This approach ensured that
multiple-choice tasks evaluated the models’ ability to differentiate between structured options, while
open-ended problems tested their capability to derive solutions independently.
8
Page 9:
3.3.2 Sampling Settings
To ensure consistency and reliability in model responses, we employed a structured sampling strategy
using temperature, top-k sampling, and top-p (nucleus) sampling [ 11,70], which control randomness
and diversity in token selection. To ensure a standardized evaluation across different LLMs, we
utilized API access through OpenRouter for DeepSeek models and OpenAI API for GPT-4o, o1-mini,
o1, and o3-mini, while maintaining their respective default generation settings. All models, including
GPT-4o and o-series models, Gemini and DeepSeek use a default temperature of T= 1.0and an
unrestricted probability sampling strategy (top-p = 1.0) [ 70], allowing the model to consider the full
probability distribution for token selection.
Moreover, to ensure smooth execution of API requests while avoiding rate limits imposed by different
providers, we implemented an adaptive time delay between successive API calls. This approach
prevented throttling issues while maintaining efficiency in large-scale evaluation. All queries were
formatted in a standardized manner to ensure fair comparisons across models.
3.3.3 Evaluation Metric
To assess the correctness of model-generated responses in mathematical tasks, we employ the Exact
Match metric as our primary evaluation method [ 78]. Given that each question in our benchmark
dataset has a single correct answer and the model produces a response per query, Exact Match ensures
a rigorous evaluation by comparing the extracted answer to the ground truth.
Unlike other evaluation metrics such as Pass@k [ 13], which allows for multiple valid responses,
exact match is particularly suited for our setup as each question has only one correct answer, making
alternative correctness measures unnecessary. The model produces a single response per query,
eliminating the need for multi-sample evaluation, and we extract the final answer by parsing the
boxed notation (\boxed{}), ensuring a direct one-to-one comparison with the ground truth. Since
mathematical evaluations require strict correctness, even minor deviations, such as additional decimal
places (5.00 vs. 5) or formatting inconsistencies, must be addressed through preprocessing before
applying the metric. To ensure fairness, we apply preprocessing steps to standardize model-generated
answers by extracting the boxed answer, trimming extraneous characters, normalizing numerical
values (e.g.,1
2to 0.5) [ 84], and handling decimal precision by rounding values where applicable. This
ensures that the Exact Match metric strictly evaluates mathematical correctness while minimizing
penalization due to trivial formatting differences.
Letˆyirepresent the extracted answer from the model’s output for the ithquestion, and let yibe the
corresponding ground truth answer. The Exact Match accuracy is computed as:
Exact Match (%) =PN
i=1⊮(normalize (ˆyi) =normalize (yi))
N×100 (1)
where:
•Nis the total number of evaluated questions.
•⊮(·)is the indicator function, returning 1if the extracted model response matches the ground
truth after preprocessing, and 0 otherwise.
• normalize (·)is a function that standardizes formatting, trims spaces, and normalizes numer-
ical values.
Furthermore, error-handling mechanisms were incorporated to detect hallucinated responses [ 71,79,
100], incomplete computations, or missing boxed outputs, with automated validation and manual
review of flagged cases ensuring robust evaluation. Additionally, since some model responses
exceeded token limits and resulted in truncated outputs, detection mechanisms were implemented
to flag incomplete solutions, allowing either exclusion from evaluation or manual review when
necessary.
Table 2 presents the evaluation of model-generated responses against ground truth values. The
Reference Answer represents the ground truth, while the Parsed Model Answer contains the
extracted response from the model’s output. The Correctness column indicates whether the model’s
response exactly matches the ground truth after parsing, ensuring a strict evaluation of mathematical
9
Page 10:
Table 2: Comparison of ground truth and model-generated parsed answers
Reference answer Parsed Model answer Correctness
p - q [’p - q’] True
90^\circ [’90’] True
4 [’5’] False
\dfrac{3}{56} [’dfrac356’] True
6 - 5i [’6 - 5i’] True
accuracy. Discrepancies, such as format mismatches or numerical errors, directly impact correctness
scores, highlighting the model’s precision in solving mathematical problems.
By integrating automated parsing, formatting corrections, and evaluation metrics, our post-processing
pipeline ensured reliable and unbiased assessment of mathematical reasoning capabilities across
different LLM architectures.
4 Results
In this section, we conducted a quantitative and qualitative evaluation of the state-of-the-art LLMs
mentioned earlier, using multiple public benchmark datasets for mathematical reasoning. As an initial
assessment, we selected a subset of three widely recognized datasets [ 16,39,62], as discussed in
Section 3, to systematically evaluate the performance of various cutting-edge models. Overall, our
evaluation includes DeepSeek-R1 [ 56], DeepSeek’s most advanced reasoning model, along with
its distilled variant DeepSeek-1.5B [ 37], which is derived from the R1 model. Additionally, we
assessed Google’s latest Gemini 2.0 Flash [ 45] and four OpenAI models: GPT-4o, o1-mini, o1,
and o3-mini [ 6,44,103]. Notably, while all these models are designed with reasoning capabilities,
GPT-4o is the only one not explicitly optimized for providing reasoning-based CoT responses. These
selected LLMs represent the most advanced AI systems available today, continuously evolving and
competing across various mathematical reasoning benchmarks.
4.1 Evaluation of LLMs using MATH
As illustrated in Table 3, the performance of various LLMs on the MATH benchmark dataset [ 40]
exhibits significant variability, underscoring differences in their mathematical reasoning capabilities.
In particular, GPT-4o [ 44] exhibited a significantly lower accuracy of 64.88%, lagging behind its peer
models. This limitation may be attributed to deficiencies in complex reasoning capabilities or the
inherent difficulty of competition-level mathematical problems in the dataset. In contrast, OpenAI’s
o1 model achieved the highest accuracy (93.12%), closely followed by DeepSeek-R1 (90.45%), with
both models surpassing the 90% correctness threshold—a benchmark that no other model reached.
Other models demonstrated varying levels of performance: Gemini 2.0 Flash, 85.87%; o1-mini,
88.93%; o3-mini, 82.06%. However, a key observation is that the distilled variant of DeepSeek-
R1 performed significantly worse, further reinforcing the hypothesis that model distillation can
impair mathematical reasoning capabilities by compromising critical reasoning pathways in favor of
improved computational efficiency.
Additionally, the quantitative comparisons across multiple LLMs, as presented in Figure 3, reveal
consistent performance patterns across OpenAI’s models, while DeepSeek-R1 consistently outper-
formed its distilled variant. These findings highlight the trade-offs between model size and reasoning
ability, as demonstrated by Fu et al. [ 33], underscoring the need for optimization techniques that
enhance computational efficiency without sacrificing mathematical problem-solving proficiency.
Furthermore, Figure 4 provides an in-depth comparison of the prompt responses and mathematical
reasoning of DeepSeek-R1, o1, and Gemini 2.0 Flash. Interestingly, Gemini 2.0 Flash failed to
provide a correct answer, whereas DeepSeek-R1 and o1 demonstrated more structured and logically
coherent reasoning processes. This further illustrates the varying degrees of mathematical reasoning
proficiency across models [ 103] and the importance of structured reasoning frameworks in achieving
high accuracy.
10
Page 11:
Table 3: Performance comparison of various models on MATH in terms of accuracy.
Publisher Model Name Accuracy (%)
Google Gemini 2.0 Flash 85.87
OpenAIGPT-4o 64.88
o1-mini 88.93
o1 93.12
o3-mini 82.06
DeepSeekDeepSeek-1.5B 65.64
DeepSeek-R1 90.45
Figure 3: Performance comparison of various LLMs in the MATH benchmark. The accuracy of
each model is displayed above the corresponding bar, highlighting differences in mathematical
problem-solving capabilities across model publishers.
Moreover, Figure 4 illustrates that DeepSeek-R1 and o1 followed nearly identical reasoning steps,
correctly assigning coordinates, computing midpoints and centroids, and using determinant-based
area calculations to arrive at the correct answer, 8. However, Gemini 2.0 Flash deviated from the
correct reasoning path by making errors in its calculations. Specifically, despite correctly identifying
some intermediate relationships, its handling of area scaling and altitude ratio calculations led to an
incorrect final answer of 4 instead of, correct response of 8. This demonstrates a key limitation in
Gemini 2.0 Flash’s numerical precision and reasoning accuracy for complex geometric problems
compared with its peers.
Notably, the strong performance of DeepSeek-R1 in Figure 4, which relies heavily on RL method-
ologies [ 29], highlights the effectiveness of iterative, feedback-driven training paradigms in tackling
mathematically intricate tasks. A crucial component in its training is Group Relative Policy Op-
timization (GRPO) [ 76], a more efficient RL approach than traditional methods. GRPO enables
DeepSeek-R1 to excel in complex domains such as mathematics, science, and coding, reinforcing
the potential of advanced learning techniques in enhancing model performance [ 37]. These find-
ings underscore the importance of advanced reasoning mechanisms in solving competition-level
mathematical problems, suggesting that RL frameworks-exemplified by DeepSeek-R1-represent a
promising direction for enhancing the mathematical reasoning capabilities of next-generation LLMs.
4.2 Evaluation of LLMs using GSM8K
As discussed in Section 3, the GSM8K benchmark dataset [ 16] comprises approximately 1.32k test
samples designed to assess math reasoning in LLMs [ 16]. The zero-shot performance of various
models is summarized in Table 4, presenting valuable insights into their mathematical reasoning
capabilities.
In particular, the results revealed that DeepSeek-R1 [ 37] performed on par with OpenAI’s o1, with
both models achieving the highest accuracy of 96.13%, surpassing all other peers in this evaluation.
Additionally, several other models demonstrated strong performance: Gemini 2.0 Flash, 95.53%;
11
Page 12:
Figure 4: Comparative evaluation of different LLMs on a Geometric problem from the MATH
benchmark. The figure showcases step-by-step reasoning from three different models and highlights
correctness. The question is shown in red, while the extracted answers are boxed. Green-highlighted
sections indicate correct responses, whereas red-highlighted sections denote incorrect outputs. This
comparison illustrates differences in mathematical reasoning and accuracy across models.
GPT-4o, 95.58%; o1-mini, 95.91%; o3-mini, 95.83% However, a notable performance gap was
observed in DeepSeek-R1’s distilled variant, which achieved only 81.13% accuracy—a significant
decline compared to its full-scale counterpart. This performance disparity underscores the trade-offs
associated with model distillation, where reducing the number of parameters for computational
efficiency may inadvertently weaken logical reasoning and problem-solving capabilities. While
distillation enables smaller, faster models, our findings suggest that excessive parameter reduction can
compromise mathematical reasoning proficiency, potentially limiting their effectiveness in complex
problem-solving tasks. Similarly, Figure 5 demonstrates that DeepSeek-R1 and o1 outperformed
other peer LLMs. Notably, there are no significant differences across all LLMs released via OpenAI.
These results emphasize the delicate balance between model size and reasoning ability, highlight-
ing the need for advancing distillation techniques that preserve core reasoning structures while
maintaining computational efficiency.
Table 4: Performance comparison of various models on GSM8K in terms of accuracy.
Publisher Model Name Accuracy (%)
Google Gemini 2.0 Flash 95.53
OpenAIGPT-4o 95.98
o1-mini 95.91
o1 96.13
o3-mini 95.83
DeepSeekDeepSeek-1.5B 81.12
DeepSeek-R1 96.13
Moreover, the qualitative comparisons across multiple LLMs, as illustrated in Figure 6, reveal a
noteworthy disparity in reasoning capabilities among OpenAI’s o1, DeepSeek-R1, and Google’s
Gemini 2.0 Flash. The analysis demonstrates that DeepSeek-R1 exhibits superior performance in
mathematical reasoning compared to its counterparts. Specifically, DeepSeek-R1 generated accurate
12
Page 13:
solutions and also articulated a systematic, step-by-step process to arrive at its conclusions. In contrast,
both OpenAI’s o1 and Gemini 2.0 Flash produced less comprehensive pathways, which resulted in
erroneous answers. These findings, as shown in Figure 6, underscore DeepSeek-R1’s robust analytical
proficiency relative to contemporary models, highlighting its advanced problem-solving capabilities
in complex mathematical tasks.
Figure 5: Performance comparison of various LLMs in the GSM8K benchmark. The accuracy
of each model is displayed above the corresponding bar, highlighting differences in mathematical
problem-solving capabilities across model publishers.
Figure 6: Comparative evaluation of different LLMs on an Arithmetic problem from the GSM8K
benchmark. The figure showcases step-by-step reasoning from three different models and highlights
correctness. The question is shown in red, while the extracted answers are boxed. Green-highlighted
sections indicate correct responses, whereas red-highlighted sections denote incorrect outputs. This
comparison illustrates differences in mathematical reasoning and accuracy across models.
13
Page 14:
4.3 Evaluation of LLMs using MMLU
Leveraging the structured categorization of MMLU benchmark dataset [ 39], we conduct an in-depth
comparative analysis of various LLMs across different mathematical domains.
First, as shown in Table 5, DeepSeek-R1 outperformed all peer models, achieving the highest
accuracy of 97.62% in Formal Logic. Other models, such as o1, o3-mini, and Gemini 2.0 Flash,
also demonstrated strong performance, all surpassing 90% accuracy. However, distilled variants of
DeepSeek-R1 and o1-mini failed to reach this benchmark, reinforcing the hypothesis that model
distillation can degrade logical reasoning capabilities by limiting access to essential computational
pathways.
Table 5: Performance comparison of various models on MMLU in terms of accuracy.
Problem Type Model Name Accuracy (%)
Abstract AlgebraGemini 2.0 flash 87.00
GPT-4o 85.00
o1-mini 88.00
o1 94.00
o3-mini 96.00
DeepSeek-1.5B 61.00
DeepSeek-R1 94.00
College MathematicsGemini 2.0 flash 96.00
GPT-4o 79.00
o1-mini 76.00
o1 99.00
o3-mini 99.00
DeepSeek-1.5B 65.00
DeepSeek-R1 96.00
Formal LogicGemini 2.0 flash 90.48
GPT-4o 82.54
o1-mini 78.57
o1 95.24
o3-mini 96.03
DeepSeek-1.5B 47.62
DeepSeek-R1 97.62
High School MathematicsGemini 2.0 flash 97.77
GPT-4o 88.15
o1-mini 98.89
o1 99.26
o3-mini 98.52
DeepSeek-1.5B 86.30
DeepSeek-R1 97.04
In addition, in Abstract Algebra, o3-mini outperformed its peers, achieving the highest accuracy of
96.00%. Abstract algebra is a particularly challenging subject, even for math graduates, as it demands
a fundamental shift in mathematical thinking from traditional algebraic manipulations to analyzing
complex algebraic structures such as groups, rings, and fields. Mastery of this subject requires a deep
understanding of abstract concepts, rigorous proof construction, and advanced logical reasoning.
The results also suggest that OpenAI’s models, particularly o3-mini, excel in abstract reasoning
and symbolic manipulation, enabling them to reconstruct and process complex algebraic structures
effectively. In contrast, the distilled variant of DeepSeek-R1 performed significantly worse, achieving
only 61% accuracy, further validating the detrimental impact of parameter reduction on logical
inference and abstract reasoning.
Furthermore, in College Mathematics, both o1 and o3-mini achieved exceptional accuracy of 99.00%
in College Mathematics, surpassing all other models. This outstanding performance can be attributed
to two key factors: a). Superior reasoning capabilities: OpenAI’s o1 and o3-mini models are designed
14
Page 15:
with enhanced mathematical reasoning frameworks, allowing them to efficiently process and solve
complex college-level problems. b). Diversity of training data: The extensive dataset exposure during
training may have equipped o1 and o3-mini with a broader mathematical foundation, leading to
their unparalleled performance in advanced mathematical reasoning tasks. Meanwhile, Gemini 2.0
Flash also demonstrated strong performance, achieving a comparable accuracy of 96%, reinforcing
its competence in college-level mathematics. However, distillation had a clear negative impact on
performance, as evidenced by: 1). o1-mini, which achieved only 76.00% accuracy, a significant drop
from its full-scale counterpart. 2). DeepSeek-1.5B, which attained just 65.00% accuracy, further
emphasizing the trade-off between model size and reasoning ability.
This performance gap highlights the increased complexity of college mathematics compared to high
school-level math. College-level coursework usually demands: 1). A deeper understanding of abstract
mathematical concepts; 2). Stronger logical reasoning abilities; 3). Proficiency in complex proofs
and real-world applications. Notably, the faster pace and heavier workload of college mathematics
compared to high school math further contribute to its difficulty [7].
Moreover, given the relative simplicity of High School Mathematics, o1 achieved the highest accuracy
at 99.62%, slightly improving upon its results in College Mathematics. Other peer models also
performed well: DeepSeek-R1, 97.04%; o1-mini, 98.89%; o3-mini, 98.52%; Gemini 2.0 Flash,
97.77%. GPT-4o showed a lower accuracy (88.15%) compared with its peers in this domain too.
Once again, the distilled variant of DeepSeek-R1 struggled, achieving only 86.30% accuracy, further
demonstrating the performance degradation caused by model compression.
Similarly, Figure 7 presents a comprehensive comparative analysis of various LLMs across four key
mathematical domains: Abstract Algebra, College Mathematics, Formal Logic, and High School
Mathematics. Notably, OpenAI’s models demonstrate superior performance in abstract reasoning,
particularly excelling in Abstract Algebra, where symbolic manipulation and deep theoretical under-
standing are crucial. This suggests that OpenAI’s models have strong capabilities in reconstructing
and processing abstract mathematical structures.
Significantly, DeepSeek-R1 outperformed its peers in Formal Logic, showcasing its exceptional
logical reasoning skills and ability to handle complex rule-based problem-solving more effectively
than other models. These findings further highlight the specialized strengths of different LLMs
and enhance the need for targeted model improvements to optimize performance across various
mathematical domains.
Figure 7: Comparison of LLM models across different problem types in the MMLU benchmark
dataset.
Figure 8 presents a comparative evaluation of different LLMs on a Formal Logic problem from the
MMLU benchmark dataset [ 39]. The task involves identifying the correct conclusion from a given
argument, with four answer choices labeled A, B, C, and D. The correct reference answer is option D,
requiring models to distinguish between supporting premises and the actual conclusion. DeepSeek-
R1 successfully identifies option D as the correct conclusion, offering a structured explanation by
recognizing that the phrase "Because of this" serves as a linguistic connector rather than a part of the
conclusion itself. In contrast, o1 and Gemini 2.0 Flash select option C, misinterpreting the logical
structure of the argument and failing to differentiate between an explanatory transition and the central
15
Page 16:
Figure 8: Comparative evaluation of different LLMs on a Formal Logic problem from the MMLU
benchmark. The figure showcases step-by-step reasoning from three different models and highlights
correctness. The question is shown in red, while the extracted answers are boxed. Green-highlighted
sections indicate correct responses, whereas red-highlighted sections denote incorrect outputs. This
comparison illustrates differences in mathematical reasoning and accuracy across models.
claim. The comparative analysis in Figure 8 further solidifies DeepSeek-R1’s exceptional logical
reasoning skills and ability to handle complex rule-based problem-solving tasks.
Figure 9 illustrates a comparative evaluation of different LLMs in solving a problem related to
abstract algebra, sourced from the MMLU benchmark dataset. The problem required reasoning about
the number of elements of a given order in a group, specifically evaluating whether Statement 1
and Statement 2 hold true under group-theoretic principles. The task involved selecting the correct
truth-value combination from four possible answer choices: A (True, True), B (False, False), C (True,
False), and D (False, True). The reference answer, indicated in the Figure 9, is option A (True, True).
DeepSeek-R1 correctly identified both statements as true, providing a detailed justification using the
properties of cyclic subgroups, Euler’s totient function, and disjoint subgroup structures. Similarly,
o1 also arrived at the correct answer, correctly recognizing that exceeding 8 elements of order 15
necessitates at least 16 elements due to the existence of multiple distinct subgroups. However, Gemini
2.0 Flash incorrectly classified Statement 2 as false, demonstrating a reasoning flaw in its approach
to subgroup structures and element counting. This comparison in Figure 9 highlights DeepSeek-R1
and o1’s superior reasoning abilities in abstract algebra, as both models successfully applied cyclic
subgroup properties and disjoint set reasoning to reach the correct conclusion.
In summary, DeepSeek-R1 dominated in Formal Logic, while o3-mini excelled in Abstract Alge-
bra. o1 and o3-mini outperformed all models in College Mathematics, likely due to their advanced
reasoning frameworks and diverse training data. Model distillation significantly reduced perfor-
mance as shown in [ 37], particularly in complex reasoning tasks like Abstract Algebra and College
Mathematics. High school-level mathematics is relatively easier for LLMs, with multiple models
surpassing 97% accuracy. These findings emphasize the trade-offs between model size, reasoning ca-
pability, and computational efficiency, highlighting the need for optimization strategies that maintain
problem-solving proficiency while improving scalability.
16
Page 17:
Figure 9: Comparative evaluation of different LLMs on an Abstract Algebra problem from the MMLU
benchmark. The figure showcases step-by-step reasoning from three different models and highlights
correctness. The question is shown in red, while the extracted answers are boxed. Green-highlighted
sections indicate correct responses, whereas red-highlighted sections denote incorrect outputs. This
comparison illustrates differences in mathematical reasoning and accuracy across models.
4.4 Statistical Analyses
In this section, Table 6 presents a comparative evaluation of LLMs based on their mean accuracy and
standard deviation across all three benchmark datasets. Among the evaluated models, OpenAI’s o1
achieves the highest mean accuracy (96.13%), followed closely by DeepSeek-R1 (95.21%) and o3-
mini (94.57%). While Gemini 2.0 Flash attains a relatively high accuracy of 92.11%, it underperforms
relative to the top-tier models.
In terms of stability, as quantified by standard deviation, both o1 (2.32) and DeepSeek-R1 (2.41)
exhibit the most consistent performance. We believe that DeepSeek-R1’s high consistency can
be attributed to its GRPO framework [ 37], which utilizes a multi-agent collaborative architecture,
effectively integrates diverse reasoning pathways. This synergy helps mitigate task-specific biases
and enhances consistency by reducing variability. The structured approach allows the model to
dynamically prioritize high-confidence solutions while iteratively refining less certain outputs, thereby
contributing to its robustness across heterogeneous datasets.
By contrast, o1-mini demonstrated significantly higher variability (standard deviation: 8.31), sug-
gesting sensitivity to dataset-specific characteristics. Gemini 2.0 Flash (4.59) and o3-mini (5.74)
exhibited moderate fluctuations in performance, likely due to limitations in their single-agent rea-
soning paradigms. GPT-4o also exhibited a significantly larger standard deviation (10.42) and was
one of the two most inconsistent models alongside DeepSeek-1.5B. Meanwhile, DeepSeek-1.5B’s
considerably lower mean accuracy (67.78%) and the highest standard deviation (12.82) reflect sub-
stantial variability in its responses, further cementing the detrimental impact of parameter reduction
on mathematical reasoning.
Figure 10 presents the performance statistics of different LLMs across three datasets. The bar
chart in Figure 10 illustrates the mean accuracy (%) of each model, with error bars representing
the standard deviation of the accuracy percentages across datasets. The standard deviation bars
indicate the variability in the performance of each model. Notably, o1 from OpenAI demonstrates the
highest mean accuracy with a low variance, whereas DeepSeek-1.5B exhibits the largest variability,
suggesting inconsistency in its accuracy across datasets. This visualization underscores that o1 and
17
Page 18:
Table 6: Performance statistics of different LLM models across all datasets, including mean accuracy
(in %) and standard deviation.
LLM Models Gemini 2.0 Flash GPT-4o o1-mini o1 o3-mini DeepSeek-1.5B DeepSeek-R1
Mean (%) 92.11 82.59 87.72 96.13 94.57 67.78 95.21
Standard Deviation 4.59 10.42 8.31 2.32 5.74 12.82 2.41
Figure 10: Performance comparison of five LLMs across multiple datasets, grouped by publisher.
The bars represent the mean accuracy (%), while the error bars indicate the standard deviation across
three benchmark datasets.
DeepSeek-R1 excel not only in accuracy but also in reliability for mathematical problem-solving
tasks.
4.5 Response time analysis
Efficient response times are crucial in the deployment of LLMs, as they directly influence user
experience and the practicality of real-time applications. Factors affecting inference speed include
model size, computational complexity, hardware capabilities, and optimization techniques.
Figure 11: Latency Comparison across different LLMs
Notably, Figure 11 presents a comparative analysis of the average response time per query across
various LLMs. The DeepSeek-R1 model exhibits the highest latency, averaging 81.0 seconds per
18
Page 19:
response, which implies that DeepSeek-R1 is significantly slower than all other models. Among
OpenAI’s models, o1 demonstrates a high response time of 15.0 seconds, likely due to its complex
CoT process. GPT-4o also exhibits relatively high latency at 11.5 seconds, making it one of the three
models that exceed the 10-seconds latency threshold. In contrast, o1-mini and o3-mini achieved
faster inference speeds of 6.8 seconds and 5.3 seconds, respectively, outperforming the other two
OpenAI models. DeepSeek-1.5B operates at 5.0 seconds, slightly outperforming OpenAI’s o3-mini
models. Notably, Gemini 2.0 Flash exhibits the fastest response time at 4.2seconds, making it the
most efficient model in terms of inference speed.
Importantly, understanding these response times is essential for selecting appropriate models based
on application requirements, balancing the trade-offs between latency and performance.
5 Discussion
In this study, we conducted a comprehensive evaluation of various LLMs to assess their mathematical
reasoning capabilities across multiple benchmarks [ 16,39,40]. The evaluated LLMs include Gemini
2.0 Flash [ 45], GPT-4o [ 44], o1-mini [ 46], o1 [ 103], o3-mini [ 6], DeepSeek-1.5B [ 57], and DeepSeek-
R1 [ 56]. Our empirical analysis revealed that DeepSeek-R1 and o1 outperformed its peers in
most mathematical domains, demonstrating strong mathematical reasoning capabilities in solving
structured problems. However, its performance was not on par with o3-mini in some complex
and abstract fields, such as Abstract Algebra. Additionally, our findings underscore the impact of
model distillation on mathematical reasoning performance. Specifically, distilled variants showcased
noticeable declines in accuracy, reinforcing concerns that distillation, while improving computational
efficiency, may inadvertently weaken the LLM’s capability to handle intricate problem-solving
tasks. These observations suggest that there exists a trade-off between computational efficiency and
reasoning depth, which needs to be carefully managed in future model optimizations.
First, our results further validate the advanced capabilities of DeepSeek-R1, which is trained using
RL rather than Supervised Fine-Tuning (SFT) [ 31,56,69]. DeepSeek-R1 incorporates GRPO, a
specialized RL-based optimization strategy that enhances training efficiency by evaluating model
actions relative to a group of sampled peers. Unlike traditional RL approaches that require an
external reward model, GRPO eliminates the need for a separate reward function by dynamically
calculating advantages based on group-based scoring [ 56,57]. This unique training paradigm delivers
DeepSeek-R1 to balance cost and performance efficiently, leading to competitive accuracy in solving
grade school-level mathematical problems. The superior performance of DeepSeek-R1 in GSM8K
highlights its superiority in handling structured, multi-step arithmetic reasoning tasks. Notably,
DeepSeek-R1 remains comparable to o1 and o3-mini, demonstrating that while GRPO enhances
efficiency, it does not necessarily surpass models with specific CoT in all problem categories.
In addition, our analysis of DeepSeek-R1’s performance across MMLU benchmarks reveals a signifi-
cant variation in mathematical reasoning capabilities across different domains. While DeepSeek-R1
excelled in Formal Logic, it struggled in more abstract mathematical areas, such as College Mathemat-
ics and Abstract Algebra. In particular, this discrepancy suggests that GRPO is particularly effective
for solving problems that rely on structured logical reasoning, where a group-based approach to RL
can refine logical consistency and decision-making [ 34]. Formal Logic problems, for instance, often
involve pattern recognition and structured deduction [ 34], which could align well with GRPO’s frame-
work. The group scoring mechanism in GRPO enables the model to internalize patterns commonly
recognized by human reasoning, allowing it to perform well in problems that require rule-based
logical deductions. However, the performance drop in Abstract Algebra and College Mathematics
suggests that DeepSeek-R1 struggles with tasks requiring highly abstract conceptualization. Abstract
Algebra, in particular, represents a fundamental shift from procedural algebra to the exploration
of abstract mathematical structures such as groups, rings, and fields [ 99]. Unlike rule-based logic
problems, abstract algebra necessitates deep symbolic manipulation, theorem proofs, and the ability
to navigate complex algebraic structures [ 99], which require advanced deductive reasoning beyond
GRPO’s core optimization strategy.
Furthermore, proof-based reasoning, a crucial component of higher-level mathematics, demands
meta-cognitive processes that extend beyond pattern recognition, including recursive logic application
and theorem proving. The limitations observed in DeepSeek-R1 to generalize across abstract mathe-
matical problems indicate that a more versatile reasoning approach may be needed to complement
19
Page 20:
GRPO-based training. In contrast, CoT reasoning demonstrates strong performance in specific
mathematical fields by leveraging step-by-step logical breakdowns to enhance problem-solving
accuracy. CoT excels in tasks that require explicit intermediate reasoning steps, which helps models
navigate complex mathematical derivations more effectively. However, despite these advantages,
CoT remains constrained by its tendency to overfit to structured problem-solving paradigms [ 9,50],
leading to limited adaptability when faced with problems requiring broader conceptual generalization.
Thus, while CoT provides a deeper, structured reasoning framework, it may struggle to satisfy the
generalization requirements necessary for handling diverse mathematical challenges. The trade-off
between GRPO’s structured logic refinement and CoT’s sequential deductive approach suggests that
a hybrid reasoning mechanism—integrating the strengths of both frameworks—may be essential for
developing a more versatile and mathematically proficient LLM.
In summary, our findings align with prior research [ 2,4,27,31,47,52,54,64,66,72,76,91,103,105,
107], further confirming that DeepSeek-R1 demonstrates strong mathematical reasoning compared
to other LLMs. However, our research extends beyond previous work by providing a granular,
field-specific analysis of LLM performance across different domains. Instead of evaluating general
reasoning skills, we delve into the nuances of logical, arithmetic, and abstract reasoning capabilities,
offering a holistic understanding of how different LLMs excel or struggle across mathematical
disciplines. Despite its valuable contributions, our study has several limitations: 1). Limited Model
Scope: While our evaluation includes prominent LLMs, we did not test other cutting-edge models,
(such as Llama-3.2 / Llama-3.3, Claude, Mistral) which could provide additional insights into
alternative training methodologies. 2). Restricted Benchmark Diversity: Our assessment relies on
three primary benchmark datasets, which, while comprehensive, may not fully capture the breadth of
mathematical reasoning challenges in real-world applications.
More importantly, future research should expand the scope of evaluated LLMs and incorporate
additional benchmarks to gain a broader understanding of model strengths and weaknesses across
varied mathematical disciplines. Specifically, integrating GRPO with peer CoTs could allow models
to leverage both structured logical optimization and stepwise deductive reasoning. This integration
could lead to LLM swarms, where multiple reasoning frameworks synergize to enhance efficiency
and accuracy. Meanwhile, our study reaffirms that distillation can significantly impair reasoning
capabilities. Future efforts should explore adaptive distillation methods that preserve deep reasoning
pathways while optimizing for computational efficiency. Beyond current CoT frameworks [ 14,53,88,
89,95], investigators could provide new insights into advanced reasoning frameworks, such as Chain
of Draft (CoD) [ 93]. Those into how LLMs handle interconnected logical dependencies will improve
abstract mathematical problem-solving [ 15]. Future models can strike a balance between efficiency,
depth, and mathematical intelligence by advancing multi-framework LLMs and refining distillation
strategies. Meanwhile, Xia et al. [ 88] introduced ReasonEval, a novel methodology for evaluating
the quality of reasoning steps in mathematical tasks, emphasizing the need to assess both validity
and redundancy in intermediate steps. Additionally, Frieder et al. [ 32] explored the mathematical
capabilities of ChatGPT and GPT-4, demonstrating their effectiveness in querying mathematical facts
and analyzing their performance across varying levels of mathematical complexity. These studies
lay the foundation for more comprehensive and rigorous evaluations in the future, advancing our
understanding of LLMs in mathematical reasoning.
6 Conclusion
Overall, our study presents a comprehensive comparative analysis of LLM performance in mathemat-
ical reasoning, identifying key strengths and weaknesses across multiple domains. We highlight the
effectiveness of GRPO in structured problem-solving, demonstrating its ability to enhance logical
consistency and rule-based reasoning. However, we also uncover its limitations in handling abstract
mathematical concepts, where deep symbolic manipulation and theorem-based reasoning remain
challenging for GRPO-trained models. Furthermore, our findings underscore the trade-offs associated
with model distillation, revealing its potential drawbacks in preserving mathematical reasoning
capabilities. As a result, we propose new strategies for refining distillation techniques, ensuring that
compressed models maintain robust problem-solving proficiency without sacrificing computational
efficiency. Beyond performance evaluation, this study paves the way for future advancements in
LLM-driven mathematical intelligence by outlining potential enhancements in reasoning frameworks,
including hybrid models that integrate RL with structured step-by-step inference methods. By address-
20
Page 21:
ing these challenges, our work contributes to the development of next-generation AI systems capable
of handling complex mathematical reasoning with greater accuracy, efficiency, and adaptability.
7 Acknowledgement
The authors acknowledge the support of the Augusta University High Performance Computing
Services (AUHPCS) for providing computational resources contributing to the results presented in
this publication/report.
References
[1]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4
technical report. arXiv preprint arXiv:2303.08774 , 2023.
[2]Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large
language models for mathematical reasoning: Progresses and challenges. arXiv preprint
arXiv:2402.00157 , 2024.
[3]Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language
models for mathematical reasoning: Progresses and challenges, 2024.
[4]Amogh Akella. Improving math problem solving in large language models through categoriza-
tion and strategy tailoring. arXiv preprint arXiv:2411.00042 , 2024.
[5]Xavier Amatriain. Prompt design and engineering: Introduction and advanced methods. arXiv
preprint arXiv:2401.14423 , 2024.
[6]Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, and Sergio Segura. Early
external safety testing of openai’s o3-mini: Insights from the pre-deployment evaluation. arXiv
preprint arXiv:2501.17749 , 2025.
[7]Babette M. Benken, Jorge Ramírez, Xuhui Li, and Scott Wetendorf. Developmental math-
ematics success: Impact of students’ knowledge and attitudes. Journal of Developmental
Education , 38:14, 2015.
[8]Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui
Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language
models with longtermism. arXiv preprint arXiv:2401.02954 , 2024.
[9]Johan Boye and Birger Moell. Large language models and mathematical reasoning failures.
arXiv preprint arXiv:2502.11574 , 2025.
[10] William Cain. Prompting change: Exploring prompt engineering in large language model ai
and its potential to transform education. TechTrends , 68(1):47–57, 2024.
[11] Nicola Cecere, Andrea Bacciu, Ignacio Fernández Tobías, and Amin Mantrach. Monte carlo
temperature: a robust sampling strategy for llm’s uncertainty quantification methods. arXiv
preprint arXiv:2502.18389 , 2025.
[12] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen,
Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language
models. ACM transactions on intelligent systems and technology , 15(3):1–45, 2024.
[13] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto,
Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating
large language models trained on code. arXiv preprint arXiv:2107.03374 , 2021.
[14] Qiguang Chen, Libo Qin, Jiaqi Wang, Jingxuan Zhou, and Wanxiang Che. Unlocking the
capabilities of thought: A reasoning boundary framework to quantify and optimize chain-of-
thought. Advances in Neural Information Processing Systems , 37:54872–54904, 2025.
21
Page 22:
[15] Zui Chen, Tianqiao Liu, Mi Tian, Weiqi Luo, Zitao Liu, et al. Advancing mathematical
reasoning in language models: The impact of problem-solving data, data synthesis methods,
and training stages. In The Thirteenth International Conference on Learning Representations ,
2025.
[16] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John
Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 ,
2021.
[17] Google CoLab. Google colab for quantitative comparisons in gsm8k. https:
//colab.research.google.com/drive/1guO9lNk9WENzyin9wgjkHa_4rhrnRyuP?
usp=sharing . Accessed: 2025-03-04.
[18] Google CoLab. Google colab for quantitative comparisons in math. https:
//colab.research.google.com/drive/1AwK-k9uVtd6aPxK-KDBmfpl727j5cxfz?
usp=sharing . Accessed: 2025-03-04.
[19] Google CoLab. Google colab for quantitative comparisons in mmlu. https:
//colab.research.google.com/drive/1ye1Mvu5rSehQBcoUQzMhi8QYJjgqZyu7?
usp=sharing . Accessed: 2025-03-04.
[20] Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li,
Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specializa-
tion in mixture-of-experts language models. arXiv preprint arXiv:2401.06066 , 2024.
[21] Haixing Dai, Yiwei Li, Zhengliang Liu, Lin Zhao, Zihao Wu, Suhang Song, Ye Shen, Dajiang
Zhu, Xiang Li, Sheng Li, et al. Ad-autogpt: an autonomous gpt for alzheimer’s disease
infodemiology. arXiv preprint arXiv:2306.10095 , 2023.
[22] Haixing Dai, Zhengliang Liu, Wenxiong Liao, Xiaoke Huang, Yihan Cao, Zihao Wu, Lin
Zhao, Shaochen Xu, Fang Zeng, Wei Liu, et al. Auggpt: Leveraging chatgpt for text data
augmentation. IEEE Transactions on Big Data , 2025.
[23] Robert Dale. Gpt-3: What’s it good for? Natural Language Engineering , 27(1):113–118,
2021.
[24] Sumit Kumar Dam, Choong Seon Hong, Yu Qiao, and Chaoning Zhang. A complete survey
on llm-based ai chatbots. arXiv preprint arXiv:2406.16937 , 2024.
[25] Paul Denny, Juho Leinonen, James Prather, Andrew Luxton-Reilly, Thezyrie Amarouche,
Brett A Becker, and Brent N Reeves. Prompt problems: A new programming exercise for the
generative ai era. In Proceedings of the 55th ACM Technical Symposium on Computer Science
Education V . 1 , pages 296–302, 2024.
[26] Ricardo Dominguez-Olmedo, Moritz Hardt, and Celestine Mendler-Dünner. Questioning
the survey responses of large language models. Advances in Neural Information Processing
Systems , 37:45850–45878, 2025.
[27] Yucong Duan. Meta-analysis and evaluation of large models’ mathematical abilities based on
dikwp semantic mathematics. Unpublished , 2025.
[28] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle,
Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd
of models. arXiv preprint arXiv:2407.21783 , 2024.
[29] Gabriel Dulac-Arnold, Nir Levine, Daniel J Mankowitz, Jerry Li, Cosmin Paduraru, Sven
Gowal, and Todd Hester. Challenges of real-world reinforcement learning: definitions, bench-
marks and analysis. Mach. Learn. , 110(9):2419–2468, September 2021.
[30] Maxim Enis and Mark Hopkins. From llm to nmt: Advancing low-resource machine translation
with claude. arXiv preprint arXiv:2404.13813 , 2024.
22
Page 23:
[31] Evgenii Evstafev. Token-hungry, yet precise: Deepseek r1 highlights the need for multi-step
reasoning over speed in math. arXiv preprint arXiv:2501.18576 , 2025.
[32] Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz,
Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt. Advances in neural
information processing systems , 36:27699–27744, 2023.
[33] Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. Specializing smaller
language models towards multi-step reasoning, 2023.
[34] Bernhard Ganter and Rudolf Wille. Formal concept analysis: mathematical foundations .
Springer Nature, 2024.
[35] Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto
Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad
Reza Ghasemi Madani, et al. Are we done with mmlu? arXiv preprint arXiv:2406.04127 ,
2024.
[36] Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li,
Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge. arXiv preprint
arXiv:2411.15594 , 2024.
[37] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu,
Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in
llms via reinforcement learning. arXiv preprint arXiv:2501.12948 , 2025.
[38] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen,
Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets
programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196 , 2024.
[39] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint
arXiv:2009.03300 , 2020.
[40] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn
Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.
arXiv preprint arXiv:2103.03874 , 2021.
[41] Siyuan Hu, Mingyu Ouyang, Difei Gao, and Mike Zheng Shou. The dawn of gui agent: A
preliminary case study with claude 3.5 computer use. arXiv preprint arXiv:2411.10323 , 2024.
[42] Wenyang Hu, Yao Shu, Zongmin Yu, Zhaoxuan Wu, Xiaoqiang Lin, Zhongxiang Dai, See-
Kiong Ng, and Bryan Kian Hsiang Low. Localized zeroth-order prompt optimization. Advances
in Neural Information Processing Systems , 37:86309–86345, 2025.
[43] Jiaxing Huang and Jingyi Zhang. A survey on evaluation of multimodal large language models.
arXiv preprint arXiv:2408.15769 , 2024.
[44] Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark,
AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv
preprint arXiv:2410.21276 , 2024.
[45] Muhammad Imran and Norah Almusharraf. Google gemini as a next generation ai educational
tool: a review of emerging educational technology. Smart Learning Environments , 11(1):22,
2024.
[46] Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low,
Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.
arXiv preprint arXiv:2412.16720 , 2024.
[47] Qile Jiang, Zhiwei Gao, and George Em Karniadakis. Deepseek vs. chatgpt: A compar-
ative study for scientific computing and scientific machine learning tasks. arXiv preprint
arXiv:2502.17764 , 2025.
23
Page 24:
[48] Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva,
Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, et al. Chatgpt
for good? on opportunities and challenges of large language models for education. Learning
and individual differences , 103:102274, 2023.
[49] Shalom Lappin. Assessing the strengths and weaknesses of large language models. Journal of
Logic, Language and Information , 33(1):9–20, 2024.
[50] Chengpeng Li, Guanting Dong, Mingfeng Xue, Ru Peng, Xiang Wang, and Dayiheng Liu.
Dotamath: Decomposition of thought with code assistance and self-correction for mathematical
reasoning. arXiv preprint arXiv:2407.04078 , 2024.
[51] Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An
Huang, Ekin Akyürek, Anima Anandkumar, et al. Pre-trained language models for interactive
decision-making. Advances in Neural Information Processing Systems , 35:31199–31212,
2022.
[52] Xiaoyuan Li, Wenjie Wang, Moxin Li, Junrong Guo, Yang Zhang, and Fuli Feng. Evaluat-
ing mathematical reasoning of large language models: A focus on error identification and
correction. arXiv preprint arXiv:2406.00755 , 2024.
[53] Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers
to solve inherently serial problems. arXiv preprint arXiv:2402.12875 , 1, 2024.
[54] Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao,
Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A
survey of reasoning large language models. arXiv preprint arXiv:2502.17419 , 2025.
[55] Shizhe Liang, Wei Zhang, and Tianyang Zhong. Mathematics and machine creativity: A
survey on bridging mathematics with ai. arXiv preprint arXiv:2412.16543 , 2024.
[56] Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr,
Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient
mixture-of-experts language model. arXiv preprint arXiv:2405.04434 , 2024.
[57] Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao,
Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv
preprint arXiv:2412.19437 , 2024.
[58] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated
by chatgpt really correct? rigorous evaluation of large language models for code generation.
Advances in Neural Information Processing Systems , 36:21558–21572, 2023.
[59] Zhengliang Liu, Yue Huang, Xiaowei Yu, Lu Zhang, Zihao Wu, Chao Cao, Haixing Dai, Lin
Zhao, Yiwei Li, Peng Shu, et al. Deid-gpt: Zero-shot medical text de-identification by gpt-4.
arXiv preprint arXiv:2303.11032 , 2023.
[60] Zhengliang Liu, Hanqi Jiang, Tianyang Zhong, Zihao Wu, Chong Ma, Yiwei Li, Xiaowei Yu,
Yutong Zhang, Yi Pan, Peng Shu, et al. Holistic evaluation of gpt-4v for biomedical imaging.
arXiv preprint arXiv:2312.05256 , 2023.
[61] Zhengliang Liu, Yiwei Li, Peng Shu, Aoxiao Zhong, Longtao Yang, Chao Ju, Zihao Wu,
Chong Ma, Jie Luo, Cheng Chen, et al. Radiology-llama2: Best-in-class large language model
for radiology. arXiv preprint arXiv:2309.06419 , 2023.
[62] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng,
Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating math reasoning in
visual contexts with gpt-4v, bard, and other large multimodal models. CoRR , 2023.
[63] Zhenyan Lu, Xiang Li, Dongqi Cai, Rongjie Yi, Fangming Liu, Xiwen Zhang, Nicholas D
Lane, and Mengwei Xu. Small language models: Survey, measurements, and insights. arXiv
preprint arXiv:2409.15790 , 2024.
24
Page 25:
[64] Sarah Mercer, Samuel Spillard, and Daniel P Martin. Brief analysis of deepseek r1 and it’s
implications for generative ai. arXiv preprint arXiv:2502.02523 , 2025.
[65] Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar
Sainz, Eneko Agirre, Ilana Heintz, and Dan Roth. Recent advances in natural language
processing via large pre-trained language models: A survey. ACM Computing Surveys , 56(2):1–
40, 2023.
[66] Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and
Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning
in large language models. arXiv preprint arXiv:2410.05229 , 2024.
[67] Gianluca Mondillo, Mariapia Masino, Simone Colosimo, Alessandra Perrotta, and Vittoria
Frattolillo. Evaluating ai reasoning models in pediatric medicine: A comparative analysis of
o3-mini and o3-mini-high. medRxiv , pages 2025–02, 2025.
[68] Aidar Myrzakhan, Sondos Mahmoud Bsharat, and Zhiqiang Shen. Open-llm-leaderboard:
From multi-choice to open-style questions for llms evaluation, benchmark, and arena. arXiv
preprint arXiv:2406.07545 , 2024.
[69] Fnu Neha and Deepshikha Bhati. A survey of deepseek models. Authorea Preprints , 2025.
[70] Minh Nguyen, Andrew Baker, Clement Neo, Allen Roush, Andreas Kirsch, and Ravid Shwartz-
Ziv. Turning up the heat: Min-p sampling for creative and coherent llm outputs. arXiv preprint
arXiv:2407.01082 , 2024.
[71] Xinzhe Ni, Yeyun Gong, Zhibin Gou, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu
Chen. Exploring the mystery of influential data for mathematical reasoning. arXiv preprint
arXiv:2404.01067 , 2024.
[72] Shuai Peng, Di Fu, Liangcai Gao, Xiuqin Zhong, Hongguang Fu, and Zhi Tang. Multimath:
Bridging visual and mathematical reasoning for large language models. arXiv preprint
arXiv:2409.00147 , 2024.
[73] Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman
Chadha. A systematic survey of prompt engineering in large language models: Techniques
and applications. arXiv preprint arXiv:2402.07927 , 2024.
[74] Amrith Setlur, Saurabh Garg, Xinyang Geng, Naman Garg, Virginia Smith, and Aviral Kumar.
Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold.
Advances in Neural Information Processing Systems , 37:43000–43031, 2024.
[75] Sakib Shahriar, Brady D Lund, Nishith Reddy Mannuru, Muhammad Arbab Arshad, Kadhim
Hayawi, Ravi Varma Kumar Bevara, Aashrith Mannuru, and Laiba Batool. Putting gpt-4o to the
sword: A comprehensive evaluation of language, vision, speech, and multimodal proficiency.
Applied Sciences , 14(17):7782, 2024.
[76] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang,
Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical
reasoning in open language models. arXiv preprint arXiv:2402.03300 , 2024.
[77] Zhuocheng Shen. Llm with tools: A survey. arXiv preprint arXiv:2409.18807 , 2024.
[78] Ali Soroush, Benjamin S Glicksberg, Eyal Zimlichman, Yiftach Barash, Robert Freeman,
Alexander W Charney, Girish N Nadkarni, and Eyal Klang. Large language models are poor
medical coders—benchmarking of medical code querying. NEJM AI , 1(5):AIdbp2300040,
2024.
[79] Joseph Spracklen, Raveen Wijewickrama, AHM Sakib, Anindya Maiti, Bimal Viswanath,
and Murtuza Jadliwala. We have a package for you! a comprehensive analysis of package
hallucinations by code generating llms. arXiv preprint arXiv:2406.10279 , 2024.
[80] Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. Chain of thoughtlessness?
an analysis of cot in planning. Advances in Neural Information Processing Systems , 37:29106–
29141, 2025.
25
Page 26:
[81] Zhi Rui Tam, Cheng-Kuang Wu, Chieh-Yen Lin, and Yun-Nung Chen. None of the above, less
of the right: Parallel patterns between humans and llms on multi-choice questions answering.
arXiv preprint arXiv:2503.01550 , 2025.
[82] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya
Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open
models based on gemini research and technology. arXiv preprint arXiv:2403.08295 , 2024.
[83] Jiaqi Wang, Hanqi Jiang, Yiheng Liu, Chong Ma, Xu Zhang, Yi Pan, Mengyuan Liu, Peiran
Gu, Sichen Xia, Wenjun Li, et al. A comprehensive review of multimodal large language
models: Performance and challenges across different tasks. arXiv preprint arXiv:2408.01319 ,
2024.
[84] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le,
Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.
Advances in neural information processing systems , 35:24824–24837, 2022.
[85] Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R
Bowman, He He, and Shi Feng. Language models learn to mislead humans via rlhf. arXiv
preprint arXiv:2409.12822 , 2024.
[86] Junchao Wu, Shu Yang, Runzhe Zhan, Yulin Yuan, Lidia Sam Chao, and Derek Fai Wong. A
survey on llm-generated text detection: Necessity, methods, and future directions. Computa-
tional Linguistics , pages 1–66, 2025.
[87] Zhenyu Wu, Meng Jiang, and Chao Shen. Get an a in math: Progressive rectification prompting.
InProceedings of the AAAI Conference on Artificial Intelligence , volume 38, pages 19288–
19296, 2024.
[88] Yu Xia, Rui Wang, Xu Liu, Mingyan Li, Tong Yu, Xiang Chen, Julian McAuley, and Shuai
Li. Beyond chain-of-thought: A survey of chain-of-x paradigms for llms. arXiv preprint
arXiv:2404.15676 , 2024.
[89] Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, and
Bo Li. Badchain: Backdoor chain-of-thought prompting for large language models. arXiv
preprint arXiv:2401.12242 , 2024.
[90] Zhenxiang Xiao, Yuzhong Chen, Junjie Yao, Lu Zhang, Zhengliang Liu, Zihao Wu, Xiaowei
Yu, Yi Pan, Lin Zhao, Chong Ma, et al. Instruction-vit: Multi-modal prompts for instruction
learning in vision transformer. Information Fusion , 104:102204, 2024.
[91] Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan,
Wenda Li, and Xiaodan Liang. Deepseek-prover: Advancing theorem proving in llms through
large-scale synthetic data. arXiv preprint arXiv:2405.14333 , 2024.
[92] Shaochen Xu, Yifan Zhou, Zhengliang Liu, Zihao Wu, Tianyang Zhong, Huaqin Zhao, Yiwei
Li, Hanqi Jiang, Yi Pan, Junhao Chen, et al. Towards next-generation medical agent: How o1
is reshaping decision-making in medical scenarios. arXiv preprint arXiv:2411.14461 , 2024.
[93] Silei Xu, Wenhao Xie, Lingxiao Zhao, and Pengcheng He. Chain of draft: Thinking faster by
writing less. arXiv preprint arXiv:2502.18600 , 2025.
[94] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan
Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint
arXiv:2412.15115 , 2024.
[95] Wen Yang, Kai Fan, and Minpeng Liao. Markov chain of thought for efficient mathematical
reasoning. arXiv preprint arXiv:2410.17635 , 2024.
[96] Zhenyuan Yang, Zhengliang Liu, Jing Zhang, Cen Lu, Jiaxin Tai, Tianyang Zhong, Yiwei Li,
Siyan Zhao, Teng Yao, Qing Liu, et al. Analyzing nobel prize literature with large language
models. arXiv preprint arXiv:2410.18142 , 2024.
26
Page 27:
[97] Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. A survey on large
language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence
Computing , page 100211, 2024.
[98] Xiaosong Yuan, Chen Shen, Shaotian Yan, Xiaofeng Zhang, Liang Xie, Wenxiao Wang,
Renchu Guan, Ying Wang, and Jieping Ye. Instance-adaptive zero-shot chain-of-thought
prompting. arXiv preprint arXiv:2409.20441 , 2024.
[99] Yumiati Yumiati, Harry Dwi Putra, Saleh Haji, et al. Blended online learning: Students’
perception and its effect on learning outcomes abstract algebra. Infinity Journal , 14(1):65–84,
2025.
[100] Piotr Zablocki and Zofia Gajewska. Assessing hallucination risks in large language models
through internal state analysis. Authorea Preprints , 2024.
[101] Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, William Song, Tiffany
Zhao, Pranav Raja, Charlotte Zhuang, Dylan Slack, et al. A careful examination of large
language model performance on grade school arithmetic. Advances in Neural Information
Processing Systems , 37:46819–46836, 2024.
[102] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian
Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.
arXiv preprint arXiv:2303.18223 , 1(2), 2023.
[103] Tianyang Zhong, Zhengliang Liu, Yi Pan, Yutong Zhang, Yifan Zhou, Shizhe Liang, Zihao
Wu, Yanjun Lyu, Peng Shu, Xiaowei Yu, et al. Evaluation of openai o1: Opportunities and
challenges of agi. arXiv preprint arXiv:2409.18486 , 2024.
[104] Tianyang Zhong, Yaonai Wei, Li Yang, Zihao Wu, Zhengliang Liu, Xiaozheng Wei, Wenjun
Li, Junjie Yao, Chong Ma, Xiang Li, et al. Chatabl: Abductive learning via natural language
interaction with chatgpt. arXiv preprint arXiv:2304.11107 , 2023.
[105] Zihao Zhou, Shudong Liu, Maizhen Ning, Wei Liu, Jindong Wang, Derek F Wong, Xiaowei
Huang, Qiufeng Wang, and Kaizhu Huang. Is your model really a good math reasoner?
evaluating mathematical reasoning with checklist. arXiv preprint arXiv:2407.08733 , 2024.
[106] Shiben Zhu, Wanqin Hu, Zhi Yang, Jiani Yan, and Fang Zhang. Qwen-2.5 outperforms other
large language models in the chinese national nursing licensing examination: Retrospective
cross-sectional comparative study. JMIR Medical Informatics , 13:e63731, 2025.
[107] Wenwen Zhuang, Xin Huang, Xiantao Zhang, and Jin Zeng. Math-puma: Progressive upward
multimodal alignment to enhance mathematical reasoning. arXiv preprint arXiv:2408.08640 ,
2024.
27