Paper Content:
Page 1:
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research
Junde Wu, Jiayuan Zhu, Yuyuan Liu
University of Oxford
Abstract
In this technical report, we introduce Agen-
tic Reasoning, a framework1that enhances
large language model (LLM) reasoning by
integrating external tool-using agents. Un-
like conventional LLM-based reasoning ap-
proaches, which rely solely on internal infer-
ence, Agentic Reasoning dynamically engages
web search, code execution, and structured
reasoning-context memory to solve complex
problems requiring deep research and multi-
step logical deduction. Our framework intro-
duces the Mind Map agent, which constructs
a structured knowledge graph to track logical
relationships, improving deductive reasoning.
Additionally, the integration of web-search and
coding agents enables real-time retrieval and
computational analysis, enhancing reasoning
accuracy and decision-making. Evaluations
on PhD-level scientific reasoning (GPQA) and
domain-specific deep research tasks demon-
strate that our approach significantly out-
performs existing models, including leading
retrieval-augmented generation (RAG) systems
and closed-source LLMs. Moreover, our re-
sults indicate that agentic reasoning improves
expert-level knowledge synthesis, test-time
scalability, and structured problem-solving.
The code is at: https://github.com/
theworldofagents/Agentic-Reasoning .
1 Introduction
Recently, large reasoning models, such as Ope-
nAI’s o1 (Jaech et al., 2024), Qwen-QwQ (Team),
and DeepSeek-R1 (Team, 2024), have demon-
strated impressive stepwise reasoning capabili-
ties over long sequences through large-scale re-
inforcement learning. These advancements provide
promising solutions to complex reasoning tasks
(Wei et al., 2022; Lewkowycz et al., 2022; OpenAI)
and have inspired foundational efforts to replicate
1work in progresso1-like reasoning patterns across a broader range
of models (Qin et al., 2024; Huang et al., 2024;
Zhang et al., 2024).
DeepSeek-R1, for example, relies exclusively on
rule-based outcome rewards during training, such
as evaluating whether a mathematical solution is
correct or a piece of code executes successfully.
While this approach has yielded remarkable rea-
soning capabilities, equaling o1’s performance in
domains like math and code, it comes with notable
trade-offs. As even the authors acknowledge, this
type of training diminishes the model’s ability to
articulate its reasoning process. DeepSeek-R1’s
responses are often logical and accurate but lack
detailed explanations of transitions between ideas
or the finer connections between arguments.
Although current reasoning methods excel in
structured domains like math and code—where out-
comes are easily verifiable—applying these tech-
niques to less structured or subjective tasks remains
a significant challenge. Adapting these strategies
to areas where answers are not inherently definitive
is a key research gap. How can models be trained
to handle tasks that require judgment, interpreta-
tion, or nuanced understanding rather than binary
correctness?
Furthermore, not all problems benefit from for-
mal reasoning approaches. Many fields, such as
social sciences, ethics, or experiential disciplines,
rely on abstract concepts, conventional wisdom,
factual verification, understanding complex logical
relationships, or moral reasoning. When models
attempt to impose math- or coding-style reason-
ing onto such areas, they often produce flawed or
overly rigid results. Developing approaches that
account for these unique requirements is essential
for advancing the applicability of reasoning model
beyond their current domains.
Deep, thoughtful answers to open-ended ques-
tions often require extensive research, repeated ver-
ification, information retrieval, computational anal-
1arXiv:2502.04644v1 [cs.AI] 7 Feb 2025
Page 2:
ysis, and the organization of complex logical rela-
tionships—steps fundamental to human reasoning.
In this process, humans rely heavily on external
tools, such as internet searches for gathering infor-
mation, computational tools for quantitative analy-
sis, or whiteboards and Mind Maps for organizing
thoughts. This raises an intriguing question: can
large language models similarly leverage external
tools to enhance their reasoning and tackle inten-
sive knowledge work across diverse domains?
Previous efforts have attempted to integrate
search or retrieval-augmented generation (RAG)
into the reasoning process (Shao et al., 2024;
Khaliq et al., 2024; Islam et al., 2024; Li et al.,
2025), with notable examples including Gemini’s
Deep Research. However, these models are closed,
their exact methodologies remain undisclosed. In
contrast, open-source models typically focus ex-
clusively on retrieval or web-searching during rea-
soning, leaving a significant performance gap com-
pared to their closed-source counterparts.
We introduce Agentic Reasoning, a framework
that enhances the reasoning process by integrating
external LLM-based agents as tools. This approach
enables LLMs to perform multi-step reasoning
and tackle complex problems more effectively by
delegating specific tasks to these auxiliary agents.
Through extensive experimentation with integrat-
ing various agents into the reasoning process, we
identified three essential agents that prove highly
effective for general reasoning across diverse prob-
lems. The web-search agent, which retrieves rele-
vant information from the internet to supplement
the model’s knowledge. The code agent, capable
of performing computational analyses and coding
tasks to support quantitative reasoning. Finally,
the memory agent, which we call Mind Map, con-
structs knowledge graphs based on the reasoning
context, enabling the organization of complex log-
ical relationships in a manner similar to human
mind mapping. Together, these agents enhance the
model’s ability to tackle complex problems with
greater efficiency and precision.
When integrated into current reasoning LLMs,
Agentic Reasoning transforms their problem-
solving capabilities by enabling them to plan and
execute multi-step strategies autonomously. These
models can identify and retrieve the necessary data,
adapt dynamically to real-time information, and
perform quantitative analyses to generate precise
outcomes. This framework also allows LLMs to
deliver comprehensive reports comparable to thoseof a research analyst or provide solutions on par
with PhD-level expertise.
We evaluated our model on general knowledge-
intensive benchmarks requiring complex reasoning
capabilities, categorized into two key areas: (1)
solving expert-level questions and (2) conducting
deep research on real-world expert-level tasks.
For expert-level questions, we tested the model
on the GPQA dataset, a PhD-level science multiple-
choice QA benchmark with questions authored by
domain experts in physics, chemistry, and biology.
Our Agentic Reasoning framework achieved im-
pressive accuracy rates: 58% in chemistry, 88% in
physics, and 79% in biology, closely rivals the best
and newest closed reasoning model, OpenAI o1.
For real-world expert-level tasks, Agentic Reason-
ing was evaluated by domain experts, who noted
that it effectively automated several hours of chal-
lenging, manual investigation. This highlights its
potential to streamline labor-intensive processes
and enhance productivity in knowledge-intensive
domains.
Additionally, we tested the model’s scalability
in test-time reasoning using the agentic framework
as a verifier. The results showed significant im-
provements in test-time computational efficiency,
demonstrating the framework’s ability to optimize
reasoning processes. This finding suggests that the
agentic framework has strong potential to serve as
a reward model for reinforcement learning, further
advancing reasoning model training.
These results position Agentic Reasoning as
a powerful and versatile framework, capable of
tackling complex, domain-specific challenges with
depth and precision. Its ability to perform in-depth
research, navigate intricate logical structures, and
synthesize information effectively highlights its po-
tential for solving knowledge-intensive problems
and driving advancements in deep analytical explo-
ration.
2 Method
2.1 Preliminary
We consider an expert-level task that requires multi-
step complex reasoning. In the process of model
reasoning, it can retrieve external tool usage, and
structured memory of its previous reasoning. Our
objective is to generate, for each query q, both
a logical reasoning chain rand a final answer a.
To achieve this, the reasoning model dynamically
interacts with external tools e, which are gener-
2
Page 3:
Figure 1: The overall workflow of Agentic Reasoning.
ally web search and python coding, and retrieves
structured knowledge from an organized memory
kthroughout the reasoning process.
Formally, we identify four primary inputs in the
problem-solving pipeline: task instruction o, defin-
ing the overarching task objective, query q, a com-
plex question requiring multi-step reasoning, exter-
nal tool outputs e, dynamically retrieved content
from tools such as web search or coding, reasoning
memory k, containing structured knowledge graph.
The goal is to integrate o, q, e, k to generate a
coherent reasoning chain rand a final answer a.
This process can be expressed as the mapping:
(o, q, e, k )7→(r, a).
We model the generation of randausing the
following joint probability formulation:
P(r, a|o, q, e, k ) =TrY
t=1P(rt|r<t, o, q, e ≤t, k≤t)
| {z }
Reasoning Process
×TaY
t=1P(at|a<t, r, o, q, e, k )
| {z }
Answer Generation.
where TrandTarepresent the lengths (in tokens)
of the reasoning chain rand the final answer a,
respectively. Here, rtdenotes the token at position
tin the reasoning sequence, with r<trepresentingall previous tokens. The terms e≤tandk≤tindicate
all tool-generated outputs and knowledge-graph
information retrieved up to step t. Similarly, atis
the token at position tin the final answer, and a<t
represents all previously generated answer tokens.
2.2 Agentic Reasoning Pipeline
Our core idea is to enhance the model reasoning
by deploying external LLM-based agents during
reasoning. The framework enables the reasoning
LLM model interacts with external information
in an agentic way. During its reasoning process,
it could call the external tools to help solve the
problem and also with a structured memory, called
Mind Map, to store its reasoning context. At its
core, an agentic mechanism empowers the model to
determine, in real-time, when additional informa-
tion is required. whenever the model identify the
external information is needed during its reasoning,
it will proactively embeds specialized tokens into
its reasoning tokens. These tokens can be gener-
ally categorized to web-search token, coding token,
and mind-map calling token. Together with token,
the reasoning model would also generate a precise
query as a message to interact with these external
agents, based on the reasoning context developed
so far.
Upon detecting such a token, the reasoning pro-
cess temporarily halts to extract the query and its
reasoning context. Those are then dispatched to
external agents, such as search engines or Mind
3
Page 4:
Map, to generate pertinent content. The generation
would consider both the message received and the
reasoning context to make sure returning the most
relevant results. These results are then reintegrated
into the reasoning chain, allowing the model to con-
tinue its inference with an updated and enriched
knowledge.
This iterative retrieval-and-reasoning cycle con-
tinues as needed, enabling the model to dynami-
cally refine its conclusions until it reaches a fully
reasoned final answer.
2.3 Mind Map Agent
We construct a Mind Map to store and structure
the real-time reasoning context of the reasoning
model. This Mind Map is built by transforming
raw reasoning chains into a structured knowledge
graph. Specifically, we use a graph-construction
LLM to extract entities from the reasoning chain
and identify semantic relationships between related
entities, following a process similar to that used in
GraphRAG (Edge et al., 2024).
The Mind Map serves two primary functions.
First, it clusters reasoning context into distinct
groups and summarizes each theme. This is
achieved by applying community clustering (Edge
et al., 2024) on the knowledge graph and using an
LLM to generate concise summaries for each group.
Second, the knowledge graph can be queried with
specific questions, such as “Who was Jason’s ma-
ternal great-grandfather?” Using standard retrieval-
augmented generation (RAG) on the knowledge
graph (Edge et al., 2024), we retrieve and return
relevant information.
These functions integrate the Mind Map into
various aspects of the Agentic Reasoning process.
It provides contextual reasoning support to exter-
nal tools, enabling them to generate more context-
aware responses (as discussed in later sections).
Additionally, when the reasoning model is uncer-
tain about its claims or loses track in an extended
reasoning process, it can query the Mind Map for
relevant information, treating it as an external tool,
and continue reasoning based on the retrieved an-
swer.
2.4 Web-search Agent
A search agent is invoked to retrieve the most rele-
vant documents from the web. Rather than incor-
porating the web pages in their raw form, they are
temporarily held for further processing. This en-
sures that only the most pertinent information isextracted and integrated into the main reasoning
chain, maintaining coherence and relevance.
Once the relevant web pages are retrieved by
the search agent, we use LLM to extract a concise,
rephrased summary of the content most relevant
to the ongoing reasoning context. This agent pro-
cesses the web pages in the context of both the user
query and the reasoning context, distilling key in-
sights that are directly applicable to the problem at
hand. The format and length of the summary adapt
dynamically based on the reasoning task, for exam-
ple, for factual queries like “What is the population
of the US in 2024? the result would be a simple
numerical answer. For exploratory reasoning like
finding a new perspective on a topic, the search
agent would provide a summerized, detailed, nu-
anced viewpoint. For hypothesis validation like as-
sessing supporting evidence for an assumption, the
result would include the degree of support or con-
tradiction found in the retrieved web-pages. This
processed snippet is then integrated into the main
reasoning process at the appropriate juncture, en-
suring that external insights enhance rather than
disrupt logical flow.
2.5 Coding Agent
Instead of prompting the reasoning model to gen-
erate code directly, we find it more efficient to del-
egate coding tasks to a specialized coding LLM.
The reasoning model sends the relevant context
and query message to the coding LLM, which then
writes the required code, executes it via a compiler,
and returns the results. This approach ensures that
the reasoning model remains focused on its core
reasoning process without being disrupted by cod-
ing tasks, allowing for longer and more coherent
reasoning chains. Specifically, we format the cod-
ing request as follows: "Write code to perform
<code message from reasoning model> given the
context <reasoning context from Mind Map> to
answer the query <user query>." The coding LLM
is instructed to always return its output in natural
language, ensuring seamless integration with the
reasoning model.
2.6 Main Findings
Less is More Unlike general agentic frameworks
that provide models with a large selection of exter-
nal tools, we find that just two—web search and
coding—are sufficient for most tasks, even those
requiring expert-level proficiency. Adding more
tools can degrade performance by increasing the
4
Page 5:
risk of inappropriate tool selection. Moreover, inac-
curacies in external tool outputs can negatively im-
pact the overall response quality. While additional
tools are not significantly beneficial for language-
based reasoning, they can be crucial for processing
non-text modalities such as financial data, medical
images, and genetic data. Developing specialized
tools for different data modalities could further en-
hance LLM reasoning capabilities, and we will
explore related results in future updates.
Delegating Tasks to LLM-Based Agents Dis-
tributing computational workloads across multiple
LLM-based agents improves efficiency. Instead
of having the main reasoning model handle all
tool-related tasks (e.g., writing code or construct-
ing a knowledge graph), or calling non-LLM tools
like pure search engine or code compiler, we dele-
gate these tasks to specialized LLM-Based Agents,
like a coding LLM generates code based on the
query and context from the main reasoning model,
or a knowledge-graph LLM constructs structured
representations (e.g., a Mind Map) from the rea-
soning chain. This approach offers two key ad-
vantages:1. Minimizing Disruptions. The main
reasoning model can maintain longer, more coher-
ent reasoning without being distracted by auxiliary
tasks or exceeding token limits. 2. Leveraging
Specialization. Different LLMs excel at different
tasks—for instance, DeepSeek-R1 specializes in
reasoning, while Claude-Sonnet excels at coding.
By assigning tasks to models best suited for them,
we achieve higher overall performance.
Agentic Test-time Scaling? For a single ques-
tion, we find reasoning chains that utilize more tool
calls tend to yield better results. While across differ-
ent questions, those requiring excessive tool usage
often indicate inherent ambiguity or inaccuracy in
the initial reasoning. This insight can be leveraged
as a test-time reasoning verifier. By selecting the
reasoning chain with the highest tool usage, we
can implement best-of-N selection or beam search,
which are techniques commonly used in mathe-
matical and coding reasoning tasks as they can
easily build a verifier, to open-domain, knowledge-
intensive Q&A, improving accuracy and robust-
ness.
Figure 2: Case study on a complex medical decision-
making problem.
3 Experiments
3.1 Solving Hard Problems
We evaluate our Agentic Reasoning model on the
GPQA dataset, a PhD-level multiple-choice science
QA benchmark. The dataset consists of expert-
authored questions spanning physics, chemistry,
and biology. Our primary experiments focus on
the high-quality Diamond Set, which contains 198
questions, while Table 2 presents results on the
broader Extended Set of 546 questions, allowing
for a direct comparison with human experts.
As shown in Table 1, our findings show that large
reasoning models such as DeepSeek-R1-Lite and
5
Page 6:
Table 1: Performance comparison on GPQA dataset
across Physics, Chemistry, and Biology.
Method Phy. Chem. Bio.
Direct Reasoning
Qwen2.5-32B 57.0 33.3 52.6
Qwen2.5-Coder-32B 37.2 25.8 57.9
QwQ-32B 75.6 39.8 68.4
Qwen2.5-72B 57.0 37.6 68.4
Llama3.3-70B 54.7 31.2 52.6
GPT-4o†59.5 40.2 61.6
o1-preview†89.4 59.9 65.9
Retrieve/Search in Reasoning
RAG-Qwen2.5-32B 57.0 37.6 52.6
RAG-QwQ-32B 76.7 38.7 73.7
RAgent-Qwen2.5-32B 58.1 33.3 63.2
RAgent-QwQ-32B 76.7 46.2 68.4
Search-o1 77.9 47.3 78.9
Agentic Reasoning
Ours 88.1 58.3 79.6
QwQ-32B-Preview significantly outperform tradi-
tional instruction-tuned LLMs. This demonstrates
the effectiveness of chain-of-thought reasoning in
solving complex, expert-level problems. Addition-
ally, models like RAgent-QwQ-32B and Search-
O1, which autonomously retrieve relevant informa-
tion at reasoning, outperform non-reasoning mod-
els that simply utilize search tools. This confirms
that calling tools is uniquely beneficial for enhanc-
ing reasoning accuracy.
Agentic Reasoning, which integrates external
agents during reasoning, further improves perfor-
mance over search-enhanced models. Our model
achieves superior results on the GPQA dataset,
demonstrating the power of tool-assisted reasoning
in tackling expert-level challenges.
To illustrate the effectiveness of Agentic Reason-
ing, we also present a case study on a complex med-
ical decision-making problem, as shown in Figure
4 The model autonomously executes code to com-
pute the optimal FiO 2(Fraction of Inspired Oxy-
gen) for a patient, performs a web search to retrieve
the most accurate PEEP (Positive End-Expiratory
Pressure) value, and synthesizes both results to
determine the best treatment plan. This example
highlights how integrating coding and web search
enhances the model’s ability to solve real-world
medical challenges.
We further compare our model with human ex-
perts in physics, chemistry, and biology using theGPQA Extended Set. As shown in Table 2, our
model surpasses human performance across all dis-
ciplines, achieving superior accuracy in all three
subsets, and also outperforming human experts.
These results highlight the model’s ability to handle
specialized scientific reasoning tasks at an expert
level.
Table 2: Performance comparison with human experts
on the GPQA extended set.
Method Phy. Chem. Bio.
Human Experts
Physicists 57.9 31.6 42.0
Chemists 34.5 72.6 45.6
Biologists 30.4 28.8 68.9
Reasoning Models
QwQ-32B 61.7 36.9 61.0
RAG-QwQ-32B 64.3 38.3 66.7
Search-o1 68.7 40.7 69.5
Agentic Reasoning 75.2 53.1 72.8
3.2 Deep Research
We conduct an evaluation of Agentic Reasoning for
deep research in open-ended Q&A tasks. A group
of PhD-level experts in finance, medicine, and law
were asked to formulate 15 to 30 professional re-
search questions closely related to their respective
fields. These questions were designed to require
at least 20 minutes of in-depth research to answer
comprehensively.
We assess the accuracy and reliability of re-
ports generated by our Agentic Reasoning model,
measuring the pass rate—the percentage of re-
sponses deemed satisfactory by domain experts.
We compare this pass rate against Gemini Deep Re-
search Service (experiments with OpenAI’s Deep
Research are ongoing). As shown in Figure 3, our
findings show that Agentic Reasoning outperforms
Gemini Deep Research across all three domains,
demonstrating the effectiveness of structured rea-
soning and tool-augmented frameworks in conduct-
ing deep research.
3.3 Analysis
3.3.1 Test-time Scaling
In our deep research study, we find that increased
tool usage improves performance on the same ques-
tion. As shown in Figure 3, a higher number of
tool calls by the reasoning model correlates with
an increased pass rate in deep research tasks. How-
6
Page 7:
Figure 3: More calling for agentic tools, the better the
model does. Red line denotes Gemini Deep Research
ever, when comparing different questions, those
requiring excessive tool usage tend to indicate an
inherently more challenging or ambiguous ques-
tion, leading to lower accuracy. The questions with
a higher number of tool calls within the same field
ultimately achieve a lower pass rate.
Such observations provide a practical approach
for test-time scaling. During inference-time search
(running the same question multiple times), we can
use the frequency of tool calls as a heuristic to se-
lect better responses. A simple implementation,
such as best-of-N selection, can effectively filter
out weaker outputs. This method even outperforms
LLM-as-a-judge evaluation, which is more compu-
tationally expensive, time-consuming, and prone to
instability.
These findings suggest a promising direction
for reinforcement learning for reasoning model on
knowledge-intensive fields. By leveraging agentic
tool usage as an implicit reward signal, we can fur-
ther optimize reasoning models for more effective
tool utilization, ultimately enhancing their problem-
solving capabilities.
3.3.2 The Role of Mind Map
Figure 4: A tricky question that misleads most LLMs is
correctly answered by us.
We find that Mind Maps are particularly effec-
tive in clarifying complex logical relationships, en-
abling the model to solve problems that often mis-
lead traditional LLMs. We highlight two key cases
where Mind Mapping maximizes its capabilities:First, Mind Maps help correctly answer tricky
logic-based questions that frequently fool LLMs. A
well-known example is a modified riddle: "The sur-
geon, who is the boy’s father, says ’I can’t operate
on this child, he’s my son!’ Who is the surgeon to
the boy?" DeepSeek-R1 took 17 seconds to process
this question but still produced the wrong answer, a
failure also observed in models from the GPT and
Gemini series models. These models often fall for
a political-correct corpus contaminated response,
failing to recognize the obvious logical structure.
However, in our Agentic Reasoning framework, the
use of a Mind Map allows the model to explicitly
analyze the logical relationships between the enti-
ties [surgeon], [boy], and [father], leading to the
correct answer.
Second, Mind Maps enhance deductive reason-
ing in strategic games. We test our approach in
Werewolf, a classic social deduction game where
players take on hidden roles as either villagers or
werewolves. Villagers attempt to identify the were-
wolves, while werewolves deceive the group and
eliminate players without being caught. The game
alternates between "night", where werewolves se-
cretly attack, and "day", where players debate and
vote on eliminations. To evaluate our Agentic
Reasoning model, we invited seven experienced
Werewolf players (5+ years of experience) to play
against it. The model achieved an impressive 72%
win rate, significantly exceeding both the expected
statistical win rate and the performance of human
players in our experiment.
We analyzed the Mind Maps generated by the
Agentic Reasoning model over multiple rounds of
play, as shown in Figure 5. These visual structures
helped the model track the relationships between
different players based on their spoken arguments,
allowing it to more accurately identify deception
strategies, anticipate voting behaviors, and opti-
mize its own disguise tactics. This result demon-
strates that Mind Mapping is not just a tool for
logic puzzles but also a powerful strategy enhancer
in dynamic reasoning environments.
7
Page 8:
Figure 5: Mind Map in playing werewolf game. The
first round and the second round.
4 Conclusion
We introduced Agentic Reasoning, a framework
that enhances LLM reasoning by integrating exter-
nal agents for structured memory (Mind Map), web
search, and computational analysis. This approach
improves logical coherence, factual accuracy, and
deep research capabilities. Our evaluations show
that Agentic Reasoning outperforms existing mod-
els on expert-level QA and real-world research
tasks, demonstrating its ability to synthesize knowl-
edge effectively. The structured use of external
tools enables more interpretable and verifiable rea-
soning, paving the way for AI systems capable of
expert-level problem-solving. Future work will ex-
plore extending this framework to multimodal data
and real-time adaptability, further advancing AI’s
ability to tackle complex, real-world challenges.
References
Darren Edge, Ha Trinh, Newman Cheng, Joshua
Bradley, Alex Chao, Apurva Mody, Steven Truitt,
and Jonathan Larson. 2024. From local to global: A
graph rag approach to query-focused summarization.
arXiv preprint arXiv:2404.16130 .
Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu,
Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin,
Weizhe Yuan, and Pengfei Liu. 2024. O1 replication
journey–part 2: Surpassing o1-preview through sim-
ple distillation, big progress or bitter lesson? arXiv
preprint arXiv:2411.16489 .Shayekh Bin Islam, Md Asib Rahman, KSM Hossain,
Enamul Hoque, Shafiq Joty, and Md Rizwan Parvez.
2024. Open-rag: Enhanced retrieval-augmented
reasoning with open-source large language models.
arXiv preprint arXiv:2410.01782 .
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard-
son, Ahmed El-Kishky, Aiden Low, Alec Helyar,
Aleksander Madry, Alex Beutel, Alex Carney, et al.
2024. Openai o1 system card. arXiv preprint
arXiv:2412.16720 .
M Abdul Khaliq, P Chang, M Ma, Bernhard Pflugfelder,
and F Mileti ´c. 2024. Ragar, your falsehood radar:
Rag-augmented reasoning for political fact-checking
using multimodal large language models. arXiv
preprint arXiv:2404.12065 .
Aitor Lewkowycz, Anders Andreassen, David Dohan,
Ethan Dyer, Henryk Michalewski, Vinay Ramasesh,
Ambrose Slone, Cem Anil, Imanol Schlag, Theo
Gutman-Solo, et al. 2022. Solving quantitative rea-
soning problems with language models. Advances
in Neural Information Processing Systems , 35:3843–
3857.
Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang,
Yujia Zhou, Yutao Zhu, Peitian Zhang, and
Zhicheng Dou. 2025. Search-o1: Agentic search-
enhanced large reasoning models. arXiv preprint
arXiv:2501.05366 .
OpenAI. Learning to reason with LLMs.
Yiwei Qin, Xuefeng Li, Haoyang Zou, Yixiu Liu, Shijie
Xia, Zhen Huang, Yixin Ye, Weizhe Yuan, Hector
Liu, Yuanzhi Li, et al. 2024. O1 replication journey:
A strategic progress report–part 1. arXiv preprint
arXiv:2410.18982 .
Yijia Shao, Yucheng Jiang, Theodore A Kanell, Pe-
ter Xu, Omar Khattab, and Monica S Lam. 2024.
Assisting in writing wikipedia-like articles from
scratch with large language models. arXiv preprint
arXiv:2402.14207 .
DeepSeek Team. 2024. Deepseek-r1-lite-preview is
now live: unleashing supercharged reasoning power.
Qwen Team. Qwq: Reflect deeply on the boundaries of
the unknown, november 2024. URL https://qwenlm.
github. io/blog/qwq-32b-preview .
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
et al. 2022. Chain-of-thought prompting elicits rea-
soning in large language models. Advances in neural
information processing systems , 35:24824–24837.
Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong
Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco
Pavone, Yuqiang Li, et al. 2024. Llama-berry: Pair-
wise optimization for o1-like olympiad-level mathe-
matical reasoning. arXiv preprint arXiv:2410.02884 .
8