loader
Generating audio...

arxiv

Paper 2502.04644

Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research

Authors: Junde Wu, Jiayuan Zhu, Yuyuan Liu

Published: 2025-02-07

Abstract:

We introduce Agentic Reasoning, a framework that enhances large language model (LLM) reasoning by integrating external tool-using agents. Unlike conventional LLM-based reasoning approaches, which rely solely on internal inference, Agentic Reasoning dynamically engages web search, code execution, and structured reasoning-context memory to solve complex problems requiring deep research and multi-step logical deduction. Our framework introduces the Mind Map agent, which constructs a structured knowledge graph to track logical relationships, improving deductive reasoning. Additionally, the integration of web-search and coding agents enables real-time retrieval and computational analysis, enhancing reasoning accuracy and decision-making. Evaluations on PhD-level scientific reasoning (GPQA) and domain-specific deep research tasks demonstrate that our approach significantly outperforms existing models, including leading retrieval-augmented generation (RAG) systems and closed-source LLMs. Moreover, our results indicate that agentic reasoning improves expert-level knowledge synthesis, test-time scalability, and structured problem-solving. The code is at: https://github.com/theworldofagents/Agentic-Reasoning.

Paper Content:
Page 1: Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research Junde Wu, Jiayuan Zhu, Yuyuan Liu University of Oxford Abstract In this technical report, we introduce Agen- tic Reasoning, a framework1that enhances large language model (LLM) reasoning by integrating external tool-using agents. Un- like conventional LLM-based reasoning ap- proaches, which rely solely on internal infer- ence, Agentic Reasoning dynamically engages web search, code execution, and structured reasoning-context memory to solve complex problems requiring deep research and multi- step logical deduction. Our framework intro- duces the Mind Map agent, which constructs a structured knowledge graph to track logical relationships, improving deductive reasoning. Additionally, the integration of web-search and coding agents enables real-time retrieval and computational analysis, enhancing reasoning accuracy and decision-making. Evaluations on PhD-level scientific reasoning (GPQA) and domain-specific deep research tasks demon- strate that our approach significantly out- performs existing models, including leading retrieval-augmented generation (RAG) systems and closed-source LLMs. Moreover, our re- sults indicate that agentic reasoning improves expert-level knowledge synthesis, test-time scalability, and structured problem-solving. The code is at: https://github.com/ theworldofagents/Agentic-Reasoning . 1 Introduction Recently, large reasoning models, such as Ope- nAI’s o1 (Jaech et al., 2024), Qwen-QwQ (Team), and DeepSeek-R1 (Team, 2024), have demon- strated impressive stepwise reasoning capabili- ties over long sequences through large-scale re- inforcement learning. These advancements provide promising solutions to complex reasoning tasks (Wei et al., 2022; Lewkowycz et al., 2022; OpenAI) and have inspired foundational efforts to replicate 1work in progresso1-like reasoning patterns across a broader range of models (Qin et al., 2024; Huang et al., 2024; Zhang et al., 2024). DeepSeek-R1, for example, relies exclusively on rule-based outcome rewards during training, such as evaluating whether a mathematical solution is correct or a piece of code executes successfully. While this approach has yielded remarkable rea- soning capabilities, equaling o1’s performance in domains like math and code, it comes with notable trade-offs. As even the authors acknowledge, this type of training diminishes the model’s ability to articulate its reasoning process. DeepSeek-R1’s responses are often logical and accurate but lack detailed explanations of transitions between ideas or the finer connections between arguments. Although current reasoning methods excel in structured domains like math and code—where out- comes are easily verifiable—applying these tech- niques to less structured or subjective tasks remains a significant challenge. Adapting these strategies to areas where answers are not inherently definitive is a key research gap. How can models be trained to handle tasks that require judgment, interpreta- tion, or nuanced understanding rather than binary correctness? Furthermore, not all problems benefit from for- mal reasoning approaches. Many fields, such as social sciences, ethics, or experiential disciplines, rely on abstract concepts, conventional wisdom, factual verification, understanding complex logical relationships, or moral reasoning. When models attempt to impose math- or coding-style reason- ing onto such areas, they often produce flawed or overly rigid results. Developing approaches that account for these unique requirements is essential for advancing the applicability of reasoning model beyond their current domains. Deep, thoughtful answers to open-ended ques- tions often require extensive research, repeated ver- ification, information retrieval, computational anal- 1arXiv:2502.04644v1 [cs.AI] 7 Feb 2025 Page 2: ysis, and the organization of complex logical rela- tionships—steps fundamental to human reasoning. In this process, humans rely heavily on external tools, such as internet searches for gathering infor- mation, computational tools for quantitative analy- sis, or whiteboards and Mind Maps for organizing thoughts. This raises an intriguing question: can large language models similarly leverage external tools to enhance their reasoning and tackle inten- sive knowledge work across diverse domains? Previous efforts have attempted to integrate search or retrieval-augmented generation (RAG) into the reasoning process (Shao et al., 2024; Khaliq et al., 2024; Islam et al., 2024; Li et al., 2025), with notable examples including Gemini’s Deep Research. However, these models are closed, their exact methodologies remain undisclosed. In contrast, open-source models typically focus ex- clusively on retrieval or web-searching during rea- soning, leaving a significant performance gap com- pared to their closed-source counterparts. We introduce Agentic Reasoning, a framework that enhances the reasoning process by integrating external LLM-based agents as tools. This approach enables LLMs to perform multi-step reasoning and tackle complex problems more effectively by delegating specific tasks to these auxiliary agents. Through extensive experimentation with integrat- ing various agents into the reasoning process, we identified three essential agents that prove highly effective for general reasoning across diverse prob- lems. The web-search agent, which retrieves rele- vant information from the internet to supplement the model’s knowledge. The code agent, capable of performing computational analyses and coding tasks to support quantitative reasoning. Finally, the memory agent, which we call Mind Map, con- structs knowledge graphs based on the reasoning context, enabling the organization of complex log- ical relationships in a manner similar to human mind mapping. Together, these agents enhance the model’s ability to tackle complex problems with greater efficiency and precision. When integrated into current reasoning LLMs, Agentic Reasoning transforms their problem- solving capabilities by enabling them to plan and execute multi-step strategies autonomously. These models can identify and retrieve the necessary data, adapt dynamically to real-time information, and perform quantitative analyses to generate precise outcomes. This framework also allows LLMs to deliver comprehensive reports comparable to thoseof a research analyst or provide solutions on par with PhD-level expertise. We evaluated our model on general knowledge- intensive benchmarks requiring complex reasoning capabilities, categorized into two key areas: (1) solving expert-level questions and (2) conducting deep research on real-world expert-level tasks. For expert-level questions, we tested the model on the GPQA dataset, a PhD-level science multiple- choice QA benchmark with questions authored by domain experts in physics, chemistry, and biology. Our Agentic Reasoning framework achieved im- pressive accuracy rates: 58% in chemistry, 88% in physics, and 79% in biology, closely rivals the best and newest closed reasoning model, OpenAI o1. For real-world expert-level tasks, Agentic Reason- ing was evaluated by domain experts, who noted that it effectively automated several hours of chal- lenging, manual investigation. This highlights its potential to streamline labor-intensive processes and enhance productivity in knowledge-intensive domains. Additionally, we tested the model’s scalability in test-time reasoning using the agentic framework as a verifier. The results showed significant im- provements in test-time computational efficiency, demonstrating the framework’s ability to optimize reasoning processes. This finding suggests that the agentic framework has strong potential to serve as a reward model for reinforcement learning, further advancing reasoning model training. These results position Agentic Reasoning as a powerful and versatile framework, capable of tackling complex, domain-specific challenges with depth and precision. Its ability to perform in-depth research, navigate intricate logical structures, and synthesize information effectively highlights its po- tential for solving knowledge-intensive problems and driving advancements in deep analytical explo- ration. 2 Method 2.1 Preliminary We consider an expert-level task that requires multi- step complex reasoning. In the process of model reasoning, it can retrieve external tool usage, and structured memory of its previous reasoning. Our objective is to generate, for each query q, both a logical reasoning chain rand a final answer a. To achieve this, the reasoning model dynamically interacts with external tools e, which are gener- 2 Page 3: Figure 1: The overall workflow of Agentic Reasoning. ally web search and python coding, and retrieves structured knowledge from an organized memory kthroughout the reasoning process. Formally, we identify four primary inputs in the problem-solving pipeline: task instruction o, defin- ing the overarching task objective, query q, a com- plex question requiring multi-step reasoning, exter- nal tool outputs e, dynamically retrieved content from tools such as web search or coding, reasoning memory k, containing structured knowledge graph. The goal is to integrate o, q, e, k to generate a coherent reasoning chain rand a final answer a. This process can be expressed as the mapping: (o, q, e, k )7→(r, a). We model the generation of randausing the following joint probability formulation: P(r, a|o, q, e, k ) =TrY t=1P(rt|r<t, o, q, e ≤t, k≤t) | {z } Reasoning Process ×TaY t=1P(at|a<t, r, o, q, e, k ) | {z } Answer Generation. where TrandTarepresent the lengths (in tokens) of the reasoning chain rand the final answer a, respectively. Here, rtdenotes the token at position tin the reasoning sequence, with r<trepresentingall previous tokens. The terms e≤tandk≤tindicate all tool-generated outputs and knowledge-graph information retrieved up to step t. Similarly, atis the token at position tin the final answer, and a<t represents all previously generated answer tokens. 2.2 Agentic Reasoning Pipeline Our core idea is to enhance the model reasoning by deploying external LLM-based agents during reasoning. The framework enables the reasoning LLM model interacts with external information in an agentic way. During its reasoning process, it could call the external tools to help solve the problem and also with a structured memory, called Mind Map, to store its reasoning context. At its core, an agentic mechanism empowers the model to determine, in real-time, when additional informa- tion is required. whenever the model identify the external information is needed during its reasoning, it will proactively embeds specialized tokens into its reasoning tokens. These tokens can be gener- ally categorized to web-search token, coding token, and mind-map calling token. Together with token, the reasoning model would also generate a precise query as a message to interact with these external agents, based on the reasoning context developed so far. Upon detecting such a token, the reasoning pro- cess temporarily halts to extract the query and its reasoning context. Those are then dispatched to external agents, such as search engines or Mind 3 Page 4: Map, to generate pertinent content. The generation would consider both the message received and the reasoning context to make sure returning the most relevant results. These results are then reintegrated into the reasoning chain, allowing the model to con- tinue its inference with an updated and enriched knowledge. This iterative retrieval-and-reasoning cycle con- tinues as needed, enabling the model to dynami- cally refine its conclusions until it reaches a fully reasoned final answer. 2.3 Mind Map Agent We construct a Mind Map to store and structure the real-time reasoning context of the reasoning model. This Mind Map is built by transforming raw reasoning chains into a structured knowledge graph. Specifically, we use a graph-construction LLM to extract entities from the reasoning chain and identify semantic relationships between related entities, following a process similar to that used in GraphRAG (Edge et al., 2024). The Mind Map serves two primary functions. First, it clusters reasoning context into distinct groups and summarizes each theme. This is achieved by applying community clustering (Edge et al., 2024) on the knowledge graph and using an LLM to generate concise summaries for each group. Second, the knowledge graph can be queried with specific questions, such as “Who was Jason’s ma- ternal great-grandfather?” Using standard retrieval- augmented generation (RAG) on the knowledge graph (Edge et al., 2024), we retrieve and return relevant information. These functions integrate the Mind Map into various aspects of the Agentic Reasoning process. It provides contextual reasoning support to exter- nal tools, enabling them to generate more context- aware responses (as discussed in later sections). Additionally, when the reasoning model is uncer- tain about its claims or loses track in an extended reasoning process, it can query the Mind Map for relevant information, treating it as an external tool, and continue reasoning based on the retrieved an- swer. 2.4 Web-search Agent A search agent is invoked to retrieve the most rele- vant documents from the web. Rather than incor- porating the web pages in their raw form, they are temporarily held for further processing. This en- sures that only the most pertinent information isextracted and integrated into the main reasoning chain, maintaining coherence and relevance. Once the relevant web pages are retrieved by the search agent, we use LLM to extract a concise, rephrased summary of the content most relevant to the ongoing reasoning context. This agent pro- cesses the web pages in the context of both the user query and the reasoning context, distilling key in- sights that are directly applicable to the problem at hand. The format and length of the summary adapt dynamically based on the reasoning task, for exam- ple, for factual queries like “What is the population of the US in 2024? the result would be a simple numerical answer. For exploratory reasoning like finding a new perspective on a topic, the search agent would provide a summerized, detailed, nu- anced viewpoint. For hypothesis validation like as- sessing supporting evidence for an assumption, the result would include the degree of support or con- tradiction found in the retrieved web-pages. This processed snippet is then integrated into the main reasoning process at the appropriate juncture, en- suring that external insights enhance rather than disrupt logical flow. 2.5 Coding Agent Instead of prompting the reasoning model to gen- erate code directly, we find it more efficient to del- egate coding tasks to a specialized coding LLM. The reasoning model sends the relevant context and query message to the coding LLM, which then writes the required code, executes it via a compiler, and returns the results. This approach ensures that the reasoning model remains focused on its core reasoning process without being disrupted by cod- ing tasks, allowing for longer and more coherent reasoning chains. Specifically, we format the cod- ing request as follows: "Write code to perform <code message from reasoning model> given the context <reasoning context from Mind Map> to answer the query <user query>." The coding LLM is instructed to always return its output in natural language, ensuring seamless integration with the reasoning model. 2.6 Main Findings Less is More Unlike general agentic frameworks that provide models with a large selection of exter- nal tools, we find that just two—web search and coding—are sufficient for most tasks, even those requiring expert-level proficiency. Adding more tools can degrade performance by increasing the 4 Page 5: risk of inappropriate tool selection. Moreover, inac- curacies in external tool outputs can negatively im- pact the overall response quality. While additional tools are not significantly beneficial for language- based reasoning, they can be crucial for processing non-text modalities such as financial data, medical images, and genetic data. Developing specialized tools for different data modalities could further en- hance LLM reasoning capabilities, and we will explore related results in future updates. Delegating Tasks to LLM-Based Agents Dis- tributing computational workloads across multiple LLM-based agents improves efficiency. Instead of having the main reasoning model handle all tool-related tasks (e.g., writing code or construct- ing a knowledge graph), or calling non-LLM tools like pure search engine or code compiler, we dele- gate these tasks to specialized LLM-Based Agents, like a coding LLM generates code based on the query and context from the main reasoning model, or a knowledge-graph LLM constructs structured representations (e.g., a Mind Map) from the rea- soning chain. This approach offers two key ad- vantages:1. Minimizing Disruptions. The main reasoning model can maintain longer, more coher- ent reasoning without being distracted by auxiliary tasks or exceeding token limits. 2. Leveraging Specialization. Different LLMs excel at different tasks—for instance, DeepSeek-R1 specializes in reasoning, while Claude-Sonnet excels at coding. By assigning tasks to models best suited for them, we achieve higher overall performance. Agentic Test-time Scaling? For a single ques- tion, we find reasoning chains that utilize more tool calls tend to yield better results. While across differ- ent questions, those requiring excessive tool usage often indicate inherent ambiguity or inaccuracy in the initial reasoning. This insight can be leveraged as a test-time reasoning verifier. By selecting the reasoning chain with the highest tool usage, we can implement best-of-N selection or beam search, which are techniques commonly used in mathe- matical and coding reasoning tasks as they can easily build a verifier, to open-domain, knowledge- intensive Q&A, improving accuracy and robust- ness. Figure 2: Case study on a complex medical decision- making problem. 3 Experiments 3.1 Solving Hard Problems We evaluate our Agentic Reasoning model on the GPQA dataset, a PhD-level multiple-choice science QA benchmark. The dataset consists of expert- authored questions spanning physics, chemistry, and biology. Our primary experiments focus on the high-quality Diamond Set, which contains 198 questions, while Table 2 presents results on the broader Extended Set of 546 questions, allowing for a direct comparison with human experts. As shown in Table 1, our findings show that large reasoning models such as DeepSeek-R1-Lite and 5 Page 6: Table 1: Performance comparison on GPQA dataset across Physics, Chemistry, and Biology. Method Phy. Chem. Bio. Direct Reasoning Qwen2.5-32B 57.0 33.3 52.6 Qwen2.5-Coder-32B 37.2 25.8 57.9 QwQ-32B 75.6 39.8 68.4 Qwen2.5-72B 57.0 37.6 68.4 Llama3.3-70B 54.7 31.2 52.6 GPT-4o†59.5 40.2 61.6 o1-preview†89.4 59.9 65.9 Retrieve/Search in Reasoning RAG-Qwen2.5-32B 57.0 37.6 52.6 RAG-QwQ-32B 76.7 38.7 73.7 RAgent-Qwen2.5-32B 58.1 33.3 63.2 RAgent-QwQ-32B 76.7 46.2 68.4 Search-o1 77.9 47.3 78.9 Agentic Reasoning Ours 88.1 58.3 79.6 QwQ-32B-Preview significantly outperform tradi- tional instruction-tuned LLMs. This demonstrates the effectiveness of chain-of-thought reasoning in solving complex, expert-level problems. Addition- ally, models like RAgent-QwQ-32B and Search- O1, which autonomously retrieve relevant informa- tion at reasoning, outperform non-reasoning mod- els that simply utilize search tools. This confirms that calling tools is uniquely beneficial for enhanc- ing reasoning accuracy. Agentic Reasoning, which integrates external agents during reasoning, further improves perfor- mance over search-enhanced models. Our model achieves superior results on the GPQA dataset, demonstrating the power of tool-assisted reasoning in tackling expert-level challenges. To illustrate the effectiveness of Agentic Reason- ing, we also present a case study on a complex med- ical decision-making problem, as shown in Figure 4 The model autonomously executes code to com- pute the optimal FiO 2(Fraction of Inspired Oxy- gen) for a patient, performs a web search to retrieve the most accurate PEEP (Positive End-Expiratory Pressure) value, and synthesizes both results to determine the best treatment plan. This example highlights how integrating coding and web search enhances the model’s ability to solve real-world medical challenges. We further compare our model with human ex- perts in physics, chemistry, and biology using theGPQA Extended Set. As shown in Table 2, our model surpasses human performance across all dis- ciplines, achieving superior accuracy in all three subsets, and also outperforming human experts. These results highlight the model’s ability to handle specialized scientific reasoning tasks at an expert level. Table 2: Performance comparison with human experts on the GPQA extended set. Method Phy. Chem. Bio. Human Experts Physicists 57.9 31.6 42.0 Chemists 34.5 72.6 45.6 Biologists 30.4 28.8 68.9 Reasoning Models QwQ-32B 61.7 36.9 61.0 RAG-QwQ-32B 64.3 38.3 66.7 Search-o1 68.7 40.7 69.5 Agentic Reasoning 75.2 53.1 72.8 3.2 Deep Research We conduct an evaluation of Agentic Reasoning for deep research in open-ended Q&A tasks. A group of PhD-level experts in finance, medicine, and law were asked to formulate 15 to 30 professional re- search questions closely related to their respective fields. These questions were designed to require at least 20 minutes of in-depth research to answer comprehensively. We assess the accuracy and reliability of re- ports generated by our Agentic Reasoning model, measuring the pass rate—the percentage of re- sponses deemed satisfactory by domain experts. We compare this pass rate against Gemini Deep Re- search Service (experiments with OpenAI’s Deep Research are ongoing). As shown in Figure 3, our findings show that Agentic Reasoning outperforms Gemini Deep Research across all three domains, demonstrating the effectiveness of structured rea- soning and tool-augmented frameworks in conduct- ing deep research. 3.3 Analysis 3.3.1 Test-time Scaling In our deep research study, we find that increased tool usage improves performance on the same ques- tion. As shown in Figure 3, a higher number of tool calls by the reasoning model correlates with an increased pass rate in deep research tasks. How- 6 Page 7: Figure 3: More calling for agentic tools, the better the model does. Red line denotes Gemini Deep Research ever, when comparing different questions, those requiring excessive tool usage tend to indicate an inherently more challenging or ambiguous ques- tion, leading to lower accuracy. The questions with a higher number of tool calls within the same field ultimately achieve a lower pass rate. Such observations provide a practical approach for test-time scaling. During inference-time search (running the same question multiple times), we can use the frequency of tool calls as a heuristic to se- lect better responses. A simple implementation, such as best-of-N selection, can effectively filter out weaker outputs. This method even outperforms LLM-as-a-judge evaluation, which is more compu- tationally expensive, time-consuming, and prone to instability. These findings suggest a promising direction for reinforcement learning for reasoning model on knowledge-intensive fields. By leveraging agentic tool usage as an implicit reward signal, we can fur- ther optimize reasoning models for more effective tool utilization, ultimately enhancing their problem- solving capabilities. 3.3.2 The Role of Mind Map Figure 4: A tricky question that misleads most LLMs is correctly answered by us. We find that Mind Maps are particularly effec- tive in clarifying complex logical relationships, en- abling the model to solve problems that often mis- lead traditional LLMs. We highlight two key cases where Mind Mapping maximizes its capabilities:First, Mind Maps help correctly answer tricky logic-based questions that frequently fool LLMs. A well-known example is a modified riddle: "The sur- geon, who is the boy’s father, says ’I can’t operate on this child, he’s my son!’ Who is the surgeon to the boy?" DeepSeek-R1 took 17 seconds to process this question but still produced the wrong answer, a failure also observed in models from the GPT and Gemini series models. These models often fall for a political-correct corpus contaminated response, failing to recognize the obvious logical structure. However, in our Agentic Reasoning framework, the use of a Mind Map allows the model to explicitly analyze the logical relationships between the enti- ties [surgeon], [boy], and [father], leading to the correct answer. Second, Mind Maps enhance deductive reason- ing in strategic games. We test our approach in Werewolf, a classic social deduction game where players take on hidden roles as either villagers or werewolves. Villagers attempt to identify the were- wolves, while werewolves deceive the group and eliminate players without being caught. The game alternates between "night", where werewolves se- cretly attack, and "day", where players debate and vote on eliminations. To evaluate our Agentic Reasoning model, we invited seven experienced Werewolf players (5+ years of experience) to play against it. The model achieved an impressive 72% win rate, significantly exceeding both the expected statistical win rate and the performance of human players in our experiment. We analyzed the Mind Maps generated by the Agentic Reasoning model over multiple rounds of play, as shown in Figure 5. These visual structures helped the model track the relationships between different players based on their spoken arguments, allowing it to more accurately identify deception strategies, anticipate voting behaviors, and opti- mize its own disguise tactics. This result demon- strates that Mind Mapping is not just a tool for logic puzzles but also a powerful strategy enhancer in dynamic reasoning environments. 7 Page 8: Figure 5: Mind Map in playing werewolf game. The first round and the second round. 4 Conclusion We introduced Agentic Reasoning, a framework that enhances LLM reasoning by integrating exter- nal agents for structured memory (Mind Map), web search, and computational analysis. This approach improves logical coherence, factual accuracy, and deep research capabilities. Our evaluations show that Agentic Reasoning outperforms existing mod- els on expert-level QA and real-world research tasks, demonstrating its ability to synthesize knowl- edge effectively. The structured use of external tools enables more interpretable and verifiable rea- soning, paving the way for AI systems capable of expert-level problem-solving. Future work will ex- plore extending this framework to multimodal data and real-time adaptability, further advancing AI’s ability to tackle complex, real-world challenges. References Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130 . Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, and Pengfei Liu. 2024. O1 replication journey–part 2: Surpassing o1-preview through sim- ple distillation, big progress or bitter lesson? arXiv preprint arXiv:2411.16489 .Shayekh Bin Islam, Md Asib Rahman, KSM Hossain, Enamul Hoque, Shafiq Joty, and Md Rizwan Parvez. 2024. Open-rag: Enhanced retrieval-augmented reasoning with open-source large language models. arXiv preprint arXiv:2410.01782 . Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. 2024. Openai o1 system card. arXiv preprint arXiv:2412.16720 . M Abdul Khaliq, P Chang, M Ma, Bernhard Pflugfelder, and F Mileti ´c. 2024. Ragar, your falsehood radar: Rag-augmented reasoning for political fact-checking using multimodal large language models. arXiv preprint arXiv:2404.12065 . Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. 2022. Solving quantitative rea- soning problems with language models. Advances in Neural Information Processing Systems , 35:3843– 3857. Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. 2025. Search-o1: Agentic search- enhanced large reasoning models. arXiv preprint arXiv:2501.05366 . OpenAI. Learning to reason with LLMs. Yiwei Qin, Xuefeng Li, Haoyang Zou, Yixiu Liu, Shijie Xia, Zhen Huang, Yixin Ye, Weizhe Yuan, Hector Liu, Yuanzhi Li, et al. 2024. O1 replication journey: A strategic progress report–part 1. arXiv preprint arXiv:2410.18982 . Yijia Shao, Yucheng Jiang, Theodore A Kanell, Pe- ter Xu, Omar Khattab, and Monica S Lam. 2024. Assisting in writing wikipedia-like articles from scratch with large language models. arXiv preprint arXiv:2402.14207 . DeepSeek Team. 2024. Deepseek-r1-lite-preview is now live: unleashing supercharged reasoning power. Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown, november 2024. URL https://qwenlm. github. io/blog/qwq-32b-preview . Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits rea- soning in large language models. Advances in neural information processing systems , 35:24824–24837. Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, et al. 2024. Llama-berry: Pair- wise optimization for o1-like olympiad-level mathe- matical reasoning. arXiv preprint arXiv:2410.02884 . 8

---