Page 1:
Search-R1
Search-R1: Training LLMs to Reason and Leverage Search
Engines with Reinforcement Learning
Bowen Jin1, Hansi Zeng2, Zhenrui Yue1, Dong Wang1, Hamed Zamani2, Jiawei Han1
1Department of Computer Science, University of Illinois at Urbana-Champaign
2Center for Intelligent Information Retrieval, University of Massachusetts Amherst
{bowenj4,zhenrui3,dwang24,hanj }@illinois.edu, {hzeng, zamani }@cs.umass.edu
Abstract
Efficiently acquiring external knowledge and up-to-date information is es-
sential for effective reasoning and text generation in large language models
(LLMs). Prompting advanced LLMs with reasoning capabilities during
inference to use search engines is not optimal, since the LLM does not
learn how to interact optimally with the search engine. This paper intro-
duces SEARCH -R1, an extension of the DeepSeek-R1 model where the LLM
learns—solely through reinforcement learning (RL)—to autonomously gen-
erate (multiple) search queries during step-by-step reasoning with real-time
retrieval. SEARCH -R1 optimizes LLM rollouts with multi-turn search in-
teractions, leveraging retrieved token masking for stable RL training and a
simple outcome-based reward function. Experiments on seven question-
answering datasets show that SEARCH -R1 improves performance by 26%
(Qwen2.5-7B), 21% (Qwen2.5-3B), and 10% (LLaMA3.2-3B) over strong
baselines. This paper further provides empirical insights into RL optimiza-
tion methods, LLM choices, and response length dynamics in retrieval-
augmented reasoning. The code and model checkpoints are available at
https://github.com/PeterGriffinJin/Search-R1 .
1 Introduction
In recent years, large language models (LLMs) have demonstrated remarkable capabilities
in natural language understanding and generation (Hendrycks et al., 2020; Clark et al., 2018).
Despite these achievements, LLMs often encounter challenges when tasked with complex
reasoning (Wei et al., 2022) and retrieving up-to-date information from external sources
(Jin et al., 2024). Addressing these limitations necessitates integrating advanced reasoning
abilities (Huang & Chang, 2022) and the capability to interact effectively with search engines
(Schick et al., 2023).
Existing approaches for integrating LLMs with search engines typically fall into two cate-
gories: (1) retrieval-augmented generation (RAG) (Gao et al., 2023; Lewis et al., 2020) and
(2) treating the search engine as a tool (Yao et al., 2023; Schick et al., 2023). RAG retrieves
relevant passages based on the input query and incorporates them into the LLM’s context
for generation (Lewis et al., 2020). This allows the LLM to leverage external knowledge
when answering questions. However, RAG is constrained by retrieval inaccuracy (Jin
et al., 2024) and multi-hop retrieval capability (Yang et al., 2018). While existing works
(Trivedi et al., 2022a) propose to conduct prompting for multi-turn, multi-query retrieval,
it is not optimal, since the LLM does not learn how to interact with the search engine
during training. Alternatively, LLMs can be prompted or trained to utilize tools, including
search engines, as part of their reasoning process (Qu et al., 2025; Trivedi et al., 2022a).
However, prompting-based approaches often struggle with generalization, as certain tasks
may not have been encountered during LLM pretraining. On the other hand, training-based
approaches provide greater adaptability but rely on large-scale, high-quality annotated
trajectories of search-and-reasoning interactions, making them difficult to scale effectively
(Schick et al., 2023).
1arXiv:2503.09516v2 [cs.CL] 19 Mar 2025
Page 2:
Search-R1
Reinforcement Learning (RL) (Sutton et al., 1999; Kaelbling et al., 1996) has emerged as a
potent paradigm for enhancing the reasoning capabilities of LLMs (Guo et al., 2025; Hou
et al., 2025; Xie et al., 2025; Kumar et al., 2024). Notably, models like OpenAI-o1 (Jaech
et al., 2024) and DeepSeek-R1 (Guo et al., 2025) have leveraged RL techniques ( e.g., PPO
(Schulman et al., 2017) and GRPO (Shao et al., 2024)) to improve logical inference and
problem-solving skills by learning from experience and feedback. After RL, even when
trained solely on the outcome rewards, the models learn complex reasoning capabilities,
including self-verification (Weng et al., 2022) and self-correction (Kumar et al., 2024).
However, applying reinforcement learning (RL) to search-and-reasoning scenarios presents
three key challenges: (1) RL Framework and Stability – It remains unclear how to effectively
integrate the search engine into the LLM RL framework while ensuring stable optimization,
particularly when incorporating retrieved context. (2) Multi-T urn Interleaved Reasoning
and Search – Ideally, the LLM should be capable of iterative reasoning and search engine
calls, dynamically adjusting its retrieval strategy based on the complexity of the problem. (3)
Reward Design – Designing an effective reward function for search-and-reasoning tasks is
nontrivial, as traditional reward formulations may not generalize well to this new paradigm.
To address these challenges, we introduce SEARCH -R1, a novel reinforcement learning (RL)
framework that enables LLMs to interact with search engines in an interleaved manner with
their own reasoning. Specifically, SEARCH -R1introduces the following key innovations: (1)
We model the search engine as part of the environment, enabling rollout sequences that inter-
leave LLM token generation with search engine retrievals. SEARCH -R1 is compatible with
various RL algorithms, including PPO and GRPO, and we apply retrieved token masking
to ensure stable optimization. (2) SEARCH -R1 supports multi-turn retrieval and reasoning,
where search calls are explicitly triggered by <search> and </search> tokens. Retrieved
content is enclosed within <information> and </information> tokens, while LLM reason-
ing steps are wrapped within <think> and </think> tokens. The final answer is formatted
using <answer> and </answer> tokens, allowing for structured, iterative decision-making.
(3) We adopt a straightforward outcome-based reward function, avoiding the complexity
of process-based rewards. Our results demonstrate that this minimal reward design is
effective in search-and-reasoning scenarios. SEARCH -R1 can be viewed as an extension
of DeepSeek-R1 (Guo et al., 2025), which primarily focuses on parametric reasoning by
introducing search-augmented RL training for enhanced retrieval-driven decision-making.
In summary, our key contributions are threefold:
• We identify the challenges of applying RL to LLM reasoning with search engine calling.
•We propose SEARCH -R1, a novel reinforcement learning framework that supports LLM
rollout and RL optimization with a search engine, including retrieved token masking to
stabilize RL training, multi-turn interleaved reasoning and search to support complex
task-solving and a simple yet effective outcome reward function.
•We conduct systematic experiments to demonstrate the effectiveness of SEARCH -R1with
26%, 21%, and 10% average relative improvement with three LLMs over strong baselines.
In addition, we provide insights on RL for reasoning and search settings, including RL
methods selection, different LLM choices and response length study.
2 Related Works
2.1 Large Language Models and Retrieval
Although large language models (LLMs) (Zhao et al., 2023; Team, 2024; Achiam et al., 2023)
have demonstrated remarkable reasoning (Guo et al., 2025) and coding (Guo et al., 2024)
capabilities, they still lack domain-specific knowledge (Peng et al., 2023; Li et al., 2023) and
are prone to hallucinations (Zhang et al., 2023). To address these limitations, search engines
(Zhao et al., 2024) are widely used to provide external information. There are two primary
ways to integrate search engines with LLMs: (1) retrieval-augmented generation (RAG)
(Gao et al., 2023) and (2) treating the search engine as a tool (Schick et al., 2023). RAG (Lewis
et al., 2020; Yue et al., 2024; Xiong et al., 2025) typically follows a one-round retrieval and
2
Page 3:
Search-R1
𝑞 𝑜Reward
ModelValue
LLM
Reference
LLM𝑟𝑣
𝐴 GAE
𝑞Reward
Model
Reference
LLMGroup
Computation𝑜1
𝑜2
𝑜𝐺…𝑟1
𝑟2
𝑟𝐺…𝐴1
𝐴2
𝐴𝐺…
Policy
LLMSearch
EngineRollout ModulePolicy
LLMSearch
EngineRollout Module
⨁
KLTrained
Models
Frozen
Model
Search
EnginePPO w. Search Engine
GRPO w. Search Engine
Figure 1: Demonstration of PPO and GRPO training with the search engine (S EARCH -R1).
sequential generation pipeline, where a search engine fetches relevant information based on
the input query, which is then concatenated with the query and fed into the LLM. However,
this pipeline struggles with issues such as retrieving irrelevant information (Jin et al., 2024)
and failing to provide sufficiently useful context (Jiang et al., 2023). An alternative approach
is search-as-a-tool, where LLMs are prompted or fine-tuned to interact with the search
engine. IRCoT (Trivedi et al., 2022a) and ReAct (Yao et al., 2023) use prompting to guide
iterative reasoning and search engine calls, while Toolformer (Schick et al., 2023) leverages
supervised fine-tuning to enhance search capabilities. However, these methods rely on
high-quality labeled trajectories, which are difficult to scale. Recent work (Guo et al., 2025)
suggests that reinforcement learning can enable LLMs to develop advanced reasoning skills
using only outcome rewards, yet its potential in search engine calling scenarios remains
under-explored.
2.2 Large Language Models and Reinforcement Learning
Reinforcement learning (RL) (Kaelbling et al., 1996) is a learning paradigm where an agent
learns to make sequential decisions by interacting with an environment and receiving feed-
back in the form of rewards, aiming to maximize cumulative reward over time (Sutton et al.,
1999). RL was introduced to LLM tuning by Ouyang et al. (2022) through reinforcement
learning from human feedback (RLHF) (Kaufmann et al., 2023). This approach first trains
a reward model using human preference data (Lambert et al., 2024), which then guides
RL-based tuning of the policy LLM, typically via the Proximal Policy Optimization (PPO)
algorithm. However, PPO involves multiple rounds of LLM optimization, making it chal-
lenging to implement. To simplify RL-based tuning, direct optimization methods such
as Direct Preference Optimization (DPO) (Rafailov et al., 2023) and SimPO (Meng et al.,
2024) have been proposed. While these methods offer computational efficiency, they suffer
from off-policy issues (Pang et al., 2024) and do not consistently match the performance
of pure RL approaches. Alternative solutions include Group Relative Policy Optimization
(GRPO) (Shao et al., 2024), which eliminates the need for a critic model by estimating base-
lines from group scores, and RLOO (Ahmadian et al., 2024), which introduces a simplified
REINFORCE-style (Williams, 1992) optimization framework. Despite these advances, the
application of RL to LLM-driven search engine interactions and reasoning remains largely
unexplored.
3 Search-R1
In the following sections, we present the detailed design of SEARCH -R1, covering (1)
reinforcement learning with a search engine; (2) text generation with an interleaved multi-
turn search engine call; (3) the training template; and (4) reward model design.
3
Page 4:
Search-R1
3.1 Reinforcement Learning with a Search Engine
We formulate the reinforcement learning framework with a search engine Ras follows:
max
πθEx∼D,y∼πθ(·|x;R)
rϕ(x,y)−βDKL[πθ(y|x;R)||πref(y|x;R)],
where πθis the policy LLM, πrefis the reference LLM, rϕis the reward function and DKLis
the KL-divergence. Unlike prior LLM reinforcement learning methods that primarily rely
on the policy LLM πθ(· |x)to generate rollout sequences (Rafailov et al., 2023; Ouyang
et al., 2022), our framework explicitly incorporates retrieval interleaved reasoning via
πθ(· |x;R), which can be seen as πθ(· |x)NR, whereNdenotes interleaved retrieval-
and-reasoning. This enables more effective decision-making in reasoning-intensive tasks
that require external information retrieval. A detailed illustration of the rollout process is
provided in Section 3.2.
Our approach builds upon two well-established policy gradient RL methods: Proximal
Policy Optimization (PPO) (Schulman et al., 2017) and Group Relative Policy Optimization
(GRPO) (Shao et al., 2024; Guo et al., 2025), leveraging their respective advantages to
optimize retrieval-augmented reasoning.
Loss Masking for Retrieved Tokens. In both PPO and GRPO, the token-level loss is
computed over the entire rollout sequence. In SEARCH -R1, the rollout sequence consists of
both LLM-generated tokens and retrieved tokens from external passages. While optimizing
LLM-generated tokens enhances the model’s ability to interact with the search engine
and perform reasoning, applying the same optimization to retrieved tokens can lead to
unintended learning dynamics. To address this, we introduce loss masking for retrieved
tokens, ensuring that the policy gradient objective is computed only over LLM-generated
tokens, while excluding retrieved content from the optimization process. This approach
stabilizes training while preserving the flexibility of search-augmented generation.
PPO + Search Engine. Proximal Policy Optimization (PPO) (Schulman et al., 2017) is a
popular actor-critic reinforcement learning algorithm commonly used for fine-tuning large
language models (LLMs) during the RL stage (Ouyang et al., 2022). In our reasoning plus
search engine calling scenario, it optimizes LLMs by maximizing the following objective:
JPPO(θ) =Ex∼D,y∼πold(·|x;R)
1
∑|y|
t=1I(yt)|y|
∑
t=1:I(yt)=1minπθ(yt|x,y<t;R)
πold(yt|x,y<t;R)At, clipπθ(yt|x,y<t;R)
πold(yt|x,y<t;R), 1−ϵ, 1+ϵ
At
, (1)
where πθand πrefrepresent the current and reference policy models, respectively. The
variable xdenotes input samples drawn from the dataset D, while yrepresents the model’s
generated outputs interleaved with search engine calling results, sampled from the reference
policy πref(y|x;R)and retrieved from the search engine R.I(yt)is the token loss masking
operation. I(yt) =1 ifytis a LLM generated token and I(yt) =0 ifytis a retrieved token.
The term ϵis a clipping-related hyperparameter introduced in PPO to stabilize training.
The advantage estimate Atis computed using Generalized Advantage Estimation (GAE)
(Schulman et al., 2015), based on future rewards {r≥t}and a learned value function Vϕ.
GRPO + Search Engine. To improve policy optimization stability and avoid the need for
an additional value function approximation, Group Relative Policy Optimization (GRPO) is
introduced in Shao et al. (2024). GRPO differs from Proximal Policy Optimization (PPO)
by leveraging the average reward of multiple sampled outputs as a baseline rather than
relying on a learned value function. Specifically, for each input question x, GRPO samples a
group of responses {y1,y2,. . .,yG}from the reference policy πref. The policy model is then
optimized by maximizing the following objective function:
4
Page 5:
Search-R1
JGRPO(θ) =Ex∼D,{yi}G
i=1∼πold(·|x;R)"
1
GG
∑
i=11
∑|yi|
t=1I(yi,t)|yi|
∑
t=1:I(yi,t)=1min
πθ(yi,t|x,yi,<t;R)
πold(yi,t|x,yi,<t;R)ˆAi,t,
clip
πθ(yi,t|x,yi,<t;R)
πold(yi,t|x,yi,<t;R), 1−ϵ, 1+ϵ!
ˆAi,t!
−βDKL[πθ||πref]#
, (2)
where ϵandβare hyperparameters, and ˆAi,trepresents the advantage, which is computed
based on the relative rewards of outputs within each group. This approach avoids intro-
ducing additional complexity in the computation of ˆAi,t.I(yi,t)is the token loss masking
operation. I(yi,t) = 1 ifyi,tis a LLM generated token and I(yi,t) = 0 ifyi,tis a retrieved
token. Additionally, instead of incorporating KL divergence as a penalty within the reward
function, GRPO regularizes by directly adding the KL divergence between the trained policy
and the reference policy to the loss function. The retrieved token masking is also applied
when calculating the KL divergence loss DKL.
3.2 Text Generation with Interleaved Multi-turn Search Engine Call
In this section, we describe the rollout process for LLM response generation with interleaved
multi-turn search engine calls, formulated as: y∼πθ(· |x;R) =πθ(· |x)NR.
Our approach follows an iterative framework where the LLM alternates between text
generation and external search engine queries. Specifically, the system instruction guides
the LLM to encapsulate its search query between two designated search call tokens, <search>
and </search> , whenever an external retrieval is needed. Upon detecting these tokens in
the generated sequence, the system extracts the search query, queries the search engine, and
retrieves relevant results. The retrieved information is then enclosed within special retrieval
tokens, <information> and </information> , and appended to the ongoing rollout sequence,
serving as additional context for the next generation step. This process continues iteratively
until one of the following conditions is met: (1) the search engine call budget is exhausted,
or (2) the model generates a final response, which is enclosed between designated answer
tokens, <answer> and </answer> . The complete workflow is outlined in Algorithm 1.
3.3 Training Template
To train SEARCH -R1, we start by crafting a simple template that directs the initial LLM
to follow our predefined instructions. As shown in Table 1, this template structures the
model’s output into three parts in an iterative fashion: first, a reasoning process, then a
search engine calling function, and finally, the answer. We deliberately limit our constraints
to this structural format, avoiding any content-specific biases, such as enforcing reflective
reasoning and search engine calling or endorsing specific problem-solving approaches.
This ensures that the model’s natural learning dynamics during the RL process remain
observable and unbiased.
Answer the given question. You must conduct reasoning inside <think> and</think>
first every time you get new information. After reasoning, if you find you lack some
knowledge, you can call a search engine by <search> query </search> , and it will
return the top searched results between <information> and </information> . You
can search as many times as you want. If you find no further external knowledge
needed, you can directly provide the answer inside <answer> and </answer> without
detailed illustrations. For example, <answer> Beijing </answer> . Question: question.
Table 1: Template for SEARCH -R1. question will be replaced with the specific question
during training and inference.
5
Page 6:
Search-R1
Algorithm 1 LLM Response Rollout with Multi-Turn Search Engine Calls
Require: Input query x, policy model πθ, search engine R, maximum search budget B.
Ensure: Final response y.
1:Initialize rollout sequence y←∅
2:Initialize search call count b←0
3:while b<Bdo
4: Generate response token yt∼πθ(· |x,y)
5: // Append ytto rollout sequence y
6: y←y+yt
7: if<search> </search> detected in ytthen
8: // Extract search query q
9: q←Parse (yt,<search> ,</search> )
10: // Retrieve search results
11: d=R(q)
12: // Insert dinto y
13: y←y+<information> d</information>
14: Increment search call count b←b+1
15: end if
16: if<answer> </answer> detected in ythen
17: // Terminate rollout
18: return final generated response y
19: end if
20:end while
21:return final generated response y
3.4 Reward Modeling
The reward function serves as the primary training signal, guiding the optimization process
in reinforcement learning. To train SEARCH -R1, we adopt a rule-based reward system
that consists solely of final outcome rewards , which assess the correctness of the model’s
response. For instance, in factual reasoning tasks, correctness can be evaluated using
rule-based criteria such as exact string matching.
rϕ(x,y) =EM(apred,agold), (3)
where apred is the extracted final answer from response yand agold is the ground truth
answer. Unlike Guo et al. (2025), we do not incorporate format rewards, as our learned
model already demonstrates strong structural adherence. We leave the exploration of more
complex format rewards for future work. Furthermore, we deliberately avoid training
neural reward models for either outcome or process evaluation, following Guo et al. (2025).
This decision is motivated by the susceptibility of neural reward models to reward hacking
in large-scale reinforcement learning, as well as the additional computational cost and
complexity introduced by retraining these models.
4 Main results
4.1 Datasets
We evaluate S EARCH -R1 on seven benchmark datasets, categorized as follows:
General Question Answering : NQ (Kwiatkowski et al., 2019), TriviaQA (Joshi et al., 2017),
and PopQA (Mallen et al., 2022).
Multi-Hop Question Answering : HotpotQA (Yang et al., 2018), 2WikiMultiHopQA (Ho
et al., 2020), Musique (Trivedi et al., 2022b), and Bamboogle (Press et al., 2022).
These datasets encompass a diverse range of search with reasoning challenges, enabling a
comprehensive evaluation of SEARCH -R1 across both single-turn and multi-hop retrieval
scenarios.
6
Page 7:
Search-R1
4.2 Baselines
To evaluate the effectiveness of SEARCH -R1, we compare it against the following baseline
methods:
Inference without Retrieval : Direct inference and Chain-of-Thought (CoT) reasoning (Wei
et al., 2022).
Inference with Retrieval : Retrieval-Augmented Generation (RAG) (Lewis et al., 2020),
IRCoT (Trivedi et al., 2022a), and Search-o1 (Li et al., 2025).
Fine-T uning-Based Methods : Supervised fine-tuning (SFT) (Chung et al., 2024) and rein-
forcement learning-based fine-tuning without a search engine (R1) (Guo et al., 2025). For
R1, we train the LLMs with the RL methods proposed in Guo et al. (2025) with our data to
have a fair comparison with SEARCH -R1. It only contains reasoning and answer steps and
cannot call a search engine.
These baselines cover a broad spectrum of retrieval-augmented and fine-tuning approaches,
allowing for a comprehensive assessment of SEARCH -R1 in both zero-shot and learned
retrieval settings.
To make a fair comparison between different methods, we use the same retriever, knowledge
corpus, training data and LLMs. More details can be found in Section 4.3.
4.3 Experimental Setup
We conduct experiments using three types of models: Qwen-2.5-3B (Base/Instruct) and
Qwen-2.5-7B (Base/Instruct) (Yang et al., 2024), as well as Llama-3.2-3B (Base/Instruct)
(Dubey et al., 2024). For retrieval, we use the 2018 Wikipedia dump (Karpukhin et al.,
2020) as the knowledge source and E5 (Wang et al., 2022) as the retriever. To ensure fair
comparison, we follow Lin et al. (2023) and set the number of retrieved passages to three
across all retrieval-based methods.
For training, we merge the training sets of NQ and HotpotQA to form a unified dataset for
SEARCH -R1 and other fine-tuning-based baselines. Evaluation is conducted on the test or
validation sets of all seven datasets to assess both in-domain and out-of-domain performance.
Exact Match (EM) is used as the evaluation metric, following Yu et al. (2024). For inference-
style baselines, we use instruct models, as base models fail to follow instructions. For RL
tuning methods, experiments are conducted on both base and instruct models.
ForSEARCH -R1 training, in PPO Training, the policy LLM learning rate is set to 1e-6, and
value LLM learning rate to 1e-5. The Generalized Advantage Estimation (GAE) parameters
areλ=1 and γ=1. In GRPO Training, the policy LLM learning rate is set to 1e-6, with five
sampled responses per prompt. We use exact match (EM) to calculate the outcome reward.
Unless stated otherwise, PPO is used as the default RL method , and a detailed comparison
between PPO and GRPO is provided in Section 5.1.
4.4 Performance
The main results comparing SEARCH -R1 with baseline methods across the seven datasets
are presented in Table 2. From the results, we make the following key observations:
SEARCH -R1 consistently outperforms strong baseline methods. We achieve 26%, 21%,
and 10% average relative improvement with Qwen2.5-7B, Qwen2.5-3B, and LLaMA3.2-
3B, respectively. These gains hold across both in-distribution evaluation ( i.e., NQ and
HotpotQA) and out-of-distribution evaluation ( i.e., TriviaQA, PopQA, 2WikiMultiHopQA,
Musique, and Bamboogle).
SEARCH -R1 surpasses RL-based training for LLM reasoning without retrieval (R1) (Guo
et al., 2025). This aligns with expectations, as incorporating search into LLM reasoning
provides access to relevant external knowledge, improving overall performance.
7
Page 8:
Search-R1
Table 2: Main results. The best performance is set in bold, and the second best is set in
underline.
Method NQ TriviaQA PopQA HotpotQA 2wiki Musique Bamboogle Avg.
Qwen2.5-7b-Base/Instruct
Direct Inference 0.134 0.408 0.140 0.183 0.250 0.031 0.120 0.181
CoT 0.048 0.185 0.054 0.092 0.111 0.022 0.232 0.106
IRCoT 0.224 0.478 0.301 0.133 0.149 0.072 0.224 0.239
Search-o1 0.151 0.443 0.131 0.187 0.176 0.058 0.296 0.206
RAG 0.349 0.585 0.392 0.299 0.235 0.058 0.208 0.304
SFT 0.318 0.354 0.121 0.217 0.259 0.066 0.112 0.207
R1-base 0.297 0.539 0.202 0.242 0.273 0.083 0.296 0.276
R1-instruct 0.270 0.537 0.199 0.237 0.292 0.072 0.293 0.271
Search-R1-base 0.412 0.568 0.428 0.356 0.322 0.142 0.384 0.373
Search-R1-instruct 0.397 0.606 0.404 0.380 0.326 0.168 0.408 0.384
Qwen2.5-3b-Base/Instruct
Direct Inference 0.106 0.288 0.108 0.149 0.244 0.020 0.024 0.134
CoT 0.023 0.032 0.005 0.021 0.021 0.002 0.000 0.015
IRCoT 0.111 0.312 0.200 0.164 0.171 0.067 0.240 0.181
Search-o1 0.238 0.472 0.262 0.221 0.218 0.054 0.320 0.255
RAG 0.348 0.544 0.387 0.255 0.226 0.047 0.080 0.270
SFT 0.249 0.292 0.104 0.186 0.248 0.044 0.112 0.176
R1-base 0.226 0.455 0.173 0.201 0.268 0.055 0.224 0.229
R1-instruct 0.210 0.449 0.171 0.208 0.275 0.060 0.192 0.224
Search-R1-base 0.341 0.513 0.362 0.263 0.273 0.076 0.211 0.292
Search-R1-instruct 0.323 0.537 0.364 0.308 0.336 0.105 0.315 0.327
LLaMA3.2-3b-Base/Instruct
Direct Inference 0.139 0.368 0.124 0.122 0.107 0.015 0.064 0.134
CoT 0.246 0.487 0.166 0.051 0.083 0.006 0.024 0.152
IRCoT 0.363 0.566 0.428 0.238 0.236 0.072 0.208 0.301
Search-o1 0.107 0.203 0.093 0.132 0.117 0.035 0.176 0.123
RAG 0.317 0.551 0.337 0.234 0.118 0.034 0.064 0.237
SFT 0.320 0.341 0.122 0.206 0.257 0.064 0.120 0.204
R1-base 0.290 0.514 0.237 0.234 0.279 0.055 0.146 0.251
R1-instruct 0.384 0.549 0.228 0.238 0.269 0.074 0.315 0.294
Search-R1-base 0.394 0.596 0.437 0.280 0.264 0.056 0.105 0.305
Search-R1-instruct 0.357 0.578 0.378 0.314 0.233 0.090 0.306 0.322
SEARCH -R1 is effective for both base and instruction-tuned models. This demonstrates
that DeepSeek-R1-Zero-style RL with outcome-based rewards (Guo et al., 2025) can be
successfully applied to reasoning with search, extending beyond its previously established
effectiveness in pure reasoning scenarios.
SEARCH -R1 generalizes across different base LLMs, including Qwen2.5 and LLaMA3.2.
This contrasts with findings in RL for mathematical reasoning, where RL has been observed
to work effectively only for certain base LLMs (Zeng et al., 2025). Our results indicate that
search-augmented RL is more broadly applicable across model families.
5 Analysis
5.1 Different RL methods: PPO vs. GRPO
We evaluate SEARCH -R1 using both PPO and GRPO as the base RL method, conducting
experiments on LLaMA3.2-3B and Qwen2.5-3B models. The training dynamics comparison
is presented in Figure 2, revealing the following insights:
GRPO converges faster than PPO across all cases. This is because PPO relies on a critic
model, which requires several warm-up steps before effective training begins.
8
Page 9:
Search-R1
0255075100125150175200
Step0.00.10.20.30.4Train Reward
PPO
GRPO
(a) LLaMA3.2-3b-base
0255075100125150175200
Step0.150.200.250.300.350.40Train Reward
PPO
GRPO (b) LLaMA3.2-3b-it
0255075100125150175200
Step0.050.100.150.200.250.300.350.40Train Reward
PPO
GRPO (c) Qwen2.5-3b-base
0255075100125150175200
Step0.200.250.300.350.400.45Train Reward
PPO
GRPO (d) Qwen2.5-3b-it
Figure 2: Training dynamics of SEARCH -R1 with PPO and GRPO as the base RL method
across four LLMs. GRPO generally converges faster but may exhibit instability in certain
cases ( e.g., LLaMA3.2-3B-Instruct), whereas PPO provides more stable optimization but
converges at a slower rate.
Table 3: The performance results of S EARCH -R1 with PPO and GRPO on seven datasets.
Method NQ TriviaQA PopQA HotpotQA 2wiki Musique Bamboogle Avg.
Qwen2.5-3b-Base/Instruct
SEARCH -R1-base (GRPO) 0.396 0.582 0.390 0.283 0.266 0.054 0.113 0.298
SEARCH -R1-instruct (GRPO) 0.409 0.552 0.405 0.345 0.369 0.154 0.320 0.365
SEARCH -R1-base (PPO) 0.341 0.513 0.362 0.263 0.273 0.076 0.211 0.292
SEARCH -R1-instruct (PPO) 0.323 0.537 0.364 0.308 0.336 0.105 0.315 0.327
LLaMA3.2-3b-Base/Instruct
SEARCH -R1-base (GRPO) 0.431 0.612 0.458 0.300 0.297 0.067 0.104 0.324
SEARCH -R1-instruct (GRPO) 0.333 0.524 0.329 0.229 0.190 0.047 0.192 0.263
SEARCH -R1-base (PPO) 0.394 0.596 0.437 0.280 0.264 0.056 0.105 0.305
SEARCH -R1-instruct (PPO) 0.357 0.578 0.378 0.314 0.233 0.090 0.306 0.322
PPO demonstrates greater training stability. As shown in Figure 2(b), GRPO leads to
reward collapse when applied to the LLaMA3.2-3B-Instruct model, whereas PPO remains
stable across different LLM architectures.
The final training rewards of PPO and GRPO are comparable. Despite differences in con-
vergence speed and stability, both methods achieve similar final reward values, indicating
that both are viable for optimizing S EARCH -R1.
The evaluation results are presented in Table 3, revealing the following key findings:
GRPO generally outperforms PPO. Across both Qwen2.5-3B and LLaMA3.2-3B, GRPO
achieves higher average performance, demonstrating its effectiveness in optimizing retrieval-
augmented reasoning.
Instruct variants perform better than base variants. For Qwen2.5-3B, SEARCH -R1-Instruct
(GRPO) achieves the highest overall average score (0.365), outperforming all other configu-
rations. For LLaMA3.2-3B, the best-performing variant is SEARCH -R1-Base (GRPO) with an
average score of 0.324, followed closely by S EARCH -R1-Instruct (PPO) at 0.322.
5.2 Base vs. Instruct LLMs
We analyze the training dynamics of SEARCH -R1 across both base LLMs and instruction-
tuned LLMs. Experiments are conducted on three model variants: LLaMA3.2-3B, Qwen2.5-
3B, and Qwen2.5-7B. As shown in Figure 3, we observe that instruction-tuned models
converge faster and start from a higher initial performance compared to base models.
However, the final performance of both model types remains highly similar after training.
This finding suggests that while general post-training accelerates learning in reasoning-plus-
search scenarios, reinforcement learning can effectively bridge the gap over time, enabling
base models to achieve comparable performance.
9
Page 10:
Search-R1
0255075100125150175200
Step0.00.10.20.30.4Train Reward
Base
Instruct
(a) LLaMA3.2-3b-base/instruct
0255075100125150175200
Step0.050.100.150.200.250.300.350.40Train Reward
Base
Instruct (b) Qwen2.5-3b-base/instruct
0255075100125150175200
Step0.150.200.250.300.350.400.450.50Train Reward
Base
Instruct (c) Qwen2.5-7b-base/instruct
Figure 3: Study of SEARCH -R1on base and instruct LLMs. The instruction model converges
faster and starts from a better initial performance. However, the final performance of both
models is very similar.
0 50 100 150 200
Step200400600800Response Length
Response Length
0.00.10.20.30.4
Train Reward
Train Reward
(a) Response length study
0255075100125150175200
Step0.00.10.20.30.4Train Reward
w. mask
w.o. mask (b) Loss mask dynamics study
Figure 4: (a) Response Length Study: The response length exhibits a decrease-increase-
stabilize trend throughout training, aligning with the overall performance trajectory of the
LLM. (b) Retrieved Token Loss Masking Study: Implementing retrieved token masking leads
to greater LLM improvements, mitigating unintended optimization effects and ensuring
more stable training dynamics.
5.3 Response Length Study
We conduct an experiment using SEARCH -R1 with the LLaMA3.2-3b-base model, training
on NQ to analyze the dynamics of training reward and response length over the course of
training. The results are presented in Figure 4(a), revealing the following key trends:
(1) Early Stage (First 100 Steps) : The response length sharply decreases, while the training
reward exhibits a slight increase. During this phase, the base model learns to eliminate
excessive filler words and begins adapting to the task requirements.
(2) Mid Stage (100–130 Steps) : Both response length and training reward increase signifi-
cantly. At this point, the LLM learns to call the search engine, resulting in longer responses
due to retrieved passages. The training reward improves substantially, as the model becomes
more effective at leveraging search results.
(3) Late Stage (After 130 Steps) : The response length stabilizes, and the training reward
continues to increase slightly. At this stage, the model has learned to use the search engine
effectively and focuses on refining its search queries. Given that NQ is a relatively simple
task, the response length stabilizes at approximately 500 tokens, indicating convergence.
5.4 Study of Retrieved Tokens Loss Masking
In Section 3.1, we introduced token-level loss masking for retrieved tokens to prevent
unintended optimization behaviors. Here, we empirically evaluate its effectiveness by
analyzing its impact on training stability and model performance.
We conduct experiments on the LLaMA3.2-3b-base model, comparing training dynamics
with and without retrieved token loss masking. As shown in Figure 4(b), applying retrieved
token masking results in greater LLM improvements, mitigating unintended optimization
effects and ensuring more stable training dynamics.
10
Page 11:
Search-R1
Table 4: The performance of SEARCH -R1 with and without retrieved token loss masking.
The LLM trained with retrieved token loss masking achieves consistently better performance.
(LLM: LLaMA3.2-3b-base)
Method NQ TriviaQA PopQA HotpotQA 2wiki Musique Bamboogle Avg.
SEARCH -R1 w. mask 0.394 0.596 0.437 0.280 0.264 0.056 0.105 0.305
SEARCH -R1 w.o. mask 0.124 0.360 0.152 0.131 0.211 0.017 0.032 0.147
Table 5: A case study of R1 and S EARCH -R1.
Question : Curious is a women’s fragrance by a singer born in what city and state?
Ground Truth : McComb, Mississippi
R1:
<think> The singer is Beyonc ´e, who was born in Houston, Texas. </think>
<answer> Houston </answer>
SEARCH -R1:
<think> I need to find out which city and state a singer, by whom the women’s fragrance ”Curious”
was created, was born in. </think>
<search> Curious fragrance information </search>
<information> Doc 1(Title: ”Curious (fragrance)”) Curious (fragrance) Curious is a women’s
fragrance by Britney Spears for Elizabeth Arden, and is the first perfume to be endorsed by
Britney Spears, preceding ””Fantasy.”” Spears reportedly earned $52 million from the endorsement.
””Curious”” was released in September 2004, and was very successful internationally. It was the
number one fragrance of 2004 in department stores, and in 2005, ””Curious”” was honored by the
Fragrance Foundation as Best Women’s Fragrance. The fragrance is a white floral scent. Its notes
are Louisiana magnolia, golden Anjou pear, lotus flower, tuberose, star jasmine, pink cyclamen,
vanilla-infused musk, sandalwood and blonde woods. Elizabeth Arden ... </information>
<think> I need to find out which city and state Britney Spears, the singer behind ”Curious”
fragrance, was born in. </think>
<search> Britney Spears birthplace </search>
<information> Doc 1(Title: ”Britney Spears”) the best-selling female albums artist of the 2000s. In
2004, she launched a perfume brand with Elizabeth Arden, Inc., from which sales exceeded US $1.5
billion, . Spears was born in McComb, Mississippi, the second child of Lynne Irene Bridges and
James Parnell Spears. Her maternal grandmother, Lillian Portell, was English (born in London),
and one of Spears’s maternal great-great-grandfathers was Maltese. Her siblings are Bryan James
and Jamie Lynn. Britney was born in the Bible Belt, where socially conservative evangelical
Protestantism is a particularly strong religious influence. Spears was baptized into the Southern
Baptist Convention, but later in life studied </information>
<think> Since Britney Spears was born in McComb, Mississippi, I need to find out which city and
state that is. </think>
<search> McComb, Mississippi location </search>
<information> Doc 1(Title: ”McComb, Mississippi”) McComb, Mississippi McComb is a city in
Pike County, Mississippi, United States, approximately south of Jackson. As of the 2010 census,
the city had a total population of 12,790. It is the principal city of the McComb, Mississippi
Micropolitan Statistical Area. McComb was founded in 1872 after Henry Simpson McComb of the
New Orleans, Jackson and Great Northern Railroad, a predecessor of the Illinois Central Railroad
(now part of the Canadian National Railway), decided to move the railroad’s maintenance shops
away from New Orleans, Louisiana, to avoid the attractions of that city’s saloons. The railroad
purchased land in Pike County, </information>
<think> I have found that Britney Spears was born in McComb, Mississippi. Since the question
asks for the city and state, I just need to confirm the state, which I already know from the previous
search. </think>
<answer> McComb, Mississippi </answer>
The performance comparison is provided in Table 4, demonstrating that SEARCH -R1trained
with retrieved token loss masking consistently outperforms the variant without masking.
11
Page 12:
Search-R1
5.5 Case Studies
To gain deeper insights into SEARCH -R1, we conduct a case study using Qwen2.5-7B-Base,
comparing its behavior with RL without a search engine (Guo et al., 2025). The results are
presented in Table 5, revealing the following key observations:
Interleaved Reasoning and Retrieval Enhances Problem Analysis :SEARCH -R1 enables
the LLM to perform in-depth reasoning with multi-turn retrieval, whereas RL without
search relies solely on the model’s internal knowledge. By incorporating retrieved passages,
SEARCH -R1 allows the LLM to iteratively refine its reasoning, leading to more informed
and accurate responses.
Self-Verification through Iterative Retrieval : We observe that after the second retrieval
round, the LLM has already gathered sufficient information to answer the question. How-
ever, SEARCH -R1performs an additional retrieval step to self-verify its conclusion, further
reinforcing its confidence in the final response. This phenomenon aligns with findings from
LLM reasoning RL without retrieval (Guo et al., 2025), highlighting how reinforcement
learning can encourage verification-driven reasoning even in search-augmented settings.
6 Conclusion
In this work, we introduced SEARCH -R1, a novel reinforcement learning framework that
enables large language models (LLMs) to interleave self-reasoning with real-time search
engine interactions. Unlike existing retrieval-augmented generation (RAG) approaches,
which lack flexibility for multi-turn retrieval, or tool-use methods that require large-scale su-
pervised training data, SEARCH -R1optimizes LLM rollouts through reinforcement learning,
allowing autonomous query generation and strategic utilization of retrieved information.
Through extensive experiments on seven datasets, we demonstrated that SEARCH -R1 sig-
nificantly enhances LLMs’ ability to tackle complex reasoning tasks requiring real-time
external knowledge. Our analysis also provides key insights into RL training strategies for
search-augmented reasoning. Looking ahead, future work can explore expanding SEARCH -
R1to support broader search strategies, including more sophisticated reward mechanisms,
dynamic retrieval adjustments based on uncertainty, and integration with diverse infor-
mation sources beyond web search. It is also promising to investigate its applicability to
multimodal reasoning tasks.
Acknowledgments
This research was supported in part by Apple PhD Fellowship, in part by US DARPA INCAS
Program No. HR0011-21-C0165 and BRIES Program No. HR0011-24-3-0325, in part by the
Office of Naval Research contract number N000142412612, in part by NSF grant numbers IIS-
19-56151 and 2402873, in part by the Molecule Maker Lab Institute: An AI Research Institutes
program supported by NSF under Award No. 2019897 and the Institute for Geospatial
Understanding through an Integrative Discovery Environment (I-GUIDE) by NSF under
Award No. 2118329, in part by Cisco, and in part by the Center for Intelligent Information
Retrieval. Any opinions, findings, and conclusions or recommendations expressed herein
are those of the authors and do not necessarily represent the views, either expressed or
implied, of the sponsors or the U.S. Government.
References
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al.
Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023.
Arash Ahmadian, Chris Cremer, Matthias Gall ´e, Marzieh Fadaee, Julia Kreutzer, Olivier
Pietquin, Ahmet ¨Ust¨un, and Sara Hooker. Back to basics: Revisiting reinforce style
optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740 ,
2024.
12
Page 13:
Search-R1
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan
Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-
finetuned language models. Journal of Machine Learning Research , 25(70):1–53, 2024.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick,
and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning
challenge. arXiv preprint arXiv:1803.05457 , 2018.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle,
Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3
herd of models. arXiv preprint arXiv:2407.21783 , 2024.
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun,
Haofen Wang, and Haofen Wang. Retrieval-augmented generation for large language
models: A survey. arXiv preprint arXiv:2312.10997 , 2, 2023.
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen,
Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets
programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196 , 2024.
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu,
Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in
llms via reinforcement learning. arXiv preprint arXiv:2501.12948 , 2025.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint
arXiv:2009.03300 , 2020.
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing
a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv preprint
arXiv:2011.01060 , 2020.
Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, and
Yuxiao Dong. Advancing language model reasoning through reinforcement learning and
inference scaling. arXiv preprint arXiv:2501.11651 , 2025.
Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A
survey. arXiv preprint arXiv:2212.10403 , 2022.
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low,
Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.
arXiv preprint arXiv:2412.16720 , 2024.
Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming
Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. In
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pp.
7969–7992, 2023.
Bowen Jin, Jinsung Yoon, Jiawei Han, and Sercan O Arik. Long-context llms meet rag:
Overcoming challenges for long inputs in rag. In The Thirteenth International Conference on
Learning Representations , 2024.
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large
scale distantly supervised challenge dataset for reading comprehension. arXiv preprint
arXiv:1705.03551 , 2017.
Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning:
A survey. Journal of artificial intelligence research , 4:237–285, 1996.
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov,
Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answer-
ing. In EMNLP (1) , pp. 6769–6781, 2020.
13
Page 14:
Search-R1
Timo Kaufmann, Paul Weng, Viktor Bengs, and Eyke H ¨ullermeier. A survey of reinforcement
learning from human feedback. arXiv preprint arXiv:2312.14925 , 10, 2023.
Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate
Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. Training language models to
self-correct via reinforcement learning. arXiv preprint arXiv:2409.12917 , 2024.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh,
Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural
questions: a benchmark for question answering research. Transactions of the Association for
Computational Linguistics , 7:453–466, 2019.
Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khy-
athi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Rewardbench:
Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787 , 2024.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman
Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-
augmented generation for knowledge-intensive nlp tasks. Advances in neural information
processing systems , 33:9459–9474, 2020.
Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang,
and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. arXiv
preprint arXiv:2501.05366 , 2025.
Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen. Large language models in finance: A
survey. In Proceedings of the fourth ACM international conference on AI in finance , pp. 374–382,
2023.
Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Richard James, Pedro
Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, et al. Ra-dit: Retrieval-augmented
dual instruction tuning. In The Twelfth International Conference on Learning Representations ,
2023.
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel
Khashabi. When not to trust language models: Investigating effectiveness and limitations
of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511 , 7, 2022.
Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a
reference-free reward. Advances in Neural Information Processing Systems , 37:124198–124235,
2024.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin,
Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language
models to follow instructions with human feedback. Advances in neural information
processing systems , 35:27730–27744, 2022.
Richard Yuanzhe Pang, Weizhe Yuan, He He, Kyunghyun Cho, Sainbayar Sukhbaatar, and
Jason Weston. Iterative reasoning preference optimization. Advances in Neural Information
Processing Systems , 37:116617–116637, 2024.
Cheng Peng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa,
Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, et al. A study of generative
large language model for medical research and healthcare. NPJ digital medicine , 6(1):210,
2023.
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis.
Measuring and narrowing the compositionality gap in language models. arXiv preprint
arXiv:2210.03350 , 2022.
Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun
Xu, and Ji-Rong Wen. Tool learning with large language models: A survey. Frontiers of
Computer Science , 19(8):198343, 2025.
14
Page 15:
Search-R1
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and
Chelsea Finn. Direct preference optimization: Your language model is secretly a reward
model. Advances in Neural Information Processing Systems , 36:53728–53741, 2023.
Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Eric Hambro,
Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models
can teach themselves to use tools. Advances in Neural Information Processing Systems , 36:
68539–68551, 2023.
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-
dimensional continuous control using generalized advantage estimation. arXiv preprint
arXiv:1506.02438 , 2015.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal
policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017.
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang,
Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical
reasoning in open language models. arXiv preprint arXiv:2402.03300 , 2024.
Richard S Sutton, Andrew G Barto, et al. Reinforcement learning. Journal of Cognitive
Neuroscience , 11(1):126–134, 1999.
Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens
of context. arXiv preprint arXiv:2403.05530 , 2024.
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving
retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions.
arXiv preprint arXiv:2212.10509 , 2022a.
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique:
Multihop questions via single-hop question composition. Transactions of the Association for
Computational Linguistics , 10:539–554, 2022b.
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan
Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training.
arXiv preprint arXiv:2212.03533 , 2022.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V
Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language
models. Advances in neural information processing systems , 35:24824–24837, 2022.
Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu,
and Jun Zhao. Large language models are better reasoners with self-verification. arXiv
preprint arXiv:2212.09561 , 2022.
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist rein-
forcement learning. Machine learning , 8:229–256, 1992.
Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai
Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based
reinforcement learning. arXiv preprint arXiv:2502.14768 , 2025.
Guangzhi Xiong, Qiao Jin, Xiao Wang, Yin Fang, Haolin Liu, Yifan Yang, Fangyuan Chen,
Zhixing Song, Dengyu Wang, Minjia Zhang, et al. Rag-gym: Optimizing reasoning and
search agents with process supervision. arXiv preprint arXiv:2502.13957 , 2025.
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li,
Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint
arXiv:2412.15115 , 2024.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdi-
nov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop
question answering. arXiv preprint arXiv:1809.09600 , 2018.
15
Page 16:
Search-R1
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and
Yuan Cao. React: Synergizing reasoning and acting in language models. In International
Conference on Learning Representations (ICLR) , 2023.
Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Mohammad Shoeybi,
and Bryan Catanzaro. Rankrag: Unifying context ranking with retrieval-augmented
generation in llms. Advances in Neural Information Processing Systems , 37:121156–121184,
2024.
Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin,
Dong Wang, Xuanhui Wang, and Michael Bendersky. Inference scaling for long-context
retrieval augmented generation. arXiv preprint arXiv:2410.04343 , 2024.
Weihao Zeng, Yuzhen Huang, Wei Liu, Keqing He, Qian Liu, Zejun Ma, and Junxian He. 7b
model and 8k examples: Emerging reasoning with reinforcement learning is both effective
and efficient. https://hkust-nlp.notion.site/simplerl-reason , 2025. Notion Blog.
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo
Zhao, Yu Zhang, Yulong Chen, et al. Siren’s song in the ai ocean: a survey on hallucination
in large language models. arXiv preprint arXiv:2309.01219 , 2023.
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian
Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.
arXiv preprint arXiv:2303.18223 , 1(2), 2023.
Wayne Xin Zhao, Jing Liu, Ruiyang Ren, and Ji-Rong Wen. Dense text retrieval based on
pretrained language models: A survey. ACM Transactions on Information Systems , 42(4):
1–60, 2024.
16