Authors: Xinjie Liu, Cyrus Neary, Kushagra Gupta, Christian Ellis, Ufuk Topcu, David Fridovich-Keil
Paper Content:
Page 1:
Multi-Fidelity Policy Gradient Algorithms
Xinjie Liu1,†, Cyrus Neary1,2,3, †, Kushagra Gupta1, Christian Ellis1,4, Ufuk
Topcu1, David Fridovich-Keil1
{xinjie-liu,kushagrag,utopcu,dfk}@utexas.edu cyrus.neary@mila.quebec
christian.ellis@austin.utexas.edu
1University of Texas at Austin, USA
2Mila – Quebec AI Institute, Canada
3Université de Montréal, Canada
4DEVCOM Army Research Laboratory, USA
†Equal Contribution
Abstract
Many reinforcement learning (RL) algorithms require large amounts of data, prohibit-
ing their use in applications where frequent interactions with operational systems are
infeasible, or high-fidelity simulations are expensive or unavailable. Meanwhile, low-
fidelity simulators—such as reduced-order models, heuristic reward functions, or gen-
erative world models—can cheaply provide useful data for RL training, even if they
are too coarse for direct sim-to-real transfer. We propose multi-fidelity policy gradients
(MFPGs) , an RL framework that mixes a small amount of data from the target envi-
ronment with a large volume of low-fidelity simulation data to form unbiased, reduced-
variance estimators (control variates) for on-policy policy gradients. We instantiate
the framework by developing multi-fidelity variants of two policy gradient algorithms:
REINFORCE and proximal policy optimization. Experimental results across a suite
of simulated robotics benchmark problems demonstrate that when target-environment
samples are limited, MFPG achieves up to 3.9 ×higher reward and improves training
stability when compared to baselines that only use high-fidelity data. Moreover, even
when the baselines are given more high-fidelity samples—up to 10 ×as many inter-
actions with the target environment—MFPG continues to match or outperform them.
Finally, we observe that MFPG is capable of training effective policies even when the
low-fidelity environment is drastically different from the target environment. MFPG
thus not only offers a novel paradigm for efficient sim-to-real transfer but also provides
a principled approach to managing the trade-off between policy performance and data
collection costs.
1 Introduction
Reinforcement learning (RL) algorithms offer significant capabilities in systems that work with un-
known, or difficult-to-specify, dynamics and objectives. The flexibility and performance of RL
algorithms have led to their adoption in applications as diverse as controlling plasma configurations
in nuclear fusion reactors (Degrave et al., 2022), piloting high-speed aerial vehicles (Kaufmann
et al., 2023), training reasoning capabilities in large language models (Shao et al., 2024), and search-
ing large design spaces for the automated discovery of new molecules (Bengio et al., 2021; Ghugare
et al., 2023). However, in many such applications, datasets must often be gathered from operational
systems or from high-fidelity simulations. This requirement acts as a significant barrier to the devel-
opment and deployment of RL policies: excessive interactions with operational systems are often
1arXiv:2503.05696v1 [cs.LG] 7 Mar 2025
Page 2:
High-Fidelity Env.
sh
0ah
0rh
0. . .
sl
0al
0rl
0. . .sh
0ah
0rh
0. . .
sl
0al
0rl
0. . .sh
0ah
0rh
0. . .
sl
0al
0rl
0. . .
Low-Fidelity Env.
sl
0al
0rl
0. . . sl
0al
0rl
0. . . sl
0al
0rl
0. . . sl
0al
0rl
0. . . sl
0al
0rl
0. . . sl
0al
0rl
0. . .MFPG Sampling Mechanism
Current
policy πθ1
NhP
τ∇θ
Xπθ
τh+c(Xπθ
τl−ˆµl)MFPG Policy Gradient Estimator
1
NlP
τ,tGl
tlogπθ(al
t|sl
t)Low-Fidelity Policy Gradient Estimator
Apply Policy Gradient StepNhcorrelated
trajectory samples
from both envs.
Nllow-fidelity
trajectory samplesNl≫NhP
tGh
tlogπθ(ah
t|sh
t)
P
tGl
tlogπθ(al
t|sl
t)
policy gradient
estimate
Figure 1: The proposed multi-fidelity policy gradient (MFPG) framework. At each policy update
step, MFPG combines a small amount of data from the target (high-fidelity) environment with a large
volume of low-fidelity simulation data, thereby forming an unbiased, reduced-variance estimator for
the policy gradient. For illustrative purposes, we use the REINFORCE gradient estimators in the
figure. However, we note that the same overall workflow applies to other MFPG algorithms.
either infeasible or unsafe, and generating simulated datasets for RL can be prohibitively expensive
unless the simulations are both cheap to run and carefully designed to minimize the sim-to-real gap.
On the other hand, low-fidelity simulation tools capable of cheaply generating large volumes of data
are often available. For example, reduced-order models, linearized dynamics, heuristic reward func-
tions, and generative world models all output useful information for RL, even when approximations
of the target dynamics and rewards are very coarse.
0 0.2 0.4 0.6 0.8 1
·10402505007501,000
Gradient StepsRewardMFPG (ours), 10x less data
Single-Fidelity (high-fid. only, value subt.)
Single-Fidelity (high-fid. only)
Figure 2: REINFORCE on the MuJoCo Hop-
per task (episode length 500). Even when
granted access to 10x fewer samples from the
target environment per gradient step, the pro-
posed MFPG improves learning speed and re-
ward performance in comparison to standard
REINFORCE algorithms (with and without
state-value baseline subtraction) that only train
on a single fidelity of data.Towards enabling the training and deployment of
highly-performant RL policies that would other-
wise be infeasible, in this work we develop novel
multi-fidelity RL algorithms. The proposed algo-
rithms achieve an order of magnitude improve-
ment in sample efficiency by mixing data from the
target environment with data generated by lower
fidelity simulations. The end result not only of-
fers a novel paradigm for sim-to-real transfer, but
also provides a principled approach to managing
the trade-off between policy performance and data
collection costs.
More specifically, we present multi-fidelity pol-
icy gradients (MFPGs) , a framework that lever-
ages data from multiple sources to construct con-
trol variates (CVs) —unbiased, reduced-variance
estimators—of policy gradients (PGs) within on-
policy methods. While the presented ideas and
methods can be broadly applied within RL algo-
rithms, in this work we apply the framework to
two existing on-policy algorithms: REINFORCE
(Williams, 1992) and proximal policy optimization (PPO) (Schulman et al., 2017).
Figure 1 illustrates the proposed approach. At each policy update step, the algorithm begins by sam-
pling a relatively small number of trajectories from the target environment, which may correspond
to real-world hardware or to a high-fidelity simulation. We then propose a method for sampling tra-
jectories from the low-fidelity environments such that the resulting cumulative rewards and action
likelihoods are highly correlated with those from the high-fidelity trajectories. This method hinges
on a nuanced approach to action sampling and is critical to the success of our approach. Next,
2
Page 3:
the algorithm uses the low-fidelity environments to sample a much larger quantity of uncorrelated
trajectories, which it uses alongside the previously sampled trajectories to compute an unbiased es-
timate of the policy gradient to be applied. So long as the random variables corresponding to the
policy gradient updates between the high and low-fidelity environments are correlated, the approach
is guaranteed to reduce the variance of the policy gradient estimates.
The impact of this variance reduction is a significant overall improvement to algorithm performance
in comparison with standard RL approaches. Figure 2 illustrates that even when the proposed multi-
fidelity algorithms are trained using 10 ×fewer high-fidelity samples than the baseline methods, we
empirically observe that the proposed MFPG approach can still learn faster and achieve a higher
cumulative reward.
We further demonstrate the benefits of our approach through experiments on a suite of robotics
control problems, which are commonly used as benchmarks for RL algorithms. Our experimental
results demonstrate the following key points. i) The proposed multi-fidelity control variates drasti-
cally reduce the variance of policy gradient estimates. ii) When the quantity of high-fidelity samples
is limited, the proposed multi-fidelity algorithms outperform their single-fidelity counterparts in all
settings, and achieve up to 3.9 ×more reward. This is particularly beneficial in online RL settings,
for which existing algorithms typically require many environment samples per gradient step. iii)
The proposed algorithms continue to match or outperform their single-fidelity counterparts, even
when those baselines are allowed access to substantially more high-fidelity data, e.g., up to 10 ×
more total high-fidelity environment interactions throughout training. iv) The MFPG algorithms
can learn effectively even when the low-fidelity environment is drastically different from the high-
fidelity environment, and high-fidelity data is scarce. v) The proposed method to sample correlated
multi-fidelity trajectories is critical to the performance of the proposed approach.
2 Related Work
Variance Reduction in RL via control variates. The method of control variates (CVs)
is commonly used for variance reduction in Monte Carlo estimation (Owen, 2013). In
RL, particularly in PG methods (Williams, 1992), there is a long history of using CVs to
reduce variance and accelerating learning, e.g., subtracting constant reward baselines (Sut-
ton, 1984; Williams, 1988; Dayan, 1991; Williams, 1992) or state value functions (Weaver &
Tao, 2001; Greensmith et al., 2004; Peters & Schaal, 2006; Zhao et al., 2011) from Monte Carlo
returns. Even though state values are known to not be optimal for variance reduction (Green-
smith et al., 2004), they are also central to modern algorithms, e.g., actor-critic methods (Schul-
man et al., 2017). More recently, additional CVs baselines have been studied, such as vector-form
CVs (Zhong et al., 2021), multi-step variance reduction along trajectories (Pankov, 2018; Cheng
et al., 2020) and their more general form (Huang & Jiang, 2020), and state-action dependent base-
lines (Gu et al., 2017; Liu et al., 2018; Grathwohl et al., 2018; Wu et al., 2018), which offer better
variance reduction under certain conditions (Tucker et al., 2018). The literature on using CVs for
policy gradients almost exclusively focuses on single-environment settings. In this work, we con-
sider the problem of fusing data from multi-fidelity environments to construct still lower-variance
estimators, cf. Section 3. Moreover, the proposed multi-fidelity approach can also be readily com-
bined with existing single-fidelity CVs for further improved performance, cf. Section 4.2.
To the best of our knowledge, the only existing work leveraging multi-fidelity CVs for RL is (Khairy
& Balaprakash, 2024). Our approach differs in two main ways. First, we propose Monte Carlo PG
and actor-critic algorithms for Markov decision processs (MDPs) with either continuous or discrete
state and action spaces, whereas Khairy & Balaprakash, 2024 estimate state-action values in tabular
MDPs without function approximation. Second, we propose a novel policy reparameterization trick
that enables the sampling of correlated trajectories across multi-fidelity environments. Crucially,
this sampling scheme improves the variance reduction of the proposed MFPG algorithms, and it
directly supports continuous settings without restricting the MDPs. By contrast, Khairy & Bal-
3
Page 4:
aprakash (2024) requires action sequences to be matched across multi-fidelity environments, which
requires additional assumptions about the MDP structure and transition dynamics.
Multi-fidelity RL and fine-tuning using high-fidelity data. Multi-fidelity methods for optimiza-
tion, uncertainty propagation, and inference are well-studied in computational science and engineer-
ing, where it is often the case that multiple models for a problem of interest are available (Peherstor-
fer et al., 2018). Such methods provide principled statistical techniques to manage the computational
costs of Monte Carlo simulations, without introducing unwanted biases. However, the adoption of
such techniques by RL algorithms is significantly more limited. Two bodies of literature which are
closely related to this work are multi-fidelity RL (Cutler et al., 2015) and fine-tuning RL policies
using limited target-domain data, e.g., in sim-to-real transfer (Smith et al., 2022). Multi-fidelity RL
aims to design training curricula , i.e., decide when to train in each simulation fidelities, in order
to train highly-performant policies for the highest-fidelity environment with minimum simulation
costs. Typically, training begins in the lowest-fidelity simulators, with proposed mechanisms to de-
cide when to transition to other fidelities based on estimated uncertainty (Suryan et al., 2017; Cutler
et al., 2015; Agrawal & McComb, 2024; Ryou et al., 2024; Bhola et al., 2023) in predicted actions,
state-action values, dynamics, rewards, and/or information gain (Marco et al., 2017). Similarly,
in sim-to-real and transfer learning settings (Zhao et al., 2020; Tang et al., 2024; Da et al., 2025),
the objective is to train a highly-performant policy for the real world using simulation data, some-
times supplemented with a small amount of real-world data. Many approaches leverage models
or value functions trained in simulation (source domain) to bootstrap real-world (target domain)
fine-tuning (Rusu et al., 2017; Arndt et al., 2020; Taylor et al., 2007; Smith et al., 2022) or guide
real-world exploration (Yin et al., 2025) to improve sample efficiency. Additionally, real-world
data can be used to refine the simulator itself (Abbeel et al., 2006; Chebotar et al., 2019; Ramos
et al., 2019). The works above introduce new training paradigms and exploration strategies with
data from different domains. In contrast, our approach proposes a new policy gradient estimator that
incorporates cross-domain data. The proposed approach can be seamlessly integrated into existing
paradigms that involve fine-tuning within target domains.
3 Multi-Fidelity Policy Gradients
Our objective is to develop RL algorithms capable of leveraging data generated by multiple dis-
tinct environments in order to efficiently learn a policy that achieves high performance in a target
environment of interest.
Modeling the multi-fidelity environments. We begin by considering a high-fidelity environment
and a low-fidelity environment, each modeled by a Markov decision process (MDP). In particular, the
high-fidelity environment is an MDP Mh= (S, A, ∆sI, γ, ph, Rh), which we assume represents
either an accurate simulator of the target environment, or the target environment itself. Here, S
is the set of environment states, Ais its set of actions, ∆sIis an initial distribution over states,
γ∈[0,1]is a discount factor, ph(s′|s, a)is the probability of transitioning to state s′from state s
under action a, and r∼Rh(s, a, s′)defines the probability of observing a particular reward under
a given state-action-state triplet. Similarly, the low-fidelity environment is an MDP Ml= (S, A,
∆sI, γ, pl, Rl), whose transition dynamics pland reward function Rldiffer from those of Mh,
and do not necessarily accurately represent the target environment. We assume that the cost of
generating sample trajectories in Mhis significantly higher than that of generating trajectories in
Ml, as may be the case if Mhwere to represent real-world hardware while Mlwere to represent
cheap simulations.
Our objective is to learn a stochastic policy πθ(a|s), parameterized by θ∈Rd, that achieves
a high expected total reward in the high-fidelity environment Mh. Fixing a policy πθde-
fines a distribution over trajectories in both MhandMl. We denote trajectories in Mhby
τh=sh
0, ah
0, rh
0, sh
1, ah
1, rh
1, . . . , sh
T, where sh
0∼∆sI,ah
t∼πθ(·|sh
t),sh
t+1∼ph(·|sh
t, ah
t), and
rt∼Rh(sh
t, ah
t, sh
t+1). Similarly, we use τlto denote trajectories sampled in Ml.
4
Page 5:
We remark that the proposed method may be extended to integrate data from an arbitrary number of
environments, each of which simulates the target environment with a different level of fidelity. One
may also define MhandMlto have different sets of states and actions, as well as different initial
distributions and discount factors. Furthermore, the framework can be readily extended to infinite-
horizon cases. However, neither of these extensions significantly enhances the primary conceptual
and experimental contributions of this work, and so we leave them to future research.
Policy gradient algorithms. PG algorithms aim to maximize the performance measure J(θ):=
Eτ∼M(πθ)[R(τ)]—the expected total reward along trajectories τsampled from environment Mun-
der policy πθ. They do so by using stochastic estimates of the policy gradient ∇θJ(θ)to perform
gradient ascent directly on the policy parameters (Sutton & Barto, 2018). For example, the REIN-
FORCE algorithm (Williams, 1988; 1992) uses Monte Carlo estimates of Eτ∼M(πθ)[∇θXπθτ]to esti-
mate the policy gradient, where Xπθτ:=1
TPT−1
t=0Gtlogπθ(at|st)is a random variable defining the
contribution of each trajectory τto the overall policy gradient. Here, at,st, and Gt=PT−1
k=tγtrt,
denote the selected action, the state, and the reward-to-go at time tin trajectory τ, respectively.
Different policy gradient algorithms use different expressions for Xπθτwhen estimating the policy
gradient (e.g., Gtmay be replaced with an advantage estimate, or Xπθτmay be entirely replaced by
a surrogate per-trajectory gradient estimate, as is done in PPO (Schulman et al., 2017)). However,
we note that the overall structure of many on-policy algorithms remains the same: use the current
policy to sample trajectories in the environment; use the rewards, actions, and states along these
trajectories to compute a variable of interest Xπθτ; finally, average the gradients of these sampled
random variables to estimate ∇θJ(θ).
Multi-fidelity policy gradient estimators via control variates. Because our objective is to opti-
mize the policy performance in the high-fidelity environment Mh, during each step of the policy
gradient algorithm we must estimate ∇θEτh∼Mh(πθ)[Xπθ
τh]from a potentially limited number Nh
of sampled high-fidelity trajectories τh. In data-scarce settings, existing policy gradient methods
can face the challenge of high variance of the gradient estimates (Greensmith et al., 2004). We
aim to reduce the estimation variance for the PGs. In this work, we assume that we may also sam-
ple a relatively large number Nl≫Nhof trajectories τlfrom the low-fidelity environment Ml.
We use these low-fidelity samples to construct a so-called control variate Xπθ
τl—a correlated aux-
iliary random variable whose expected value µl:=Eτl∼Ml(πθ)[Xπθ
τl]is known. We then use the
control variates technique (Nelson, 1987) to construct an unbiased, reduced-variance estimator for
∇θEτh∼Mh(πθ)[Xπθ
τh].
Specifically, we construct a new random variable Zπθ:=Xπθ
τh+c(Xπθ
τl−µl)with parameter c∈R.
We estimate µlusing the Nllow-fidelity trajectory samples, and we provide a method to jointly sam-
ple correlated values Xπθ
τhandXπθ
τl(described below). By construction, this new random variable
has the property that E[Zπθ] =E[Xπθ
τh]. Furthermore, depending on the value of c, it may have a
significantly reduced variance. In particular, by choosing c∗=−Cov(Xπθ
τh, Xπθ
τl)/Var(Xπθ
τl), we
obtain Var (Zπθ) = (1 −ρ2(Xπθ
τh, Xπθ
τl))Var(Xπθ
τh), where ρ(·,·)is the Pearson correlation coeffi-
cient between the two random variables. In practice, we estimate c∗using the sampled trajectories
fromMhandMl, including samples generated during previous gradient steps, e.g., by computing
an exponential moving average for c∗.
At every policy gradient step, the proposed MFPG framework thus proceeds as follows: 1) Use
policy πθto sample Nhcorrelated trajectories from MhandMl, as well as Nladditional tra-
jectories from Ml. 2) Use the sampled trajectories to compute ˆµl,c, and the correlated val-
ues of the random variables Xπθ
τhandXπθ
τl. 3) Compute the sampled values of Zπθ. 4) Use
the samples of Zπθto compute an unbiased, reduced-variance estimate of the policy gradient
∇θE[Zπθ]≈1
NhP
τ∇θ
Xπθ
τh+c(Xπθ
τl−ˆµl)
.
Sampling correlated trajectories from the multi-fidelity environments. Note that for the random
variables Xπθ
τhandXπθ
τlto be correlated, they must share an underlying probability space (Ω,F,I P).
In other words, every outcome ω∈Ωfrom this probability space should uniquely define a high-
5
Page 6:
fidelity τh(ω)and low-fidelity τl(ω)trajectory, as well as the corresponding values of the random
variables Xπθ
τh(ω)andXπθ
τl(ω). Careful treatment of this joint probability space is necessary not only
for conceptual completeness, but also to implement a mechanism that samples correlated trajectories
from the distinct environments MhandMlunder the same policy πθ.
Informally, when sampling trajectories from MhandMl, there are six sources of stochasticity: the
stochasticity introduced by the policy πθ, that introduced by the initial state distribution ∆sI, the
stochasticity introduced by the high-fidelity and low-fidelity transition dynamics phandpl, and fi-
nally the stochasticity introduced by the two reward functions RhandRl. Note that the stochasticity
introduced by πθis implemented by the agent, and is thus under the control of the algorithm in the
sense that the same random outcome ωmay be used to generate actions from πθin both Mhand
Ml. Similarly, the algorithm may fix the initial states in the low-fidelity environment to match those
observed from the high-fidelity samples. On the other hand, the transition dynamics and environ-
ment rewards are generated by independent sources of stochasticity in the different environments.
We accordingly define the outcome set of the probability space as Ω = Ω ∆sI×Ωπ×Ωph×Ωpl×
ΩRh×ΩRl. Here, each outcome ω∆sI∈Ω∆sIdefines a particular shared initial state. Meanwhile,
each policy outcome ωπ∈Ωπcorresponds to a sequence ωπ=ωπ
1, ωπ
2, . . . , ωπ
Tthat dictates the
random sequence of actions selected by the policy πθin both environments. Conceptually, given
the outcome ωπ
tat any timestep tof a trajectory, the policy should deterministically output action
ah
t=πθ(sh
t, ωπ)in the high-fidelity environment, and action al
t=πθ(sl
t, ωπ)in the low-fidelity
environment. Note that this does not necessarily imply that ah
t=al
t, due to a potential difference
in states sh
tandsl
tthat stems from the different dynamics. Similarly, each outcome ωph∈Ωphis a
sequence ωph=ωph
0, ωph
1, . . . , ωph
Tdictating the outcomes st+1=ph(sh
t, ah
t, ωph)of the stochastic
transitions in the high-fidelity environment. However, unlike the policy outcome sequence ωπ∈Ωπ,
the transition outcome sequences ωph∈Ωphandωpl∈Ωpl, and the reward outcome sequences
ωRh∈ΩRhandωRl∈ΩRl, are not shared between the high and low fidelity environments.
Algorithm 1: Correlated trajectory sampling.
Input: Current policy πθ, multi-fidelity
environments MhandMl.
Output: Sampled trajectories {τh
i, τl
i}Nh
i=1.
TrajectoryList ←EmptyList
fori∈ {1,2, . . . , Nh}do
ωπ
0. . . ωπ
T∼SamplePolicyOutcomes ()
sh
0∼∆sI;sl
0←sh
0
fort∈ {0, . . . , T −1}do
ah
t←πθ(sh
t, ωπθ
t)
sh
t+1∼ph(·|sh
t, ah
t)
rh
t∼Rh(sh
t, ah
t, sh
t+1)
fort∈ {0, . . . , T −1}do
al
t←πθ(sl
t, ωπθ
t)
sl
t+1∼pl(·|sl
t, al
t)
rl
t∼Rl(sl
t, al
t, sl
t+1)
τh
i←sh
0, ah
0, rh
0, . . . , sh
T
τl
i←sl
0, al
0, rl
0, . . . , sl
T
TrajectoryList.append ({τh
i, τl
i})
return TrajectoryListFormulating these separate outcome sets is con-
ceptually helpful. However, practically, we
only explicitly sample values for the policy out-
comes ωπ
0, ωπ
1, . . . , ωπ
Tin our implementation.
Algorithm 1 outlines the steps of the sampling
process. First, the pre-sampled policy out-
comes are used to rollout a high-fidelity trajec-
toryτh. Next, the initial state in the low-fidelity
environment is fixed to match the initial state of
τhand therefore to effectively enforce a com-
monω∆sI. Finally, the same sequence of pre-
sampled policy outcomes is then used to gener-
ate the low-fidelity trajectory τl.
Correlated action sampling via policy distri-
bution reparameterization. In order to use the
sampled outcomes ωπ
tto deterministically se-
lect an action under the parameterized policy
πθ, we implement a technique inspired by the
so-called reparameterization trick used in vari-
ational autoencoders (Kingma, 2013). In par-
ticular, in continuous action spaces, we draw
ωπ
t∼ N (0,1), and the policy πθ(st, ωπ
t)is
trained to output a state-dependent mean and
standard deviation which are used to transform ωπ
tinto an action at. Meanwhile, in discrete ac-
tion spaces, we draw ωπ
t∼Uniform (0,1)and apply the Gumbel-Max trick (Huijben et al., 2022) to
sample ataccording to the state-dependent probability distribution defined by the policy πθ(st, ωπ
t).
6
Page 7:
Defining multi-fidelity variants of the REINFORCE and PPO algorithms. To this point, we have
defined a mechanism for sampling correlated trajectories τhandτlfrom multi-fidelity environments
MhandMl, as well as a framework for using said trajectories to construct reduced-variance estima-
tors of the policy gradient from trajectory-dependent variables Xπθ
τhandXπθ
τl. With these elements
of the MFPG framework in place, we may easily implement multi-fidelity variants of existing pol-
icy gradient algorithms simply by replacing the way in which the value of Xπθτis computed from
sampled trajectories τ.
We implement a multi-fidelity variant of the REINFORCE algorithm (Williams, 1988; 1992)
by defining Xπθτ:=1
TPT−1
t=0Gtlogπθ(at|st), as described above. We additionally imple-
ment a commonly used variant of the REINFORCE algorithm (Peters & Schaal, 2006) that sub-
tracts state values estimated by a value network from the Monte Carlo returns to reduce vari-
ance, i.e., Xπθτ:=1
TPT−1
t=0(Gt−Vϕ(st)) log πθ(at|st) =1
TPT−1
t=0Aϕ(st, at) logπθ(at|st).
Here, Vϕ(st), Aϕ(st, at)denote value and advantage functions estimated from samples. We
use generalized advantage estimation for Aϕ(st, at)(Schulman et al., 2015). Meanwhile,
we define a multi-fidelity variant of PPO (Schulman et al., 2017) by instead using Xπθτ:=
1
TPT−1
t=0min πθ(at|st)
πold(at|st)Aϕ(st, at),clip(πθ(at|st)
πold(at|st),1−ϵ,1 +ϵ)Aϕ(st, at)
. In this case, the multi-
fidelity estimate of the policy gradient is used in place of the traditional single-fidelity variant, along-
side PPO’s entropy maximization term, when defining the objective function for gradient ascent. In
both of the above cases, a common advantage network Aϕis used to compute both Xπθ
τhandXπθ
τl,
and is trained using sampled tuples (sh
t, ah
t, rh
t, sh
t+1)from the high-fidelity dataset.
4 Experiments
Our results in a suite of simulated benchmark RL tasks validate the following five key claims. Sec-
tions 4.1 to 4.5 highlight the experimental results that demonstrate the claims C1-5 in order. We
then present a large-scale comparison of all algorithm variants and baselines in Section 4.6.
• (C1) The proposed MFPG algorithms drastically reduce the variance of policy gradient
estimators in comparison to common approaches to variance reduction, such as baseline
subtraction and increasing the number of high-fidelity samples per gradient estimate.
• (C2) When high-fidelity samples are limited, the MFPG algorithms stabilize training and
outperform the single-fidelity baselines in all settings, achieving up to 3.9 ×higher reward
at algorithm convergence.
• (C3) The MFPG algorithms continue to match or outperform these baseline algorithms,
even when the baselines are allowed access to substantially more high-fidelity data, i.e.,
10-20x more high-fidelity samples per gradient update and 2-10x more samples in total.
• (C4) The MFPG algorithms can learn effectively even when the low-fidelity environment
is drastically different from the high-fidelity environment, and high-fidelity data is scarce.
• (C5) The proposed policy reparameterization trick for sampling correlated trajectories
across multi-fidelity environments is crucial to the performance of the proposed approach.
Tasks. We test our approach and support the key claims above in the following simulated robotic
tasks from Gymnasium (Towers et al., 2024) and MuJoCo (Todorov et al., 2012): (i) CartPole-v1,
(ii) InvertedPendulum-v5, (iii) Swimmer-v5, (iv) Walker2d-v5, (v) Hopper-v5, (vi) HalfCheetah-v5.
Low-Fidelity Models. In the CartPole task, we construct a low-fidelity model by linearizing the
system dynamics about an equilibrium point. In all other MuJoCo tasks, we implement low-fidelity
environments in which the gravitational forces differ from those in the target (high-fidelity) environ-
ments, as is common in the literature on off-dynamics learning (Lyu et al., 2024), or by changing the
reward function. In particular, the results in Sections 4.1 to 4.3 and 4.5 use a low-fidelity model that
preserves 90% of the target environment’s gravity, while Section 4.6 reports results for additional
differences in gravity levels across all tasks. Meanwhile, Section 4.4 presents results from a low-
7
Page 8:
fidelity environment in which the rewards are drastically different from those of the high-fidelity
environment.
Implementation details. We implement the MFPG algorithms by building on the Stable-Baselines3
RL software library (Raffin et al., 2021). In all experiments, we use the default hyperparameters
from Stable-Baselines3 with minimal changes. The specific hyperparameters used are detailed in
the Supplementary Material. We employ the same hyperparameters between our proposed MFPG
algorithms and their single-fidelity counterparts, which we use as baselines for comparison. The
code will be released upon publication of the manuscript.
Baselines algorithms. We evaluate three variants of the proposed multi-fidelity framework, and
compare them to their single-fidelity counterparts: (i) REINFORCE (Williams, 1992), (ii) REIN-
FORCE with value function subtraction (a common method for variance reduction in policy gradient
algorithms) (Peters & Schaal, 2006), and (iii) PPO (Schulman et al., 2017).
In RL, the amount of available data influences training through two distinct mechanisms: (i) batch
size, i.e., the number of high-fidelity samples per gradient step, and (ii) the total number of high-
fidelity samples used throughout training, i.e., the batch size multiplied by the total number of gra-
dient steps. Hence, to simulate data-scarce settings, we always use a small batch size of 100 for the
proposed MFPG algorithms. Sections 4.1 to 4.3 use Nl= 150 Nhto estimate the mean term ˆµlof
the CVs, while Section 4.4 uses Nl= 100 Nh. In all experiments, we use 6 random seeds for each
algorithm. For each of the MFPG algorithms, we compare our approach to two types of baselines:
•All algorithms have limited access to high-fidelity samples : The baselines use the same, small
batch size of 100as the MFPG algorithm and train using the same number of total high-fidelity
samples, i.e., 1 million.
•Baseline algorithms have access to significantly more high-fidelity samples : In these exper-
iments, the baselines are granted substantially more high-fidelity samples than our MFPG algo-
rithms. The baselines use 10-20x larger batch size and train using 2-10x total number of high-
fidelity samples. We note that in all experiments, the baselines ultimately take either same num-
ber, or twice as many, gradient steps as our approach. This is because the PPO implementations
use mini-batches to estimate gradients, i.e., subsets of the data batches, whereas the REINFORCE
implementations use the entire batch to estimate gradients in order to reduce variance.
4.1 MFPG Algorithms Drastically Reduce the Variance of Policy Gradient Estimators
0.20.40.60.81·106103104
High-Fidelity Env. StepsVarianceSingle-Fid. (batch 100)
Single-Fid. (batch 100, value subt.)
Single-Fid. (batch 1000)
Single-Fid. (batch 1000, value subt.)
MFPG (ours, batch 100)
MFPG (ours, batch 100, value subt.)
Figure 3: Variance of the policy gradient estimates in the
HalfCheetah-v5 task. The multi-fidelity REINFORCE al-
gorithm enjoys a significantly lower variance in its gradi-
ent estimates than the single-fidelity baselines that only use
high-fidelity samples.This experiment is designed to vali-
date claim C1: The proposed MFPG
algorithms drastically reduces the
variance of policy gradient estima-
tors in comparison to common ap-
proaches to variance reduction, such
as baseline subtraction and increasing
the number of high-fidelity samples
per gradient estimate.
We use the REINFORCE algorithm
in the HalfCheetah task as an exam-
ple. Figure 3 shows the variance of
the multi-fidelity vs. single-fidelity policy gradient estimators (more specifically, we plot the vari-
ances of the scalar quantities ZπθandXπθ
τhbefore differentiation w.r.t. policy parameters). The
figure includes results from several single-fidelity baseline algorithms, including a baseline that has
access to the same limited number of high-fidelity samples as our MFPG algorithms (i.e., the same
batch size), a baseline with significantly more high-fidelity samples (a batch size of 1000 ), and vari-
ants of these single fidelty baselines that also use value function subtraction in an effort to reduce
estimator variance.
8
Page 9:
MFPG (ours, value subt.) MFPG (ours) Single-Fid. (high-fid. only, value subt.) Single-Fid. (high-fid. only)
0 0.2 0.4 0.6 0.8 1·106
0250500
High-Fidelity Env. StepsReward
(a) REINFORCE InvertedPendulum-v50 0.2 0.4 0.6 0.8 1·106
02505007501,000
High-Fidelity Env. StepsReward
(b) REINFORCE Hopper-v5
0.2 0.4 0.6 0.8 1·106 05001,0001,500
High-Fidelity Env. StepsReward
(c) REINFORCE HalfCheetah-v50 0.2 0.4 0.6 0.8 1·106
02505007501,000
High-Fidelity Env. StepsReward
(d) PPO Hopper-v5
Figure 4: Episodic reward as a function of the total elapsed steps in the high-fidelity environment.
All algorithms are limited to 100 high-fidelity samples per gradient update, and the episode length
in all tasks is T= 500 . MFPG algorithms significantly outperform their single-fidelity counterparts.
We train a policy using the single-fidelity REINFORCE algorithm with state-value subtracted for
1 million steps and save the trained policy at 18 different checkpoints. For each of these saved
policies, we collect 200 batches of both high- and low-fidelity data, where the size of each batch
(i.e., the amount of high-fidelity data) varies between approaches as described above. We then
record the empirical mean and variance of the policy gradient estimates from these 200 batches,
for each checkpointed policy. We repeat this experiment for 6 random seeds and report aggregate
statistics in Figure 3, where each line is based on 21600 batches of policy gradient estimates.
As shown in Figure 3, our multi-fidelity approach drastically reduces the variance of policy gradient
estimators when compared to the single-fidelity baselines. Specifically, we observe that (i) this
reduction is far more substantial than that achieved by standard variance reduction techniques in RL
(namely state-value subtraction or training with larger batch sizes), and (ii) the variance reduction
occurs immediately anduniformly throughout training , whereas state-value subtraction only starts
reducing the variance after fitting a meaningful value function (around 500K steps).
4.2 MFPG Algorithms Significantly Outperform Baselines when Data is Scarce
This experiment is designed to validate claim C2: When high-fidelity samples are limited, the MFPG
algorithms stabilize training and outperform the single-fidelity baselines in all settings. Figure 4
illustrates the episodic reward achieved by the RL algorithms as a function of the total number of
steps taken in the high-fidelity environment. All algorithms use a batch size of 100and1million
total high-fidelity samples throughout training.
The proposed MFPG algorithms achieve higher rewards than the baselines by a large margin—
they achieve 2 to 3.9 times higher rewards in tasks without a maximum reward (Figs. 4b to 4d)
and significantly more steep and stable learning curves when there is a fixed maximum achievable
reward (Fig. 4a). When observing the results for the REINFORCE algorithm (Figs. 4a to 4c), we
also note that baseline subtraction only improves algorithm performance in 2 of the 3 tasks, and that
this improvement is minimal in comparison to that provided by our multi-fidelity approach.
9
Page 10:
MFPG (ours, value subt.) MFPG (ours) Single-Fid. (10x high. data, value subt.) Single-Fid. (10x high. data)
0 0.2 0.4 0.6 0.8 1·104
0250500
Gradient StepsReward
(a) REINFORCE InvertedPendulum-v50 0.2 0.4 0.6 0.8 1·104 05001,0001,500
Gradient StepsReward
(b) REINFORCE HalfCheetah-v5
Figure 5: Episodic reward as a function of elapsed gradient steps. The baseline algorithms have ac-
cess to 10x more high-fidelity samples than the proposed MFPG algorithms. Even when facing this
severe data disadvantage, the MFPG algorithms outperform or match the baselines’ performance.
4.3 MFPG Algorithms Outperform Baselines Even when Training with 10x Less Data
This experiment is designed to validate claim C3: The MFPG algorithms continue to match or
outperform these baseline algorithms, even when they are allowed access to substantially more high-
fidelity data. In Figs. 2 and 5, the multi-fidelity REINFORCE algorithm uses a batch size of 100,
and1million total high-fidelity samples, whereas the baseline algorithms have access to 10times
more samples per batch and to 10million total high-fidelity samples. As in Section 4.2, we observe
that the proposed approach achieves substantially higher rewards than the baseline in tasks without
a maximum achievable reward (Fig. 5b) and yields a much more stable learning curve when there is
a maximum achievable reward (Fig. 5a).
We benchmark the multi-fidelity PPO algorithm (batch size of 100, and 1 million total high-fidelity
samples) with results reported in Stable-Baselines3 (Stable-Baselines3), which use a batch size of
2048 and 2million total high-fidelity samples. We provide the multi-fidelity PPO results, as well as
other additional results—including results from other tasks—in Tables 1 and 2 in Section 4.6.
4.4 The MFPG Algorithms Can Remain Highly Effective Under Substantial Mismatches
Between the Low-Fidelity and High-Fidelity Environments
This experiment is designed to validate claim C4: The MFPG algorithms can learn effectively even
when the low-fidelity environment is drastically different from the high-fidelity environment, and
high-fidelity data is scarce. As discussed in Section 3, so long as there is a statistical relationship
between the random variables of interest Xπθ
τhandXπθ
τl(i.e.,ρ2(Xπθ
τh, Xτl)is non-negligible), the
MFPG framework will reduce the variance of the policy gradient estimates (w.r.t. the high-fidelity
environment) without introducing bias. To demonstrate this point, we examine a situation in which
the low-fidelity and high-fidelity rewards are drastically different: the reward function in the low-
fidelity environment is the negative of that from the high-fidelity environment. Figure 6 illustrates
the performance of the multi-fidelity REINFORCE algorithm, and compares it to the performance
of a REINFORCE policy trained using only high-fidelity data. We also plot the performance of a
REINFORCE policy that is trained only on low-fidelity data, and then evaluated in the high-fidelity
environment. Each baseline has access to the same amount of (either high-fidelity or low-fidelity)
data as our MFPG approach.
As expected, due to the significant difference between the tasks of these two environments, the
baseline trained only on low-fidelity data is entirely ineffective: instead of learning to move forward,
the agent learns to end the episode as quickly as possible. Meanwhile, similarly to as observed in
cf. 4b, the baseline that only uses high-fidelity data is unable to learn effectively from the limited
number of samples. However, the MFPG algorithm is able to combine these data sources, each of
which alone is not sufficient for effective training, in order to learn a highly performant policy.
Intuitively, although the values of Xπθ
τlare entirely different from those of Xπθ
τh, they are negatively
correlated in this example. The MFPG algorithm takes advantage of this relationship to compute
10
Page 11:
MFPG (ours) High-Fid. only Low-Fid. only
0 0.2 0.4 0.6 0.8 1·104
02505007501,000
Gradient StepsReward
(a) REINFORCE Hopper-v50 0.2 0.4 0.6 0.8 1·104−0.8−0.6−0.4−0.2
Gradient StepsReward
(b) REINFORCE Hopper-v5 (Zoomed In)
Figure 6: Multi-fidelity REINFORCE performance when the low-fidelity and high-fidelity environ-
ment rewards are drastically different. Figure (a) compares the MFPG algorithm with baselines that
train using only low- (or high-)fidelity data. Figure (b) provides a magnified view of Figure (a)
to highlight the performance of the baseline that trains using only low-fidelity data. While neither
low-fidelity data nor high-fidelity data are sufficient for effective training, the MFPG algorithm is
able to combine the data sources to learn an effective policy.
With Policy Reparameterization Trick Without Policy Reparameterization Trick
0 0.2 0.4 0.6 0.8 1·1060250500
High-Fidelity Env. StepsReward
(a) REINFORCE CartPole-v10 0.2 0.4 0.6 0.8 1·1060250500
High-Fidelity Env. StepsReward
(b) REINFORCE InvertedPendulum-v5
Figure 7: Impact of the policy reparameterization trick on MFPG algorithm performance.
a correction for the low-fidelity estimator ˆµl, using only a small number of high-fidelity samples.
Although this is an extreme example in the sense that the low- and high-fidelity tasks are polar op-
posites of each other, it highlights a useful feature of the MFPG framework: the low-fidelity rewards
and dynamics might be very different from the target environment of interest (making direct sim-to-
real transfer infeasible), and yet still provide useful information for multi-fidelity policy training.
4.5 The Policy Reparameterization Trick is of Critical Imporance to the MFPG Approach
This experiment is designed to validate claim C5: The proposed policy reparameterization trick
for sampling correlated trajectories across multi-fidelity environments is crucial to the perfor-
mance of the proposed approach. Figure 7 shows an ablation study in which we remove the
reparameterization-based action sampling described in Section 3. We demonstrate the importance
of this portion of the MFPG framework within both discrete (Fig. 7a) and continuous (Fig. 7b) ac-
tion spaces. Recall that this policy reparameterization trick is an important element of Algorithm 1,
which enables correlated action sampling along two or more trajectories, even when differences in
dynamics cause the policy to be evaluated on different states along those trajectories. The orange
lines in Figure Fig. 7 illustrate our approach, while the gray lines illustrate the result of instead
independently sampling actions from the policy πθ. We observe that by increasing the correlation
of the sampled low- and high-fidelity trajectories, this policy reparameterization trick significantly
improves the performance of the approach.
4.6 Full Results
Tables 1 and 2 show complete results for all the experimental settings. For approaches using the
same batch size and total number of high-fidelity samples, we show the final reward and the reward
11
Page 12:
at another intermediate checkpoint. For baselines granted more high-fidelity samples and larger
batch sizes, we only show the final reward since the number of gradient steps varies between meth-
ods with different batch sizes (for a fixed number of environment steps). For PPO baselines with
more high-fidelity samples, we directly use the reward results reported in Stable-Baselines3 (Stable-
Baselines3), divided by a factor of 2 since their episode length is twice as long as ours.
Note that in tasks where states vary substantially throughout an episode (e.g. HalfCheetah), it is
useful to roll out the low-fidelity trajectories only for relatively short time horizons (e.g., 15 steps),
which increases the correlation between low- and high-fidelity trajectories. Similar techniques have
been used in model-based RL (Janner et al., 2019; Levy et al., 2024). This may introduce bias,
however, into the policy gradient estimator; further analysis is an important topic of future work.
Table 1: REINFORCE: Mean Episode Reward. Numbers within parentheses denote rewards
achieved by the variant of REINFORCE with state values subtracted.
Env. (continuous action)Single-fid.
More high-fid. dataStepsSingle-fid.
Limited high-fid. dataMulti-fid. (ours)
Limited high-fid. data, 90% gravityMulti-fid. (ours)
Limited high-fid. data, 70% gravity
Inv. Pendulum 449.87 (467.32)200K 395.53 (433.93) 485.49 (494.26) 466.43 (483.33)
1M 236.94 (325.85) 452.63 (483.26) 448.90 (477.36)
Hopper 454.95 (686.53)200K 257.15 (283.12) 777.43 (694.28) 626.96 (615.22)
1M 349.90 (298.34) 906.36 (916.40) 670.04 (628.86)
Half Cheetah 584.07 (462.18)200K -134.44 (-100.98) 216.39 (156.85) 121.60 (71.01)
1M 135.72 (253.62) 756.35 (988.14) 591.57 (438.62)
Swimmer 30.67 (37.84)150K 30.83 (30.07) 34.25 (36.19) 32.11 (26.26)
300K 24.36 (31.23) 36.60 ( 45.99 ) 39.94 (35.17)
Walker 302.15 (650.10)200K 260.89 (266.25) 292.88 (285.27) 292.32 ( 286.66 )
1M 261.63 (288.99) 549.09 (792.84) 463.47 (623.93)
Env. (discrete action)Single-fid.StepsSingle-fid. Multi-fid. (ours)
More high-fid. data Limited high-fid. data Limited high-fid. data, linearized low. model
Cart Pole 433.59 (496.77)200K 353.53 (403.06) 480.36 (482.90)
1M 327.91 (464.53) 490.63 (493.73)
Table 2: PPO: Mean Episode Reward. The results of the single-fidelity baseline with more high-
fidelity data are from Stable-Baselines3 (Stable-Baselines3) divided by 2 to align episode lengths.
Env.Single-fid.
More high-fid. dataStepsSingle-fid.
Limited high-fid. dataMulti-fid. (ours)
Limited high-fid. data, 90% gravityMulti-fid. (ours)
Limited high-fid. data, 70% gravity
Inv. Pendulum N/A200K 340.25 496.70 489.10
1M 265.21 500 499.67
Hopper 783.5200K 277.60 536.61 397.56
1M 337.27 714.49 527.40
Half Cheetah 988.0200K 280.37 446.80 223.90
1M 654.68 767.12 540.92
Swimmer N/A150K 33.92 40.18 35.68
300K 36.06 41.14 40.10
Walker 615.0200K 223.70 363.37 337.12
1M 451.42 683.55 485.54
5 Conclusions
We present a multi-fidelity policy gradient (MFPG) framework, which mixes a small amount of
potentially expensive high-fidelity data with a larger volume of cheap lower-fidelity data to construct
unbiased, variance-reduced estimators for on-policy policy gradients. We use this general framework
to propose multi-fidelity variants of two policy gradient algorithms, REINFORCE and proximal
policy optimization. Through experiments in a suite of simulated robotics benchmark tasks, we
demonstrate that the proposed multi-fidelity policy gradient algorithms: i) significantly reduce the
variance of policy gradient estimates, ii) improve performance (up to 3.9 ×higher reward) and data-
efficiency in comparison with baselines that only use high-fidelity data, and iii) learn effectively
even when the low-fidelity environment is drastically different from the high-fidelity environment.
12
Page 13:
In summary, the proposed framework offers a novel paradigm for incorporating multi-fidelity data
in policy gradient algorithms for reinforcement learning, and provides promising directions for ef-
ficient sim-to-real transfer and principled approaches to managing the trade-off between policy per-
formance and data collection costs. Future work will investigate additional techniques to improve
the performance of multi-fidelity actor-critic methods, and to deploy the proposed framework for
sim-to-real transfer on robotic hardware.
Acknowledgments
We thank Haoran Xu, Mustafa Karabag, Brett Barkley, and Jacob Levy for their helpful discus-
sions. This work was supported in part by the National Science Foundation (NSF) under Grants
2214939 and 2409535, in part by the Defense Advanced Research Projects Agency (DARPA) under
the Transfer Learning from Imprecise and Abstract Models to Autonomous Technologies (TIAMAT)
program under grant number HR0011-24-9-0431, and in part by the DEVCOM Army Research Lab-
oratory under Cooperative Agreement Numbers W911NF-23-2-0011 and W911NF-23-2-0211. The
views and conclusions contained in this document are those of the authors and should not be in-
terpreted as representing the official policies, either expressed or implied, of the U.S. Government.
The U.S. Government is authorized to reproduce and distribute reprints for Government purposes
notwithstanding any copyright notation herein.
References
Pieter Abbeel, Morgan Quigley, and Andrew Y Ng. Using inaccurate models in reinforcement
learning. In Proceedings of the 23rd international conference on Machine learning , pp. 1–8,
2006.
Akash Agrawal and Christopher McComb. Adaptive learning of design strategies over non-
hierarchical multi-fidelity models via policy alignment. arXiv preprint arXiv:2411.10841 , 2024.
Karol Arndt, Murtaza Hazara, Ali Ghadirzadeh, and Ville Kyrki. Meta reinforcement learning for
sim-to-real domain adaptation. In 2020 IEEE international conference on robotics and automa-
tion (ICRA) , pp. 2725–2731. IEEE, 2020.
Emmanuel Bengio, Moksh Jain, Maksym Korablyov, Doina Precup, and Yoshua Bengio. Flow
network based generative models for non-iterative diverse candidate generation. Advances in
Neural Information Processing Systems , 34:27381–27394, 2021.
Sahil Bhola, Suraj Pawar, Prasanna Balaprakash, and Romit Maulik. Multi-fidelity reinforce-
ment learning framework for shape optimization. Journal of Computational Physics , 482:
112018, 2023. ISSN 0021-9991. DOI: https://doi.org/10.1016/j.jcp.2023.112018. URL https:
//www.sciencedirect.com/science/article/pii/S0021999123001134 .
Yevgen Chebotar, Ankur Handa, Viktor Makoviychuk, Miles Macklin, Jan Issac, Nathan Ratliff,
and Dieter Fox. Closing the sim-to-real loop: Adapting simulation randomization with real world
experience. In 2019 International Conference on Robotics and Automation (ICRA) , pp. 8973–
8979, 2019. DOI: 10.1109/ICRA.2019.8793789.
Ching-An Cheng, Xinyan Yan, and Byron Boots. Trajectory-wise control variates for variance
reduction in policy gradient methods. In Conference on Robot Learning , pp. 1379–1394. PMLR,
2020.
Mark Cutler, Thomas J. Walsh, and Jonathan P. How. Real-world reinforcement learning via mul-
tifidelity simulators. IEEE Transactions on Robotics , 31(3):655–671, 2015. DOI: 10.1109/TRO.
2015.2419431.
Longchao Da, Justin Turnau, Thirulogasankar Pranav Kutralingam, Alvaro Velasquez, Paulo
Shakarian, and Hua Wei. A survey of sim-to-real methods in rl: Progress, prospects and chal-
lenges with foundation models. arXiv preprint arXiv:2502.13187 , 2025.
13
Page 14:
Peter Dayan. Reinforcement comparison. In Connectionist Models , pp. 45–51. Elsevier, 1991.
Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco
Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Mag-
netic control of tokamak plasmas through deep reinforcement learning. Nature , 602(7897):414–
419, 2022.
Raj Ghugare, Santiago Miret, Adriana Hugessen, Mariano Phielipp, and Glen Berseth. Search-
ing for high-value molecules using reinforcement learning and transformers. arXiv preprint
arXiv:2310.02902 , 2023.
Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, and David Duvenaud. Backpropagation
through the void: Optimizing control variates for black-box gradient estimation. In International
Conference on Learning Representations , 2018.
Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient
estimates in reinforcement learning. Journal of Machine Learning Research , 5(Nov):1471–1530,
2004.
Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E Turner, and Sergey Levine. Q-
prop: Sample-efficient policy gradient with an off-policy critic. In International Conference on
Learning Representations , 2017.
Jiawei Huang and Nan Jiang. From importance sampling to doubly robust policy gradient. In
International Conference on Machine Learning , pp. 4434–4443. PMLR, 2020.
Iris AM Huijben, Wouter Kool, Max B Paulus, and Ruud JG Van Sloun. A review of the gumbel-
max trick and its extensions for discrete stochasticity in machine learning. IEEE transactions on
pattern analysis and machine intelligence , 45(2):1353–1371, 2022.
Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-
based policy optimization. Advances in neural information processing systems , 32, 2019.
Elia Kaufmann, Leonard Bauersfeld, Antonio Loquercio, Matthias Müller, Vladlen Koltun, and
Davide Scaramuzza. Champion-level drone racing using deep reinforcement learning. Nature ,
620(7976):982–987, 2023.
Sami Khairy and Prasanna Balaprakash. Multi-fidelity reinforcement learning with control variates.
Neurocomputing , 597:127963, 2024.
Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 , 2013.
Jacob Levy, Tyler Westenbroek, and David Fridovich-Keil. Learning to walk from three minutes of
real-world data with semi-structured dynamics models. arXiv preprint arXiv:2410.09163 , 2024.
Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Action-dependent
control variates for policy optimization via stein identity. In International Conference on Learning
Representations , 2018.
Jiafei Lyu, Kang Xu, Jiacheng Xu, Mengbei Yan, Jingwen Yang, Zongzhang Zhang, Chenjia Bai,
Zongqing Lu, and Xiu Li. Odrl: A benchmark for off-dynamics reinforcement learning. In The
Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks
Track , 2024. URL https://openreview.net/forum?id=ap4x1kArGy .
Alonso Marco, Felix Berkenkamp, Philipp Hennig, Angela P Schoellig, Andreas Krause, Stefan
Schaal, and Sebastian Trimpe. Virtual vs. real: Trading off simulations and physical experiments
in reinforcement learning with bayesian optimization. In 2017 IEEE International Conference on
Robotics and Automation (ICRA) , pp. 1557–1563. IEEE, 2017.
14
Page 15:
Barry L Nelson. On control variate estimators. Computers & Operations Research , 14(3):219–225,
1987.
Art B. Owen. Monte Carlo theory, methods and examples .https://artowen.su.domains/
mc/, 2013.
Sergey Pankov. Reward-estimation variance elimination in sequential decision processes. arXiv
preprint arXiv:1811.06225 , 2018.
Benjamin Peherstorfer, Karen Willcox, and Max Gunzburger. Survey of multifidelity methods in
uncertainty propagation, inference, and optimization. Siam Review , 60(3):550–591, 2018.
Jan Peters and Stefan Schaal. Policy gradient methods for robotics. In 2006 IEEE/RSJ International
Conference on Intelligent Robots and Systems , pp. 2219–2225, 2006. DOI: 10.1109/IROS.2006.
282564.
Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah
Dormann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of
Machine Learning Research , 22(268):1–8, 2021. URL http://jmlr.org/papers/v22/
20-1364.html .
Fabio Ramos, Rafael Carvalhaes Possas, and Dieter Fox. Bayessim: adaptive domain randomization
via probabilistic inference for robotics simulators. arXiv preprint arXiv:1906.01728 , 2019.
Andrei A Rusu, Matej Ve ˇcerík, Thomas Rothörl, Nicolas Heess, Razvan Pascanu, and Raia Hadsell.
Sim-to-real robot learning from pixels with progressive nets. In Conference on robot learning ,
pp. 262–270. PMLR, 2017.
Gilhyun Ryou, Geoffrey Wang, and Sertac Karaman. Multi-fidelity reinforcement learning for time-
optimal quadrotor re-planning. arXiv preprint arXiv:2403.08152 , 2024.
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-
dimensional continuous control using generalized advantage estimation. arXiv preprint
arXiv:1506.02438 , 2015.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017.
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang,
Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical
reasoning in open language models. arXiv preprint arXiv:2402.03300 , 2024.
Laura Smith, J. Chase Kew, Xue Bin Peng, Sehoon Ha, Jie Tan, and Sergey Levine. Legged
robots that keep on learning: Fine-tuning locomotion policies in the real world. In 2022
International Conference on Robotics and Automation (ICRA) , pp. 1593–1599, 2022. DOI:
10.1109/ICRA46639.2022.9812166.
Stable-Baselines3. Performance check (continuous actions) github issue48. URL https://
github.com/DLR-RM/stable-baselines3/issues/48 .
Varun Suryan, Nahush Gondhalekar, and Pratap Tokekar. Multi-fidelity reinforcement learning with
gaussian processes. arXiv preprint arXiv:1712.06489 , 2017.
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction . MIT press, 2018.
Richard Stuart Sutton. Temporal credit assignment in reinforcement learning . University of Mas-
sachusetts Amherst, 1984.
Chen Tang, Ben Abbatematteo, Jiaheng Hu, Rohan Chandra, Roberto Martín-Martín, and Peter
Stone. Deep reinforcement learning for robotics: A survey of real-world successes. Annual
Review of Control, Robotics, and Autonomous Systems , 8, 2024.
15
Page 16:
Matthew E Taylor, Peter Stone, and Yaxin Liu. Transfer learning via inter-task mappings for tem-
poral difference learning. Journal of Machine Learning Research , 8(9), 2007.
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control.
In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems , pp. 5026–5033.
IEEE, 2012. DOI: 10.1109/IROS.2012.6386109.
Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu,
Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard
interface for reinforcement learning environments. arXiv preprint arXiv:2407.17032 , 2024.
George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard Turner, Zoubin Ghahramani, and Sergey
Levine. The mirage of action-dependent baselines in reinforcement learning. In International
conference on machine learning , pp. 5015–5024. PMLR, 2018.
Lex Weaver and Nigel Tao. The optimal reward baseline for gradient-based reinforcement learning.
InProceedings of the 17th Conference in Uncertainty in Artificial Intelligence , pp. 538–545,
2001.
RJ Williams. Toward a theory of reinforcement-learning connectionist systems. Technical Report ,
1988.
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement
learning. Machine learning , 8:229–256, 1992.
Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M Bayen, Sham Kakade,
Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with action-dependent
factorized baselines. In International Conference on Learning Representations , 2018.
Patrick Yin, Tyler Westenbroek, Simran Bagaria, Kevin Huang, Ching-an Cheng, Andrey Kobolov,
and Abhishek Gupta. Rapidly adapting policies to the real world via simulation-guided fine-
tuning. arXiv preprint arXiv:2502.02705 , 2025.
Tingting Zhao, Hirotaka Hachiya, Gang Niu, and Masashi Sugiyama. Analysis and improvement of
policy gradient estimation. Advances in Neural Information Processing Systems , 24, 2011.
Wenshuai Zhao, Jorge Peña Queralta, and Tomi Westerlund. Sim-to-real transfer in deep rein-
forcement learning for robotics: a survey. In 2020 IEEE symposium series on computational
intelligence (SSCI) , pp. 737–744. IEEE, 2020.
Yuanyi Zhong, Yuan Zhou, and Jian Peng. Coordinate-wise control variates for deep policy gradi-
ents. arXiv preprint arXiv:2107.04987 , 2021.
16
Page 17:
Supplementary Materials
The following content was not necessarily subject to peer review.
Computation of Control Variates Coefficient c∗
We mention in Section 3 that the control variates coefficient c∗is calculated as
c∗=−Cov(Xπθ
τh, Xπθ
τl)
Var(Xπθ
τl)=−ρ(Xπθ
τh, Xπθ
τl)s
Var(Xπθ
τh)
Var(Xπθ
τl),
where ρ(·,·)denotes the Pearson correlation coefficient between the two random variables. We
compute a moving average for the coefficient c∗based on each collected sample batches with a
hyperparameter αma:
c∗=−
αmaρ(Xπθ
τh, Xπθ
τl)old+ (1−αma)ρ(Xπθ
τh, Xπθ
τl)news
αmaVar(Xπθ
τh)old+ (1−αma)Var(Xπθ
τh)new
αmaVar(Xπθ
τl)old+ (1−αma)Var(Xπθ
τl)new,
where we drop the control variate when the estimated correlation coefficient is below a threshold
hyperparameter detailed below.
Hyperparameter Values
As mentioned in Section 4.6, for some tasks, we roll out low-fidelity trajectories for shorter horizons.
We do shorter horizons of 15 in HalfCheetah, Hopper, and Walker tasks. For all other tasks, we
do not constrain the rollout horizons. Tables 3 and 4 give all other hyperparameters for single-
fidelity/MFPG REINFORCE and PPO respectively.
Table 3: Hyperparameters for REINFORCE (All Tasks)
Parameter Value
Shared for all algorithms
optimizer RMSProp
learning rate 7·10−4
discount factor ( γ) 0.99
value loss coefficient (for value subtraction case) 0.5
network architecture (multi-layer perceptron, 2 hidden layers) [64,64]
value network for value subtraction (multi-layer perceptron, 2 hidden layers) [64,64]
max gradient norm 0.5
Multi-Fidelity REINFORCE
batch size 100
total high-fidelity training steps 106
c∗threshold value for dropping control variate 0
αma 0.95
Single-Fidelity REINFORCE
batch size for more high-fidelity data case 1000
total high-fidelity training steps for more high-fidelity data case 107
batch size for limited high-fidelity data case 100
total high-fidelity training steps for limited high-fidelity data case 106
17
Page 18:
Table 4: Hyperparameters for PPO (All tasks)
Parameter Value
Shared for all algorithms
optimizer RMSProp
learning rate 1.5·10−4
discount factor ( γ) 0.99
GAE coefficient 0.95
epochs 10
minibatch size 64
clipping range [0.8,1.2]
value loss coefficient 0.5
actor network architecture (multi-layer perceptron, 2 hidden layers) [64,64]
critic network architecture (multi-layer perceptron, 2 hidden layers) [64,64]
Shared Multi-Fidelity PPO hyperparameters for all tasks
batch size 100
total high-fidelity training steps 106
Shared hyperparameters for all tasks except HalfCheetah-v5
c∗threshold value for dropping control variate 0
αma 0.95
HalfCheetah-v5 specific hyperparameters
c∗threshold value for dropping control variate 0.2
αma 0.92
Single-Fidelity PPO
batch size for more high-fidelity data case 2048
total high-fidelity training steps for more high-fidelity data case 2·107
batch size for limited high-fidelity data case 100
total high-fidelity training steps for limited high-fidelity data case 106
18