Open AGI Codes | Your Codes Reflect!

Updates

Generating audio...

arxiv

Paper 2503.05696

Multi-Fidelity Policy Gradient Algorithms

Authors: Xinjie Liu, Cyrus Neary, Kushagra Gupta, Christian Ellis, Ufuk Topcu, David Fridovich-Keil

Published: 2025-03-07

Abstract:

Many reinforcement learning (RL) algorithms require large amounts of data, prohibiting their use in applications where frequent interactions with operational systems are infeasible, or high-fidelity simulations are expensive or unavailable. Meanwhile, low-fidelity simulators--such as reduced-order models, heuristic reward functions, or generative world models--can cheaply provide useful data for RL training, even if they are too coarse for direct sim-to-real transfer. We propose multi-fidelity policy gradients (MFPGs), an RL framework that mixes a small amount of data from the target environment with a large volume of low-fidelity simulation data to form unbiased, reduced-variance estimators (control variates) for on-policy policy gradients. We instantiate the framework by developing multi-fidelity variants of two policy gradient algorithms: REINFORCE and proximal policy optimization. Experimental results across a suite of simulated robotics benchmark problems demonstrate that when target-environment samples are limited, MFPG achieves up to 3.9x higher reward and improves training stability when compared to baselines that only use high-fidelity data. Moreover, even when the baselines are given more high-fidelity samples--up to 10x as many interactions with the target environment--MFPG continues to match or outperform them. Finally, we observe that MFPG is capable of training effective policies even when the low-fidelity environment is drastically different from the target environment. MFPG thus not only offers a novel paradigm for efficient sim-to-real transfer but also provides a principled approach to managing the trade-off between policy performance and data collection costs.

Paper Content:

Page 1: Multi-Fidelity Policy Gradient Algorithms Xinjie Liu1,†, Cyrus Neary1,2,3, †, Kushagra Gupta1, Christian Ellis1,4, Ufuk Topcu1, David Fridovich-Keil1 {xinjie-liu,kushagrag,utopcu,dfk}@utexas.edu cyrus.neary@mila.quebec christian.ellis@austin.utexas.edu 1University of Texas at Austin, USA 2Mila – Quebec AI Institute, Canada 3Université de Montréal, Canada 4DEVCOM Army Research Laboratory, USA †Equal Contribution Abstract Many reinforcement learning (RL) algorithms require large amounts of data, prohibit- ing their use in applications where frequent interactions with operational systems are infeasible, or high-fidelity simulations are expensive or unavailable. Meanwhile, low- fidelity simulators—such as reduced-order models, heuristic reward functions, or gen- erative world models—can cheaply provide useful data for RL training, even if they are too coarse for direct sim-to-real transfer. We propose multi-fidelity policy gradients (MFPGs) , an RL framework that mixes a small amount of data from the target envi- ronment with a large volume of low-fidelity simulation data to form unbiased, reduced- variance estimators (control variates) for on-policy policy gradients. We instantiate the framework by developing multi-fidelity variants of two policy gradient algorithms: REINFORCE and proximal policy optimization. Experimental results across a suite of simulated robotics benchmark problems demonstrate that when target-environment samples are limited, MFPG achieves up to 3.9 ×higher reward and improves training stability when compared to baselines that only use high-fidelity data. Moreover, even when the baselines are given more high-fidelity samples—up to 10 ×as many inter- actions with the target environment—MFPG continues to match or outperform them. Finally, we observe that MFPG is capable of training effective policies even when the low-fidelity environment is drastically different from the target environment. MFPG thus not only offers a novel paradigm for efficient sim-to-real transfer but also provides a principled approach to managing the trade-off between policy performance and data collection costs. 1 Introduction Reinforcement learning (RL) algorithms offer significant capabilities in systems that work with un- known, or difficult-to-specify, dynamics and objectives. The flexibility and performance of RL algorithms have led to their adoption in applications as diverse as controlling plasma configurations in nuclear fusion reactors (Degrave et al., 2022), piloting high-speed aerial vehicles (Kaufmann et al., 2023), training reasoning capabilities in large language models (Shao et al., 2024), and search- ing large design spaces for the automated discovery of new molecules (Bengio et al., 2021; Ghugare et al., 2023). However, in many such applications, datasets must often be gathered from operational systems or from high-fidelity simulations. This requirement acts as a significant barrier to the devel- opment and deployment of RL policies: excessive interactions with operational systems are often 1arXiv:2503.05696v1 [cs.LG] 7 Mar 2025 Page 2: High-Fidelity Env. sh 0ah 0rh 0. . . sl 0al 0rl 0. . .sh 0ah 0rh 0. . . sl 0al 0rl 0. . .sh 0ah 0rh 0. . . sl 0al 0rl 0. . . Low-Fidelity Env. sl 0al 0rl 0. . . sl 0al 0rl 0. . . sl 0al 0rl 0. . . sl 0al 0rl 0. . . sl 0al 0rl 0. . . sl 0al 0rl 0. . .MFPG Sampling Mechanism Current policy πθ1 NhP τ∇θ Xπθ τh+c(Xπθ τl−ˆµl)MFPG Policy Gradient Estimator 1 NlP τ,tGl tlogπθ(al t|sl t)Low-Fidelity Policy Gradient Estimator Apply Policy Gradient StepNhcorrelated trajectory samples from both envs. Nllow-fidelity trajectory samplesNl≫NhP tGh tlogπθ(ah t|sh t) P tGl tlogπθ(al t|sl t) policy gradient estimate Figure 1: The proposed multi-fidelity policy gradient (MFPG) framework. At each policy update step, MFPG combines a small amount of data from the target (high-fidelity) environment with a large volume of low-fidelity simulation data, thereby forming an unbiased, reduced-variance estimator for the policy gradient. For illustrative purposes, we use the REINFORCE gradient estimators in the figure. However, we note that the same overall workflow applies to other MFPG algorithms. either infeasible or unsafe, and generating simulated datasets for RL can be prohibitively expensive unless the simulations are both cheap to run and carefully designed to minimize the sim-to-real gap. On the other hand, low-fidelity simulation tools capable of cheaply generating large volumes of data are often available. For example, reduced-order models, linearized dynamics, heuristic reward func- tions, and generative world models all output useful information for RL, even when approximations of the target dynamics and rewards are very coarse. 0 0.2 0.4 0.6 0.8 1 ·10402505007501,000 Gradient StepsRewardMFPG (ours), 10x less data Single-Fidelity (high-fid. only, value subt.) Single-Fidelity (high-fid. only) Figure 2: REINFORCE on the MuJoCo Hop- per task (episode length 500). Even when granted access to 10x fewer samples from the target environment per gradient step, the pro- posed MFPG improves learning speed and re- ward performance in comparison to standard REINFORCE algorithms (with and without state-value baseline subtraction) that only train on a single fidelity of data.Towards enabling the training and deployment of highly-performant RL policies that would other- wise be infeasible, in this work we develop novel multi-fidelity RL algorithms. The proposed algo- rithms achieve an order of magnitude improve- ment in sample efficiency by mixing data from the target environment with data generated by lower fidelity simulations. The end result not only of- fers a novel paradigm for sim-to-real transfer, but also provides a principled approach to managing the trade-off between policy performance and data collection costs. More specifically, we present multi-fidelity pol- icy gradients (MFPGs) , a framework that lever- ages data from multiple sources to construct con- trol variates (CVs) —unbiased, reduced-variance estimators—of policy gradients (PGs) within on- policy methods. While the presented ideas and methods can be broadly applied within RL algo- rithms, in this work we apply the framework to two existing on-policy algorithms: REINFORCE (Williams, 1992) and proximal policy optimization (PPO) (Schulman et al., 2017). Figure 1 illustrates the proposed approach. At each policy update step, the algorithm begins by sam- pling a relatively small number of trajectories from the target environment, which may correspond to real-world hardware or to a high-fidelity simulation. We then propose a method for sampling tra- jectories from the low-fidelity environments such that the resulting cumulative rewards and action likelihoods are highly correlated with those from the high-fidelity trajectories. This method hinges on a nuanced approach to action sampling and is critical to the success of our approach. Next, 2 Page 3: the algorithm uses the low-fidelity environments to sample a much larger quantity of uncorrelated trajectories, which it uses alongside the previously sampled trajectories to compute an unbiased es- timate of the policy gradient to be applied. So long as the random variables corresponding to the policy gradient updates between the high and low-fidelity environments are correlated, the approach is guaranteed to reduce the variance of the policy gradient estimates. The impact of this variance reduction is a significant overall improvement to algorithm performance in comparison with standard RL approaches. Figure 2 illustrates that even when the proposed multi- fidelity algorithms are trained using 10 ×fewer high-fidelity samples than the baseline methods, we empirically observe that the proposed MFPG approach can still learn faster and achieve a higher cumulative reward. We further demonstrate the benefits of our approach through experiments on a suite of robotics control problems, which are commonly used as benchmarks for RL algorithms. Our experimental results demonstrate the following key points. i) The proposed multi-fidelity control variates drasti- cally reduce the variance of policy gradient estimates. ii) When the quantity of high-fidelity samples is limited, the proposed multi-fidelity algorithms outperform their single-fidelity counterparts in all settings, and achieve up to 3.9 ×more reward. This is particularly beneficial in online RL settings, for which existing algorithms typically require many environment samples per gradient step. iii) The proposed algorithms continue to match or outperform their single-fidelity counterparts, even when those baselines are allowed access to substantially more high-fidelity data, e.g., up to 10 × more total high-fidelity environment interactions throughout training. iv) The MFPG algorithms can learn effectively even when the low-fidelity environment is drastically different from the high- fidelity environment, and high-fidelity data is scarce. v) The proposed method to sample correlated multi-fidelity trajectories is critical to the performance of the proposed approach. 2 Related Work Variance Reduction in RL via control variates. The method of control variates (CVs) is commonly used for variance reduction in Monte Carlo estimation (Owen, 2013). In RL, particularly in PG methods (Williams, 1992), there is a long history of using CVs to reduce variance and accelerating learning, e.g., subtracting constant reward baselines (Sut- ton, 1984; Williams, 1988; Dayan, 1991; Williams, 1992) or state value functions (Weaver & Tao, 2001; Greensmith et al., 2004; Peters & Schaal, 2006; Zhao et al., 2011) from Monte Carlo returns. Even though state values are known to not be optimal for variance reduction (Green- smith et al., 2004), they are also central to modern algorithms, e.g., actor-critic methods (Schul- man et al., 2017). More recently, additional CVs baselines have been studied, such as vector-form CVs (Zhong et al., 2021), multi-step variance reduction along trajectories (Pankov, 2018; Cheng et al., 2020) and their more general form (Huang & Jiang, 2020), and state-action dependent base- lines (Gu et al., 2017; Liu et al., 2018; Grathwohl et al., 2018; Wu et al., 2018), which offer better variance reduction under certain conditions (Tucker et al., 2018). The literature on using CVs for policy gradients almost exclusively focuses on single-environment settings. In this work, we con- sider the problem of fusing data from multi-fidelity environments to construct still lower-variance estimators, cf. Section 3. Moreover, the proposed multi-fidelity approach can also be readily com- bined with existing single-fidelity CVs for further improved performance, cf. Section 4.2. To the best of our knowledge, the only existing work leveraging multi-fidelity CVs for RL is (Khairy & Balaprakash, 2024). Our approach differs in two main ways. First, we propose Monte Carlo PG and actor-critic algorithms for Markov decision processs (MDPs) with either continuous or discrete state and action spaces, whereas Khairy & Balaprakash, 2024 estimate state-action values in tabular MDPs without function approximation. Second, we propose a novel policy reparameterization trick that enables the sampling of correlated trajectories across multi-fidelity environments. Crucially, this sampling scheme improves the variance reduction of the proposed MFPG algorithms, and it directly supports continuous settings without restricting the MDPs. By contrast, Khairy & Bal- 3 Page 4: aprakash (2024) requires action sequences to be matched across multi-fidelity environments, which requires additional assumptions about the MDP structure and transition dynamics. Multi-fidelity RL and fine-tuning using high-fidelity data. Multi-fidelity methods for optimiza- tion, uncertainty propagation, and inference are well-studied in computational science and engineer- ing, where it is often the case that multiple models for a problem of interest are available (Peherstor- fer et al., 2018). Such methods provide principled statistical techniques to manage the computational costs of Monte Carlo simulations, without introducing unwanted biases. However, the adoption of such techniques by RL algorithms is significantly more limited. Two bodies of literature which are closely related to this work are multi-fidelity RL (Cutler et al., 2015) and fine-tuning RL policies using limited target-domain data, e.g., in sim-to-real transfer (Smith et al., 2022). Multi-fidelity RL aims to design training curricula , i.e., decide when to train in each simulation fidelities, in order to train highly-performant policies for the highest-fidelity environment with minimum simulation costs. Typically, training begins in the lowest-fidelity simulators, with proposed mechanisms to de- cide when to transition to other fidelities based on estimated uncertainty (Suryan et al., 2017; Cutler et al., 2015; Agrawal & McComb, 2024; Ryou et al., 2024; Bhola et al., 2023) in predicted actions, state-action values, dynamics, rewards, and/or information gain (Marco et al., 2017). Similarly, in sim-to-real and transfer learning settings (Zhao et al., 2020; Tang et al., 2024; Da et al., 2025), the objective is to train a highly-performant policy for the real world using simulation data, some- times supplemented with a small amount of real-world data. Many approaches leverage models or value functions trained in simulation (source domain) to bootstrap real-world (target domain) fine-tuning (Rusu et al., 2017; Arndt et al., 2020; Taylor et al., 2007; Smith et al., 2022) or guide real-world exploration (Yin et al., 2025) to improve sample efficiency. Additionally, real-world data can be used to refine the simulator itself (Abbeel et al., 2006; Chebotar et al., 2019; Ramos et al., 2019). The works above introduce new training paradigms and exploration strategies with data from different domains. In contrast, our approach proposes a new policy gradient estimator that incorporates cross-domain data. The proposed approach can be seamlessly integrated into existing paradigms that involve fine-tuning within target domains. 3 Multi-Fidelity Policy Gradients Our objective is to develop RL algorithms capable of leveraging data generated by multiple dis- tinct environments in order to efficiently learn a policy that achieves high performance in a target environment of interest. Modeling the multi-fidelity environments. We begin by considering a high-fidelity environment and a low-fidelity environment, each modeled by a Markov decision process (MDP). In particular, the high-fidelity environment is an MDP Mh= (S, A, ∆sI, γ, ph, Rh), which we assume represents either an accurate simulator of the target environment, or the target environment itself. Here, S is the set of environment states, Ais its set of actions, ∆sIis an initial distribution over states, γ∈[0,1]is a discount factor, ph(s′|s, a)is the probability of transitioning to state s′from state s under action a, and r∼Rh(s, a, s′)defines the probability of observing a particular reward under a given state-action-state triplet. Similarly, the low-fidelity environment is an MDP Ml= (S, A, ∆sI, γ, pl, Rl), whose transition dynamics pland reward function Rldiffer from those of Mh, and do not necessarily accurately represent the target environment. We assume that the cost of generating sample trajectories in Mhis significantly higher than that of generating trajectories in Ml, as may be the case if Mhwere to represent real-world hardware while Mlwere to represent cheap simulations. Our objective is to learn a stochastic policy πθ(a|s), parameterized by θ∈Rd, that achieves a high expected total reward in the high-fidelity environment Mh. Fixing a policy πθde- fines a distribution over trajectories in both MhandMl. We denote trajectories in Mhby τh=sh 0, ah 0, rh 0, sh 1, ah 1, rh 1, . . . , sh T, where sh 0∼∆sI,ah t∼πθ(·|sh t),sh t+1∼ph(·|sh t, ah t), and rt∼Rh(sh t, ah t, sh t+1). Similarly, we use τlto denote trajectories sampled in Ml. 4 Page 5: We remark that the proposed method may be extended to integrate data from an arbitrary number of environments, each of which simulates the target environment with a different level of fidelity. One may also define MhandMlto have different sets of states and actions, as well as different initial distributions and discount factors. Furthermore, the framework can be readily extended to infinite- horizon cases. However, neither of these extensions significantly enhances the primary conceptual and experimental contributions of this work, and so we leave them to future research. Policy gradient algorithms. PG algorithms aim to maximize the performance measure J(θ):= Eτ∼M(πθ)[R(τ)]—the expected total reward along trajectories τsampled from environment Mun- der policy πθ. They do so by using stochastic estimates of the policy gradient ∇θJ(θ)to perform gradient ascent directly on the policy parameters (Sutton & Barto, 2018). For example, the REIN- FORCE algorithm (Williams, 1988; 1992) uses Monte Carlo estimates of Eτ∼M(πθ)[∇θXπθτ]to esti- mate the policy gradient, where Xπθτ:=1 TPT−1 t=0Gtlogπθ(at|st)is a random variable defining the contribution of each trajectory τto the overall policy gradient. Here, at,st, and Gt=PT−1 k=tγtrt, denote the selected action, the state, and the reward-to-go at time tin trajectory τ, respectively. Different policy gradient algorithms use different expressions for Xπθτwhen estimating the policy gradient (e.g., Gtmay be replaced with an advantage estimate, or Xπθτmay be entirely replaced by a surrogate per-trajectory gradient estimate, as is done in PPO (Schulman et al., 2017)). However, we note that the overall structure of many on-policy algorithms remains the same: use the current policy to sample trajectories in the environment; use the rewards, actions, and states along these trajectories to compute a variable of interest Xπθτ; finally, average the gradients of these sampled random variables to estimate ∇θJ(θ). Multi-fidelity policy gradient estimators via control variates. Because our objective is to opti- mize the policy performance in the high-fidelity environment Mh, during each step of the policy gradient algorithm we must estimate ∇θEτh∼Mh(πθ)[Xπθ τh]from a potentially limited number Nh of sampled high-fidelity trajectories τh. In data-scarce settings, existing policy gradient methods can face the challenge of high variance of the gradient estimates (Greensmith et al., 2004). We aim to reduce the estimation variance for the PGs. In this work, we assume that we may also sam- ple a relatively large number Nl≫Nhof trajectories τlfrom the low-fidelity environment Ml. We use these low-fidelity samples to construct a so-called control variate Xπθ τl—a correlated aux- iliary random variable whose expected value µl:=Eτl∼Ml(πθ)[Xπθ τl]is known. We then use the control variates technique (Nelson, 1987) to construct an unbiased, reduced-variance estimator for ∇θEτh∼Mh(πθ)[Xπθ τh]. Specifically, we construct a new random variable Zπθ:=Xπθ τh+c(Xπθ τl−µl)with parameter c∈R. We estimate µlusing the Nllow-fidelity trajectory samples, and we provide a method to jointly sam- ple correlated values Xπθ τhandXπθ τl(described below). By construction, this new random variable has the property that E[Zπθ] =E[Xπθ τh]. Furthermore, depending on the value of c, it may have a significantly reduced variance. In particular, by choosing c∗=−Cov(Xπθ τh, Xπθ τl)/Var(Xπθ τl), we obtain Var (Zπθ) = (1 −ρ2(Xπθ τh, Xπθ τl))Var(Xπθ τh), where ρ(·,·)is the Pearson correlation coeffi- cient between the two random variables. In practice, we estimate c∗using the sampled trajectories fromMhandMl, including samples generated during previous gradient steps, e.g., by computing an exponential moving average for c∗. At every policy gradient step, the proposed MFPG framework thus proceeds as follows: 1) Use policy πθto sample Nhcorrelated trajectories from MhandMl, as well as Nladditional tra- jectories from Ml. 2) Use the sampled trajectories to compute ˆµl,c, and the correlated val- ues of the random variables Xπθ τhandXπθ τl. 3) Compute the sampled values of Zπθ. 4) Use the samples of Zπθto compute an unbiased, reduced-variance estimate of the policy gradient ∇θE[Zπθ]≈1 NhP τ∇θ Xπθ τh+c(Xπθ τl−ˆµl) . Sampling correlated trajectories from the multi-fidelity environments. Note that for the random variables Xπθ τhandXπθ τlto be correlated, they must share an underlying probability space (Ω,F,I P). In other words, every outcome ω∈Ωfrom this probability space should uniquely define a high- 5 Page 6: fidelity τh(ω)and low-fidelity τl(ω)trajectory, as well as the corresponding values of the random variables Xπθ τh(ω)andXπθ τl(ω). Careful treatment of this joint probability space is necessary not only for conceptual completeness, but also to implement a mechanism that samples correlated trajectories from the distinct environments MhandMlunder the same policy πθ. Informally, when sampling trajectories from MhandMl, there are six sources of stochasticity: the stochasticity introduced by the policy πθ, that introduced by the initial state distribution ∆sI, the stochasticity introduced by the high-fidelity and low-fidelity transition dynamics phandpl, and fi- nally the stochasticity introduced by the two reward functions RhandRl. Note that the stochasticity introduced by πθis implemented by the agent, and is thus under the control of the algorithm in the sense that the same random outcome ωmay be used to generate actions from πθin both Mhand Ml. Similarly, the algorithm may fix the initial states in the low-fidelity environment to match those observed from the high-fidelity samples. On the other hand, the transition dynamics and environ- ment rewards are generated by independent sources of stochasticity in the different environments. We accordingly define the outcome set of the probability space as Ω = Ω ∆sI×Ωπ×Ωph×Ωpl× ΩRh×ΩRl. Here, each outcome ω∆sI∈Ω∆sIdefines a particular shared initial state. Meanwhile, each policy outcome ωπ∈Ωπcorresponds to a sequence ωπ=ωπ 1, ωπ 2, . . . , ωπ Tthat dictates the random sequence of actions selected by the policy πθin both environments. Conceptually, given the outcome ωπ tat any timestep tof a trajectory, the policy should deterministically output action ah t=πθ(sh t, ωπ)in the high-fidelity environment, and action al t=πθ(sl t, ωπ)in the low-fidelity environment. Note that this does not necessarily imply that ah t=al t, due to a potential difference in states sh tandsl tthat stems from the different dynamics. Similarly, each outcome ωph∈Ωphis a sequence ωph=ωph 0, ωph 1, . . . , ωph Tdictating the outcomes st+1=ph(sh t, ah t, ωph)of the stochastic transitions in the high-fidelity environment. However, unlike the policy outcome sequence ωπ∈Ωπ, the transition outcome sequences ωph∈Ωphandωpl∈Ωpl, and the reward outcome sequences ωRh∈ΩRhandωRl∈ΩRl, are not shared between the high and low fidelity environments. Algorithm 1: Correlated trajectory sampling. Input: Current policy πθ, multi-fidelity environments MhandMl. Output: Sampled trajectories {τh i, τl i}Nh i=1. TrajectoryList ←EmptyList fori∈ {1,2, . . . , Nh}do ωπ 0. . . ωπ T∼SamplePolicyOutcomes () sh 0∼∆sI;sl 0←sh 0 fort∈ {0, . . . , T −1}do ah t←πθ(sh t, ωπθ t) sh t+1∼ph(·|sh t, ah t) rh t∼Rh(sh t, ah t, sh t+1) fort∈ {0, . . . , T −1}do al t←πθ(sl t, ωπθ t) sl t+1∼pl(·|sl t, al t) rl t∼Rl(sl t, al t, sl t+1) τh i←sh 0, ah 0, rh 0, . . . , sh T τl i←sl 0, al 0, rl 0, . . . , sl T TrajectoryList.append ({τh i, τl i}) return TrajectoryListFormulating these separate outcome sets is con- ceptually helpful. However, practically, we only explicitly sample values for the policy out- comes ωπ 0, ωπ 1, . . . , ωπ Tin our implementation. Algorithm 1 outlines the steps of the sampling process. First, the pre-sampled policy out- comes are used to rollout a high-fidelity trajec- toryτh. Next, the initial state in the low-fidelity environment is fixed to match the initial state of τhand therefore to effectively enforce a com- monω∆sI. Finally, the same sequence of pre- sampled policy outcomes is then used to gener- ate the low-fidelity trajectory τl. Correlated action sampling via policy distri- bution reparameterization. In order to use the sampled outcomes ωπ tto deterministically se- lect an action under the parameterized policy πθ, we implement a technique inspired by the so-called reparameterization trick used in vari- ational autoencoders (Kingma, 2013). In par- ticular, in continuous action spaces, we draw ωπ t∼ N (0,1), and the policy πθ(st, ωπ t)is trained to output a state-dependent mean and standard deviation which are used to transform ωπ tinto an action at. Meanwhile, in discrete ac- tion spaces, we draw ωπ t∼Uniform (0,1)and apply the Gumbel-Max trick (Huijben et al., 2022) to sample ataccording to the state-dependent probability distribution defined by the policy πθ(st, ωπ t). 6 Page 7: Defining multi-fidelity variants of the REINFORCE and PPO algorithms. To this point, we have defined a mechanism for sampling correlated trajectories τhandτlfrom multi-fidelity environments MhandMl, as well as a framework for using said trajectories to construct reduced-variance estima- tors of the policy gradient from trajectory-dependent variables Xπθ τhandXπθ τl. With these elements of the MFPG framework in place, we may easily implement multi-fidelity variants of existing pol- icy gradient algorithms simply by replacing the way in which the value of Xπθτis computed from sampled trajectories τ. We implement a multi-fidelity variant of the REINFORCE algorithm (Williams, 1988; 1992) by defining Xπθτ:=1 TPT−1 t=0Gtlogπθ(at|st), as described above. We additionally imple- ment a commonly used variant of the REINFORCE algorithm (Peters & Schaal, 2006) that sub- tracts state values estimated by a value network from the Monte Carlo returns to reduce vari- ance, i.e., Xπθτ:=1 TPT−1 t=0(Gt−Vϕ(st)) log πθ(at|st) =1 TPT−1 t=0Aϕ(st, at) logπθ(at|st). Here, Vϕ(st), Aϕ(st, at)denote value and advantage functions estimated from samples. We use generalized advantage estimation for Aϕ(st, at)(Schulman et al., 2015). Meanwhile, we define a multi-fidelity variant of PPO (Schulman et al., 2017) by instead using Xπθτ:= 1 TPT−1 t=0minπθ(at|st) πold(at|st)Aϕ(st, at),clip(πθ(at|st) πold(at|st),1−ϵ,1 +ϵ)Aϕ(st, at) . In this case, the multi- fidelity estimate of the policy gradient is used in place of the traditional single-fidelity variant, along- side PPO’s entropy maximization term, when defining the objective function for gradient ascent. In both of the above cases, a common advantage network Aϕis used to compute both Xπθ τhandXπθ τl, and is trained using sampled tuples (sh t, ah t, rh t, sh t+1)from the high-fidelity dataset. 4 Experiments Our results in a suite of simulated benchmark RL tasks validate the following five key claims. Sec- tions 4.1 to 4.5 highlight the experimental results that demonstrate the claims C1-5 in order. We then present a large-scale comparison of all algorithm variants and baselines in Section 4.6. • (C1) The proposed MFPG algorithms drastically reduce the variance of policy gradient estimators in comparison to common approaches to variance reduction, such as baseline subtraction and increasing the number of high-fidelity samples per gradient estimate. • (C2) When high-fidelity samples are limited, the MFPG algorithms stabilize training and outperform the single-fidelity baselines in all settings, achieving up to 3.9 ×higher reward at algorithm convergence. • (C3) The MFPG algorithms continue to match or outperform these baseline algorithms, even when the baselines are allowed access to substantially more high-fidelity data, i.e., 10-20x more high-fidelity samples per gradient update and 2-10x more samples in total. • (C4) The MFPG algorithms can learn effectively even when the low-fidelity environment is drastically different from the high-fidelity environment, and high-fidelity data is scarce. • (C5) The proposed policy reparameterization trick for sampling correlated trajectories across multi-fidelity environments is crucial to the performance of the proposed approach. Tasks. We test our approach and support the key claims above in the following simulated robotic tasks from Gymnasium (Towers et al., 2024) and MuJoCo (Todorov et al., 2012): (i) CartPole-v1, (ii) InvertedPendulum-v5, (iii) Swimmer-v5, (iv) Walker2d-v5, (v) Hopper-v5, (vi) HalfCheetah-v5. Low-Fidelity Models. In the CartPole task, we construct a low-fidelity model by linearizing the system dynamics about an equilibrium point. In all other MuJoCo tasks, we implement low-fidelity environments in which the gravitational forces differ from those in the target (high-fidelity) environ- ments, as is common in the literature on off-dynamics learning (Lyu et al., 2024), or by changing the reward function. In particular, the results in Sections 4.1 to 4.3 and 4.5 use a low-fidelity model that preserves 90% of the target environment’s gravity, while Section 4.6 reports results for additional differences in gravity levels across all tasks. Meanwhile, Section 4.4 presents results from a low- 7 Page 8: fidelity environment in which the rewards are drastically different from those of the high-fidelity environment. Implementation details. We implement the MFPG algorithms by building on the Stable-Baselines3 RL software library (Raffin et al., 2021). In all experiments, we use the default hyperparameters from Stable-Baselines3 with minimal changes. The specific hyperparameters used are detailed in the Supplementary Material. We employ the same hyperparameters between our proposed MFPG algorithms and their single-fidelity counterparts, which we use as baselines for comparison. The code will be released upon publication of the manuscript. Baselines algorithms. We evaluate three variants of the proposed multi-fidelity framework, and compare them to their single-fidelity counterparts: (i) REINFORCE (Williams, 1992), (ii) REIN- FORCE with value function subtraction (a common method for variance reduction in policy gradient algorithms) (Peters & Schaal, 2006), and (iii) PPO (Schulman et al., 2017). In RL, the amount of available data influences training through two distinct mechanisms: (i) batch size, i.e., the number of high-fidelity samples per gradient step, and (ii) the total number of high- fidelity samples used throughout training, i.e., the batch size multiplied by the total number of gra- dient steps. Hence, to simulate data-scarce settings, we always use a small batch size of 100 for the proposed MFPG algorithms. Sections 4.1 to 4.3 use Nl= 150 Nhto estimate the mean term ˆµlof the CVs, while Section 4.4 uses Nl= 100 Nh. In all experiments, we use 6 random seeds for each algorithm. For each of the MFPG algorithms, we compare our approach to two types of baselines: •All algorithms have limited access to high-fidelity samples : The baselines use the same, small batch size of 100as the MFPG algorithm and train using the same number of total high-fidelity samples, i.e., 1 million. •Baseline algorithms have access to significantly more high-fidelity samples : In these exper- iments, the baselines are granted substantially more high-fidelity samples than our MFPG algo- rithms. The baselines use 10-20x larger batch size and train using 2-10x total number of high- fidelity samples. We note that in all experiments, the baselines ultimately take either same num- ber, or twice as many, gradient steps as our approach. This is because the PPO implementations use mini-batches to estimate gradients, i.e., subsets of the data batches, whereas the REINFORCE implementations use the entire batch to estimate gradients in order to reduce variance. 4.1 MFPG Algorithms Drastically Reduce the Variance of Policy Gradient Estimators 0.20.40.60.81·106103104 High-Fidelity Env. StepsVarianceSingle-Fid. (batch 100) Single-Fid. (batch 100, value subt.) Single-Fid. (batch 1000) Single-Fid. (batch 1000, value subt.) MFPG (ours, batch 100) MFPG (ours, batch 100, value subt.) Figure 3: Variance of the policy gradient estimates in the HalfCheetah-v5 task. The multi-fidelity REINFORCE al- gorithm enjoys a significantly lower variance in its gradi- ent estimates than the single-fidelity baselines that only use high-fidelity samples.This experiment is designed to vali- date claim C1: The proposed MFPG algorithms drastically reduces the variance of policy gradient estima- tors in comparison to common ap- proaches to variance reduction, such as baseline subtraction and increasing the number of high-fidelity samples per gradient estimate. We use the REINFORCE algorithm in the HalfCheetah task as an exam- ple. Figure 3 shows the variance of the multi-fidelity vs. single-fidelity policy gradient estimators (more specifically, we plot the vari- ances of the scalar quantities ZπθandXπθ τhbefore differentiation w.r.t. policy parameters). The figure includes results from several single-fidelity baseline algorithms, including a baseline that has access to the same limited number of high-fidelity samples as our MFPG algorithms (i.e., the same batch size), a baseline with significantly more high-fidelity samples (a batch size of 1000 ), and vari- ants of these single fidelty baselines that also use value function subtraction in an effort to reduce estimator variance. 8 Page 9: MFPG (ours, value subt.) MFPG (ours) Single-Fid. (high-fid. only, value subt.) Single-Fid. (high-fid. only) 0 0.2 0.4 0.6 0.8 1·106 0250500 High-Fidelity Env. StepsReward (a) REINFORCE InvertedPendulum-v50 0.2 0.4 0.6 0.8 1·106 02505007501,000 High-Fidelity Env. StepsReward (b) REINFORCE Hopper-v5 0.2 0.4 0.6 0.8 1·106 05001,0001,500 High-Fidelity Env. StepsReward (c) REINFORCE HalfCheetah-v50 0.2 0.4 0.6 0.8 1·106 02505007501,000 High-Fidelity Env. StepsReward (d) PPO Hopper-v5 Figure 4: Episodic reward as a function of the total elapsed steps in the high-fidelity environment. All algorithms are limited to 100 high-fidelity samples per gradient update, and the episode length in all tasks is T= 500 . MFPG algorithms significantly outperform their single-fidelity counterparts. We train a policy using the single-fidelity REINFORCE algorithm with state-value subtracted for 1 million steps and save the trained policy at 18 different checkpoints. For each of these saved policies, we collect 200 batches of both high- and low-fidelity data, where the size of each batch (i.e., the amount of high-fidelity data) varies between approaches as described above. We then record the empirical mean and variance of the policy gradient estimates from these 200 batches, for each checkpointed policy. We repeat this experiment for 6 random seeds and report aggregate statistics in Figure 3, where each line is based on 21600 batches of policy gradient estimates. As shown in Figure 3, our multi-fidelity approach drastically reduces the variance of policy gradient estimators when compared to the single-fidelity baselines. Specifically, we observe that (i) this reduction is far more substantial than that achieved by standard variance reduction techniques in RL (namely state-value subtraction or training with larger batch sizes), and (ii) the variance reduction occurs immediately anduniformly throughout training , whereas state-value subtraction only starts reducing the variance after fitting a meaningful value function (around 500K steps). 4.2 MFPG Algorithms Significantly Outperform Baselines when Data is Scarce This experiment is designed to validate claim C2: When high-fidelity samples are limited, the MFPG algorithms stabilize training and outperform the single-fidelity baselines in all settings. Figure 4 illustrates the episodic reward achieved by the RL algorithms as a function of the total number of steps taken in the high-fidelity environment. All algorithms use a batch size of 100and1million total high-fidelity samples throughout training. The proposed MFPG algorithms achieve higher rewards than the baselines by a large margin— they achieve 2 to 3.9 times higher rewards in tasks without a maximum reward (Figs. 4b to 4d) and significantly more steep and stable learning curves when there is a fixed maximum achievable reward (Fig. 4a). When observing the results for the REINFORCE algorithm (Figs. 4a to 4c), we also note that baseline subtraction only improves algorithm performance in 2 of the 3 tasks, and that this improvement is minimal in comparison to that provided by our multi-fidelity approach. 9 Page 10: MFPG (ours, value subt.) MFPG (ours) Single-Fid. (10x high. data, value subt.) Single-Fid. (10x high. data) 0 0.2 0.4 0.6 0.8 1·104 0250500 Gradient StepsReward (a) REINFORCE InvertedPendulum-v50 0.2 0.4 0.6 0.8 1·104 05001,0001,500 Gradient StepsReward (b) REINFORCE HalfCheetah-v5 Figure 5: Episodic reward as a function of elapsed gradient steps. The baseline algorithms have ac- cess to 10x more high-fidelity samples than the proposed MFPG algorithms. Even when facing this severe data disadvantage, the MFPG algorithms outperform or match the baselines’ performance. 4.3 MFPG Algorithms Outperform Baselines Even when Training with 10x Less Data This experiment is designed to validate claim C3: The MFPG algorithms continue to match or outperform these baseline algorithms, even when they are allowed access to substantially more high- fidelity data. In Figs. 2 and 5, the multi-fidelity REINFORCE algorithm uses a batch size of 100, and1million total high-fidelity samples, whereas the baseline algorithms have access to 10times more samples per batch and to 10million total high-fidelity samples. As in Section 4.2, we observe that the proposed approach achieves substantially higher rewards than the baseline in tasks without a maximum achievable reward (Fig. 5b) and yields a much more stable learning curve when there is a maximum achievable reward (Fig. 5a). We benchmark the multi-fidelity PPO algorithm (batch size of 100, and 1 million total high-fidelity samples) with results reported in Stable-Baselines3 (Stable-Baselines3), which use a batch size of 2048 and 2million total high-fidelity samples. We provide the multi-fidelity PPO results, as well as other additional results—including results from other tasks—in Tables 1 and 2 in Section 4.6. 4.4 The MFPG Algorithms Can Remain Highly Effective Under Substantial Mismatches Between the Low-Fidelity and High-Fidelity Environments This experiment is designed to validate claim C4: The MFPG algorithms can learn effectively even when the low-fidelity environment is drastically different from the high-fidelity environment, and high-fidelity data is scarce. As discussed in Section 3, so long as there is a statistical relationship between the random variables of interest Xπθ τhandXπθ τl(i.e.,ρ2(Xπθ τh, Xτl)is non-negligible), the MFPG framework will reduce the variance of the policy gradient estimates (w.r.t. the high-fidelity environment) without introducing bias. To demonstrate this point, we examine a situation in which the low-fidelity and high-fidelity rewards are drastically different: the reward function in the low- fidelity environment is the negative of that from the high-fidelity environment. Figure 6 illustrates the performance of the multi-fidelity REINFORCE algorithm, and compares it to the performance of a REINFORCE policy trained using only high-fidelity data. We also plot the performance of a REINFORCE policy that is trained only on low-fidelity data, and then evaluated in the high-fidelity environment. Each baseline has access to the same amount of (either high-fidelity or low-fidelity) data as our MFPG approach. As expected, due to the significant difference between the tasks of these two environments, the baseline trained only on low-fidelity data is entirely ineffective: instead of learning to move forward, the agent learns to end the episode as quickly as possible. Meanwhile, similarly to as observed in cf. 4b, the baseline that only uses high-fidelity data is unable to learn effectively from the limited number of samples. However, the MFPG algorithm is able to combine these data sources, each of which alone is not sufficient for effective training, in order to learn a highly performant policy. Intuitively, although the values of Xπθ τlare entirely different from those of Xπθ τh, they are negatively correlated in this example. The MFPG algorithm takes advantage of this relationship to compute 10 Page 11: MFPG (ours) High-Fid. only Low-Fid. only 0 0.2 0.4 0.6 0.8 1·104 02505007501,000 Gradient StepsReward (a) REINFORCE Hopper-v50 0.2 0.4 0.6 0.8 1·104−0.8−0.6−0.4−0.2 Gradient StepsReward (b) REINFORCE Hopper-v5 (Zoomed In) Figure 6: Multi-fidelity REINFORCE performance when the low-fidelity and high-fidelity environ- ment rewards are drastically different. Figure (a) compares the MFPG algorithm with baselines that train using only low- (or high-)fidelity data. Figure (b) provides a magnified view of Figure (a) to highlight the performance of the baseline that trains using only low-fidelity data. While neither low-fidelity data nor high-fidelity data are sufficient for effective training, the MFPG algorithm is able to combine the data sources to learn an effective policy. With Policy Reparameterization Trick Without Policy Reparameterization Trick 0 0.2 0.4 0.6 0.8 1·1060250500 High-Fidelity Env. StepsReward (a) REINFORCE CartPole-v10 0.2 0.4 0.6 0.8 1·1060250500 High-Fidelity Env. StepsReward (b) REINFORCE InvertedPendulum-v5 Figure 7: Impact of the policy reparameterization trick on MFPG algorithm performance. a correction for the low-fidelity estimator ˆµl, using only a small number of high-fidelity samples. Although this is an extreme example in the sense that the low- and high-fidelity tasks are polar op- posites of each other, it highlights a useful feature of the MFPG framework: the low-fidelity rewards and dynamics might be very different from the target environment of interest (making direct sim-to- real transfer infeasible), and yet still provide useful information for multi-fidelity policy training. 4.5 The Policy Reparameterization Trick is of Critical Imporance to the MFPG Approach This experiment is designed to validate claim C5: The proposed policy reparameterization trick for sampling correlated trajectories across multi-fidelity environments is crucial to the perfor- mance of the proposed approach. Figure 7 shows an ablation study in which we remove the reparameterization-based action sampling described in Section 3. We demonstrate the importance of this portion of the MFPG framework within both discrete (Fig. 7a) and continuous (Fig. 7b) ac- tion spaces. Recall that this policy reparameterization trick is an important element of Algorithm 1, which enables correlated action sampling along two or more trajectories, even when differences in dynamics cause the policy to be evaluated on different states along those trajectories. The orange lines in Figure Fig. 7 illustrate our approach, while the gray lines illustrate the result of instead independently sampling actions from the policy πθ. We observe that by increasing the correlation of the sampled low- and high-fidelity trajectories, this policy reparameterization trick significantly improves the performance of the approach. 4.6 Full Results Tables 1 and 2 show complete results for all the experimental settings. For approaches using the same batch size and total number of high-fidelity samples, we show the final reward and the reward 11 Page 12: at another intermediate checkpoint. For baselines granted more high-fidelity samples and larger batch sizes, we only show the final reward since the number of gradient steps varies between meth- ods with different batch sizes (for a fixed number of environment steps). For PPO baselines with more high-fidelity samples, we directly use the reward results reported in Stable-Baselines3 (Stable- Baselines3), divided by a factor of 2 since their episode length is twice as long as ours. Note that in tasks where states vary substantially throughout an episode (e.g. HalfCheetah), it is useful to roll out the low-fidelity trajectories only for relatively short time horizons (e.g., 15 steps), which increases the correlation between low- and high-fidelity trajectories. Similar techniques have been used in model-based RL (Janner et al., 2019; Levy et al., 2024). This may introduce bias, however, into the policy gradient estimator; further analysis is an important topic of future work. Table 1: REINFORCE: Mean Episode Reward. Numbers within parentheses denote rewards achieved by the variant of REINFORCE with state values subtracted. Env. (continuous action)Single-fid. More high-fid. dataStepsSingle-fid. Limited high-fid. dataMulti-fid. (ours) Limited high-fid. data, 90% gravityMulti-fid. (ours) Limited high-fid. data, 70% gravity Inv. Pendulum 449.87 (467.32)200K 395.53 (433.93) 485.49 (494.26) 466.43 (483.33) 1M 236.94 (325.85) 452.63 (483.26) 448.90 (477.36) Hopper 454.95 (686.53)200K 257.15 (283.12) 777.43 (694.28) 626.96 (615.22) 1M 349.90 (298.34) 906.36 (916.40) 670.04 (628.86) Half Cheetah 584.07 (462.18)200K -134.44 (-100.98) 216.39 (156.85) 121.60 (71.01) 1M 135.72 (253.62) 756.35 (988.14) 591.57 (438.62) Swimmer 30.67 (37.84)150K 30.83 (30.07) 34.25 (36.19) 32.11 (26.26) 300K 24.36 (31.23) 36.60 ( 45.99 ) 39.94 (35.17) Walker 302.15 (650.10)200K 260.89 (266.25) 292.88 (285.27) 292.32 ( 286.66 ) 1M 261.63 (288.99) 549.09 (792.84) 463.47 (623.93) Env. (discrete action)Single-fid.StepsSingle-fid. Multi-fid. (ours) More high-fid. data Limited high-fid. data Limited high-fid. data, linearized low. model Cart Pole 433.59 (496.77)200K 353.53 (403.06) 480.36 (482.90) 1M 327.91 (464.53) 490.63 (493.73) Table 2: PPO: Mean Episode Reward. The results of the single-fidelity baseline with more high- fidelity data are from Stable-Baselines3 (Stable-Baselines3) divided by 2 to align episode lengths. Env.Single-fid. More high-fid. dataStepsSingle-fid. Limited high-fid. dataMulti-fid. (ours) Limited high-fid. data, 90% gravityMulti-fid. (ours) Limited high-fid. data, 70% gravity Inv. Pendulum N/A200K 340.25 496.70 489.10 1M 265.21 500 499.67 Hopper 783.5200K 277.60 536.61 397.56 1M 337.27 714.49 527.40 Half Cheetah 988.0200K 280.37 446.80 223.90 1M 654.68 767.12 540.92 Swimmer N/A150K 33.92 40.18 35.68 300K 36.06 41.14 40.10 Walker 615.0200K 223.70 363.37 337.12 1M 451.42 683.55 485.54 5 Conclusions We present a multi-fidelity policy gradient (MFPG) framework, which mixes a small amount of potentially expensive high-fidelity data with a larger volume of cheap lower-fidelity data to construct unbiased, variance-reduced estimators for on-policy policy gradients. We use this general framework to propose multi-fidelity variants of two policy gradient algorithms, REINFORCE and proximal policy optimization. Through experiments in a suite of simulated robotics benchmark tasks, we demonstrate that the proposed multi-fidelity policy gradient algorithms: i) significantly reduce the variance of policy gradient estimates, ii) improve performance (up to 3.9 ×higher reward) and data- efficiency in comparison with baselines that only use high-fidelity data, and iii) learn effectively even when the low-fidelity environment is drastically different from the high-fidelity environment. 12 Page 13: In summary, the proposed framework offers a novel paradigm for incorporating multi-fidelity data in policy gradient algorithms for reinforcement learning, and provides promising directions for ef- ficient sim-to-real transfer and principled approaches to managing the trade-off between policy per- formance and data collection costs. Future work will investigate additional techniques to improve the performance of multi-fidelity actor-critic methods, and to deploy the proposed framework for sim-to-real transfer on robotic hardware. Acknowledgments We thank Haoran Xu, Mustafa Karabag, Brett Barkley, and Jacob Levy for their helpful discus- sions. This work was supported in part by the National Science Foundation (NSF) under Grants 2214939 and 2409535, in part by the Defense Advanced Research Projects Agency (DARPA) under the Transfer Learning from Imprecise and Abstract Models to Autonomous Technologies (TIAMAT) program under grant number HR0011-24-9-0431, and in part by the DEVCOM Army Research Lab- oratory under Cooperative Agreement Numbers W911NF-23-2-0011 and W911NF-23-2-0211. The views and conclusions contained in this document are those of the authors and should not be in- terpreted as representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. References Pieter Abbeel, Morgan Quigley, and Andrew Y Ng. Using inaccurate models in reinforcement learning. In Proceedings of the 23rd international conference on Machine learning , pp. 1–8, 2006. Akash Agrawal and Christopher McComb. Adaptive learning of design strategies over non- hierarchical multi-fidelity models via policy alignment. arXiv preprint arXiv:2411.10841 , 2024. Karol Arndt, Murtaza Hazara, Ali Ghadirzadeh, and Ville Kyrki. Meta reinforcement learning for sim-to-real domain adaptation. In 2020 IEEE international conference on robotics and automa- tion (ICRA) , pp. 2725–2731. IEEE, 2020. Emmanuel Bengio, Moksh Jain, Maksym Korablyov, Doina Precup, and Yoshua Bengio. Flow network based generative models for non-iterative diverse candidate generation. Advances in Neural Information Processing Systems , 34:27381–27394, 2021. Sahil Bhola, Suraj Pawar, Prasanna Balaprakash, and Romit Maulik. Multi-fidelity reinforce- ment learning framework for shape optimization. Journal of Computational Physics , 482: 112018, 2023. ISSN 0021-9991. DOI: https://doi.org/10.1016/j.jcp.2023.112018. URL https: //www.sciencedirect.com/science/article/pii/S0021999123001134 . Yevgen Chebotar, Ankur Handa, Viktor Makoviychuk, Miles Macklin, Jan Issac, Nathan Ratliff, and Dieter Fox. Closing the sim-to-real loop: Adapting simulation randomization with real world experience. In 2019 International Conference on Robotics and Automation (ICRA) , pp. 8973– 8979, 2019. DOI: 10.1109/ICRA.2019.8793789. Ching-An Cheng, Xinyan Yan, and Byron Boots. Trajectory-wise control variates for variance reduction in policy gradient methods. In Conference on Robot Learning , pp. 1379–1394. PMLR, 2020. Mark Cutler, Thomas J. Walsh, and Jonathan P. How. Real-world reinforcement learning via mul- tifidelity simulators. IEEE Transactions on Robotics , 31(3):655–671, 2015. DOI: 10.1109/TRO. 2015.2419431. Longchao Da, Justin Turnau, Thirulogasankar Pranav Kutralingam, Alvaro Velasquez, Paulo Shakarian, and Hua Wei. A survey of sim-to-real methods in rl: Progress, prospects and chal- lenges with foundation models. arXiv preprint arXiv:2502.13187 , 2025. 13 Page 14: Peter Dayan. Reinforcement comparison. In Connectionist Models , pp. 45–51. Elsevier, 1991. Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Mag- netic control of tokamak plasmas through deep reinforcement learning. Nature , 602(7897):414– 419, 2022. Raj Ghugare, Santiago Miret, Adriana Hugessen, Mariano Phielipp, and Glen Berseth. Search- ing for high-value molecules using reinforcement learning and transformers. arXiv preprint arXiv:2310.02902 , 2023. Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, and David Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. In International Conference on Learning Representations , 2018. Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research , 5(Nov):1471–1530, 2004. Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E Turner, and Sergey Levine. Q- prop: Sample-efficient policy gradient with an off-policy critic. In International Conference on Learning Representations , 2017. Jiawei Huang and Nan Jiang. From importance sampling to doubly robust policy gradient. In International Conference on Machine Learning , pp. 4434–4443. PMLR, 2020. Iris AM Huijben, Wouter Kool, Max B Paulus, and Ruud JG Van Sloun. A review of the gumbel- max trick and its extensions for discrete stochasticity in machine learning. IEEE transactions on pattern analysis and machine intelligence , 45(2):1353–1371, 2022. Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model- based policy optimization. Advances in neural information processing systems , 32, 2019. Elia Kaufmann, Leonard Bauersfeld, Antonio Loquercio, Matthias Müller, Vladlen Koltun, and Davide Scaramuzza. Champion-level drone racing using deep reinforcement learning. Nature , 620(7976):982–987, 2023. Sami Khairy and Prasanna Balaprakash. Multi-fidelity reinforcement learning with control variates. Neurocomputing , 597:127963, 2024. Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 , 2013. Jacob Levy, Tyler Westenbroek, and David Fridovich-Keil. Learning to walk from three minutes of real-world data with semi-structured dynamics models. arXiv preprint arXiv:2410.09163 , 2024. Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Action-dependent control variates for policy optimization via stein identity. In International Conference on Learning Representations , 2018. Jiafei Lyu, Kang Xu, Jiacheng Xu, Mengbei Yan, Jingwen Yang, Zongzhang Zhang, Chenjia Bai, Zongqing Lu, and Xiu Li. Odrl: A benchmark for off-dynamics reinforcement learning. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2024. URL https://openreview.net/forum?id=ap4x1kArGy . Alonso Marco, Felix Berkenkamp, Philipp Hennig, Angela P Schoellig, Andreas Krause, Stefan Schaal, and Sebastian Trimpe. Virtual vs. real: Trading off simulations and physical experiments in reinforcement learning with bayesian optimization. In 2017 IEEE International Conference on Robotics and Automation (ICRA) , pp. 1557–1563. IEEE, 2017. 14 Page 15: Barry L Nelson. On control variate estimators. Computers & Operations Research , 14(3):219–225, 1987. Art B. Owen. Monte Carlo theory, methods and examples .https://artowen.su.domains/ mc/, 2013. Sergey Pankov. Reward-estimation variance elimination in sequential decision processes. arXiv preprint arXiv:1811.06225 , 2018. Benjamin Peherstorfer, Karen Willcox, and Max Gunzburger. Survey of multifidelity methods in uncertainty propagation, inference, and optimization. Siam Review , 60(3):550–591, 2018. Jan Peters and Stefan Schaal. Policy gradient methods for robotics. In 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems , pp. 2219–2225, 2006. DOI: 10.1109/IROS.2006. 282564. Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research , 22(268):1–8, 2021. URL http://jmlr.org/papers/v22/ 20-1364.html . Fabio Ramos, Rafael Carvalhaes Possas, and Dieter Fox. Bayessim: adaptive domain randomization via probabilistic inference for robotics simulators. arXiv preprint arXiv:1906.01728 , 2019. Andrei A Rusu, Matej Ve ˇcerík, Thomas Rothörl, Nicolas Heess, Razvan Pascanu, and Raia Hadsell. Sim-to-real robot learning from pixels with progressive nets. In Conference on robot learning , pp. 262–270. PMLR, 2017. Gilhyun Ryou, Geoffrey Wang, and Sertac Karaman. Multi-fidelity reinforcement learning for time- optimal quadrotor re-planning. arXiv preprint arXiv:2403.08152 , 2024. John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 , 2015. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 , 2024. Laura Smith, J. Chase Kew, Xue Bin Peng, Sehoon Ha, Jie Tan, and Sergey Levine. Legged robots that keep on learning: Fine-tuning locomotion policies in the real world. In 2022 International Conference on Robotics and Automation (ICRA) , pp. 1593–1599, 2022. DOI: 10.1109/ICRA46639.2022.9812166. Stable-Baselines3. Performance check (continuous actions) github issue48. URL https:// github.com/DLR-RM/stable-baselines3/issues/48 . Varun Suryan, Nahush Gondhalekar, and Pratap Tokekar. Multi-fidelity reinforcement learning with gaussian processes. arXiv preprint arXiv:1712.06489 , 2017. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction . MIT press, 2018. Richard Stuart Sutton. Temporal credit assignment in reinforcement learning . University of Mas- sachusetts Amherst, 1984. Chen Tang, Ben Abbatematteo, Jiaheng Hu, Rohan Chandra, Roberto Martín-Martín, and Peter Stone. Deep reinforcement learning for robotics: A survey of real-world successes. Annual Review of Control, Robotics, and Autonomous Systems , 8, 2024. 15 Page 16: Matthew E Taylor, Peter Stone, and Yaxin Liu. Transfer learning via inter-task mappings for tem- poral difference learning. Journal of Machine Learning Research , 8(9), 2007. Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems , pp. 5026–5033. IEEE, 2012. DOI: 10.1109/IROS.2012.6386109. Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments. arXiv preprint arXiv:2407.17032 , 2024. George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard Turner, Zoubin Ghahramani, and Sergey Levine. The mirage of action-dependent baselines in reinforcement learning. In International conference on machine learning , pp. 5015–5024. PMLR, 2018. Lex Weaver and Nigel Tao. The optimal reward baseline for gradient-based reinforcement learning. InProceedings of the 17th Conference in Uncertainty in Artificial Intelligence , pp. 538–545, 2001. RJ Williams. Toward a theory of reinforcement-learning connectionist systems. Technical Report , 1988. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning , 8:229–256, 1992. Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M Bayen, Sham Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with action-dependent factorized baselines. In International Conference on Learning Representations , 2018. Patrick Yin, Tyler Westenbroek, Simran Bagaria, Kevin Huang, Ching-an Cheng, Andrey Kobolov, and Abhishek Gupta. Rapidly adapting policies to the real world via simulation-guided fine- tuning. arXiv preprint arXiv:2502.02705 , 2025. Tingting Zhao, Hirotaka Hachiya, Gang Niu, and Masashi Sugiyama. Analysis and improvement of policy gradient estimation. Advances in Neural Information Processing Systems , 24, 2011. Wenshuai Zhao, Jorge Peña Queralta, and Tomi Westerlund. Sim-to-real transfer in deep rein- forcement learning for robotics: a survey. In 2020 IEEE symposium series on computational intelligence (SSCI) , pp. 737–744. IEEE, 2020. Yuanyi Zhong, Yuan Zhou, and Jian Peng. Coordinate-wise control variates for deep policy gradi- ents. arXiv preprint arXiv:2107.04987 , 2021. 16 Page 17: Supplementary Materials The following content was not necessarily subject to peer review. Computation of Control Variates Coefficient c∗ We mention in Section 3 that the control variates coefficient c∗is calculated as c∗=−Cov(Xπθ τh, Xπθ τl) Var(Xπθ τl)=−ρ(Xπθ τh, Xπθ τl)s Var(Xπθ τh) Var(Xπθ τl), where ρ(·,·)denotes the Pearson correlation coefficient between the two random variables. We compute a moving average for the coefficient c∗based on each collected sample batches with a hyperparameter αma: c∗=− αmaρ(Xπθ τh, Xπθ τl)old+ (1−αma)ρ(Xπθ τh, Xπθ τl)news αmaVar(Xπθ τh)old+ (1−αma)Var(Xπθ τh)new αmaVar(Xπθ τl)old+ (1−αma)Var(Xπθ τl)new, where we drop the control variate when the estimated correlation coefficient is below a threshold hyperparameter detailed below. Hyperparameter Values As mentioned in Section 4.6, for some tasks, we roll out low-fidelity trajectories for shorter horizons. We do shorter horizons of 15 in HalfCheetah, Hopper, and Walker tasks. For all other tasks, we do not constrain the rollout horizons. Tables 3 and 4 give all other hyperparameters for single- fidelity/MFPG REINFORCE and PPO respectively. Table 3: Hyperparameters for REINFORCE (All Tasks) Parameter Value Shared for all algorithms optimizer RMSProp learning rate 7·10−4 discount factor ( γ) 0.99 value loss coefficient (for value subtraction case) 0.5 network architecture (multi-layer perceptron, 2 hidden layers) [64,64] value network for value subtraction (multi-layer perceptron, 2 hidden layers) [64,64] max gradient norm 0.5 Multi-Fidelity REINFORCE batch size 100 total high-fidelity training steps 106 c∗threshold value for dropping control variate 0 αma 0.95 Single-Fidelity REINFORCE batch size for more high-fidelity data case 1000 total high-fidelity training steps for more high-fidelity data case 107 batch size for limited high-fidelity data case 100 total high-fidelity training steps for limited high-fidelity data case 106 17 Page 18: Table 4: Hyperparameters for PPO (All tasks) Parameter Value Shared for all algorithms optimizer RMSProp learning rate 1.5·10−4 discount factor ( γ) 0.99 GAE coefficient 0.95 epochs 10 minibatch size 64 clipping range [0.8,1.2] value loss coefficient 0.5 actor network architecture (multi-layer perceptron, 2 hidden layers) [64,64] critic network architecture (multi-layer perceptron, 2 hidden layers) [64,64] Shared Multi-Fidelity PPO hyperparameters for all tasks batch size 100 total high-fidelity training steps 106 Shared hyperparameters for all tasks except HalfCheetah-v5 c∗threshold value for dropping control variate 0 αma 0.95 HalfCheetah-v5 specific hyperparameters c∗threshold value for dropping control variate 0.2 αma 0.92 Single-Fidelity PPO batch size for more high-fidelity data case 2048 total high-fidelity training steps for more high-fidelity data case 2·107 batch size for limited high-fidelity data case 100 total high-fidelity training steps for limited high-fidelity data case 106 18