Paper Content:
Page 1:
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.InDRiVE: Intrinsic Disagreement-based Reinforcement for Vehicle
Exploration through Curiosity-Driven Generalized World Model
Feeza Khan Khanzada1and Jaerock Kwon2
Abstract — Model-based Reinforcement Learning (MBRL)
has emerged as a promising paradigm for autonomous driving,
where data efficiency and robustness are critical. Yet, existing
solutions often rely on carefully crafted, task-specific extrinsic
rewards, limiting generalization to new tasks or environments.
In this paper, we propose InDRiVE (Intrinsic Disagreement-
based Reinforcement for Vehicle Exploration), a method that
leverages purely intrinsic, disagreement-based rewards within a
Dreamer-based MBRL framework. By training an ensemble of
world models, the agent actively explores high-uncertainty re-
gions of environments without any task-specific feedback. This
approach yields a task-agnostic latent representation, allowing
for rapid zero-shot or few-shot fine-tuning on downstream
driving tasks such as lane following and collision avoidance.
Experimental results in both seen and unseen environments
demonstrate that InDRiVE achieves higher success rates and
fewer infractions compared to DreamerV2 and DreamerV3
baselines—despite using significantly fewer training steps. Our
findings highlight the effectiveness of purely intrinsic explo-
ration for learning robust vehicle control behaviors, paving
the way for more scalable and adaptable autonomous driving
systems.
I. I NTRODUCTION
Model-Based Reinforcement Learning (MBRL) has been
making significant strides in the robotics domain, offering a
compelling alternative to model-free reinforcement learning
by focusing on building an internal model of the environ-
ment before direct interaction. This approach has shown
tremendous potential in reducing training time, creating more
generalized models, and mitigating uncertainties. However,
the reward-centric nature of traditional reinforcement learn-
ing algorithms poses a fundamental challenge to achieving
generalization, particularly in scenarios with sparse rewards.
Both model-free and MBRL algorithms often rely heavily
on task-specific rewards, which limits their adaptability and
efficiency in novel environments.
Inspired by neuroscience, curiosity-based learning offers
a promising avenue to overcome these limitations [1]. In
humans, curiosity drives learning through exploration and the
accumulation of experiences, often independent of immediate
external rewards [2]. By seeking novel states in the environ-
ment, agents can enhance their exploration capabilities using
prediction error from an inverse dynamic model as a measure
of novelty. Subsequent studies have refined and expanded on
this idea, demonstrating its efficacy in training RL models
*This work was supported in part by the National Science Foundation
(NSF) under Grant MRI 2214830.
1Feeza Khan Khanzada and2Jaerock Kwon are with the Department
of Electrical and Computer Engineering, University of Michigan-Dearborn,
4901 Evergreen Rd, Dearborn, MI 48128, United States. {feezakk,
jrkwon }@umich.edufor better generalization and uncertainty quantification [3]
[4]. Applications of intrinsic motivation have also been ex-
plored in domains such as autonomous vehicles, particularly
for handling sparse rewards and improving exploration, as
highlighted in the related work section.
Despite the progress, there is no comprehensive study
that leverages intrinsic motivation to train an MBRL agent
for generalization across task-agnostic extrinsic reward func-
tions. Specifically, the ability to train a single agent that
can adapt to diverse tasks like lane following and collision
avoidance in a zero-shot or few-shot learning setting is
largely unexplored. From the prior work surveyed, three
major research gaps that capture our attention are:
•Although MBRL methods excel in sample efficiency
and have been validated on various tasks, they fre-
quently rely on domain or task-specific reward struc-
tures and lack evidence of extensive multi-task gener-
alization.
•Intrinsic motivation has been shown to improve explo-
ration, but most applications still augment rather than
replace extrinsic rewards. There is a lack of studies
examining a complete reliance on intrinsic rewards to
build a highly adaptable world model.
•While some efforts address specific driving tasks, a
single agent that can adapt to downstream tasks like
lane following and collision avoidance in zero- or few-
shot settings remains underexplored.
Addressing these gaps, we introduce InDRiVE (Intrinsic
Disagreement-based Reinforcement for Vehicle Exploration),
which leverages a Dreamer-based MBRL agent. InDRiVE
relies solely on ensemble model disagreement for intrinsic
motivation, enabling the agent to learn a robust, task-agnostic
latent world model. Our objective is to facilitate zero-shot
or few-shot fine-tuning across diverse driving tasks, thus
minimizing training time and reducing reliance on manual
reward engineering for real-world deployment. Following is
the list of contributions of InDRiVE through this research:
•To the best of our knowledge, InDRiVE is the first
study to train an ego-vehicle exclusively with intrinsic
rewards, leveraging latent disagreement among an en-
semble of world models (based on the [5]). This elim-
inates the need for hand-crafted task rewards, relying
solely on uncertainty-based signals to build a robust,
task-agnostic representation of the environment.
•The resulting world model supports zero-shot and
few-shot adaptation to real driving tasks (e.g., lane-
following, collision avoidance), drastically reducingarXiv:2503.05573v1 [cs.RO] 7 Mar 2025
Page 2:
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.domain-specific reward engineering.
•Through purely intrinsic exploration, our approach
yields a versatile world model capable of zero-shot and
few-shot transfer to downstream driving tasks such as
lane-following and collision avoidance. This demon-
strates that the learned model is not only compre-
hensive but also quickly adaptable to practical driving
objectives, significantly reducing the need for domain-
specific reward engineering.
•Our findings confirm that fully intrinsic reward mech-
anisms are both viable and beneficial for high-
dimensional, safety-critical domains like autonomous
driving, paving the way for broader self-supervised
MBRL solutions.
By capitalizing on intrinsic model disagreement sig-
nals, InDRiVE achieves robust exploration, rapid adap-
tation to new tasks, and a streamlined reward design
pipeline—pointing toward more scalable, self-supervised so-
lutions for future autonomous vehicles.
II. R ELATED WORK
MBRL has transitioned from a theoretical construct to
a practical solution for autonomous vehicle (A V) control,
driven by advances in model fidelity, planning algorithms,
and deep neural networks [6]. Unlike model-free methods,
which rely primarily on trial-and-error, MBRL incorporates
a learned world model of the environment to enable look-
ahead planning and improve data efficiency. In the context
of autonomous driving, these learned models can anticipate
future states and rewards, allowing for safer decision-making
and reduced real-world experimentation [7] [8]. Recent work
in simulation platforms such as CARLA [9] has demon-
strated that world-model-based planners can imagine a di-
verse range of upcoming scenarios before executing actions,
thus mitigating safety risks and addressing data scarcity
by synthesizing additional training samples [10]. Continued
innovations like latent state abstraction, uncertainty-aware
modeling, and online adaptation further reduce the gap
between purely simulated training and real-world deploy-
ment [11] [12]. Moreover, while most prior efforts focus
on on-road driving scenarios, recent analytical study on off-
road autonomy found that selecting the right image region-
of-interest and using a larger training dataset significantly
improves the performance of vision-based end-to-end lateral
control [13]. Such findings highlight the importance of data
representation and collection strategies, which could simi-
larly benefit model-based methods by ensuring that learned
representations capture critical environmental cues across
diverse driving conditions.
Intrinsic Motivation (IM) and curiosity-driven exploration
have emerged as essential mechanisms for guiding agents
in sparse-reward or high-dimensional environments, where
extrinsic feedback is rare or too costly to define [2] [14].
IM provides agents with self-generated reward signals that
encourage exploration, often by rewarding novelty, uncer-
tainty, or prediction error [15] [16]. Notable curiosity-based
approaches include the Intrinsic Curiosity Module (ICM)[2] and Random Network Distillation (RND) [4], both of
which incentivize agents to visit unfamiliar or surprising
states. Such methods have been successfully applied to
robotic systems and video game domains, enabling agents
to learn skills in the absence of dense external rewards [14]
[17]. However, purely intrinsic exploration can lead agents
to fixate on irrelevant or noisy events, spurring interest in
techniques that combine curiosity with additional constraints
or memory mechanisms to ensure meaningful, goal-relevant
exploration [18].
In the realm of autonomous driving, prior research has
mostly leveraged intrinsic motivation as a complementary
signal rather than a primary training objective [19] [7]
[8]. Typical reinforcement learning frameworks for driving
rely on task-specific reward functions (e.g., measuring route
progress, penalizing collisions, or encouraging lane-keeping)
[20], often augmented with a small curiosity bonus to expe-
dite convergence. While this hybrid approach can alleviate
some exploration hurdles, it still anchors the learned policy
to a particular extrinsic objective, reducing its flexibility to
generalize across tasks or conditions. . Additionally, explo-
ration in autonomous driving requires careful consideration
of safety and real-world feasibility; purely random or naive
exploration is not viable in practice, further complicating the
application of intrinsic rewards [21].
Crucially, existing literature lacks studies that train a full
end-to-end driving policy exclusively via intrinsic rewards.
While purely curiosity-driven methods have been demon-
strated in simpler continuous-control scenarios [5] [22], no
prior work has shown that an autonomous vehicle agent can
acquire complex driving behaviors (such as collision avoid-
ance or lane-following) without relying on explicit, task-
specific feedback. This gap is particularly significant given
the potential advantages of a fully task-agnostic paradigm
in which the agent discovers relevant driving skills inde-
pendently and subsequently fine-tunes to specific tasks with
minimal overhead. Our work aims to bridge this gap by
integrating an ensemble-based model disagreement signal,
inspired by the [5], into a Dreamer-based agent [11], allow-
ing the vehicle to learn a robust world model in CARLA
solely through intrinsic exploration signals. Ultimately, this
approach seeks to demonstrate that internal disagreement
metrics can serve as a standalone training driver, paving
the way for efficient, flexible, and generalized autonomous
driving policies.
A recent analytical study on off-road autonomy found that
selecting the right image region-of-interest and using a larger
training dataset significantly improves the performance of
vision-based end-to-end lateral control [13]
III. M ETHODOLOGY
In this section, we detail our InDRiVE approach, which
extends DreamerV3 with an ensemble-based intrinsic ex-
ploration mechanism inspired by [5] The goal is to train
a robust, task-agnostic world model via curiosity-driven
exploration, then fine-tune the learned policy with minimal
additional effort for specific driving tasks in CARLA. Fig. 1
Page 3:
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
enc
ht LD
ot st at rint
t
ht+1 LD
st+1 at+1 rint
t+1
vt+1
hT LD
sT aT rint
T
vT
…(a) Overview of the InDRiVE Actor Critic Policy.
var{ ŝk
t+1}
ŝ1
t+1ŝ2
t+1ŝk
t+1
w1
statw2
… wk
…
…rint
t (b) Latent Disagreement (LD) Reward
Fig. 1: Overview of the InDRiVE. (a) An actor critic policy architecture incorporating latent disagreement for exploration.
LD is Latent Disagreement in (b). Raw images are encoded into a stochastic latent st, which is combined with deterministic
hidden state htto maintain temporal context. The actor–critic policy then outputs an action atbased on [st, ht]. (b) An
ensemble of forward models predicts potential next states ˆsk
t+1for the same (st, at). The variance among these predictions
yields a latent-disagreement (intrinsic) reward, which, encourages the policy to explore.
presents a high-level overview of InDRiVE alongwith the
latent disagreement mechanism.
A. Intrinsic Motivation and World Model
InDRiVE is an MBRL framework designed for au-
tonomous driving. It adopts the DreamerV3 architecture
[23] for its latent world model and planning capabilities
while leveraging ensemble disagreement to generate purely
intrinsic rewards during an initial exploration phase. This
approach is motivated by [5], which demonstrated that self-
supervised exploration improves sample efficiency and task
generalization. In InDRiVE, we first train the agent solely
with intrinsic rewards (no task-specific feedback), yielding
a broad coverage of driving scenarios and a capable latent
world model. Subsequently, we introduce extrinsic rewards
to fine-tune the policy for tasks such as lane following or
collision avoidance.
We formulate autonomous driving as a Markov Decision
Process (MDP) M= (S,A, p, r, γ ). States s∈ S en-
capsulate sensor observations, while actions a∈ A corre-
spond to vehicular control inputs (steering, throttle, braking).
The transition model p(st+1|st, at)governs environment
dynamics, and r(st, at)provides task-dependent feedback.
In our setting, intrinsic exploration replaces task-specific
rewards during the initial training phase:
rint
t=Disagreement-based curiosity signal ,
whereas the fine-tuning phase introduces extrinsic signals:
rext
t=task-specific reward signal .
We can also combine the extrinsic and intrinsic rewards,
where intrinsic reward can be used to augment the reward
based training:
rt=α rext
t+ (1−α)rint
t, (1)
withα∈[0,1]controlling the weighting between extrinsic
and intrinsic rewards.We adopt a Recurrent State-Space Model (RSSM) to learn
a compact representation of high-dimensional sensory inputs
(e.g., images) and predict future observations and rewards.
Following the Dreamer framework [23], the RSSM consists
of four main components:
•Encoder qϕ(zt|st): Converts raw observations stinto
a stochastic latent state zt.
•Recurrent Core (GRU): Maintains a hidden state ht,
summarizing past latent states and actions.
•Transition Model pϕ(zt+1|zt, at, ht): Predicts the next
latent state zt+1given the current latent state, action at,
and hidden state ht.
•Decoder pϕ(st|zt): Reconstructs or imagines the
original observation stfrom the latent state zt.
Additionally, we include a reward predictor pϕ(rt|zt, ht)
to model the immediate reward, and a discount (or continua-
tion) predictor pϕ(γt|zt, ht)to handle episode termination.
At each time step, we thus have:
ht=GRU (ht−1, zt−1, at−1), (2)
zt∼qϕ(zt|st, ht), (3)
with the transition prior
pϕ(zt+1|zt, at, ht+1). (4)
We jointly optimize the encoder, decoder, transition, re-
ward, and discount networks. The training loss, inspired
by the variational Evidence Lower Bound (ELBO), can be
expressed as:
Lmodel(ϕ) =Eqϕh
−lnpϕ(st|zt)−lnpϕ(rt|zt, ht)i
+βEqϕh
DKL
qϕ(zt|st, ht)∥pϕ(zt|ht)i
+λγEqϕh
−lnpϕ(γt|zt, ht)i
,
(5)
where:
Page 4:
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.•−lnpϕ(st|zt)is the reconstruction loss , penalizing
the model for poor observation predictions.
•−lnpϕ(rt|zt, ht)is the reward prediction loss .
•DKL
qϕ(zt|st, ht)∥pϕ(zt|ht)
is the KL divergence
between the encoder posterior and the transition prior,
encouraging compact and consistent latent states.
•βscales or clips the KL term (the free-bits heuris-
tic [12]) so that the model retains sufficient representa-
tional capacity without collapsing.
•−lnpϕ(γt|zt, ht)is an optional discount/continuation
loss (weighted by λγ) that helps the model account for
terminal states.
We sample short latent-rollout sequences from a replay buffer
of past trajectories, optimize (5) in mini-batches, and update
the model parameters ϕvia stochastic gradient descent.
Once trained, the RSSM provides a forward model for
imagined rollouts : starting from a real or latent-encoded
state, the model predicts future states, rewards, and discounts,
thereby enabling policy learning and planning entirely within
the compact latent space.
B. Ensemble Disagreement for Intrinsic Exploration
We incorporate ensemble disagreement to drive curiosity,
building on the self-supervised exploration scheme intro-
duced by [3]. Specifically, we train Klightweight forward
dynamics models, each predicting the next latent state st+1
given (st, at). Letµk(st, at)denote the prediction of the k-th
model. The intrinsic reward rint
tis computed as the variance
of these predictions:
rint
t= Var
µ1(st, at), µ2(st, at), . . . , µ K(st, at)
.(6)
High disagreement indicates unexplored or uncertain regions,
incentivizing the policy to gather data where the world
model is less confident. As training progresses, this promotes
coverage of diverse states and reduces model uncertainty in
safety-critical scenarios.
C. Steering Loss Function
To encourage smooth driving behavior, we introduce a
steering loss function inspired by [24], adapted to penalize
excessively large steering angles. Let a(steer)
t denote the
steering command at time t, measured in the range [−1,1]
(left to right turn). We define:
rsteer(at) =(
−λ,ifa(steer)
t> δ,
0, otherwise ,(7)
where λ >0andδ∈(0,1)is a steering-angle threshold. In
practice, we set λ= 0.5and incorporate this penalty term
during training. This additional cost biases the policy to avoid
extreme steering angles, thereby promoting smoother, more
stable navigation without preventing necessary turns.
D. Training Procedure
Algorithm 1 summarizes the two-phase training pipeline:
Phase 1: Task-Agnostic Exploration: The agent explores
the CARLA environment by maximizing the ensemble-
disagreement reward, augmented with the steering penaltyAlgorithm 1 InDRiVE Training Procedure
Require: Environment E(CARLA), replay buffer D, num-
ber of ensemble models K, exploration steps Nexplore ,
fine-tuning steps Nfine.
1:Initialize parameters of DreamerV3 world model ϕ,
policy πθ, value network vθ, and ensemble models
{µk}k=1..K.
2:D ← {} (empty replay buffer)
3:forstep = 1 toNexplore do
4: Roll out policy πθinEforTsteps to collect
{(ot, at, ot+1)}.
5: Encode st←qϕ(st|ot).
6: Compute disagreement rint
tvia Eq. (6).
7:D ← D ∪ { (st, at, rint
t, st+1)}.
8: Update DreamerV3 world model & ensemble models
using ELBO-based loss (Eq. (5)).
9: Update πθ,vθvia imagination in the latent space,
optimizing intrinsic returns.
10:end for
11:{Fine-Tuning for Task-Specific Rewards }
12:forstep= 1 toNfinedo
13: Introduce extrinsic reward rextfor downstream task
(e.g., collision avoidance).
14: rt=rext
t
15: (zero-shot): no additional data collection.
16: (few-shot): gather limited on-policy data to refine
ϕ, θ.
17:end for
18:return Optimized policy πθand world model parame-
tersϕ.
for stable control. This phase yields a broad coverage of
driving states and a well-trained world model without relying
on task-specific guidance.
Phase 2: Task-Specific Fine-Tuning: We then introduce
the extrinsic driving objective (lane following and collision
avoidance). The policy learns to balance this task reward
with the residual intrinsic signal and steering loss. In many
cases, zero-shot adaptation is possible, as the agent’s learned
representation already encodes crucial driving behaviors.
Otherwise, a small number of additional training episodes
is sufficient for few-shot adaptation, drastically reducing
total sample complexity compared to purely extrinsic-driven
training.
Overall, this two-phase approach demonstrates how self-
supervised exploration can bootstrap a robust world model,
leading to faster and more versatile task adaptation in au-
tonomous driving. We use the CARLA simulator as our
primary testbed, taking advantage of its:
•Realistic sensor data : RGB camera, LiDAR, GPS, and
odometry information,
•Complex traffic scenarios : dynamic vehicles, pedestri-
ans, traffic lights, and multi-lane roads,
•Configurable weather and lighting conditions : enabling
diverse scenarios for robust exploration.
Page 5:
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.IV. E XPERIMENTAL SETUP
This section details the experimental framework for as-
sessing our proposed approach. We begin by introducing
the CARLA simulation environment and the tasks under
consideration, followed by the two-phase training procedure.
We then describe the baseline methods, hyperparameter
configurations, and the metrics used for evaluation.
A. Environment Setup
We focus on two CARLA towns Town01 (a small town
with a river and bridges) and Town02 (a small town
with a mixture of residential and commercial buildings)
with moderate traffic density. At each time step, the agent
receives a 128×128 semantic segmentation image, along
with throttle and steering angle information. To capture
temporal dependencies, we stack four consecutive semantic
segmentation frames as a single observation input to the
encoder. The agent outputs continuous control commands
(steering, throttle, brake).
B. Tasks and Scenarios
We consider two representative driving tasks to demon-
strate zero-shot and few-shot performance:
•Lane Following (LF): The agent must maintain its lane
position while traveling at a safe speed.
•Collision Avoidance (CA): The agent must avoid collid-
ing with other vehicles and obstacles in real-time traffic
scenarios.
Episodes terminate upon any of the following events:
•Collision: The agent collides with another vehicle,
pedestrian, or static obstacle.
•Wrong Direction: The agent drives in the opposite
direction of the intended lane.
•Off-Road Driving: The agent leaves the drivable area.
•Vehicle Stall: The agent’s velocity falls below a minimal
threshold (e.g., 1 km/h) for an extended period (e.g.,
1 minute).
•Episode Completion: The agent successfully complete
the number of steps assigned for the episode without
any lane violation and collisions.
C. Two-Phase Training Procedure
The learning process is divided into two main phases: (i)
intrinsic exploration for building a general-purpose world
model, and (ii) task-specific fine-tuning that leverages this
model for downstream tasks.
1) Task-Agnostic Exploration: During Phase 1, we train
a task-agnostic InDRiVE solely using an intrinsic reward
derived from ensemble disagreement. Specifically, we ran-
domize key environment parameters (e.g., weather, traffic
density) every 10,000 steps to ensure diverse experiences,
then roll out the current policy for 1,000 steps and store all
transitions in a replay buffer. Afterward, we update the en-
semble of forward dynamics models, along with the encoder-
decoder modules and policy/value networks, in latent space
using the intrinsic reward rint
t. This cycle of randomization,
data collection, and model updating is repeated until apredetermined maximum of environment interactions (e.g.,
50K steps) is reached.
2) Task-Specific Fine-Tuning: Following intrinsic explo-
ration, we evaluate and refine the agent’s performance on
downstream tasks (e.g., lane following or collision avoid-
ance) through both zero-shot and few-shot evaluations. For
zero-shot evaluation, we freeze the world model parameters
(encoder, decoder, and ensemble) and directly test the policy
without further training, recording success rates and infrac-
tions to gauge initial performance. For few-shots evaluation,
we introduce a task-specific extrinsic reward rextr
t, collect a
small batch of new data, and update the policy and value
networks by applying extrinsic rewards. We then measure
the resultant performance gains to assess how effectively the
agent adapts to the target task.
D. Baseline Methods
For comparative evaluation, we focus on DreamerV2
(Task-Specific) and DreamerV3 (Task-Specific), both of
which train a world model and policy from scratch using
only task-specific rewards (e.g., for lane following or colli-
sion avoidance), without incorporating any intrinsic rewards.
These baselines thus provide a performance and sample-
efficiency benchmark for traditional, task-centric learning
approaches, enabling a clear assessment of the benefits
gained by integrating intrinsic exploration in our method.
Table I summarizes the key hyperparameters used during
the intrinsic exploration phase and the few-shot fine-tuning
phase.
TABLE I: Key Hyperparameters for InDRiVE and Fine-
Tuning
Hyperparameter Intrinsic Exploration Fine-Tuning
Learning rate (world model) 1×10−41×10−4
Learning rate (policy/value) 1×10−41×10−4
Ensemble size ( K) 8 8 (frozen)
Batch size 64 64
Replay buffer size 105105
Discount factor ( γ) 0.99 0.99
Intrinsic reward weight ( α) 1.0 0.0
Training steps 50k env steps 10K env steps
E. Evaluation Metrics
We benchmark InDRiVE on multiple scenarios in the
CARLA simulator. Key metrics include:
•Success Rate (SR): Rate of successful completion of
episode with any lane violation and collision.
•Infraction Rate (IR): Rate of rule violations (collisions,
lane departures) per episode.
•Zero-Shot/Few-Shot Adaptation: Evaluates how well
the agent performs the task with no (zero-shot) and
minimal (few-shot) additional interactions, highlighting
the benefit of curiosity-driven exploration.
By measuring performance across different tasks, towns, and
training regimes, we obtain a comprehensive view of zero-
shot and few-shot generalization in complex urban driving
scenarios.
Page 6:
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
480000 485000 490000 495000 500000 505000 510000
Environment Steps0.00.20.40.60.81.0Reward RateInDRiVE
DreamerV3
DreamerV2(a) Lane Following
480000 485000 490000 495000 500000 505000 510000
Environment Steps0.00.20.40.60.81.0Reward RateInDRiVE
DreamerV3
DreamerV2 (b) Collision Avoidance
480000 485000 490000 495000 500000 505000 510000
Environment Steps0.00.20.40.60.81.0Reward RateInDRiVE
DreamerV3
DreamerV2 (c) Lane Following + Collision Avoidance
Fig. 2: Average reward rates of InDRiVE (red), DreamerV3 (blue), and DreamerV2 (green) across three CARLA driving
tasks. The gray area after 500K steps indicates the start of InDRiVE’s finetuning phase (few-shot learning). Despite being
trained on the extrinsic reward for fewer steps (10K), InDRiVE (red) converges to near-optimal performance in all three
tasks—surpassing both Dreamer baselines—and demonstrates superior sample efficiency and training stability overall.
V. R ESULTS
A. Zero-Shot Evaluation
Table II reports the zero-shot evaluation of DreamerV3,
trained in Town01 using only a latent disagreement–based
intrinsic reward signal when tested in both Town01 and
Town02 . The model is trained for 500K exploration steps,
and performance is measured over 50K evaluation steps in
each town.
Overall, these results indicate that using InDRiVE in the
training phase can yield an agent capable of generalizing
from one town to another. While the transfer performance
remains slightly lower than the results observed in the
training environment, the similarity in success and collision
rates suggests that the agent’s learned exploration strategy
maintains a degree of robustness across different environ-
ments.
TABLE II: Zero-Shot Learning Evaluation of InDRiVE on
Town01 & 02
Train Steps Eval StepsEval Town01 (seen) Eval Town02 (unseen)
SR (%) ↑ IR (%) ↓ SR (%) ↑ IR (%) ↓
500K 50K 64.52 35.48 64.06 35.94
B. Few-Shots Evaluation
Table III compares three models—InDRiVE (ours),
DreamerV3, and DreamerV2—across three driving tasks
(Lane Following, Collision Avoidance, and Lane Following +
Collision Avoidance) in two CARLA towns, Town01 (seen
during training) and Town02 (unseen). The table reports
two primary metrics: Success Rate (SR), the percentage of
episodes completed without collisions or lane departures, and
Infraction Rate (IR), the percentage of episodes in which a
collision or off-lane event occurred. Each model is described
by the number of training steps ( Train ) and the number of
evaluation steps ( Eval).The results highlight several points. First, InDRiVE con-
sistently achieves higher SR and lower IR in both towns,
while requiring notably fewer training steps (10K) com-
pared to DreamerV2 or DreamerV3 (510K). In Town01 ,
InDRiVE’s SR ranges from 66% to 96% across tasks,
while in Town02 the performance remains high (83% to
100%), indicating strong zero-shot generalization. By con-
trast, DreamerV2 shows lower SR, particularly in Lane Fol-
lowing tasks, where it struggles to stay within lanes in both
towns. DreamerV3 performs moderately well in Town01 ,
and its zero-shot performance in Town02 is also decent,
but InDRiVE still surpasses it in success rate and infraction
reduction. Overall, these findings suggest that incorporating
intrinsic disagreement-based exploration (InDRiVE) yields
more efficient learning and robust navigation behaviors com-
pared to the Dreamer baselines.
Fig. 2 illustrates three plots compare the average reward
rates of InDRiVE (red), DreamerV3 (blue), and DreamerV2
(green) over environment steps in CARLA for three tasks:
Lane Following (left), Collision Avoidance (middle), and
Lane Following + Collision Avoidance (right). The x-axis
represents the number of environment steps, while the y-axis
denotes the reward rate. Notably, InDRiVE rapidly converges
to high reward across all three tasks, whereas the Dreamer
baselines require more steps and show greater fluctuation in
reward.
VI. C ONCLUSION AND FUTURE WORK
We introduced InDRiVE, a fully intrinsic MBRL frame-
work for autonomous driving that eliminates task-specific
external rewards by relying solely on ensemble disagreement
signals for exploration. Experiments in CARLA show that
InDRiVE achieves higher success rates and fewer infrac-
tions than DreamerV2 and DreamerV3, while using fewer
training steps. Its latent representation transfers effectively to
both familiar ( Town01 ) and unfamiliar ( Town02 ) settings,
enabling zero-shot or few-shot adaptation to tasks like lane-
following and collision avoidance. These findings highlight
Page 7:
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.TABLE III: Comparison of models on three tasks in both Town01 and Town02
Task Model Train Eval Town01 (seen) Town02 (unseen)
SR (%) ↑ IR (%) ↓ SR (%) ↑ IR (%) ↓
LFInDRiVE (ours) 10K 50K 96.08 3.92 100.00 0.00
DreamerV3 510K 50K 64.52 35.48 64.06 35.94
DreamerV2 510K 50K 28.09 71.91 29.07 70.93
CAInDRiVE (ours) 10K 50K 66.10 33.90 83.05 16.95
DreamerV3 510K 50K 73.68 24.56 98.00 2.00
DreamerV2 510K 50K 39.24 60.76 33.33 66.67
LF + CAInDRiVE (ours) 10K 50K 83.02 16.98 100.00 0.00
DreamerV3 510K 50K 73.21 26.79 98.00 2.00
DreamerV2 510K 50K 30.95 69.05 28.28 71.72
the benefits of purely intrinsic exploration in uncovering
robust driving policies and underscore the potential for re-
ducing dependence on manual reward design. Future research
directions include exploring more complex traffic scenarios,
integrating richer sensor modalities, addressing sim-to-real
transfer, investigating continual and multi-task learning, and
evaluating alternative intrinsic reward formulations to further
enhance scalability, data efficiency, and adaptability.
print
REFERENCES
[1] A. Aubret, L. Matignon, and S. Hassas, “An information-theoretic
perspective on intrinsic motivation in reinforcement learning: a
survey,” Entropy , vol. 25, no. 2, p. 327, Feb. 2023, arXiv:2209.08890
[cs]. [Online]. Available: http://arxiv.org/abs/2209.08890
[2] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-
driven Exploration by Self-supervised Prediction,” May 2017,
arXiv:1705.05363 [cs]. [Online]. Available: http://arxiv.org/abs/1705.
05363
[3] D. Pathak, D. Gandhi, and A. Gupta, “Self-Supervised Exploration
via Disagreement,” Jun. 2019, arXiv:1906.04161 [cs]. [Online].
Available: http://arxiv.org/abs/1906.04161
[4] Y . Burda, H. Edwards, A. Storkey, and O. Klimov, “Exploration by
Random Network Distillation,” Oct. 2018, arXiv:1810.12894 [cs].
[Online]. Available: http://arxiv.org/abs/1810.12894
[5] R. Sekar, O. Rybkin, K. Daniilidis, P. Abbeel, D. Hafner,
and D. Pathak, “Planning to Explore via Self-Supervised World
Models,” Jun. 2020, arXiv:2005.05960 [cs]. [Online]. Available:
http://arxiv.org/abs/2005.05960
[6] D. Ha and J. Schmidhuber, “Recurrent World Models Facilitate Policy
Evolution,” in Advances in Neural Information Processing Systems ,
vol. 31. Curran Associates, Inc., 2018.
[7] Y . Gao, Q. Zhang, D.-W. Ding, and D. Zhao, “Dream to Drive
With Predictive Individual World Model,” IEEE Transactions on
Intelligent Vehicles , pp. 1–16, 2024, conference Name: IEEE
Transactions on Intelligent Vehicles. [Online]. Available: https:
//ieeexplore.ieee.org/document/10547289
[8] A. Hu, G. Corrado, N. Griffiths, Z. Murez, C. Gurau, H. Yeo,
A. Kendall, R. Cipolla, and J. Shotton, “Model-Based Imitation Learn-
ing for Urban Driving,” Advances in Neural Information Processing
Systems , vol. 35, pp. 20 703–20 716, Dec. 2022.
[9] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun,
“CARLA: An open urban driving simulator,” in Proceedings of the
1st Annual Conference on Robot Learning , 2017, pp. 1–16.
[10] B. R. Kiran, I. Sobh, V . Talpaert, P. Mannion, A. A. A. Sallab, S. Yo-
gamani, and P. P ´erez, “Deep Reinforcement Learning for Autonomous
Driving: A Survey,” IEEE Transactions on Intelligent Transportation
Systems , vol. 23, no. 6, pp. 4909–4926, Jun. 2022, conference Name:
IEEE Transactions on Intelligent Transportation Systems.[11] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to
Control: Learning Behaviors by Latent Imagination,” Mar. 2020,
arXiv:1912.01603 [cs]. [Online]. Available: http://arxiv.org/abs/1912.
01603
[12] D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba, “Mastering Atari with
Discrete World Models,” Feb. 2022, arXiv:2010.02193 [cs]. [Online].
Available: http://arxiv.org/abs/2010.02193
[13] F. K. Khanzada, B. Kwon, W. Jeong, Y . S. Cho, and J. Kwon,
“Analytical study on region of interest and dataset size of vision-based
end-to-end lateral control for off-road autonomy,” in ICRA 2024
Workshop on Resilient Off-road Autonomy , 2024. [Online]. Available:
https://openreview.net/forum?id=KaZ40iwHg7
[14] Y . Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and
A. A. Efros, “Large-Scale Study of Curiosity-Driven Learning,”
Aug. 2018, arXiv:1808.04355 [cs]. [Online]. Available: http:
//arxiv.org/abs/1808.04355
[15] J.-A. Meyer and S. W. Wilson, “A Possibility for Implementing
Curiosity and Boredom in Model-Building Neural Controllers,” in
From Animals to Animats: Proceedings of the First International
Conference on Simulation of Adaptive Behavior . MIT Press, 1991, pp.
222–227, conference Name: From Animals to Animats: Proceedings of
the First International Conference on Simulation of Adaptive Behavior.
[Online]. Available: https://ieeexplore.ieee.org/document/6294131
[16] B. C. Stadie, S. Levine, and P. Abbeel, “Incentivizing Exploration In
Reinforcement Learning With Deep Predictive Models,” Nov. 2015,
arXiv:1507.00814 [cs]. [Online]. Available: http://arxiv.org/abs/1507.
00814
[17] P.-Y . Oudeyer, F. Kaplan, and V . V . Hafner, “Intrinsic Motivation
Systems for Autonomous Mental Development,” IEEE Transactions
on Evolutionary Computation , vol. 11, no. 2, pp. 265–286, Apr. 2007,
conference Name: IEEE Transactions on Evolutionary Computation.
[Online]. Available: https://ieeexplore.ieee.org/document/4141061
[18] R. Raileanu and T. Rockt ¨aschel, “RIDE: Rewarding Impact-Driven
Exploration for Procedurally-Generated Environments,” Feb. 2020,
arXiv:2002.12292 [cs]. [Online]. Available: http://arxiv.org/abs/2002.
12292
[19] “Imagine-2-Drive: High-Fidelity World Modeling in CARLA for
Autonomous Vehicles.” [Online]. Available: https://arxiv.org/html/
2411.10171
[20] M. Toromanoff, E. Wirbel, and F. Moutarde, “End-to-End Model-
Free Reinforcement Learning for Urban Driving Using Implicit Af-
fordances,” in 2020 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR) . Seattle, WA, USA: IEEE, Jun. 2020,
pp. 7151–7160.
[21] F. Codevilla, E. Santana, A. M. L ´opez, and A. Gaidon, “Exploring
the Limitations of Behavior Cloning for Autonomous Driving,” Apr.
2019, arXiv:1904.08980 [cs]. [Online]. Available: http://arxiv.org/abs/
1904.08980
[22] Y . Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and
A. A. Efros, “Large-Scale Study of Curiosity-Driven Learning,”
Aug. 2018, arXiv:1808.04355 [cs]. [Online]. Available: http:
//arxiv.org/abs/1808.04355
[23] D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap, “Mastering Diverse
Domains through World Models,” Apr. 2024, arXiv:2301.04104 [cs].
[Online]. Available: http://arxiv.org/abs/2301.04104
Page 8:
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.[24] Q. Li, X. Jia, S. Wang, and J. Yan, “Think2Drive: Efficient
Reinforcement Learning by Thinking in Latent World Model for
Quasi-Realistic Autonomous Driving (in CARLA-v2),” Jul. 2024,
arXiv:2402.16720 [cs]. [Online]. Available: http://arxiv.org/abs/2402.
16720