loader
Generating audio...

arxiv

Paper 2503.05573

InDRiVE: Intrinsic Disagreement based Reinforcement for Vehicle Exploration through Curiosity Driven Generalized World Model

Authors: Feeza Khan Khanzada, Jaerock Kwon

Published: 2025-03-07

Abstract:

Model-based Reinforcement Learning (MBRL) has emerged as a promising paradigm for autonomous driving, where data efficiency and robustness are critical. Yet, existing solutions often rely on carefully crafted, task specific extrinsic rewards, limiting generalization to new tasks or environments. In this paper, we propose InDRiVE (Intrinsic Disagreement based Reinforcement for Vehicle Exploration), a method that leverages purely intrinsic, disagreement based rewards within a Dreamer based MBRL framework. By training an ensemble of world models, the agent actively explores high uncertainty regions of environments without any task specific feedback. This approach yields a task agnostic latent representation, allowing for rapid zero shot or few shot fine tuning on downstream driving tasks such as lane following and collision avoidance. Experimental results in both seen and unseen environments demonstrate that InDRiVE achieves higher success rates and fewer infractions compared to DreamerV2 and DreamerV3 baselines despite using significantly fewer training steps. Our findings highlight the effectiveness of purely intrinsic exploration for learning robust vehicle control behaviors, paving the way for more scalable and adaptable autonomous driving systems.

Paper Content:
Page 1: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.InDRiVE: Intrinsic Disagreement-based Reinforcement for Vehicle Exploration through Curiosity-Driven Generalized World Model Feeza Khan Khanzada1and Jaerock Kwon2 Abstract — Model-based Reinforcement Learning (MBRL) has emerged as a promising paradigm for autonomous driving, where data efficiency and robustness are critical. Yet, existing solutions often rely on carefully crafted, task-specific extrinsic rewards, limiting generalization to new tasks or environments. In this paper, we propose InDRiVE (Intrinsic Disagreement- based Reinforcement for Vehicle Exploration), a method that leverages purely intrinsic, disagreement-based rewards within a Dreamer-based MBRL framework. By training an ensemble of world models, the agent actively explores high-uncertainty re- gions of environments without any task-specific feedback. This approach yields a task-agnostic latent representation, allowing for rapid zero-shot or few-shot fine-tuning on downstream driving tasks such as lane following and collision avoidance. Experimental results in both seen and unseen environments demonstrate that InDRiVE achieves higher success rates and fewer infractions compared to DreamerV2 and DreamerV3 baselines—despite using significantly fewer training steps. Our findings highlight the effectiveness of purely intrinsic explo- ration for learning robust vehicle control behaviors, paving the way for more scalable and adaptable autonomous driving systems. I. I NTRODUCTION Model-Based Reinforcement Learning (MBRL) has been making significant strides in the robotics domain, offering a compelling alternative to model-free reinforcement learning by focusing on building an internal model of the environ- ment before direct interaction. This approach has shown tremendous potential in reducing training time, creating more generalized models, and mitigating uncertainties. However, the reward-centric nature of traditional reinforcement learn- ing algorithms poses a fundamental challenge to achieving generalization, particularly in scenarios with sparse rewards. Both model-free and MBRL algorithms often rely heavily on task-specific rewards, which limits their adaptability and efficiency in novel environments. Inspired by neuroscience, curiosity-based learning offers a promising avenue to overcome these limitations [1]. In humans, curiosity drives learning through exploration and the accumulation of experiences, often independent of immediate external rewards [2]. By seeking novel states in the environ- ment, agents can enhance their exploration capabilities using prediction error from an inverse dynamic model as a measure of novelty. Subsequent studies have refined and expanded on this idea, demonstrating its efficacy in training RL models *This work was supported in part by the National Science Foundation (NSF) under Grant MRI 2214830. 1Feeza Khan Khanzada and2Jaerock Kwon are with the Department of Electrical and Computer Engineering, University of Michigan-Dearborn, 4901 Evergreen Rd, Dearborn, MI 48128, United States. {feezakk, jrkwon }@umich.edufor better generalization and uncertainty quantification [3] [4]. Applications of intrinsic motivation have also been ex- plored in domains such as autonomous vehicles, particularly for handling sparse rewards and improving exploration, as highlighted in the related work section. Despite the progress, there is no comprehensive study that leverages intrinsic motivation to train an MBRL agent for generalization across task-agnostic extrinsic reward func- tions. Specifically, the ability to train a single agent that can adapt to diverse tasks like lane following and collision avoidance in a zero-shot or few-shot learning setting is largely unexplored. From the prior work surveyed, three major research gaps that capture our attention are: •Although MBRL methods excel in sample efficiency and have been validated on various tasks, they fre- quently rely on domain or task-specific reward struc- tures and lack evidence of extensive multi-task gener- alization. •Intrinsic motivation has been shown to improve explo- ration, but most applications still augment rather than replace extrinsic rewards. There is a lack of studies examining a complete reliance on intrinsic rewards to build a highly adaptable world model. •While some efforts address specific driving tasks, a single agent that can adapt to downstream tasks like lane following and collision avoidance in zero- or few- shot settings remains underexplored. Addressing these gaps, we introduce InDRiVE (Intrinsic Disagreement-based Reinforcement for Vehicle Exploration), which leverages a Dreamer-based MBRL agent. InDRiVE relies solely on ensemble model disagreement for intrinsic motivation, enabling the agent to learn a robust, task-agnostic latent world model. Our objective is to facilitate zero-shot or few-shot fine-tuning across diverse driving tasks, thus minimizing training time and reducing reliance on manual reward engineering for real-world deployment. Following is the list of contributions of InDRiVE through this research: •To the best of our knowledge, InDRiVE is the first study to train an ego-vehicle exclusively with intrinsic rewards, leveraging latent disagreement among an en- semble of world models (based on the [5]). This elim- inates the need for hand-crafted task rewards, relying solely on uncertainty-based signals to build a robust, task-agnostic representation of the environment. •The resulting world model supports zero-shot and few-shot adaptation to real driving tasks (e.g., lane- following, collision avoidance), drastically reducingarXiv:2503.05573v1 [cs.RO] 7 Mar 2025 Page 2: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.domain-specific reward engineering. •Through purely intrinsic exploration, our approach yields a versatile world model capable of zero-shot and few-shot transfer to downstream driving tasks such as lane-following and collision avoidance. This demon- strates that the learned model is not only compre- hensive but also quickly adaptable to practical driving objectives, significantly reducing the need for domain- specific reward engineering. •Our findings confirm that fully intrinsic reward mech- anisms are both viable and beneficial for high- dimensional, safety-critical domains like autonomous driving, paving the way for broader self-supervised MBRL solutions. By capitalizing on intrinsic model disagreement sig- nals, InDRiVE achieves robust exploration, rapid adap- tation to new tasks, and a streamlined reward design pipeline—pointing toward more scalable, self-supervised so- lutions for future autonomous vehicles. II. R ELATED WORK MBRL has transitioned from a theoretical construct to a practical solution for autonomous vehicle (A V) control, driven by advances in model fidelity, planning algorithms, and deep neural networks [6]. Unlike model-free methods, which rely primarily on trial-and-error, MBRL incorporates a learned world model of the environment to enable look- ahead planning and improve data efficiency. In the context of autonomous driving, these learned models can anticipate future states and rewards, allowing for safer decision-making and reduced real-world experimentation [7] [8]. Recent work in simulation platforms such as CARLA [9] has demon- strated that world-model-based planners can imagine a di- verse range of upcoming scenarios before executing actions, thus mitigating safety risks and addressing data scarcity by synthesizing additional training samples [10]. Continued innovations like latent state abstraction, uncertainty-aware modeling, and online adaptation further reduce the gap between purely simulated training and real-world deploy- ment [11] [12]. Moreover, while most prior efforts focus on on-road driving scenarios, recent analytical study on off- road autonomy found that selecting the right image region- of-interest and using a larger training dataset significantly improves the performance of vision-based end-to-end lateral control [13]. Such findings highlight the importance of data representation and collection strategies, which could simi- larly benefit model-based methods by ensuring that learned representations capture critical environmental cues across diverse driving conditions. Intrinsic Motivation (IM) and curiosity-driven exploration have emerged as essential mechanisms for guiding agents in sparse-reward or high-dimensional environments, where extrinsic feedback is rare or too costly to define [2] [14]. IM provides agents with self-generated reward signals that encourage exploration, often by rewarding novelty, uncer- tainty, or prediction error [15] [16]. Notable curiosity-based approaches include the Intrinsic Curiosity Module (ICM)[2] and Random Network Distillation (RND) [4], both of which incentivize agents to visit unfamiliar or surprising states. Such methods have been successfully applied to robotic systems and video game domains, enabling agents to learn skills in the absence of dense external rewards [14] [17]. However, purely intrinsic exploration can lead agents to fixate on irrelevant or noisy events, spurring interest in techniques that combine curiosity with additional constraints or memory mechanisms to ensure meaningful, goal-relevant exploration [18]. In the realm of autonomous driving, prior research has mostly leveraged intrinsic motivation as a complementary signal rather than a primary training objective [19] [7] [8]. Typical reinforcement learning frameworks for driving rely on task-specific reward functions (e.g., measuring route progress, penalizing collisions, or encouraging lane-keeping) [20], often augmented with a small curiosity bonus to expe- dite convergence. While this hybrid approach can alleviate some exploration hurdles, it still anchors the learned policy to a particular extrinsic objective, reducing its flexibility to generalize across tasks or conditions. . Additionally, explo- ration in autonomous driving requires careful consideration of safety and real-world feasibility; purely random or naive exploration is not viable in practice, further complicating the application of intrinsic rewards [21]. Crucially, existing literature lacks studies that train a full end-to-end driving policy exclusively via intrinsic rewards. While purely curiosity-driven methods have been demon- strated in simpler continuous-control scenarios [5] [22], no prior work has shown that an autonomous vehicle agent can acquire complex driving behaviors (such as collision avoid- ance or lane-following) without relying on explicit, task- specific feedback. This gap is particularly significant given the potential advantages of a fully task-agnostic paradigm in which the agent discovers relevant driving skills inde- pendently and subsequently fine-tunes to specific tasks with minimal overhead. Our work aims to bridge this gap by integrating an ensemble-based model disagreement signal, inspired by the [5], into a Dreamer-based agent [11], allow- ing the vehicle to learn a robust world model in CARLA solely through intrinsic exploration signals. Ultimately, this approach seeks to demonstrate that internal disagreement metrics can serve as a standalone training driver, paving the way for efficient, flexible, and generalized autonomous driving policies. A recent analytical study on off-road autonomy found that selecting the right image region-of-interest and using a larger training dataset significantly improves the performance of vision-based end-to-end lateral control [13] III. M ETHODOLOGY In this section, we detail our InDRiVE approach, which extends DreamerV3 with an ensemble-based intrinsic ex- ploration mechanism inspired by [5] The goal is to train a robust, task-agnostic world model via curiosity-driven exploration, then fine-tune the learned policy with minimal additional effort for specific driving tasks in CARLA. Fig. 1 Page 3: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. enc ht LD ot st at rint t ht+1 LD st+1 at+1 rint t+1 vt+1 hT LD sT aT rint T vT …(a) Overview of the InDRiVE Actor Critic Policy. var{ ŝk t+1} ŝ1 t+1ŝ2 t+1ŝk t+1 w1 statw2 … wk … …rint t (b) Latent Disagreement (LD) Reward Fig. 1: Overview of the InDRiVE. (a) An actor critic policy architecture incorporating latent disagreement for exploration. LD is Latent Disagreement in (b). Raw images are encoded into a stochastic latent st, which is combined with deterministic hidden state htto maintain temporal context. The actor–critic policy then outputs an action atbased on [st, ht]. (b) An ensemble of forward models predicts potential next states ˆsk t+1for the same (st, at). The variance among these predictions yields a latent-disagreement (intrinsic) reward, which, encourages the policy to explore. presents a high-level overview of InDRiVE alongwith the latent disagreement mechanism. A. Intrinsic Motivation and World Model InDRiVE is an MBRL framework designed for au- tonomous driving. It adopts the DreamerV3 architecture [23] for its latent world model and planning capabilities while leveraging ensemble disagreement to generate purely intrinsic rewards during an initial exploration phase. This approach is motivated by [5], which demonstrated that self- supervised exploration improves sample efficiency and task generalization. In InDRiVE, we first train the agent solely with intrinsic rewards (no task-specific feedback), yielding a broad coverage of driving scenarios and a capable latent world model. Subsequently, we introduce extrinsic rewards to fine-tune the policy for tasks such as lane following or collision avoidance. We formulate autonomous driving as a Markov Decision Process (MDP) M= (S,A, p, r, γ ). States s∈ S en- capsulate sensor observations, while actions a∈ A corre- spond to vehicular control inputs (steering, throttle, braking). The transition model p(st+1|st, at)governs environment dynamics, and r(st, at)provides task-dependent feedback. In our setting, intrinsic exploration replaces task-specific rewards during the initial training phase: rint t=Disagreement-based curiosity signal , whereas the fine-tuning phase introduces extrinsic signals: rext t=task-specific reward signal . We can also combine the extrinsic and intrinsic rewards, where intrinsic reward can be used to augment the reward based training: rt=α rext t+ (1−α)rint t, (1) withα∈[0,1]controlling the weighting between extrinsic and intrinsic rewards.We adopt a Recurrent State-Space Model (RSSM) to learn a compact representation of high-dimensional sensory inputs (e.g., images) and predict future observations and rewards. Following the Dreamer framework [23], the RSSM consists of four main components: •Encoder qϕ(zt|st): Converts raw observations stinto a stochastic latent state zt. •Recurrent Core (GRU): Maintains a hidden state ht, summarizing past latent states and actions. •Transition Model pϕ(zt+1|zt, at, ht): Predicts the next latent state zt+1given the current latent state, action at, and hidden state ht. •Decoder pϕ(st|zt): Reconstructs or imagines the original observation stfrom the latent state zt. Additionally, we include a reward predictor pϕ(rt|zt, ht) to model the immediate reward, and a discount (or continua- tion) predictor pϕ(γt|zt, ht)to handle episode termination. At each time step, we thus have: ht=GRU (ht−1, zt−1, at−1), (2) zt∼qϕ(zt|st, ht), (3) with the transition prior pϕ(zt+1|zt, at, ht+1). (4) We jointly optimize the encoder, decoder, transition, re- ward, and discount networks. The training loss, inspired by the variational Evidence Lower Bound (ELBO), can be expressed as: Lmodel(ϕ) =Eqϕh −lnpϕ(st|zt)−lnpϕ(rt|zt, ht)i +βEqϕh DKL qϕ(zt|st, ht)∥pϕ(zt|ht)i +λγEqϕh −lnpϕ(γt|zt, ht)i , (5) where: Page 4: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.•−lnpϕ(st|zt)is the reconstruction loss , penalizing the model for poor observation predictions. •−lnpϕ(rt|zt, ht)is the reward prediction loss . •DKL qϕ(zt|st, ht)∥pϕ(zt|ht) is the KL divergence between the encoder posterior and the transition prior, encouraging compact and consistent latent states. •βscales or clips the KL term (the free-bits heuris- tic [12]) so that the model retains sufficient representa- tional capacity without collapsing. •−lnpϕ(γt|zt, ht)is an optional discount/continuation loss (weighted by λγ) that helps the model account for terminal states. We sample short latent-rollout sequences from a replay buffer of past trajectories, optimize (5) in mini-batches, and update the model parameters ϕvia stochastic gradient descent. Once trained, the RSSM provides a forward model for imagined rollouts : starting from a real or latent-encoded state, the model predicts future states, rewards, and discounts, thereby enabling policy learning and planning entirely within the compact latent space. B. Ensemble Disagreement for Intrinsic Exploration We incorporate ensemble disagreement to drive curiosity, building on the self-supervised exploration scheme intro- duced by [3]. Specifically, we train Klightweight forward dynamics models, each predicting the next latent state st+1 given (st, at). Letµk(st, at)denote the prediction of the k-th model. The intrinsic reward rint tis computed as the variance of these predictions: rint t= Var µ1(st, at), µ2(st, at), . . . , µ K(st, at) .(6) High disagreement indicates unexplored or uncertain regions, incentivizing the policy to gather data where the world model is less confident. As training progresses, this promotes coverage of diverse states and reduces model uncertainty in safety-critical scenarios. C. Steering Loss Function To encourage smooth driving behavior, we introduce a steering loss function inspired by [24], adapted to penalize excessively large steering angles. Let a(steer) t denote the steering command at time t, measured in the range [−1,1] (left to right turn). We define: rsteer(at) =( −λ,if a(steer) t > δ, 0, otherwise ,(7) where λ >0andδ∈(0,1)is a steering-angle threshold. In practice, we set λ= 0.5and incorporate this penalty term during training. This additional cost biases the policy to avoid extreme steering angles, thereby promoting smoother, more stable navigation without preventing necessary turns. D. Training Procedure Algorithm 1 summarizes the two-phase training pipeline: Phase 1: Task-Agnostic Exploration: The agent explores the CARLA environment by maximizing the ensemble- disagreement reward, augmented with the steering penaltyAlgorithm 1 InDRiVE Training Procedure Require: Environment E(CARLA), replay buffer D, num- ber of ensemble models K, exploration steps Nexplore , fine-tuning steps Nfine. 1:Initialize parameters of DreamerV3 world model ϕ, policy πθ, value network vθ, and ensemble models {µk}k=1..K. 2:D ← {} (empty replay buffer) 3:forstep = 1 toNexplore do 4: Roll out policy πθinEforTsteps to collect {(ot, at, ot+1)}. 5: Encode st←qϕ(st|ot). 6: Compute disagreement rint tvia Eq. (6). 7:D ← D ∪ { (st, at, rint t, st+1)}. 8: Update DreamerV3 world model & ensemble models using ELBO-based loss (Eq. (5)). 9: Update πθ,vθvia imagination in the latent space, optimizing intrinsic returns. 10:end for 11:{Fine-Tuning for Task-Specific Rewards } 12:forstep= 1 toNfinedo 13: Introduce extrinsic reward rextfor downstream task (e.g., collision avoidance). 14: rt=rext t 15: (zero-shot): no additional data collection. 16: (few-shot): gather limited on-policy data to refine ϕ, θ. 17:end for 18:return Optimized policy πθand world model parame- tersϕ. for stable control. This phase yields a broad coverage of driving states and a well-trained world model without relying on task-specific guidance. Phase 2: Task-Specific Fine-Tuning: We then introduce the extrinsic driving objective (lane following and collision avoidance). The policy learns to balance this task reward with the residual intrinsic signal and steering loss. In many cases, zero-shot adaptation is possible, as the agent’s learned representation already encodes crucial driving behaviors. Otherwise, a small number of additional training episodes is sufficient for few-shot adaptation, drastically reducing total sample complexity compared to purely extrinsic-driven training. Overall, this two-phase approach demonstrates how self- supervised exploration can bootstrap a robust world model, leading to faster and more versatile task adaptation in au- tonomous driving. We use the CARLA simulator as our primary testbed, taking advantage of its: •Realistic sensor data : RGB camera, LiDAR, GPS, and odometry information, •Complex traffic scenarios : dynamic vehicles, pedestri- ans, traffic lights, and multi-lane roads, •Configurable weather and lighting conditions : enabling diverse scenarios for robust exploration. Page 5: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.IV. E XPERIMENTAL SETUP This section details the experimental framework for as- sessing our proposed approach. We begin by introducing the CARLA simulation environment and the tasks under consideration, followed by the two-phase training procedure. We then describe the baseline methods, hyperparameter configurations, and the metrics used for evaluation. A. Environment Setup We focus on two CARLA towns Town01 (a small town with a river and bridges) and Town02 (a small town with a mixture of residential and commercial buildings) with moderate traffic density. At each time step, the agent receives a 128×128 semantic segmentation image, along with throttle and steering angle information. To capture temporal dependencies, we stack four consecutive semantic segmentation frames as a single observation input to the encoder. The agent outputs continuous control commands (steering, throttle, brake). B. Tasks and Scenarios We consider two representative driving tasks to demon- strate zero-shot and few-shot performance: •Lane Following (LF): The agent must maintain its lane position while traveling at a safe speed. •Collision Avoidance (CA): The agent must avoid collid- ing with other vehicles and obstacles in real-time traffic scenarios. Episodes terminate upon any of the following events: •Collision: The agent collides with another vehicle, pedestrian, or static obstacle. •Wrong Direction: The agent drives in the opposite direction of the intended lane. •Off-Road Driving: The agent leaves the drivable area. •Vehicle Stall: The agent’s velocity falls below a minimal threshold (e.g., 1 km/h) for an extended period (e.g., 1 minute). •Episode Completion: The agent successfully complete the number of steps assigned for the episode without any lane violation and collisions. C. Two-Phase Training Procedure The learning process is divided into two main phases: (i) intrinsic exploration for building a general-purpose world model, and (ii) task-specific fine-tuning that leverages this model for downstream tasks. 1) Task-Agnostic Exploration: During Phase 1, we train a task-agnostic InDRiVE solely using an intrinsic reward derived from ensemble disagreement. Specifically, we ran- domize key environment parameters (e.g., weather, traffic density) every 10,000 steps to ensure diverse experiences, then roll out the current policy for 1,000 steps and store all transitions in a replay buffer. Afterward, we update the en- semble of forward dynamics models, along with the encoder- decoder modules and policy/value networks, in latent space using the intrinsic reward rint t. This cycle of randomization, data collection, and model updating is repeated until apredetermined maximum of environment interactions (e.g., 50K steps) is reached. 2) Task-Specific Fine-Tuning: Following intrinsic explo- ration, we evaluate and refine the agent’s performance on downstream tasks (e.g., lane following or collision avoid- ance) through both zero-shot and few-shot evaluations. For zero-shot evaluation, we freeze the world model parameters (encoder, decoder, and ensemble) and directly test the policy without further training, recording success rates and infrac- tions to gauge initial performance. For few-shots evaluation, we introduce a task-specific extrinsic reward rextr t, collect a small batch of new data, and update the policy and value networks by applying extrinsic rewards. We then measure the resultant performance gains to assess how effectively the agent adapts to the target task. D. Baseline Methods For comparative evaluation, we focus on DreamerV2 (Task-Specific) and DreamerV3 (Task-Specific), both of which train a world model and policy from scratch using only task-specific rewards (e.g., for lane following or colli- sion avoidance), without incorporating any intrinsic rewards. These baselines thus provide a performance and sample- efficiency benchmark for traditional, task-centric learning approaches, enabling a clear assessment of the benefits gained by integrating intrinsic exploration in our method. Table I summarizes the key hyperparameters used during the intrinsic exploration phase and the few-shot fine-tuning phase. TABLE I: Key Hyperparameters for InDRiVE and Fine- Tuning Hyperparameter Intrinsic Exploration Fine-Tuning Learning rate (world model) 1×10−41×10−4 Learning rate (policy/value) 1×10−41×10−4 Ensemble size ( K) 8 8 (frozen) Batch size 64 64 Replay buffer size 105105 Discount factor ( γ) 0.99 0.99 Intrinsic reward weight ( α) 1.0 0.0 Training steps 50k env steps 10K env steps E. Evaluation Metrics We benchmark InDRiVE on multiple scenarios in the CARLA simulator. Key metrics include: •Success Rate (SR): Rate of successful completion of episode with any lane violation and collision. •Infraction Rate (IR): Rate of rule violations (collisions, lane departures) per episode. •Zero-Shot/Few-Shot Adaptation: Evaluates how well the agent performs the task with no (zero-shot) and minimal (few-shot) additional interactions, highlighting the benefit of curiosity-driven exploration. By measuring performance across different tasks, towns, and training regimes, we obtain a comprehensive view of zero- shot and few-shot generalization in complex urban driving scenarios. Page 6: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. 480000 485000 490000 495000 500000 505000 510000 Environment Steps0.00.20.40.60.81.0Reward RateInDRiVE DreamerV3 DreamerV2(a) Lane Following 480000 485000 490000 495000 500000 505000 510000 Environment Steps0.00.20.40.60.81.0Reward RateInDRiVE DreamerV3 DreamerV2 (b) Collision Avoidance 480000 485000 490000 495000 500000 505000 510000 Environment Steps0.00.20.40.60.81.0Reward RateInDRiVE DreamerV3 DreamerV2 (c) Lane Following + Collision Avoidance Fig. 2: Average reward rates of InDRiVE (red), DreamerV3 (blue), and DreamerV2 (green) across three CARLA driving tasks. The gray area after 500K steps indicates the start of InDRiVE’s finetuning phase (few-shot learning). Despite being trained on the extrinsic reward for fewer steps (10K), InDRiVE (red) converges to near-optimal performance in all three tasks—surpassing both Dreamer baselines—and demonstrates superior sample efficiency and training stability overall. V. R ESULTS A. Zero-Shot Evaluation Table II reports the zero-shot evaluation of DreamerV3, trained in Town01 using only a latent disagreement–based intrinsic reward signal when tested in both Town01 and Town02 . The model is trained for 500K exploration steps, and performance is measured over 50K evaluation steps in each town. Overall, these results indicate that using InDRiVE in the training phase can yield an agent capable of generalizing from one town to another. While the transfer performance remains slightly lower than the results observed in the training environment, the similarity in success and collision rates suggests that the agent’s learned exploration strategy maintains a degree of robustness across different environ- ments. TABLE II: Zero-Shot Learning Evaluation of InDRiVE on Town01 & 02 Train Steps Eval StepsEval Town01 (seen) Eval Town02 (unseen) SR (%) ↑ IR (%) ↓ SR (%) ↑ IR (%) ↓ 500K 50K 64.52 35.48 64.06 35.94 B. Few-Shots Evaluation Table III compares three models—InDRiVE (ours), DreamerV3, and DreamerV2—across three driving tasks (Lane Following, Collision Avoidance, and Lane Following + Collision Avoidance) in two CARLA towns, Town01 (seen during training) and Town02 (unseen). The table reports two primary metrics: Success Rate (SR), the percentage of episodes completed without collisions or lane departures, and Infraction Rate (IR), the percentage of episodes in which a collision or off-lane event occurred. Each model is described by the number of training steps ( Train ) and the number of evaluation steps ( Eval).The results highlight several points. First, InDRiVE con- sistently achieves higher SR and lower IR in both towns, while requiring notably fewer training steps (10K) com- pared to DreamerV2 or DreamerV3 (510K). In Town01 , InDRiVE’s SR ranges from 66% to 96% across tasks, while in Town02 the performance remains high (83% to 100%), indicating strong zero-shot generalization. By con- trast, DreamerV2 shows lower SR, particularly in Lane Fol- lowing tasks, where it struggles to stay within lanes in both towns. DreamerV3 performs moderately well in Town01 , and its zero-shot performance in Town02 is also decent, but InDRiVE still surpasses it in success rate and infraction reduction. Overall, these findings suggest that incorporating intrinsic disagreement-based exploration (InDRiVE) yields more efficient learning and robust navigation behaviors com- pared to the Dreamer baselines. Fig. 2 illustrates three plots compare the average reward rates of InDRiVE (red), DreamerV3 (blue), and DreamerV2 (green) over environment steps in CARLA for three tasks: Lane Following (left), Collision Avoidance (middle), and Lane Following + Collision Avoidance (right). The x-axis represents the number of environment steps, while the y-axis denotes the reward rate. Notably, InDRiVE rapidly converges to high reward across all three tasks, whereas the Dreamer baselines require more steps and show greater fluctuation in reward. VI. C ONCLUSION AND FUTURE WORK We introduced InDRiVE, a fully intrinsic MBRL frame- work for autonomous driving that eliminates task-specific external rewards by relying solely on ensemble disagreement signals for exploration. Experiments in CARLA show that InDRiVE achieves higher success rates and fewer infrac- tions than DreamerV2 and DreamerV3, while using fewer training steps. Its latent representation transfers effectively to both familiar ( Town01 ) and unfamiliar ( Town02 ) settings, enabling zero-shot or few-shot adaptation to tasks like lane- following and collision avoidance. These findings highlight Page 7: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.TABLE III: Comparison of models on three tasks in both Town01 and Town02 Task Model Train Eval Town01 (seen) Town02 (unseen) SR (%) ↑ IR (%) ↓ SR (%) ↑ IR (%) ↓ LFInDRiVE (ours) 10K 50K 96.08 3.92 100.00 0.00 DreamerV3 510K 50K 64.52 35.48 64.06 35.94 DreamerV2 510K 50K 28.09 71.91 29.07 70.93 CAInDRiVE (ours) 10K 50K 66.10 33.90 83.05 16.95 DreamerV3 510K 50K 73.68 24.56 98.00 2.00 DreamerV2 510K 50K 39.24 60.76 33.33 66.67 LF + CAInDRiVE (ours) 10K 50K 83.02 16.98 100.00 0.00 DreamerV3 510K 50K 73.21 26.79 98.00 2.00 DreamerV2 510K 50K 30.95 69.05 28.28 71.72 the benefits of purely intrinsic exploration in uncovering robust driving policies and underscore the potential for re- ducing dependence on manual reward design. Future research directions include exploring more complex traffic scenarios, integrating richer sensor modalities, addressing sim-to-real transfer, investigating continual and multi-task learning, and evaluating alternative intrinsic reward formulations to further enhance scalability, data efficiency, and adaptability. print REFERENCES [1] A. Aubret, L. Matignon, and S. Hassas, “An information-theoretic perspective on intrinsic motivation in reinforcement learning: a survey,” Entropy , vol. 25, no. 2, p. 327, Feb. 2023, arXiv:2209.08890 [cs]. [Online]. Available: http://arxiv.org/abs/2209.08890 [2] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity- driven Exploration by Self-supervised Prediction,” May 2017, arXiv:1705.05363 [cs]. [Online]. Available: http://arxiv.org/abs/1705. 05363 [3] D. Pathak, D. Gandhi, and A. Gupta, “Self-Supervised Exploration via Disagreement,” Jun. 2019, arXiv:1906.04161 [cs]. [Online]. Available: http://arxiv.org/abs/1906.04161 [4] Y . Burda, H. Edwards, A. Storkey, and O. Klimov, “Exploration by Random Network Distillation,” Oct. 2018, arXiv:1810.12894 [cs]. [Online]. Available: http://arxiv.org/abs/1810.12894 [5] R. Sekar, O. Rybkin, K. Daniilidis, P. Abbeel, D. Hafner, and D. Pathak, “Planning to Explore via Self-Supervised World Models,” Jun. 2020, arXiv:2005.05960 [cs]. [Online]. Available: http://arxiv.org/abs/2005.05960 [6] D. Ha and J. Schmidhuber, “Recurrent World Models Facilitate Policy Evolution,” in Advances in Neural Information Processing Systems , vol. 31. Curran Associates, Inc., 2018. [7] Y . Gao, Q. Zhang, D.-W. Ding, and D. Zhao, “Dream to Drive With Predictive Individual World Model,” IEEE Transactions on Intelligent Vehicles , pp. 1–16, 2024, conference Name: IEEE Transactions on Intelligent Vehicles. [Online]. Available: https: //ieeexplore.ieee.org/document/10547289 [8] A. Hu, G. Corrado, N. Griffiths, Z. Murez, C. Gurau, H. Yeo, A. Kendall, R. Cipolla, and J. Shotton, “Model-Based Imitation Learn- ing for Urban Driving,” Advances in Neural Information Processing Systems , vol. 35, pp. 20 703–20 716, Dec. 2022. [9] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun, “CARLA: An open urban driving simulator,” in Proceedings of the 1st Annual Conference on Robot Learning , 2017, pp. 1–16. [10] B. R. Kiran, I. Sobh, V . Talpaert, P. Mannion, A. A. A. Sallab, S. Yo- gamani, and P. P ´erez, “Deep Reinforcement Learning for Autonomous Driving: A Survey,” IEEE Transactions on Intelligent Transportation Systems , vol. 23, no. 6, pp. 4909–4926, Jun. 2022, conference Name: IEEE Transactions on Intelligent Transportation Systems.[11] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to Control: Learning Behaviors by Latent Imagination,” Mar. 2020, arXiv:1912.01603 [cs]. [Online]. Available: http://arxiv.org/abs/1912. 01603 [12] D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba, “Mastering Atari with Discrete World Models,” Feb. 2022, arXiv:2010.02193 [cs]. [Online]. Available: http://arxiv.org/abs/2010.02193 [13] F. K. Khanzada, B. Kwon, W. Jeong, Y . S. Cho, and J. Kwon, “Analytical study on region of interest and dataset size of vision-based end-to-end lateral control for off-road autonomy,” in ICRA 2024 Workshop on Resilient Off-road Autonomy , 2024. [Online]. Available: https://openreview.net/forum?id=KaZ40iwHg7 [14] Y . Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros, “Large-Scale Study of Curiosity-Driven Learning,” Aug. 2018, arXiv:1808.04355 [cs]. [Online]. Available: http: //arxiv.org/abs/1808.04355 [15] J.-A. Meyer and S. W. Wilson, “A Possibility for Implementing Curiosity and Boredom in Model-Building Neural Controllers,” in From Animals to Animats: Proceedings of the First International Conference on Simulation of Adaptive Behavior . MIT Press, 1991, pp. 222–227, conference Name: From Animals to Animats: Proceedings of the First International Conference on Simulation of Adaptive Behavior. [Online]. Available: https://ieeexplore.ieee.org/document/6294131 [16] B. C. Stadie, S. Levine, and P. Abbeel, “Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models,” Nov. 2015, arXiv:1507.00814 [cs]. [Online]. Available: http://arxiv.org/abs/1507. 00814 [17] P.-Y . Oudeyer, F. Kaplan, and V . V . Hafner, “Intrinsic Motivation Systems for Autonomous Mental Development,” IEEE Transactions on Evolutionary Computation , vol. 11, no. 2, pp. 265–286, Apr. 2007, conference Name: IEEE Transactions on Evolutionary Computation. [Online]. Available: https://ieeexplore.ieee.org/document/4141061 [18] R. Raileanu and T. Rockt ¨aschel, “RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments,” Feb. 2020, arXiv:2002.12292 [cs]. [Online]. Available: http://arxiv.org/abs/2002. 12292 [19] “Imagine-2-Drive: High-Fidelity World Modeling in CARLA for Autonomous Vehicles.” [Online]. Available: https://arxiv.org/html/ 2411.10171 [20] M. Toromanoff, E. Wirbel, and F. Moutarde, “End-to-End Model- Free Reinforcement Learning for Urban Driving Using Implicit Af- fordances,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Seattle, WA, USA: IEEE, Jun. 2020, pp. 7151–7160. [21] F. Codevilla, E. Santana, A. M. L ´opez, and A. Gaidon, “Exploring the Limitations of Behavior Cloning for Autonomous Driving,” Apr. 2019, arXiv:1904.08980 [cs]. [Online]. Available: http://arxiv.org/abs/ 1904.08980 [22] Y . Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros, “Large-Scale Study of Curiosity-Driven Learning,” Aug. 2018, arXiv:1808.04355 [cs]. [Online]. Available: http: //arxiv.org/abs/1808.04355 [23] D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap, “Mastering Diverse Domains through World Models,” Apr. 2024, arXiv:2301.04104 [cs]. [Online]. Available: http://arxiv.org/abs/2301.04104 Page 8: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.[24] Q. Li, X. Jia, S. Wang, and J. Yan, “Think2Drive: Efficient Reinforcement Learning by Thinking in Latent World Model for Quasi-Realistic Autonomous Driving (in CARLA-v2),” Jul. 2024, arXiv:2402.16720 [cs]. [Online]. Available: http://arxiv.org/abs/2402. 16720

---