loader
Generating audio...

arxiv

Paper 2503.10626

NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models

Authors: Mert Albaba, Chenhao Li, Markos Diomataris, Omid Taheri, Andreas Krause, Michael Black

Published: 2025-03-13

Abstract:

Acquiring physically plausible motor skills across diverse and unconventional morphologies-including humanoid robots, quadrupeds, and animals-is essential for advancing character simulation and robotics. Traditional methods, such as reinforcement learning (RL) are task- and body-specific, require extensive reward function engineering, and do not generalize well. Imitation learning offers an alternative but relies heavily on high-quality expert demonstrations, which are difficult to obtain for non-human morphologies. Video diffusion models, on the other hand, are capable of generating realistic videos of various morphologies, from humans to ants. Leveraging this capability, we propose a data-independent approach for skill acquisition that learns 3D motor skills from 2D-generated videos, with generalization capability to unconventional and non-human forms. Specifically, we guide the imitation learning process by leveraging vision transformers for video-based comparisons by calculating pair-wise distance between video embeddings. Along with video-encoding distance, we also use a computed similarity between segmented video frames as a guidance reward. We validate our method on locomotion tasks involving unique body configurations. In humanoid robot locomotion tasks, we demonstrate that 'No-data Imitation Learning' (NIL) outperforms baselines trained on 3D motion-capture data. Our results highlight the potential of leveraging generative video models for physically plausible skill learning with diverse morphologies, effectively replacing data collection with data generation for imitation learning.

Paper Content:
Page 1: NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models Mert Albaba1,2∗Chenhao Li1Markos Diomataris1,2 Omid Taheri2Andreas Krause1Michael Black2 1ETH Z¨ urich2Max Planck Institute for Intelligent Systems Figure 1. NIL Overview: First, from a single frame and a textual prompt, a pre-trained video diffusion model generates a reference video. Reinforcement learning policies are then trained to mimic the generated video and control various robots without using any external data. Abstract Acquiring physically plausible motor skills across diverse and unconventional morphologies—including humanoid robots, quadrupeds, and animals—is essen- tial for advancing character simulation and robotics. Traditional methods, such as reinforcement learning (RL) are task- and body-specific, require extensive reward function engineering, and do not generalize well. Imitation learning offers an alternative but relies heavily on high-quality expert demonstrations, which are difficult to obtain for non-human mor- phologies. Video diffusion models, on the other hand, are capable of generating realistic videos of various morphologies, from humans to ants. Leveraging this capability, we propose a data-independent approach for skill acquisition that learns 3D motor skills from2D-generated videos, with generalization capability to unconventional and non-human forms. Specifically, we guide the imitation learning process by leveraging vision transformers for video-based comparisons by calculating pair-wise distance between video embed- dings. Along with video-encoding distance, we also use a computed similarity between segmented video frames as a guidance reward. We validate our method on locomotion tasks involving unique body configurations. In humanoid robot locomotion tasks, we demonstrate that “No-data Imitation Learning” (NIL) outperforms baselines trained on 3D motion-capture data. Our results highlight the potential of leveraging generative video models for physically plausible skill learning with diverse morphologies, effectively replacing data collection with data generation for imitation learning. 1arXiv:2503.10626v1 [cs.CV] 13 Mar 2025 Page 2: 1. Introduction Learning motor skills for multiple and diverse agent morphologies, including robots or animals, is essential to advance robotics and character simulation. How- ever, enabling physically plausible skill acquisition across such a range of morphologies is a longstanding challenge. Reinforcement learning (RL) is the most common approach for training skills in a physically plausible manner. RL trains agents within a physical simulator so that the learned behaviors inherently respect phys- ical laws, an important property for both robotics and character simulation. However, RL requires substantial manual effort to engineer reward functions for each spe- cific task-body pair, and poorly specified rewards lead to unintended behavior [3]. Imitation learning (IL) circumvents meticulous reward engineering by learn- ing from expert demonstrations. However, IL requires high-quality 3D data, consisting of accurate joint po- sitions and velocities. Such a high-quality 3D data is difficult to obtain for each possible morphology, par- ticularly for non-humanoid robots and animals, where motion-capture data is scarce and expensive to collect. This gap motivates us to explore an alternative ap- proach that bypasses the need for high-quality demon- strations and generalizes to various morphologies. Re- cent advances in video diffusion models [9, 19, 37] of- fer a compelling alternative. These pretrained gener- ative models are capable of generating visually plausi- ble video for a wide variety of tasks and morphologies. Leveraging such models could eliminate the need for curated 3D data. But while the generated videos are visually plausible, they are not always physically plau- sible [24], which hinders their usability in skill-learning. Also, learning 3D skills from generated 2D videos is challenging due to the absence of 3D information and the lack of precise action annotations. In this work, we introduce No-data Imitation Learn- ing (NIL): a framework that bridges this gap by inte- grating off-the-shelf video diffusion models with video vision transformers, and learns purely from the gener- ated data. In other words, we propose a way to convert generated 2D videos into usable feedback for training 3D policies. Specifically, we generate reference videos using pretrained video diffusion models, and employ video vision transformers [4, 10] to calculate similarity between learned behaviors and generated videos. We embed both the generated video and the agent’s video rendered by simulation, into the latent space of a video vision transformer. By comparing these encodings di- rectly, we calculate a distance between videos and cre- ate a reward signal that encourages the agent to repli- cate key aspects of the reference motion while respect-ing 3D physics constraints in the simulator. Such video similarity does not provide enough guidance, since a more granular feedback is required in high-dimensional complex tasks. Therefore, we also employ image-based similarity. Specifically, we segment the agent’s body in both the generated video and the simulation-rendering, creating binary masks that isolate the agent from the background. We then compute similarity scores based on the Intersection over Union (IoU) metric between the segmentation masks. These similarity scores serve as rich reward signals for IL, and along with the video- similarity, encourage the agent to produce behaviors that closely match the generated reference video, en- abling it to learn skills even with unconventional body configurations. Overall, NIL combines pretrained video diffusion models, video vision transformers, and imitation learn- ing to enable agents to learn skills with unconven- tional body configurations. By directly measuring dis- crepancies between video encodings and segmentation masks, we provide a robust reward signal that guides the agent’s learning process within a physical simula- tor, without relying on any curated data. 1.Autonomous Expert Data Generation: NIL generates expert demonstrations on-the-fly using video diffusion models, conditioned on the agent’s initial state and a textual task description. This approach generalizes to any task-body pair by re- moving any dependency on collected data. 2.No-data Imitation Learning: NIL combines video vision transformers and image segmentation to create an informative reward signal from 2D videos for imitation learning. This provides a stable and effective learning guidance. We test our approach on locomotion tasks involving di- verse morphologies, including different robots (2-legged and 4-legged), for which collecting the expert data is difficult, and reward engineering is challenging. Our re- sults show that NIL reaches the performance of motion- capture data-based imitation learning without requir- ing any data. By leveraging the strengths of video diffusion models and imitation learning, our approach addresses critical bottlenecks in training agents for complex tasks, partic- ularly when expert data is scarce or unavailable. This work opens new avenues for research at the intersec- tion of generative modeling and reinforcement learn- ing, with potential applications in robotics, animation, and beyond. Our code and models will be available for research purposes. 2 Page 3: 2. Related Work 2.1. Imitation Learning Imitation learning (IL) has long been an attractive paradigm for training agents to mimic expert behav- ior. Early approaches like Behavioral Cloning [5] di- rectly map observations to actions but are prone to compounding errors due to distributional shifts. Ad- versarial frameworks address this issue. For example, Generative Adversarial Imitation Learning (GAIL) [17] leverages a discriminator to distinguish between ex- pert and agent-generated trajectories, providing a ro- bust learning signal. Subsequent methods, including InfoGAIL [23] and Adversarial Inverse Reinforcement Learning (AIRL) [15], refine these techniques by in- corporating reward inference and information-theoretic principles. Other work, such as the Variational Dis- criminator Bottleneck (VDB) [26] and Adversarial Mo- tion Priors [27], focus on stabilizing adversarial train- ing through regularization and by incorporating pre- trained motion models. Recently, Reinforced Imita- tion Learning (RILe) [2] combines inverse reinforce- ment learning with adversarial imitation learning to improve performance in complex settings. While adversarial approaches are state-of-the-art in imitation learning, they have a significant drawback: the discriminator tends to overfit quickly, leading to instability during training [2]. In contrast, our work removes this dependency by leveraging video diffusion models and direct video comparisons, providing a more stable and effective reward signal. 2.2. Video Diffusion Models If we had access to limitless, high-quality, 3D training data, imitation learning would work well. Since captur- ing such data is challenging, is it possible to generate it? Denoising Diffusion Probabilistic Models (DDPMs) [18] enable high-quality image synthesis, while video diffusion models [19] extend these ideas to the temporal domain, generating coherent video sequences. Models such as Make-A-Video [30] and Imagen [29] demon- strate cascaded diffusion processes capture complex motion patterns from large-scale datasets. Recent de- velopments, including Stable Video Diffusion [9] and I2VGen [37], improve video quality and controllability. Additionally, recent work such as DynamiCrafter [35] explores diffusion models tailored for dynamic scene synthesis. Despite their impressive visual performance, these models sometimes output 2D results that are vi- sually convincing but physically implausible [24], pos- ing a challenge to using them for imitation learning.2.3. Video Encoders To exploit generated 2D data, we need to be able to use it to provide a meaningful learning signal. Ro- bust video encoders are critical for extracting mean- ingful spatio-temporal features that bridge the gap be- tween 2D visual data and 3D behavioral understand- ing. Vision transformer architectures, such as ViViT [4] and TimeSFormer [7], show that transformer-based models effectively capture dynamic patterns in video data. Subsequent work, including VideoMAE-v2 [34] and VideoMamba [22], advance video representation learning by employing dual masking strategies and ef- ficient state-space models. Other approaches leverage masked autoencoding techniques [14] to learn robust video representations. In addition, dedicated trans- former architectures such as VidTr [38] and DistInit [16] further validate the potential of transformer-based models in video understanding. The advances in video understanding provide robust representations that are essential for comparing generated and simulated be- haviors in our framework. 2.4. Learning from Generated Data There is a growing body of work that leverages gener- ated or weakly supervised data to drive policy learning, thereby reducing the reliance on curated expert demon- strations. Early explorations in one-shot imitation learning [13] and meta-imitation learning [21] demon- strate that agents can learn effectively from sparse or unstructured data. More recent approaches, such as DexMV [28] and video language planning methods [11], extend these ideas to incorporate video data directly. Complementary efforts in imitation learning from hu- man videos [36] and zero-shot robotic manipulation us- ing pretrained image-editing diffusion models [8] fur- ther highlight the potential of harnessing generated data. Methods like MimicPlay [33] and learning uni- versal policies via text-guided video generation [12] ex- emplify the trend towards minimizing the dependency on meticulously collected expert data. Notably, to the best of our knowledge, all existing approaches still rely on at least some curated data for policy training, largely because challenges remain in ensuring that visually generated data is both physically plausible and sufficiently informative for direct policy learning. NIL addresses these challenges by integrat- ing video diffusion models within an imitation learning framework, thereby providing robust, on-the-fly expert demonstrations for agents with diverse morphologies, without the need for any curated or embodied data. 3 Page 4: Figure 2. NIL: No-data Imitation Learning consists two stages. Stage 1: Render the agent’s initial frame, remove the background, and generate a reference video using a pre-trained video diffusion model conditioned on the initial frame and a textual task description. Stage 2: Train a reinforcement learning agent in a physical simulation to imitate the generated video via a reward function comprising (1) video encoding similarity, (2) segmentation mask IoU, and (3) regularization for smooth behavior. 3. Method 3.1. Overview No-data Imitation Learning (NIL) aims to learn phys- ically plausible 3D motor skills from video-diffusion- model generated 2D videos. Given a skill siand an em- bodimentbj, our goal is to learn a policy πsi,bjthat en- ables a simulated agent ebjto perform si. Our method comprises two stages (Fig. 2): 1.Video Generation: A reference video Fsi,bjis gen- erated using a pre-trained video diffusion model D, conditioned on the initial 2D simulation frame e0 and a textual prompt psi,bj. 2.Policy Learning: A reward function evaluates the similarity between the generated video Fsi,bjand a rendered simulation video Esi,bj. This reward, along with regularization for smoothness, guides the optimization of πsi,bj. 3.2. Stage 1: Video Generation The video generation module uses a frozen, pre-trained video diffusion model, D, to generate a 2D video of the agent performing the skill si. The inputs to Dare: 1. the initial frame ( e0) depicting the agent with em- bodimentbjin the simulation environment, rendered from the physical simulation at a fixed starting posi-tion, and 2. a textual prompt ( psi,bj) describing the task. The textual prompt is constructed as: psi,bj= “Thebjagent issi, camera follows. ” (1) wherepbjis the name of the embodiment, e.g. Unitree H1 robot, and psidescribes the skill, e.g. walking. We use a fixed camera setup; the camera follows the agent in both the generated video and the simulation rendering. The video diffusion model generates a video: Fsi,bj∈Rn×H×W×3=D(psi,bj,e0). (2) The generated video frames are denoted as: Fsi,bj={f(si,bj) 0,f(si,bj) 1,...,f(si,bj) n−1}. (3) wherenis the number of frames, and HandWare the height and width of the frames, respectively. 3.3. Stage 2: Learning the Task-Achieving Policy Video Similarity: The similarity metric computes a reward signal by comparing the generated video Fsi,bjwith the rendered simulation video Esi,bj= {e(si,bj) 0,e(si,bj) 1,...,e(si,bj) k−1}, wherekis the length of the rendered agent video. The objective is to extract meaningful learning signals from the 2D-generated 4 Page 5: video to guide the acquisition of 3D motor skills. The computation involves three steps: 1. segmentation and masking; 2. video encoding; and 3. similarity compu- tation. Segmentation and Masking: We segment the agent from both Fsi,bjandEsi,bj. ForFsi,bj, we use the Segment Anything Model 2 (SAM) to obtain masks as MF={MF 0,MF 1,...,MF n−1}, and forEsi,bj, segmen- tation masks ME={ME 0,ME 1,...,ME k−1}are pro- vided by the simulator. Masked frames are denoted asfM,(si,bj) t ,eM,(si,bj) t . The segmented videos are thus represented as FM si,bjandEM si,bj. Video Encoding: To capture spatiotemporal dynam- ics for both the generated and rendered videos, we em- ploy a pre-trained TimeSformer encoder T(trained on Kinetics-400). For each time step t, we extract an 8- frame clip from the masked video FM si,bjas: C(si,bj) t =/braceleftigg {fM,(si,bj) 0,...,fM,(si,bj) 0,fM,(si,bj) t}, t< 7, {fM,(si,bj) t−7,...,fM,(si,bj) t}, t≥7. Each clip is passed through Tto obtain the embedding: zF,(si,bj) t =T/parenleftig C(si,bj) t/parenrightig . An analogous procedure yields zE,(si,bj) t forEM si,bj. Each 8-frame clip is then passed through the TimeS- former encoder: zF,(si,bj) t =T/parenleftig C(si,bj) t/parenrightig , zE,(si,bj) t =T/parenleftig ˜C(si,bj) t/parenrightig , wherezF,(si,bj) t, zE,(si,bj) t∈Rdare the embeddings derived from the last hidden states of the trans- former. The encoder first divides each frame into non-overlapping patches that are linearly embedded and enriched with positional encodings. Subsequent transformer blocks then apply multi-head self-attention across both spatial and temporal dimensions, allowing the network to capture dynamic interactions among patches. This hierarchical attention mechanism yields a comprehensive representation of the input clip. 3.3.1. Reward Function The reward score at each frame tis computed by com- bining video similarity, image-based similarity, and reg- ularization. a) Video Similarity: The video similar- ity at time step tis defined as the negative Euclidean (L2) distance between the corresponding embeddings of the generated and rendered videos: Sv,t=−/vextenddouble/vextenddouble/vextenddoublezF,(si,bj) t−zE,(si,bj) t/vextenddouble/vextenddouble/vextenddouble 2. b) Image-Based Similarity: We compute the In- tersection over Union (IoU) between the segmentationmasks of the generated and rendered videos: SMt=/summationtext k,lMF t(k,l)·ME t(k,l)/summationtext k,lMF t(k,l) +ME t(k,l)−MF t(k,l)·ME t(k,l) (4) wherek= 1,2,...H , andl= 1,2,...W denote pixel locations vertically and horizontally, respectively. MF t(k,l) = 1 if the pixel at the coordinate k,lstays inside the agent body mask, and 0 otherwise. The IoU score ranges between 0 and 1, with higher values indi- cating greater similarity between the masks. c) Regularization: To ensure smooth behavior, we introduce an aggregated regularization term, Pt. These regularization components are standard in robotic con- trol frameworks and ensure that the learned policy ad- heres to physically realistic constraints. Specifically, we define: Pt=PJ,t+PA,t+PV,t+PF,t+PS,t, wherePJ,tpenalizes the sum of joint torques to dis- courage aggressive actuation, PA,tpenalizes large dif- ferences between consecutive actions, PV,tdiscourages high angular joint velocities, PF,tpenalizes scenarios where a foot is in contact with the ground and moving while offering a slight reward when a foot remains off the ground for an extended duration, and PS,tenforces stability by penalizing excessive tilting of the torso. d) Combined Reward: The overall reward at each time steptis computed as a weighted sum of the video similarity, the image-based similarity, and the aggre- gated penalty: Rt=αSv,t+βSM,t+γPt, whereα,β, andγare scalar weights that balance the contributions of each term. This composite reward ef- fectively aligns the rendered simulation with the gen- erated video while promoting smooth and physically plausible agent behavior. 3.4. Policy Learning The end goal of NIL is to learn a policy, πsi,bjthat maximizes the expected cumulative imitation reward derived from the similarity scores defined in Sec. 3.3.1. In contrast to state-of-the-art imitation learning ap- proaches combining discriminators with reinforcement learning, we directly maximize the imitation reward us- ing entropy-regularized off-policy actor-critic reinforce- ment learning. This change eliminates the need for ad- versarial training and simplifies the learning process. At each time step t, the agent receives an observa- tionot∈O, which comprises joint positions and veloc- itie, and selects an action at∈ A (i.e., the torques 5 Page 6: to be applied to the joints) according to the policy πsi,bj(at|ot). The environment then provides an imi- tation reward, defined as: Rt=αSv,t+βSM,t+γPt, whereSv,tis the video similarity, SM,tis the image- based similarity, Ptis the aggregated penalty (see Sec. 3.3.1.3), and α,β, andγare scalar weights. The overall objective is to maximize the expected cumulative discounted reward: J(πsi,bj) =Eπsi,bj/bracketleftigg∞/summationdisplay t=0γtRt/bracketrightigg , withγ∈[0,1) representing the discount factor. In the entropy-regularized actor-critic framework, the policy is optimized by maximizing a soft value func- tion that includes an entropy term to encourage explo- ration: max πE(o,a)∼π/bracketleftigg∞/summationdisplay t=0γt/parenleftbig Rt+αentH/parenleftbig π(·|ot)/parenrightbig/parenrightbig/bracketrightigg , whereH/parenleftbig π(·|ot)/parenrightbig denotes the entropy of the policy at stateot, andαentis a temperature parameter control- ling the trade-off between reward maximization and exploration. Both the actor (policy network) and the critics (Q- value networks) are updated concurrently using off- policy data stored in a replay buffer. This setup allows the agent to effectively leverage the dense imitation re- ward signal—derived from the similarity metrics—to reproduce expert motion patterns as observed in the reference video Fsi,bj. Importantly, while we employ BRO [25] in our experiments, this entropy-regularized actor-critic formulation is generic and can be instanti- ated with any suitable algorithm. Temporal Alignment: The action frequency resolu- tion is different in the generated video when compared with the video rendered using the physical simulation since the simulator-rendered videos are very high reso- lution in terms of action frequency to allow fine-grained controls. Therefore, during the policy learning, to align the temporal dynamics between the simulation and the generated video, we define a mapping between simula- tion timesteps and video frames. Let kbe the num- ber of simulation timesteps corresponding to one video frame. Then, for every ksimulation steps, we compute the reward Stand assign it uniformly to those steps. 3.5. NIL No-data Video Imitation (NIL) provides a method for 3D motor skill acquisition from generated 2D videos byintegrating video diffusion models and a discriminator- free imitation learning framework. By leveraging a pretrained video diffusion model, we generate expert demonstrations in the form of 2D videos on-the-fly. These videos offer a visual reference for the desired be- havior without any need for manually collected expert data. To extract meaningful learning signals from the gen- erated videos, we employ a similarity metric that com- bines video-encoding distance and image-based seg- mentation similarity between the generated video and the rendered simulation video. This metric serves as a reward function guiding the agent’s learning pro- cess. NIL replaces the adversarial discriminator used in state-of-the-art imitation learning methods with this similarity metric, and simplifies the training process while enhancing the stability. Through imitation learning, the agent optimizes its policy to maximize the similarity score, effectively learning the desired skill by imitating the motion pat- terns depicted in the generated video. NIL’s general framework allows it to be applied across diverse and un- conventional morphologies, enabling agents to acquire complex skills without reliance on pre-collected expert demonstrations. NIL offers extensive generalization ca- pability to a wide range of tasks and embodiments. 4. Experiments In this section, we evaluate the performance of No-data Imitation Learning (NIL) for 3D motor skill acquisition using autonomously generated 2D videos. We perform three ablation studies to analyze NIL: •Reward Component Ablation: We analyze the impact of individual reward components on the learning performance. •Diffusion Model Comparison: We compare sev- eral pretrained video diffusion models to deter- mine which one provides the most effective reference demonstrations for imitation learning. •Improving Diffusion Models: We assess how in- cremental advancements in video diffusion models affect the quality of the learned behaviors. Then, we evaluate the performance of NIL in challeng- ing robotic control tasks: •Continuous Control of Various Robots: Learn- ing to walk with five different robot embodiments, each of which has different unique configurations and challenges. Baselines: Since NIL is the only method that relies solely on generated data (without any curated or same- embodiment collected data), we compare NIL against both upper and lower baselines. As upper baselines, we 6 Page 7: (a) All components (b) Without regularization (c) Without IoU (d) Without video similarity (e) Only regularization (f) Only IoU (g) Only video similarity Figure 3. Reward Components: Ablation of the reward function. (a) All components: All components are employed, and agent learns to walk well. (b) Without regularization: The resulting motion is jittery. (c) Without IoU: The learned behavior is distorted slightly. (d) Without video similarity: The walking is slower, and jittery. (e) Only regularization: Agent fails to walk straight, and employs suboptimal large leg movements. (f) Only IoU: Agent fails to walk forward continuously. (g) Only video similarity: Agent walks in a jittery way, and stops midway while walking. employ three state-of-the-art imitation learning meth- ods: AMP [27], GAIfO [32], and DRAIL [20]. As a lower baseline, we consider Behavioral Cloning (BC [5]) and Behavioral Cloning from Observations (BCO [31]). All baselines are trained using motion-capture data from [1] that is adapted to the simulation do- main, with perfect joint correspondence. These perfect joint correspondences are used to calculate a reward for the learning agent, and combined with regularization parameters. For AMP, we define velocity tracking re- ward, which rewards the agent for reaching the velocity of 1m/s. We provide details regarding quantification metrics in Supplementary Materials. 4.1. Reward Components To understand the contribution of each reward term, we train NIL on a walking task on the UnitreeH1 hu- manoid robot. We evaluate performance using two metrics: (a) the environment reward: evalautes the speed and stability of the policy and (b) the motion similarity score: quantifies how closely the learned mo- tion matches motion-capture data. Table 1 presents quantitative results, Figure 3 shows qualitative demonstrations. First, we analyze how the lack of individual components affects NIL. Overall, reg- ularization helps NIL to smoothen the learned mo- tions, while both image-based and video-based simi- larity scores helps the agent to understand essentials of walking. Second, we evaluate whether isolated components of the reward function enable imitating motions in gen- erated reference videos. With only video similarity, NIL achieves a reasonable performance, albeit fails toTable 1. Reward Ablation : We analyze effects of each reward function component on the performance of NIL. Env. Reward↑MoCap Loss↓ NIL (all components)396.1 46.4 w/o Reg. 382.4 44.9 w/o IoU Score 381.4 82.9 w/o Video Sim. 387.3 54.3 only Reg. 363.6 93.5 only IoU Score 328.4 101.6 only Video Sim. 369.6 56.8 Expert 400 0 generate visually plausible motions. In contrast, us- ing only regularization or IoU rewards results in poor- performing policies. 4.2. Diffusion Models for Imitation Learning We evaluate the impact of different video diffusion models on imitation learning. To our knowledge, this is the first study to compare various diffusion models for usability in imitation learning. We consider five open- and closed-source video diffusion models: Kling AI, Pika, Runway Gen-3, OpenAI Sora, and Stable Video Diffusion (SVD) [9]. For each model, we gen- erate reference videos for the UnitreeH1 walking task. Generated videos are provided in the Supplementary Materials. Quantitatively (see Table 2), Kling, despite exhibiting intermittent instabilities, yields the most vi- sually plausible outputs and the highest NIL perfor- mance. Interestingly, even though Pika has shown lim- itations in physical plausibility [6], it still leads to high 7 Page 8: (a) NIL trained on Kling v1.6 (b) NIL trained on Kling v1.0 Figure 4. Different versions of video diffusion models: (a) NIL learns to walk using the newest Kling version for reference video generation; (b) Reference video generated by the older version of King results in walking with an unbalanced gait. Table 2. Video Diffusion Models : We compare effec- tiveness of different video diffusion models in terms of gen- erating reference videos. ModelGenerated FramesNIL Performance Kling 150 396.1 Pika v1.5 120 385.9 Runway Gen-3 125 383.7 OpenAI Sora 150 370.8 SVD 24 366.5 Expert - 400 imitation scores. We hypothesize that visual plausibil- ity of the reference video is the most crucial property for NIL, as NIL is designed to refine physically implau- sible motions within the simulator. 4.3. Improvements in Video Diffusion Models To examine the sensitivity of NIL to advancements in video diffusion models, we compare two versions of Kling: v1.0 and v1.6. Both versions are used to gen- erate reference videos for the UnitreeH1 walking task. While the quantitative metrics are similar for both ver- sions, qualitative results (see Figure 4) reveal that the newer Kling v1.6 produces significantly more natural gaits. In contrast, the reference video from Kling v1.0 leads to an unbalanced gait with asymmetric leg move- ments. This experiment underscores that even with small improvements in video diffusion models, the perfor- mance of NIL gets better. Therefore, better video dif- fusion models would enable NIL to learn more challeng- ing tasks without using any collected/curated data.Table 3. Robotic Control: We evaluate NIL on challeng- ing tobotic locomotion tasks across multiple robots. Unitree H1TalosUnitree G1Unitree A1 NIL (ours)396.1 352.8 356.9 360.8 AMP 393.5 231.1 393.4 286.9 DRAIL 11.7 1.9 7.3 3.5 GAIfO 347.8 204.4 46.1 260.8 BCO 72.0 26.6 21.2 30.3 Expert 400 400 400 400 4.4. Continuous Control of Various Robots Finally, we test NIL on challenging locomotion tasks across multiple robotic embodiments: three humanoid platforms (Unitree H1, Talos, and Unitree G1) and a quadruped (Unitree A1). For each robot, NIL is trained using a single reference video generated by Kling AI (Pika for Unitree A1), and we compare its performance against upper baselines (AMP, DRAIL, GAIfO) as well as a lower baseline (BCO), all of which are trained with 25 curated expert demonstrations. Table 3 presents quantitative results. For the Uni- tree H1 and Unitree A1, NIL not only achieves superior quantitative performance compared to AMP, it also produces a more natural and balanced walking gait. In contrast, for Unitree G1, even though NIL obtains competitive scores, AMP generates visually more natu- ral locomotion. With the Talos platform, both NIL and AMP face significant challenges due to the robot’s com- plex morphology; however, NIL performs better and learns to move forward, albeit with less natural motion than desired. 8 Page 9: 4.5. Summary of Experiments Overall, our experimental results confirm that NIL, by leveraging generated data and a discriminator-free imi- tation reward, effectively learns task-achieving policies across diverse robotic platforms. The ablation studies underscore the importance of the reward components, while the diffusion model comparison highlights that visually plausible generation, even if not physically per- fect, is key for effective imitation. We also present that improvements in video diffusion models enables better performance of NIL. These findings demonstrate the potential of NIL as a promising alternative to conven- tional, data-intensive imitation learning approaches. Future improvements in video diffusion model could enable NIL to achieve more complex tasks, with differ- ent embodiments. 5. Discussion and Future Directions We introduce NIL as a first step towards eliminat- ing the dependency on curated expert data in imita- tion learning. By leveraging video diffusion models to generate expert demonstrations on-the-fly, NIL not only removes the platform specific data collection but also achieves competitive performance across diverse robotic platforms. One of the key insights from our study is that the performance of NIL is closely tied to the quality of the generated videos. As improve- ments in video diffusion models continue to emerge, NIL naturally benefits from these advancements, lead- ing to more realistic and robust behaviors. For exam- ple, our experiments with different versions of the Kling model clearly illustrate that even minor enhancements in video quality translate to significantly more natural gaits and smoother motion patterns. Looking forward, several directions can further en- hance the capabilities of NIL. First, integrating NIL as a pretraining step offers an exciting opportunity; the policies learned in a data-free manner can be fine-tuned using a small amount of curated data to boost motion naturalness, especially for complex morphologies where current methods face challenges. Also, extending NIL to more challenging tasks beyond locomotion is an ex- citing direction. In summary, NIL’s performance is expected to im- prove in accordance with the rapid advancements in video diffusion technology. This synergistic relation- ship opens up a broader research horizon where data efficiency and generalization are paramount. We be- lieve that our work lays a strong foundation for fu- ture research at the intersection of generative mod- eling and imitation learning, paving the way for in- creasingly sophisticated and autonomous robotic be- haviors.References [1] Firas Al-Hafez, Guoping Zhao, Jan Peters, and Da- vide Tateo. Locomujoco: A comprehensive imitation learning benchmark for locomotion. arXiv preprint arXiv:2311.02496 , 2023. 7 [2] Mert Albaba, Sammy Christen, Christoph Gebhardt, Thomas Langarek, Michael J Black, and Otmar Hilliges. Rile: Reinforced imitation learning. arXiv preprint arXiv:2406.08472 , 2024. 3 [3] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man´ e. Con- crete problems in ai safety. arXiv preprint arXiv:1606.06565 , 2016. 2 [4] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luˇ ci´ c, and Cordelia Schmid. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vi- sion, pages 6836–6846, 2021. 2, 3 [5] Michael Bain and Claude Sammut. A framework for behavioural cloning. In Machine Intelligence 15 , pages 103–129, 1995. 3, 7 [6] Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation. arXiv preprint arXiv:2406.03520 , 2024. 7 [7] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video under- standing? In ICML , page 4, 2021. 3 [8] Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639 , 2023. 3 [9] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 , 2023. 2, 3, 7 [10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Trans- formers for image recognition at scale. In International Conference on Learning Representations , 2021. 2 [11] Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B Tenenbaum, et al. Video language planning. arXiv preprint arXiv:2310.10625 , 2023. 3 [12] Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text- guided video generation. Advances in Neural Informa- tion Processing Systems , 36, 2024. 3 9 Page 10: [13] Yan Duan, Marcin Andrychowicz, Bradly Stadie, Ope- nAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot im- itation learning. Advances in neural information pro- cessing systems , 30, 2017. 3 [14] Christoph Feichtenhofer, Yanghao Li, Kaiming He, et al. Masked autoencoders as spatiotemporal learners. Advances in neural information processing systems , 35: 35946–35958, 2022. 3 [15] Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adverserial inverse reinforcement learning. In International Conference on Learning Representations , 2018. 3 [16] Rohit Girdhar, Du Tran, Lorenzo Torresani, and Deva Ramanan. Distinit: Learning video representations without a single labeled video. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 852–861, 2019. 3 [17] Jonathan Ho and Stefano Ermon. Generative adversar- ial imitation learning. Advances in neural information processing systems , 29, 2016. 3 [18] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denois- ing diffusion probabilistic models. Advances in neural information processing systems , 33:6840–6851, 2020. 3 [19] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Processing Systems , 35:8633–8646, 2022. 2, 3 [20] Chun-Mao Lai, Hsiang-Chun Wang, Ping-Chun Hsieh, Frank Wang, Min-Hung Chen, and Shao-Hua Sun. Diffusion-reward adversarial imitation learning. Ad- vances in Neural Information Processing Systems , 37: 95456–95487, 2025. 7 [21] Jiayi Li, Tao Lu, Xiaoge Cao, Yinghao Cai, and Shuo Wang. Meta-imitation learning by watching video demonstrations. In International Conference on Learning Representations , 2021. 3 [22] Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Videomamba: State space model for efficient video understanding. In Euro- pean Conference on Computer Vision , pages 237–255. Springer, 2024. 3 [23] Yunzhu Li, Jiaming Song, and Stefano Ermon. In- fogail: Interpretable imitation learning from visual demonstrations. Advances in neural information pro- cessing systems , 30, 2017. 3 [24] Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models learn physical principles from watching videos? arXiv preprint arXiv:2501.09038 , 2025. 2, 3 [25] Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Mi/suppress lo´ s, and Marek Cygan. Bigger, reg- ularized, optimistic: scaling for compute and sample efficient continuous control. Advances in Neural Infor- mation Processing Systems , 37:113038–113071, 2025. 6[26] Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, and Sergey Levine. Variational discriminator bottleneck: Improving imitation learning, inverse rl, and gans by constraining information flow. In Interna- tional Conference on Learning Representations , 2018. 3 [27] Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Adversarial motion pri- ors for stylized physics-based character control. ACM Transactions on Graphics (ToG) , 40(4):1–20, 2021. 3, 7 [28] Yuzhe Qin, Yueh-Hua Wu, Shaowei Liu, Hanwen Jiang, Ruihan Yang, Yang Fu, and Xiaolong Wang. Dexmv: Imitation learning for dexterous manipulation from human videos. In European Conference on Com- puter Vision , pages 570–587. Springer, 2022. 3 [29] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kam- yar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language un- derstanding. Advances in neural information process- ing systems , 35:36479–36494, 2022. 3 [30] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to- video generation without text-video data. In nterna- tional Conference on Learning Representations , 2023. 3 [31] Faraz Torabi, Garrett Warnell, and Peter Stone. Be- havioral cloning from observation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence , pages 4950–4957, 2018. 7 [32] Faraz Torabi, Garrett Warnell, and Peter Stone. Gen- erative adversarial imitation from observation. arXiv preprint arXiv:1807.06158 , 2018. 7 [33] Chen Wang, Linxi Fan, Jiankai Sun, Ruohan Zhang, Li Fei-Fei, Danfei Xu, Yuke Zhu, and Anima Anandku- mar. Mimicplay: Long-horizon imitation learning by watching human play. In Conference on Robot Learn- ing, pages 201–221. PMLR, 2023. 3 [34] Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Video- mae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14549–14560, 2023. 3 [35] Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: An- imating open-domain images with video diffusion pri- ors. In European Conference on Computer Vision , pages 399–417. Springer, 2024. 3 [36] Haoyu Xiong, Quanzhou Li, Yun-Chun Chen, Homanga Bharadhwaj, Samarth Sinha, and Animesh Garg. Learning by watching: Physical imitation of manipulation skills from human videos. In 2021 10 Page 11: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 7827–7834. IEEE, 2021. 3 [37] Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. I2vgen-xl: High-quality image-to- video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145 , 2023. 2, 3 [38] Yanyi Zhang, Xinyu Li, Chunhui Liu, Bing Shuai, Yi Zhu, Biagio Brattoli, Hao Chen, Ivan Marsic, and Joseph Tighe. Vidtr: Video transformer without con- volutions. In Proceedings of the IEEE/CVF interna- tional conference on computer vision , pages 13577– 13587, 2021. 3 11

---