Authors: Siyin Wang, Zhaoye Fei, Qinyuan Cheng, Shiduo Zhang, Panpan Cai, Jinlan Fu, Xipeng Qiu
Paper Content:
Page 1:
OpenMOSSWorld Modeling Makes a Better Planner:
Dual Preference Optimization for Embodied Task Planning
Siyin Wang1,2Zhaoye Fei1Qinyuan Cheng1Shiduo Zhang1
Panpan Cai2,4Jinlan Fu3∗Xipeng Qiu1,2∗
1Fudan University2Shanghai Innovation Institute
3National University of Singapore4Shanghai Jiao Tong University
Abstract
Recent advances in large vision-language models (LVLMs) have shown promise
for embodied task planning, yet they struggle with fundamental challenges like
dependency constraints and efficiency. Existing approaches either solely optimize
action selection or leverage world models during inference, overlooking the benefits
of learning to model the world as a way to enhance planning capabilities. We
propose Dual Preference Optimization (D ²PO), a new learning framework that
jointly optimizes state prediction and action selection through preference learning,
enabling LVLMs to understand environment dynamics for better planning. To
automatically collect trajectories and stepwise preference data without human
annotation, we introduce a tree search mechanism for extensive exploration via
trial-and-error. Extensive experiments on V oTa-Bench demonstrate that our D ²PO-
based method significantly outperforms existing methods and GPT-4o when applied
to Qwen2-VL (7B), LLaV A-1.6 (7B), and LLaMA-3.2 (11B), achieving superior
task success rates with more efficient execution paths.
1 Introduction
World ModelPolicy ModelPerceptionActionState
Policy ModelAgent
StatePredictionActionSelection≻ActionState
𝑠- 𝑠.≻GoalState
𝑎- 𝑎. Environment
𝑫𝟐𝑷𝑶
Figure 1: Overview of D2PO: World modeling
enables better embodied task planning through
joint preference optimization of state prediction
and action selection.Embodied task planning (Singh et al., 2022; In-
oue & Ohashi, 2022; Mai et al., 2023), which
enables AI systems to perform real-world tasks
through physical interaction, demands both cor-
rectness and efficiency. Incorrect or inefficient
task planning not only wastes computational re-
sources but may also lead to unsafe operations,
compromising system usability and reliability in
dynamic environments. Previous LLM-based ap-
proaches rely heavily on environment metadata
(Yao et al., 2022; Sun et al., 2023) or external ob-
ject detection models (Singh et al., 2022; Song
et al., 2022), limiting their ability to operate end-
to-end in real-world scenarios. Recent advances
in Large Vision-Language Models (LVLMs) (Ope-
nAI, 2024) have opened new possibilities for em-
bodied intelligence, yet state-of-the-art LVLMs
still struggle with fundamental issues such as de-
pendency constraints (placing objects before pick-
ing them up) and inefficient planning (repeating
unnecessary steps). These limitations stem from
a critical gap: LVLMs operate on static snapshots
of the environment, lacking the ability to model
the dynamic nature of physical interactions.
∗Corresponding authors.
1arXiv:2503.10480v1 [cs.CL] 13 Mar 2025
Page 2:
OpenMOSS
Existing approaches leverage language models for embodied task planning, including prompt-based
methods (Song et al., 2022; Shin et al., 2024; Liang et al., 2022), supervised fine-tuning (SFT)
from expert demonstrations (Wu et al., 2023; Chen et al., 2024b; Jin et al., 2023), and RL-based
optimization (Carta et al., 2023; Yang et al., 2023; Szot et al., 2023). However, these methods
primarily focus on learning direct mappings from state to action, optimizing for what to do without
considering the consequences of these actions. To model environment dynamics, some recent methods
leverage LLMs directly as world models through prompting (Hao et al., 2023; Zhou et al., 2024)
to guide the search path. However, these approaches introduce additional computational overhead
while fail to develop world modeling capabilities during training. Moreover, embodied task planning
involves generating sequential actions based on environmental context, often with multiple valid
solutions.
Humans possess an internal world model , a cognitive framework constructed in the brain to un-
derstand, predict, and adapt to the external world. This model is developed through continuous
interaction with the environment (Johnson-Laird, 1983; Tolman, 1948; LeCun, 2022). To equip a
model with an internal world model and enable diverse and multi-solution decision-making , we pro-
pose Dual Preference Optimization (D2PO), a framework that jointly optimizes state imagination
(state prediction) and action selection through preference learning, as shown in Figure 1. Specifically,
D2PO interacts with the environment to predict future changes, gradually forming an internal world
model. And inspired by Direct Preference Optimization (DPO) (Rafailov et al., 2023), it learns
relative preferences, thus retaining the ability to explore diverse solutions. (1) State Prediction,
where the model predicts the next state given the current state and action, learning the consequences
of actions over time; (2) Action Selection, which improves the model’s policy ability to choose
appropriate actions with reasoning based on the goal and interaction history. By representing world
dynamics in natural language, we leverage the prior knowledge of large language models. More
importantly, rather than treating world modeling as a separate component, our framework uses world
modeling objectives to enhance the policy’s planning capabilities. Through this dual optimization,
the policy model naturally develops an understanding of world dynamics, leading to more informed
action selection without requiring explicit world model guidance during inference.
To automatically collect correct trajectories and stepwise preference data for training, we introduce a
tree search mechanism that systematically explores action sequences within a simulated environment.
By combining model evaluations and environmental feedback, this scalable method can automatically
generate extensive trajectories and construct preference pairs for both action selection and state
prediction. This approach eliminates the need for expert demonstrations and preference annotations,
while efficiently gathering diverse embodied interaction experiences. Extensive experiments on
V oTa-Bench, our vision-enhanced extension of the text-only LoTa-Bench (designed for LLMs) (Choi
et al., 2024), demonstrate that our method outperforms existing training approaches across multiple
evaluation settings. Our evaluation shows significant improvements in both success rate and planning
efficiency, with our 7B-parameter model surpassing GPT-4o’s performance on multiple test types,
highlighting the efficacy and potential of our approach.
Our main contributions are as follows:
•We propose to learn world modeling to enhance model’s planning abilities through our
novel Dual Preference Optimization (D ²PO) framework, which jointly optimizes state
prediction and action selection through preference learning, enabling the model to learn
action consequences while improving planning.
•We introduce a tree search algorithm that automatically collects trajectories and constructs
multimodal stepwise preference data for embodied task planning via trial-and-error, elimi-
nating the need for human annotation.
•We demonstrate that auxiliary world modeling objectives significantly improve embod-
ied task planning with extensive experiments on V oTa-Bench. Our 7B-parameter model
achieves a relative improvement of 31.4% and 33.0% in success rate and planning efficiency
respectively compared to SFT baselines.
2
Page 3:
OpenMOSS
2 Relate Work
2.1 Embodied Task Planning
Embodied task planning is a key component of Embodied AI, enabling agents to perform complex
tasks within dynamic and physical environments. Early LLM-based methods (Yao et al., 2022; Sun
et al., 2023; Zhao et al., 2023) rely purely on textual metadata from the environment, making them
struggle to adapt to the unpredictable and dynamic nature of real-world settings. Later approaches
(Singh et al., 2022; Song et al., 2022; Shin et al., 2024; Yang et al., 2024; Zhao et al., 2024; Shirai et al.,
2023) introduce cascaded visual processing through external models. However, these multi-stage
pipelines increase system complexity and potential error propagation. Notably, existing methods
(Pashevich et al., 2021; Inoue & Ohashi, 2022; Lu et al., 2023; Chen et al., 2024b; Zhao et al., 2024)
also heavily rely on manual step-by-step instructions. In contrast, we propose an end-to-end approach
using a single VLM for both direct visual processing and autonomous planning, despite the increased
modeling challenges.
Methodologically, several recent works have explored diverse prompting strategies (Song et al., 2022;
Shin et al., 2024; Liang et al., 2022) and multi-agent frameworks with specialized roles (Zhang et al.,
2023; Mai et al., 2023; Wang et al., 2024d). SFT-based approaches learn from expert demonstrations
using human or language model annotated data (Wu et al., 2023; Chen et al., 2024b; Jin et al.,
2023), or collect training data through actor-critic simulation (Li et al., 2024). Recent works explore
PPO-based optimization using designed reward templates (Carta et al., 2023) or optimizing through
environment interaction feasibility (Yang et al., 2023; Szot et al., 2023) These RL-based methods
require designed reward or training separate reward models. Direct preference optimization (DPO)
(Rafailov et al., 2023), as an implicit reward modeling approach, has shown promise in LLM planning
(Song et al., 2024; Zhao et al., 2024). Different from existing approaches focusing on optimizing
action selection alone, we propose to leverage DPO for joint optimization of state prediction and
action selection in LVLMs.
2.2 World Model
World model is a computational framework that predicts future states based on current states and ac-
tions, enabling decision-making through simulated outcomes (Sutton, 1990). Traditional approaches
based on recurrent state space models (RSSM) for low-level control, focus on learning state transitions
in a latent space rather than language modeling and rely on handcrafted reward functions (Hafner
et al., 2019; 2020; Wu et al., 2022; Hafner et al., 2023). Recent advancements have explored integrat-
ing LLMs to leverage prior knowledge, with some using LLMs to generate symbolic plans or code to
modeling world (Guan et al., 2023; Dainese et al., 2024), and others using text prompting (Hao et al.,
2023; Zhou et al., 2024). However, these methods mainly utilize world modeling during inference,
without incorporating it into the training process. In contrast, our approach jointly optimizes state
prediction and action selection with DPO during training stage, learning world modeling capabilities
that enhance the model’s planning abilities.
2.3 Direct Preference Optimization
In the realm of preference-based learning, Direct Preference Optimization (DPO) (Rafailov et al.,
2023) offers a powerful framework for language model alignment without requiring explicit reward
modeling. Recent work has extended DPO to multimodal settings in understanding or reasoning
tasks (Yu et al., 2023; Wang et al., 2024a; Xie et al., 2024; Wang et al., 2024c; Fu et al., 2025).
However, embodied task planning differs from these tasks as it requires interaction with real-world
environments, closed-loop adaptation to current states, and long-horizon planning. Recent work
like ETO (Song et al., 2024) applied DPO in LLM-based embodied planning but primarily focused
on action optimization without considering state prediction or visual inputs. In contrast, our work
combines LVLMs with DPO to jointly optimize state prediction and action selection, leveraging
world modeling to enhance the agent’s planning capabilities in dynamic, interactive settings.
3
Page 4:
OpenMOSS
IterativeExpansionSamplingSelection
4.04.54.02.01.0✘ActionFailBacktracking
GoalStateSuccess?NoYes𝑠!𝑠"!𝑠""𝑠"#𝑠#"𝑠##
𝜋!
𝜋"𝑫𝟐𝑷𝑶
StateAction4.01.54.0𝑠"!𝑠""𝑠"#a#!a#"a##a#$a#%
✘
𝑠"#$= 𝒯𝑎"#$𝑟%&'"#$=1(𝑒𝑥𝑒𝑐𝑢𝑡𝑎𝑏𝑙𝑒)𝑟()*+"#$=Va,#$,s,#$𝑔,𝑜-"#$,𝑎-"#$𝑫𝟐𝑷𝑶 (Sec 3.3)DataExplorationviaStep-wiseTreeSearch (Sec 3.2)
Base Policy Model
Policy/World ModelActionSelectionState Prediction
PreferenceData𝑎".>𝑎"/𝑠".>𝑠"/
State 𝑖State 𝑖+·1pick up egg𝑎"#$ ~ 𝜋@𝑔,𝑜-"#$,𝑎-"#$
𝑠#!
(a) Data Exploration via Step-wiseTree Search(b) D!PO
Figure 2: Our method consists of two dimensions: (a) Data Exploration via Step-wise Tree Search
(Sec 3.2), which collects preference data through sampling and selecting potential actions, iterative
tree expansion, and trajectory backtracking; (b) Dual Preference Optimization (D2PO) framework
(Sec 3.3) that leverages the collected preference pairs to jointly optimize action selection and state
prediction.
3 Method
3.1 Task Formulation
We model the embodied task planning problem as a Partially Observable Markov Decision Process
(POMDP), where the agent operates in a partially observable environment and generates actions
based on multimodal feedback. The POMDP is defined by the tuple (S,A,O,T,M,R,γ), where
Sis the state space, Ais the action space, Ois the observation space, T:S × A → S is the
transition function ( st=T(st−1,at)),M:S → O is the observation function provided by the
simulation environment, R:S × A → [0, 1]is the reward function, and γis the constant discount
factor. Due to partial observability, the agent cannot directly access the complete state st∈ S, but
instead receives first-person visual observations ot=M(st)∈ O from the environment.
Given a task goal g∈ G, where Gis the space of natural language task instructions, the agent interacts
with the environment through a sequential planing process. At each time step t, the agent receives an
observation ot∈ O from the simulation environment and maintains a history of past observations
and actions ht= (o0,a1,o1, ...,at,ot). Based on this history and the task goal, the agent’s policy πθ
generates an action at+1∼πθ(·|g,ht), where the policy πθ:G × H → A maps the current history
htand goal gto a distribution over the action space A.
Through this interaction process, a trajectory is formed as e= (g,o0,a1,o1, ...,on−1,an,on), where n
is the length of the trajectory, and each observation otis provided by the environment after executing
action at. The task is considered successfully completed if the final state satisfies the goal condition,
with the reward defined as r(e) =1if the goal condition is satisfied and 0 otherwise.
3.2 Data Exploration via Step-wise Tree Search
Previous training methods often rely on costly human expert annotations or GPT-4o-generated labels,
which can be both time-consuming and limited in diversity. To address these challenges, we introduce
a novel tree search framework for embodied task planning that explores the action space step-by-step
with environment interaction, eliminating the need for human expert annotation.
As shown in Figure 2(a), our framework consists of three components: action sampling and evaluation,
iterative tree expansion, and trajectory validation and backtracking. First, we sample and evaluate
potential actions at each state using a hybrid scoring mechanism. Then, we iteratively expand the
search tree by selecting and exploring promising nodes at each level, following a breadth-first strategy.
4
Page 5:
OpenMOSS
Once a goal state is reached, we backtrack through the trajectory to create preference pairs for dual
optimization of action selection and state prediction. More detailed implementation is provided in the
appendix B.
Action Sampling and Evaluation At each selected state node st, we sample multiple potential
actions a(i)
ti=1...Kusing a base policy model. Actions are evaluated through a hybrid scoring mecha-
nism combining two components: a process reward score r(i)
procfrom GPT-4o, which evaluates how
actions contribute to goal completion based on the history according to a score-based prompt, and a
binary environmental feasibility score r(i)
envindicating action executability (1 if executable, 0 if not).
These scores are normalized and combined with equal weights into r(i)
total=αr(i)
proc+ (1−α)r(i)
env
where α=0.5, guiding exploration towards both goal-oriented and executable trajectories.
Iterative Tree Expansion Following a breadth-first strategy, actions with high scores r(i)
total≥τ
(where τis a predefined threshold) are selected for expansion. The states after selected actions
execution in the environment form the next level of exploration. This step-by-step expansion ensures
extensive exploration of promising solution paths at each depth while maintaining physical feasibility.
Trajectory Validation and Backtracking Upon reaching a goal state, we extract the trajectory
by backtracking and constructing preference pairs for both action selection and state prediction. At
each step st−1→atin a successful trajectory, where visual observations ot−1=M(st−1)represent
the agent’s first-person view of states as input, we generate two types of preference pairs. For
action selection, we obtain (g,a<t,o<t,rw
t,aw
t,rj
t,aj
tj∈N(t)), where (rw
t,aw
t)represents the chosen
reasoning-action pair and rj
t,aj
tj∈N(t)are alternatives from sibling nodes. For state prediction, we
extract (st−1,at,sw
t,sj
tj∈N(t)), where sw
trepresents the state description that would result from
executing action aw
t, and sj
tj∈N(t)are the corresponding state descriptions from alternative actions.
3.3 Dual Preference Optimization (D2PO) Framework
We propose the Dual Preference Optimization (D2PO) framework (Figure 2(b)), building upon Direct
Preference Optimization (DPO) (Rafailov et al., 2023). The core idea of DPO is to directly optimize
the model using preference pairs {yw,yl}, where the optimization objective encourages the model
to assign a higher probability to preferred responses p(yw≻yl)while maintaining proximity to a
reference model, without additional reward model.
We extend this preference learning framework to embodied task planning by simultaneously optimiz-
ing two critical aspects: action selection andstate prediction . The action selection optimization
focuses on enhancing the policy model, enabling the agent to choose the most appropriate action
based on the current state, history, and task instruction. Meanwhile, the state prediction optimization
targets the world modeling, which learns to predict the next state resulting from the current state
and action. This dual optimization approach enhances the agent’s understanding of environment
dynamics, leading to better planning performance.
Action Selection Given context (g,a<t,o<t), we optimize the probability of selecting preferred
reasoning-action pairs (rw
t,aw
t)over rejected pairs (rl
t,al
t):
Laction(πθ;πref) =−E(g,a<t,o<t,rw
t,aw
t,rl
t,al
t)∼Dh
logσ
βlogπθ(rw
t,aw
t|g,a<t,o<t)
πref(rw
t,aw
t|g,a<t,o<t)−βlogπθ(rl
t,al
t|g,a<t,o<t)
πref(rl
t,al
t|g,a<t,o<t)i
.
(1)
State Prediction Given state-action pairs (st−1,at), we optimize the prediction of preferred out-
come states sw
tafter executing action atover rejected states sl
t. The states are represented as
descriptions that capture key object properties, spatial relationships, and agent status (e.g., “the plate
is on the table, and the agent is holding the cup”). This optimization enables the model to learn the
dynamic state changes induced by actions. Formally, the state prediction objective is:
5
Page 6:
OpenMOSS
Lstate(πθ;πref) =−E(at,st−1,sw
t,sl
t)∼Dh
logσ
βlogπθ(sw
t|st−1,at)
πref(sw
t|st−1,at)−βlogπθ(sl
t|st−1,at)
πref(sl
t|st−1,at)i
.(2)
Finally, we combine both objectives in a joint optimization problem. The total loss is a weighted sum
of the action selection and state prediction losses, with the objective function defined as:
Ltotal=Laction(πθ;πref) +λLstate(πθ;πref),
where λis a hyperparameter controlling the balance between the two optimization objectives.
4 Experiment
4.1 Experimental Settings
4.1.1 VoTa-Bench
Dataset Our evaluation is based on the LoTa-Bench (Choi et al., 2024), which leverages the AI2-
THOR (Kolve et al., 2017) simulation environment and repurposes data from ALFRED (Shridhar
et al., 2019). Unlike ALFRED, which provides both task- and step-level instructions for translating
detailed step-by-step guidance into robot actions, LoTa-Bench focuses on high-level task planning
using only task-level instructions.
In this work, we extend LoTa-Bench to create a new multimodal benchmark, V oTa-Bench, to better
support LVLMs. (1) Unlike the LoTa-Bench, which relies on textual descriptions, V oTa-Bench
incorporates egocentric visual information as both the initial state and the observation after each
operation, requiring the model to effectively process visual inputs. (2) For evaluation, we do not
rely on executable skills and logits computation; instead, we adopt an open-domain generation
approach, which may result in the model generating non-executable skills. (3) The original dataset’s
environments were same to the training environment (seen scene). We expanded the dataset by adding
new unseen environments to test the model’s generalization, resulting in 549 seen test samples and
646 unseen test samples, covering 108 objects and 120 scenes. More details are in Appendix A.
4.1.2 Baselines
Our evaluation includes the zero-shot performance of several leading LVLMs, such as GPT-4o,
GPT-4o-mini (OpenAI, 2024), Gemini-1.5-Pro (Team, 2024), Qwen2-VL-72B (Wang et al., 2024b)
and LLaV A-1.6-34B (Liu et al., 2024).
Additionally, we validate our approach on Qwen2-VL-7B (Wang et al., 2024b), LLaV A-1.6-7B (Liu
et al., 2024), and Llama-3.2-Vision-11B (Meta, 2024). The compared methods are as follows: (1)
In-Context Learning: We provide 5-shot examples to prompt the model for generation. (2) SFT:
We fine-tune the models using our collected dataset. (3) DPO: We optimize the models using our
collected action selection data. Notably, the DPO data is collected by us and focuses solely on action
selection optimization, serving as an ablation of our D2PO method. (4) D2PO (Ours): We propose a
dual preference optimization approach, leveraging both action selection and state prediction data for
enhanced performance.
4.1.3 Evaluation Metrics
Success Rate (SR) The Success Rate (SR) measures task completion by verifying if the final state
of the environment, including object states and positions, satisfies the task’s goal conditions. For
example, in the task “Place a cold apple on the dinner table,” success is achieved only if the apple is
chilled and located on the dinner table.
Path-Length Weighted Success Rate (PL) We introduce the Path-Length Weighted Success Rate
(PL) (Shridhar et al., 2019) to evaluate efficiency, which adjusts SR by comparing the model’s step
sequence length to the expert demonstration. The PL score is calculated as: PL=SR×L∗
max(L∗,ˆL),
6
Page 7:
OpenMOSS
Table 1: Performance of D ²PO and baselines on V oTa-Bench (Seen). Bold values indicate the highest
performance within the same model, and our method (D ²PO), including its ablation (DPO), are
highlighted in green .
Examine&Light Pick&Place Stack&Place Clean&Place Heat&Place Cool&Place Overall
SR PL SR PL SR PL SR PL SR PL SR PL SR PL
GPT-4o 33.33 23.37 51.19 36.27 0.00 0.00 0.00 0.00 8.41 6.55 2.38 2.02 14.39 10.37
+ ICL 41.67 30.60 64.29 45.95 4.17 1.31 1.79 1.79 24.30 23.81 11.90 11.39 23.50 18.78
GPT-4o-mini 22.22 10.88 14.29 8.16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.10 2.68
Gemini-1.5-pro 34.72 29.38 27.38 12.07 0.00 0.00 0.00 0.00 7.48 7.37 3.17 1.72 10.93 6.81
Qwen2-VL (72B) 34.72 21.62 39.29 21.81 0.00 0.00 0.00 0.00 3.97 3.47 0.79 0.56 11.66 7.10
LLaV A-1.6 (34B) 12.50 2.09 7.14 2.67 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.73 0.68
Qwen2-VL (7B) 26.39 8.55 14.29 8.22 2.08 0.60 0.00 0.00 0.00 0.00 0.00 0.00 5.83 2.46
+ ICL 25.00 9.25 21.43 12.29 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6.56 3.14
+ SFT 70.83 55.24 69.05 57.74 6.25 5.38 26.79 26.04 58.88 58.34 31.75 31.11 44.63 40.33
+ DPO 72.22 56.67 80.95 66.30 10.42 8.47 44.64 44.64 60.75 60.75 44.44 44.04 53.92 49.37
+ D2PO 84.72 66.67 84.52 71.27 12.50 10.23 48.21 48.21 66.36 66.36 44.44 44.33 58.11 53.33
LLaV A-1.6 (7B) 4.17 0.67 7.14 1.14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.64 0.26
+ ICL 1.39 0.22 4.76 0.76 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.91 0.15
+ SFT 56.94 45.37 63.10 51.65 12.50 9.81 31.25 31.18 50.47 50.08 30.16 29.34 41.35 37.56
+ DPO 66.67 45.77 72.62 59.17 20.83 18.20 44.64 44.64 44.86 44.86 43.65 43.07 49.54 44.38
+ D2PO 69.44 52.60 78.57 65.48 22.92 19.60 47.32 47.32 60.75 60.41 44.44 44.33 54.83 50.23
LLaMA-3.2 (11B) 12.50 2.00 4.76 0.86 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.37 0.39
+ ICL 8.33 1.33 3.57 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.64 0.26
+ SFT 58.33 44.13 72.62 47.04 8.33 6.69 30.36 26.03 46.73 46.73 35.71 31.98 42.99 35.33
+ DPO 76.39 59.31 78.57 62.61 12.50 9.97 29.46 25.47 43.93 43.35 36.51 34.24 46.08 39.73
+ D2PO 76.39 59.63 88.10 71.32 14.58 12.19 38.39 32.97 48.60 48.26 39.68 38.80 51.18 44.84
where L∗is the expert’s trajectory length, and ˆLis the model’s trajectory length. This penalizes
models that take longer than the expert, ensuring both task success and efficiency are considered. For
instance, a model takes twice as long as the expert receives half the credit.
4.1.4 Implementation Details
For the models Qwen2-VL-7B, LLaV A-1.6-7B, and Llama-3.2-Vision-11B, we adopt the same
training protocol. We use full-parameter tuning, first performing SFT for 3 epochs, using a learning
rate of 3e−5and a batch size of 32. Following SFT, we conduct D2PO for 1 epoch, with a learning
rate of 5e−7and a batch size of 32. In the D2PO loss function, we set the balancing parameter λ=1
to equally weigh the contributions of action selection and state prediction. The DPO implementation
is kept identical to the D2PO setup. Our training data consists of 4.5k SFT samples and 15k DPO
samples. Due to the inherent properties of VLMs, we use images as state inputs and text descriptions
as outputs for state prediction. The maximum number of steps is set to 25 and the temperature is set
to 0 during evaluation.
4.2 Main Results
Our experimental results highlight the substantial advantages of the Dual Preference Optimization
(D2PO) framework over existing baselines. Results are shown in Table 1, and we summarize the key
findings as follows:
World Modeling Enhances Planning Performance: The consistent superiority of D2PO over
standard DPO (average +9.84% SR across models) validates our core hypothesis - incorporating
world modeling objectives significantly enhances the model’s planning capabilities.
Learning from Mistakes: The performance gains of DPO and D ²PO over SFT (average relative
improvements of 15.95% and 27.29% in SR across models) underscore the value of learning from
both successful and unsuccessful exploration. While SFT relies solely on successful trajectories,
DPO and D2PO additionally utilize suboptimal or failed attempts, enabling the model to learn not
just what to do but also what not to do. This mirrors human learning, where mistakes often provide
critical insights into task dynamics and constraints.
7
Page 8:
OpenMOSS
Table 2: Generalization performance on V oTa-Bench (Unseen). Bold values indicate the highest
performance within the same model, and our method (D ²PO), including its ablation (DPO), are
highlighted in green .
Examine&Light Pick&Place Stack&Place Clean&Place Heat&Place Cool&Place Overall
SR PL SR PL SR PL SR PL SR PL SR PL SR PL
Qwen2-VL (7B) 25.53 9.34 15.79 9.58 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 7.43 3.18
+ ICL 26.95 12.20 3.95 1.69 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6.35 2.86
+ SFT 68.79 56.93 52.63 44.46 4.29 2.61 43.36 43.37 62.50 62.29 49.54 47.38 50.77 46.70
+ DPO 73.76 60.17 53.95 46.95 7.14 5.15 52.21 52.21 66.18 66.18 66.97 66.97 57.59 53.65
+ D2PO 77.30 62.67 56.58 49.56 11.43 8.66 55.75 55.75 72.79 72.79 68.81 68.51 61.46 57.16
LLaV A-1.6 (7B) 4.26 0.77 6.58 1.14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.70 0.30
+ ICL 2.84 0.45 2.63 1.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.93 0.23
+ SFT 64.54 52.41 57.89 51.39 4.29 3.00 42.48 41.61 56.62 56.16 44.04 43.51 48.14 44.33
+ DPO 75.89 51.53 60.53 45.25 7.14 4.62 56.64 56.21 65.44 64.61 63.30 63.12 58.82 51.23
+ D2PO 77.30 58.98 60.53 49.30 14.29 10.38 60.18 60.18 69.12 68.90 65.14 64.46 61.61 55.78
LLaMA-3.2 (11B) 12.06 2.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.63 0.46
+ ICL 9.22 1.48 5.26 0.83 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.63 0.42
+ SFT 70.92 58.75 53.95 46.25 7.14 4.61 51.33 50.02 47.06 46.85 52.29 50.81 50.31 46.02
+ DPO 74.47 61.40 64.47 54.16 7.14 5.63 45.13 43.76 51.47 50.33 53.21 51.41 52.32 47.39
+ D2PO 82.27 66.47 64.47 55.34 7.14 5.69 53.10 51.52 58.09 57.59 57.80 55.79 57.59 52.27
Surpassing Process Reward Model through Environment Exploration: Our D2PO framework,
with a 7B model, Qwen2-VL-7B outperforms GPT-4o (only 14.39% SR) by 43.72 points in SR,
despite GPT-4o serving as the process reward model. This reveals how our framework effectively
combines process guidance from larger models with environmental feedback to develop superior
planning capabilities, even when the process reward model’s direct performance on the task is limited.
Efficiency Gains from World Model Understanding: The improved path-length weighted success
rate (PL) metrics across all tasks (average +11.35% compared to DPO) indicate that our model
develops physics-aware planning capabilities. Even more, in some tasks, while DPO and D2PO
achieve similar SR, D2PO increases the PL, showing more efficient action sequencing through
anticipated state transitions.
4.3 Generalization: Unseen Scene
We further evaluated the generalization capabilities of our model by testing it on unseen scenes
that were not part of the training environment. As shown in Table 2, we observe that our method
consistently outperforms baseline methods in both success rate (SR) and path-length weighted success
rate (PL), with average relative improvements of 7.17% and 8.58% respectively across different
models compared to DPO. These results demonstrate that incorporating world modeling objectives
enhances the model’s planning capabilities and generalization to novel environments.
5 Further Analysis
5.1 Data Scale
To investigate the impact of the data scale on performance, we varied the SFT data from 2K to 15K
samples (with corresponding DPO data from 6K to 50K). Using Qwen2-VL-7B as the backbone
model, our results in Figure 3a show that D2PO consistently outperforms baselines across all data
scales, achieving an average improvement of 5-15% in success rate (SR) over SFT.
As the data size increases, we observed a non-monotonic trend in the performance of D2PO: initial
improvements followed by plateauing or slight decline at larger scales. This phenomenon likely stems
from the shared source with SFT data, where simply increasing DPO data may lead to overfitting.
This highlights the importance of data quality and diversity for model generalization.
8
Page 9:
OpenMOSS
2k/6k 4.5k/1.5w 1w/3w 1.5w/5w
SFT/DPO Data Num30354045505560SR
SFT
DPO
D2PO
(a) Impact of data scale on performance (SR).
Qwen-2B Qwen-7B Qwen-72B LLaVA-7B LLaVA-13B
Models3035404550556065SR+23.2%+30.2%+31.5%
+32.6%+30.0%
SFT
DPO
D2PO (b) Impact of model scale on performance (SR).
Figure 3: Analysis of data scale and model scale.
5.2 Model Scale
We further examined the effect of model scale on performance by conducting experiments with
models of varying sizes, ranging from 2B to 72B parameters. As shown in Figure 3b, performance
improves as the model scale increases. Notably, D2PO consistently outperforms SFT across all model
sizes, with both methods benefiting from larger model capacities. On the largest models (Qwen 72B
and LLaV A 13B), D2PO achieves approximately 30% improvement in SR over baselines.
5.3 Action-conditioned v.s.Goal-directed World Modeling
45 50 55 60
SRLLaMA-11BLLaVA-7BQwen-7BSeen
50 55 60 65
SRUnseen
Goal-directed
Action-conditioned
Figure 4: Success rates (SR) of action-conditioned
and goal-directed world models across seen and
unseen scenarios.Inspired by recent advances in video prediction
(Ren et al., 2025) that demonstrate the poten-
tial of learning world dynamics without explicit
actions, we investigate two distinct approaches
to world modeling. The conventional action-
conditioned world model learns to predict the
next state based on the current state and action
(π(st|st−1,at)), while the goal-directed world
model directly imagines future states from his-
toryht−1and goal conditions ( π(st|g,ht−1)).
Our empirical analysis in Figure 4 reveals that
while the action-conditioned model achieves a
higher success rate on seen scenarios, the goal-
directed model demonstrates superior general-
ization to unseen scenarios. This suggests a fun-
damental trade-off: explicit action supervision
helps anchor predictions in familiar contexts,
but removing such constraints enhances the model’s imaginative capacity, leading to more flexible
dynamics learning that better generalizes to novel situations.
5.4 Error Analysis
Table 3: Distribution of error types across
different methods.
SFT DPO D2PO
Dependency Error 212 157 141
Affordance Error 144 141 128
Inefficient Error 141 93 78
Others 20 16 17We classify error types by comparing standard trajec-
tories with erroneous ones, noting that a single trajec-
tory may contain multiple types of errors simultane-
ously. Through analyzing error cases of Qwen2-VL-
7B in seen scenarios, Table 5 shows that our method
significantly reduced dependency error (212 →141),
affordance error (144 →128), and inefficient Error
(141→78). Details are provided in Appendix C.
9
Page 10:
OpenMOSS
5.5 Case Study
To better understand our approach’s advantages in handling dependency constraints and efficiency, we
present a detailed analysis of representative cases in Appendix D. Our case studies demonstrate how
D²PO consistently produces more coherent action sequences by properly respecting dependencies
between actions and generating more efficient plans compared to SFT baselines.
6 Conclusion
Embodied task planning requires AI systems to understand environment dynamics for effective
physical interactions, yet existing approaches primarily focus on direct state-to-action mapping
without considering action consequences. In this paper, we propose to learn world modeling to
enhance the model’s planning capability through presented Dual Preference Optimization (D2PO),
a new framework that jointly optimizes state prediction and action selection through preference
learning. To automatically construct stepwise preference data for training, we also introduced a
tree search mechanism, enabling systematic exploration and embodied experience accumulation in
simulated environments. Extensive experiments on our proposed V oTa-Bench demonstrate that our
7B parameter model significantly outperforms existing approaches, including GPT-4o, across various
evaluation metrics. These results validate that incorporating world modeling helps the model better
understand environment dynamics, leading to improved planning capabilities.
Limitations
Sim-to-Real Gap Similar to others in embodied task planning, our current training and evaluation
are conducted in the AI2-THOR simulation environment, which may not fully capture the complexity
and uncertainty of real-world scenarios, and may lead to the sim-to-real gap. Nevertheless, our
learning algorithm is designed to be environment-agnostic and independent of simulation metadata,
enabling potential deployment and optimization in real-world settings. Additionally, existing research
efforts are actively exploring methods to bridge this gap, which could further facilitate real-world
applications.
Data Collection Efficiency Given the current limitations in multimodal language models’ critique
capabilities (Chen et al., 2024a), our data collection pipeline utilizes GPT-4o as the judge model for
process rewarding, which requires additional computational resources. As vision-language models
continue to advance rapidly, and with future exploration of embodied self-rewarding mechanisms, we
believe these computational costs will be significantly reduced, making the framework more scalable
for practical applications.
Ethics Statement
Our research aims to develop robots that serve as assistive tools to augment human capabilities in daily
tasks rather than replacing human workers, creating new opportunities for human-AI collaboration in
household scenarios. To ensure responsible development and prioritize user safety, we advocate for
implementing comprehensive safety protocols and monitoring mechanisms before deploying similar
systems in real-world environments, particularly when handling potentially hazardous appliances.
References
Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, and Pierre-Yves
Oudeyer. Grounding large language models in interactive environments with online reinforcement
learning. ArXiv , abs/2302.02662, 2023. URL https://api.semanticscholar.org/CorpusID:
256615643 .
Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang,
Pan Zhou, Yao Wan, and Lichao Sun. Mllm-as-a-judge: Assessing multimodal llm-as-a-judge
with vision-language benchmark. In International Conference on Machine Learning , 2024a. URL
https://api.semanticscholar.org/CorpusID:267523079 .
10
Page 11:
OpenMOSS
Yaran Chen, Wenbo Cui, Yuanwen Chen, Mining Tan, Xinyao Zhang, Dongbin Zhao, and He Wang.
Robogpt: an intelligent agent of making embodied long-term decisions for daily instruction tasks,
2024b. URL https://arxiv.org/abs/2311.15649 .
Jae-Woo Choi, Youngwoo Yoon, Hyobin Ong, Jaehong Kim, and Minsu Jang. Lota-bench: Bench-
marking language-oriented task planners for embodied agents. ArXiv , abs/2402.08178, 2024. URL
https://api.semanticscholar.org/CorpusID:267636765 .
Nicola Dainese, Matteo Merler, Minttu Alakuijala, and Pekka Marttinen. Generating code world
models with large language models guided by monte carlo tree search. ArXiv , abs/2405.15383,
2024. URL https://api.semanticscholar.org/CorpusID:270045176 .
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.
2025. URL https://api.semanticscholar.org/CorpusID:275789950 .
Jinlan Fu, Shenzhen Huangfu, Hao Fei, Xiaoyu Shen, Bryan Hooi, Xipeng Qiu, and See-Kiong
Ng. Chip: Cross-modal hierarchical direct preference optimization for multimodal llms. ArXiv ,
abs/2501.16629, 2025. URL https://api.semanticscholar.org/CorpusID:275932245 .
L. Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. Leveraging pre-
trained large language models to construct and utilize world models for model-based task plan-
ning. ArXiv , abs/2305.14909, 2023. URL https://api.semanticscholar.org/CorpusID:
258865907 .
Danijar Hafner, Timothy P. Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control:
Learning behaviors by latent imagination. ArXiv , abs/1912.01603, 2019. URL https://api.
semanticscholar.org/CorpusID:208547755 .
Danijar Hafner, Timothy P. Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with
discrete world models. ArXiv , abs/2010.02193, 2020. URL https://api.semanticscholar.
org/CorpusID:222133157 .
Danijar Hafner, J. Pasukonis, Jimmy Ba, and Timothy P. Lillicrap. Mastering diverse domains
through world models. ArXiv , abs/2301.04104, 2023. URL https://api.semanticscholar.
org/CorpusID:255569874 .
Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu.
Reasoning with language model is planning with world model. ArXiv , abs/2305.14992, 2023. URL
https://api.semanticscholar.org/CorpusID:258865812 .
Yuki Inoue and Hiroki Ohashi. Prompter: Utilizing large language model prompting for a data
efficient embodied instruction following. ArXiv , abs/2211.03267, 2022. URL https://api.
semanticscholar.org/CorpusID:253383940 .
Chuhao Jin, Wenhui Tan, Jiange Yang, Bei Liu, Ruihua Song, Limin Wang, and Jianlong Fu.
Alphablock: Embodied finetuning for vision-language reasoning in robot manipulation. ArXiv ,
abs/2305.18898, 2023. URL https://api.semanticscholar.org/CorpusID:258967880 .
Philip Nicholas Johnson-Laird. Mental models: Towards a cognitive science of language, inference,
and consciousness . Number 6. Harvard University Press, 1983.
Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt
Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, Aniruddha Kembhavi, Abhinav Kumar Gupta,
and Ali Farhadi. Ai2-thor: An interactive 3d environment for visual ai. ArXiv , abs/1712.05474,
2017. URL https://api.semanticscholar.org/CorpusID:28328610 .
Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open
Review , 62(1):1–62, 2022.
Boyu Li, Haobin Jiang, Ziluo Ding, Xinrun Xu, Haoran Li, Dongbin Zhao, and Zongqing Lu. Selu:
Self-learning embodied mllms in unknown environments. ArXiv , abs/2410.03303, 2024. URL
https://api.semanticscholar.org/CorpusID:273162831 .
11
Page 12:
OpenMOSS
Jacky Liang, Wenlong Huang, F. Xia, Peng Xu, Karol Hausman, Brian Ichter, Peter R. Florence,
and Andy Zeng. Code as policies: Language model programs for embodied control. 2023
IEEE International Conference on Robotics and Automation (ICRA) , pp. 9493–9500, 2022. URL
https://api.semanticscholar.org/CorpusID:252355542 .
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-
next: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.
github.io/blog/2024-01-30-llava-next/ .
Guanxing Lu, Ziwei Wang, Changliu Liu, Jiwen Lu, and Yansong Tang. Thinkbot: Embodied
instruction following with thought chain reasoning. ArXiv , abs/2312.07062, 2023. URL https:
//api.semanticscholar.org/CorpusID:266174229 .
Jinjie Mai, Jun Chen, Bing chuan Li, Guocheng Qian, Mohamed Elhoseiny, and Bernard Ghanem.
Llm as a robotic brain: Unifying egocentric memory and control. ArXiv , abs/2304.09349, 2023.
URL https://api.semanticscholar.org/CorpusID:258212642 .
AI Meta. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models. Meta AI
Blog. Retrieved December , 20:2024, 2024.
OpenAI. Gpt-4o system card, 2024. URL https://arxiv.org/abs/2410.21276 .
Alexander Pashevich, Cordelia Schmid, and Chen Sun. Episodic transformer for vision-and-language
navigation. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , pp. 15922–
15932, 2021. URL https://api.semanticscholar.org/CorpusID:234482879 .
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea
Finn. Direct preference optimization: Your language model is secretly a reward model. ArXiv ,
abs/2305.18290, 2023. URL https://api.semanticscholar.org/CorpusID:258959321 .
Zhongwei Ren, Yunchao Wei, Xun Guo, Yao Zhao, Bingyi Kang, Jiashi Feng, and Xiaojie Jin.
Videoworld: Exploring knowledge learning from unlabeled videos, 2025. URL https://arxiv.
org/abs/2501.09781 .
Suyeon Shin, Sujin Jeon, Junghyun Kim, Gi-Cheon Kang, and Byoung-Tak Zhang. Socratic planner:
Inquiry-based zero-shot planning for embodied instruction following. ArXiv , abs/2404.15190,
2024. URL https://api.semanticscholar.org/CorpusID:269302975 .
Keisuke Shirai, Cristian Camilo Beltran-Hernandez, Masashi Hamaya, Atsushi Hashimoto, Shohei
Tanaka, Kento Kawaharazuka, Kazutoshi Tanaka, Yoshitaka Ushiku, and Shinsuke Mori. Vision-
language interpreter for robot task planning. 2024 IEEE International Conference on Robotics
and Automation (ICRA) , pp. 2051–2058, 2023. URL https://api.semanticscholar.org/
CorpusID:264935138 .
Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi,
Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for
everyday tasks. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) ,
pp. 10737–10746, 2019. URL https://api.semanticscholar.org/CorpusID:208617407 .
Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter
Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using
large language models. 2023 IEEE International Conference on Robotics and Automation (ICRA) ,
pp. 11523–11530, 2022. URL https://api.semanticscholar.org/CorpusID:252519594 .
Chan Hee Song, Jiaman Wu, Clay Washington, Brian M. Sadler, Wei-Lun Chao, and Yu Su. Llm-
planner: Few-shot grounded planning for embodied agents with large language models. 2023
IEEE/CVF International Conference on Computer Vision (ICCV) , pp. 2986–2997, 2022. URL
https://api.semanticscholar.org/CorpusID:254408960 .
Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, and Bill Yuchen Lin. Trial and error:
Exploration-based trajectory optimization for llm agents. ArXiv , abs/2403.02502, 2024. URL
https://api.semanticscholar.org/CorpusID:268249221 .
12
Page 13:
OpenMOSS
Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. Adaplanner: Adaptive
planning from feedback with language models. ArXiv , abs/2305.16653, 2023. URL https:
//api.semanticscholar.org/CorpusID:258947337 .
Richard S. Sutton. Dyna, an integrated architecture for learning, planning, and reacting. SIGART
Bull. , 2:160–163, 1990. URL https://api.semanticscholar.org/CorpusID:207162288 .
Andrew Szot, Max Schwarzer, Harsh Agrawal, Bogdan Mazoure, Walter Talbott, Katherine Metcalf,
Natalie Mackraz, Devon Hjelm, and Alexander Toshev. Large language models as generalizable
policies for embodied tasks. ArXiv , abs/2310.17722, 2023. URL https://api.semanticscholar.
org/CorpusID:264555578 .
Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of
context. ArXiv , abs/2403.05530, 2024. URL https://api.semanticscholar.org/CorpusID:
268297180 .
Edward C Tolman. Cognitive maps in rats and men. Psychological review , 55(4):189, 1948.
Fei Wang, Wenxuan Zhou, James Y . Huang, Nan Xu, Sheng Zhang, Hoifung Poon, and Muhao
Chen. mdpo: Conditional preference optimization for multimodal large language models. ArXiv ,
abs/2406.11839, 2024a. URL https://api.semanticscholar.org/CorpusID:270560448 .
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Ke-Yang Chen, Xuejing
Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men,
Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-
language model’s perception of the world at any resolution. ArXiv , abs/2409.12191, 2024b. URL
https://api.semanticscholar.org/CorpusID:272704132 .
Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu,
Xizhou Zhu, Lewei Lu, Yu Qiao, and Jifeng Dai. Enhancing the reasoning ability of multimodal
large language models via mixed preference optimization. ArXiv , abs/2411.10442, 2024c. URL
https://api.semanticscholar.org/CorpusID:274117026 .
Zidan Wang, Rui Shen, and Bradly C. Stadie. Wonderful team: Zero-shot physical task planning
with visual llms. 2024d. URL https://api.semanticscholar.org/CorpusID:271533474 .
Philipp Wu, Alejandro Escontrela, Danijar Hafner, Ken Goldberg, and P. Abbeel. Daydreamer:
World models for physical robot learning. In Conference on Robot Learning , 2022. URL https:
//api.semanticscholar.org/CorpusID:250088882 .
Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, and Haibin Yan. Embodied task planning with large
language models. ArXiv , abs/2307.01848, 2023. URL https://api.semanticscholar.org/
CorpusID:259342896 .
Yuxi Xie, Guanzhen Li, Xiao Xu, and Min-Yen Kan. V-dpo: Mitigating hallucination in large vision
language models via vision-guided direct preference optimization. In Conference on Empirical
Methods in Natural Language Processing , 2024. URL https://api.semanticscholar.org/
CorpusID:273821696 .
Jingkang Yang, Yuhao Dong, Shuai Liu, Bo Li, Ziyue Wang, Chencheng Jiang, Haoran Tan, Jiamu
Kang, Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. Octopus: Embodied vision-language
programmer from environmental feedback. In European Conference on Computer Vision , 2023.
URL https://api.semanticscholar.org/CorpusID:263909250 .
Yuxiao Yang, Shenao Zhang, Zhihan Liu, Huaxiu Yao, and Zhaoran Wang. Hindsight planner: A
closed-loop few-shot planner for embodied instruction following. ArXiv , abs/2412.19562, 2024.
URL https://api.semanticscholar.org/CorpusID:275119585 .
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao.
React: Synergizing reasoning and acting in language models. ArXiv , abs/2210.03629, 2022. URL
https://api.semanticscholar.org/CorpusID:252762395 .
13
Page 14:
OpenMOSS
Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu,
Hai-Tao Zheng, Maosong Sun, and Tat-Seng Chua. Rlhf-v: Towards trustworthy mllms via
behavior alignment from fine-grained correctional human feedback. 2024 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR) , pp. 13807–13816, 2023. URL https:
//api.semanticscholar.org/CorpusID:265608723 .
Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B. Tenenbaum, Tianmin
Shu, and Chuang Gan. Building cooperative embodied agents modularly with large language
models. ArXiv , abs/2307.02485, 2023. URL https://api.semanticscholar.org/CorpusID:
259342833 .
Qi Zhao, Haotian Fu, Chen Sun, and George Dimitri Konidaris. Epo: Hierarchical llm agents
with environment preference optimization. ArXiv , abs/2408.16090, 2024. URL https://api.
semanticscholar.org/CorpusID:272146208 .
Zirui Zhao, Wee Sun Lee, and David Hsu. Large language models as commonsense knowledge for
large-scale task planning. ArXiv , abs/2305.14078, 2023. URL https://api.semanticscholar.
org/CorpusID:258841057 .
Siyu Zhou, Tianyi Zhou, Yijun Yang, Guodong Long, Deheng Ye, Jing Jiang, and Chengqi Zhang.
Wall-e: World alignment by rule learning improves world model-based llm agents. ArXiv ,
abs/2410.07484, 2024. URL https://api.semanticscholar.org/CorpusID:273233468 .
14
Page 15:
OpenMOSS
A VoTa-Bench
A.1 Task Formulation and Comparison
Task Formulation V oTa-Bench is designed as a closed-loop task planning framework. For each
task sample, the framework consists of a natural language goal, an initial environment state detailing
object locations and states (which are used to initialize the simulator), and a goal condition specifying
the criteria for task completion.
The task execution follows an interactive closed-loop process. Initially, the model receives a goal
instruction along with an egocentric view of the environment state. Based on these inputs, the
model begins its planning process. At each step, the model plans only the next action, which is
then executed in the simulation environment. The environment provides feedback including both
the action execution status (success or failure) and an updated egocentric view of the new state. The
model incorporates this feedback to plan its next step. This interactive process continues until either
the model signals completion by outputting a “done” action or reaches the maximum allowed steps
(25).
LoTa-Bench vs. ALFRED Our V oTa-Bench is based on Lota-bench. Although both LoTa-
Bench and ALFRED are based on the AI2Thor simulation environment, they represent different
approaches to embodied task evaluation. LoTa-Bench focuses specifically on assessing LLM’s
planning capabilities, providing a low-level controller to handle the execution of language actions in
the simulation environment. In contrast, ALFRED evaluates models’ overall performance, including
low-level action execution, without decoupling task success metrics. This distinction is particularly
relevant in modern hierarchical systems where LLMs serve as the embodied brain for task planning,
while separate action models handle low-level execution. LoTa-Bench effectively isolates and
measures the model’s planning ability specifically. Furthermore, LoTa-Bench implements more fine-
grained step decomposition, breaking tasks into simple, executable actions, compared to ALFRED’s
higher-level planning approach (Figure 5). Another key difference lies in the instruction format:
while ALFRED provides human-written step-by-step instructions to guide task planning, LoTa-Bench
presents a greater challenge by providing only goal instructions.
Task TypeSeen UnseenSample InstructionNum Avg Length Num Avg Length
Examine & Light 72 4.00 141 4.34 Examine a vase under a tall lamp
Pick & Place 84 4.46 77 5.70 Put pencil on bureau top
Stack & Place 48 10.60 70 8.49 Put a pot with a sponge in it in the sink.
Clean & Place 112 12.66 113 12.88 Put a cleaned washcloth away in a cabi-
net.
Heat & Place 107 18.35 136 17.38 To heat a potato slice and put it on the
table by the spoon.
Cool & Place 126 15.48 109 14.48 Chill a knife and place a chilled slice of
lettuce in a sink.
Total 549 11.85 646 10.90
Table 4: Distribution of task types in V oTa-Bench. The dataset is divided into seen and unseen
environments, with statistics showing the number of samples (Num) and average action sequence
length (Avg Length) for each task type. Example instructions are provided to illustrate typical tasks.
A.2 Data Statics
A.2.1 Tasks
Following the design of LoTa-Bench, V oTa-Bench incorporates 6 task types: Examine & Light, Pick
& Place, Stack & Place, Clean & Place, Heat & Place, and Cool & Place. Compared to LoTa-Bench’s
208 samples, we expanded the dataset to 549 samples in seen environments and further added
646 samples in unseen environments. The average action sequence length varies across different
task types, ranging from 4.00 steps for simple examination tasks to 18.35 steps for more complex
15
Page 16:
OpenMOSS
1. Find CounterTop2. Pickup Tomato3. Find Fridge4. Cool Tomato5. FindSink6. PutDown Tomato
1. Find Tomato2. Pickup Tomato3. Find Fridge11. FindSink12. PutDown Tomato4. Open Fridge5. PutDown Tomato6. Close Fridge7. Open Fridge8. Find Tomato9. Pickup Tomato10. Close FridgeVoTa-Bench (ours)GoalInstruction: Place a cold tomato in the sink
LoTa-Bench [ICLR’24]GoalInstruction: Place a cold tomato in the sinkALFRED [CVPR’20]GoalInstruction: Place a cold tomato in the sinkStep-by-step Instruction:1. turn to the left and take a few steps and turn to the right and go to the counter2. pick up the tomato from the counter top3. turn to the right twice and go to the front of the refrigerator and turn to the left and go to the refrigerator4. open the refrigerator door and put the tomato on the bottom right shelf and close the door and wait and open the door and pick up the tomato and close the refrigerator door5. turn to the left and go to the counter and turn to the right facing the sink",6. put the tomato in the sink"
(a) ALFRED (high-level planning) (Shridhar et al., 2019)
1. Find CounterTop2. Pickup Tomato3. Find Fridge4. Cool Tomato5. FindSink6. PutDown Tomato
1. Find Tomato2. Pickup Tomato3. Find Fridge11. FindSink12. PutDown Tomato4. Open Fridge5. PutDown Tomato6. Close Fridge7. Open Fridge8. Find Tomato9. Pickup Tomato10. Close FridgeVoTa-Bench (ours)GoalInstruction: Place a cold tomato in the sink
1. Find Tomato2. Pickup Tomato3. Find Fridge
11. FindSink12. PutDown Tomato4. Open Fridge5. PutDown Tomato6. Close Fridge
7. Open Fridge8. Find Tomato9. Pickup Tomato10. Close Fridge
LoTa-Bench [ICLR’24]GoalInstruction: Place a cold tomato in the sink
(b) LoTa-Bench (Choi et al., 2024)
1. Find CounterTop2. Pickup Tomato3. Find Fridge4. Cool Tomato5. FindSink6. PutDown Tomato
1. Find Tomato2. Pickup Tomato3. Find Fridge11. FindSink12. PutDown Tomato4. Open Fridge5. PutDown Tomato6. Close Fridge7. Open Fridge8. Find Tomato9. Pickup Tomato10. Close FridgeVoTa-Bench (ours)GoalInstruction: Place a cold tomato in the sink
1. Find Tomato2. Pickup Tomato3. Find Fridge
11. FindSink12. PutDown Tomato4. Open Fridge5. PutDown Tomato6. Close Fridge
7. Open Fridge8. Find Tomato9. Pickup Tomato10. Close Fridge
LoTa-Bench [ICLR’24]GoalInstruction: Place a cold tomato in the sink
(c) V oTa-Bench (ours)
Figure 5: Comparison of ALFRED, LoTa-Bench, and V oTa-Bench in the task “Place a cold tomato
in the sink”. (a) ALFRED emphasizes high-level task planning with human-written step-by-step
instructions, breaking the task into subgoals like “Cool Tomato” (step 4). (b) LoTa-Bench provides
only goal instructions and decomposes tasks into fine-grained low-level actions (e.g., “Open Fridge”,
“PutDown Tomato”, etc.; steps 4–10) but lacks guidance from visual input, relying on predefined
executable actions, choosing actions based on maximum logits to ensure they are valid in the
simulation. (c) V oTa-Bench extends LoTa-Bench by incorporating egocentric visual observations,
requiring models to generate open-domain actions based on visual information to handle both seen
and unseen environments.
operations like Heat & Place, with an overall average of 11.85 steps in seen environments and 10.90
steps in unseen environments. More details is shown in Table 4.
A.2.2 Actions
Based on the AI2-THOR simulator, V oTa-Bench supports eight fundamental actions that can be
combined to accomplish the above tasks:
16
Page 17:
OpenMOSS
(a) Seen Scenes
(b) Unseen Scenes
Figure 6: Examples of seen and unseen scenes.
•Find(<object>): A navigation action that enables the agent to locate and approach a specific
object. The agent needs to identify and move to the target object’s location before any
interaction can occur.
•PickUp(<object>): Allows the agent to grasp and lift an object. The precondition is that the
agent must be within the interaction range of the object and not currently holding anything.
The effect is that the agent holds the specified object.
•PutDown(<object>): Places a held object onto the last visited receptacle. The agent must be
holding the object and within range of the receptacle.
•Open(<object>): Opens containers such as cabinets, drawers, or appliances. The agent must
be within the interaction range of the target object.
•Close(<object>): Closes previously opened containers. Similar to Open, requires the agent
to be within the interaction range.
•TurnOn(<object>): Activates objects like lights or appliances. The agent must be within the
interaction range of the target object.
•TurnOff(<object>): Deactivates previously turned on objects. Requires the agent to be
within interaction range.
•Slice(<object>): Allows the agent to cut or slice certain objects. The agent must be holding
an appropriate cutting tool and be within range of the target object.
Each action can only be executed when its preconditions are met, ensuring realistic interaction
sequences. For example, interaction actions like “PickUp” can only be executed when the distance
between the agent and the target object is within a predefined threshold. If the target object is not
within visual range, the agent needs to use the “Find” action first to locate and approach the object
before interaction.
A.2.3 Scene
V oTa-Bench environments are based on the AI2-THOR simulation platform, covering four indoor
scenes: Kitchen, Living Room, Bedroom, and Bathroom. We extend LoTa-Bench by introducing
unseen scenes for testing generalization capability.
•Seen Scene: These household environments share identical layouts with the training set. Ob-
ject positions are randomly initialized according to pre-defined commonsense distributions
in AI2-THOR.
•Unseen Scene: These household environments feature different layouts from the training
set. Object positions are randomly initialized according to pre-defined commonsense
distributions in AI2-THOR.
Figure 6 shows examples of layouts in our seen and unseen environments.
17
Page 18:
OpenMOSS
A.3 License Statement
This work builds upon ALFRED (MIT License), AI2-THOR (Apache-2.0), and LoTa-Bench (CC BY
4.0). All modifications and derived work comply with their respective licenses.
B Details of Preference Data
B.1 Data Construction Details
Our task instructions are sampled from the ALFRED dataset’s training set. This process can be
automated through defining formal goal conditions (including object relationships like <object> on
<object> and object states like “heated”), which, combined with instruction generation capabilities of
large language models, enables automated construction of large-scale instruction-goal paired datasets.
We use the Qwen2-VL-7B as the policy model for data collection with a temperature setting of 0.8,
and GPT-4o (temperature = 0) is utilized as the process reward model to assess action quality (0-5).
Environmental feasibility is determined through binary scoring (0/1), indicating whether an action
can be physically executed in the environment. To ensure balanced consideration of both aspects, we
normalize the environmental score to a 0-5 scale before averaging it with the semantic score.
Our tree search implementation employs several key parameters to maintain efficiency while ensuring
thorough exploration. The selection threshold τis set to 3.75, which creates a strict filtering
mechanism: actions must be both environmentally feasible and semantically meaningful to be
selected for expansion. This threshold effectively filters out non-executable actions (environmental
score = 0) and executable actions with low semantic scores (< 2). To manage computational resources
and maintain search efficiency, we sample 5 candidate actions for each state and set a maximum
search depth of 25 steps. These parameters were determined through empirical testing to balance
between exploration breadth and computational feasibility.
The hybrid scoring mechanism integrates both semantic and environmental feasibility assessments.
For semantic evaluation, we employ GPT-4o to assess how effectively each step contributes to task
completion, using a 5-point scale rating prompt (detailed in Appendix E).
B.2 Data Sample
For each step in the correct trajectory, we can construct preference pairs for both action selection and
state prediction. For action selection, the action from the correct trajectory serves as the “chosen”
action, while other actions at the same step serve as “reject” actions. Since there may be multiple
incorrect actions when using tree-search methods, the number of DPO samples exceeds the number
of SFT samples. The example of action selection is as follows:
Input:
<image>
Please generate the plan for the next step based on the given Goal, Previous Steps, and Images. The plan should select one action and one
object from the provided list.
### Goal: Place a cooked apple inside the fridge.
### Previous Steps:
#Step 1: find apple (this action success)
#Step 2: pick up apple (this action success)
#Step 3: find microwave (this action success)
### Action List: [’find’, ’pick up’, ’put down’, ’open’, ’close’, ’slice’, ’turn on’, ’turn off’, ’done’]
Generate the next step in the format:
Reasoning:
Action:
Object:
Chosen Output :
18
Page 19:
OpenMOSS
Reasoning: To cook the apple, it needs to be placed inside the microwave. After that, it can be cooled in the fridge.
Action: open
Object: microwave
Rejected Output :
Reasoning: The user needs to put the cooked apple inside the fridge. The apple has already been found and picked up in the previous steps.
The next step requires finding the fridge.
Action: find
Object: fridge
For state prediction, given the current state and correct action as input, we treat the state that the
correct action transitions to as “chosen”, and the states that incorrect actions lead to as “reject” states.
We use GPT-4o to convert environment metadata (object visibility, spatial relationships, etc.) and
visual information into natural language state descriptions. Since there might be a language style gap
between GPT-4o and our model’s generations, we have the model first reproduce the current state
description as an anchor point before predicting the next state. The example of state prediction is as
follows:
Input:
<image>
Given the current state and step, describe the current state and predict what the next state should look like.
### Step: open microwave
First describe the current state, then predict the next state after the step is executed:
### Current State:
### Next State:
Chosen Output :
Current State: The microwave is on the counter, and the apple is held.
Next State: The microwave is open with a blue plate inside. The apple is held.
Rejected Output :
Current State: The microwave is on the counter, and the apple is held.
Next State: The fridge is visible with a temperature display showing -15, and the apple is held.
B.3 Data Distribution
To achieve a balanced dataset, we processed the collected data to ensure similar sample sizes across
task types, with the detailed distribution presented in the Figure 7.
C Error Analysis
SFT DPO D2PO
Dependency Error 212 157 141
Affordance Error 144 141 128
Inefficient Error 141 93 78
Others 20 16 17
Table 5: Distribution of Error Types Across Different Methods
To systematically analyze the error patterns, we employed Deepseek-R1 (DeepSeek-AI, 2025) to
classify error types by comparing standard trajectories with erroneous ones. Note that a single
trajectory may contain multiple types of errors simultaneously. We categorized the errors into three
main types:
19
Page 20:
OpenMOSS
16.4%
15.8%
16.9% 16.6%17.2%17.0%SFT Distribution
16.3%
12.9%
17.3%
18.0%17.4%18.2%DPO Distribution
Types
Examine&Light
Pick&PlaceStack&Place
Clean&PlaceHeat&Place
Cool&Place
Figure 7: Distribution of the SFT and DPO dataset across different task types.
•Dependency Error (DE) : Occurs when actions are executed without meeting necessary
prerequisites, violating the logical sequence of operations.
•Affordance Error (AE) : Manifests as incorrect object interaction sequences, indicating a
misunderstanding of how to properly interact with objects in the environment. This includes
both action affordance errors (using incorrect methods to interact with objects) and existence
affordance errors (attempting to interact with non-existent objects).
•Inefficient Error (IE) : Involves redundant or unnecessary actions that do not contribute to
achieving the task goal efficiently.
As shown in Table 5, our D2PO method demonstrates significant improvements in reducing these
error types compared to baseline methods. The analysis reveals that D2PO particularly excels in
minimizing Dependency Errors (212 →141), Affordance Errors (144 →128), and Inefficient Errors
(141→78).
However, we acknowledge certain limitations in our current approach. While we have made substan-
tial progress in reducing these common error types, there remain opportunities for future work to
further enhance the model’s performance and address more complex error patterns that may emerge
in different scenarios.
D Case Study
We conduct case studies to demonstrate the advantages of our proposed D ²PO method over SFT in
terms of dependency and efficiency.
Dependency As shown in Figure 8, our method exhibits superior dependency modeling compared to
SFT in the task “put washed plate inside fridge”. At step 2, SFT attempts to “pick up” without first
locating an accessible plate, while our method correctly performs “find plate” before attempting any
manipulation. Similarly, at step 4, SFT executes “put down plate” without having successfully picked
up any plate, whereas our approach ensures proper prerequisites are met. These initial errors in SFT
propagate throughout the sequence - despite multiple pick and place attempts, they remain invalid
operations, ultimately resulting in task failure.
Efficiency Figure 9 demonstrates our method’s superior efficiency in the task “place a warm plate
in the cabinet”. Even when both approaches successfully complete the task, our method requires
fewer steps through better action sequencing. D ²PO first locates the plate before proceeding to
operate the microwave, following a logical and efficient order. In contrast, SFT inefficiently operates
the microwave before finding the plate, leading to redundant “find plate” actions in steps 1 and 5.
20
Page 21:
OpenMOSS
(a) SFT Trajectory (Fail)
(b) D2PO Trajectory (Success)
Figure 8: Case Study about Dependency. This example demonstrates our method’s superiority in
dependency modeling compared to SFT. At step 2, SFT attempts “pick up” without locating an
accessible plate, while our method first performs “find plate”. Similarly, at step 4, SFT executes “put
down plate” without having picked up any plate, whereas our approach ensures the plate is properly
held before putting it down. These initial errors in SFT propagate throughout the sequence - despite
multiple pick and place attempts, they remain invalid operations, ultimately resulting in task failure.
Furthermore, SFT exhibits unnecessary repetition in steps 12-14, where it performs the same action
multiple times. This comparison highlights our method’s ability to generate more streamlined and
efficient action sequences while maintaining task success.
E Prompt Template
21
Page 22:
OpenMOSS
(a) SFT Trajectory (Success)
(b) D2PO Trajectory (Success)
Figure 9: Case Study about Efficiency. Even when both SFT and D2PO methods successfully
complete the task, our approach requires fewer steps. Our method first locates the plate before
proceeding to operate the microwave, while SFT operates the microwave before finding the plate,
resulting in redundant “find plate” actions in steps 1 and 5. Additionally, SFT’s repetitive execution of
the same action in steps 12-14 further reduces efficiency. This comparison demonstrates our method’s
superior action sequencing and efficiency, even when both approaches ultimately achieve the goal.
22
Page 23:
OpenMOSS
GPT Evaluation Prompt
Please serve as an unbiased evaluator for the AI-generated next step in the task planning according to
the goal progress. The task involves robotic actions that typically follow a logical sequence of steps to
achieve a defined goal.
{example}
## Input Data:
### Goal: {goal}
### Previous Steps: {previous_steps}
—
## AI-generated Next Step to Evaluate:
Step: {step}
Execution Result: {action_ret}
After executing the step, you can see the following environment state: <image>
## Evaluation Criteria:
### Goal Progress (1-5 points):
Evaluate how effectively the step moves toward completing the task by considering:
1. **Action Sequence** - Does it follow a logical progression of actions based on the task requirements?
(e.g., preparation →execution →refinement →goal completion)
2. **Previous Actions** - How does it build on prior steps? Does it avoid unnecessary repetition or
conflicting actions?
3. **Goal State** - Does the step advance the task toward achieving the defined goal or final condition?
4. **Environment State** - Does the environment state after executing the step align with the expected
progress toward the goal?
Scoring for Goal Progress:
- **[1]:** Step moves away from the goal or makes goal completion more difficult.
- **[2]:** Step is redundant or repeats the exact same action as the immediate previous step without
progress.
- **[3]:** Step makes moderate progress toward the goal.
- **[4]:** Step makes significant progress toward the goal, aligning well with the task sequence.
- **[5]:** Step makes excellent progress, directly advancing toward goal completion.
### Examples:
- A step that repeats an action unnecessarily (e.g., "find object" followed by "find object") = [2].
- A step that logically follows the sequence (e.g., "find object" before "pick up object") = [4].
- A step that conflicts with the goal (e.g., "pick up object" followed by "put down object" without
correct location) = [1].
—
## Output Format:
### Evaluation:
Analysis: Briefly explain how the step compares to prior actions, whether it follows a logical sequence,
and how it advances the goal.
Goal Progress Score: Use the following scale format: [1], [2], [3], [4], [5].
Figure 10: Prompt Template for GPT-Evaluation during the Data Collection.
23