Paper Content:
Page 1:
KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting
for Open-Vocabulary Robotic Manipulation
Zixian Liu1∗, Mingtong Zhang2∗, and Yunzhu Li3
Collect all the cubes together.
Large Vision
Model
Vision Language ModelCode as Target SpecificationDynamics
Planning
Code Parsing
Multi-Object Interaction
Rigid ObjectRope
Granular
Fig. 1: KUDA is an open-vocabulary manipulation system that uses keypoints to unify the visual prompting of vision language models
(VLMs) and dynamics modeling. Taking the RGBD observation and the language instruction as inputs, KUDA samples keypoints in the
environment, then uses a VLM to generate code specifying keypoint-based target specification. These keypoints are translated into a cost
function for model-based planning with learned dynamics models, enabling open-vocabulary manipulation across various object categories.
Abstract — With the rapid advancement of large language
models (LLMs) and vision-language models (VLMs), significant
progress has been made in developing open-vocabulary robotic
manipulation systems. However, many existing approaches
overlook the importance of object dynamics, limiting their
applicability to more complex, dynamic tasks. In this work,
we introduce KUDA, an open-vocabulary manipulation system
that integrates dynamics learning and visual prompting through
keypoints, leveraging both VLMs and learning-based neural
dynamics models. Our key insight is that a keypoint-based target
specification is simultaneously interpretable by VLMs and can
be efficiently translated into cost functions for model-based
planning. Given language instructions and visual observations,
KUDA first assigns keypoints to the RGB image and queries the
VLM to generate target specifications. These abstract keypoint-
based representations are then converted into cost functions,
which are optimized using a learned dynamics model to
produce robotic trajectories. We evaluate KUDA on a range of
manipulation tasks, including free-form language instructions
across diverse object categories, multi-object interactions, and
deformable or granular objects, demonstrating the effectiveness
of our framework. The project page is available at http:
//kuda-dynamics.github.io .
I. I NTRODUCTION
It has been a longstanding focus to create an open-
vocabulary robotic system capable of executing tasks based
on human language in diverse environments. However, hu-
man language is inherently abstract and often ambiguous,
requiring contextual knowledge and the ability to ground
∗denotes equal contribution.1Tsinghua University,2University of Illinois
Urbana-Champaign,3Columbia Universitylanguage inputs in the environments where the robots oper-
ate. Recent advances in large language models (LLMs) and
vision-language models (VLMs) [1–9] have demonstrated
advanced capabilities in text and image understanding. These
developments have paved new paths for incorporating such
models into robotic systems [10–12]. However, many exist-
ing open-vocabulary robotic systems heavily rely on VLMs
and LLMs for guidance at all levels and do not provide
an explicit account of object dynamics. As a result, they
typically focus only on rigid objects and coarse-grained
manipulation, limiting their applicability to more complex
and dynamic tasks involving diverse object categories, such
as deformable objects and object piles.
On the other hand, learning-based dynamics models have
shown the ability to model the complex behaviors of real-
world objects directly from observation data [13–16]. These
models can accurately predict the future states of objects
with varying object categories and shapes, accounting for
different interactions. However, model-based planning with
dynamics models typically requires a pre-defined target state
or cost function, which cannot be directly inferred from high-
level language instructions. This raises a key question: How
can we develop open-vocabulary manipulation systems that
harness the flexibility of task specification via VLMs while
preserving the benefits of model-based planning?
Our key insight is to develop a unified keypoint represen-
tation that integrates dynamics learning and visual prompting
through VLMs. Using keypoints as the visual represen-
tation is intuitive for vision-language models to interpretarXiv:2503.10546v1 [cs.RO] 13 Mar 2025
Page 2:
and express, while also being precisely defined and easily
translatable into a cost function for planning with dynamics
models. To achieve this, we propose defining the objective
function for planning using keypoints and employing mark-
based visual prompting, inspired by [11], to enable the VLM
to generate code that specifies the objective as arithmetic
relationships between the visual keypoints.
Building on our keypoint-based target specifications, we
introduce KUDA : an open-vocabulary manipulation system
that utilizes Keypoints to Unify Dynamics learning and
visuAl prompting. KUDA employs an upstream VLM along
with a pre-trained dynamics model, using keypoints as a
shared intermediate representation. Given a language instruc-
tion for the manipulation task and the current visual observa-
tion of the experimental setup, KUDA automatically samples
and labels keypoints from the RGB image. The VLM is
then prompted to generate target specifications, which are
subsequently converted into a cost function. During robot
execution, a two-level closed-loop control mechanism en-
sures effective and robust model-based planning. Notably,
we found that incorporating few-shot examples significantly
enhances the performance of the VLM. Inspired by this,
we developed a prompt library and a retrieval mechanism
based on score matching, ensuring high-quality few-shot
examples without exceeding input token limits. In summary,
our contributions are as follows:
•We propose using keypoints as a unified intermediate
representation to bridge dynamics learning and visual
prompting through VLMs.
•We design a prompt retriever that automatically sub-
samples from our prompt library based on the task
description, while ensuring it stays within the context
window of the VLMs.
•We evaluate the integrated system in real-world manipu-
lation tasks, demonstrating state-of-the-art performance
on tasks involving diverse object materials, such as
ropes and granular objects (Fig. 1), and covering a range
of language instructions.
II. R ELATED WORK
A. Grounding Language Instructions
Using human language to instruct intelligent robots has
been an active research domain. However, most works [17–
19] mainly focus on decomposing high-level language in-
structions as subtasks. Grounding ambiguous human lan-
guage into structured action sequences that robots can ex-
ecute remains a significant challenge [20–22]. Most existing
approaches use action primitives as the basic elements for
planning. These methods either employ classical techniques
such as lexical analysis, formal logic, and graphical mod-
els [20, 23–25], or leverage pre-trained large models to
comprehend instructions and generate task plans [22, 26–
28]. However, the reliance on pre-defined motion primitives
is often seen as a major limitation for developing a universal
manipulation system [10].
Many recent works have also focused on grounding lan-
guage into lower-level actions. These approaches includelanguage-conditioned imitation learning [29–31], synthesiz-
ing value maps or reward functions [10, 12, 32, 33], and
generating motions through visual prompting [11, 34]. In
our work, we propose a new paradigm to ground natural
language instructions into keypoint target configurations for
model-based planning with learned dynamics models.
B. Foundation Models in Robotics
The remarkable success of large language models (LLMs)
has recently earned significant interest in the field of robotics.
Recent works have explored how to effectively integrate
these powerful models into robotic systems. Some ap-
proaches leverage LLMs for high-level task planning [35,
36], while others have demonstrated that LLMs excel at
generating code for robot control [28, 37, 38]. Another appli-
cation of LLMs is in synthesizing value functions or reward
functions [10, 33, 39]. However, these approaches typically
require converting both the task and observations into textual
form, which often leads to suboptimal performance in real-
world manipulation scenarios.
Recently, vision-language models (VLMs) have earned
significant attention. In addition to replacing previous LLMs
with powerful VLMs, such as GPT-4V [6], to achieve better
grounding in real-world scenarios, VLMs have been applied
in various contexts. Some pre-trained VLMs have shown
superior perception capabilities [40–42], while others have
been used to generate affordances or constraints based on
visual prompts [11, 12, 43]. However, despite their capabili-
ties, these advanced VLMs lack a good understanding of the
3D spatial relationships and object dynamics, making them
inefficient in more complicated manipulation tasks.
Another recent trend focuses on collecting large-scale
robotics data and training end-to-end general-purpose models
for robotic tasks [44–47]. However, many works in this
direction focus primarily on rigid object manipulation and
cannot perform complex manipulation across various object
categories due to a lack of knowledge of object dynamics.
Additionally, collecting data and training models for such
works is highly resource-intensive, and their performance
remains limited. In our work, we leverage the state-of-
the-art VLM to generate the target specifications for the
neural dynamics models, which enables a flexible and robust
framework that facilitates model-based planning.
C. Visual Prompting and In-Context Learning
Visual prompting is an emerging technique that has
earned much attention with the development of large vision-
language models. This technique enhances the visual ground-
ing capabilities of these foundation models through incor-
porating visual markers into observations. Yang et al. [48]
demonstrated that overlaying the original image with its
semantic segmentation enables GPT-4V to answer questions
requiring visual grounding. Cai et al. [49] introduced ViP-
LLaV A, a model capable of decoding visual prompts that
include various types of visual markers. Others have also
been using visual prompts generated by predefined rules to
produce affordances for robotic planning [11, 34].
Page 3:
In-context learning is a relatively novel paradigm in the
vision domain. It allows a pre-trained model to efficiently
adapt to a novel downstream task using a few input-output
examples, without requiring fine-tuning. One of the earliest
works in this area is Flamingo [50], a vision-language model
(VLM) that is capable of learning from few-shot examples.
Bar et al. [51] approach the image-to-image in-context
learning problem as an image inpainting task, while Yang
et al. [52] introduce in-context learning for image editing.
Li et al. [53] offer a general method for in-context learning
in segmentation tasks. Additionally, Zhang et al. [54] provide
a comprehensive study on how to select effective examples
for visual in-context learning. In our work, we utilize a
set of well-designed prompting formulations to guide the
vision language model in generating keypoint-based target
specifications. Additionally, inspired by [55], we introduce a
prompt retriever to ensure the quality of few-shot examples.
III. M ETHOD
We first provide the problem formulation. Then we de-
scribe how we prompt the VLM to generate the keypoint-
based target specification . Next, we present the design of
the prompt retriever and the Top-K prompt library. Lastly,
we specify how we obtain the cost function from target
specification and how we perform model-based planning
with neural dynamics models, along with the two-level
closed-loop control mechanism. The overview of our KUDA
framework is shown in Fig. 2.
A. Problem Formulation
We consider a tabletop manipulation problem defined by
afree-form language instruction Lthat pertains to one or
multiple objects O, along with a pre-trained neural dynamics
model f. The goal is to optimize the robot’s action trajectory
τto perform the manipulation task described by L. The
instruction Lis typically abstract and requires interpretation
that incorporates both common sense and the context of the
experiment configurations. The dynamics model fis able to
predict how the object set Owill change when a specific
action is applied:
ˆzt+1=f(zt, ut), (1)
where ztandzt+1represent the current state of the envi-
ronment at time tand the subsequent state at time t+ 1,
respectively. ˆzt+1is the predicted state at time t+ 1, and
utis the applied action. The environment state describes the
positions of the 3D points oi= (xi, yi, zi)that represent the
objects (or parts of an object) in O.
The key problem here is how to define the objectives from
the ambiguous instruction Lfor optimization. We leverage
VLMs to propose a keypoint-based representation, where
we will obtain our cost function. Once the cost function
C(·|L)is obtained from the instruction L, the problem can
be formulated as an optimization problem over the robot’s
action trajectory τ:
τ= arg min
τC(z′|L), (2)where z′represents the final state of the environment after
the action sequence τhas been applied.
B. Keypoint-Based Target Specification
Inspired by [11], we employ visual markers to enhance
the visual grounding capabilities of vision-language models
(VLMs). A key insight in our framework is that a wide range
of manipulation tasks can be effectively described by the
spatial relationships between keypoints on the objects to be
manipulated and reference points within the environment.
For instance, the task of straightening a rope on a table can
be articulated as pulling one end of the rope to the left side
of the center of the table while pulling the other end to the
right. Notably, we found that VLMs are highly proficient
at generating such spatial relationships when provided with
appropriate visual prompts.
Specifically, we first segment all semantic masks from the
RGB observation utilizing Segment Anything [42]. Next, we
perform farthest point sampling (FPS) on these masks to
obtain keypoints and reference points, which we then label on
the original image. We prompt the VLM to select keypoints
and, for each keypoint k, to specify a target position pthat
describes the desired final state of the object set Oupon
task completion. The target position pis determined by its
spatial offset from a reference point, as provided by the
VLM through a set of code assignment statements in the
form of p = r + [dx, dx, dz] , which denotes that
the target position is equal to the reference point r’s position
added by an offset [dx, dx, dz] . We refer to each (k, p)
pair as a target specification . It is important to note that
these target specifications do not have to be strictly satisfied
after executing the task, as the vision-language model may
occasionally generate infeasible specifications due to its lack
of knowledge of object dynamics. Despite this, we found that
optimizing for approximate target specifications is generally
sufficient to successfully complete the instruction. For a more
detailed prompting process, please refer to the appendix on
our project website.
C. Top-K Prompt Library
In our experiments, we noticed that providing a few
examples of similar tasks significantly improved the per-
formance of the VLM. As a result, we collected a diverse
set of examples that cover all the object categories used in
our experiments. However, existing VLMs, such as GPT-
4V , often have limitations on the number of input images
they can task as inputs. Additionally, images consume a
substantial number of input tokens, which can be costly and
affect the efficiency of the program.
To address this, we developed a prompt retriever that
selects optimal examples from the prompt library through
score matching. Specifically, we employ CLIP [7] to encode
both the input observation sand instruction L, as well as
all the examples represented by tuples (qi,obsi, ri)in the
prompt library P, where qiis the text query, obs iis the
corresponding observation in the example, and riis the
response provided by a human expert. We then normalize
Page 4:
Divide the cubes of different
colors into two pieces.
Large
Vision
ModelVision
Language
Model
Prompt Librarydef keypoint_specification():
# Move red cubes to the left
p1 = C + [-20, 0, 0]
p3 = C + [-20, 0, 0]
p5 = c + [-20, 0, 0]
# Move purple cubes to the right
p2 = C + [20, 0, 0]
p4 = C + [20, 0, 0]
p6 = C + [20, 0, 0]
return p1, p2, p3, p4, p5, p6
P2
P4P6
P1
P3
P5
RGBD Observation
3D Objectives
Robot Execution
Low-Level Closed Loop
Neural Dynamics
Model-Based PlanningC
High-Level-Closed LoopFig. 2: Overview of KUDA . Taking the RGBD observations and a language instruction as inputs, we first utilize the large vision model
to obtain the keypoints and label them on the RGB image to obtain the visual prompt (green dot C marks the center reference point).
Next, the vision-language model generates code for target specifications, which are projected into 3D space to construct the 3D objectives.
Lastly, we utilize the pre-trained dynamics model for model-based planning. After a certain number of actions, the VLM is re-queried
with the current observation, enabling high-level closed-loop planning to correct VLM and execution errors.
these latent vectors and obtain similarities by dot product.
The matching score Siis calculated as a weighted average
of the image and text similarities:
Si=fI(s)
|fI(s)|·fI(obsi)
|fI(obsi)|+λfT(L)
|fT(L)|·fT(qi)
|fT(qi)|,(3)
where fIandfTrepresent the image and text encoders of
CLIP, respectively, λis a hyperparameter set to 0.6. The top
Kexamples are selected and incorporated into our prompt
based on the score.
D. Two-Level Closed-Loop Planning
In our framework, we propose a two-level closed-loop
planning to improve the robustness and effectiveness of the
manipulation tasks.
At the low-level closed loop of model-based planning,
once the target specifications are obtained, the next step
is to translate them into optimization objectives. We start
by projecting the keypoints and their corresponding targets
into 3D space. Next, we extract the object points oifrom
the objects’ point clouds and align each keypoint with its
nearest object point. The objective is then defined as the
sum of the Euclidean distances between keypoints and their
corresponding targets:
C(z|L) =X
i∥oi−pi∥2(oi∈z), (4)
where pirepresents the 3D target of point oi, and the
summation is performed over all target specifications. Using
the objective defined in Eqn. 2 and the dynamics model f,
we employ the Model Predictive Path Integral (MPPI) [56]
algorithm, to determine the action to be executed.However, for specific object categories such as granular
pieces, a limited number of keypoints is insufficient to accu-
rately describe the target shape specified by the instruction.
Thus, we perform high-level closed-loop re-planning with
the VLM, re-prompting it with the current observation and
instruction to update target specifications after a series of
actions within a loop. This two-level closed-loop planning
framework effectively corrects imperfect target specifications
and execution errors, etc., ensuring the system’s robustness
even in the presence of external disturbances.
IV. E XPERIMENTS
The main purpose of our experiments is to verify and
analyze the ability of our system to perform a variety of tasks
on different objects given various language instructions or
experiment configurations. We aim to answer the following
research questions: (1) How well does our system generalize
to diverse text instructions and visual scenarios? (2) How
does our framework handle complex manipulation tasks
across various object categories? (3) How does each module
contribute to the system’s failure cases?
We conduct qualitative evaluations on a diversified set
of tasks to demonstrate the effectiveness of our system. To
highlight its model-based planning capabilities with neural
dynamics models, we compare our framework against two
baselines. Additionally, we provide a component-wise ex-
periment error breakdown for a comprehensive analysis of
our framework’s effectiveness. Furthermore, as an ablation
study, we examine the impact of the hyperparameter Kin
the top-K prompt library on in-context learning.
To demonstrate the flexibility of our system across var-
ious objects, we train the neural dynamics models on 4
Page 5:
“Move the cubes to the sticker of the pink color” “Move the cubes to the sticker that writes ‘pink’”
“Move the ends of the rope to form a fraction 1/9 on the table” “Move the T shape to make a word ‘ROBOT’”
“Collect the coffee beans into the square” Initial State Target Spec Robot Action Initial State Target Spec Robot Action Fig. 3: Qualitative Results of the Rollouts. We show the target specification and robot executions of various tasks on different objects,
highlight the effectiveness of our framework. We show the initial state and the target specification visualization of our system, along with
the robot executions, to demonstrate the performance of our framework on various manipulation tasks. Note that we show the granular
collection task to exhibit how our VLM-level closed-loop control works in our two VLM-level loops.
different object categories: rope, cubes, granular pieces, and
a T-shaped block. The first three categories utilize graph-
based neural dynamics models, whereas the T-shaped block
employs a state-based neural dynamics model trained using
a multilayer perceptron network.
We compare our system with MOKA [11] and Vox-
Poser [10] in a tabletop environment. These two baselines
also enable open-vocabulary manipulation and in-context
few-shot learning. MOKA builds a framework to prompt
VLM to directly generate motion, and Voxposer uses LLM
to synthesize 3D voxel map as affordance. To ensure a fair
comparison, the prompts and few-shot examples of these 2
systems are adapted to be suitable for our tasks. To avoid
the possible overfit in few-shot learning, we ensure that no
example in the prompt library is exactly the same as the task.
All vision language models and large language models used
in our system and two baselines are specified to be GPT-4o.
A. Qualitative Results
In Fig. 3, we present 5 tasks featuring different text
instructions and visual scenarios, along with their corre-
sponding target specifications generated by our system and
the robot’s executions. These examples clearly demonstrate
the effectiveness of our framework. In each case, our sys-
tem generates precise target specifications aligned with the
language instructions, and the robot executes the tasks effec-
tively. Notably, two tasks involving cubes start with similar
initial configurations but differ in instructions which are easy
to confuse. The VLM effectively distinguished semantics
and provided precise target specifications, demonstrating its
strong visual understanding capabilities.
The coffee bean collection task highlights the benefits
of our VLM-level closed-loop planning. Initially, the target
specifications were too sparse to fully manipulate the coffee
bean pile, leaving a few unspecified beans outside the squareafter multiple actions. Our system identified these errors
and corrected them in the subsequent loop. These examples
demonstrate the flexibility of our framework across a wide
variety of instructions and environment configurations.
B. Quantitative Results
We compare our system with two baselines on a total
of six tasks across the four object categories. The results
are shown in Tab. I. For each task, we evaluate the success
rate by measuring the Chamfer distance between the point
clouds of the objects after a specified number of robot
actions are applied and the corresponding target point clouds.
The quantitative results are shown as the total number of
successful trials out of a total of 10. The text instructions
for all evaluation tasks are listed in Tab. II.
Methods MOKA [11] V oxPoser [10] Ours
Rope Straightening 2/10 0/10 8/10
Cube Collection 0/10 3/10 6/10
Cube Movement 6/10 3/10 10/10
Granular Collection 0/10 1/10 10/10
Granular Movement 0/10 1/10 6/10
T Movement 0/10 0/10 8/10
Total 13.3% 13.3% 80.0%
TABLE I: Quantitative results of our evaluation. Our method
achieved relatively high performance across all evaluation tasks
compared to the two baselines, while the failures in Cube Collection
and Granular Movement were primarily caused by perception.
As demonstrated in the quantitative evaluation results, our
system is superior to these existing methods by a large
margin on the evaluation tasks. The main reason for such
a performance gap is that the existing methods ignore the
fine-grained representation of objects or actions and lack the
knowledge of object dynamics, which leads to their limited
capability to manipulate deformable objects or rigid objects
Page 6:
Total trials
60VLM provides under-
specified targets
2
Keypoint
Specification Success
58Camera provides
bad depth
1
Perception Error
6
Grounding Success
51Objects overlap
with other shapes
3
Fail to detect one of
two adjacent cubes
3
Tracking Success
49Fail to track dense
object pile (coffee beans)
2
Manipulation Success
48Dynamics model fails to
predict correct states
1Fig. 4: Visualizations of Error Breakdown. We provide a detailed breakdown of each failure mode, marked in red. While we achieved
an80% success rate across 60 trials for various tasks, the primary cause of failure was perception errors, accounting for 10% of all trials
and50% of the failure cases.
of more complex shapes. In contrast, our system utilizes
the keypoint-based representation for visual prompting and
dynamics learning. This enables our system to predict the
future state of those complex objects and perform model-
based planning for manipulation.
Rope Straightening “Straighten the rope.”
Cube Collection “Move all the cubes to the pink cross.”
Cube Movement “Move the yellow cube to the red cross.”
Granular Collection “Collect all the coffee beans together.”
Granular Movement “Move all the coffee beans to the red cross.”
T Movement “Move the orange T into the pink square.”
TABLE II: The input instructions of each evaluation task.
C. Error Breakdown
We conducted a manual analysis of the failure cases
encountered during our experiments, as shown in Fig. 4.
As illustrated, 1) the perception module accounted for the
majority of errors. This includes instances where the module
failed to detect objects, particularly when they overlapped
with other shapes, such as a cross sign on the table, or
failed to distinguish between two adjacent cubes. 2) The
second most significant source of error stemmed from the
target specification and the tracking module. Errors in target
specification typically arose from the VLM providing under-
specified targets, i.e., an insufficient number of target specifi-
cations to complete the task. The tracking module commonly
failed when tracking keypoints in dense object piles, such
as coffee beans. 3) Additionally, a smaller proportion of
failures were attributed to the dynamics model and hardware,
where the dynamics model occasionally produced inaccurate
predictions, and the RGBD camera sometimes provided
inaccurate depth values.
D. Ablation of Top-K Prompt Library
Category Top-0 Top-1 Top-3 Top-5
Success Rate 10/10 2/10 3/10 10/10 7/10
TABLE III: The quantitative results of different Kvalues.
In this ablation, we evaluate the success rate on the Rope
Straightening task for different Kvalues, as well as a specialprompting method labeled “category” where examples are
manually selected by human expert. The results are presented
in Tab. III. Notably, when K= 3 , the prompt retriever
achieves performance on par with that of a human expert.
However, increasing Kintroduces less relevant examples,
which lowers the overall prompt quality and results in a
reduced success rate for prompt retriever when K= 5.
V. C ONCLUSION & L IMITATIONS
In this work, we propose a novel open-vocabulary
robotic manipulation system KUDA, which unifies the visual
prompting of the vision language models and dynamics
learning through keypoint-based representation. Utilizing this
flexible representation, KUDA leverages the vision lan-
guage model to deal with various high-level human lan-
guage instructions, and in the meantime utilizes model-
based planning with dynamics models to generate robot
actions, to perform complex manipulation tasks on various
object categories, with different language instructions. Our
experiments demonstrate the effectiveness and versatility of
our framework.
However, KUDA also has several limitations: First, it uses
a top camera to capture visual observations for the vision
language model, limiting its ability to perform tasks with
more complex 3D spatial relationships. Second, the dynamics
models in our work are trained in simulations, leading to
an inevitable sim-to-real gap and limited generalization to
different object categories. We believe that with the devel-
opment of related fields in the future, the problems above
will be finally eliminated. We hope that KUDA will inspire
more future work on incorporating knowledge of dynamics
into more versatile robotic systems.
VI. A CKNOWLEDGEMENT
This work is partially supported by the Toyota Research
Institute (TRI), the Sony Group Corporation, and Google.
This article solely reflects the opinions and conclusions of
its authors and should not be interpreted as necessarily
representing the official policies, either expressed or implied,
of the sponsors.
Page 7:
REFERENCES
[1] J. Devlin, “Bert: Pre-training of deep bidirec-
tional transformers for language understanding,” arXiv
preprint arXiv:1810.04805 , 2018.
[2] A. Radford, “Improving language understanding by
generative pre-training,” 2018.
[3] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
I. Sutskever, et al. , “Language models are unsupervised
multitask learners,” OpenAI blog , vol. 1, no. 8, p. 9,
2019.
[4] T. B. Brown, “Language models are few-shot learners,”
arXiv preprint arXiv:2005.14165 , 2020.
[5] A. Chowdhery, S. Narang, J. Devlin, M. Bosma,
G. Mishra, A. Roberts, P. Barham, H. W. Chung,
C. Sutton, S. Gehrmann, et al. , “Palm: Scaling language
modeling with pathways,” Journal of Machine Learning
Research , vol. 24, no. 240, pp. 1–113, 2023.
[6] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya,
F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman,
S. Anadkat, et al. , “Gpt-4 technical report,” arXiv
preprint arXiv:2303.08774 , 2023.
[7] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh,
S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,
et al. , “Learning transferable visual models from natural
language supervision,” in International conference on
machine learning . PMLR, 2021, pp. 8748–8763.
[8] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping
language-image pre-training for unified vision-language
understanding and generation,” in International confer-
ence on machine learning . PMLR, 2022, pp. 12 888–
12 900.
[9] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrap-
ping language-image pre-training with frozen image
encoders and large language models,” in International
conference on machine learning . PMLR, 2023, pp.
19 730–19 742.
[10] W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-
Fei, “V oxposer: Composable 3d value maps for robotic
manipulation with language models,” arXiv preprint
arXiv:2307.05973 , 2023.
[11] F. Liu, K. Fang, P. Abbeel, and S. Levine, “Moka:
Open-vocabulary robotic manipulation through
mark-based visual prompting,” arXiv preprint
arXiv:2403.03174 , 2024.
[12] W. Huang, C. Wang, Y . Li, R. Zhang, and L. Fei-
Fei, “Rekep: Spatio-temporal reasoning of relational
keypoint constraints for robotic manipulation,” arXiv
preprint arXiv:2409.01652 , 2024.
[13] C. Finn and S. Levine, “Deep visual foresight for plan-
ning robot motion,” in 2017 IEEE International Con-
ference on Robotics and Automation (ICRA) . IEEE,
2017, pp. 2786–2793.
[14] A. Nagabandi, K. Konolige, S. Levine, and V . Kumar,
“Deep dynamics models for learning dexterous manip-
ulation,” in Conference on Robot Learning . PMLR,
2020, pp. 1101–1112.[15] P. Battaglia, R. Pascanu, M. Lai, D. Jimenez Rezende,
et al. , “Interaction networks for learning about objects,
relations and physics,” Advances in neural information
processing systems , vol. 29, 2016.
[16] Y . Li, J. Wu, R. Tedrake, J. B. Tenenbaum, and A. Tor-
ralba, “Learning particle dynamics for manipulating
rigid bodies, deformable objects, and fluids,” in ICLR ,
2019.
[17] S. Karamcheti, D. Sadigh, and P. Liang, “Learning
adaptive language interfaces through decomposition,”
arXiv preprint arXiv:2010.05190 , 2020.
[18] V . Myers, B. C. Zheng, O. Mees, S. Levine, and
K. Fang, “Policy adaptation via language optimiza-
tion: Decomposing tasks for few-shot imitation,” arXiv
preprint arXiv:2408.16228 , 2024.
[19] A. Z. Ren, B. Govil, T.-Y . Yang, K. R. Narasimhan,
and A. Majumdar, “Leveraging language for accelerated
learning of tool manipulation,” in Conference on Robot
Learning . PMLR, 2023, pp. 1531–1541.
[20] T. Kollar, S. Tellex, D. Roy, and N. Roy, “Toward
understanding natural language directions,” in 2010 5th
ACM/IEEE International Conference on Human-Robot
Interaction (HRI) , 2010, pp. 259–266.
[21] D. K. Misra, J. Sung, K. Lee, and A. Saxena, “Tell
me dave: Context-sensitive grounding of natural lan-
guage to manipulation instructions,” in Proceedings of
Robotics: Science and Systems , Berkeley, USA, July
2014.
[22] M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes,
B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Haus-
man, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter,
A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth,
N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.-
H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P. Pas-
tor, J. Quiambao, K. Rao, J. Rettinghouse, D. Reyes,
P. Sermanet, N. Sievers, C. Tan, A. Toshev, V . Van-
houcke, F. Xia, T. Xiao, P. Xu, S. Xu, M. Yan, and
A. Zeng, “Do as i can and not as i say: Grounding
language in robotic affordances,” in arXiv preprint
arXiv:2204.01691 , 2022.
[23] J. Thomason, S. Zhang, R. J. Mooney, and P. Stone,
“Learning to interpret natural language commands
through human-robot dialog,” in Twenty-Fourth Inter-
national Joint Conference on Artificial Intelligence ,
2015.
[24] S. Tellex, T. Kollar, S. Dickerson, M. Walter, A. Baner-
jee, S. Teller, and N. Roy, “Understanding natural
language commands for robotic navigation and mobile
manipulation,” in Proceedings of the AAAI conference
on artificial intelligence , vol. 25, no. 1, 2011, pp. 1507–
1514.
[25] T. Kollar, S. Tellex, D. Roy, and N. Roy, “Grounding
verbs of motion in natural language commands to
robots,” in Experimental robotics: The 12th interna-
tional symposium on experimental robotics . Springer,
2014, pp. 31–47.
[26] H. Jiang, B. Huang, R. Wu, Z. Li, S. Garg, H. Nayyeri,
Page 8:
S. Wang, and Y . Li, “Roboexp: Action-conditioned
scene graph via interactive exploration for robotic ma-
nipulation,” arXiv preprint arXiv:2402.15487 , 2024.
[27] Y . Hu, F. Lin, T. Zhang, L. Yi, and Y . Gao, “Look
before you leap: Unveiling the power of gpt-4v
in robotic vision-language planning,” arXiv preprint
arXiv:2311.17842 , 2023.
[28] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman,
B. Ichter, P. Florence, and A. Zeng, “Code as policies:
Language model programs for embodied control,” in
arXiv preprint arXiv:2209.07753 , 2022.
[29] M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What
and where pathways for robotic manipulation,” in Pro-
ceedings of the 5th Conference on Robot Learning
(CoRL) , 2021.
[30] ——, “Perceiver-actor: A multi-task transformer for
robotic manipulation,” in Proceedings of the 6th Con-
ference on Robot Learning (CoRL) , 2022.
[31] E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert,
C. Lynch, S. Levine, and C. Finn, “Bc-z: Zero-shot
task generalization with robotic imitation learning,” in
Conference on Robot Learning . PMLR, 2022, pp. 991–
1002.
[32] P. Sharma, B. Sundaralingam, V . Blukis, C. Paxton,
T. Hermans, A. Torralba, J. Andreas, and D. Fox, “Cor-
recting robot plans with natural language feedback,”
RSS, 2022.
[33] W. Yu, N. Gileadi, C. Fu, S. Kirmani, K.-H. Lee,
M. Gonzalez Arenas, H.-T. Lewis Chiang, T. Erez,
L. Hasenclever, J. Humplik, B. Ichter, T. Xiao, P. Xu,
A. Zeng, T. Zhang, N. Heess, D. Sadigh, J. Tan,
Y . Tassa, and F. Xia, “Language to rewards for robotic
skill synthesis,” Arxiv preprint arXiv:2306.08647 , 2023.
[34] S. Nasiriany, F. Xia, W. Yu, T. Xiao, J. Liang, I. Das-
gupta, A. Xie, D. Driess, A. Wahid, Z. Xu, et al. , “Pivot:
Iterative visual prompting elicits actionable knowledge
for vlms,” arXiv preprint arXiv:2402.07872 , 2024.
[35] Y . Chen, R. Gandhi, Y . Zhang, and C. Fan, “Nl2tl:
Transforming natural languages to temporal log-
ics using large language models,” arXiv preprint
arXiv:2305.07766 , 2023.
[36] Y . Chen, J. Arkin, Y . Zhang, N. Roy, and C. Fan,
“Autotamp: Autoregressive task and motion planning
with llms as translators and checkers,” arXiv preprint
arXiv:2306.06531 , 2023.
[37] I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu,
J. Tremblay, D. Fox, J. Thomason, and A. Garg, “Prog-
prompt: Generating situated robot task plans using
large language models,” in 2023 IEEE International
Conference on Robotics and Automation (ICRA) , 2023,
pp. 11 523–11 530.
[38] S. H. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor,
“Chatgpt for robotics: Design principles and model
abilities,” IEEE Access , 2024.
[39] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Flo-
rence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar,
P. Sermanet, N. Brown, T. Jackson, L. Luu, S. Levine,K. Hausman, and B. Ichter, “Inner monologue: Embod-
ied reasoning through planning with language models,”
inarXiv preprint arXiv:2207.05608 , 2022.
[40] M. Minderer, A. Gritsenko, A. Stone, M. Neu-
mann, D. Weissenborn, A. Dosovitskiy, A. Mahendran,
A. Arnab, M. Dehghani, Z. Shen, et al. , “Simple open-
vocabulary object detection,” in European Conference
on Computer Vision . Springer, 2022, pp. 728–755.
[41] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang,
C. Li, J. Yang, H. Su, J. Zhu, et al. , “Grounding dino:
Marrying dino with grounded pre-training for open-
set object detection,” arXiv preprint arXiv:2303.05499 ,
2023.
[42] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland,
L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-
Y . Lo, et al. , “Segment anything,” in Proceedings of
the IEEE/CVF International Conference on Computer
Vision , 2023, pp. 4015–4026.
[43] H. Huang, F. Lin, Y . Hu, S. Wang, and Y . Gao,
“Copa: General robotic manipulation through spatial
constraints of parts with foundation models,” arXiv
preprint arXiv:2403.08248 , 2024.
[44] A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis,
C. Finn, K. Gopalakrishnan, K. Hausman, A. Her-
zog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson,
S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov,
Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu,
U. Malla, D. Manjunath, I. Mordatch, O. Nachum,
C. Parada, J. Peralta, E. Perez, K. Pertsch, J. Quiambao,
K. Rao, M. Ryoo, G. Salazar, P. Sanketi, K. Sayed,
J. Singh, S. Sontakke, A. Stone, C. Tan, H. Tran,
V . Vanhoucke, S. Vega, Q. Vuong, F. Xia, T. Xiao,
P. Xu, S. Xu, T. Yu, and B. Zitkovich, “Rt-1: Robotics
transformer for real-world control at scale,” in arXiv
preprint arXiv:2212.06817 , 2022.
[45] A. Brohan, N. Brown, J. Carbajal, Y . Chebotar,
X. Chen, K. Choromanski, T. Ding, D. Driess,
A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas,
K. Gopalakrishnan, K. Han, K. Hausman, A. Her-
zog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian,
D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E.
Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch,
K. Pertsch, K. Rao, K. Reymann, M. Ryoo, G. Salazar,
P. Sanketi, P. Sermanet, J. Singh, A. Singh, R. Soricut,
H. Tran, V . Vanhoucke, Q. Vuong, A. Wahid, S. Welker,
P. Wohlhart, J. Wu, F. Xia, T. Xiao, P. Xu, S. Xu, T. Yu,
and B. Zitkovich, “Rt-2: Vision-language-action models
transfer web knowledge to robotic control,” in arXiv
preprint arXiv:2307.15818 , 2023.
[46] A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog,
A. Irpan, A. Khazatsky, A. Rai, A. Singh, A. Brohan,
et al. , “Open x-embodiment: Robotic learning datasets
and rt-x models,” arXiv preprint arXiv:2310.08864 ,
2023.
[47] A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna,
S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama,
L. Y . Chen, K. Ellis, et al. , “Droid: A large-scale in-
Page 9:
the-wild robot manipulation dataset,” arXiv preprint
arXiv:2403.12945 , 2024.
[48] J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao,
“Set-of-mark prompting unleashes extraordinary visual
grounding in gpt-4v,” arXiv preprint arXiv:2310.11441 ,
2023.
[49] M. Cai, H. Liu, S. K. Mustikovela, G. P. Meyer, Y . Chai,
D. Park, and Y . J. Lee, “Making large multimodal mod-
els understand arbitrary visual prompts,” arXiv preprint
arXiv:2312.00784 , 2023.
[50] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr,
Y . Hasson, K. Lenc, A. Mensch, K. Millican,
M. Reynolds, et al. , “Flamingo: a visual language
model for few-shot learning,” Advances in neural infor-
mation processing systems , vol. 35, pp. 23 716–23 736,
2022.
[51] A. Bar, Y . Gandelsman, T. Darrell, A. Globerson, and
A. Efros, “Visual prompting via image inpainting,”
Advances in Neural Information Processing Systems ,
vol. 35, pp. 25 005–25 017, 2022.
[52] Y . Yang, H. Peng, Y . Shen, Y . Yang, H. Hu, L. Qiu,
H. Koike, et al. , “Imagebrush: Learning visual in-
context instructions for exemplar-based image manip-
ulation,” Advances in Neural Information Processing
Systems , vol. 36, 2024.
[53] F. Li, Q. Jiang, H. Zhang, T. Ren, S. Liu, X. Zou, H. Xu,
H. Li, J. Yang, C. Li, et al. , “Visual in-context prompt-
ing,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition , 2024, pp.
12 861–12 871.
[54] Y . Zhang, K. Zhou, and Z. Liu, “What makes good
examples for visual in-context learning?” Advances in
Neural Information Processing Systems , vol. 36, pp.
17 773–17 794, 2023.
[55] L. Zha, Y . Cui, L.-H. Lin, M. Kwon, M. G. Arenas,
A. Zeng, F. Xia, and D. Sadigh, “Distilling and retriev-
ing generalizable knowledge for robot manipulation via
language corrections,” in 2024 IEEE International Con-
ference on Robotics and Automation (ICRA) . IEEE,
2024, pp. 15 172–15 179.
[56] G. Williams, A. Aldrich, and E. Theodorou, “Model
predictive path integral control using covariance
variable importance sampling,” arXiv preprint
arXiv:1509.01149 , 2015.