Generating audio...

arxiv

Paper 2503.10546

KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation

Authors: Zixian Liu, Mingtong Zhang, Yunzhu Li

Published: 2025-03-13

Abstract:

With the rapid advancement of large language models (LLMs) and vision-language models (VLMs), significant progress has been made in developing open-vocabulary robotic manipulation systems. However, many existing approaches overlook the importance of object dynamics, limiting their applicability to more complex, dynamic tasks. In this work, we introduce KUDA, an open-vocabulary manipulation system that integrates dynamics learning and visual prompting through keypoints, leveraging both VLMs and learning-based neural dynamics models. Our key insight is that a keypoint-based target specification is simultaneously interpretable by VLMs and can be efficiently translated into cost functions for model-based planning. Given language instructions and visual observations, KUDA first assigns keypoints to the RGB image and queries the VLM to generate target specifications. These abstract keypoint-based representations are then converted into cost functions, which are optimized using a learned dynamics model to produce robotic trajectories. We evaluate KUDA on a range of manipulation tasks, including free-form language instructions across diverse object categories, multi-object interactions, and deformable or granular objects, demonstrating the effectiveness of our framework. The project page is available at http://kuda-dynamics.github.io.

Paper Content:
Page 1: KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation Zixian Liu1∗, Mingtong Zhang2∗, and Yunzhu Li3 Collect all the cubes together. Large Vision Model Vision Language ModelCode as Target SpecificationDynamics Planning Code Parsing Multi-Object Interaction Rigid ObjectRope Granular Fig. 1: KUDA is an open-vocabulary manipulation system that uses keypoints to unify the visual prompting of vision language models (VLMs) and dynamics modeling. Taking the RGBD observation and the language instruction as inputs, KUDA samples keypoints in the environment, then uses a VLM to generate code specifying keypoint-based target specification. These keypoints are translated into a cost function for model-based planning with learned dynamics models, enabling open-vocabulary manipulation across various object categories. Abstract — With the rapid advancement of large language models (LLMs) and vision-language models (VLMs), significant progress has been made in developing open-vocabulary robotic manipulation systems. However, many existing approaches overlook the importance of object dynamics, limiting their applicability to more complex, dynamic tasks. In this work, we introduce KUDA, an open-vocabulary manipulation system that integrates dynamics learning and visual prompting through keypoints, leveraging both VLMs and learning-based neural dynamics models. Our key insight is that a keypoint-based target specification is simultaneously interpretable by VLMs and can be efficiently translated into cost functions for model-based planning. Given language instructions and visual observations, KUDA first assigns keypoints to the RGB image and queries the VLM to generate target specifications. These abstract keypoint- based representations are then converted into cost functions, which are optimized using a learned dynamics model to produce robotic trajectories. We evaluate KUDA on a range of manipulation tasks, including free-form language instructions across diverse object categories, multi-object interactions, and deformable or granular objects, demonstrating the effectiveness of our framework. The project page is available at http: //kuda-dynamics.github.io . I. I NTRODUCTION It has been a longstanding focus to create an open- vocabulary robotic system capable of executing tasks based on human language in diverse environments. However, hu- man language is inherently abstract and often ambiguous, requiring contextual knowledge and the ability to ground ∗denotes equal contribution.1Tsinghua University,2University of Illinois Urbana-Champaign,3Columbia Universitylanguage inputs in the environments where the robots oper- ate. Recent advances in large language models (LLMs) and vision-language models (VLMs) [1–9] have demonstrated advanced capabilities in text and image understanding. These developments have paved new paths for incorporating such models into robotic systems [10–12]. However, many exist- ing open-vocabulary robotic systems heavily rely on VLMs and LLMs for guidance at all levels and do not provide an explicit account of object dynamics. As a result, they typically focus only on rigid objects and coarse-grained manipulation, limiting their applicability to more complex and dynamic tasks involving diverse object categories, such as deformable objects and object piles. On the other hand, learning-based dynamics models have shown the ability to model the complex behaviors of real- world objects directly from observation data [13–16]. These models can accurately predict the future states of objects with varying object categories and shapes, accounting for different interactions. However, model-based planning with dynamics models typically requires a pre-defined target state or cost function, which cannot be directly inferred from high- level language instructions. This raises a key question: How can we develop open-vocabulary manipulation systems that harness the flexibility of task specification via VLMs while preserving the benefits of model-based planning? Our key insight is to develop a unified keypoint represen- tation that integrates dynamics learning and visual prompting through VLMs. Using keypoints as the visual represen- tation is intuitive for vision-language models to interpretarXiv:2503.10546v1 [cs.RO] 13 Mar 2025 Page 2: and express, while also being precisely defined and easily translatable into a cost function for planning with dynamics models. To achieve this, we propose defining the objective function for planning using keypoints and employing mark- based visual prompting, inspired by [11], to enable the VLM to generate code that specifies the objective as arithmetic relationships between the visual keypoints. Building on our keypoint-based target specifications, we introduce KUDA : an open-vocabulary manipulation system that utilizes Keypoints to Unify Dynamics learning and visuAl prompting. KUDA employs an upstream VLM along with a pre-trained dynamics model, using keypoints as a shared intermediate representation. Given a language instruc- tion for the manipulation task and the current visual observa- tion of the experimental setup, KUDA automatically samples and labels keypoints from the RGB image. The VLM is then prompted to generate target specifications, which are subsequently converted into a cost function. During robot execution, a two-level closed-loop control mechanism en- sures effective and robust model-based planning. Notably, we found that incorporating few-shot examples significantly enhances the performance of the VLM. Inspired by this, we developed a prompt library and a retrieval mechanism based on score matching, ensuring high-quality few-shot examples without exceeding input token limits. In summary, our contributions are as follows: •We propose using keypoints as a unified intermediate representation to bridge dynamics learning and visual prompting through VLMs. •We design a prompt retriever that automatically sub- samples from our prompt library based on the task description, while ensuring it stays within the context window of the VLMs. •We evaluate the integrated system in real-world manipu- lation tasks, demonstrating state-of-the-art performance on tasks involving diverse object materials, such as ropes and granular objects (Fig. 1), and covering a range of language instructions. II. R ELATED WORK A. Grounding Language Instructions Using human language to instruct intelligent robots has been an active research domain. However, most works [17– 19] mainly focus on decomposing high-level language in- structions as subtasks. Grounding ambiguous human lan- guage into structured action sequences that robots can ex- ecute remains a significant challenge [20–22]. Most existing approaches use action primitives as the basic elements for planning. These methods either employ classical techniques such as lexical analysis, formal logic, and graphical mod- els [20, 23–25], or leverage pre-trained large models to comprehend instructions and generate task plans [22, 26– 28]. However, the reliance on pre-defined motion primitives is often seen as a major limitation for developing a universal manipulation system [10]. Many recent works have also focused on grounding lan- guage into lower-level actions. These approaches includelanguage-conditioned imitation learning [29–31], synthesiz- ing value maps or reward functions [10, 12, 32, 33], and generating motions through visual prompting [11, 34]. In our work, we propose a new paradigm to ground natural language instructions into keypoint target configurations for model-based planning with learned dynamics models. B. Foundation Models in Robotics The remarkable success of large language models (LLMs) has recently earned significant interest in the field of robotics. Recent works have explored how to effectively integrate these powerful models into robotic systems. Some ap- proaches leverage LLMs for high-level task planning [35, 36], while others have demonstrated that LLMs excel at generating code for robot control [28, 37, 38]. Another appli- cation of LLMs is in synthesizing value functions or reward functions [10, 33, 39]. However, these approaches typically require converting both the task and observations into textual form, which often leads to suboptimal performance in real- world manipulation scenarios. Recently, vision-language models (VLMs) have earned significant attention. In addition to replacing previous LLMs with powerful VLMs, such as GPT-4V [6], to achieve better grounding in real-world scenarios, VLMs have been applied in various contexts. Some pre-trained VLMs have shown superior perception capabilities [40–42], while others have been used to generate affordances or constraints based on visual prompts [11, 12, 43]. However, despite their capabili- ties, these advanced VLMs lack a good understanding of the 3D spatial relationships and object dynamics, making them inefficient in more complicated manipulation tasks. Another recent trend focuses on collecting large-scale robotics data and training end-to-end general-purpose models for robotic tasks [44–47]. However, many works in this direction focus primarily on rigid object manipulation and cannot perform complex manipulation across various object categories due to a lack of knowledge of object dynamics. Additionally, collecting data and training models for such works is highly resource-intensive, and their performance remains limited. In our work, we leverage the state-of- the-art VLM to generate the target specifications for the neural dynamics models, which enables a flexible and robust framework that facilitates model-based planning. C. Visual Prompting and In-Context Learning Visual prompting is an emerging technique that has earned much attention with the development of large vision- language models. This technique enhances the visual ground- ing capabilities of these foundation models through incor- porating visual markers into observations. Yang et al. [48] demonstrated that overlaying the original image with its semantic segmentation enables GPT-4V to answer questions requiring visual grounding. Cai et al. [49] introduced ViP- LLaV A, a model capable of decoding visual prompts that include various types of visual markers. Others have also been using visual prompts generated by predefined rules to produce affordances for robotic planning [11, 34]. Page 3: In-context learning is a relatively novel paradigm in the vision domain. It allows a pre-trained model to efficiently adapt to a novel downstream task using a few input-output examples, without requiring fine-tuning. One of the earliest works in this area is Flamingo [50], a vision-language model (VLM) that is capable of learning from few-shot examples. Bar et al. [51] approach the image-to-image in-context learning problem as an image inpainting task, while Yang et al. [52] introduce in-context learning for image editing. Li et al. [53] offer a general method for in-context learning in segmentation tasks. Additionally, Zhang et al. [54] provide a comprehensive study on how to select effective examples for visual in-context learning. In our work, we utilize a set of well-designed prompting formulations to guide the vision language model in generating keypoint-based target specifications. Additionally, inspired by [55], we introduce a prompt retriever to ensure the quality of few-shot examples. III. M ETHOD We first provide the problem formulation. Then we de- scribe how we prompt the VLM to generate the keypoint- based target specification . Next, we present the design of the prompt retriever and the Top-K prompt library. Lastly, we specify how we obtain the cost function from target specification and how we perform model-based planning with neural dynamics models, along with the two-level closed-loop control mechanism. The overview of our KUDA framework is shown in Fig. 2. A. Problem Formulation We consider a tabletop manipulation problem defined by afree-form language instruction Lthat pertains to one or multiple objects O, along with a pre-trained neural dynamics model f. The goal is to optimize the robot’s action trajectory τto perform the manipulation task described by L. The instruction Lis typically abstract and requires interpretation that incorporates both common sense and the context of the experiment configurations. The dynamics model fis able to predict how the object set Owill change when a specific action is applied: ˆzt+1=f(zt, ut), (1) where ztandzt+1represent the current state of the envi- ronment at time tand the subsequent state at time t+ 1, respectively. ˆzt+1is the predicted state at time t+ 1, and utis the applied action. The environment state describes the positions of the 3D points oi= (xi, yi, zi)that represent the objects (or parts of an object) in O. The key problem here is how to define the objectives from the ambiguous instruction Lfor optimization. We leverage VLMs to propose a keypoint-based representation, where we will obtain our cost function. Once the cost function C(·|L)is obtained from the instruction L, the problem can be formulated as an optimization problem over the robot’s action trajectory τ: τ= arg min τC(z′|L), (2)where z′represents the final state of the environment after the action sequence τhas been applied. B. Keypoint-Based Target Specification Inspired by [11], we employ visual markers to enhance the visual grounding capabilities of vision-language models (VLMs). A key insight in our framework is that a wide range of manipulation tasks can be effectively described by the spatial relationships between keypoints on the objects to be manipulated and reference points within the environment. For instance, the task of straightening a rope on a table can be articulated as pulling one end of the rope to the left side of the center of the table while pulling the other end to the right. Notably, we found that VLMs are highly proficient at generating such spatial relationships when provided with appropriate visual prompts. Specifically, we first segment all semantic masks from the RGB observation utilizing Segment Anything [42]. Next, we perform farthest point sampling (FPS) on these masks to obtain keypoints and reference points, which we then label on the original image. We prompt the VLM to select keypoints and, for each keypoint k, to specify a target position pthat describes the desired final state of the object set Oupon task completion. The target position pis determined by its spatial offset from a reference point, as provided by the VLM through a set of code assignment statements in the form of p = r + [dx, dx, dz] , which denotes that the target position is equal to the reference point r’s position added by an offset [dx, dx, dz] . We refer to each (k, p) pair as a target specification . It is important to note that these target specifications do not have to be strictly satisfied after executing the task, as the vision-language model may occasionally generate infeasible specifications due to its lack of knowledge of object dynamics. Despite this, we found that optimizing for approximate target specifications is generally sufficient to successfully complete the instruction. For a more detailed prompting process, please refer to the appendix on our project website. C. Top-K Prompt Library In our experiments, we noticed that providing a few examples of similar tasks significantly improved the per- formance of the VLM. As a result, we collected a diverse set of examples that cover all the object categories used in our experiments. However, existing VLMs, such as GPT- 4V , often have limitations on the number of input images they can task as inputs. Additionally, images consume a substantial number of input tokens, which can be costly and affect the efficiency of the program. To address this, we developed a prompt retriever that selects optimal examples from the prompt library through score matching. Specifically, we employ CLIP [7] to encode both the input observation sand instruction L, as well as all the examples represented by tuples (qi,obsi, ri)in the prompt library P, where qiis the text query, obs iis the corresponding observation in the example, and riis the response provided by a human expert. We then normalize Page 4: Divide the cubes of different colors into two pieces. Large Vision ModelVision Language Model Prompt Librarydef keypoint_specification(): # Move red cubes to the left p1 = C + [-20, 0, 0] p3 = C + [-20, 0, 0] p5 = c + [-20, 0, 0] # Move purple cubes to the right p2 = C + [20, 0, 0] p4 = C + [20, 0, 0] p6 = C + [20, 0, 0] return p1, p2, p3, p4, p5, p6 
 P2 P4P6 P1 P3 P5 RGBD Observation 3D Objectives Robot Execution Low-Level Closed Loop Neural Dynamics Model-Based PlanningC High-Level-Closed LoopFig. 2: Overview of KUDA . Taking the RGBD observations and a language instruction as inputs, we first utilize the large vision model to obtain the keypoints and label them on the RGB image to obtain the visual prompt (green dot C marks the center reference point). Next, the vision-language model generates code for target specifications, which are projected into 3D space to construct the 3D objectives. Lastly, we utilize the pre-trained dynamics model for model-based planning. After a certain number of actions, the VLM is re-queried with the current observation, enabling high-level closed-loop planning to correct VLM and execution errors. these latent vectors and obtain similarities by dot product. The matching score Siis calculated as a weighted average of the image and text similarities: Si=fI(s) |fI(s)|·fI(obsi) |fI(obsi)|+λfT(L) |fT(L)|·fT(qi) |fT(qi)|,(3) where fIandfTrepresent the image and text encoders of CLIP, respectively, λis a hyperparameter set to 0.6. The top Kexamples are selected and incorporated into our prompt based on the score. D. Two-Level Closed-Loop Planning In our framework, we propose a two-level closed-loop planning to improve the robustness and effectiveness of the manipulation tasks. At the low-level closed loop of model-based planning, once the target specifications are obtained, the next step is to translate them into optimization objectives. We start by projecting the keypoints and their corresponding targets into 3D space. Next, we extract the object points oifrom the objects’ point clouds and align each keypoint with its nearest object point. The objective is then defined as the sum of the Euclidean distances between keypoints and their corresponding targets: C(z|L) =X i∥oi−pi∥2(oi∈z), (4) where pirepresents the 3D target of point oi, and the summation is performed over all target specifications. Using the objective defined in Eqn. 2 and the dynamics model f, we employ the Model Predictive Path Integral (MPPI) [56] algorithm, to determine the action to be executed.However, for specific object categories such as granular pieces, a limited number of keypoints is insufficient to accu- rately describe the target shape specified by the instruction. Thus, we perform high-level closed-loop re-planning with the VLM, re-prompting it with the current observation and instruction to update target specifications after a series of actions within a loop. This two-level closed-loop planning framework effectively corrects imperfect target specifications and execution errors, etc., ensuring the system’s robustness even in the presence of external disturbances. IV. E XPERIMENTS The main purpose of our experiments is to verify and analyze the ability of our system to perform a variety of tasks on different objects given various language instructions or experiment configurations. We aim to answer the following research questions: (1) How well does our system generalize to diverse text instructions and visual scenarios? (2) How does our framework handle complex manipulation tasks across various object categories? (3) How does each module contribute to the system’s failure cases? We conduct qualitative evaluations on a diversified set of tasks to demonstrate the effectiveness of our system. To highlight its model-based planning capabilities with neural dynamics models, we compare our framework against two baselines. Additionally, we provide a component-wise ex- periment error breakdown for a comprehensive analysis of our framework’s effectiveness. Furthermore, as an ablation study, we examine the impact of the hyperparameter Kin the top-K prompt library on in-context learning. To demonstrate the flexibility of our system across var- ious objects, we train the neural dynamics models on 4 Page 5: “Move the cubes to the sticker of the pink color” “Move the cubes to the sticker that writes ‘pink’” “Move the ends of the rope to form a fraction 1/9 on the table” “Move the T shape to make a word ‘ROBOT’” “Collect the coffee beans into the square” Initial State Target Spec Robot Action Initial State Target Spec Robot Action Fig. 3: Qualitative Results of the Rollouts. We show the target specification and robot executions of various tasks on different objects, highlight the effectiveness of our framework. We show the initial state and the target specification visualization of our system, along with the robot executions, to demonstrate the performance of our framework on various manipulation tasks. Note that we show the granular collection task to exhibit how our VLM-level closed-loop control works in our two VLM-level loops. different object categories: rope, cubes, granular pieces, and a T-shaped block. The first three categories utilize graph- based neural dynamics models, whereas the T-shaped block employs a state-based neural dynamics model trained using a multilayer perceptron network. We compare our system with MOKA [11] and Vox- Poser [10] in a tabletop environment. These two baselines also enable open-vocabulary manipulation and in-context few-shot learning. MOKA builds a framework to prompt VLM to directly generate motion, and Voxposer uses LLM to synthesize 3D voxel map as affordance. To ensure a fair comparison, the prompts and few-shot examples of these 2 systems are adapted to be suitable for our tasks. To avoid the possible overfit in few-shot learning, we ensure that no example in the prompt library is exactly the same as the task. All vision language models and large language models used in our system and two baselines are specified to be GPT-4o. A. Qualitative Results In Fig. 3, we present 5 tasks featuring different text instructions and visual scenarios, along with their corre- sponding target specifications generated by our system and the robot’s executions. These examples clearly demonstrate the effectiveness of our framework. In each case, our sys- tem generates precise target specifications aligned with the language instructions, and the robot executes the tasks effec- tively. Notably, two tasks involving cubes start with similar initial configurations but differ in instructions which are easy to confuse. The VLM effectively distinguished semantics and provided precise target specifications, demonstrating its strong visual understanding capabilities. The coffee bean collection task highlights the benefits of our VLM-level closed-loop planning. Initially, the target specifications were too sparse to fully manipulate the coffee bean pile, leaving a few unspecified beans outside the squareafter multiple actions. Our system identified these errors and corrected them in the subsequent loop. These examples demonstrate the flexibility of our framework across a wide variety of instructions and environment configurations. B. Quantitative Results We compare our system with two baselines on a total of six tasks across the four object categories. The results are shown in Tab. I. For each task, we evaluate the success rate by measuring the Chamfer distance between the point clouds of the objects after a specified number of robot actions are applied and the corresponding target point clouds. The quantitative results are shown as the total number of successful trials out of a total of 10. The text instructions for all evaluation tasks are listed in Tab. II. Methods MOKA [11] V oxPoser [10] Ours Rope Straightening 2/10 0/10 8/10 Cube Collection 0/10 3/10 6/10 Cube Movement 6/10 3/10 10/10 Granular Collection 0/10 1/10 10/10 Granular Movement 0/10 1/10 6/10 T Movement 0/10 0/10 8/10 Total 13.3% 13.3% 80.0% TABLE I: Quantitative results of our evaluation. Our method achieved relatively high performance across all evaluation tasks compared to the two baselines, while the failures in Cube Collection and Granular Movement were primarily caused by perception. As demonstrated in the quantitative evaluation results, our system is superior to these existing methods by a large margin on the evaluation tasks. The main reason for such a performance gap is that the existing methods ignore the fine-grained representation of objects or actions and lack the knowledge of object dynamics, which leads to their limited capability to manipulate deformable objects or rigid objects Page 6: Total trials 60VLM provides under- specified targets 2 Keypoint Specification Success 58Camera provides bad depth 1 Perception Error 6 Grounding Success 51Objects overlap with other shapes 3 Fail to detect one of two adjacent cubes 3 Tracking Success 49Fail to track dense object pile (coffee beans) 2 Manipulation Success 48Dynamics model fails to predict correct states 1Fig. 4: Visualizations of Error Breakdown. We provide a detailed breakdown of each failure mode, marked in red. While we achieved an80% success rate across 60 trials for various tasks, the primary cause of failure was perception errors, accounting for 10% of all trials and50% of the failure cases. of more complex shapes. In contrast, our system utilizes the keypoint-based representation for visual prompting and dynamics learning. This enables our system to predict the future state of those complex objects and perform model- based planning for manipulation. Rope Straightening “Straighten the rope.” Cube Collection “Move all the cubes to the pink cross.” Cube Movement “Move the yellow cube to the red cross.” Granular Collection “Collect all the coffee beans together.” Granular Movement “Move all the coffee beans to the red cross.” T Movement “Move the orange T into the pink square.” TABLE II: The input instructions of each evaluation task. C. Error Breakdown We conducted a manual analysis of the failure cases encountered during our experiments, as shown in Fig. 4. As illustrated, 1) the perception module accounted for the majority of errors. This includes instances where the module failed to detect objects, particularly when they overlapped with other shapes, such as a cross sign on the table, or failed to distinguish between two adjacent cubes. 2) The second most significant source of error stemmed from the target specification and the tracking module. Errors in target specification typically arose from the VLM providing under- specified targets, i.e., an insufficient number of target specifi- cations to complete the task. The tracking module commonly failed when tracking keypoints in dense object piles, such as coffee beans. 3) Additionally, a smaller proportion of failures were attributed to the dynamics model and hardware, where the dynamics model occasionally produced inaccurate predictions, and the RGBD camera sometimes provided inaccurate depth values. D. Ablation of Top-K Prompt Library Category Top-0 Top-1 Top-3 Top-5 Success Rate 10/10 2/10 3/10 10/10 7/10 TABLE III: The quantitative results of different Kvalues. In this ablation, we evaluate the success rate on the Rope Straightening task for different Kvalues, as well as a specialprompting method labeled “category” where examples are manually selected by human expert. The results are presented in Tab. III. Notably, when K= 3 , the prompt retriever achieves performance on par with that of a human expert. However, increasing Kintroduces less relevant examples, which lowers the overall prompt quality and results in a reduced success rate for prompt retriever when K= 5. V. C ONCLUSION & L IMITATIONS In this work, we propose a novel open-vocabulary robotic manipulation system KUDA, which unifies the visual prompting of the vision language models and dynamics learning through keypoint-based representation. Utilizing this flexible representation, KUDA leverages the vision lan- guage model to deal with various high-level human lan- guage instructions, and in the meantime utilizes model- based planning with dynamics models to generate robot actions, to perform complex manipulation tasks on various object categories, with different language instructions. Our experiments demonstrate the effectiveness and versatility of our framework. However, KUDA also has several limitations: First, it uses a top camera to capture visual observations for the vision language model, limiting its ability to perform tasks with more complex 3D spatial relationships. Second, the dynamics models in our work are trained in simulations, leading to an inevitable sim-to-real gap and limited generalization to different object categories. We believe that with the devel- opment of related fields in the future, the problems above will be finally eliminated. We hope that KUDA will inspire more future work on incorporating knowledge of dynamics into more versatile robotic systems. VI. A CKNOWLEDGEMENT This work is partially supported by the Toyota Research Institute (TRI), the Sony Group Corporation, and Google. This article solely reflects the opinions and conclusions of its authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors. Page 7: REFERENCES [1] J. Devlin, “Bert: Pre-training of deep bidirec- tional transformers for language understanding,” arXiv preprint arXiv:1810.04805 , 2018. [2] A. Radford, “Improving language understanding by generative pre-training,” 2018. [3] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. , “Language models are unsupervised multitask learners,” OpenAI blog , vol. 1, no. 8, p. 9, 2019. [4] T. B. Brown, “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165 , 2020. [5] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. , “Palm: Scaling language modeling with pathways,” Journal of Machine Learning Research , vol. 24, no. 240, pp. 1–113, 2023. [6] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023. [7] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. , “Learning transferable visual models from natural language supervision,” in International conference on machine learning . PMLR, 2021, pp. 8748–8763. [8] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International confer- ence on machine learning . PMLR, 2022, pp. 12 888– 12 900. [9] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrap- ping language-image pre-training with frozen image encoders and large language models,” in International conference on machine learning . PMLR, 2023, pp. 19 730–19 742. [10] W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei- Fei, “V oxposer: Composable 3d value maps for robotic manipulation with language models,” arXiv preprint arXiv:2307.05973 , 2023. [11] F. Liu, K. Fang, P. Abbeel, and S. Levine, “Moka: Open-vocabulary robotic manipulation through mark-based visual prompting,” arXiv preprint arXiv:2403.03174 , 2024. [12] W. Huang, C. Wang, Y . Li, R. Zhang, and L. Fei- Fei, “Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation,” arXiv preprint arXiv:2409.01652 , 2024. [13] C. Finn and S. Levine, “Deep visual foresight for plan- ning robot motion,” in 2017 IEEE International Con- ference on Robotics and Automation (ICRA) . IEEE, 2017, pp. 2786–2793. [14] A. Nagabandi, K. Konolige, S. Levine, and V . Kumar, “Deep dynamics models for learning dexterous manip- ulation,” in Conference on Robot Learning . PMLR, 2020, pp. 1101–1112.[15] P. Battaglia, R. Pascanu, M. Lai, D. Jimenez Rezende, et al. , “Interaction networks for learning about objects, relations and physics,” Advances in neural information processing systems , vol. 29, 2016. [16] Y . Li, J. Wu, R. Tedrake, J. B. Tenenbaum, and A. Tor- ralba, “Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids,” in ICLR , 2019. [17] S. Karamcheti, D. Sadigh, and P. Liang, “Learning adaptive language interfaces through decomposition,” arXiv preprint arXiv:2010.05190 , 2020. [18] V . Myers, B. C. Zheng, O. Mees, S. Levine, and K. Fang, “Policy adaptation via language optimiza- tion: Decomposing tasks for few-shot imitation,” arXiv preprint arXiv:2408.16228 , 2024. [19] A. Z. Ren, B. Govil, T.-Y . Yang, K. R. Narasimhan, and A. Majumdar, “Leveraging language for accelerated learning of tool manipulation,” in Conference on Robot Learning . PMLR, 2023, pp. 1531–1541. [20] T. Kollar, S. Tellex, D. Roy, and N. Roy, “Toward understanding natural language directions,” in 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI) , 2010, pp. 259–266. [21] D. K. Misra, J. Sung, K. Lee, and A. Saxena, “Tell me dave: Context-sensitive grounding of natural lan- guage to manipulation instructions,” in Proceedings of Robotics: Science and Systems , Berkeley, USA, July 2014. [22] M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Haus- man, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.- H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P. Pas- tor, J. Quiambao, K. Rao, J. Rettinghouse, D. Reyes, P. Sermanet, N. Sievers, C. Tan, A. Toshev, V . Van- houcke, F. Xia, T. Xiao, P. Xu, S. Xu, M. Yan, and A. Zeng, “Do as i can and not as i say: Grounding language in robotic affordances,” in arXiv preprint arXiv:2204.01691 , 2022. [23] J. Thomason, S. Zhang, R. J. Mooney, and P. Stone, “Learning to interpret natural language commands through human-robot dialog,” in Twenty-Fourth Inter- national Joint Conference on Artificial Intelligence , 2015. [24] S. Tellex, T. Kollar, S. Dickerson, M. Walter, A. Baner- jee, S. Teller, and N. Roy, “Understanding natural language commands for robotic navigation and mobile manipulation,” in Proceedings of the AAAI conference on artificial intelligence , vol. 25, no. 1, 2011, pp. 1507– 1514. [25] T. Kollar, S. Tellex, D. Roy, and N. Roy, “Grounding verbs of motion in natural language commands to robots,” in Experimental robotics: The 12th interna- tional symposium on experimental robotics . Springer, 2014, pp. 31–47. [26] H. Jiang, B. Huang, R. Wu, Z. Li, S. Garg, H. Nayyeri, Page 8: S. Wang, and Y . Li, “Roboexp: Action-conditioned scene graph via interactive exploration for robotic ma- nipulation,” arXiv preprint arXiv:2402.15487 , 2024. [27] Y . Hu, F. Lin, T. Zhang, L. Yi, and Y . Gao, “Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning,” arXiv preprint arXiv:2311.17842 , 2023. [28] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embodied control,” in arXiv preprint arXiv:2209.07753 , 2022. [29] M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” in Pro- ceedings of the 5th Conference on Robot Learning (CoRL) , 2021. [30] ——, “Perceiver-actor: A multi-task transformer for robotic manipulation,” in Proceedings of the 6th Con- ference on Robot Learning (CoRL) , 2022. [31] E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn, “Bc-z: Zero-shot task generalization with robotic imitation learning,” in Conference on Robot Learning . PMLR, 2022, pp. 991– 1002. [32] P. Sharma, B. Sundaralingam, V . Blukis, C. Paxton, T. Hermans, A. Torralba, J. Andreas, and D. Fox, “Cor- recting robot plans with natural language feedback,” RSS, 2022. [33] W. Yu, N. Gileadi, C. Fu, S. Kirmani, K.-H. Lee, M. Gonzalez Arenas, H.-T. Lewis Chiang, T. Erez, L. Hasenclever, J. Humplik, B. Ichter, T. Xiao, P. Xu, A. Zeng, T. Zhang, N. Heess, D. Sadigh, J. Tan, Y . Tassa, and F. Xia, “Language to rewards for robotic skill synthesis,” Arxiv preprint arXiv:2306.08647 , 2023. [34] S. Nasiriany, F. Xia, W. Yu, T. Xiao, J. Liang, I. Das- gupta, A. Xie, D. Driess, A. Wahid, Z. Xu, et al. , “Pivot: Iterative visual prompting elicits actionable knowledge for vlms,” arXiv preprint arXiv:2402.07872 , 2024. [35] Y . Chen, R. Gandhi, Y . Zhang, and C. Fan, “Nl2tl: Transforming natural languages to temporal log- ics using large language models,” arXiv preprint arXiv:2305.07766 , 2023. [36] Y . Chen, J. Arkin, Y . Zhang, N. Roy, and C. Fan, “Autotamp: Autoregressive task and motion planning with llms as translators and checkers,” arXiv preprint arXiv:2306.06531 , 2023. [37] I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg, “Prog- prompt: Generating situated robot task plans using large language models,” in 2023 IEEE International Conference on Robotics and Automation (ICRA) , 2023, pp. 11 523–11 530. [38] S. H. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor, “Chatgpt for robotics: Design principles and model abilities,” IEEE Access , 2024. [39] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Flo- rence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, P. Sermanet, N. Brown, T. Jackson, L. Luu, S. Levine,K. Hausman, and B. Ichter, “Inner monologue: Embod- ied reasoning through planning with language models,” inarXiv preprint arXiv:2207.05608 , 2022. [40] M. Minderer, A. Gritsenko, A. Stone, M. Neu- mann, D. Weissenborn, A. Dosovitskiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen, et al. , “Simple open- vocabulary object detection,” in European Conference on Computer Vision . Springer, 2022, pp. 728–755. [41] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, et al. , “Grounding dino: Marrying dino with grounded pre-training for open- set object detection,” arXiv preprint arXiv:2303.05499 , 2023. [42] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.- Y . Lo, et al. , “Segment anything,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 4015–4026. [43] H. Huang, F. Lin, Y . Hu, S. Wang, and Y . Gao, “Copa: General robotic manipulation through spatial constraints of parts with foundation models,” arXiv preprint arXiv:2403.08248 , 2024. [44] A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Her- zog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu, U. Malla, D. Manjunath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, J. Quiambao, K. Rao, M. Ryoo, G. Salazar, P. Sanketi, K. Sayed, J. Singh, S. Sontakke, A. Stone, C. Tan, H. Tran, V . Vanhoucke, S. Vega, Q. Vuong, F. Xia, T. Xiao, P. Xu, S. Xu, T. Yu, and B. Zitkovich, “Rt-1: Robotics transformer for real-world control at scale,” in arXiv preprint arXiv:2212.06817 , 2022. [45] A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Her- zog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pertsch, K. Rao, K. Reymann, M. Ryoo, G. Salazar, P. Sanketi, P. Sermanet, J. Singh, A. Singh, R. Soricut, H. Tran, V . Vanhoucke, Q. Vuong, A. Wahid, S. Welker, P. Wohlhart, J. Wu, F. Xia, T. Xiao, P. Xu, S. Xu, T. Yu, and B. Zitkovich, “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” in arXiv preprint arXiv:2307.15818 , 2023. [46] A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Singh, A. Brohan, et al. , “Open x-embodiment: Robotic learning datasets and rt-x models,” arXiv preprint arXiv:2310.08864 , 2023. [47] A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. , “Droid: A large-scale in- Page 9: the-wild robot manipulation dataset,” arXiv preprint arXiv:2403.12945 , 2024. [48] J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao, “Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v,” arXiv preprint arXiv:2310.11441 , 2023. [49] M. Cai, H. Liu, S. K. Mustikovela, G. P. Meyer, Y . Chai, D. Park, and Y . J. Lee, “Making large multimodal mod- els understand arbitrary visual prompts,” arXiv preprint arXiv:2312.00784 , 2023. [50] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. , “Flamingo: a visual language model for few-shot learning,” Advances in neural infor- mation processing systems , vol. 35, pp. 23 716–23 736, 2022. [51] A. Bar, Y . Gandelsman, T. Darrell, A. Globerson, and A. Efros, “Visual prompting via image inpainting,” Advances in Neural Information Processing Systems , vol. 35, pp. 25 005–25 017, 2022. [52] Y . Yang, H. Peng, Y . Shen, Y . Yang, H. Hu, L. Qiu, H. Koike, et al. , “Imagebrush: Learning visual in- context instructions for exemplar-based image manip- ulation,” Advances in Neural Information Processing Systems , vol. 36, 2024. [53] F. Li, Q. Jiang, H. Zhang, T. Ren, S. Liu, X. Zou, H. Xu, H. Li, J. Yang, C. Li, et al. , “Visual in-context prompt- ing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 12 861–12 871. [54] Y . Zhang, K. Zhou, and Z. Liu, “What makes good examples for visual in-context learning?” Advances in Neural Information Processing Systems , vol. 36, pp. 17 773–17 794, 2023. [55] L. Zha, Y . Cui, L.-H. Lin, M. Kwon, M. G. Arenas, A. Zeng, F. Xia, and D. Sadigh, “Distilling and retriev- ing generalizable knowledge for robot manipulation via language corrections,” in 2024 IEEE International Con- ference on Robotics and Automation (ICRA) . IEEE, 2024, pp. 15 172–15 179. [56] G. Williams, A. Aldrich, and E. Theodorou, “Model predictive path integral control using covariance variable importance sampling,” arXiv preprint arXiv:1509.01149 , 2015.

---