Open AGI Codes | Your Codes Reflect!

Updates

Generating audio...

arxiv

Paper 2201.07207

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Authors: Wenlong Huang, Pieter Abbeel, Deepak Pathak, Igor Mordatch

Published: 2022-01-18

Abstract:

Can world knowledge learned by large language models (LLMs) be used to act in interactive environments? In this paper, we investigate the possibility of grounding high-level tasks, expressed in natural language (e.g. "make breakfast"), to a chosen set of actionable steps (e.g. "open fridge"). While prior work focused on learning from explicit step-by-step examples of how to act, we surprisingly find that if pre-trained LMs are large enough and prompted appropriately, they can effectively decompose high-level tasks into mid-level plans without any further training. However, the plans produced naively by LLMs often cannot map precisely to admissible actions. We propose a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions. Our evaluation in the recent VirtualHome environment shows that the resulting method substantially improves executability over the LLM baseline. The conducted human evaluation reveals a trade-off between executability and correctness but shows a promising sign towards extracting actionable knowledge from language models. Website at https://huangwl18.github.io/language-planner

Paper Content:

Page 1: Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents Wenlong Huang UC BerkeleyPieter Abbeel UC BerkeleyDeepak Pathak* Carnegie Mellon UniversityIgor Mordatch* Google Abstract Can world knowledge learned by large language models (LLMs) be used to act in interactive environments? In this paper, we investigate the possibility of grounding high-level tasks, expressed in natural language (e.g. “make breakfast”), to a chosen set of actionable steps (e.g. “open fridge”). While prior work focused on learning from explicit step-by-step examples of how to act, we surprisingly ﬁnd that if pre-trained LMs are large enough and prompted appropriately, they can effectively decompose high-level tasks into mid-level plans without any further training. How- ever, the plans produced naively by LLMs often cannot map precisely to admissible actions. We propose a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions. Our evaluation in the recent VirtualHome environment shows that the resulting method substantially improves executability over the LLM baseline. The conducted human evaluation reveals a trade-off between executability and correctness but shows a promising sign towards extracting actionable knowledge from language models. Task: Brush teethStep 1: Go to bathroomGPT-21.5BTask: Brush teethStep 1: Walk to bathroomStep 2: Walk to sinkStep 3: Find toothbrushStep 4: Pick up toothbrushStep 5: Put toothbrush in mouthStep 6: Move brush around mouthfor two minutesStep 7: Spit out toothpaste andbrush into sinkStep 8: Turn on water in sink andrinse brush for oneminuteStep 9: Turn off water in sink andreturn brush tocupboardCodex12BTask: Brush teethStep 1: Walk to bathroomStep 2: Open doorStep 3: Walk to sinkStep 4: Put pot on sinkStep 5: Put brush ontoothbrushStep 6: Turn to toothpasteStep 7: Put toothpaste ontoothbrushStep 8: Put teeth ontoothbrushTranslatedCodex12BTask: Throw away paperStep 1: Walk to home officeStep 2: Walk to deskStep 3: Find deskStep 4: Turn to deskStep 5: Find chairStep 6: Sit on chairStep 7: Find checkStep 8: Grab checkStep 9: Squeeze checkStep 10: Stand upStep 11: Walk to trashcanStep 12: Put check on trashcanHumanTask: Throw away paperStep 1: Walk to home officeStep 2: Walk to wastebasketStep 3: Find wastebasketStep 4: Grab wastebasketStep 5: Walk to deskStep 6: Drop paper in wastebasketGPT-3175BTask: Throw away paperStep 1: Walk to home officeStep 2: Walk to tableStep 3: Find tableStep 4: Turn to tableStep 5: Find paperStep 6: Grab paperStep 7: Walk to trashcanStep 8: Open trashcanStep 9: Put paper on trashcanStep 10: Close trashcanTranslatedGPT-3175B ([HFXWDELOLW\&RUUHFWQHVV+XPDQ([HFXWDELOLW\ GPT-21.5BGPT-313BCodex12BGPT-3175BTranslatedCodex12B(Ours)GPT-20.1BTranslatedGPT-3175B(Ours)Task:GetGlassofMilk WalktoKitchenOpenFridgeGrabMilkCloseFridge Figure 1: Executability v.s. semantic correctness of generated plans (left) , sample plans by different models (right) , and example environment execution ( bottom ). Large models can produce action plans indistinguishable from those by humans, but frequently are not executable in the environment. Using our techniques, we can signiﬁcantly improve executability, albeit at the cost of correctness. More samples can be found in Appendix A.5. *Equal advising. Correspondence to Wenlong Huang <wenlong.huang@berkeley.edu>. Code and videos at https://huangwl18.github.io/language-plannerarXiv:2201.07207v2 [cs.LG] 8 Mar 2022 Page 2: Contents 1 Introduction 3 2 Evaluation Framework 4 2.1 Evaluated Environment: VirtualHome . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Method 6 3.1 Querying LLMs for Action Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 Admissible Action Parsing by Semantic Translation . . . . . . . . . . . . . . . . . 6 3.3 Autoregressive Trajectory Correction . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.4 Dynamic Example Selection for Improved Knowledge Extraction . . . . . . . . . 7 4 Results 8 4.1 Do LLMs contain actionable knowledge for high-level tasks? . . . . . . . . . . . . 8 4.2 How executable are the LLM action plans? . . . . . . . . . . . . . . . . . . . . . 9 4.3 Can LLM action plans be made executable by proposed procedure? . . . . . . . . 9 5 Analysis and Discussions 10 5.1 Ablation of design decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5.2 Are the generated action plans grounded in the environment? . . . . . . . . . . . . 10 5.3 Effect of Different Translation LMs . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.4 Can LLMs generate actionable programs by following step-by-step instructions? . 11 5.5 Analysis of program length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 6 Related Works 12 7 Conclusion, Limitations & Future Work 13 A Appendix 18 A.1 Hyperparameter Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.2 Details of Human Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A.3 All Evaluated Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.4 Natural Language Templates for All Atomic Actions . . . . . . . . . . . . . . . . 21 A.5 Random Samples of Action Plans . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2 Page 3: 1 Introduction Large language models (LLMs) have made impressive advances in language generation and under- standing in recent years [ 10,39,40,5]. See [ 4] for a recent summary of their capabilities and impacts. Being trained on large corpora of human-produced language, these models are thought to contain a lot of information about the world [42, 23, 3] - albeit in linguistic form. We ask whether we can use such knowledge contained in LLMs not just for linguistic tasks, but to make goal-driven decisions that can be enacted in interactive, embodied environments. But we are not simply interested in whether we can train models on a dataset of demonstrations collected for some speciﬁc environment – we are instead interested in whether LLMs already contain information necessary to accomplish goals without any additional training. More speciﬁcally, we ask whether world knowledge about how to perform high-level tasks (such as “make breakfast”) can be expanded to a series of groundable actions (such as “open fridge”, “grab milk”, “close fridge”, etc) that can be executed in the environment. For our investigation, we use the recently proposed VirtualHome environment [ 38]. It can simulate a large variety of realistic human activities in a household environment and supports the ability to perform them via embodied actions deﬁned with a verb-object syntax. However, due to the open-ended nature of the tasks, it is difﬁcult to autonomously evaluate their success. We rely on human evaluation (conducted on Mechanical Turk) to decide whether sequences of actions meaningfully accomplish posed tasks. We ﬁnd that large GPT-3 [ 5] and Codex [ 7] models, when prompted with a single ﬁxed example of a task description and its associated sequence of actions, can produce very plausible action plans for the task we’re interested in. Such completions reﬂect the information already stored in the model – no model ﬁne-tuning is involved. Additionally, we only observe this effect in the larger models. Unfortunately, despite their semantic correctness, the produced action plans are often not executable in the environment. Produced actions may not map precisely to admissible actions, or may contain various linguistic ambiguities. We propose several tools to improve executability of the model’s outputs. First, we enumerate all admissible actions and map the model’s output phrases to the most semantically-similar admissible action (we use similarity measure between sentence embeddings produced by a RoBERTa model [ 27] in this work, but other choices are possible). Second, we use the model to autoregressively generate actions in a plan by conditioning past actions that have been made admissible via the technique above. Such on-the-ﬂy correction can keep generation anchored to admissible actions. Third, we provide weak supervision to the model by prompting the model with a known task example similar to the query task. This is somewhat reminiscent of prompt tuning approaches but does not require access to gradients or internals of the model. Using the above tools to bias model generation, we ﬁnd that we improve executability of action plans from 18% to 79% (see Figure 1) without any invasive modiﬁcations to model parameters or any extra gradient or internal information beyond what is returned from the model’s forward pass. This is advantageous because it does not require any modiﬁcations to the model training procedure and can ﬁt within existing model serving pipelines. However, we do ﬁnd there to be some drop in correctness of the action sequences generated with the above tools (as judged by humans), indicating a promising step, but requiring more research on the topic. To summarize, our paper’s contributions are as follows: •We show that without any training, large language models can be prompted to generate plausible goal-driven action plans, but such plans are frequently not executable in interactive environments. •We propose several tools to improve executability of the model generation without invasive probing or modiﬁcations to the model. •We conduct a human evaluation of multiple techniques and models and report on the trade-offs between executability and semantic correctness. 3 Page 4: Task: ShaveStep 1: Grab razorStep 2: Wash razorStep 3: SwitchonrazorTask: Apply lotionStep1:PourlotionintorighthandStep2:Task: ShaveStep 1: Grab razorStep 2: SwitchonrazorStep 3:PutrazoronfaceTask: Apply lotionPre-TrainedCausalLLMFrozenPre-TrainedMaskedLLMFrozenStep1:Squeeze out agloboflotionStep1:PourlotionintorighthandStep1:Squeeze out agloboflotionPre-TrainedCausalLLMFrozenZero-ShotPlanningviaCausalLLMTranslationtoAdmissibleActionStep-By-StepAutoregressiveGenerationPromptPromptFigure 2: We investigate the possibility of extracting actionable knowledge from pre-trained large language models (LLMs). We ﬁrst show surprising ﬁnding that pre-trained causal LLMs can decompose high-level tasks into sensible mid-level action plans ( left). To make the plans executable, we propose to translate each step into admissible action via another pre-trained masked LLM (middle ). The translated action is appended to the prompt used for generating the remaining steps ( right ). All models are kept frozen without additional training. 2 Evaluation Framework Simulating open-ended tasks that resemble naturalistic human activities requires an environment to support a rich set of diverse interactions, rendering most existing embodied environments unsuitable for our investigation. One exception is VirtualHome [ 38], which we evaluate on as it models complex human activities, though only in a household setting. To measure correctness of the generated action plans, for which evaluating computationally is inherently difﬁcult for these open-ended tasks, we conduct a human evaluation similar to Puig et al. [38]. We note that since no further training is involved throughout our investigations, the observations and ﬁndings presented in this paper should also translate to similar embodied environments, likely even beyond the household domain. 2.1 Evaluated Environment: VirtualHome Preliminaries In VirtualHome, activities are expressed as programs. Each program consists of a sequence of textual action steps, where each step is written as: [action]hargi(idx) . Each action refers to one of the 42 atomic actions supported in VirtualHome, such as “walk” and “open”. Full list of atomic actions can be found in Appendix A.4. Different actions take in different numbers ofarg, such as “bedroom” and “fridge”, that are necessary for specifying an interaction. Associated with each arg is a unique idspecifying the corresponding node in the environment graph, in case of multiple instances of the same object class are present in the graph. For the sake of simplicity, we omit the idin the remaining discussions of this paper and allow automatic assignment by the environment. An example program is shown below for the task “Relax on sofa”: [WALK]hliving_roomi(1) [WALK]htelevisioni(1) [FIND]htelevisioni(1) [SWITCHON]htelevisioni(1) [FIND]hsofai(1) [SIT]hsofai(1) [TURNTO]htelevisioni(1) [WATCH]htelevisioni(1) Evaluated Tasks We use the ActivityPrograms knowledge base collected by Puig et al. [38] for evaluation. It contains 2821 different entries annotated by Amazon Mechanical Turk (MTurk) workers. Each entry contains 1) a high-level task name (e.g. “Watch TV”), 2) detailed instructions expressed in natural language to complete the task (e.g. “Sit on my couch directly opposite my TV , switch on my TV with the remote control and watch”), and 3) an executable program containing all necessary steps for a robotic agent (example above). We omit the use of detailed instructions (2) as we desire direct extraction of executable programs (3) from only high-level task names (1). There are 292 distinct high-level tasks in the knowledge base, from which we randomly sample 88 held-out tasks for evaluation. The remaining 204 tasks are used as demonstration set from which we are allowed 4 Page 5: Algorithm 1: Generating Action Plans from Pre-Trained Language Models Notation Summary: LMP: text completion language model (also referred as Planning LM ) LMT: text embedding language model (also referred as Translation LM ) f(Ti;Ei)gN i=1: demonstration set, where Tis task name and Eis example plan for T C: cosine similarity function P: mean token log probability under LMP Input: query task name Q, e.g. “make breakfast” Output: action plan consisting of admissible env actions, e.g. “open fridge” Extract most similar example (T;E)whoseTmaximizesC(LMT(T);LMT(Q)) Initialize prompt with (T+E+Q) while max step is not reached do SampleLMPwith current prompt to obtain ksingle-step action phrases foreach sample ^aandeach admissible env action aedo Calculate ranking score by C(LMT(^a);LMT(ae)) +P(^a) end for Append highest-scoring env action a eto prompt Appenda eto output if>50% samples are 0-length orhighest score <then break end if end while to select as example(s) for prompting language models, or in the case of supervised ﬁne-tuning baselines, they are used to ﬁne-tune pre-trained language models. 2.2 Metrics A program that commands the agent to wander around in a household environment is highly executable but is mostly not correct. On the other hand, a program composed of natural language instructions annotated by humans is likely correct but cannot be executed, because its format is ambiguous and may lack necessary common-sense actions (e.g. fridge must be opened before an agent can grab things from it). We thus consider two axes for evaluation: executability andcorrectness . Executability Executability measures whether an action plan can be correctly parsed andsatisﬁes the common-sense constraints of the environment. To be correctly parsed, an action plan must be syntactically correct and contain only allowed actions and recognizable objects. To satisfy the common-sense constraints, each action step must not violate the set of its pre-conditions (e.g. the agent cannot grab milk from the fridge before opening it) and post-conditions (e.g. the state of the fridge changes from “closed” to “open” after the agent opens it). We report the average executability across all 88 tasks and all 7 VirtualHome scenes. Correctness Unlike most embodied environments where the completion of a task can be easily judged, the ambiguous and multimodal nature of natural language task speciﬁcation makes it impracti- cal to obtain a gold-standard measurement of correctness1. Therefore, we conduct human evaluations for the main methods. For the remaining analysis, we rely on a match-based metric that measures how similar a generated program is to human annotations. Speciﬁcally, we follow Puig et al. [38] and calculate the longest common subsequence (LCS) between two programs, normalized by the maximum length of the two. In the presence of multiple human-written programs for a single task, we take the maximum LCS across them. However, we note that the majority of the tasks only have one human annotation, but there are often many plausible ways to complete a certain task, making 1One approach could be measuring the similarity of the ﬁnal environment state produced by executing predicted and human-written programs, but initial state must be kept ﬁxed for each task, which are not appropriate for many tasks due to their open-ended nature. 5 Page 6: this metric imperfect at evaluation program correctness2. Although correlation between the two is shown by Puig et al. [38], we consider it only as a proxy metric in replacement of unscalable human evaluation. 3 Method In this section, we investigate the possibility of extracting actionable knowledge from pre-trained language models without further training. We ﬁrst give an overview of the common approach to query large language models (LLMs) and how it may be used for embodied agents in Section 3.1. Then we describe an inference-time procedure that addresses several deﬁciencies of the LLM baseline and offers better executability in embodied environments. We break down the proposed procedure into three individual components, each discussed in Section 3.2, 3.3, 3.4. Pseudo-code is in Algorithm 1. Since LMs excel at dealing with natural language text instead of the speciﬁc format required by VirtualHome as described in Section 2.1, we only expose natural language text to LMs. To do this, we deﬁne a bi-directional mapping for each atomic action that converts between the natu- ral language format and the program format. For instance, “walk to living room” is mapped to [WALK]hliving_roomi(1). Full list of the mappings is in Appendix A.4. 3.1 Querying LLMs for Action Plans Previous works have shown that large language models pre-trained on a colossal amount of data would internalize rich world knowledge that can be probed to perform various downstream tasks [ 39,5]. Notably, autoregressive LLMs can even perform in-context learning, an ability to solve tasks using only contextual information without gradient updates [ 5]. Contextual information is given as part of the input prompt and LMs are asked to complete the remaining text. It often consists of natural language instructions and/or a number of examples containing the desired input/output pairs. We adopt the same approach to query LLMs to generate action plans for high-level tasks. Speciﬁcally, we prepend one example high-level task and its annotated action plan from the demonstration set to the query task, as shown in Figure 2. To obtain text completion results, we sample from autoregressive LLM using temperature sampling and nucleus sampling [ 18]. We refer to this LM as Planning LM and the approach using this LM for plan generation as VanillahLMi, wherehLMiis replaced by speciﬁc language model such as GPT-3. To improve the generation quality, we follow Chen et al. [7]to sample multiple outputs for each query. However, unlike Chen et al. [7]who investigate program synthesis and can choose the sample with highest unit test pass rate, we only consider the setting where one sample is allowed to be evaluated for each task. This is because repetitive trial-and-error is equivalent to probing the environment for privileged information, which should not be considered viable in our setting. For Vanilla hLMi, to choose the best action plan Xamongksamples (X1;X2;:::;Xk), each consisting of nitokens Xi= (xi;1;xi;2;:::;xi;ni), we select the sample with highest mean log probability as follows: argmax Xi P(Xi) :=1 niniX j=1logp(xi;jjxi;<j) whereparameterizes the Planning LM. (1) 3.2 Admissible Action Parsing by Semantic Translation One issue arises when naively following the above approach to generate action plans: the plan expressed in free-form language often cannot be mapped to unambiguous actionable steps and thus is not executable by a robotic agent. Many reasons can cause such failures: 1) the output does not follow pre-deﬁned mappings of any atomic action (e.g. “I ﬁrst walk to the bedroom” is not of the format “walk tohPLACEi”), 2) the output may refer to atomic action and objects using words unrecognizable by the environment (e.g. “microwave the chocolate milk” where “microwave” and “chocolate milk” cannot be mapped to precise action and object), or 3) the output contains lexically ambiguous words (e.g. “open TV” should instead be “switch on TV”). 2Although LCS has a mathematical range of [0;1], we measure the LCS between different human-written programs for the same task and ﬁnd an empirical maximum of 0:489. 6 Page 7: Instead of developing a set of rules to transform the free-form text into admissible action steps, we propose to again leverage world knowledge learned by language models to semantically translate the action. For each admissible environment action ae, we calculate its semantic distance to the predicted action phrase ^aby cosine similarity: C(f(^a);f(ae)) :=f(^a)f(ae) kf(^a)kkf(ae)kwherefis an embedding function. (2) To embed the output action phrase and environment actions, we use a BERT-style LM [ 10,27] pre-trained with Sentence-BERT [ 41] objective, to which we refer as Translation LM3. The action embedding is obtained by mean-pooling the last layer hidden states across all tokens in that action phrase. While the set of admissible actions in our environment is discrete and possible to exhaustively enumerate, sampling or projection can be employed in larger discrete or continuous action spaces. 3.3 Autoregressive Trajectory Correction Translating each step of the program after the entire program has been synthesized lacks consideration of achievability of individual steps and subjects to compounding errors. In practice, LLMs might output compounded instructions for a single step, even though it cannot be completed using one admissible action in the environment. To this end, we can instead interleave plan generation and action translation to allow for automatic trajectory correction. At each step, we ﬁrst query Planning LM to generate ksamples for a single action ( ^a1;^a2;:::;^ak). For each sample ^a, we consider both its semantic soundness and its achievability in the environment. Speciﬁcally, we aim to ﬁnd admissible environment action aeby modifying the ranking scheme described in Equation 1 as follows: argmax ae max ^aC(f(^a);f(ae)) +P(^a) whereis a weighting coefﬁcient. (3) Then we append the translated environment action aeto the unﬁnished text completion. This way all subsequent steps will be conditioned on admissible actions instead of free-form action phrases generated by Planning LM. Furthermore, we can use Translation LM to detect out-of-distribution actions, those outside the capabilities of a robot, and terminate a program early instead of mapping to a faulty action. This can be achieved by setting a threshold such that if max ^a;aeC(f(^a);f(ae)) + P(^a)< at stept, the program is terminated early. Since we now sample Planning LM for individual steps instead of an entire sequence, another termination condition we consider is when >50% of current-step samples are 0-length (excluding leading or trailing non-English text tokens). 3.4 Dynamic Example Selection for Improved Knowledge Extraction So far in the text, we always give the same example in the prompt for all query tasks. However, consider the task of “ordering pizza”. Prompting LLMs with this task may give the assumption that the agent is initialized in front of a computer, and the LLMs may guide the agent to search for a pizza store and click “checkout my cart”. Although these are reasonable and feasible in the real world, such assumption cannot always be made as these interactions may not be supported in simulated environments. In fact, the closest series of actions that human experts give in VirtualHome may be “walking to a computer”, “switching on the computer”, and “typing the keyboard”. Without being ﬁne-tuned on these data, LLMs would often fail at these tasks. To provide weak supervision at inference time, we propose to select the most similar task Tand its example plan Efrom the demonstration set to be used as the example in the prompt. Speciﬁcally, we re-use the same Translation LM introduced in Section 3.2 and select (T;E)whose high-level task nameTmaximizesC(f(T);f(Q)), whereQis the query task. This approach bears resemblance to several recent works [ 37,13,26,43]. An example is shown in Figure 2 where “Shave” is the most similar to the query task “Apply lotion”. FINAL METHOD Combining the various improvement discussed above, we refer to the ﬁnal method as TranslatedhLMi, wherehLMiis replaced by speciﬁc language model used such as GPT-3. 3Note that this is a different LM than the GPT-style Planning LM. Using a single LM for both purposes could as well be possible and likely more efﬁcient, but we leave such investigation to future works. 7 Page 8: Task:CompleteAmazonTurkSurveysTask:GetGlassofMilk SitonChair SwitchonComputerWalktoHomeOfficeLookatComputer WalktoKitchenOpenFridgeGrabMilkCloseFridgeFigure 3: Visualization of VirtualHome programs generated by our approach. The top row shows the execution of the task “Complete Amazon Turk Surveys”, and the bottom row shows the task “Get Glass of Milk”. We show LLMs not only can generate sensible action plans given only high-level tasks but also contains the actionable knowledge that can be extracted for grounding in embodied environments. 4 Results In this section, we ﬁrst show that language models can generate sensible action plans for many high-level tasks, even without any additional training. Then we highlight its inadequacy when naively applied to embodied environments and demonstrate how this can be improved by again leveraging world knowledge learned by LLMs. Visualization of generated programs is shown in Figure 3. Sampling from LMs Pre-trained LMs are sensitive to sampling parameters and the speciﬁc exam- ple given in the prompt. For all evaluated methods, we perform hyperparameter search over various sampling parameters, and for methods using a ﬁxed prompt example, we report metrics averaged across three randomly chosen examples. To select the best run for each method, we rank the runs by the sum of LCS and executability, each normalized by human-expert scores. Further details are in Appendix A.1. Model Choices For Planning LM, we evaluate a representative set of causal language models. For Translation LM, we mainly use Sentence-RoBERTa-355M and provide relevant ablations in Section 5.3. GPT-3 and Codex are accessed using OpenAI API, and the remaining models are accessed through open-source packages, Hugging Face Transformers [ 55] and SentenceTransformers [ 41], all without additional training (except for the ﬁne-tuning baseline). 4.1 Do LLMs contain actionable knowledge for high-level tasks? We ﬁrst investigate whether LLMs can generate sensible action plans expressed in free-form language. We use the approach described in Section 3.1 to query pre-trained LLMs. To evaluate the correctness of generated action plans, we conduct human evaluations. For each model, we ask 10 human annotators to determine – by answering “Yes” or “No” – whether each task can be completed using provided action steps. To provide a reference of how humans might rate the action plans provided by other humans, we also ask annotators to rate the human-written action plans included in the VirtualHome dataset for the same set of tasks. In contrast to the free-form text output by LLMs, humans wrote the plans using a graphical programming interface that enforces strict syntax and a chosen set of atomic action vocabulary, which limit the expressivity and the completeness of their answers4. More details of our human evaluation procedure can be found in Appendix A.2. We show the human evaluation results in Figure 1, where the y-axis shows correctness averaged across all tasks and all annotators. Surprisingly, when LLMs are large enough and without imposed syntactic constraints, they can generate highly realistic action plans whose correctness – as deemed by human annotators – even surpasses human-written action plans. We also observe some level of correctness for smaller models such as GPT-2. However, inspection of its produced output indicates 4Puig et al. [38] also conduct a human evaluation on 100 randomly sampled human-written programs and show that 64% of them are complete (i.e. contain all necessary steps). Readers are encouraged to refer to Puig et al. [38] for a more comprehensive analysis of the dataset. 8 Page 9: Language Model Executability LCS Correctness Vanilla GPT-2 117M 18.66% 3.19% 15.81% (4.90%) Vanilla GPT-2 1.5B 39.40% 7.78% 29.25% (5.28%) Vanilla Codex 2.5B 17.62% 15.57% 63.08% (7.12%) Vanilla GPT-Neo 2.7B 29.92% 11.52% 65.29% (9.08%) Vanilla Codex 12B 18.07% 16.97% 64.87% (5.41%) Vanilla GPT-3 13B 25.87% 13.40% 49.44% (8.14%) Vanilla GPT-3 175B 7.79% 17.82% 77.86% (6.42%) Human 100.00% N/A 70.05% (5.44%) Fine-tuned GPT-3 13B 66.07% 34.08% 64.92% (5.96%) OURFINAL METHODS Translated Codex 12B 78.57% 24.72% 54.88% (5.90%) Translated GPT-3 175B 73.05% 24.09% 66.13% (8.38%) Table 1: Human-evaluated correctness and evaluation results in VirtualHome. Although action plans generated by large language models can match or even surpass human-written plans in correctness measure, they are rarely executable. By translating the naive action plans, we show an important step towards grounding LLMs in embodied environments, but we observe room to achieve this without trading executability for correctness. We also observe a failure mode among smaller models that lead to high executability. For correctness measure, standard error of the mean across 10 human annotators is reported in the parenthesis. that it often generates shorter plans by ignoring common-sense actions or by simply rephrasing the given task (e.g. the task “Go to sleep” produces only a single step “Go to bed”). These failure modes sometimes mislead human annotators to mark them correct as the annotators may ignore common-sense actions in their judgment as well, resulting in a higher correctness rate than the quality of the output shows. 4.2 How executable are the LLM action plans? We analyze the executability of LLM plans by evaluating them in all 7 household scenes in Virtual- Home. As shown in Table 1, we ﬁnd action plans generated naively by LLMs are generally not very executable. Although smaller models seem to have higher executability, we ﬁnd that the majority of these executable plans are produced by ignoring the queried task and repeating the given example of a different task. This is validated by the fact that smaller models have lower LCS than larger models despite having high executability, showing that this failure mode is prevalent among smaller models. In contrast, larger models do not suffer severely from this failure mode. Yet as a result of being more expressive, their generated programs are substantially less executable. 4.3 Can LLM action plans be made executable by proposed procedure? We evaluate the effectiveness of our proposed procedure of action translation. We ﬁrst create a bank of all allowed 47522 action steps in the environment, including all possible combinations of atomic actions and allowed arguments/objects. Then we use an off-the-shelf Sentence-RoBERTa [ 27,41] as Translation LM to create embeddings for actions and output text. For better computational efﬁciency, we pre-compute the embeddings for all allowed actions, leaving minor computation overhead for our procedure over the baseline methods at inference time. As shown in Table 1, executability of generated programs is signiﬁcantly improved. Furthermore, we also observe improved LCS because the translated action steps precisely follow the program syntax and thus are more similar to the plans produced by human experts. Sample output is shown in Figure 1 and a larger random subset of generated samples can be found in Appendix A.5. To validate their correctness, we again perform human evaluations using the same procedure from Section 4.1. Results are shown in Table 1. We ﬁnd that despite being more similar to human-written plans as they follow strict syntax, the programs are deemed less correct by humans compared to their vanilla counterparts. By examining the output, we observe two main sources of errors. First, we ﬁnd Translation LM is poor at mapping compounded instructions to a succinct admissible action, e.g. “brush teeth with toothbrush and toothpaste”. Second, we ﬁnd that the generated programs are sometimes terminated too early. This is partly due to the imperfect expressivity of the environment; 9 Page 10: certain necessary actions or objects are not implemented to fully achieve some tasks, so Translation LM cannot map to a sufﬁciently similar action. This is also reﬂected by our human evaluation results of the programs written by other humans, as only 70% of the programs are considered complete. 5 Analysis and Discussions 5.1 Ablation of design decisions We perform ablation studies for the three components of our proposed procedure, described in Section 3.2, 3.3, and 3.4 respectively. As shown in Table 2, leaving out any of the three components would all lead to decreased performance in both executability and LCS. An exception is Translated GPT-3 w/o Trajectory Correction, where we observe a slight improvement in LCS at the expense of a considerable drop in executability. Among the three proposed components, leaving out action transla- tion leads to the most signiﬁcant executability drop, showing the importance of action translation in extracting executable action plans from LLMs. Methods Executability LCS Translated Codex 12B 78.57% 24.72% -w/oAction Translation 31.49% 22.53% -w/oDynamic Example 50.86% 22.84% -w/oTrajectory Correction 55.19% 24.43% Translated GPT-3 175B 73.05% 24.09% -w/oAction Translation 36.04% 24.31% -w/oDynamic Example 60.82% 22.92% -w/oTrajectory Correction 40.10% 24.98% Table 2: Ablation of three proposed techniques. 5.2 Are the generated action plans grounded in the environment? Since successful execution of correct action plans directly measures grounding, we calculate the percentage of generated action plans that are both correct andexecutable . We deem an action plan to be correct if 70% or more human annotators decide it is correct. Human-written plans are 100% executable, of which 65.91% are deemed correct. Results for LMs are shown in Figure 4. Although smaller LMs such as GPT-2 can generate highly executable action plans as shown in Table 1, these executable plans mostly are not correct, as they often repeat the given example or do not contain all necessary steps. Increasing model parameters can lead to some improvement in generating plans that are both executable and correct, yet it scales poorly with the parameter count. In the meantime, action translation offers a promising way towards grounding actionable knowledge by producing executable and correct plans, though a large gap remains to be closed to reach human-level performance (65.91%). Translated GPT-3 175BTranslated Codex 12BVanilla GPT-3 175BVanilla GPT-3 12BVanilla Codex 12BVanilla GPT-2 1.5BVanilla GPT-2 0.1B 35.23%27.27%6.82%2.27%4.55%1.14%0.0%% of Executable & Correct Plans Figure 4: Percentage of both executable and correct action plans generated by LMs. 10 Page 11: 5.3 Effect of Different Translation LMs In this section, we study the effect of using different Translation LM. We compare two size variants of Sentence BERT and Sentence RoBERTa [ 10,27,41] trained on the STS benchmark [ 6] and a baseline using averaged GloVe embeddings [ 35]. Results are shown in Table 3. Notably, we do not observe signiﬁcant differences in executability and LCS across different variants of BERT and RoBERTa. We hypothesize that this is because any language models trained on reasonably large datasets should be capable of the single-step action phrase translation considered in this work. However, simply using average GloVe embeddings would lead to signiﬁcantly reduced performance. Translation LM Parameter Count Executability LCS CODEX 12B ASPLANNING LM Avg. GloVe embeddings - 46.92% 9.71% Sentence Bert (base) 110M 73.21% 24.10% Sentence Bert (large) 340M 75.16% 20.79% Sentence RoBERTa (base) 125M 74.35% 22.82% Sentence RoBERTa (large) 325M 78.57% 24.72% GPT-3 175B ASPLANNING LM Avg. GloVe embeddings - 47.40% 12.16% Sentence Bert (base) 110M 77.60% 24.49% Sentence Bert (large) 340M 67.86% 21.24% Sentence RoBERTa (base) 125M 72.73% 23.64% Sentence RoBERTa (large) 325M 73.05% 24.09% Table 3: Effect of different Translation LMs on executability and LCS. 5.4 Can LLMs generate actionable programs by following step-by-step instructions? Prior works often focus on translating step-by-step instructions into executable programs. Speciﬁcally, instead of only providing a high-level task name, how-to instructions are also provided, as shown in Figure 5. Although this setting is easier as it does not require rich prior knowledge, how-to instructions can help resolve much ambiguity of exactly how to perform a high-level task when multiple solutions are possible. To investigate whether pre-trained LLMs are capable of doing this without additional training, we include these instructions in the prompt and evaluate LLMs with the proposed procedure. We compare to a supervised baseline from VirtualHome that trains an LSTM [ 17] from scratch on human-annotated data. Since the code to train the baseline is not publicly released and a different train/test split is likely used, we only show results reported in Puig et al. [38] as a crude reference. We also cannot compare executability as it is not reported. Results are shown in Table 4. Surprisingly, without being ﬁne-tuned on any domain data, Translated Codex/GPT-3 can attain LCS close to supervised methods while generating highly executable programs. Task: Read bookDescription: Walk to home office, turn on light, grab a book, sit in chair, start to read the book.Step 1: Walk to home officeStep 2: Walk to lightStep 3: Find lightStep 4: Switch on lightStep 5: Find novelStep 6: Grab novelStep 7: Find chairStep 8: Sit on chairStep 9: Read novelTask: Find dictionaryDescription: Move towards thebookshelf, scan the bookshelf forthe dictionary, when thedictionary is found, pick up thedictionary. Figure 5: An example prompt containing step-by- step instructions.Methods Executability LCS Translated Codex 12B 78.57% 32.87% Translated GPT-3 175B 74.15% 31.05% Supervised LSTM - 34.00% Table 4: Executability and LCS when conditioned on step-by-step instructions. 11 Page 12: 5.5 Analysis of program length Shorter programs have a natural advantage of being more executable as they need to satisfy less pre/post-conditions, albeit being prone to incompleteness. To validate the proposed approach does not simply generate very short programs, we calculate the average program length across the 88 evaluated tasks. Results are shown in Table 5. Mirroring the observations made in Section 4.1 and Section 4.2, we ﬁnd smaller LMs such as GPT-2 tend to generate shorter programs than larger models do while frequently repeating the given executable example. In contrast, larger models like Codex and GPT-3 can generate more expressive programs with high realism, yet consequently, they often suffer from executability. We show proposed procedure can ﬁnd appropriate balance and is capable of generating programs that are highly executable while maintaining reasonable expressiveness as measured by program length. Methods Executability Average Length Vanilla GPT-2 1.5B 39.40% 4.24 Vanilla Codex 12B 18.07% 7.22 Vanilla GPT-3 175B 7.79% 9.716 Translated Codex 12B 78.57% 7.13 Translated GPT-3 175B 73.05% 7.36 Human 100.00% 9.66 Table 5: Average executability & program length of different methods. 6 Related Works Large-scale natural language modeling has witnessed rapid advances since the inception of the Transformer architecture [ 53]. It has been shown by recent works that large language models (LLMs) pre-trained on large unstructured text corpus not only can perform strongly on various down-stream NLP tasks [ 10,39,40,5] but the learned representations can also be used to model relations of entities [ 23], retrieve matching visual features [ 19], synthesize code from docstrings [ 15,7], solve math problems [ 8,46], and even as valuable priors when applied to diverse tasks from different modalities [ 28,52]. Notably, by pre-training on large-scale data, these models can also internalize an implicit knowledge base containing rich information about the world from which factual answers (e.g. “Dante was born in hPLACEi”) can be extracted [ 36,21,9,50,42]. Compared to prior works in single-step knowledge extraction, we aim to extract sequential action plans to complete open-ended human activities while satisfying various constraints of an interactive environment. Many prior works have looked into grounding natural language in embodied environments. A series of them parse language instructions into formal logic or rely mainly on lexical analysis to resolve various linguistic ambiguities for embodied agents [ 2,33,34,51]. However, they often require many hand-designed rules or scale inadequately to more complex tasks and environments. Recently, many efforts have been put into creating more realistic environments with the goal to further advances in this area [ 38,47,48,22,44,1]. At the same time, by leveraging the better representation power of neural architectures, a number of works have looked into creating instruction-following agents that can perform manipulation [ 29,30], navigation [ 11,54,31], or both [ 49,16,12]. Recent works also use language as hierarchical abstractions to plan actions using imitation learning [ 45] and to guide exploration in reinforcement learning [32]. Notably, many prior works do not leverage full-blown pre-trained LLMs; most investigate smaller LMs that require considerable domain-speciﬁc data for ﬁne-tuning to obtain reasonable performance. Perhaps more importantly, few works have evaluated LLMs in an embodiment setting that realizes the full potential of the actionable knowledge these models already contain by pre-training on large-scale unstructured text: the tasks evaluated are often generated from a handful of templates, which do not resemble the highly diverse activities that humans perform in daily lives [ 14,20]. The development of VirtualHome environment [ 38] enables such possibility. However, relevant works [ 38,25] rely on human-annotated data and perform supervised training from scratch. Due to the lack of rich world knowledge, these models can only generate action plans given detailed instructions of how to act or video demonstrations. Concurrent work by Li et al. [24] validates similar hypothesis that 12 Page 13: LMs contain rich actionable knowledge. They ﬁne-tune GPT-2 with demonstrations to incorporate environment context and to predict actions in VirtualHome, and evaluate on tasks that are generated from pre-deﬁned predicates. In contrast, we investigate existing knowledge in LLMs without any additional training and evaluate on human activity tasks expressed in free-form language. 7 Conclusion, Limitations & Future Work In this work, we investigate actionable knowledge already contained in pre-trained LLMs without any additional training. We present several techniques to extract this knowledge to perform common-sense grounding by planning actions for complex human activities. Despite promising ﬁndings, there remain several limitations of this work which we discuss as follows: Drop in Correctness Although our approach can signiﬁcantly improve executability of the gen- erated plans, we observe a considerable drop in correctness. In addition to the errors caused by the proposed action translation (discussed in Section 4.3), this is partially attributed to the limited expressivity of VirtualHome, as it may not support all necessary actions to fully complete all evaluated tasks (correctness is judged by humans). This is also reﬂected by that Vanilla LMs can even surpass human-written plans, which are restricted by environment expressivity. Mid-Level Grounding Instead of grounding the LLM generation to low-level actions by using downstream data from a speciﬁc environment, we focus on high-level to mid-level grounding such that we evaluate raw knowledge of LLMs as closely and broadly as possible. Hence, we only consider the most prominent challenge in mid-level grounding that the generated plans must satisfy all common-sense constraints (characterized by executability metric). As a result, we assume there is a low-level controller that can execute these mid-level actions (such as “grab cup”), and we do not investigate the usefulness of LLMs for low-level sensorimotor behavior grounding. To perform sensorimotor grounding, such as navigation and interaction mask prediction, domain-speciﬁc data and ﬁne-tuning are likely required. Ignorant of Environment Context We do not incorporate observation context or feedback into our models. To some extent, we approach LLMs in the same way as how VirtualHome asks human annotators to write action plans for a given human activity by imagination , in which case humans similarly do not observe environment context. Similar to human-written plans, we assume the plans generated by LMs only refer to one instance of each object class. As a result, successful plan generation for tasks like “stack two plates on the right side of a cup” is not possible. Evaluation Protocol We measure quality of plans by a combination of executability andcorrectness instead of one straightforward metric. To the best of our knowledge, there isn’t a known way to computationally assess the semantic correctness of the plans due to the tasks’ open-ended and multi-modal nature. Prior work also adopt similar combination of metrics [ 38]. We report two metrics individually to shine light on the deﬁciencies of existing LLMs which we hope could provide insights for future works. To provide a holistic view, we report results by combining two metrics in Section 5.2. We believe addressing each of these shortcoming will lead to exciting future directions. We also hope these ﬁndings can inspire future investigations into using pre-trained LMs for goal-driven decision-making problems and grounding the learned knowledge in embodied environments. Acknowledgment We would like to thank OpenAI for providing academic access to the OpenAI API and Luke Metz for valuable feedback and discussions. This work was supported in part by Berkeley Deep Drive, NSF IIS-2024594, and GoodAI Research Award. 13 Page 14: References [1]Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 3674–3683, 2018. [2]Yoav Artzi and Luke Zettlemoyer. Weakly supervised learning of semantic parsers for mapping instructions to actions. Transactions of the Association for Computational Linguistics , 1:49–62, 2013. [3]BIG-bench collaboration. Beyond the imitation game: Measuring and extrapolating the capabil- ities of language models. In preparation , 2021. URL https://github.com/google/ BIG-bench/ . [4]Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Kohd, Mark Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Rob Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher Ré, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. On the opportunities and risks of foundation models, 2021. [5]Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 , 2020. [6]Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055 , 2017. [7]Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harri Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 , 2021. [8]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christo- pher Hesse, and John Schulman. Training veriﬁers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021. [9]Joe Davison, Joshua Feldman, and Alexander M Rush. Commonsense knowledge mining from pretrained models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 1173–1178, 2019. [10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018. 14 Page 15: [11] Daniel Fried, Ronghang Hu, V olkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. Speaker- follower models for vision-and-language navigation. arXiv preprint arXiv:1806.02724 , 2018. [12] Justin Fu, Anoop Korattikara, Sergey Levine, and Sergio Guadarrama. From language to goals: Inverse reinforcement learning for vision-based instruction following. arXiv preprint arXiv:1902.07742 , 2019. [13] Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723 , 2020. [14] Brent Harrison and Mark O Riedl. Learning from stories: using crowdsourced narratives to train virtual agents. In Twelfth Artiﬁcial Intelligence and Interactive Digital Entertainment Conference , 2016. [15] Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938 , 2021. [16] Felix Hill, Sona Mokra, Nathaniel Wong, and Tim Harley. Human instruction-following with deep reinforcement learning via transfer-learning from text. arXiv preprint arXiv:2005.09382 , 2020. [17] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation , 9(8): 1735–1780, 1997. [18] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751 , 2019. [19] Gabriel Ilharco, Rowan Zellers, Ali Farhadi, and Hannaneh Hajishirzi. Probing text models for common ground with visual representations. arXiv e-prints , pages arXiv–2005, 2020. [20] Peter A Jansen. Visually-grounded planning without vision: Language models infer detailed plans from high-level instructions. arXiv preprint arXiv:2009.14259 , 2020. [21] Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. How can we know what language models know? Transactions of the Association for Computational Linguistics , 8:423–438, 2020. [22] Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474 , 2017. [23] Belinda Z Li, Maxwell Nye, and Jacob Andreas. Implicit representations of meaning in neural language models. arXiv preprint arXiv:2106.00737 , 2021. [24] Shuang Li, Xavier Puig, Yilun Du, Clinton Wang, Ekin Akyurek, Antonio Torralba, Jacob Andreas, and Igor Mordatch. Pre-trained language models for interactive decision-making. arXiv preprint arXiv:2202.01771 , 2022. [25] Yuan-Hong Liao, Xavier Puig, Marko Boben, Antonio Torralba, and Sanja Fidler. Synthesizing environment-aware activities via activity sketches. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6291–6299, 2019. [26] Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt- 3?arXiv preprint arXiv:2101.06804 , 2021. [27] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 , 2019. [28] Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Pretrained transformers as universal computation engines. arXiv preprint arXiv:2103.05247 , 2021. 15 Page 16: [29] Corey Lynch and Pierre Sermanet. Grounding language in play. arXiv preprint arXiv:2005.07648 , 2020. [30] Corey Lynch and Pierre Sermanet. Language conditioned imitation learning over unstructured data. Proceedings of Robotics: Science and Systems. doi , 10, 2021. [31] Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, and Dhruv Batra. Improving vision-and-language navigation with image-text pairs from the web. In European Conference on Computer Vision , pages 259–274. Springer, 2020. [32] Suvir Mirchandani, Siddharth Karamcheti, and Dorsa Sadigh. Ella: Exploration through learned language abstraction. arXiv preprint arXiv:2103.05825 , 2021. [33] Dipendra Misra, Kejia Tao, Percy Liang, and Ashutosh Saxena. Environment-driven lexicon induction for high-level instructions. In Proceedings of the 53rd Annual Meeting of the Asso- ciation for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 992–1002, 2015. [34] Dipendra K Misra, Jaeyong Sung, Kevin Lee, and Ashutosh Saxena. Tell me dave: Context- sensitive grounding of natural language to manipulation instructions. The International Journal of Robotics Research , 35(1-3):281–300, 2016. [35] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages 1532–1543, 2014. [36] Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. Language models as knowledge bases? arXiv preprint arXiv:1909.01066 , 2019. [37] Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. Synchromesh: Reliable code generation from pre-trained language models. arXiv preprint arXiv:2201.11227 , 2022. [38] Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. Virtualhome: Simulating household activities via programs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 8494–8502, 2018. [39] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog , 1(8):9, 2019. [40] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. arXiv preprint arXiv:1910.10683 , 2019. [41] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. arXiv preprint arXiv:1908.10084 , 2019. [42] Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910 , 2020. [43] Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning. arXiv preprint arXiv:2112.08633 , 2021. [44] Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 9339–9347, 2019. [45] Pratyusha Sharma, Antonio Torralba, and Jacob Andreas. Skill induction and planning with latent language. arXiv preprint arXiv:2110.01517 , 2021. [46] Jianhao Shen, Yichun Yin, Lin Li, Lifeng Shang, Xin Jiang, Ming Zhang, and Qun Liu. Generate & rank: A multi-task framework for math word problems. arXiv preprint arXiv:2109.03034 , 2021. 16 Page 17: [47] Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mot- taghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 10740–10749, 2020. [48] Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768 , 2020. [49] Alessandro Suglia, Qiaozi Gao, Jesse Thomason, Govind Thattai, and Gaurav Sukhatme. Embodied bert: A transformer model for embodied, language-guided visual task completion. arXiv preprint arXiv:2108.04927 , 2021. [50] Alon Talmor, Yanai Elazar, Yoav Goldberg, and Jonathan Berant. olmpics-on what language model pre-training captures. Transactions of the Association for Computational Linguistics , 8: 743–758, 2020. [51] Moritz Tenorth, Daniel Nyga, and Michael Beetz. Understanding and executing instructions for everyday manipulation tasks from the world wide web. In 2010 ieee international conference on robotics and automation , pages 1486–1491. IEEE, 2010. [52] Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. arXiv preprint arXiv:2106.13884 , 2021. [53] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems , pages 5998–6008, 2017. [54] Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6629–6638, 2019. [55] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transform- ers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 , 2019. 17 Page 18: A Appendix A.1 Hyperparameter Search For each evaluated method, we perform grid search over the following hyperparameters: Name Description Search Values epsilon () Out-of-distribution early termination threshold {0, 0.4, 0.8} temperature sampling parameter adjusting relative token probabilities {0.1, 0.3, 0.6} k number of samples generated by Planning LM {1, 10} beta () weighting coefﬁcient in action translation to trade off semantic and translation correctness{0.3} frequence_penalty OpenAI API only ; penalize new tokens based on their existing frequency in the text so far{0.1, 0.3, 0.6, 0.9} presence_penalty OpenAI API only ; penalize new tokens based on whether they appear in the text so far{0.3, 0.5, 0.8} repetition_penalty Hugging Face Transformers only ; penalize new tokens based on whether repeating existing text{1.0, 1.2, 1.5, 1.8} For methods that use ﬁxed example across evaluated tasks, we search over the following three randomly chosen examples: Example 1 Example 2 Example 3 Task: Use computer Step 1: Walk to home ofﬁce Step 2: Walk to chair Step 3: Find chair Step 4: Sit on chair Step 5: Find computer Step 6: Switch on computer Step 7: Turn to computer Step 8: Look at computer Step 9: Find keyboard Step 10: Type on keyboardTask: Relax on sofa Step 1: Walk to home ofﬁce Step 2: Walk to couch Step 3: Find couch Step 4: Sit on couch Step 5: Find pillow Step 6: Lie on couchTask: Read book Step 1: Walk to home ofﬁce Step 2: Walk to novel Step 3: Find novel Step 4: Grab novel Step 5: Find chair Step 6: Sit on chair Step 7: Read novel 18 Page 19: A.2 Details of Human Evaluations Human evaluations are conducted on Amazon Mechanical Turk. For each method, we generate action plans for all 88 high-level tasks. To account for the expressivity of the VirtualHome environment [ 38], we include action plans written by human experts from the VirtualHome dataset as references in our human evaluations. The evaluations are conducted in the form of questionnaires containing all action plans whose order is randomly shufﬂed and whose corresponding methods are unknown to the annotators. Human annotators are required to answer all the questions in the questionnaire. For each question, the annotators need to answer either “Yes” or “No” indicating if they believe the action plan completes the task. For each method, we report correctness percentage averaged across 10 participated human annotators and all 88 tasks. We further report the standard error of the mean across human annotators. Screenshot can be found in Figure 6. Figure 6: Screenshot of human evaluation interface, conducted as a Google Forms questionnaire. 19 Page 20: A.3 All Evaluated Tasks The evaluated tasks are part of the ActivityPrograms dataset collected by Puig et al. [38]. Some of the task names may contain misspelling(s). 1. Apply lotion 2. Arrange folders 3. Breakfast 4. Browse internet 5. Brush teeth 6. Change clothes 7.Change sheets and pil- low cases 8. Collect napkin rings 9.Complete surveys on amazon turk 10. Compute 11. Decorate it 12. Do homework 13. Do work 14. Draft home 15. Draw picture 16. Dry soap bottles 17. Dust 18. Eat cereal 19. Eat cheese 20.Eat snacks and drink tea 21.Empty dishwasher and ﬁll dishwasher 22. Entertain 23. Feed me 24. Find dictionary 25. Fix snack 26. Get glass of milk 27. Give milk to cat 28. Go to sleep 29. Grab things 30. Hand washing31. Hang keys 32. Hang pictures 33. Iron shirt 34.Keep cats inside while door is open 35. Keep cats out of room 36. Leave home 37. Listen to music 38. Look at mirror 39. Look at painting 40. Make bed 41. Make popcorn 42. Organize closet 43. Organize pantry 44. Paint ceiling 45. Pay bills 46. Pick up toys 47. Play musical chairs 48.Prepare pot of boiling water 49. Push all chairs in 50. Push in desk chair 51.Put alarm clock in bed- room 52. Put away groceries 53. Put away toys 54. Put clothes away 55.Put mail in mail orga- nizer 56. Put on your shoes 57. Put out ﬂowers 58. Put up decoration 59. Read 60. Read newspaper61. Read on sofa 62. Read to child 63. Read yourself to sleep 64. Receive credit card 65. Restock 66.Scrubbing living room tile ﬂoor is once week activity for me 67. Style hair 68. Switch on lamp 69. Take jacket off 70. Take shoes off 71. Tale off shoes 72. Throw away paper 73. Try yourself off 74. Turn off TV 75.Turn on TV with re- mote 76. Turn on radio 77. Type up document 78.Unload various items from pockets and place them in bowl on table 79. Use laptop 80. Vacuum 81. Walk to room 82. Wash dirty dishes 83. Wash face 84. Wash monitor 85. Wash teeth 86. Watch horror movie 87. Wipe down sink 88. Write book 20 Page 21: A.4 Natural Language Templates for All Atomic Actions VirtualHome requires action steps speciﬁed in a speciﬁc format, yet language models are trained to deal with mostly natural language. We thus deﬁne a natural language template for each atomic action and only expose the converted natural language text in all operations involving language models, i.e. autoregressive generation and action translation. After we obtain an entire generated program expressed in natural language, such as those in Figure 1 and Figure 2, we then convert each action step to the VirtualHome syntax. Full list of the atomic actions and their natural language templates can be found below. Atomic Action in VirtualHome Syntax Natural Language Template [CLOSE]harg1i(1) closeharg1i [CUT]harg1i(1) cutharg1i [DRINK]harg1i(1) drinkharg1i [DROP]harg1i(1) dropharg1i [EAT]harg1i(1) eatharg1i [FIND]harg1i(1) ﬁndharg1i [GRAB]harg1i(1) grabharg1i [GREET]harg1i(1) greetharg1i [LIE]harg1i(1) lie onharg1i [LOOKAT]harg1i(1) look atharg1i [MOVE]harg1i(1) moveharg1i [OPEN]harg1i(1) openharg1i [PLUGIN]harg1i(1) plug inharg1i [PLUGOUT]harg1i(1) plug outharg1i [POINTAT]harg1i(1) point atharg1i [POUR]harg1i(1)harg2i(1) pourharg1iintoharg2i [PULL]harg1i(1) pullharg1i [PUSH]harg1i(1) pushharg1i [PUTBACK]harg1i(1)harg2i(1) putharg1ionharg2i [PUTIN]harg1i(1)harg2i(1) putharg1iinharg2i [PUTOBJBACK] harg1i(1) put backharg1i [PUTOFF]harg1i(1) take offharg1i [PUTON]harg1i(1) put onharg1i [READ]harg1i(1) readharg1i [RELEASE] release [RINSE]harg1i(1) rinseharg1i [RUN]harg1i(1) run toharg1i [SCRUB]harg1i(1) scrubharg1i [SIT]harg1i(1) sit onharg1i [SLEEP] sleep [SQUEEZE]harg1i(1) squeezeharg1i [STANDUP] stand up [SWITCHOFF]harg1i(1) switch offharg1i [SWITCHON]harg1i(1) switch onharg1i [TOUCH]harg1i(1) touchharg1i [TURNTO]harg1i(1) turn toharg1i [TYPE]harg1i(1) type onharg1i [WAKEUP] wake up [WALK]harg1i(1) walk toharg1i [WASH]harg1i(1) washharg1i [WATCH]harg1i(1) watchharg1i [WIPE]harg1i(1) wipeharg1i 21 Page 22: A.5 Random Samples of Action Plans Human Vanilla GPT3-175B Translated GPT3-175B (Ours) Task: Go to sleep Step 1: Walk to bedroom Step 2: Walk to bed Step 3: Find bed Step 4: Turn to bed Step 5: Look at bed Step 6: Sit on bed Step 7: SleepTask: Go to sleep Step 1: Walk to bedroom Step 2: Walk to bed Step 3: Find pajamas Step 4: Put on pajamas Step 5: Find slippers Step 6: Put on slippers Step 7: Get in bed Step 8: SleepTask: Go to sleep Step 1: Walk to bedroom Step 2: Walk to bed Step 3: Lie on bed Step 4: Sleep Task: Browse internet Step 1: Walk to home ofﬁce Step 2: Walk to chair Step 3: Find chair Step 4: Sit on chair Step 5: Find computer Step 6: Switch on computer Step 7: Turn to computer Step 8: Look at computer Step 9: Find keyboard Step 10: Type on keyboardTask: Browse internet Step 1: Walk to home ofﬁce Step 2: Walk to computer Step 3: Find computer Step 4: Grab computer Step 5: Find chair Step 6: Sit on chair Step 7: Browse internetTask: Browse internet Step 1: Walk to home ofﬁce Step 2: Walk to desk Step 3: Find chair Step 4: Sit on chair Step 5: Find computer Step 6: Switch on computer Step 7: Find mouse Step 8: Grab mouse Step 9: Turn to computer Step 10: Point at computer Task: Wash teeth Step 1: Stand up Step 2: Walk to bathroom Step 3: Walk to sink Step 4: Find sink Step 5: Turn to sink Step 6: Look at sink Step 7: Find teeth Step 8: Wash teethTask: Wash teeth Step 1: Walk to bathroom Step 2: Walk to sink Step 3: Turn on faucet Step 4: Grab toothbrush Step 5: Put toothpaste on tooth- brush Step 6: Put toothbrush in mouth Step 7: Brush teeth Step 8: Rinse mouth out with wa- ter Step 9: Turn off faucet Step 10: Walk back to bedroomTask: Wash teeth Step 1: Walk to bathroom Step 2: Walk to sink Step 3: Find faucet Step 4: Switch on faucet Step 5: Walk to toothbrush Step 6: Grab toothbrush Task: Brush teeth Step 1: Walk to bathroom Step 2: Walk to tooth paste Step 3: Find tooth paste Step 4: Grab tooth paste Step 5: Find toothbrush Step 6: Grab toothbrushTask: Brush teeth Step 1: Walk to bathroom Step 2: Walk to sink Step 3: Turn on water Step 4: Grab toothbrush Step 5: Put toothpaste on tooth- brush Step 6: Put toothbrush in mouth Step 7: Brush teethTask: Brush teeth Step 1: Find toothbrush Step 2: Turn to toothbrush Step 3: Point at toothbrush Step 4: Wash toothbrush Task: Vacuum Step 1: Walk to home ofﬁce Step 2: Walk to computer Step 3: Find vacuum cleaner Step 4: Switch on vacuum cleaner Step 5: Switch off vacuum cleanerTask: Vacuum Step 1: Walk to home ofﬁce Step 2: Walk to vacuum Step 3: Find vacuum Step 4: Grab vacuum Step 5: Find chair Step 6: Sit on chair Step 7: VacuumTask: Vacuum Step 1: Walk to home ofﬁce Step 2: Walk to dresser Step 3: Open dresser Step 4: Find vacuum cleaner Step 5: Grab vacuum cleaner Step 6: Pull vacuum cleaner Step 7: Plug in vacuum cleaner Step 8: Switch on vacuum cleaner 22 Page 23: Task: Change sheets and pillow cases Step 1: Find pillow Step 2: Turn to pillow Step 3: Look at pillow Step 4: Grab pillowTask: Change sheets and pillow cases Step 1: Walk to bedroom Step 2: Walk to bed Step 3: Change sheets and pillow cases Step 4: Walk to kitchen Step 5: Walk to sink Step 6: Wash sheets and pillow cases Step 7: Walk to bedroom Step 8: Put sheets and pillow cases on bed Step 9: Walk to living room Step 10: Sit on couchTask: Change sheets and pillow cases Step 1: Walk to bedroom Step 2: Walk to bed Step 3: Find sheets Step 4: Grab sheets Step 5: Find pillow Step 6: Grab pillow Task: Wash dirty dishes Step 1: Walk to dining room Step 2: Walk to sink Step 3: Find sponge Step 4: Grab sponge Step 5: Find faucet Step 6: Switch on faucet Step 7: Put sponge on sink Step 8: Grab sponge Step 9: Find dish soap Step 10: Grab dish soap Step 11: Pour dish soap into sponge Step 12: Put back dish soap Step 13: Find plate Step 14: Grab plate Step 15: Scrub plate Step 16: Rinse plate Step 17: Find dishrack Step 18: Put plate on dishrack Step 19: Find plate Step 20: Grab plate Step 21: Scrub plate Step 22: Rinse plate Step 23: Put plate on dishrack Step 24: Rinse sponge Step 25: Put back sponge Step 26: Switch off faucetTask: Wash dirty dishes Step 1: Walk to kitchen Step 2: Walk to sink Step 3: Wash dishes Step 4: Dry dishes Step 5: Put dishes awayTask: Wash dirty dishes Step 1: Walk to dining room Step 2: Walk to sink Step 3: Find faucet Step 4: Switch on faucet Step 5: Walk to plate Step 6: Grab plate 23 Page 24: Task: Feed me Step 1: Walk to dining room Step 2: Walk to plate Step 3: Find plate Step 4: Grab plate Step 5: Find food Step 6: Grab food Step 7: Put food on plate Step 8: Find food Step 9: Grab food Step 10: Put food on plate Step 11: Find fork Step 12: Grab fork Step 13: Eat plate Step 14: Eat plateTask: Feed me Step 1: Walk to kitchen Step 2: Walk to refrigerator Step 3: Open refrigerator Step 4: Grab food Step 5: Find plate Step 6: Put food on plate Step 7: Find fork Step 8: Put fork on plate Step 9: Find spoon Step 10: Put spoon on plate Step 11: Find cup Step 12: Pour water in cup Step 13: Walk to table Step 14: Sit at table Step 15: Eat foodTask: Feed me Step 1: Walk to dining room Step 2: Walk to table Step 3: Find table Step 4: Turn to table Step 5: Find chair Step 6: Sit on chair Step 7: Find food Step 8: Grab food Step 9: Find plate Step 10: Put food on plate Task: Breakfast Step 1: Walk to dining room Step 2: Walk to food Step 3: Grab food Step 4: Walk to bedroom Step 5: Walk to television Step 6: Switch on television Step 7: Walk to bed Step 8: Lie on bed Step 9: Turn to television Step 10: Watch television Step 11: Eat foodTask: Breakfast Step 1: Walk to kitchen Step 2: Find cereal Step 3: Pour cereal into bowl Step 4: Pour milk into bowl Step 5: Walk to living room Step 6: Sit on couch Step 7: Eat cerealTask: Breakfast Step 1: Walk to dining room Step 2: Walk to freezer Step 3: Find freezer Step 4: Open freezer Step 5: Find bread Step 6: Grab bread Step 7: Close freezer Step 8: Find toaster Step 9: Plug in toaster Step 10: Put bread on plate Step 11: Switch on toaster Task: Read Step 1: Find novel Step 2: Grab novel Step 3: Walk to bedroom Step 4: Walk to chair Step 5: Sit on chair Step 6: Read novelTask: Read Step 1: Walk to home ofﬁce Step 2: Walk to novel Step 3: Find novel Step 4: Grab novel Step 5: Find chair Step 6: Sit on chair Step 7: Read novelTask: Read Step 1: Walk to home ofﬁce Step 2: Walk to light Step 3: Find light Step 4: Switch on light Step 5: Find novel Step 6: Grab novel Step 7: Find chair Step 8: Sit on chair Step 9: Read novel Task: Style hair Step 1: Find towel Step 2: Turn to towel Step 3: Point at towel Step 4: Wash towel Step 5: Turn to towel Step 6: Look at towel Step 7: Squeeze towel Step 8: Walk to bathroomTask: Style hair Step 1: Walk to bathroom Step 2: Walk to mirror Step 3: Find hairbrush Step 4: Grab hairbrush Step 5: Find chair Step 6: Sit on chair Step 7: Style hairTask: Style hair Step 1: Walk to bathroom Step 2: Walk to mirror Step 3: Find hair Step 4: Grab hair Step 5: Find comb Step 6: Grab comb Step 7: Find brush Step 8: Grab brush 24 Page 25: Task: Pick up toys Step 1: Walk to bedroom Step 2: Walk to toy Step 3: Find toy Step 4: Grab toy Step 5: Walk to ﬁling cabinet Step 6: Open ﬁling cabinet Step 7: Put toy on ﬁling cabinet Step 8: Find toy Step 9: Grab toy Step 10: Walk to ﬁling cabinet Step 11: Put toy on ﬁling cabinet Step 12: Find toy Step 13: Grab toy Step 14: Walk to ﬁling cabinet Step 15: Put toy on ﬁling cabinet Step 16: Close ﬁling cabinetTask: Pick up toys Step 1: Walk to living room Step 2: Walk to toy box Step 3: Find toy box Step 4: Grab toy box Step 5: Find chair Step 6: Sit on chair Step 7: Pick up toysTask: Pick up toys Step 1: Find toy Step 2: Turn to toy Step 3: Point at toy Step 4: Wash toy Task: Decorate it Step 1: Walk to home ofﬁce Step 2: Walk to centerpiece Step 3: Find centerpiece Step 4: Grab centerpiece Step 5: Walk to table Step 6: Put centerpiece on table Step 7: Find drawing Step 8: Grab drawing Step 9: Find wall Step 10: Turn to wall Step 11: Put drawing on wall Step 12: Find drawing Step 13: Grab drawing Step 14: Turn to wall Step 15: Put drawing on wallTask: Decorate it Step 1: Walk to home ofﬁce Step 2: Walk to novel Step 3: Find novel Step 4: Grab novel Step 5: Find chair Step 6: Sit on chair Step 7: Read novel Step 8: Decorate itTask: Decorate it Step 1: Walk to table Step 2: Walk to centerpiece Step 3: Look at centerpiece Step 4: Grab centerpiece Step 5: Point at centerpiece Task: Put away groceries Step 1: Walk to dining room Step 2: Walk to freezer Step 3: Find freezer Step 4: Open freezer Step 5: Find food Step 6: Grab food Step 7: Put food in freezer Step 8: Close freezerTask: Put away groceries Step 1: Walk to kitchen Step 2: Walk to refrigerator Step 3: Open refrigerator Step 4: Grab milk and eggs Step 5: Close refrigerator Step 6: Walk to counter Step 7: Put milk and eggs on counter Step 8: Walk to cabinet Step 9: Open cabinet Step 10: Put away milk and eggs Step 11: Close cabinet Step 12: Walk to refrigerator Step 13: Grab orange juice and bread Step 14: Close refrigeratorTask: Put away groceries Step 1: Walk to kitchen Step 2: Walk to fridge Step 3: Open fridge Step 4: Put food in fridge Step 5: Close fridge Step 6: Walk to dining room Step 7: Walk to livingroom Step 8: Walk to bedroom Step 9: Walk to bathroom 25 Page 26: Task: Draw picture Step 1: Walk to home ofﬁce Step 2: Walk to chair Step 3: Find chair Step 4: Sit on chair Step 5: Find desk Step 6: Open desk Step 7: Find check Step 8: Grab check Step 9: Put check on desk Step 10: Find pencil Step 11: Grab pencil Step 12: Close deskTask: Draw picture Step 1: Walk to home ofﬁce Step 2: Walk to pencils Step 3: Find pencils Step 4: Grab pencils Step 5: Find paper Step 6: Sit on chair Step 7: Draw pictureTask: Draw picture Step 1: Walk to home ofﬁce Step 2: Walk to wall Step 3: Walk to drawing Step 4: Find drawing Step 5: Turn to drawing Step 6: Look at drawing Step 7: Touch drawing Step 8: Pull drawing Step 9: Push drawing Task: Entertain Step 1: Walk to home ofﬁce Step 2: Walk to cheese Step 3: Find cheese Step 4: Find food Step 5: Find couch Step 6: Sit on couch Step 7: Find wine glass Step 8: Grab wine glass Step 9: Drink wine glassTask: Entertain Step 1: Walk to living room Step 2: Walk to entertainment center Step 3: Find remote Step 4: Grab remote Step 5: Find chair Step 6: Sit on chair Step 7: Entertain selfTask: Entertain Step 1: Walk to livingroom Step 2: Walk to couch Step 3: Find couch Step 4: Sit on couch Task: Type up document Step 1: Walk to home ofﬁce Step 2: Walk to chair Step 3: Find chair Step 4: Sit on chair Step 5: Find computer Step 6: Switch on computer Step 7: Turn to computer Step 8: Look at computer Step 9: Find check Step 10: Grab check Step 11: Read check Step 12: Find coffee cup Step 13: Grab coffee cup Step 14: Drink coffee cup Step 15: Put back coffee cup Step 16: Find keyboard Step 17: Type on keyboardTask: Type up document Step 1: Walk to home ofﬁce Step 2: Walk to keyboard Step 3: Find keyboard Step 4: Grab keyboard Step 5: Find chair Step 6: Sit on chair Step 7: Type documentTask: Type up document Step 1: Walk to home ofﬁce Step 2: Walk to computer Step 3: Find computer Step 4: Switch on computer Step 5: Find chair Step 6: Sit on chair Step 7: Turn to computer Step 8: Look at computer Step 9: Find keyboard Step 10: Grab keyboard Step 11: Type on keyboard Step 12: Stand up Step 13: Find printer Step 14: Switch on printer Step 15: Find printing paper Task: Put on your shoes Step 1: Walk to home ofﬁce Step 2: Walk to shoes Step 3: Find shoes Step 4: Grab shoes Step 5: Find chair Step 6: Sit on chair Step 7: Find feet both Step 8: Put shoes on feet bothTask: Put on your shoes Step 1: Walk to home ofﬁce Step 2: Walk to shoes Step 3: Find shoes Step 4: Grab shoes Step 5: Find chair Step 6: Sit on chair Step 7: Put on shoesTask: Put on your shoes Step 1: Find shoes Step 2: Turn to shoes Step 3: Point at shoes Step 4: Wash shoes 26 Page 27: Task: Watch horror movie Step 1: Walk to home ofﬁce Step 2: Walk to remote control Step 3: Find remote control Step 4: Grab remote control Step 5: Find couch Step 6: Sit on couch Step 7: Find television Step 8: Switch on television Step 9: Turn to television Step 10: Watch televisionTask: Watch horror movie Step 1: Walk to home ofﬁce Step 2: Walk to horror movie Step 3: Find horror movie Step 4: Grab horror movie Step 5: Find chair Step 6: Sit on chair Step 7: Watch horror movieTask: Watch horror movie Step 1: Find television Step 2: Switch on television Step 3: Find cd player Step 4: Switch on cd player Step 5: Switch off cd player Step 6: Find remote control Step 7: Switch on remote control Step 8: Switch off remote control Task: Turn on radio Step 1: Walk to home ofﬁce Step 2: Walk to stereo Step 3: Find stereo Step 4: Find electrical outlet Step 5: Turn to electrical outlet Step 6: Plug in stereo Step 7: Switch on electrical outlet Step 8: Switch on stereoTask: Turn on radio Step 1: Walk to home ofﬁce Step 2: Walk to radio Step 3: Find radio Step 4: Grab radio Step 5: Find chair Step 6: Sit on chair Step 7: Turn on radioTask: Turn on radio Step 1: Walk to home ofﬁce Step 2: Walk to chair Step 3: Find chair Step 4: Sit on chair Step 5: Find remote control Step 6: Grab remote control Step 7: Find radio Step 8: Point at radio Step 9: Switch on radio Step 10: Put back remote control Task: Look at mirror Step 1: Walk to bathroom Step 2: Walk to mirror Step 3: Find mirror Step 4: Turn to mirror Step 5: Look at mirrorTask: Look at mirror Step 1: Walk to bathroom Step 2: Walk to mirror Step 3: Look in mirrorTask: Look at mirror Step 1: Walk to bedroom Step 2: Walk to mirror Step 3: Find mirror Step 4: Turn to mirror Step 5: Look at mirror Step 6: Find face Step 7: Turn to face Step 8: Look at face Step 9: Find hair Step 10: Turn to hair Step 11: Look at hair Step 12: Find dress Step 13: Turn to dress Step 14: Look at dress Task: Leave home Step 1: Walk to home ofﬁce Step 2: Walk to shoes Step 3: Find shoes Step 4: Grab shoes Step 5: Put on shoesTask: Leave home Step 1: Walk to front door Step 2: Open front door Step 3: Walk to car Step 4: Open car door Step 5: Get in car Step 6: Close car door Step 7: Drive to workTask: Leave home Step 1: Find shoes Step 2: Grab shoes Step 3: Put on shoes Task: Walk to room Step 1: Walk to home ofﬁce Step 2: Walk to home ofﬁceTask: Walk to room Step 1: Walk to home ofﬁce Step 2: Walk to novel Step 3: Find novel Step 4: Grab novel Step 5: Find chair Step 6: Sit on chair Step 7: Read novelTask: Walk to room Step 1: Walk to dining room Step 2: Walk to livingroom Step 3: Walk to bedroom 27 Page 28: Task: Hang pictures Step 1: Walk to bedroom Step 2: Walk to drawing Step 3: Find drawing Step 4: Grab drawing Step 5: Find wall Step 6: Put drawing on wall Step 7: Turn to drawing Step 8: Look at drawing Step 9: Find drawing Step 10: Grab drawing Step 11: Put drawing on wall Step 12: Turn to drawing Step 13: Look at drawingTask: Hang pictures Step 1: Walk to home ofﬁce Step 2: Walk to box of pictures Step 3: Find box of pictures Step 4: Grab box of pictures Step 5: Walk to wall Step 6: Hang pictures on wallTask: Hang pictures Step 1: Walk to home ofﬁce Step 2: Walk to table Step 3: Find keys Step 4: Grab keys Step 5: Walk to wall Step 6: Find hanger Step 7: Put keys on hanger Task: Wipe down sink Step 1: Walk to dining room Step 2: Walk to rag Step 3: Find rag Step 4: Grab rag Step 5: Walk to sink Step 6: Wipe sink Step 7: Put back ragTask: Wipe down sink Step 1: Walk to kitchen Step 2: Walk to sink Step 3: Grab sponge Step 4: Wipe down sink Step 5: Rinse sponge Step 6: Put sponge away Step 7: Wash handsTask: Wipe down sink Step 1: Walk to bathroom Step 2: Walk to cleaning solution Step 3: Find cleaning solution Step 4: Grab cleaning solution Step 5: Find sink Step 6: Pour cleaning solution into sink Step 7: Find faucet Step 8: Switch on faucet Step 9: Scrub sink Step 10: Wash sink Task: Take shoes off Step 1: Walk to home ofﬁce Step 2: Walk to shoes Step 3: Find shoes Step 4: Find shoes Step 5: Take off shoes Step 6: Find dresser Step 7: Open dresser Step 8: Grab shoes Step 9: Put shoes in dresser Step 10: Close dresserTask: Take shoes off Step 1: Walk to home ofﬁce Step 2: Walk to shoes Step 3: Find shoes Step 4: Grab shoes Step 5: Find chair Step 6: Sit on chair Step 7: Take shoes offTask: Take shoes off Step 1: Walk to home ofﬁce Step 2: Walk to shoes Step 3: Find shoes Task: Give milk to cat Step 1: Walk to home ofﬁce Step 2: Walk to cat Step 3: Find cat Step 4: Turn to cat Step 5: Point at cat Step 6: Pull cat Step 7: Find milk Step 8: Grab milk Step 9: Find plate Step 10: Pour milk into plate Step 11: Touch cat Step 12: Scrub catTask: Give milk to cat Step 1: Walk to kitchen Step 2: Walk to fridge Step 3: Open fridge Step 4: Grab milk Step 5: Walk to cat Step 6: Pour milk into bowl Step 7: Put bowl on ﬂoor Step 8: Walk away from catTask: Give milk to cat Step 1: Find cat Step 2: Touch cat Step 3: Open fridge Step 4: Touch milk Step 5: Pour milk into bowl 28 Page 29: Task: Grab things Step 1: Walk to novel Step 2: Turn to novel Step 3: Grab novelTask: Grab things Step 1: Walk to home ofﬁce Step 2: Walk to novel Step 3: Find novel Step 4: Grab novel Step 5: Find chair Step 6: Sit on chair Step 7: Read novel Step 8: Grab thingsTask: Grab things Step 1: Find toy Step 2: Turn to toy Step 3: Point at toy Step 4: Wash toy Task: Use laptop Step 1: Walk to home ofﬁce Step 2: Walk to couch Step 3: Find couch Step 4: Sit on couch Step 5: Find laptop Step 6: Switch on laptop Step 7: Turn to laptop Step 8: Look at laptop Step 9: Find keyboard Step 10: Type on keyboard Step 11: Turn to keyboard Step 12: Look at keyboard Step 13: Type on keyboard Step 14: Turn to laptop Step 15: Look at laptopTask: Use laptop Step 1: Walk to home ofﬁce Step 2: Walk to laptop Step 3: Find laptop Step 4: Grab laptop Step 5: Find chair Step 6: Sit on chair Step 7: Use laptopTask: Use laptop Step 1: Walk to home ofﬁce Step 2: Walk to chair Step 3: Find chair Step 4: Sit on chair Step 5: Find laptop Step 6: Switch on laptop Step 7: Find mouse Step 8: Grab mouse Step 9: Find mousepad Step 10: Put mouse on mousepad Step 11: Turn to laptop Step 12: Point at laptop Task: Organize pantry Step 1: Walk to dining room Step 2: Walk to pantry Step 3: Find food Step 4: Grab food Step 5: Turn to food Step 6: Look at food Step 7: Find garbage can Step 8: Put food on garbage can Step 9: Find vegetable Step 10: Grab vegetable Step 11: Turn to vegetable Step 12: Look at vegetable Step 13: Put vegetable on garbage can Step 14: Find dry pasta Step 15: Grab dry pasta Step 16: Turn to dry pasta Step 17: Look at dry pasta Step 18: Put back dry pasta Step 19: Find food Step 20: Turn to food Step 21: Look at food Step 22: Push food Step 23: Find noodles Step 24: Grab noodles Step 25: Turn to noodles Step 26: Look at noodles Step 27: Put back noodlesTask: Organize pantry Step 1: Walk to kitchen Step 2: Walk to pantry Step 3: Find pantry Step 4: Grab pantry Step 5: Find chair Step 6: Sit on chair Step 7: Organize pantryTask: Organize pantry Step 1: Walk to pantry Step 2: Find pantry 29 Page 30: Task: Change clothes Step 1: Walk to bedroom Step 2: Walk to dresser Step 3: Open dresser Step 4: Find dress Step 5: Grab dress Step 6: Put on dress Step 7: Close dresserTask: Change clothes Step 1: Walk to bedroom Step 2: Walk to closet Step 3: Find clothes Step 4: Grab clothes Step 5: Find chair Step 6: Sit on chair Step 7: Change clothesTask: Change clothes Step 1: Walk to bedroom Step 2: Walk to dresser Step 3: Find dresser Step 4: Open dresser Step 5: Find pants Step 6: Grab pants Step 7: Put on pants Step 8: Find shirt Step 9: Grab shirt Step 10: Put on shirt Step 11: Close dresser Task: Put up decoration Step 1: Walk to home ofﬁce Step 2: Walk to drawing Step 3: Find drawing Step 4: Find centerpiece Step 5: Grab centerpiece Step 6: Walk to wall Step 7: Grab drawing Step 8: Put drawing on wall Step 9: Find centerpiece Step 10: Grab centerpiece Step 11: Walk to table Step 12: Put centerpiece on tableTask: Put up decoration Step 1: Walk to home ofﬁce Step 2: Walk to decoration Step 3: Find decoration Step 4: Grab decoration Step 5: Find chair Step 6: Sit on chair Step 7: Put up decorationTask: Put up decoration Step 1: Walk to livingroom Step 2: Walk to table Task: Make bed Step 1: Walk to bedroom Step 2: Walk to bed Step 3: Find sheets Step 4: Grab sheets Step 5: Pull sheetsTask: Make bed Step 1: Walk to bedroom Step 2: Walk to bed Step 3: Find sheets Step 4: Find blanket Step 5: Find pillow Step 6: Put sheets on bed Step 7: Put blanket on bed Step 8: Put pillow on bedTask: Make bed Step 1: Stand up Step 2: Find sheets Step 3: Find blanket Step 4: Find pillow Step 5: Find comforter Step 6: Find bed Task: Scrubbing living room tile ﬂoor is once week activity for me Step 1: Find mop Step 2: Turn to mop Step 3: Point at mop Step 4: Wash mop Step 5: Turn to mop Step 6: Look at mopTask: Scrubbing living room tile ﬂoor is once week activity for me Step 1: Walk to living room Step 2: Walk to tile ﬂoor Step 3: Find scrub brush Step 4: Grab scrub brush Step 5: Scrub tile ﬂoor Step 6: Put down scrub brush Step 7: Walk to kitchenTask: Scrubbing living room tile ﬂoor is once week activity for me Step 1: Walk to livingroom Step 2: Walk to mop Step 3: Find mop Step 4: Grab mop Step 5: Walk to bathroom Step 6: Find bucket Step 7: Switch on faucet Task: Read yourself to sleep Step 1: Walk to bedroom Step 2: Walk to novel Step 3: Find novel Step 4: Grab novel Step 5: Find bed Step 6: Sit on bed Step 7: Find bookmark Step 8: Grab bookmark Step 9: Read novel Step 10: Put bookmark on novel Step 11: SleepTask: Read yourself to sleep Step 1: Walk to bedroom Step 2: Walk to bed Step 3: Find book Step 4: Grab book Step 5: Find chair Step 6: Sit on chair Step 7: Read book Step 8: SleepTask: Read yourself to sleep Step 1: Walk to bedroom Step 2: Walk to bed Step 3: Lie on bed Step 4: Sleep Step 5: Read book 30 Page 31: Task: Hand washing Step 1: Walk to bathroom Step 2: Walk to sink Step 3: Find faucet Step 4: Switch on faucet Step 5: Find soap Step 6: Grab soap Step 7: Find hands both Step 8: Scrub hands both Step 9: Put back soap Step 10: Wash hands both Step 11: Switch off faucet Step 12: Find towel Step 13: Grab towel Step 14: Wipe hands bothTask: Hand washing Step 1: Walk to bathroom Step 2: Walk to sink Step 3: Turn on faucet Step 4: Grab soap Step 5: Wash hands Step 6: Rinse hands Step 7: Dry handsTask: Hand washing Step 1: Walk to bathroom Step 2: Walk to sink Step 3: Find faucet Step 4: Turn to faucet Step 5: Find hands both Step 6: Wash hands both Task: Throw away paper Step 1: Walk to home ofﬁce Step 2: Walk to desk Step 3: Find desk Step 4: Turn to desk Step 5: Find chair Step 6: Sit on chair Step 7: Find check Step 8: Grab check Step 9: Squeeze check Step 10: Stand up Step 11: Walk to trashcan Step 12: Put check on trashcanTask: Throw away paper Step 1: Walk to home ofﬁce Step 2: Walk to wastebasket Step 3: Find wastebasket Step 4: Grab wastebasket Step 5: Walk to desk Step 6: Drop paper in wastebas- ketTask: Throw away paper Step 1: Walk to home ofﬁce Step 2: Walk to table Step 3: Find table Step 4: Turn to table Step 5: Find paper Step 6: Grab paper Step 7: Walk to trashcan Step 8: Open trashcan Step 9: Put paper on trashcan Step 10: Close trashcan Task: Tale off shoes Step 1: Walk to home ofﬁce Step 2: Walk to shoes Step 3: Find shoes Step 4: Find shoes Step 5: Take off shoes Step 6: Grab shoes Step 7: Find mat Step 8: Put shoes on matTask: Tale off shoes Step 1: Walk to home ofﬁce Step 2: Walk to novel Step 3: Find novel Step 4: Grab novel Step 5: Find chair Step 6: Sit on chair Step 7: Read novelTask: Tale off shoes Step 1: Walk to home ofﬁce Step 2: Walk to shoes Step 3: Find shoes Task: Hang keys Step 1: Walk to bedroom Step 2: Walk to dresser Step 3: Find purse Step 4: Grab purse Step 5: Open purse Step 6: Find keys Step 7: Grab keys Step 8: Find bowl Step 9: Put keys on bowl Step 10: Close purse Step 11: Put back purseTask: Hang keys Step 1: Walk to home ofﬁce Step 2: Walk to key hook Step 3: Find key hook Step 4: Grab key hook Step 5: Find keys Step 6: Hang keys on key hook Step 7: Walk to front door Step 8: Leave houseTask: Hang keys Step 1: Walk to home ofﬁce Step 2: Walk to table Step 3: Find keys Step 4: Grab keys Step 5: Walk to wall Step 6: Find hanger Step 7: Put keys on hanger 31 Page 32: Task: Make popcorn Step 1: Walk to dining room Step 2: Walk to food Step 3: Find food Step 4: Grab food Step 5: Walk to microwave Step 6: Turn to microwave Step 7: Open microwave Step 8: Put food in microwave Step 9: Close microwave Step 10: Switch on microwaveTask: Make popcorn Step 1: Walk to kitchen Step 2: Walk to microwave Step 3: Open microwave Step 4: Put bag of popcorn in mi- crowave Step 5: Close microwave Step 6: Push buttons on mi- crowave Step 7: Wait for popcorn to pop Step 8: Get bowl out of cupboard Step 9: Get butter out of fridge Step 10: Get salt out of cupboard Step 11: Pour popcorn into bowl Step 12: Add butter and salt to bowl Step 13:Task: Make popcorn Step 1: Walk to kitchen Step 2: Walk to microwave Step 3: Find microwave Step 4: Open microwave Task: Receive credit card Step 1: Walk to home ofﬁce Step 2: Walk to envelope Step 3: Find envelope Step 4: Grab envelope Step 5: Walk to home ofﬁce Step 6: Walk to scissors Step 7: Find scissors Step 8: Open envelope Step 9: Find creditcard Step 10: Grab creditcardTask: Receive credit card Step 1: Walk to mailbox Step 2: Open mailbox Step 3: Grab credit card Step 4: Walk to home ofﬁce Step 5: Walk to desk Step 6: Put credit card on desk Step 7: Sit at desk Step 8: Look at credit cardTask: Receive credit card Step 1: Walk to home ofﬁce Step 2: Walk to cabinet Step 3: Find cabinet Step 4: Open cabinet Task: Read to child Step 1: Walk to bedroom Step 2: Walk to child Step 3: Find child Step 4: Greet child Step 5: Walk to ﬁling cabinet Step 6: Open ﬁling cabinet Step 7: Find novel Step 8: Grab novel Step 9: Walk to chair Step 10: Pull chair Step 11: Sit on chair Step 12: Turn to novel Step 13: Look at novel Step 14: Turn to novel Step 15: Point at novel Step 16: Turn to novel Step 17: Point at novel Step 18: Read novel Step 19: Read novel Step 20: Read novelTask: Read to child Step 1: Walk to home ofﬁce Step 2: Walk to novel Step 3: Find novel Step 4: Grab novel Step 5: Find chair Step 6: Sit on chair Step 7: Read novelTask: Read to child Step 1: Walk to child Step 2: Read book 32 Page 33: Task: Eat cheese Step 1: Walk to dining room Step 2: Walk to freezer Step 3: Open freezer Step 4: Find cheese Step 5: Grab cheese Step 6: Eat cheeseTask: Eat cheese Step 1: Walk to kitchen Step 2: Walk to refrigerator Step 3: Open refrigerator Step 4: Grab cheese Step 5: Eat cheeseTask: Eat cheese Step 1: Walk to dining room Step 2: Walk to table Step 3: Find table Step 4: Turn to table Step 5: Find chair Step 6: Sit on chair Step 7: Find food Step 8: Grab food Step 9: Find plate Step 10: Put food on plate Task: Unload various items from pockets and place them in bowl on table Step 1: Walk to home ofﬁce Step 2: Walk to table Step 3: Find bowl Step 4: Grab bowl Step 5: Put bowl on table Step 6: Find pants Step 7: Find bills Step 8: Grab bills Step 9: Find phone Step 10: Grab phoneTask: Unload various items from pockets and place them in bowl on table Step 1: Walk to home ofﬁce Step 2: Walk to bowl on table Step 3: Find bowl on table Step 4: Grab bowl on table Step 5: Unload various items from pockets and place them in bowl on tableTask: Unload various items from pockets and place them in bowl on table Step 1: Find bowl Step 2: Turn to bowl Step 3: Point at bowl Step 4: Wash bowl Task: Switch on lamp Step 1: Walk to bedroom Step 2: Walk to ﬂoor lamp Step 3: Find ﬂoor lamp Step 4: Switch on ﬂoor lampTask: Switch on lamp Step 1: Walk to lamp Step 2: Switch on lamp Step 3: Walk to chair Step 4: Sit on chair Step 5: Read bookTask: Switch on lamp Step 1: Walk to bedroom Step 2: Walk to ﬂoor lamp Step 3: Find ﬂoor lamp Step 4: Turn to ﬂoor lamp Step 5: Switch on ﬂoor lamp Step 6: Find bed Step 7: Lie on bed Task: Take jacket off Step 1: Walk to bedroom Step 2: Walk to jacket Step 3: Find jacket Step 4: Find jacket Step 5: Take off jacket Step 6: Find dresser Step 7: Open dresser Step 8: Close dresserTask: Take jacket off Step 1: Walk to home ofﬁce Step 2: Walk to jacket Step 3: Find jacket Step 4: Grab jacket Step 5: Find chair Step 6: Sit on chair Step 7: Take jacket offTask: Take jacket off Step 1: Walk to home ofﬁce Step 2: Walk to jacket Step 3: Find jacket 33