Authors: Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, Sercan Ö. Arık
Page 1:
Learn-by-interact: A Data-Centric Framework
for Self-Adaptive Agents in Realistic
Environments
Hongjin Su1 2 *, Ruoxi Sun1, Jinsung Yoon1, Pengcheng Yin1, Tao Yu2and Sercan Ö. Arık1
1Google,2The University of Hong Kong
Autonomous agents powered by large language models (LLMs) have the potential to enhance human
capabilities, assisting with digital tasks from sending emails to performing data analysis. The abilities
of existing LLMs at such tasks are often hindered by the lack of high-quality agent data from the
corresponding environments they interact with. We propose Learn-by-interact , a data-centric
framework to adapt LLM agents to any given environments without human annotations. Learn-by-
interact synthesizes trajectories of agent-environment interactions based on documentations, and
constructsinstructionsbysummarizingorabstractingtheinteractionhistories,aprocesscalled backward
construction . We assess the quality of our synthetic data by using them in both training-based scenarios
and training-free in-context learning (ICL), where we craft innovative retrieval approaches optimized
for agents. Extensive experiments on SWE-bench, WebArena, OSWorld and Spider2-V spanning across
realistic coding, web, and desktop environments show the effectiveness of Learn-by-interact in
various downstream agentic tasks — baseline results are improved by up to 12.2% for ICL with Claude-
3.5 and 19.5% for training with Codestral-22B. We further demonstrate the critical role of backward
construction, which provides up to 14.0% improvement for training. Our ablation studies demonstrate
the efficiency provided by our synthesized data in ICL and the superiority of our retrieval pipeline
over alternative approaches like conventional retrieval-augmented generation (RAG). We expect that
Learn-by-interact will serve as a foundation for agent data synthesis as LLMs are increasingly
deployed at real-world environments.
1 Introduction
Pre-trained large language models (LLMs) offer great potential for assisting humans with various tasks
in digital settings, such as editing images, performing data analysis, resolving software engineering
issues, and navigating commercial platforms (Jimenez et al., 2023; Xie et al., 2023, 2024; Yao et al.,
2022a). By streamlining these, LLM agents can greatly enhance human efficiency and productivity,
allowing users to shift their focus toward higher-level, creative, and strategic endeavors. To explore
this potential, many benchmarks (Cao et al., 2024; Jimenez et al., 2023; Koh et al., 2024; Xie et al.,
2024; Zhou et al., 2023b) and agentic frameworks (Chen et al., 2024a; Gur et al., 2023; Yang et al.,
2024, 2023; Zhan and Zhang, 2023) have been established based on realistic digital environments,
spanning web applications, code development, desktop computing, etc. However, LLMs often fall
short of expected performance in these tasks, consistently displaying a significant gap compared to
human capabilities. As a result, they remain less practical and reliable for real-world applications.
Efficient adaptation to new environments can be a key part of the performance improvements.
Prior works have explored various prompt-based approaches (Gur et al., 2023; Yang et al., 2024;
Yao et al., 2022b; Zhan and Zhang, 2023), that are constrained by the capabilities of underlying
foundation models. Other studies on training LLMs with human-labeled examples (Chen et al., 2023,
Corresponding author(s): hjsu@cs.hku.hk
* This work was done while Hongjin was a student researcher at Google Cloud AI Research.arXiv:2501.10893v1 [cs.LG] 18 Jan 2025
Page 2:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
LLM Agent
Agent -en vir onment int eractionGenerat ed task instruction b y self -instruct:
Upload CSV file in Google Driv e t o BigQuer yAct1: view dataset “demo”Act2: click butt on cr eat e tableAct3: select table sour ce Google Cloud St orage
(wr ong pr ediction misaligned wit h instruction)Obs0: BigQuer y HomepageObs1: Dataset page,wit h inf o lik e cr eation timeObs2: T able cr eation pageObs13: BigQuer y table cr eat ed.
T ut orial and
DocumentationF A QCode: Softwar e: W eb: R e s o u r ce sE n vir onment sInstruction 1: R e p licat e t he .. .
T r aject or y 1: (O bs1 , A ct2 , O bs2 )
( Obs1 , Act2, Obs2 )ne w in s tr u ctionR eplicat e t he f ollo wing: I n
t he dataset page, click t he
butt on cr eat e table ...Instruction n: Link CSV file .. .
T r aject or y n: (Obs0 , ..., Obs13) Sy nt h e s i z e d d ata
I n-cont e x t l earningBackwar d construction
( Obs0 , ..., Obs13 )Upd at e in s tr u ction t o
a l ign w it h tra j ect or yLink CS V file in Google
Cloud St orage t o BigQuer yFilt er ed dataT raining........................
......Loading CS V data Go t o t he
BigQuer y page. H o w t o cr eat e
BigQuer y table ?
Answ er: ......I nstructionObs 0Act 1Obs 1Act 2Model - based:Find e x amples wit h similar
int ent and tra j ect or y Find e x amples wit h t he
same obser v ationObser v ation - based:Agentic r etrie v al
Figure 1|Overview of the data synthesis and adaptation processes. Given an environment and
standard resources, we first leverage self-instruct to create a diverse set of instructions. LLMs are then
employed to complete these tasks, resulting in long trajectories of agent-environment interactions. We
construct task instructions using LLMs for each sub-trajectory, a process called backward construction .
The synthesized data are then filtered and used for both training and in-context learning, where we
design agentic retrieval to retrieve demonstration examples based on information at each step, using
both model-based and observation-based approaches. See Appendix E for the complete data synthesis
example and Algorithm 2 for more details on agentic retrieval.
2024b; Li et al., 2020) on the other hand, come with the fundamental limitation of high annotation
costs when new environments are considered. In particular, annotating agentic data can be quite
difficult and expensive due to long-trajectory interactions with environments and specific domain
expertise required. Few works have explored fully-autonomous data construction pipelines towards
self-adaptive agents that can efficiently learn new environments (Aksitov et al., 2023; Gulcehre et al.,
2023).
In this paper, we introduce Learn-by-interact , a data-centric framework for LLMs to self-
adapt to new environments, utilizing agent data synthesis via interactions. Intuitively, the effects of
actions executed in environments (e.g., the next webpage after clicking a button) serve as informa-
tive demonstrations that help LLMs in future navigation. Inspired by this, we design Learn-by-
interact that first uses self-instruct (Wang et al., 2022b) to develop a variety of task instructions,
referring to standard resources such as documentations and tutorials for a given environment. This
covers most important scenarios that human users are interested in and avoids intensive prompt
engineering to control the distribution and diversity of the generated data. We then collect diverse
trajectories from interactions between LLMs and environments, as illustrated in Fig. 1. However, given
the low performance of LLMs in existing agentic benchmarks (Cao et al., 2024; Xie et al., 2024), it is
likely that a large percentage of synthesized trajectories would not match with the instructions. To
tackle this challenge, we construct new instructions by summarizing or abstracting each sub-trajectory,
leveragingthestrongsummarizationcapabilitiesofLLMs(Liuetal.,2023;Puetal.,2023). Wecallthis
process backward construction . After obtaining synthesized instruction-trajectory pairs and filtering
low-quality ones, we apply it to both training and ICL, where we craft innovative retrieval pipelines
optimized for agents. Specifically, the approach comprises two components: (1) a model-based
approach where LLMs generate queries guided by instructions, interaction histories, and current
observations, followed by retrieval models selecting demonstration examples from synthesized data;
and (2) an observation-based approach that identifies examples in which the current observation
appears in trajectories, signaling that the current state was encountered during the data synthesis
2
Page 3:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
process.
Our comprehensive evaluations across four challenging benchmarks: SWE-bench (Jimenez et al.,
2023), WebArena (Zhou et al., 2023b), OSWorld (Xie et al., 2024), and Spider2-V (Cao et al., 2024),
highlight the efficacy of the data generated by Learn-by-interact . With ICL, both Gemini-1.5-
pro (Reid et al., 2024) and Claude-3.5-sonnet (Anthropic, 2024) show consistent and remarkable
improvements – for OSWorld (Xie et al., 2024), our generated data nearly doubles Claude-3.5-sonnet’s
baseline performance, increasing it from 12.4% to 22.5%. This enables Learn-by-interact
to achieve the best-class performance in all of the four leaderboards. Furthermore, substantial
improvements are observed by training models of varying sizes and architectures with our synthesized
data. Asanexample, Codestral-22B’s(Team,2024b)performanceonWebArenasignificantlyincreases
from 4.7% to 24.2% after training. These results underscore the high quality of the generated agentic
data and the broad applicability across diverse environments.
Our extensive ablation studies reveal that backward construction not only increases the quantity
of the synthesized data, but also improves its overall quality (§3.5). With data synthesized by Learn-
by-interact , we observe significant improvements in both performance and efficiency during
LLM inference (§4.1). Our empirical results demonstrate the superiority of the agentic retrieval
in ICL (§4.2). We anticipate that this research will spark innovative developments in enhancing
agentic performance using LLMs and contribute to its wider-spread adoption in real-world application
scenarios.
2 Learn-by-interact
We introduce the proposed Learn-by-interact framework to synthesize agent data in an
autonomous way by leveraging interactions between LLMs and environments. We first formalize
the canonical agentic tasks (§2.1), and introduce the detailed synthesis (§2.2) and filtering (§2.3)
procedures. We then describe the application of the synthesized data in adapting LLMs in both
training-free and training-based settings (§2.4).
2.1 Task formulation
Given an environment 𝐸and a task instruction 𝐼, the objective of an agent 𝐴is to achieve the target
𝐺through multi-step interactions with 𝐸. At each step 𝑖,𝐴predicts the next action 𝑎𝑖based on
the instruction 𝐼and the previous history 𝐻=(𝑜0,𝑎1,𝑜1,𝑎2,...,𝑜𝑖−1), which is then executed in the
environment 𝐸to get a new observation 𝑜𝑖. The interactions terminated until 𝐴predicts the action
𝑠𝑡𝑜𝑝or the maximum number of steps 𝑚is reached.
2.2 Agentic data synthesis
The essential idea of Learn-by-interact is manifested in synthesizing environment-specific
agent data with zero human effort. In Algorithm 1, we show the overall process with pseudo-code.
Given an environment for a downstream application (e.g., Visual Studio Code), we first leverage
commonly-accessible resources such as documentation to generate diverse task instructions using
self-instruct (Wang et al., 2022b) (line 5). These resources are usually created by human experts to
address common concerns and provide usage suggestions, e.g., how to navigate a website or operate a
software. Intuitively, such references often cover representative use cases of an application. Therefore,
3
Page 4:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
the task instructions generated conditioned on them could cover most popular scenarios in the domain
and avoid potentially unreasonable cases that may be of less value.
Algorithm 1 Agent data synthesis
1:Input:𝐿𝐿𝑀: Large Language Model; 𝐸: environment;
𝐷𝑜𝑐: standard resources like documentation; 𝑁: the num-
ber of instructions to generate per document; 𝐹: data
filter.
2:Initialization: 𝐷=[]: synthesized data.
3:for𝑑in𝐷𝑜𝑐do
4: // self-instruct to generate 𝑁task instructions
5:𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 =𝐿𝐿𝑀(𝑑,𝑁)
6:for𝐼in𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 do
7: 𝐸.reset()
8: 𝑇= [] // initialize interaction trajectory
9: whilenot𝐸.finished() do
10: 𝑜=𝐸.get_observation()
11: 𝑎=𝐿𝐿𝑀(𝐼,𝑇,𝑜)
12: 𝑇+=[𝑜,𝑎]
13: end while
14: 𝑇.𝑎𝑝𝑝𝑒𝑛𝑑(𝐸.get_observation())
15: // backward construction
16: for𝑖in range( 0,𝑙𝑒𝑛(𝑇)−1,2)do
17: for𝑗in range(𝑖+2,𝑙𝑒𝑛(𝑇),2)do
18: 𝑇′=𝑇[𝑖:𝑗]
19: 𝐼′=𝐿𝐿𝑀(𝑇′)
20: 𝐷.𝑎𝑝𝑝𝑒𝑛𝑑([𝐼′,𝑇′])
21: end for
22: end for
23:end for
24:end for
25:𝐷=𝐹(𝐷)// Filter low-quality data
26:Return:𝐷For each generated task, LLMs then aim
to solve it, which results in a long trajectory
𝑇=(𝑜0,𝑎1,𝑜1,...,𝑎𝑛,𝑜𝑛)(line 9-14 in Algo-
rithm1). Toaddressthepotentialmisalignment
between the instruction 𝐼and the generated tra-
jectories𝑇, we introduce a novel mechanism,
backward construction, to construct instruc-
tions based on trajectories (lines 15-22 in Al-
gorithm 1). Specifically, for each sub-trajectory
𝑇′=(𝑜𝑖,𝑎𝑖+1,𝑜𝑖+1,...,𝑎𝑗,𝑜𝑗),0≤𝑖 < 𝑗≤𝑛, we
obtain two types of new instructions: (1) sum-
maries of trajectory steps; and (2) abstractions
of the trajectory purpose. In Fig. 1, the sub-
trajectory(𝑂𝑏𝑠1,𝐴𝑐𝑡2,𝑂𝑏𝑠2)issummarizedinto
a new task instruction that requires to replicate
the𝐴𝑐𝑡2. The abstraction of the full trajectory
updates the original task objective, which is no
longeralignedwiththegeneratedtrajectorydue
to the wrong prediction in the action 3. Overall,
theLearn-by-interact pipeline offers two
notable advantages: (1). It corrects the poten-
tial misalignment between instructions and pre-
dicted trajectories by updating task objectives,
which enhances the data quality as verified by
the experimental results in §3.5. (2). It maximizes the utility of each generated trajectory by craft-
ing new instructions for each sub-trajectory. This results in a quadratic increase in the number of
synthesized examples with respect to the steps in the sequence per generated trajectory. For a given
target dataset size, backward construction substantially decreases the necessary interactions, which
is particularly valuable in scenarios where such interactions are challenging and costly to obtain such
as Robotics (Keipour, 2022).
2.3 Filtering
To further enhance the data quality, we design the following criteria to filter inferior synthesized data:
(1). Remove duplicate states: We remove duplicate (𝑎𝑖,𝑜𝑖)from𝑇′if(𝑎𝑖,𝑜𝑖)=(𝑎𝑖−1,𝑜𝑖−1), which is
potentially introduced by the invalid action or the environment error (inactivity). (2). LLM committee
check: We feed the generated instruction-trajectory pair ( 𝐼′,𝑇′)into a committee of LLMs, and only
classify it of high-quality if all LLMs consider the trajectory coherent, natural, reasonable and aligned
withtheinstruction. Thelistedcriteriaareallfully-autonomousandcanonically-applicableforfiltering
data synthesized in general agent scenarios. Additionally, we employ iterative prompting to augment
LLMs with high-quality examples to enhance their capabilities in data generation. See Table 31 for
our prompts used in LLM committee check.
4
Page 5:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
2.4 Adaptation
Algorithm 2 ICL with agentic retrieval
1:Input:𝐿𝐿𝑀: Large Language Model; 𝐸: environment;
𝐷: synthesized data; 𝐵𝑀25: BM25 retrieval model; 𝑅𝑀:
dense retriever; 𝐼: task instruction; 𝑚1: maximum num-
ber of examples from observation-based retrieval; 𝑚2:
maximum number of examples from model-based re-
trieval.
2:Initialization :𝐻=[]: interaction history; 𝑅: retrieved
examples.
3:whilenot𝐸.finished() do
4:𝑜=𝐸.get_observation()
5: // observation-based retrieval
6:𝑅=𝐵𝑀25(𝑜,𝐷,𝑚 1)
7: // model-based retrieval
8:𝑞=𝐿𝐿𝑀(𝐼,𝐻,𝑜)
9:𝑅+=𝑅𝑀(𝑞,𝐷,𝑚 2,𝑅)
10:𝑎=𝐿𝐿𝑀(𝐼,𝐻,𝑜,𝑅)
11:𝐻+=[𝑜,𝑎]
12:end whileAfter obtaining the synthesized data 𝐷, we
apply it to both ICL and training. Given the
unique characteristics of multi-round interac-
tions with environments in agent settings, we
design agentic retrieval (pseudo-code in Algo-
rithm 2) to maximize the effectiveness of the
synthesized data. Specifically, we propose two
retrieval pipelines: observation-based (line 5-
14) and model-based retrieval (line 15-17). In
observation-based retrieval, we compare the
current observation 𝑜to the trajectory of each
example𝑒in the synthesized data, where 𝑒=
[𝐼′,[𝑜0,𝑎1,𝑜1,...,𝑎𝑛,𝑜𝑛]]. If𝑜matches one of
the observations in 𝑒, i.e.,𝑜=𝑜𝑖, then we con-
sider𝑒as a helpful example to the current task.
Forthemodel-basedretrieval,weleverageLLMs
to first write queries based on the instruction, the interaction history and the current observation
(line 16), and then employ retrieval models to retrieve non-duplicate examples (line 17). LLMs are
then augmented with the retrieved examples to predict the next action (line 18). Refer to Table 32 to
35 for prompts to write queries and predict actions.
Apart from using the synthesized data as demonstration examples in ICL, we further utilize
them to fine-tune models. For a given generated example, we convert it to the format of action
prediction (Table 32), and prepare input-output pairs for supervised fine-tuning. More details on the
experimental settings can be found in §3.3.
3 Experiments
3.1 Baselines
We compare ICL with agentic retrieval to the following prompt-based approaches.
•Baseline: The vanilla prediction pipeline in each benchmark that includes the task instruction,
interaction history and the state observation in the prompt. See more implementation details in
Appendix A.
•RAG: The conventional RAG pipeline that first retrieves from the resources like documentation
based on the instruction, and augments LLMs with the retrieved content.
•Data distill: We follow the same pipeline to synthesize data in Algorithm 1 except backward con-
struction (replace lines 15-22 with 𝐷.𝑎𝑝𝑝𝑒𝑛𝑑(𝐼,𝑇)), and follow Algorithm 2 during the evaluation.
•Reflexion (Shinn et al., 2024): A general framework to reinforce language agents through linguistic
feedback from both executors and LLMs.
•Language Agent Tree Search (LATS) (Zhou et al., 2023a): It integrates the combinatorial tree
search into expanding ReAct (Yao et al., 2022b) and combine agent online reasoning, acting and
planning throughout the trajectory.
For the training-based evaluation, we primarily compare to the data distillation, which also constructs
5
Page 6:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
SWE-bench WebArena OSWorld Spider2-V
Documents 6,464 3,578 7,362 11,231
Raw trajectories 4,568 3,967 1,125 1,226
Examples 41,237 32,319 19,688 21,525
Filtered examples 10,232 10,456 11,782 10,169
Table 1|Statistics for the number of crawled documents, generated raw trajectories, examples
(instruction-trajectory pairs) and examples after filtering.
data from scratch and requires no human effort to annotate seed or preference data. Additionally, we
include the model performance before training as another baseline.
3.2 Datasets
Weconsiderthefouragenticdatasetsthatinvolvemulti-roundinteractionswithrealisticenvironments.
They span diverse domains of code, web, computer desktop and professional software. Appendix B
illustrates details of each dataset with examples.
•SWE-bench (Jimenez et al., 2023) is an evaluation benchmark on realistic software engineering
problems from realistic Github issues. We use the verified version by default throughout the
experiments.
•Webarena(Zhouetal.,2023b)evaluatesagentcapabilitiestoperformtasksinthewebenvironments
such as e-commerce, social forum discussion, and beyond.
•OSWorld (Xie et al., 2024) is an integrated environment for assessing open-ended computer tasks,
which involve diverse applications like Terminal, Chrome, etc.
•Spider2-V (Cao et al., 2024) is a multimodal agent benchmark focusing on professional data science
and engineering workflows, which includes BigQuery, Airbyte and more.
3.3 Settings
We synthesize one separate set of environment-specific data for each evaluated benchmark. Through-
out the data synthesis process, we employ the Claude-3.5-sonnet (Anthropic, 2024) as the generator
model and both Gemini-1.5-pro (Reid et al., 2024) and Claude-3.5-sonnet as the LLM committee for
filtering low-quality data. For each document, we sample three task instructions from LLMs. The
statistics for generated raw trajectories, examples before and after filtering are shown in Table 1. In
Appendix D, we list document sources used for each benchmark. During ICL, we retrieve examples
until the maximum length of LLMs and set an upper bound of 5 for both model-based and observation-
based retrieval ( 𝑚1=5,𝑚2=5in Algorithm 2). We leverage Gemini-1.5-pro (Reid et al., 2024) and
Claude-3.5-sonnet (Anthropic, 2024)1, Codegemma-7B (Team, 2024a) and Codestral-22B (Team,
2024b) in the ICL evaluation, and tune Codegemma-7B and Codestral-22B with LoRA (Hu et al.,
2021) to evaluate the data quality as training sources. By default, we do not include retrieval content
in evaluating the trained model to avoid the confusion in understanding the effectiveness of our
synthesized data in training. We include more detailed hyper-parameter settings (both existing
approaches and Learn-by-interact ) and machine information in Appendix C.
1In the subsequent descriptions, Gemini refers to Gemini-1.5-pro, and Claude refers to Claude-3.5-sonnet.
6
Page 7:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Benchmark→ SWE Web OS Spider2-V SWE Web OS Spider2-V
Approach↓ Gemini-1.5-pro Claude-3.5-sonnet
Existing approaches
Baseline 13.3 17.9 4.9 8.3 51.2 35.8 12.4 8.4
RAG 13.7 19.5 5.1 9.1 51.8 36.9 12.8 9.2
Data distill 14.0 19.8 5.7 9.1 54.0 39.2 12.9 9.7
Reflexion 14.3 20.2 5.7 9.3 54.4 40.4 15.6 10.5
LATS 15.3 21.0 6.5 11.3 55.2 41.3 16.8 11.2
Ours
Learn-by-interact 18.7 25.6 10.3 16.4 60.0 48.0 22.5 16.6
Δover baseline +5.4 +7.7 +5.4 +8.1 +8.8 +12.2 +10.1 +8.2
Table 2|Comparison of Learn-by-interact to other existing training-free approaches. SWE
refers to SWE-bench, Web refers to WebArena and OS refers to OSWorld. The best results are
highlighted in bold.
3.4 Evaluation
Wefollowthedefaultevaluationmetricsdesignedbytheoriginalbenchmarks. OnSWE-bench(Jimenez
et al., 2023), we apply the generated patch program to the repository codebase, and measure the
agent performance by execution accuracy (pass@1). On WebArena (Zhou et al., 2023b), we employ
both LLM-based fuzzy match and string match that checks keywords in predictions. Slightly different
from the original work that uses gpt-4-0613 as the LLM judge, we use Claude-3.5-sonnet as a similar
replacement. On OSWorld (Xie et al., 2024), we leverage the sample-specific evaluation scripts to
assess the functional correctness of the task completion, which processes environment states and
checks if agents finish the task as expected. On Spider2-V (Cao et al., 2024), we utilize file-based
comparison, information-based validation, execution-based verification to determine whether a task is
successfully completed. All performance numbers throughout the paper are shown in the percentage
of resolved instances with % omitted for brevity.
3.5 Results
3.5.1 Training-free Evaluation
We first consider Learn-by-interact in the training-free setting, where the proposed methods
can be applied to the commercial LLMs even with prediction-only API access.
Results on Table 2 show marginal improvement of RAG compared to the baseline, which suggests
limited effectiveness by simply concatenating standard resources to LLM prompts. By retrieving
examples from distilled data, we observe better performance compared to RAG, but still no more than
2% improvement over the baseline, which indicates that the distilled data tend to be noisy in the
setting with multi-round agent-environment interactions. This highlights the critical role of backward
construction, which corrects the misalignment between instructions and trajectories by curating new
task objectives.
Both Reflexion and LATS consistently improve over the baseline across 4 benchmarks, which
7
Page 8:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Benchmark→ Web OS Web OS Web OS Web OS
Model→ Codegemma-7B Codestral-22B Codegemma-7B Codestral-22B
Approach↓ Before tuning After tuning
Existing approaches
Baseline 3.3 0.0 4.7 2.2 - - - -
Data distill 4.2 0.0 5.8 2.7 6.2 1.4 10.2 5.4
Ours
Learn-by-interact 7.6 3.5 9.9 5.4 14.6 6.5 24.2 11.7
Δover baseline +4.3 +3.5 +5.2 +3.2 +11.3 +6.5 +19.5 +9.5
Table3|DownstreamtaskperformanceofmodelstrainedfromdatageneratedbyLearning-by-interact
and data distillation. We include the models results before training, where the synthesized data
is used as demonstration examples, and after training, where the synthesized data is used to train
models.
demonstrate their general applicability to agent tasks. Using the data synthesized from the Learn-
by-interact , we can see a significant performance gain compared to all other frameworks in both
Gemini and Claude. For example, on OSWorld, augmenting Claude with synthesized environment-
specific data almost doubles the result compared to the baseline. This signifies the high quality of the
generated data and the effectiveness of the Learn-by-interact framework.
3.5.2 Training-based Evaluation
We consider the data synthesized by Learn-by-interact in the scenario of LLM tuning, which is
applicable to the LLMs with access to weight updates.
The results presented in Table 3 reveal that Learn-by-interact substantially surpasses both
the baseline and data distillation, suggesting its capacity to generate high-quality training data that
enables language models to learn and adapt efficiently. We discover that utilizing our synthesized data
for model training yields better results compared to using it as in-context learning (ICL) examples.
A notable instance is in WebArena, where Codestral-22B’s performance jumps from 4.7% to 24.2%
when trained on our synthesized data, while only showing a 5.5% improvement in the ICL scenario.
Remarkably, the Codestral-22B model trained with our synthesized data even outperforms Gemini
when the latter uses our data as demonstration examples.
4 Analysis
4.1 Inference Efficiency
We compare the efficiency of different pipelines at inference. We analyze the trade-off between
downstream task performance and the required computational costs. We focus on measuring the
number of LLM calls and consumed tokens per example, which are averaged across four evaluated
datasets(§3.2)usingClaude-3.5-sonnet. AsillustratedinFig. 2,whileReflexionandLATSdemonstrate
8
Page 9:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
26283032343638Performance
510152025303540LLM calls
50k100k150k200k250kConsumed tokens
Baseline RAG Data distill Reflexion LATS Learn-by-interaction
Figure 2|Evaluation performance, the number of LLM calls and consumed tokens (per example)
of various training-free pipelines during inference, which are averaged across four benchmarks:
SWE-bench, Webarena, OSWorld and Spider2-V.
Benchmark→ SWE Web OS Spider2-V SWE Web OS Spider2-V
Retrieval↓ Gemini-1.5-pro Claude-3.5-sonnet
No retrieval 13.3 17.9 4.9 8.3 51.2 35.8 12.4 8.4
Instruction-based 14.7 21.6 7.0 10.2 52.4 36.6 15.0 9.6
Observation-based 16.3 23.5 8.7 14.6 53.6 42.5 17.2 10.5
Model-based 17.0 24.3 9.5 15.4 57.8 44.8 20.3 13.7
Ours 18.7 25.6 10.3 16.4 60.0 48.0 22.5 16.6
Table 4|Model performance based on different retrieval paradigms. Observation-based and Model-
based retrieval prove to be particularly effective in agent tasks, whose combination (ours) gives the
best results.
enhanced performance, this comes at the cost of significantly increased computational resources
during inference. Specifically, LATS achieves an average improvement of 2.5 %, albeit at the cost
of requiring nearly four times more tokens per instance compared to the baseline. In contrast,
Learn-by-interact exhibits superior performance while utilizing fewer LLM calls and slightly
more tokens compared to the baseline. Thanks to the rich environment information stored in the
examples of synthesized data, LLMs can potentially make better decisions and thus finish the task
in fewer steps. This removes the performance-efficiency trade-off during inference at the cost of
data synthesis in advance and suggests that Learn-by-interact is particularly well-suited for
real-world deployment that demands both low latency and high performance.
4.2 The Impact of Retrieval
As mentioned in §2.4, we employ both model-based and observation-based retrieval in our evaluation
with ICL. We analyze their effectiveness by incorporating only one of them (skip lines 5-14 in
Algorithm 2 for model-based retrieval only and skip lines 15-17 for observation-based retrieval only).
In addition, we compare to two baselines: (1) no retrieval: LLMs predict each action in the zero-shot
setting; and (2) instruction-based: only use instructions to retrieve synthesized data and apply the
same demonstration examples in every action prediction throughout the trajectory.
The results presented in Table 4 illustrate how various retrieval methods impact LLMs when
using the synthetic data as the retrieval source. Despite having access to the same example pool
(except the baseline without using retrieval), there are notable differences in performance across
different retrieval strategies, highlighting the crucial role of agentic retrieval in effectively utiliz-
9
Page 10:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Benchmark→ SWE Web OS Spider2-V Web OS
Granularity↓ Claude-3.5-sonnet Codestral-22B
Baseline 51.2 35.8 12.4 8.4 4.6 2.2
Short 54.2 39.4 17.9 10.8 13.5 4.9
Medium 53.6 38.8 16.6 9.7 12.6 4.0
Long 52.2 37.6 15.2 9.2 10.6 3.4
Short+Medium 54.6 41.2 18.8 11.3 14.6 5.7
Short+Long 54.0 40.5 17.8 10.7 14.4 5.3
Medium+Long 53.8 38.6 17.2 10.4 13.2 4.5
Short+Medium+Long 55.0 42.0 19.8 12.3 15.4 6.3
Table 5|Effectiveness of synthetic data with various granularity. In general, short-trajectory data is
more advantageous to both training and ICL, while mixing all of short, medium and long-trajectory
data provides the best performance.
ing synthesized data. Conventional Retrieval-Augmented Generation (RAG) methods, which only
employs instructions for retrieval, show the least improvement across four benchmarks and two
LLMs. In contrast, the observation-based approach proves particularly effective for agent-based tasks,
significantly outperforming the instruction-based retrieval, for instance, achieving a 4.4% absolute
improvement on Spider-2V when using Gemini. By leveraging task instructions, interaction history
and the current observation, model-based retrieval demonstrates even better results compared to
using the observation-based version. Ultimately, the most impressive scores are achieved by combining
both model-based and observation-based retrieval, which results in our agentic retrieval pipeline.
These findings underscore the importance of carefully designing retrieval pipelines to maximize the
potential of synthetic data and LLMs in agent scenarios.
4.3 Data granularity
Asmentionedin§2.2,wesynthesizedatabytakingcontiguoussub-trajectoriesfromthefullgeneration
paths of LLMs, i.e. 𝑇′=𝑇[𝑖:𝑗], which results in trajectories of diverse lengths in the synthesized
data. We divide the synthetic data into three groups: (1). trajectory steps <5(short); (2). 5≤
trajectory steps <10(medium); (3). trajectory steps ≥10(long), and leverage each group and their
combinationsinboththetraining-freeandthetraining-basedprocess. Toensureafaircomparison, we
constraint the data size in each group and combined group to 200M tokens2, utilizing Su et al. (2022)
for sub-sampling. Table 5 presents the results. In both training-free and training-based evaluation,
LLMs derive greater advantages from short-trajectory data, as demonstrated by its consistently
superior performance compared to medium and long-trajectory data with Claude-3.5-sonnet and
Codestral-22B. This can be attributed to the versatility of short-trajectory data, which usually serves
as a sub-step or a partial workflow in downstream tasks. The combination of any two data groups
proves more effective than relying on a single group, showcasing the complementary nature of diverse
data sets. For instance, in Webarena with Codestral-22B, incorporating examples with both short and
medium-length trajectories shows additional improvement over using either one exclusively. This
underscores the value of considering the trajectory length as a unique dimension of agentic data
synthesis.
2We use the number of tokens to measure the data size due to the fact that long-trajectory example may contain more
information compared to the short version.
10
Page 11:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
4.4 Scaling Laws
0 2k 4k 6k 8k 10k
Synthesized data size05101520253035Performance
Claude-3.5-sonnet
Codegemma-7B
Codegemma-7B-trainedGemini-1.5-pro
Codestral-22B
Codestral-22B-trained
Figure 3|Scaling laws for the synthesized data.
Compared to in-context learning, tuning achieves
more significant improvements as the data scales
up. The performance is averaged across WebArena
and OSWorld.We examine how the model performance im-
proves as the synthetic data size scales up. Fig-
ure 3 presents two sets of results, with training-
free (where Claude, Gemini, Codegemma and
Codestral use retrieval augmentation without
training) and with training-based (where fine-
tuned Codegemma and Codestral models are
evaluatedwithoutretrieval). Allresultsareaver-
aged across Webarena and OSworld due to com-
putational resource constraints. The findings
indicate that both learning paradigms benefit
from larger data, suggesting the synthetic data
is diverse and high-quality. In the training-free
evaluation, more substantial improvements are
observed for larger models (Claude and Gem-
ini)comparedtosmallerones(Codegemmaand
Codestral), possibly due to the their enhanced
in-context learning abilities. Our analysis also
reveals that for a given amount of synthetic
data, fine-tuning smaller models is more effec-
tive than using the data as demonstration exam-
ples during evaluation.
5 Related work
Various agents based on LLMs have been devel-
oped (Huang et al., 2022; Shinn et al., 2024; Wang et al., 2023a, 2024a, 2023b; Zhang et al., 2024).
React (Yao et al., 2022b) proposes to synergize reasoning and acting in LLMs. By integrating Monte
Carlo Tree Search (Coulom, 2006; Kocsis and Szepesvári, 2006), Zhou et al. (2023a) leverages LLM-
powered value functions and self-reflection (Madaan et al., 2024) to encourage proficient exploration
and decision-making. However, it comes with increased computational costs and relies on the premise
that the environment allows for state reversals. In contrast, Learn-by-interact removes such
assumptions and improves both agent efficiency and performance by synthesizing high-quality data
in advance.
Another line of research to improve agent models relies on training on human-labeled exam-
ples (Chen et al., 2024b; Deng et al., 2024; Wang et al., 2022a; Yin et al., 2023; Zeng et al., 2023) or
data distilled from LLMs like GPT-4 (Chen et al., 2023; Zhao et al., 2024). AgentGen (Hu et al., 2024)
explores automatic synthesis of both environments and tasks and then leverages FastDownward3to
generate trajectory data. AgentTuning (Zeng et al., 2023) utilizes both existing datasets and self-
instruct (Wang et al., 2022b) to derive instructions and then samples trajectories from GPT-4 (Achiam
et al., 2023). In contrast, Learn-by-interact focuses on realistic environments and generate
tasks and trajectories using backward construction. Some other researchers are also exploring ways
to use data more efficiently with reinforcement learning (Ball et al., 2023; Nachum et al., 2018;
Schwarzer et al., 2020, 2021; Thomas and Brunskill, 2016). Gulcehre et al. (2023) suggests using
3https://www.fast-downward.org/
11
Page 12:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
data created by an LLM’s policy can enhance itself via offline reinforcement learning algorithms.
Aksitov et al. (2023) takes this further by combining with ReAct (Yao et al., 2022b) to train agent
models iteratively on experience trajectories. These typically require a reward model as the scoring
function or LLM/execution-generated feedback to enhance data quality. Our work, however, takes a
different approach by employing the backward construction to improve the data quality by aligning
instructions and trajectories.
6 Conclusion
We introduce Learn-by-interact , a data-centric framework to adapt LLM agents to any given en-
vironments without human annotations. Based on commonly-accessible resources like documentaion,
LLMs propose downstream tasks and complete them with multi-round interactions with environments.
We address the misalignment between instructions and trajectories by updating objectives with new
instructions derived from trajectories. Additionally, we design innovative retrieval approaches that
leverage agent instructions, interaction histories, and current observations to retrieve synthesized
examples. Through extensive experiments, we demonstrate that the synthetic data from Learn-by-
interact significantly enhances model performance with both ICL and training. Compared with
other leading approaches in agent tasks, Learn-by-interact shows much better performance
with lower latency and computational costs, which make it particularly suitable for large-scale deploy-
ment. Further analysis has also shown the superiority of Learn-by-interact over the classical
RAG. In future work, we plan to explore multi-modal settings and train general agent models widely
applicable in realistic environments. We anticipate that Learn-by-interact will inspire future
research to push the state-of-the-art in this direction.
7 Limitations
Although Learn-by-interact effectively synthesizes high-quality agentic data with trajectories,
it requires a lot of LLM calls in generation and filtering. We hope that future works will explore more
efficient approaches to complete annotations without sacrificing quality. Additionally, Learn-by-
interact leverages the environment-related resources to generate instructions. In some scenarios,
however, these resources may be incomplete or not available.
References
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt,
S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023.
R. Aksitov, S. Miryoosefi, Z. Li, D. Li, S. Babayan, K. Kopparapu, Z. Fisher, R. Guo, S. Prakash,
P. Srinivasan, et al. Rest meets react: Self-improvement for multi-step reasoning llm agent. arXiv
preprint arXiv:2312.10003 , 2023.
Anthropic. Introducing claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/
claude-3-5-sonnet .
P. J. Ball, L. Smith, I. Kostrikov, and S. Levine. Efficient online reinforcement learning with offline
data. In International Conference on Machine Learning , pages 1577–1594. PMLR, 2023.
12
Page 13:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
R. Cao, F. Lei, H. Wu, J. Chen, Y. Fu, H. Gao, X. Xiong, H. Zhang, Y. Mao, W. Hu, et al. Spider2-v:
How far are multimodal agents from automating data science and engineering workflows? arXiv
preprint arXiv:2407.10956 , 2024.
B. Chen, C. Shu, E. Shareghi, N. Collier, K. Narasimhan, and S. Yao. Fireact: Toward language agent
fine-tuning. arXiv preprint arXiv:2310.05915 , 2023.
D. Chen, S. Lin, M. Zeng, D. Zan, J.-G. Wang, A. Cheshkov, J. Sun, H. Yu, G. Dong, A. Aliev, et al.
Coder: Issue resolving with multi-agent and task graphs. arXiv preprint arXiv:2406.01304 , 2024a.
Z. Chen, K. Liu, Q. Wang, W. Zhang, J. Liu, D. Lin, K. Chen, and F. Zhao. Agent-flan: Designing data
and methods of effective agent tuning for large language models. arXiv preprint arXiv:2403.12881 ,
2024b.
R. Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In International
conference on computers and games , pages 72–83. Springer, 2006.
X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su. Mind2web: Towards a
generalist agent for the web. Advances in Neural Information Processing Systems , 36, 2024.
A. Drouin, M. Gasse, M. Caccia, I. H. Laradji, M. Del Verme, T. Marty, D. Vazquez, N. Chapados,
and A. Lacoste. WorkArena: How capable are web agents at solving common knowledge work
tasks? In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp,
editors, Proceedings of the 41st International Conference on Machine Learning , volume 235 of
Proceedings of Machine Learning Research , pages 11642–11662. PMLR, 21–27 Jul 2024. URL
https://proceedings.mlr.press/v235/drouin24a.html .
C. Gulcehre, T. L. Paine, S. Srinivasan, K. Konyushkova, L. Weerts, A. Sharma, A. Siddhant, A. Ahern,
M. Wang, C. Gu, et al. Reinforced self-training (rest) for language modeling. arXiv preprint
arXiv:2308.08998 , 2023.
I. Gur, H. Furuta, A. Huang, M. Safdari, Y. Matsuo, D. Eck, and A. Faust. A real-world webagent with
planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856 ,
2023.
E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank
adaptation of large language models. arXiv preprint arXiv:2106.09685 , 2021.
M. Hu, P. Zhao, C. Xu, Q. Sun, J. Lou, Q. Lin, P. Luo, S. Rajmohan, and D. Zhang. Agentgen: Enhancing
planning abilities for large language model based agent via environment and task generation. arXiv
preprint arXiv:2408.00764 , 2024.
W. Huang, P. Abbeel, D. Pathak, and I. Mordatch. Language models as zero-shot planners: Extracting
actionable knowledge for embodied agents. In International conference on machine learning , pages
9118–9147. PMLR, 2022.
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can
language models resolve real-world github issues? arXiv preprint arXiv:2310.06770 , 2023.
A. Keipour. Physical interaction and manipulation of the environment using aerial robots. arXiv
preprint arXiv:2207.02856 , 2022.
L. Kocsis and C. Szepesvári. Bandit based monte-carlo planning. In European conference on machine
learning , pages 282–293. Springer, 2006.
13
Page 14:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
J. Y. Koh, R. Lo, L. Jang, V. Duvvur, M. C. Lim, P.-Y. Huang, G. Neubig, S. Zhou, R. Salakhutdinov,
and D. Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv
e-prints, pages arXiv–2401, 2024.
Y. Li, J. He, X. Zhou, Y. Zhang, and J. Baldridge. Mapping natural language instructions to mobile ui
action sequences. arXiv preprint arXiv:2005.03776 , 2020.
Y. Liu, K. Shi, K. S. He, L. Ye, A. R. Fabbri, P. Liu, D. Radev, and A. Cohan. On learning to summarize
with large language models as references. arXiv preprint arXiv:2305.14239 , 2023.
A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye,
Y. Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information
Processing Systems , 36, 2024.
O. Nachum, S. S. Gu, H. Lee, and S. Levine. Data-efficient hierarchical reinforcement learning.
Advances in neural information processing systems , 31, 2018.
X. Pu, M. Gao, and X. Wan. Summarization is (almost) dead. arXiv preprint arXiv:2309.09558 , 2023.
M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazaridou,
O. Firat, J. Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of
tokens of context. arXiv preprint arXiv:2403.05530 , 2024.
M. Schwarzer, A. Anand, R. Goel, R. D. Hjelm, A. Courville, and P. Bachman. Data-efficient reinforce-
ment learning with self-predictive representations. arXiv preprint arXiv:2007.05929 , 2020.
M. Schwarzer, N. Rajkumar, M. Noukhovitch, A. Anand, L. Charlin, R. D. Hjelm, P. Bachman, and
A. C. Courville. Pretraining representations for data-efficient reinforcement learning. Advances in
Neural Information Processing Systems , 34:12686–12699, 2021.
N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with
verbal reinforcement learning. Advances in Neural Information Processing Systems , 36, 2024.
H. Su, J. Kasai, C. H. Wu, W. Shi, T. Wang, J. Xin, R. Zhang, M. Ostendorf, L. Zettlemoyer, N. A.
Smith, et al. Selective annotation makes language models better few-shot learners. arXiv preprint
arXiv:2209.01975 , 2022.
C. Team. Codegemma: Open code models based on gemma. arXiv preprint arXiv:2406.11409 , 2024a.
T. M. A. Team. Codestral: Hello, world!, 2024b. URL https://mistral.ai/news/codestral/ .
P. Thomas and E. Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In
International Conference on Machine Learning , pages 2139–2148. PMLR, 2016.
G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar. Voyager: An
open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 , 2023a.
R. Wang, P. Jansen, M.-A. Côté, and P. Ammanabrolu. Scienceworld: Is your agent smarter than a
5th grader? arXiv e-prints , pages arXiv–2203, 2022a.
X. Wang, Y. Chen, L. Yuan, Y. Zhang, Y. Li, H. Peng, and H. Ji. Executable code actions elicit better
llm agents. arXiv preprint arXiv:2402.01030 , 2024a.
X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, H. H. Tran, F. Li,
R. Ma, M. Zheng, B. Qian, Y. Shao, N. Muennighoff, Y. Zhang, B. Hui, J. Lin, R. Brennan, H. Peng,
H. Ji, and G. Neubig. OpenHands: An Open Platform for AI Software Developers as Generalist
Agents, 2024b. URL https://arxiv.org/abs/2407.16741 .
14
Page 15:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi. Self-instruct: Aligning
language models with self-generated instructions. arXiv preprint arXiv:2212.10560 , 2022b.
Z. Wang, S. Cai, G. Chen, A. Liu, X. Ma, and Y. Liang. Describe, explain, plan and select: Interac-
tive planning with large language models enables open-world multi-task agents. arXiv preprint
arXiv:2302.01560 , 2023b.
T. Xie, F. Zhou, Z. Cheng, P. Shi, L. Weng, Y. Liu, T. J. Hua, J. Zhao, Q. Liu, C. Liu, et al. Openagents:
An open platform for language agents in the wild. arXiv preprint arXiv:2310.10634 , 2023.
T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, et al. Osworld:
Benchmarking multimodal agents for open-ended tasks in real computer environments. arXiv
preprint arXiv:2404.07972 , 2024.
J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe-agent: Agent-
computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793 ,
2024.
Z. Yang, J. Liu, Y. Han, X. Chen, Z. Huang, B. Fu, and G. Yu. Appagent: Multimodal agents as
smartphone users. arXiv preprint arXiv:2312.13771 , 2023.
S. Yao, H. Chen, J. Yang, and K. Narasimhan. Webshop: Towards scalable real-world web interaction
with grounded language agents. Advances in Neural Information Processing Systems , 35:20744–
20757, 2022a.
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. React: Synergizing reasoning
and acting in language models. arXiv preprint arXiv:2210.03629 , 2022b.
D. Yin, F. Brahman, A. Ravichander, K. Chandu, K.-W. Chang, Y. Choi, and B. Y. Lin. Lumos: Learning
agents with unified data, modular design, and open-source llms. arXiv preprint arXiv:2311.05657 ,
2023.
A. Zeng, M. Liu, R. Lu, B. Wang, X. Liu, Y. Dong, and J. Tang. Agenttuning: Enabling generalized
agent abilities for llms. arXiv preprint arXiv:2310.12823 , 2023.
Z. Zhan and A. Zhang. You only look at screens: Multimodal chain-of-action agents. arXiv preprint
arXiv:2309.11436 , 2023.
J.Zhang,Y.Yu,M.Liao,W.Li,J.Wu,andZ.Wei. Ui-hawk: Unleashingthescreenstreamunderstanding
for gui agents. arXiv preprint , 2024.
Z. Zhao, K. Ma, W. Chai, X. Wang, K. Chen, D. Guo, Y. Zhang, H. Wang, and G. Wang. Do we
really need a complex agent system? distill embodied agent into a single model. arXiv preprint
arXiv:2404.04619 , 2024.
A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y.-X. Wang. Language agent tree search
unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406 , 2023a.
S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y. Bisk, D. Fried, et al. Webarena:
A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854 ,
2023b.
15
Page 16:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
A Baseline implementations
We follow the existing frameworks to set up baselines in each benchmark. In SWE-bench (Jimenez
et al., 2023), we follow CodeAct (Wang et al., 2024b), where LLMs interact with environments to
solve problems. In WebArena (Zhou et al., 2023b), we follow the implementation in Drouin et al.
(2024), whichconcatenatestaskobjectives, actionspacedescriptions, generalinstructions(e.g., output
formats) and webpage observations in the prompt, and ask LMs to predict the next action. By default,
we use the accessibility tree4as the observation space. In OSWorld (Xie et al., 2024) and Spider2-
V (Cao et al., 2024), we follow the original prompt style designed by the benchmark, which also
concatenatestaskobjectives,actionspacedescriptions,generalinstructionsandcomputerobservations
in the prompt. By default, we use the accessibility tree as the observation space for OSWorld, and use
the set-of-mark for Spider2-V due to the significant information loss of the accessibility tree in the
original benchmark. See an example in Table 18 and 19 for more details.
B Dataset examples
From Table 8 to 17, we provide one example for each dataset with full instructions, interaction history
with the environment.
C Experimental settings
We retrieve documents until the maximum length of LLMs for RAG and set an upper bound number
of 50 documents, where the retrieved documents remain unchanged throughout agent interaction
trajectory because only instructions are used as the query for retrieval. For Reflexion (Shinn et al.,
2024), we use the maximum trials 3. In LATS (Zhou et al., 2023a), we use the number of gen-
erated action 5, depth limit 15, value function weight 0.8, following the original setting in paper
with WebShop (Yao et al., 2022a), which is also an agent task based on website. By default, we
use https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings as the
dense retriever for model-based retrieval. We use the temperature 0 throughout the experiments to
ensure better reproductivity of the experiments. During training, we the batch size 128, learning
rate 0.00002, warmup ratio 0.03 and maximum length 8192, and tune the model for 3 epochs. All
experiments are conducted in H100 machines with 80GB memeory.
D Document sources
We use all the non-repeated python files in SWE-bench-Verified (Jimenez et al., 2023) as the document
sources. Although we may not always find abundant documentations and tutorials for each environ-
ment, we believe that documentations in the same domain still have a good coverage of frequent
operations. For example, one subset of WebArena (Zhou et al., 2023b) focuses on the navigation of
the shopping website OneStopMarket, we use the Amazon documentation as a good replacement.
Regardless of the shopping websites, the frequent tasks usually include order change, product search,
delivery checking, etc. Therefore, we use other documentations in the same domain to sample task
4https://developer.mozilla.org/en-US/docs/Glossary/Accessibility_tree
16
Page 17:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
instructions when the exact version for the target environment is not available. Concretely, we use
the following sources for WebArena:
•https://docs.gitlab.com/ee/tutorials/
•https://support.google.com/maps
•https://www.amazon.com/hz/contact-us/foresight/hubgateway
•https://support.reddithelp.com/hc/en-us/articles
The following sources are used for OSWorld:
•https://support.google.com/chrome/?hl=en
•https://www.gimp.org/tutorials/
•https://books.libreoffice.org/en/CG72/CG72.html
•https://books.libreoffice.org/en/WG73/WG73.html
•https://ubuntu.com/tutorials/command-line-for-beginners
•https://support.mozilla.org/en-US/products/thunderbird
•https://wiki.videolan.org/Documentation:Documentation
•https://code.visualstudio.com/docs
, The following sources are used for Spider2-V:
•https://docs.getdbt.com/
•https://release-1-7-2.dagster.dagster-docs.io/
•https://docs.astronomer.io/
•https://docs.airbyte.com/
•https://airbyte.com/tutorials/
•https://airbyte-public-api-docs.s3.us-east-2.amazonaws.com/rapidoc-api-docs.html
•https://superset.apache.org/docs/
•https://www.metabase.com/docs/v0.49/
•https://www.metabase.com/learn/
•https://docs.snowflake.com/en/
•https://cloud.google.com/bigquery/docs/
•https://jupyterlab.readthedocs.io/en/4.1.x/
E Synthesized data examples
From Table 20 to 26, we provide a complete example of data synthesis. To begin with, an LLM
generates instructions based on standard resources like tutorials, documentations and FAQs: Upload
CSV data in Google Drive to BigQuery. (See prompt in Table 29) It then attempts solve the task by
predicting actions and collecting feedback from environments (interactions). This produces a long
trajectory showing how LLMs try to achieve the goal.
However, it is not guaranteed that the trajectory successfully achieves the target. In our example,
the LLM makes a wrong prediction in the action 4. It selects the table source Google Cloud Storage,
while the correct action should select “Drive" to align with the instruction that reuiqres to upload CSV
data in Google Drive. This results in wrong actions in the subsequent predictions, and the generated
trajectory is not aligned with the initial instruction, which leads to noisy data in this case.
17
Page 18:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Number of synthesized examples 0 5k 10k
Task instructions generated based on environments 35.8 37.3 39.6
Task instructions generated based on related resources 35.8 43.2 48.0
Table 6|The comparison of Learn-by-interact to the version without replying on existing
resources.
Instead of using the original instruction-trajectory pairs for downstream training and in-context
learning, we fix the mentioned misalignment by crafting new instructions for each sub-trajectory
(backward construction). Concretely, we feed the generated trajectory into LLM prompts, and ask it
to summarize the trajectory or propose a new task based on it. For example, the LLM updates the
task objective to “Link CSV file in Google Cloud Storage to BigQuery" after observing the trajectory,
which makes the task instrucion and the trajectory aligned. Additionally, we also generate new
instructions for each sub-trajectory, which would increase the utility of a generated full trajectory.
For instance, based on the sub-trajectory (observation 0, Action 1, observation 1), the LLM generates
a new instruction: When is dataset “demo" created? In Table 27 and 28, we list more generated
instructions based on sub-trajectories.
F Case study on filtered examples
InTable36-45, wedemonstratetherepresentativesynthesizedexamplesthatfailtomeetourdesigned
criteria. The example in Table 36-41 is filtered because the trajectory shows detour in accomplishing
the goal, i.e. Action 1-6 are not necessary. The example in Table 42-45 is filtered because it goes back
and forth in states, i.e. repeat the actions of clicking "My Orders" and clicking "View Order". We filter
these low-quality examples to avoid their negative influences in the downstream applications.
G Synthesized data from environments
We compare Learn-by-interact with the version without relying on existing resources in
WebArena. Except for sampling task instructions from LLMs based on given environments, we follow
the same procedures in Learn-by-interact to synthesize 10k examples. The results of in-context
learning with Claude-3.5-sonnet are shown in Table 6
However, we note the following potential concerns regarding the version without replying on
existing resources: the distribution and the diversity of the generated data are hard to control.
Without conditioning on prior documents, one will need intensive prompt engineering to guide LLMs
in generating diverse task instructions. On the other hand, the related resources are usually crafted
by experts or written by real users, which cover most important scenarios that people who interact
with the environment are interested in.
H Cross-website generalization
To evaluate the generalization capabilities of Learn-by-interact , in WebArena, we consider the
content management systems (CMS) as a held-out test set and leverage the synthetic data from the
18
Page 19:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Model Claude-3.5-sonnet Codestral-22B
Baseline 26.0 3.3
Learn-by-interact with synthetic data that excludes CMS 28.3 12.6
Learn-by-interact with all WebArena data that contains CMS 29.2 17.6
Table 7|Results for cross-website generalization.
remaining websites as the training data. To ensure a fair comparison and avoid the influences of the
discrepancies in the training set size, we downsample the original data that covers all the WebArena
domains so that both set contains the same number of instances. Following the same evaluation
pipelines in the paper, we assess the performance of in-context learning with Claude-3.5-sonnet and
training with Codestral-22B.
From the results in Table 7, we observe that, even without specifically using the data sampled from
the CMS environment, Learn-by-interact demonstrates significant improvements over the baseline in
both training and in-context learning. This indicates that the proposed approach holds the potential
for cross-website generalization and is likely to achieve better performance when utilizing data from
more websites.
Instruction
Tell me the total cost of my latest cancelled order?
Observation 0
Action 1
click [4918] // click the button ‘My Account’
Table 8|Webarena example
19
Page 20:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Observation 1
Action 2
Stop: The total cost of the latest cancelled order is $365.42
Table 9|Webarena example cont.
20
Page 21:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Instruction
Could you assist me in adding a new layer and naming it ‘Square’?
Observation 0 (Interface of the software GIMP)
Action 1
import pyautogui
pyautogui.hotkey(‘shift’, ‘ctrl’, ‘n’) // shortcut to initialize a new layer.
Observation 1 (Interface of the software GIMP)
Table 10|OSWorld example
21
Page 22:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Action 2
import pyautogui
pyautogui.typewrite(‘Square’) // change the layer name to Square.
Observation 2 (Interface of the software GIMP)
Action 3
import pyautogui
pyautogui.click(1096, 892) // click OK to save the change.
Observation 3 (Interface of the software GIMP)
Table 11|OSWorld example cont.
22
Page 23:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Instruction
I have established a connection from Faker to local .csv file. Could you help me
change the running schedule? I hope it can be replicated at 18:00 pm every
day.
Observation 0 (Interface of the software Airbyte)
Action 1
import pyautogui
pyautogui.click(550,280) // click the connection row with the name “Sample
Data (Faker)→Local CSV"
Observation 1 (Interface of the software Airbyte)
Table 12|Spider2-V example, cont.
23
Page 24:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Action 2
import pyautogui
pyautogui.click(500,300) // click the button Replication
Observation 2 (Interface of the software Airbyte)
Action 3
import pyautogui
pyautogui.click(500,450) // click the panel Configuration
Observation 3 (Interface of the software Airbyte)
Table 13|Spider2-V example, cont.
24
Page 25:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Action 4
import pyautogui
pyautogui.click(1270,430) // reveal the dropdown menu of the schedule type
Observation 4 (Interface of the software Airbyte)
Action 5
import pyautogui
pyautogui.click(1200,565) // select the schedule type Cron
Observation 5 (Interface of the software Airbyte)
Table 14|Spider2-V example, cont.
25
Page 26:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Action 6
import pyautogui
pyautogui.click(1020,480) // focus on the content of the Cron expression
Observation 6 (Interface of the software Airbyte)
Action 7
import pyautogui
pyautogui.hotkey(“ctrl",“a") // select all content in the textbox
Observation 7 (Interface of the software Airbyte)
Table 15|Spider2-V example, cont.
26
Page 27:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Action 8
import pyautogui
pyautogui.press(“delete") // delete the default content
Observation 8 (Interface of the software Airbyte)
Action 9
import pyautogui
pyautogui.write(“0 0 18 * * ?") // update the content to align 18:00 pm in the
instruction
Observation 9 (Interface of the software Airbyte)
Table 16|Spider2-V example, cont.
27
Page 28:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Action 10
import pyautogui
pyautogui.click(1450,900) // click the button save changes
Observation 10 (Interface of the software Airbyte)
Table 17|Spider2-V example, cont.
28
Page 29:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Screenshot
Set-of-mark
Table 18|Observation space of Spider2-V.
29
Page 30:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
[208, 13]menu Chromium Web Browser “"
[1463, 13]menu System “"
[35, 65]push-button Chromium Web Browser “"
[753, 81]label Please download waiting software updates. “"
[135, 109]label Home
[35, 133]push-button Terminal “"
[35, 201]push-button Visual Studio Code “"
[35, 269]push-button Files “"
[35, 337]push-button Text Editor “"
[953, 370]label Updated software is available for this computer. Do you want
to install it now?
[35, 405]push-button LibreOffice Calc “"
[951, 463]table-cell Security updates
[1191, 463]table-cell 638.8 MB
[35, 473]push-button LibreOffice Writer “"
[963, 486]table-cell LibreOffice
[1191, 486]table-cell 23.4 MB
[963, 509]table-cell LibreOffice Calc
[1191, 509]table-cell 8.7 MB
[923, 524]toggle-button Details of updates “"
[963, 532]table-cell LibreOffice Draw
[1191, 532]table-cell 3.0 MB
[35, 541]push-button Document Viewer “"
[963, 555]table-cell LibreOffice Impress
[1191, 555]table-cell 1.3 MB
[963, 578]table-cell LibreOffice Math
[1191, 578]table-cell 673 kB
[35, 612]push-button Software Updater “"
[935, 660]label 1157.8 MB will be downloaded.
[35, 680]push-button Trash “"
[671, 702]push-button Settings... “"
[1054, 702]push-button Cancel “"
[1176, 702]push-button Install Now “"
[35, 884]toggle-button Show Applications “"
Table 19|Observation space of Spider2-V. The accessibility tree suffers from significant information
loss. Compared to the screenshot and set-of-mark shown in Table 18, the presented accessibility tree
fails to retrieve webpage information, and only shows the details of the desktop icons in the left panel.
30
Page 31:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Instruction
Upload CSV data in Google Drive to BigQuery.
Observation 0 (Bigquery Interface)
Action 1
import pyautogui
pyautogui.doubleClick(332,447) // double click the dataset demo.
Observation 1 (Bigquery Interface)
Table 20|Example of data synthesis - Bigquery
31
Page 32:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Action 2
import pyautogui
pyautogui.doubleClick(1002,321) // double click the button CREATE TABLE.
Observation 2 (Bigquery Interface)
Action 3
import pyautogui
pyautogui.click(1458,279) // click the dropdown menu to select table source.
Observation 3 (Bigquery Interface)
Table 21|Example of data synthesis - Bigquery
32
Page 33:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Action 4
import pyautogui
// wrong prediction: click the table source Google Cloud Storage.
// The correct prediction should select Drive to align with the instruction.
pyautogui.click(1302,331)
Observation 4 (Bigquery Interface)
Action 5
import pyautogui
pyautogui.click(1389,340) // click BROWSE to find file.
Observation 5 (Bigquery Interface)
Table 22|Example of data synthesis - Bigquery
33
Page 34:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Action 6
import pyautogui
pyautogui.click(1341,282) // click to find files under directory.
Observation 6 (Bigquery Interface)
Action 7
import pyautogui
pyautogui.doubleClick(1203,284) // select the CSV file.
Observation 7 (Bigquery Interface)
Table 23|Example of data synthesis - Bigquery
34
Page 35:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Action 8
import pyautogui
pyautogui.click(1000,915) // click to select file.
Observation 8 (Bigquery Interface)
Action 9
import pyautogui
pyautogui.click(1000,668) // click to focus on the text box Table.
Observation 9 (Bigquery Interface)
Table 24|Example of data synthesis - Bigquery
35
Page 36:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Action 10
import pyautogui
pyautogui.write("test") // name the file "test".
Observation 10 (Bigquery Interface)
Action 11
import pyautogui
pyautogui.click(1425,726) // click to reveal dropdown menu of the table type.
Observation 11 (Bigquery Interface)
Table 25|Example of data synthesis - Bigquery
36
Page 37:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Action 12
import pyautogui
pyautogui.click(1297,801) // select the table source external table.
Observation 12 (Bigquery Interface)
Action 13
import pyautogui
pyautogui.click(303,914) // click CREATE TABLE.
Observation 13 (Bigquery Interface)
Table 26|Example of data synthesis - Bigquery
37
Page 38:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
sub-trajectory type instruction
Observation 0
↓
Action 1 New task When is dataset “demo" created?
↓
Observation 1)
Observation 1 Replicate the following: We are currently at the Google Cloud
↓ Console interface, specifically focused on a BigQuery project.
Action 2 Replicate trajectory The browser window displays details of a dataset named "demo"
↓ within a BigQuery project. The interface provides information
Observation 2 about the dataset, including its creation date, last modified time,
data location (US), and other properties like default table expiry
and rounding mode. On the left side of the screen, there’s a
navigation panel showing the Explorer view with the "demo"
dataset selected. The top of the screen shows the Google Cloud
header with project selection and search functionality.
The overall layout is characteristic of a cloud-based data
management platform, with options to create tables, share data,
and manage dataset properties.
After taking the action to click the CREATE TABLE button,
we go to the user interface for creating a table. The screen
displays a form titled "Create table" with various fields and
options. The source section allows selecting a table to create
from, while the destination section includes fields for project,
dataset, and table name. There’s also a schema section and
partition and cluster settings. The interface is part of the Google
Cloud Console, as evident from the sidebar on the left showing
different Cloud services and project navigation.
Observation 4
↓
Action 5
↓
Observation 5
↓
Action 6
↓
Observation 6 New task Select test.csv in the bucket test-1616 in Google Cloud Storage
↓ as the table source.
Action 7
↓
Observation 7
↓
Action 8
↓
Observation 8
Table 27|Instructions generated from trajectory from Table 20 to 26
38
Page 39:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
sub-trajectory type instruction
Observation 8 Replicate the following: We are in the the interface for creating
↓ a table in Google Cloud’s BigQuery service. The page is divided
Action 9 into several sections. At the top, it indicates the user is creating
↓ a table from a Google Cloud Storage source, with a CSV file
Observation 9 Replicate trajectory selected. The destination section shows the project ID and allows
↓ input for the dataset and table name. The destination table is
Action 10 empty. The table type is set to “Native table". At the bottom,
↓ there’s an option for schema detection, with buttons to create the
Observation 10 table or cancel the operation. The left side of the screen displays a
navigation menu for the Google Cloud Console, including options
like Explorer and various project-related items. The overall layout
suggests this is part of a larger cloud data management and
analysis platform. After we click on the text box Table, we select
and focus on the text box. We then type “test" into the box, which
gives the table a name. Except the textbox we are working on,
the other parts of the webpage has not changed after clicking
and typing.
Observation 0
↓
Action 1
↓
Observation 1 New task Link CSV file in Google Cloud Storage to BigQuery
↓
Action 2
↓
......
↓
Observation 13
Table 28|Instructions generated from trajectory from Table 20 to 26
{Documentation}
Based on the tutorial, examplify 3 tasks that users frequently perform.
User the following format to output:
...
...
Table 29|self-instruct prompts to propose instructions based on tutorials, documentations and FAQs.
39
Page 40:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Prompt 1
Below is a trajectory to complete a task.
Observation:
{Observation 𝑖}
Action:
{Action 𝑖+1}
Observation:
{Observation 𝑖+1}
Action:
{Action 𝑖+2}
...
Action:
{Action 𝑗−1}
Observation:
{Observation 𝑗}
Please write a reasonable task instruction that is completed by the trajectory.
Wrap the instruction with ```.
Prompt 2
Below is a trajectory to complete a task.
Observation:
{Observation 𝑖}
Action:
{Action 𝑖+1}
Observation:
{Observation 𝑖+1}
Action:
{Action 𝑖+2}
...
Action:
{Action 𝑗−1}
Observation:
{Observation 𝑗}
Please summarize the trajectory about each observation and changes after each
action.
Wrap the summarization with ```.
Table 30|Prompts to summarize (sub-)trajectories or propose new tasks based on the (sub-
)trajectories.
40
Page 41:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Task instruction:
{instruction}
Below is the trajectory to complete the task.
Observation:
{Observation 𝑖}
Action:
{Action 𝑖+1}
Observation:
{Observation 𝑖+1}
Action:
{Action 𝑖+2}
...
Action:
{Action 𝑗−1}
Observation:
{Observation 𝑗}
Here are the criteria to indicate a good pair of the instruction and the trajectory:
1. The instruction and the trajectory are aligned, which means the trajectory
successfully accomplishes the goal in the instruction.
2. The trajectory is coherent, indicating that each action is logical based on its
previous observation and the actions do not contradict with each other based
on the task instruction.
3. The trajectory is natural, meaning that the trajectory closely mimics real-
world interactions and a human user would possibly perform it when engaging
in the environment.
4. The trajectory is reasonable, indicating that the trajectory finishes the task
instruction using a reasonable solution, e.g., not using an over-complicated
method, not over-simply the problem, not going back and forth in states, etc.
Please answer yes if the task instruction and the trajectory satisfies all the
criteria, otherwise, answer with no.
Table 31|LLM prompts to filter low-quality data
41
Page 42:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
SYSTEM MESSAGE:
{system message}
OBJECTIVE:
{task instruction}
INTERACTION HISTORY:
{interaction history}
OBSERVATIONS:
{observations}
Your REASONING and ACTION in the format:
REASON:
Your reason to choose a specific action.
ACTION:
Your action
Table 32|Model inference prompts without external knowledge
SYSTEM MESSAGE:
{system message}
ADDITIONAL INFORMATION FOR REFERENCE:
{external knowledge}
OBJECTIVE:
{task instruction}
INTERACTION HISTORY:
{interaction history}
OBSERVATIONS:
{observations}
Your REASONING and ACTION in the format:
REASON:
Your reason to choose a specific action.
ACTION:
Your action
Table 33|Model inference prompts with external knowledge
REASON:
...
ACTION:
...
Table 34|Expected model outputs
42
Page 43:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
SYSTEM MESSAGE:
{system message}
Here is the final goal we want to achieve:
{task instruction}
To achieve the goal, we have done the following:
{interaction history}
Now, we have observed:
{observations}
To better finish the task, write a query to ask for useful information, e.g., what
kind of examples or interaction history will be helpful to predict the next action.
Table 35|Model prompts to write query for retrieval
43
Page 44:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Instruction
Sum numbers in the first column.
Observation 0 (Interface of the software LibreOffice Calc)
Action 1
import pyautogui
pyautogui.click(543,126) // click Tools.
Observation 1
Table 36|OSWorld example (filtered)
44
Page 45:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Action 2
import pyautogui
pyautogui.click(543,580) // click Solver.
Observation 2
Action 3
import pyautogui
pyautogui.click(772,892) // click Close.
Observation 3
Table 37|OSWorld example (filtered) cont.
45
Page 46:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Action 4
import pyautogui
pyautogui.click(520,126) // click Data.
Observation 4
Action 5
import pyautogui
pyautogui.moveTo(520,562) // move to Calculate.
Observation 5
Table 38|OSWorld example (filtered) cont.
46
Page 47:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Action 6
import pyautogui
pyautogui.click(498,126) // click Sheet.
Observation 6
Action 7
import pyautogui
pyautogui.click(152,591) // click cell A4.
Observation 7
Table 39|OSWorld example (filtered) cont.
47
Page 48:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Action 8
import pyautogui
pyautogui.click(480,302) // click the formula box.
Observation 8
Action 9
import pyautogui
pyautogui.write("=SUM(A1:A3)") // type the formula.
Observation 9
Table 40|OSWorld example (filtered) cont.
48
Page 49:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Action 10
import pyautogui
pyautogui.press("enter")
Observation 10
Table 41|OSWorld example (filtered) cont.
49
Page 50:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Instruction
What are items ordered in the latest cancelled order?
Observation 0
Action 1
click [4918] // click the button "My Account"
Observation 1
Table 42|WebArena example (filtered)
50
Page 51:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Action 2
click [4922] // click the button "My Orders".
Observation 2
Action 3
click [6357] // click the button "View Order".
Observation 3
Table 43|WebArena example (filtered) cont.
51
Page 52:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Action 4
click [4922] // click the button "My Orders".
Observation 4
Action 5
click [6357] // click the button "View Order".
Observation 5
Table 44|WebArena example (filtered) cont.
52
Page 53:
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Action 6
click [4922] // click the button "My Orders".
Observation 6
Action 7
click [6357] // click the button "View Order".
Observation 7
Table 45|WebArena example (filtered) cont.
53