Open AGI Codes | Your Codes Reflect!

Updates

AI: Inceptive: aims to apply artificial intelligence to RNA biology
eTextbooks: The MIT Press
AI: OpenAI Playground
AI: Code Assist
AI: Thinking Machines
AI: QuArch: Question-Answering Computer Architecture Dataset website
AI: ReAct: Synergizing Reasoning and Acting in Language Models
AI: AI Studio: Instagram
Autonomous Agents: Autonomous Agents Research Group
AI: Humanity's Last Exam: Benchmark LLM capabilities
AI: Azure AI Foundry
LLM: The /llms.txt file
AGI: A non-profit for the public advancement of open artificial general intelligence-the ARC-AGI benchmark
AI: Dynamic Concepts
AI: Vitra AI: assists creators and businesses in leveraging AI to translate videos, images, and podcasts
AGI: All Hands: Don't Sleep on Single Agent Systems
AI: Character.ai: dialogues with AI-generated characters
AI: SSI.inc: Safe Superintelligence Inc.
ML: Jay Alammar: The Illustrated Stable Diffusion
AI: AI-2027: Compute Forecast

Generating audio...

Extracting PDF content...

arxiv

Paper 2501.04227

Agent Laboratory: Using LLM Agents as Research Assistants

Authors: Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, Emad Barsoum

Published: 2025-01-08

Abstract:

Historically, scientific discovery has been a lengthy and costly process, demanding substantial time and resources from initial conception to final results. To accelerate scientific discovery, reduce research costs, and improve research quality, we introduce Agent Laboratory, an autonomous LLM-based framework capable of completing the entire research process. This framework accepts a human-provided research idea and progresses through three stages--literature review, experimentation, and report writing to produce comprehensive research outputs, including a code repository and a research report, while enabling users to provide feedback and guidance at each stage. We deploy Agent Laboratory with various state-of-the-art LLMs and invite multiple researchers to assess its quality by participating in a survey, providing human feedback to guide the research process, and then evaluate the final paper. We found that: (1) Agent Laboratory driven by o1-preview generates the best research outcomes; (2) The generated machine learning code is able to achieve state-of-the-art performance compared to existing methods; (3) Human involvement, providing feedback at each stage, significantly improves the overall quality of research; (4) Agent Laboratory significantly reduces research expenses, achieving an 84% decrease compared to previous autonomous research methods. We hope Agent Laboratory enables researchers to allocate more effort toward creative ideation rather than low-level coding and writing, ultimately accelerating scientific discovery.

Paper Content: on Alphaxiv

PDF Extraction Method:

Page 1: 2025-1-9 Agent Laboratory: Using LLM Agents as Research Assistants Samuel Schmidgall1, 2, Yusheng Su1, Ze Wang1, Ximeng Sun1, Jialian Wu1, Xiaodong Yu1, Jiang Liu1, Zicheng Liu1and Emad Barsoum1 1AMD,2Johns Hopkins University Historically, scientific discovery has been a lengthy and costly process, demanding substantial time and resources from initial conception to final results. To accelerate scientific discovery, reduce research costs, and improve research quality, we introduce Agent Laboratory , an autonomous LLM-based framework capable of completing the entire research process. This framework accepts a human-provided research idea and progresses through three stages—literature review, experimentation, and report writing to produce comprehensive research outputs, including a code repository and a research report, while enabling users to provide feedback and guidance at each stage. We deploy Agent Laboratory with various state-of-the-art LLMs and invite multiple researchers to assess its quality by participating in a survey, providing human feedback to guide the research process, and then evaluate the final paper. We found that: (1) Agent Laboratory driven by o1-preview generates the best research outcomes; (2) The generated machine learning code is able to achieve state-of-the-art performance compared to existing methods; (3) Human involvement, providing feedback at each stage, significantly improves the overall quality of research; (4) Agent Laboratory significantly reduces research expenses, achieving an 84% decrease compared to previous autonomous research methods. We hope Agent Laboratory enables researchers to allocate more effort toward creative ideation rather than low-level coding and writing, ultimately accelerating scientific discovery. /githubhttps://AgentLaboratory.github.io Figure 1|Agent Laboratory takes as input a human research idea and a set of notes, provides this to a pipeline of specialized LLM-driven agents, and produces a research report and code repository. Corresponding author(s): Samuel Schmidgall (sschmi46@jhu.edu)arXiv:2501.04227v1 [cs.HC] 8 Jan 2025 Page 2: Agent Laboratory: Using LLM Agents as Research Assistants 1. Introduction Scientists frequently face constraints that limit the number of research ideas they can explore at any given time, resulting in ideas being prioritized based on predicted impact. While this process helps determine which concepts are worth investing time in and how best to allocate limited resources effectively, many high quality ideas remain unexplored. If the process of exploring ideas had less limitations, researchers would be able to investigate multiple concepts simultaneously, increasing the likelihood of scientific discovery. In an effort to achieve this, recent work has explored the capability of LLMs to perform research ideation and automated paper generation, where LLM agents perform the role of human scientists (Baek et al. (2024); Ghafarollahi & Buehler (2024b); Lu et al. (2024a); Swanson et al. (2024)). The work of Baek et al. (2024) introduces ResearchAgent, which automatically generates research ideas, methods, and experiment designs, iteratively refining them through feedback from multiple reviewing agents that mirror peer discussions and leverage human-aligned evaluation criteria to improve the outputs. Lu et al. (2024a) explores fully automated paper generation, where The AI Scientist framework generates novel research ideas, writes code, conducts experiments, and creates a full scientific paper with an automated peer-review system to evaluate the work. Even though these works demonstrate that current LLMs can generate ideas judged to be more novel than those produced by human experts, Si et al. (2024) indicates that LLMs still exhibit weaknesses in feasibility and implementation details, suggesting a complementary rather than replacement role for LLMs in research. Therefore, we aim to design an autonomous agent pipeline that can assist humans toward implementing their own research ideas. In this work, we introduce Agent Laboratory , an autonomous pipeline for accelerating the individual’s ability to perform machine learning research. Unlike previous approaches, where agents participate in their own research ideation independent of human input (Baek et al. (2024); Lu et al. (2024b)), Agent Laboratory is designed to assist human scientists in executing their own research ideas using language agents. Agent Laboratory takes as input a human research idea and outputs a research report and code repository produced by autonomous language agents, allowing various levels of human involvement, where feedback can be provided at a frequency based on user preference. A detailed list of our contributions are provided below: 1.We introduce Agent Laboratory , an open-source LLM agent framework for accelerating the individual’s ability to perform research in machine learning. In order to accommodate all users, Agent Laboratory is compute flexible, where various levels of compute can be allocated based on the individual’s access to compute resource (e.g., CPU, GPU, memory) and model inference budget. 2.Human evaluators rated papers generated using Agent Laboratory across experimental quality,reportquality,andusefulness,showingthatwhiletheo1-previewbackendwasperceived as the most useful, o1-mini achieved the highest experimental quality scores, and gpt-4o was behind in all metrics. 3.NeurIPS-style evaluations showed that o1-preview performed best among backends, particularly in clarity and soundness, according to human reviewers. However, a clear gap emerged between human and automated evaluations, with automated scores significantly overestimating quality (6.1/10 vs. 3.8/10 overall). Similar discrepancies were seen across clarity and contribution metrics, suggesting the need for human feedback to complement automated evaluations for more accurate assessments of research quality. 4.Co-pilot mode in Agent Laboratory was evaluated on custom and preselected topics, showing higher overall scores compared to autonomous mode. Co-pilot papers also saw trade-offs 2 Page 3: Agent Laboratory: Using LLM Agents as Research Assistants in experimental quality and usefulness, reflecting challenges in aligning agent outputs with researcher intent. 5.The co-pilot feature in Agent Laboratory is overall found to have high utility and usability when rated by human users, with most participants deciding to continue usage after their experience 6.Detailed cost and inference time statistics, as well as the breakdown of cost per paper phase, are presented for different model back-ends, demonstrating that Agent Laboratory offers automatic research at a greatly reduced price compared with other works (only $2.33 USD per paper with a gpt-4o backend). 7.State-of-the-artperformanceonasubsetofMLE-Benchchallengesusingtheproposed mle-solver , achieving higher consistency and scoring compared to other solvers, and earning more medals, including gold and silver, than MLAB, OpenHands, and AIDE. We hope that this work takes a step toward accelerating scientific discovery in machine learning, allowing researchers to allocate more effort toward creative ideation and experiment design rather than low-level coding and writing. 2. Background & Related Work Largelanguagemodels Theresearchagentsinthispaperarebuiltonautoregressivelargelanguage models(LLMs),whicharetrainedonextensivetextcorporatopredictconditionalprobabilitiesoftoken sequences, 𝑝(𝑥𝑡|𝑥<𝑡;𝜃), and generate text completions through sampling, where 𝑥𝑡∼softmax(𝑊·ℎ𝑡), withℎ𝑡as the hidden state and 𝑊as the learned weight matrix mapping to token probabilities. LLMs utilize transformer architectures (Vaswani (2017)) to capture long-range dependencies in text. These models, such as Claude (Anthropic (2024)), Llama (Dubey et al. (2024); Touvron et al. (2023a,b)), and ChatGPT (Achiam et al. (2023); Hurst et al. (2024); OpenAI (2022)), leverage vast datasets and scaling techniques, thus enabling them to perform a wide array of language-based tasks, such as translation, summarization, and reasoning, by generalizing patterns learned during pretraining to novel inputs Brown (2020). LLM Agents While LLMs demonstrate strong understanding and reasoning abilities, they face chal- lenges when executing tasks in real-world scenarios. To overcome these limitations, their capabilities are extended through structured frameworks, enabling them to autonomously and semi-autonomously perform task execution and semi-autonomously perform task execution (Chen et al. (2023b); Li et al. (2023); Qian et al. (2024); Wu et al. (2023)). These systems, referred to as agents, utilize techniques such as chain-of-thought prompting (Wei et al. (2022)), iterative refinement (Shinn et al. (2024)), self-improvement (Huang et al. (2022)), and external tool integration to execute complex workflows (Hao et al. (2024); Qin et al. (2023); Schick et al. (2023)). LLM agents have made remarkable progress in solving tasks of real-world significance, such as software engineering Jimenez et al. (2023); Wang et al. (2024b); Yang et al. (2024)), cybersecurity (Abramovich et al. (2024); Fang et al. (2024); Wan et al. (2024)), and medical diagnosis (McDuff et al. (2023); Schmidgall et al. (2024); Tu et al. (2024)). There has also been progress in applying LLMs agents to embodied problems such as autonomous robotics (Black et al. (2024); Brohan et al. (2022, 2023); Kim et al. (2024)), web tasks (Deng et al. (2024); Gur et al. (2023); He et al. (2024); Putta et al. (2024); Shi et al. (2017)), and game playing (AL et al. (2024); Feng et al. (2024); Wang et al. (2023)). For a broader overview of LLM agents, refer to Wang et al. (2024a). 3 Page 4: Agent Laboratory: Using LLM Agents as Research Assistants Automated machine learning Automated machine learning is an area of active research, with many approaches focused on using Kaggle, an online platform for machine learning competitions, as a benchmark for evaluating agent performance. Notable efforts include MLE-Bench (Chan et al. (2024)), DS-bench (Jing et al. (2024)), and MLAgentBench (Huang et al. (2024)) which propose using 75, 74, and 6 Kaggle challenges respectively as benchmarks to measure the abilities of ML agents in tasks such as data preparation, model development, and submission. Several ML "solvers" which can solve ML challenges have been introduced, such as AIDE (Schmidt et al. (2024)), CodeActAgent (referred to as “OpenHands") (Wang et al. (2024b)), and ResearchAgent (referred to as “MLAB") from MLAgentBench (Huang et al. (2024)) which automate feature implementation, bug fixing, and code refactoring with a high success rate. Agent K (Grosnit et al. (2024)) demonstrates the ability to solve Kaggle challenges at the human-level with a challenge URL provided as input. AI in Scientific Discovery AI has been used to support scientific discovery across numerous disci- plines for decades. For instance, AI has been used for discovery in mathematics (Romera-Paredes et al. (2024)), material science (Merchant et al. (2023); Pyzer-Knapp et al. (2022); Szymanski et al. (2023)), chemistry (Hayes et al. (2024); Jumper et al. (2021)), algorithm discovery (Fawzi et al. (2022)), and computational biology (Ding et al. (2024)). These approaches position AI as a tool rather than an agent performing research in autonomous research. LLMs for research related tasks LLMs have demonstrated strong capabilities in diverse research- relatedtasks, suchascodegeneration(Chenetal.(2021);Nijkampetal.(2022)), end-to-endsoftware development (Hai et al. (2024); Phan et al. (2024); Qian et al. (2023, 2024)), code generation for discovery (Chen et al. (2024b); Ghafarollahi & Buehler (2024a); Gu et al. (2024); Guo et al. (2024); Hu et al. (2024b); Ifargan et al. (2024); Majumder et al. (2024)), research question-answering (Chen et al. (2024a); Lála et al. (2023); Lin et al. (2024); Song et al. (2024)), research ideation (Baek et al. (2024); Ghafarollahi & Buehler (2024b); Li et al. (2024a); Si et al. (2024)), automated paper reviewing (D’Arcy et al. (2024); Liang et al. (2024); Lu et al. (2024b); Weng et al. (2024)), literature search (Ajith et al. (2024); Kang & Xiong (2024); Li et al. (2024b); Press et al. (2024)), and predicting the outcome of experiments (Ashokkumar et al. (2024); Lehr et al. (2024); Luo et al. (2024); Manning et al. (2024); Zhang et al. (2024)). Although LLMs have made notable progress in solving the aforementioned tasks, ideation has struggled to progress, with some work showing that LLM ideation leads to greater novelty than humans (Si et al. (2024)), while others show reduced creativity (Chakrabarty et al. (2024)) and greater homogeneous effects (Anderson et al. (2024); Zhou et al. (2024)) that may limit creative discovery without human guidance. Additionally, research on human-AI collaboration has reached mixed conclusions about the idea novelty (Ashkinaze et al. (2024); Liu et al. (2024); Padmakumar & He (2024)). These findings suggest that, with the current LLMs, the strongest research systems would combine human-guided ideation with LLM-based workflows. LLMs for autonomous research Recent advancements in automated scientific workflows have focused on leveraging LLMs to emulate the process of research. Swanson et al. (2024) introduces a team of LLM agents working as scientists alongside a human researcher who provides high-level feedback, with the end result being novel nanobody binders aimed at addressing recent variants of SARS-CoV-2. ChemCrow (M. Bran et al. (2024)) and Coscientist (Boiko et al. (2023)) demonstrate the ability for autonomous ideation and experimentation in chemistry. ResearchAgent (Baek et al. (2024)) automates research idea generation, experiment design, and iterative refinement using feedback from reviewingagentsalignedwithhumanevaluationcriterion. TheAIScientist(Luetal.(2024a))extends 4 Page 5: Agent Laboratory: Using LLM Agents as Research Assistants Figure 2|Agent Laboratory Workflow. This image illustrates the three primary phases of Agent Laboratory: Literature Review, Experimentation, and Report Writing, each featuring distinct tasks, tools, and human-agent roles. The pipeline integrates human input with LLM-driven agents, such as thePhDandPostdocagents,whichhandleliteraturereviews,experimentalplanning,datapreparation, and result interpretation. Specialized tools like mle-solver for experimentation and paper-solver for reportgenerationautomatetediousresearchtasks, enablingcollaborationbetweenhumanresearchers and AI to produce high-quality research outputs. this automation to encompass end-to-end scientific discovery, including coding, experiment execution, and automated peer review for manuscript generation. Despite these advancements, studies like Si et al. (2024) highlight limitations in the feasibility and implementation details of LLM ideation, indicating a complementary rather than replacement role for LLMs in autonomous research. 3. Agent Laboratory Overview. Agent Laboratory begins with the independent collection and analysis of relevant research papers, progresses through collaborative planning and data preparation, and results in automated experimentation and comprehensive report generation. As shown in Figure 2, the overall workflow consists of three primary phases: (1) Literature Review, (2) Experimentation, and (3) Report Writing. In this section, we will introduce these phases in detail along with the corresponding involved agents. Furthermore, in Section 4, we will conduct qualitative and quantitative analyses to demonstrate the strengths of Agent Laboratory and its ability to generate 3.1. Literature Review Literature Review. The literature review phase involves gathering and curating relevant research papers for the given research idea to provide references for subsequent stages. During this process, the PhD agent utilizes the arXiv API to retrieve related papers and performs three main actions: summary ,full text , andadd paper . Thesummary action retrieves abstracts of the top 20 papers relevant to the initial query produced by the agent. The full text action extracts the complete content of specific papers, and the add paper action incorporates selected summaries or full texts into the curated review. This process is iterative rather than a single-step operation, as the agent performs multiple queries, evaluates the relevance of each paper based on its content, and refines the 5 Page 6: Agent Laboratory: Using LLM Agents as Research Assistants selection to build a comprehensive review. Once the specified number of relevant texts (N=max) is reached via the add paper command, the curated review is finalized for use in subsequent phases. 3.2. Experimentation Plan Formulation The plan formulation phase focuses on creating a detailed, actionable research plan based on the literature review and research goal. During this phase, the PhD and Postdoc agents collaborate through dialogue to specify how to achieve the research objective, detailing experimental components needed to complete the specified research idea such as which machine learning models to implement, which datasets to use, and the high-level steps of the experiment. Once a consensus is reached, the Postdoc agent submits this plan using the plancommand, which serves as a set of instructions for subsequent subtasks. Data Preparation. The goal of the data preparation phase is to write code that prepares data for running experiments, using the instructions from the plan formulation stage as a guideline. The ML Engineer agent executes code using Python command command and observes any printed output. The ML Engineer has access to HuggingFace datasets, searchable via the search HF command. After agreeing on the finalized data preparation code, the SW Engineer agent submits it using the submit codecommand. Before the final submission proceeds, the code is first passed through a Python compiler to ensure that there are no compilation issues. This process will be iteratively executed until the code is bug-free. Running Experiments. In the running experiments phase, the ML Engineer agent focuses on imple- menting and executing the experimental plan formulated prior. This is facilitated by mle-solver , a specialized module designed to generate, test, and refine machine learning code autonomously. mle-solver begins by producing initial code based on the research plan and insights from the literature review. For the first mle-solver step, the program is empty and must generate a file from scratch, which is used as the top scoring program . The following processes describe the workflow of themle-solver : A.Command Execution. During the command execution phase, an initial program is sampled from a maintained set of top-performing programs, which is represented by a single file dur- ing initialization. The mle-solver iteratively refines this program through two operations, REPLACE andEDIT, to better align the output with experimental objectives. The EDITopera- tion identifies a range of lines, substituting the code between the specified line numbers with newly generated code. In contrast, the REPLACE operation generates a completely new Python file. B.Code Execution. After a code command is executed, the new program is passed through a compiler to check for runtime errors. If it successfully compiles, a score is returned and the list of top programs is updated if the score is higher than the existing programs. If the code does not compile, the agent attempts to repair the code for 𝑁𝑟𝑒𝑝tries (𝑁𝑟𝑒𝑝=3 in our experiments) before returning an error and moving on to a new code replacement. C.Program Scoring. If a code succeeds in compilation, it is sent to a scoring function which determines if it is better than previously implemented experiment code. In order to obtain a program score, we implement a scoring function that uses an LLM reward model to assess the effectiveness of the ML code generated by mle-solver . The reward model, invoked as an LM, scores the program on a scale from 0 to 1 considering the outlined research plan, the produced code, and the observed output to determine how accurately the program adheres to 6 Page 7: Agent Laboratory: Using LLM Agents as Research Assistants Figure 3|Overview of the mle-solver workflow. This diagram details the iterative process used by the MLE-Solver to autonomously generate machine learning code. Beginning with external resources, the workflow integrates command execution (A), where new code is generated, followed by code execution (B) to compile and repair issues if needed. Program scoring (C) evaluates the generated code using a reward function, while self-reflection (D) helps refine future iterations based on results. Performance stabilization (E) ensures consistent outcomes by maintaining a pool of top-performing programs and iterative optimization. the initial goals. A score of 1 is provided for results with high alignment and everything below on a spectrum of how closely the output and code matches the planning goals. This process is similar to existing methods for LLM reasoning tree search (Yao et al. (2024)), where instead of a series of reasoning steps being traversed using self-evaluated LLM scoring, the set of possible programs are being traversed (via EDITandREPLACE commands) and the resulting program outcome is self-evaluated to determine if a program is worth building on. This is similar to the Solution Space Search of AIDE (Schmidt et al. (2024)), however their method was specifically designed for the Kaggle competitions and is simply extracting the accuracy rather than scoring the research code and outcomes. D.Self Reflection. Whether the code succeeds or fails, a self-reflection is produced based on the experimental results or the encountered error signal (Renze & Guven (2024); Shinn et al. (2024)). Here, the mle-solver is prompted to reflect on the outcome of its actions. If the program failed to compile, the solver reflects on how to fix this issue in next iterations. If it successfuly compiles and returns a score, the solver will reflect on how to increase this score. These reflections are generated to improve future performance, ensuring that the system learns from errors, improving the quality and robustness of the generated code over iterative cycles. E.Performance Stabilization To prevent performance drift, two mechanisms are implemented: top program sampling and batch-parallelization. In top program sampling, a collection of the highest-scoring programs is maintained, and one program is randomly sampled before executing a command, ensuring diversity while retaining quality. For batch-parallelization, each solver step involves making N modifications simultaneously, with the top modification selected to replace the lowest-scoring program in the top collection. These strategies use high-entropy sampling to modify the code, resulting in a balance between exploration of new solutions and 7 Page 8: Agent Laboratory: Using LLM Agents as Research Assistants Figure 4|Graphical outline of paper-solver . This diagram showcases the step-by-step process of generating and refining academic research reports using the Paper-Solver tool. The workflow starts with the creation of an initial report scaffold (A) by iteratively generating LaTeX-based sections, followed by updates to ensure structural completeness. (B) Research is performed through an Arxiv tool during relevant sections. In the Report Editing phase (C), the language model applies targeted edits to improve the document, with LaTeX compilation verifying the integrity of changes. Finally, the completed report undergoes a reward-based evaluation during the Paper Review phase (D), ensuring alignment with academic standards and research goals. refinement of existing ones in order to maintain stable code modifications. Results Interpretation. The goal of the results interpretation phase is to derive meaningful insights from experimental outcomes to inform the final report. The PhD and Postdoc agents discuss their un- derstanding of the experimental results produced by mle-solver . Once they agree on a meaningful interpretation that could contribute to a compelling academic paper, the Postdoc agent submits it using theinterpretation command, forming the basis for the report writing phase. 3.3. Report Writing Report Writing. In the report writing phase, the PhD and Professor agent synthesize the research findings into a comprehensive academic report. This process is facilitated by a specialized module calledpaper-solver , which iteratively generates and refines the report. The paper-solver aims to act as a report generator, positioning the work that has been produced by previous stages of Agent Laboratory .paper-solver does not aim to entirely replace the academic paper-writing process, but rather to summarize the research that has been produced in a human-readable format so that the researcher usingAgent Laboratory understands what has been accomplished. The output follows the standard structure of an academic paper, ensuring it meets conference submission requirements (for the paper scoring phase) while being clear and methodical. The following processes describe the workflow of paper-solver : A.Initial Report Scaffold. The first task of the paper-solver is to generate an initial scaffold for the research paper. This scaffold outlines the document structure, dividing it into eight stan- dardized sections: Abstract, Introduction, Background, Related Work, Methods, Experimental Setup, Results, and Discussion. During scaffold creation, placeholders are inserted for each section to categorize future content. This process establishes the framework for subsequent detailed text generation. The scaffold includes necessary formatting for LaTeX compilation, allowing the generated paper to be directly reviewed and refined. Special care is taken to ensure the scaffold aligns with academic conventions, such as appropriate section titles and placeholders that guide content development. 8 Page 9: Agent Laboratory: Using LLM Agents as Research Assistants B.Arxiv Research. During the scaffold building phase, we allow the paper-solver access to arXiv which is accessible through the same interface as the earlier literature review phase. ArXiv is enabled to allow the solver to explore related literature on the subject it is writing on as well as finding papers to refer to, although it is not enforced. We note that the agent still has access to the original literature search, but has the opportunity to expand based on literature needed to write a particular paper section. C.Report Editing. One the scaffold is built, the paper-solver uses specialized commands to iteratively refine the generated paper. The primary command are available for this stage is theEDITcommand, which allows precise line-by-line modifications to the LaTeX code. This command enable dynamic adjustments to the content, ensuring alignment with the research plan, the clarity of arguments, and compliance with formatting standards. Before integrating edits, the system compiles the LaTeX to verify error-free functionality, thereby maintaining document integrity. Through iterative editing, the solver ensures the paper achieves the desired level of quality, cohesiveness, and depth required for academic acceptance. D.Paper Review. For obtaining scores for papers during the paper-solver iterations, we leverage an adapted version of the automated review system developed in Lu et al. (2024b). This system works by using an LLM-based agent to simulate the scientific paper review process followingtheNeurIPSconferenceguidelines. Whenevaluatedon500ICLR2022papersfromthe OpenReview dataset, the automated reviewer achieved human-level accuracy (65% compared to 66% for human reviewers) and surpassed human performance in F1 score (0.57 vs. 0.49) after calibration. An example review from one of our papers by o1-mini is provided below. Example Review ( o1-mini | Word Order Sensitivity ) "Strengths": [ "Comprehensive experimental design and methodology.", "Use of a well-known dataset (RACE) for evaluation.", "Empirical validation of bias mitigation strategies.", "Clear presentation of results and analysis."], Weaknesses": [ "Limited exploration of additional bias mitigation techniques.", "Lack of in-depth discussion on limitations and societal impacts.", "The originality could be enhanced by exploring novel strategies."], "Originality": 3, "Quality": 4, "Clarity": 3, "Significance": 3, "Questions": [ "Have you considered exploring additional bias mitigation techniques beyond majority voting and entropy-based thresholding?", "Can you provide more details on the potential societal impacts of the model’s sensitivity to option order?", "What are the limitations of the current study, and how might they be addressed in future work?"], "Limitations": [ "The study is limited to the RACE dataset and may not generalize to other datasets.", "The bias mitigation strategies, while effective, do not completely eliminate sensitivity to option order."], 9 Page 10: Agent Laboratory: Using LLM Agents as Research Assistants "Ethical Concerns": false, "Soundness": 3, "Presentation": 3, "Contribution": 3, "Overall": 7, "Confidence": 4, "Decision": "Accept" Paper Refinement. In the paper refinement phase, the PhD agent makes a decision on whether to make paper revisions or to determine that the paper is complete. The process begins with a set of three reviewer agents generating reviews that mimic feedback from NeurIPS peer reviewers, evaluating the report based on criteria such as originality, quality, clarity, and significance. Based on these scores, the PhD agent then decides whether to finalize the project or revisit earlier subtasks—such as planning, experimentation, or results interpretation—to address the feedback. This allows the agents to refine the research report until it meets sufficiently high standards, effectively simulating the real-world academic revision process. 3.3.1. Autonomous versus Co-Pilot Mode: There are two ways in which Agent Laboratory can be operated: autonomous and co-pilot modes. In autonomous mode, there is no human involvement other than providing the initial research idea for agents to produce research for. Each subtask moves on to the next subtask sequentially upon completion. In co-pilot mode, in addition to providing the research idea, there is also a checkpoint at the end of each subtask, where a human is involved in reviewing the work produced by agents in that phase (e.g., the literature review summary or generated report). The human reviewer can either decide to proceed to the next subtask, or ask the agent to repeat the subtask while providing high level notes for the agent to improve its performance during the next attempt. For example, if the literature review phase did not include a specific paper or the experiments did not include a desired technique, the human reviewer would instruct the agent to include this. 4. Results In this section, we present our main findings on the efficacy of Agent Laboratory to produce research. We begin our results by asking how human evaluators perceive papers generated by Agent Laboratory running in end-to-end autonomous mode across five topics. Next, we examine human evaluation when using Agent Laboratory in collaborative co-pilot mode from both allowing the researcher to choose any topic they want and from our set of preselected topics. We then provide a detailed runtime analysis including cost, average time, and success rate by various models. Finally, we conclude with an evaluation of the mle-solver in isolation on MLE-Bench, a set of real-world Kaggle challenges. The details of all surveys are provided in Appendix C. 4.1. Evaluation of quality by language model Our first experiment aims to evaluate how human-evaluated quality varies across three axes: experi- ment quality, report quality, and usefulness. This evaluation was conducted by human participants using three different LLM backends: gpt-4o (Hurst et al. (2024)), o1-mini, and o1-preview (OpenAI (2024)). Research questions were selected from a set of 5 templates: 1. Do language models exhibit cognitive biases, such as confirmation bias or anchoring bias? 2. Are image transformers more or less sensitive to pixel noise than convolutional networks? 10 Page 11: Agent Laboratory: Using LLM Agents as Research Assistants Figure 5|The average human evaluated scores of papers generated by Agent Laboratory in an autonomous mode based on a research question (left column) and LLM backend (top row). The bottom row shows the average score across all topics by LLM backend. 3.Do language models improve accuracy on MedQA when asked to perform differential diagnosis? 4. Are language models sensitive to word order in multiple choice benchmarks? 5.Does gender role play affect the accuracy on of language models on answering math questions? These 5 questions across 3 LLM backends resulted in a total of 15 papers being written au- tonomouslyby Agent Laboratory withoutanyhumaninvolvement. Wethenrecruited10volunteer PhD students to review 3 randomly assigned papers each. These researchers rated the experimental quality, report quality, and usefulness of the generated outputs on a scale of 1 to 5. The goal of this evaluation is to understand the differences in quality of produced research based on the three distinct LLM backbones, and to understand the usefulness of Agent Laboratory in autonomous mode. The details of the evaluation questions are provided here: •Experimental Quality: What is your perception of the quality of the experimental results presented in this report? •Report Quality: What is your perception of the quality of the research report writing quality presented in this report? •Usefulness: What is your perception of the usefulness of an AI assistant tool that can generate the presented report autonomously? Theresultsofthisevaluationindicatevariabilityinperformanceacrossdifferent Agent Laboratory LLM backends (Figure 5). gpt-4o consistently achieved lower scores, with an average experimental quality rating of 2.6/5, a report quality rating of 3.0/5, and a usefulness rating of 4.0/5. In contrast, o1-mini generally outperformed gpt-4o in experimental quality, with an average score of 3.2/5 (+0.6), while maintaining similar levels of report quality and usefulness at 3.2/5 (+0.2) and 4.3/5 (+0.3), respectively. o1-preview demonstrated the highest usefulness and report quality, averaging 4.4/5 (+0.4 from gpt-4o and +0.1 from o1-mini) and 3.4/5 (+0.4 from gpt-4o and +0.2 from o1-mini) respectively, though its experimental ratings were slightly lower than o1-mini at 2.9/5 (+0.3 from gpt-4o and -0.3 from o1-mini). While all backends perform comparably in terms of report and experimental quality, the o1-preview model was as the most useful for research assistance, suggesting that its outputs were better aligned with the expectations and needs of researchers. 11 Page 12: Agent Laboratory: Using LLM Agents as Research Assistants From our results, the quality is demonstrated to vary based on the selected topic. We find that the overall highest average report quality to be 3.8/5 and usefulness to be 4.5/5 for the word order topic and the highest average experiment quality to be 3.2/5 for the cognitive bias topic. Interestingly, we also find that word order has the lowest experiment quality at 2.7/5 along with the image noise topic. Theimage noise topic was demonstrated to have high variance based on the LLM backend, with an experiment quality score of 1.5/5 for gpt-4o and a 4.0/5 with o1-mini (+2.5 point difference) and a usefulness score of 2.5/5 for gpt-4o and a 4.5/5 with o1-mini (+2.0 point difference). In summary, the evaluation of quality across LLM backends demonstrates clear differences in experimental quality, report quality, and usefulness. While o1-preview is consistently rated as the most useful for research assistance, o1-mini achieves the highest experimental quality scores, and gpt-4o is generally being outperformed in all areas. Topic-specific trends suggest there may exist variability in the performance of Agent Laboratory across difference areas of machine learning research and across backend models. 4.1.1. Human reviewer scores by language model In addition to evaluating paper quality, we also asked human reviewers to assess papers generated byAgent Laboratory according to NeurIPS-style criteria, including quality, significance, clarity, soundness, presentation, and contribution as shown in Figure 6. We evaluated the same papers analyzed in Section 4.1 using the aforementioned metrics and conducted the comparison. We found that the average human scores for the three backends revealed differences in performance, with average overall ratings ranging from 3.5/10 with gpt-4o, 3.8/10 with o1-mini, and 4.0/10 with o1-preview. First, when evaluating quality we find that reviewers rated gpt-4o the lowest at 1.8/4, while o1-mini achieved the highest score of 2.3/4, demonstrating relatively better technical soundness. In terms of significance, all three backends received similar scores between 2.2–2.5/4, indicating a modest contribution to advancing research goals. Clarity scores showed slight variability, with gpt-4o receiving 2.6/4 and o1-mini falling slightly lower at 2.1/4 (-0.5), reflecting differences in how well the papers were written. The soundness of the generated outputs, which assesses the robustness of claims, was rated highest for o1-preview at 2.2/4, with o1-mini and gpt-4o at 1.8 (-0.4) and 1.7. Presentation and contribution ratings followed similar trends, with the overall contribution score averaging 2.1/4 across models, highlighting a need for improvement in the originality of the outputs. These scores show a general trend where human reviewers identified o1-preview as producing slightly better-rounded outputs compared to other backends, though significant gaps remain in technical and methodological aspects across all models. We note that the average score of an accepted paper at NeurIPS is 5.9. In this regard, on average, papers produced in autonomous mode are below theacceptancethresholdfortopMLconferences. Theseresultsdemonstratethat,inautonomousmode, there is a need for refinement of Agent Laboratory to meet human expectations for high-quality, impactful research papers. Automated Reviews versus Human Reviews. We also explore to what extent the automated reviewer scores align with those of human reviewers. The alignment is graphically illustrated using both tabular data (for all scores) and violin plots (for overall scores) in Figure 6. Our findings suggest that automated reviewers demonstrate notable discrepancies across all metrics compared with human evaluators, with a tendency to highly over-estimate the contribution of self-evaluated work. While the automated reviewers gave an average overall above average NeurIPS paper score of 6.1/10, human reviewers provided a much lower average of 3.8/10 (-2.3 points). Similar gaps are observed for all 12 Page 13: Agent Laboratory: Using LLM Agents as Research Assistants Figure 6|Scores from NeurIPs-style evaluation of generated papers, including the criterion: quality, significance, clarity, soundness, presentation, and contribution. (top) Split-violin plot comparing the overall score distribution of automated reviewers (LLM scores, left half of violin) and human reviewers (right half of violin). Human scores are not predictive of automated reviewer scores, demonstrating an average of -2.3 points lower. (middle) Automated reviewer scores across NeurIPs-style criterion. (bottom) Human reviewer scores across NeurIPs-style criterion. 13 Page 14: Agent Laboratory: Using LLM Agents as Research Assistants specific criteria, such as clarity and contribution, where automated reviewers rated clarity at 3.6/4 on average compared to 2.4/4 by human evaluators. This pattern holds for all criterion. Previous work demonstrates high alignment with automated reviewers (Lu et al. (2024b)) and ICLR scores from OpenReview. However, with actual humans rating the generated papers, we find that automated reviews do not align closely with human reviews and are far from an average accepted paper at NeurIPS 2024, which stands at 5.85∗(our scores were -2.05 points lower on average). Our results demonstrate that it is important for human evaluations to be provided alongside automated reviewer scores in future works in order to obtain a better understanding of the quality of generated papers. 4.2. Evaluation of co-pilot quality We next evaluate the use of Agent Laboratory in co-pilot mode, where a human researcher is providing feedback at the end of each subtask (see Section 3.3.1 for more details). We evaluate performance across two measures: (1) the quality of Agent Laboratory as a tool for assisting their research and (2) the quality of generated papers. We first ask researchers to co-pilot Agent Laboratory on a topic of their choice without limitations. We then ask researchers to select a topic from the 5 topics introduced in Section 4.1, resulting in a total of 2 papers per researcher which we refer to as custom andpreselected papers respectively. After their papers are generated, we ask researchers to rate their experience using Agent Laboratory during the process of generating custom and preselected papers. We then ask them to self-evaluate the generated papers according to NeurIPS-style criterion. Finally, we ask external researchers to evaluate their paper comparing performance with Agent Laboratory in autonomous mode. All experiments used an o1-mini backbone for all phases except the literature review. 4.2.1. Quality as a tool The evaluation of Agent Laboratory as a research tool focuses on understanding its effectiveness in assisting researchers during the co-pilot mode. After generating their papers, participants were asked to reflect on their experiences and assess the tool’s utility, usability, and overall satisfaction. We begin our evaluation by asking the following questions: •Utility: How useful is Agent Laboratory for assisting your research? •Continuation: How likely are you to continue using Agent Laboratory for research? •Satisfaction: How much did you enjoy using Agent Laboratory? •Usability: How easy was it for you to build a project using Agent Laboratory? The result of answering each question is a score from 1-5, where 1 indicates the lowest agreement and 5 indicates the highest. We find that the overall scores across all experiments are 3.5/5 for utility, 3.75/5 for continuation, 3.63/5 for satisfaction, and 4.0/5 for usability (Figure 7). We also delineate average scores based on custom and preselected topics. For custom experiments, we find overall scores of 3.75/5 for utility, 4.0/5 for continuation, 3.75/5 for satisfaction, and 3.75/5 for usability. For preselected topics, we find overall scores of 3.25/5 for utility, 3.5/5 for continuation, 3.5/5 for satisfaction, and 4.25 for usability. Ratings for preselected topics are lower across all measures compared with custom, except for usability which was -0.5 points lower. From preselected to custom, utility and continuation increased by +0.5 points and satisfaction increased by +0.25 points. We also evaluated across the same questions reported in Section 4.1. We report an average experimental quality rating of 2.38/5, a report quality rating of 3.13/5, and a usefulness rating of ∗https://papercopilot.com/statistics/neurips-statistics/neurips-2024-statistics 14 Page 15: Agent Laboratory: Using LLM Agents as Research Assistants Figure 7|Co-pilot evaluation. 3.75/5. We find higher scores for custom topics across report quality with a rating of 3.5/5 (+0.75) and a usefulness rating of 4.0/5 (+0.5). For experiment quality, we find that preselected has +0.25 points higher with a score of 2.5/5. Scores across all metrics rated lower when compared with the corresponding o1-mini autonomous evaluation results. While report quality was only rated -0.07 points lower, usefulness was rated -0.55 points lower and experiment quality was -0.82 points lower. Finally, we opened an optional question for participants to provide feedback, which asks the following question: "How could Agent Laboratory be improved for your research?" For both custom and preselected topics we received a 75% response rate. From this feedback, there were suggestions for improving the Agent Laboratory interface (e.g., adding a GUI, better inspection of intermediate results), adding the option to incorporate more figures for the paper, and improving the literature review phase. We find that when compared to reviews of Agent Laboratory in autonomous mode from Section 4.1, human co-pilots rated report quality, usefulness, and experiment quality lower. From feedback provided by researchers, we find the reduction in scores is due to difficulty guiding the agents to execute their exact vision for the project. We discuss these limitations in greater detail in Section 5. 4.2.2. Evaluation of co-pilot generated papers To assess the quality of papers generated by Agent Laboratory in co-pilot mode, we conduct evaluations using two approaches: (1) researchers self-assessed their generated papers based on NeurIPS-style criteria, and (2) external researchers provided evaluations of the same papers. This section aims to understand differences in scores from self-assessment and external assessment, as well as how assessments compare to Agent Laboratory in fully autonomous mode. We use the same NeurIPS criterion introduced in Section 4.1.1. 15 Page 16: Agent Laboratory: Using LLM Agents as Research Assistants Self-evaluation. From the results of the self-evaluation (Figure 7), we found that the average overall scoreincreased from evaluations provided to papers generated in autonomous mode, with autonomous papers having an overall average of 3.8/10 and co-pilot papers at 4.13/10 (+0.33). These scores even improved across the best autonomous backend, o1-preview, which averaged 4.0/10. Across individual criterion, scores increased for quality (+0.13), clarity (+0.48), soundness (+0.35), and presentation (+0.33), but decreased for significance and contribution. The scores that decreased were significance (-0.3) and contribution (-0.1). External evaluation. We compare scores provided through self-evaluation with those provided by a set of external evaluators on the same papers (Figure 7). We find that average scores across most criteria, including quality, significance, clarity, soundness, presentation, and contribution, show an improvement in the external assessments, with an overall average of 4.38/10, up from 4.13/10 in self-evaluations. The most significant improvements were observed in quality (+0.62), significance (+0.25), and overall (+0.25) scores, suggesting that external reviewers perceived the generated papers to be higher quality and more significant than the researchers who produced them. However, clarity scores decreased (-0.25), indicating potential issues in the articulation of ideas that might have been overlooked during self-assessment. While presentation scores did not improve (+0.0), soundness (+0.13) and contribution (+0.13) only increased slightly. Notably, the external evaluations also reinforce differences between scores preselected and custom topics. Unlike with the self-evaluated papers, papers on preselected topics were rated slightly higher overall, with improvements observed across several metrics, particularly in quality (+0.5) and significance (+0.5). These findings suggest that self-evaluated reviewers perceive the work produced on their custom topic as higher quality compared to the work produced on preselected topics, whereas external evaluators find the opposite to be true. Comparison with autonomous mode Comparing scores by external evaluators on autonomous and co-pilot papers (Figure 7), we find that the largest improvements were seen for quality, which increased by +0.75, soundness, which improved by +0.48, and the overall score, which improved by +0.58. Moderate gains were also observed in clarity (+0.23) and presentation (+0.33). In contrast, some metrics showed minimal or no improvement. Significance declined slightly (-0.05), and contribution increased only marginally (+0.03). Our results suggest that papers generated with human involvement overall are evaluated more highly than autonomously generated paper, with much of the focus of human involvement going toward making the paper more presentable (presentation and clarity) while there was less emphasis on improving experimental results (significance and contribution). Finally, we note that co-pilot overall scores, which average at 4.38, are still -1.45 points below the average score of 5.85 for an accepted paper at NeurIPS 2024. Increasing the overall score to match conference standards will likely result by improving the contribution and significance of the paper results, which is consistently lower than other evaluation metrics. 4.3. Runtime statistics Runtime statistics for Agent Laboratory are detailed to provide insight into the computational efficiency and monetary costs associated with different phases of its workflow. In this evaluation, both the time required per phase (measured in seconds) and the costs incurred (calculated in USD) were analyzed to better understand the performance of three model backends: gpt-4o, o1-mini, and o1-preview. These measurements were recorded for each subtask, including Literature Review, Plan Formulation, Data Preparation, Running Experiments, Results Interpretation, Report Writing, and Report Refinement. 16 Page 17: Agent Laboratory: Using LLM Agents as Research Assistants Figure 8|Performance and Cost Evaluation. This table summarizes the runtime statistics, cost, and success rates of Agent Laboratory across its workflow phases using three different model backends: gpt-4o, o1-mini, and o1-preview. The metrics include average cost per phase (in USD), average time per phase (in seconds), and success rates for each phase. Inference time Across all models, gpt-4o exhibited the fastest execution times, completing the entire workflow in 1165.4 seconds, approximately 3.2x faster than o1-mini and 5.3x faster than o1-preview, which required 3616.8 seconds and 6201.3 seconds, respectively. In most subtasks, gpt-4o demonstrated superior speed, particularly in Running Experiments and Report Writing phases, where its times were significantly shorter than those of o1-mini and o1-preview. For instance, in Running Experiments, gpt-4o averaged 417.8 seconds, while o1-mini and o1-preview took 2082.5 seconds and 4036.2 seconds, respectively. Similarly, for Report Writing, gpt-4o completed the task in 572.5 seconds, compared to 827.7 seconds for o1-mini and 1854.2 seconds for o1-preview. Inferencecost Monetarycostsperworkflowwerealsosubstantiallylowerforgpt-4o,whichaveraged just $2.33 for the entire process. This is significantly more cost effective than previous autonomous research workflows (Lu et al. (2024b)), which cost around ∼$15 (6.4x more expensive) to complete using gpt-4o. Other models in our workflow has a lower cost efficiency, such as o1-mini at $7.51, and o1-preview at $13.10, the latter being over 5.6x more expensive than gpt-4o. Among the individual subtasks, gpt-4o consistently had the lowest costs. For example, its costs for Data Preparation and Report Writing were $0.09 and $1.73, respectively, compared to $3.03 and $2.58 for o1-mini, and $0.30 and $9.58 for o1-preview. 17 Page 18: Agent Laboratory: Using LLM Agents as Research Assistants Figure 9|Average score of four methods (MLAB, OpenHands, AIDE, and mle-solver) on a subset of MLE-Bench. Phase-level Observations From our observations at the phase-level, Literature Review was notably efficient for all models in terms of time and cost, with gpt-4o completing it in 92.9 seconds at a cost of $0.12. Meanwhile, o1-mini completed this phase faster (56.8 seconds) but at a slightly higher cost ($0.16). For Plan Formulation, gpt-4o was both the fastest (23.3 seconds) and the cheapest ($0.03), followed closely by o1-preview in cost ($0.04) but not in speed (33.1 seconds). The most expensive phase across models was Report Writing, where costs were driven by the increased computational resources required for writing a long document. o1-preview incurred particularly high costs in this phase ($9.58) despite producing comparable outputs in terms of task success rates. Success Rates Overall, every model exhibits reasonably high reliability, with o1-preview achieving the highest average subtask success rate (95.7%) for the entire workflow. Both gpt-4o and o1-mini followed closely at 94.3% and 92.8%. While most tasks had 100% success rate for each model, the literature review phase had a high rate of failure, at 60%, 70%, and 80% for gpt-4o, o1-mini, and o1-preview respectively. The Data Preparation phase showed minor challenges, with o1-mini recording an 80% success rate in Data Preparation, compared to gpt-4o’s 100% success rate and o1-preview at a 90% success rate. 4.4. Evaluating mle-solver on MLE-Bench Evaluating the entire Agent Laboratory workflow does not contain much information about the ability ofmle-solver specifically to solve individual ML problems. In order to evaluate mle-solver more objectively, we use a subset of 10 ML challenges from MLE-Bench (Chan et al. (2024)). MLE- Bench is a benchmark designed to assess the capability of agents in handling real-world ML tasks on Kaggle competitions. This benchmark compares agent performances with human baselines, scoring agents with Kaggle’s medal system, and incorporating mechanisms to mitigate contamination and plagiarism risks. We include all challenges focusing on text and tabular data from the low complexity categoryofMLE-Bench. Weprovideasinputto mle-solver thefollowing: Kaggledatasetdescription, distilled knowledge from Kaggle notebooks, as well as an accessible train and dev set. Instead of using an LLM scoring function, the mle-solver score is evaluated on the dev set, which is a 20% random sample taken from the original training set, and the training set is represented by the other 80% split. All data (dev, test, train) is placed into arrays using the numpy library instead of providing 18 Page 19: Agent Laboratory: Using LLM Agents as Research Assistants file locations in order to better emulate the data preparation phase. Once all mle-solver steps have concluded, the final code with the highest score is evaluated on the actual Kaggle test set and a benchmark score is recorded. We compare average scores across several runs from three other methods: MLAB (Huang et al. (2024), gpt-4o backend), OpenHands (Wang et al. (2024b), gpt-4o backend), and AIDE (Schmidt et al. (2024), o1-preview backend). While mle-solver submitted valid solutions for all MLE-Bench challenges within two hours, prior methods often failed to submit, complicating scoring. We thus calculated average scores by excluding invalid submissions from other works and averaging valid ones. We find that Agent Laboratory ’smle-solver is more consistently high scoring than other solvers, with mle-solver obtaining four medals (two gold, one silver, and one bronze) compared with OpenHands (gpt-4o) obtaining two medals (two gold), AIDE (o1-preview) obtaining two medals (one gold, one bronze) and MLAB obtaining zero medals. Additionally, mle-solver obtained above median human performance on six out of ten benchmarks, with AIDE obtaining five out of ten, OpenHands two out of ten, and MLAB zero out of ten. A detailed overview is provided in Figure 9. 5. Limitations While our results suggest that Agent Laboratory demonstrates strong performance as a research tool, we now turn to a discussion of limitations that could inform future work. While some of these are also limitations of LLMs themselves, others are not, and we nonetheless provide a thorough and critical discussion of our work. We hope that progress in autonomous research will address these limitations. 5.1. Workflow limitations Challenges with self-evaluation Thepaper-solver is being evaluated for quality by using LLMs emulated NeurIPS reviewers. This has two limitations: (1) while the reviewing agents were shown to have high alignment with real reviewers (Lu et al. (2024b)), qualitatively research reports from Agent Laboratory are less satisfying than research papers from The AI Scientist (Lu et al. (2024b)), with ours having lower quality figures, despite Agent Laboratory papers obtaining higher scores overall. (2)Theresearchreportsproducedby Agent Laboratory arenotmeanttoreplacethepaperwriting process done by humans as it was in The AI Scientist, rather it is meant to provide a report for the human to understand what has been accomplished, so that they can scale up the experiment and write their own research report. However, we nonetheless use NeurIPS reviewer scores as the heuristic for the quality of our presented paper-solver , which aims to evaluate the reports from the perspective of a complete research paper. Additionally, contrasting with Lu et al. (2024b) demonstrate that LLMs perform less reliably for self-evaluation compared with human reviewers, with lower agreement scores (53.3% vs. 56.1%). Although LLMs demonstrate reasonable consistency, this may stem from reliance on superficial patterns rather than robust evaluation criteria, resulting in discrepancies between LLM and human rankings. This limits LLMs in subjective tasks like research idea evaluation, which is the foundation of mle-solver andpaper-solver . Challenges with automated structure There are also some limitations that present themselves due to the structure enforced in the workflow. For example, paper-solver is encouraged to a organize the paper into a relatively fixed structure (abstract, introduction, etc), which disallows unique paper organizations and section orders. Another limitation is that mle-solver andpaper-solver are limited to generating only two figures for the paper. This can be solved in future work, by allowing all of the figures generated by the mle-solver (without restriction) to be incorporated into 19 Page 20: Agent Laboratory: Using LLM Agents as Research Assistants paper-solver bydetectingimagefilesandprovidingthosepathstothesolver. Agent Laboratory isalsonotabletomanagerepository-levelcodeonitsown,butrathertheappropriatefilesareprovided to it at each necessary step and files are saved based on which phase produced the file. Enabling flexible repository-level file modification and execution is a clear next step for future work. Challenges with hallucination While uncommon, we also found that in some of the research papers,particularlyfromlowerperformingmodels,suchasgpt-4o,therewerehallucinationsregarding experimentalresultsthatdidnotoccur, suchasthefollowingexamplefromagpt-4opaperonthetopic ofAre image transformers more or less sensitive to noise than convolutional networks? : “Hyperparameter optimization played a crucial role in achieving these results. The learning rate was set at 0.001, with a batchsizeof 32, andthenumberofreasoningsteps 𝐿={𝑙1,𝑙2,...,𝑙𝑛}variedbetween 5to10, dependingon the complexity of the query. The model was trained over 50epochs, with early stopping criteria applied to prevent overfitting. " While the issue of hallucination is more generally a problem with LLMs themselves, future work must appropriately address these challenges in order to prevent misinformation from being propagated when using automated research tools. 5.2. Common failure modes In addition to the limitations outlined in Section 5.1, we also outline common failure modes observed during the runtime of Agent Laboratory . We report a list of the most common failure modes observed below: •Many of the more capable models (gpt-4o, o1-mini, o1-preview) struggled with instruction- followingduringtheliteraturereviewphase,andhadatendencytorepeatedlyusethe summarize command until the maximum phase steps have been reached, leading to a termination. •Retrieved papers during the literature review phase had been observed to reach the maximum token limit for some models. •When generating figures for the paper using mle-solver , the figure legends, titles, or often •Experiments run by mle-solver sometimes obtain 0%accuracy for all tested methods which is not corrected by the agent by the time mle-solver runs out of solving steps. •mle-solver has a tendency to edit line 0more than other lines in the code, causing to the replace command to more often lead to successful code compiles. •Printed output from the data preparation or experimental results can lead to the LLMs reaching their token limit. •mle-solver often generated the python exit()command, which terminated the entire process. This had to be detected and removed manually. •mle-solver has been observed to run system commands on the host computer using the subprocess.run() command. While nothing problematic has been observed, safeguards should be implemented around this. •paper-solver often struggles to search for relevant papers using the arXiv engine. Before a search time-limit was enforced, it could take up to 100tries for a successful search query to returnanypapers. A limit of 5was place thereafter to prevent this cycle. 5.3. Ethical considerations Agent Laboratory offers potential to accelerate the field of machine learning research by automat- ing time-intensive tasks and enabling researchers to focus on ideation and experimental design. However, its capabilities also bring ethical challenges that require careful consideration. The ability 20 Page 21: Agent Laboratory: Using LLM Agents as Research Assistants to autonomously generate research code, reports, and experiment plans may inadvertently lower the barriers to producing substandard or misleading scientific outputs. This could overwhelm peer review systems and jeopardize the integrity of academic discourse. Furthermore, the automated processes may reflect or even amplify biases inherent in the underlying datasets or algorithms, leading to skewed outcomes in research findings. Transparent disclosure of AI involvement in research outputs is important in order to mitigate such risks and maintain accountability. There are additional concerns about potential misuse of Agent Laboratory for unethical pur- poses, such as developing harmful technologies or generating content that bypasses ethical oversight. For instance, the misuse of autonomous research agents in fields like cybersecurity could lead to the automated creation of malware (Begou et al. (2023); Francia et al. (2024); Happe & Cito (2023); Xu et al. (2024)) or in environmental studies, it may generate biased analyses that downplay climate risks or overstate the benefits of certain interventions. Moreover, as the platform matures, the risk of its misuse increases if safeguards are not implemented to ensure alignment with ethical research standards (Jiao et al. (2024); Watkins (2024)). Thus, while Agent Laboratory demonstrates im- mense promise for accelerating scientific discovery, there is a need for robust governance mechanisms to ensure that the underlying LLMs produce content that aligns with ethical principles and societal values. 6. Discussion In this paper, we introduce Agent Laboratory , an open-source LLM agent framework for accelerat- ing the individual’s ability to perform research in machine learning. Unlike fully automated research pipelines that attempt to conceive their own research directions, Agent Laboratory is designed as a co-pilot, enabling a more human-centric mode of scientific exploration. Because of this, we present results from human-centered experiments. Our initial evaluations focused on the quality of gener- ated papers in autonomous mode, assessing human evaluations of experimental and report quality, usefulness, as well as reviewer scores based on standard academic criteria across different language models. We also assessed the effectiveness of Agent Laboratory in co-pilot mode, comparing its performance with autonomous mode, receiving positive feedback from researchers. ThefindingsofthisworkhighlightthevariabilityinperformanceacrossLLMbackends,withtheo1- preview model being rated most useful, while o1-mini demonstrated the highest experimental quality. Autonomous mode outputs, although generally well-received, revealed gaps when evaluated against human expectations for high-quality research papers, particularly in terms of clarity and soundness. We also find that automated reviewer scores do not predict human reviewer scores demonstrating the importance of human evaluations inautomated research. ntegrating human feedback in co-pilot mode overall produced higher-quality outputs than autonomous mode, with higher scores across most metrics. The co-pilot feature in Agent Laboratory is overall found to have high utility and usability when rated by human users, with most participants deciding to continue usage after their experience. Finally, runtime and cost analyses demonstrated the efficiency of the framework, with the gpt-4o backend offering the fastest execution and lowest costs. Finally, evaluations of the mle-solver on MLE-Bench demonstrates improved ability to solve general ML problems over previous methods. Agent Laboratory builds upon an emerging trend in the use of language agents for science, where previous works have shown the potential of LLMs to generate research ideas (Baek et al. (2024); Li et al. (2024a); Si et al. (2024)), implement machine learning projects (Chan et al. (2024); Huang et al. (2024); Jing et al. (2024)), and even produce scientific papers (Lu et al. (2024b)). While many of these prior efforts leverage LLMs as tools to be applied at discrete stages, Agent Laboratory integrates these processes into a single, continuous pipeline that can scale and adapt to 21 Page 22: Agent Laboratory: Using LLM Agents as Research Assistants the researcher’s desired level of interaction and compute availability. This allows human researchers to focus more on conceptual design and critical thinking, allowing Agent Laboratory to handle more tedious tasks, such as preprocessing data and coding. We overcome the limitations of prior work, such as The AI Scientist (Lu et al. (2024b)) which does not have human-computer interaction, Virtual Lab (Swanson et al. (2024)) which does not have access to up-to-date knowledge, does not generate research papers, and was only demonstrated for nanobody design, as well as ChemCrow (M. Bran et al. (2024)) and Coscientist (Boiko et al. (2023)) which cannot solve open-ended research problems. However, as was outlined in Limitations (Section 5), there are many areas for improvement in our approach which can be addressed in future work. A valuable direction for future research could involve a longitudinal study comparing researchers’ outcomes when conducting studies with and without Agent Laboratory , as the human evaluations in this work provide only a snapshot of its utility. Studies of this kind have been conducted with other workflow automation tools, such as GitHub Copilot (Dohmke et al. (2023); Ziegler et al. (2024)), and have demonstrated promising potential for improving productivity. Such a study would help to better understand the long-term impact of Agent Laboratory on research efficiency and its role in improving scientific discovery. It may also be worth exploring automatic agent workflow (Hong et al. (2023); Li et al. (2024c); Zhuge et al. (2024)) and agent generation techniques (Chen et al. (2023a); Hu et al. (2024a)) to optimize the Agent Laboratory workflow. Conclusion In conclusion, Agent Laboratory stands as a promising step toward more efficient, human-centered research workflows that leverage the power of LLMs. By integrating specialized autonomous agents guided by human oversight, our approach can help researchers spend less time on repetitive tasks and more time on the creative, conceptual aspects of their work. We hope that Agent Laboratory may ultimately serve as a tool to enable scientific discovery. References Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kimberly Milner, Sofija Jancheska, John Yang, Carlos E Jimenez, Farshad Khorrami, et al. Enigma: Enhanced interactive generative model agent for ctf challenges. arXiv preprint arXiv:2409.16165 , 2024. Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023. Anirudh Ajith, Mengzhou Xia, Alexis Chevalier, Tanya Goyal, Danqi Chen, and Tianyu Gao. Litsearch: A retrieval benchmark for scientific literature search. arXiv preprint arXiv:2407.18940 , 2024. Altera AL, Andrew Ahn, Nic Becker, Stephanie Carroll, Nico Christie, Manuel Cortes, Arda Demirci, MelissaDu,FrankieLi,ShuyingLuo,etal. Projectsid: Many-agentsimulationstowardaicivilization. arXiv preprint arXiv:2411.00114 , 2024. BarrettRAnderson, JashHemantShah, andMaxKreminski. Homogenizationeffectsoflargelanguage models on human creative ideation. In Proceedings of the 16th Conference on Creativity & Cognition , pp. 413–425, 2024. AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card , 1, 2024. 22 Page 23: Agent Laboratory: Using LLM Agents as Research Assistants Joshua Ashkinaze, Julia Mendelsohn, Li Qiwei, Ceren Budak, and Eric Gilbert. How ai ideas affect the creativity, diversity, and evolution of human ideas: Evidence from a large, dynamic experiment. arXiv preprint arXiv:2401.13481 , 2024. Ashwini Ashokkumar, Luke Hewitt, Isaias Ghezae, and Robb Willer. Predicting results of social science experiments using large language models. Technical report, Technical report, Working Paper, 2024. Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. Researchagent: Iterative research idea generation over scientific literature with large language models. arXiv preprint arXiv:2404.07738 , 2024. Nils Begou, Jérémy Vinoy, Andrzej Duda, and Maciej Korczyński. Exploring the dark side of ai: Advanced phishing attack design and deployment using chatgpt. In 2023 IEEE Conference on Communications and Network Security (CNS) , pp. 1–6. IEEE, 2023. Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. 𝜋0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 , 2024. Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models. Nature, 624(7992):570–578, 2023. Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 , 2022. Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818 , 2023. Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 , 2020. Tuhin Chakrabarty, Philippe Laban, Divyansh Agarwal, Smaranda Muresan, and Chien-Sheng Wu. Art or artifice? large language models and the false promise of creativity. In Proceedings of the CHI Conference on Human Factors in Computing Systems , pp. 1–34, 2024. Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095 , 2024. Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, Börje F Karlsson, Jie Fu, and Yemin Shi. Autoagents: A framework for automatic agent generation. arXiv preprint arXiv:2309.17288 , 2023a. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 , 2021. Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. In The Twelfth International Conference on Learning Representations , 2023b. Xiuying Chen, Tairan Wang, Taicheng Guo, Kehan Guo, Juexiao Zhou, Haoyang Li, Mingchen Zhuge, Jürgen Schmidhuber, Xin Gao, and Xiangliang Zhang. Scholarchemqa: Unveiling the power of language models in chemical research question answering. arXiv preprint arXiv:2407.16931 , 2024a. 23 Page 24: Agent Laboratory: Using LLM Agents as Research Assistants Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. arXiv preprint arXiv:2410.05080 , 2024b. Mike D’Arcy, Tom Hope, Larry Birnbaum, and Doug Downey. Marg: Multi-agent review generation for scientific papers. arXiv preprint arXiv:2401.04259 , 2024. Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36, 2024. Ning Ding, Shang Qu, Linhai Xie, Yifei Li, Zaoqu Liu, Kaiyan Zhang, Yibai Xiong, Yuxin Zuo, Zhangren Chen, Ermo Hua, et al. Automating exploratory proteomics research via language models. arXiv preprint arXiv:2411.03743 , 2024. Thomas Dohmke, Marco Iansiti, and Greg Richards. Sea change in software development: Economic and productivity analysis of the ai-powered developer lifecycle. arXiv preprint arXiv:2306.15033 , 2023. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024. Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, and Daniel Kang. Llm agents can autonomously hack websites. arXiv preprint arXiv:2402.06664 , 2024. Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Moham- madamin Barekatain, Alexander Novikov, Francisco J R Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610(7930):47–53, 2022. Xidong Feng, Yicheng Luo, Ziyan Wang, Hongrui Tang, Mengyue Yang, Kun Shao, David Mguni, Yali Du, and Jun Wang. Chessgpt: Bridging policy learning and language modeling. Advances in Neural Information Processing Systems , 36, 2024. Jerson Francia, Derek Hansen, Ben Schooley, Matthew Taylor, Shydra Murray, and Greg Snow. Assessing ai vs human-authored spear phishing sms attacks: An empirical study using the trapd method. arXiv preprint arXiv:2406.13049 , 2024. Alireza Ghafarollahi and Markus J Buehler. Protagents: protein discovery via large language model multi-agent collaborations combining physics and machine learning. Digital Discovery , 2024a. Alireza Ghafarollahi and Markus J Buehler. Sciagents: Automating scientific discovery through multi-agent intelligent graph reasoning. arXiv preprint arXiv:2409.05556 , 2024b. Antoine Grosnit, Alexandre Maraval, James Doran, Giuseppe Paolo, Albert Thomas, Refinath Shahul Hameed Nabeezath Beevi, Jonas Gonzalez, Khyati Khandelwal, Ignacio Iacobacci, Abdelhakim Benechehab, et al. Large language models orchestrating structured reasoning achieve kaggle grandmaster level. arXiv preprint arXiv:2411.03562 , 2024. Ken Gu, Ruoxi Shang, Ruien Jiang, Keying Kuang, Richard-John Lin, Donghe Lyu, Yue Mao, Youran Pan, Teng Wu, Jiaqian Yu, et al. Blade: Benchmarking language model agents for data-driven science. arXiv preprint arXiv:2408.09667 , 2024. 24 Page 25: Agent Laboratory: Using LLM Agents as Research Assistants Siyuan Guo, Cheng Deng, Ying Wen, Hechang Chen, Yi Chang, and Jun Wang. Ds-agent: Automated data science by empowering large language models with case-based reasoning. arXiv preprint arXiv:2402.17453 , 2024. Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856 , 2023. Nam Le Hai, Dung Manh Nguyen, and Nghi DQ Bui. Repoexec: Evaluate code generation with a repository-level executable benchmark. arXiv preprint arXiv:2406.11927 , 2024. Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. Advances in neural information processing systems , 36, 2024. Andreas Happe and Jürgen Cito. Getting pwn’d by ai: Penetration testing with large language models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pp. 2082–2086, 2023. Tomas Hayes, Roshan Rao, Halil Akin, Nicholas J Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q Tran, Jonathan Deaton, Marius Wiggert, et al. Simulating 500 million years of evolution with a language model. bioRxiv, pp. 2024–07, 2024. Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919 , 2024. Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352 , 2023. Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. arXiv preprint arXiv:2408.08435 , 2024a. Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, et al. Infiagent-dabench: Evaluating agents on data analysis tasks. arXiv preprint arXiv:2401.05507 , 2024b. Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. arXiv preprint arXiv:2210.11610 , 2022. Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. In Forty-first International Conference on Machine Learning , 2024. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276 , 2024. Tal Ifargan, Lukas Hafner, Maor Kern, Ori Alcalay, and Roy Kishony. Autonomous llm-driven research from data to human-verifiable research papers. arXiv preprint arXiv:2404.17605 , 2024. Junfeng Jiao, Saleh Afroogh, Yiming Xu, and Connor Phillips. Navigating llm ethics: Advancements, challenges, and future directions. arXiv preprint arXiv:2406.18841 , 2024. 25 Page 26: Agent Laboratory: Using LLM Agents as Research Assistants Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770 , 2023. Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. Dsbench: How far are data science agents to becoming data science experts? arXiv preprint arXiv:2409.07703 , 2024. John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. nature, 596(7873):583–589, 2021. Hao Kang and Chenyan Xiong. Researcharena: Benchmarking llms’ ability to collect and organize information as research agents. arXiv preprint arXiv:2406.10291 , 2024. Ji Woong Kim, Tony Z Zhao, Samuel Schmidgall, Anton Deguet, Marin Kobilarov, Chelsea Finn, and Axel Krieger. Surgical robot transformer (srt): Imitation learning for surgical tasks. In 8th Annual Conference on Robot Learning , 2024. Jakub Lála, Odhran O’Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G Rodriques, and An- drew D White. Paperqa: Retrieval-augmented generative agent for scientific research. arXiv preprint arXiv:2312.07559 , 2023. Steven A Lehr, Aylin Caliskan, Suneragiri Liyanage, and Mahzarin R Banaji. Chatgpt as research scientist: Probing gpt’s capabilities as a research librarian, research ethicist, data generator, and data predictor. Proceedings of the National Academy of Sciences , 121(35):e2404328121, 2024. Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Com- municative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems , 36:51991–52008, 2023. Long Li, Weiwen Xu, Jiayan Guo, Ruochen Zhao, Xingxuan Li, Yuqian Yuan, Boqiang Zhang, Yuming Jiang, Yifei Xin, Ronghao Dang, et al. Chain of ideas: Revolutionizing research via novel idea development with llm agents. arXiv preprint arXiv:2410.13185 , 2024a. SihangLi,JinHuang,JiaxiZhuang,YaoruiShi,XiaochenCai,MingjunXu,XiangWang,LinfengZhang, Guolin Ke, and Hengxing Cai. Scilitllm: How to adapt llms for scientific literature understanding. arXiv preprint arXiv:2408.15545 , 2024b. Zelong Li, Shuyuan Xu, Kai Mei, Wenyue Hua, Balaji Rama, Om Raheja, Hao Wang, He Zhu, and Yongfeng Zhang. Autoflow: Automated workflow generation for large language model agents. arXiv preprint arXiv:2407.12821 , 2024c. Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Yi Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Scott Smith, Yian Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis. NEJM AI , 1(8):AIoa2400196, 2024. Xinna Lin, Siqi Ma, Junjie Shan, Xiaojing Zhang, Shell Xu Hu, Tiannan Guo, Stan Z Li, and Kaicheng Yu. Biokgbench: A knowledge graph checking benchmark of ai agent for biomedical science. arXiv preprint arXiv:2407.00466 , 2024. Yiren Liu, Si Chen, Haocong Cheng, Mengxia Yu, Xiao Ran, Andrew Mo, Yiliu Tang, and Yun Huang. How ai processing delays foster creativity: Exploring research question co-creation with an llm- based agent. In Proceedings of the CHI Conference on Human Factors in Computing Systems , pp. 1–25, 2024. 26 Page 27: Agent Laboratory: Using LLM Agents as Research Assistants Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292 , 2024a. Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI Scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292 , 2024b. Xiaoliang Luo, Akilles Rechardt, Guangzhi Sun, Kevin K Nejad, Felipe Yáñez, Bati Yilmaz, Kangjoo Lee, Alexandra O Cohen, Valentina Borghesani, Anton Pashkov, et al. Large language models surpass human experts in predicting neuroscience results. Nature Human Behaviour , pp. 1–11, 2024. Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Augmenting large language models with chemistry tools. Nature Machine Intelligence , pp. 1–11, 2024. Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, AryanPrakhar, TirthVora, TusharKhot, AshishSabharwal, andPeterClark. Discoverybench: Towards data-driven discovery with large language models. arXiv preprint arXiv:2407.01725 , 2024. Benjamin S Manning, Kehang Zhu, and John J Horton. Automated social science: Language models as scientist and subjects. Technical report, National Bureau of Economic Research, 2024. Daniel McDuff, Mike Schaekermann, Tao Tu, Anil Palepu, Amy Wang, Jake Garrison, Karan Singhal, Yash Sharma, Shekoofeh Azizi, Kavita Kulkarni, et al. Towards accurate differential diagnosis with large language models. arXiv preprint arXiv:2312.00164 , 2023. Amil Merchant, Simon Batzner, Samuel S Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Do- gus Cubuk. Scaling deep learning for materials discovery. Nature, 624(7990):80–85, 2023. Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and CaimingXiong. Codegen: Anopenlargelanguagemodelforcodewithmulti-turnprogramsynthesis. arXiv preprint arXiv:2203.13474 , 2022. OpenAI. Introducing chatgpt. https://openai.com/index/chatgpt/ , November 2022. Blog post. OpenAI. Introducing openai o1-preview, September 2024. URL https://openai.com/index/ introducing-openai-o1-preview/ . Accessed: 2024-09. Vishakh Padmakumar and He He. Does writing with language models reduce content diversity? In The Twelfth International Conference on Learning Representations , 2024. Huy Nhat Phan, Tien N Nguyen, Phong X Nguyen, and Nghi DQ Bui. Hyperagent: Generalist software engineering agents to solve coding tasks at scale. arXiv preprint arXiv:2409.16299 , 2024. Ori Press, Andreas Hochlehnert, Ameya Prabhu, Vishaal Udandarao, Ofir Press, and Matthias Bethge. Citeme: Can language models accurately cite scientific claims? arXiv preprint arXiv:2407.12861 , 2024. Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents. arXiv preprint arXiv:2408.07199 , 2024. 27 Page 28: Agent Laboratory: Using LLM Agents as Research Assistants Edward O Pyzer-Knapp, Jed W Pitera, Peter WJ Staar, Seiji Takeda, Teodoro Laino, Daniel P Sanders, James Sexton, John R Smith, and Alessandro Curioni. Accelerating materials discovery using artificial intelligence, high performance computing and robotics. npj Computational Materials , 8 (1):84, 2022. Chen Qian, Yufan Dang, Jiahao Li, Wei Liu, Zihao Xie, Yifei Wang, Weize Chen, Cheng Yang, Xin Cong, Xiaoyin Che, et al. Experiential co-learning of software-developing agents. arXiv preprint arXiv:2312.17025 , 2023. Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pp. 15174–15186, 2024. Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789 , 2023. Matthew Renze and Erhan Guven. Self-reflection in llm agents: Effects on problem-solving perfor- mance. arXiv preprint arXiv:2405.06682 , 2024. Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625(7995): 468–475, 2024. Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. In Thirty-seventh Conference on Neural Information Processing Systems , 2023. URL https://openreview.net/forum?id=Yacmpz84TH . Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments. arXiv preprint arXiv:2405.07960 , 2024. Dominik Schmidt, Zhengyao Jiang, and Yuxiang Unknown. Introducing weco aide, 2024. URL https://www.weco.ai/blog/technical-report . Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning , pp. 3135–3144. PMLR, 2017. Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024. ChengleiSi,DiyiYang,andTatsunoriHashimoto. Canllmsgeneratenovelresearchideas? alarge-scale human study with 100+ nlp researchers. arXiv preprint arXiv:2409.04109 , 2024. Xiaoshuai Song, Muxi Diao, Guanting Dong, Zhengyang Wang, Yujia Fu, Runqi Qiao, Zhexu Wang, Dayuan Fu, Huangxuan Wu, Bin Liang, et al. Cs-bench: A comprehensive benchmark for large language models towards computer science mastery. arXiv preprint arXiv:2406.08587 , 2024. Kyle Swanson, Wesley Wu, Nash L Bulaong, John E Pak, and James Zou. The virtual lab: Ai agents design new sars-cov-2 nanobodies with experimental validation. bioRxiv, pp. 2024–11, 2024. 28 Page 29: Agent Laboratory: Using LLM Agents as Research Assistants NathanJSzymanski, BernardusRendy, YuxingFei, RishiEKumar, TanjinHe, DavidMilsted, MatthewJ McDermott, Max Gallant, Ekin Dogus Cubuk, Amil Merchant, et al. An autonomous laboratory for the accelerated synthesis of novel materials. Nature, 624(7990):86–91, 2023. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 , 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 , 2023b. Tao Tu, Anil Palepu, Mike Schaekermann, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Nenad Tomasev, et al. Towards conversational diagnostic ai. arXiv preprint arXiv:2401.05654 , 2024. A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems , 2017. Shengye Wan, Cyrus Nikolaidis, Daniel Song, David Molnar, James Crnkovich, Jayson Grace, Manish Bhatt, SahanaChennabasappa, SpencerWhitman, StephanieDing, etal. Cyberseceval3: Advancing the evaluation of cybersecurity risks and capabilities in large language models. arXiv preprint arXiv:2408.01605 , 2024. Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv: Arxiv-2305.16291 , 2023. Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science , 18(6):186345, 2024a. Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Opendevin: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741 , 2024b. Ryan Watkins. Guidance for researchers and peer-reviewers on the ethical use of large language models (llms) in scientific research workflows. AI and Ethics , 4(4):969–974, 2024. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems , 35:24824–24837, 2022. Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang. Cycleresearcher: Improving automated research via automated review. arXiv preprint arXiv:2411.00816 , 2024. Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155 , 2023. Jiacen Xu, Jack W Stokes, Geoff McDonald, Xuesong Bai, David Marshall, Siyue Wang, Adith Swami- nathan, and Zhou Li. Autoattacker: A large language model guided system to implement automatic cyber-attacks. arXiv preprint arXiv:2403.01038 , 2024. 29 Page 30: Agent Laboratory: Using LLM Agents as Research Assistants John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793 , 2024. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems , 36, 2024. Xingjian Zhang, Yutong Xie, Jin Huang, Jinge Ma, Zhaoying Pan, Qijia Liu, Ziyang Xiong, Tolga Ergen, Dongsub Shim, Honglak Lee, et al. Massw: A new dataset and benchmark tasks for ai-assisted scientific workflows. arXiv preprint arXiv:2406.06357 , 2024. Yilun Zhou, Caiming Xiong, Silvio Savarese, and Chien-Sheng Wu. Shared imagination: Llms hallucinate alike. arXiv preprint arXiv:2407.16604 , 2024. Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Gptswarm: Language agents as optimizable graphs. In Forty-first International Conference on Machine Learning , 2024. Albert Ziegler, Eirini Kalliamvakou, X Alice Li, Andrew Rice, Devon Rifkin, Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian. Measuring github copilot’s impact on productivity. Commu- nications of the ACM , 67(3):54–63, 2024. 30 Page 31: Agent Laboratory: Using LLM Agents as Research Assistants A.Agent Laboratory configuration A.1. Hyperparameters Table 1|Hyperparameters for Agent Laboratory . Category Hyperparameter Value Literature Review Number of Paper Summaries 5 Full Text History Decay Steps 3 Agent temperature 0.8 Data Preparation Experiment Timeout 120s Running Experiments mle-solver steps 3 Code repair attempts 2 Maximum top codes 2 Error history length 5 Code history length 2 Number of comparison trials 2 Experiment Timeout 600s Score generation temperature 0.6 Repair temperature 0.8 Initial code temperature 1.0 Solver temperature 1.0 Paper Writing paper-solver steps 5 Maximum top papers 1 Paper history length 10 Number of Reviewers 1 Number of comparison trials 2 Solver temperature 1.0 Initial paper temperature 0.8 Paper Refinement Number of Reviewers 3 A.2. Hardware All experiments in this paper were run on a 2023 MacBook Pro with an Apple M3 Max processor and 36 GB of memory. 31 Page 32: Agent Laboratory: Using LLM Agents as Research Assistants B. Prompts B.1. Base Inference Prompt Base System Prompt You are {self.role_description()} Task instructions:{self.phase_prompt(phase)} {self.command_descriptions(phase)} Base Prompt {context_prompt} History: {history_str} Current Step #{step} Phase: {phase} {complete_str} [Objective] Your goal is to perform research on the following topic: {research_topic} Feedback: {feedback} Notes: {notes_str} Your previous command was: {self.prev_comm}. Make sure your new output is different. Please produce a single command below: Phase Notes (notes_str) Notes for the task objective: {phase_notes} Complete String The complete string is typically set to the empty string. However, in the case when the number of steps reaches 70% of the way toward completion, the following is appended to the base prompt to encourage the agent to produce a submission. Complete String (complete_str) You must finish this task and submit as soon as possible! History Line Step #{step}, Phase: {phase}, Feedback: {feedback}, Your response: {model_resp} 32 Page 33: Agent Laboratory: Using LLM Agents as Research Assistants B.2. Context Prompts Context Prompt {sr_str} {context_prompt} Context Prompt Second Round String (sr_string) The following are results from the previous experiments Previous Experiment code: {self.prev_results_code} Previous Results: {self.prev_exp_results} Previous Interpretation of results: {self.prev_interpretation} Previous Report: {self.prev_report} {self.reviewer_response} Context Prompt Plan Formulation Current Literature Review: {self.lit_review_summary} Context Prompt Data Preparation Current Literature Review: {self.lit_review_summary} Current Plan: {self.plan} Context Prompt Results Interpretation Current Literature Review: {lit_review_sum} Current Plan: {self.plan} Current Dataset code: {self.dataset_code} Current Experiment code: {self.results_code} Current Results: {self.exp_results} Context Prompt Report Refinement Current Literature Review: {lit_review_sum} Current Plan: {self.plan} Current Dataset code: {self.dataset_code} Current Experiment code: {self.results_code} Current Results: {self.exp_results} Current Interpretation of results: {self.interpretation} 33 Page 34: Agent Laboratory: Using LLM Agents as Research Assistants B.3. Agent Phase Descriptions B.3.1. PhD Student phase PhD Literature Review Phase Prompt Your goal is to perform a literature review for the presented task and add papers to the literature review. You have access to arXiv and can perform two search operations: (1) finding many different paper summaries from a search query and (2) getting a single full paper text for an arXiv paper. PhD Literature Review Phase Prompt You are a PhD student being directed by a postdoc who will help you come up with a good plan, and you interact with them through dialogue. Your goal is to produce plans that would make good experiments for the given topic. You should aim for a very simple experiment that showcases your plan, not a complex one. You should integrate the provided literature review and come up with plans on how to expand and build on these works for the given topic. Your plans should provide a clear outline for how to achieve the task, including what machine learning models to use and implement, what types of datasets should be searched for and used to train the model, and the exact details of the experiment. PhD Data Preparation Phase Prompt You are a PhD student directing a machine learning engineer, where the machine learning engineer will be writing the code, and you can interact with them through dialogue. Your goal is to help the ML engineer produce code that prepares the data for the provided experiment. You should aim for very simple code to prepare the data, not complex code. You should integrate the provided literature review and the plan and come up with code to prepare data for this experiment. PhD Results Interpretation Phase Prompt You are a PhD student being directed by a postdoc who will help you come up with an interpretation for results from an experiment, and you interact with them through dialogue. Your goal is to interpret results from experiments that were previously run. You should read through the code and look at the results to understand what occurred. You should then discuss with the postdoc your interpretation and use their feedback to improve your thoughts. You should integrate the provided literature review, code, and plans to come up with an exciting interpretation that could 34 Page 35: Agent Laboratory: Using LLM Agents as Research Assistants make a compelling paper. Your plans should provide a clear outline that can be used to write an academic paper. Your interpretation should include numbers, relevant metrics to the experiment (e.g., accuracy or loss) and measures of significance. You must propagate this information accurately. You must submit the interpretation during this phase in a reasonable amount of time. Do not delay the submission. PhD Report Refinement Phase Prompt You are a PhD student who has submitted their paper to an ML conference called ICLR. Your goal was to write a research paper and get high scores from the reviewers so that it get accepted to the conference. PhD Report Refinement Phase Prompt You are a PhD student who has submitted their paper to an ML conference called ICLR. Your goal was to write a research paper and get high scores from the reviewers so that it get accepted to the conference. B.4. Machine Learning Engineer Phase Descriptions ML Engineer Data Preparation Phase Prompt You are a machine learning engineer being directed by a PhD student who will help you write the code, and you can interact with them through dialogue. Your goal is to produce code that prepares the data for the provided experiment. You should aim for simple code to prepare the data, not complex code. You should integrate the provided literature review and the plan and come up with code to prepare data for this experiment. B.5. Postdoc Phase Descriptions Postdoc Plan Formulation Prompt You are directing a PhD student to help them come up with a good plan, and you interact with them through dialogue. Your goal is to produce plans that would make good experiments for the given topic. You should aim for a very simple experiment that showcases your plan, not a complex one. You should integrate the provided literature review and come up with plans on how to expand and build on these works for the given topic. Your plans should provide a clear outline for how to achieve the task, including what 35 Page 36: Agent Laboratory: Using LLM Agents as Research Assistants machine learning models to use and implement, what types of datasets should be searched for and used to train the model, and the exact details of the experiment. Postdoc Results Interpretation Phase Prompt You are directing a PhD student to help them come up with an interpretation for results from an experiment, and you interact with them through dialogue. Your goal is to interpret results from experiments that were previously run. You should read through the code and look at the results to understand what occurred. You should then discuss with the PhD student how they can interpret the results and give their feedback to improve their thoughts. You should integrate the provided literature review, code, and plans to come up with an exciting interpretation that could make a compelling paper. Your plans should provide a clear outline that can be used to write an academic paper. Your interpretation should include numbers, relevant metrics to the experiment (e.g., accuracy or loss) and measures of significance. You must propagate this information accurately. You must also complete this in a reasonable amount of time and then submit your results. B.6. Agent Command Description B.6.1. PhD Student Command Description PhD Student Literature Review Command Prompt To collect paper summaries, use the following command: ```SUMMARY SEARCH QUERY ``` where SEARCH QUERY is a string that will be used to find papers with semantically similar content and SUMMARY is just the word SUMMARY. To get the full paper text for an arXiv paper, use the following command: ```FULL_TEXT arXiv paper ID ``` where arXiv paper ID is the ID of the arXiv paper (which can be found by using the SUMMARY command), and FULL_TEXT is just the word FULL_TEXT. Make sure to read the full text using the FULL_TEXT command before adding it to your list of relevant papers. If you believe a paper is relevant to the research project proposal, you can add it to the official review after reading using the following command: ```ADD_PAPER arXiv_paper_ID 36 Page 37: Agent Laboratory: Using LLM Agents as Research Assistants PAPER_SUMMARY ``` where arXiv_paper_ID is the ID of the arXiv paper, PAPER_SUMMARY is a brief summary of the paper, and ADD_PAPER is just the word ADD_PAPER. You can only add one paper at a time. Make sure to use ADD_PAPER when you see a relevant paper. DO NOT use SUMMARY too many times. You can only use a single command per inference turn. Do not use more than one command per inference. If you use multiple commands, then only one of them will be executed, not both. Make sure to extensively discuss the experimental results in your summary. When performing a command, make sure to include the three ticks ( ```) at the top and bottom ```COMMAND text ```where COMMAND is the specific command you want to run (e.g., ADD_PAPER, FULL_TEXT, SUMMARY). Do not use the word COMMAND make sure to use the actual command, e.g., your command should look exactly like this: ```ADD_PAPER text ```(where the command could be from ADD_PAPER, FULL_TEXT, SUMMARY) PhD Student Plan Formulation Command Prompt You can produce dialogue using the following command: ```DIALOGUE dialogue here ``` where ’dialogue here’ is the actual dialogue you will send and DIALOGUE is just the word DIALOGUE. PhD Student Data Preparation Command Prompt You can produce dialogue using the following command: ```DIALOGUE dialogue here ``` where ’dialogue here’ is the actual dialogue you will send and DIALOGUE is just the word DIALOGUE. When you and the ML engineer have finalized your dataset preparation code and are ready to submit the final code, please use the following command: ```SUBMIT_CODE code here ``` where ’code here’ is the finalized code you will send and SUBMIT_CODE is just the word SUBMIT_CODE. The submitted code must have a HuggingFace dataset import and must use an external HuggingFace dataset. If your code returns any errors, they will be provided to you, and you are also able to see print statements. Make sure 37 Page 38: Agent Laboratory: Using LLM Agents as Research Assistants function variables are created inside the function or passed as a function parameter. DO NOT CREATE A MAIN FUNCTION. Make sure to submit code in a reasonable amount of time. Do not make the code too complex, try to make it simple. Do not take too long to submit code. Submit the code early. You should submit the code ASAP. You can only use a single command per inference turn. Do not use more than one command per inference. If you use multiple commands, then only one of them will be executed, not both. When performing a command, make sure to include the three ticks ( ```) at the top and bottom ```COMMAND text ```where COMMAND is the specific command you want to run (e.g., SUBMIT_CODE, DIALOGUE). PhD Student Results Interpretation Command Prompt You can produce dialogue using the following command: ```DIALOGUE dialogue here ``` where ’dialogue here’ is the actual dialogue you will send and DIALOGUE is just the word DIALOGUE. When performing a command, make sure to include the three ticks ( ```) at the top and bottom ```COMMAND text ```where COMMAND is the specific command you want to run (e.g., DIALOGUE). B.6.2. ML Engineer Agent Command Description ML Engineer Data Preparation Command Prompt You can produce code using the following command: ```python code here ``` where code here is the actual code you will execute in a Python terminal, and python is just the word python. If your code returns any errors, they will be provided to you, and you are also able to see print statements. You will receive all print statement results from the code. Make sure function variables are created inside the function or passed as a function parameter. You can produce dialogue using the following command: ```DIALOGUE dialogue here ``` where dialogue here is the actual dialogue you will send, and DIALOGUE is just the word DIALOGUE. You also have access to HuggingFace datasets. You can search the datasets repository using the following command: ```SEARCH_HF 38 Page 39: Agent Laboratory: Using LLM Agents as Research Assistants search query here ```where search query here is the query used to search HuggingFace datasets, and SEARCH_HF is the word SEARCH_HF. This will return a list of HuggingFace dataset descriptions which can be loaded into Python using the datasets library. Your code MUST use an external HuggingFace directory. You MUST use a HuggingFace dataset in your code. DO NOT CREATE A MAIN FUNCTION. Try to make the code very simple. You can only use a SINGLE command per inference turn. Do not use more than one command per inference. If you use multiple commands, then only one of them will be executed, NOT BOTH. When performing a command, make sure to include the three ticks ( ```) at the top and bottom ```COMMAND text ```where COMMAND is the specific command you want to run (e.g., python, DIALOGUE, SEARCH_HF). B.6.3. Postdoc Agent Command Description Postdoc Plan Formulation Command Prompt You can produce dialogue using the following command: ```DIALOGUE dialogue here ``` where dialogue here is the actual dialogue you will send and DIALOGUE is just the word DIALOGUE. When you believe a good plan has been arrived at between you and the PhD student you can use the following command to end the dialogue and submit the plan ```PLAN plan here ``` where plan here is the actual plan to be transmitted and PLAN is just the word PLAN. Plan here should provide a clear outline for how to achieve the task, including what machine learning models to use and implement, what types of datasets should be searched for and used to train the model, and the exact details of the experiment. You can only use a SINGLE command per inference turn. Do not use more than one command per inference. If you use multiple commands, then only one of them will be executed, NOT BOTH. Make sure not to produce too much dialogue and to submit an plan in reasonable time. When performing a command, make sure to include the three ticks ( ```) at the top and bottom ```COMMAND text ```where COMMAND is the specific command you want to run (e.g., PLAN, DIALOGUE). 39 Page 40: Agent Laboratory: Using LLM Agents as Research Assistants Postdoc Results Interpretation Command Prompt When you believe a good interpretation has been arrived at between you and the PhD student you can use the following command to end the dialogue and submit the plan ```INTERPRETATION interpretation here ``` where interpretation here is the actual interpretation to be transmitted and INTERPRETATION is just the word INTERPRETATION. Please provide an INTERPRETATION in a reasonable amount of time. You can produce dialogue using the following command: ```DIALOGUE dialogue here ``` where dialogue here is the actual dialogue you will send and DIALOGUE is just the word DIALOGUE. You must submit the interpretation during this phase in a reasonable amount of time. Do not delay the submission. When performing a command, make sure to include the three ticks ( ```) at the top and bottom ```COMMAND text ```where COMMAND is the specific command you want to run (e.g., INTERPRETATION, DIALOGUE). B.7. Agent Role Description B.7.1. PhD Student Role Description PhD Student Role Prompt You are a computer science PhD student at a top university. B.7.2. Machine Learning Engineer Role Description Machine Learning Engineer Role Prompt You are a machine learning engineer working at a top university. B.7.3. Professor Agent Professor Role Prompt You are a computer science professor at a top university. 40 Page 41: Agent Laboratory: Using LLM Agents as Research Assistants B.7.4. Postdoc Agent Role Description Postdoc Role Prompt You are a computer science postdoctoral student at a top university. B.8. mle-solver Prompts B.8.1. Tools mle-solver Replace Tool ============= REWRITE CODE EDITING TOOL ============= You also have access to a code replacing tool. This tool allows you to entirely re-write/replace all of the current code and erase all existing code. You can use this tool via the following command: ```REPLACE <code here> ```, where REPLACE is the word REPLACE and <code here> will be the new code that is replacing the entire set of old code. This tool is useful if you want to make very significant changes, such as entirely changing the model, or the learning process. Before changing the existing code to be your new code, your new code will be tested and if it returns an error it will not replace the existing code. Try limiting the use of rewriting and aim for editing the code more. mle-solver Edit Tool ============= CODE EDITING TOOL ============= You also have access to a code editing tool. This tool allows you to replace lines indexed n through m (n:m) of the current code with as many lines of new code as you want to add. This removal is inclusive meaning that line n and m and everything between n and m is removed. This will be the primary way that you interact with code. You can edit code using the following command: ```EDIT N M <new lines to replace old lines> ```EDIT is the word EDIT, N is the first line index you want to replace and M the the last line index you want to replace (everything inbetween will also be removed), and <new lines to replace old lines> will be the new code that is replacing the old code. Before changing the existing code to be your new code, your new code will be tested and if it returns an error it will not replace the existing code. Your changes should significantly change the functionality of the code. 41 Page 42: Agent Laboratory: Using LLM Agents as Research Assistants Professor Agent Scoring System Prompt You are a professor agent who is serving as an expert reward model that can read a research plan, research code, and code output and are able to determine how well a model followed the plan, built the code, and got the proper output scored from 0 to 1 as a float. You must structure your score exactly in the following way: ```SCORE <score here> ```where SCORE is just the word score, <score here> is a floating point number between 0 and 1 representing how well the model followed the plan, built the code, and got the proper output Professor Agent Scoring Prompt Outlined in the following text is the research plan that the machine learning engineer was tasked with building: {outlined_plan} The following text is the research code that the model produced: {code} The following is the output from the model: {code_return} Code Repair Tool System Prompt You are an automated code repair tool. Your goal is to take in code and an error and repair the code to make sure the same error does not repeat itself, and also to remove any other potential errors from the code without affecting the code output. Your output should match the original code as closely as possible. You must wrap the code in the following ```python <code here> ``` Do not forget the opening ```python and the closing ```. Code Repair Tool Prompt Provided here is the error: {error} Provided below is the code: {code} Initial Code Generation Prompt {err_hist} You should now use ```REPLACE to create initial code to solve the challenge. Now please enter the ```REPLACE command below: 42 Page 43: Agent Laboratory: Using LLM Agents as Research Assistants Initial Code Generation Error Prompt (err_hist) The following is a history of your previous errors {errs} nDO NOT REPEAT THESE. Where the string errs is concatenation of the minimum between five previous errors and the length of all errors (i.e. all errors until the number reaches five, then only five). Initial Code Generation Error Prompt (err) The following was the previous command generated: {model_resp}. This was the error return {cmd_str}. You should make sure not to repeat this error and to solve the presented problem. mle-solver System Prompt {self.role_description()}. The following are your task instructions: {self.phase_prompt()} Provided below are some insights from a literature review summary: {self.insights} {self.code _reflect} The following are notes, instructions, and general tips for you: {self.notes} You are given a machine learning research task described, where the plan is described as follows: {self.plan} {self.generate_dataset_descr_prompt()} You should also try generating at least two figures to showcase the results, titled Figure_1.png and Figure_2.png Your method MUST not get 0% accuracy. If it does, you have done something wrong and must correct this. Make sure to check your accuracy calculation is correct. Your goal is to solve the research plan as well as possible. You will receive a score after you write the code and should aim to maximize the score by following the plan instructions and writing high quality code. Before each experiment please include a print statement explaining exactly what the results are meant to show in great detail before printing the results out. The following are commands you have access to: {self.command_descriptions()}. You should try to have a diversity of command responses if appropriate. Do not repeat the same commend too many times. Please consider looking through your history and not repeating commands too many times. mle-solver Role Description (role_description) You are an expert machine learning engineer working at a top university to write code to solve machine learning research 43 Page 44: Agent Laboratory: Using LLM Agents as Research Assistants challenges using your machine learning expertise. mle-solver Command Description (command_description) You also have access to tools which can be interacted with using the following structure: ```COMMAND <command information here> , where COMMAND is whichever command you want to run (e.g., EDIT, REPLACE...), <command information here> is information used for the command, such as code to run or a search query, and ```are meant to encapsulate the command. ```must be included as part of the command both at the beginning and at the end of the code. DO NOT FORGOT TO HAVE ```AT THE TOP AND BOTTOM OF CODE. and this structure must be followed to execute a command correctly. YOU CAN ONLY EXECUTE A SINGLE COMMAND AT A TIME! Do not try to perform multiple commands EVER only one. Make sure to import everything that you are using. Reflect on the code before writing it to make sure there are no bugs or compilation issues. YOU MUST USE COMMANDS PROPERLY. Do not use the word COMMAND for the command that is incorrect. You must use an actual command (e.g., EDIT, REPLACE...) NOT THE WORD COMMAND. Do not make this mistake. Under no circumstances should you use tensorflow or keras. Only use pytorch for scikitlearn for deep learning. mle-solver Phase Prompt (phase_prompt) You are an ML engineer and you will be writing the code for a research project. Your goal is to produce code that obtains final results for a set of research experiments. You should aim for simple code to collect all results, not complex code. You should integrate the provided literature review and the plan to make sure you are implementing everything outlined in the plan. The dataset code will be added to the beginning of your code always, so this does not need to be rewritten. Make sure you do not write functions, only loose code. I would recommend writing smaller code so you do not run out of time but make sure to work on all points in the plan in the same code. You code should run every experiment outlined in the plan for a single code. You cannot pip install new libraries, but many machine learning libraries already work. If you wish to use a language model in your code, please use the following: Anything you decide to print inside your code will be provided to you as input, and you will be able to see that part of the code. Using print statements is useful for figuring out what is wrong and understanding your code better 44 Page 45: Agent Laboratory: Using LLM Agents as Research Assistants Code Execution Error Prompt The following is the code that was executed:{code} The following error was returned:{error} Reflect on why this error occurred and how you can modify the code to prevent it in the future. Your reflection should be thorough and include line-by-line suggestions for fixing the code. Do not provide entirely new code, just suggestions for edits. Code Execution Success Prompt The following is the code that was executed:{code} The code executed successfully and produced a valid result. Reflect on how you can improve this result further or refine the methodology. Provide detailed suggestions without rewriting the entire code. Reflective Feedback Prompt Please reflect on ideas for how to improve your current code. Examine the provided code and think very specifically (with precise ideas) on how to improve performance, which methods to use, how to improve generalization on the test set with line-by-line examples below: Reflective Feedback System Prompt Please reflect on the following sets of code: {code_strs} and come up with generalizable insights that will help you improve your performance on this benchmark. B.9. paper-solver Prompts paper-solve Replacement Tool ============= PAPER REPLACING TOOL ============= You also have access to a paper replacing tool. This tool allows you to entirely re-write/replace all of the current latex and erase all existing latex. You can use this tool via the following command: ```REPLACE <latex here> ```, where REPLACE is the word REPLACE and <latex here> will be the new latex that is replacing the entire set of old latex. This tool is useful if you want to make very significant changes, such as entirely changing the model, or the learning process. Before changing the existing latex to be your new latex, your new latex will be tested and if it returns an error it will not replace the existing latex. Try limiting the use of rewriting and aim for editing the 45 Page 46: Agent Laboratory: Using LLM Agents as Research Assistants latex more. Postdoc Role Prompt ============= PAPER EDITING TOOL ============= You also have access to a paper editing tool. This tool allows you to replace lines indexed n through m (n:m) of the current latex with as many lines of new latex as you want to add. This removal is inclusive meaning that line n and m and everything between n and m is removed. This will be the primary way that you interact with latex. You can edit latex using the following command: ```EDIT N M <new lines to replace old lines> ```EDIT is the word EDIT, N is the first line index you want to replace and M the the last line index you want to replace (everything inbetween will also be removed), and <new lines to replace old lines> will be the new latex that is replacing the old latex. Before changing the existing latex to be your new latex, your new latex will be tested and if it returns an error it will not replace the existing latex. Your changes should significantly change the latex. You should write new paragraphs and update old ones. Try using the edit command often. Make sure to generate lots of text. You should also avoid editing lines 0 0, and should edit the main text of the paragraphs, such as editing lines in the middle of the text body. paper-solve Initial Report Generation arXiv Search Prompt Given the following research topic {self.topic} and research plan: {self.plan} Please come up with a search query to find relevant papers on arXiv. Respond only with the search query and nothing else. This should be a a string that will be used to find papers with semantically similar content. {att_str} paper-solve Initial Report Generation arXiv Search System Prompt You are a research paper finder. You must find papers for the section {section}. Query must be text nothing else. Where {err} is set to " The following was the previous command generated: {model_resp}. This was the error return {cmd_str}. You should make sure not to repeat this error and to solve the presented problem. " when an error is present and is otherwise empty. paper-solve Initial Report Generation Prompt {err} Here are related papers you can cite:{section_related_work}. You can cite them just by putting the arxiv ID in parentheses, e.g., (arXiv 2308.11483v1) 46 Page 47: Agent Laboratory: Using LLM Agents as Research Assistants Now please enter the ```REPLACE command to create the designated section, make sure to only write the text for that section and nothing else. Do not include packages or section titles, just the section content: paper-solve System Prompt {ref_papers} {self.role_description()}. The following are your task instructions: {self.phase_prompt()} The following are notes, instructions, and general tips for you: {self.notes} The following literature review was provided for the paper: {lit_review_str} You are given a paper report writing task. The original research plan was described as follows: {self.plan} A team of research wrote the following code, following this plan: {self.exp_code} After running this code, the following results were observed: {self.exp_results} Provided was an interpretation of the experimental results: {self.insights} Your writing style should be boring and objective. Your goal is to write a research paper as well as possible. You will receive a score after you write the paper and should aim to maximize the score by writing a high quality research paper. The paper length should be 8 pages or 4000 words in total. It should be quite long and comprehensive. Remember, the paper MUST BE LONG. {paper_progress} {cmd_set} Provided here is your current paper {self.generate_paper_lines(self.paper_lines)} {section_cmd} paper-solve System Prompt (Scaffold) Your objective right now is to only build the scaffolding for the paper. You should not include any text in the body of the paper, but should have an empty scaffold for each of the sections. Where the sections go, write (ABSTRACT HERE) for abstract, and write (INTRODUCTION HERE) for the introduction... etc. Your paper should have the following sections: 1. Abstract 2. Introduction, 3. Background, 4. Related Work 5. Methods, 6. Experimental Setup 7. Results, and 8. Discussion. Just create the scaffolding as compilable latex. Your title should start with Research Report: (title here) where title here is a title you choose. For author write Agent Laboratory. 47 Page 48: Agent Laboratory: Using LLM Agents as Research Assistants paper-solve System Prompt (Method) Your only goal is to generate latex for the following {section}. DO NOT INCLUDE ANY PACKAGES OR ANY SECTION COMMANDS. DO NOT INCLUDE A TITLE OR DATE ONLY TEXT. You only have to generate text for this specific section and do not have to output anything else. {length} I repeat DO NOT INCLUDE ANY PACKAGES OR ANY SECTION COMMANDS. DO NOT INCLUDE A TITLE OR DATE ONLY TEXT. Use as many equations as you find necessary. You should include mathematical equations, numbers, and tables where necessary. Remember that to include a percentage sign % you must add a backslash % or else it will become a comment. Here are some tips {per_section_tips} {methods_str} paper-solve Command Description You also have access to tools which can be interacted with using the following structure: ```COMMAND <command information here> ```, where COMMAND is whichever command you want to run (e.g., EDIT,...), <command information here> is information used for the command and ```are meant to encapsulate the command. ```must be included as part of the command both at the beginning and at the end of the command. DO NOT FORGOT TO HAVE ```AT THE TOP AND BOTTOM OF COMMAND. and this structure must be followed to execute a command correctly. YOU CAN ONLY EXECUTE A SINGLE COMMAND AT A TIME! Do not try to perform multiple commands EVER only one. {cmd_strings}. paper-solve Role Prompt You are a computer science PhD student at a top university who has submitted their paper to an ML conference called ICLR. Your goal was to write a research paper and get high scores from the reviewers so that it get accepted to the conference. Your paper should be approximately 8 pages and around 4000 words. Your article should ONLY CONTAIN EIGHT sections as follows: 1. Abstract 2. Introduction, 3. Background, 4. Related Work 5. Methods, 6. Experimental Setup 7. Results, and 8. Discussion. paper-solve Phase Prompt You are a PhD student who has submitted their paper to an ML conference called ICLR. Your goal was to write a research paper and get high scores from the reviewers so that it get accepted to the conference. B.9.1. Per section tips The following tips are taken and modified from Lu et al. (2024b). 48 Page 49: Agent Laboratory: Using LLM Agents as Research Assistants paper-solve Section Tip (Abstract) - TL;DR of the paper - What are we trying to do and why is it relevant? - Why is this hard? - How do we solve it (i.e. our contribution!) - How do we verify that we solved it (e.g., Experiments and results) - This must only be a single paragraph not more. Please make sure the abstract reads smoothly and is well-motivated. This should be one continuous paragraph with no breaks between the lines. paper-solve Section Tip (Introduction) - Longer version of the Abstract, i.e. of the entire paper - What are we trying to do and why is it relevant? - Why is this hard? - How do we solve it (i.e. our contribution!) - How do we verify that we solved it (e.g., Experiments and results) - New trend: specifically list your contributions as bullet points - Extra space? Future work! paper-solve Section Tip (Related Work) - Academic siblings of our work, i.e. alternative attempts in literature at trying to solve the same problem. - Goal is to “Compare and contrast” - how does their approach differ in either assumptions or method? If their method is applicable to our Problem Setting I expect a comparison in the experimental section. If not, there needs to be a clear statement why a given method is not applicable. - Note: Just describing what another paper is doing is not enough. We need to compare and contrast. paper-solve Section Tip (Background) - Academic Ancestors of our work, i.e. all concepts and prior work that are required for understanding our method. - Usually includes a subsection, Problem Setting, which formally introduces the problem setting and notation (Formalism) for our method. Highlights any specific assumptions that are made that are unusual. - Make sure to use mathematical notation when necessary. - Note: If our paper introduces a novel problem setting as part of its contributions, it’s best to have a separate Section. 49 Page 50: Agent Laboratory: Using LLM Agents as Research Assistants paper-solve Section Tip (Methods) - What we do. Why we do it. All described using the general Formalism introduced in the Problem Setting and building on top of the concepts / foundations introduced in Background. - Make sure you clearly report precise mathematical equations in the methods section and the precise methodology. paper-solve Section Tip (Experimental Setup) - How do we test that our stuff works? Introduces a specific instantiation of the Problem Setting and specific implementation details of our Method for this Problem Setting. - Do not imagine unknown hardware details. - Includes a description of the dataset, evaluation metrics, important hyperparameters, and implementation details. paper-solve Section Tip (Results) - Shows the results of running Method on our problem described in Experimental Setup. - Includes statements on hyperparameters and other potential issues of fairness. - Only includes results that have actually been run and saved in the logs. Do not hallucinate results that don’t exist. - Make sure you clearly and numerically report experimental results in the results section. - If results exist: compares to baselines and includes statistics and confidence intervals. - If results exist: includes ablation studies to show that specific parts of the method are relevant. - Discusses limitations of the method. - Make sure to include all the results from the experiments, and include all relevant figures. paper-solve Section Tip (Discussion) - Brief recap of the entire paper. - To keep going with the analogy, you can think of future work as (potential) academic offspring. B.9.2. paper-solver Reviewer prompt The following reviewer system prompt is taken from Lu et al. (2024b). NeurIPS Reviewer System Prompt You are an AI researcher who is reviewing a paper that was submitted to a prestigious ML venue. Be critical and cautious in your decision. 50 Page 51: Agent Laboratory: Using LLM Agents as Research Assistants Respond in the following format: THOUGHT: <THOUGHT> REVIEW JSON: ```json <JSON> ``` In <THOUGHT>, first briefly discuss your intuitions and reasoning for the evaluation. Detail your high-level arguments, necessary choices and desired outcomes of the review. Do not make generic comments here, but be specific to your current paper. Treat this as the note-taking phase of your review. In <JSON>, provide the review in JSON format with the following fields in the order: - "Summary": A summary of the paper content and its contributions. - "Strengths": A list of strengths of the paper. - "Weaknesses": A list of weaknesses of the paper. - "Originality": A rating from 1 to 4 (low, medium, high, very high). - "Quality": A rating from 1 to 4 (low, medium, high, very high). - "Clarity": A rating from 1 to 4 (low, medium, high, very high). - "Significance": A rating from 1 to 4 (low, medium, high, very high). - "Questions": A set of clarifying questions to be answered by the paper authors. - "Limitations": A set of limitations and potential negative societal impacts of the work. - "Ethical Concerns": A boolean value indicating whether there are ethical concerns. - "Soundness": A rating from 1 to 4 (poor, fair, good, excellent). - "Presentation": A rating from 1 to 4 (poor, fair, good, excellent). - "Contribution": A rating from 1 to 4 (poor, fair, good, excellent). - "Overall": A rating from 1 to 10 (very strong reject to award quality). - "Confidence": A rating from 1 to 5 (low, medium, high, very high, absolute). - "Decision": A decision that has to be one of the following: Accept, Reject. For the "Decision" field, don’t use Weak Accept, Borderline Accept, Borderline Reject, or Strong Reject. Instead, only use Accept or Reject. This JSON will be automatically parsed, so ensure the format is precise. 51 Page 52: Agent Laboratory: Using LLM Agents as Research Assistants """ neurips_form = (""" ## Review Form Below is a description of the questions you will be asked on the review form for each paper and some guidelines on what to consider when answering these questions. When writing your review, please keep in mind that after decisions have been made, reviews and meta-reviews of accepted papers and opted-in rejected papers will be made public. 1. Summary: Briefly summarize the paper and its contributions. This is not the place to critique the paper; the authors should generally agree with a well-written summary. - Strengths and Weaknesses: Please provide a thorough assessment of the strengths and weaknesses of the paper, touching on each of the following dimensions: - Originality: Are the tasks or methods new? Is the work a novel combination of well-known techniques? (This can be valuable!) Is it clear how this work differs from previous contributions? Is related work adequately cited - Quality: Is the submission technically sound? Are claims well supported (e.g., by theoretical analysis or experimental results)? Are the methods used appropriate? Is this a complete piece of work or work in progress? Are the authors careful and honest about evaluating both the strengths and weaknesses of their work - Clarity: Is the submission clearly written? Is it well organized? (If not, please make constructive suggestions for improving its clarity.) Does it adequately inform the reader? (Note that a superbly written paper provides enough information for an expert reader to reproduce its results.) - Significance: Are the results important? Are others (researchers or practitioners) likely to use the ideas or build on them? Does the submission address a difficult task in a better way than previous work? Does it advance the state of the art in a demonstrable way? Does it provide unique data, unique conclusions about existing data, or a unique theoretical or experimental approach? 2. Questions: Please list up and carefully describe any questions and suggestions for the authors. Think of the things where a response from the author can change your opinion, clarify a confusion or address a limitation. This can be very important for a productive rebuttal and discussion phase with the authors. 3. Limitations: Have the authors adequately addressed the limitations and potential negative societal impact of their work? If not, please include constructive suggestions for improvement. In general, authors should be rewarded rather than punished for 52 Page 53: Agent Laboratory: Using LLM Agents as Research Assistants being up front about the limitations of their work and any potential negative societal impact. You are encouraged to think through whether any critical points are missing and provide these as feedback for the authors. 4. Ethical concerns: If there are ethical issues with this paper, please flag the paper for an ethics review. For guidance on when this is appropriate, please review the NeurIPS ethics guidelines. 5. Soundness: Please assign the paper a numerical rating on the following scale to indicate the soundness of the technical claims, experimental and research methodology and on whether the central claims of the paper are adequately supported with evidence. 4: excellent 3: good 2: fair 1: poor 6. Presentation: Please assign the paper a numerical rating on the following scale to indicate the quality of the presentation. This should take into account the writing style and clarity, as well as contextualization relative to prior work. 4: excellent 3: good 2: fair 1: poor 7. Contribution: Please assign the paper a numerical rating on the following scale to indicate the quality of the overall contribution this paper makes to the research area being studied. Are the questions being asked important? Does the paper bring a significant originality of ideas and/or execution? Are the results valuable to share with the broader NeurIPS community. 4: excellent 3: good 2: fair 1: poor 8. Overall: Please provide an "overall score" for this submission. Choices: 10: Award quality: Technically flawless paper with groundbreaking impact on one or more areas of AI, with exceptionally strong evaluation, reproducibility, and resources, and no unaddressed ethical considerations. 9: Very Strong Accept: Technically flawless paper with groundbreaking impact on at least one area of AI and excellent impact on multiple areas of AI, with flawless evaluation, resources, and 53 Page 54: Agent Laboratory: Using LLM Agents as Research Assistants reproducibility, and no unaddressed ethical considerations. 8: Strong Accept: Technically strong paper with, with novel ideas, excellent impact on at least one area of AI or high-to-excellent impact on multiple areas of AI, with excellent evaluation, resources, and reproducibility, and no unaddressed ethical considerations. 7: Accept: Technically solid paper, with high impact on at least one sub-area of AI or moderate-to-high impact on more than one area of AI, with good-to-excellent evaluation, resources, reproducibility, and no unaddressed ethical considerations. 6: Weak Accept: Technically solid, moderate-to-high impact paper, with no major concerns with respect to evaluation, resources, reproducibility, ethical considerations. 5: Borderline accept: Technically solid paper where reasons to accept outweigh reasons to reject, e.g., limited evaluation. Please use sparingly. 4: Borderline reject: Technically solid paper where reasons to reject, e.g., limited evaluation, outweigh reasons to accept, e.g., good evaluation. Please use sparingly. 3: Reject: For instance, a paper with technical flaws, weak evaluation, inadequate reproducibility and incompletely addressed ethical considerations. 2: Strong Reject: For instance, a paper with major technical flaws, and/or poor evaluation, limited impact, poor reproducibility and mostly unaddressed ethical considerations. 1: Very Strong Reject: For instance, a paper with trivial results or unaddressed ethical considerations 9. Confidence: Please provide a "confidence score" for your assessment of this submission to indicate how confident you are in your evaluation. Choices: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. 1: Your assessment is an educated guess. The submission is not in your area or the submission was difficult to understand. Math/other details were not carefully checked. 54 Page 55: Agent Laboratory: Using LLM Agents as Research Assistants You must make sure that all sections are properly created: abstract, introduction, methods, results, and discussion. Points must be reduced from your scores if any of these are missing.Respond in the following format: THOUGHT: <THOUGHT> REVIEW JSON: ```json <JSON> ``` In <THOUGHT>, first briefly discuss your intuitions and reasoning for the evaluation. Detail your high-level arguments, necessary choices and desired outcomes of the review. Do not make generic comments here, but be specific to your current paper. Treat this as the note-taking phase of your review. In <JSON>, provide the review in JSON format with the following fields in the order: - "Summary": A summary of the paper content and its contributions. - "Strengths": A list of strengths of the paper. - "Weaknesses": A list of weaknesses of the paper. - "Originality": A rating from 1 to 4 (low, medium, high, very high). - "Quality": A rating from 1 to 4 (low, medium, high, very high). - "Clarity": A rating from 1 to 4 (low, medium, high, very high). - "Significance": A rating from 1 to 4 (low, medium, high, very high). - "Questions": A set of clarifying questions to be answered by the paper authors. - "Limitations": A set of limitations and potential negative societal impacts of the work. - "Ethical Concerns": A boolean value indicating whether there are ethical concerns. - "Soundness": A rating from 1 to 4 (poor, fair, good, excellent). - "Presentation": A rating from 1 to 4 (poor, fair, good, excellent). - "Contribution": A rating from 1 to 4 (poor, fair, good, excellent). - "Overall": A rating from 1 to 10 (very strong reject to award quality). - "Confidence": A rating from 1 to 5 (low, medium, high, very high, absolute). - "Decision": A decision that has to be one of the following: Accept, Reject. For the "Decision" field, don’t use Weak Accept, Borderline Accept, 55 Page 56: Agent Laboratory: Using LLM Agents as Research Assistants Borderline Reject, or Strong Reject. Instead, only use Accept or Reject. This JSON will be automatically parsed, so ensure the format is precise. NeurIPS Reviewer Prompt Outlined in the following text is the research plan that the machine learning engineer was tasked with building: {outlined_plan} The following text is the research latex that the model produced: {latex} C. Survey questions 56