loader
Generating audio...

arxiv

Paper 2502.13069

Interactive Agents to Overcome Ambiguity in Software Engineering

Authors: Sanidhya Vijayvargiya, Xuhui Zhou, Akhila Yerukola, Maarten Sap, Graham Neubig

Published: 2025-02-18

Abstract:

AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions. Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes, safety risks due to tool misuse, and wasted computational resources. In this work, we study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance across three key steps: (a) leveraging interactivity to improve performance in ambiguous scenarios, (b) detecting ambiguity, and (c) asking targeted questions. Our findings reveal that models struggle to distinguish between well-specified and underspecified instructions. However, when models interact for underspecified inputs, they effectively obtain vital information from the user, leading to significant improvements in performance and underscoring the value of effective interaction. Our study highlights critical gaps in how current state-of-the-art models handle ambiguity in complex software engineering tasks and structures the evaluation into distinct steps to enable targeted improvements.

Paper Content:
Page 1: Interactive Agents to Overcome Ambiguity in Software Engineering Sanidhya Vijayvargiya1Xuhui Zhou1Akhila Yerukola1Maarten Sap1Graham Neubig1 Abstract AI agents are increasingly being deployed to auto- mate tasks, often based on ambiguous and under- specified user instructions. Making unwarranted assumptions and failing to ask clarifying ques- tions can lead to suboptimal outcomes, safety risks due to tool misuse, and wasted computa- tional resources. In this work, we study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluat- ing proprietary and open-weight models on their performance across three key steps: (a) leveraging interactivity to improve performance in ambigu- ous scenarios, (b) detecting ambiguity, and (c) asking targeted questions. Our findings reveal that models struggle to distinguish between well- specified and underspecified instructions. How- ever, when models interact for underspecified inputs, they effectively obtain vital information from the user, leading to significant improvements in performance and underscoring the value of ef- fective interaction. Our study highlights critical gaps in how current state-of-the-art models handle ambiguity in complex software engineering tasks and structures the evaluation into distinct steps to enable targeted improvements.1 1. Introduction Large Language Models (LLMs) are increasingly used as chatbots in task-oriented workflows to improve productiv- ity (Peng et al., 2023; Brynjolfsson et al., 2023), with the user providing a task instruction which the model com- pletes. Due to the interactive nature of chatbots, the perfor- mance depends on the information provided in the user’s prompt. Users often provide non-descriptive instructions, which poses critical challenges in successfully completing the task (Chowdhury et al., 2024). The ambiguity can lead not only to erroneous outcomes, but also to significant safety issues (Kim et al., 2024; Karli & Fitzgerald, 2023). 1Language Technologies Institute, Carnegie Mellon Univer- sity, Pittsburgh, USA. Correspondence to: Sanidhya Vijayvargiya <sanidhyv@cs.cmu.edu >. 1Code and data can be accessed at https://github.com/ sani903/InteractiveSWEAgents Figure 1. Interactive agents mitigate resource wastage and reduce misalignment in ambiguous settings. This ambiguity can lead to more severe consequences in task automation scenarios, where AI agents are equipped with powerful tools (Wang et al., 2024b; Lu et al., 2024; Huang et al., 2024; Zhou et al., 2024a). In software engineering settings, agents must navigate complex codebases, make architectural decisions, and modify critical systems—all while operating with potentially incomplete or ambiguous instructions. When human developers face such ambigu- ity, they engage in clarifying dialogue to gather missing context (Testoni & Fern ´andez, 2024; Purver, 2004). How- ever, current AI systems often proceed with incomplete understanding, leading to costly mistakes and misaligned solutions, as demonstrated in Figure 1. In this work, we systematically evaluate the interaction capa- 1arXiv:2502.13069v1 [cs.AI] 18 Feb 2025 Page 2: Interactive Agents to Overcome Ambiguity in Software Engineering bilities of commonly used open and proprietary LLMs when addressing underspecified instructions in agentic code set- tings (§2). We examine three research questions to address the problem for code generation. 1.Interactive problem solving : Can LLMs appropri- ately leverage interaction with the user to improve per- formance in ambiguous settings? 2.Detection of ambiguity : Can LLMs identify whether a given task description is underspecified and ask clari- fying questions? 3.Question quality : Can LLMs generate meaningful and targeted questions that gather the necessary infor- mation to complete the task? We evaluate the research questions separately to ensure in- dependence between them. We use the Github issues from SWE-Bench Verified (Chowdhury et al., 2024) to simulate well-specified inputs, and the summarized variants of the same Github issues as underspecified inputs for the experi- ments. A simulated user (Xu et al., 2024; Zhou et al., 2024b), equipped with the full, well-specified issue, simulates real conversations where the user has additional context, which is provided only when prompted with the appropriate ques- tions. This multi-stage approach allows for targeted im- provements in individual aspects, offering a pathway to enhance overall system performance. Through our evaluations across the different settings, we find that interactivity can boost performance on underspeci- fied inputs by up to 74% over the non-interactive settings but the performance varies between models ( §3). LLMs default to non-interactive behavior without explicit encouragement, and even with it, they struggle to distinguish between under- specified and well-specified inputs. Claude Sonnet 3.5 is the only evaluated LLM that achieves notable accuracy (84%) in making this distinction. Prompt engineering offers lim- ited improvement, and its effectiveness varies across mod- els (§4). When interacting, LLMs generally pose questions capable of extracting relevant details, but some models, such as Llama 3.1 70B, fail to obtain sufficient specificity ( §5). In summary, this study underscores the importance of in- teractivity in LLMs for agentic workflows, particularly in real-world tasks where prompt quality varies significantly. 2. Method 2.1. Dataset In our experiments, we simulate well-specified and under- specified inputs using the SWE-Bench Verified dataset, a refined subset of 500 issues from the SWE-Bench dataset. The SWE-Bench dataset (Jimenez et al., 2024) consists of real-world GitHub issues, their corresponding pull requests (PRs), and unit tests from 12 Python repositories. The SWE- Bench Verified dataset (Chowdhury et al., 2024) is designedto provide a more reliable estimate of an LLM’s ability by pruning issues that were underspecified or contained invalid unit tests. The task of an LLM is to modify the state of the repository at the time of creation of the issue and resolve it. The test cases are used to verify the patch generated by the agent. Given that the Verified subset contains only sufficiently specified issues, we assume that these issues do not require disambiguation. Therefore, for each SWE-Bench Verified issue, we consider two forms, as shown in Figure 2: 1.Fully specified issue : The original and detailed GitHub issue. 2.Underspecified issue : A summarized version gen- erated using GPT-4o, where the model is asked to preserve specific terminology is preserved but reduce the amount of detailed content (complete prompt in §A.1.3). 2.2. Agentic Framework Agent Environment The OpenHands (Wang et al., 2024b) agentic framework equips the LLM with an interactive en- vironment that extends its capabilities beyond static code generation. The agent operates within a structured execu- tion environment where it can iteratively refine code, plan tasks, and run commands using integrated tools. It has the ability to edit files, break down complex instructions into executable steps, and execute both Bash and Python scripts within a secure sandbox. This controlled environment en- ables the agent to analyze execution outputs, detect and debug errors, and refine its approach based on observed results, ensuring adaptability and correctness in solving complex programming tasks. Selected Models We use Claude Sonnet 3.5 (Anthropic, 2024b) as one of the proprietary models due to its superior performance on SWE-Bench. Claude Haiku 3.5 (Anthropic, 2024a) is included as the second proprietary model to in- vestigate the impact of model parameterization, as both models likely share similar training methodologies but dif- fer significantly in the number of parameters. Additionally, we evaluate Llama 3.1 70B-Instruct (Llama team, 2024) andDeepseek-v2 (DeepSeek-AI, 2024) as two open-weight frontier models. User Proxy We employ GPT-4o (Ahmad & OpenAI, 2024) as a user proxy to simulate user-agent interactions (Xu et al., 2024; Zhou et al., 2024a). The user proxy is provided with the fully specified version of the task, allowing the coding agent to extract the necessary information through interaction. It is instructed to respond based solely on the information available in the full issue and will reply with I don’t have that information if relevant details are miss- ing. This approach prevents the user proxy from halluci- 2 Page 3: Interactive Agents to Overcome Ambiguity in Software Engineering Figure 2. The three settings in order: Full, Hidden, and Interaction nating incorrect information and encourages clear, negative responses when needed. The full prompt is shown in §A.1.2. 2.3. Study Design We use three distinct settings to evaluate models across the 500 issues from SWE-Bench Verified shown in Figure 2 and described below. •Full Setting : This is the traditional SWE-Bench setting for resolving GitHub issues. The coding agent is pro- vided with the fully specified task and the interaction is disabled. It represents the agent’s performance in an unambiguous scenario, where the agent has access to fullinformation, simulating ideal conditions. •Hidden setting : A summarized version of the issue is provided to the coding agent with the user-agent interaction disabled to mimic the lack of detail that can occur in task descriptions. We do not give any interaction-related instructions, and all models default to non-interactive behavior. Specific details are hidden from the coding agent. •Interaction Setting : The coding agent receives a sum- marized task, while the user proxy model gets the fully specified task. Interaction is enabled through prompt- ing, allowing the agent to query the proxy for specific details. The models do not interact with the user with- out an explicit prompt. In addition to the full issue, the proxy has access to file locations that need modifica- tion and can provide them when queried. This setupallows us to evaluate which models proactively seek navigational information and examine how this inter- action influences the success of the solution process across models. 3. RQ1: Interactive Problem Solving Effectively addressing ambiguity requires a model to inte- grate information from user interactions to form a clear plan and successfully solve the task. Our first experiment holis- tically evaluates the model’s ability to leverage interaction and improve performance. The model must not only process the initial task description, but also query users to extract relevant details while filtering out irrelevant information. 3.1. Experimental Setup The hypothesis of the experiment is that different language models will exhibit varying performance with interaction based on their incorporation of the provided information, leading to different levels of improvement over the Hidden setting. We evaluate the models across the three settings and conduct two Wilcoxon-Signed Rank tests with a signif- icance level of 0.05 to determine significant performance differences between the Hidden and Interaction settings, and between the Interaction and Full settings for every model. Here, we modify the prompt to make interaction with the user compulsory in the Interaction setting2. Ideally, the In- 2Without compulsory interaction, the model defaults to non- interactive behavior for most issues, as seen in the Hidden setting. Full prompt in §A.1.2 3 Page 4: Interactive Agents to Overcome Ambiguity in Software Engineering teraction setting should approach the performance of the full setting. The coding agent has a maximum of 30 turns to generate a solution patch. 3.2. Leveraging Interaction in Ambiguity In this experiment, each model is tested in the Hidden ,In- teraction , and Full settings to evaluate its ability to leverage interaction and optimize performance on underspecified is- sues. The results, as shown in Figure 3, confirm the expected increase in resolve rates as more information becomes avail- able to the agent. While the difference between the Hid- den and Interaction settings is significant for every model (Table 1), emphasizing the impact of interaction on the tra- jectory, the performance gap between the Interaction and Full settings is also significant across all models, highlight- ing the unrealized potential. Specifically, for the Hidden vs. Interaction settings, proprietary models show stronger evidence of a significant difference. These results suggest that the ability to leverage interaction varies across models, with proprietary models demonstrating greater effectiveness in utilizing interaction compared to open-weight models. Figure 3. Resolve rates (in %) across different settings: Hidden (underspecified issues), Interaction (underspecified issues with user interaction), and Full (fully specified issues) Using interaction, the Claude Sonnet and Haiku agents recre- ate 80% of the performance in the Full setting. However, with Deepseek and Llama 3.1, the relative performance is lower, of 59% and 54%, respectively. Claude Sonnet 3.5’s high resolve rate in the Hidden setting is likely due to its su- perior programming acumen. The performance is surprising, as a human would be able to decipher little about the expec- tations given the summarized issue. Better programming models can potentially extract more information from the stack trace by reproducing the error themselves. We observe that the Claude Haiku model achieves a performance rela- tive to the Full setting similar to that of the Claude Sonnet model, despite having inferior coding abilities. Thus, there is no direct correlation between the number of parameters or coding ability and a model’s ability to leverage interaction. This hints towards better training practices that can lead tobetter integration of the new information. Model Comparison p-value Llama 3.1 70BHidden vs Interaction 0.0023 Interaction vs Full 3.87e-06 Claude Haiku 3.5Hidden vs Interaction 2.18e-14 Interaction vs Full 1.65e-09 Claude Sonnet 3.5Hidden vs Interaction 8.55e-19 Interaction vs Full 1.28e-12 Deepseek-v2Hidden vs Interaction 0.0023 Interaction vs Full 2.87e-07 Table 1. Wilcoxon Signed-Rank Test Results for Hidden vs Inter- action setting and for Interaction vs Full setting for each model This experiment highlights the importance of interaction in mitigating ambiguity. Since many real-world software engineering problems are underspecified (Chowdhury et al., 2024), interactive systems are essential for ensuring align- ment and reducing safety risks. However, current models default to non-interactive behavior even when faced with ambiguity and struggle to match the performance seen in well-specified settings. While interactive trajectories show performance gains over non-interactive approaches for am- biguous inputs, the improvement is not statistically signifi- cant, indicating strong potential for improvement. 3.3. Impact of Interaction Details on Model Performance In the Interaction setting of the previous experiment, the in- formation gained can be broadly categorized into two types: informational , which relates to the expected behavior or nature of the error, and navigational , which pertains to the locations of the files to modify. While informational details are typically obtained in nearly every interaction, the models request navigational details less frequently. We measure the resolve rates separately for instances where the model asks for navigational details and when it does not, examining the impact on performance when models must rely only on informational details versus when navigational details are also accessible. As seen in Table 2, requesting navigational details improves performance across all models by providing cues beyond de- scribed behavior and errors. However, some models rely too heavily on this information and struggle when it’s missing. Smaller models like Llama 3.1 and Deepseek-v2 request file locations more often but underperform without them. Claude models, particularly Sonnet, better leverage infor- mational cues, achieving higher resolve rates even without navigational details. Deepseek, by contrast, performs worse than its Hidden setting when file locations are absent, high- 4 Page 5: Interactive Agents to Overcome Ambiguity in Software Engineering Model Navigational Info (%) Resolve Rate With (%) Resolve Rate Without (%) Claude Sonnet 3.5 8.96 59.52 37.94 Claude Haiku 3.5 24.67 36.94 24.78 Deepseek-v2 30.70 13.19 4.62 Llama 3.1 70B 30.28 6.34 4.28 Table 2. The percentage of issues where navigational information was acquired in the Interaction setting, along with the corresponding resolve rates with and without navigational information. Navigational information refers to file locations requiring modification, helping to avoid tedious code exploration. The resolve rates offer insight into how the information obtained during interaction impacts overall performance. lighting its dependence. This reliance leads to wasted turns searching for errors instead of identifying them efficiently. Llama 3.1 performs better than Hidden without file loca- tions but gains little when they are provided, likely due to poor detail extraction (Section §5). Ideally, LLMs should generalize across diverse interaction types, as users may not always provide specific details, improving robustness in real-world software engineering tasks. Takeaway: Interaction has significant potential to improve model performance in ambiguous tasks, but models, partic- ularly the less strong open-weight models, struggle to lever- age it effectively. Proprietary models like Claude Sonnet 3.5 and Haiku 3.5 achieve nearly 80% of their Full setting performance, with Haiku improving by 74% over its Hidden setting performance through effective integration of both informational and navigational cues. The lack of correlation between model size and its ability to utilize interaction sug- gests that better training practices play a more crucial role. In contrast, models like Deepseek-v2 and Llama 3.1 show limited gains, primarily due to their challenges in utilizing broader informational cues, which hinders their adaptability in ambiguous tasks. 4. RQ2: Ambiguity Detection In real-world LLM and agent applications, task descrip- tions and prompts can vary in quality (Chowdhury et al., 2024). To detect ambiguity, a model must recognize un- clear expectations or identify missing key information in its planned approach. However, interacting unnecessarily when sufficient information is already available can introduce in- efficiencies and place an undue burden on the user. Here, we examine the capabilities of LLMs to detect ambiguous instructions in software engineering contexts. 4.1. Experimental Setup In this experiment, each issue is presented in either the Full setting or the Hidden setting . The objective is to identify patterns in how models choose to interact based on the input type. Ideally, the model should have a high interaction rate for the summarized inputs and a negligible interaction rate for the well-specified inputs.In the instructions which outline the task, we present the agent with an option to interact during its solution trajectory and design three instructions with varying levels of encour- agement to interact with the user. We track the input type the model chooses to interact with. The instructions, listed in order of increasing encouragement to interact, are: Neutral , where the agent is told it can ask questions if anything is unclear), Moderate Encouragement , where the agent is told to carefully check that all necessary information is available and only proceed after everything is clear, and Strong En- couragement , where the agent is told that asking questions is critical to task success (full prompts in §A). 4.2. Effect of Different Prompts Experiments to detect ambiguity demonstrate that, using prompt engineering, we can control the level of interaction with the user, as shown in Table 3. But this interactivity is not possible without clearly specifying it in the prompt wherein without any specific mention of interaction, the models almost never interact for any of the summarized issue inputs. The Claude Sonnet model performs best with Moderate Encouragement, achieving the highest overall accuracy of 84% across all variations. Its counterpart from the same model family, Claude Haiku, is hesitant to interact even with Strong Encouragement. The Claude models show a drop in accuracy in cases where interaction is not needed as their overall interaction increases, indicating that the interaction fails to target underspecified inputs effectively. For the Deepseek model, we observe that the Neutral prompt gives the best results as interactivity surprisingly decreases with more encouragement. The accuracy in both the cases where interaction was desired and not desired is around 70%, which shows that the model is capable of distinguishing between well-specified and underspecified issues to some extent. The Llama model displays a greater, but arbitrary, tendency to interact across all prompts than other models. 4.3. Detection across Models While interaction levels can be adjusted with prompting, both summarized issues and full issues have equal probabil- 5 Page 6: Interactive Agents to Overcome Ambiguity in Software Engineering ModelNeutral Moderate Encouragement Strong Encouragement Accuracy ↑FPR ↓FNR ↓Accuracy ↑FPR ↓FNR ↓Accuracy ↑FPR ↓FNR ↓ Claude Sonnet 3.5 0.60 0.00 0.81 0.84 0.24 0.09 0.76 0.36 0.10 Claude Haiku 3.5 0.54 0.00 0.97 0.57 0.02 0.90 0.63 0.06 0.66 Deepseek-v2 0.69 0.30 0.31 0.57 0.08 0.83 0.51 0.04 0.94 Llama 3.1 70B 0.48 0.46 0.57 0.47 0.95 0.09 0.52 0.93 0.06 Table 3. Model performance in ambiguity detection across prompts with varying levels of interaction encouragement. FPR refers to cases where the model interacted unnecessarily, while FNR refers to cases where it failed to interact when needed. A model that reliably distinguishes between underspecified and well-specified issues should have high accuracy, low FPR, and low FNR. ity of being selected for interaction as interactivity increases, particularly with smaller models. Despite the stark differ- ence in the language and detail of summarized issues and fully specified issues, the models, except Claude Sonnet, fail to reliably distinguish them, indicating that LLMs struggle to detect ambiguity even in obvious cases. All models, in- cluding Claude Sonnet, show big changes in the ambiguity detection behavior with prompt variations. Interestingly, Sonnet outperforms Haiku, likely due to its more extensive instruction tuning or Human Feedback training, which helps it better follow instructions and achieve the desired interac- tive trajectory. Surprisingly, even Deepseek adapts better to the task than Haiku. Takeaway: Prompt engineering can influence model interac- tivity but fails to consistently improve ambiguity detection across models. When interaction is not explicitly prompted, models default to non-interactive behavior. Claude Sonnet shows some ability to distinguish ambiguous inputs, but other models, including Claude Haiku and Llama 3.1, strug- gle even with clear cues. This inconsistency reveals that models are not inherently equipped to detect underspecified tasks. Improving ambiguity detection requires dedicated training, not just prompt modifications. 5. RQ3: Question Quality To gather missing information from underspecified inputs, the quality of an agent’s questions is crucial. While §3 evaluates task completion, the model performance in the experiment is influenced by the coding ability. Here, we focus solely on the quality of the questions posed by the agent to the user, measuring how effectively models extract relevant information under the assumption that users have the necessary details. 5.1. Experimental Setup In this experiment, we evaluate the quality of the inter- actions between the agent and the user in the Interaction setting. We measure the novelty and detail level of the in- formation obtained from the user’s answers to evaluate the quality, quantifying the new knowledge relative to the exist-ing understanding of the agent. We employ two techniques to quantify the information obtained. 1.Cosine Distance : We compute the cosine distance (1−cos(P, Q)) between the embeddings of the sum- marized task Ebefore and the cumulative knowledge after interaction with the user Eafterusing a text embed- ding model. Lower distances indicate redundant user input, while higher values show meaningful informa- tion gain. We use OpenAI’s text-embedding-3-small as our embedding model. 2.LLM-as-judge (GPT-4o) : The model scores the user answers on a scale of 1 to 5, where a higher score corresponds to more new and detailed information in the user’s response, such as specific files causing er- rors or function behavior. The prompt to the model includes the summarized issue, agent questions, and user responses for better context. 5.2. Information Gain from Interaction For the quantitative evaluation of the quality of the question, both the cosine distance and the LLM-as-judge methods suggest a similar result, in which the Llama model performs significantly worse than the other models, whereas the other models achieve very similar information gains, as seen in Figure 5. The Llama model has an average cosine distance of 0.101 when the embedding of the summarized issue is compared to the embedding of the user response appended to the sum- marized issue. Deepseek achieves the highest cosine dis- tance of 0.142, while the Claude Sonnet and Haiku models achieve very similar cosine distances of 0.136 and 0.135. Using LLM as a judge, we evaluate the specificity of the details present in the answers. Here again, the Llama 3.1 model achieves a significantly worse average score of 3.58 than the other models which see similar performance of around 4 out of 5. 6 Page 7: Interactive Agents to Overcome Ambiguity in Software Engineering Figure 4. Agent questions and user responses to the same underspecified input are shown for Llama 3.1 70B, Deepseek-v2, and Claude Haiku 3.5. The examples highlight specific interaction patterns and differences in handling ambiguity. The corresponding model inputs are detailed in Table 5. Llama 3.1 Deepseek-v2 Claude Haiku Claude Sonnet0.00.10.20.3Cosine Distance (a) Llama 3.1 Deepseek-v2 Claude Haiku Claude Sonnet12345LLM-as-Judge Score (b) Figure 5. Information Gain measured using (a) Cosine Distance Scores and (b) LLM-as-Judge Scores 5.3. Qualitative Analysis of Questions The quantitative results can be further supported by a quali- tative evaluation of the questions. Sample question-answer pairs reflecting common trends are shown in Figure 4. The Llama model asks fewer questions on average than other models in one message for user interaction, as seen in Ta- ble 4, and often poses overly general questions like, Are there any existing workarounds or temporary fixes? . These template-like questions are unproductive and less likely to gather useful information. Deepseek, on the other hand, asks the most questions per message, allowing it to extract more information. Its ques- tions, such as Are there any existing tests or examples that demonstrate the issue? , aim to extract, edge cases, documen-tation, or tests, and while common across multiple issues, they are reasonable and yield valuable details. But most questions are very specific and detailed, querying about the expected behavior. Often, due to the specificity of the question, the user might not have the required information. Claude Sonnet asks fewer questions than Deepseek, likely because it explores the codebase first. The questions do not have easily discernible patterns and match the Deepseek model in specificity. The Haiku model, in contrast, follows a consistent template, typically asking three questions re- gardless of the input, although sub-questions may be present. Haiku’s questions are more keyword-driven based on the input, while Sonnet’s are based on a deeper understanding of the issue and codebase. Model Avg. Number of Questions Claude Sonnet 3.5 3.80 Claude Haiku 3.5 3.49 Deepseek-v2 4.57 Llama 3.1 70B 2.61 Table 4. The average number of questions asked by different mod- els in an interaction. Takeaway: Models that balance specificity and question quantity, like the Claude models, achieve greater informa- tion gain and superior interaction quality compared to mod- els that ask too few, too many, or templated questions. While Deepseek benefits from asking numerous detailed questions, it risks overwhelming the user. In contrast, Llama underper- forms due to its reliance on generic or irrelevant questions. 7 Page 8: Interactive Agents to Overcome Ambiguity in Software Engineering 6. Limitations Our study benefits from including both open-weight and proprietary models, as well as models from the same family with different parameterizations, enhancing the generaliz- ability of the findings. However, certain design decisions may affect the experiments. Ambiguity detection is limited to the first three turns, as LLMs struggle to interact meaningfully if they do not en- gage early. To assess question quality, we measure changes in the latent vector to capture the information gained, as- suming equal importance for all new information—though models may prioritize different details in their solution. The resolve rates in the overall problem solving experiment re- flect real-life conditions, where incorrect code is unaccept- able, regardless of how close the generated patch is to the solution. However, data leakage could enable some models to make correct assumptions in underspecified settings, in- flating resolve rates. Additionally, the user proxy may be more interactive than real-world users, as LLMs are tuned to be helpful. We address this by limiting the number of interaction turns and focusing interactions on the task with detailed system prompts. 7. Related Work Code Generation Benchmarks In code generation tasks, ambiguous user instructions hinder the evaluation of code suggestions generated by the model. Since the cause of ambiguity is missing details, clarifying questions become neessary (Mu et al., 2023). Interactive, test-driven work- flows mitigate this ambiguity by first generating test cases aligned with user expectations, which users validate before code generation (Lahiri et al., 2023). Extensions of this approach employ runtime techniques to generate, mutate, and rank candidate code suggestions and test cases based on user feedback (Fakhoury et al., 2024). Although effective, these workflows can burden users, highlighting the need to minimize intervention to essential cases. Interactive ML Systems In task-oriented settings, am- biguity between generated outputs and user expectations remains a challenge. AmbigNLG addresses this by intro- ducing a taxonomy of instruction ambiguities and applying targeted disambiguation based on the identified ambiguity type (Niwa & Iso, 2024). These ambiguities include unclear output lengths, mandatory keywords, and contextual nu- ances in instructions. NoisyToolBench (Wang et al., 2024a) offers a dataset for evaluating LLM tool use with ambigu- ous instructions, though it focuses on simpler tasks. Rein- forcement learning frameworks like ReHAC balance user interaction by modeling optimal intervention points (Feng et al., 2024), but more effective strategies are needed for complex, multi-step workflows.LLMs and Ambiguity The current state-of-the-art LLMs are not inherently trained to handle ambiguity through user interaction (Zhang et al., 2024), but, their instruction tun- ing enables improved performance with prompt engineer- ing (White et al., 2023). Ambiguity detection has been tackled with uncertainty estimation to measure the utility of seeking clarification (Zhang & Choi, 2023; Park et al., 2024). Meanwhile, the quality of clarifying questions and the resulting performance remain critical to overall suc- cess (Rao & Daum ´e III, 2018; Pyatkin et al., 2023; Kuhn et al., 2023). Despite advances, state-of-the-art techniques such as few-shot prompting and Chain-of-Thought reason- ing offer limited relief in ambiguous scenarios (Zhang et al., 2024). Self-disambiguation uses the internal knowledge of a model to reduce query ambiguity (Keluskar et al., 2024; Sterner, 2022; Sumanathilaka et al., 2024). For example, Alignment with Perceived Ambiguity (APA) employs self- disambiguation to quantify perceived ambiguity using in- formation gain, improving the model’s processing of such inputs (Kim et al., 2024). Although inference-only methods are cost-effective, they are less robust than training-based approaches for handling ambiguity. 8. Conclusion This work evaluates proprietary and open-weight models in agentic frameworks for handling ambiguity in software engineering. In code generation, to effectively integrate new information into the solution, an agent must detect ambiguity and ask targeted questions. Our key findings are: •Given an underspecified input, Claude Sonnet 3.5 and Claude Haiku 3.5 with interaction can achieve 80% of their performance with a well-specified input. In con- trast, open-weight models struggle: Deepseek relies on navigational cues to locate relevant files, while Llama 3.1 70B extracts limited information from the user. •LLMs do not interact unless explicitly prompted, and their ambiguity detection is highly sensitive to prompt variations. Only Claude Sonnet 3.5 achieves a higher accuracy of 84% in distinguishing between well-specified and underspecified input. •Claude Sonnet 3.5, Haiku 3.5, and Deepseek effec- tively extract new, detailed user information, whereas Llama 3.1 struggles to ask the right questions. Despite these advances, a gap remains between resolve rates for underspecified vs. fully specified issues. Open-weight models need better interaction strategies to improve resolu- tion, while proprietary models, particularly Claude Haiku 3.5, require stronger prompting to engage interactively. This work establishes the current state-of-the-art in handling am- biguity through interaction, breaking the resolution process into multiple steps. 8 Page 9: Interactive Agents to Overcome Ambiguity in Software Engineering Impact Statement This paper examines the ability of AI agents to handle am- biguous and underspecified instructions in software engi- neering. Our findings highlight key limitations in the res- olution of AI-driven tasks in real-world applications, ne- cessitating the development of more interactive, adaptable, and effective systems. By emphasizing AI interactivity, our work contributes to the responsible development of AI tech- nologies, fostering safer and more efficient automation in software engineering. References Ahmad, L. and OpenAI. Gpt-4o system card, October 2024. Anthropic. Claude 3.5 haiku, 10 2024a. URL https:// www.anthropic.com/claude/haiku . Accessed on January 9, 2025. Anthropic. Introducing claude 3.5 sonnet, 6 2024b. URL https://www.anthropic.com/news/ claude-3-5-sonnet . Accessed on January 8, 2025. Brynjolfsson, E., Li, D., and Raymond, L. R. Generative ai at work. Working Paper 31161, National Bureau of Economic Research, April 2023. URL http://www. nber.org/papers/w31161 . Chowdhury, N., Aung, J., Shern, C. J., Jaffe, O., Sher- burn, D., Starace, G., Mays, E., Dias, R., Aljubeh, M., Glaese, M., Jimenez, C. E., Yang, J., Liu, K., and Madry, A. Introducing SWE-bench veri- fied, 2024. URL https://openai.com/index/ introducing-swe-bench-verified/ . Ac- cessed on December 10, 2024. DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024. Fakhoury, S., Naik, A., Sakkas, G., Chakraborty, S., and Lahiri, S. K. Llm-based test-driven interactive code generation: User study and empirical evaluation. IEEE Transactions on Software Engineering , 50(9):2254–2268, September 2024. ISSN 2326-3881. doi: 10.1109/tse.2024. 3428972. URL http://dx.doi.org/10.1109/ TSE.2024.3428972 . Feng, X., Chen, Z.-Y ., Qin, Y ., Lin, Y ., Chen, X., Liu, Z., and Wen, J.-R. Large language model-based human- agent collaboration for complex task solving, 2024. URL https://arxiv.org/abs/2402.12914 . Huang, D., Zhang, J. M., Luck, M., Bu, Q., Qing, Y ., and Cui, H. Agentcoder: Multi-agent-based code genera- tion with iterative testing and optimisation, 2024. URL https://arxiv.org/abs/2312.13010 .Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. Swe-bench: Can language models resolve real-world github issues?, 2024. URL https://arxiv.org/abs/2310.06770 . Karli, U. B. and Fitzgerald, T. Extended abstract: Re- solving ambiguities in LLM-enabled human-robot col- laboration. In 2nd Workshop on Language and Robot Learning: Language as Grounding , 2023. URL https: //openreview.net/forum?id=LtwuJx83Rc . Keluskar, A., Bhattacharjee, A., and Liu, H. Do llms under- stand ambiguity in text? a case study in open-world ques- tion answering, 2024. URL https://arxiv.org/ abs/2411.12395 . Kim, H. J., Kim, Y ., Park, C., Kim, J., Park, C., Yoo, K. M., goo Lee, S., and Kim, T. Aligning language models to explicitly handle ambiguity, 2024. URL https:// arxiv.org/abs/2404.11972 . Kuhn, L., Gal, Y ., and Farquhar, S. Clam: Selective clar- ification for ambiguous questions with generative lan- guage models, 2023. URL https://arxiv.org/ abs/2212.07769 . Lahiri, S. K., Fakhoury, S., Naik, A., Sakkas, G., Chakraborty, S., Musuvathi, M., Choudhury, P., von Veh, C., Inala, J. P., Wang, C., and Gao, J. Interactive code generation via test-driven user-intent formalization, 2023. URL https://arxiv.org/abs/2208.05950 . Llama team. The llama 3 herd of models. https: //ai.meta.com/research/publications/ the-llama-3-herd-of-models/ , July 2024. Accessed on January 9, 2025. Lu, C., Lu, C., Lange, R. T., Foerster, J., Clune, J., and Ha, D. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URL https: //arxiv.org/abs/2408.06292 . Mu, F., Shi, L., Wang, S., Yu, Z., Zhang, B., Wang, C., Liu, S., and Wang, Q. Clarifygpt: Empowering llm-based code generation with intention clarification, 2023. URL https://arxiv.org/abs/2310.10996 . Niwa, A. and Iso, H. Ambignlg: Addressing task ambiguity in instruction for nlg, 2024. URL https://arxiv. org/abs/2402.17717 . Park, J., Lim, S., Lee, J., Park, S., Chang, M., Yu, Y ., and Choi, S. Clara: Classifying and disambiguating user commands for reliable interactive robotic agents, 2024. URL https://arxiv.org/abs/2306.10376 . Peng, S., Kalliamvakou, E., Cihon, P., and Demirer, M. The impact of ai on developer productivity: Evidence from 9 Page 10: Interactive Agents to Overcome Ambiguity in Software Engineering github copilot, 2023. URL https://arxiv.org/ abs/2302.06590 . Purver, M. R. J. The theory and use of clarification requests in dialogue . PhD thesis, University of London King’s College, 2004. Pyatkin, V ., Hwang, J. D., Srikumar, V ., Lu, X., Jiang, L., Choi, Y ., and Bhagavatula, C. Clarifydelphi: Rein- forced clarification questions with defeasibility rewards for social and moral situations, 2023. URL https: //arxiv.org/abs/2212.10409 . Rao, S. and Daum ´e III, H. Learning to ask good ques- tions: Ranking clarification questions using neural ex- pected value of perfect information. In Gurevych, I. and Miyao, Y . (eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers) , pp. 2737–2746, Mel- bourne, Australia, July 2018. Association for Compu- tational Linguistics. doi: 10.18653/v1/P18-1255. URL https://aclanthology.org/P18-1255 . Sterner, B. Explaining ambiguity in scientific language. Synthese , 200(5):354, 2022. Sumanathilaka, T. G. D. K., Micallef, N., and Hough, J. Can llms assist with ambiguity? a quantitative evaluation of various large language models on word sense disam- biguation, 2024. URL https://arxiv.org/abs/ 2411.18337 . Testoni, A. and Fern ´andez, R. Asking the right ques- tion at the right time: Human and model uncertainty guidance to ask clarification questions. arXiv preprint arXiv:2402.06509 , 2024. Wang, W., Shi, J., Wang, C., Lee, C., Yuan, Y ., tse Huang, J., and Lyu, M. R. Learning to ask: When llms meet un- clear instruction, 2024a. URL https://arxiv.org/ abs/2409.00557 . Wang, X., Li, B., Song, Y ., Xu, F. F., Tang, X., Zhuge, M., Pan, J., Song, Y ., Li, B., Singh, J., Tran, H. H., Li, F., Ma, R., Zheng, M., Qian, B., Shao, Y ., Muennighoff, N., Zhang, Y ., Hui, B., Lin, J., Brennan, R., Peng, H., Ji, H., and Neubig, G. Openhands: An open platform for ai software developers as generalist agents, 2024b. URL https://arxiv.org/abs/2407.16741 . White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., and Schmidt, D. C. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 , 2023. Xu, F. F., Song, Y ., Li, B., Tang, Y ., Jain, K., Bao, M., Wang, Z. Z., Zhou, X., Guo, Z., Cao, M., Yang, M., Lu, H. Y ., Martin, A., Su, Z., Maben, L., Mehta, R., Chi, W.,Jang, L., Xie, Y ., Zhou, S., and Neubig, G. Theagent- company: Benchmarking llm agents on consequential real world tasks, 2024. URL https://arxiv.org/ abs/2412.14161 . Zhang, M. J. Q. and Choi, E. Clarify when necessary: Resolving ambiguity through interaction with lms, 2023. URL https://arxiv.org/abs/2311.09469 . Zhang, T., Qin, P., Deng, Y ., Huang, C., Lei, W., Liu, J., Jin, D., Liang, H., and Chua, T.-S. Clamber: A benchmark of identifying and clarifying ambiguous in- formation needs in large language models, 2024. URL https://arxiv.org/abs/2405.12063 . Zhou, X., Kim, H., Brahman, F., Jiang, L., Zhu, H., Lu, X., Xu, F., Lin, B. Y ., Choi, Y ., Mireshghallah, N., Le Bras, R., and Sap, M. Haicosystem: An ecosystem for sandbox- ing safety risks in human-ai interactions. arXiv , 2024a. URL http://arxiv.org/abs/2409.16427 . Zhou, X., Zhu, H., Mathur, L., Zhang, R., Yu, H., Qi, Z., Morency, L.-P., Bisk, Y ., Fried, D., Neubig, G., and Sap, M. Sotopia: Interactive evaluation for social intelligence in language agents, 2024b. URL https://arxiv. org/abs/2310.11667 . 10 Page 11: Interactive Agents to Overcome Ambiguity in Software Engineering A. Appendix A.1. Experimental Design A.1.1. F ULL SETTING In addition to the fully-specified GitHub issue from SWE-Bench Verified, we also include hints from the dataset, which contains the conversation between developers regarding the issue. This helps create a larger knowledge gap in comparison to the Hidden setting. Prompt for Full Setting I’ve uploaded a Python code repository in the directory /workspace/ {workspace dirname}. Consider the following PR description: <prdescription> {instance.full issue }</pr description> Here are some additional hints: <hints> {instance.hints text}</hints> Can you help me implement the necessary changes to the repository so that the requirements specified in the PR description are met? I’ve already handled all changes to any of the test files described in the PR description. This means you DON’T need to modify the testing logic or any of the tests! Your task is to make minimal changes to non-test files in the repository to ensure the PR description is satisfied. Follow these steps to resolve the issue: 1. As a first step, explore the repo to familiarize yourself with its structure. 2.Create a script to reproduce the error and execute it with python <filename.py> using the BashTool to confirm the error. 3. Edit the source code in the repo to resolve the issue. 4. Rerun your reproduce script to confirm the error is fixed. 5. Consider edge cases and make sure your fix handles them as well. Your thinking should be thorough, and it’s fine if it’s very long. A.1.2. I NTERACTION SETTING In this setting, the user proxy agent receives both the fully specified issue and additional hints, maintaining the knowledge gap relative to the Hidden setting. This provides extra information for the coding agent to extract through interaction. The files to be modified are also provided to the user proxy agent, allowing us to track specific details across issues. Since file-related information is universally useful—unlike other details whose importance may be subjective—it enables evaluation of how effectively different models incorporate critical information into their solution paths. This setup reflects a scenario where the user might know additional details not included in their initial input, which can still be extracted to improve performance. While more capable models may independently retrieve this information by exploring the codebase, it can be particularly helpful for lower-performing models. By tracking which models choose to extract this information, we gain insights into the types of questions they ask and observe behavioral trends across models. 11 Page 12: Interactive Agents to Overcome Ambiguity in Software Engineering Prompt for Interaction Setting with Mandatory Interaction I’ve uploaded a Python code repository in the directory /workspace/ {workspace dirname}. Consider the following PR description: <prdescription> {instance.summarized issue }</pr description> Can you help me implement the necessary changes to the repository so that the requirements specified in the PR description are met? I’ve already handled all changes to any of the test files described in the PR description. This means you DON’T need to modify the testing logic or any of the tests! Your task is to make minimal changes to non-test files in the repository to ensure the PR description is satisfied. I have not provided all the necessary details about the issue and I have some hidden details that are helpful. Please ask me specific questions using non-code commands to gather the relevant information that I have to help you solve the issue. Ensure you have all the details you require to solve the issue. You have a limited number of turns. Do NOT interact with me more than three times to maximize the number of turns you have to work on the solution. Follow these steps to resolve the issue: 1.As a first step, look at the issue and ask me questions to get all the necessary details about the issue. You can also ask me questions if you run into a problem in later steps. 2. Then, it might be a good idea to explore the repo to familiarize yourself with its structure. 3.Create a script to reproduce the error and execute it with python <filename.py> using the BashTool to confirm the error. 4. Edit the source code in the repo to resolve the issue. 5. Rerun your reproduce script to confirm the error is fixed. 6. Think about edge cases and make sure your fix handles them as well. Your thinking should be thorough, and it’s fine if it’s very long. Prompt to User Proxy You are a GitHub user reporting an issue. Here are the details of your issue and environment: Issue:{issue } Hints: {hints } Files relative to your current directory: {files } Your task is to respond to questions from a coder who is trying to solve your issue. The coder has a summarized version of the issue you have. Follow these rules: 1. If the coder asks a question that is directly related to the information in the issue you have, provide that information. 2. Always stay in character as a user reporting an issue, not as an AI assistant. 3. Keep your responses concise and to the point. 4. The coder has limited turns to solve the issue. Do not interact with the coder beyond 3 turns. Respond with I don’t have that information if the question is unrelated or you’re unsure. 12 Page 13: Interactive Agents to Overcome Ambiguity in Software Engineering A.1.3. H IDDEN SETTING Prompt for Hidden Setting I’ve uploaded a Python code repository in the directory /workspace/ {workspace dirname}. Consider the following PR description: <prdescription> {instance.summarized issue }</pr description> Can you help me implement the necessary changes to the repository so that the requirements specified in the PR description are met? I’ve already taken care of all changes to any of the test files described in the PR description. This means you DON’T need to modify the testing logic or any of the tests! Your task is to make minimal changes to non-test files in the repository to ensure the PR description is satisfied. Follow these steps to resolve the issue: 1. As a first step, it might be a good idea to explore the repo to familiarize yourself with its structure. 2.Create a script to reproduce the error and execute it with python <filename.py> using the BashTool to confirm the error. 3. Edit the source code in the repo to resolve the issue. 4. Rerun your reproduce script to confirm the error is fixed. 5. Consider edge cases and make sure your fix handles them as well. Your thinking should be thorough, and it’s fine if it’s very long. Prompt For Summarizing GitHub Issues I have several issues from GitHub related to code specifications. Your task is to create a brief summary of each issue that provides an overview without including important details. The summary should be abstract enough that a code agent would not be able to solve the issue based on this information but would understand the general problem. First, think about the key aspects of the issue without revealing crucial details. Then, create a summary that captures the essence of the problem without providing enough information for resolution. Use the <summary> and </summary> tags around your generated summary. The output should be in the form: <summary> ... </summary> Here is the issue: {issue } A.2. Statistical Methods A.2.1. W ILCOXON SIGNED -RANK TEST TheWilcoxon Signed-Rank Test is a non-parametric statistical test used to determine if there is a significant difference between the medians of two related groups. Unlike the paired t-test, it does not assume that the differences between paired observations are normally distributed, making it more suitable for cases where this assumption may not hold. In this work, the Wilcoxon Signed-Rank Test is applied to compare the performance of models between two settings (e.g., Hidden vs. Interaction ,Interaction vs. Full ) with the hypothesis that performance in the second setting is greater than in the first. Formally, the null hypothesis ( H0) for the Wilcoxon Signed-Rank Test states that the median difference between the two settings is zero or negative : H0:˜d≤0 where ˜drepresents the median of the paired differences. The alternative hypothesis ( H1) asserts that the median difference isgreater than zero : H1:˜d >0 The test ranks the absolute differences between paired observations, considering both the magnitude and direction of change. 13 Page 14: Interactive Agents to Overcome Ambiguity in Software Engineering If the p-value obtained from the test is less than the significance threshold (0.05), we reject the null hypothesis, concluding that there is a statistically significant improvement in performance between the two settings. A.3. Ambiguity Detection prompts •Neutral :Ensure you have all the necessary information to proceed. If any part of the issue is unclear or lacks critical details, ask concise, targeted questions to clarify. If everything is clear, you can move ahead without asking unnecessary questions. •Moderate Encouragement :Before attempting a solution, carefully check whether all key information is provided. If there’s any ambiguity or missing details that could impact your work, don’t hesitate to ask questions. Your goal is to gather the information needed for an accurate and efficient solution. Only skip asking questions when you are absolutely sure all details are complete. •Strong Encouragement :Your success depends on having all relevant details to solve the issue effectively. Whenever you encounter unclear or missing information, proactively ask questions to fill those gaps. Even minor ambiguities can affect the outcome, so always prioritize clarifying questions. Avoid questions only when you are 100% certain no further clarification is needed. A.4. Question Quality Analysis Cosine Distance (P, Q) = 1−P·Q ∥P∥∥Q∥(1) where: •P={p1, p2, . . . , p N}represents the embedding vector of the updated knowledge ( Eafter). •Q={q1, q2, . . . , q N}represents the embedding vector of the initial knowledge ( Ebefore). •N= 1536 is the dimensionality of the embedding space. 14 Page 15: Interactive Agents to Overcome Ambiguity in Software Engineering Model Summarized Issue Selected Questions Corresponding Answers Llama 3.1 70BThe InheritDocstrings meta- class is encountering a prob- lem where it fails to han- dle properties correctly due to its reliance on a function- checking mechanism that does not recognize properties as functions.1.Are there any existing workarounds or temporary fixes for this problem that I should be aware of? 2.Are there any specific proper- ties or use cases that are caus- ing the issue, or is it a more general problem with the Inher- itDocstrings metaclass?1. I don’t have that information. 2.This seems to be a more general problem with how the InheritDoc- strings metaclass handles proper- ties, not specific to any properties or use cases. Deepseek chatThe InheritDocstrings meta- class is encountering a prob- lem where it fails to han- dle properties correctly due to its reliance on a function- checking mechanism that does not recognize properties as functions.1.Can you provide more details about the InheritDocstrings metaclass? Specifically, what is its purpose, and how is it currently implemented in the repository? 2.Are there any existing tests or examples that demonstrate the issue? If so, can you provide the names of these tests or ex- amples?1.The InheritDocstrings metaclass is intended to ensure that docstrings are inherited from the parent class when the docstring is not explic- itly defined in the child class. It currently uses inspect.isfunction, which doesn’t work with proper- ties since they are not functions. 2. I don’t have that information. Claude Haiku 3.5The InheritDocstrings meta- class is encountering a prob- lem where it fails to han- dle properties correctly due to its reliance on a function- checking mechanism that does not recognize properties as functions.1.Could you provide more de- tails about where the Inherit- Docstrings metaclass is located in the repository? 2.When you say the metaclass ”fails to handle properties cor- rectly”, could you elaborate on what specific behavior is incor- rect?1.The InheritDocstrings meta- class is located in the file astropy/utils/misc.py. 2.The incorrect behavior is related to docstring inheritance. Specifically, properties that are redefined in a subclass without a docstring do not inherit the docstring from the base class. Claude Sonnet 3.5The issue involves a method that is expected to create a header from a byte string, but it does not accept Python 3 bytes as input. The method works with Python 3’s unicode strings, assum- ing ASCII encoding, but this behavior is inconsistent with its counterpart method that handles both text and binary file modes.1.Which specific method in the codebase is causing the issue with byte string input? Please provide the class name and method name. 2.Can you share an example of the byte string input that should work but currently fails?1.The specific method causing the issue with byte string input is Header.fromstring. 2.I don’t have a specific example of the byte string input, but it should be a byte string that contains the entire header data. Table 5. Examples of questions and their responses across different models chosen to reflect general observed patterns. 15

---