Authors: Sanidhya Vijayvargiya, Xuhui Zhou, Akhila Yerukola, Maarten Sap, Graham Neubig
Paper Content:
Page 1:
Interactive Agents to Overcome Ambiguity in Software Engineering
Sanidhya Vijayvargiya1Xuhui Zhou1Akhila Yerukola1Maarten Sap1Graham Neubig1
Abstract
AI agents are increasingly being deployed to auto-
mate tasks, often based on ambiguous and under-
specified user instructions. Making unwarranted
assumptions and failing to ask clarifying ques-
tions can lead to suboptimal outcomes, safety
risks due to tool misuse, and wasted computa-
tional resources. In this work, we study the ability
of LLM agents to handle ambiguous instructions
in interactive code generation settings by evaluat-
ing proprietary and open-weight models on their
performance across three key steps: (a) leveraging
interactivity to improve performance in ambigu-
ous scenarios, (b) detecting ambiguity, and (c)
asking targeted questions. Our findings reveal
that models struggle to distinguish between well-
specified and underspecified instructions. How-
ever, when models interact for underspecified
inputs, they effectively obtain vital information
from the user, leading to significant improvements
in performance and underscoring the value of ef-
fective interaction. Our study highlights critical
gaps in how current state-of-the-art models handle
ambiguity in complex software engineering tasks
and structures the evaluation into distinct steps to
enable targeted improvements.1
1. Introduction
Large Language Models (LLMs) are increasingly used as
chatbots in task-oriented workflows to improve productiv-
ity (Peng et al., 2023; Brynjolfsson et al., 2023), with the
user providing a task instruction which the model com-
pletes. Due to the interactive nature of chatbots, the perfor-
mance depends on the information provided in the user’s
prompt. Users often provide non-descriptive instructions,
which poses critical challenges in successfully completing
the task (Chowdhury et al., 2024). The ambiguity can lead
not only to erroneous outcomes, but also to significant safety
issues (Kim et al., 2024; Karli & Fitzgerald, 2023).
1Language Technologies Institute, Carnegie Mellon Univer-
sity, Pittsburgh, USA. Correspondence to: Sanidhya Vijayvargiya
<sanidhyv@cs.cmu.edu >.
1Code and data can be accessed at https://github.com/
sani903/InteractiveSWEAgents
Figure 1. Interactive agents mitigate resource wastage and reduce
misalignment in ambiguous settings.
This ambiguity can lead to more severe consequences in task
automation scenarios, where AI agents are equipped with
powerful tools (Wang et al., 2024b; Lu et al., 2024; Huang
et al., 2024; Zhou et al., 2024a). In software engineering
settings, agents must navigate complex codebases, make
architectural decisions, and modify critical systems—all
while operating with potentially incomplete or ambiguous
instructions. When human developers face such ambigu-
ity, they engage in clarifying dialogue to gather missing
context (Testoni & Fern ´andez, 2024; Purver, 2004). How-
ever, current AI systems often proceed with incomplete
understanding, leading to costly mistakes and misaligned
solutions, as demonstrated in Figure 1.
In this work, we systematically evaluate the interaction capa-
1arXiv:2502.13069v1 [cs.AI] 18 Feb 2025
Page 2:
Interactive Agents to Overcome Ambiguity in Software Engineering
bilities of commonly used open and proprietary LLMs when
addressing underspecified instructions in agentic code set-
tings (§2). We examine three research questions to address
the problem for code generation.
1.Interactive problem solving : Can LLMs appropri-
ately leverage interaction with the user to improve per-
formance in ambiguous settings?
2.Detection of ambiguity : Can LLMs identify whether
a given task description is underspecified and ask clari-
fying questions?
3.Question quality : Can LLMs generate meaningful
and targeted questions that gather the necessary infor-
mation to complete the task?
We evaluate the research questions separately to ensure in-
dependence between them. We use the Github issues from
SWE-Bench Verified (Chowdhury et al., 2024) to simulate
well-specified inputs, and the summarized variants of the
same Github issues as underspecified inputs for the experi-
ments. A simulated user (Xu et al., 2024; Zhou et al., 2024b),
equipped with the full, well-specified issue, simulates real
conversations where the user has additional context, which
is provided only when prompted with the appropriate ques-
tions. This multi-stage approach allows for targeted im-
provements in individual aspects, offering a pathway to
enhance overall system performance.
Through our evaluations across the different settings, we
find that interactivity can boost performance on underspeci-
fied inputs by up to 74% over the non-interactive settings but
the performance varies between models ( §3). LLMs default
to non-interactive behavior without explicit encouragement,
and even with it, they struggle to distinguish between under-
specified and well-specified inputs. Claude Sonnet 3.5 is the
only evaluated LLM that achieves notable accuracy (84%)
in making this distinction. Prompt engineering offers lim-
ited improvement, and its effectiveness varies across mod-
els (§4). When interacting, LLMs generally pose questions
capable of extracting relevant details, but some models, such
as Llama 3.1 70B, fail to obtain sufficient specificity ( §5).
In summary, this study underscores the importance of in-
teractivity in LLMs for agentic workflows, particularly in
real-world tasks where prompt quality varies significantly.
2. Method
2.1. Dataset
In our experiments, we simulate well-specified and under-
specified inputs using the SWE-Bench Verified dataset, a
refined subset of 500 issues from the SWE-Bench dataset.
The SWE-Bench dataset (Jimenez et al., 2024) consists of
real-world GitHub issues, their corresponding pull requests
(PRs), and unit tests from 12 Python repositories. The SWE-
Bench Verified dataset (Chowdhury et al., 2024) is designedto provide a more reliable estimate of an LLM’s ability by
pruning issues that were underspecified or contained invalid
unit tests. The task of an LLM is to modify the state of the
repository at the time of creation of the issue and resolve it.
The test cases are used to verify the patch generated by the
agent.
Given that the Verified subset contains only sufficiently
specified issues, we assume that these issues do not require
disambiguation. Therefore, for each SWE-Bench Verified
issue, we consider two forms, as shown in Figure 2:
1.Fully specified issue : The original and detailed
GitHub issue.
2.Underspecified issue : A summarized version gen-
erated using GPT-4o, where the model is asked to
preserve specific terminology is preserved but reduce
the amount of detailed content (complete prompt in
§A.1.3).
2.2. Agentic Framework
Agent Environment The OpenHands (Wang et al., 2024b)
agentic framework equips the LLM with an interactive en-
vironment that extends its capabilities beyond static code
generation. The agent operates within a structured execu-
tion environment where it can iteratively refine code, plan
tasks, and run commands using integrated tools. It has the
ability to edit files, break down complex instructions into
executable steps, and execute both Bash and Python scripts
within a secure sandbox. This controlled environment en-
ables the agent to analyze execution outputs, detect and
debug errors, and refine its approach based on observed
results, ensuring adaptability and correctness in solving
complex programming tasks.
Selected Models We use Claude Sonnet 3.5 (Anthropic,
2024b) as one of the proprietary models due to its superior
performance on SWE-Bench. Claude Haiku 3.5 (Anthropic,
2024a) is included as the second proprietary model to in-
vestigate the impact of model parameterization, as both
models likely share similar training methodologies but dif-
fer significantly in the number of parameters. Additionally,
we evaluate Llama 3.1 70B-Instruct (Llama team, 2024)
andDeepseek-v2 (DeepSeek-AI, 2024) as two open-weight
frontier models.
User Proxy We employ GPT-4o (Ahmad & OpenAI,
2024) as a user proxy to simulate user-agent interactions (Xu
et al., 2024; Zhou et al., 2024a). The user proxy is provided
with the fully specified version of the task, allowing the
coding agent to extract the necessary information through
interaction. It is instructed to respond based solely on the
information available in the full issue and will reply with
I don’t have that information if relevant details are miss-
ing. This approach prevents the user proxy from halluci-
2
Page 3:
Interactive Agents to Overcome Ambiguity in Software Engineering
Figure 2. The three settings in order: Full, Hidden, and Interaction
nating incorrect information and encourages clear, negative
responses when needed. The full prompt is shown in §A.1.2.
2.3. Study Design
We use three distinct settings to evaluate models across the
500 issues from SWE-Bench Verified shown in Figure 2 and
described below.
•Full Setting : This is the traditional SWE-Bench setting
for resolving GitHub issues. The coding agent is pro-
vided with the fully specified task and the interaction
is disabled. It represents the agent’s performance in an
unambiguous scenario, where the agent has access to
fullinformation, simulating ideal conditions.
•Hidden setting : A summarized version of the issue
is provided to the coding agent with the user-agent
interaction disabled to mimic the lack of detail that
can occur in task descriptions. We do not give any
interaction-related instructions, and all models default
to non-interactive behavior. Specific details are hidden
from the coding agent.
•Interaction Setting : The coding agent receives a sum-
marized task, while the user proxy model gets the fully
specified task. Interaction is enabled through prompt-
ing, allowing the agent to query the proxy for specific
details. The models do not interact with the user with-
out an explicit prompt. In addition to the full issue, the
proxy has access to file locations that need modifica-
tion and can provide them when queried. This setupallows us to evaluate which models proactively seek
navigational information and examine how this inter-
action influences the success of the solution process
across models.
3. RQ1: Interactive Problem Solving
Effectively addressing ambiguity requires a model to inte-
grate information from user interactions to form a clear plan
and successfully solve the task. Our first experiment holis-
tically evaluates the model’s ability to leverage interaction
and improve performance. The model must not only process
the initial task description, but also query users to extract
relevant details while filtering out irrelevant information.
3.1. Experimental Setup
The hypothesis of the experiment is that different language
models will exhibit varying performance with interaction
based on their incorporation of the provided information,
leading to different levels of improvement over the Hidden
setting. We evaluate the models across the three settings
and conduct two Wilcoxon-Signed Rank tests with a signif-
icance level of 0.05 to determine significant performance
differences between the Hidden and Interaction settings, and
between the Interaction and Full settings for every model.
Here, we modify the prompt to make interaction with the
user compulsory in the Interaction setting2. Ideally, the In-
2Without compulsory interaction, the model defaults to non-
interactive behavior for most issues, as seen in the Hidden setting.
Full prompt in §A.1.2
3
Page 4:
Interactive Agents to Overcome Ambiguity in Software Engineering
teraction setting should approach the performance of the
full setting. The coding agent has a maximum of 30 turns to
generate a solution patch.
3.2. Leveraging Interaction in Ambiguity
In this experiment, each model is tested in the Hidden ,In-
teraction , and Full settings to evaluate its ability to leverage
interaction and optimize performance on underspecified is-
sues. The results, as shown in Figure 3, confirm the expected
increase in resolve rates as more information becomes avail-
able to the agent. While the difference between the Hid-
den and Interaction settings is significant for every model
(Table 1), emphasizing the impact of interaction on the tra-
jectory, the performance gap between the Interaction and
Full settings is also significant across all models, highlight-
ing the unrealized potential. Specifically, for the Hidden
vs. Interaction settings, proprietary models show stronger
evidence of a significant difference. These results suggest
that the ability to leverage interaction varies across models,
with proprietary models demonstrating greater effectiveness
in utilizing interaction compared to open-weight models.
Figure 3. Resolve rates (in %) across different settings: Hidden
(underspecified issues), Interaction (underspecified issues with
user interaction), and Full (fully specified issues)
Using interaction, the Claude Sonnet and Haiku agents recre-
ate 80% of the performance in the Full setting. However,
with Deepseek and Llama 3.1, the relative performance is
lower, of 59% and 54%, respectively. Claude Sonnet 3.5’s
high resolve rate in the Hidden setting is likely due to its su-
perior programming acumen. The performance is surprising,
as a human would be able to decipher little about the expec-
tations given the summarized issue. Better programming
models can potentially extract more information from the
stack trace by reproducing the error themselves. We observe
that the Claude Haiku model achieves a performance rela-
tive to the Full setting similar to that of the Claude Sonnet
model, despite having inferior coding abilities. Thus, there
is no direct correlation between the number of parameters or
coding ability and a model’s ability to leverage interaction.
This hints towards better training practices that can lead tobetter integration of the new information.
Model Comparison p-value
Llama 3.1 70BHidden vs Interaction 0.0023
Interaction vs Full 3.87e-06
Claude Haiku 3.5Hidden vs Interaction 2.18e-14
Interaction vs Full 1.65e-09
Claude Sonnet 3.5Hidden vs Interaction 8.55e-19
Interaction vs Full 1.28e-12
Deepseek-v2Hidden vs Interaction 0.0023
Interaction vs Full 2.87e-07
Table 1. Wilcoxon Signed-Rank Test Results for Hidden vs Inter-
action setting and for Interaction vs Full setting for each model
This experiment highlights the importance of interaction
in mitigating ambiguity. Since many real-world software
engineering problems are underspecified (Chowdhury et al.,
2024), interactive systems are essential for ensuring align-
ment and reducing safety risks. However, current models
default to non-interactive behavior even when faced with
ambiguity and struggle to match the performance seen in
well-specified settings. While interactive trajectories show
performance gains over non-interactive approaches for am-
biguous inputs, the improvement is not statistically signifi-
cant, indicating strong potential for improvement.
3.3. Impact of Interaction Details on Model
Performance
In the Interaction setting of the previous experiment, the in-
formation gained can be broadly categorized into two types:
informational , which relates to the expected behavior or
nature of the error, and navigational , which pertains to the
locations of the files to modify. While informational details
are typically obtained in nearly every interaction, the models
request navigational details less frequently. We measure the
resolve rates separately for instances where the model asks
for navigational details and when it does not, examining
the impact on performance when models must rely only on
informational details versus when navigational details are
also accessible.
As seen in Table 2, requesting navigational details improves
performance across all models by providing cues beyond de-
scribed behavior and errors. However, some models rely too
heavily on this information and struggle when it’s missing.
Smaller models like Llama 3.1 and Deepseek-v2 request
file locations more often but underperform without them.
Claude models, particularly Sonnet, better leverage infor-
mational cues, achieving higher resolve rates even without
navigational details. Deepseek, by contrast, performs worse
than its Hidden setting when file locations are absent, high-
4
Page 5:
Interactive Agents to Overcome Ambiguity in Software Engineering
Model Navigational Info (%) Resolve Rate With (%) Resolve Rate Without (%)
Claude Sonnet 3.5 8.96 59.52 37.94
Claude Haiku 3.5 24.67 36.94 24.78
Deepseek-v2 30.70 13.19 4.62
Llama 3.1 70B 30.28 6.34 4.28
Table 2. The percentage of issues where navigational information was acquired in the Interaction setting, along with the corresponding
resolve rates with and without navigational information. Navigational information refers to file locations requiring modification, helping
to avoid tedious code exploration. The resolve rates offer insight into how the information obtained during interaction impacts overall
performance.
lighting its dependence. This reliance leads to wasted turns
searching for errors instead of identifying them efficiently.
Llama 3.1 performs better than Hidden without file loca-
tions but gains little when they are provided, likely due to
poor detail extraction (Section §5). Ideally, LLMs should
generalize across diverse interaction types, as users may
not always provide specific details, improving robustness in
real-world software engineering tasks.
Takeaway: Interaction has significant potential to improve
model performance in ambiguous tasks, but models, partic-
ularly the less strong open-weight models, struggle to lever-
age it effectively. Proprietary models like Claude Sonnet
3.5 and Haiku 3.5 achieve nearly 80% of their Full setting
performance, with Haiku improving by 74% over its Hidden
setting performance through effective integration of both
informational and navigational cues. The lack of correlation
between model size and its ability to utilize interaction sug-
gests that better training practices play a more crucial role.
In contrast, models like Deepseek-v2 and Llama 3.1 show
limited gains, primarily due to their challenges in utilizing
broader informational cues, which hinders their adaptability
in ambiguous tasks.
4. RQ2: Ambiguity Detection
In real-world LLM and agent applications, task descrip-
tions and prompts can vary in quality (Chowdhury et al.,
2024). To detect ambiguity, a model must recognize un-
clear expectations or identify missing key information in its
planned approach. However, interacting unnecessarily when
sufficient information is already available can introduce in-
efficiencies and place an undue burden on the user. Here,
we examine the capabilities of LLMs to detect ambiguous
instructions in software engineering contexts.
4.1. Experimental Setup
In this experiment, each issue is presented in either the Full
setting or the Hidden setting . The objective is to identify
patterns in how models choose to interact based on the input
type. Ideally, the model should have a high interaction rate
for the summarized inputs and a negligible interaction rate
for the well-specified inputs.In the instructions which outline the task, we present the
agent with an option to interact during its solution trajectory
and design three instructions with varying levels of encour-
agement to interact with the user. We track the input type the
model chooses to interact with. The instructions, listed in
order of increasing encouragement to interact, are: Neutral ,
where the agent is told it can ask questions if anything is
unclear), Moderate Encouragement , where the agent is told
to carefully check that all necessary information is available
and only proceed after everything is clear, and Strong En-
couragement , where the agent is told that asking questions
is critical to task success (full prompts in §A).
4.2. Effect of Different Prompts
Experiments to detect ambiguity demonstrate that, using
prompt engineering, we can control the level of interaction
with the user, as shown in Table 3. But this interactivity
is not possible without clearly specifying it in the prompt
wherein without any specific mention of interaction, the
models almost never interact for any of the summarized
issue inputs.
The Claude Sonnet model performs best with Moderate
Encouragement, achieving the highest overall accuracy of
84% across all variations. Its counterpart from the same
model family, Claude Haiku, is hesitant to interact even
with Strong Encouragement. The Claude models show a
drop in accuracy in cases where interaction is not needed
as their overall interaction increases, indicating that the
interaction fails to target underspecified inputs effectively.
For the Deepseek model, we observe that the Neutral prompt
gives the best results as interactivity surprisingly decreases
with more encouragement. The accuracy in both the cases
where interaction was desired and not desired is around 70%,
which shows that the model is capable of distinguishing
between well-specified and underspecified issues to some
extent. The Llama model displays a greater, but arbitrary,
tendency to interact across all prompts than other models.
4.3. Detection across Models
While interaction levels can be adjusted with prompting,
both summarized issues and full issues have equal probabil-
5
Page 6:
Interactive Agents to Overcome Ambiguity in Software Engineering
ModelNeutral Moderate Encouragement Strong Encouragement
Accuracy ↑FPR ↓FNR ↓Accuracy ↑FPR ↓FNR ↓Accuracy ↑FPR ↓FNR ↓
Claude Sonnet 3.5 0.60 0.00 0.81 0.84 0.24 0.09 0.76 0.36 0.10
Claude Haiku 3.5 0.54 0.00 0.97 0.57 0.02 0.90 0.63 0.06 0.66
Deepseek-v2 0.69 0.30 0.31 0.57 0.08 0.83 0.51 0.04 0.94
Llama 3.1 70B 0.48 0.46 0.57 0.47 0.95 0.09 0.52 0.93 0.06
Table 3. Model performance in ambiguity detection across prompts with varying levels of interaction encouragement. FPR refers to
cases where the model interacted unnecessarily, while FNR refers to cases where it failed to interact when needed. A model that reliably
distinguishes between underspecified and well-specified issues should have high accuracy, low FPR, and low FNR.
ity of being selected for interaction as interactivity increases,
particularly with smaller models. Despite the stark differ-
ence in the language and detail of summarized issues and
fully specified issues, the models, except Claude Sonnet, fail
to reliably distinguish them, indicating that LLMs struggle
to detect ambiguity even in obvious cases. All models, in-
cluding Claude Sonnet, show big changes in the ambiguity
detection behavior with prompt variations. Interestingly,
Sonnet outperforms Haiku, likely due to its more extensive
instruction tuning or Human Feedback training, which helps
it better follow instructions and achieve the desired interac-
tive trajectory. Surprisingly, even Deepseek adapts better to
the task than Haiku.
Takeaway: Prompt engineering can influence model interac-
tivity but fails to consistently improve ambiguity detection
across models. When interaction is not explicitly prompted,
models default to non-interactive behavior. Claude Sonnet
shows some ability to distinguish ambiguous inputs, but
other models, including Claude Haiku and Llama 3.1, strug-
gle even with clear cues. This inconsistency reveals that
models are not inherently equipped to detect underspecified
tasks. Improving ambiguity detection requires dedicated
training, not just prompt modifications.
5. RQ3: Question Quality
To gather missing information from underspecified inputs,
the quality of an agent’s questions is crucial. While §3
evaluates task completion, the model performance in the
experiment is influenced by the coding ability. Here, we
focus solely on the quality of the questions posed by the
agent to the user, measuring how effectively models extract
relevant information under the assumption that users have
the necessary details.
5.1. Experimental Setup
In this experiment, we evaluate the quality of the inter-
actions between the agent and the user in the Interaction
setting. We measure the novelty and detail level of the in-
formation obtained from the user’s answers to evaluate the
quality, quantifying the new knowledge relative to the exist-ing understanding of the agent. We employ two techniques
to quantify the information obtained.
1.Cosine Distance : We compute the cosine distance
(1−cos(P, Q)) between the embeddings of the sum-
marized task Ebefore and the cumulative knowledge
after interaction with the user Eafterusing a text embed-
ding model. Lower distances indicate redundant user
input, while higher values show meaningful informa-
tion gain. We use OpenAI’s text-embedding-3-small
as our embedding model.
2.LLM-as-judge (GPT-4o) : The model scores the user
answers on a scale of 1 to 5, where a higher score
corresponds to more new and detailed information in
the user’s response, such as specific files causing er-
rors or function behavior. The prompt to the model
includes the summarized issue, agent questions, and
user responses for better context.
5.2. Information Gain from Interaction
For the quantitative evaluation of the quality of the question,
both the cosine distance and the LLM-as-judge methods
suggest a similar result, in which the Llama model performs
significantly worse than the other models, whereas the other
models achieve very similar information gains, as seen in
Figure 5.
The Llama model has an average cosine distance of 0.101
when the embedding of the summarized issue is compared
to the embedding of the user response appended to the sum-
marized issue. Deepseek achieves the highest cosine dis-
tance of 0.142, while the Claude Sonnet and Haiku models
achieve very similar cosine distances of 0.136 and 0.135.
Using LLM as a judge, we evaluate the specificity of the
details present in the answers. Here again, the Llama 3.1
model achieves a significantly worse average score of 3.58
than the other models which see similar performance of
around 4 out of 5.
6
Page 7:
Interactive Agents to Overcome Ambiguity in Software Engineering
Figure 4. Agent questions and user responses to the same underspecified input are shown for Llama 3.1 70B, Deepseek-v2, and Claude
Haiku 3.5. The examples highlight specific interaction patterns and differences in handling ambiguity. The corresponding model inputs
are detailed in Table 5.
Llama 3.1 Deepseek-v2 Claude Haiku Claude Sonnet0.00.10.20.3Cosine Distance
(a)
Llama 3.1 Deepseek-v2 Claude Haiku Claude Sonnet12345LLM-as-Judge Score
(b)
Figure 5. Information Gain measured using (a) Cosine Distance
Scores and (b) LLM-as-Judge Scores
5.3. Qualitative Analysis of Questions
The quantitative results can be further supported by a quali-
tative evaluation of the questions. Sample question-answer
pairs reflecting common trends are shown in Figure 4. The
Llama model asks fewer questions on average than other
models in one message for user interaction, as seen in Ta-
ble 4, and often poses overly general questions like, Are
there any existing workarounds or temporary fixes? . These
template-like questions are unproductive and less likely to
gather useful information.
Deepseek, on the other hand, asks the most questions per
message, allowing it to extract more information. Its ques-
tions, such as Are there any existing tests or examples that
demonstrate the issue? , aim to extract, edge cases, documen-tation, or tests, and while common across multiple issues,
they are reasonable and yield valuable details. But most
questions are very specific and detailed, querying about
the expected behavior. Often, due to the specificity of the
question, the user might not have the required information.
Claude Sonnet asks fewer questions than Deepseek, likely
because it explores the codebase first. The questions do not
have easily discernible patterns and match the Deepseek
model in specificity. The Haiku model, in contrast, follows
a consistent template, typically asking three questions re-
gardless of the input, although sub-questions may be present.
Haiku’s questions are more keyword-driven based on the
input, while Sonnet’s are based on a deeper understanding
of the issue and codebase.
Model Avg. Number of Questions
Claude Sonnet 3.5 3.80
Claude Haiku 3.5 3.49
Deepseek-v2 4.57
Llama 3.1 70B 2.61
Table 4. The average number of questions asked by different mod-
els in an interaction.
Takeaway: Models that balance specificity and question
quantity, like the Claude models, achieve greater informa-
tion gain and superior interaction quality compared to mod-
els that ask too few, too many, or templated questions. While
Deepseek benefits from asking numerous detailed questions,
it risks overwhelming the user. In contrast, Llama underper-
forms due to its reliance on generic or irrelevant questions.
7
Page 8:
Interactive Agents to Overcome Ambiguity in Software Engineering
6. Limitations
Our study benefits from including both open-weight and
proprietary models, as well as models from the same family
with different parameterizations, enhancing the generaliz-
ability of the findings. However, certain design decisions
may affect the experiments.
Ambiguity detection is limited to the first three turns, as
LLMs struggle to interact meaningfully if they do not en-
gage early. To assess question quality, we measure changes
in the latent vector to capture the information gained, as-
suming equal importance for all new information—though
models may prioritize different details in their solution. The
resolve rates in the overall problem solving experiment re-
flect real-life conditions, where incorrect code is unaccept-
able, regardless of how close the generated patch is to the
solution. However, data leakage could enable some models
to make correct assumptions in underspecified settings, in-
flating resolve rates. Additionally, the user proxy may be
more interactive than real-world users, as LLMs are tuned
to be helpful. We address this by limiting the number of
interaction turns and focusing interactions on the task with
detailed system prompts.
7. Related Work
Code Generation Benchmarks In code generation tasks,
ambiguous user instructions hinder the evaluation of code
suggestions generated by the model. Since the cause of
ambiguity is missing details, clarifying questions become
neessary (Mu et al., 2023). Interactive, test-driven work-
flows mitigate this ambiguity by first generating test cases
aligned with user expectations, which users validate before
code generation (Lahiri et al., 2023). Extensions of this
approach employ runtime techniques to generate, mutate,
and rank candidate code suggestions and test cases based on
user feedback (Fakhoury et al., 2024). Although effective,
these workflows can burden users, highlighting the need to
minimize intervention to essential cases.
Interactive ML Systems In task-oriented settings, am-
biguity between generated outputs and user expectations
remains a challenge. AmbigNLG addresses this by intro-
ducing a taxonomy of instruction ambiguities and applying
targeted disambiguation based on the identified ambiguity
type (Niwa & Iso, 2024). These ambiguities include unclear
output lengths, mandatory keywords, and contextual nu-
ances in instructions. NoisyToolBench (Wang et al., 2024a)
offers a dataset for evaluating LLM tool use with ambigu-
ous instructions, though it focuses on simpler tasks. Rein-
forcement learning frameworks like ReHAC balance user
interaction by modeling optimal intervention points (Feng
et al., 2024), but more effective strategies are needed for
complex, multi-step workflows.LLMs and Ambiguity The current state-of-the-art LLMs
are not inherently trained to handle ambiguity through user
interaction (Zhang et al., 2024), but, their instruction tun-
ing enables improved performance with prompt engineer-
ing (White et al., 2023). Ambiguity detection has been
tackled with uncertainty estimation to measure the utility
of seeking clarification (Zhang & Choi, 2023; Park et al.,
2024). Meanwhile, the quality of clarifying questions and
the resulting performance remain critical to overall suc-
cess (Rao & Daum ´e III, 2018; Pyatkin et al., 2023; Kuhn
et al., 2023). Despite advances, state-of-the-art techniques
such as few-shot prompting and Chain-of-Thought reason-
ing offer limited relief in ambiguous scenarios (Zhang et al.,
2024). Self-disambiguation uses the internal knowledge of
a model to reduce query ambiguity (Keluskar et al., 2024;
Sterner, 2022; Sumanathilaka et al., 2024). For example,
Alignment with Perceived Ambiguity (APA) employs self-
disambiguation to quantify perceived ambiguity using in-
formation gain, improving the model’s processing of such
inputs (Kim et al., 2024). Although inference-only methods
are cost-effective, they are less robust than training-based
approaches for handling ambiguity.
8. Conclusion
This work evaluates proprietary and open-weight models
in agentic frameworks for handling ambiguity in software
engineering. In code generation, to effectively integrate
new information into the solution, an agent must detect
ambiguity and ask targeted questions. Our key findings are:
•Given an underspecified input, Claude Sonnet 3.5 and
Claude Haiku 3.5 with interaction can achieve 80% of
their performance with a well-specified input. In con-
trast, open-weight models struggle: Deepseek relies on
navigational cues to locate relevant files, while Llama
3.1 70B extracts limited information from the user.
•LLMs do not interact unless explicitly prompted,
and their ambiguity detection is highly sensitive to
prompt variations. Only Claude Sonnet 3.5 achieves
a higher accuracy of 84% in distinguishing between
well-specified and underspecified input.
•Claude Sonnet 3.5, Haiku 3.5, and Deepseek effec-
tively extract new, detailed user information, whereas
Llama 3.1 struggles to ask the right questions.
Despite these advances, a gap remains between resolve rates
for underspecified vs. fully specified issues. Open-weight
models need better interaction strategies to improve resolu-
tion, while proprietary models, particularly Claude Haiku
3.5, require stronger prompting to engage interactively. This
work establishes the current state-of-the-art in handling am-
biguity through interaction, breaking the resolution process
into multiple steps.
8
Page 9:
Interactive Agents to Overcome Ambiguity in Software Engineering
Impact Statement
This paper examines the ability of AI agents to handle am-
biguous and underspecified instructions in software engi-
neering. Our findings highlight key limitations in the res-
olution of AI-driven tasks in real-world applications, ne-
cessitating the development of more interactive, adaptable,
and effective systems. By emphasizing AI interactivity, our
work contributes to the responsible development of AI tech-
nologies, fostering safer and more efficient automation in
software engineering.
References
Ahmad, L. and OpenAI. Gpt-4o system card, October 2024.
Anthropic. Claude 3.5 haiku, 10 2024a. URL https://
www.anthropic.com/claude/haiku . Accessed
on January 9, 2025.
Anthropic. Introducing claude 3.5 sonnet, 6 2024b.
URL https://www.anthropic.com/news/
claude-3-5-sonnet . Accessed on January 8, 2025.
Brynjolfsson, E., Li, D., and Raymond, L. R. Generative
ai at work. Working Paper 31161, National Bureau of
Economic Research, April 2023. URL http://www.
nber.org/papers/w31161 .
Chowdhury, N., Aung, J., Shern, C. J., Jaffe, O., Sher-
burn, D., Starace, G., Mays, E., Dias, R., Aljubeh,
M., Glaese, M., Jimenez, C. E., Yang, J., Liu,
K., and Madry, A. Introducing SWE-bench veri-
fied, 2024. URL https://openai.com/index/
introducing-swe-bench-verified/ . Ac-
cessed on December 10, 2024.
DeepSeek-AI. Deepseek-v2: A strong, economical, and
efficient mixture-of-experts language model, 2024.
Fakhoury, S., Naik, A., Sakkas, G., Chakraborty, S., and
Lahiri, S. K. Llm-based test-driven interactive code
generation: User study and empirical evaluation. IEEE
Transactions on Software Engineering , 50(9):2254–2268,
September 2024. ISSN 2326-3881. doi: 10.1109/tse.2024.
3428972. URL http://dx.doi.org/10.1109/
TSE.2024.3428972 .
Feng, X., Chen, Z.-Y ., Qin, Y ., Lin, Y ., Chen, X., Liu, Z.,
and Wen, J.-R. Large language model-based human-
agent collaboration for complex task solving, 2024. URL
https://arxiv.org/abs/2402.12914 .
Huang, D., Zhang, J. M., Luck, M., Bu, Q., Qing, Y ., and
Cui, H. Agentcoder: Multi-agent-based code genera-
tion with iterative testing and optimisation, 2024. URL
https://arxiv.org/abs/2312.13010 .Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press,
O., and Narasimhan, K. Swe-bench: Can language
models resolve real-world github issues?, 2024. URL
https://arxiv.org/abs/2310.06770 .
Karli, U. B. and Fitzgerald, T. Extended abstract: Re-
solving ambiguities in LLM-enabled human-robot col-
laboration. In 2nd Workshop on Language and Robot
Learning: Language as Grounding , 2023. URL https:
//openreview.net/forum?id=LtwuJx83Rc .
Keluskar, A., Bhattacharjee, A., and Liu, H. Do llms under-
stand ambiguity in text? a case study in open-world ques-
tion answering, 2024. URL https://arxiv.org/
abs/2411.12395 .
Kim, H. J., Kim, Y ., Park, C., Kim, J., Park, C., Yoo, K. M.,
goo Lee, S., and Kim, T. Aligning language models
to explicitly handle ambiguity, 2024. URL https://
arxiv.org/abs/2404.11972 .
Kuhn, L., Gal, Y ., and Farquhar, S. Clam: Selective clar-
ification for ambiguous questions with generative lan-
guage models, 2023. URL https://arxiv.org/
abs/2212.07769 .
Lahiri, S. K., Fakhoury, S., Naik, A., Sakkas, G.,
Chakraborty, S., Musuvathi, M., Choudhury, P., von Veh,
C., Inala, J. P., Wang, C., and Gao, J. Interactive code
generation via test-driven user-intent formalization, 2023.
URL https://arxiv.org/abs/2208.05950 .
Llama team. The llama 3 herd of models. https:
//ai.meta.com/research/publications/
the-llama-3-herd-of-models/ , July 2024.
Accessed on January 9, 2025.
Lu, C., Lu, C., Lange, R. T., Foerster, J., Clune, J., and
Ha, D. The ai scientist: Towards fully automated
open-ended scientific discovery, 2024. URL https:
//arxiv.org/abs/2408.06292 .
Mu, F., Shi, L., Wang, S., Yu, Z., Zhang, B., Wang, C., Liu,
S., and Wang, Q. Clarifygpt: Empowering llm-based
code generation with intention clarification, 2023. URL
https://arxiv.org/abs/2310.10996 .
Niwa, A. and Iso, H. Ambignlg: Addressing task ambiguity
in instruction for nlg, 2024. URL https://arxiv.
org/abs/2402.17717 .
Park, J., Lim, S., Lee, J., Park, S., Chang, M., Yu, Y ., and
Choi, S. Clara: Classifying and disambiguating user
commands for reliable interactive robotic agents, 2024.
URL https://arxiv.org/abs/2306.10376 .
Peng, S., Kalliamvakou, E., Cihon, P., and Demirer, M. The
impact of ai on developer productivity: Evidence from
9
Page 10:
Interactive Agents to Overcome Ambiguity in Software Engineering
github copilot, 2023. URL https://arxiv.org/
abs/2302.06590 .
Purver, M. R. J. The theory and use of clarification requests
in dialogue . PhD thesis, University of London King’s
College, 2004.
Pyatkin, V ., Hwang, J. D., Srikumar, V ., Lu, X., Jiang,
L., Choi, Y ., and Bhagavatula, C. Clarifydelphi: Rein-
forced clarification questions with defeasibility rewards
for social and moral situations, 2023. URL https:
//arxiv.org/abs/2212.10409 .
Rao, S. and Daum ´e III, H. Learning to ask good ques-
tions: Ranking clarification questions using neural ex-
pected value of perfect information. In Gurevych, I.
and Miyao, Y . (eds.), Proceedings of the 56th Annual
Meeting of the Association for Computational Linguis-
tics (Volume 1: Long Papers) , pp. 2737–2746, Mel-
bourne, Australia, July 2018. Association for Compu-
tational Linguistics. doi: 10.18653/v1/P18-1255. URL
https://aclanthology.org/P18-1255 .
Sterner, B. Explaining ambiguity in scientific language.
Synthese , 200(5):354, 2022.
Sumanathilaka, T. G. D. K., Micallef, N., and Hough, J.
Can llms assist with ambiguity? a quantitative evaluation
of various large language models on word sense disam-
biguation, 2024. URL https://arxiv.org/abs/
2411.18337 .
Testoni, A. and Fern ´andez, R. Asking the right ques-
tion at the right time: Human and model uncertainty
guidance to ask clarification questions. arXiv preprint
arXiv:2402.06509 , 2024.
Wang, W., Shi, J., Wang, C., Lee, C., Yuan, Y ., tse Huang,
J., and Lyu, M. R. Learning to ask: When llms meet un-
clear instruction, 2024a. URL https://arxiv.org/
abs/2409.00557 .
Wang, X., Li, B., Song, Y ., Xu, F. F., Tang, X., Zhuge, M.,
Pan, J., Song, Y ., Li, B., Singh, J., Tran, H. H., Li, F.,
Ma, R., Zheng, M., Qian, B., Shao, Y ., Muennighoff, N.,
Zhang, Y ., Hui, B., Lin, J., Brennan, R., Peng, H., Ji, H.,
and Neubig, G. Openhands: An open platform for ai
software developers as generalist agents, 2024b. URL
https://arxiv.org/abs/2407.16741 .
White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert,
H., Elnashar, A., Spencer-Smith, J., and Schmidt, D. C.
A prompt pattern catalog to enhance prompt engineering
with chatgpt. arXiv preprint arXiv:2302.11382 , 2023.
Xu, F. F., Song, Y ., Li, B., Tang, Y ., Jain, K., Bao, M.,
Wang, Z. Z., Zhou, X., Guo, Z., Cao, M., Yang, M., Lu,
H. Y ., Martin, A., Su, Z., Maben, L., Mehta, R., Chi, W.,Jang, L., Xie, Y ., Zhou, S., and Neubig, G. Theagent-
company: Benchmarking llm agents on consequential
real world tasks, 2024. URL https://arxiv.org/
abs/2412.14161 .
Zhang, M. J. Q. and Choi, E. Clarify when necessary:
Resolving ambiguity through interaction with lms, 2023.
URL https://arxiv.org/abs/2311.09469 .
Zhang, T., Qin, P., Deng, Y ., Huang, C., Lei, W., Liu,
J., Jin, D., Liang, H., and Chua, T.-S. Clamber: A
benchmark of identifying and clarifying ambiguous in-
formation needs in large language models, 2024. URL
https://arxiv.org/abs/2405.12063 .
Zhou, X., Kim, H., Brahman, F., Jiang, L., Zhu, H., Lu, X.,
Xu, F., Lin, B. Y ., Choi, Y ., Mireshghallah, N., Le Bras,
R., and Sap, M. Haicosystem: An ecosystem for sandbox-
ing safety risks in human-ai interactions. arXiv , 2024a.
URL http://arxiv.org/abs/2409.16427 .
Zhou, X., Zhu, H., Mathur, L., Zhang, R., Yu, H., Qi, Z.,
Morency, L.-P., Bisk, Y ., Fried, D., Neubig, G., and Sap,
M. Sotopia: Interactive evaluation for social intelligence
in language agents, 2024b. URL https://arxiv.
org/abs/2310.11667 .
10
Page 11:
Interactive Agents to Overcome Ambiguity in Software Engineering
A. Appendix
A.1. Experimental Design
A.1.1. F ULL SETTING
In addition to the fully-specified GitHub issue from SWE-Bench Verified, we also include hints from the dataset, which
contains the conversation between developers regarding the issue. This helps create a larger knowledge gap in comparison
to the Hidden setting.
Prompt for Full Setting
I’ve uploaded a Python code repository in the directory /workspace/ {workspace dirname}. Consider the
following PR description: <prdescription> {instance.full issue }</pr description>
Here are some additional hints: <hints> {instance.hints text}</hints>
Can you help me implement the necessary changes to the repository so that the requirements specified in the PR
description are met?
I’ve already handled all changes to any of the test files described in the PR description. This means you DON’T need
to modify the testing logic or any of the tests!
Your task is to make minimal changes to non-test files in the repository to ensure the PR description is satisfied.
Follow these steps to resolve the issue:
1. As a first step, explore the repo to familiarize yourself with its structure.
2.Create a script to reproduce the error and execute it with python <filename.py> using the BashTool to
confirm the error.
3. Edit the source code in the repo to resolve the issue.
4. Rerun your reproduce script to confirm the error is fixed.
5. Consider edge cases and make sure your fix handles them as well.
Your thinking should be thorough, and it’s fine if it’s very long.
A.1.2. I NTERACTION SETTING
In this setting, the user proxy agent receives both the fully specified issue and additional hints, maintaining the knowledge
gap relative to the Hidden setting. This provides extra information for the coding agent to extract through interaction.
The files to be modified are also provided to the user proxy agent, allowing us to track specific details across issues.
Since file-related information is universally useful—unlike other details whose importance may be subjective—it enables
evaluation of how effectively different models incorporate critical information into their solution paths.
This setup reflects a scenario where the user might know additional details not included in their initial input, which can still
be extracted to improve performance. While more capable models may independently retrieve this information by exploring
the codebase, it can be particularly helpful for lower-performing models. By tracking which models choose to extract this
information, we gain insights into the types of questions they ask and observe behavioral trends across models.
11
Page 12:
Interactive Agents to Overcome Ambiguity in Software Engineering
Prompt for Interaction Setting with Mandatory Interaction
I’ve uploaded a Python code repository in the directory /workspace/ {workspace dirname}. Consider the
following PR description: <prdescription> {instance.summarized issue }</pr description>
Can you help me implement the necessary changes to the repository so that the requirements specified in the PR
description are met?
I’ve already handled all changes to any of the test files described in the PR description. This means you DON’T
need to modify the testing logic or any of the tests!
Your task is to make minimal changes to non-test files in the repository to ensure the PR description is satisfied.
I have not provided all the necessary details about the issue and I have some hidden details that are helpful. Please
ask me specific questions using non-code commands to gather the relevant information that I have to help you solve
the issue. Ensure you have all the details you require to solve the issue.
You have a limited number of turns. Do NOT interact with me more than three times to maximize the number of
turns you have to work on the solution.
Follow these steps to resolve the issue:
1.As a first step, look at the issue and ask me questions to get all the necessary details about the issue. You can
also ask me questions if you run into a problem in later steps.
2. Then, it might be a good idea to explore the repo to familiarize yourself with its structure.
3.Create a script to reproduce the error and execute it with python <filename.py> using the BashTool to
confirm the error.
4. Edit the source code in the repo to resolve the issue.
5. Rerun your reproduce script to confirm the error is fixed.
6. Think about edge cases and make sure your fix handles them as well.
Your thinking should be thorough, and it’s fine if it’s very long.
Prompt to User Proxy
You are a GitHub user reporting an issue. Here are the details of your issue and environment:
Issue:{issue }
Hints: {hints }
Files relative to your current directory: {files }
Your task is to respond to questions from a coder who is trying to solve your issue. The coder has a summarized
version of the issue you have. Follow these rules:
1. If the coder asks a question that is directly related to the information in the issue you have, provide that information.
2. Always stay in character as a user reporting an issue, not as an AI assistant.
3. Keep your responses concise and to the point.
4. The coder has limited turns to solve the issue. Do not interact with the coder beyond 3 turns.
Respond with I don’t have that information if the question is unrelated or you’re unsure.
12
Page 13:
Interactive Agents to Overcome Ambiguity in Software Engineering
A.1.3. H IDDEN SETTING
Prompt for Hidden Setting
I’ve uploaded a Python code repository in the directory /workspace/ {workspace dirname}. Consider the
following PR description: <prdescription> {instance.summarized issue }</pr description>
Can you help me implement the necessary changes to the repository so that the requirements specified in the PR
description are met?
I’ve already taken care of all changes to any of the test files described in the PR description. This means you DON’T
need to modify the testing logic or any of the tests!
Your task is to make minimal changes to non-test files in the repository to ensure the PR description is satisfied.
Follow these steps to resolve the issue:
1. As a first step, it might be a good idea to explore the repo to familiarize yourself with its structure.
2.Create a script to reproduce the error and execute it with python <filename.py> using the BashTool to
confirm the error.
3. Edit the source code in the repo to resolve the issue.
4. Rerun your reproduce script to confirm the error is fixed.
5. Consider edge cases and make sure your fix handles them as well.
Your thinking should be thorough, and it’s fine if it’s very long.
Prompt For Summarizing GitHub Issues
I have several issues from GitHub related to code specifications. Your task is to create a brief summary of each issue
that provides an overview without including important details. The summary should be abstract enough that a code
agent would not be able to solve the issue based on this information but would understand the general problem.
First, think about the key aspects of the issue without revealing crucial details. Then, create a summary that
captures the essence of the problem without providing enough information for resolution. Use the <summary> and
</summary> tags around your generated summary.
The output should be in the form: <summary> ... </summary>
Here is the issue: {issue }
A.2. Statistical Methods
A.2.1. W ILCOXON SIGNED -RANK TEST
TheWilcoxon Signed-Rank Test is a non-parametric statistical test used to determine if there is a significant difference
between the medians of two related groups. Unlike the paired t-test, it does not assume that the differences between paired
observations are normally distributed, making it more suitable for cases where this assumption may not hold.
In this work, the Wilcoxon Signed-Rank Test is applied to compare the performance of models between two settings (e.g.,
Hidden vs. Interaction ,Interaction vs. Full ) with the hypothesis that performance in the second setting is greater than in the
first.
Formally, the null hypothesis ( H0) for the Wilcoxon Signed-Rank Test states that the median difference between the two
settings is zero or negative :
H0:˜d≤0
where ˜drepresents the median of the paired differences. The alternative hypothesis ( H1) asserts that the median difference
isgreater than zero :
H1:˜d >0
The test ranks the absolute differences between paired observations, considering both the magnitude and direction of change.
13
Page 14:
Interactive Agents to Overcome Ambiguity in Software Engineering
If the p-value obtained from the test is less than the significance threshold (0.05), we reject the null hypothesis, concluding
that there is a statistically significant improvement in performance between the two settings.
A.3. Ambiguity Detection prompts
•Neutral :Ensure you have all the necessary information to proceed. If any part of the issue is unclear or lacks critical
details, ask concise, targeted questions to clarify. If everything is clear, you can move ahead without asking unnecessary
questions.
•Moderate Encouragement :Before attempting a solution, carefully check whether all key information is provided.
If there’s any ambiguity or missing details that could impact your work, don’t hesitate to ask questions. Your goal
is to gather the information needed for an accurate and efficient solution. Only skip asking questions when you are
absolutely sure all details are complete.
•Strong Encouragement :Your success depends on having all relevant details to solve the issue effectively. Whenever
you encounter unclear or missing information, proactively ask questions to fill those gaps. Even minor ambiguities
can affect the outcome, so always prioritize clarifying questions. Avoid questions only when you are 100% certain no
further clarification is needed.
A.4. Question Quality Analysis
Cosine Distance (P, Q) = 1−P·Q
∥P∥∥Q∥(1)
where:
•P={p1, p2, . . . , p N}represents the embedding vector of the updated knowledge ( Eafter).
•Q={q1, q2, . . . , q N}represents the embedding vector of the initial knowledge ( Ebefore).
•N= 1536 is the dimensionality of the embedding space.
14
Page 15:
Interactive Agents to Overcome Ambiguity in Software Engineering
Model Summarized Issue Selected Questions Corresponding Answers
Llama 3.1
70BThe InheritDocstrings meta-
class is encountering a prob-
lem where it fails to han-
dle properties correctly due
to its reliance on a function-
checking mechanism that
does not recognize properties
as functions.1.Are there any existing
workarounds or temporary
fixes for this problem that I
should be aware of?
2.Are there any specific proper-
ties or use cases that are caus-
ing the issue, or is it a more
general problem with the Inher-
itDocstrings metaclass?1. I don’t have that information.
2.This seems to be a more general
problem with how the InheritDoc-
strings metaclass handles proper-
ties, not specific to any properties
or use cases.
Deepseek
chatThe InheritDocstrings meta-
class is encountering a prob-
lem where it fails to han-
dle properties correctly due
to its reliance on a function-
checking mechanism that
does not recognize properties
as functions.1.Can you provide more details
about the InheritDocstrings
metaclass? Specifically, what
is its purpose, and how is it
currently implemented in the
repository?
2.Are there any existing tests or
examples that demonstrate the
issue? If so, can you provide
the names of these tests or ex-
amples?1.The InheritDocstrings metaclass is
intended to ensure that docstrings
are inherited from the parent class
when the docstring is not explic-
itly defined in the child class. It
currently uses inspect.isfunction,
which doesn’t work with proper-
ties since they are not functions.
2. I don’t have that information.
Claude
Haiku 3.5The InheritDocstrings meta-
class is encountering a prob-
lem where it fails to han-
dle properties correctly due
to its reliance on a function-
checking mechanism that
does not recognize properties
as functions.1.Could you provide more de-
tails about where the Inherit-
Docstrings metaclass is located
in the repository?
2.When you say the metaclass
”fails to handle properties cor-
rectly”, could you elaborate on
what specific behavior is incor-
rect?1.The InheritDocstrings meta-
class is located in the file
astropy/utils/misc.py.
2.The incorrect behavior is related to
docstring inheritance. Specifically,
properties that are redefined in a
subclass without a docstring do not
inherit the docstring from the base
class.
Claude
Sonnet 3.5The issue involves a method
that is expected to create a
header from a byte string,
but it does not accept Python
3 bytes as input. The
method works with Python
3’s unicode strings, assum-
ing ASCII encoding, but this
behavior is inconsistent with
its counterpart method that
handles both text and binary
file modes.1.Which specific method in the
codebase is causing the issue
with byte string input? Please
provide the class name and
method name.
2.Can you share an example of
the byte string input that should
work but currently fails?1.The specific method causing the
issue with byte string input is
Header.fromstring.
2.I don’t have a specific example of
the byte string input, but it should
be a byte string that contains the
entire header data.
Table 5. Examples of questions and their responses across different models chosen to reflect general observed patterns.
15