Authors: Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, Tiffany Barnes
Paper Content:
Page 1:
LLMs’ Reshaping of People, Processes, Products, and Society in Software
Development: A Comprehensive Exploration with Early Adopters
BENYAMIN TABARSI∗and HEIDI REICHERT∗,North Carolina State University, USA
ALLY LIMKE, North Carolina State University, USA
SANDEEP KUTTAL, North Carolina State University, USA
TIFFANY BARNES, North Carolina State University, USA
Large language models (LLMs) like OpenAI ChatGPT, Google Gemini, and GitHub Copilot are rapidly gaining traction in the software
industry, but their full impact on software engineering remains insufficiently explored. Despite their growing adoption, there is a
notable lack of formal, qualitative assessments of how LLMs are applied in real-world software development contexts. To fill this gap,
we conducted semi-structured interviews with sixteen early-adopter professional developers to explore their use of LLMs throughout
various stages of the software development life cycle. Our investigation examines four critical dimensions: people-how LLMs affect
individual developers and teams; process-how LLMs alter software engineering workflows; product-LLM impact on software quality
and innovation; and society-the broader socioeconomic and ethical implications of LLM adoption. Thematic analysis of our data reveals
that while LLMs have not fundamentally revolutionized the development process, they have substantially enhanced routine coding
tasks, including code generation, refactoring, and debugging. Developers who were LLM early-adopters report the most effective
outcomes when providing LLMs with clear, well-defined problem statements, indicating that LLMs excel with decomposed problems
and specific requirements. Furthermore, these early adopters identified that LLMs offer significant value for personal and professional
development, aiding in the learning of new languages and concepts. Early adopters, highly skilled both in software engineering and in
how LLMs work, identified early and persisting challenges for software engineering, such as inaccuracies in generated content and
the need for careful manual review before integrating LLM outputs into production environments. Our study provides a nuanced
understanding of how LLMs are shaping the current and future landscape of software development, highlighting both their practical
benefits, limitations, and potential ongoing implications.
CCS Concepts: •Human-centered computing →User studies ;Empirical studies in HCI .
Additional Key Words and Phrases: LLM, ChatGPT, Gemini, Copilot Chat, interview study, professional developers
ACM Reference Format:
Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes. 2025. LLMs’ Reshaping of People, Processes,
Products, and Society in Software Development: A Comprehensive Exploration with Early Adopters. 1, 1 (March 2025), 44 pages.
https://doi.org/10.1145/nnnnnnn.nnnnnnn
∗Both authors contributed equally to this research.
Authors’ Contact Information: Benyamin Tabarsi, btaghiz@ncsu.edu; Heidi Reichert, hreiche@ncsu.edu, North Carolina State University, Raleigh, North
Carolina, USA; Ally Limke, North Carolina State University, USA; Sandeep Kuttal, North Carolina State University, Raleigh, North Carolina, USA; Tiffany
Barnes, North Carolina State University, Raleigh, North Carolina, USA.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components
of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on
servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
©2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Manuscript submitted to ACM
Manuscript submitted to ACM 1arXiv:2503.05012v1 [cs.SE] 6 Mar 2025
Page 2:
2 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
1 Introduction
Large language models (LLMs), trained on extensive datasets, are generative AI systems that produce human-like text
based on their architecture and training data [ 70,98]. Publicly-available LLMs such as OpenAI ChatGPT1, Google Gem-
ini2, and GitHub Copilot Chat3represent transformative AI, reshaping workflows across multiple domains [ 35,56,100].
Within software development, specific LLMs are trained on vast repositories of code; for instance, the AI model behind
GitHub Copilot was originally OpenAI’s Codex [ 58] and has since been refined to enhance contextual filtering [ 107].
These models enable developers to generate, refactor, and debug code through natural language prompts, streamlining
traditionally labor-intensive processes. Such tools are increasingly central to enhancing developer productivity, fostering
collaboration, and accelerating innovation in software engineering [15, 79].
The adoption of LLMs in software development has sparked considerable discourse on their broader implications.
For example, Stack Overflow’s annual developer surveys of 2023 and 2024 highlight how developers are rapidly
integrating AI tools into their workflows, reflecting shifting perceptions about automation and its role in programming
[59,64] and researchers have studied. Discussions on LinkedIn [ 7,22], the OpenAI community forum [ 16], and other
platforms further emphasize the potential of LLMs to redefine software creation, engineering processes, and developer
collaboration.
Several studies have examined LLM usage in software engineering through literature reviews [ 36], case studies [ 9],
empirical studies of forums [ 16], and comparison of general LLM tools and LLM-based agents [ 40]. However, there is
a lack of formal, qualitative assessments based on interviews with software developers on how LLMs are utilized in
real-world software engineering settings. Those that exist primarily focus on more specific dimensions of their work,
such as perceptions of security [42], trustworthiness [77], and user study evaluations of new tools [76].
To address this gap, we conducted interviews with 16 early-adopter software industry professionals who took the
initiative to become educated about LLMs at their inception and began actively incorporating LLM-based AI tools into
their daily workflows between November 2022 and April 2023. The insights of these early adopters help frame new and
ongoing issues that are important for software engineering processes, tools, and practitioners to maintain awareness
of as LLMs evolve. Our investigation is organized around four critical dimensions: people , examining how LLMs
influence individual developers and teams; process , analyzing changes in software engineering workflows; product ,
evaluating how LLMs contribute to software quality and innovation; and society , exploring the broader socioeconomic
and ethical implications. By addressing these dimensions for both early and ongoing revisions of LLMs, this research
provides a holistic perspective on the evolving intersection of LLMs and software development, offering insights into
the opportunities and challenges these technologies bring to the field.
•People – RQ1: How do LLMs affect developers?
In this “people” dimension, we aimed to discern the advantages and disadvantages LLMs offer professional
software developers in their work. While existing studies provide valuable insights into specific tasks where LLMs
excel, they often fail to capture the broader, qualitative experiences of developers across varying experience levels.
Our research fills this gap by exploring the nuanced ways LLMs influence different development tasks, offering a
richer understanding of their practical utility and limitations. LLMs best serve developers when assisting with
tasks they like the least, such as automating repetitive coding tasks and summarizing information. Additionally,
LLMs helped developers learn more effectively by explaining code, personalizing learning, and generating new
1https://openai.com/chatgpt
2https://gemini.google.com/
3https://docs.github.com/en/copilot/github-copilot-chat
Manuscript submitted to ACM
Page 3:
LLMs’ Impacts on Software Development 3
ideas. However, they are prone to hallucinations and inaccuracies, making it crucial for developers to manually
review and adapt the generated content.
•Process – RQ2: How have LLMs influenced software development processes?
We explored the “process” dimension by investigating how LLMs have impacted the software development life
cycle (SDLC), considering both positive and negative effects. Previous research has provided useful insights into
isolated phases of the SDLC, but few studies examine the end-to-end impact of LLMs across all development
stages. Our study fills this gap by systematically analyzing developers’ experiences using LLMs throughout the
entire SDLC. We found that LLMs are particularly effective for ideation, testing, and debugging tasks, but less
useful for generating requirements or reviewing code, especially in collaborative environments. Developers
adapted their strategies by using a combination of broad and specific queries, experimenting with context
addition and removal. While LLM-generated code often needed manual review before integration, it also offered
learning opportunities and inspired innovative solutions.
•Product – RQ3: How has the use of LLMs influenced the software products created?
In this “product” dimension, we focused on understanding the impact of LLMs on the code and software
products generated by developers. Most existing studies rely on predefined metrics to evaluate code quality,
often neglecting developers’ subjective perceptions of LLM-generated code. Our research addresses this gap
by examining developers’ confidence in the code’s accuracy, readability, and complexity. We found that LLMs
performed well with smaller, routine tasks like generating unit tests and documentation, but struggled with
complex, novel code. Concerns arose around over-engineered code, security, and the sensitivity of the data
used for training LLMs. Developers took responsibility for reviewing outputs, carefully balancing their trust in
LLM-generated content with their own quality assurance processes.
•Society – RQ4: How may the software industry and education be affected by LLMs?
To understand the broader societal implications, we explored developers’ perceptions of LLMs’ impact on the
software industry and educational training. Much existing research focuses on theoretical implications, but
few studies engage developers directly. Our research fills this gap by capturing their nuanced views on the
opportunities and challenges of LLMs in professional workflows. While some of our participants noted that LLMs
could replace certain roles, they generally believed developers’ roles were safe, as LLMs are tools to support
rather than replace human decision-making. Concerns were raised about entry-level positions and the interview
process, with some arguing for integrating LLMs into CS curricula, while others called for revising assignments
to prevent easy LLM solutions. The lack of formal guidelines and training for LLM use in workplaces was noted,
highlighting the need for structured support to maximize LLMs’ effectiveness and ethical use.
Our paper makes the following contributions:
•We provide one of the first comprehensive qualitative analyses of developers’ experiences with LLM-based tools
in real-world software development settings, offering new insights into their practical utility and limitations.
•We uncover the strategies that developers use to integrate LLM tools into their workflows, highlighting both
effective practices and common pitfalls, which can guide future adoption and optimization of these tools in
professional environments.
•We offer corroborating evidence for prior studies on the technical impact of LLMs on software development,
while also introducing new insights into the broader socio-economic and educational implications of these tools,
including their potential effects on job roles and educational curricula.
Manuscript submitted to ACM
Page 4:
4 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
•We analyze the impact of LLMs across all phases of the software development lifecycle (SDLC), providing a
deeper understanding of how these tools influence technical tasks like coding, testing, and debugging, as well as
human-centered activities like learning, collaboration, and decision-making.
The rest of the paper is structured as follows: Section 2 details the methodology used in the paper; Section 3 presents
our results in detail for all four research questions; Section 4 explores the implications of our results; Section 5 discusses
related work in the literature; Section 6 considers limitations; and Section 8 concludes the paper.
2 Methods
2.1 Participants and Demographics
We recruited 16 participants from tech meetup groups in the southeastern US, the researchers’ personal contacts, and
LinkedIn. Snowball sampling was also employed, where participants invited individuals from their networks to join
the study [ 85]. The target population for this study was professional software engineers and developers. Inclusion
criteria required participants to have at least two months of experience using ChatGPT or a similar LLM-based chatbot
for programming in their work. Initially, our call for participants attracted a number of graduate students. Although
many had prior experience as developers or had worked in development roles at their universities, we determined that
this group would not fully address our research questions. As a result, we revised our recruitment materials to focus
specifically on full-time developers. The demographic details of all 16 participants are provided in Table 2.
2.2 Interview Questions Formulation
We structured our interview questions around four key themes in software engineering: People, Processes, Products,
and Society. These themes were derived from existing literature, which highlighted the critical aspects of each category.
Using this foundational framework, we formulated interview questions aimed at capturing relevant insights within
each theme. An iterative approach was employed in developing these questions to ensure their clarity and relevance.
To refine and ground our interview approach, we conducted several pilot interviews. One pilot interview was with a
professional developer who regularly uses ChatGPT for programming, and the other three were with students who
self-identified as developers and used ChatGPT in their work, though they were not employed as full-time professionals.
These pilot interviews served as a testing phase to assess the appropriateness of the interview questions, determine the
time required for each interview, and evaluate the overall structure. While the student interviews were excluded from
our analysis, the professional developer’s interview was included due to the rich data it provided.
Following the pilot interviews, we reviewed the data and feedback, ultimately excluding the student interviews from
the preliminary analysis due to concerns about their alignment with the study’s objectives. Based on the insights gained,
we refined the questions, reworded them for clarity, checked for grammatical accuracy, and ensured the wording was
universally understood. The final version of the questions used for the interviews can be found in Table 1. Additionally,
we reduced the interview duration to ensure the questions could be answered efficiently without sacrificing the depth
of the responses.
2.3 Procedure
2.3.1 Interview. We invited interested participants via email and asked them to complete consent forms. Participants
used a Google Calendar link to schedule their own research sessions, which lasted approximately 70 minutes. While all
Manuscript submitted to ACM
Page 5:
LLMs’ Impacts on Software Development 5
Table 1. A list of the interview questions we asked participants.
How do you use ChatGPT in your everyday work?
How often (on average) – multiple times a day? Daily? Weekly? Monthly?
Can you tell me about a time when you used ChatGPT to help you write a program?
How about a time when it did not work out well?
Were there any times when you were surprised by what you could do with ChatGPT and code?
How has ChatGPT changed your software engineering process?
Has it changed how you gather requirements? How?
Has it changed how you break tasks into parts that can be solved by ChatGPT? How?
Has it changed how you write code? How?
Has it changed your testing process? How?
Has it changed your code review process? How?
How do you normally evaluate the code generated by ChatGPT?
How do you do testing?
How do you determine if the code does what you asked for?
Do you read the code?
How do you check the code quality, efficiency, complexity?
What about security aspects?
How much do you trust the code provided by ChatGPT?
How is it different from evaluating human-written code?
How do you integrate the ChatGPT code results into your codebase, if at all?
What steps do you take before integrating the output of ChatGPT into your code?
Or adapting/modifying the ChatGPT code to make it useful?
How often do you throw away the code, use, or reuse the code given by ChatGPT?
How secure do you believe the code given by ChatGPT generally is?
How will ChatGPT impact the skills and jobs in the software industry?
How do you think CS degree programs should adjust to prepare for this shift?
What skills must a person have to use ChatGPT like you do?
For example, how do you formulate a question to ChatGPT?
How do you structure your queries to get the desired answer?
How broad or specific should your questions be?
How much context will you provide to ChatGPT?
How do follow-up questions impact the accuracy of your answer?
How many queries/reformulations
Do you or your company have any guidelines, formal or informal,
about how developers should use ChatGPT?
participants consented to audio recording, not all agreed to screen or video capture. All interviews took place between
the period of March 1, 2023, to July 7, 2023.
The study consisted of semi-structured interviews, featuring a core set of predetermined questions, while allowing
flexibility for follow-up questions based on the conversation [ 55]. At the start of each session, participants were briefed
on the research purpose, followed by an interview of about 60 minutes. Two researchers were present: one conducted
the interview, while the other took notes. Both researchers asked clarifying or follow-up questions based on participant
responses.
During the interviews, participants were asked to discuss their use of LLMs in their daily work, how they evaluated
its generated code, and how they integrated it into existing codebases. We also explored their opinions on how LLMs
Manuscript submitted to ACM
Page 6:
6 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
Table 2. Demographics of participants P1-P16 (P#) showing the company size (number of employees), gender, race/ethnicity, years of
development experience, and LLM tool(s) used.
P# Company size Gender Trans Race / ethnicity Experience LLM used
P1 Small (10-19) Non-binary No Hispanic, White 3+ years GPT
P2 Medium (50-249) Man No Asian 1-2 years GPT
P3 Large enterprise (250+) Man No Hispanic, Asian 3+ years GPT
P4 Micro (5-9) Man No Hispanic, White 3+ years GPT
P5 Large enterprise (250+) Man No White 3+ years GPT
P6 Micro (1-4) Man No Black or African American 3+ years Copilot
P7 Large enterprise (250+) Man No Asian 3+ years GPT
P8 Medium (50-249) Man No Asian 3+ years GPT
P9 Large enterprise (250+ Man Yes White 3+ years GPT
P10 Small (20-49) Man No Asian 3+ years GPT
P11 Micro (1-4) Man No Asian 3+ years GPT
P12 Large enterprise (250+) Non-binary Yes White 3+ years Bard
P13 Large enterprise (250+) Woman No Asian 1-2 years GPT
P14 Micro (5-9) Man No White 1-2 years GPT
P15 Medium (50-249) Man No White 3+ years GPT, Copilot
P16 Large enterprise (250+) Woman No Black or African American 1-2 years GPT
might impact the skills and job landscape in the software industry, with specific follow-up questions regarding the skills
needed to effectively use LLMs. Additionally, participants were asked to describe real projects they had worked on. A
list of the pre-prepared questions is provided in Table ??. While the questions we asked explicitly named ChatGPT,
we stated during the interviews that interviewees should also answer the questions based on their usage of similar
LLM-based chatbot other LLM-based tools.
2.3.2 Demographic Survey. After completing the interview, participants were asked to complete a brief survey on
Qualtrics, which collected demographic information. Specifically, participants were asked to provide details about the
size of their company, as well as their gender, ethnicity, and race. This demographic data was gathered to explore
potential differences in LLMs usage across various groups. Based on the information provided during the interviews, we
also estimated the number of years of experience each participant had in the software development field. These results
are presented in Table 2. Our participants were 43.75% White, 43.75% Asian, and 12.5% Black. 75% were men, 12.5%
were women, and 12.5% were non-binary. 75% of our participants had at least 3 years of development experience, with
only four participants having more limited experience and two participants expressing they had begun their careers
within the past year.
2.4 Analysis
All interviews were recorded, automatically transcribed, and then reviewed and corrected by the researchers, who
rewatched the video recordings to ensure accuracy.
Our analysis followed an inductive approach, similar to that of Silva et al. in their study of AR activists [ 89]. Initially,
we created open codes based on three transcripts that we identified as thematically rich and highly relevant to our
research questions. After discussing and refining these codes, we compiled a preliminary codebook with descriptions
for each tag. The two researchers who conducted the interviews, along with one researcher who did not participate in
Manuscript submitted to ACM
Page 7:
LLMs’ Impacts on Software Development 7
the interviews, independently coded the remaining 13 transcripts. They cross-referenced these codes with the interview
video recordings when needed. Following independent coding, the researchers held discussions to reconcile their codes,
developing a unified set of agreed-upon tags that contributed to an evolving codebook. The codebook grew iteratively
as new themes emerged from the diversity of participant responses.
Subsequently, three researchers (two involved in coding, one external) grouped the tags according to each research
question, then clustered them into mid-level themes. These were further divided into lower-level themes, which served
as the foundation for our written analysis. This process took approximately two weeks, consisting of multiple intensive
coding sessions. Throughout the analysis, the researchers continued to refine the codes and themes through ongoing
discussions and by sharing draft results. The final coding resulted in 361 total codes, categorized into 138 low-level
themes. For each theme, we also quantified the number of quotes tagged, which we present in our results. Participant
quotes are referenced as PX, where X denotes the participant’s interview order. Note that P1 was technically a pilot
interview, but answered all questions that were asked of the other participants.
3 Results
In this section, we present key themes found in our analysis regarding how LLMs impact people (RQ1), processes (RQ2),
products (RQ3), and the environment (RQ4). We organized this section to outline the positive and negative aspects of
LLMs.
3.1 RQ1: How do LLMs affect developers?
The key themes that emerged in understanding how LLMs affect developers’ daily tasks are as follows:
3.1.1 Boosting Developers’ Productivity .
•Reducing Mundane Tasks: Codes related to LLMs’ effectiveness in simplifying mundane tasks appeared in
fourteen interviews, aligned with previous research describing how developers used ChatGPT to automate
tedious tasks [ 38]4. A notable example was provided by P14, who shared, “ I really like that the repetitive, boring
tasks — like looking for a comma — that I don’t need to do those, and I can focus more on building things. ”
Thirteen participants highlighted LLMs’ role in expediting the software engineering process and saving time. For
instance, P1 noted the potential benefits of integrating ChatGPT into developers’ workflows: “ It might make you
aware of things that you might be missing [...] maybe speed up some small pieces of a software engineer’s process. ”
These findings align with those of other researchers [6, 67, 78].
Similarly, nine participants highlighted how LLMs enhance efficiency and make developers more productive. For
example, P15 explained how ChatGPT writes functions, supporting focus on higher-level tasks: “ [Suppose that] I
want to do something simple [...] I don’t write those functions anymore. I always have ChatGPT do it because I know
that it’s going to come up with something close to what I was going to do, but I didn’t actually have to do it. So it’s
kind of like a pair-programmer for me, so I can stick to the higher-level stuff that would take more understanding of
the infrastructure. ”
•Streamlining Search Experiences: Ten participants highlighted the productivity advantage of LLMs in reducing
search time for solutions and drew parallels between LLMs and search engines. For example, P6 shared, “ [...]
instead of going out directly to Google or Stack Overflow, my first line of question is ChatGPT [...] I get answers a
little bit faster and more contextualized [with ChatGPT]. ” In another case, P14 mentioned how LLMs transformed
4Master’s thesis
Manuscript submitted to ACM
Page 8:
8 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
Fig. 1. Findings related to RQ1, categorized by their uses (indicated by a checkmark), existing challenges (indicated by a stop sign),
and challenges mentioned by our participants that have been largely addressed in newer versions of LLMs (indicated by a blue circle).
his search style: “ Now, I haven’t been on Google probably for 3 weeks [...] I don’t need to click on each of the links
from Google to find the solution[...] [ChatGPT] gives me a summarized version from possibly three, four different
sources. ” In a related finding, a study by Xu et al. revealed that individuals who used ChatGPT to search for
answers spent less time than their counterparts who used Google [105]5.
•Providing Boilerplate Code and Templates: Ten participants shared that they had used LLMs to create
generic or standardized code and templates. P4 described this capability as “ one of the best things I get from
[ChatGPT]. ” P12, a consistent user of Bard6, expressed, “ You can take [Bard’s response] as an idea, like a first draft,
and then iterate on it -— that’s the experience I have working with it from the coding side. ” Similarly, P5 mentioned,
“I would ask it particular questions regarding a specific front-end component I’m developing, and I use it to point me
in the right direction with code that I can start off with. And then, based on my business need, I would customize it to
fit the specific requirement at the time. ”
•Translating Code: Code translation was referenced in three interviews. P5 described his experience, stating,
“[If] I want [the code] to be styled in a particular way, or in a certain language, or if I want to transpile it to another
5Pre-published via arXiv.
6Now known as Gemini, as noted in Section 2.
Manuscript submitted to ACM
Page 9:
LLMs’ Impacts on Software Development 9
language, [ChatGPT has] been able to do that pretty well. ” This finding is in line with prior research on code
translation as one of the capabilities of AI coding tools [45].
•Accelerating Learning: Two participants noted that ChatGPT reduces the learning curve and speeds up learning.
Given the integral role of continuous learning in a developer’s responsibilities, LLMs’ ability to facilitate this
is highly beneficial. For instance, P14 highlighted, “ I feel it’s really good at explaining. Like, I don’t understand
quantum computing at all, but I ask [ChatGPT] to explain the principles to me as if I’m five [years old] or as if I’m a
JavaScript developer [...] It helps me understand the context without needing to delve into all the underlying physics.
It definitely helps me learn new things faster. ” Prior research has also highlighted the positive impact of ChatGPT
on accelerating the learning process [52].
•Simplifying Set-ups: Two participants mentioned using LLMs for set-ups and installations, which are an
inevitable part of a developer’s work and can consume significant time. In one instance, P4 stated, “ It saves me
the four hours of headache of setting up. It would also help with setting up a new [development] environment — I’m
gonna go to React, gotta get this web app building and running. I’ll just copy and paste the terminal errors, and it
does a surprisingly good job of telling me how to get through the dev environment issues. ”
•Supplementing Tutorials and Documentation: One participant, P16, highlighted the common issue of online
tutorials and documentation being incomplete or outdated, potentially creating roadblocks in developers’ projects.
She suggested that ChatGPT could alleviate this challenge: “ I’m working with a platform that I’m not familiar with,
and the documentation of that platform is not clear or may skip steps. I notice now that at work, documentations do
skip steps, and ChatGPT does a really good job at filling in the gaps when I ask my question: ’Hey, this is what I’m
supposed to be getting, but I’m getting something else. Do you know why?’ And then again, ChatGPT sometimes
does give wrong answers, but it shoots out answers that I may not have considered. ”
•Improving Recall of Syntax/Implementation: One participant highlighted the challenge of forgetting syntax
or implementation details in a developer’s knowledge base. P11 shared, “ I used to use MySQL a long time ago, and
then I haven’t used MySQL in a while, so I didn’t remember certain ways of doing things. In one case, I needed to do
a left join to find things that were in one table but not another. [ChatGPT] was helpful. ”
Challenges:
•Programming Language Mix-Ups: Three participants noted instances where LLMs’ responses were not in
the expected language. For instance, P8 recounted an experience of asking ChatGPT to refactor code in Java:
“When you paste your code [on ChatGPT], it even does not understand that code is [in] Java, C#, JavaScript, or
whatever...Maybe the premium version doesn’t have that issue. But at least the [free] version that they have is silly
in those cases. ” Consequently, developers need to spend time converting the code to the target language.
•Contradictory Answers: One participant, P1, observed that ChatGPT’s answers sometimes contradicted its
earlier responses and remained stubborn. He stated, “ [The impact of follow-up questions on the accuracy of the
answers] depends. These language models have a tendency to be stubborn. ” Additionally, he cited an example
supposedly from the literature where ChatGPT’s responses to an imaginary scenario were contradictory, and it
refused to correct its mistakes7. Despite such occasional unreliability, prior research indicates that individuals
are not deterred from using ChatGPT for this reason [5].
7The scenario in question concerns asking ChatGPT what it would do if it were invisible, with contradictions arising when it responds that it would
access inaccessible areas; note that the researchers were unable to find the source of this reference
Manuscript submitted to ACM
Page 10:
10 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
•Slow Response Generation: Different versions of LLMs may exhibit varying response times that impact the
user’s experience. P16 compared GPT-3.5 and GPT-4, noting significant differences in speed that resulted in
frustration: “ GPT-4 is really slow and very limited, so right now, I use GPT-3.5. I am excited for GPT-4 [to become
faster] because it obviously holds more information. But I tried it out, and I caught myself trying to rip off my hair
because it wasn’t understanding me the way that GPT-3.5 understood me. And then, when you’re limited — I think
it’s 50 prompts in four hours or something like that. And I’m having conversations with ChatGPT [that are] one
thread with 20 messages. [...] 50 [prompts] every four hours is not efficient. ”
•Struggles with Unstructured Data Analysis: One participant, P3, discussed a challenge with unstructured
input for ChatGPT. As he described, “ I tried asking it to perform some analysis on unstructured data I received
that I was going to run through Splunk, which is a machine data reading tool. I asked it to test efficacy. I provided it
with some dummy data and asked for it to create a few reports. It seems like it only read the variable names and
decided what those reports would look like, but the actual visualizations it created and the insights it provided were
not accurate. This might have been due to the nature of the data, but I also noticed similar issues discussed on Reddit
regarding Kaggle data analysis. ” Such inaccuracies could consume developers’ time to correct or result in wasted
efforts when using LLMs.
3.1.2 Facilitating Developers’ Learning .
•Personalizing Education and Skill Enhancement: Learning new technologies, software, and information
from LLMs emerged as a prominent theme, appearing in fourteen interviews. A key benefit of LLMs is their ability
to offer personalized and interactive learning. Whether LLMs introduced unfamiliar libraries to a programmer or
assisted developers in learning specific languages for job interviews, they proved to be an invaluable resource.
For example, P7 highlighted the constant need to learn new things in the industry: “ We still need to read a lot of
things like documentations or ...some new technologies. So ChatGPT is a good resource...to learn something I never
heard before. So, for example, if there is a question [and] I’m not sure which tool I should use, I could probably just
ask the open question from ChatGPT. ”
•Explaining Code: We found LLMs’ abilities to provide explanations and examples to be noted in eleven
interviews. For instance, P7 highlighted how a code example from ChatGPT not only offered guidance but also
inspired him to clean his code: “ Instead of looking at the [Golang8] code myself — because the source code has
so many packages, I don’t know where to look into — I just asked ChatGPT to find me an example. In this case, I
can quickly notice this is [what] the source code is doing. I can probably do something similar to make the code
readable. I mean, it has good readability and is also as beautiful as possible. ” Related to LLMs’ streamlining of the
searching process, P13 shared how ChatGPT was more useful than other tools due to its ability to explain code
more efficiently: “ [My] first step tends to be to go to ChatGPT, give it the snippet of my code, [...] and then ask it
those specific questions because I think it puts you quite ahead in understanding and making progress with things. ”
•Providing a Broad Scope of Knowledge and Diverse Datasets: The extensive knowledge and diverse datasets
of LLMs appeared in four interviews as an element that advanced developers’ knowledge and understanding.
For example, P8 shared an example in which ChatGPT was able to create data transfer objects (DTOs) based on
requirements related to a task in the financial domain; the LLM was able to intuit that the DTOs would require a
first name, last name, SSN, PIN number, and a card number: “ That was really surprising to me that [ChatGPT]
knows that. ”
8Go, also known as Golang, is a programming language designed by Google [60].
Manuscript submitted to ACM
Page 11:
LLMs’ Impacts on Software Development 11
•Demonstrating Best Coding Practices: Two participants highlighted how LLMs helped them learn how to
write code more efficiently and elegantly. By observing the generated code and incorporating LLMs’ suggestions,
they were able to refine their coding style and produce more readable code. P16 shared that ChatGPT helped
her learn how to write code more efficiently: “ I think it showed me better [...] shorthand code. [...] I just kind of
learned how to use less words, [and] code more efficiently. I would say sometimes it may not be pretty, but from what
I know, plus what I’ve learned from it, I’m able to kind of combine it and mix it and make my code more elegant and
readable and just better-looking code. ”
Challenges .
•Hallucinations and Incomplete Responses: LLMs occasionally produce erroneous or fictitious content, a
major issue that has been outlined in several academic studies [ 3,96]. This challenge surfaced in six interviews.
For instance, P1 recounted instances where ChatGPT fabricated responses: “ There were some things that were
surprising in a bad way, like it would make up papers. It would hallucinate paper names and authors. ” Another
participant, P11, highlighted ChatGPT’s failure to provide sources for its responses, noting the importance of
human expertise and the using reliable, up-to-date information: “ [People’s] knowledge is probably more up to date.
No, nobody is always 100% up-to-date, but you can trust it a little bit more, or if they have experience with the library,
then I would trust a person’s experience more than whatever ChatGPT is drawing from. [...] ChatGPT really does not
cite its information. I think maybe Bard started citing stuff. Citations are very helpful. ” Additionally, P5 mentioned
encountering disrupted answers midway, possibly due to server connectivity issues: “ I have run into issues where
I tried to generate like a lot of code, and then it would just stop halfway through because of a slow connection, or just
the plan that I have for it, and then I would ask it to finish, and then it just sort of forgets what it was doing. ”
•Limited Knowledge and Datasets: Contrary to the broad knowledge discussed earlier, there were five partici-
pants who shared that LLMs have limited knowledge and datasets. For instance, P9 shared that despite extensive
efforts in feeding information to ChatGPT and careful prompt engineering, ChatGPT failed to provide an answer:
“There are definitely times when I spent maybe more than an hour trying to ask ChatGPT to help me debug my
JavaScript code. But it turns out that no matter how well I try to re-prompt, using the prompt engineering techniques
that I picked up from Dr. Andrew Ng’s course on DeepLearning.AI9, [or] even if I tried all the techniques available,
it’s still going in loops. ChatGPT is still unable to pick up any specific bugs, which [are] helpful for me to overcome
the issue. In that case, I still have a JavaScript subcontractor that I hire on an hourly basis to help me when I’m really,
really, really stuck. So there are times that I still need to get him to fill me [in on] that. ”
•Struggles with Novel Ideas and Logical Prompts: In four interviews, participants highlighted LLMs’ limita-
tions in generating code for novel ideas and handling prompts requiring complex logical reasoning. For example,
as P11 observed, “ ChatGPT isn’t good at logical things or counting and math, right? But I think it’s able to usually
generate some reasonable code for that. ”
•Impediment of Developers’ Learning: Another concern raised in four interviews was LLMs’ probable adverse
effect on learning, particularly for junior developers. This was primarily due to users’ potential inability to parse
the correctness of ChatGPT’s answers. For example, P10 mentioned, “ I think ChatGPT is not mature enough yet
for everyday uses, especially for junior developers. If they want to start to do something, I don’t think they should use
ChatGPT because it might lead to misinterpretation of a lot of documentation and lead to something else. Because a
9https://www.deeplearning.ai/
Manuscript submitted to ACM
Page 12:
12 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
lot of time, ChatGPT’s tone, the way it replies, has certainties, say, here are all the solutions [...] and that is incorrect.
I have encountered so many times that it’s answering itself in circles. ”
•Safeguard Failures and Harmful Effects: In two interviews, participants reported incidents where these
LLMs’ safeguards against disseminating harmful information could be bypassed. One participant, P3, detailed an
instance of impersonation that led to offensive and discriminatory content. In another case, P4 described how he
was able to write a program that caused his computer to restart:“ [I]t will gladly do harmful side effects. But I had
to go out of my way to try to get it to [do] it. This was just during a live demo of using GPT-3, and I had it restart my
computer. ” These examples underscore the need for ongoing enhancements in LLMs’ capabilities to ensure their
effectiveness as a knowledge-enhancing tool.
3.1.3 Supporting Developers’ Personal Growth .
•Enhancing Reassurance, Confidence, and Independence: One theme that emerged in six interviews was
LLMs’ role in providing developers with reassurance, which improved their confidence and independence. P2,
a newer professional, shared his experience of comparing his answers to ChatGPT’s for corroboration: “ It has
made the [software engineering] process easier definitely. And the best part is I kind of get the reassurance of a lot of
things — that if I’m doing it this way, I can ask it, and I can just check if ChatGPT is giving me a similar answer [...],
then I’m not doing anything which is majorly wrong. There will always be a few things here and there, but most of
the parts would be correct. That’s what I get from it. ”
•Improving Access to Information: A positive view of LLMs’ ability to ease access to information emerged in
three interviews. P15 shared his belief that ChatGPT would democratize information access worldwide: “ Now we
have this tool that everybody in the world can use, that’s very, very inexpensive or free, that allows everybody, from
all ages, from five years old to a hundred years old, to learn any topic that they want. I think a lot of countries are
gonna be able to grow and thrive, knowing that they can now figure out how to grow these things, or what’s messed
up with their economy, or analyzing these things [...] You can’t say, ‘Oh, I grew up in a poor neighborhood. I couldn’t
get the same information as the person who went to an Ivy League College, ’ because that’s no longer the case. Now it
comes down to whether or not you yourself have the ambition to learn the things that you say, that you would learn
if education was free because now it’s free. ”
•Enhancing Job Satisfaction: One participant, P14, highlighted ChatGPT’s role in his increased job satisfaction.
As a non-native English speaker, he shared how ChatGPT helped him in areas he struggled with, thus allowing
him to focus on what he cared about: “ I feel like I’m very bad at writing essays in English. I feel like [ChatGPT] can
write better essays than I can, and work my ideas better than I can in English. So it can help me in that way, that I
don’t need to get my skills on some level. So I can write some okay essays or emails [...] and I can focus more time
on doing something that I enjoy and going deeper there, which it encourages me. And if I can focus more time [on]
doing something that makes me happy and that I’m interested in, I can learn those things way faster and be way
happier than if I have to learn something that I don’t really care about. ”
Challenges .
•Inability to Replace Human Decisions : Six participants noted LLMs’ inability to replace human interactions
and decisions. For instance, P12 emphasized that software engineering involves more than just coding, highlight-
ing the importance of human communication and collaboration: “ Despite what [LLMs] can do, at the end of the
day, it can’t replace — like, what’s difficult about being a software engineer isn’t coding particularly. Yeah, that’s
Manuscript submitted to ACM
Page 13:
LLMs’ Impacts on Software Development 13
part of it, [but] I think what’s difficult about being a software engineer, at least with what I do, is the communication
that happens between teams, between coworkers, the email threads, and the chat threads that exist. ”
Similarly, P5 stressed the ongoing necessity of human involvement in software development, especially in
decision-making and understanding business requirements: “ I still believe that you’ll still need a human person
running the shots and doing the code. You still need someone to run ideas off of and to handle all of these human
elements because you’d still need to take in the business case from a person and from the whole other team, and that’s
a whole other topic and conversation. But there still is room in this world for the human interaction and human
developer, in my perspective. ”
•Concerns about Dependency: Two participants mentioned their preference for solving problems on their own
before asking LLMs. This was primarily done to avoid over-reliance—as P16 shared, “ I try my best to at least
spend some time on my own to figure it out because I don’t want to be too dependent on [ChatGPT]. ”
•Slower Implementation than Humans: Differences between the time it took humans and LLMs to create code
were noted in one interview. P12 highlighted the efficiency of human problem-solving in cases where ChatGPT
fails to understand or provide accurate solutions: “ At some point, I’ll be like, ‘Okay, it just doesn’t understand me. ’
And so I’ll give up and then just do it myself. Honestly, sometimes I feel like doing it myself can often be the easier
answer — like, the actual, quickest path to getting what I want sometimes when it starts making these mistakes
initially. ”
3.1.4 Assisting Developers’ Non-Technical Tasks .
•Consulting and Decision-Making: There were eleven participants discussing the use of LLMs for consultation
or direction. For example, P7 highlighted their utility in uncertain situations: “ If there is a question, [like] I’m
not sure which tool I should use, I could probably just ask an open question to the ChatGPT like, ’Hey, could you
give me some directions or some potential solutions to this situation?’ And ChatGPT could probably show me some
high-level ways [...] But with some traditional search engines, like Google, it’s kind of hard because if I don’t ask a
specific question, or if I don’t ask for some specific tool, they cannot give me a suggestion or they give me some way
unrelated suggestions. ” In another case, P8 mentioned ChatGPT’s impact on resolving team debates: “ [A]lways in
technical teams, there exists debates on choosing options, options A and B — both are correct, but which one is better?
In this way, we have a judge; we have someone that tells us which approach is better. In this way, I think it changed
the whole software engineering [process] for me, that whenever we have a discussion with our colleagues in our team,
in some cases, at least we have someone [ChatGPT] who says the final word. ”
•Summarizing Text and Documentation: Nine participants mentioned the use of LLMs for summarizing
different textual data, such as articles, papers, and documentation. With regard to reading documentation, P7
shared, “ When I look into some documentations, instead of reading the whole thing — because I probably don’t need
all of these things, I only need some piece of the information — I can ask ChatGPT, ‘Hey, read through this link and
give me the summary. ’ ”
•Supporting Internal Communication: The use of LLMs to facilitate internal communication by composing
documents like quarterly updates, aiding in presentations by providing outlines and content suggestions, and
assisting in explaining complex tasks to team members emerged in five interviews. For instance, P7 stated, “ Since
English is not my first language, [it] previously took me a lot of time to write that. But with ChatGPT, I can just
show it this is what I’ve done. ‘Could you generate this kind of solution, this kind of document for me?’ ” P4 shared
how he used ChatGPT to create presentations and demonstrations: “ I also do a lot of presentations. So, [ChatGPT
Manuscript submitted to ACM
Page 14:
14 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
Fig. 2. Findings related to RQ2, excluding those themes found related to the implementation phase, which can be found in Figure 3.
helps with] communicating what I’m building and what our team is working on internally to other teams, how to
present that information, help me with slides, help me with what would make it a good demo, video, et cetera. It’s
able to just give me outlines for that type of stuff! ”
Challenges .
•Limited Summarizing and Explanation Capabilities: Two participants expressed dissatisfaction with LLMs’
ability to summarize and explain. One participant, P3, touched on this concept, as well as ChatGPT’s hallucinations
(discussed in Section 3.1.2): “ I’m a little bit wary of asking it for expert questions. [...] [I] asked for it to review a
paper for me, and it ended up not actually reading the paper, [and] making generalizations [...] I even asked for it to
make a work cited, and it gave me cited resources that didn’t actually exist. ”
3.2 RQ2: How have LLMs influenced software development processes?
To answer RQ2, we loosely organize our findings based on the software development life cycle [ 87] and the agile
development methodology [ 30]. The key themes that emerged in understanding how LLMs affect developers’ software
development processes are as follows:
3.2.1 Requirements and Planning .We use a common definition of requirements, defining it as a software capability
that must be met by a system or system component in order to satisfy a specification [ 46]. Similarly, we define planning
as the process of collecting requirements from stakeholders, scheduling, and resource estimation/allocation [87].
•Discovering Missing Components: Two participants utilized LLMs for uncovering missing components, as
P15 shared: “ I have an idea of what I think it should be, [...] and then I put it in ChatGPT and say, ‘What are some
gotchas, or what are some things that I’m missing? Or what kind of questions should I be asking in order to fill in
any blanks that I might have?’ So [ChatGPT is] kind of my consultant, as if it was a more senior developer than me,
or more a manager than me, or something like that, to where I’m going to get some feedback. ”
•Prototyping: Prototyping was another task that two participants found LLMs to be proficient in. As P12 shared,
“[ChatGPT] can [...] kind of prototype out what I want to build faster than I could probably do it myself. And then on,
when I think it’s looking right or something, I’ll go actually try to implement it. ”
Manuscript submitted to ACM
Page 15:
LLMs’ Impacts on Software Development 15
•Refining Requirements: LLMs’ assistance in refining requirements, especially for independent contractors,
appeared in two interviews. For these participants participants, LLMs did seem to be particularly helpful in this
area. P9 shared, “ Say, if I got a new contract job that I’m scoping to try to help my clients to try to refine the technical
requirements, and especially so on the domains I’m not familiar with. So, most recently, one of the contracts I got was
to try to create a music visualizer [...] I have never previously dealt with creating apps with specifically musicians
before. So while I was talking to them getting requirements at the same time, I would have the ChatGPT window on
the side. ” P10 shared that ChatGPT helped him research and cut down discovery time when interacting with
stakeholders, and that this was important for him throughout the software engineering process: “ Because I’m
a kind of a one-man shop currently, I really need someone or something to help me to kind of prop up all those
processes. ”
Challenges .
•Inability to Replace Human Involvement: Twelve participants stated that LLMs did not have an impact
on their requirements gathering. Participants mentioned that they received requirements and plans from their
supervisors or superiors, who work directly with client needs, and these can’t be generated by LLMs. As P2
shared, “ [Requirements gathering] hasn’t changed because most of the work that I’m doing is with client developers
and product managers from the client side. So we need to talk to them for the requirements we need to specify — like,
we need to gather all the requirements from them. And because our projects are very client-specific, I cannot just go
on the Internet to see what they would expect. We need to talk to them to figure out the requirements. ”
•Limited Requirements Detailing: One participant, P3, shared how ChatGPT cannot help to generate detailed
requirements: “ I feel as though the requirements it gives me are very common sense, if that makes sense. But it
doesn’t really get into the usability aspects of it, or it doesn’t get into like some of the more fine-tuned [requirements],
like how should something be done. ”
•Struggles with multi-criteria decision-making: We found in one of our interviews that multi-criteria
decision-making can be challenging for LLMs. P16 shared, “ So I was looking at flight from [point A to point B].
The flight price with [one airline] went up by a lot of money, and I did not want to pay that much, so I was willing to
take different flights, like, you know, fly to [point C], and then [point B], see whatever is cheaper [...] and then I said,
‘Okay, so ChatGPT is not good with math or logistics, ’ and I did have to do it manually. ”
3.2.2 Design and Ideation .We consider design to be the process of finding solutions and creating a more detailed
technical plan as a result of the requirements-finding process [ 87], where ideation is a critical component in designing.
Decomposition is also part of the design process, and Chattopadhyay et al. have noted how developers employ particular
strategies in order to decompose their tasks into smaller units [14].
•Increasing Problem Decomposition: Eight participants acknowledged that LLMs encouraged or necessitated
problem decomposition into smaller, more manageable components. This is because LLMs can only take in a
limited amount of information at a given time. Consequently, participants needed to either decompose their
problems for the LLMs or provide their already decomposed problems. P4 shared that for him, ChatGPT “ might
force me to break [problems] down [...] I guess it might be a good forcing function to make me break them down in
the subtasks and subrequirements [...] If I can’t explain it to ChatGPT, then it might be an indicator that I don’t yet
understand it. ”
Manuscript submitted to ACM
Page 16:
16 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
•Utilizing for Ideation: Six participants acknowledged using LLMs for ideation, leveraging their extensive
knowledge bases and generative capabilities. Participants highlighted LLMs’ abilities to generate diverse ideas,
albeit with varying levels of accuracy. Because of LLMs’ broad bases of knowledge and generative capabilities
(further discussed in Section 3.1.2), they could be relied on as a source of ideas. P12 shared that for them, “ Probably
the ideation stage is where [Bard] shines the most, because you can’t trust it to be accurate or exactly what you want,
but you can trust it to generate some cool content for you, some cool code ideas, some cool writing ideas. ”
Challenges .
•Inapplicable for Decomposition: Seven participants shared that LLMs have not affected their problem
decomposition, viewing that as inherently reliant on individual reasoning than LLMs can do. P13, for instance,
stated, “ I don’t think [it has impacted decomposition]. I like planning them myself and then going to ChatGPT for
more fine-grained planning about how I should actually implement the code.” P16 shared that she intentionally
did not use ChatGPT for “ super big things that need to be broken up, ” instead only using it for smaller tasks that
she believed would be more likely to be answerable. Other participants alluded to only using LLMs for simple
tasks and thus avoided struggling with the decomposition process.
•Stubborn Responses: Six participants noted that LLMs were stubborn, which could limit their usefulness while
designing. P1 noted that “ These language models...have a tendency to be stubborn, ” and P12 stated that ChatGPT
tends to “ stay with the previous output a lot. ” Another participant, P11, remarked on ChatGPT’s struggles to
adapt to new input and modify its outputs: “ [ChatGPT] doesn’t do a good job of changing itself. ” However, P3,
who did not believe ChatGPT was stubborn, shared that his opinion was such because he recognized its inherent
limitations with regard to complex tasks: “ I’ve never really thought [ChatGPT is] stubborn [...] I guess I’ve kind of
learned through searching online [...] to get it to do like what you want. And then realizing it doesn’t do it, [...] it’s
not that it doesn’t want to, it just can’t. ”
3.2.3 Implementation .The implementation process with LLMs parallels code reuse. Therefore, we adopted Rosson
and Caroll’s classification [ 84], dividing the process into three distinct phases: (1) Finding Context, which entails prompt
engineering and the process of locating pertinent responses; (2) Evaluating Context, which centers on evaluating the
accuracy and usability of the generated responses; and (3) Integration, which pertains to incorporating the generated
text into an individual’s own code or within an existing codebase. We note that the following bullet points’ titles are
framed from the perspectives of the developer/prompter, rather than the perspectives/capabilities of the LLMs.
(1) Finding Context.
•Varying Prompt Specificity to Impact Answers. Developers’ diverse preferences for broad prompts, specific
prompting, or adjusting query specificity to influence LLMs’ output emerged in fifteen interviews.
Broad questions are favored when seeking varied answers, avoiding assumptions, or seeking low-effort responses.
Participants used broad questions and vague prompts when they were looking for varied answers. P15 shared
that while he originally had used specific prompts, he found that using broader prompts resulted in improved
outcomes: “ [I]t started to give me actually better results, because it knew things that I didn’t know [...] When I have
a more broad [prompt], then it’s able to kind of formulate its own deduction and get to the problem that it thinks
that it’s solving. ” P4 similarly shared that, while he had originally used more specific prompts, he found vague
prompts to ultimately be more productive: “ I’ve been moving to shorter and shorter, more vague prompts. So I used
to be, you know, trying to do the few-shot learning approach. [...] I never do that anymore. It is such a waste of time. ”
Manuscript submitted to ACM
Page 17:
LLMs’ Impacts on Software Development 17
Fig. 3. Findings related to the implementation phase of software development.
Specific prompts are preferred when seeking particular answers, especially in software engineering, or when
participants have a clear understanding of what they are asking for. As P2 stated, “ I think it should be very specific
[...] If you are going into other [non-software] domains, where there is no one correct answer, you can go for broad
questions. But for software development, I think you need to be very specific. ” P16 mentioned, “ I think it depends on
how well you understand the question you’re asking, because I’ve asked ChatGPT very specific questions, but those
were specific and to the point. ”
Four participants acknowledged that there is not a single correct strategy for choosing the level of specificity.
They emphasized the importance of adaptability and experimentation, suggesting that users can benefit from
trying both vague and specific prompts. P7 shared two use cases demonstrating the value of both prompting
strategies: “ Sometimes I want the question be really generic. For example, I want to design a system. [...] I want
to look into all those possibilities. In that case, I want to give some general question like, ‘Could you give me some
high-level architecture for this system?’ But sometimes I want some answer [to be] really specific. For example, when
I read a book, there’s a sentence I don’t really understand why the logic [is] this way. I don’t want [ChatGPT] to give
me an approximate answer. I want [a definite] answer. ” Being able to determine when to be more or less specific is
seen as a valuable skill when using LLMs.
•Enhancing Accuracy and Clarity with Follow-Up Queries: Fifteen participants shared their strategies
of prompting iteratively until getting the desired response. Seven participants shared that follow-up queries
improve accuracy, and two shared that they improve clarity. As P16 stated, “ I think the follow-up questions allow
the users to get clarity. ” She elaborated on the importance of follow-ups: “ The follow-up questions actually might
be more important than [the initial question], more probably equally important. But I get more of my answers from
the follow-up question. ”
Manuscript submitted to ACM
Page 18:
18 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
•Unique Prompting Strategies: Prompt engineering as a discipline has become increasingly popular, with
courses and guides from all corners of the Internet, including from OpenAI itself [ 69]. Fourteen participants
acknowledged the crucial role of prompt engineering in eliciting accurate and relevant responses from LLMs. P9
shared that he could find a good answer “ [a]s long as the prompt is crafted carefully, and the problem is common
enough to find a solution. ” This ties into results discussed in Section 3.1.2, showing that ChatGPT performs best
when it encounters less novel problems.
Participants demonstrated a willingness to try different prompting strategies, with inspiration from online
resources and previous interactions with LLMs. P10 shared his unique strategy of adding and removing context
via particular prompting by using ’+’ and ’-’ within his prompts, as well as labeling points (e.g., A, B, C) to refer
back to them later.
•Generalizing Prompts and Code for Security: Thirteen participants employed generalized prompts or
used case-specific information rather than project-specific details when interacting with LLMs. P2 shared
that, “ Because I cannot put all the information into ChatGPT, because it’s very client-specific information, it’s
confidential information [...] I just give it use cases that are similar. ” By focusing on broader contexts or specific use
cases, participants aimed to minimize the risk of exposing confidential project information while still obtaining
relevant responses from LLMs. This approach allowed participants to strike a balance between leveraging LLMs’
capabilities and safeguarding sensitive data. P5 felt more secure in using this approach: “ I think that what makes
me comfortable about [ChatGPT’s security] is that [...] the prompts that I ask are sort of general or generalized [...]
and really can’t be tied to any particular person’s identity, or any sensitive piece of information. ”
•Providing Examples : Nine participants practiced few-shot prompting with ChatGPT. Few-shot prompting,
which has been shown in prior research to have mixed success, entails providing examples to an LLM tool in
order to receive a particular output [ 83]. For example, P11 shared that, “ If you’re able to give it an example of
what you want, then it can do a better job. ”
•Applying Contextual Input for Improved Responses: Nine participants highlighted the significance of
providing context when interacting with LLMs, i.e., more context generally leads to better responses from LLMs.
Examples include P1’s observation that “ The more context, the better, ” and P15’s explanation that “ The more
information that you give it, the better [of] a response it’s going to be. ” Two participants discussed how ChatGPT’s
ability to retain context distinguishes it from traditional search engines. They noted that this feature simplifies
the research process and allows for more effective interactions, as ChatGPT can keep track of previous questions
and responses, enabling a more seamless exchange of information. As P14 stated, “ It takes a mindset from the
Googling part, and the biggest mindset shift is that it keeps context that I can build up on the question. ”
•Starting a New Thread for Fresh Answers: Nine participants stated their preference for beginning a new
thread when they were dissatisfied with the answers received in a previous interaction with ChatGPT or Gemini.
This approach allowed them to refresh the context and seek new responses without the constraints or biases
of previous chats. P15 detailed his experience with ChatGPT: [I] have it continue refining what I want to do.
Sometimes, it’ll get to the point where it kind of [stops] working out. Maybe the context gets a little skewed the
further you get down in the chat. So then I’ll just take whatever it had there and then my problem, and then start a
new chat. So, recreate the context. And this is where I’m at and trying to do this, and then I can kind of start over
and and refresh where it’s at. ” Some participants mentioned that when they chose to open a new one, they often
improved the specificity or clarity of their prompts. P16 shared that when she continuously gets an incorrect
answer, “ I open a new thread because sometimes I’m like, I learned my lesson; I learned. I know what [ChatGPT is]
Manuscript submitted to ACM
Page 19:
LLMs’ Impacts on Software Development 19
going to say now. So, let me open a new thread and start all over. And I’m going to be more specific, and maybe we
can solve this together. ”
Challenges .
•Struggles with Integration and Adjustments: Five participants observed that LLMs faced challenges when
integrating context and making adjustments or changes. Instances were shared where LLMs struggled to modify
specific components of their outputs without affecting the entirety of the generated code. P1 shared an incident
in which ChatGPT struggled to change minor components in its output: “ It wouldn’t change just that piece of the
code. It would change all of the code. ” Participants also expressed concerns about LLMs’ abilities to retain the
original context over multiple prompts or interactions. One participant, P8, shared that in software engineering
(not other disciplines), “ [W]hen I’m asking more questions, and I give [ChatGPT] more context, I confuse it more. ”
He described situations where ChatGPT would forget previous information or modifications, leading to the need
to start fresh threads to maintain clarity and accuracy. Similarly, P8 noted that higher specificity in prompts
could sometimes result in ChatGPT making assumptions or becoming overwhelmed with additional information.
Additionally, P3 shared that ChatGPT often struggled to integrate non-functional requirements: “ [ChatGPT]
usually kind of messes up on those, which is why I avoid [including] non-functional aspects [when prompting]. ”
•Limitations of Context Window: Five participants highlighted the limitations of LLMs’ context windows. They
noted instances where LLMs ran out of space to generate code or lost context when prompted multiple times.
P15 echoed this sentiment of ChatGPT forgetting its original context and how this factors into him opening new
threads: “ [A]t some point further down the line, it’s going to forget certain things that were really, really important to
consider. And so sometimes I’ll say, ‘This is really important to remember. ’ [...] It’s not always perfect, but that’s kind
of why I take the context of everything I learned now up to this point, plus my original message, and then readjust
and do a new chat, just so [ChatGPT] can kind of get a fresh start at looking at it. ” This limitation could lead to
frustration and decreased effectiveness in generating accurate responses, as P12 stated: “ I’m probably better off
doing it myself. At that point, it’s probably going to be a waste of time to continually trying to re-prompt it and
re-prompt it. ”
•Necessity of Follow-up Queries: Two participants shared that they often relied on follow-ups. This was in
part because LLMs did not always get things right the first time. P9 stated, “ Very rarely, [...] I can just get the code
that I want in one try, even though there are specific prompt engineering templates that I follow. ” As mentioned,
participants often use broad prompts and then provide specific follow-up questions, demonstrating the iterative
process of prompting. One participant, P14, shared that “ If it’s an easy function, [the task requires] one [follow-up].
And if it’s a very difficult problem, sometimes it takes like 50. ”
•Challenges of Providing Context: While acknowledging the benefits of context, one participant noted the
challenges of providing sufficient context, especially when they are unsure of what specific examples to provide.
P11 mentioned, “ As far as context, [ChatGPT] benefits a lot from examples [...] If you can give it an example of what
you want, then it can do a better job — but often you don’t know what the example would be, because you’re asking
it to do something you don’t know how to do. ”
(2) Evaluating Context.
•Verifying through Reading Code: Ten participants shared that they rely on reading their code as a primary
method of review. This involves going through the code manually, line by line, to understand its logic, structure,
Manuscript submitted to ACM
Page 20:
20 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
and implementation details. Reading allows developers to identify potential issues, errors, or areas for improve-
ment within their codebase. P9 noted that reading could be a sufficient check for a simple piece of code: “ For me,
as a professional developer, it is still my job to validate [that] this piece of code is going to work, and it’s going to
integrate well into my existing codebase. So to do that, if it’s something that’s simple enough, I can just mentally do
a quick check — me reading it line by line — and do a quick check mentally, I can do that. ”
•Verifying through Output: Similar to testing, ten participants shared that engage in validating their code by
checking the output after reading through the code. This involves examining the results of the code execution,
which can provide insights into its correctness and functionality. Methods for output verification include checking
console logs, manual testing, and comparing the generated output to expectations. As P8 shared, “ If that output
would be contradicted with what I expect, I discard it. ”
•Verifying through Manual Testing: Seven participants also engaged in manual testing practices, such as
verifying code templates and solutions, playing around with isolated examples, and using developer tools or
read–eval–print loop (REPL) tools to test code functionality before integration into their codebases. P10 shared
that, for his process,“ I verify its templates, and I verify its solutions. I put that in a inbox to test it. ” Additionally,
checking console logs, verifying expected function return values, and monitoring IDE warnings or errors were
common practices among participants to ensure the correctness and integrity of ChatGPT-generated code. As
P13 noted, “ [I see] if my editor is like giving any warnings or errors. ”
•Verifying through External Tests: Four participants employed external testing frameworks and tools, such as
Pi Test (P9) and LangChain (P4), to evaluate the generated code. These tools facilitated automated testing, bug
detection, and even collaboration between different agents to review and improve the code.
•Verifying through External Sources: Two participants, one more experienced and one less, shared that
they seek validation of their generated code quality from external sources such as the Internet. They look up
documentation, search for common solutions on platforms like Stack Overflow, and review discussions to ensure
that their implementation aligns with established practices and expectations. P13, a less-experienced developer,
described it as an additional step she took for validating even after the code seemed to be functional: “ I always
take [the generated code] with a certain pinch of salt, and even though the code seems to work when I plug it in, I
still always try and, you know, do a web search. [...] I always go back and do a web search to ensure that, yeah, this
is one of the common solutions to the problem, and it’s a good one. Basically reading all of the comments, like all of
the discussions that happen on Stack Overflow, I think are really helpful to verify that my implementation is in line
with how people would expect it to be done. ”
•Clarifying Code through Generated Explanations: Two participants shared that they ask LLMs to explain
their own generated code as part of their evaluation process; as P4 shared, he “ [asked ChatGPT] to explain
itself. ” This involves requesting explanations for specific lines or segments of code in order to gain a deeper
understanding of its logic, functionality, and rationale.
•Assessing Visually and Logically: Two developers, who both worked in front-end development, noted that
they visually assessed the quality of their code. P5 noted that he could both read and visually check the quality of
his code: “ I first run it in my head to see if it flows logically. [...] So when I look at the code or look at the class, and
I’d say, ‘Okay, that’s generally what I’m going for, based off of what I know. ’ [...] I’d either load up my own personal
test environment and like a web browser, or I just copy and paste the code, and just see if it gives the result that I’m
looking for. ”
Manuscript submitted to ACM
Page 21:
LLMs’ Impacts on Software Development 21
•Judging Quality through Readability: One participant, P7, shared how he used the process of assessing
readability and adherence to established principles in order to judge the quality of the generated code. He shared
that he “ judge[d] whether it code is good or not ” by evaluating the readability of the code through three strategies:
applying the SOLID principles [ 90]; referring to the concepts he had learned through RC Martin’s clean-code
manual Clean code: a handbook of agile software craftsmanship [53]; and cross-referencing code against the
programming language’s official documentation.
•Improving via "Self"-Correction: One participant, P12, shared his experience in which ChatGPT could
recognize and rectify errors it had made in its code without explicit user guidance: “ Sometimes [ChatGPT] literally
will write something, or it’ll write some code, or even English text, and [you’ll] be like, ‘Hey? You made a mistake
here. What was your mistake?’ And it will just correct it. You don’t even have to tell it what the mistake is. It’ll just
be like, ‘Oh, my bad, I did this, this, this, and this wrong. Here, let me go fix it. ’ And it’ll write it up for you. ” This
self-correction capability contributes to the reliability and usability of LLMs like ChatGPT in code generation
tasks.
Challenges .
•Increased Skepticism: The existence of generated code from ChatGPT made five participants more skeptical
and vigilant about evaluating the quality and reliability of the code. P1 shared that the existence of generated
code made them more likely to check and test code: “ [I]n a sense, yes, it has changed testing. I’m a bit more
skeptical. ”
•Necessity of Evaluation: ChatGPT has previously been found to have varying quality with regard to its
generated code [ 50,51]. Seven participants emphasized the importance of thoroughly checking the generated
code for accuracy and relevance. P4 mused on the general quality of generated code: “ [Generated code is] not
good enough that I don’t have to read it yet, although that might be nice one day— might put me out of a job,
though. ” Participants advised caution and suggested not solely relying on LLMs’ outputs without verification. P6
shared, “ Be wary of what you get [...] Don’t give up on checking the responses. ” P1 particularly speculated on the
hypothetical case in which ChatGPT’s generated code could pass tests but fail to fulfill the intended requirements.
This underscores the necessity of thorough review processes, which should similarly be in-place for reviewing
human-created code; as P16 shared, “ [ChatGPT]’s not going to get it right every single time, and we need human
overview on that [...] And when it comes to human code. I think it’s the same thing. ” Two participants shared that
they paid more attention to reviewing the quality of generated code compared to human-written code, indicating
a higher level of scrutiny for ChatGPT’s outputs.
•Lack of Self-Evaluation One participant, P11, mentioned ChatGPT’s lack of self-evaluation as a limitation,
which makes it inferior to even junior developers. He stated,“ [...] It’s not evaluating its own code. It doesn’t actually
have the ability to do those things that I would do to recode. I just can’t trust it in a certain way; I know it’s not
making certain decisions based on facts or logic. ” While this was one participant’s view, as noted in the above
section on Evaluating Context, another participant (P12) commented that ChatGPT could correct itself.
•Lack of Contextual Clarity: One participant, P4, emphasized the importance of context in understanding
code, suggesting that humans have an advantage in having and being able to provide contextual information,
especially about how and why code was created: “ [I]f a human hands me a piece of code, I can ask them questions
about it. If someone handed me code that was written by ChatGPT, I don’t know what I would want. [...] would I
want the prompt from the code? Because I’m missing the context now of how and why the code was generated.[...] If
Manuscript submitted to ACM
Page 22:
22 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
it’s a really small piece of code, I can probably just read it and figure it out and either choose to throw it away or go
ask ChatGPT, ‘Hey, will you validate this code for me?’ I think if a human gave me ChatGPT code with the prompt
that generated it, I would feel a lot better. ”
(3) Integration. Here, we examine integration across two dimensions. First, we investigate the practices employed
by developers to incorporate the generated code into their work. Second, we assess the degree to which developers
actually utilized the generated code.
•Modifying Before Integration: Fifteen participants reported that they modified the generated code before
integrating it into their projects. They described refining the code to align with their specific requirements and
overcome potential errors. P13 shared her process: “ Once I got that overall approach, I refined it in a few places,
and then I sent it back to ChatGPT [...] ’Kind of give me boilerplate code for that. ’ So I think it did a really good job
at giving me that boilerplate code. And then I just had to do a lot of refinements within it to overcome certain errors. ”
P2 noted the need to adapt the logic of LLMs into the structure of existing projects and compared it to integrating
code from other sources: "[Integrating ChatGPT’s code results into my code base is] similar to what I used to do
with Stack Overflow. It will definitely give me a snippet, but variable names and all of that will be very different. The
code structure will be different. So I’ll just take the logic, but follow the structure which is already the pre-existing
code base. "
•Copy-Pasting Code: Twelve participants reported simply copy-pasting the generated code into their develop-
ment environments. They found this process straightforward and efficient, allowing them to quickly incorporate
the generated solutions into their projects. P7 stated, “ Maybe there [aren’t] too [many] steps. Just asking the
question, and if the answer looks good to me, I would just copy-paste. ”
•Rewriting Code: Three participants mentioned that they preferred to rewrite the generated code manually
in their development environments. They did this to gain a deeper understanding of the code and to mitigate
potential errors that might arise from directly copy-pasting the generated solutions, giving them greater control
over the code integration process. As P9 shared, “ I re-implement [the code] in Python myself instead of just taking
the output directly and pasting it there. ”
•Using Generated Code: Participants exhibited varying degrees of frequency in discarding generated code. Six
participants rarely discarded code, two disposed of it about half the time, and eight often discarded it. Those who
seldom discarded code pointed out that, with well-crafted prompts and in certain programming languages, LLMs
are capable of producing viable code, albeit with some limitations; as P13 shared, “ I would say [I discard code] not
very often. It seems to be useful in the context that I use it. I think I’ve been frequently throwing out code given by
ChatGPT only in cases of SQL. ”
•Discarding Generated Code: Eight participants shared that they threw away a lot (greater than 50%) of their
generated code, while two shared that they threw away about half of their generated code. Frequent discards and
reduced usage of generated code was more prevalent among those eight who used LLMs for ideation, consulting,
or learning. P4 shared his reasons for throwing away code, along with the value that he still received from
discarded code: “ It’s good at getting me from zero to something, but I iterate a lot on it. I throw away a lot of code,
and sometimes I choose not to do what it says to do, like maybe it’s using some design pattern, or it’s just being too
clever with the code. I don’t generally have a problem with with how it’s doing stuff. It just might be different than
how I would do it. For the type of stuff I’m building, I value being familiar with my own way more so than even code
quality at times. [...] I think it’s just the nature of the iterative prototyping that I do, that I throw away a lot of code,
Manuscript submitted to ACM
Page 23:
LLMs’ Impacts on Software Development 23
no matter what. But that doesn’t mean I don’t get value from the ChatGPT code. I think it still teaches me a lot of
stuff. ”
Challenges .
•Friction with Copy-Pasting: One participant, P4, expressed his frustration with the copy-pasting process due
to the need to switch between applications or rely on third-party plugins: “ [E]very time you want to interact with
ChatGPT, there is friction just because of the UI. You have to switch to the web application, or you have to get one of
these unofficial third party plugins [...] So right now I copy-paste, and I find that annoying, but the value is still high
enough. So it’s like, what is the cost versus value gained. ”
3.2.4 Testing and Code Review .
•Generating Unit Tests: Seven participants had used LLMs for generating tests, particularly for unit tests due to
their relatively basic and formulaic nature. As P9 shared, “ Sometimes I use ChatGPT to create simple unit tests for
me, instead of me writing from scratch again. But then, I only save those for smaller functions after I’ve done my
refactoring. ” One participant, P15, noted that using ChatGPT to generate tests encouraged him to incorporate
more testing into his coding practices: “ [ChatGPT] does write unit tests for me. So where I would not normally have
them, if it’s able to write unit test for them, then I’ll have it do that. We don’t spend a lot of time creating tests as
much as we should. But when when it comes to certain functions, like even just the basic functions [...] usually [the
generated tests] are pretty good when it comes to just a small function. ”
•Simulating Code Reviews: Two participants, who worked independently as contractors or in small teams, used
LLMs for code review due to the lack of colleagues to review their code; an additional participant used ChatGPT
to review chunks of his code. For the two solo developers, using LLMs for code review served as a way to ensure
their code met professional standards and to compensate for the absence of traditional code review processes
within their teams. As P9 shared: “ [R]ight now, I’m working as a contractor solo. I don’t have the privilege of getting
somebody to do code reviews for me. So I would say in terms of code quality, that really helped me to maintain [...] a
professional coding level. ”
Challenges .
•Inapplicable for Code Review Eight participants indicated that they did not utilize ChatGPT for code reviews
or that it had minimal impact on their code review process. P15, despite being a big proponent of using ChatGPT,
expressed hesitance toward using it for code reviews: “ I don’t use it for code review, mainly because I need to
understand what the code is doing myself [...] It’s better for me to know each step of what it’s trying to do, because I
need to know ... how it’s going to affect the rest of the system. So where it’s like, ‘Oh, we deleted this function chapter.
[It]’s gonna say it’s fine, but in reality, that function’s being used in many places. [...] That type of thing [the LLM]
wouldn’t be able to do. It just doesn’t have the big picture. ” P4 expressed significant concerns regarding using LLMs
for code reviews and preferred human reviewers: “ It would be silly to have [an LLM] replace a human, because
one of the main benefits of doing a code review, at least in the teams that I work on, is transferring the knowledge
amongst the developers. So using [an LLM] for [code reviews] would be a great way [sic] to remove the entire purpose
of the code review. ”
•Struggles with Complex Tests Two participants noted that LLMs struggled with generating larger or more
system-wide tests. P2 noted that he did not consider using ChatGPT for system-wide testing due to security
considerations. P15 shared further that ChatGPT lacked the context necessary to generate larger tests: “ It gets a
Manuscript submitted to ACM
Page 24:
24 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
little bit harder if you were talking about [...] an entire web page [...] because it just doesn’t know what people are
supposed to be doing [or] how it should create the test. ”
3.2.5 Debugging, Refactoring, and Documentation .
•Improving Debugging: LLM tools have shown promise in helping individuals debug code [ 34,92]. Ten partici-
pants mentioned having used LLMs to debug their own code. Participants like P5 primarily utilized LLMs as an
immediate debugging tool, relying on it to assist in troubleshooting specific issues with their code, especially in
front-end development; as he shared, “ I use [ChatGPT] as my immediate sort of debugging tool. So I would ask
it particular questions [on a] front-end component that I’m trying to develop, and I use it to try to point me in the
right direction. ” For participants like P6, LLMs were especially useful in understanding bugs and their causes: “ I
always have questions about why something is failing, what something is doing. So [Copilot Chat] does help on a
day-to-day basis with programs. ”
Participants highlighted LLMs’ efficacy in significantly reducing debugging time by providing quick solutions
or guiding them in the right direction. P5 shared his experiences: “ [ChatGPT] helps to dramatically shorten the
whole debugging process. If it doesn’t give you the answer — that is, if it doesn’t give you the answer on the first try,
[...] it helps to put me in some right directions to where I can do some further research or ask it more questions. ”
•Reducing Syntax-Based Errors: Two participants emphasized LLMs’ utility in resolving syntax-related errors,
such as missing punctuation or braces. P14 stated that, “ [W]henever I’m missing a comma somewhere or a brace.
[...] I just paste it into ChatGPT and say, ‘Fix the syntax mistake. ’ ” He then shared how ChatGPT quickly identifies
and fixes such errors, saving him from spending hours debugging simple syntax mistakes.
•Performing Refactoring: We define refactoring as transforming the code in such a way as its functionality and
behavior is preserved while improving its maintainability or comprehensibility [ 29]. Two participants shared
that they used LLMs for refactoring purposes, aiming to enhance maintainability and comprehensibility while
preserving functionality. P8 shared his process: “ I have a method, a class, or something that I want to verify, to
read, and I’m going to get an idea of how I can do that — refactor in a better way. [...] [if] I agree with the refactoring
that [ChatGPT] gave to me, I’ll paste it in my codebase. Otherwise, I try to only get the idea and implement it myself. ”
Additionally, one participant utilized LLMs to condense code, particularly by converting code blocks into more
concise versions. P7 noted that he did so, sharing an example prompt and ChatGPT response: “ ’This is the code I
[am] writing for a for loop. Can you just convert it to the stream version?’ And then [ChatGPT] just gave me this one
line of code. So I feel this is pretty useful. ”
Challenges .
•Debugging Difficulties: One participant, P9, shared that he had had a negative experience while using ChatGPT
to debug a piece of code he had found online, but with poor results. He emphasized that his lack of experience
with the code left him with an inability to properly evaluate the results of ChatGPT’s output, leaving him
frustrated:“ I was using ChatGPT to help me debug and help me revise it. But since I don’t understand the code
perfectly, and relying too much on ChatGPT at that point, it was giving me the incorrect prompt, which I didn’t
know until two or three hours later [...] I thought, ‘There’s something wrong with my input, ’ and but it turns out
it’s not. It’s actually ChatGPT — it was mixing up Python’s syntax in there. It was actually using syntax from all
languages, but that almost looked like Python. ”
Manuscript submitted to ACM
Page 25:
LLMs’ Impacts on Software Development 25
Fig. 4. Findings related to RQ3, emphasizing the pros and cons of LLMS with regard to their code-related output.
3.3 RQ3: How has the use of LLMs influenced the software products created?
The key themes that emerged in understanding how LLMs affect the artifacts (i.e., code and software products) are as
follows:
3.3.1 Quality and Complexity of Generated Code .
•Producing Clean, Readable Code: In Section 3.1.2, we presented how some participants used LLMs to keep their
code clean and neat, incorporate ideal design patterns, and understand industry best practices. Six participants
shared that they found LLMs’ generated code to be clean, readable, and systematic. They likened its readability to
that of standard Stack Overflow answers or official documentation for programming languages. Five participants,
such as P4, appreciated the ease of understanding the syntax, which facilitated quick comprehension of the code:
“[ChatGPT is] generally pretty good about not generating difficult to read code or overly complex [code]. And just
asking it to improve the code in terms of readability generally works for small snippets. I’ve had good success with
that.”
•Generating Code with Reasonable Complexity: Five participants stated that the code actually had fair to
good complexity for the tasks that they were working on. They acknowledged that while LLMs may not always
produce the most efficient solutions, the complexity generally aligned well with their needs.
Particularly for common tasks or well-known problems, participants generally found the complexity to be
sufficient. P11 stated that he found ChatGPT to have good complexity for more common tasks, but not for more
novel tasks: “ Every now and then I’ll put in a code challenge, or an interview kind of problem, and [...] I think I
think it does better on those, because other people have posted about those kinds of problems. And then you can ask
it, ‘Oh, can you do this more efficiently? Can you do whatever?’ And it’ll talk to you and explain to you. But again,
like I said, my suspicion is that it’s only because those are things that people talk about more. If I try to give it more
unique data structure problems from my actual experience, it doesn’t do as well. ”
•Conducting Complexity Analysis: Four participants utilized LLMs as tools for checking the time and space
complexity of code snippets and explaining the benefits of different approaches. This served as a learning
Manuscript submitted to ACM
Page 26:
26 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
opportunity, particularly for participants who were newer to computing. P16 shared that she used ChatGPT
to learn about code complexity, “ [S]ometimes, if I don’t know, I’m just like [to ChatGPT], ‘Oh, what’s the time
complexity, and the space complexity of this problem? And explain why and explain the benefits of using this method
versus that method by its complexity. And [ChatGPT] does that sometimes. It’s like, ‘This saves space, but this saves
time, and you can like solve this problem [...]’ ”
Challenges .
•Struggling with Up-to-Date Information: Two participants noted that ChatGPT’s generated code sometimes
lacked up-to-date information, particularly when dealing with rapidly evolving technologies or APIs. This
resulted in outdated or irrelevant suggestions, as P6 shared: “ Sometimes it doesn’t work out well, in the sense that
it just gives you suggestions that are a little outdated about the use of a specific API. ”
•Over-Engineered, Complex Code: Five participants raised concerns about LLMs over-engineering solutions,
adding unnecessary complexity and inefficiency to simple problems. They highlighted instances where LLMs
introduced extraneous modules or convoluted solutions. For instance, as P8 complained, “ One complaint that I
have about ChatGPT’s output is over-engineering. [...] In some of cases, I’m feeling that sometimes that I’m asking
it, ‘Okay, write a simple method for, I don’t know, multiplying two numbers together. ’ [...] Sometimes it does the
over-engineering for those cases. ” P3 additionally shared this sentiment: “ It can build some things, but you know,
even that [...] sometimes becomes more complicated than just coding it yourself, and you can’t create complex
functionality. ” For more novel or unique challenges, some participants (like P9 and P11) observed limitations in
ChatGPT’s ability to produce code with optimal complexity.
3.3.2 Optimal Use Cases .
•Excelling at Small Tasks: Nine participants identified LLMs’ strength in handling small tasks, particularly
those that involve routine or standard procedures like the boilerplate code referenced in Section 3.1.1. They
found it most effective for tasks that could be decomposed to a low level, such as writing small code snippets
or implementing basic functionalities. For ChatGPT, P4 shared that “ I’ve kind of isolated its use cases to helping
me improve code at the function level, like small snippets of code, and helping me go from zero to something ”; this
sentiment is connected to his relationship with frequently discarding code as part of an iterative process, as
presented in Section 3.2.3.
Challenges .
•Better Text than Code Generations: Two participants observed that LLMs seemed to perform better at
generating textual content than code, possibly due to their training data composition [ 103]. P8 shared that, in his
experience, “ [ChatGPT] works better for text, not the codebase. ”
3.3.3 Security .
•Providing Sufficient Security: Thirteen participants expressed that they considered LLMs’ code to be sufficiently
secure for their purposes, primarily because they were not using it in production environments or for critical
applications. It was also mentioned that LLMs’ code is as secure as any code publicly available. Seven participants
shared that this was because of their specific use case, like P2, who noted that “ I’ve never had to use it for any
security aspects. ” P14 additionally used a VPN, and P4 and P15 shared that they believed their reading of the
generated code was another layer of added protection against security concerns (P4: “ It’s not like I’m not reading
Manuscript submitted to ACM
Page 27:
LLMs’ Impacts on Software Development 27
the code. So if it’s trying to like, wipe my hard drive, or leak customer data — There’s no scenario where that could
happen. ”)
Challenges .
•Concerns Sending Data to LLMs: Eight participants shared that they specifically did not provide LLMs with
identifying, confidential, or otherwise proprietary information in their prompts. P3 expressed concerns over the
lack of “ trust and safety controls. ” A few participants noted concerns about their data being used by companies
like OpenAI. P4 stated, “ There is, of course, concern with what code am I sending to OpenAI. ” P12 shared related
concerns: “ If those interactions with Bard or ChatGPT get logged and used for training data in the future, it could be
that those models start outputting production code or like, our own internal code. And that’s something we want to
avoid. ” It is worth noting that, since April 2023, ChatGPT has included an option to prevent a user’s queries from
being used to train or improve the model [ 20]—the feature that both P14 and P15 mentioned that they used in
order to reduce their security concerns. At present, Google Gemini additionally allows users some control over
how their data is used by Google [62], and Copilot allows users to opt out of sharing their data [61].
•Concerns using Data from LLMs: Eight participants emphasized developers’ responsibility to ensure code
correctness, especially regarding the possibility copying and using code without review. Two participants
explicitly stated that LLM code should not be copied without review due to security concerns; P11 shared that, “ I
think it’s dangerous, right? Like, if out of our work on a team, somebody would just copy-paste ChatGPT [code], you
know, I’ll probably be annoyed by it. ” P1 further expressed concerns over malicious actors potentially injecting
poisoning code that goes on to then further train ChatGPT, which could result in the tool generating exploitable
code in the future.
We note that both this and the above developer-identified challenge are clearly informed by applying knowledge
of how LLMs work to determine possible risks of sharing data with LLMs and using LLM outputs in production
code.
3.4 RQ4: How may the software industry and education be affected by LLMs?
The key themes that emerged in understanding how LLMs impact two areas of society—the software industry and CS
education—are as follows:
3.4.1 Industry .
•Comparing LLMs to Existing Entities or Roles: We found nine interviews in which LLMs were fulfilling
roles commonly filled by other people or tools, including as a pair-programmer, assistant or secretary, junior
developer, rubber duck, or simply a tool. For instance, P10, whose colleagues had been all laid off, mentioned,
“I’ve been working solo by myself for a few months now. [ChatGPT] is my junior who [is] trying to help me. " He
further added, “ I still write [the code] the way I do. [ChatGPT] just has some extra eyes that [are] helping me, to
guide me through [the coding process] or to give me some recommendations overall. " P5 also mentioned, “ [ChatGPT]
is like the rubber ducky that we would have, except now it produces answers, and it talks to me, and it gives me all of
the advice that I would need. ” These results suggest that developers are in the process of figuring out the future of
their careers and how LLMs may ultimately fit in.
•Minimal Impact on the Job Market: Nine participants expressed that jobs in the software development field
would remain largely unaffected. They argued that software development involves more than just code writing,
Manuscript submitted to ACM
Page 28:
28 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
Fig. 5. Findings related to RQ4, split horizontally by findings on industry (above) and CS education (below) and vertically by
opportunities (left) and challenges (right).
and LLMs lack the capability to fully substitute human developers. As P3 noted, “ I don’t really see it replacing
people because I think if you’re at the point, at least with the current version of it, that it could do your job, you
probably weren’t as good of a programmer in the first place. .”
•Widespread Use Among Peers: Four participants observed widespread use of ChatGPT among their peers. P7,
for example, pointed out, “ [...] maybe 90%, maybe 80% of my team started using ChatGPT from January or February
[2023]. Maybe because we’re a tech company, people know this quicker than the others. And now I think 100% of
people are starting to use it, just depending on the use case. ” This suggests a growing acceptance and adoption of
LLMs within professional environments, particularly within tech companies.
•Lowering Entry Barriers: Three participants mentioned LLMs could lower the barrier for entry-level positions
by providing assistance and resources. P6 shared his hopes for LLMs being able to answer novice programmers’
questions: “ [ChatGPT] has lowered the barrier to entry in terms of coding. So you can really go and ask and stuff,
and depending on what your level of expertise is or what your level of questioning is, it will provide you with the
level of answers. So you don’t have to rely on books or universities and then going through the [usual] courses. Not
that there is anything wrong with them, but it’s a different way of approaching programming. ”
Challenges .
•Changing the General Job Market: Nine participants viewed LLMs as technologies that would change and
re-purpose jobs in general. Although they acknowledged LLMs’ capacity to diminish some jobs, they noted their
potential to create new job opportunities as well. For example, P10 noted, “ [ChatGPT] is a very powerful tool, and
it’s going to kill a lot of wild white-collar jobs for sure, but more jobs will be created. [The issue] is just how fast that
those jobs will be created and whether the people that lost their job will get trained. ”
•Absence and Necessity of Guidelines: Eight participants emphasized the importance of establishing guide-
lines for LLM use, especially for larger companies or those dealing with security-sensitive information. They
Manuscript submitted to ACM
Page 29:
LLMs’ Impacts on Software Development 29
highlighted the need for clear rules to ensure the secure usage of LLMs. P9, for example, noted, “ I think 100%
[that companies should have guidelines], especially when you’re dealing with things that are sensitive; for example,
financial institution or medical [data]. And I actually do think that instead of them trying to use third-party LLMs
from, for example, OpenAI, they should start building their own models and have them housed within their own PVC
cloud so that it’s secure enough for their standard instead of trying to buy this off-shelf. ” P7 highlighted the utility
of more specific guidelines, noting that some companies should provide prompt templates. As he noted, “ I think
that would be good [to have guidelines] because I’ve heard there are some question templates to let ChatGPT give us
what we want. ”
Of these eight participants, five noted that their companies lacked specific guidelines for LLM use. However,
they noted that existing rules, such as not sharing sensitive information, apply to using LLMs. P5 shared, “ We’re
a pretty small company, so we don’t have too [many] regulations against it, and a lot of people work really like
me and the guy that I’m working under. He uses it fairly heavily as well. So, outside of making sure that we don’t
provide any sensitive data or information to ChatGPT, we’re pretty much just okay to use it .” Similarly, P2 stated,
“We had these guidelines even before ChatGPT; the guidelines were that we are not allowed to share any client-specific
information with anyone. ”
•Lowering the Need for Certain Roles: Five participants highlighted LLMs’ potential to decrease the number
of developers required for certain tasks and roles, such as user interface professionals or data scientists. P8
mentioned, “ I’m not sure [if] it’s changed a lot of things in software engineering, but in the other fields, like data
science, I’m seeing a [change] because, in those fields, there are people who don’t have a lot of deep knowledge about
computer science. [...] For example, they have a piece of Python code, and they want to do some logic. They have a
matrix, and they want to transpose it. [So] they can use ChatGPT very easily, and it does that for them. ”
•Undermining Entry-Level Positions: LLMs’ potential to undermine the demand for entry-level roles emerged
three times in our study. P6, for instance, mentioned, “ I do find the level of code generated is sometimes almost as
good as a junior software developer. So I think that really up the bar of hiring for junior software developers. When
I graduated, ChatGPT wasn’t around, and I probably didn’t have to expect to know that much. ” P15 also stated
that newer developers should broaden their skill set in order to stay ahead of technological advancements and
mitigate their negative impacts:“ As long as you have a bigger picture of things, and you’re able to engineer things
more thoroughly, you can beat the pace of ChatGPT. ”
•Exploiting LLMs in Job Interviews: Two participants raised concerns about candidates using LLMs to pass
interviews without genuine knowledge or skills, so formulaic interview techniques might need to change. P1,
for example, mentioned that people could very easily use ChatGPT to pass, “ without actually knowing how to
pass the interview. I don’t think that necessarily makes them a bad software engineer, but I think it can break this
formulaic interview a little bit. ”
3.4.2 Education .
•Encouraging as Learning Tools: Ten participants shared that CS curricula should integrate LLMs into education
rather than banning them outright. P5 shared that, “ If universities are trying to prepare students for work in the
industry, I would say that exposing the students to how to properly use ChatGPT to create impactful works in
code and program would certainly be beneficial in the professional corporate setting. But, you still need to have
the foundational knowledge of your data structures, algorithms, how to design code databases, SDLC, all of those
different core topics. Those are still needed within the curriculum. ” Participants highlighted LLMs’ potential to
Manuscript submitted to ACM
Page 30:
30 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
assist students in problem-solving, generating personalized problem sets, and improving prompt engineering
skills.
•Teaching Prompt Engineering: Four participants emphasized the importance of teaching students prompt
engineering. For instance, P13 mentioned, “ I think students should be introduced to these tools and how they can
prompt better. For instance, one of my instructors at university, the first thing that they covered in our first semester,
was how to Google. I think this is going to be something along the same lines as how to prompt. ”
•Impracticality of Bans: Three participants viewed the issue from another perspective: the impracticality of
banning it. For instance, P2 stated, “ You cannot run away from it because everyone has access [...] Students are going
to use it, no matter [what]. ” Participants emphasized the potential benefits of LLMs in supplementing learning.
As P16 shared, “ I think degree programs should encourage students to use ChatGPT in the right way. ” Participants
felt this could be particularly important for underprivileged students who may lack access to traditional tutoring
or resources.
Challenges .
•Emphasizing Fundamental Concepts: Six participants stressed that foundational knowledge, theoretical
understanding, and fundamental concepts should be prioritized over the integration of ChatGPT. They believed
that topics like software architecture, algorithm design, and problem-solving skills should take precedence. For
instance, P1 mentioned, “ I think [universities] should be doing more of what they should have been doing in the first
place, which is not necessarily focusing on specific implementations and specific kind of optimal algorithms, but
rather on the bigger, more difficult-to-grasp ideas that have to do with architecting software and that have to do with
thinking about what it takes to get from a business idea to an actual product. ”
Concerns over LLMs being just the latest tool or technology that may become obsolete also appeared. For example,
P6 mentioned, “ One thing remains constant [in CS education] that you are studying and you’re understanding and
you’re learning the programming languages and you’re learning the technology that will be obsolete by the time
you get to the market for a job. I don’t think that the programs should [change based] on what’s the latest and the
greatest, ChatGPT, Anthropic AI, this AI, that AI, or this fancy new programming language. I think that’s the wrong
way to go. I think, in general, the program should be focused on more fundamentals because those largely remain the
same. ”
•Needing to Adapt: Six participants acknowledged the flip side of the impracticality of banning LLM usage,
suggesting ways that computing education programs should adapt to the new reality with LLMs. Four participants
expressed that the ease of plagiarism with LLMs encourages students to use them for ready-made solutions,
introducing a need for change. Two felt cheating was going to happen, with P16 suggesting that cheating is
inherent to certain individuals: “ People who cheat, they’re going to cheat anyway, and that’s gonna get caught up
with them. That’s a character trait. ” P8 mentioned, “ I think [students] don’t write any code in their homework based
on the [access to] ChatGPT because they already have whatever they want. ”
Two participants suggested that instructors mitigate the impact of cheating concerns by redesigning assignments
to ensure they require a genuine understanding of concepts through critical thinking and problem-solving skills.
P13 suggested comparing student work to LLM answers: “ The professors would have to run it through GPT to see
how it’s performing, and see what kind of variations of answers it’s generating. ” P2 additionally shared, “ You can
design the assignments in such a way that even if [students] use [ChatGPT], they need to understand the concepts. ”
Manuscript submitted to ACM
Page 31:
LLMs’ Impacts on Software Development 31
Two more participants mentioned the importance of preparing students for coding without LLM assistance.
Although P5 was generally a proponent of integrating ChatGPT into the curriculum, he stressed the importance
of independence: “ You need to be able to create code on your own, but also collaborate with others as well. "
4 Discussions
As outlined in Section 3, we have highlighted the advantages and challenges of current LLMs across four key dimensions.
Based on our findings:
4.1 Implications for Future Software Developers
4.1.1 Educating future software developers on prompt engineering. Our participants emphasized the importance of
prompt engineering as a key technique for effectively utilizing LLMs. This underscores the need for workshops and
training sessions to focus on prompt engineering, particularly on strategies such as maintaining specificity and brevity,
formulating problems clearly, and considering linguistic nuances for optimal results.
To support this learning, developers can access formal resources, such as OpenAI’s comprehensive guide on prompt
engineering [ 68] or DeepLearning.AI’s free10course on prompt engineering practices for developers [ 31]. These
resources provide valuable insights and practical techniques to enhance their expertise in working with LLMs.
4.1.2 Educating on problem decomposition. The ability to break down complex problems into manageable components
is crucial for maximizing the effectiveness of LLMs. Our findings indicate that LLMs perform optimally when presented
with tasks that are clear, concise, and limited in scope. By developing strong problem decomposition skills, developers
can better align their challenges with the capabilities of LLMs, significantly improving the efficiency and accuracy of
their solutions. Mastering this skill empowers developers to leverage LLMs as powerful tools in addressing a wide
range of software development tasks.
4.1.3 Setting realistic expectations for LLM use. Developers must recognize the inherent limitations of LLMs, including
issues such as unreliable responses [ 104], hallucinations, knowledge gaps, and difficulties in maintaining consistent
contextual understanding [ 37]. Although LLM accuracy and effectiveness continue to improve, their performance
remains variable across different domains. Understanding which domains to trust and the extent of reliability is essential
for using these tools effectively.
One practical strategy is to explicitly identify the software engineering tasks for which LLMs excel or underperform,
as outlined in this paper (see Section 3). Additionally, equipping developers with a deeper understanding of how
LLMs function can bridge gaps in expectations and capabilities. Structured courses or workshops tailored to software
developers’ needs could offer a robust foundation, similar to existing workshops designed for educators[82].
These educational initiatives not only enhance developers’ ability to use LLMs effectively but also provide the added
benefit of deepening their technical knowledge, enabling them to adapt to and capitalize on future advancements in
generative AI.
4.1.4 Adopting LLMs as Surrogate Team Members. Freelance developers, or those who primarily work independently,
encounter a distinct set of challenges compared to their counterparts in collaborative environments. Insights from two
study participants (P9 and P10) indicate that LLMs provide specific advantages uniquely suited to solo developers, which
may not be as pronounced for other developer types. Based on our findings, freelance or solo developers can utilize
10At the time of writing.
Manuscript submitted to ACM
Page 32:
32 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
LLMs as surrogate team members, serving as virtual colleagues, subcontractors, or assistants. LLMs can support these
developers by refining and clarifying requirements, conducting research, discovering information during stakeholder
interactions, and simulating code reviews to uphold professional coding standards. Moreover, LLMs can reduce—though
not entirely eliminate—the need for subcontractors, thereby enhancing the efficiency and self-sufficiency of independent
developers.
4.1.5 Encouraging effective use of LLMs for programming. Despite their limitations, LLMs offer considerable advantages
to developers who understand how to harness their capabilities effectively. As indicated by our findings, developers can
use LLMs to enhance various aspects of their programming practices, including: (1) Refactoring Code: Improving the
structure and organization of existing code for maintainability and performance. (2) Learning and Applying Design
Patterns: Gaining insights into widely recognized solutions to common programming challenges. (3) Enhancing Code
Readability: Producing more understandable and clean code, which facilitates collaboration and long-term project
sustainability. (4) Automating Boilerplate and Repetitive Tasks: Quickly generating routine code components to save
time and focus on more complex challenges. These benefits not only streamline workflows but also provide opportunities
for developers to adopt and reinforce best practices. Our findings align with similar research (e.g., [ 18,48,54,76]
highlighting how LLMs can optimize developer productivity and programming quality. By using LLMs strategically,
developers can maximize their efficiency and improve both individual and team outcomes.
4.1.6 Emphasizing secure practices. Developers must remain vigilant about the proprietary and security aspects of
the code they share with LLMs, as emphasized by Wang et al. [ 101]. Inputted data such as code, prompts, and other
personal information may be collected and utilized by companies like OpenAI and Gemini to improve their services11.
This makes it crucial for developers to ensure compliance with contractual agreements and address licensing issues to
minimize risks when using these tools.
Another critical concern is that LLMs often do not provide the source of the code they generate [23]12, leaving the
origins and security of the produced code uncertain. This ambiguity poses potential risks, particularly in production
environments or sensitive projects. Developers engaged in research or ideation may perceive fewer security challenges,
but adhering to robust security protocols is essential for all.
Key strategies to mitigate these risks include: (1) Sanitizing Generated Code: Carefully reviewing and cleaning LLM
outputs to prevent vulnerabilities or unintended exposures. (2) Maintaining Data Integrity: Ensuring the confidentiality
and safety of proprietary code and sensitive data. (3) Upholding Security Standards: Consistently applying established
security practices to avoid compromising systems or violating regulations. By proactively addressing these issues,
developers can maintain secure workflows and responsibly integrate LLM tools into their practices.
4.2 Software Engineering Tools and Designs for Developers
4.2.1 Educating developers on the benefits of LLM-powered IDE extensions. Integrating LLMs into widely-used IDEs
such as Visual Studio Code (VS code) can significantly optimize development workflows and reduce friction. Extensions
like GitHub Copilot, Amazon Q, and IBM watsonx are already helping developers streamline different tasks such as
debugging, code generation, and documentation writing. Some participants in this study mentioned the challenges of
translating LLM results into their own context, highlighting the benefits of a more streamlined user experience.
11At the time of writing, both OpenAI [ 63] and Gemini [ 62] state that they collect and use personal data, including prompts and log data, as the default
setting to improve their model and services.
12Pre-published via arXiv.
Manuscript submitted to ACM
Page 33:
LLMs’ Impacts on Software Development 33
4.2.2 Tailoring answers for specific industries. Developers may benefit from LLM tools that provide tailored responses
based on organization-specific or proprietary documents and data. These tools could be trained on or fine-tuned with
project documentation to assist with legacy or brownfield development. Alternatively, they could use advanced AI
techniques, such as retrieval-augmented generation (RAG) [ 47], which would dynamically incorporate relevant artifacts
like project documents, code repositories, and communication threads into the generative process to deliver more
contextually-aware answers.
4.2.3 Tailoring answers for specific individuals. As a developer works over time with an LLM, contextual grounding
methods can improve the models’ understanding of developers’ specific needs. Persistent context memory, for instance,
could allow models to “remember” quirks and details about the user and retain them as context for future responses.
For instance, an LLM could detect that a developer prioritizes attributes like performance over readability; as another
example, an LLM could identify areas in which a developer frequently asks for explanations and proactively provide
more detail in those areas.
4.3 Developments in LLMs and Ongoing Limitations
Since the interviews conducted during the Spring and Summer of 2023, the landscape of LLMs has notably evolved. The
popular GPT-3.5 model, which was the most common model among our participants, has been replaced by GPT-4o
for OpenAI’s free-tier users. For those having premium subscriptions, more advanced models, including GPT-4o,
OpenAI o1, and OpenAI o1-mini, are accessible. Additionally, GitHub Copilot, the coding assistant used by two of our
participants, is now also powered by GPT-4o. Table 3 provides an overview of the OpenAI and Google models’ features
and performance.13These were selected as developers of the models discussed in this paper. Other state-of-the-art
models, such as Llama by Meta and Claude by Anthropic, also demonstrate significant advancements but are outside
the scope of this discussion.
Current LLMs’ developments address a few limitations (see the blue bullet points in Figure 1) found during the
interviews. For example, larger context windows now enable improved summarization and response latency has been
reduced in models like GPT-4o compared to GPT-4. Issues such as hallucinations, contradictory answers, mixing up
programming languages, and struggle with unstructured data have been mitigated to some extent through enhanced
reasoning abilities and a larger context window.
Despite these improvements, many challenges still persist (see Challenges under different subsections in Section
3). For example, real-time browsing has been integrated into the chat interface of both GPT-4o and Gemini 1.5 to
address outdated responses and lack of references. However, outputs rely heavily on static training data, making them
non-referenceable. This is especially problematic in rapidly evolving information like programming libraries, where
outdated information harms accuracy.
Beyond existing and future advancements, there are two considerations that are worth noting:
(1)While flagship models like GPT-4o and o1 have addressed some limitations, smaller or locally deployable models
(e.g., Llama 3.2 1B) that appeal to those prioritizing privacy or cost-efficiency continue to face challenges such as
constrained context windows and reasoning capabilities.
(2)Many limitations discussed in this paper are systemic and intrinsic to LLMs, such as their potential to impede
critical learning processes if misused or their inability to replicate nuanced human interactions.
13Two metrics, Quality Index andLatency , are based on evaluations from Artificial Analysis [ 1], an independent team focused on benchmarking and
evaluating AI models. These metrics may not reflect the official evaluations of corresponding companies.
Manuscript submitted to ACM
Page 34:
34 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
Developer Model Context
WindowMax Output
TokensKnowledge
Cut-OffQuality
Index [1]Latency
[1]Developer’s Description of
Model
OpenAIGPT-4o 128k to-
kens128k tokens Oct 2023 77 0.43 High-intelligence flagship
model for complex, multi-
step tasks. GPT-4o is cheaper
and faster than GPT-4 Turbo.
GPT-4o mini 128k to-
kens16k tokens Oct 2023 72 0.41 Affordable and intelligent
small model for fast, light-
weight tasks. GPT-4o mini is
cheaper and more capable
than GPT-3.5 Turbo.
o1 128k to-
kens32k tokens Oct 2023 85 24.60 Reasoning model designed to
solve hard problems across
domains. Trained with rein-
forcement learning to per-
form complex reasoning.
o1-mini 128k to-
kens64k tokens Oct 2023 82 8.93 Faster and cheaper reason-
ing model particularly good
at coding, math, and science.
Trained with reinforcement
learning to perform complex
reasoning.
GPT-3.5 16k tokens 4k tokens Sep 2021 53 0.38 Understands and generates
natural language or code and
has been optimized for chat
but works well for non-chat
tasks as well.
GoogleGemini 1.5
Pro2M tokens 8k tokens Sep 2024 80 1.06 Complex reasoning tasks re-
quiring more intelligence.
Gemini 1.5
Flash1M tokens 8k tokens Sep 2024 68 0.24 Fast and versatile perfor-
mance across a diverse vari-
ety of tasks.
Table 3. Current Status of Language Models Referenced in the Paper. Quality Index represents the average performance across various
evaluations of model intelligence, including benchmarks like MMLU, GPQA, and HumanEval. Latency denotes the time to the first
token received after an API request, measured in seconds.
Hence, informed deployment of LLM technologies is essential to leverage their benefits while minimizing any
potential harm.
5 Related Works
This paper examines the role of LLMs in software engineering, focusing on their impact on developers, the development
life cycle, products developed, and societal implications. Due to the emerging nature of research in this field, we include
some non-peer-reviewed studies and provide transparency through footnotes, acknowledging that, as noted by Fan et
al. [26], formal literature surveys can no longer capture all relevant work, making our review not exhaustive.
The most notable work on LLMs in the Software Engineering (SE) domain is a systematic review by Hou et al., which
analyzed 395 papers on their impact, focusing on optimization techniques, applications, and potential use cases [ 36]. In
contrast, our study provides a qualitative analysis based on interviews with full-time developers.
Several interview-based studies have explored developers’ use of generative AI. Klemmer et al. found that while AI
assistants like Copilot and ChatGPT were widely used for security-critical tasks, developers lacked trust in these tools
Manuscript submitted to ACM
Page 35:
LLMs’ Impacts on Software Development 35
and double-checked their work [ 42]. Similarly, Rabani et al. reported that developers noted ChatGPT’s inaccuracy and
the need for debugging [ 77]. Mendes et al. explored developers’ views on intelligent assistants, noting benefits like
faster development and improved code but also challenges like poor accuracy and distractions [ 54]. Empirical studies,
including those by Rasnayaka et al., assessed LLMs’ usefulness in programming projects, finding they help with code
generation and debugging but highlighting a learning curve and no significant difference in software quality between
AI-assisted and non-assisted teams [81].
Our study extends and corroborates these findings by exploring the broader interplay of LLMs across the four
dimensions of our research questions (RQs). Accordingly, the remainder of the related work is structured around these
dimensions.
5.1 People - Software Developers using LLMs
Recent studies, particularly pre-published ones, highlight the growing adoption of LLMs among professional developers.
These studies, mainly empirical, include interviews and thematic analyses of developer’s written responses. Feng et
al. analyzed social media posts and found that developers use ChatGPT for code debugging, interview preparation,
and solving academic assignments, with fear being the predominant emotion related to code generation [ 28]. Süße
et al. identified fourteen coping patterns in a case study of AI-powered chatbots in software development, with one
pattern considering the AI as a virtual colleague [ 94]. Nam et al. found that LLMs, when used within programming
environments, improve task completion by providing contextualized queries, offering a more effective alternative to
simple web searches [ 65]. Peng et al. discovered that GitHub Copilot helped developers complete tasks 55.8% faster than
a control group [ 75], while Vaithilingam et al. noted that while Copilot did not significantly improve task completion
time, it served as a useful starting point despite challenges in debugging [ 99]. Kuhlail et al. surveyed 99 developers,
finding that ChatGPT improved productivity by helping generate generic code, explain complex code, and find sources
[44]. Siddiq et al.’s study on DevGPT usage also highlighted ChatGPT’s role in helping developers understand libraries
and frameworks, and engage in networking and messaging [ 88]. Stack Overflow’s 2024 survey found that 63.2% of
professional developers were using AI tools in the development process, with 13.5% sharing they planned to and 23.4%
sharing that they did not wish to use these tools [ 64]. The Stack Overflow 2023 survey highlighted AI usage for tasks
like code writing, debugging, documentation, and testing[ 59], and the 2024 survey noted that professional developers
believed that AI tools could increase productivity, speed up learning, improve efficiency and code accuracy, and make
workload more manageable [64].
Further interview studies reveal how developers incorporate LLMs in their daily work. Pinto et al. found that LLMs,
built on GPT-4, reduced repetitive tasks and helped contextualize code, though they had technical limitations like
poor UI and inaccurate suggestions [ 76]. In a 2024 study by Kimbel et al., 17 participants across sectors reported using
ChatGPT for content generation, information retrieval, brainstorming, and programming but noted challenges with
accuracy [ 41]14. Coutinho et al. found that AI tools helped software professionals save and organize their time, but
with reliability issues in generated content [18].
While these studies provide valuable insights, they fall short in addressing critical gaps. Most focus narrowly on
either professional or novice developers, specific tasks, or quantitative metrics like task completion time. To date,
little qualitative research has comprehensively examined the experiences of professional developers across varying
experience levels, exploring not only the tasks where LLMs excel but also those where they fall short. Our study
14Pre-published via ResearchGate.
Manuscript submitted to ACM
Page 36:
36 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
addresses this gap by investigating the nuanced ways LLMs influence different development tasks, offering a richer
understanding of their practical utility and limitations.
5.2 Processes - Software Development Life Cycle Process
Emerging research into the impact of LLMs, particularly ChatGPT and GitHub Copilot, on the Software Development
Life Cycle (SDLC) has identified significant effects, though more research is needed to fully understand their scope
and quality. Prior to ChatGPT, earlier models like BERT were employed for specific coding tasks, such as vulnerability
detection with 99.3% accuracy [ 4], while Text-to-Text Transfer Transformers demonstrated promise for code completion
[17]. GitHub Copilot shifted the focus from code writing to code evaluation, with studies noting that developers often
spent more time evaluating AI-generated suggestions than writing code itself [ 12]. While Copilot provides significant
value for experts, it poses risks for novices, who may struggle to identify or correct buggy or non-optimal code [57].
Recent studies have explored ChatGPT and GPT models across various SDLC phases. For instance, ChatGPT
has proven useful in debugging [ 93] and generating software requirements, though the outputs are typically less
detailed than those produced by humans [ 10]. It has also been employed in refining requirements [ 2]. Krishna et al.
demonstrated that advanced models like CodeLlama and GPT-4 could generate Software Requirement Specifications
(SRS) at a level comparable to an entry-level engineer [ 43]15. While LLMs have shown potential to assist with planning,
design, implementation, and testing, they still require human supervision, particularly during the coding phase [ 74].
Furthermore, Sridhara et al. found that while LLMs excel at tasks like refactoring, they struggle with more nuanced
activities such as code reviews and vulnerability detection [91]16.
Studies also highlight LLM contributions to software testing. Gu [ 32] and Tang [ 97] found that LLMs could outperform
traditional tools in test coverage, but challenges like prompt design and accuracy persist. Frameworks for evaluating
LLM-generated code have been developed, including Yeo et al. ’s work on prompt engineering [ 106], Liu et al. ’s framework
for error identification [ 49], and Hou et al.’s analysis of metrics like MRR, BLEU, and ROUGE to assess LLM performance
in software tasks [36].
Finally, the advent of LLM-based agents marks a shift toward a more AI-enhanced SDLC. Jin et al. demonstrated that
LLM-based agents, equipped with capabilities like autonomous reasoning and tool usage, handle complex tasks more
efficiently than traditional LLMs [ 40]17. This shift signals a transformation in software engineering workflows, where
AI roles extend from copilots to supervisors, driving new paradigms in the SDLC process [71, 72].
While prior research has provided valuable insights, it remains fragmented and often focuses narrowly on specific
SDLC phases or tasks. Studies rarely examine the end-to-end impact of LLMs on all SDLC activities or provide
comprehensive guidance for integrating LLMs into these processes. Our research fills this gap by systematically
investigating developers’ experiences using LLMs across the entire SDLC. We uniquely organize our findings according
to the SDLC steps and make actionable recommendations, identifying tasks where LLMs excel and those where they
under-perform. This comprehensive approach advances understanding of LLMs’ role in software development and
offers practical strategies for their effective use.
15Pre-published via arXiv.
16Pre-published via arXiv.
17Pre-published via arXiv.
Manuscript submitted to ACM
Page 37:
LLMs’ Impacts on Software Development 37
5.3 Products- Artifacts
The use of LLMs in code generation has led to growing interest in understanding the quality of generated code. Studies
have focused on readability [ 19], complexity [ 86], correctness [ 49], and security [ 33]. For example, Nascimento et
al. showed that ChatGPT outperforms beginner programmers but not experts [ 66], while Fan et al. identified shared
common mistakes between human-written and generated code [ 27]. Stack Overflow’s 2024 Survey found that almost
half of professional developers believed that AI tools were bad at handling complex tasks; additionally, developers were
split on the trustworthiness of AI tools, with newer developers trusting AI accuracy more than professionals [64].
Despite significant advances, much of the existing research emphasizes objective evaluations of code quality, often
relying on predefined metrics or benchmarks. These studies rarely delve into developers’ subjective perceptions of
the quality and trustworthiness of artifacts generated by LLMs, particularly in professional settings where these tools
are actively integrated into workflows. To the best of our knowledge, limited research has explored how professional
developers assess and trust the quality of artifacts produced by LLMs. Our work addresses this gap by investigating
developers’ perceptions, their confidence in these outputs, and the factors influencing their trust in LLM-generated
artifacts. This perspective is critical to understanding the practical integration of LLMs into professional software
development and guiding improvements in LLM capabilities.
5.4 Society: LLMs in Industry and Education
5.4.1 Software Industry. LLMs are poised to significantly impact various sectors, including software development, due
to their widespread adoption and diverse capabilities [ 8,24]. A 2018 National Bureau of Economic Research report by
Bessen suggests AI could reshape the labor market by replacing, shifting, or creating jobs depending on demand [ 11].
With LLMs capable of generating code, concerns have emerged about their potential to assist or replace developers.
Carleton et al.’s 2021 report advocates for AI’s collaboration with software engineers to enhance productivity and
reliability [ 13]. Kuhail et al. found that over two-thirds of developers surveyed did not foresee an immediate job security
threat from AI, though many recognized a partial risk [ 44]. Rashid speculated that AI could both replace some roles
and open new opportunities, helping developers become more efficient [ 80]. Demirci et al. observed a 21% decrease in
job postings related to coding after ChatGPT’s introduction, while jobs requiring manual labor saw less impact [ 21].
Additionally, research by Winter et al. showed developers prefer working alongside tools rather than having their tasks
entirely replaced, suggesting that AI could support, but not fully replace, developers’ work [102].
As companies increasingly release policies on generative AI use, a 2024 interview study of 17 professionals revealed
that many still lack formal guidelines for integrating AI tools like ChatGPT into their workflows [41].
Despite these advances, much of the existing research focuses on broad market trends, theoretical implications, or
objective measures of AI’s impact. Few studies have directly interviewed developers to understand their perspectives on
how LLMs are shaping their industry. Our research addresses this gap by exploring developers’ nuanced views on the
opportunities and challenges of integrating LLMs into professional workflows. By centering the voices of developers,
we provide actionable insights into the practical and ethical considerations of adopting LLMs at scale.
5.4.2 Education. Research on the integration of LLMs into education, particularly in computing, has gained attention.
Tanay et al. (2024) observed that upper-level computing students using LLMs in software engineering projects improved
efficiency in obtaining information and completing tasks, though concerns were raised about potential negative
impacts on learning outcomes [ 95]. Essel et al. (2024) found that ChatGPT enhanced undergraduate students’ critical,
reflective, and creative thinking skills [ 25]. This has sparked discussions on adapting the computer science curriculum
Manuscript submitted to ACM
Page 38:
38 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
to incorporate LLMs. Ozkaya proposed that students should learn to collaborate with LLMs and AI applications,
especially for legacy systems [ 73]. Additionally, Jeuring, Groot, and Keuning highlighted the strong correlation between
computational thinking (CT) skills and effective co-development with ChatGPT, suggesting the continued relevance of
these skills for both students and future professionals [39].
While these studies offer insights into how LLMs may enhance student learning, there is limited research on
developers’ perspectives regarding the role of LLMs in computing education. Our study addresses this gap by engaging
developers in a discussion about how LLMs could reshape education in the field. By focusing on their experiences and
views, we provide valuable insights into how LLMs can influence both the learning and teaching of computing skills in
professional settings.
6 Threats to validity
As with any qualitative study, our research also faces several threats to validity, including:
6.1 Internal Validity:
The study’s timeline coincided with frequent updates to tools such as GPT-3.5 and Bard (now Gemini), creating a
significant internal validity threat. The rapid evolution of these tools may have influenced participants’ experiences
in uncontrolled ways. As participants interacted with different versions of the tools, they might have encountered
different performance levels, which could skew the results. Consequently, findings related to GPT-3.5 or earlier versions
of Bard may not accurately reflect the current version of these tools, especially with the emergence of newer iterations
like GPT-4o and OpenAI o1.
Additionally, the absence of pretests to assess participants’ baseline knowledge of how they used LLMs may impact
the study’s internal validity. Although post-survey questions regarding industry experience partially mitigated this issue,
the lack of a uniform measure of skill at the beginning of the study means that differences in participants’ performance
could be attributed to their varying levels of expertise rather than the tools themselves.
6.2 External Validity:
The geographic limitation of participants, with the majority based in the US, poses a significant threat to external validity.
This limitation restricts the generalizability of the findings, as developers in other regions or cultural contexts may
encounter different experiences and challenges. Factors such as localized programming practices, language preferences,
and access to specific LLM features can all influence how these tools perform in non-US settings.
Moreover, the lack of diversity in the participant pool, predominantly composed of White or Asian males with
over 3 years of experience, further exacerbates the threat to external validity. The experiences and challenges faced
by underrepresented groups—including women, non-binary individuals, developers with disabilities, and those from
various racial, ethnic, and experience backgrounds—may differ significantly from those of the current sample. Although
our study was announced openly on LinkedIn, we recognize the need to implement additional strategies in future
research to ensure a more diverse and representative participant pool.
6.3 Construct Validity:
The introduction of newer models, such as GPT-4o, along with the resolution of issues related to GPT-3.5 and Bard,
poses a significant threat to the temporal relevance of the findings. As LLM tools continue to evolve, the capabilities
and limitations of earlier versions may no longer accurately represent the current state of technology. Consequently,
Manuscript submitted to ACM
Page 39:
LLMs’ Impacts on Software Development 39
some results from the study could become outdated, potentially undermining the applicability and relevance of the
findings to contemporary tools usage. This emphasizes the necessity of considering the dynamic nature of LLM tools
when interpreting the study’s conclusions. Additionally, the limited number of participants using tools like Google
Bard and GitHub Copilot Chat presents a construct validity threat. The conclusions drawn about these tools may lack
robustness and comprehensiveness due to insufficient data.
6.4 Conclusion Validity:
The relatively small sample size of 16 participants and 16 hours of interviews introduces threat to conclusion validity.
This limitation restricts the statistical power to draw meaningful generalizations from the findings. Additionally,
the small dataset heightens the risk of random variations influencing the results, potentially leading to unreliable
conclusions. As a result, the study’s ability to make robust claims about the broader population of developers and their
experiences with LLM tools may be compromised.
7 Conclusion
It has been clear since LLMs’ inception that they would impact software development in different ways. Our study
aimed to examine the effects of LLMs on software developers, their processes, products, and society at large. Through
sixteen interviews with early-adopter developers, we explored their self-reported day-to-day activities, perceptions,
and experiences with LLMs. In our qualitative analysis of their responses, we found that:
•RQ1: People: LLMs provide developers with numerous benefits, including enhanced productivity, improved
efficiency, time savings, streamlined searching, access to templates, and accelerated learning. However, developers
also face challenges, such as occasional unreliable LLM responses.
•RQ2: Processes: In the SDLC, LLMs showed minimal impacts on gathering requirements, planning, and refac-
toring. However, they had mostly positive impacts on ideation, test generation, debugging, and documentation.
Developers used various strategies for prompt engineering and evaluating LLM-generated code, such as entering
vague prompts or conducting mental checks.
•RQ3: Product: LLMs generate readable code and are effective for simple tasks, but they exhibit varying quality
across different questions and encounter difficulties with complex tasks.
•RQ4: Society: There is a need for formal, proactive guidelines for software developers on the usage of LLMs in
the workplace, particularly to promote the ethical and safe use of generative artificial intelligence. Additionally,
there is a predicted shift in entry-level positions due to LLMs, and LLMs are perceived as being likely to alter or
repurpose development-related jobs, rather than eliminating them entirely. Finally, developers hold the ultimate
responsibility for the code they deploy, regardless of its source or the process used to create it.
The sixteen interviewed developers show an advanced understanding of how LLMs work and their data sources.
This understanding influences the decisions that developers make about when, how, and why to use LLMs. Their
insights can be used to craft best practices for LLM use in computing education and workforce settings. In addition,
they demonstrate that LLMs offer, in general, far more opportunities than they do challenges. Despite concerns over
code quality and limitations, our participants deemed LLMs to be sufficiently useful for developers. Consequently, our
findings suggest that LLMs can be a valuable asset in a professional developer’s toolbox. As one participant expressed,
“[ChatGPT] is like calculators being invented. We’re going to ban them for a while, and we’re going to tell [students] no, no,
no, don’t use it. And then, once they graduate, they will use it every day. ”
Manuscript submitted to ACM
Page 40:
40 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
8 Acknowledgments
This material is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-
21-1-0108 and National Science Foundation (NSF) under award numbers IIS-2313890, CCF-2006977, and IIS-1917885.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do
not necessarily reflect the view of the AFOSR or NSF.
References
[1][n. d.]. LLM Performance Leaderboard - a Hugging Face Space by ArtificialAnalysis. https://huggingface.co/spaces/ArtificialAnalysis/LLM-
Performance-Leaderboard
[2]Aakash Ahmad, Muhammad Waseem, Peng Liang, Mahdi Fahmideh, Mst Shamima Aktar, and Tommi Mikkonen. 2023. Towards human-bot
collaborative software architecting with chatgpt. In Proceedings of the 27th International Conference on Evaluation and Assessment in Software
Engineering . 279–285.
[3]Hussam Alkaissi and Samy I. McFarlane. 2023. Artificial Hallucinations in ChatGPT: Implications in Scientific Writing. Cureus 15, 2 (Feb. 2023).
https://doi.org/10.7759/cureus.35179 Publisher: Cureus.
[4]Mansour Alqarni and Akramul Azim. 2022. Low level source code vulnerability detection using advanced bert language model. In Proceedings of
the Canadian Conference on Artificial Intelligence-Https://caiac. pubpub. org/pub/gdhb8oq4 (may 27 2022) .
[5]Ilaria Amaro, Attilio Della Greca, Rita Francese, Genoveffa Tortora, and Cesare Tucci. 2023. AI Unreliable Answers: A Case Study on ChatGPT. In
Artificial Intelligence in HCI . Springer, Cham, 23–40. https://doi.org/10.1007/978-3-031-35894-4_2 ISSN: 1611-3349.
[6] Mattias Andersson and Tom Marshall Olsson. 2023. ChatGPT as a Supporting Tool for System Developers . Ph. D. Dissertation.
[7]Alwin Augustin. 2023. How LLMs Influence Software Engineering and Development. https://www.linkedin.com/pulse/how-llms-influence-
software-engineering-development-alwin-augustin#:~:text=Overall%2C%20LLMs%20have%20the%20potential,efficient%2C%20effective%2C%
20and%20innovative.
[8]Ömer Aydin and Enis Karaarslan. 2023. Is ChatGPT leading generative AI? What is beyond expectations? Academic Platform Journal of Engineering
and Smart Systems 11, 3 (2023), 118–134.
[9]Lenz Belzner, Thomas Gabor, and Martin Wirsing. 2023. Large language model assisted software engineering: prospects, challenges, and a case
study. In International Conference on Bridging the Gap between AI and Reality . Springer, 355–374.
[10] Leila Bencheikh and Niklas Höglund. 2023. Exploring the Efficacy of ChatGPT in Generating Requirements: An Experimental Study . Ph. D. Dissertation.
https://gupea.ub.gu.se/handle/2077/77957 Accepted: 2023-08-03T12:28:26Z.
[11] James Bessen. 2018. AI and jobs: The role of demand . Technical Report. National Bureau of Economic Research.
[12] Christian Bird, Denae Ford, Thomas Zimmermann, Nicole Forsgren, Eirini Kalliamvakou, Travis Lowdermilk, and Idan Gazit. 2023. Taking
Flight with Copilot: Early insights and opportunities of AI-powered pair-programming tools. Queue 20, 6 (Jan. 2023), Pages 10:35–Pages 10:57.
https://doi.org/10.1145/3582083
[13] Anita Carleton, Forrest Shull, and Erin Harper. 2022. Architecting the future of software engineering. Computer 55, 9 (2022), 89–93.
[14] Souti Chattopadhyay, Nicholas Nelson, Yenifer Ramirez Gonzalez, Annel Amelia Leon, Rahul Pandita, and Anita Sarma. 2019. Latent patterns in
activities: A field study of how developers manage context. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) . IEEE,
373–383.
[15] Ping Chen and Syazwina Binti Alias. 2024. Opportunities and Challenges in the Cultivation of Software Development Professionals in the Context
of Large Language Models. In Proceedings of the 2024 International Symposium on Artificial Intelligence for Education . 259–267.
[16] Xiang Chen, Chaoyang Gao, Chunyang Chen, Guangbei Zhang, and Yong Liu. 2024. An Empirical Study on Challenges for LLM Developers. arXiv
preprint arXiv:2408.05002 (2024).
[17] Matteo Ciniselli, Nathan Cooper, Luca Pascarella, Antonio Mastropaolo, Emad Aghajani, Denys Poshyvanyk, Massimiliano Di Penta, and Gabriele
Bavota. 2021. An empirical study on the usage of transformer models for code completion. IEEE Transactions on Software Engineering 48, 12 (2021),
4818–4837.
[18] Mariana Coutinho, Lorena Marques, Anderson Santos, Marcio Dahia, Cesar França, and Ronnie de Souza Santos. 2024. The Role of Generative AI
in Software Development Productivity: A Pilot Case Study. In Proceedings of the 1st ACM International Conference on AI-Powered Software . 131–138.
[19] Carlos Dantas, Adriano Rocha, and Marcelo Maia. 2023. Assessing the Readability of ChatGPT Code Snippet Recommendations: A Comparative
Study. In Proceedings of the XXXVII Brazilian Symposium on Software Engineering (SBES ’23) . Association for Computing Machinery, New York, NY,
USA, 283–292. https://doi.org/10.1145/3613372.3613413
[20] Jeffrey Dastin and Anna Tong. 2023. OpenAI rolls out ’incognito mode’ on ChatGPT | Reuters. https://www.reuters.com/technology/openai-rolls-
out-incognito-mode-chatgpt-2023-04-25/.
[21] Ozge Demirci, Jonas Hannane, and Xinrong Zhu. 2023. Who is AI Replacing? The Impact of ChatGPT on Online Freelancing Platforms. The Impact
of ChatGPT on Online Freelancing Platforms (October 15, 2023) (2023).
Manuscript submitted to ACM
Page 41:
LLMs’ Impacts on Software Development 41
[22] Ditstek Innovations Pvt. Ltd. (DITS). 2024. How are LLMs Reshaping Software Development? https://www.linkedin.com/pulse/how-llms-
reshaping-software-development-ditstek-innovations-8gtac/
[23] Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Weikang Zhou, Muling Wu, Mingxu Chai, Jessica Fan, Caishuang Huang, Yunbo Tao, et al .
2024. What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study. arXiv preprint arXiv:2407.06153 (2024).
[24] Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. 2023. Gpts are gpts: An early look at the labor market impact potential of large
language models. arXiv preprint arXiv:2303.10130 (2023).
[25] Harry Barton Essel, Dimitrios Vlachopoulos, Albert Benjamin Essuman, and John Opuni Amankwa. 2024. ChatGPT effects on cognitive skills
of undergraduate students: Receiving instant responses from AI-based conversational large language models (LLMs). Computers and Education:
Artificial Intelligence 6 (2024), 100198.
[26] Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for
software engineering: Survey and open problems. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering
(ICSE-FoSE) . IEEE, 31–53.
[27] Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated repair of programs from large language
models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) . IEEE, 1469–1481.
[28] Yunhe Feng, Sreecharan Vanam, Manasa Cherukupally, Weijian Zheng, Meikang Qiu, and Haihua Chen. 2023. Investigating Code Generation
Performance of Chat-GPT with Crowdsourcing Social Data. In Proceedings of the 47th IEEE Computer Software and Applications Conference . 1–10.
[29] Martin Fowler. 2018. Refactoring . Addison-Wesley Professional.
[30] Martin Fowler, Jim Highsmith, et al. 2001. The agile manifesto. Software development 9, 8 (2001), 28–35.
[31] Isa Fulford and Andrew Ng. 2024. ChatGPT Prompt Engineering for Developers. https://www.deeplearning.ai/short-courses/chatgpt-prompt-
engineering-for-developers/
[32] Qiuhan Gu. 2023. LLM-Based Code Generation Method for Golang Compiler Testing. In Proceedings of the 31st ACM Joint European Software
Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023) . Association for Computing Machinery, New
York, NY, USA, 2201–2203. https://doi.org/10.1145/3611643.3617850
[33] Maanak Gupta, Charankumar Akiri, Kshitiz Aryal, Eli Parker, and Lopamudra Praharaj. 2023. From ChatGPT to ThreatGPT: Impact of Generative
AI in Cybersecurity and Privacy. IEEE Access 11 (2023), 80218–80245. https://doi.org/10.1109/ACCESS.2023.3300381 Conference Name: IEEE
Access.
[34] Md Asraful Haque and Shuai Li. 2023. The Potential Use of ChatGPT for Debugging and Bug Fixing. EAI Endorsed Transactions on AI and Robotics
2, 1 (2023), e4–e4.
[35] Madison Hoff, Aaron Mok, and Jacob Zinkula. 2023. 4 white-collar jobs most at risk of getting replaced by Ai like chatgpt. https://www.
businessinsider.com/chatgpt-white-collar-jobs-at-risk-artificial-intelligence-ai-2023-2
[36] Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2023. Large language
models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620 (2023).
[37] Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language
Models for Software Engineering: A Systematic Literature Review. ACM Trans. Softw. Eng. Methodol. (Sept. 2024). https://doi.org/10.1145/3695988
Just Accepted.
[38] Adam Hörnemalm. 2023. ChatGPT as a Software Development Tool . Ph. D. Dissertation.
[39] Johan Jeuring, Roel Groot, and Hieke Keuning. 2023. What Skills Do You Need When Developing Software Using ChatGPT?(Discussion Paper). In
Proceedings of the 23rd Koli Calling International Conference on Computing Education Research . 1–6.
[40] Haolin Jin, Linghan Huang, Haipeng Cai, Jun Yan, Bo Li, and Huaming Chen. 2024. From llms to llm-based agents for software engineering: A
survey of current, challenges and future. arXiv preprint arXiv:2408.02479 (2024).
[41] Angelika Kimbel, Magdalena Glas, and Günther Pernul. 2024. Security and Privacy Perspectives on Using ChatGPT at the Workplace: An Interview
Study. (2024).
[42] Jan H Klemmer, Stefan Albert Horstmann, Nikhil Patnaik, Cordelia Ludden, Cordell Burton Jr, Carson Powers, Fabio Massacci, Akond Rahman,
Daniel Votipka, Heather Richter Lipford, et al .2024. Using AI Assistants in Software Development: A Qualitative Study on Security Practices and
Concerns. arXiv preprint arXiv:2405.06371 (2024).
[43] Madhava Krishna, Bhagesh Gaur, Arsh Verma, and Pankaj Jalote. 2024. Using LLMs in Software Requirements Specifications: An Empirical
Evaluation. arXiv preprint arXiv:2404.17842 (2024).
[44] Mohammad Amin Kuhail, Sujith Samuel Mathew, Ashraf Khalil, Jose Berengueres, and Syed Jawad Hussain Shah. 2024. “Will I Be Replaced?”
Assessing ChatGPT’s Effect on Software Development and Programmer Perceptions of AI Tools. Science of Computer Programming (2024), 103111.
[45] Sam Lau and Philip Guo. 2023. From" Ban it till we understand it" to" Resistance is futile": How university programming instructors plan to adapt
as more students use AI code generation and explanation tools such as ChatGPT and GitHub Copilot. In Proceedings of the 2023 ACM Conference on
International Computing Education Research-Volume 1 . 106–121.
[46] Dean Leffingwell and Don Widrig. 2000. Managing software requirements: a unified approach . Addison-Wesley Professional.
[47] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim
Rocktäschel, et al .2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33
(2020), 9459–9474.
Manuscript submitted to ACM
Page 42:
42 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
[48] Ze Shi Li, Nowshin Nawar Arony, Ahmed Musa Awon, Daniela Damian, and Bowen Xu. 2024. AI Tool Use and Adoption in Software Development
by Individuals and Organizations: A Grounded Theory Study. arXiv preprint arXiv:2406.17325 (2024).
[49] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous
Evaluation of Large Language Models for Code Generation. Advances in Neural Information Processing Systems 36 (Dec. 2023), 21558–21572.
https://proceedings.neurips.cc/paper_files/paper/2023/hash/43e9d647ccd3e4b7b5baab53f0368686-Abstract-Conference.html
[50] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2024. Is your code generated by chatgpt really correct? rigorous evaluation of
large language models for code generation. Advances in Neural Information Processing Systems 36 (2024).
[51] Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-Bach D Le, and David Lo. 2023. Refining ChatGPT-generated
code: Characterizing and mitigating code quality issues. ACM Transactions on Software Engineering and Methodology (2023).
[52] Kamil Malinka, Martin Peresíni, Anton Firc, Ondrej Hujnák, and Filip Janus. 2023. On the Educational Impact of ChatGPT: Is Artificial Intelligence
Ready to Obtain a University Degree?. In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1
(ITiCSE 2023) . Association for Computing Machinery, New York, NY, USA, 47–53. https://doi.org/10.1145/3587102.3588827
[53] Robert C Martin. 2009. Clean code: a handbook of agile software craftsmanship . Pearson Education.
[54] Wendy Mendes, Samara Souza, and Cleidson De Souza. 2024. " You’re on a bicycle with a little motor": Benefits and Challenges of Using AI Code
Assistants. In Proceedings of the 2024 IEEE/ACM 17th International Conference on Cooperative and Human Aspects of Software Engineering . 144–152.
[55] Jeremy Miles and Paul Gilbert. 2005. A Handbook of Research Methods for Clinical and Health Psychology . Oxford University Press. Google-Books-ID:
kmZ3Yt5pY0YC.
[56] Aaron Mok and Jacob Zinkula. 2023. Chatgpt may be coming for our jobs. Here are the 10 roles that AI is most likely to re-
place. https://www.businessinsider.com/chatgpt-jobs-at-risk-replacement-artificial-intelligence-ai-labor-trends-2023-02#:~:text=Experts%
20say%20ChatGPT%20and%20related,career%2C%20mid%2Dability%20work.
[57] Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh, Michel C. Desmarais, and Zhen Ming (Jack) Jiang. 2023. GitHub
Copilot AI pair programmer: Asset or Liability? Journal of Systems and Software 203 (Sept. 2023), 111734. https://doi.org/10.1016/j.jss.2023.111734
[58] n.a. 2021. OpenAI Codex. https://openai.com/index/openai-codex/
[59] n.a. 2023. Stack Overflow 2024 Developer Survey.
[60] n.a. 2024. Build simple, secure, scalable systems with Go. https://go.dev/
[61] n.a. 2024. FAQ for optional data sharing for Copilot AI features in Dynamics 365 and Power Platform. https://learn.microsoft.com/en-us/power-
platform/faqs-copilot-data-sharing
[62] n.a. 2024. Gemini Apps Privacy Hub. https://support.google.com/gemini/answer/13594961?hl=en
[63] n.a. 2024. Privacy Policy. https://openai.com/policies/row-privacy-policy/
[64] n.a. 2024. Stack Overflow 2023 Developer Survey.
[65] Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an LLM to Help With Code Understanding. In
2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE) . IEEE Computer Society, 881–881.
[66] Nathalia Nascimento, Paulo Alencar, and Donald Cowan. 2023. Artificial Intelligence vs. Software Engineers: An Empirical Study on Performance
and Efficiency using ChatGPT. In Proceedings of the 33rd Annual International Conference on Computer Science and Software Engineering (CASCON
’23). IBM Corp., USA, 24–33.
[67] Nikolaos Nikolaidis, Karolos Flamos, Daniel Feitosa, Alexander Chatzigeorgiou, and Apostolos Ampatzoglou. 2023. The End of an Era: Can Ai
Subsume Software Developers? Evaluating Chatgpt and Copilot Capabilities Using Leetcode Problems. https://doi.org/10.2139/ssrn.4422122
[68] OpenAI. [n. d.]. Prompt Engineering. https://platform.openai.com
[69] OpenAI. n.d.. Prompt engineering. https://platform.openai.com/docs/guides/prompt-engineering.
[70] R OpenAI. 2023. GPT-4 technical report. arXiv (2023), 2303–08774.
[71] Ipek Ozkaya. 2022. A Paradigm Shift in Automating Software Engineering Tasks: Bots. IEEE Software 39, 5 (Sept. 2022), 4–8. https://doi.org/10.
1109/MS.2022.3167801 Conference Name: IEEE Software.
[72] Ipek Ozkaya. 2023. Application of Large Language Models to Software Engineering Tasks: Opportunities, Risks, and Implications. IEEE Software 40,
3 (May 2023), 4–8. https://doi.org/10.1109/MS.2023.3248401
[73] Ipek Ozkaya. 2023. Application of large language models to software engineering tasks: Opportunities, risks, and implications. IEEE Software 40, 3
(2023), 4–8.
[74] Zeynep Özpolat, Özal YILDIRIM, and Murat Karabatak. 2023. Artificial Intelligence-Based Tools in Software Development Processes: Application
of ChatGPT. European Journal of Technique (EJT) 13, 2 (2023), 229–240.
[75] Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot.
arXiv:2302.06590 [cs.SE]
[76] Gustavo Pinto, Cleidson De Souza, Thayssa Rocha, Igor Steinmacher, Alberto Souza, and Edward Monteiro. 2024. Developer Experiences with a
Contextualized AI Coding Assistant: Usability, Expectations, and Outcomes. In Proceedings of the IEEE/ACM 3rd International Conference on AI
Engineering-Software Engineering for AI . 81–91.
[77] Zeinab Sadat Rabani, Hanieh Khorashadizadeh, Shirin Abdollahzade, Sven Groppe, and Javad Ghofrani. 2023. Developers’ Perspective on
Trustworthiness of Code Generated by ChatGPT: Insights from Interviews. In International Conference on Applied Machine Learning and Data
Analytics . Springer, 215–229.
Manuscript submitted to ACM
Page 43:
LLMs’ Impacts on Software Development 43
[78] Wahyu Rahmaniar. 2023. ChatGPT for Software Development: Opportunities and Challenges . https://doi.org/10.36227/techrxiv.23993583.v1
[79] Nitin Liladhar Rane, Abhijeet Tawde, Saurabh P Choudhary, and Jayesh Rane. 2023. Contribution and performance of ChatGPT and other Large
Language Models (LLM) for scientific and research advancements: a double-edged sword. International Research Journal of Modernization in
Engineering Technology and Science 5, 10 (2023), 875–899.
[80] Mahdiyah Rashid. 2023. HOW IS THE DEVELOPMENT AND DEPLOYMENT OF AI MODELS LIKE CHAT GPT AFFECTING THE JOB MARKET
AND WHAT ARE THE IMPLICATIONS FOR WORKERS IN VARIOUS INDUSTRIES? International Education and Research Journal (2023).
[81] Sanka Rasnayaka, Guanlin Wang, Ridwan Shariffdeen, and Ganesh Neelakanta Iyer. 2024. An empirical study on usage and perceptions of llms in a
software engineering project. arXiv preprint arXiv:2401.16186 (2024).
[82] Heidi Reichert, Benyamin T. Tabarsi, Zifan Zang, Cheri Fennell, Indira Bhandari, David Robinson, Madeline Drayton, Catherine Crofton, Matthew
Lococo, Dongkuan Xu, and Tiffany Barnes. 2024. Empowering Secondary School Teachers: Creating, Executing, and Evaluating a Transformative
Professional Development Course on ChatGPT. In 2024 IEEE Frontiers in Education Conference (FIE) . IEEE, forthcoming.
[83] Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of
the 2021 CHI Conference on Human Factors in Computing Systems . 1–7.
[84] Mary Beth Rosson and John M. Carroll. 1996. The Reuse of Uses in Smalltalk Programming. ACM Trans. Comput.-Hum. Interact. 3, 3 (sep 1996),
219–253. https://doi.org/10.1145/234526.234530
[85] Georgia Robins Sadler, Hau-Chen Lee, Rod Seung-Hwan Lim, and Judith Fullerton. 2010. Research Article: Recruitment of hard-to-reach population
subgroups via adaptations of the snowball sampling strategy. Nursing & Health Sciences 12, 3 (2010), 369–374. https://doi.org/10.1111/j.1442-
2018.2010.00541.x _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1442-2018.2010.00541.x.
[86] Fardin Ahsan Sakib, Saadat Hasan Khan, and A. H. M. Rezaul Karim. 2023. Extending the Frontier of ChatGPT: Code Generation and Debugging.
https://arxiv.org/abs/2307.08260v1
[87] Amazon Web Services. n.d.. What is SDLC? - Software Development Lifecycle Explained - AWS. https://aws.amazon.com/what-is/sdlc/.
[88] Mohammed Latif Siddiq, Lindsay Roney, Jiahao Zhang, and Joanna Cecilia Da Silva Santos. 2024. Quality Assessment of ChatGPT Generated Code
and their Use by Developers. In Proceedings of the 21st International Conference on Mining Software Repositories . 152–156.
[89] Rafael M. L. Silva, Erica Principe Cruz, Daniela K. Rosner, Dayton Kelly, Andrés Monroy-Hernández, and Fannie Liu. 2022. Understanding AR
Activism: An Interview Study with Creators of Augmented Reality Experiences for Social Change. In Proceedings of the 2022 CHI Conference on Human
Factors in Computing Systems (CHI ’22) . Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/3491102.3517605
[90] Harmeet Singh and Syed Imtiyaz Hassan. 2015. Effect of solid design principles on quality of software: An empirical assessment. International
Journal of Scientific & Engineering Research 6, 4 (2015), 1321–1324.
[91] Giriprasad Sridhara, Sourav Mazumdar, et al .2023. Chatgpt: A study on its utility for ubiquitous software engineering tasks. arXiv preprint
arXiv:2305.16837 (2023).
[92] Nigar M Shafiq Surameery and Mohammed Y Shakor. 2023. Use chat gpt to solve programming bugs. International Journal of Information Technology
& Computer Engineering (IJITC) ISSN: 2455-5290 3, 01 (2023), 17–22.
[93] Nigar M. Shafiq Surameery and Mohammed Y. Shakor. 2023. Use Chat GPT to Solve Programming Bugs. International Journal of Information
Technology & Computer Engineering (IJITC) ISSN : 2455-5290 3, 01 (Jan. 2023), 17–22. https://doi.org/10.55529/ijitc.31.17.22 Number: 01.
[94] Thomas Süße, Maria Kobert, Simon Grapenthin, and Bernd-Friedrich Voigt. 2023. AI-Powered Chatbots and the Transformation of Work: Findings
from a Case Study in Software Development and Software Engineering. In Working Conference on Virtual Enterprises . Springer, 689–705.
[95] Ben Arie Tanay, Lexy Arinze, Siddhant S Joshi, Kirsten A Davis, and James C Davis. 2024. An Exploratory Study on Upper-Level Computing
Students’ Use of Large Language Models as Tools in a Semester-Long Project. arXiv preprint arXiv:2403.18679 (2024).
[96] Liyan Tang, Zhaoyi Sun, Betina Idnay, Jordan G. Nestor, Ali Soroush, Pierre A. Elias, Ziyang Xu, Ying Ding, Greg Durrett, Justin F. Rousseau,
Chunhua Weng, and Yifan Peng. 2023. Evaluating large language models on medical evidence summarization. npj Digital Medicine 6, 1 (Aug. 2023),
158. https://doi.org/10.1038/s41746-023-00896-7
[97] Yutian Tang, Zhijie Liu, Zhichao Zhou, and Xiapu Luo. 2023. ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation.
https://arxiv.org/abs/2307.00588v1
[98] Helen Toner. 2023. What Are Generative AI, Large Language Models, and Foundation Models? | Center for Security and Emerging Technology.
https://cset.georgetown.edu/article/what-are-generative-ai-large-language-models-and-foundation-models/.
[99] Priyan Vaithilingam, Tianyi Zhang, and Elena L Glassman. 2022. Expectation vs. experience: Evaluating the usability of code generation tools
powered by large language models. In Chi conference on human factors in computing systems extended abstracts . 1–7.
[100] Pranshu Verma and Gerrit De Vynck. 2023. ChatGPT took their jobs. Now they walk dogs and fix air conditioners. https://www.washingtonpost.
com/technology/2023/06/02/ai-taking-jobs/
[101] Jiexin Wang, Liuwen Cao, Xitong Luo, Zhiping Zhou, Jiayuan Xie, Adam Jatowt, and Yi Cai. 2023. Enhancing Large Language Models for Secure
Code Generation: A Dataset-driven Study on Vulnerability Mitigation. https://arxiv.org/abs/2310.16263v1
[102] Emily Winter, David Bowes, Steve Counsell, Tracy Hall, Sæmundur Haraldsson, Vesna Nowack, and John Woodward. 2022. How do developers
really feel about bug fixing? directions for automatic program repair. IEEE Transactions on Software Engineering (2022).
[103] Tianyu Wu, Shizhu He, Jingping Liu, Siqi Sun, Kang Liu, Qing-Long Han, and Yang Tang. 2023. A brief overview of ChatGPT: The history, status
quo and potential future development. IEEE/CAA Journal of Automatica Sinica 10, 5 (2023), 1122–1136.
Manuscript submitted to ACM
Page 44:
44 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes
[104] Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A systematic evaluation of large language models of code. In
Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (MAPS 2022) . Association for Computing Machinery, New
York, NY, USA, 1–10. https://doi.org/10.1145/3520312.3534862
[105] Ruiyun Xu, Yue (Katherine) Feng, and Hailiang Chen. 2023. ChatGPT vs. Google: A Comparative Study of Search Performance and User Experience.
SSRN Electronic Journal (2023). https://doi.org/10.2139/ssrn.4498671
[106] Sangyeop Yeo, Yu-Seung Ma, Sang Cheol Kim, Hyungkook Jun, and Taeho Kim. 2024. Framework for evaluating code gener-
ation ability of large language models. ETRI Journal 46, 1 (2024), 106–117. https://doi.org/10.4218/etrij.2023-0357 _eprint:
https://onlinelibrary.wiley.com/doi/pdf/10.4218/etrij.2023-0357.
[107] Shuyin Zhao. 2024. Smarter, more efficient coding: GitHub Copilot goes beyond Codex with improved AI model. https://github.blog/news-
insights/product-news/smarter-more-efficient-coding-github-copilot-goes-beyond-codex-with-improved-ai-model/
Manuscript submitted to ACM