Open AGI Codes | Your Codes Reflect!

Updates

Generating audio...

arxiv

Paper 2503.05012

LLMs' Reshaping of People, Processes, Products, and Society in Software Development: A Comprehensive Exploration with Early Adopters

Authors: Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, Tiffany Barnes

Published: 2025-03-06

Abstract:

Large language models (LLMs) like OpenAI ChatGPT, Google Gemini, and GitHub Copilot are rapidly gaining traction in the software industry, but their full impact on software engineering remains insufficiently explored. Despite their growing adoption, there is a notable lack of formal, qualitative assessments of how LLMs are applied in real-world software development contexts. To fill this gap, we conducted semi-structured interviews with sixteen early-adopter professional developers to explore their use of LLMs throughout various stages of the software development life cycle. Our investigation examines four dimensions: people - how LLMs affect individual developers and teams; process - how LLMs alter software engineering workflows; product - LLM impact on software quality and innovation; and society - the broader socioeconomic and ethical implications of LLM adoption. Thematic analysis of our data reveals that while LLMs have not fundamentally revolutionized the development process, they have substantially enhanced routine coding tasks, including code generation, refactoring, and debugging. Developers reported the most effective outcomes when providing LLMs with clear, well-defined problem statements, indicating that LLMs excel with decomposed problems and specific requirements. Furthermore, these early-adopters identified that LLMs offer significant value for personal and professional development, aiding in learning new languages and concepts. Early-adopters, highly skilled in software engineering and how LLMs work, identified early and persisting challenges for software engineering, such as inaccuracies in generated content and the need for careful manual review before integrating LLM outputs into production environments. Our study provides a nuanced understanding of how LLMs are shaping the landscape of software development, with their benefits, limitations, and ongoing implications.

Paper Content:

Page 1: LLMs’ Reshaping of People, Processes, Products, and Society in Software Development: A Comprehensive Exploration with Early Adopters BENYAMIN TABARSI∗and HEIDI REICHERT∗,North Carolina State University, USA ALLY LIMKE, North Carolina State University, USA SANDEEP KUTTAL, North Carolina State University, USA TIFFANY BARNES, North Carolina State University, USA Large language models (LLMs) like OpenAI ChatGPT, Google Gemini, and GitHub Copilot are rapidly gaining traction in the software industry, but their full impact on software engineering remains insufficiently explored. Despite their growing adoption, there is a notable lack of formal, qualitative assessments of how LLMs are applied in real-world software development contexts. To fill this gap, we conducted semi-structured interviews with sixteen early-adopter professional developers to explore their use of LLMs throughout various stages of the software development life cycle. Our investigation examines four critical dimensions: people-how LLMs affect individual developers and teams; process-how LLMs alter software engineering workflows; product-LLM impact on software quality and innovation; and society-the broader socioeconomic and ethical implications of LLM adoption. Thematic analysis of our data reveals that while LLMs have not fundamentally revolutionized the development process, they have substantially enhanced routine coding tasks, including code generation, refactoring, and debugging. Developers who were LLM early-adopters report the most effective outcomes when providing LLMs with clear, well-defined problem statements, indicating that LLMs excel with decomposed problems and specific requirements. Furthermore, these early adopters identified that LLMs offer significant value for personal and professional development, aiding in the learning of new languages and concepts. Early adopters, highly skilled both in software engineering and in how LLMs work, identified early and persisting challenges for software engineering, such as inaccuracies in generated content and the need for careful manual review before integrating LLM outputs into production environments. Our study provides a nuanced understanding of how LLMs are shaping the current and future landscape of software development, highlighting both their practical benefits, limitations, and potential ongoing implications. CCS Concepts: •Human-centered computing →User studies ;Empirical studies in HCI . Additional Key Words and Phrases: LLM, ChatGPT, Gemini, Copilot Chat, interview study, professional developers ACM Reference Format: Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes. 2025. LLMs’ Reshaping of People, Processes, Products, and Society in Software Development: A Comprehensive Exploration with Early Adopters. 1, 1 (March 2025), 44 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn ∗Both authors contributed equally to this research. Authors’ Contact Information: Benyamin Tabarsi, btaghiz@ncsu.edu; Heidi Reichert, hreiche@ncsu.edu, North Carolina State University, Raleigh, North Carolina, USA; Ally Limke, North Carolina State University, USA; Sandeep Kuttal, North Carolina State University, Raleigh, North Carolina, USA; Tiffany Barnes, North Carolina State University, Raleigh, North Carolina, USA. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. ©2025 Copyright held by the owner/author(s). Publication rights licensed to ACM. Manuscript submitted to ACM Manuscript submitted to ACM 1arXiv:2503.05012v1 [cs.SE] 6 Mar 2025 Page 2: 2 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes 1 Introduction Large language models (LLMs), trained on extensive datasets, are generative AI systems that produce human-like text based on their architecture and training data [ 70,98]. Publicly-available LLMs such as OpenAI ChatGPT1, Google Gem- ini2, and GitHub Copilot Chat3represent transformative AI, reshaping workflows across multiple domains [ 35,56,100]. Within software development, specific LLMs are trained on vast repositories of code; for instance, the AI model behind GitHub Copilot was originally OpenAI’s Codex [ 58] and has since been refined to enhance contextual filtering [ 107]. These models enable developers to generate, refactor, and debug code through natural language prompts, streamlining traditionally labor-intensive processes. Such tools are increasingly central to enhancing developer productivity, fostering collaboration, and accelerating innovation in software engineering [15, 79]. The adoption of LLMs in software development has sparked considerable discourse on their broader implications. For example, Stack Overflow’s annual developer surveys of 2023 and 2024 highlight how developers are rapidly integrating AI tools into their workflows, reflecting shifting perceptions about automation and its role in programming [59,64] and researchers have studied. Discussions on LinkedIn [ 7,22], the OpenAI community forum [ 16], and other platforms further emphasize the potential of LLMs to redefine software creation, engineering processes, and developer collaboration. Several studies have examined LLM usage in software engineering through literature reviews [ 36], case studies [ 9], empirical studies of forums [ 16], and comparison of general LLM tools and LLM-based agents [ 40]. However, there is a lack of formal, qualitative assessments based on interviews with software developers on how LLMs are utilized in real-world software engineering settings. Those that exist primarily focus on more specific dimensions of their work, such as perceptions of security [42], trustworthiness [77], and user study evaluations of new tools [76]. To address this gap, we conducted interviews with 16 early-adopter software industry professionals who took the initiative to become educated about LLMs at their inception and began actively incorporating LLM-based AI tools into their daily workflows between November 2022 and April 2023. The insights of these early adopters help frame new and ongoing issues that are important for software engineering processes, tools, and practitioners to maintain awareness of as LLMs evolve. Our investigation is organized around four critical dimensions: people , examining how LLMs influence individual developers and teams; process , analyzing changes in software engineering workflows; product , evaluating how LLMs contribute to software quality and innovation; and society , exploring the broader socioeconomic and ethical implications. By addressing these dimensions for both early and ongoing revisions of LLMs, this research provides a holistic perspective on the evolving intersection of LLMs and software development, offering insights into the opportunities and challenges these technologies bring to the field. •People – RQ1: How do LLMs affect developers? In this “people” dimension, we aimed to discern the advantages and disadvantages LLMs offer professional software developers in their work. While existing studies provide valuable insights into specific tasks where LLMs excel, they often fail to capture the broader, qualitative experiences of developers across varying experience levels. Our research fills this gap by exploring the nuanced ways LLMs influence different development tasks, offering a richer understanding of their practical utility and limitations. LLMs best serve developers when assisting with tasks they like the least, such as automating repetitive coding tasks and summarizing information. Additionally, LLMs helped developers learn more effectively by explaining code, personalizing learning, and generating new 1https://openai.com/chatgpt 2https://gemini.google.com/ 3https://docs.github.com/en/copilot/github-copilot-chat Manuscript submitted to ACM Page 3: LLMs’ Impacts on Software Development 3 ideas. However, they are prone to hallucinations and inaccuracies, making it crucial for developers to manually review and adapt the generated content. •Process – RQ2: How have LLMs influenced software development processes? We explored the “process” dimension by investigating how LLMs have impacted the software development life cycle (SDLC), considering both positive and negative effects. Previous research has provided useful insights into isolated phases of the SDLC, but few studies examine the end-to-end impact of LLMs across all development stages. Our study fills this gap by systematically analyzing developers’ experiences using LLMs throughout the entire SDLC. We found that LLMs are particularly effective for ideation, testing, and debugging tasks, but less useful for generating requirements or reviewing code, especially in collaborative environments. Developers adapted their strategies by using a combination of broad and specific queries, experimenting with context addition and removal. While LLM-generated code often needed manual review before integration, it also offered learning opportunities and inspired innovative solutions. •Product – RQ3: How has the use of LLMs influenced the software products created? In this “product” dimension, we focused on understanding the impact of LLMs on the code and software products generated by developers. Most existing studies rely on predefined metrics to evaluate code quality, often neglecting developers’ subjective perceptions of LLM-generated code. Our research addresses this gap by examining developers’ confidence in the code’s accuracy, readability, and complexity. We found that LLMs performed well with smaller, routine tasks like generating unit tests and documentation, but struggled with complex, novel code. Concerns arose around over-engineered code, security, and the sensitivity of the data used for training LLMs. Developers took responsibility for reviewing outputs, carefully balancing their trust in LLM-generated content with their own quality assurance processes. •Society – RQ4: How may the software industry and education be affected by LLMs? To understand the broader societal implications, we explored developers’ perceptions of LLMs’ impact on the software industry and educational training. Much existing research focuses on theoretical implications, but few studies engage developers directly. Our research fills this gap by capturing their nuanced views on the opportunities and challenges of LLMs in professional workflows. While some of our participants noted that LLMs could replace certain roles, they generally believed developers’ roles were safe, as LLMs are tools to support rather than replace human decision-making. Concerns were raised about entry-level positions and the interview process, with some arguing for integrating LLMs into CS curricula, while others called for revising assignments to prevent easy LLM solutions. The lack of formal guidelines and training for LLM use in workplaces was noted, highlighting the need for structured support to maximize LLMs’ effectiveness and ethical use. Our paper makes the following contributions: •We provide one of the first comprehensive qualitative analyses of developers’ experiences with LLM-based tools in real-world software development settings, offering new insights into their practical utility and limitations. •We uncover the strategies that developers use to integrate LLM tools into their workflows, highlighting both effective practices and common pitfalls, which can guide future adoption and optimization of these tools in professional environments. •We offer corroborating evidence for prior studies on the technical impact of LLMs on software development, while also introducing new insights into the broader socio-economic and educational implications of these tools, including their potential effects on job roles and educational curricula. Manuscript submitted to ACM Page 4: 4 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes •We analyze the impact of LLMs across all phases of the software development lifecycle (SDLC), providing a deeper understanding of how these tools influence technical tasks like coding, testing, and debugging, as well as human-centered activities like learning, collaboration, and decision-making. The rest of the paper is structured as follows: Section 2 details the methodology used in the paper; Section 3 presents our results in detail for all four research questions; Section 4 explores the implications of our results; Section 5 discusses related work in the literature; Section 6 considers limitations; and Section 8 concludes the paper. 2 Methods 2.1 Participants and Demographics We recruited 16 participants from tech meetup groups in the southeastern US, the researchers’ personal contacts, and LinkedIn. Snowball sampling was also employed, where participants invited individuals from their networks to join the study [ 85]. The target population for this study was professional software engineers and developers. Inclusion criteria required participants to have at least two months of experience using ChatGPT or a similar LLM-based chatbot for programming in their work. Initially, our call for participants attracted a number of graduate students. Although many had prior experience as developers or had worked in development roles at their universities, we determined that this group would not fully address our research questions. As a result, we revised our recruitment materials to focus specifically on full-time developers. The demographic details of all 16 participants are provided in Table 2. 2.2 Interview Questions Formulation We structured our interview questions around four key themes in software engineering: People, Processes, Products, and Society. These themes were derived from existing literature, which highlighted the critical aspects of each category. Using this foundational framework, we formulated interview questions aimed at capturing relevant insights within each theme. An iterative approach was employed in developing these questions to ensure their clarity and relevance. To refine and ground our interview approach, we conducted several pilot interviews. One pilot interview was with a professional developer who regularly uses ChatGPT for programming, and the other three were with students who self-identified as developers and used ChatGPT in their work, though they were not employed as full-time professionals. These pilot interviews served as a testing phase to assess the appropriateness of the interview questions, determine the time required for each interview, and evaluate the overall structure. While the student interviews were excluded from our analysis, the professional developer’s interview was included due to the rich data it provided. Following the pilot interviews, we reviewed the data and feedback, ultimately excluding the student interviews from the preliminary analysis due to concerns about their alignment with the study’s objectives. Based on the insights gained, we refined the questions, reworded them for clarity, checked for grammatical accuracy, and ensured the wording was universally understood. The final version of the questions used for the interviews can be found in Table 1. Additionally, we reduced the interview duration to ensure the questions could be answered efficiently without sacrificing the depth of the responses. 2.3 Procedure 2.3.1 Interview. We invited interested participants via email and asked them to complete consent forms. Participants used a Google Calendar link to schedule their own research sessions, which lasted approximately 70 minutes. While all Manuscript submitted to ACM Page 5: LLMs’ Impacts on Software Development 5 Table 1. A list of the interview questions we asked participants. How do you use ChatGPT in your everyday work? How often (on average) – multiple times a day? Daily? Weekly? Monthly? Can you tell me about a time when you used ChatGPT to help you write a program? How about a time when it did not work out well? Were there any times when you were surprised by what you could do with ChatGPT and code? How has ChatGPT changed your software engineering process? Has it changed how you gather requirements? How? Has it changed how you break tasks into parts that can be solved by ChatGPT? How? Has it changed how you write code? How? Has it changed your testing process? How? Has it changed your code review process? How? How do you normally evaluate the code generated by ChatGPT? How do you do testing? How do you determine if the code does what you asked for? Do you read the code? How do you check the code quality, efficiency, complexity? What about security aspects? How much do you trust the code provided by ChatGPT? How is it different from evaluating human-written code? How do you integrate the ChatGPT code results into your codebase, if at all? What steps do you take before integrating the output of ChatGPT into your code? Or adapting/modifying the ChatGPT code to make it useful? How often do you throw away the code, use, or reuse the code given by ChatGPT? How secure do you believe the code given by ChatGPT generally is? How will ChatGPT impact the skills and jobs in the software industry? How do you think CS degree programs should adjust to prepare for this shift? What skills must a person have to use ChatGPT like you do? For example, how do you formulate a question to ChatGPT? How do you structure your queries to get the desired answer? How broad or specific should your questions be? How much context will you provide to ChatGPT? How do follow-up questions impact the accuracy of your answer? How many queries/reformulations Do you or your company have any guidelines, formal or informal, about how developers should use ChatGPT? participants consented to audio recording, not all agreed to screen or video capture. All interviews took place between the period of March 1, 2023, to July 7, 2023. The study consisted of semi-structured interviews, featuring a core set of predetermined questions, while allowing flexibility for follow-up questions based on the conversation [ 55]. At the start of each session, participants were briefed on the research purpose, followed by an interview of about 60 minutes. Two researchers were present: one conducted the interview, while the other took notes. Both researchers asked clarifying or follow-up questions based on participant responses. During the interviews, participants were asked to discuss their use of LLMs in their daily work, how they evaluated its generated code, and how they integrated it into existing codebases. We also explored their opinions on how LLMs Manuscript submitted to ACM Page 6: 6 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes Table 2. Demographics of participants P1-P16 (P#) showing the company size (number of employees), gender, race/ethnicity, years of development experience, and LLM tool(s) used. P# Company size Gender Trans Race / ethnicity Experience LLM used P1 Small (10-19) Non-binary No Hispanic, White 3+ years GPT P2 Medium (50-249) Man No Asian 1-2 years GPT P3 Large enterprise (250+) Man No Hispanic, Asian 3+ years GPT P4 Micro (5-9) Man No Hispanic, White 3+ years GPT P5 Large enterprise (250+) Man No White 3+ years GPT P6 Micro (1-4) Man No Black or African American 3+ years Copilot P7 Large enterprise (250+) Man No Asian 3+ years GPT P8 Medium (50-249) Man No Asian 3+ years GPT P9 Large enterprise (250+ Man Yes White 3+ years GPT P10 Small (20-49) Man No Asian 3+ years GPT P11 Micro (1-4) Man No Asian 3+ years GPT P12 Large enterprise (250+) Non-binary Yes White 3+ years Bard P13 Large enterprise (250+) Woman No Asian 1-2 years GPT P14 Micro (5-9) Man No White 1-2 years GPT P15 Medium (50-249) Man No White 3+ years GPT, Copilot P16 Large enterprise (250+) Woman No Black or African American 1-2 years GPT might impact the skills and job landscape in the software industry, with specific follow-up questions regarding the skills needed to effectively use LLMs. Additionally, participants were asked to describe real projects they had worked on. A list of the pre-prepared questions is provided in Table ??. While the questions we asked explicitly named ChatGPT, we stated during the interviews that interviewees should also answer the questions based on their usage of similar LLM-based chatbot other LLM-based tools. 2.3.2 Demographic Survey. After completing the interview, participants were asked to complete a brief survey on Qualtrics, which collected demographic information. Specifically, participants were asked to provide details about the size of their company, as well as their gender, ethnicity, and race. This demographic data was gathered to explore potential differences in LLMs usage across various groups. Based on the information provided during the interviews, we also estimated the number of years of experience each participant had in the software development field. These results are presented in Table 2. Our participants were 43.75% White, 43.75% Asian, and 12.5% Black. 75% were men, 12.5% were women, and 12.5% were non-binary. 75% of our participants had at least 3 years of development experience, with only four participants having more limited experience and two participants expressing they had begun their careers within the past year. 2.4 Analysis All interviews were recorded, automatically transcribed, and then reviewed and corrected by the researchers, who rewatched the video recordings to ensure accuracy. Our analysis followed an inductive approach, similar to that of Silva et al. in their study of AR activists [ 89]. Initially, we created open codes based on three transcripts that we identified as thematically rich and highly relevant to our research questions. After discussing and refining these codes, we compiled a preliminary codebook with descriptions for each tag. The two researchers who conducted the interviews, along with one researcher who did not participate in Manuscript submitted to ACM Page 7: LLMs’ Impacts on Software Development 7 the interviews, independently coded the remaining 13 transcripts. They cross-referenced these codes with the interview video recordings when needed. Following independent coding, the researchers held discussions to reconcile their codes, developing a unified set of agreed-upon tags that contributed to an evolving codebook. The codebook grew iteratively as new themes emerged from the diversity of participant responses. Subsequently, three researchers (two involved in coding, one external) grouped the tags according to each research question, then clustered them into mid-level themes. These were further divided into lower-level themes, which served as the foundation for our written analysis. This process took approximately two weeks, consisting of multiple intensive coding sessions. Throughout the analysis, the researchers continued to refine the codes and themes through ongoing discussions and by sharing draft results. The final coding resulted in 361 total codes, categorized into 138 low-level themes. For each theme, we also quantified the number of quotes tagged, which we present in our results. Participant quotes are referenced as PX, where X denotes the participant’s interview order. Note that P1 was technically a pilot interview, but answered all questions that were asked of the other participants. 3 Results In this section, we present key themes found in our analysis regarding how LLMs impact people (RQ1), processes (RQ2), products (RQ3), and the environment (RQ4). We organized this section to outline the positive and negative aspects of LLMs. 3.1 RQ1: How do LLMs affect developers? The key themes that emerged in understanding how LLMs affect developers’ daily tasks are as follows: 3.1.1 Boosting Developers’ Productivity . •Reducing Mundane Tasks: Codes related to LLMs’ effectiveness in simplifying mundane tasks appeared in fourteen interviews, aligned with previous research describing how developers used ChatGPT to automate tedious tasks [ 38]4. A notable example was provided by P14, who shared, “ I really like that the repetitive, boring tasks — like looking for a comma — that I don’t need to do those, and I can focus more on building things. ” Thirteen participants highlighted LLMs’ role in expediting the software engineering process and saving time. For instance, P1 noted the potential benefits of integrating ChatGPT into developers’ workflows: “ It might make you aware of things that you might be missing [...] maybe speed up some small pieces of a software engineer’s process. ” These findings align with those of other researchers [6, 67, 78]. Similarly, nine participants highlighted how LLMs enhance efficiency and make developers more productive. For example, P15 explained how ChatGPT writes functions, supporting focus on higher-level tasks: “ [Suppose that] I want to do something simple [...] I don’t write those functions anymore. I always have ChatGPT do it because I know that it’s going to come up with something close to what I was going to do, but I didn’t actually have to do it. So it’s kind of like a pair-programmer for me, so I can stick to the higher-level stuff that would take more understanding of the infrastructure. ” •Streamlining Search Experiences: Ten participants highlighted the productivity advantage of LLMs in reducing search time for solutions and drew parallels between LLMs and search engines. For example, P6 shared, “ [...] instead of going out directly to Google or Stack Overflow, my first line of question is ChatGPT [...] I get answers a little bit faster and more contextualized [with ChatGPT]. ” In another case, P14 mentioned how LLMs transformed 4Master’s thesis Manuscript submitted to ACM Page 8: 8 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes Fig. 1. Findings related to RQ1, categorized by their uses (indicated by a checkmark), existing challenges (indicated by a stop sign), and challenges mentioned by our participants that have been largely addressed in newer versions of LLMs (indicated by a blue circle). his search style: “ Now, I haven’t been on Google probably for 3 weeks [...] I don’t need to click on each of the links from Google to find the solution[...] [ChatGPT] gives me a summarized version from possibly three, four different sources. ” In a related finding, a study by Xu et al. revealed that individuals who used ChatGPT to search for answers spent less time than their counterparts who used Google [105]5. •Providing Boilerplate Code and Templates: Ten participants shared that they had used LLMs to create generic or standardized code and templates. P4 described this capability as “ one of the best things I get from [ChatGPT]. ” P12, a consistent user of Bard6, expressed, “ You can take [Bard’s response] as an idea, like a first draft, and then iterate on it -— that’s the experience I have working with it from the coding side. ” Similarly, P5 mentioned, “I would ask it particular questions regarding a specific front-end component I’m developing, and I use it to point me in the right direction with code that I can start off with. And then, based on my business need, I would customize it to fit the specific requirement at the time. ” •Translating Code: Code translation was referenced in three interviews. P5 described his experience, stating, “[If] I want [the code] to be styled in a particular way, or in a certain language, or if I want to transpile it to another 5Pre-published via arXiv. 6Now known as Gemini, as noted in Section 2. Manuscript submitted to ACM Page 9: LLMs’ Impacts on Software Development 9 language, [ChatGPT has] been able to do that pretty well. ” This finding is in line with prior research on code translation as one of the capabilities of AI coding tools [45]. •Accelerating Learning: Two participants noted that ChatGPT reduces the learning curve and speeds up learning. Given the integral role of continuous learning in a developer’s responsibilities, LLMs’ ability to facilitate this is highly beneficial. For instance, P14 highlighted, “ I feel it’s really good at explaining. Like, I don’t understand quantum computing at all, but I ask [ChatGPT] to explain the principles to me as if I’m five [years old] or as if I’m a JavaScript developer [...] It helps me understand the context without needing to delve into all the underlying physics. It definitely helps me learn new things faster. ” Prior research has also highlighted the positive impact of ChatGPT on accelerating the learning process [52]. •Simplifying Set-ups: Two participants mentioned using LLMs for set-ups and installations, which are an inevitable part of a developer’s work and can consume significant time. In one instance, P4 stated, “ It saves me the four hours of headache of setting up. It would also help with setting up a new [development] environment — I’m gonna go to React, gotta get this web app building and running. I’ll just copy and paste the terminal errors, and it does a surprisingly good job of telling me how to get through the dev environment issues. ” •Supplementing Tutorials and Documentation: One participant, P16, highlighted the common issue of online tutorials and documentation being incomplete or outdated, potentially creating roadblocks in developers’ projects. She suggested that ChatGPT could alleviate this challenge: “ I’m working with a platform that I’m not familiar with, and the documentation of that platform is not clear or may skip steps. I notice now that at work, documentations do skip steps, and ChatGPT does a really good job at filling in the gaps when I ask my question: ’Hey, this is what I’m supposed to be getting, but I’m getting something else. Do you know why?’ And then again, ChatGPT sometimes does give wrong answers, but it shoots out answers that I may not have considered. ” •Improving Recall of Syntax/Implementation: One participant highlighted the challenge of forgetting syntax or implementation details in a developer’s knowledge base. P11 shared, “ I used to use MySQL a long time ago, and then I haven’t used MySQL in a while, so I didn’t remember certain ways of doing things. In one case, I needed to do a left join to find things that were in one table but not another. [ChatGPT] was helpful. ” Challenges: •Programming Language Mix-Ups: Three participants noted instances where LLMs’ responses were not in the expected language. For instance, P8 recounted an experience of asking ChatGPT to refactor code in Java: “When you paste your code [on ChatGPT], it even does not understand that code is [in] Java, C#, JavaScript, or whatever...Maybe the premium version doesn’t have that issue. But at least the [free] version that they have is silly in those cases. ” Consequently, developers need to spend time converting the code to the target language. •Contradictory Answers: One participant, P1, observed that ChatGPT’s answers sometimes contradicted its earlier responses and remained stubborn. He stated, “ [The impact of follow-up questions on the accuracy of the answers] depends. These language models have a tendency to be stubborn. ” Additionally, he cited an example supposedly from the literature where ChatGPT’s responses to an imaginary scenario were contradictory, and it refused to correct its mistakes7. Despite such occasional unreliability, prior research indicates that individuals are not deterred from using ChatGPT for this reason [5]. 7The scenario in question concerns asking ChatGPT what it would do if it were invisible, with contradictions arising when it responds that it would access inaccessible areas; note that the researchers were unable to find the source of this reference Manuscript submitted to ACM Page 10: 10 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes •Slow Response Generation: Different versions of LLMs may exhibit varying response times that impact the user’s experience. P16 compared GPT-3.5 and GPT-4, noting significant differences in speed that resulted in frustration: “ GPT-4 is really slow and very limited, so right now, I use GPT-3.5. I am excited for GPT-4 [to become faster] because it obviously holds more information. But I tried it out, and I caught myself trying to rip off my hair because it wasn’t understanding me the way that GPT-3.5 understood me. And then, when you’re limited — I think it’s 50 prompts in four hours or something like that. And I’m having conversations with ChatGPT [that are] one thread with 20 messages. [...] 50 [prompts] every four hours is not efficient. ” •Struggles with Unstructured Data Analysis: One participant, P3, discussed a challenge with unstructured input for ChatGPT. As he described, “ I tried asking it to perform some analysis on unstructured data I received that I was going to run through Splunk, which is a machine data reading tool. I asked it to test efficacy. I provided it with some dummy data and asked for it to create a few reports. It seems like it only read the variable names and decided what those reports would look like, but the actual visualizations it created and the insights it provided were not accurate. This might have been due to the nature of the data, but I also noticed similar issues discussed on Reddit regarding Kaggle data analysis. ” Such inaccuracies could consume developers’ time to correct or result in wasted efforts when using LLMs. 3.1.2 Facilitating Developers’ Learning . •Personalizing Education and Skill Enhancement: Learning new technologies, software, and information from LLMs emerged as a prominent theme, appearing in fourteen interviews. A key benefit of LLMs is their ability to offer personalized and interactive learning. Whether LLMs introduced unfamiliar libraries to a programmer or assisted developers in learning specific languages for job interviews, they proved to be an invaluable resource. For example, P7 highlighted the constant need to learn new things in the industry: “ We still need to read a lot of things like documentations or ...some new technologies. So ChatGPT is a good resource...to learn something I never heard before. So, for example, if there is a question [and] I’m not sure which tool I should use, I could probably just ask the open question from ChatGPT. ” •Explaining Code: We found LLMs’ abilities to provide explanations and examples to be noted in eleven interviews. For instance, P7 highlighted how a code example from ChatGPT not only offered guidance but also inspired him to clean his code: “ Instead of looking at the [Golang8] code myself — because the source code has so many packages, I don’t know where to look into — I just asked ChatGPT to find me an example. In this case, I can quickly notice this is [what] the source code is doing. I can probably do something similar to make the code readable. I mean, it has good readability and is also as beautiful as possible. ” Related to LLMs’ streamlining of the searching process, P13 shared how ChatGPT was more useful than other tools due to its ability to explain code more efficiently: “ [My] first step tends to be to go to ChatGPT, give it the snippet of my code, [...] and then ask it those specific questions because I think it puts you quite ahead in understanding and making progress with things. ” •Providing a Broad Scope of Knowledge and Diverse Datasets: The extensive knowledge and diverse datasets of LLMs appeared in four interviews as an element that advanced developers’ knowledge and understanding. For example, P8 shared an example in which ChatGPT was able to create data transfer objects (DTOs) based on requirements related to a task in the financial domain; the LLM was able to intuit that the DTOs would require a first name, last name, SSN, PIN number, and a card number: “ That was really surprising to me that [ChatGPT] knows that. ” 8Go, also known as Golang, is a programming language designed by Google [60]. Manuscript submitted to ACM Page 11: LLMs’ Impacts on Software Development 11 •Demonstrating Best Coding Practices: Two participants highlighted how LLMs helped them learn how to write code more efficiently and elegantly. By observing the generated code and incorporating LLMs’ suggestions, they were able to refine their coding style and produce more readable code. P16 shared that ChatGPT helped her learn how to write code more efficiently: “ I think it showed me better [...] shorthand code. [...] I just kind of learned how to use less words, [and] code more efficiently. I would say sometimes it may not be pretty, but from what I know, plus what I’ve learned from it, I’m able to kind of combine it and mix it and make my code more elegant and readable and just better-looking code. ” Challenges . •Hallucinations and Incomplete Responses: LLMs occasionally produce erroneous or fictitious content, a major issue that has been outlined in several academic studies [ 3,96]. This challenge surfaced in six interviews. For instance, P1 recounted instances where ChatGPT fabricated responses: “ There were some things that were surprising in a bad way, like it would make up papers. It would hallucinate paper names and authors. ” Another participant, P11, highlighted ChatGPT’s failure to provide sources for its responses, noting the importance of human expertise and the using reliable, up-to-date information: “ [People’s] knowledge is probably more up to date. No, nobody is always 100% up-to-date, but you can trust it a little bit more, or if they have experience with the library, then I would trust a person’s experience more than whatever ChatGPT is drawing from. [...] ChatGPT really does not cite its information. I think maybe Bard started citing stuff. Citations are very helpful. ” Additionally, P5 mentioned encountering disrupted answers midway, possibly due to server connectivity issues: “ I have run into issues where I tried to generate like a lot of code, and then it would just stop halfway through because of a slow connection, or just the plan that I have for it, and then I would ask it to finish, and then it just sort of forgets what it was doing. ” •Limited Knowledge and Datasets: Contrary to the broad knowledge discussed earlier, there were five partici- pants who shared that LLMs have limited knowledge and datasets. For instance, P9 shared that despite extensive efforts in feeding information to ChatGPT and careful prompt engineering, ChatGPT failed to provide an answer: “There are definitely times when I spent maybe more than an hour trying to ask ChatGPT to help me debug my JavaScript code. But it turns out that no matter how well I try to re-prompt, using the prompt engineering techniques that I picked up from Dr. Andrew Ng’s course on DeepLearning.AI9, [or] even if I tried all the techniques available, it’s still going in loops. ChatGPT is still unable to pick up any specific bugs, which [are] helpful for me to overcome the issue. In that case, I still have a JavaScript subcontractor that I hire on an hourly basis to help me when I’m really, really, really stuck. So there are times that I still need to get him to fill me [in on] that. ” •Struggles with Novel Ideas and Logical Prompts: In four interviews, participants highlighted LLMs’ limita- tions in generating code for novel ideas and handling prompts requiring complex logical reasoning. For example, as P11 observed, “ ChatGPT isn’t good at logical things or counting and math, right? But I think it’s able to usually generate some reasonable code for that. ” •Impediment of Developers’ Learning: Another concern raised in four interviews was LLMs’ probable adverse effect on learning, particularly for junior developers. This was primarily due to users’ potential inability to parse the correctness of ChatGPT’s answers. For example, P10 mentioned, “ I think ChatGPT is not mature enough yet for everyday uses, especially for junior developers. If they want to start to do something, I don’t think they should use ChatGPT because it might lead to misinterpretation of a lot of documentation and lead to something else. Because a 9https://www.deeplearning.ai/ Manuscript submitted to ACM Page 12: 12 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes lot of time, ChatGPT’s tone, the way it replies, has certainties, say, here are all the solutions [...] and that is incorrect. I have encountered so many times that it’s answering itself in circles. ” •Safeguard Failures and Harmful Effects: In two interviews, participants reported incidents where these LLMs’ safeguards against disseminating harmful information could be bypassed. One participant, P3, detailed an instance of impersonation that led to offensive and discriminatory content. In another case, P4 described how he was able to write a program that caused his computer to restart:“ [I]t will gladly do harmful side effects. But I had to go out of my way to try to get it to [do] it. This was just during a live demo of using GPT-3, and I had it restart my computer. ” These examples underscore the need for ongoing enhancements in LLMs’ capabilities to ensure their effectiveness as a knowledge-enhancing tool. 3.1.3 Supporting Developers’ Personal Growth . •Enhancing Reassurance, Confidence, and Independence: One theme that emerged in six interviews was LLMs’ role in providing developers with reassurance, which improved their confidence and independence. P2, a newer professional, shared his experience of comparing his answers to ChatGPT’s for corroboration: “ It has made the [software engineering] process easier definitely. And the best part is I kind of get the reassurance of a lot of things — that if I’m doing it this way, I can ask it, and I can just check if ChatGPT is giving me a similar answer [...], then I’m not doing anything which is majorly wrong. There will always be a few things here and there, but most of the parts would be correct. That’s what I get from it. ” •Improving Access to Information: A positive view of LLMs’ ability to ease access to information emerged in three interviews. P15 shared his belief that ChatGPT would democratize information access worldwide: “ Now we have this tool that everybody in the world can use, that’s very, very inexpensive or free, that allows everybody, from all ages, from five years old to a hundred years old, to learn any topic that they want. I think a lot of countries are gonna be able to grow and thrive, knowing that they can now figure out how to grow these things, or what’s messed up with their economy, or analyzing these things [...] You can’t say, ‘Oh, I grew up in a poor neighborhood. I couldn’t get the same information as the person who went to an Ivy League College, ’ because that’s no longer the case. Now it comes down to whether or not you yourself have the ambition to learn the things that you say, that you would learn if education was free because now it’s free. ” •Enhancing Job Satisfaction: One participant, P14, highlighted ChatGPT’s role in his increased job satisfaction. As a non-native English speaker, he shared how ChatGPT helped him in areas he struggled with, thus allowing him to focus on what he cared about: “ I feel like I’m very bad at writing essays in English. I feel like [ChatGPT] can write better essays than I can, and work my ideas better than I can in English. So it can help me in that way, that I don’t need to get my skills on some level. So I can write some okay essays or emails [...] and I can focus more time on doing something that I enjoy and going deeper there, which it encourages me. And if I can focus more time [on] doing something that makes me happy and that I’m interested in, I can learn those things way faster and be way happier than if I have to learn something that I don’t really care about. ” Challenges . •Inability to Replace Human Decisions : Six participants noted LLMs’ inability to replace human interactions and decisions. For instance, P12 emphasized that software engineering involves more than just coding, highlight- ing the importance of human communication and collaboration: “ Despite what [LLMs] can do, at the end of the day, it can’t replace — like, what’s difficult about being a software engineer isn’t coding particularly. Yeah, that’s Manuscript submitted to ACM Page 13: LLMs’ Impacts on Software Development 13 part of it, [but] I think what’s difficult about being a software engineer, at least with what I do, is the communication that happens between teams, between coworkers, the email threads, and the chat threads that exist. ” Similarly, P5 stressed the ongoing necessity of human involvement in software development, especially in decision-making and understanding business requirements: “ I still believe that you’ll still need a human person running the shots and doing the code. You still need someone to run ideas off of and to handle all of these human elements because you’d still need to take in the business case from a person and from the whole other team, and that’s a whole other topic and conversation. But there still is room in this world for the human interaction and human developer, in my perspective. ” •Concerns about Dependency: Two participants mentioned their preference for solving problems on their own before asking LLMs. This was primarily done to avoid over-reliance—as P16 shared, “ I try my best to at least spend some time on my own to figure it out because I don’t want to be too dependent on [ChatGPT]. ” •Slower Implementation than Humans: Differences between the time it took humans and LLMs to create code were noted in one interview. P12 highlighted the efficiency of human problem-solving in cases where ChatGPT fails to understand or provide accurate solutions: “ At some point, I’ll be like, ‘Okay, it just doesn’t understand me. ’ And so I’ll give up and then just do it myself. Honestly, sometimes I feel like doing it myself can often be the easier answer — like, the actual, quickest path to getting what I want sometimes when it starts making these mistakes initially. ” 3.1.4 Assisting Developers’ Non-Technical Tasks . •Consulting and Decision-Making: There were eleven participants discussing the use of LLMs for consultation or direction. For example, P7 highlighted their utility in uncertain situations: “ If there is a question, [like] I’m not sure which tool I should use, I could probably just ask an open question to the ChatGPT like, ’Hey, could you give me some directions or some potential solutions to this situation?’ And ChatGPT could probably show me some high-level ways [...] But with some traditional search engines, like Google, it’s kind of hard because if I don’t ask a specific question, or if I don’t ask for some specific tool, they cannot give me a suggestion or they give me some way unrelated suggestions. ” In another case, P8 mentioned ChatGPT’s impact on resolving team debates: “ [A]lways in technical teams, there exists debates on choosing options, options A and B — both are correct, but which one is better? In this way, we have a judge; we have someone that tells us which approach is better. In this way, I think it changed the whole software engineering [process] for me, that whenever we have a discussion with our colleagues in our team, in some cases, at least we have someone [ChatGPT] who says the final word. ” •Summarizing Text and Documentation: Nine participants mentioned the use of LLMs for summarizing different textual data, such as articles, papers, and documentation. With regard to reading documentation, P7 shared, “ When I look into some documentations, instead of reading the whole thing — because I probably don’t need all of these things, I only need some piece of the information — I can ask ChatGPT, ‘Hey, read through this link and give me the summary. ’ ” •Supporting Internal Communication: The use of LLMs to facilitate internal communication by composing documents like quarterly updates, aiding in presentations by providing outlines and content suggestions, and assisting in explaining complex tasks to team members emerged in five interviews. For instance, P7 stated, “ Since English is not my first language, [it] previously took me a lot of time to write that. But with ChatGPT, I can just show it this is what I’ve done. ‘Could you generate this kind of solution, this kind of document for me?’ ” P4 shared how he used ChatGPT to create presentations and demonstrations: “ I also do a lot of presentations. So, [ChatGPT Manuscript submitted to ACM Page 14: 14 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes Fig. 2. Findings related to RQ2, excluding those themes found related to the implementation phase, which can be found in Figure 3. helps with] communicating what I’m building and what our team is working on internally to other teams, how to present that information, help me with slides, help me with what would make it a good demo, video, et cetera. It’s able to just give me outlines for that type of stuff! ” Challenges . •Limited Summarizing and Explanation Capabilities: Two participants expressed dissatisfaction with LLMs’ ability to summarize and explain. One participant, P3, touched on this concept, as well as ChatGPT’s hallucinations (discussed in Section 3.1.2): “ I’m a little bit wary of asking it for expert questions. [...] [I] asked for it to review a paper for me, and it ended up not actually reading the paper, [and] making generalizations [...] I even asked for it to make a work cited, and it gave me cited resources that didn’t actually exist. ” 3.2 RQ2: How have LLMs influenced software development processes? To answer RQ2, we loosely organize our findings based on the software development life cycle [ 87] and the agile development methodology [ 30]. The key themes that emerged in understanding how LLMs affect developers’ software development processes are as follows: 3.2.1 Requirements and Planning .We use a common definition of requirements, defining it as a software capability that must be met by a system or system component in order to satisfy a specification [ 46]. Similarly, we define planning as the process of collecting requirements from stakeholders, scheduling, and resource estimation/allocation [87]. •Discovering Missing Components: Two participants utilized LLMs for uncovering missing components, as P15 shared: “ I have an idea of what I think it should be, [...] and then I put it in ChatGPT and say, ‘What are some gotchas, or what are some things that I’m missing? Or what kind of questions should I be asking in order to fill in any blanks that I might have?’ So [ChatGPT is] kind of my consultant, as if it was a more senior developer than me, or more a manager than me, or something like that, to where I’m going to get some feedback. ” •Prototyping: Prototyping was another task that two participants found LLMs to be proficient in. As P12 shared, “[ChatGPT] can [...] kind of prototype out what I want to build faster than I could probably do it myself. And then on, when I think it’s looking right or something, I’ll go actually try to implement it. ” Manuscript submitted to ACM Page 15: LLMs’ Impacts on Software Development 15 •Refining Requirements: LLMs’ assistance in refining requirements, especially for independent contractors, appeared in two interviews. For these participants participants, LLMs did seem to be particularly helpful in this area. P9 shared, “ Say, if I got a new contract job that I’m scoping to try to help my clients to try to refine the technical requirements, and especially so on the domains I’m not familiar with. So, most recently, one of the contracts I got was to try to create a music visualizer [...] I have never previously dealt with creating apps with specifically musicians before. So while I was talking to them getting requirements at the same time, I would have the ChatGPT window on the side. ” P10 shared that ChatGPT helped him research and cut down discovery time when interacting with stakeholders, and that this was important for him throughout the software engineering process: “ Because I’m a kind of a one-man shop currently, I really need someone or something to help me to kind of prop up all those processes. ” Challenges . •Inability to Replace Human Involvement: Twelve participants stated that LLMs did not have an impact on their requirements gathering. Participants mentioned that they received requirements and plans from their supervisors or superiors, who work directly with client needs, and these can’t be generated by LLMs. As P2 shared, “ [Requirements gathering] hasn’t changed because most of the work that I’m doing is with client developers and product managers from the client side. So we need to talk to them for the requirements we need to specify — like, we need to gather all the requirements from them. And because our projects are very client-specific, I cannot just go on the Internet to see what they would expect. We need to talk to them to figure out the requirements. ” •Limited Requirements Detailing: One participant, P3, shared how ChatGPT cannot help to generate detailed requirements: “ I feel as though the requirements it gives me are very common sense, if that makes sense. But it doesn’t really get into the usability aspects of it, or it doesn’t get into like some of the more fine-tuned [requirements], like how should something be done. ” •Struggles with multi-criteria decision-making: We found in one of our interviews that multi-criteria decision-making can be challenging for LLMs. P16 shared, “ So I was looking at flight from [point A to point B]. The flight price with [one airline] went up by a lot of money, and I did not want to pay that much, so I was willing to take different flights, like, you know, fly to [point C], and then [point B], see whatever is cheaper [...] and then I said, ‘Okay, so ChatGPT is not good with math or logistics, ’ and I did have to do it manually. ” 3.2.2 Design and Ideation .We consider design to be the process of finding solutions and creating a more detailed technical plan as a result of the requirements-finding process [ 87], where ideation is a critical component in designing. Decomposition is also part of the design process, and Chattopadhyay et al. have noted how developers employ particular strategies in order to decompose their tasks into smaller units [14]. •Increasing Problem Decomposition: Eight participants acknowledged that LLMs encouraged or necessitated problem decomposition into smaller, more manageable components. This is because LLMs can only take in a limited amount of information at a given time. Consequently, participants needed to either decompose their problems for the LLMs or provide their already decomposed problems. P4 shared that for him, ChatGPT “ might force me to break [problems] down [...] I guess it might be a good forcing function to make me break them down in the subtasks and subrequirements [...] If I can’t explain it to ChatGPT, then it might be an indicator that I don’t yet understand it. ” Manuscript submitted to ACM Page 16: 16 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes •Utilizing for Ideation: Six participants acknowledged using LLMs for ideation, leveraging their extensive knowledge bases and generative capabilities. Participants highlighted LLMs’ abilities to generate diverse ideas, albeit with varying levels of accuracy. Because of LLMs’ broad bases of knowledge and generative capabilities (further discussed in Section 3.1.2), they could be relied on as a source of ideas. P12 shared that for them, “ Probably the ideation stage is where [Bard] shines the most, because you can’t trust it to be accurate or exactly what you want, but you can trust it to generate some cool content for you, some cool code ideas, some cool writing ideas. ” Challenges . •Inapplicable for Decomposition: Seven participants shared that LLMs have not affected their problem decomposition, viewing that as inherently reliant on individual reasoning than LLMs can do. P13, for instance, stated, “ I don’t think [it has impacted decomposition]. I like planning them myself and then going to ChatGPT for more fine-grained planning about how I should actually implement the code.” P16 shared that she intentionally did not use ChatGPT for “ super big things that need to be broken up, ” instead only using it for smaller tasks that she believed would be more likely to be answerable. Other participants alluded to only using LLMs for simple tasks and thus avoided struggling with the decomposition process. •Stubborn Responses: Six participants noted that LLMs were stubborn, which could limit their usefulness while designing. P1 noted that “ These language models...have a tendency to be stubborn, ” and P12 stated that ChatGPT tends to “ stay with the previous output a lot. ” Another participant, P11, remarked on ChatGPT’s struggles to adapt to new input and modify its outputs: “ [ChatGPT] doesn’t do a good job of changing itself. ” However, P3, who did not believe ChatGPT was stubborn, shared that his opinion was such because he recognized its inherent limitations with regard to complex tasks: “ I’ve never really thought [ChatGPT is] stubborn [...] I guess I’ve kind of learned through searching online [...] to get it to do like what you want. And then realizing it doesn’t do it, [...] it’s not that it doesn’t want to, it just can’t. ” 3.2.3 Implementation .The implementation process with LLMs parallels code reuse. Therefore, we adopted Rosson and Caroll’s classification [ 84], dividing the process into three distinct phases: (1) Finding Context, which entails prompt engineering and the process of locating pertinent responses; (2) Evaluating Context, which centers on evaluating the accuracy and usability of the generated responses; and (3) Integration, which pertains to incorporating the generated text into an individual’s own code or within an existing codebase. We note that the following bullet points’ titles are framed from the perspectives of the developer/prompter, rather than the perspectives/capabilities of the LLMs. (1) Finding Context. •Varying Prompt Specificity to Impact Answers. Developers’ diverse preferences for broad prompts, specific prompting, or adjusting query specificity to influence LLMs’ output emerged in fifteen interviews. Broad questions are favored when seeking varied answers, avoiding assumptions, or seeking low-effort responses. Participants used broad questions and vague prompts when they were looking for varied answers. P15 shared that while he originally had used specific prompts, he found that using broader prompts resulted in improved outcomes: “ [I]t started to give me actually better results, because it knew things that I didn’t know [...] When I have a more broad [prompt], then it’s able to kind of formulate its own deduction and get to the problem that it thinks that it’s solving. ” P4 similarly shared that, while he had originally used more specific prompts, he found vague prompts to ultimately be more productive: “ I’ve been moving to shorter and shorter, more vague prompts. So I used to be, you know, trying to do the few-shot learning approach. [...] I never do that anymore. It is such a waste of time. ” Manuscript submitted to ACM Page 17: LLMs’ Impacts on Software Development 17 Fig. 3. Findings related to the implementation phase of software development. Specific prompts are preferred when seeking particular answers, especially in software engineering, or when participants have a clear understanding of what they are asking for. As P2 stated, “ I think it should be very specific [...] If you are going into other [non-software] domains, where there is no one correct answer, you can go for broad questions. But for software development, I think you need to be very specific. ” P16 mentioned, “ I think it depends on how well you understand the question you’re asking, because I’ve asked ChatGPT very specific questions, but those were specific and to the point. ” Four participants acknowledged that there is not a single correct strategy for choosing the level of specificity. They emphasized the importance of adaptability and experimentation, suggesting that users can benefit from trying both vague and specific prompts. P7 shared two use cases demonstrating the value of both prompting strategies: “ Sometimes I want the question be really generic. For example, I want to design a system. [...] I want to look into all those possibilities. In that case, I want to give some general question like, ‘Could you give me some high-level architecture for this system?’ But sometimes I want some answer [to be] really specific. For example, when I read a book, there’s a sentence I don’t really understand why the logic [is] this way. I don’t want [ChatGPT] to give me an approximate answer. I want [a definite] answer. ” Being able to determine when to be more or less specific is seen as a valuable skill when using LLMs. •Enhancing Accuracy and Clarity with Follow-Up Queries: Fifteen participants shared their strategies of prompting iteratively until getting the desired response. Seven participants shared that follow-up queries improve accuracy, and two shared that they improve clarity. As P16 stated, “ I think the follow-up questions allow the users to get clarity. ” She elaborated on the importance of follow-ups: “ The follow-up questions actually might be more important than [the initial question], more probably equally important. But I get more of my answers from the follow-up question. ” Manuscript submitted to ACM Page 18: 18 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes •Unique Prompting Strategies: Prompt engineering as a discipline has become increasingly popular, with courses and guides from all corners of the Internet, including from OpenAI itself [ 69]. Fourteen participants acknowledged the crucial role of prompt engineering in eliciting accurate and relevant responses from LLMs. P9 shared that he could find a good answer “ [a]s long as the prompt is crafted carefully, and the problem is common enough to find a solution. ” This ties into results discussed in Section 3.1.2, showing that ChatGPT performs best when it encounters less novel problems. Participants demonstrated a willingness to try different prompting strategies, with inspiration from online resources and previous interactions with LLMs. P10 shared his unique strategy of adding and removing context via particular prompting by using ’+’ and ’-’ within his prompts, as well as labeling points (e.g., A, B, C) to refer back to them later. •Generalizing Prompts and Code for Security: Thirteen participants employed generalized prompts or used case-specific information rather than project-specific details when interacting with LLMs. P2 shared that, “ Because I cannot put all the information into ChatGPT, because it’s very client-specific information, it’s confidential information [...] I just give it use cases that are similar. ” By focusing on broader contexts or specific use cases, participants aimed to minimize the risk of exposing confidential project information while still obtaining relevant responses from LLMs. This approach allowed participants to strike a balance between leveraging LLMs’ capabilities and safeguarding sensitive data. P5 felt more secure in using this approach: “ I think that what makes me comfortable about [ChatGPT’s security] is that [...] the prompts that I ask are sort of general or generalized [...] and really can’t be tied to any particular person’s identity, or any sensitive piece of information. ” •Providing Examples : Nine participants practiced few-shot prompting with ChatGPT. Few-shot prompting, which has been shown in prior research to have mixed success, entails providing examples to an LLM tool in order to receive a particular output [ 83]. For example, P11 shared that, “ If you’re able to give it an example of what you want, then it can do a better job. ” •Applying Contextual Input for Improved Responses: Nine participants highlighted the significance of providing context when interacting with LLMs, i.e., more context generally leads to better responses from LLMs. Examples include P1’s observation that “ The more context, the better, ” and P15’s explanation that “ The more information that you give it, the better [of] a response it’s going to be. ” Two participants discussed how ChatGPT’s ability to retain context distinguishes it from traditional search engines. They noted that this feature simplifies the research process and allows for more effective interactions, as ChatGPT can keep track of previous questions and responses, enabling a more seamless exchange of information. As P14 stated, “ It takes a mindset from the Googling part, and the biggest mindset shift is that it keeps context that I can build up on the question. ” •Starting a New Thread for Fresh Answers: Nine participants stated their preference for beginning a new thread when they were dissatisfied with the answers received in a previous interaction with ChatGPT or Gemini. This approach allowed them to refresh the context and seek new responses without the constraints or biases of previous chats. P15 detailed his experience with ChatGPT: [I] have it continue refining what I want to do. Sometimes, it’ll get to the point where it kind of [stops] working out. Maybe the context gets a little skewed the further you get down in the chat. So then I’ll just take whatever it had there and then my problem, and then start a new chat. So, recreate the context. And this is where I’m at and trying to do this, and then I can kind of start over and and refresh where it’s at. ” Some participants mentioned that when they chose to open a new one, they often improved the specificity or clarity of their prompts. P16 shared that when she continuously gets an incorrect answer, “ I open a new thread because sometimes I’m like, I learned my lesson; I learned. I know what [ChatGPT is] Manuscript submitted to ACM Page 19: LLMs’ Impacts on Software Development 19 going to say now. So, let me open a new thread and start all over. And I’m going to be more specific, and maybe we can solve this together. ” Challenges . •Struggles with Integration and Adjustments: Five participants observed that LLMs faced challenges when integrating context and making adjustments or changes. Instances were shared where LLMs struggled to modify specific components of their outputs without affecting the entirety of the generated code. P1 shared an incident in which ChatGPT struggled to change minor components in its output: “ It wouldn’t change just that piece of the code. It would change all of the code. ” Participants also expressed concerns about LLMs’ abilities to retain the original context over multiple prompts or interactions. One participant, P8, shared that in software engineering (not other disciplines), “ [W]hen I’m asking more questions, and I give [ChatGPT] more context, I confuse it more. ” He described situations where ChatGPT would forget previous information or modifications, leading to the need to start fresh threads to maintain clarity and accuracy. Similarly, P8 noted that higher specificity in prompts could sometimes result in ChatGPT making assumptions or becoming overwhelmed with additional information. Additionally, P3 shared that ChatGPT often struggled to integrate non-functional requirements: “ [ChatGPT] usually kind of messes up on those, which is why I avoid [including] non-functional aspects [when prompting]. ” •Limitations of Context Window: Five participants highlighted the limitations of LLMs’ context windows. They noted instances where LLMs ran out of space to generate code or lost context when prompted multiple times. P15 echoed this sentiment of ChatGPT forgetting its original context and how this factors into him opening new threads: “ [A]t some point further down the line, it’s going to forget certain things that were really, really important to consider. And so sometimes I’ll say, ‘This is really important to remember. ’ [...] It’s not always perfect, but that’s kind of why I take the context of everything I learned now up to this point, plus my original message, and then readjust and do a new chat, just so [ChatGPT] can kind of get a fresh start at looking at it. ” This limitation could lead to frustration and decreased effectiveness in generating accurate responses, as P12 stated: “ I’m probably better off doing it myself. At that point, it’s probably going to be a waste of time to continually trying to re-prompt it and re-prompt it. ” •Necessity of Follow-up Queries: Two participants shared that they often relied on follow-ups. This was in part because LLMs did not always get things right the first time. P9 stated, “ Very rarely, [...] I can just get the code that I want in one try, even though there are specific prompt engineering templates that I follow. ” As mentioned, participants often use broad prompts and then provide specific follow-up questions, demonstrating the iterative process of prompting. One participant, P14, shared that “ If it’s an easy function, [the task requires] one [follow-up]. And if it’s a very difficult problem, sometimes it takes like 50. ” •Challenges of Providing Context: While acknowledging the benefits of context, one participant noted the challenges of providing sufficient context, especially when they are unsure of what specific examples to provide. P11 mentioned, “ As far as context, [ChatGPT] benefits a lot from examples [...] If you can give it an example of what you want, then it can do a better job — but often you don’t know what the example would be, because you’re asking it to do something you don’t know how to do. ” (2) Evaluating Context. •Verifying through Reading Code: Ten participants shared that they rely on reading their code as a primary method of review. This involves going through the code manually, line by line, to understand its logic, structure, Manuscript submitted to ACM Page 20: 20 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes and implementation details. Reading allows developers to identify potential issues, errors, or areas for improve- ment within their codebase. P9 noted that reading could be a sufficient check for a simple piece of code: “ For me, as a professional developer, it is still my job to validate [that] this piece of code is going to work, and it’s going to integrate well into my existing codebase. So to do that, if it’s something that’s simple enough, I can just mentally do a quick check — me reading it line by line — and do a quick check mentally, I can do that. ” •Verifying through Output: Similar to testing, ten participants shared that engage in validating their code by checking the output after reading through the code. This involves examining the results of the code execution, which can provide insights into its correctness and functionality. Methods for output verification include checking console logs, manual testing, and comparing the generated output to expectations. As P8 shared, “ If that output would be contradicted with what I expect, I discard it. ” •Verifying through Manual Testing: Seven participants also engaged in manual testing practices, such as verifying code templates and solutions, playing around with isolated examples, and using developer tools or read–eval–print loop (REPL) tools to test code functionality before integration into their codebases. P10 shared that, for his process,“ I verify its templates, and I verify its solutions. I put that in a inbox to test it. ” Additionally, checking console logs, verifying expected function return values, and monitoring IDE warnings or errors were common practices among participants to ensure the correctness and integrity of ChatGPT-generated code. As P13 noted, “ [I see] if my editor is like giving any warnings or errors. ” •Verifying through External Tests: Four participants employed external testing frameworks and tools, such as Pi Test (P9) and LangChain (P4), to evaluate the generated code. These tools facilitated automated testing, bug detection, and even collaboration between different agents to review and improve the code. •Verifying through External Sources: Two participants, one more experienced and one less, shared that they seek validation of their generated code quality from external sources such as the Internet. They look up documentation, search for common solutions on platforms like Stack Overflow, and review discussions to ensure that their implementation aligns with established practices and expectations. P13, a less-experienced developer, described it as an additional step she took for validating even after the code seemed to be functional: “ I always take [the generated code] with a certain pinch of salt, and even though the code seems to work when I plug it in, I still always try and, you know, do a web search. [...] I always go back and do a web search to ensure that, yeah, this is one of the common solutions to the problem, and it’s a good one. Basically reading all of the comments, like all of the discussions that happen on Stack Overflow, I think are really helpful to verify that my implementation is in line with how people would expect it to be done. ” •Clarifying Code through Generated Explanations: Two participants shared that they ask LLMs to explain their own generated code as part of their evaluation process; as P4 shared, he “ [asked ChatGPT] to explain itself. ” This involves requesting explanations for specific lines or segments of code in order to gain a deeper understanding of its logic, functionality, and rationale. •Assessing Visually and Logically: Two developers, who both worked in front-end development, noted that they visually assessed the quality of their code. P5 noted that he could both read and visually check the quality of his code: “ I first run it in my head to see if it flows logically. [...] So when I look at the code or look at the class, and I’d say, ‘Okay, that’s generally what I’m going for, based off of what I know. ’ [...] I’d either load up my own personal test environment and like a web browser, or I just copy and paste the code, and just see if it gives the result that I’m looking for. ” Manuscript submitted to ACM Page 21: LLMs’ Impacts on Software Development 21 •Judging Quality through Readability: One participant, P7, shared how he used the process of assessing readability and adherence to established principles in order to judge the quality of the generated code. He shared that he “ judge[d] whether it code is good or not ” by evaluating the readability of the code through three strategies: applying the SOLID principles [ 90]; referring to the concepts he had learned through RC Martin’s clean-code manual Clean code: a handbook of agile software craftsmanship [53]; and cross-referencing code against the programming language’s official documentation. •Improving via "Self"-Correction: One participant, P12, shared his experience in which ChatGPT could recognize and rectify errors it had made in its code without explicit user guidance: “ Sometimes [ChatGPT] literally will write something, or it’ll write some code, or even English text, and [you’ll] be like, ‘Hey? You made a mistake here. What was your mistake?’ And it will just correct it. You don’t even have to tell it what the mistake is. It’ll just be like, ‘Oh, my bad, I did this, this, this, and this wrong. Here, let me go fix it. ’ And it’ll write it up for you. ” This self-correction capability contributes to the reliability and usability of LLMs like ChatGPT in code generation tasks. Challenges . •Increased Skepticism: The existence of generated code from ChatGPT made five participants more skeptical and vigilant about evaluating the quality and reliability of the code. P1 shared that the existence of generated code made them more likely to check and test code: “ [I]n a sense, yes, it has changed testing. I’m a bit more skeptical. ” •Necessity of Evaluation: ChatGPT has previously been found to have varying quality with regard to its generated code [ 50,51]. Seven participants emphasized the importance of thoroughly checking the generated code for accuracy and relevance. P4 mused on the general quality of generated code: “ [Generated code is] not good enough that I don’t have to read it yet, although that might be nice one day— might put me out of a job, though. ” Participants advised caution and suggested not solely relying on LLMs’ outputs without verification. P6 shared, “ Be wary of what you get [...] Don’t give up on checking the responses. ” P1 particularly speculated on the hypothetical case in which ChatGPT’s generated code could pass tests but fail to fulfill the intended requirements. This underscores the necessity of thorough review processes, which should similarly be in-place for reviewing human-created code; as P16 shared, “ [ChatGPT]’s not going to get it right every single time, and we need human overview on that [...] And when it comes to human code. I think it’s the same thing. ” Two participants shared that they paid more attention to reviewing the quality of generated code compared to human-written code, indicating a higher level of scrutiny for ChatGPT’s outputs. •Lack of Self-Evaluation One participant, P11, mentioned ChatGPT’s lack of self-evaluation as a limitation, which makes it inferior to even junior developers. He stated,“ [...] It’s not evaluating its own code. It doesn’t actually have the ability to do those things that I would do to recode. I just can’t trust it in a certain way; I know it’s not making certain decisions based on facts or logic. ” While this was one participant’s view, as noted in the above section on Evaluating Context, another participant (P12) commented that ChatGPT could correct itself. •Lack of Contextual Clarity: One participant, P4, emphasized the importance of context in understanding code, suggesting that humans have an advantage in having and being able to provide contextual information, especially about how and why code was created: “ [I]f a human hands me a piece of code, I can ask them questions about it. If someone handed me code that was written by ChatGPT, I don’t know what I would want. [...] would I want the prompt from the code? Because I’m missing the context now of how and why the code was generated.[...] If Manuscript submitted to ACM Page 22: 22 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes it’s a really small piece of code, I can probably just read it and figure it out and either choose to throw it away or go ask ChatGPT, ‘Hey, will you validate this code for me?’ I think if a human gave me ChatGPT code with the prompt that generated it, I would feel a lot better. ” (3) Integration. Here, we examine integration across two dimensions. First, we investigate the practices employed by developers to incorporate the generated code into their work. Second, we assess the degree to which developers actually utilized the generated code. •Modifying Before Integration: Fifteen participants reported that they modified the generated code before integrating it into their projects. They described refining the code to align with their specific requirements and overcome potential errors. P13 shared her process: “ Once I got that overall approach, I refined it in a few places, and then I sent it back to ChatGPT [...] ’Kind of give me boilerplate code for that. ’ So I think it did a really good job at giving me that boilerplate code. And then I just had to do a lot of refinements within it to overcome certain errors. ” P2 noted the need to adapt the logic of LLMs into the structure of existing projects and compared it to integrating code from other sources: "[Integrating ChatGPT’s code results into my code base is] similar to what I used to do with Stack Overflow. It will definitely give me a snippet, but variable names and all of that will be very different. The code structure will be different. So I’ll just take the logic, but follow the structure which is already the pre-existing code base. " •Copy-Pasting Code: Twelve participants reported simply copy-pasting the generated code into their develop- ment environments. They found this process straightforward and efficient, allowing them to quickly incorporate the generated solutions into their projects. P7 stated, “ Maybe there [aren’t] too [many] steps. Just asking the question, and if the answer looks good to me, I would just copy-paste. ” •Rewriting Code: Three participants mentioned that they preferred to rewrite the generated code manually in their development environments. They did this to gain a deeper understanding of the code and to mitigate potential errors that might arise from directly copy-pasting the generated solutions, giving them greater control over the code integration process. As P9 shared, “ I re-implement [the code] in Python myself instead of just taking the output directly and pasting it there. ” •Using Generated Code: Participants exhibited varying degrees of frequency in discarding generated code. Six participants rarely discarded code, two disposed of it about half the time, and eight often discarded it. Those who seldom discarded code pointed out that, with well-crafted prompts and in certain programming languages, LLMs are capable of producing viable code, albeit with some limitations; as P13 shared, “ I would say [I discard code] not very often. It seems to be useful in the context that I use it. I think I’ve been frequently throwing out code given by ChatGPT only in cases of SQL. ” •Discarding Generated Code: Eight participants shared that they threw away a lot (greater than 50%) of their generated code, while two shared that they threw away about half of their generated code. Frequent discards and reduced usage of generated code was more prevalent among those eight who used LLMs for ideation, consulting, or learning. P4 shared his reasons for throwing away code, along with the value that he still received from discarded code: “ It’s good at getting me from zero to something, but I iterate a lot on it. I throw away a lot of code, and sometimes I choose not to do what it says to do, like maybe it’s using some design pattern, or it’s just being too clever with the code. I don’t generally have a problem with with how it’s doing stuff. It just might be different than how I would do it. For the type of stuff I’m building, I value being familiar with my own way more so than even code quality at times. [...] I think it’s just the nature of the iterative prototyping that I do, that I throw away a lot of code, Manuscript submitted to ACM Page 23: LLMs’ Impacts on Software Development 23 no matter what. But that doesn’t mean I don’t get value from the ChatGPT code. I think it still teaches me a lot of stuff. ” Challenges . •Friction with Copy-Pasting: One participant, P4, expressed his frustration with the copy-pasting process due to the need to switch between applications or rely on third-party plugins: “ [E]very time you want to interact with ChatGPT, there is friction just because of the UI. You have to switch to the web application, or you have to get one of these unofficial third party plugins [...] So right now I copy-paste, and I find that annoying, but the value is still high enough. So it’s like, what is the cost versus value gained. ” 3.2.4 Testing and Code Review . •Generating Unit Tests: Seven participants had used LLMs for generating tests, particularly for unit tests due to their relatively basic and formulaic nature. As P9 shared, “ Sometimes I use ChatGPT to create simple unit tests for me, instead of me writing from scratch again. But then, I only save those for smaller functions after I’ve done my refactoring. ” One participant, P15, noted that using ChatGPT to generate tests encouraged him to incorporate more testing into his coding practices: “ [ChatGPT] does write unit tests for me. So where I would not normally have them, if it’s able to write unit test for them, then I’ll have it do that. We don’t spend a lot of time creating tests as much as we should. But when when it comes to certain functions, like even just the basic functions [...] usually [the generated tests] are pretty good when it comes to just a small function. ” •Simulating Code Reviews: Two participants, who worked independently as contractors or in small teams, used LLMs for code review due to the lack of colleagues to review their code; an additional participant used ChatGPT to review chunks of his code. For the two solo developers, using LLMs for code review served as a way to ensure their code met professional standards and to compensate for the absence of traditional code review processes within their teams. As P9 shared: “ [R]ight now, I’m working as a contractor solo. I don’t have the privilege of getting somebody to do code reviews for me. So I would say in terms of code quality, that really helped me to maintain [...] a professional coding level. ” Challenges . •Inapplicable for Code Review Eight participants indicated that they did not utilize ChatGPT for code reviews or that it had minimal impact on their code review process. P15, despite being a big proponent of using ChatGPT, expressed hesitance toward using it for code reviews: “ I don’t use it for code review, mainly because I need to understand what the code is doing myself [...] It’s better for me to know each step of what it’s trying to do, because I need to know ... how it’s going to affect the rest of the system. So where it’s like, ‘Oh, we deleted this function chapter. [It]’s gonna say it’s fine, but in reality, that function’s being used in many places. [...] That type of thing [the LLM] wouldn’t be able to do. It just doesn’t have the big picture. ” P4 expressed significant concerns regarding using LLMs for code reviews and preferred human reviewers: “ It would be silly to have [an LLM] replace a human, because one of the main benefits of doing a code review, at least in the teams that I work on, is transferring the knowledge amongst the developers. So using [an LLM] for [code reviews] would be a great way [sic] to remove the entire purpose of the code review. ” •Struggles with Complex Tests Two participants noted that LLMs struggled with generating larger or more system-wide tests. P2 noted that he did not consider using ChatGPT for system-wide testing due to security considerations. P15 shared further that ChatGPT lacked the context necessary to generate larger tests: “ It gets a Manuscript submitted to ACM Page 24: 24 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes little bit harder if you were talking about [...] an entire web page [...] because it just doesn’t know what people are supposed to be doing [or] how it should create the test. ” 3.2.5 Debugging, Refactoring, and Documentation . •Improving Debugging: LLM tools have shown promise in helping individuals debug code [ 34,92]. Ten partici- pants mentioned having used LLMs to debug their own code. Participants like P5 primarily utilized LLMs as an immediate debugging tool, relying on it to assist in troubleshooting specific issues with their code, especially in front-end development; as he shared, “ I use [ChatGPT] as my immediate sort of debugging tool. So I would ask it particular questions [on a] front-end component that I’m trying to develop, and I use it to try to point me in the right direction. ” For participants like P6, LLMs were especially useful in understanding bugs and their causes: “ I always have questions about why something is failing, what something is doing. So [Copilot Chat] does help on a day-to-day basis with programs. ” Participants highlighted LLMs’ efficacy in significantly reducing debugging time by providing quick solutions or guiding them in the right direction. P5 shared his experiences: “ [ChatGPT] helps to dramatically shorten the whole debugging process. If it doesn’t give you the answer — that is, if it doesn’t give you the answer on the first try, [...] it helps to put me in some right directions to where I can do some further research or ask it more questions. ” •Reducing Syntax-Based Errors: Two participants emphasized LLMs’ utility in resolving syntax-related errors, such as missing punctuation or braces. P14 stated that, “ [W]henever I’m missing a comma somewhere or a brace. [...] I just paste it into ChatGPT and say, ‘Fix the syntax mistake. ’ ” He then shared how ChatGPT quickly identifies and fixes such errors, saving him from spending hours debugging simple syntax mistakes. •Performing Refactoring: We define refactoring as transforming the code in such a way as its functionality and behavior is preserved while improving its maintainability or comprehensibility [ 29]. Two participants shared that they used LLMs for refactoring purposes, aiming to enhance maintainability and comprehensibility while preserving functionality. P8 shared his process: “ I have a method, a class, or something that I want to verify, to read, and I’m going to get an idea of how I can do that — refactor in a better way. [...] [if] I agree with the refactoring that [ChatGPT] gave to me, I’ll paste it in my codebase. Otherwise, I try to only get the idea and implement it myself. ” Additionally, one participant utilized LLMs to condense code, particularly by converting code blocks into more concise versions. P7 noted that he did so, sharing an example prompt and ChatGPT response: “ ’This is the code I [am] writing for a for loop. Can you just convert it to the stream version?’ And then [ChatGPT] just gave me this one line of code. So I feel this is pretty useful. ” Challenges . •Debugging Difficulties: One participant, P9, shared that he had had a negative experience while using ChatGPT to debug a piece of code he had found online, but with poor results. He emphasized that his lack of experience with the code left him with an inability to properly evaluate the results of ChatGPT’s output, leaving him frustrated:“ I was using ChatGPT to help me debug and help me revise it. But since I don’t understand the code perfectly, and relying too much on ChatGPT at that point, it was giving me the incorrect prompt, which I didn’t know until two or three hours later [...] I thought, ‘There’s something wrong with my input, ’ and but it turns out it’s not. It’s actually ChatGPT — it was mixing up Python’s syntax in there. It was actually using syntax from all languages, but that almost looked like Python. ” Manuscript submitted to ACM Page 25: LLMs’ Impacts on Software Development 25 Fig. 4. Findings related to RQ3, emphasizing the pros and cons of LLMS with regard to their code-related output. 3.3 RQ3: How has the use of LLMs influenced the software products created? The key themes that emerged in understanding how LLMs affect the artifacts (i.e., code and software products) are as follows: 3.3.1 Quality and Complexity of Generated Code . •Producing Clean, Readable Code: In Section 3.1.2, we presented how some participants used LLMs to keep their code clean and neat, incorporate ideal design patterns, and understand industry best practices. Six participants shared that they found LLMs’ generated code to be clean, readable, and systematic. They likened its readability to that of standard Stack Overflow answers or official documentation for programming languages. Five participants, such as P4, appreciated the ease of understanding the syntax, which facilitated quick comprehension of the code: “[ChatGPT is] generally pretty good about not generating difficult to read code or overly complex [code]. And just asking it to improve the code in terms of readability generally works for small snippets. I’ve had good success with that.” •Generating Code with Reasonable Complexity: Five participants stated that the code actually had fair to good complexity for the tasks that they were working on. They acknowledged that while LLMs may not always produce the most efficient solutions, the complexity generally aligned well with their needs. Particularly for common tasks or well-known problems, participants generally found the complexity to be sufficient. P11 stated that he found ChatGPT to have good complexity for more common tasks, but not for more novel tasks: “ Every now and then I’ll put in a code challenge, or an interview kind of problem, and [...] I think I think it does better on those, because other people have posted about those kinds of problems. And then you can ask it, ‘Oh, can you do this more efficiently? Can you do whatever?’ And it’ll talk to you and explain to you. But again, like I said, my suspicion is that it’s only because those are things that people talk about more. If I try to give it more unique data structure problems from my actual experience, it doesn’t do as well. ” •Conducting Complexity Analysis: Four participants utilized LLMs as tools for checking the time and space complexity of code snippets and explaining the benefits of different approaches. This served as a learning Manuscript submitted to ACM Page 26: 26 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes opportunity, particularly for participants who were newer to computing. P16 shared that she used ChatGPT to learn about code complexity, “ [S]ometimes, if I don’t know, I’m just like [to ChatGPT], ‘Oh, what’s the time complexity, and the space complexity of this problem? And explain why and explain the benefits of using this method versus that method by its complexity. And [ChatGPT] does that sometimes. It’s like, ‘This saves space, but this saves time, and you can like solve this problem [...]’ ” Challenges . •Struggling with Up-to-Date Information: Two participants noted that ChatGPT’s generated code sometimes lacked up-to-date information, particularly when dealing with rapidly evolving technologies or APIs. This resulted in outdated or irrelevant suggestions, as P6 shared: “ Sometimes it doesn’t work out well, in the sense that it just gives you suggestions that are a little outdated about the use of a specific API. ” •Over-Engineered, Complex Code: Five participants raised concerns about LLMs over-engineering solutions, adding unnecessary complexity and inefficiency to simple problems. They highlighted instances where LLMs introduced extraneous modules or convoluted solutions. For instance, as P8 complained, “ One complaint that I have about ChatGPT’s output is over-engineering. [...] In some of cases, I’m feeling that sometimes that I’m asking it, ‘Okay, write a simple method for, I don’t know, multiplying two numbers together. ’ [...] Sometimes it does the over-engineering for those cases. ” P3 additionally shared this sentiment: “ It can build some things, but you know, even that [...] sometimes becomes more complicated than just coding it yourself, and you can’t create complex functionality. ” For more novel or unique challenges, some participants (like P9 and P11) observed limitations in ChatGPT’s ability to produce code with optimal complexity. 3.3.2 Optimal Use Cases . •Excelling at Small Tasks: Nine participants identified LLMs’ strength in handling small tasks, particularly those that involve routine or standard procedures like the boilerplate code referenced in Section 3.1.1. They found it most effective for tasks that could be decomposed to a low level, such as writing small code snippets or implementing basic functionalities. For ChatGPT, P4 shared that “ I’ve kind of isolated its use cases to helping me improve code at the function level, like small snippets of code, and helping me go from zero to something ”; this sentiment is connected to his relationship with frequently discarding code as part of an iterative process, as presented in Section 3.2.3. Challenges . •Better Text than Code Generations: Two participants observed that LLMs seemed to perform better at generating textual content than code, possibly due to their training data composition [ 103]. P8 shared that, in his experience, “ [ChatGPT] works better for text, not the codebase. ” 3.3.3 Security . •Providing Sufficient Security: Thirteen participants expressed that they considered LLMs’ code to be sufficiently secure for their purposes, primarily because they were not using it in production environments or for critical applications. It was also mentioned that LLMs’ code is as secure as any code publicly available. Seven participants shared that this was because of their specific use case, like P2, who noted that “ I’ve never had to use it for any security aspects. ” P14 additionally used a VPN, and P4 and P15 shared that they believed their reading of the generated code was another layer of added protection against security concerns (P4: “ It’s not like I’m not reading Manuscript submitted to ACM Page 27: LLMs’ Impacts on Software Development 27 the code. So if it’s trying to like, wipe my hard drive, or leak customer data — There’s no scenario where that could happen. ”) Challenges . •Concerns Sending Data to LLMs: Eight participants shared that they specifically did not provide LLMs with identifying, confidential, or otherwise proprietary information in their prompts. P3 expressed concerns over the lack of “ trust and safety controls. ” A few participants noted concerns about their data being used by companies like OpenAI. P4 stated, “ There is, of course, concern with what code am I sending to OpenAI. ” P12 shared related concerns: “ If those interactions with Bard or ChatGPT get logged and used for training data in the future, it could be that those models start outputting production code or like, our own internal code. And that’s something we want to avoid. ” It is worth noting that, since April 2023, ChatGPT has included an option to prevent a user’s queries from being used to train or improve the model [ 20]—the feature that both P14 and P15 mentioned that they used in order to reduce their security concerns. At present, Google Gemini additionally allows users some control over how their data is used by Google [62], and Copilot allows users to opt out of sharing their data [61]. •Concerns using Data from LLMs: Eight participants emphasized developers’ responsibility to ensure code correctness, especially regarding the possibility copying and using code without review. Two participants explicitly stated that LLM code should not be copied without review due to security concerns; P11 shared that, “ I think it’s dangerous, right? Like, if out of our work on a team, somebody would just copy-paste ChatGPT [code], you know, I’ll probably be annoyed by it. ” P1 further expressed concerns over malicious actors potentially injecting poisoning code that goes on to then further train ChatGPT, which could result in the tool generating exploitable code in the future. We note that both this and the above developer-identified challenge are clearly informed by applying knowledge of how LLMs work to determine possible risks of sharing data with LLMs and using LLM outputs in production code. 3.4 RQ4: How may the software industry and education be affected by LLMs? The key themes that emerged in understanding how LLMs impact two areas of society—the software industry and CS education—are as follows: 3.4.1 Industry . •Comparing LLMs to Existing Entities or Roles: We found nine interviews in which LLMs were fulfilling roles commonly filled by other people or tools, including as a pair-programmer, assistant or secretary, junior developer, rubber duck, or simply a tool. For instance, P10, whose colleagues had been all laid off, mentioned, “I’ve been working solo by myself for a few months now. [ChatGPT] is my junior who [is] trying to help me. " He further added, “ I still write [the code] the way I do. [ChatGPT] just has some extra eyes that [are] helping me, to guide me through [the coding process] or to give me some recommendations overall. " P5 also mentioned, “ [ChatGPT] is like the rubber ducky that we would have, except now it produces answers, and it talks to me, and it gives me all of the advice that I would need. ” These results suggest that developers are in the process of figuring out the future of their careers and how LLMs may ultimately fit in. •Minimal Impact on the Job Market: Nine participants expressed that jobs in the software development field would remain largely unaffected. They argued that software development involves more than just code writing, Manuscript submitted to ACM Page 28: 28 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes Fig. 5. Findings related to RQ4, split horizontally by findings on industry (above) and CS education (below) and vertically by opportunities (left) and challenges (right). and LLMs lack the capability to fully substitute human developers. As P3 noted, “ I don’t really see it replacing people because I think if you’re at the point, at least with the current version of it, that it could do your job, you probably weren’t as good of a programmer in the first place. .” •Widespread Use Among Peers: Four participants observed widespread use of ChatGPT among their peers. P7, for example, pointed out, “ [...] maybe 90%, maybe 80% of my team started using ChatGPT from January or February [2023]. Maybe because we’re a tech company, people know this quicker than the others. And now I think 100% of people are starting to use it, just depending on the use case. ” This suggests a growing acceptance and adoption of LLMs within professional environments, particularly within tech companies. •Lowering Entry Barriers: Three participants mentioned LLMs could lower the barrier for entry-level positions by providing assistance and resources. P6 shared his hopes for LLMs being able to answer novice programmers’ questions: “ [ChatGPT] has lowered the barrier to entry in terms of coding. So you can really go and ask and stuff, and depending on what your level of expertise is or what your level of questioning is, it will provide you with the level of answers. So you don’t have to rely on books or universities and then going through the [usual] courses. Not that there is anything wrong with them, but it’s a different way of approaching programming. ” Challenges . •Changing the General Job Market: Nine participants viewed LLMs as technologies that would change and re-purpose jobs in general. Although they acknowledged LLMs’ capacity to diminish some jobs, they noted their potential to create new job opportunities as well. For example, P10 noted, “ [ChatGPT] is a very powerful tool, and it’s going to kill a lot of wild white-collar jobs for sure, but more jobs will be created. [The issue] is just how fast that those jobs will be created and whether the people that lost their job will get trained. ” •Absence and Necessity of Guidelines: Eight participants emphasized the importance of establishing guide- lines for LLM use, especially for larger companies or those dealing with security-sensitive information. They Manuscript submitted to ACM Page 29: LLMs’ Impacts on Software Development 29 highlighted the need for clear rules to ensure the secure usage of LLMs. P9, for example, noted, “ I think 100% [that companies should have guidelines], especially when you’re dealing with things that are sensitive; for example, financial institution or medical [data]. And I actually do think that instead of them trying to use third-party LLMs from, for example, OpenAI, they should start building their own models and have them housed within their own PVC cloud so that it’s secure enough for their standard instead of trying to buy this off-shelf. ” P7 highlighted the utility of more specific guidelines, noting that some companies should provide prompt templates. As he noted, “ I think that would be good [to have guidelines] because I’ve heard there are some question templates to let ChatGPT give us what we want. ” Of these eight participants, five noted that their companies lacked specific guidelines for LLM use. However, they noted that existing rules, such as not sharing sensitive information, apply to using LLMs. P5 shared, “ We’re a pretty small company, so we don’t have too [many] regulations against it, and a lot of people work really like me and the guy that I’m working under. He uses it fairly heavily as well. So, outside of making sure that we don’t provide any sensitive data or information to ChatGPT, we’re pretty much just okay to use it .” Similarly, P2 stated, “We had these guidelines even before ChatGPT; the guidelines were that we are not allowed to share any client-specific information with anyone. ” •Lowering the Need for Certain Roles: Five participants highlighted LLMs’ potential to decrease the number of developers required for certain tasks and roles, such as user interface professionals or data scientists. P8 mentioned, “ I’m not sure [if] it’s changed a lot of things in software engineering, but in the other fields, like data science, I’m seeing a [change] because, in those fields, there are people who don’t have a lot of deep knowledge about computer science. [...] For example, they have a piece of Python code, and they want to do some logic. They have a matrix, and they want to transpose it. [So] they can use ChatGPT very easily, and it does that for them. ” •Undermining Entry-Level Positions: LLMs’ potential to undermine the demand for entry-level roles emerged three times in our study. P6, for instance, mentioned, “ I do find the level of code generated is sometimes almost as good as a junior software developer. So I think that really up the bar of hiring for junior software developers. When I graduated, ChatGPT wasn’t around, and I probably didn’t have to expect to know that much. ” P15 also stated that newer developers should broaden their skill set in order to stay ahead of technological advancements and mitigate their negative impacts:“ As long as you have a bigger picture of things, and you’re able to engineer things more thoroughly, you can beat the pace of ChatGPT. ” •Exploiting LLMs in Job Interviews: Two participants raised concerns about candidates using LLMs to pass interviews without genuine knowledge or skills, so formulaic interview techniques might need to change. P1, for example, mentioned that people could very easily use ChatGPT to pass, “ without actually knowing how to pass the interview. I don’t think that necessarily makes them a bad software engineer, but I think it can break this formulaic interview a little bit. ” 3.4.2 Education . •Encouraging as Learning Tools: Ten participants shared that CS curricula should integrate LLMs into education rather than banning them outright. P5 shared that, “ If universities are trying to prepare students for work in the industry, I would say that exposing the students to how to properly use ChatGPT to create impactful works in code and program would certainly be beneficial in the professional corporate setting. But, you still need to have the foundational knowledge of your data structures, algorithms, how to design code databases, SDLC, all of those different core topics. Those are still needed within the curriculum. ” Participants highlighted LLMs’ potential to Manuscript submitted to ACM Page 30: 30 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes assist students in problem-solving, generating personalized problem sets, and improving prompt engineering skills. •Teaching Prompt Engineering: Four participants emphasized the importance of teaching students prompt engineering. For instance, P13 mentioned, “ I think students should be introduced to these tools and how they can prompt better. For instance, one of my instructors at university, the first thing that they covered in our first semester, was how to Google. I think this is going to be something along the same lines as how to prompt. ” •Impracticality of Bans: Three participants viewed the issue from another perspective: the impracticality of banning it. For instance, P2 stated, “ You cannot run away from it because everyone has access [...] Students are going to use it, no matter [what]. ” Participants emphasized the potential benefits of LLMs in supplementing learning. As P16 shared, “ I think degree programs should encourage students to use ChatGPT in the right way. ” Participants felt this could be particularly important for underprivileged students who may lack access to traditional tutoring or resources. Challenges . •Emphasizing Fundamental Concepts: Six participants stressed that foundational knowledge, theoretical understanding, and fundamental concepts should be prioritized over the integration of ChatGPT. They believed that topics like software architecture, algorithm design, and problem-solving skills should take precedence. For instance, P1 mentioned, “ I think [universities] should be doing more of what they should have been doing in the first place, which is not necessarily focusing on specific implementations and specific kind of optimal algorithms, but rather on the bigger, more difficult-to-grasp ideas that have to do with architecting software and that have to do with thinking about what it takes to get from a business idea to an actual product. ” Concerns over LLMs being just the latest tool or technology that may become obsolete also appeared. For example, P6 mentioned, “ One thing remains constant [in CS education] that you are studying and you’re understanding and you’re learning the programming languages and you’re learning the technology that will be obsolete by the time you get to the market for a job. I don’t think that the programs should [change based] on what’s the latest and the greatest, ChatGPT, Anthropic AI, this AI, that AI, or this fancy new programming language. I think that’s the wrong way to go. I think, in general, the program should be focused on more fundamentals because those largely remain the same. ” •Needing to Adapt: Six participants acknowledged the flip side of the impracticality of banning LLM usage, suggesting ways that computing education programs should adapt to the new reality with LLMs. Four participants expressed that the ease of plagiarism with LLMs encourages students to use them for ready-made solutions, introducing a need for change. Two felt cheating was going to happen, with P16 suggesting that cheating is inherent to certain individuals: “ People who cheat, they’re going to cheat anyway, and that’s gonna get caught up with them. That’s a character trait. ” P8 mentioned, “ I think [students] don’t write any code in their homework based on the [access to] ChatGPT because they already have whatever they want. ” Two participants suggested that instructors mitigate the impact of cheating concerns by redesigning assignments to ensure they require a genuine understanding of concepts through critical thinking and problem-solving skills. P13 suggested comparing student work to LLM answers: “ The professors would have to run it through GPT to see how it’s performing, and see what kind of variations of answers it’s generating. ” P2 additionally shared, “ You can design the assignments in such a way that even if [students] use [ChatGPT], they need to understand the concepts. ” Manuscript submitted to ACM Page 31: LLMs’ Impacts on Software Development 31 Two more participants mentioned the importance of preparing students for coding without LLM assistance. Although P5 was generally a proponent of integrating ChatGPT into the curriculum, he stressed the importance of independence: “ You need to be able to create code on your own, but also collaborate with others as well. " 4 Discussions As outlined in Section 3, we have highlighted the advantages and challenges of current LLMs across four key dimensions. Based on our findings: 4.1 Implications for Future Software Developers 4.1.1 Educating future software developers on prompt engineering. Our participants emphasized the importance of prompt engineering as a key technique for effectively utilizing LLMs. This underscores the need for workshops and training sessions to focus on prompt engineering, particularly on strategies such as maintaining specificity and brevity, formulating problems clearly, and considering linguistic nuances for optimal results. To support this learning, developers can access formal resources, such as OpenAI’s comprehensive guide on prompt engineering [ 68] or DeepLearning.AI’s free10course on prompt engineering practices for developers [ 31]. These resources provide valuable insights and practical techniques to enhance their expertise in working with LLMs. 4.1.2 Educating on problem decomposition. The ability to break down complex problems into manageable components is crucial for maximizing the effectiveness of LLMs. Our findings indicate that LLMs perform optimally when presented with tasks that are clear, concise, and limited in scope. By developing strong problem decomposition skills, developers can better align their challenges with the capabilities of LLMs, significantly improving the efficiency and accuracy of their solutions. Mastering this skill empowers developers to leverage LLMs as powerful tools in addressing a wide range of software development tasks. 4.1.3 Setting realistic expectations for LLM use. Developers must recognize the inherent limitations of LLMs, including issues such as unreliable responses [ 104], hallucinations, knowledge gaps, and difficulties in maintaining consistent contextual understanding [ 37]. Although LLM accuracy and effectiveness continue to improve, their performance remains variable across different domains. Understanding which domains to trust and the extent of reliability is essential for using these tools effectively. One practical strategy is to explicitly identify the software engineering tasks for which LLMs excel or underperform, as outlined in this paper (see Section 3). Additionally, equipping developers with a deeper understanding of how LLMs function can bridge gaps in expectations and capabilities. Structured courses or workshops tailored to software developers’ needs could offer a robust foundation, similar to existing workshops designed for educators[82]. These educational initiatives not only enhance developers’ ability to use LLMs effectively but also provide the added benefit of deepening their technical knowledge, enabling them to adapt to and capitalize on future advancements in generative AI. 4.1.4 Adopting LLMs as Surrogate Team Members. Freelance developers, or those who primarily work independently, encounter a distinct set of challenges compared to their counterparts in collaborative environments. Insights from two study participants (P9 and P10) indicate that LLMs provide specific advantages uniquely suited to solo developers, which may not be as pronounced for other developer types. Based on our findings, freelance or solo developers can utilize 10At the time of writing. Manuscript submitted to ACM Page 32: 32 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes LLMs as surrogate team members, serving as virtual colleagues, subcontractors, or assistants. LLMs can support these developers by refining and clarifying requirements, conducting research, discovering information during stakeholder interactions, and simulating code reviews to uphold professional coding standards. Moreover, LLMs can reduce—though not entirely eliminate—the need for subcontractors, thereby enhancing the efficiency and self-sufficiency of independent developers. 4.1.5 Encouraging effective use of LLMs for programming. Despite their limitations, LLMs offer considerable advantages to developers who understand how to harness their capabilities effectively. As indicated by our findings, developers can use LLMs to enhance various aspects of their programming practices, including: (1) Refactoring Code: Improving the structure and organization of existing code for maintainability and performance. (2) Learning and Applying Design Patterns: Gaining insights into widely recognized solutions to common programming challenges. (3) Enhancing Code Readability: Producing more understandable and clean code, which facilitates collaboration and long-term project sustainability. (4) Automating Boilerplate and Repetitive Tasks: Quickly generating routine code components to save time and focus on more complex challenges. These benefits not only streamline workflows but also provide opportunities for developers to adopt and reinforce best practices. Our findings align with similar research (e.g., [ 18,48,54,76] highlighting how LLMs can optimize developer productivity and programming quality. By using LLMs strategically, developers can maximize their efficiency and improve both individual and team outcomes. 4.1.6 Emphasizing secure practices. Developers must remain vigilant about the proprietary and security aspects of the code they share with LLMs, as emphasized by Wang et al. [ 101]. Inputted data such as code, prompts, and other personal information may be collected and utilized by companies like OpenAI and Gemini to improve their services11. This makes it crucial for developers to ensure compliance with contractual agreements and address licensing issues to minimize risks when using these tools. Another critical concern is that LLMs often do not provide the source of the code they generate [23]12, leaving the origins and security of the produced code uncertain. This ambiguity poses potential risks, particularly in production environments or sensitive projects. Developers engaged in research or ideation may perceive fewer security challenges, but adhering to robust security protocols is essential for all. Key strategies to mitigate these risks include: (1) Sanitizing Generated Code: Carefully reviewing and cleaning LLM outputs to prevent vulnerabilities or unintended exposures. (2) Maintaining Data Integrity: Ensuring the confidentiality and safety of proprietary code and sensitive data. (3) Upholding Security Standards: Consistently applying established security practices to avoid compromising systems or violating regulations. By proactively addressing these issues, developers can maintain secure workflows and responsibly integrate LLM tools into their practices. 4.2 Software Engineering Tools and Designs for Developers 4.2.1 Educating developers on the benefits of LLM-powered IDE extensions. Integrating LLMs into widely-used IDEs such as Visual Studio Code (VS code) can significantly optimize development workflows and reduce friction. Extensions like GitHub Copilot, Amazon Q, and IBM watsonx are already helping developers streamline different tasks such as debugging, code generation, and documentation writing. Some participants in this study mentioned the challenges of translating LLM results into their own context, highlighting the benefits of a more streamlined user experience. 11At the time of writing, both OpenAI [ 63] and Gemini [ 62] state that they collect and use personal data, including prompts and log data, as the default setting to improve their model and services. 12Pre-published via arXiv. Manuscript submitted to ACM Page 33: LLMs’ Impacts on Software Development 33 4.2.2 Tailoring answers for specific industries. Developers may benefit from LLM tools that provide tailored responses based on organization-specific or proprietary documents and data. These tools could be trained on or fine-tuned with project documentation to assist with legacy or brownfield development. Alternatively, they could use advanced AI techniques, such as retrieval-augmented generation (RAG) [ 47], which would dynamically incorporate relevant artifacts like project documents, code repositories, and communication threads into the generative process to deliver more contextually-aware answers. 4.2.3 Tailoring answers for specific individuals. As a developer works over time with an LLM, contextual grounding methods can improve the models’ understanding of developers’ specific needs. Persistent context memory, for instance, could allow models to “remember” quirks and details about the user and retain them as context for future responses. For instance, an LLM could detect that a developer prioritizes attributes like performance over readability; as another example, an LLM could identify areas in which a developer frequently asks for explanations and proactively provide more detail in those areas. 4.3 Developments in LLMs and Ongoing Limitations Since the interviews conducted during the Spring and Summer of 2023, the landscape of LLMs has notably evolved. The popular GPT-3.5 model, which was the most common model among our participants, has been replaced by GPT-4o for OpenAI’s free-tier users. For those having premium subscriptions, more advanced models, including GPT-4o, OpenAI o1, and OpenAI o1-mini, are accessible. Additionally, GitHub Copilot, the coding assistant used by two of our participants, is now also powered by GPT-4o. Table 3 provides an overview of the OpenAI and Google models’ features and performance.13These were selected as developers of the models discussed in this paper. Other state-of-the-art models, such as Llama by Meta and Claude by Anthropic, also demonstrate significant advancements but are outside the scope of this discussion. Current LLMs’ developments address a few limitations (see the blue bullet points in Figure 1) found during the interviews. For example, larger context windows now enable improved summarization and response latency has been reduced in models like GPT-4o compared to GPT-4. Issues such as hallucinations, contradictory answers, mixing up programming languages, and struggle with unstructured data have been mitigated to some extent through enhanced reasoning abilities and a larger context window. Despite these improvements, many challenges still persist (see Challenges under different subsections in Section 3). For example, real-time browsing has been integrated into the chat interface of both GPT-4o and Gemini 1.5 to address outdated responses and lack of references. However, outputs rely heavily on static training data, making them non-referenceable. This is especially problematic in rapidly evolving information like programming libraries, where outdated information harms accuracy. Beyond existing and future advancements, there are two considerations that are worth noting: (1)While flagship models like GPT-4o and o1 have addressed some limitations, smaller or locally deployable models (e.g., Llama 3.2 1B) that appeal to those prioritizing privacy or cost-efficiency continue to face challenges such as constrained context windows and reasoning capabilities. (2)Many limitations discussed in this paper are systemic and intrinsic to LLMs, such as their potential to impede critical learning processes if misused or their inability to replicate nuanced human interactions. 13Two metrics, Quality Index andLatency , are based on evaluations from Artificial Analysis [ 1], an independent team focused on benchmarking and evaluating AI models. These metrics may not reflect the official evaluations of corresponding companies. Manuscript submitted to ACM Page 34: 34 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes Developer Model Context WindowMax Output TokensKnowledge Cut-OffQuality Index [1]Latency [1]Developer’s Description of Model OpenAIGPT-4o 128k to- kens128k tokens Oct 2023 77 0.43 High-intelligence flagship model for complex, multi- step tasks. GPT-4o is cheaper and faster than GPT-4 Turbo. GPT-4o mini 128k to- kens16k tokens Oct 2023 72 0.41 Affordable and intelligent small model for fast, light- weight tasks. GPT-4o mini is cheaper and more capable than GPT-3.5 Turbo. o1 128k to- kens32k tokens Oct 2023 85 24.60 Reasoning model designed to solve hard problems across domains. Trained with rein- forcement learning to per- form complex reasoning. o1-mini 128k to- kens64k tokens Oct 2023 82 8.93 Faster and cheaper reason- ing model particularly good at coding, math, and science. Trained with reinforcement learning to perform complex reasoning. GPT-3.5 16k tokens 4k tokens Sep 2021 53 0.38 Understands and generates natural language or code and has been optimized for chat but works well for non-chat tasks as well. GoogleGemini 1.5 Pro2M tokens 8k tokens Sep 2024 80 1.06 Complex reasoning tasks re- quiring more intelligence. Gemini 1.5 Flash1M tokens 8k tokens Sep 2024 68 0.24 Fast and versatile perfor- mance across a diverse vari- ety of tasks. Table 3. Current Status of Language Models Referenced in the Paper. Quality Index represents the average performance across various evaluations of model intelligence, including benchmarks like MMLU, GPQA, and HumanEval. Latency denotes the time to the first token received after an API request, measured in seconds. Hence, informed deployment of LLM technologies is essential to leverage their benefits while minimizing any potential harm. 5 Related Works This paper examines the role of LLMs in software engineering, focusing on their impact on developers, the development life cycle, products developed, and societal implications. Due to the emerging nature of research in this field, we include some non-peer-reviewed studies and provide transparency through footnotes, acknowledging that, as noted by Fan et al. [26], formal literature surveys can no longer capture all relevant work, making our review not exhaustive. The most notable work on LLMs in the Software Engineering (SE) domain is a systematic review by Hou et al., which analyzed 395 papers on their impact, focusing on optimization techniques, applications, and potential use cases [ 36]. In contrast, our study provides a qualitative analysis based on interviews with full-time developers. Several interview-based studies have explored developers’ use of generative AI. Klemmer et al. found that while AI assistants like Copilot and ChatGPT were widely used for security-critical tasks, developers lacked trust in these tools Manuscript submitted to ACM Page 35: LLMs’ Impacts on Software Development 35 and double-checked their work [ 42]. Similarly, Rabani et al. reported that developers noted ChatGPT’s inaccuracy and the need for debugging [ 77]. Mendes et al. explored developers’ views on intelligent assistants, noting benefits like faster development and improved code but also challenges like poor accuracy and distractions [ 54]. Empirical studies, including those by Rasnayaka et al., assessed LLMs’ usefulness in programming projects, finding they help with code generation and debugging but highlighting a learning curve and no significant difference in software quality between AI-assisted and non-assisted teams [81]. Our study extends and corroborates these findings by exploring the broader interplay of LLMs across the four dimensions of our research questions (RQs). Accordingly, the remainder of the related work is structured around these dimensions. 5.1 People - Software Developers using LLMs Recent studies, particularly pre-published ones, highlight the growing adoption of LLMs among professional developers. These studies, mainly empirical, include interviews and thematic analyses of developer’s written responses. Feng et al. analyzed social media posts and found that developers use ChatGPT for code debugging, interview preparation, and solving academic assignments, with fear being the predominant emotion related to code generation [ 28]. Süße et al. identified fourteen coping patterns in a case study of AI-powered chatbots in software development, with one pattern considering the AI as a virtual colleague [ 94]. Nam et al. found that LLMs, when used within programming environments, improve task completion by providing contextualized queries, offering a more effective alternative to simple web searches [ 65]. Peng et al. discovered that GitHub Copilot helped developers complete tasks 55.8% faster than a control group [ 75], while Vaithilingam et al. noted that while Copilot did not significantly improve task completion time, it served as a useful starting point despite challenges in debugging [ 99]. Kuhlail et al. surveyed 99 developers, finding that ChatGPT improved productivity by helping generate generic code, explain complex code, and find sources [44]. Siddiq et al.’s study on DevGPT usage also highlighted ChatGPT’s role in helping developers understand libraries and frameworks, and engage in networking and messaging [ 88]. Stack Overflow’s 2024 survey found that 63.2% of professional developers were using AI tools in the development process, with 13.5% sharing they planned to and 23.4% sharing that they did not wish to use these tools [ 64]. The Stack Overflow 2023 survey highlighted AI usage for tasks like code writing, debugging, documentation, and testing[ 59], and the 2024 survey noted that professional developers believed that AI tools could increase productivity, speed up learning, improve efficiency and code accuracy, and make workload more manageable [64]. Further interview studies reveal how developers incorporate LLMs in their daily work. Pinto et al. found that LLMs, built on GPT-4, reduced repetitive tasks and helped contextualize code, though they had technical limitations like poor UI and inaccurate suggestions [ 76]. In a 2024 study by Kimbel et al., 17 participants across sectors reported using ChatGPT for content generation, information retrieval, brainstorming, and programming but noted challenges with accuracy [ 41]14. Coutinho et al. found that AI tools helped software professionals save and organize their time, but with reliability issues in generated content [18]. While these studies provide valuable insights, they fall short in addressing critical gaps. Most focus narrowly on either professional or novice developers, specific tasks, or quantitative metrics like task completion time. To date, little qualitative research has comprehensively examined the experiences of professional developers across varying experience levels, exploring not only the tasks where LLMs excel but also those where they fall short. Our study 14Pre-published via ResearchGate. Manuscript submitted to ACM Page 36: 36 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes addresses this gap by investigating the nuanced ways LLMs influence different development tasks, offering a richer understanding of their practical utility and limitations. 5.2 Processes - Software Development Life Cycle Process Emerging research into the impact of LLMs, particularly ChatGPT and GitHub Copilot, on the Software Development Life Cycle (SDLC) has identified significant effects, though more research is needed to fully understand their scope and quality. Prior to ChatGPT, earlier models like BERT were employed for specific coding tasks, such as vulnerability detection with 99.3% accuracy [ 4], while Text-to-Text Transfer Transformers demonstrated promise for code completion [17]. GitHub Copilot shifted the focus from code writing to code evaluation, with studies noting that developers often spent more time evaluating AI-generated suggestions than writing code itself [ 12]. While Copilot provides significant value for experts, it poses risks for novices, who may struggle to identify or correct buggy or non-optimal code [57]. Recent studies have explored ChatGPT and GPT models across various SDLC phases. For instance, ChatGPT has proven useful in debugging [ 93] and generating software requirements, though the outputs are typically less detailed than those produced by humans [ 10]. It has also been employed in refining requirements [ 2]. Krishna et al. demonstrated that advanced models like CodeLlama and GPT-4 could generate Software Requirement Specifications (SRS) at a level comparable to an entry-level engineer [ 43]15. While LLMs have shown potential to assist with planning, design, implementation, and testing, they still require human supervision, particularly during the coding phase [ 74]. Furthermore, Sridhara et al. found that while LLMs excel at tasks like refactoring, they struggle with more nuanced activities such as code reviews and vulnerability detection [91]16. Studies also highlight LLM contributions to software testing. Gu [ 32] and Tang [ 97] found that LLMs could outperform traditional tools in test coverage, but challenges like prompt design and accuracy persist. Frameworks for evaluating LLM-generated code have been developed, including Yeo et al. ’s work on prompt engineering [ 106], Liu et al. ’s framework for error identification [ 49], and Hou et al.’s analysis of metrics like MRR, BLEU, and ROUGE to assess LLM performance in software tasks [36]. Finally, the advent of LLM-based agents marks a shift toward a more AI-enhanced SDLC. Jin et al. demonstrated that LLM-based agents, equipped with capabilities like autonomous reasoning and tool usage, handle complex tasks more efficiently than traditional LLMs [ 40]17. This shift signals a transformation in software engineering workflows, where AI roles extend from copilots to supervisors, driving new paradigms in the SDLC process [71, 72]. While prior research has provided valuable insights, it remains fragmented and often focuses narrowly on specific SDLC phases or tasks. Studies rarely examine the end-to-end impact of LLMs on all SDLC activities or provide comprehensive guidance for integrating LLMs into these processes. Our research fills this gap by systematically investigating developers’ experiences using LLMs across the entire SDLC. We uniquely organize our findings according to the SDLC steps and make actionable recommendations, identifying tasks where LLMs excel and those where they under-perform. This comprehensive approach advances understanding of LLMs’ role in software development and offers practical strategies for their effective use. 15Pre-published via arXiv. 16Pre-published via arXiv. 17Pre-published via arXiv. Manuscript submitted to ACM Page 37: LLMs’ Impacts on Software Development 37 5.3 Products- Artifacts The use of LLMs in code generation has led to growing interest in understanding the quality of generated code. Studies have focused on readability [ 19], complexity [ 86], correctness [ 49], and security [ 33]. For example, Nascimento et al. showed that ChatGPT outperforms beginner programmers but not experts [ 66], while Fan et al. identified shared common mistakes between human-written and generated code [ 27]. Stack Overflow’s 2024 Survey found that almost half of professional developers believed that AI tools were bad at handling complex tasks; additionally, developers were split on the trustworthiness of AI tools, with newer developers trusting AI accuracy more than professionals [64]. Despite significant advances, much of the existing research emphasizes objective evaluations of code quality, often relying on predefined metrics or benchmarks. These studies rarely delve into developers’ subjective perceptions of the quality and trustworthiness of artifacts generated by LLMs, particularly in professional settings where these tools are actively integrated into workflows. To the best of our knowledge, limited research has explored how professional developers assess and trust the quality of artifacts produced by LLMs. Our work addresses this gap by investigating developers’ perceptions, their confidence in these outputs, and the factors influencing their trust in LLM-generated artifacts. This perspective is critical to understanding the practical integration of LLMs into professional software development and guiding improvements in LLM capabilities. 5.4 Society: LLMs in Industry and Education 5.4.1 Software Industry. LLMs are poised to significantly impact various sectors, including software development, due to their widespread adoption and diverse capabilities [ 8,24]. A 2018 National Bureau of Economic Research report by Bessen suggests AI could reshape the labor market by replacing, shifting, or creating jobs depending on demand [ 11]. With LLMs capable of generating code, concerns have emerged about their potential to assist or replace developers. Carleton et al.’s 2021 report advocates for AI’s collaboration with software engineers to enhance productivity and reliability [ 13]. Kuhail et al. found that over two-thirds of developers surveyed did not foresee an immediate job security threat from AI, though many recognized a partial risk [ 44]. Rashid speculated that AI could both replace some roles and open new opportunities, helping developers become more efficient [ 80]. Demirci et al. observed a 21% decrease in job postings related to coding after ChatGPT’s introduction, while jobs requiring manual labor saw less impact [ 21]. Additionally, research by Winter et al. showed developers prefer working alongside tools rather than having their tasks entirely replaced, suggesting that AI could support, but not fully replace, developers’ work [102]. As companies increasingly release policies on generative AI use, a 2024 interview study of 17 professionals revealed that many still lack formal guidelines for integrating AI tools like ChatGPT into their workflows [41]. Despite these advances, much of the existing research focuses on broad market trends, theoretical implications, or objective measures of AI’s impact. Few studies have directly interviewed developers to understand their perspectives on how LLMs are shaping their industry. Our research addresses this gap by exploring developers’ nuanced views on the opportunities and challenges of integrating LLMs into professional workflows. By centering the voices of developers, we provide actionable insights into the practical and ethical considerations of adopting LLMs at scale. 5.4.2 Education. Research on the integration of LLMs into education, particularly in computing, has gained attention. Tanay et al. (2024) observed that upper-level computing students using LLMs in software engineering projects improved efficiency in obtaining information and completing tasks, though concerns were raised about potential negative impacts on learning outcomes [ 95]. Essel et al. (2024) found that ChatGPT enhanced undergraduate students’ critical, reflective, and creative thinking skills [ 25]. This has sparked discussions on adapting the computer science curriculum Manuscript submitted to ACM Page 38: 38 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes to incorporate LLMs. Ozkaya proposed that students should learn to collaborate with LLMs and AI applications, especially for legacy systems [ 73]. Additionally, Jeuring, Groot, and Keuning highlighted the strong correlation between computational thinking (CT) skills and effective co-development with ChatGPT, suggesting the continued relevance of these skills for both students and future professionals [39]. While these studies offer insights into how LLMs may enhance student learning, there is limited research on developers’ perspectives regarding the role of LLMs in computing education. Our study addresses this gap by engaging developers in a discussion about how LLMs could reshape education in the field. By focusing on their experiences and views, we provide valuable insights into how LLMs can influence both the learning and teaching of computing skills in professional settings. 6 Threats to validity As with any qualitative study, our research also faces several threats to validity, including: 6.1 Internal Validity: The study’s timeline coincided with frequent updates to tools such as GPT-3.5 and Bard (now Gemini), creating a significant internal validity threat. The rapid evolution of these tools may have influenced participants’ experiences in uncontrolled ways. As participants interacted with different versions of the tools, they might have encountered different performance levels, which could skew the results. Consequently, findings related to GPT-3.5 or earlier versions of Bard may not accurately reflect the current version of these tools, especially with the emergence of newer iterations like GPT-4o and OpenAI o1. Additionally, the absence of pretests to assess participants’ baseline knowledge of how they used LLMs may impact the study’s internal validity. Although post-survey questions regarding industry experience partially mitigated this issue, the lack of a uniform measure of skill at the beginning of the study means that differences in participants’ performance could be attributed to their varying levels of expertise rather than the tools themselves. 6.2 External Validity: The geographic limitation of participants, with the majority based in the US, poses a significant threat to external validity. This limitation restricts the generalizability of the findings, as developers in other regions or cultural contexts may encounter different experiences and challenges. Factors such as localized programming practices, language preferences, and access to specific LLM features can all influence how these tools perform in non-US settings. Moreover, the lack of diversity in the participant pool, predominantly composed of White or Asian males with over 3 years of experience, further exacerbates the threat to external validity. The experiences and challenges faced by underrepresented groups—including women, non-binary individuals, developers with disabilities, and those from various racial, ethnic, and experience backgrounds—may differ significantly from those of the current sample. Although our study was announced openly on LinkedIn, we recognize the need to implement additional strategies in future research to ensure a more diverse and representative participant pool. 6.3 Construct Validity: The introduction of newer models, such as GPT-4o, along with the resolution of issues related to GPT-3.5 and Bard, poses a significant threat to the temporal relevance of the findings. As LLM tools continue to evolve, the capabilities and limitations of earlier versions may no longer accurately represent the current state of technology. Consequently, Manuscript submitted to ACM Page 39: LLMs’ Impacts on Software Development 39 some results from the study could become outdated, potentially undermining the applicability and relevance of the findings to contemporary tools usage. This emphasizes the necessity of considering the dynamic nature of LLM tools when interpreting the study’s conclusions. Additionally, the limited number of participants using tools like Google Bard and GitHub Copilot Chat presents a construct validity threat. The conclusions drawn about these tools may lack robustness and comprehensiveness due to insufficient data. 6.4 Conclusion Validity: The relatively small sample size of 16 participants and 16 hours of interviews introduces threat to conclusion validity. This limitation restricts the statistical power to draw meaningful generalizations from the findings. Additionally, the small dataset heightens the risk of random variations influencing the results, potentially leading to unreliable conclusions. As a result, the study’s ability to make robust claims about the broader population of developers and their experiences with LLM tools may be compromised. 7 Conclusion It has been clear since LLMs’ inception that they would impact software development in different ways. Our study aimed to examine the effects of LLMs on software developers, their processes, products, and society at large. Through sixteen interviews with early-adopter developers, we explored their self-reported day-to-day activities, perceptions, and experiences with LLMs. In our qualitative analysis of their responses, we found that: •RQ1: People: LLMs provide developers with numerous benefits, including enhanced productivity, improved efficiency, time savings, streamlined searching, access to templates, and accelerated learning. However, developers also face challenges, such as occasional unreliable LLM responses. •RQ2: Processes: In the SDLC, LLMs showed minimal impacts on gathering requirements, planning, and refac- toring. However, they had mostly positive impacts on ideation, test generation, debugging, and documentation. Developers used various strategies for prompt engineering and evaluating LLM-generated code, such as entering vague prompts or conducting mental checks. •RQ3: Product: LLMs generate readable code and are effective for simple tasks, but they exhibit varying quality across different questions and encounter difficulties with complex tasks. •RQ4: Society: There is a need for formal, proactive guidelines for software developers on the usage of LLMs in the workplace, particularly to promote the ethical and safe use of generative artificial intelligence. Additionally, there is a predicted shift in entry-level positions due to LLMs, and LLMs are perceived as being likely to alter or repurpose development-related jobs, rather than eliminating them entirely. Finally, developers hold the ultimate responsibility for the code they deploy, regardless of its source or the process used to create it. The sixteen interviewed developers show an advanced understanding of how LLMs work and their data sources. This understanding influences the decisions that developers make about when, how, and why to use LLMs. Their insights can be used to craft best practices for LLM use in computing education and workforce settings. In addition, they demonstrate that LLMs offer, in general, far more opportunities than they do challenges. Despite concerns over code quality and limitations, our participants deemed LLMs to be sufficiently useful for developers. Consequently, our findings suggest that LLMs can be a valuable asset in a professional developer’s toolbox. As one participant expressed, “[ChatGPT] is like calculators being invented. We’re going to ban them for a while, and we’re going to tell [students] no, no, no, don’t use it. And then, once they graduate, they will use it every day. ” Manuscript submitted to ACM Page 40: 40 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes 8 Acknowledgments This material is based upon work supported by the Air Force Office of Scientific Research under award number FA9550- 21-1-0108 and National Science Foundation (NSF) under award numbers IIS-2313890, CCF-2006977, and IIS-1917885. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the view of the AFOSR or NSF. References [1][n. d.]. LLM Performance Leaderboard - a Hugging Face Space by ArtificialAnalysis. https://huggingface.co/spaces/ArtificialAnalysis/LLM- Performance-Leaderboard [2]Aakash Ahmad, Muhammad Waseem, Peng Liang, Mahdi Fahmideh, Mst Shamima Aktar, and Tommi Mikkonen. 2023. Towards human-bot collaborative software architecting with chatgpt. In Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering . 279–285. [3]Hussam Alkaissi and Samy I. McFarlane. 2023. Artificial Hallucinations in ChatGPT: Implications in Scientific Writing. Cureus 15, 2 (Feb. 2023). https://doi.org/10.7759/cureus.35179 Publisher: Cureus. [4]Mansour Alqarni and Akramul Azim. 2022. Low level source code vulnerability detection using advanced bert language model. In Proceedings of the Canadian Conference on Artificial Intelligence-Https://caiac. pubpub. org/pub/gdhb8oq4 (may 27 2022) . [5]Ilaria Amaro, Attilio Della Greca, Rita Francese, Genoveffa Tortora, and Cesare Tucci. 2023. AI Unreliable Answers: A Case Study on ChatGPT. In Artificial Intelligence in HCI . Springer, Cham, 23–40. https://doi.org/10.1007/978-3-031-35894-4_2 ISSN: 1611-3349. [6] Mattias Andersson and Tom Marshall Olsson. 2023. ChatGPT as a Supporting Tool for System Developers . Ph. D. Dissertation. [7]Alwin Augustin. 2023. How LLMs Influence Software Engineering and Development. https://www.linkedin.com/pulse/how-llms-influence- software-engineering-development-alwin-augustin#:~:text=Overall%2C%20LLMs%20have%20the%20potential,efficient%2C%20effective%2C% 20and%20innovative. [8]Ömer Aydin and Enis Karaarslan. 2023. Is ChatGPT leading generative AI? What is beyond expectations? Academic Platform Journal of Engineering and Smart Systems 11, 3 (2023), 118–134. [9]Lenz Belzner, Thomas Gabor, and Martin Wirsing. 2023. Large language model assisted software engineering: prospects, challenges, and a case study. In International Conference on Bridging the Gap between AI and Reality . Springer, 355–374. [10] Leila Bencheikh and Niklas Höglund. 2023. Exploring the Efficacy of ChatGPT in Generating Requirements: An Experimental Study . Ph. D. Dissertation. https://gupea.ub.gu.se/handle/2077/77957 Accepted: 2023-08-03T12:28:26Z. [11] James Bessen. 2018. AI and jobs: The role of demand . Technical Report. National Bureau of Economic Research. [12] Christian Bird, Denae Ford, Thomas Zimmermann, Nicole Forsgren, Eirini Kalliamvakou, Travis Lowdermilk, and Idan Gazit. 2023. Taking Flight with Copilot: Early insights and opportunities of AI-powered pair-programming tools. Queue 20, 6 (Jan. 2023), Pages 10:35–Pages 10:57. https://doi.org/10.1145/3582083 [13] Anita Carleton, Forrest Shull, and Erin Harper. 2022. Architecting the future of software engineering. Computer 55, 9 (2022), 89–93. [14] Souti Chattopadhyay, Nicholas Nelson, Yenifer Ramirez Gonzalez, Annel Amelia Leon, Rahul Pandita, and Anita Sarma. 2019. Latent patterns in activities: A field study of how developers manage context. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) . IEEE, 373–383. [15] Ping Chen and Syazwina Binti Alias. 2024. Opportunities and Challenges in the Cultivation of Software Development Professionals in the Context of Large Language Models. In Proceedings of the 2024 International Symposium on Artificial Intelligence for Education . 259–267. [16] Xiang Chen, Chaoyang Gao, Chunyang Chen, Guangbei Zhang, and Yong Liu. 2024. An Empirical Study on Challenges for LLM Developers. arXiv preprint arXiv:2408.05002 (2024). [17] Matteo Ciniselli, Nathan Cooper, Luca Pascarella, Antonio Mastropaolo, Emad Aghajani, Denys Poshyvanyk, Massimiliano Di Penta, and Gabriele Bavota. 2021. An empirical study on the usage of transformer models for code completion. IEEE Transactions on Software Engineering 48, 12 (2021), 4818–4837. [18] Mariana Coutinho, Lorena Marques, Anderson Santos, Marcio Dahia, Cesar França, and Ronnie de Souza Santos. 2024. The Role of Generative AI in Software Development Productivity: A Pilot Case Study. In Proceedings of the 1st ACM International Conference on AI-Powered Software . 131–138. [19] Carlos Dantas, Adriano Rocha, and Marcelo Maia. 2023. Assessing the Readability of ChatGPT Code Snippet Recommendations: A Comparative Study. In Proceedings of the XXXVII Brazilian Symposium on Software Engineering (SBES ’23) . Association for Computing Machinery, New York, NY, USA, 283–292. https://doi.org/10.1145/3613372.3613413 [20] Jeffrey Dastin and Anna Tong. 2023. OpenAI rolls out ’incognito mode’ on ChatGPT | Reuters. https://www.reuters.com/technology/openai-rolls- out-incognito-mode-chatgpt-2023-04-25/. [21] Ozge Demirci, Jonas Hannane, and Xinrong Zhu. 2023. Who is AI Replacing? The Impact of ChatGPT on Online Freelancing Platforms. The Impact of ChatGPT on Online Freelancing Platforms (October 15, 2023) (2023). Manuscript submitted to ACM Page 41: LLMs’ Impacts on Software Development 41 [22] Ditstek Innovations Pvt. Ltd. (DITS). 2024. How are LLMs Reshaping Software Development? https://www.linkedin.com/pulse/how-llms- reshaping-software-development-ditstek-innovations-8gtac/ [23] Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Weikang Zhou, Muling Wu, Mingxu Chai, Jessica Fan, Caishuang Huang, Yunbo Tao, et al . 2024. What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study. arXiv preprint arXiv:2407.06153 (2024). [24] Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. 2023. Gpts are gpts: An early look at the labor market impact potential of large language models. arXiv preprint arXiv:2303.10130 (2023). [25] Harry Barton Essel, Dimitrios Vlachopoulos, Albert Benjamin Essuman, and John Opuni Amankwa. 2024. ChatGPT effects on cognitive skills of undergraduate students: Receiving instant responses from AI-based conversational large language models (LLMs). Computers and Education: Artificial Intelligence 6 (2024), 100198. [26] Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE) . IEEE, 31–53. [27] Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated repair of programs from large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) . IEEE, 1469–1481. [28] Yunhe Feng, Sreecharan Vanam, Manasa Cherukupally, Weijian Zheng, Meikang Qiu, and Haihua Chen. 2023. Investigating Code Generation Performance of Chat-GPT with Crowdsourcing Social Data. In Proceedings of the 47th IEEE Computer Software and Applications Conference . 1–10. [29] Martin Fowler. 2018. Refactoring . Addison-Wesley Professional. [30] Martin Fowler, Jim Highsmith, et al. 2001. The agile manifesto. Software development 9, 8 (2001), 28–35. [31] Isa Fulford and Andrew Ng. 2024. ChatGPT Prompt Engineering for Developers. https://www.deeplearning.ai/short-courses/chatgpt-prompt- engineering-for-developers/ [32] Qiuhan Gu. 2023. LLM-Based Code Generation Method for Golang Compiler Testing. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023) . Association for Computing Machinery, New York, NY, USA, 2201–2203. https://doi.org/10.1145/3611643.3617850 [33] Maanak Gupta, Charankumar Akiri, Kshitiz Aryal, Eli Parker, and Lopamudra Praharaj. 2023. From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy. IEEE Access 11 (2023), 80218–80245. https://doi.org/10.1109/ACCESS.2023.3300381 Conference Name: IEEE Access. [34] Md Asraful Haque and Shuai Li. 2023. The Potential Use of ChatGPT for Debugging and Bug Fixing. EAI Endorsed Transactions on AI and Robotics 2, 1 (2023), e4–e4. [35] Madison Hoff, Aaron Mok, and Jacob Zinkula. 2023. 4 white-collar jobs most at risk of getting replaced by Ai like chatgpt. https://www. businessinsider.com/chatgpt-white-collar-jobs-at-risk-artificial-intelligence-ai-2023-2 [36] Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2023. Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620 (2023). [37] Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review. ACM Trans. Softw. Eng. Methodol. (Sept. 2024). https://doi.org/10.1145/3695988 Just Accepted. [38] Adam Hörnemalm. 2023. ChatGPT as a Software Development Tool . Ph. D. Dissertation. [39] Johan Jeuring, Roel Groot, and Hieke Keuning. 2023. What Skills Do You Need When Developing Software Using ChatGPT?(Discussion Paper). In Proceedings of the 23rd Koli Calling International Conference on Computing Education Research . 1–6. [40] Haolin Jin, Linghan Huang, Haipeng Cai, Jun Yan, Bo Li, and Huaming Chen. 2024. From llms to llm-based agents for software engineering: A survey of current, challenges and future. arXiv preprint arXiv:2408.02479 (2024). [41] Angelika Kimbel, Magdalena Glas, and Günther Pernul. 2024. Security and Privacy Perspectives on Using ChatGPT at the Workplace: An Interview Study. (2024). [42] Jan H Klemmer, Stefan Albert Horstmann, Nikhil Patnaik, Cordelia Ludden, Cordell Burton Jr, Carson Powers, Fabio Massacci, Akond Rahman, Daniel Votipka, Heather Richter Lipford, et al .2024. Using AI Assistants in Software Development: A Qualitative Study on Security Practices and Concerns. arXiv preprint arXiv:2405.06371 (2024). [43] Madhava Krishna, Bhagesh Gaur, Arsh Verma, and Pankaj Jalote. 2024. Using LLMs in Software Requirements Specifications: An Empirical Evaluation. arXiv preprint arXiv:2404.17842 (2024). [44] Mohammad Amin Kuhail, Sujith Samuel Mathew, Ashraf Khalil, Jose Berengueres, and Syed Jawad Hussain Shah. 2024. “Will I Be Replaced?” Assessing ChatGPT’s Effect on Software Development and Programmer Perceptions of AI Tools. Science of Computer Programming (2024), 103111. [45] Sam Lau and Philip Guo. 2023. From" Ban it till we understand it" to" Resistance is futile": How university programming instructors plan to adapt as more students use AI code generation and explanation tools such as ChatGPT and GitHub Copilot. In Proceedings of the 2023 ACM Conference on International Computing Education Research-Volume 1 . 106–121. [46] Dean Leffingwell and Don Widrig. 2000. Managing software requirements: a unified approach . Addison-Wesley Professional. [47] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al .2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474. Manuscript submitted to ACM Page 42: 42 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes [48] Ze Shi Li, Nowshin Nawar Arony, Ahmed Musa Awon, Daniela Damian, and Bowen Xu. 2024. AI Tool Use and Adoption in Software Development by Individuals and Organizations: A Grounded Theory Study. arXiv preprint arXiv:2406.17325 (2024). [49] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. Advances in Neural Information Processing Systems 36 (Dec. 2023), 21558–21572. https://proceedings.neurips.cc/paper_files/paper/2023/hash/43e9d647ccd3e4b7b5baab53f0368686-Abstract-Conference.html [50] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2024. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems 36 (2024). [51] Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-Bach D Le, and David Lo. 2023. Refining ChatGPT-generated code: Characterizing and mitigating code quality issues. ACM Transactions on Software Engineering and Methodology (2023). [52] Kamil Malinka, Martin Peresíni, Anton Firc, Ondrej Hujnák, and Filip Janus. 2023. On the Educational Impact of ChatGPT: Is Artificial Intelligence Ready to Obtain a University Degree?. In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1 (ITiCSE 2023) . Association for Computing Machinery, New York, NY, USA, 47–53. https://doi.org/10.1145/3587102.3588827 [53] Robert C Martin. 2009. Clean code: a handbook of agile software craftsmanship . Pearson Education. [54] Wendy Mendes, Samara Souza, and Cleidson De Souza. 2024. " You’re on a bicycle with a little motor": Benefits and Challenges of Using AI Code Assistants. In Proceedings of the 2024 IEEE/ACM 17th International Conference on Cooperative and Human Aspects of Software Engineering . 144–152. [55] Jeremy Miles and Paul Gilbert. 2005. A Handbook of Research Methods for Clinical and Health Psychology . Oxford University Press. Google-Books-ID: kmZ3Yt5pY0YC. [56] Aaron Mok and Jacob Zinkula. 2023. Chatgpt may be coming for our jobs. Here are the 10 roles that AI is most likely to re- place. https://www.businessinsider.com/chatgpt-jobs-at-risk-replacement-artificial-intelligence-ai-labor-trends-2023-02#:~:text=Experts% 20say%20ChatGPT%20and%20related,career%2C%20mid%2Dability%20work. [57] Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh, Michel C. Desmarais, and Zhen Ming (Jack) Jiang. 2023. GitHub Copilot AI pair programmer: Asset or Liability? Journal of Systems and Software 203 (Sept. 2023), 111734. https://doi.org/10.1016/j.jss.2023.111734 [58] n.a. 2021. OpenAI Codex. https://openai.com/index/openai-codex/ [59] n.a. 2023. Stack Overflow 2024 Developer Survey. [60] n.a. 2024. Build simple, secure, scalable systems with Go. https://go.dev/ [61] n.a. 2024. FAQ for optional data sharing for Copilot AI features in Dynamics 365 and Power Platform. https://learn.microsoft.com/en-us/power- platform/faqs-copilot-data-sharing [62] n.a. 2024. Gemini Apps Privacy Hub. https://support.google.com/gemini/answer/13594961?hl=en [63] n.a. 2024. Privacy Policy. https://openai.com/policies/row-privacy-policy/ [64] n.a. 2024. Stack Overflow 2023 Developer Survey. [65] Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an LLM to Help With Code Understanding. In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE) . IEEE Computer Society, 881–881. [66] Nathalia Nascimento, Paulo Alencar, and Donald Cowan. 2023. Artificial Intelligence vs. Software Engineers: An Empirical Study on Performance and Efficiency using ChatGPT. In Proceedings of the 33rd Annual International Conference on Computer Science and Software Engineering (CASCON ’23). IBM Corp., USA, 24–33. [67] Nikolaos Nikolaidis, Karolos Flamos, Daniel Feitosa, Alexander Chatzigeorgiou, and Apostolos Ampatzoglou. 2023. The End of an Era: Can Ai Subsume Software Developers? Evaluating Chatgpt and Copilot Capabilities Using Leetcode Problems. https://doi.org/10.2139/ssrn.4422122 [68] OpenAI. [n. d.]. Prompt Engineering. https://platform.openai.com [69] OpenAI. n.d.. Prompt engineering. https://platform.openai.com/docs/guides/prompt-engineering. [70] R OpenAI. 2023. GPT-4 technical report. arXiv (2023), 2303–08774. [71] Ipek Ozkaya. 2022. A Paradigm Shift in Automating Software Engineering Tasks: Bots. IEEE Software 39, 5 (Sept. 2022), 4–8. https://doi.org/10. 1109/MS.2022.3167801 Conference Name: IEEE Software. [72] Ipek Ozkaya. 2023. Application of Large Language Models to Software Engineering Tasks: Opportunities, Risks, and Implications. IEEE Software 40, 3 (May 2023), 4–8. https://doi.org/10.1109/MS.2023.3248401 [73] Ipek Ozkaya. 2023. Application of large language models to software engineering tasks: Opportunities, risks, and implications. IEEE Software 40, 3 (2023), 4–8. [74] Zeynep Özpolat, Özal YILDIRIM, and Murat Karabatak. 2023. Artificial Intelligence-Based Tools in Software Development Processes: Application of ChatGPT. European Journal of Technique (EJT) 13, 2 (2023), 229–240. [75] Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. arXiv:2302.06590 [cs.SE] [76] Gustavo Pinto, Cleidson De Souza, Thayssa Rocha, Igor Steinmacher, Alberto Souza, and Edward Monteiro. 2024. Developer Experiences with a Contextualized AI Coding Assistant: Usability, Expectations, and Outcomes. In Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering-Software Engineering for AI . 81–91. [77] Zeinab Sadat Rabani, Hanieh Khorashadizadeh, Shirin Abdollahzade, Sven Groppe, and Javad Ghofrani. 2023. Developers’ Perspective on Trustworthiness of Code Generated by ChatGPT: Insights from Interviews. In International Conference on Applied Machine Learning and Data Analytics . Springer, 215–229. Manuscript submitted to ACM Page 43: LLMs’ Impacts on Software Development 43 [78] Wahyu Rahmaniar. 2023. ChatGPT for Software Development: Opportunities and Challenges . https://doi.org/10.36227/techrxiv.23993583.v1 [79] Nitin Liladhar Rane, Abhijeet Tawde, Saurabh P Choudhary, and Jayesh Rane. 2023. Contribution and performance of ChatGPT and other Large Language Models (LLM) for scientific and research advancements: a double-edged sword. International Research Journal of Modernization in Engineering Technology and Science 5, 10 (2023), 875–899. [80] Mahdiyah Rashid. 2023. HOW IS THE DEVELOPMENT AND DEPLOYMENT OF AI MODELS LIKE CHAT GPT AFFECTING THE JOB MARKET AND WHAT ARE THE IMPLICATIONS FOR WORKERS IN VARIOUS INDUSTRIES? International Education and Research Journal (2023). [81] Sanka Rasnayaka, Guanlin Wang, Ridwan Shariffdeen, and Ganesh Neelakanta Iyer. 2024. An empirical study on usage and perceptions of llms in a software engineering project. arXiv preprint arXiv:2401.16186 (2024). [82] Heidi Reichert, Benyamin T. Tabarsi, Zifan Zang, Cheri Fennell, Indira Bhandari, David Robinson, Madeline Drayton, Catherine Crofton, Matthew Lococo, Dongkuan Xu, and Tiffany Barnes. 2024. Empowering Secondary School Teachers: Creating, Executing, and Evaluating a Transformative Professional Development Course on ChatGPT. In 2024 IEEE Frontiers in Education Conference (FIE) . IEEE, forthcoming. [83] Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems . 1–7. [84] Mary Beth Rosson and John M. Carroll. 1996. The Reuse of Uses in Smalltalk Programming. ACM Trans. Comput.-Hum. Interact. 3, 3 (sep 1996), 219–253. https://doi.org/10.1145/234526.234530 [85] Georgia Robins Sadler, Hau-Chen Lee, Rod Seung-Hwan Lim, and Judith Fullerton. 2010. Research Article: Recruitment of hard-to-reach population subgroups via adaptations of the snowball sampling strategy. Nursing & Health Sciences 12, 3 (2010), 369–374. https://doi.org/10.1111/j.1442- 2018.2010.00541.x _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1442-2018.2010.00541.x. [86] Fardin Ahsan Sakib, Saadat Hasan Khan, and A. H. M. Rezaul Karim. 2023. Extending the Frontier of ChatGPT: Code Generation and Debugging. https://arxiv.org/abs/2307.08260v1 [87] Amazon Web Services. n.d.. What is SDLC? - Software Development Lifecycle Explained - AWS. https://aws.amazon.com/what-is/sdlc/. [88] Mohammed Latif Siddiq, Lindsay Roney, Jiahao Zhang, and Joanna Cecilia Da Silva Santos. 2024. Quality Assessment of ChatGPT Generated Code and their Use by Developers. In Proceedings of the 21st International Conference on Mining Software Repositories . 152–156. [89] Rafael M. L. Silva, Erica Principe Cruz, Daniela K. Rosner, Dayton Kelly, Andrés Monroy-Hernández, and Fannie Liu. 2022. Understanding AR Activism: An Interview Study with Creators of Augmented Reality Experiences for Social Change. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22) . Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/3491102.3517605 [90] Harmeet Singh and Syed Imtiyaz Hassan. 2015. Effect of solid design principles on quality of software: An empirical assessment. International Journal of Scientific & Engineering Research 6, 4 (2015), 1321–1324. [91] Giriprasad Sridhara, Sourav Mazumdar, et al .2023. Chatgpt: A study on its utility for ubiquitous software engineering tasks. arXiv preprint arXiv:2305.16837 (2023). [92] Nigar M Shafiq Surameery and Mohammed Y Shakor. 2023. Use chat gpt to solve programming bugs. International Journal of Information Technology & Computer Engineering (IJITC) ISSN: 2455-5290 3, 01 (2023), 17–22. [93] Nigar M. Shafiq Surameery and Mohammed Y. Shakor. 2023. Use Chat GPT to Solve Programming Bugs. International Journal of Information Technology & Computer Engineering (IJITC) ISSN : 2455-5290 3, 01 (Jan. 2023), 17–22. https://doi.org/10.55529/ijitc.31.17.22 Number: 01. [94] Thomas Süße, Maria Kobert, Simon Grapenthin, and Bernd-Friedrich Voigt. 2023. AI-Powered Chatbots and the Transformation of Work: Findings from a Case Study in Software Development and Software Engineering. In Working Conference on Virtual Enterprises . Springer, 689–705. [95] Ben Arie Tanay, Lexy Arinze, Siddhant S Joshi, Kirsten A Davis, and James C Davis. 2024. An Exploratory Study on Upper-Level Computing Students’ Use of Large Language Models as Tools in a Semester-Long Project. arXiv preprint arXiv:2403.18679 (2024). [96] Liyan Tang, Zhaoyi Sun, Betina Idnay, Jordan G. Nestor, Ali Soroush, Pierre A. Elias, Ziyang Xu, Ying Ding, Greg Durrett, Justin F. Rousseau, Chunhua Weng, and Yifan Peng. 2023. Evaluating large language models on medical evidence summarization. npj Digital Medicine 6, 1 (Aug. 2023), 158. https://doi.org/10.1038/s41746-023-00896-7 [97] Yutian Tang, Zhijie Liu, Zhichao Zhou, and Xiapu Luo. 2023. ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation. https://arxiv.org/abs/2307.00588v1 [98] Helen Toner. 2023. What Are Generative AI, Large Language Models, and Foundation Models? | Center for Security and Emerging Technology. https://cset.georgetown.edu/article/what-are-generative-ai-large-language-models-and-foundation-models/. [99] Priyan Vaithilingam, Tianyi Zhang, and Elena L Glassman. 2022. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In Chi conference on human factors in computing systems extended abstracts . 1–7. [100] Pranshu Verma and Gerrit De Vynck. 2023. ChatGPT took their jobs. Now they walk dogs and fix air conditioners. https://www.washingtonpost. com/technology/2023/06/02/ai-taking-jobs/ [101] Jiexin Wang, Liuwen Cao, Xitong Luo, Zhiping Zhou, Jiayuan Xie, Adam Jatowt, and Yi Cai. 2023. Enhancing Large Language Models for Secure Code Generation: A Dataset-driven Study on Vulnerability Mitigation. https://arxiv.org/abs/2310.16263v1 [102] Emily Winter, David Bowes, Steve Counsell, Tracy Hall, Sæmundur Haraldsson, Vesna Nowack, and John Woodward. 2022. How do developers really feel about bug fixing? directions for automatic program repair. IEEE Transactions on Software Engineering (2022). [103] Tianyu Wu, Shizhu He, Jingping Liu, Siqi Sun, Kang Liu, Qing-Long Han, and Yang Tang. 2023. A brief overview of ChatGPT: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica 10, 5 (2023), 1122–1136. Manuscript submitted to ACM Page 44: 44 Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes [104] Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (MAPS 2022) . Association for Computing Machinery, New York, NY, USA, 1–10. https://doi.org/10.1145/3520312.3534862 [105] Ruiyun Xu, Yue (Katherine) Feng, and Hailiang Chen. 2023. ChatGPT vs. Google: A Comparative Study of Search Performance and User Experience. SSRN Electronic Journal (2023). https://doi.org/10.2139/ssrn.4498671 [106] Sangyeop Yeo, Yu-Seung Ma, Sang Cheol Kim, Hyungkook Jun, and Taeho Kim. 2024. Framework for evaluating code gener- ation ability of large language models. ETRI Journal 46, 1 (2024), 106–117. https://doi.org/10.4218/etrij.2023-0357 _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.4218/etrij.2023-0357. [107] Shuyin Zhao. 2024. Smarter, more efficient coding: GitHub Copilot goes beyond Codex with improved AI model. https://github.blog/news- insights/product-news/smarter-more-efficient-coding-github-copilot-goes-beyond-codex-with-improved-ai-model/ Manuscript submitted to ACM