Open AGI Codes | Your Codes Reflect!

Updates

Generating audio...

arxiv

Paper 2412.20545

The Impact of Prompt Programming on Function-Level Code Generation

Authors: Ranim Khojah, Francisco Gomes de Oliveira Neto, Mazen Mohamad, Philipp Leitner

Published: 2024-12-29

Abstract:

Large Language Models (LLMs) are increasingly used by software engineers for code generation. However, limitations of LLMs such as irrelevant or incorrect code have highlighted the need for prompt programming (or prompt engineering) where engineers apply specific prompt techniques (e.g., chain-of-thought or input-output examples) to improve the generated code. Despite this, the impact of different prompt techniques -- and their combinations -- on code generation remains underexplored. In this study, we introduce CodePromptEval, a dataset of 7072 prompts designed to evaluate five prompt techniques (few-shot, persona, chain-of-thought, function signature, list of packages) and their effect on the correctness, similarity, and quality of complete functions generated by three LLMs (GPT-4o, Llama3, and Mistral). Our findings show that while certain prompt techniques significantly influence the generated code, combining multiple techniques does not necessarily improve the outcome. Additionally, we observed a trade-off between correctness and quality when using prompt techniques. Our dataset and replication package enable future research on improving LLM-generated code and evaluating new prompt techniques.

Paper Content:

Page 1: 1 The Impact of Prompt Programming on Function-Level Code Generation Ranim Khojah1, Francisco Gomes de Oliveira Neto1, Mazen Mohamad1,2, Philipp Leitner1 1Chalmers University of Technology and University of Gothenburg ,2RISE Research Institutes of Sweden Gothenburg, Sweden khojah@chalmers.se, francisco.gomes@cse.gu.se, mazen.mohamad@ri.se, philipp.leitner@chalmers.se Abstract —Large Language Models (LLMs) are increasingly used by software engineers for code generation. However, lim- itations of LLMs such as irrelevant or incorrect code have highlighted the need for prompt programming (or prompt engineering) where engineers apply specific prompt techniques (e.g., chain-of-thought or input-output examples) to improve the generated code. Despite this, the impact of different prompt tech- niques — and their combinations — on code generation remains underexplored. In this study, we introduce CodePromptEval, a dataset of 7072 prompts designed to evaluate five prompt tech- niques (few-shot, persona, chain-of-thought, function signature, list of packages) and their effect on the correctness, similarity, and quality of complete functions generated by three LLMs (GPT-4o, Llama3, and Mistral). Our findings show that while certain prompt techniques significantly influence the generated code, combining multiple techniques does not necessarily improve the outcome. Additionally, we observed a trade-off between cor- rectness and quality when using prompt techniques. Our dataset and replication package enable future research on improving LLM-generated code and evaluating new prompt techniques. I. I NTRODUCTION With the widespread adoption of Large Language Models (LLMs) in software engineering, researchers and practitioners have uncovered their significant potential, particularly for code-related tasks, such as code generation and completion [1, 2]. However, this adoption has also revealed several limita- tions of LLMs that can hinder developers’ productivity [3] and cause frustrations [4], preventing them from fully leveraging the benefits of LLMs in their coding process. Such limitations are related to hallucinations, misunderstanding the intent or purpose of the code, or simply generating incorrect code [5]. These limitations are inherent to the design of LLMs, and are unlikely to ”resolve themselves” entirely with future model generations. Therefore, researchers started proposing ways to mitigate these limitations by adapting how users interact with the LLMs. The interactions typically start with a natural language prompt that specifies what the LLM is expected to output. To ensure that the LLM generates accurate, relevant, and high-quality outputs, users employ a structured approach to construct prompts, which is often referred to as prompt programming. To implement prompt programming, various prompt tech- niques can be used to guide the LLM on how to achieve the expected results [6, 7, 8]. For example, few-shot learning involves providing the LLM with a few input-output examples to guide the function logic, while adding context about thepackages used can give the model additional information on what helper functions to use. However, such prompt techniques were evaluated based on the output accuracy for natural language generation tasks [9, 8] and are not well-studied for code generation, more specifically, function synthesis (generating function-level code), which is one of the most common use cases among software engineers [3]. Furthermore, evaluating the accuracy of code generation is not sufficient, since other aspects of the code are important for software engineers, such as maintainability and adherence to best practices. Prompt techniques can also be combined [6], but to the best of our knowledge, no work evaluates the impact of multiple combinations of prompt techniques in one prompt. For instance, whether applying a certain prompt technique can cancel out, hinder, or even enhance the impact of an existing prompt technique in the prompt. Therefore, in this study, we design a full factorial experiment on five common prompt techniques for function generation along with all the possible combinations of these prompts, which sums up to 32 unique combinations of prompt techniques. To perform a comprehensive evaluation of the impact of different prompt techniques on code generation, we construct our dataset CodePromptEval which consists of 221 code-generation prompts from CoderEval [10], that we extend with 32 possible variations for each prompt (that is, combinations of prompt techniques). This results in a total of 7072 datapoints. We use CodePromptEval to generate functions with three popular LLMs (GPT-4o, Llama3, and Mistral), then evaluate the generated functions based on correctness, as well as quality and similarity to ground truth (e.g., in terms of naming style and structure). Particularly, we investigate the following research questions. RQ 1 : How do different LLMs perform on CodePromptEval? Initially, we study the performance of different current- generation LLMs (GPT-4o, Llama3, and Mistral) on our CodePromptEval dataset. We particularly look at the correctness of LLM-generated code as measured using existing test cases in CoderEval benchmark as a ground truth. We observe that the performance of all three evaluated LLMs is comparable, with a difference of around 5 percentage points between the best model, GPT-4o, and the worst, Mistral. RQ 2 : To what extent do different prompting techniques (and combinations of them) impact the code generation of LLMs?arXiv:2412.20545v1 [cs.SE] 29 Dec 2024 Page 2: 2 We now turn to the central research question of this paper. Using a full factorial experiment design, we compare how different prompt techniques (e.g., few-shot, providing a persona, etc.) impact the generated code in three dimensions: correctness, similarity to ground truth, and code quality. RQ 2.1 : How do prompt techniques impact the correctness of the code? To evaluate correctness, we test the functions, then measure the Pass@k scores for each combination of prompt techniques. We also perform statistical tests to identify the (combinations of) prompt techniques that impact the test results. We found that including only a function signature or few-shot examples has a significant positive impact on correctness. We further observe that combining prompt techniques does not lead to significantly better results. RQ 2.2 : How do prompt techniques impact the similarity of the code to a human-written baseline? We also study how similar generated solutions are to the (human-written) baseline. We find that including a persona, chain-of-thought, or signature increases the overall similarity to the baseline for some LLMs, while few-shot reduces only the lexical similarity. Note that generating code that is similar to an “expected” solution may be good or bad depending on context — on the one hand, code that is close to the baseline may be easy to fix even if it is not passing the test cases; on the other, “different” can be particularly valuable if the goal is to brainstorm approaches, e.g., if used in “exploration mode” [11]. RQ 2.3 : How do prompt techniques impact the quality of the code? Finally, we study code quality as measured through the presence of code smells and the (cyclomatic and cognitive) complexity of the code. We find that including a signature or few-shot examples leads to functions with higher complexity and more code smells. Interestingly, adding a relevant persona (“as a software developer who follows best coding practices ...”) indeed has a small positive effect on the code quality, but at the expense of slightly lower correctness. Overall, we conclude that the impact of prompt program- ming techniques is not dramatic for, at the time of writing, current-generation models. Most combinations of prompt tech- niques do not lead to statistically significant improvements (nor regressions) in correctness, similarity or quality. Providing type information for the function that is to be generated, either explicitly through a signature, or implicitly via few-shot exam- ples, has the most clear effect. Some prompt techniques have a positive impact on correctness, and others on quality. However, the obvious idea of combining them usually improves neither.II. R ELATED WORK Existing research on LLMs in software engineering has shown the potential of LLMs to support software engineers in various tasks, including requirements elicitation, software testing and documentation [12, 13]. However, the main focus is directed towards code-related tasks [3]. This is also reflected in the interest among software organizations that, at the time of writing, leverage LLMs mostly for code generation, code completion and code summarization [14]. However, the in- creased adoption of LLMs for code-related tasks has unveiled risks and limitations, such as hallucinations, inaccuracies, and potential vulnerabilities [15, 16]. Researchers have proposed the concept of prompt programming (or prompt engineering) in order to minimize the model’s limitations and trigger the LLM to output a more desirable response by using prompt techniques and provide relevant contextual information [17, 6]. Therefore, a new line of research emerged focusing on finding prompt techniques that can improve the performance of LLMs in various tasks. White et al. [6] propose different prompt patterns and techniques depending on the software- related task. However, the impact of these techniques on the LLM output can be unstable and inconsistent. Wang et al [18] shows that prompt techniques can be sensitive to the specific task as well as the LLM (e.g., GPT-3.5 vs. GPT-4o). Other studies also show that the few-shot prompt technique [19] is effective, especially with the right structure [7], type [20] or order [21] of the examples (shots). Reynolds and McDonell [9] highlight how few-shot examples can hurt the performance of the model and limit its search for a plausible solution in translation tasks. Contrastingly, we found that few-shot sig- nificantly improved the performance of the LLMs suggesting that prompt techniques have varied impact depending on the task and the domain. For code-related tasks, prompt techniques were shown to have a positive effect on code generation in the domain of education [22]. Furthermore, researchers proposed ways and contextual information as prompt techniques to apply to the prompt and enhance code-related tasks [23, 8], e.g., incor- porating dataflow information to improve code summarization [8]. Other prompt techniques used by Dong et al [24] included self-collaboration, where the LLM is prompted several times to take different personas e.g., first as a requirements engineer, then a software developer, then a tester, and only then return a code that resulted from the “collaboration” among the three personas. In our study, we focus on three common prompt techniques, namely, few-shot learning, automatic chain-of- thought [25], and persona [26], as well as propose two pieces of contextual information as additional prompt techniques that are easily accessed by the developers, i.e., the imported packages and the signature of the function. To evaluate LLMs on code generation tasks, the most common metric is Pass@k [27], where k=1 is used to measure the rate of passed functions that the LLM generated on the first attempt [28] (e.g., by running a test suite). CodeBLEU [29] is another popular metric, commonly used in studies to measure the human-likenesses of generated code [30, 31]. Li et al. [32] conduct a manual human evaluation of their proposed prompt Page 3: 3 221 datapointsExclude prompts with failing ground truth230 datapoints CoderEval 32 prompt combinationsCreate prompt templates Few-shot CoT Persona Signature Packages 221x32 prompts Construct CodePromptEval"For example, ..." "Think step by step ..." "As a software developer ..." "The signature is def ..." "The function uses pandas ..." CodePromptEval 7072 functionsRun LLMs Call API-based LLMs Host Self- deployed LLMs 7072 prompts 7072 functionsLLM-generated functionsEvaluate generated functions Run tests Measure similarityCode evaluation Test results and errors CodeBLEU similarity scoresAnalyze the evaluation results General Multi- regression Wilcoxon test + effect sizeRun Pylint Compute complexityCyclomatic and cognitive complexity32 prompt combinationsPrompt technique combinations RQ2 (PromptCodeEval evaluation) Correctness Similarity QualityList of code smellsDescriptive statistics (per LLM)RQ1 (LLM evaluation) Performance and common error types Descriptive statistics (per combination) Fig. 1. The process we follow to evaluate the code generated using different prompts by different LLMs. technique “AceCoder” based on correctness, presence of code smells, and maintainability. We provide a systematic and automated approach to evaluate generated code based on correctness, maintainability, and similarity to the ground truth. III. M ETHODOLOGY Figure 1 shows our approach to evaluate the impact of commonly-used prompting techniques on the code generated by LLMs. On a high level, we create prompt templates that combine prompting techniques (e.g., CoT, few-shot, etc.) and apply each prompt template to 221 tasks from the CoderEval benchmark [10]. We evaluate three different LLMs (two open- weight and one proprietary), leading to 7072 generated func- tions per LLM. To understand the impact of each prompting technique and answer RQ2, we evaluate all functions in terms of correctness, similarity to the baseline of the benchmark, and quality using statistical analysis. We follow a full factorial experiment design and evaluate the code generation functionality of LLMs by varying two levels (present/absent) of five factors in a prompt, that is, the five prompt techniques: (1) few-shot learning, (2) Chain-of- Thought (CoT), (3) persona, (4) function signature, and (5) the list of packages. Therefore, we have 32 ( 25) treatments in our experiment. Note that the absence of all of these techniques counts as zero-shot, where only the generation instruction is present without any other prompt technique. We do not treat zero-shot as a factor since it cannot logically be combined with other prompt techniques (e.g., combining zero-shot with persona would simply default to persona). Instead, we use zero-shot as a baseline for prompt technique comparisons. A. Prompt technique combinations Prompt programming is the act of constructing a prompt using natural language to ensure that the model provides the intended response or to improve the performance of the model [9]. Based on observations from our previous work [3] and recommendations in literature [6] and from LLM providerssuch as OpenAI1and Microsoft23we decided on five prompt techniques to apply when prompting LLMs in our study. Examples for all prompt techniques will be provided later in Figure 2. •Few-shot learning can be achieved by providing shots (or examples) to an LLM in order to enable learning new ex- amples without the need to fine-tune the LLM [19]. We use two input-output examples explained in natural language. We do not consider a varying number of shots. •Automatic Chain-of-Thought (CoT) allows the LLM to break down the prompt by asking it to ”think” step by step and following the steps when solving the problem [25]. •Persona allows the LLM to play a specific role and consider its perspective when solving a problem [33, 34]. For the persona, we use the role of a software developer who focuses on practices and standards that software developers follow. •Signature is a line of code that includes the signature of the function to generate. The signature includes the function name, the input parameters, and (optionally) the output. •Packages is a list of libraries and files that exist in the environment in which the code runs. This includes local packages and external libraries. B. CodePromptEval To evaluate the different combinations of prompt tech- niques, we construct CodePromptEval – a dataset that includes 221 function-level code generation tasks, where each genera- tion task is implemented using 32 different prompt variations. Each of these 32 prompts applies a unique combination of prompt techniques, resulting in a total of 7072 prompts (221 tasks×32 variations). To create our dataset, we initially start with the CoderEval Python dataset [10]. This dataset consists of 230 datapoints from 43 Python projects. Each datapoint consists of a prompt, 1https://platform.openai.com/docs/guides/prompt-engineering 2https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/ advanced-prompt-engineering 3https://microsoft.github.io/prompt-engineering/ Page 4: 4 a Python function (human-written baseline), and the corre- sponding tests (in form of unit tests or a main class). We first set up different virtual environments for functions from different projects, then we test the functions using the provided tests, and eliminate nine datapoints where the baseline does not pass the tests. This resulted in 221 datapoints that will be the foundation for our own CodePromptEval dataset. Then, we ensure that the prompts are “pure” from any prompt technique that may be implicitly applied (e.g., provid- ing examples), by going through the prompts manually and removing any elements that do not describe the purpose of the code. We then treat this prompt as a zero-shot prompt. The next step was to prepare prompt templates by defining how each prompt technique will be implemented and mapping relevant information to prompt techniques. In particular, for each datapoint, we extract the signature of the function and the list of used packages (represented as imports at the beginning of the class). For chain-of-thought, we adapted the template recommended by Zhuosheng et al. [25]. To construct the persona, we defined a persona description of a software de- veloper who follows best coding practices for maintainability. To implement the few-shot prompt technique, the first three authors of this paper manually constructed two input-output examples for each prompt following the template “If the input is X, then the output is Y”. We also create corresponding tests to ensure that the input and output are correct. TABLE I THE32COMBINATIONS OF PROMPT TECHNIQUES THAT WE CONSIDER IN OUR FULL FACTORIAL EXPERIMENT . ID Few-shot CoT Persona Packages Signature P1 - - D D D P2 - - D D - P3 - - D - D P4 - - D - - P5 - - - D D P6 - - - D - P7 - - - - D P8 - - - - - P9 - D D D D P10 - D D D - P11 - D D - D P12 - D D - - P13 - D - D D P14 - D - D - P15 - D - - D P16 - D - - - P17 D - D D D P18 D - D D - P19 D - D - D P20 D - D - - P21 D - - D D P22 D - - D - P23 D - - - D P24 D - - - - P25 D D D D D P26 D D D D - P27 D D D - D P28 D D D - - P29 D D - D D P30 D D - D - P31 D D - - D P32 D D - - -Finally, we define 32 prompt variations that we list in Table I. Each variation represents a prompt that applies a unique combination of prompt techniques. For example, P7 is a prompt that provides the code signature, but uses no other prompt programming technique, whereas P28 combines few-shot learning with CoT and the usage of the persona “software developer”. P8is the zero-shot baseline, where no prompt technique is used and the model is only provided with the programming task. P25 is the case where allprompt techniques are used in conjunction. We then map each variation from Table I to the relevant information and templates for prompt techniques (e.g., im- ported libraries for packages), then we combine them with the 221 prompts from CoderEval, leading to 7072 concrete prompts (221 prompts times 32 variations). Our dataset, the virtual environments, and the few-shot examples and tests are provided in our replication package [35]. Respond with a Python function in one code block. As a softwar e developer who follows best coding practices for maintainability such as avoiding code smells and writing simple and clean code, For example, i f the input is 4523605 and 3600, then the output is "01:15:23.605+01:00", and if the input is 4523605 and None then the output is "01: 15:23.605". Think car efully and logically , explaining your answer step by step. Convert nanoseconds to a time in fixed format. The function signatur e is: def hydrate_time(nanoseconds, tz=None) The function has access (but does not necessarily use) the following packages: time pytz datetime. Persona Chain-of- Thought Few-shot examples Signature Packages Constraint Fig. 2. Example prompt in CodePromptEval. We illustrate an example prompt with all prompting tech- niques ( P25) in Figure 2. The prompt description, signature, and packages are extracted from CoderEval, while we con- struct the few-shot examples, persona, and chain-of-thought texts as a part of CodePromptEval. We also append a constraint at the beginning of each prompt to ensure that the output has a block of Python code with a self-contained function. If different prompt techniques are combined, we apply them in a fixed order (as given in Figure 2). This order ensures the sentence flows naturally. For example, the common practice is to place the persona at the beginning, and the few-shot examples after the purpose of the code. While it is possible to experiment with different orders of prompt techniques in a prompt, we consider this outside the scope of this study. C. Code Generation We focus on LLMs with a decoder-only transformer ar- chitecture, which is at the time of writing the preferred architecture to use in code generation tasks [36]. Therefore, we select the following LLMs for our study: GPT-4o, Llama3- 70B-Instruct, and Mistral-Small-Instruct-2409 (22B). We also Page 5: 5 collect data for two previous-generation LLMs (GPT-3.5- turbo and Llama2-7B-Instruct), but omit discussing the results for these older models for reasons of brevity in this paper. The collected data for these models is still available in our replication package [35]. We run all 7072 prompts on the selected LLMs with 0.2 temperature, which has been commonly used for code generation tasks [27, 37]. For the API-based GPT models, we send requests to the external API and store the responses. We host the remaining models on the Alvis cluster, a NAISS re- source (National Academic Infrastructure for Supercomputing in Sweden) dedicated to Artificial Intelligence and Machine Learning research4using models downloaded from Hugging- face5. Running the self-hosted LLMs on Alvis required around 1600 GPU hours using Nvidia A100 GPUs. For the GPT models, we use the OpenAI API, which is billed based on the tokens that are processed. To run GPT-4o and GPT-3-turbo on all the prompts in CodePromptEval, we provide around 1.1 million input tokens and generate approximately 3.7 million output tokens. D. Evaluating the LLM-generated functions After generating 7072 code solutions, we evaluate them based on three main aspects following our research questions (correctness, similarity, and quality). We use different tests to measure statistically significant differences for the measures below, hence we detail the choice of statistical methods in their corresponding results sections. Correctness: To evaluate their correctness, we run the generated functions against their corresponding tests in CoderEval. There are two types of tests in CoderEval: Python unit tests, and a main function with different statements and conditions that set a boolean variable isT (is True) to False when at least one of the conditions does not hold. To ensure consistency and instrumentation of our experiment, ensure that an AssertionError is thrown when needed, by adding an assert statement at the end of tests in the form of a main function assert isT . Furthermore, as some of the LLM-generated functions can be erroneous and get stuck in an infinite loop, we wrap the tests with error-handling constructs (a try/except ) and set a timeout of 60 seconds per function. Then we collect the test results and the error messages when applicable. Similarity: To assess the LLM-generated code’s similarity to the ground truth obtained from CoderEval (human-written functions), we measure the CodeBLEU score [29]. CodeBLEU combines four dimensions of similarity to capture alignment in the syntax, semantics, and data flow between the generated functions and the ground truth. The different dimensions are N-gram match (BLEU), weighted N-gram matches (weighted- BLEU), Abstract Syntax Tree (AST) match, and data flow match. For more specific insights, we also measure the syntac- tic similarity (coding style and variable naming) using BLEU 4https://www.c3se.chalmers.se/about/Alvis/ 5https://huggingface.coand weighted-BLEU, and semantic similarity (i.e., structural and algorithmic design) using AST match. Quality: Regarding code quality, we focus on measures that are related to maintainability [38] and we only measure them for functions that pass their tests. In other words, we measure the quality only for functionally correct functions. We use Pylint6to generate a report with identified code smells in the generated functions. Moreover, we compute the code complexity for both the LLM-generated functions and the equivalent ground truth (i.e., the human-written functions in CoderEval) to compare both results and see how the different prompts have an impact on the code quality. Code complexity refers to how detailed and interconnected different parts of the code are, which can make the code harder to understand and test. To get an overview of the complexity of the generated functions, we measure McCabe’s cyclomatic complexity via the Radon Python package7and cognitive complexity [39] via the cognitive-complexity Python package.8 IV. C ODEPROMPT EVAL OVERVIEW In this section, we provide an overview of the aggregate results from running three LLMs (GPT-4o, Llama-3, and Mistral) on the CodePromptEval dataset. This section answers RQ1 in our study. Note that these results are not an assessment of the capa- bilities of these models when used with an “ideal” prompt, but an aggregation over all prompt technique combinations in our study. That is, the following results should be read as an overview of CodePromptEval, and not as a judgment of which LLM performs best in general. Detailed drill-downs assessing the performance of individual (combinations of) prompt techniques will be presented in Section V. TABLE II OVERVIEW OF PASSED AND FAILED FUNCTIONS PER LLM. T HE TOTAL NUMBER OF FUNCTIONS PER LLM IS7072. LLM # Passed Functions # Failed Functions GPT-4o 3707 (52.42%) 3365 (47.58%) Llama3-70B-Instruct 3564 (50.40%) 3508 (49.60%) Mistral-22B-Instruct 3335 (47.16%) 3737 (52.84%) A high-level results summary is shown in Table II. There are a total of 7072 generation tasks in the dataset. All three models are able to solve (generate functions that pass all tests) approximately half of the tasks. Mistral performs worst in our study, solving 3335 (47.16%) of tasks, and GPT-4o does best solving 3707 (52.42%), outperforming the worst model by approximately 5 percentage points. To get a better idea of whether these results are impacted by the code level of the function. We use the code levels defined by Yu et al. [10] that are based on the nature of dependencies of the function. The code levels are: self-contained (no need to import), standard library runnable (no need to install), public library runnable (uses libraries available on PyPI), class 6https://pypi.org/project/pylint/ 7https://pypi.org/project/radon/ 8https://pypi.org/project/cognitive-complexity/ Page 6: 6 runnable (uses code outside the function, but within the class), file runnable (uses code outside the class, but within the file), and project runnable (uses code in other files). Code levels provide a rough indication of the “difficulty” of a generation task, based on what kind of dependencies the LLM needs to correctly incorporate. 61%58%48%49%31%11% 39%42%52%51%69%89% 59%61%45%50%41%12% 41%39%55%50%59%88% 69%62%46%50%43%12% 31%38%54%50%57%88%GPT−4oLlama3−70BMistral−22B 00.250.50.75100.250.50.75100.250.50.751ClassrunnableFilerunnableProjectrunnablePubliclibraryrunnableSelf−containedStandardlibraryrunnable Percentages of functionsTest ResultFailedPassed800Total105660873621441728 Fig. 3. Passed and failed functions per LLM for each code level. The total number of functions per LLM is 7072. We looked into the code levels that passing and failing functions belong to (see Figure 3). Unsurprisingly, the fail rate for all models increases as tasks get more difficult (i.e., by construction, class runnable tasks tend to be substantially more challenging than self-contained ones, and all models struggle much more with solving them correctly). Pass rates for the easiest type of task (standard library runnable) are close to 90% for all models, going down to as low as 31% to 41% for the most challenging tasks (class runnable). We observe that, overall, all three models perform comparably on most code levels, with the exception of self-contained tests (where GPT-4o outperforms the other models by a larger margin of 10 to 12 percentage points). This difference explains most of the slightly higher overall performance of GPT-4o. We have also confirmed these differences using Chi-square statistical test and Cohen’s ωas effect size ( ω= 0.34:- Medium effect) [40]. GPT−4o Llama3−70B Mistral−22B 05001000 150005001000 150005001000 1500IndexErrorSyntaxErrorKeyErrorValueErrorNameErrorAttributeErrorImportErrorTypeErrorAssertionError Count Fig. 4. Frequency of error types among failing tests for functions generated by GPT-4o, Llama3-70B, and Mistral-22B. Finally, we report what error types led to the failing tests shown in Table II and Figure 3. We report the error types based on the Python exception that is first thrown when running the tests. The results of the error types are visualized in Figure 4. The most common error type for failed tests across the LLMs is AssertionError , indicating that the LLM generated a Python function that did not exhibit precisely theexpected functionality (as defined through unseen tests). How- ever, are also frequently encountered such as TypeErrors (operation is performed on a value of an inappropriate type, indicating that the LLM misjudged the runtime type of a Python object), AttributeErrors (invalid attribute ref- erence is made), and ImportErrors (a faulty import of a module or object). Other errors, such as NameErrors orIndexErrors , exist but are rare. While there are dif- ferences between the LLMs, they are relatively minor and not systematic. The most notable difference is that Mistral tends to generate functions leading to an ImportError or NameError more frequently than the other LLMs, whereas AttributeErrors are less frequent in Mistral-generated code. Key Findings (RQ1): Overall, we observed that GPT-4o minorly outperforms the other LLMs in the study. However, in general, results are consistent between current-generation LLMs. Depending on task difficulty, all LLMs can solve between 31% and 89% of tasks. Assertion and TypeErrors are the most common cause of failed tests. V. P ROMPT TECHNIQUE COMPARISON We now turn towards RQ2, and describe the results of a statistical analysis examining how the different prompt tech- niques applied in each prompt impact the function regarding (i) correctness, (ii) similarity to the ground truth, and (iii) quality. A. Correctness A central question for assessing the value of prompt tech- niques is how likely a (combination of) techniques is to lead to code that (i) does not throw errors and (ii) passes the tests. To measure the correctness for different prompts, we use the well-established Pass@k metric [27]. This metric measures the likelihood that the LLM will generate a correct solution from the kstattempt. When k= 1, then the metric becomes a measure of accuracy, as it calculates the number of passed functions divided by the total number of functions. In our study, we run the functions generated by the 32 combinations of prompt techniques on the tests provided by the CoderEval benchmark. Then, we collect the test results (pass or fail) and measure Pass@1 accordingly. Figure 5 shows Pass@1 results for all combinations of techniques (see Table I) for GPT-4o. Given that results between different models appear to be very consistent (see also Section V), we focus our discussion on one example model. However, results for the other models can be found in Appendix A. It is evident from Figure 5 that the most important technique when it comes to correctness is the presence of a function signature. Combining the signature with other techniques, such as few-shot or chain-of-thought, is sometimes helpful to further increase the likelihood of a generated function being correct (albeit by a very small margin, e.g., adding chain- of-thought and package information to the signature only leads to an improvement of 0.4 percentage points). The best combination, with a Pass@1 of 57.9%, is the combination of signature and few-shot. We achieved the worst results in terms of correctness when using chain-of-thought alone, with Page 7: 7 47.5%54.8% 46.6%54.8% 48%52.9% 48.9%54.3% 46.2%52.9% 48%55.2% 48.4%53.8% 47.1%55.7% 52.9%57.9% 53.4%56.6% 50.2%55.2% 52%56.1% 50.7%56.1% 50.2%57% 50.7%56.6% 51.6%55.2% CoTPkg.CoT, Persona, Pkg.CoT, Pkg.PersonaCoT, PersonaPersona, Pkg.Few−shot, CoT, Pkg.Few−shot, PersonaFew−shot, CoTFew−shot, CoT, PersonaFew−shot, CoT, Persona, Pkg.Few−shot, Persona, Pkg.CoT, Sig.Few−shotPersona, Sig.Few−shot, Pkg.CoT, Persona, Sig.Persona, Pkg., Sig.Pkg., Sig.Sig.CoT, Pkg., Sig.Few−shot, CoT, Persona, Pkg., Sig.Few−shot, Persona, Sig.CoT, Persona, Pkg., Sig.Few−shot, CoT, Sig.Few−shot, Persona, Pkg., Sig.Few−shot, CoT, Persona, Sig.Few−shot, Pkg., Sig.Few−shot, CoT, Pkg., Sig.Few−shot, Sig. 0.0 0.2 0.4 0.6 Pass@1 Fig. 5. Pass@1 results of the different (combinations of) prompt techniques exemplified for GPT-4o. a Pass@1 of 46.2%. It is surprising to note that the impact of prompt engineering techniques is overall lower than we would have expected — the difference between the best and worst combinations is merely 11.7 percentage points, i.e., prompt programming seems to have a noticeable impact in only a little over one in ten generation tasks. Our findings also indicate that sometimes the addition of more information in the prompt leads to worse performance. For example, using only few-shot and signature performs better than if also packages are provided. Further, it is evident that techniques can interact in non-obvious ways. For example, both package information and CoT alone led to the worst Pass@1 results in the experiment. However, if these techniques are used in conjunction with a function signature, Pass@1 improves marginally over using only the signature in isolation. To further investigate these interactions between factors in our experiment, we conducted a multi-linear regression analysis. Figure 6 shows the five prompt techniques in the study their interactions and their effect on the test result (pass or fail). For instance, “CoT:Persona” describes if the impact on test results comes from the interaction of CoT and Persona in a prompt, regardless of whether that prompt includes other prompt techniques. Similarly, “Sig.” (signature) refers to all prompts that include at least the signature (including, for example, P23, the combination of few-shot and signature), and is not limited to prompts that only specify the signature. The multi-linear regression results in a coefficient estimate and a p-value for each factor and possible interactions among the factors. The coefficient reflects the impact on the test results, positive and negative coefficients refer to positive and negative impacts, respectively. The p-value indicates how significant the impact is. Fewshot:Sig.PersonaFewshot:CoT:Persona:Sig.Persona:Sig.Fewshot:CoTFewshot:CoT:Pkg.:Sig.Fewshot:CoT:Pkg.CoTCoT:Persona:Sig.CoT:Persona:Pkg.:Sig.Fewshot:Persona:Pkg.Fewshot:Persona:Sig.Pkg.:Sig.Fewshot:PersonaFewshot:Pkg.CoT:Pkg.Persona:Pkg.:Sig.Fewshot:Pkg.:Sig.Fewshot:Persona:Pkg.:Sig.Persona:Pkg.Fewshot:CoT:Persona:Pkg.:Sig.CoT:Sig.Fewshot:CoT:Persona:Pkg.CoT:Persona:Pkg.CoT:Pkg.:Sig.Fewshot:CoT:PersonaCoT:PersonaPkg.Fewshot:CoT:Sig.FewshotSig. 0.00 0.01 0.02 0.03 Coefficient EstimateSignificance Level <0.001 <0.01 <0.05 Not Significant LLM GPT−4o Llama3 MistralFig. 6. Multi-linear regression results for the test results (pass or fail). Each point visualizes the coefficient estimate for the corresponding combination. The darker colors represent more conservative significance levels ( α). Zero- shot is not depicted, as it cannot be combined with other techniques. In line with our previous findings, we observe that the presence of a signature and few-shot in a prompt (regardless of whether they are combined with other prompt techniques) affect the test results positively (albeit with different statistical significance levels for different LLMs), and a positive, high coefficient estimates (meaning there is a significant positive impact on correctness). Interestingly, few-shot does not have a statistically significant impact in the case of the Mistral model. The remaining main factors (packages, chain-of-thought, and persona) do not have a statistically significant impact on any of the three models. However, it is interesting to observe that the impact of a persona as well as chain-of-thought even trends to the negative (coefficient estimate smaller than 0). Key Findings (RQ2.1): The presence of a signature or few- shot has the clearest positive impact on correctness. The other prompt techniques in the study do not have a statistically significant impact on correctness. However, in general, the difference between “good” and “bad” prompt techniques is surprisingly low. Adding additional information to a prompt sometimes leads to worse performance. Digging deeper into what causes generated functions to fail, Figure 7 displays the percentages of errors types en- countered for each combination of prompt techniques. We show the four most common error types ( AssertionError , TypeError ,AttributeError , and ImportError ) us- ing the GPT-4o model (see Appendix C for the other models). For example, 51% of the failed functions of prompts with few- shot, packages, and signature throw an AssertionError . The combinations of prompt techniques are ordered from the fewest errors (at the top) to the most errors (at the bottom). Page 8: 8 29.8%31.7%13.5%17.3% 7.7% 36.7%28.4%11.0%14.7% 9.2% 34.5%29.1%17.3%12.7% 6.4%43.2%17.9%21.1%10.5% 7.4% 34.9%29.4%12.8%14.7% 8.3%34.0%27.4%16.0%12.3%10.4%41.4%19.2%17.2%11.1%11.1%42.7%16.7%22.9%10.4% 7.3%48.4%16.8%22.1% 8.4%4.2% 38.8%26.2%14.6%11.7% 8.7%51.0%17.7%17.7% 7.3%6.2% 35.5%30.0%14.5%13.6% 6.4%34.9%28.3%17.9% 9.4%9.4%50.5%10.3%22.7% 9.3%7.2% 43.9%12.2%28.6%10.2% 5.1%43.0%16.1%24.7%10.8% 5.4% 29.6%28.7%11.3%25.2% 5.2% 32.2%30.5% 6.8%23.7% 6.8%36.0%29.8% 9.6%14.9% 9.6%39.8%15.3%19.4%12.2%13.3% 31.0%33.6%12.4%16.8% 6.2% 36.2%30.2%11.2%13.8% 8.6%43.3%10.3%23.7% 8.2%14.4% 44.6%10.9%26.7% 7.9%9.9%41.0%19.0%23.0% 9.0%8.0% 37.6%31.6% 9.4%13.7% 7.7%44.4%15.2%16.2% 9.1%15.2% 32.5%28.1%13.2%19.3% 7.0%32.7%33.6%14.2%10.6% 8.8%44.0%17.0%20.0% 6.0%13.0% 41.7%16.5%20.4% 9.7%11.7%43.4%13.1%24.2% 9.1%10.1% Zero−shot, CoTZero−shot, PackageZero−shot, CoT, Persona, PackageZero−shotZero−shot, CoT, PackageZero−shot, PersonaZero−shot, CoT, PersonaZero−shot, Persona, PackageFew−shot, CoT, PackageFew−shot, PersonaFew−shot, CoTFew−shot, CoT, PersonaFew−shot, CoT, Persona, PackageFew−shot, Persona, PackageFew−shotFew−shot, PackageZero−shot, Persona, SignatureZero−shot, CoT, Persona, SignatureZero−shot, CoT, SignatureZero−shot, Persona, Package, SignatureFew−shot, CoT, Persona, Package, SignatureZero−shot, Package, SignatureZero−shot, SignatureFew−shot, Persona, SignatureZero−shot, CoT, Package, SignatureFew−shot, Persona, Package, SignatureZero−shot, CoT, Persona, Package, SignatureFew−shot, CoT, Persona, SignatureFew−shot, Package, SignatureFew−shot, CoT, Package, SignatureFew−shot, CoT, SignatureFew−shot, Signature AssertionErrorTypeError AttributeErrorImportErrorOther Error Type Fig. 7. The percentages of error types that we observed in failed functions generated by different combinations of prompt techniques (GPT-4o). In general, we see that the prompts that result in the least number of errors (among the first rows in the heatmap) are combinations that include a signature. On the other end, the prompts with the most errors lack few-shot examples. Taking a closer look at the different error types, we observe that while AssertionErrors generally occur at a higher rate than the other error types across all prompt techniques, they are particularly more frequent in prompts that include the function signature. In contrast, the absence of the function signature often leads to TypeErrors , primarily because the LLM misjudges the expected number or order of positional arguments when generating functions. ImportErrors were observed more frequently in prompts that did not employ few-shot prompting. Interestingly, in a subset of cases, ImportErrors occurred even when packages were explicitly specified. To investigate this, we manually inspected five random prompts where packages were specified but still resulted in ImportErrors . We found that when the prompt indicated the use of a package that is local or unfamiliar to the LLM, the LLM hallucinated and attempted to import non-existent functions from the specified packages. We note that these findings are consistent for GPT-4oand Llama3, while the errors of code generated by Mistral lacked any clear patterns for the above mentioned error types. However, we observed a trend of a higher rate of AttributeErrors in Mistral when the signature is in- cluded in the prompt. For the other two LLMs (GPT-4o and Llama3), we did not observe any consistent patterns among the prompts that triggered AttributeErrors . Overall, we emphasize that encountering a certain error does not necessarily mean that the function is free from the other error types, as the program terminates at the first error thrown. However, assertions are evaluated after the function has successfully been executed, so an AssertionError indeed indicates that no other errors have occurred. Further, AssertionErrors are qualitatively different from other error types, as they do not indicate a fundamentally broken function, but rather that the LLM misunderstood (or could not correctly guess from context) some assumptions about the functionality of the code that is to be generated. Key Findings (RQ2.1): Including the signature or few-shot ex- amples in prompts generally reduces errors, particularly Type- Errors, AttributeErrors, and ImportErrors. Providing package information can naturally reduce ImportErrors but may cause hallucinated imports if unfamiliar to the LLM. B. Similarity Beyond correctness, we believe that another important ques- tion is how similar generated functions are to the human- written baseline. We use the CodeBLEU score [29] to measure how similar the generated function is to the baseline in terms of syntax, semantics, and logic structure. CodeBLEU is a composite score that integrates the scores of: N-gram match (BLEU [41]), a weighted BLEU, and the Abstract Syntax Tree (AST) node match, and the dataflow match as a semantic similarity measure with different weights (by default, 25% for each of the four scores). In our analysis, we remove the signature of the generated function and the ground truth before measuring the similarity to avoid any bias toward the signature prompt technique. Consequently, we can no longer measure the dataflow similarity as the functions are no longer parsable (i.e., identifying parameters and their dataflow in the function). Therefore, we set the weight for the dataflow match to 0, and 1/3 for the three other match scores. In Figure 8, we see that the overall CodeBLEU score is low for all approaches (varying between 12.2% and 17.9% for Llama3) indicating that generated solutions are largely different than how humans have solved the same tasks. Results for the other models are in Appendix D. We observe that using any prompt technique increases similarity (i.e., zero-shot has the lowest similarity to the baseline for all three models). Consistently with correctness, combinations that include a signature lead to higher similarity. This is unsurprising, given that a predefined signature restricts the solution space for the LLM (which can be desirable or unwanted depending on context). Combining more techniques indeed seems to generally increase similarity. To better understand the impact of the prompt techniques and their interactions on the code similarity, we now perform a Page 9: 9 17.9% 17.8% 17.5% 17.4% 17.3% 17.2% 17.2% 16.7% 16.6% 16.5% 16.4% 15.8% 15.6% 15.6% 15.2% 15.1% 15% 14.9% 14.8% 14.7% 14.7% 14.5% 14.5% 14.5% 14.2% 14.2% 14% 14% 13.5% 13.5% 13.4% 12.2%Pkg.Few−shotFew−shot, Pkg.Few−shot, Persona, Pkg.Few−shot, PersonaPersona, Pkg.Few−shot, CoTCoTFew−shot, CoT, Pkg.PersonaCoT, PersonaCoT, Persona, Pkg.Few−shot, Pkg., Sig.Few−shot, Sig.CoT, Pkg.Sig.Pkg., Sig.Few−shot, Persona, Pkg., Sig.Few−shot, CoT, Persona, Pkg.Few−shot, CoT, PersonaFew−shot, CoT, Pkg., Sig.Few−shot, Persona, Sig.CoT, Pkg., Sig.Persona, Pkg., Sig.Persona, Sig.Few−shot, CoT, Persona, Pkg., Sig.Few−shot, CoT, Sig.CoT, Persona, Pkg., Sig.CoT, Sig.CoT, Persona, Sig.Few−shot, CoT, Persona, Sig. 0.00 0.05 0.10 0.15 Average CodeBLEU (%) Fig. 8. The average CodeBLEU scores for the functions generated by each combination (Llama3). general multi-linear regression to see how the different prompt techniques and their interactions can impact the CodeBLEU and the three constituent dimensions we use (n-gram, weighted n-gram, and AST match scores). N-gram and weighted N-gram matches can give us insights into how similar the generated function to the baseline on a lexical level, i.e., the set of words such as coding style, variable naming, and/or in-line documentation in the function. The AST match focuses on the syntactic structure and code constructs in the function such as nested statements indepen- dently of their naming (e.g., loops, conditionals, operations). Figure 9 shows the results of our multi-linear regression. We see that, regardless of the test results, the presence of a signature or persona in a prompt can significantly increase the CodeBLEU score. Chain-of-thought (CoT) seems to also positively impact the CodeBLEU score, but only for Llama3. The impact of the signature was shared across all dimensions of CodeBLEU. The impact of the persona and CoT was only observed for the lexical similarity, which explains the slightly lower p-values compared to the signature. In an interesting case, few-shot seems to lower the lexical similarity but increase the syntactic similarity, which results in no overall impact on the CodeBLEU. The impact of few-shot suggests that while it generates functions that may use different variable naming and style (lower lexical similarity), the structure and logic are close to the baseline (higher syntactic similarity). Key Findings (RQ2.2): The signature, persona, and chain-of- thought increase the overall similarity of the function to the baseline (i.e., code written by humans). Few-shot increases the syntactic similarity (structure of the function) and decreases the lexical similarity (variable naming).C. Quality Using prompt techniques that yield correct functions does not necessarily mean that these functions are maintainable and of good quality. Hence, we now turn to an assessment of the quality of the generated code. In our experiment, we focus on code smells and complexity as proxies of code quality. For this analysis, we only evaluate functions that pass their tests (see Section V-A). We do not believe that assessing the quality of functionally incorrect implementations is fruitful because refactoring must be done on a working piece of code and preserve its behavior [42]. For code smells, we run Pylint on the generated functions for each prompt in CodePromptEval, using the code smell IDs defined by Pylint. Then, we group the code smells for prompts that share the same prompt techniques, and finally, we filter code smells that make up less than 5% of the total number of code smells for the group. TABLE III LIST OF CODE SMELLS IN GENERATED FUNCTIONS . Category Code Smell ID Definition Error E0602 Usage of an undefined variable. Warning W0212 Access to a protected class member. Warning W0611 Import statement not used. Warning W0613 Function argument is not used. Warning W0621 Redefines name from outer scope. Refactoring R0903 Insufficient public methods in a class. Refactoring R1705 Unnecessary ”else” after ”return”. Convention C0301 Line exceeds the character limit. Convention C0103 Violating UPPER CASE naming style. Convention C0115 Class lacks a descriptive docstring. Convention C0116 Function lacks a descriptive docstring. Convention C0304 File missing a final newline. We find 12 code smells that fulfill these criteria for all LLMs (see Table III). Most identified code smells are warning and convention code smells, but there is also one error and two refactoring smells. From this list, we decided to remove C0304 as it is present in all generated functions across all LLMs and is mostly an artifact of our generation pipeline. In Figure 10, we show what percentage of functions have at least one instance of each code smell. For reasons of brevity, we focus on GPT-4o (Llama3 and Mistral’s in Appendix B). We note that 72% of the functions generated using the few-shot technique contain C0116 code smells, indicating that these functions lacked a descriptive docstring (in contrast to only 17% of functions generated by chain-of-thought com- bined with a persona and package information). In general, we observe that prompts that apply the few-shot and signature prompt techniques generate functions with more code smells and, more specifically, warning and error code smells com- pared to other prompts. On the other hand, we observed that CoT, persona, and package lead to functions with fewer code smells, unless these prompt techniques are combined with few-shot and/or signature, then the percentage of code smells increases. This is interesting, as we have seen that few-shot and signature are the techniques with the clearest positive impact on correctness (see Section V-A). In part, this discrepancy could be explained by solutions for challenging tasks that LLMs only solve correctly Page 10: 10 CodeBLEU BLEU (n−gram) − 33.33% BLEU−weighted − 33.33% Syntactic Similarity (AST) − 33.33% −0.005 0.000 0.005 0.010 0.015 −0.005 0.000 0.005 0.010 0.015 −0.005 0.000 0.005 0.010 0.015 −0.005 0.000 0.005 0.010 0.015CoT:PersonaFewshot:CoTFewshot:CoT:PersonaFewshotCoTPersonaSig. Coefficient Estimate LLM GPT−4o Llama3 Mistral Significance Level <0.001 <0.01 <0.05 Not Significant Fig. 9. The coefficient estimates from the multi-linear regression of prompt technique combinations that significantly impact the CodeBLEU and its components n-gram match (BLEU), weighted n-gram match (BLEU-weighted), and AST match scores. The combinations of prompt techniques that do not have a significant impact on the CodeBLEU or any of its components using any model are omitted for brevity. A complete plot can be found in Appendix D. 72%49%41%35%27%15%11% 11%71%43%40%36%34%15%13%66%33%41%35%28%15%12%12% 9%69%41%29%27%37%13%18%62%37%39%38%29%13%13%62%36%25%25%25%11%17%15% 13%50%45%51%48%34%54%29%34%23%27%17%11%22% 10%64%36%29%27%23%11%14%10% 12%54%31%29%23%27%16%11%13% 12%52%48%38%37%24%15%70%30%22%17%14%14%13%17% 12%49%30%25%23%31%14% 15% 11%10%45%39%31%33%30%12% 11%40%38%37%34%24%11% 11%55%26%22%20%26% 10% 12%59%16%16%12%21% 16% 11%10%40%30%14%12% 12% 20% 24%62%19%15%13%16%14%10%50% 21%21%16% 14%22%58%22%17%16%16% 12%44%19%19%18% 16% 20%28%23%23%19%12% 21%33%11%18%19%22% 11%35%21%15%13%18%10%43%20%17%15% 13%28%18%21%22%15%48%15%13% 16%20%28%15%11% 15%14%29%18%12% 11%20%28%18%12%17%17%13% Few−shotFew−shot, Pkg.Few−shot, Sig.Few−shot, CoT, Pkg.Few−shot, CoTFew−shot, CoT, Pkg., Sig.Few−shot, PersonaFew−shot, Persona, Sig.Few−shot, CoT, Sig.Few−shot, Persona, Pkg., Sig.Few−shot, Persona, Pkg.Few−shot, Pkg., Sig.Few−shot, CoT, Persona, Sig.Few−shot, CoT, Persona, Pkg.Few−shot, CoT, PersonaFew−shot, CoT, Persona, Pkg., Sig.CoT, Pkg., Sig.Pkg., Sig.CoT, Sig.Sig.Pkg.Persona, Sig.CoT, Persona, Pkg., Sig.Persona, Pkg., Sig.CoTCoT, Persona, Sig.CoT, Pkg.PersonaCoT, PersonaPersona, Pkg.CoT, Persona, Pkg. C0116C0301C0115R0903W0613W0611R1705E0602W0621W0212C0103W1203 Code Smell ID Fig. 10. Percentages of functions generated by GPT-4o that have different code smells. Empty fields indicate that no smell of this type is found. Note that functions can have instances of multiple types of smells. when provided examples or a signature (recall that, for this analysis, we have only investigated functions that pass all tests — hence, some challenging functions have an analyzable solution for signature and few-shot, but not other techniques). However, we note that the differences in Figure 10 are too large to be entirely explained in this way. Consequently, we conclude that CoT, persona, and package information indeed seem to systematically lead to fewer code smells. Key Finding (RQ2.3): While using CoT, persona, or package information leads to fewer correct solutions, these techniques lead to higher-quality code in terms of code smells. We now turn towards the cyclomatic and cognitive com- plexity and compare the complexity of generated solutions to the complexity of the human-written baseline. In Table IV, we show the p-values resulting from the paired Wilcoxon test to assess the statistical significance of differences between the cyclomatic complexity of the generated functions and ground truth. We only show the (combinations of) prompt techniquesthat had a significant impact on the cyclomatic complexity for at least two LLMs ( α= 0.05). We use Vargha Delaney A12 measure [43] to understand the nature of the impact (reduces or increases complexity) and to quantify the effect size (Negligible ( A12≥0.45), Small (0.36≤A12<0.45), Medium ( 0.36≤A12<0.29), or Large (A12≤0.29)). Vargha Delaney A12 is a probability measure (that was later adopted as an effect size measure), which describes the probability that one level (generated function complexity) is greater than a corresponding value in another level (ground truth complexity). If the A12 is less than 0.50, it means that the values of the first level are lower than the second level, and the lower the score is, the larger the effect size. This allows us to see if the prompts generate functions with a significantly lower or higher complexity as the ground truth, or with a comparable complexity when no significance is observed. Complete complexity results are in Appendix E. Similar to the code smells results, we see that CoT, persona, and packages reduce the complexity in comparison to the baseline. A zero-shot prompt also leads to lowered complexity. However, all reductions have (at most) a small effect size. This can be explained by the low cyclomatic complexity of all LLM tasks — in general, only minor simplifications are even possible to the generally rather simple code snippets. For cognitive complexity (see Table VII), we observe larger differences among the LLMs than between combinations of prompt techniques in general. There was only one combination of prompt techniques that reduced the cognitive complexity across all three LLMs with a small size effect. GPT-4o seems to generate functions with no or negligible differences to the ground truth. Mistral can reduce the cognitive complexity with a small effect size when the prompt does not include few- shot and a persona, packages or CoT applied in the prompt. In contrast, there are no clear trends or patterns among the prompts in Llama3 rather most of the prompt techniques seem to reduce the cognitive complexity with a small effect size. We conclude that Llama3 appears to lead to simpler solutions than the other models, particularly GPT-4o. It is interesting to observe that no combination of prompt techniques leads to more complex solutions than the baseline — generated solutions are always slightly simpler or com- parably complex. Viewed positively, this may indicate that LLMs generate rather clean code. However, a more negative interpretation may also be that the generated code does not cover some complex corner cases that human-written solutions Page 11: 11 TABLE IV CYCLOMATIC COMPLEXITY ANALYSIS USING WILCOXON TEST AND A12 V ARGHA DELANEY FOR THE EFFECT SIZE (N- N EGLIGIBLE , S- S MALL , M- MEDIUM , L- L ARGE ).↓INDICATES A REDUCTION IN COMPLEXITY ,∅INDICATES NO STATISTICAL DIFFERENCE . Combinations GPT-4o Llama3 Mistral p-value A12 Effect p-value A12 Effect p-value A12 Effect Package 0.0012 0.403 ↓(S) 0.0008 0.402 ↓(S) 0.0407 0.417 ↓(S) CoT, Package 0.0366 0.446 ↓(S) 0.0147 0.440 ↓(S) 0.0211 0.410 ↓(S) Zero-shot 0.0143 0.433 ↓(S) 0.0012 0.392 ↓(S) 0.0402 0.407 ↓(S) CoT, Persona 0.0576 0.462 ∅ 0.0100 0.432 ↓(S) 0.0062 0.433 ↓(S) Persona, Sig. 0.0999 0.466 ∅ 0.0358 0.436 ↓(S) 0.0080 0.439 ↓(S) Persona 0.0857 0.464 ∅ 0.0086 0.411 ↓(S) 0.0259 0.436 ↓(S) Sig. 0.1593 0.453 ∅ 0.0094 0.425 ↓(S) 0.0157 0.434 ↓(S) Package, Sig. 0.0426 0.457 ↓(N) 0.0156 0.417 ↓(S) 0.0175 0.448 ↓(S) Persona, Package 0.0283 0.438 ↓(S) 0.0039 0.404 ↓(S) 0.0661 0.460 ∅ CoT 0.0191 0.447 ↓(S) 0.0559 0.448 ∅ 0.0095 0.417 ↓(S) TABLE V COGNITIVE COMPLEXITY ANALYSIS USING WILCOXON TEST AND A12 V ARGHA DELANEY FOR THE EFFECT SIZE (N- N EGLIGIBLE , S- S MALL , M- MEDIUM , L- L ARGE ).↓INDICATES A REDUCTION IN COMPLEXITY ,∅INDICATES NO STATISTICAL DIFFERENCE . Combinations GPT-4o Llama3 Mistral p-value A12 Effect p-value A12 Effect p-value A12 Effect Package, Sig. 0.0365 0.449 ↓(S) 0.0001 0.382 ↓(S) 0.0175 0.448 ↓(S) Few-shot, Persona 0.0364 0.442 ↓(S) 0.0327 0.412 ↓(S) 0.2305 0.447 ∅ CoT, Persona 0.3678 0.501 ∅ 0.0026 0.419 ↓(S) 0.0062 0.433 ↓(S) CoT, Package 0.3300 0.485 ∅ 0.0256 0.444 ↓(S) 0.0211 0.410 ↓(S) CoT 0.3609 0.496 ∅ 0.0449 0.445 ↓(S) 0.0095 0.417 ↓(S) Persona, Sig. 0.0883 0.456 ∅ 0.0025 0.416 ↓(S) 0.0079 0.439 ↓(S) Persona 0.1156 0.493 ∅ 0.0032 0.393 ↓(S) 0.0259 0.436 ↓(S) Package 0.0777 0.446 ∅ 0.0002 0.371 ↓(S) 0.0407 0.417 ↓(S) Sig. 0.5717 0.485 ∅ 0.0003 0.398 ↓(S) 0.0157 0.434 ↓(S) Zero-shot 0.1708 0.486 ∅ 0.0016 0.382 ↓(S) 0.0402 0.407 ↓(S) account for (which may not be covered by CoderEval tests). Key Findings (RQ2.3): There are noticeable differences among models with regard to the complexity of the code they produce. Llama3 appears to produce simpler solutions systematically. There were no cases of increased complexity — LLM solutions were comparably complex to human-written code, or simpler. VI. D ISCUSSION In this section, we discuss the key lessons learned from this study, the implications of our findings for software engineering practitioners and researchers, as well as validity threats. A. Lessons learned L1: The differences in the results of prompt techniques are not dramatic: We carefully designed a full factorial experiment to evaluate not only prompt techniques but also combinations of them in a prompt. Our analyses revealed that, while there was an impact of some prompt techniques on the generated functions, the results for most of the prompt techniques were not that different. For example, the difference in the Pass@1 rates for the prompts with the highest and lowest rates is only around 10-12 percentage points (see Figure 5), and the effect sizes of the complexities are mostly small or insignificant (see Table IV). These insights align with other studies that evaluate prompt techniques on code summariza- tion [18] and generation [22], where the performance results of different prompt techniques such as CoT, few-shot, self- collaboration, among others, are also comparable. In contrast,we see clearer differences in the performance results of some prompt techniques when using benchmarks for math-related tasks or general question-answering [25, 44]. We conclude that a strong emphasis on prompt programming is not necessary in the context of function-level code generation using current- generation models. L2: Providing information about the interface via few- shot or signature is useful, but limits the “creativity” of the LLM: In our correctness and similarity results, the signature and few-shot prompt techniques stood out among other prompt techniques. In general, we believe that while they are two different prompt techniques, they can provide similar context about the expected functional interface in terms of positional arguments and expected output. This was also revealed through our general multi-linear regression results in Figures 6 and 9, where we see that having either signature or few-shot examples significantly impact the code’s correctness or similarity, but their interaction or combination does not help. In relation to previous work by Ahmed et al. [45, 8], we observe a similar pattern where contextual information about the parameters and other identifiers can improve the code summarization. However, providing this information limits the solution space for the LLM (i.e., it restricts the potential for “creativity”), which may not always be desired. L3: There is a trade-off between correctness and main- tainability when choosing prompt techniques: Our analysis revealed contrasting results: prompts with few-shot examples or function signatures improved correctness but increased complexity and number of code smells, while prompts that Page 12: 12 employed persona, CoT or package had lower passing rates but significantly enhanced code maintainability (see Tables IV,V for complexity and Figure 10 for code smells). While previous research suggests that the use of a persona in the prompt does not improve the outcome [46] but can improve the personalization and user experience [26], we believe that this only applies to simple personas such as “software developer”. However, our results indicate that personas can be more beneficial when used as a way to induce additional quality requirements e.g., “software developer who writes clean and simple code”. Recent work has also shown that personas can be beneficial for code generation when used in more complex approaches such as self-collaboration where multiple personas (e.g., requirement engineer, software tester, and a developer) are used together to iteratively construct the code in a systematic way [24]. B. Implications I1: Researchers should prioritize refining prompts for more effective prompt programming experiments We shed light on two components of experiments in prompt pro- gramming: the generation tasks, and the prompts. Based on observations in our previous work [3], we note that developers often use LLMs for more complex tasks than those in common datasets such as HumanEval [47] or CoderEval [10]. Although we were able to analyze and compare different prompt tech- niques, we believe that a dataset with are more representative functions of the large systems and projects that developers typically work with is needed. Further, prompts in common benchmarks often lack a consistent format or level of detail. For instance, the prompts in CoderEval are based on functions’ docstrings rather than actual prompts. Sclar et al. [48] show that LLMs, regardless of their sizes and number of parameters, are highly sensitive to small prompt changes such as prompt formatting. We observed similar behavior when experimenting with the template “The function uses the following packages” for the packages prompt technique and found that it caused errors related to using the wrong packages. We traced the issue back to the prompt itself and realized that the packages listed were not necessarily used by the function but existed in its class. When we modified the template to “The function has access to (but does not necessar- ily use) the following packages, ” we mitigated the issue. This pre-processing of the prompt technique templates is another aspect of prompt programming recommended by Obrien et al. [49]. We found value in inspecting and refining prompts and creating our own few-shot examples, which increased our confidence in the dataset’s reliability and stability. Therefore, we encourage researchers to invest in similar efforts. I2: Software developers should avoid overusing prompt techniques While we saw that prompt techniques can be beneficial for certain criteria (e.g., signature for correctness and persona for quality), we also saw that combining them does not necessarily yield better results. In fact, some cases showed that including an additional prompt technique can cancel out the impact of the existing prompt techniques. For instance, in the code smells results in Figure 10, we showhow the inclusion of few-shot examples to CoT and persona can increase the code smells by more than one-third. Previous work has also shown how few-shot examples can hurt the LLM performance if not carefully engineered by humans [44]. I3: Different LLMs have different sensitivity levels to the prompt techniques We argue that a single prompt technique does not have the same impact on the different aspects (cor- rectness, similarity, or quality) of the generated code across all LLMs. We have seen that the three LLMs demonstrated different sensitivity levels to the prompt. For instance, we saw how the similarity scores of Llama3-generated functions were significantly impacted by CoT, while it showed no effect for GPT-4o and Mistral (see Figure 9). The error types for Mistral in Figure 16 did not seem to be strongly impacted by prompt techniques as they did for other LLMs. Note that these differences in the models do not necessarily come from the model size and the number of parameters it was trained on, but rather the underlying architecture it uses (which aligns with findings by Wang et al. [18]). This implies that when a software company integrates an LLM into its processes and provides employee training, it should develop specific guidelines tailored to the LLM, including recommendations for prompt techniques that align with the model’s characteristics e.g., if an LLM returns simple functions in general, so prompt techniques that impact complexity may not be needed. I4: Determining the purpose of the code generation is essential for the use of prompt programming Depending on whether the intended use of the LLM is to support human developers or to completely automate code generation, prompt programming has different significance. While the use of some prompt techniques has significantly minimized the number of errors and code smells, we saw that many of these issues can be easily fixed by human developers, arguably requiring less time and effort than applying an additional prompt technique. For instance, the absence of a signature in a prompt causes TypeErrors when the LLM misjudges the number of ar- guments (see Figure 7). This bug may be more easily fixed by the human rather than re-prompting the LLM with the correct signature. On the other hand, prompt programming can be more valuable when the purpose is to automate code generation and return correct and maintainable code without the need for human intervention, especially to apply simple modifications or refactoring actions. C. Threats to validity External validity. The main threats to external validity in this study are associated with the prompt techniques, the LLMs and the benchmark we utilized. There are many possible prompt techniques that can potentially impact code generation, such as self-collaboration [31], tree-of-thought [50], or provid- ing the whole class as context. However, we decided to select common prompt techniques that can be practically applied by a typical software developer in most code generation tasks. Another important question is whether the use of more powerful LLMs can result in different findings and eliminate the need for prompt programming. We used three current- generation LLMs during the study, including GPT-4o (200B Page 13: 13 parameters) and Llama3 (70B parameters). Our replication package [35] also includes results for older LLMs (GPT-3.5, Llama2), showing that the prompt techniques affecting code generation in this study similarly impact older models. Internal validity. Regarding internal validity, we acknowl- edge that LLMs can be sensitive to format or structure of the prompt [48], or even the order of the few-shot examples [21]. We used a fixed order of the prompt techniques that we believe represents a natural sentence flow, and we ensured to use it consistently across all prompts. In addition, the manual creation of few-shot examples for our dataset may have introduced a degree of subjectivity. We therefore involve the first three authors in the process, allowing them to discuss possible examples and select the two most representative ones. Construct validity. The representativeness of the bench- mark (including generation tasks and functions) is an impor- tant aspect of construct validity. While current benchmarks often include functions that are not as complex as real- life tasks, we used the CoderEval dataset based on large open-source projects to minimize this threat. However, there remains the question of whether CoderEval fully captures the complexity and diversity of real-world development tasks. Conclusion validity. For conclusion validity, we focused on three key criteria: similarity, correctness, and quality. While others, like efficiency, could be considered, we argue these suffice for our research questions. Robustness is ensured with multiple metrics for each criterion. VII. C ONCLUSION In this study, we have investigated the impact of different prompt techniques on code generation, specifically function synthesis, along three quality dimensions (correctness, similar- ity to a human-written baseline, and code quality). We studied five prompt techniques, namely few-shot learning, automatic chain-of-thought, providing a persona, providing a signature, and listing packages. We conduct a full factorial analysis of these five factors using CodePromptEval dataset, which we developed based on CoderEval. We studied three current- generation LLMs, namely GPT-4o, Llama3, and Mistral. Our key lessons learned were that the impact of prompt techniques on correctness, similarity, and quality was not as large as might be expected. Most combinations of prompt techniques do not lead to statistically significant improvements (or regressions) in correctness, quality, or similarity. Providing type information for the function that is to be generated, either explicitly through a signature, or implicitly via few-shot examples, has the most clear positive effect, particularly on correctness. Some prompt techniques have a positive impact on correctness, and others on quality. However, the obvious idea of combining them usually improves neither. A possible future extension of our research is to evaluate to what extent our findings generalize to other code generation tasks (e.g., line completion, program repair, or the generation of full applications). It is plausible that some of the prompt techniques that did not show a meaningful positive impact on correctness in our experiments (e.g., chain-of-thought) turn out to be more relevant if the generation task is more complex. Ad- ditionally, there are other quality metrics, such as performanceor energy efficiency, which should be studied in future work — particularly given that recent work indicates that AI-generated code frequently exhibits performance regressions [51]. ACKNOWLEDGEMENTS This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. Additionally, LLM executions were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725. REFERENCES [1] S. I. Ross, F. Martinez, S. Houde, M. Muller, and J. D. Weisz, “The Programmer’s Assistant: Conversational Interaction with a Large Language Model for Software Development,” in Proceedings of the 28th International Conference on Intelligent User Interfaces , IUI ’23, (New York, NY , USA), p. 491–514, Association for Computing Machinery, 2023. [2] Z. Zeng, H. Tan, H. Zhang, J. Li, Y . Zhang, and L. Zhang, “An extensive study on pre-trained models for program understanding and generation,” inProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis , ISSTA 2022, (New York, NY , USA), p. 39–51, Association for Computing Machinery, 2022. [3] R. Khojah, M. Mohamad, P. Leitner, and F. G. de Oliveira Neto, “Beyond Code Generation: An Observational Study of ChatGPT Usage in Software Engineering Practice,” Proc. ACM Softw. Eng. , vol. 1, July 2024. [4] J. D. Weisz, M. Muller, S. I. Ross, F. Martinez, S. Houde, M. Agarwal, K. Talamadupula, and J. T. Richards, “Better Together? An Evaluation of AI-Supported Code Translation,” in Proceedings of the 27th Interna- tional Conference on Intelligent User Interfaces , IUI ’22, (New York, NY , USA), p. 369–391, Association for Computing Machinery, 2022. [5] S. Zheng, J. Huang, and K. C.-C. Chang, “Why Does ChatGPT Fall Short in Providing Truthful Answers?,” 2023. [6] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, and D. C. Schmidt, “A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT,” 2023. [7] A. J. Fiannaca, C. Kulkarni, C. J. Cai, and M. Terry, “Programming without a Programming Language: Challenges and Opportunities for Designing Developer Tools for Prompt Programming,” in Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems , CHI EA ’23, (New York, NY , USA), Association for Comput- ing Machinery, 2023. [8] T. Ahmed, K. S. Pai, P. Devanbu, and E. Barr, “Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization),” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering , ICSE ’24, (New York, NY , USA), Association for Computing Machinery, 2024. [9] L. Reynolds and K. McDonell, “Prompt Programming for Large Lan- guage Models: Beyond the Few-Shot Paradigm,” in Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems , CHI EA ’21, (New York, NY , USA), Association for Computing Machinery, 2021. [10] H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y . Ma, G. Liang, Y . Li, Q. Wang, and T. Xie, “CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , ICSE ’24, (New York, NY , USA), Association for Computing Machin- ery, 2024. [11] S. Barke, M. B. James, and N. Polikarpova, “Grounded Copilot: How Programmers Interact with Code-Generating Models,” Proc. ACM Pro- gram. Lang. , vol. 7, Apr. 2023. [12] K. Ronanki, C. Berger, and J. Horkoff, “Investigating ChatGPT’s Potential to Assist in Requirements Elicitation Processes,” in 2023 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) , pp. 354–361, 2023. [13] J. Wang, Y . Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software Testing With Large Language Models: Survey, Landscape, and Vision,” IEEE Transactions on Software Engineering , vol. 50, no. 4, pp. 911–936, 2024. Page 14: 14 [14] H. Li, C.-P. Bezemer, and A. E. Hassan, “Software Engineering and Foundation Models: Insights from Industry Blogs Using a Jury of Foundation Models,” 2024. [15] R. T ´oth, T. Bisztray, and L. Erd ˝odi, “LLMs in Web Development: Evaluating LLM-Generated PHP Code Unveiling Vulnerabilities and Limitations,” in International Conference on Computer Safety, Relia- bility, and Security , pp. 425–437, Springer, 2024. [16] J. Liu, C. S. Xia, Y . Wang, and L. ZHANG, “Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation,” in Advances in Neural Information Processing Systems (A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, eds.), vol. 36, pp. 21558–21572, Curran Associates, Inc., 2023. [17] P. Sahoo, A. K. Singh, S. Saha, V . Jain, S. Mondal, and A. Chadha, “A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications,” 2024. [18] G. Wang, Z. Sun, Z. Gong, S. Ye, Y . Chen, Y . Zhao, Q. Liang, and D. Hao, “Do Advanced Language Models Eliminate the Need for Prompt Engineering in Software Engineering?,” arXiv preprint arXiv:2411.02093 , 2024. [19] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language Models are Few-Shot Learners,” 2020. [20] K. Margatina, T. Schick, N. Aletras, and J. Dwivedi-Yu, “Active Learn- ing Principles for In-Context Learning with Large Language Models,” inFindings of the Association for Computational Linguistics: EMNLP 2023 (H. Bouamor, J. Pino, and K. Bali, eds.), (Singapore), pp. 5011– 5034, Association for Computational Linguistics, Dec. 2023. [21] Y . Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stenetorp, “Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity,” May 2022. [22] T. Wang, N. Zhou, and Z. Chen, “Enhancing Computer Programming Education with LLMs: A Study on Effective Prompt Engineering for Python Code Generation,” 2024. [23] D. Shrivastava, H. Larochelle, and D. Tarlow, “Repository-Level Prompt Generation for Large Language Models of Code,” in Proceedings of the 40th International Conference on Machine Learning (A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, eds.), vol. 202 of Proceedings of Machine Learning Research , pp. 31693– 31715, PMLR, 23–29 Jul 2023. [24] Y . Dong, X. Jiang, Z. Jin, and G. Li, “Self-Collaboration Code Gener- ation via ChatGPT,” ACM Trans. Softw. Eng. Methodol. , vol. 33, Sept. 2024. [25] Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic Chain of Thought Prompting in Large Language Models,” 2022. [26] Y .-M. Tseng, Y .-C. Huang, T.-Y . Hsiao, W.-L. Chen, C.-W. Huang, Y . Meng, and Y .-N. Chen, “Two Tales of Persona in LLMs: A Survey of Role-Playing and Personalization,” 2024. [27] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-V oss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V . Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, “Evaluating Large Language Models Trained on Code,” 2021. [28] Y . Dong, X. Jiang, Z. Jin, and G. Li, “Self-Collaboration Code Gener- ation via ChatGPT,” ACM Trans. Softw. Eng. Methodol. , vol. 33, Sept. 2024. [29] S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundaresan, M. Zhou, A. Blanco, and S. Ma, “CodeBLEU: a Method for Automatic Evaluation of Code Synthesis,” 2020. [30] Y . Wang, W. Wang, S. Joty, and S. C. Hoi, “CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pp. 8696–8708, 2021. [31] X. Jiang, Y . Dong, L. Wang, Z. Fang, Q. Shang, G. Li, Z. Jin, and W. Jiao, “Self-Planning Code Generation with Large Language Models,” ACM Trans. Softw. Eng. Methodol. , vol. 33, Sept. 2024.[32] J. Li, Y . Zhao, Y . Li, G. Li, and Z. Jin, “AceCoder: An Effective Prompting Technique Specialized in Code Generation,” ACM Trans. Softw. Eng. Methodol. , vol. 33, Nov. 2024. [33] J. Wei, S. Kim, H. Jung, and Y .-H. Kim, “Leveraging Large Language Models to Power Chatbots for Collecting User Self-Reported Data,” Apr. 2024. [34] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton, “Program Synthesis with Large Language Models,” 2021. [35] R. Khojah, F. G. de Oliveira Neto, M. Mohamad, and P. Leitner, “Code- PromptEval,” Dec. 2024. https://github.com/icetlab/CodePromptEval. [36] J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A Survey on Large Language Models for Code Generation,” 2024. [37] S. Thakur, B. Ahmad, H. Pearce, B. Tan, B. Dolan-Gavitt, R. Karri, and S. Garg, “VeriGen: A Large Language Model for Verilog Code Generation,” ACM Trans. Des. Autom. Electron. Syst. , vol. 29, Apr. 2024. [38] I. Heitlager, T. Kuipers, and J. Visser, “A Practical Model for Measuring Maintainability,” in 6th International Conference on the Quality of Information and Communications Technology (QUATIC 2007) , pp. 30– 39, 2007. [39] G. A. Campbell, “Cognitive complexity: an overview and evaluation,” inProceedings of the 2018 International Conference on Technical Debt , TechDebt ’18, (New York, NY , USA), p. 57–58, Association for Computing Machinery, 2018. [40] J. Cohen, Statistical power analysis for the behavioral sciences . rout- ledge, 2013. [41] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a Method for Automatic Evaluation of Machine Translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pp. 311–318, 2002. [42] M. Fowler, Refactoring: improving the design of existing code . Addison- Wesley Professional, 2018. [43] A. Vargha and H. D. Delaney, “A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong,” Journal of Educational and Behavioral Statistics , vol. 25, no. 2, pp. 101– 132, 2000. [44] T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large Language Models are Zero-Shot Reasoners,” in Advances in Neural Information Processing Systems (S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, eds.), vol. 35, pp. 22199–22213, Curran Associates, Inc., 2022. [45] T. Ahmed and P. Devanbu, “Multilingual training for software en- gineering,” in Proceedings of the 44th International Conference on Software Engineering , ICSE ’22, (New York, NY , USA), p. 1443–1455, Association for Computing Machinery, 2022. [46] M. Zheng, J. Pei, L. Logeswaran, M. Lee, and D. Jurgens, “When ”A Helpful Assistant” Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models,” inFindings of the Association for Computational Linguistics: EMNLP 2024 (Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, eds.), (Miami, Florida, USA), pp. 15126–15154, Association for Computational Linguistics, Nov. 2024. [47] B. Athiwaratkun, S. K. Gouda, Z. Wang, X. Li, Y . Tian, M. Tan, W. U. Ahmad, S. Wang, Q. Sun, M. Shang, S. K. Gonugondla, H. Ding, V . Kumar, N. Fulton, A. Farahani, S. Jain, R. Giaquinto, H. Qian, M. K. Ramanathan, R. Nallapati, B. Ray, P. Bhatia, S. Sengupta, D. Roth, and B. Xiang, “Multi-lingual Evaluation of Code Generation Models,” 2023. [48] M. Sclar, Y . Choi, Y . Tsvetkov, and A. Suhr, “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting,” in The Twelfth International Conference on Learning Representations , 2023. [49] D. OBrien, S. Biswas, S. M. Imtiaz, R. Abdalkareem, E. Shihab, and H. Rajan, “Are Prompt Engineering and TODO Comments Friends or Foes? An Evaluation on GitHub Copilot,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , ICSE ’24, (New York, NY , USA), Association for Computing Machin- ery, 2024. [50] S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan, “Tree of Thoughts: Deliberate Problem Solving with Large Language Models,” in Advances in Neural Information Processing Systems (A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, eds.), vol. 36, pp. 11809–11822, Curran Associates, Inc., 2023. [51] S. Li, Y . Cheng, J. Chen, J. Xuan, S. He, and W. Shang, “Assessing the Performance of AI-Generated Code: A Case Study on GitHub Copilot,” in 35th IEEE International Symposium on Software Reliability Engineering, ISSRE , IEEE, 2024. Page 15: 15 APPENDIX A. Pass@k rates This section presents the Pass@1 results to evaluate the correctness of the generated code by Llama3 (Figure 11) and Mistral (Figure 12). 44.8%51.1% 45.7%50.7% 44.3%49.8% 46.2%50.2% 46.6%50.2% 44.8%51.1% 44.3%50.2% 45.7%50.7%50.7%52%53.4%55.2% 52%54.3% 51.6%53.8% 51.1%53.8% 51.6%55.2% 52%52.9% 51.1%55.2% CoT, PersonaPersonaCoT, Pkg.CoT, Persona, Pkg.Pkg.Persona, Pkg.CoTPersona, Sig.CoT, Persona, Sig.CoT, Sig.Persona, Pkg., Sig.CoT, Persona, Pkg., Sig.Few−shotPkg., Sig.CoT, Pkg., Sig.Few−shot, CoTFew−shot, CoT, Persona, Pkg.Sig.Few−shot, CoT, Pkg.Few−shot, Persona, Pkg.Few−shot, CoT, PersonaFew−shot, PersonaFew−shot, Sig.Few−shot, CoT, Persona, Sig.Few−shot, Pkg.Few−shot, CoT, Sig.Few−shot, Persona, Pkg., Sig.Few−shot, Persona, Sig.Few−shot, CoT, Persona, Pkg., Sig.Few−shot, CoT, Pkg., Sig.Few−shot, Pkg., Sig. 0.0 0.2 0.4 Pass@1 Fig. 11. The Pass@1 results of the different prompt technique combinations for Llama3. 43.9%50.7% 43.4%50.7% 43%49.8% 43.4%51.1% 43.4%48% 45.2%49.3% 44.8%46.6% 44.3%49.3% 48.4%49.8% 48%50.7% 45.7%48% 45.2%49.3% 44.3%50.2% 45.2%48.4% 44.3%48.4% 47.1%48.9% PersonaCoTPersona, Pkg.Pkg.CoT, Persona, Pkg.Few−shot, CoTFew−shot, CoT, PersonaCoT, PersonaCoT, Pkg.Few−shot, CoT, Pkg.Few−shot, Persona, Pkg.Few−shot, PersonaCoT, Persona, Sig.Few−shot, CoT, Persona, Pkg.CoT, Sig.Few−shot, Persona, Sig.Few−shot, Pkg.Few−shotFew−shot, CoT, Persona, Sig.Few−shot, CoT, Pkg., Sig.Few−shot, CoT, Persona, Pkg., Sig.CoT, Persona, Pkg., Sig.CoT, Pkg., Sig.Few−shot, Persona, Pkg., Sig.Few−shot, Sig.Persona, Sig.Few−shot, CoT, Sig.Few−shot, Pkg., Sig.Pkg., Sig.Sig.Persona, Pkg., Sig. 0.0 0.1 0.2 0.3 0.4 0.5 Pass@1 Fig. 12. The Pass@1 results of the different prompt technique combinations for Mistral.B. Code smells This section presents the percentages of code smells in functions generated by Llama3 (Figure 14) and Mistral (Figure 13). 40%53%48%48%44%21%13%15%23% 13%67%69%39%38%31%14%18%18%18%37%50%61%50%53% 22%13%20%64%61%40%46%36% 12%14%17%15%76%53%42%25%42%25%12%17% 13%72%45%47%35%45%15%19%16%36%62%49%47%43%19%16% 21%59%44%37%19%34%28%26%18% 12%12%67%57%30%22%25%11%26%14%11%11%12%47%60%40%13%32%18%24%11%13% 14%47%53%38%22%28%17%25%11% 15%13%62%50%36%23%32%16%18%13% 18%40%62%47%41%42% 19%13%60%49%37% 36%23%24%19% 11%43%55%35%19%31%22%17% 13%47%43%32%20%24%19%16%14%24%29%29%13%22%12%16%11% 16% 17%15%33%22%40%16%27% 12% 17%68%26%14%43% 24%36%26%22%47%17% 23%37%22%17%39%13%15% 12%12%73%22%15%40% 12%66%14%18% 12%17%21%11%64%18%21% 16%21%16%51%16%18% 15%12%14%13% 11%14%26%17%33%13%13% 15% 15%13%19%21%54%19% 19%44%14%21% 16%11%22%15%18%21%29% 23%17%19% 12%20%14%16%43%12% 18%19%21%23%16%21% 15%19%21%23% 16%11%14% Few−shot, Persona, Pkg.Few−shot, CoT, Pkg.Few−shot, PersonaFew−shot, CoTFew−shot, Pkg.Few−shotFew−shot, CoT, Persona, Pkg.Few−shot, Pkg., Sig.Few−shot, CoT, Sig.Few−shot, CoT, Persona, Pkg., Sig.Few−shot, CoT, Persona, Sig.Few−shot, CoT, Pkg., Sig.Few−shot, CoT, PersonaFew−shot, Sig.Few−shot, Persona, Pkg., Sig.Few−shot, Persona, Sig.CoT, Persona, Pkg., Sig.Persona, Pkg.Pkg.CoTCoT, Pkg.Sig.Pkg., Sig.CoT, Sig.CoT, Persona, Pkg.PersonaCoT, Pkg., Sig.Persona, Pkg., Sig.CoT, PersonaCoT, Persona, Sig.Persona, Sig. C0116C0301C0115W0621R0903W0611W0613R1705C0103E0602W0612W1203W1309 Code Smell ID Fig. 13. The percentages of functions that had different code smells (as Lint IDs) that were generated by the combinations of prompt techniques (Mistral). 79%31%14%18% 17% 18%12%12%77%39% 18%11%17%12%10%14%60%34%23%21% 20%12%13%14%66%38%44% 16% 18%14%26%32%56%10%12%10%12%14% 20%65%36%12%12%15%13%11%16%54%32%19%18% 15%12%17%12%45%34%49%12%12%11%11%54%37% 18% 15%13%11%17%54%27%12%17%11%16% 17%64%23%34% 19%10%32%46%12%10%11% 14% 18%30%25%19% 18%11%14%57%18% 14%22%11%12%12%42%19%13%13%12% 14%32%34%13%11%13%11% 11%72%13% 12% 25%44%30% 14%15%13%21%29% 19% 17%11%15%23%21% 16%21%14%12%25%38% 18%23%36% 19% 11%32% 12%22%12%11%75% 13%29%12%11%21% 14%31%34% 17%30%13% 21%28% 12%21%13%26% 20%20%20% 16%27%12% 16%14%23% 12% Few−shotFew−shot, Pkg.Few−shot, PersonaFew−shot, Sig.Few−shot, CoT, Sig.Few−shot, Pkg., Sig.Few−shot, CoTFew−shot, Persona, Sig.Few−shot, Persona, Pkg.Few−shot, CoT, Pkg.Sig.Few−shot, CoT, PersonaFew−shot, CoT, Persona, Sig.Pkg., Sig.Few−shot, CoT, Persona, Pkg., Sig.Few−shot, CoT, Pkg., Sig.Pkg.Few−shot, Persona, Pkg., Sig.Few−shot, CoT, Persona, Pkg.Persona, Pkg., Sig.Persona, Sig.CoT, Persona, Sig.CoT, Pkg., Sig.CoT, Persona, Pkg., Sig.CoT, Sig.Persona, Pkg.CoT, Persona, Pkg.CoT, Pkg.PersonaCoT, PersonaCoT C0116C0301E0602C0115W0613R0903W0611R1705W0612W0212 Code Smell ID Fig. 14. The percentages of functions that had different code smells (as Lint IDs) that were generated by the combinations of prompt techniques (Llama3). Page 16: 16 C. Error types This section presents the percentages of error types across prompt techniques for functions generated by GPT-4o (Figure 15) and Mistral (Figure 16). 39.0%32.4%11.4%14.3% 2.9%38.5%26.9%14.4%12.5% 7.7% 42.9%22.9%16.2%13.3% 4.8%45.5%18.2%14.1%11.1%11.1% 36.3%25.5%18.6%13.7% 5.9% 37.4%24.3%16.8%13.1% 8.4%42.9%14.3%17.3%14.3%11.2% 44.6%15.8%16.8%13.9% 8.9%42.4%16.2%17.2%15.2% 9.1% 40.2%21.6%12.7%14.7%10.8%44.3%14.4%19.6%14.4% 7.2% 39.8%24.3%15.5%12.6% 7.8% 38.7%22.6%16.0%13.2% 9.4%42.6%16.8%17.8%12.9% 9.9%47.0%16.0%18.0%12.0% 7.0% 47.6%15.5%18.4%12.6% 5.8% 34.2%38.3%12.5% 8.3%6.7%31.6%37.6%14.5% 7.7%8.5% 39.5%32.8%12.6% 8.4%6.7%46.7%12.1%25.2%10.3% 5.6% 36.9%37.7%12.3% 6.6%6.6%34.5%31.9%16.0%11.8% 5.9%49.1%12.0%21.3% 8.3%9.3%44.4%13.0%20.4% 7.4%14.8% 45.9%14.7%18.3% 9.2%11.9% 35.0%34.2%12.0%14.5% 4.3%40.4%15.6%19.3%12.8%11.9% 36.4%34.7%15.7% 6.6%6.6%37.0%28.6%16.8% 8.4%9.2%44.0%14.7%21.1% 9.2%11.0% 46.8%10.8%21.6% 9.9%10.8%46.7%15.9%17.8%11.2% 8.4% Zero−shot, CoT, PersonaZero−shot, PersonaZero−shotZero−shot, CoT, PackageZero−shot, CoT, Persona, PackageZero−shot, Persona, PackageZero−shot, CoTZero−shot, PackageZero−shot, Persona, SignatureZero−shot, CoT, SignatureZero−shot, Package, SignatureZero−shot, Persona, Package, SignatureZero−shot, CoT, Persona, Package, SignatureZero−shot, CoT, Persona, SignatureFew−shot, CoT, Persona, PackageZero−shot, CoT, Package, SignatureZero−shot, SignatureFew−shot, Persona, PackageFew−shotFew−shot, CoT, PackageFew−shot, CoTFew−shot, PersonaFew−shot, SignatureFew−shot, CoT, PersonaFew−shot, PackageFew−shot, CoT, Persona, SignatureFew−shot, Persona, Package, SignatureFew−shot, Persona, SignatureFew−shot, CoT, Package, SignatureFew−shot, CoT, SignatureFew−shot, CoT, Persona, Package, SignatureFew−shot, Package, Signature AssertionErrorTypeError AttributeErrorImportErrorOther Error Type Fig. 15. The percentages of error types that we observed in failed functions generated by different combinations of prompt techniques (Llama3). 30.9%30.0% 6.4%21.8%10.9% 30.3%20.2% 5.0%26.9%17.6% 26.7%20.0% 4.2%32.5%16.7%31.5%25.2% 9.0%22.5%11.7% 25.0%23.3% 6.7%29.2%15.8%26.3%21.9% 3.5%29.8%18.4%30.6%29.7%10.8%18.0%10.8%31.5%29.7%10.8%18.9% 9.0%35.8%27.5%10.1%18.3% 8.3% 27.0%28.8% 8.1%24.3%11.7%37.4%26.2%12.1%14.0%10.3% 29.3%25.9% 6.9%26.7%11.2% 26.9%23.5% 6.7%26.1%16.8%34.5%24.5%10.9%19.1%10.9% 33.9%27.8%14.8%14.8% 8.7%36.4%30.9%12.7%12.7% 7.3% 29.4%30.3% 8.4%21.8%10.1% 30.1%19.5% 7.3%33.3% 9.8%31.7%25.8% 6.7%25.8%10.0%35.2%21.3%16.7%15.7%11.1% 31.4%22.3% 5.8%27.3%13.2%33.1%21.5% 5.8%27.3%12.4%36.9%20.7%16.2%15.3%10.8% 36.3%15.9%25.7%13.3% 8.8%36.3%21.2%23.9%14.2% 4.4% 33.1%26.4% 7.4%26.4% 6.6%32.7%22.4%21.5%13.1%10.3% 28.8%26.4% 8.8%29.6% 6.4%33.9%21.8% 5.6%22.6%16.1%41.3%18.3%18.3%12.5% 9.6% 40.7%10.2%25.0%13.9%10.2%39.0%16.2%21.0%13.3%10.5% Zero−shot, PersonaZero−shot, Persona, PackageZero−shot, CoTZero−shot, CoT, PersonaZero−shot, CoT, Persona, PackageZero−shot, PackageFew−shot, CoT, PackageFew−shot, CoT, PersonaZero−shot, CoT, PackageFew−shot, CoTFew−shot, Persona, PackageZero−shotFew−shot, PersonaFew−shot, Persona, SignatureFew−shot, CoT, Persona, PackageZero−shot, CoT, Persona, SignatureZero−shot, CoT, SignatureFew−shot, CoT, Package, SignatureFew−shot, CoT, Persona, Package, SignatureFew−shot, CoT, Persona, SignatureFew−shot, PackageZero−shot, CoT, Persona, Package, SignatureFew−shotFew−shot, Persona, Package, SignatureFew−shot, SignatureFew−shot, CoT, SignatureZero−shot, CoT, Package, SignatureZero−shot, Persona, SignatureFew−shot, Package, SignatureZero−shot, Package, SignatureZero−shot, SignatureZero−shot, Persona, Package, Signature AssertionErrorTypeError AttributeErrorImportErrorOther Error TypeFig. 16. The percentages of error types that we observed in failed functions generated by different combinations of prompt techniques (Mistral). D. CodeBLEU scores This section present the CodeBLEU results of code gener- ated by GPT-4o (Figure 18) and Mistral (Figure 19), as well as the general multi-variate regression results to show impact of prompt techniques on CodeBLEU and its three dimensions (Figure 17). Page 17: 17 CodeBLEU BLEU (n−gram) − 33.33% BLEU−weighted − 33.33% Syntactic Similarity (AST) − 33.33% −0.005 0.000 0.005 0.010 0.015 −0.005 0.000 0.005 0.010 0.015 −0.005 0.000 0.005 0.010 0.015 −0.005 0.000 0.005 0.010 0.015Pkg.:Sig.CoT:PersonaFewshot:Sig.Fewshot:CoT:Persona:Sig.Fewshot:Pkg.CoT:Pkg.:Sig.Fewshot:Persona:Sig.Persona:Pkg.CoT:Pkg.Fewshot:CoT:Persona:Pkg.Pkg.Fewshot:Persona:Pkg.:Sig.CoT:Persona:Pkg.:Sig.CoT:Persona:Sig.Fewshot:PersonaFewshot:Pkg.:Sig.Fewshot:CoT:Persona:Pkg.:Sig.Fewshot:Persona:Pkg.CoT:Persona:Pkg.CoT:Sig.Persona:Pkg.:Sig.Persona:Sig.Fewshot:CoT:Pkg.:Sig.Fewshot:CoT:Pkg.Fewshot:CoTFewshot:CoT:Sig.Fewshot:CoT:PersonaFewshotCoTPersonaSig. Coefficient Estimate LLM GPT−4o Llama3 Mistral Significance Level <0.001 <0.01 <0.05 Not Significant Fig. 17. The coefficient estimates from the multi-linear regression of selected combinations of prompt techniques that significantly impact the CodeBLEU, and its components n-gram match (BLEU), weighted n-gram match (BLEU-weighted), AST match scores, which together contribute to the CodeBLEU score. 16.1% 16.1% 15.8% 15.8% 15.7% 15.6% 15.6% 15.6% 15.5% 15.4% 15.4% 15.3% 15.3% 15.3% 14.8% 14.8% 14.7% 14.6% 14.5% 14.5% 14.4% 14.3% 14.2% 14% 14% 14% 13.9% 13.9% 13.7% 13.7% 13.7% 13.4%CoT, PersonaPkg.CoTFew−shotPersonaFew−shot, Pkg.CoT, Pkg.Persona, Pkg.Few−shot, Persona, Pkg.CoT, Persona, Pkg.Few−shot, CoT, PersonaFew−shot, CoTFew−shot, PersonaFew−shot, CoT, Pkg.Few−shot, CoT, Persona, Pkg.CoT, Pkg., Sig.Few−shot, Pkg., Sig.Few−shot, Persona, Pkg., Sig.CoT, Persona, Pkg., Sig.CoT, Sig.Pkg., Sig.Persona, Sig.Sig.Persona, Pkg., Sig.Few−shot, Persona, Sig.Few−shot, CoT, Sig.Few−shot, CoT, Pkg., Sig.CoT, Persona, Sig.Few−shot, Sig.Few−shot, CoT, Persona, Sig.Few−shot, CoT, Persona, Pkg., Sig. 0.00 0.05 0.10 0.15 Average CodeBLEU (%) Fig. 18. The avergae CodeBLEU scores for the functions generated by each combination (GPT-4o). E. Complexity This section provides the complete results for the cyclomatic complexity (Table VI) and cognitive complexity (Table VII). 15.8% 15.7% 15.7% 15.6% 15.4% 15.3% 15.2% 15.1% 15.1% 15.1% 14.7% 14.6% 14.5% 14.5% 14.5% 14.5% 14.4% 14.4% 14.3% 14% 14% 14% 13.9% 13.8% 13.8% 13.8% 13.7% 13.6% 13.6% 13.5% 13.4% 13%Pkg.Few−shot, CoT, Pkg.Few−shot, CoTCoTSig.CoT, Pkg.Few−shot, Pkg.Few−shotCoT, Persona, Pkg.Persona, Pkg.CoT, PersonaPersonaPkg., Sig.Few−shot, PersonaFew−shot, CoT, Persona, Pkg.Few−shot, Pkg., Sig.Few−shot, CoT, PersonaCoT, Pkg., Sig.CoT, Sig.Few−shot, Persona, Pkg.Few−shot, Sig.Persona, Sig.CoT, Persona, Pkg., Sig.Few−shot, CoT, Pkg., Sig.Few−shot, CoT, Persona, Pkg., Sig.Few−shot, CoT, Sig.CoT, Persona, Sig.Persona, Pkg., Sig.Few−shot, Persona, Sig.Few−shot, CoT, Persona, Sig.Few−shot, Persona, Pkg., Sig. 0.00 0.05 0.10 0.15 Average CodeBLEU (%)Fig. 19. The average CodeBLEU scores for the functions generated by each combination (Mistral). Page 18: 18 TABLE VI CYCLOMATIC COMPLEXITY ANALYSIS USING WILCOXON TEST AND A12 V ARGHA DELANEY FOR THE EFFECT SIZE (N- N EGLIGIBLE , S- S MALL , M- MEDIUM , L- L ARGE ).↓INDICATES A REDUCTION IN COMPLEXITY ,∅INDICATES NO STATISTICAL DIFFERENCE . W E HAVE NOT OBSERVED CASES OF INCREASING COMPLEXITY . Combinations GPT-4o Llama3 Mistral p-value A12 Effect p-value A12 Effect p-value A12 Effect Package 0.0012 0.403 ↓(S) 0.0008 0.402 ↓(S) 0.0407 0.417 ↓(S) CoT, Package 0.0366 0.446 ↓(S) 0.0147 0.440 ↓(S) 0.0211 0.410 ↓(S) Zero-shot 0.0143 0.433 ↓(S) 0.0012 0.392 ↓(S) 0.0402 0.407 ↓(S) CoT, Persona 0.0576 0.462 ∅ 0.0100 0.432 ↓(S) 0.0062 0.433 ↓(S) Persona, Sig. 0.0999 0.466 ∅ 0.0358 0.436 ↓(S) 0.0080 0.439 ↓(S) Persona 0.0857 0.464 ∅ 0.0086 0.411 ↓(S) 0.0259 0.436 ↓(S) Sig. 0.1593 0.453 ∅ 0.0094 0.425 ↓(S) 0.0157 0.434 ↓(S) Package, Sig. 0.0426 0.457 ↓(N) 0.0156 0.417 ↓(S) 0.0175 0.448 ↓(S) Persona, Package 0.0283 0.438 ↓(S) 0.0039 0.404 ↓(S) 0.0661 0.460 ∅ CoT 0.0191 0.447 ↓(S) 0.0559 0.448 ∅ 0.0095 0.417 ↓(S) CoT, Persona, Package 0.0859 0.452 ∅ 0.0002 0.410 ↓(S) 0.1732 0.466 ∅ Few-shot, Persona, Package 0.7039 0.494 ∅ 0.0243 0.438 ↓(S) 0.5887 0.493 ∅ Few-shot, Persona 0.3185 0.483 ∅ 0.0094 0.430 ↓(S) 0.2305 0.447 ∅ Few-shot 0.1646 0.472 ∅ 0.0381 0.432 ↓(S) 0.4433 0.453 ∅ Few-shot, CoT, Persona, Sig. 0.3691 0.520 ∅ 0.0295 0.457 ↓(N) 0.3559 0.486 ∅ CoT, Persona, Package, Sig. 0.1051 0.463 ∅ 0.0292 0.457 ↓(N) 0.3423 0.493 ∅ Few-shot, Persona, Sig. 0.4960 0.511 ∅ 0.0421 0.455 ↓(N) 0.8256 0.486 ∅ CoT, Persona, Sig. 0.3594 0.496 ∅ 0.2051 0.459 ∅ 0.0812 0.457 ∅ CoT, Package, Sig. 0.3130 0.483 ∅ 0.1568 0.481 ∅ 0.1667 0.475 ∅ CoT, Sig. 0.1784 0.474 ∅ 0.0869 0.460 ∅ 0.1002 0.465 ∅ Persona, Package, Sig. 0.1254 0.467 ∅ 0.1296 0.448 ∅ 0.1592 0.485 ∅ Few-shot, CoT, Persona, Package, Sig. 0.9261 0.502 ∅ 0.1803 0.482 ∅ 0.7346 0.521 ∅ Few-shot, CoT, Persona, Package 0.7083 0.512 ∅ 0.0900 0.466 ∅ 0.2862 0.471 ∅ Few-shot, CoT, Persona 0.2819 0.458 ∅ 0.1312 0.458 ∅ 0.0501 0.425 ∅ Few-shot, CoT, Package, Sig. 0.3642 0.522 ∅ 0.3053 0.478 ∅ 0.3436 0.542 ∅ Few-shot, CoT, Package 0.7715 0.485 ∅ 0.3289 0.469 ∅ 0.0653 0.446 ∅ Few-shot, CoT, Sig. 0.5092 0.500 ∅ 0.3245 0.474 ∅ 0.6845 0.497 ∅ Few-shot, CoT 0.6889 0.478 ∅ 0.3901 0.468 ∅ 0.0852 0.439 ∅ Few-shot, Persona, Package, Sig. 0.8748 0.506 ∅ 0.0788 0.458 ∅ 0.6931 0.524 ∅ Few-shot, Package, Sig. 0.4958 0.501 ∅ 0.3440 0.467 ∅ 0.9435 0.510 ∅ Few-shot, Package 0.3267 0.474 ∅ 0.0551 0.443 ∅ 0.8160 0.509 ∅ Few-shot, Sig. 0.3447 0.515 ∅ 0.1005 0.449 ∅ 0.8524 0.497 ∅ TABLE VII COGNITIVE COMPLEXITY ANALYSIS USING WILCOXON TEST AND A12 V ARGHA DELANEY FOR THE EFFECT SIZE (N- N EGLIGIBLE , S- S MALL , M- MEDIUM , L- L ARGE ).↓INDICATES A REDUCTION IN COMPLEXITY ,∅INDICATES NO STATISTICAL DIFFERENCE . W E HAVE NOT OBSERVED CASES OF INCREASING COMPLEXITY . Combinations GPT-4o Llama3 Mistral p-value A12 Effect p-value A12 Effect p-value A12 Effect Package, Sig. 0.0365 0.449 ↓(S) 0.0001 0.382 ↓(S) 0.0175 0.448 ↓(S) Few-shot, Persona 0.0364 0.442 ↓(S) 0.0327 0.412 ↓(S) 0.2305 0.447 ∅ CoT, Persona 0.3678 0.501 ∅ 0.0026 0.419 ↓(S) 0.0062 0.433 ↓(S) CoT, Package 0.3300 0.485 ∅ 0.0256 0.444 ↓(S) 0.0211 0.410 ↓(S) CoT 0.3609 0.496 ∅ 0.0449 0.445 ↓(S) 0.0095 0.417 ↓(S) Persona, Sig. 0.0883 0.456 ∅ 0.0025 0.416 ↓(S) 0.0079 0.439 ↓(S) Persona 0.1156 0.493 ∅ 0.0032 0.393 ↓(S) 0.0259 0.436 ↓(S) Package 0.0777 0.446 ∅ 0.0002 0.371 ↓(S) 0.0407 0.417 ↓(S) Sig. 0.5717 0.485 ∅ 0.0003 0.398 ↓(S) 0.0157 0.434 ↓(S) Zero-shot 0.1708 0.486 ∅ 0.0016 0.382 ↓(S) 0.0402 0.407 ↓(S) Few-shot, CoT, Persona, Sig. 0.5113 0.479 ∅ 0.0435 0.444 ↓(S) 0.3559 0.486 ∅ Few-shot, Persona, Package 0.1139 0.449 ∅ 0.0378 0.428 ↓(S) 0.5887 0.493 ∅ Few-shot, Persona, Sig. 0.3812 0.470 ∅ 0.0035 0.421 ↓(S) 0.8256 0.486 ∅ Few-shot, Sig. 0.7730 0.488 ∅ 0.0244 0.417 ↓(S) 0.8524 0.497 ∅ Few-shot 0.2544 0.444 ∅ 0.0475 0.408 ↓(S) 0.4433 0.453 ∅ CoT, Persona, Package, Sig. 0.2312 0.490 ∅ 0.0144 0.444 ↓(S) 0.3423 0.493 ∅ Persona, Package, Sig. 0.3364 0.478 ∅ 0.0095 0.418 ↓(S) 0.1592 0.485 ∅ Persona, Package 0.2733 0.484 ∅ 0.0006 0.395 ↓(S) 0.0661 0.460 ∅ CoT, Sig. 0.9386 0.522 ∅ 0.0439 0.446 ↓(S) 0.1002 0.465 ∅ CoT, Persona, Package 0.0496 0.461 ↓(N) 0.0007 0.420 ↓(S) 0.1732 0.466 ∅ CoT, Package, Sig. 0.8144 0.508 ∅ 0.0254 0.464 ↓(N) 0.1667 0.475 ∅ CoT, Persona, Sig. 0.6917 0.510 ∅ 0.1106 0.459 ∅ 0.0812 0.457 ∅ Few-shot, CoT, Persona, Package, Sig. 0.5436 0.485 ∅ 0.1174 0.463 ∅ 0.7346 0.521 ∅ Few-shot, CoT, Persona, Package 0.2607 0.478 ∅ 0.0662 0.461 ∅ 0.2862 0.471 ∅ Few-shot, CoT, Persona 0.0531 0.443 ∅ 0.1155 0.460 ∅ 0.0501 0.425 ∅ Few-shot, CoT, Package, Sig. 0.6145 0.509 ∅ 0.4893 0.467 ∅ 0.3436 0.542 ∅ Few-shot, CoT, Package 0.2378 0.467 ∅ 0.4649 0.464 ∅ 0.0653 0.446 ∅ Few-shot, CoT, Sig. 0.3291 0.487 ∅ 0.3538 0.462 ∅ 0.6845 0.497 ∅ Few-shot, CoT 0.1763 0.443 ∅ 0.7870 0.465 ∅ 0.0852 0.439 ∅ Few-shot, Persona, Package, Sig. 0.7159 0.482 ∅ 0.0690 0.436 ∅ 0.6931 0.524 ∅ Few-shot, Package, Sig. 0.6934 0.511 ∅ 0.3366 0.437 ∅ 0.9435 0.510 ∅ Few-shot, Package 0.1084 0.454 ∅ 0.0966 0.428 ∅ 0.8160 0.509 ∅