Authors: Ranim Khojah, Francisco Gomes de Oliveira Neto, Mazen Mohamad, Philipp Leitner
Paper Content:
Page 1:
1
The Impact of Prompt Programming on
Function-Level Code Generation
Ranim Khojah1, Francisco Gomes de Oliveira Neto1, Mazen Mohamad1,2, Philipp Leitner1
1Chalmers University of Technology and University of Gothenburg ,2RISE Research Institutes of Sweden
Gothenburg, Sweden
khojah@chalmers.se, francisco.gomes@cse.gu.se, mazen.mohamad@ri.se, philipp.leitner@chalmers.se
Abstract —Large Language Models (LLMs) are increasingly
used by software engineers for code generation. However, lim-
itations of LLMs such as irrelevant or incorrect code have
highlighted the need for prompt programming (or prompt
engineering) where engineers apply specific prompt techniques
(e.g., chain-of-thought or input-output examples) to improve the
generated code. Despite this, the impact of different prompt tech-
niques — and their combinations — on code generation remains
underexplored. In this study, we introduce CodePromptEval, a
dataset of 7072 prompts designed to evaluate five prompt tech-
niques (few-shot, persona, chain-of-thought, function signature,
list of packages) and their effect on the correctness, similarity,
and quality of complete functions generated by three LLMs
(GPT-4o, Llama3, and Mistral). Our findings show that while
certain prompt techniques significantly influence the generated
code, combining multiple techniques does not necessarily improve
the outcome. Additionally, we observed a trade-off between cor-
rectness and quality when using prompt techniques. Our dataset
and replication package enable future research on improving
LLM-generated code and evaluating new prompt techniques.
I. I NTRODUCTION
With the widespread adoption of Large Language Models
(LLMs) in software engineering, researchers and practitioners
have uncovered their significant potential, particularly for
code-related tasks, such as code generation and completion
[1, 2]. However, this adoption has also revealed several limita-
tions of LLMs that can hinder developers’ productivity [3] and
cause frustrations [4], preventing them from fully leveraging
the benefits of LLMs in their coding process. Such limitations
are related to hallucinations, misunderstanding the intent or
purpose of the code, or simply generating incorrect code [5].
These limitations are inherent to the design of LLMs, and
are unlikely to ”resolve themselves” entirely with future model
generations. Therefore, researchers started proposing ways
to mitigate these limitations by adapting how users interact
with the LLMs. The interactions typically start with a natural
language prompt that specifies what the LLM is expected to
output. To ensure that the LLM generates accurate, relevant,
and high-quality outputs, users employ a structured approach
to construct prompts, which is often referred to as prompt
programming.
To implement prompt programming, various prompt tech-
niques can be used to guide the LLM on how to achieve
the expected results [6, 7, 8]. For example, few-shot learning
involves providing the LLM with a few input-output examples
to guide the function logic, while adding context about thepackages used can give the model additional information on
what helper functions to use.
However, such prompt techniques were evaluated based on
the output accuracy for natural language generation tasks [9, 8]
and are not well-studied for code generation, more specifically,
function synthesis (generating function-level code), which is
one of the most common use cases among software engineers
[3]. Furthermore, evaluating the accuracy of code generation
is not sufficient, since other aspects of the code are important
for software engineers, such as maintainability and adherence
to best practices. Prompt techniques can also be combined [6],
but to the best of our knowledge, no work evaluates the impact
of multiple combinations of prompt techniques in one prompt.
For instance, whether applying a certain prompt technique can
cancel out, hinder, or even enhance the impact of an existing
prompt technique in the prompt.
Therefore, in this study, we design a full factorial
experiment on five common prompt techniques for function
generation along with all the possible combinations of these
prompts, which sums up to 32 unique combinations of prompt
techniques. To perform a comprehensive evaluation of the
impact of different prompt techniques on code generation,
we construct our dataset CodePromptEval which consists of
221 code-generation prompts from CoderEval [10], that we
extend with 32 possible variations for each prompt (that is,
combinations of prompt techniques). This results in a total
of 7072 datapoints. We use CodePromptEval to generate
functions with three popular LLMs (GPT-4o, Llama3, and
Mistral), then evaluate the generated functions based on
correctness, as well as quality and similarity to ground truth
(e.g., in terms of naming style and structure). Particularly, we
investigate the following research questions.
RQ 1 : How do different LLMs perform on CodePromptEval?
Initially, we study the performance of different current-
generation LLMs (GPT-4o, Llama3, and Mistral) on our
CodePromptEval dataset. We particularly look at the
correctness of LLM-generated code as measured using
existing test cases in CoderEval benchmark as a ground
truth. We observe that the performance of all three evaluated
LLMs is comparable, with a difference of around 5 percentage
points between the best model, GPT-4o, and the worst, Mistral.
RQ 2 : To what extent do different prompting techniques (and
combinations of them) impact the code generation of LLMs?arXiv:2412.20545v1 [cs.SE] 29 Dec 2024
Page 2:
2
We now turn to the central research question of this
paper. Using a full factorial experiment design, we compare
how different prompt techniques (e.g., few-shot, providing a
persona, etc.) impact the generated code in three dimensions:
correctness, similarity to ground truth, and code quality.
RQ 2.1 : How do prompt techniques impact the
correctness of the code?
To evaluate correctness, we test the functions, then
measure the Pass@k scores for each combination of
prompt techniques. We also perform statistical tests
to identify the (combinations of) prompt techniques
that impact the test results. We found that including
only a function signature or few-shot examples has a
significant positive impact on correctness. We further
observe that combining prompt techniques does not
lead to significantly better results.
RQ 2.2 : How do prompt techniques impact the
similarity of the code to a human-written baseline?
We also study how similar generated solutions are to
the (human-written) baseline. We find that including
a persona, chain-of-thought, or signature increases
the overall similarity to the baseline for some LLMs,
while few-shot reduces only the lexical similarity.
Note that generating code that is similar to an
“expected” solution may be good or bad depending
on context — on the one hand, code that is close
to the baseline may be easy to fix even if it is not
passing the test cases; on the other, “different” can
be particularly valuable if the goal is to brainstorm
approaches, e.g., if used in “exploration mode” [11].
RQ 2.3 : How do prompt techniques impact the
quality of the code?
Finally, we study code quality as measured through
the presence of code smells and the (cyclomatic
and cognitive) complexity of the code. We find
that including a signature or few-shot examples
leads to functions with higher complexity and more
code smells. Interestingly, adding a relevant persona
(“as a software developer who follows best coding
practices ...”) indeed has a small positive effect on
the code quality, but at the expense of slightly lower
correctness.
Overall, we conclude that the impact of prompt program-
ming techniques is not dramatic for, at the time of writing,
current-generation models. Most combinations of prompt tech-
niques do not lead to statistically significant improvements
(nor regressions) in correctness, similarity or quality. Providing
type information for the function that is to be generated, either
explicitly through a signature, or implicitly via few-shot exam-
ples, has the most clear effect. Some prompt techniques have a
positive impact on correctness, and others on quality. However,
the obvious idea of combining them usually improves neither.II. R ELATED WORK
Existing research on LLMs in software engineering has
shown the potential of LLMs to support software engineers
in various tasks, including requirements elicitation, software
testing and documentation [12, 13]. However, the main focus
is directed towards code-related tasks [3]. This is also reflected
in the interest among software organizations that, at the time
of writing, leverage LLMs mostly for code generation, code
completion and code summarization [14]. However, the in-
creased adoption of LLMs for code-related tasks has unveiled
risks and limitations, such as hallucinations, inaccuracies, and
potential vulnerabilities [15, 16]. Researchers have proposed
the concept of prompt programming (or prompt engineering)
in order to minimize the model’s limitations and trigger the
LLM to output a more desirable response by using prompt
techniques and provide relevant contextual information [17, 6].
Therefore, a new line of research emerged focusing on
finding prompt techniques that can improve the performance
of LLMs in various tasks. White et al. [6] propose different
prompt patterns and techniques depending on the software-
related task. However, the impact of these techniques on the
LLM output can be unstable and inconsistent. Wang et al [18]
shows that prompt techniques can be sensitive to the specific
task as well as the LLM (e.g., GPT-3.5 vs. GPT-4o). Other
studies also show that the few-shot prompt technique [19] is
effective, especially with the right structure [7], type [20] or
order [21] of the examples (shots). Reynolds and McDonell
[9] highlight how few-shot examples can hurt the performance
of the model and limit its search for a plausible solution in
translation tasks. Contrastingly, we found that few-shot sig-
nificantly improved the performance of the LLMs suggesting
that prompt techniques have varied impact depending on the
task and the domain.
For code-related tasks, prompt techniques were shown to
have a positive effect on code generation in the domain of
education [22]. Furthermore, researchers proposed ways and
contextual information as prompt techniques to apply to the
prompt and enhance code-related tasks [23, 8], e.g., incor-
porating dataflow information to improve code summarization
[8]. Other prompt techniques used by Dong et al [24] included
self-collaboration, where the LLM is prompted several times
to take different personas e.g., first as a requirements engineer,
then a software developer, then a tester, and only then return
a code that resulted from the “collaboration” among the three
personas. In our study, we focus on three common prompt
techniques, namely, few-shot learning, automatic chain-of-
thought [25], and persona [26], as well as propose two pieces
of contextual information as additional prompt techniques
that are easily accessed by the developers, i.e., the imported
packages and the signature of the function.
To evaluate LLMs on code generation tasks, the most
common metric is Pass@k [27], where k=1 is used to measure
the rate of passed functions that the LLM generated on the first
attempt [28] (e.g., by running a test suite). CodeBLEU [29] is
another popular metric, commonly used in studies to measure
the human-likenesses of generated code [30, 31]. Li et al. [32]
conduct a manual human evaluation of their proposed prompt
Page 3:
3
221 datapointsExclude prompts
with failing ground
truth230 datapoints
CoderEval
32 prompt combinationsCreate prompt
templates
Few-shot
CoT
Persona
Signature
Packages
221x32 prompts Construct
CodePromptEval"For example, ..."
"Think step by step ..."
"As a software developer ..."
"The signature is def ..."
"The function uses pandas ..."
CodePromptEval
7072 functionsRun LLMs
Call API-based
LLMs
Host Self-
deployed LLMs
7072 prompts
7072 functionsLLM-generated
functionsEvaluate generated
functions
Run tests
Measure
similarityCode evaluation
Test results and
errors
CodeBLEU
similarity scoresAnalyze the
evaluation results
General Multi-
regression
Wilcoxon test +
effect sizeRun Pylint
Compute
complexityCyclomatic and
cognitive complexity32 prompt
combinationsPrompt technique
combinations
RQ2 (PromptCodeEval
evaluation)
Correctness
Similarity
QualityList of code smellsDescriptive
statistics
(per LLM)RQ1 (LLM evaluation)
Performance and
common error types
Descriptive
statistics
(per combination)
Fig. 1. The process we follow to evaluate the code generated using different prompts by different LLMs.
technique “AceCoder” based on correctness, presence of code
smells, and maintainability. We provide a systematic and
automated approach to evaluate generated code based on
correctness, maintainability, and similarity to the ground truth.
III. M ETHODOLOGY
Figure 1 shows our approach to evaluate the impact of
commonly-used prompting techniques on the code generated
by LLMs. On a high level, we create prompt templates that
combine prompting techniques (e.g., CoT, few-shot, etc.) and
apply each prompt template to 221 tasks from the CoderEval
benchmark [10]. We evaluate three different LLMs (two open-
weight and one proprietary), leading to 7072 generated func-
tions per LLM. To understand the impact of each prompting
technique and answer RQ2, we evaluate all functions in terms
of correctness, similarity to the baseline of the benchmark, and
quality using statistical analysis.
We follow a full factorial experiment design and evaluate
the code generation functionality of LLMs by varying two
levels (present/absent) of five factors in a prompt, that is, the
five prompt techniques: (1) few-shot learning, (2) Chain-of-
Thought (CoT), (3) persona, (4) function signature, and (5) the
list of packages. Therefore, we have 32 ( 25) treatments in our
experiment. Note that the absence of all of these techniques
counts as zero-shot, where only the generation instruction is
present without any other prompt technique. We do not treat
zero-shot as a factor since it cannot logically be combined
with other prompt techniques (e.g., combining zero-shot with
persona would simply default to persona). Instead, we use
zero-shot as a baseline for prompt technique comparisons.
A. Prompt technique combinations
Prompt programming is the act of constructing a prompt
using natural language to ensure that the model provides the
intended response or to improve the performance of the model
[9]. Based on observations from our previous work [3] and
recommendations in literature [6] and from LLM providerssuch as OpenAI1and Microsoft23we decided on five prompt
techniques to apply when prompting LLMs in our study.
Examples for all prompt techniques will be provided later in
Figure 2.
•Few-shot learning can be achieved by providing shots (or
examples) to an LLM in order to enable learning new ex-
amples without the need to fine-tune the LLM [19]. We use
two input-output examples explained in natural language.
We do not consider a varying number of shots.
•Automatic Chain-of-Thought (CoT) allows the LLM to break
down the prompt by asking it to ”think” step by step and
following the steps when solving the problem [25].
•Persona allows the LLM to play a specific role and consider
its perspective when solving a problem [33, 34]. For the
persona, we use the role of a software developer who focuses
on practices and standards that software developers follow.
•Signature is a line of code that includes the signature of
the function to generate. The signature includes the function
name, the input parameters, and (optionally) the output.
•Packages is a list of libraries and files that exist in the
environment in which the code runs. This includes local
packages and external libraries.
B. CodePromptEval
To evaluate the different combinations of prompt tech-
niques, we construct CodePromptEval – a dataset that includes
221 function-level code generation tasks, where each genera-
tion task is implemented using 32 different prompt variations.
Each of these 32 prompts applies a unique combination of
prompt techniques, resulting in a total of 7072 prompts (221
tasks×32 variations).
To create our dataset, we initially start with the CoderEval
Python dataset [10]. This dataset consists of 230 datapoints
from 43 Python projects. Each datapoint consists of a prompt,
1https://platform.openai.com/docs/guides/prompt-engineering
2https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/
advanced-prompt-engineering
3https://microsoft.github.io/prompt-engineering/
Page 4:
4
a Python function (human-written baseline), and the corre-
sponding tests (in form of unit tests or a main class). We
first set up different virtual environments for functions from
different projects, then we test the functions using the provided
tests, and eliminate nine datapoints where the baseline does
not pass the tests. This resulted in 221 datapoints that will be
the foundation for our own CodePromptEval dataset.
Then, we ensure that the prompts are “pure” from any
prompt technique that may be implicitly applied (e.g., provid-
ing examples), by going through the prompts manually and
removing any elements that do not describe the purpose of
the code. We then treat this prompt as a zero-shot prompt.
The next step was to prepare prompt templates by defining
how each prompt technique will be implemented and mapping
relevant information to prompt techniques. In particular, for
each datapoint, we extract the signature of the function and the
list of used packages (represented as imports at the beginning
of the class). For chain-of-thought, we adapted the template
recommended by Zhuosheng et al. [25]. To construct the
persona, we defined a persona description of a software de-
veloper who follows best coding practices for maintainability.
To implement the few-shot prompt technique, the first three
authors of this paper manually constructed two input-output
examples for each prompt following the template “If the input
is X, then the output is Y”. We also create corresponding tests
to ensure that the input and output are correct.
TABLE I
THE32COMBINATIONS OF PROMPT TECHNIQUES THAT WE CONSIDER IN
OUR FULL FACTORIAL EXPERIMENT .
ID Few-shot CoT Persona Packages Signature
P1 - - D D D
P2 - - D D -
P3 - - D - D
P4 - - D - -
P5 - - - D D
P6 - - - D -
P7 - - - - D
P8 - - - - -
P9 - D D D D
P10 - D D D -
P11 - D D - D
P12 - D D - -
P13 - D - D D
P14 - D - D -
P15 - D - - D
P16 - D - - -
P17 D - D D D
P18 D - D D -
P19 D - D - D
P20 D - D - -
P21 D - - D D
P22 D - - D -
P23 D - - - D
P24 D - - - -
P25 D D D D D
P26 D D D D -
P27 D D D - D
P28 D D D - -
P29 D D - D D
P30 D D - D -
P31 D D - - D
P32 D D - - -Finally, we define 32 prompt variations that we list in
Table I. Each variation represents a prompt that applies a
unique combination of prompt techniques. For example, P7
is a prompt that provides the code signature, but uses no
other prompt programming technique, whereas P28 combines
few-shot learning with CoT and the usage of the persona
“software developer”. P8is the zero-shot baseline, where no
prompt technique is used and the model is only provided
with the programming task. P25 is the case where allprompt
techniques are used in conjunction.
We then map each variation from Table I to the relevant
information and templates for prompt techniques (e.g., im-
ported libraries for packages), then we combine them with
the 221 prompts from CoderEval, leading to 7072 concrete
prompts (221 prompts times 32 variations). Our dataset, the
virtual environments, and the few-shot examples and tests are
provided in our replication package [35].
Respond with a Python function in one code block.
As a softwar e developer who follows best coding
practices for maintainability such as avoiding code
smells and writing simple and clean code,
For example, i f the input is 4523605 and 3600, then the
output is "01:15:23.605+01:00", and if the input is
4523605 and None then the output is "01: 15:23.605". Think car efully and logically , explaining your answer
step by step.
Convert nanoseconds to a time in fixed format.
The function signatur e is:
def hydrate_time(nanoseconds, tz=None)
The function has access (but does not necessarily use)
the following packages: time pytz datetime. Persona
Chain-of-
Thought
Few-shot
examples
Signature
Packages Constraint
Fig. 2. Example prompt in CodePromptEval.
We illustrate an example prompt with all prompting tech-
niques ( P25) in Figure 2. The prompt description, signature,
and packages are extracted from CoderEval, while we con-
struct the few-shot examples, persona, and chain-of-thought
texts as a part of CodePromptEval. We also append a constraint
at the beginning of each prompt to ensure that the output has
a block of Python code with a self-contained function.
If different prompt techniques are combined, we apply them
in a fixed order (as given in Figure 2). This order ensures the
sentence flows naturally. For example, the common practice
is to place the persona at the beginning, and the few-shot
examples after the purpose of the code. While it is possible
to experiment with different orders of prompt techniques in a
prompt, we consider this outside the scope of this study.
C. Code Generation
We focus on LLMs with a decoder-only transformer ar-
chitecture, which is at the time of writing the preferred
architecture to use in code generation tasks [36]. Therefore,
we select the following LLMs for our study: GPT-4o, Llama3-
70B-Instruct, and Mistral-Small-Instruct-2409 (22B). We also
Page 5:
5
collect data for two previous-generation LLMs (GPT-3.5-
turbo and Llama2-7B-Instruct), but omit discussing the results
for these older models for reasons of brevity in this paper.
The collected data for these models is still available in our
replication package [35].
We run all 7072 prompts on the selected LLMs with
0.2 temperature, which has been commonly used for code
generation tasks [27, 37]. For the API-based GPT models, we
send requests to the external API and store the responses. We
host the remaining models on the Alvis cluster, a NAISS re-
source (National Academic Infrastructure for Supercomputing
in Sweden) dedicated to Artificial Intelligence and Machine
Learning research4using models downloaded from Hugging-
face5. Running the self-hosted LLMs on Alvis required around
1600 GPU hours using Nvidia A100 GPUs. For the GPT
models, we use the OpenAI API, which is billed based on the
tokens that are processed. To run GPT-4o and GPT-3-turbo on
all the prompts in CodePromptEval, we provide around 1.1
million input tokens and generate approximately 3.7 million
output tokens.
D. Evaluating the LLM-generated functions
After generating 7072 code solutions, we evaluate them
based on three main aspects following our research questions
(correctness, similarity, and quality). We use different tests to
measure statistically significant differences for the measures
below, hence we detail the choice of statistical methods in
their corresponding results sections.
Correctness: To evaluate their correctness, we run the
generated functions against their corresponding tests in
CoderEval. There are two types of tests in CoderEval: Python
unit tests, and a main function with different statements and
conditions that set a boolean variable isT (is True) to False
when at least one of the conditions does not hold. To ensure
consistency and instrumentation of our experiment, ensure
that an AssertionError is thrown when needed, by adding
an assert statement at the end of tests in the form of a
main function assert isT . Furthermore, as some of the
LLM-generated functions can be erroneous and get stuck
in an infinite loop, we wrap the tests with error-handling
constructs (a try/except ) and set a timeout of 60 seconds
per function. Then we collect the test results and the error
messages when applicable.
Similarity: To assess the LLM-generated code’s similarity
to the ground truth obtained from CoderEval (human-written
functions), we measure the CodeBLEU score [29]. CodeBLEU
combines four dimensions of similarity to capture alignment
in the syntax, semantics, and data flow between the generated
functions and the ground truth. The different dimensions are
N-gram match (BLEU), weighted N-gram matches (weighted-
BLEU), Abstract Syntax Tree (AST) match, and data flow
match. For more specific insights, we also measure the syntac-
tic similarity (coding style and variable naming) using BLEU
4https://www.c3se.chalmers.se/about/Alvis/
5https://huggingface.coand weighted-BLEU, and semantic similarity (i.e., structural
and algorithmic design) using AST match.
Quality: Regarding code quality, we focus on measures
that are related to maintainability [38] and we only measure
them for functions that pass their tests. In other words, we
measure the quality only for functionally correct functions. We
use Pylint6to generate a report with identified code smells
in the generated functions. Moreover, we compute the code
complexity for both the LLM-generated functions and the
equivalent ground truth (i.e., the human-written functions in
CoderEval) to compare both results and see how the different
prompts have an impact on the code quality. Code complexity
refers to how detailed and interconnected different parts of the
code are, which can make the code harder to understand and
test. To get an overview of the complexity of the generated
functions, we measure McCabe’s cyclomatic complexity via
the Radon Python package7and cognitive complexity [39] via
the cognitive-complexity Python package.8
IV. C ODEPROMPT EVAL OVERVIEW
In this section, we provide an overview of the aggregate
results from running three LLMs (GPT-4o, Llama-3, and
Mistral) on the CodePromptEval dataset. This section answers
RQ1 in our study.
Note that these results are not an assessment of the capa-
bilities of these models when used with an “ideal” prompt,
but an aggregation over all prompt technique combinations
in our study. That is, the following results should be read as
an overview of CodePromptEval, and not as a judgment of
which LLM performs best in general. Detailed drill-downs
assessing the performance of individual (combinations of)
prompt techniques will be presented in Section V.
TABLE II
OVERVIEW OF PASSED AND FAILED FUNCTIONS PER LLM. T HE TOTAL
NUMBER OF FUNCTIONS PER LLM IS7072.
LLM # Passed Functions # Failed Functions
GPT-4o 3707 (52.42%) 3365 (47.58%)
Llama3-70B-Instruct 3564 (50.40%) 3508 (49.60%)
Mistral-22B-Instruct 3335 (47.16%) 3737 (52.84%)
A high-level results summary is shown in Table II. There
are a total of 7072 generation tasks in the dataset. All three
models are able to solve (generate functions that pass all tests)
approximately half of the tasks. Mistral performs worst in our
study, solving 3335 (47.16%) of tasks, and GPT-4o does best
solving 3707 (52.42%), outperforming the worst model by
approximately 5 percentage points.
To get a better idea of whether these results are impacted by
the code level of the function. We use the code levels defined
by Yu et al. [10] that are based on the nature of dependencies
of the function. The code levels are: self-contained (no need
to import), standard library runnable (no need to install),
public library runnable (uses libraries available on PyPI), class
6https://pypi.org/project/pylint/
7https://pypi.org/project/radon/
8https://pypi.org/project/cognitive-complexity/
Page 6:
6
runnable (uses code outside the function, but within the class),
file runnable (uses code outside the class, but within the file),
and project runnable (uses code in other files). Code levels
provide a rough indication of the “difficulty” of a generation
task, based on what kind of dependencies the LLM needs to
correctly incorporate.
61%58%48%49%31%11%
39%42%52%51%69%89%
59%61%45%50%41%12%
41%39%55%50%59%88%
69%62%46%50%43%12%
31%38%54%50%57%88%GPT−4oLlama3−70BMistral−22B
00.250.50.75100.250.50.75100.250.50.751ClassrunnableFilerunnableProjectrunnablePubliclibraryrunnableSelf−containedStandardlibraryrunnable
Percentages of functionsTest ResultFailedPassed800Total105660873621441728
Fig. 3. Passed and failed functions per LLM for each code level. The total
number of functions per LLM is 7072.
We looked into the code levels that passing and failing
functions belong to (see Figure 3). Unsurprisingly, the fail
rate for all models increases as tasks get more difficult (i.e., by
construction, class runnable tasks tend to be substantially more
challenging than self-contained ones, and all models struggle
much more with solving them correctly). Pass rates for the
easiest type of task (standard library runnable) are close to
90% for all models, going down to as low as 31% to 41%
for the most challenging tasks (class runnable). We observe
that, overall, all three models perform comparably on most
code levels, with the exception of self-contained tests (where
GPT-4o outperforms the other models by a larger margin of 10
to 12 percentage points). This difference explains most of the
slightly higher overall performance of GPT-4o. We have also
confirmed these differences using Chi-square statistical test
and Cohen’s ωas effect size ( ω= 0.34:- Medium effect) [40].
GPT−4o Llama3−70B Mistral−22B
05001000 150005001000 150005001000 1500IndexErrorSyntaxErrorKeyErrorValueErrorNameErrorAttributeErrorImportErrorTypeErrorAssertionError
Count
Fig. 4. Frequency of error types among failing tests for functions generated
by GPT-4o, Llama3-70B, and Mistral-22B.
Finally, we report what error types led to the failing tests
shown in Table II and Figure 3. We report the error types
based on the Python exception that is first thrown when
running the tests. The results of the error types are visualized
in Figure 4. The most common error type for failed tests across
the LLMs is AssertionError , indicating that the LLM
generated a Python function that did not exhibit precisely theexpected functionality (as defined through unseen tests). How-
ever, are also frequently encountered such as TypeErrors
(operation is performed on a value of an inappropriate type,
indicating that the LLM misjudged the runtime type of a
Python object), AttributeErrors (invalid attribute ref-
erence is made), and ImportErrors (a faulty import of
a module or object). Other errors, such as NameErrors
orIndexErrors , exist but are rare. While there are dif-
ferences between the LLMs, they are relatively minor and
not systematic. The most notable difference is that Mistral
tends to generate functions leading to an ImportError or
NameError more frequently than the other LLMs, whereas
AttributeErrors are less frequent in Mistral-generated
code.
Key Findings (RQ1): Overall, we observed that GPT-4o
minorly outperforms the other LLMs in the study. However,
in general, results are consistent between current-generation
LLMs. Depending on task difficulty, all LLMs can solve
between 31% and 89% of tasks. Assertion and TypeErrors are
the most common cause of failed tests.
V. P ROMPT TECHNIQUE COMPARISON
We now turn towards RQ2, and describe the results of a
statistical analysis examining how the different prompt tech-
niques applied in each prompt impact the function regarding (i)
correctness, (ii) similarity to the ground truth, and (iii) quality.
A. Correctness
A central question for assessing the value of prompt tech-
niques is how likely a (combination of) techniques is to lead
to code that (i) does not throw errors and (ii) passes the tests.
To measure the correctness for different prompts, we use the
well-established Pass@k metric [27]. This metric measures
the likelihood that the LLM will generate a correct solution
from the kstattempt. When k= 1, then the metric becomes
a measure of accuracy, as it calculates the number of passed
functions divided by the total number of functions. In our
study, we run the functions generated by the 32 combinations
of prompt techniques on the tests provided by the CoderEval
benchmark. Then, we collect the test results (pass or fail) and
measure Pass@1 accordingly. Figure 5 shows Pass@1 results
for all combinations of techniques (see Table I) for GPT-4o.
Given that results between different models appear to be very
consistent (see also Section V), we focus our discussion on
one example model. However, results for the other models
can be found in Appendix A.
It is evident from Figure 5 that the most important technique
when it comes to correctness is the presence of a function
signature. Combining the signature with other techniques,
such as few-shot or chain-of-thought, is sometimes helpful to
further increase the likelihood of a generated function being
correct (albeit by a very small margin, e.g., adding chain-
of-thought and package information to the signature only
leads to an improvement of 0.4 percentage points). The best
combination, with a Pass@1 of 57.9%, is the combination
of signature and few-shot. We achieved the worst results in
terms of correctness when using chain-of-thought alone, with
Page 7:
7
47.5%54.8%
46.6%54.8%
48%52.9%
48.9%54.3%
46.2%52.9%
48%55.2%
48.4%53.8%
47.1%55.7%
52.9%57.9%
53.4%56.6%
50.2%55.2%
52%56.1%
50.7%56.1%
50.2%57%
50.7%56.6%
51.6%55.2%
CoTPkg.CoT, Persona, Pkg.CoT, Pkg.PersonaCoT, PersonaPersona, Pkg.Few−shot, CoT, Pkg.Few−shot, PersonaFew−shot, CoTFew−shot, CoT, PersonaFew−shot, CoT, Persona, Pkg.Few−shot, Persona, Pkg.CoT, Sig.Few−shotPersona, Sig.Few−shot, Pkg.CoT, Persona, Sig.Persona, Pkg., Sig.Pkg., Sig.Sig.CoT, Pkg., Sig.Few−shot, CoT, Persona, Pkg., Sig.Few−shot, Persona, Sig.CoT, Persona, Pkg., Sig.Few−shot, CoT, Sig.Few−shot, Persona, Pkg., Sig.Few−shot, CoT, Persona, Sig.Few−shot, Pkg., Sig.Few−shot, CoT, Pkg., Sig.Few−shot, Sig.
0.0 0.2 0.4 0.6
Pass@1
Fig. 5. Pass@1 results of the different (combinations of) prompt techniques
exemplified for GPT-4o.
a Pass@1 of 46.2%. It is surprising to note that the impact of
prompt engineering techniques is overall lower than we would
have expected — the difference between the best and worst
combinations is merely 11.7 percentage points, i.e., prompt
programming seems to have a noticeable impact in only a
little over one in ten generation tasks.
Our findings also indicate that sometimes the addition of
more information in the prompt leads to worse performance.
For example, using only few-shot and signature performs
better than if also packages are provided. Further, it is evident
that techniques can interact in non-obvious ways. For example,
both package information and CoT alone led to the worst
Pass@1 results in the experiment. However, if these techniques
are used in conjunction with a function signature, Pass@1
improves marginally over using only the signature in isolation.
To further investigate these interactions between factors
in our experiment, we conducted a multi-linear regression
analysis. Figure 6 shows the five prompt techniques in the
study their interactions and their effect on the test result (pass
or fail). For instance, “CoT:Persona” describes if the impact
on test results comes from the interaction of CoT and Persona
in a prompt, regardless of whether that prompt includes other
prompt techniques. Similarly, “Sig.” (signature) refers to all
prompts that include at least the signature (including, for
example, P23, the combination of few-shot and signature), and
is not limited to prompts that only specify the signature.
The multi-linear regression results in a coefficient estimate
and a p-value for each factor and possible interactions among
the factors. The coefficient reflects the impact on the test
results, positive and negative coefficients refer to positive
and negative impacts, respectively. The p-value indicates how
significant the impact is.
Fewshot:Sig.PersonaFewshot:CoT:Persona:Sig.Persona:Sig.Fewshot:CoTFewshot:CoT:Pkg.:Sig.Fewshot:CoT:Pkg.CoTCoT:Persona:Sig.CoT:Persona:Pkg.:Sig.Fewshot:Persona:Pkg.Fewshot:Persona:Sig.Pkg.:Sig.Fewshot:PersonaFewshot:Pkg.CoT:Pkg.Persona:Pkg.:Sig.Fewshot:Pkg.:Sig.Fewshot:Persona:Pkg.:Sig.Persona:Pkg.Fewshot:CoT:Persona:Pkg.:Sig.CoT:Sig.Fewshot:CoT:Persona:Pkg.CoT:Persona:Pkg.CoT:Pkg.:Sig.Fewshot:CoT:PersonaCoT:PersonaPkg.Fewshot:CoT:Sig.FewshotSig.
0.00 0.01 0.02 0.03
Coefficient EstimateSignificance Level
<0.001
<0.01
<0.05
Not Significant
LLM
GPT−4o
Llama3
MistralFig. 6. Multi-linear regression results for the test results (pass or fail). Each
point visualizes the coefficient estimate for the corresponding combination.
The darker colors represent more conservative significance levels ( α). Zero-
shot is not depicted, as it cannot be combined with other techniques.
In line with our previous findings, we observe that the
presence of a signature and few-shot in a prompt (regardless
of whether they are combined with other prompt techniques)
affect the test results positively (albeit with different statistical
significance levels for different LLMs), and a positive, high
coefficient estimates (meaning there is a significant positive
impact on correctness). Interestingly, few-shot does not have a
statistically significant impact in the case of the Mistral model.
The remaining main factors (packages, chain-of-thought,
and persona) do not have a statistically significant impact on
any of the three models. However, it is interesting to observe
that the impact of a persona as well as chain-of-thought even
trends to the negative (coefficient estimate smaller than 0).
Key Findings (RQ2.1): The presence of a signature or few-
shot has the clearest positive impact on correctness. The other
prompt techniques in the study do not have a statistically
significant impact on correctness. However, in general, the
difference between “good” and “bad” prompt techniques is
surprisingly low. Adding additional information to a prompt
sometimes leads to worse performance.
Digging deeper into what causes generated functions to
fail, Figure 7 displays the percentages of errors types en-
countered for each combination of prompt techniques. We
show the four most common error types ( AssertionError ,
TypeError ,AttributeError , and ImportError ) us-
ing the GPT-4o model (see Appendix C for the other models).
For example, 51% of the failed functions of prompts with few-
shot, packages, and signature throw an AssertionError .
The combinations of prompt techniques are ordered from
the fewest errors (at the top) to the most errors (at the bottom).
Page 8:
8
29.8%31.7%13.5%17.3% 7.7%
36.7%28.4%11.0%14.7% 9.2%
34.5%29.1%17.3%12.7% 6.4%43.2%17.9%21.1%10.5% 7.4%
34.9%29.4%12.8%14.7% 8.3%34.0%27.4%16.0%12.3%10.4%41.4%19.2%17.2%11.1%11.1%42.7%16.7%22.9%10.4% 7.3%48.4%16.8%22.1% 8.4%4.2%
38.8%26.2%14.6%11.7% 8.7%51.0%17.7%17.7% 7.3%6.2%
35.5%30.0%14.5%13.6% 6.4%34.9%28.3%17.9% 9.4%9.4%50.5%10.3%22.7% 9.3%7.2%
43.9%12.2%28.6%10.2% 5.1%43.0%16.1%24.7%10.8% 5.4%
29.6%28.7%11.3%25.2% 5.2%
32.2%30.5% 6.8%23.7% 6.8%36.0%29.8% 9.6%14.9% 9.6%39.8%15.3%19.4%12.2%13.3%
31.0%33.6%12.4%16.8% 6.2%
36.2%30.2%11.2%13.8% 8.6%43.3%10.3%23.7% 8.2%14.4%
44.6%10.9%26.7% 7.9%9.9%41.0%19.0%23.0% 9.0%8.0%
37.6%31.6% 9.4%13.7% 7.7%44.4%15.2%16.2% 9.1%15.2%
32.5%28.1%13.2%19.3% 7.0%32.7%33.6%14.2%10.6% 8.8%44.0%17.0%20.0% 6.0%13.0%
41.7%16.5%20.4% 9.7%11.7%43.4%13.1%24.2% 9.1%10.1%
Zero−shot, CoTZero−shot, PackageZero−shot, CoT, Persona, PackageZero−shotZero−shot, CoT, PackageZero−shot, PersonaZero−shot, CoT, PersonaZero−shot, Persona, PackageFew−shot, CoT, PackageFew−shot, PersonaFew−shot, CoTFew−shot, CoT, PersonaFew−shot, CoT, Persona, PackageFew−shot, Persona, PackageFew−shotFew−shot, PackageZero−shot, Persona, SignatureZero−shot, CoT, Persona, SignatureZero−shot, CoT, SignatureZero−shot, Persona, Package, SignatureFew−shot, CoT, Persona, Package, SignatureZero−shot, Package, SignatureZero−shot, SignatureFew−shot, Persona, SignatureZero−shot, CoT, Package, SignatureFew−shot, Persona, Package, SignatureZero−shot, CoT, Persona, Package, SignatureFew−shot, CoT, Persona, SignatureFew−shot, Package, SignatureFew−shot, CoT, Package, SignatureFew−shot, CoT, SignatureFew−shot, Signature
AssertionErrorTypeError
AttributeErrorImportErrorOther
Error Type
Fig. 7. The percentages of error types that we observed in failed functions
generated by different combinations of prompt techniques (GPT-4o).
In general, we see that the prompts that result in the least
number of errors (among the first rows in the heatmap) are
combinations that include a signature. On the other end, the
prompts with the most errors lack few-shot examples.
Taking a closer look at the different error types, we observe
that while AssertionErrors generally occur at a higher
rate than the other error types across all prompt techniques,
they are particularly more frequent in prompts that include
the function signature. In contrast, the absence of the function
signature often leads to TypeErrors , primarily because the
LLM misjudges the expected number or order of positional
arguments when generating functions.
ImportErrors were observed more frequently in
prompts that did not employ few-shot prompting. Interestingly,
in a subset of cases, ImportErrors occurred even when
packages were explicitly specified. To investigate this, we
manually inspected five random prompts where packages were
specified but still resulted in ImportErrors . We found that
when the prompt indicated the use of a package that is local or
unfamiliar to the LLM, the LLM hallucinated and attempted
to import non-existent functions from the specified packages.
We note that these findings are consistent for GPT-4oand Llama3, while the errors of code generated by Mistral
lacked any clear patterns for the above mentioned error
types. However, we observed a trend of a higher rate of
AttributeErrors in Mistral when the signature is in-
cluded in the prompt. For the other two LLMs (GPT-4o and
Llama3), we did not observe any consistent patterns among
the prompts that triggered AttributeErrors .
Overall, we emphasize that encountering a certain error
does not necessarily mean that the function is free from the
other error types, as the program terminates at the first error
thrown. However, assertions are evaluated after the function
has successfully been executed, so an AssertionError
indeed indicates that no other errors have occurred. Further,
AssertionErrors are qualitatively different from other
error types, as they do not indicate a fundamentally broken
function, but rather that the LLM misunderstood (or could
not correctly guess from context) some assumptions about the
functionality of the code that is to be generated.
Key Findings (RQ2.1): Including the signature or few-shot ex-
amples in prompts generally reduces errors, particularly Type-
Errors, AttributeErrors, and ImportErrors. Providing package
information can naturally reduce ImportErrors but may cause
hallucinated imports if unfamiliar to the LLM.
B. Similarity
Beyond correctness, we believe that another important ques-
tion is how similar generated functions are to the human-
written baseline. We use the CodeBLEU score [29] to measure
how similar the generated function is to the baseline in terms
of syntax, semantics, and logic structure. CodeBLEU is a
composite score that integrates the scores of: N-gram match
(BLEU [41]), a weighted BLEU, and the Abstract Syntax Tree
(AST) node match, and the dataflow match as a semantic
similarity measure with different weights (by default, 25%
for each of the four scores). In our analysis, we remove the
signature of the generated function and the ground truth before
measuring the similarity to avoid any bias toward the signature
prompt technique. Consequently, we can no longer measure
the dataflow similarity as the functions are no longer parsable
(i.e., identifying parameters and their dataflow in the function).
Therefore, we set the weight for the dataflow match to 0, and
1/3 for the three other match scores.
In Figure 8, we see that the overall CodeBLEU score is
low for all approaches (varying between 12.2% and 17.9%
for Llama3) indicating that generated solutions are largely
different than how humans have solved the same tasks. Results
for the other models are in Appendix D. We observe that using
any prompt technique increases similarity (i.e., zero-shot has
the lowest similarity to the baseline for all three models).
Consistently with correctness, combinations that include a
signature lead to higher similarity. This is unsurprising, given
that a predefined signature restricts the solution space for
the LLM (which can be desirable or unwanted depending
on context). Combining more techniques indeed seems to
generally increase similarity.
To better understand the impact of the prompt techniques
and their interactions on the code similarity, we now perform a
Page 9:
9
17.9%
17.8%
17.5%
17.4%
17.3%
17.2%
17.2%
16.7%
16.6%
16.5%
16.4%
15.8%
15.6%
15.6%
15.2%
15.1%
15%
14.9%
14.8%
14.7%
14.7%
14.5%
14.5%
14.5%
14.2%
14.2%
14%
14%
13.5%
13.5%
13.4%
12.2%Pkg.Few−shotFew−shot, Pkg.Few−shot, Persona, Pkg.Few−shot, PersonaPersona, Pkg.Few−shot, CoTCoTFew−shot, CoT, Pkg.PersonaCoT, PersonaCoT, Persona, Pkg.Few−shot, Pkg., Sig.Few−shot, Sig.CoT, Pkg.Sig.Pkg., Sig.Few−shot, Persona, Pkg., Sig.Few−shot, CoT, Persona, Pkg.Few−shot, CoT, PersonaFew−shot, CoT, Pkg., Sig.Few−shot, Persona, Sig.CoT, Pkg., Sig.Persona, Pkg., Sig.Persona, Sig.Few−shot, CoT, Persona, Pkg., Sig.Few−shot, CoT, Sig.CoT, Persona, Pkg., Sig.CoT, Sig.CoT, Persona, Sig.Few−shot, CoT, Persona, Sig.
0.00 0.05 0.10 0.15
Average CodeBLEU (%)
Fig. 8. The average CodeBLEU scores for the functions generated by each
combination (Llama3).
general multi-linear regression to see how the different prompt
techniques and their interactions can impact the CodeBLEU
and the three constituent dimensions we use (n-gram, weighted
n-gram, and AST match scores).
N-gram and weighted N-gram matches can give us insights
into how similar the generated function to the baseline on
a lexical level, i.e., the set of words such as coding style,
variable naming, and/or in-line documentation in the function.
The AST match focuses on the syntactic structure and code
constructs in the function such as nested statements indepen-
dently of their naming (e.g., loops, conditionals, operations).
Figure 9 shows the results of our multi-linear regression.
We see that, regardless of the test results, the presence of a
signature or persona in a prompt can significantly increase
the CodeBLEU score. Chain-of-thought (CoT) seems to also
positively impact the CodeBLEU score, but only for Llama3.
The impact of the signature was shared across all dimensions
of CodeBLEU. The impact of the persona and CoT was
only observed for the lexical similarity, which explains the
slightly lower p-values compared to the signature. In an
interesting case, few-shot seems to lower the lexical similarity
but increase the syntactic similarity, which results in no overall
impact on the CodeBLEU. The impact of few-shot suggests
that while it generates functions that may use different variable
naming and style (lower lexical similarity), the structure and
logic are close to the baseline (higher syntactic similarity).
Key Findings (RQ2.2): The signature, persona, and chain-of-
thought increase the overall similarity of the function to the
baseline (i.e., code written by humans). Few-shot increases the
syntactic similarity (structure of the function) and decreases the
lexical similarity (variable naming).C. Quality
Using prompt techniques that yield correct functions does
not necessarily mean that these functions are maintainable and
of good quality. Hence, we now turn to an assessment of the
quality of the generated code. In our experiment, we focus on
code smells and complexity as proxies of code quality. For this
analysis, we only evaluate functions that pass their tests (see
Section V-A). We do not believe that assessing the quality
of functionally incorrect implementations is fruitful because
refactoring must be done on a working piece of code and
preserve its behavior [42].
For code smells, we run Pylint on the generated functions
for each prompt in CodePromptEval, using the code smell IDs
defined by Pylint. Then, we group the code smells for prompts
that share the same prompt techniques, and finally, we filter
code smells that make up less than 5% of the total number of
code smells for the group.
TABLE III
LIST OF CODE SMELLS IN GENERATED FUNCTIONS .
Category Code Smell ID Definition
Error E0602 Usage of an undefined variable.
Warning W0212 Access to a protected class member.
Warning W0611 Import statement not used.
Warning W0613 Function argument is not used.
Warning W0621 Redefines name from outer scope.
Refactoring R0903 Insufficient public methods in a class.
Refactoring R1705 Unnecessary ”else” after ”return”.
Convention C0301 Line exceeds the character limit.
Convention C0103 Violating UPPER CASE naming style.
Convention C0115 Class lacks a descriptive docstring.
Convention C0116 Function lacks a descriptive docstring.
Convention C0304 File missing a final newline.
We find 12 code smells that fulfill these criteria for all LLMs
(see Table III). Most identified code smells are warning and
convention code smells, but there is also one error and two
refactoring smells. From this list, we decided to remove C0304
as it is present in all generated functions across all LLMs and
is mostly an artifact of our generation pipeline.
In Figure 10, we show what percentage of functions have at
least one instance of each code smell. For reasons of brevity,
we focus on GPT-4o (Llama3 and Mistral’s in Appendix B).
We note that 72% of the functions generated using the
few-shot technique contain C0116 code smells, indicating that
these functions lacked a descriptive docstring (in contrast to
only 17% of functions generated by chain-of-thought com-
bined with a persona and package information). In general,
we observe that prompts that apply the few-shot and signature
prompt techniques generate functions with more code smells
and, more specifically, warning and error code smells com-
pared to other prompts.
On the other hand, we observed that CoT, persona, and
package lead to functions with fewer code smells, unless
these prompt techniques are combined with few-shot and/or
signature, then the percentage of code smells increases. This is
interesting, as we have seen that few-shot and signature are the
techniques with the clearest positive impact on correctness (see
Section V-A). In part, this discrepancy could be explained by
solutions for challenging tasks that LLMs only solve correctly
Page 10:
10
CodeBLEU BLEU (n−gram) − 33.33% BLEU−weighted − 33.33% Syntactic Similarity (AST) − 33.33%
−0.005 0.000 0.005 0.010 0.015 −0.005 0.000 0.005 0.010 0.015 −0.005 0.000 0.005 0.010 0.015 −0.005 0.000 0.005 0.010 0.015CoT:PersonaFewshot:CoTFewshot:CoT:PersonaFewshotCoTPersonaSig.
Coefficient Estimate
LLM GPT−4o Llama3 Mistral Significance Level <0.001 <0.01 <0.05 Not Significant
Fig. 9. The coefficient estimates from the multi-linear regression of prompt technique combinations that significantly impact the CodeBLEU and its components
n-gram match (BLEU), weighted n-gram match (BLEU-weighted), and AST match scores. The combinations of prompt techniques that do not have a significant
impact on the CodeBLEU or any of its components using any model are omitted for brevity. A complete plot can be found in Appendix D.
72%49%41%35%27%15%11% 11%71%43%40%36%34%15%13%66%33%41%35%28%15%12%12% 9%69%41%29%27%37%13%18%62%37%39%38%29%13%13%62%36%25%25%25%11%17%15% 13%50%45%51%48%34%54%29%34%23%27%17%11%22% 10%64%36%29%27%23%11%14%10% 12%54%31%29%23%27%16%11%13% 12%52%48%38%37%24%15%70%30%22%17%14%14%13%17% 12%49%30%25%23%31%14% 15% 11%10%45%39%31%33%30%12% 11%40%38%37%34%24%11% 11%55%26%22%20%26% 10% 12%59%16%16%12%21% 16% 11%10%40%30%14%12% 12% 20% 24%62%19%15%13%16%14%10%50% 21%21%16% 14%22%58%22%17%16%16% 12%44%19%19%18% 16% 20%28%23%23%19%12% 21%33%11%18%19%22% 11%35%21%15%13%18%10%43%20%17%15% 13%28%18%21%22%15%48%15%13% 16%20%28%15%11% 15%14%29%18%12% 11%20%28%18%12%17%17%13%
Few−shotFew−shot, Pkg.Few−shot, Sig.Few−shot, CoT, Pkg.Few−shot, CoTFew−shot, CoT, Pkg., Sig.Few−shot, PersonaFew−shot, Persona, Sig.Few−shot, CoT, Sig.Few−shot, Persona, Pkg., Sig.Few−shot, Persona, Pkg.Few−shot, Pkg., Sig.Few−shot, CoT, Persona, Sig.Few−shot, CoT, Persona, Pkg.Few−shot, CoT, PersonaFew−shot, CoT, Persona, Pkg., Sig.CoT, Pkg., Sig.Pkg., Sig.CoT, Sig.Sig.Pkg.Persona, Sig.CoT, Persona, Pkg., Sig.Persona, Pkg., Sig.CoTCoT, Persona, Sig.CoT, Pkg.PersonaCoT, PersonaPersona, Pkg.CoT, Persona, Pkg.
C0116C0301C0115R0903W0613W0611R1705E0602W0621W0212C0103W1203
Code Smell ID
Fig. 10. Percentages of functions generated by GPT-4o that have different
code smells. Empty fields indicate that no smell of this type is found. Note
that functions can have instances of multiple types of smells.
when provided examples or a signature (recall that, for this
analysis, we have only investigated functions that pass all
tests — hence, some challenging functions have an analyzable
solution for signature and few-shot, but not other techniques).
However, we note that the differences in Figure 10 are too
large to be entirely explained in this way. Consequently, we
conclude that CoT, persona, and package information indeed
seem to systematically lead to fewer code smells.
Key Finding (RQ2.3): While using CoT, persona, or package
information leads to fewer correct solutions, these techniques
lead to higher-quality code in terms of code smells.
We now turn towards the cyclomatic and cognitive com-
plexity and compare the complexity of generated solutions to
the complexity of the human-written baseline. In Table IV, we
show the p-values resulting from the paired Wilcoxon test to
assess the statistical significance of differences between the
cyclomatic complexity of the generated functions and ground
truth. We only show the (combinations of) prompt techniquesthat had a significant impact on the cyclomatic complexity for
at least two LLMs ( α= 0.05).
We use Vargha Delaney A12 measure [43] to understand the
nature of the impact (reduces or increases complexity) and
to quantify the effect size (Negligible ( A12≥0.45), Small
(0.36≤A12<0.45), Medium ( 0.36≤A12<0.29), or Large
(A12≤0.29)). Vargha Delaney A12 is a probability measure
(that was later adopted as an effect size measure), which
describes the probability that one level (generated function
complexity) is greater than a corresponding value in another
level (ground truth complexity). If the A12 is less than 0.50,
it means that the values of the first level are lower than the
second level, and the lower the score is, the larger the effect
size. This allows us to see if the prompts generate functions
with a significantly lower or higher complexity as the ground
truth, or with a comparable complexity when no significance
is observed. Complete complexity results are in Appendix E.
Similar to the code smells results, we see that CoT, persona,
and packages reduce the complexity in comparison to the
baseline. A zero-shot prompt also leads to lowered complexity.
However, all reductions have (at most) a small effect size.
This can be explained by the low cyclomatic complexity of
all LLM tasks — in general, only minor simplifications are
even possible to the generally rather simple code snippets.
For cognitive complexity (see Table VII), we observe larger
differences among the LLMs than between combinations of
prompt techniques in general. There was only one combination
of prompt techniques that reduced the cognitive complexity
across all three LLMs with a small size effect. GPT-4o seems
to generate functions with no or negligible differences to the
ground truth. Mistral can reduce the cognitive complexity with
a small effect size when the prompt does not include few-
shot and a persona, packages or CoT applied in the prompt.
In contrast, there are no clear trends or patterns among the
prompts in Llama3 rather most of the prompt techniques seem
to reduce the cognitive complexity with a small effect size. We
conclude that Llama3 appears to lead to simpler solutions than
the other models, particularly GPT-4o.
It is interesting to observe that no combination of prompt
techniques leads to more complex solutions than the baseline
— generated solutions are always slightly simpler or com-
parably complex. Viewed positively, this may indicate that
LLMs generate rather clean code. However, a more negative
interpretation may also be that the generated code does not
cover some complex corner cases that human-written solutions
Page 11:
11
TABLE IV
CYCLOMATIC COMPLEXITY ANALYSIS USING WILCOXON TEST AND A12 V ARGHA DELANEY FOR THE EFFECT SIZE (N- N EGLIGIBLE , S- S MALL , M-
MEDIUM , L- L ARGE ).↓INDICATES A REDUCTION IN COMPLEXITY ,∅INDICATES NO STATISTICAL DIFFERENCE .
Combinations GPT-4o Llama3 Mistral
p-value A12 Effect p-value A12 Effect p-value A12 Effect
Package 0.0012 0.403 ↓(S) 0.0008 0.402 ↓(S) 0.0407 0.417 ↓(S)
CoT, Package 0.0366 0.446 ↓(S) 0.0147 0.440 ↓(S) 0.0211 0.410 ↓(S)
Zero-shot 0.0143 0.433 ↓(S) 0.0012 0.392 ↓(S) 0.0402 0.407 ↓(S)
CoT, Persona 0.0576 0.462 ∅ 0.0100 0.432 ↓(S) 0.0062 0.433 ↓(S)
Persona, Sig. 0.0999 0.466 ∅ 0.0358 0.436 ↓(S) 0.0080 0.439 ↓(S)
Persona 0.0857 0.464 ∅ 0.0086 0.411 ↓(S) 0.0259 0.436 ↓(S)
Sig. 0.1593 0.453 ∅ 0.0094 0.425 ↓(S) 0.0157 0.434 ↓(S)
Package, Sig. 0.0426 0.457 ↓(N) 0.0156 0.417 ↓(S) 0.0175 0.448 ↓(S)
Persona, Package 0.0283 0.438 ↓(S) 0.0039 0.404 ↓(S) 0.0661 0.460 ∅
CoT 0.0191 0.447 ↓(S) 0.0559 0.448 ∅ 0.0095 0.417 ↓(S)
TABLE V
COGNITIVE COMPLEXITY ANALYSIS USING WILCOXON TEST AND A12 V ARGHA DELANEY FOR THE EFFECT SIZE (N- N EGLIGIBLE , S- S MALL , M-
MEDIUM , L- L ARGE ).↓INDICATES A REDUCTION IN COMPLEXITY ,∅INDICATES NO STATISTICAL DIFFERENCE .
Combinations GPT-4o Llama3 Mistral
p-value A12 Effect p-value A12 Effect p-value A12 Effect
Package, Sig. 0.0365 0.449 ↓(S) 0.0001 0.382 ↓(S) 0.0175 0.448 ↓(S)
Few-shot, Persona 0.0364 0.442 ↓(S) 0.0327 0.412 ↓(S) 0.2305 0.447 ∅
CoT, Persona 0.3678 0.501 ∅ 0.0026 0.419 ↓(S) 0.0062 0.433 ↓(S)
CoT, Package 0.3300 0.485 ∅ 0.0256 0.444 ↓(S) 0.0211 0.410 ↓(S)
CoT 0.3609 0.496 ∅ 0.0449 0.445 ↓(S) 0.0095 0.417 ↓(S)
Persona, Sig. 0.0883 0.456 ∅ 0.0025 0.416 ↓(S) 0.0079 0.439 ↓(S)
Persona 0.1156 0.493 ∅ 0.0032 0.393 ↓(S) 0.0259 0.436 ↓(S)
Package 0.0777 0.446 ∅ 0.0002 0.371 ↓(S) 0.0407 0.417 ↓(S)
Sig. 0.5717 0.485 ∅ 0.0003 0.398 ↓(S) 0.0157 0.434 ↓(S)
Zero-shot 0.1708 0.486 ∅ 0.0016 0.382 ↓(S) 0.0402 0.407 ↓(S)
account for (which may not be covered by CoderEval tests).
Key Findings (RQ2.3): There are noticeable differences among
models with regard to the complexity of the code they produce.
Llama3 appears to produce simpler solutions systematically.
There were no cases of increased complexity — LLM solutions
were comparably complex to human-written code, or simpler.
VI. D ISCUSSION
In this section, we discuss the key lessons learned from this
study, the implications of our findings for software engineering
practitioners and researchers, as well as validity threats.
A. Lessons learned
L1: The differences in the results of prompt techniques
are not dramatic: We carefully designed a full factorial
experiment to evaluate not only prompt techniques but also
combinations of them in a prompt. Our analyses revealed
that, while there was an impact of some prompt techniques
on the generated functions, the results for most of the prompt
techniques were not that different. For example, the difference
in the Pass@1 rates for the prompts with the highest and lowest
rates is only around 10-12 percentage points (see Figure 5),
and the effect sizes of the complexities are mostly small or
insignificant (see Table IV). These insights align with other
studies that evaluate prompt techniques on code summariza-
tion [18] and generation [22], where the performance results
of different prompt techniques such as CoT, few-shot, self-
collaboration, among others, are also comparable. In contrast,we see clearer differences in the performance results of some
prompt techniques when using benchmarks for math-related
tasks or general question-answering [25, 44]. We conclude that
a strong emphasis on prompt programming is not necessary
in the context of function-level code generation using current-
generation models.
L2: Providing information about the interface via few-
shot or signature is useful, but limits the “creativity” of the
LLM: In our correctness and similarity results, the signature
and few-shot prompt techniques stood out among other prompt
techniques. In general, we believe that while they are two
different prompt techniques, they can provide similar context
about the expected functional interface in terms of positional
arguments and expected output. This was also revealed through
our general multi-linear regression results in Figures 6 and 9,
where we see that having either signature or few-shot examples
significantly impact the code’s correctness or similarity, but
their interaction or combination does not help. In relation to
previous work by Ahmed et al. [45, 8], we observe a similar
pattern where contextual information about the parameters
and other identifiers can improve the code summarization.
However, providing this information limits the solution space
for the LLM (i.e., it restricts the potential for “creativity”),
which may not always be desired.
L3: There is a trade-off between correctness and main-
tainability when choosing prompt techniques: Our analysis
revealed contrasting results: prompts with few-shot examples
or function signatures improved correctness but increased
complexity and number of code smells, while prompts that
Page 12:
12
employed persona, CoT or package had lower passing rates
but significantly enhanced code maintainability (see Tables
IV,V for complexity and Figure 10 for code smells). While
previous research suggests that the use of a persona in the
prompt does not improve the outcome [46] but can improve the
personalization and user experience [26], we believe that this
only applies to simple personas such as “software developer”.
However, our results indicate that personas can be more
beneficial when used as a way to induce additional quality
requirements e.g., “software developer who writes clean and
simple code”. Recent work has also shown that personas
can be beneficial for code generation when used in more
complex approaches such as self-collaboration where multiple
personas (e.g., requirement engineer, software tester, and a
developer) are used together to iteratively construct the code
in a systematic way [24].
B. Implications
I1: Researchers should prioritize refining prompts for
more effective prompt programming experiments We shed
light on two components of experiments in prompt pro-
gramming: the generation tasks, and the prompts. Based on
observations in our previous work [3], we note that developers
often use LLMs for more complex tasks than those in common
datasets such as HumanEval [47] or CoderEval [10]. Although
we were able to analyze and compare different prompt tech-
niques, we believe that a dataset with are more representative
functions of the large systems and projects that developers
typically work with is needed.
Further, prompts in common benchmarks often lack a
consistent format or level of detail. For instance, the prompts
in CoderEval are based on functions’ docstrings rather than
actual prompts. Sclar et al. [48] show that LLMs, regardless
of their sizes and number of parameters, are highly sensitive to
small prompt changes such as prompt formatting. We observed
similar behavior when experimenting with the template “The
function uses the following packages” for the packages prompt
technique and found that it caused errors related to using the
wrong packages. We traced the issue back to the prompt itself
and realized that the packages listed were not necessarily used
by the function but existed in its class. When we modified the
template to “The function has access to (but does not necessar-
ily use) the following packages, ” we mitigated the issue. This
pre-processing of the prompt technique templates is another
aspect of prompt programming recommended by Obrien et
al. [49]. We found value in inspecting and refining prompts
and creating our own few-shot examples, which increased our
confidence in the dataset’s reliability and stability. Therefore,
we encourage researchers to invest in similar efforts.
I2: Software developers should avoid overusing prompt
techniques While we saw that prompt techniques can be
beneficial for certain criteria (e.g., signature for correctness
and persona for quality), we also saw that combining them
does not necessarily yield better results. In fact, some cases
showed that including an additional prompt technique can
cancel out the impact of the existing prompt techniques. For
instance, in the code smells results in Figure 10, we showhow the inclusion of few-shot examples to CoT and persona
can increase the code smells by more than one-third. Previous
work has also shown how few-shot examples can hurt the LLM
performance if not carefully engineered by humans [44].
I3: Different LLMs have different sensitivity levels to the
prompt techniques We argue that a single prompt technique
does not have the same impact on the different aspects (cor-
rectness, similarity, or quality) of the generated code across
all LLMs. We have seen that the three LLMs demonstrated
different sensitivity levels to the prompt. For instance, we
saw how the similarity scores of Llama3-generated functions
were significantly impacted by CoT, while it showed no effect
for GPT-4o and Mistral (see Figure 9). The error types for
Mistral in Figure 16 did not seem to be strongly impacted
by prompt techniques as they did for other LLMs. Note that
these differences in the models do not necessarily come from
the model size and the number of parameters it was trained
on, but rather the underlying architecture it uses (which aligns
with findings by Wang et al. [18]). This implies that when
a software company integrates an LLM into its processes
and provides employee training, it should develop specific
guidelines tailored to the LLM, including recommendations for
prompt techniques that align with the model’s characteristics
e.g., if an LLM returns simple functions in general, so prompt
techniques that impact complexity may not be needed.
I4: Determining the purpose of the code generation is
essential for the use of prompt programming Depending
on whether the intended use of the LLM is to support human
developers or to completely automate code generation, prompt
programming has different significance. While the use of some
prompt techniques has significantly minimized the number of
errors and code smells, we saw that many of these issues can
be easily fixed by human developers, arguably requiring less
time and effort than applying an additional prompt technique.
For instance, the absence of a signature in a prompt causes
TypeErrors when the LLM misjudges the number of ar-
guments (see Figure 7). This bug may be more easily fixed
by the human rather than re-prompting the LLM with the
correct signature. On the other hand, prompt programming
can be more valuable when the purpose is to automate code
generation and return correct and maintainable code without
the need for human intervention, especially to apply simple
modifications or refactoring actions.
C. Threats to validity
External validity. The main threats to external validity
in this study are associated with the prompt techniques, the
LLMs and the benchmark we utilized. There are many possible
prompt techniques that can potentially impact code generation,
such as self-collaboration [31], tree-of-thought [50], or provid-
ing the whole class as context. However, we decided to select
common prompt techniques that can be practically applied
by a typical software developer in most code generation
tasks. Another important question is whether the use of more
powerful LLMs can result in different findings and eliminate
the need for prompt programming. We used three current-
generation LLMs during the study, including GPT-4o (200B
Page 13:
13
parameters) and Llama3 (70B parameters). Our replication
package [35] also includes results for older LLMs (GPT-3.5,
Llama2), showing that the prompt techniques affecting code
generation in this study similarly impact older models.
Internal validity. Regarding internal validity, we acknowl-
edge that LLMs can be sensitive to format or structure of
the prompt [48], or even the order of the few-shot examples
[21]. We used a fixed order of the prompt techniques that we
believe represents a natural sentence flow, and we ensured
to use it consistently across all prompts. In addition, the
manual creation of few-shot examples for our dataset may
have introduced a degree of subjectivity. We therefore involve
the first three authors in the process, allowing them to discuss
possible examples and select the two most representative ones.
Construct validity. The representativeness of the bench-
mark (including generation tasks and functions) is an impor-
tant aspect of construct validity. While current benchmarks
often include functions that are not as complex as real-
life tasks, we used the CoderEval dataset based on large
open-source projects to minimize this threat. However, there
remains the question of whether CoderEval fully captures the
complexity and diversity of real-world development tasks.
Conclusion validity. For conclusion validity, we focused on
three key criteria: similarity, correctness, and quality. While
others, like efficiency, could be considered, we argue these
suffice for our research questions. Robustness is ensured with
multiple metrics for each criterion.
VII. C ONCLUSION
In this study, we have investigated the impact of different
prompt techniques on code generation, specifically function
synthesis, along three quality dimensions (correctness, similar-
ity to a human-written baseline, and code quality). We studied
five prompt techniques, namely few-shot learning, automatic
chain-of-thought, providing a persona, providing a signature,
and listing packages. We conduct a full factorial analysis of
these five factors using CodePromptEval dataset, which we
developed based on CoderEval. We studied three current-
generation LLMs, namely GPT-4o, Llama3, and Mistral.
Our key lessons learned were that the impact of prompt
techniques on correctness, similarity, and quality was not as
large as might be expected. Most combinations of prompt
techniques do not lead to statistically significant improvements
(or regressions) in correctness, quality, or similarity. Providing
type information for the function that is to be generated,
either explicitly through a signature, or implicitly via few-shot
examples, has the most clear positive effect, particularly on
correctness. Some prompt techniques have a positive impact
on correctness, and others on quality. However, the obvious
idea of combining them usually improves neither.
A possible future extension of our research is to evaluate to
what extent our findings generalize to other code generation
tasks (e.g., line completion, program repair, or the generation
of full applications). It is plausible that some of the prompt
techniques that did not show a meaningful positive impact on
correctness in our experiments (e.g., chain-of-thought) turn out
to be more relevant if the generation task is more complex. Ad-
ditionally, there are other quality metrics, such as performanceor energy efficiency, which should be studied in future work —
particularly given that recent work indicates that AI-generated
code frequently exhibits performance regressions [51].
ACKNOWLEDGEMENTS
This work was partially supported by the Wallenberg AI,
Autonomous Systems and Software Program (WASP) funded
by the Knut and Alice Wallenberg Foundation. Additionally,
LLM executions were enabled by resources provided by
the National Academic Infrastructure for Supercomputing in
Sweden (NAISS), partially funded by the Swedish Research
Council through grant agreement no. 2022-06725.
REFERENCES
[1] S. I. Ross, F. Martinez, S. Houde, M. Muller, and J. D. Weisz,
“The Programmer’s Assistant: Conversational Interaction with a Large
Language Model for Software Development,” in Proceedings of the 28th
International Conference on Intelligent User Interfaces , IUI ’23, (New
York, NY , USA), p. 491–514, Association for Computing Machinery,
2023.
[2] Z. Zeng, H. Tan, H. Zhang, J. Li, Y . Zhang, and L. Zhang, “An extensive
study on pre-trained models for program understanding and generation,”
inProceedings of the 31st ACM SIGSOFT International Symposium on
Software Testing and Analysis , ISSTA 2022, (New York, NY , USA),
p. 39–51, Association for Computing Machinery, 2022.
[3] R. Khojah, M. Mohamad, P. Leitner, and F. G. de Oliveira Neto,
“Beyond Code Generation: An Observational Study of ChatGPT Usage
in Software Engineering Practice,” Proc. ACM Softw. Eng. , vol. 1, July
2024.
[4] J. D. Weisz, M. Muller, S. I. Ross, F. Martinez, S. Houde, M. Agarwal,
K. Talamadupula, and J. T. Richards, “Better Together? An Evaluation
of AI-Supported Code Translation,” in Proceedings of the 27th Interna-
tional Conference on Intelligent User Interfaces , IUI ’22, (New York,
NY , USA), p. 369–391, Association for Computing Machinery, 2022.
[5] S. Zheng, J. Huang, and K. C.-C. Chang, “Why Does ChatGPT Fall
Short in Providing Truthful Answers?,” 2023.
[6] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar,
J. Spencer-Smith, and D. C. Schmidt, “A Prompt Pattern Catalog to
Enhance Prompt Engineering with ChatGPT,” 2023.
[7] A. J. Fiannaca, C. Kulkarni, C. J. Cai, and M. Terry, “Programming
without a Programming Language: Challenges and Opportunities for
Designing Developer Tools for Prompt Programming,” in Extended
Abstracts of the 2023 CHI Conference on Human Factors in Computing
Systems , CHI EA ’23, (New York, NY , USA), Association for Comput-
ing Machinery, 2023.
[8] T. Ahmed, K. S. Pai, P. Devanbu, and E. Barr, “Automatic Semantic
Augmentation of Language Model Prompts (for Code Summarization),”
inProceedings of the IEEE/ACM 46th International Conference on
Software Engineering , ICSE ’24, (New York, NY , USA), Association
for Computing Machinery, 2024.
[9] L. Reynolds and K. McDonell, “Prompt Programming for Large Lan-
guage Models: Beyond the Few-Shot Paradigm,” in Extended Abstracts
of the 2021 CHI Conference on Human Factors in Computing Systems ,
CHI EA ’21, (New York, NY , USA), Association for Computing
Machinery, 2021.
[10] H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y . Ma, G. Liang, Y . Li,
Q. Wang, and T. Xie, “CoderEval: A Benchmark of Pragmatic Code
Generation with Generative Pre-trained Models,” in Proceedings of the
IEEE/ACM 46th International Conference on Software Engineering ,
ICSE ’24, (New York, NY , USA), Association for Computing Machin-
ery, 2024.
[11] S. Barke, M. B. James, and N. Polikarpova, “Grounded Copilot: How
Programmers Interact with Code-Generating Models,” Proc. ACM Pro-
gram. Lang. , vol. 7, Apr. 2023.
[12] K. Ronanki, C. Berger, and J. Horkoff, “Investigating ChatGPT’s
Potential to Assist in Requirements Elicitation Processes,” in 2023
49th Euromicro Conference on Software Engineering and Advanced
Applications (SEAA) , pp. 354–361, 2023.
[13] J. Wang, Y . Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software
Testing With Large Language Models: Survey, Landscape, and Vision,”
IEEE Transactions on Software Engineering , vol. 50, no. 4, pp. 911–936,
2024.
Page 14:
14
[14] H. Li, C.-P. Bezemer, and A. E. Hassan, “Software Engineering and
Foundation Models: Insights from Industry Blogs Using a Jury of
Foundation Models,” 2024.
[15] R. T ´oth, T. Bisztray, and L. Erd ˝odi, “LLMs in Web Development:
Evaluating LLM-Generated PHP Code Unveiling Vulnerabilities and
Limitations,” in International Conference on Computer Safety, Relia-
bility, and Security , pp. 425–437, Springer, 2024.
[16] J. Liu, C. S. Xia, Y . Wang, and L. ZHANG, “Is Your Code Generated
by ChatGPT Really Correct? Rigorous Evaluation of Large Language
Models for Code Generation,” in Advances in Neural Information
Processing Systems (A. Oh, T. Naumann, A. Globerson, K. Saenko,
M. Hardt, and S. Levine, eds.), vol. 36, pp. 21558–21572, Curran
Associates, Inc., 2023.
[17] P. Sahoo, A. K. Singh, S. Saha, V . Jain, S. Mondal, and A. Chadha, “A
Systematic Survey of Prompt Engineering in Large Language Models:
Techniques and Applications,” 2024.
[18] G. Wang, Z. Sun, Z. Gong, S. Ye, Y . Chen, Y . Zhao, Q. Liang,
and D. Hao, “Do Advanced Language Models Eliminate the Need
for Prompt Engineering in Software Engineering?,” arXiv preprint
arXiv:2411.02093 , 2024.
[19] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-
V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler,
J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray,
B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever,
and D. Amodei, “Language Models are Few-Shot Learners,” 2020.
[20] K. Margatina, T. Schick, N. Aletras, and J. Dwivedi-Yu, “Active Learn-
ing Principles for In-Context Learning with Large Language Models,”
inFindings of the Association for Computational Linguistics: EMNLP
2023 (H. Bouamor, J. Pino, and K. Bali, eds.), (Singapore), pp. 5011–
5034, Association for Computational Linguistics, Dec. 2023.
[21] Y . Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stenetorp, “Fantastically
Ordered Prompts and Where to Find Them: Overcoming Few-Shot
Prompt Order Sensitivity,” May 2022.
[22] T. Wang, N. Zhou, and Z. Chen, “Enhancing Computer Programming
Education with LLMs: A Study on Effective Prompt Engineering for
Python Code Generation,” 2024.
[23] D. Shrivastava, H. Larochelle, and D. Tarlow, “Repository-Level Prompt
Generation for Large Language Models of Code,” in Proceedings of
the 40th International Conference on Machine Learning (A. Krause,
E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, eds.),
vol. 202 of Proceedings of Machine Learning Research , pp. 31693–
31715, PMLR, 23–29 Jul 2023.
[24] Y . Dong, X. Jiang, Z. Jin, and G. Li, “Self-Collaboration Code Gener-
ation via ChatGPT,” ACM Trans. Softw. Eng. Methodol. , vol. 33, Sept.
2024.
[25] Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic Chain of Thought
Prompting in Large Language Models,” 2022.
[26] Y .-M. Tseng, Y .-C. Huang, T.-Y . Hsiao, W.-L. Chen, C.-W. Huang,
Y . Meng, and Y .-N. Chen, “Two Tales of Persona in LLMs: A Survey
of Role-Playing and Personalization,” 2024.
[27] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan,
H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri,
G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan,
S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian,
C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis,
E. Barnes, A. Herbert-V oss, W. H. Guss, A. Nichol, A. Paino, N. Tezak,
J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse,
A. N. Carr, J. Leike, J. Achiam, V . Misra, E. Morikawa, A. Radford,
M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew,
D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, “Evaluating
Large Language Models Trained on Code,” 2021.
[28] Y . Dong, X. Jiang, Z. Jin, and G. Li, “Self-Collaboration Code Gener-
ation via ChatGPT,” ACM Trans. Softw. Eng. Methodol. , vol. 33, Sept.
2024.
[29] S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundaresan,
M. Zhou, A. Blanco, and S. Ma, “CodeBLEU: a Method for Automatic
Evaluation of Code Synthesis,” 2020.
[30] Y . Wang, W. Wang, S. Joty, and S. C. Hoi, “CodeT5: Identifier-aware
Unified Pre-trained Encoder-Decoder Models for Code Understanding
and Generation,” in Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing , pp. 8696–8708, 2021.
[31] X. Jiang, Y . Dong, L. Wang, Z. Fang, Q. Shang, G. Li, Z. Jin, and
W. Jiao, “Self-Planning Code Generation with Large Language Models,”
ACM Trans. Softw. Eng. Methodol. , vol. 33, Sept. 2024.[32] J. Li, Y . Zhao, Y . Li, G. Li, and Z. Jin, “AceCoder: An Effective
Prompting Technique Specialized in Code Generation,” ACM Trans.
Softw. Eng. Methodol. , vol. 33, Nov. 2024.
[33] J. Wei, S. Kim, H. Jung, and Y .-H. Kim, “Leveraging Large Language
Models to Power Chatbots for Collecting User Self-Reported Data,” Apr.
2024.
[34] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan,
E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton, “Program Synthesis
with Large Language Models,” 2021.
[35] R. Khojah, F. G. de Oliveira Neto, M. Mohamad, and P. Leitner, “Code-
PromptEval,” Dec. 2024. https://github.com/icetlab/CodePromptEval.
[36] J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A Survey on Large
Language Models for Code Generation,” 2024.
[37] S. Thakur, B. Ahmad, H. Pearce, B. Tan, B. Dolan-Gavitt, R. Karri,
and S. Garg, “VeriGen: A Large Language Model for Verilog Code
Generation,” ACM Trans. Des. Autom. Electron. Syst. , vol. 29, Apr. 2024.
[38] I. Heitlager, T. Kuipers, and J. Visser, “A Practical Model for Measuring
Maintainability,” in 6th International Conference on the Quality of
Information and Communications Technology (QUATIC 2007) , pp. 30–
39, 2007.
[39] G. A. Campbell, “Cognitive complexity: an overview and evaluation,”
inProceedings of the 2018 International Conference on Technical
Debt , TechDebt ’18, (New York, NY , USA), p. 57–58, Association for
Computing Machinery, 2018.
[40] J. Cohen, Statistical power analysis for the behavioral sciences . rout-
ledge, 2013.
[41] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a Method for
Automatic Evaluation of Machine Translation,” in Proceedings of the
40th annual meeting of the Association for Computational Linguistics ,
pp. 311–318, 2002.
[42] M. Fowler, Refactoring: improving the design of existing code . Addison-
Wesley Professional, 2018.
[43] A. Vargha and H. D. Delaney, “A Critique and Improvement of the
CL Common Language Effect Size Statistics of McGraw and Wong,”
Journal of Educational and Behavioral Statistics , vol. 25, no. 2, pp. 101–
132, 2000.
[44] T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large
Language Models are Zero-Shot Reasoners,” in Advances in Neural
Information Processing Systems (S. Koyejo, S. Mohamed, A. Agarwal,
D. Belgrave, K. Cho, and A. Oh, eds.), vol. 35, pp. 22199–22213, Curran
Associates, Inc., 2022.
[45] T. Ahmed and P. Devanbu, “Multilingual training for software en-
gineering,” in Proceedings of the 44th International Conference on
Software Engineering , ICSE ’22, (New York, NY , USA), p. 1443–1455,
Association for Computing Machinery, 2022.
[46] M. Zheng, J. Pei, L. Logeswaran, M. Lee, and D. Jurgens, “When
”A Helpful Assistant” Is Not Really Helpful: Personas in System
Prompts Do Not Improve Performances of Large Language Models,”
inFindings of the Association for Computational Linguistics: EMNLP
2024 (Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, eds.), (Miami, Florida,
USA), pp. 15126–15154, Association for Computational Linguistics,
Nov. 2024.
[47] B. Athiwaratkun, S. K. Gouda, Z. Wang, X. Li, Y . Tian, M. Tan, W. U.
Ahmad, S. Wang, Q. Sun, M. Shang, S. K. Gonugondla, H. Ding,
V . Kumar, N. Fulton, A. Farahani, S. Jain, R. Giaquinto, H. Qian, M. K.
Ramanathan, R. Nallapati, B. Ray, P. Bhatia, S. Sengupta, D. Roth, and
B. Xiang, “Multi-lingual Evaluation of Code Generation Models,” 2023.
[48] M. Sclar, Y . Choi, Y . Tsvetkov, and A. Suhr, “Quantifying Language
Models’ Sensitivity to Spurious Features in Prompt Design or: How
I learned to start worrying about prompt formatting,” in The Twelfth
International Conference on Learning Representations , 2023.
[49] D. OBrien, S. Biswas, S. M. Imtiaz, R. Abdalkareem, E. Shihab, and
H. Rajan, “Are Prompt Engineering and TODO Comments Friends
or Foes? An Evaluation on GitHub Copilot,” in Proceedings of the
IEEE/ACM 46th International Conference on Software Engineering ,
ICSE ’24, (New York, NY , USA), Association for Computing Machin-
ery, 2024.
[50] S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and
K. Narasimhan, “Tree of Thoughts: Deliberate Problem Solving with
Large Language Models,” in Advances in Neural Information Processing
Systems (A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and
S. Levine, eds.), vol. 36, pp. 11809–11822, Curran Associates, Inc.,
2023.
[51] S. Li, Y . Cheng, J. Chen, J. Xuan, S. He, and W. Shang, “Assessing
the Performance of AI-Generated Code: A Case Study on GitHub
Copilot,” in 35th IEEE International Symposium on Software Reliability
Engineering, ISSRE , IEEE, 2024.
Page 15:
15
APPENDIX
A. Pass@k rates
This section presents the Pass@1 results to evaluate the
correctness of the generated code by Llama3 (Figure 11) and
Mistral (Figure 12).
44.8%51.1%
45.7%50.7%
44.3%49.8%
46.2%50.2%
46.6%50.2%
44.8%51.1%
44.3%50.2%
45.7%50.7%50.7%52%53.4%55.2%
52%54.3%
51.6%53.8%
51.1%53.8%
51.6%55.2%
52%52.9%
51.1%55.2%
CoT, PersonaPersonaCoT, Pkg.CoT, Persona, Pkg.Pkg.Persona, Pkg.CoTPersona, Sig.CoT, Persona, Sig.CoT, Sig.Persona, Pkg., Sig.CoT, Persona, Pkg., Sig.Few−shotPkg., Sig.CoT, Pkg., Sig.Few−shot, CoTFew−shot, CoT, Persona, Pkg.Sig.Few−shot, CoT, Pkg.Few−shot, Persona, Pkg.Few−shot, CoT, PersonaFew−shot, PersonaFew−shot, Sig.Few−shot, CoT, Persona, Sig.Few−shot, Pkg.Few−shot, CoT, Sig.Few−shot, Persona, Pkg., Sig.Few−shot, Persona, Sig.Few−shot, CoT, Persona, Pkg., Sig.Few−shot, CoT, Pkg., Sig.Few−shot, Pkg., Sig.
0.0 0.2 0.4
Pass@1
Fig. 11. The Pass@1 results of the different prompt technique combinations
for Llama3.
43.9%50.7%
43.4%50.7%
43%49.8%
43.4%51.1%
43.4%48%
45.2%49.3%
44.8%46.6%
44.3%49.3%
48.4%49.8%
48%50.7%
45.7%48%
45.2%49.3%
44.3%50.2%
45.2%48.4%
44.3%48.4%
47.1%48.9%
PersonaCoTPersona, Pkg.Pkg.CoT, Persona, Pkg.Few−shot, CoTFew−shot, CoT, PersonaCoT, PersonaCoT, Pkg.Few−shot, CoT, Pkg.Few−shot, Persona, Pkg.Few−shot, PersonaCoT, Persona, Sig.Few−shot, CoT, Persona, Pkg.CoT, Sig.Few−shot, Persona, Sig.Few−shot, Pkg.Few−shotFew−shot, CoT, Persona, Sig.Few−shot, CoT, Pkg., Sig.Few−shot, CoT, Persona, Pkg., Sig.CoT, Persona, Pkg., Sig.CoT, Pkg., Sig.Few−shot, Persona, Pkg., Sig.Few−shot, Sig.Persona, Sig.Few−shot, CoT, Sig.Few−shot, Pkg., Sig.Pkg., Sig.Sig.Persona, Pkg., Sig.
0.0 0.1 0.2 0.3 0.4 0.5
Pass@1
Fig. 12. The Pass@1 results of the different prompt technique combinations
for Mistral.B. Code smells
This section presents the percentages of code smells in
functions generated by Llama3 (Figure 14) and Mistral (Figure
13).
40%53%48%48%44%21%13%15%23% 13%67%69%39%38%31%14%18%18%18%37%50%61%50%53% 22%13%20%64%61%40%46%36% 12%14%17%15%76%53%42%25%42%25%12%17% 13%72%45%47%35%45%15%19%16%36%62%49%47%43%19%16% 21%59%44%37%19%34%28%26%18% 12%12%67%57%30%22%25%11%26%14%11%11%12%47%60%40%13%32%18%24%11%13% 14%47%53%38%22%28%17%25%11% 15%13%62%50%36%23%32%16%18%13% 18%40%62%47%41%42% 19%13%60%49%37% 36%23%24%19% 11%43%55%35%19%31%22%17% 13%47%43%32%20%24%19%16%14%24%29%29%13%22%12%16%11% 16% 17%15%33%22%40%16%27% 12% 17%68%26%14%43% 24%36%26%22%47%17% 23%37%22%17%39%13%15% 12%12%73%22%15%40% 12%66%14%18% 12%17%21%11%64%18%21% 16%21%16%51%16%18% 15%12%14%13% 11%14%26%17%33%13%13% 15% 15%13%19%21%54%19% 19%44%14%21% 16%11%22%15%18%21%29% 23%17%19% 12%20%14%16%43%12% 18%19%21%23%16%21% 15%19%21%23% 16%11%14%
Few−shot, Persona, Pkg.Few−shot, CoT, Pkg.Few−shot, PersonaFew−shot, CoTFew−shot, Pkg.Few−shotFew−shot, CoT, Persona, Pkg.Few−shot, Pkg., Sig.Few−shot, CoT, Sig.Few−shot, CoT, Persona, Pkg., Sig.Few−shot, CoT, Persona, Sig.Few−shot, CoT, Pkg., Sig.Few−shot, CoT, PersonaFew−shot, Sig.Few−shot, Persona, Pkg., Sig.Few−shot, Persona, Sig.CoT, Persona, Pkg., Sig.Persona, Pkg.Pkg.CoTCoT, Pkg.Sig.Pkg., Sig.CoT, Sig.CoT, Persona, Pkg.PersonaCoT, Pkg., Sig.Persona, Pkg., Sig.CoT, PersonaCoT, Persona, Sig.Persona, Sig.
C0116C0301C0115W0621R0903W0611W0613R1705C0103E0602W0612W1203W1309
Code Smell ID
Fig. 13. The percentages of functions that had different code smells (as Lint
IDs) that were generated by the combinations of prompt techniques (Mistral).
79%31%14%18% 17% 18%12%12%77%39% 18%11%17%12%10%14%60%34%23%21% 20%12%13%14%66%38%44% 16% 18%14%26%32%56%10%12%10%12%14% 20%65%36%12%12%15%13%11%16%54%32%19%18% 15%12%17%12%45%34%49%12%12%11%11%54%37% 18% 15%13%11%17%54%27%12%17%11%16% 17%64%23%34% 19%10%32%46%12%10%11% 14%
18%30%25%19% 18%11%14%57%18% 14%22%11%12%12%42%19%13%13%12% 14%32%34%13%11%13%11% 11%72%13% 12% 25%44%30% 14%15%13%21%29% 19% 17%11%15%23%21% 16%21%14%12%25%38% 18%23%36% 19% 11%32% 12%22%12%11%75% 13%29%12%11%21% 14%31%34% 17%30%13% 21%28% 12%21%13%26% 20%20%20% 16%27%12% 16%14%23% 12%
Few−shotFew−shot, Pkg.Few−shot, PersonaFew−shot, Sig.Few−shot, CoT, Sig.Few−shot, Pkg., Sig.Few−shot, CoTFew−shot, Persona, Sig.Few−shot, Persona, Pkg.Few−shot, CoT, Pkg.Sig.Few−shot, CoT, PersonaFew−shot, CoT, Persona, Sig.Pkg., Sig.Few−shot, CoT, Persona, Pkg., Sig.Few−shot, CoT, Pkg., Sig.Pkg.Few−shot, Persona, Pkg., Sig.Few−shot, CoT, Persona, Pkg.Persona, Pkg., Sig.Persona, Sig.CoT, Persona, Sig.CoT, Pkg., Sig.CoT, Persona, Pkg., Sig.CoT, Sig.Persona, Pkg.CoT, Persona, Pkg.CoT, Pkg.PersonaCoT, PersonaCoT
C0116C0301E0602C0115W0613R0903W0611R1705W0612W0212
Code Smell ID
Fig. 14. The percentages of functions that had different code smells (as Lint
IDs) that were generated by the combinations of prompt techniques (Llama3).
Page 16:
16
C. Error types
This section presents the percentages of error types across
prompt techniques for functions generated by GPT-4o (Figure
15) and Mistral (Figure 16).
39.0%32.4%11.4%14.3% 2.9%38.5%26.9%14.4%12.5% 7.7%
42.9%22.9%16.2%13.3% 4.8%45.5%18.2%14.1%11.1%11.1%
36.3%25.5%18.6%13.7% 5.9%
37.4%24.3%16.8%13.1% 8.4%42.9%14.3%17.3%14.3%11.2%
44.6%15.8%16.8%13.9% 8.9%42.4%16.2%17.2%15.2% 9.1%
40.2%21.6%12.7%14.7%10.8%44.3%14.4%19.6%14.4% 7.2%
39.8%24.3%15.5%12.6% 7.8%
38.7%22.6%16.0%13.2% 9.4%42.6%16.8%17.8%12.9% 9.9%47.0%16.0%18.0%12.0% 7.0%
47.6%15.5%18.4%12.6% 5.8%
34.2%38.3%12.5% 8.3%6.7%31.6%37.6%14.5% 7.7%8.5%
39.5%32.8%12.6% 8.4%6.7%46.7%12.1%25.2%10.3% 5.6%
36.9%37.7%12.3% 6.6%6.6%34.5%31.9%16.0%11.8% 5.9%49.1%12.0%21.3% 8.3%9.3%44.4%13.0%20.4% 7.4%14.8%
45.9%14.7%18.3% 9.2%11.9%
35.0%34.2%12.0%14.5% 4.3%40.4%15.6%19.3%12.8%11.9%
36.4%34.7%15.7% 6.6%6.6%37.0%28.6%16.8% 8.4%9.2%44.0%14.7%21.1% 9.2%11.0%
46.8%10.8%21.6% 9.9%10.8%46.7%15.9%17.8%11.2% 8.4%
Zero−shot, CoT, PersonaZero−shot, PersonaZero−shotZero−shot, CoT, PackageZero−shot, CoT, Persona, PackageZero−shot, Persona, PackageZero−shot, CoTZero−shot, PackageZero−shot, Persona, SignatureZero−shot, CoT, SignatureZero−shot, Package, SignatureZero−shot, Persona, Package, SignatureZero−shot, CoT, Persona, Package, SignatureZero−shot, CoT, Persona, SignatureFew−shot, CoT, Persona, PackageZero−shot, CoT, Package, SignatureZero−shot, SignatureFew−shot, Persona, PackageFew−shotFew−shot, CoT, PackageFew−shot, CoTFew−shot, PersonaFew−shot, SignatureFew−shot, CoT, PersonaFew−shot, PackageFew−shot, CoT, Persona, SignatureFew−shot, Persona, Package, SignatureFew−shot, Persona, SignatureFew−shot, CoT, Package, SignatureFew−shot, CoT, SignatureFew−shot, CoT, Persona, Package, SignatureFew−shot, Package, Signature
AssertionErrorTypeError
AttributeErrorImportErrorOther
Error Type
Fig. 15. The percentages of error types that we observed in failed functions
generated by different combinations of prompt techniques (Llama3).
30.9%30.0% 6.4%21.8%10.9%
30.3%20.2% 5.0%26.9%17.6%
26.7%20.0% 4.2%32.5%16.7%31.5%25.2% 9.0%22.5%11.7%
25.0%23.3% 6.7%29.2%15.8%26.3%21.9% 3.5%29.8%18.4%30.6%29.7%10.8%18.0%10.8%31.5%29.7%10.8%18.9% 9.0%35.8%27.5%10.1%18.3% 8.3%
27.0%28.8% 8.1%24.3%11.7%37.4%26.2%12.1%14.0%10.3%
29.3%25.9% 6.9%26.7%11.2%
26.9%23.5% 6.7%26.1%16.8%34.5%24.5%10.9%19.1%10.9%
33.9%27.8%14.8%14.8% 8.7%36.4%30.9%12.7%12.7% 7.3%
29.4%30.3% 8.4%21.8%10.1%
30.1%19.5% 7.3%33.3% 9.8%31.7%25.8% 6.7%25.8%10.0%35.2%21.3%16.7%15.7%11.1%
31.4%22.3% 5.8%27.3%13.2%33.1%21.5% 5.8%27.3%12.4%36.9%20.7%16.2%15.3%10.8%
36.3%15.9%25.7%13.3% 8.8%36.3%21.2%23.9%14.2% 4.4%
33.1%26.4% 7.4%26.4% 6.6%32.7%22.4%21.5%13.1%10.3%
28.8%26.4% 8.8%29.6% 6.4%33.9%21.8% 5.6%22.6%16.1%41.3%18.3%18.3%12.5% 9.6%
40.7%10.2%25.0%13.9%10.2%39.0%16.2%21.0%13.3%10.5%
Zero−shot, PersonaZero−shot, Persona, PackageZero−shot, CoTZero−shot, CoT, PersonaZero−shot, CoT, Persona, PackageZero−shot, PackageFew−shot, CoT, PackageFew−shot, CoT, PersonaZero−shot, CoT, PackageFew−shot, CoTFew−shot, Persona, PackageZero−shotFew−shot, PersonaFew−shot, Persona, SignatureFew−shot, CoT, Persona, PackageZero−shot, CoT, Persona, SignatureZero−shot, CoT, SignatureFew−shot, CoT, Package, SignatureFew−shot, CoT, Persona, Package, SignatureFew−shot, CoT, Persona, SignatureFew−shot, PackageZero−shot, CoT, Persona, Package, SignatureFew−shotFew−shot, Persona, Package, SignatureFew−shot, SignatureFew−shot, CoT, SignatureZero−shot, CoT, Package, SignatureZero−shot, Persona, SignatureFew−shot, Package, SignatureZero−shot, Package, SignatureZero−shot, SignatureZero−shot, Persona, Package, Signature
AssertionErrorTypeError
AttributeErrorImportErrorOther
Error TypeFig. 16. The percentages of error types that we observed in failed functions
generated by different combinations of prompt techniques (Mistral).
D. CodeBLEU scores
This section present the CodeBLEU results of code gener-
ated by GPT-4o (Figure 18) and Mistral (Figure 19), as well
as the general multi-variate regression results to show impact
of prompt techniques on CodeBLEU and its three dimensions
(Figure 17).
Page 17:
17
CodeBLEU BLEU (n−gram) − 33.33% BLEU−weighted − 33.33% Syntactic Similarity (AST) − 33.33%
−0.005 0.000 0.005 0.010 0.015 −0.005 0.000 0.005 0.010 0.015 −0.005 0.000 0.005 0.010 0.015 −0.005 0.000 0.005 0.010 0.015Pkg.:Sig.CoT:PersonaFewshot:Sig.Fewshot:CoT:Persona:Sig.Fewshot:Pkg.CoT:Pkg.:Sig.Fewshot:Persona:Sig.Persona:Pkg.CoT:Pkg.Fewshot:CoT:Persona:Pkg.Pkg.Fewshot:Persona:Pkg.:Sig.CoT:Persona:Pkg.:Sig.CoT:Persona:Sig.Fewshot:PersonaFewshot:Pkg.:Sig.Fewshot:CoT:Persona:Pkg.:Sig.Fewshot:Persona:Pkg.CoT:Persona:Pkg.CoT:Sig.Persona:Pkg.:Sig.Persona:Sig.Fewshot:CoT:Pkg.:Sig.Fewshot:CoT:Pkg.Fewshot:CoTFewshot:CoT:Sig.Fewshot:CoT:PersonaFewshotCoTPersonaSig.
Coefficient Estimate
LLM GPT−4o Llama3 Mistral Significance Level <0.001 <0.01 <0.05 Not Significant
Fig. 17. The coefficient estimates from the multi-linear regression of selected combinations of prompt techniques that significantly impact the CodeBLEU,
and its components n-gram match (BLEU), weighted n-gram match (BLEU-weighted), AST match scores, which together contribute to the CodeBLEU score.
16.1%
16.1%
15.8%
15.8%
15.7%
15.6%
15.6%
15.6%
15.5%
15.4%
15.4%
15.3%
15.3%
15.3%
14.8%
14.8%
14.7%
14.6%
14.5%
14.5%
14.4%
14.3%
14.2%
14%
14%
14%
13.9%
13.9%
13.7%
13.7%
13.7%
13.4%CoT, PersonaPkg.CoTFew−shotPersonaFew−shot, Pkg.CoT, Pkg.Persona, Pkg.Few−shot, Persona, Pkg.CoT, Persona, Pkg.Few−shot, CoT, PersonaFew−shot, CoTFew−shot, PersonaFew−shot, CoT, Pkg.Few−shot, CoT, Persona, Pkg.CoT, Pkg., Sig.Few−shot, Pkg., Sig.Few−shot, Persona, Pkg., Sig.CoT, Persona, Pkg., Sig.CoT, Sig.Pkg., Sig.Persona, Sig.Sig.Persona, Pkg., Sig.Few−shot, Persona, Sig.Few−shot, CoT, Sig.Few−shot, CoT, Pkg., Sig.CoT, Persona, Sig.Few−shot, Sig.Few−shot, CoT, Persona, Sig.Few−shot, CoT, Persona, Pkg., Sig.
0.00 0.05 0.10 0.15
Average CodeBLEU (%)
Fig. 18. The avergae CodeBLEU scores for the functions generated by each
combination (GPT-4o).
E. Complexity
This section provides the complete results for the cyclomatic
complexity (Table VI) and cognitive complexity (Table VII).
15.8%
15.7%
15.7%
15.6%
15.4%
15.3%
15.2%
15.1%
15.1%
15.1%
14.7%
14.6%
14.5%
14.5%
14.5%
14.5%
14.4%
14.4%
14.3%
14%
14%
14%
13.9%
13.8%
13.8%
13.8%
13.7%
13.6%
13.6%
13.5%
13.4%
13%Pkg.Few−shot, CoT, Pkg.Few−shot, CoTCoTSig.CoT, Pkg.Few−shot, Pkg.Few−shotCoT, Persona, Pkg.Persona, Pkg.CoT, PersonaPersonaPkg., Sig.Few−shot, PersonaFew−shot, CoT, Persona, Pkg.Few−shot, Pkg., Sig.Few−shot, CoT, PersonaCoT, Pkg., Sig.CoT, Sig.Few−shot, Persona, Pkg.Few−shot, Sig.Persona, Sig.CoT, Persona, Pkg., Sig.Few−shot, CoT, Pkg., Sig.Few−shot, CoT, Persona, Pkg., Sig.Few−shot, CoT, Sig.CoT, Persona, Sig.Persona, Pkg., Sig.Few−shot, Persona, Sig.Few−shot, CoT, Persona, Sig.Few−shot, Persona, Pkg., Sig.
0.00 0.05 0.10 0.15
Average CodeBLEU (%)Fig. 19. The average CodeBLEU scores for the functions generated by each
combination (Mistral).
Page 18:
18
TABLE VI
CYCLOMATIC COMPLEXITY ANALYSIS USING WILCOXON TEST AND A12 V ARGHA DELANEY FOR THE EFFECT SIZE (N- N EGLIGIBLE , S- S MALL , M-
MEDIUM , L- L ARGE ).↓INDICATES A REDUCTION IN COMPLEXITY ,∅INDICATES NO STATISTICAL DIFFERENCE . W E HAVE NOT OBSERVED CASES OF
INCREASING COMPLEXITY .
Combinations GPT-4o Llama3 Mistral
p-value A12 Effect p-value A12 Effect p-value A12 Effect
Package 0.0012 0.403 ↓(S) 0.0008 0.402 ↓(S) 0.0407 0.417 ↓(S)
CoT, Package 0.0366 0.446 ↓(S) 0.0147 0.440 ↓(S) 0.0211 0.410 ↓(S)
Zero-shot 0.0143 0.433 ↓(S) 0.0012 0.392 ↓(S) 0.0402 0.407 ↓(S)
CoT, Persona 0.0576 0.462 ∅ 0.0100 0.432 ↓(S) 0.0062 0.433 ↓(S)
Persona, Sig. 0.0999 0.466 ∅ 0.0358 0.436 ↓(S) 0.0080 0.439 ↓(S)
Persona 0.0857 0.464 ∅ 0.0086 0.411 ↓(S) 0.0259 0.436 ↓(S)
Sig. 0.1593 0.453 ∅ 0.0094 0.425 ↓(S) 0.0157 0.434 ↓(S)
Package, Sig. 0.0426 0.457 ↓(N) 0.0156 0.417 ↓(S) 0.0175 0.448 ↓(S)
Persona, Package 0.0283 0.438 ↓(S) 0.0039 0.404 ↓(S) 0.0661 0.460 ∅
CoT 0.0191 0.447 ↓(S) 0.0559 0.448 ∅ 0.0095 0.417 ↓(S)
CoT, Persona, Package 0.0859 0.452 ∅ 0.0002 0.410 ↓(S) 0.1732 0.466 ∅
Few-shot, Persona, Package 0.7039 0.494 ∅ 0.0243 0.438 ↓(S) 0.5887 0.493 ∅
Few-shot, Persona 0.3185 0.483 ∅ 0.0094 0.430 ↓(S) 0.2305 0.447 ∅
Few-shot 0.1646 0.472 ∅ 0.0381 0.432 ↓(S) 0.4433 0.453 ∅
Few-shot, CoT, Persona, Sig. 0.3691 0.520 ∅ 0.0295 0.457 ↓(N) 0.3559 0.486 ∅
CoT, Persona, Package, Sig. 0.1051 0.463 ∅ 0.0292 0.457 ↓(N) 0.3423 0.493 ∅
Few-shot, Persona, Sig. 0.4960 0.511 ∅ 0.0421 0.455 ↓(N) 0.8256 0.486 ∅
CoT, Persona, Sig. 0.3594 0.496 ∅ 0.2051 0.459 ∅ 0.0812 0.457 ∅
CoT, Package, Sig. 0.3130 0.483 ∅ 0.1568 0.481 ∅ 0.1667 0.475 ∅
CoT, Sig. 0.1784 0.474 ∅ 0.0869 0.460 ∅ 0.1002 0.465 ∅
Persona, Package, Sig. 0.1254 0.467 ∅ 0.1296 0.448 ∅ 0.1592 0.485 ∅
Few-shot, CoT, Persona, Package, Sig. 0.9261 0.502 ∅ 0.1803 0.482 ∅ 0.7346 0.521 ∅
Few-shot, CoT, Persona, Package 0.7083 0.512 ∅ 0.0900 0.466 ∅ 0.2862 0.471 ∅
Few-shot, CoT, Persona 0.2819 0.458 ∅ 0.1312 0.458 ∅ 0.0501 0.425 ∅
Few-shot, CoT, Package, Sig. 0.3642 0.522 ∅ 0.3053 0.478 ∅ 0.3436 0.542 ∅
Few-shot, CoT, Package 0.7715 0.485 ∅ 0.3289 0.469 ∅ 0.0653 0.446 ∅
Few-shot, CoT, Sig. 0.5092 0.500 ∅ 0.3245 0.474 ∅ 0.6845 0.497 ∅
Few-shot, CoT 0.6889 0.478 ∅ 0.3901 0.468 ∅ 0.0852 0.439 ∅
Few-shot, Persona, Package, Sig. 0.8748 0.506 ∅ 0.0788 0.458 ∅ 0.6931 0.524 ∅
Few-shot, Package, Sig. 0.4958 0.501 ∅ 0.3440 0.467 ∅ 0.9435 0.510 ∅
Few-shot, Package 0.3267 0.474 ∅ 0.0551 0.443 ∅ 0.8160 0.509 ∅
Few-shot, Sig. 0.3447 0.515 ∅ 0.1005 0.449 ∅ 0.8524 0.497 ∅
TABLE VII
COGNITIVE COMPLEXITY ANALYSIS USING WILCOXON TEST AND A12 V ARGHA DELANEY FOR THE EFFECT SIZE (N- N EGLIGIBLE , S- S MALL , M-
MEDIUM , L- L ARGE ).↓INDICATES A REDUCTION IN COMPLEXITY ,∅INDICATES NO STATISTICAL DIFFERENCE . W E HAVE NOT OBSERVED CASES OF
INCREASING COMPLEXITY .
Combinations GPT-4o Llama3 Mistral
p-value A12 Effect p-value A12 Effect p-value A12 Effect
Package, Sig. 0.0365 0.449 ↓(S) 0.0001 0.382 ↓(S) 0.0175 0.448 ↓(S)
Few-shot, Persona 0.0364 0.442 ↓(S) 0.0327 0.412 ↓(S) 0.2305 0.447 ∅
CoT, Persona 0.3678 0.501 ∅ 0.0026 0.419 ↓(S) 0.0062 0.433 ↓(S)
CoT, Package 0.3300 0.485 ∅ 0.0256 0.444 ↓(S) 0.0211 0.410 ↓(S)
CoT 0.3609 0.496 ∅ 0.0449 0.445 ↓(S) 0.0095 0.417 ↓(S)
Persona, Sig. 0.0883 0.456 ∅ 0.0025 0.416 ↓(S) 0.0079 0.439 ↓(S)
Persona 0.1156 0.493 ∅ 0.0032 0.393 ↓(S) 0.0259 0.436 ↓(S)
Package 0.0777 0.446 ∅ 0.0002 0.371 ↓(S) 0.0407 0.417 ↓(S)
Sig. 0.5717 0.485 ∅ 0.0003 0.398 ↓(S) 0.0157 0.434 ↓(S)
Zero-shot 0.1708 0.486 ∅ 0.0016 0.382 ↓(S) 0.0402 0.407 ↓(S)
Few-shot, CoT, Persona, Sig. 0.5113 0.479 ∅ 0.0435 0.444 ↓(S) 0.3559 0.486 ∅
Few-shot, Persona, Package 0.1139 0.449 ∅ 0.0378 0.428 ↓(S) 0.5887 0.493 ∅
Few-shot, Persona, Sig. 0.3812 0.470 ∅ 0.0035 0.421 ↓(S) 0.8256 0.486 ∅
Few-shot, Sig. 0.7730 0.488 ∅ 0.0244 0.417 ↓(S) 0.8524 0.497 ∅
Few-shot 0.2544 0.444 ∅ 0.0475 0.408 ↓(S) 0.4433 0.453 ∅
CoT, Persona, Package, Sig. 0.2312 0.490 ∅ 0.0144 0.444 ↓(S) 0.3423 0.493 ∅
Persona, Package, Sig. 0.3364 0.478 ∅ 0.0095 0.418 ↓(S) 0.1592 0.485 ∅
Persona, Package 0.2733 0.484 ∅ 0.0006 0.395 ↓(S) 0.0661 0.460 ∅
CoT, Sig. 0.9386 0.522 ∅ 0.0439 0.446 ↓(S) 0.1002 0.465 ∅
CoT, Persona, Package 0.0496 0.461 ↓(N) 0.0007 0.420 ↓(S) 0.1732 0.466 ∅
CoT, Package, Sig. 0.8144 0.508 ∅ 0.0254 0.464 ↓(N) 0.1667 0.475 ∅
CoT, Persona, Sig. 0.6917 0.510 ∅ 0.1106 0.459 ∅ 0.0812 0.457 ∅
Few-shot, CoT, Persona, Package, Sig. 0.5436 0.485 ∅ 0.1174 0.463 ∅ 0.7346 0.521 ∅
Few-shot, CoT, Persona, Package 0.2607 0.478 ∅ 0.0662 0.461 ∅ 0.2862 0.471 ∅
Few-shot, CoT, Persona 0.0531 0.443 ∅ 0.1155 0.460 ∅ 0.0501 0.425 ∅
Few-shot, CoT, Package, Sig. 0.6145 0.509 ∅ 0.4893 0.467 ∅ 0.3436 0.542 ∅
Few-shot, CoT, Package 0.2378 0.467 ∅ 0.4649 0.464 ∅ 0.0653 0.446 ∅
Few-shot, CoT, Sig. 0.3291 0.487 ∅ 0.3538 0.462 ∅ 0.6845 0.497 ∅
Few-shot, CoT 0.1763 0.443 ∅ 0.7870 0.465 ∅ 0.0852 0.439 ∅
Few-shot, Persona, Package, Sig. 0.7159 0.482 ∅ 0.0690 0.436 ∅ 0.6931 0.524 ∅
Few-shot, Package, Sig. 0.6934 0.511 ∅ 0.3366 0.437 ∅ 0.9435 0.510 ∅
Few-shot, Package 0.1084 0.454 ∅ 0.0966 0.428 ∅ 0.8160 0.509 ∅