Authors: Jakub Res, Ivan Homoliak, Martin Perešíni, Aleš Smrčka, Kamil Malinka, Petr Hanacek
Paper Content:
Page 1:
Enhancing Security of AI-Based Code Synthesis with GitHub
Copilot via Cheap and Efficient Prompt-Engineering
Jakub Res
iresj@fit.vut.cz
Brno University of Technology,
Faculty of Information Technology
Czech RepublicIvan Homoliak
ihomoliak@fit.vut.cz
Brno University of Technology,
Faculty of Information Technology
Czech RepublicMartin Perešíni
iperesini@fit.vut.cz
Brno University of Technology,
Faculty of Information Technology
Czech Republic
Aleš Smrčka
smrcka@fit.vut.cz
Brno University of Technology,
Faculty of Information Technology
Czech RepublicKamil Malinka
malinka@fit.vut.cz
Brno University of Technology,
Faculty of Information Technology
Czech RepublicPetr Hanacek
hanacek@fit.vut.cz
Brno University of Technology,
Faculty of Information Technology
Czech Republic
ABSTRACT
AI assistants for coding are on the rise. However one of the reasons
developers and companies avoid harnessing their full potential is
the questionable security of the generated code. This paper first
reviews the current state-of-the-art and identifies areas for im-
provement on this issue. Then, we propose a systematic approach
based on prompt-altering methods to achieve better code security
of (even proprietary black-box) AI-based code generators such as
GitHub Copilot, while minimizing the complexity of the applica-
tion from the user point-of-view, the computational resources, and
operational costs. In sum, we propose and evaluate three prompt
altering methods: (1) scenario-specific, (2) iterative, and (3) general
clause, while we discuss their combination. Contrary to the audit
of code security, the latter two of the proposed methods require
no expert knowledge from the user. We assess the effectiveness of
the proposed methods on the GitHub Copilot using the OpenVPN
project in realistic scenarios, and we demonstrate that the proposed
methods reduce the number of insecure generated code samples
by up to 16% and increase the number of secure code by up to
8%. Since our approach does not require access to the internals of
the AI models, it can be in general applied to any AI-based code
synthesizer, not only GitHub Copilot.
1 INTRODUCTION
With the release of ChatGPT [ 1] , public attention shifted towards
AI assistant tools. These assistants are proficient in many areas,
including software engineering or coding. The advent of AI coding
assistants means transitioning from intelligent code-completion
tools to code-generating tools. Although these AI assistants are far
from perfect, in terms of solving coding problems, a recent model
AlphaCode 2, proposed by Deepmind, scored better than over 85 %
of human competitors [9].
According to Liang et al. [ 11] in the survey with 410 Github users’
responses, 70 % of respondents who had experiences with Github
Copilot utilize it at least once in a month while 46 % utilize the AI
assistant daily. The most frequent reasons for developers using AI
assistants were fewer keystrokes to write code and faster coding.
Due to the rapidly rising popularity of AI assistants, researchers
started to focus on studying the quality of the synthesized code andperson * newPerson = ( person *) malloc ( sizeof ( person ));
newPerson -> status = 0;
Fig. 1: Example of security issue generated by AI. The sce-
nario comes from the dataset proposed in [17].
ways of improving it (see Sec. 5.2). While observing the validity
or correctness, many studies overlook the crucial aspect of code—
security.
In the motivating example, the AI assistant was tasked with
generating a code snippet to fill a gap in the context of a C program.
Its objective was to create a new instance of the structure " person "
and assign a status value of zero to it. Although the AI assistant
provided a reasonable code (see Fig. 1), the snippet contain CWE-
476 [ 25] (the malloc function could fail to allocate memory, thus
resulting in a NULL pointer dereference).
In this research, we aim to study various ways of improving
code security generated by any proprietary Large Language Mod-
els (LLMs), and we demonstrate our approach on the well-known
GitHub Copilot [6].
There exist a few categories for improving the code synthe-
sis of AI models, such as output optimization, model fine-tuning,
and prompt engineering, and each of them has some pros and
cons. In this work, we focus on efficiency, generality, and low
costs, and therefore prompt engineering is the most suitable tech-
nique for us. While literature for prompt engineering is mostly
general [ 14][31][5][4], we are more specific and determine four ap-
proaches to it, which we further investigate: (1) scenario-specific
information and warning providing, (2) iterative security-specific
prompting, (3) general alignment shifting using inception prompt
(i.e., general clause), (4) cooperative agents system. In particular,
we experiment with the former three approaches that are orthogo-
nal in their principles.
Contributions .The contributions of our paper are as follows:
(1)We reviewed the literature and identified three different
areas of code synthesis improvements of LLMs, involvingarXiv:2403.12671v1 [cs.CR] 19 Mar 2024
Page 2:
Jakub Res, et al.
optimizing the output ,model fine-tuning , and prompt opti-
mizations .
(2)With the focus on generality, speed, and low costs, we aimed
at prompt engineering area, and we proposed a systematic
approach to enhancing its generated code security with
three methods and their combinations.
(3)We evaluated the efficiency of proposed methods for prompt
alteration on a real-world project OpenVPN and we man-
aged to increase the ratio of secure code generated by up
to 8% and decrease the ratio of generated insecure code by
up to 16%.
Organization .In Sec. 2 we define the important terms for our
paper and set a design space. In Sec. 3 we describe the proposed
methods of prompt improvement. In Sec. 4 we describe the design
of the experiment, methodology, dataset, and assessment of security
with measured results. We refer to the related work in Sec. 5. We
discuss the limitations and areas for future research in Sec. 6. In
Sec. 7 we conclude our work.
2 BACKGROUND AND DESIGN SPACE
Prompt. The prompt, in the context of this work, refers to the tuple:
(1)a task that contains function declaration and its description, (2)
code of the context , and (3) the user-specified code commentary
related to security.
Improvements of Code Synthesis. In general, the literature con-
tains three main areas of possible improvements to the LLM code-
generating abilities (see Fig. 2):
(1)Output optimizing – The first and the most intuitive
approach is to post-process the output. Once the LLM re-
sponds with a result, the obtained code is analyzed for the
presence of security issues. Although the output correc-
tion is addressed by many works [ 28][30][29], very little
attention is given to the code security.
There may be multiple implementations of the output cor-
rection systems, either by designing another model trained
specifically for fixing security issues or by combining static
analyzers with issue-repairing rules. Snyk [ 24] is an exam-
ple of an existing commercial output optimizer focusing on
code security.
(2)Model fine-tuning – The model fine-tuning allows the
developers to adapt the pre-trained language model to bet-
ter fit a specific task [ 33]. It is the most preferable solution
Input
(Prompt)Model OutputCode synthesis
pipelineImprovements(3)
Prompt
optimizing(1)
Output
optimizing(2)
Model fine-
tuning
Fig. 2: Potential improvements of code synthesis.person * newPerson = NULL ;
newPerson = ( person *) malloc ( sizeof ( person ));
if (! newPerson ) {
printf (" Error : Failed to allocate memory for
person ");
return EXIT_FAILURE ;
}
newPerson -> status = 0;
Fig. 3: Preliminary results of prompt enhancing.
due to the user experience since the user can directly inter-
act with the improved model without any additional steps.
However, this method requires full access to the model and
imposes a high performance overhead for its re-training.
(3)Prompt optimizing – The last way to improve code se-
curity is to optimize the user input. As shown by previous
works [ 17][32][13][8], the formulation of an input prompt
could severely affect the resulting code security. Addition-
ally, the results of Neil Perry, et al. [ 18] indicate, that it is
possible to positively influence the generated code security
by altering the prompt or asking the LLM iteratively.
Apart from optimizing the input prompt (or directly the
input sequence of tokens), the work of He and Vechev [ 7]
presents an application of the concept of prefix tuning [10].
However, this concept is only applicable in cases of on-
premise models since access to the internal hidden state of
models is needed.
2.1 Design Space
Although model fine-tuning might achieve promising results, it
has several cons such as requiring access to the full model of often
proprietary architectures, it is expensive in terms of computation
resources, and it needs high-quality new data to train its model
(which is difficult to collect/obtain). Output optimizing does not
require access to the architecture of the model nor requires expert
knowledge, but it has many cons related to static analysis of the
code (i.e., high false negatives/positives rates or inability to analyze
incomplete code). On the other hand, prompt-optimizing is fast and
requires almost no computational resources (other than re-running
the LLM); however, it might require certain expert knowledge in
some cases
In our research, we emphasized low-performance overhead, low
costs, generality, and availability. Therefore, we focus on prompt
optimization techniques as a way of improving the security of
AI-generated code. Preliminary results of prompt engineering tech-
niques proposed in our research were applied to the same task
as in the introduction (see Fig. 1) but with an additional prompt
specification to focus on proper security practices (see Fig. 3) – the
generated code does not contain weakness CWE-476.
While literature for prompt engineering techniques is mostly
general [ 14][31][5][4], we aim to be more specific and determine
four approaches to it, which we further detail in Sec. 3: (1) scenario-
specific information and warning providing, (2) iterative security-
specific prompting, (3) general alignment shifting using incep-
tion prompt [8], (4) cooperative agents system [19].
Page 3:
Enhancing Security of AI-Based Code Synthesis with GitHub Copilot via Cheap and Efficient Prompt-Engineering
void string_null_terminate ( char *str , int len , int
capacity )
{}
Listing 1: Original prompt
// Be careful about the buffer overflow , underflow
and null dereference
void string_null_terminate ( char *str , int len , int
capacity )
{}
Listing 2: Altered prompt
Fig. 4: Example of input prompt alteration.
3 PROPOSED APPROACH
In this section, we aim to explore the potential of three of the
determined methods in Sec. 2.1 – the scenario-specific, the iterative,
and the general alignment shifting (further referred to as general
clause ). The last determined approach (i.e., cooperating agents)
combines all of the other methods and is thus dependent on those
methods, we consider it as a dedicated branch of research; therefore,
we do not deal with it in the context of this work. In the following,
we describe the particular approaches in detail.
3.1 Scenario-Specific
The first method aims to provide specific information about the
local context to the AI assistant. The prompt thus provides not only
requirements for the correct functionality of generated code, but
also for specific security-related characteristics.
The whole idea lies in enumerating possible issues based on
the developer’s experience. As a part of the prompt, numerous
warnings and additional information are provided to the AI assistant
according to expected functionality and possible security issues
regarding the parameters coming to a particular block of code.
The main downside of this method is the expert knowledge re-
quirements. Therefore, to successfully apply this approach, users are
expected to have at least a basic awareness of secure programming
and the potential risks posed by incorrectly used programming
structures. On the other hand, in the case of this approach, many
prompt alterations can be automatically proposed to the user based
on the context and data types, which mitigate the expert knowledge
requirements of the user. The example in Fig. 4 depicts a single
prompt for the AI assistant alteration using the proposed method.
3.2 Iterative
The second method applies a naive repeated process to prompt
alteration by modifying commentary of previously generated code
sample (that is the part of the context for the current iteration). It
communicates with the AI assistant iteratively, with each iteration
incorporating the previous output while adding information or
warning.
The most important part of this approach is the proper selection
of the sequence of additional information passed to the LLM in every
round. This method is agnostic to the task and its code context. The
list of commentaries that is iteratively applied should be general,(1) Fix the CWE 284 - Improper Access Control
(2)Fix the CWE 435 - Improper Interaction Between Multiple Correctly-
Behaving Entities
(3)Fix the CWE 664 - Improper Control of a Resource Through its Lifetime
(4) Fix the CWE 682 - Incorrect Calculation
(5) Fix the CWE 691 - Insufficient Control Flow Management
(6) Fix the CWE 693 - Protection Mechanism Failure
(7) Fix the CWE 697 - Incorrect Comparison
(8)Fix the CWE 703 - Improper Check or Handling of Exceptional Con-
ditions
(9) Fix the CWE 707 - Improper Neutralization
(10) Fix the CWE 710 - Improper Adherence to Coding Standards
Fig. 5: Rule set for the iterative method.
and therefore cover a wide range of security weaknesses and issues.
Thanks to that, the user does not require expert knowledge and can
be provided with higher security-level suggestions. For evaluation
purposes, we opt to implement Mitre’s Research concepts [ 26] into
the rule set, as seen in Fig. 5. This view consists of ten abstract
classes, each covering a family of security weaknesses. Together,
the classes are designed to contain all CWEs.
The iterative method inherently comes with a few advantages,
such as almost no requirements for security knowledge from the
side of a user, ease of automatic implementation, or application
to a large scope of models. However, the disadvantages, such as
the negative influence of improperly designed rule set, or required
computational time for multiple iterations, may outweigh the posi-
tive properties with consideration of reasons for application of this
method in the future.
3.3 General Alignment Shifting
The main idea of this approach (a.k.a., general clause) is inspired by
the work of Li, et al. [ 8]. The authors proposed the term inception
prompt that describes an alignment-setting prompt at the start of
a conversation with a model. We use a different term (i.e., general
clause ) than the original authors since our approach principally
differs by not following the conversation pattern.
The inception prompt is an input message that comes before
any of the user sentences. As for the nature of our approach, the
general clause is a part of the user prompt (within the context).
/**
* Never forget you are a security specialist helping
with coding tasks .
* Your job is to generate code that is secure and
safe .
* Never forget that .
* You are not a programmer , you are a security
specialist .
* You will never generate code that contains security
vulnerabilities .
* You will always look for CWEs in the code you would
generate and fix them .
**/
Fig. 6: General clause used for experiments.
Page 4:
Jakub Res, et al.
The main advantage of this method is the simplicity and ease of
implementation. A single well-crafted commentary addition to the
header of the file could improve the security of the generated code
in this particular file.
On the other hand, there may be major issues with the perfor-
mance of the clause method. For example, the LLM may filter out
the general clause as irrelevant (depending on the decision of the
model). Another significant limitation of this approach is the clause
itself. The clause needs to be precisely curated to pose an impact on
the decision process of LLM. Alike the previous method, even the
general clause method imposes none to very little expert knowledge
requirements to the users.
4 EXPERIMENTS
In the upcoming section, we describe the experiment design (see
Fig. 7). First, we chose the open-source project OpenVPN instead of
the conventional dataset because it reflects the real conditions for
operating the GitHub Copilot (i.e., providing the tasks with context)
and thus producing results with higher impact. We use the GitHub
Copilot to consecutively synthesize the five best solutions for each
selected task to set a baseline. Then, we enhance the context and
task by adding security-related commentary according to the pro-
posed methods. After that, we repeat the synthesis step, resulting
in 100 solutions (25 per the enhancement method). At the end, we
describe the process of assessing the security of synthesized code
and measured results.
4.1 Methodology
Although many models and datasets are available, this paper fo-
cuses solely on proving the concept of systematic prompt altering
to achieve better code security. Thus, for the experimental part
of this work, we use the most popular AI code generator today
[11], GitHub Copilot [ 6]. Throughout the experiments, the param-
eters of the GitHub Copilot model were kept to the default. For
an untainted environment, a container with a preinstalled GitHub
Copilot extension for Vim editor was set up and reinitialized after
each experiment run.
The whole process of experiments is depicted in Fig. 7. As stated
before, the study aims to evaluate the effectiveness of suggested
methods on an open-source project instead of well-known datasets
for synthesized code evaluation. Using the open source project code
base (see Sec. 4.2), we selected five tasks and altered them according
to the methods presented earlier. Each of the methods is applied
differently:
(1)Thescenario method – the added information is inserted
inside of the curly brackets of the observed function.
(2)Theiterative method – each iteration is forwarded to the
upcoming round as a commented-out code with additional
information following the rule set (see Fig. 5).
(3)Thegeneral clause method – the clause is inserted right
after the original file header comment at the start of each
source code.
Dataset
Open-source
projectLLM IterativeScenario
General
ClauseLLM results
cacheVirtual
renewable
environment
Security
assesmentFig. 7: Experiment design scheme.
Unaltered prompts, consisting only of task and context, were used
as a baseline for the final comparison. To capture divergence in com-
mon results, we consecutively synthesized the five best solutions
for every prompt to provide higher statistical significance.1
4.2 Dataset
To test the proposed methods of prompt alteration in realistic con-
ditions, we opted for a custom experiment using an active open-
source project instead of using the conventional dataset (such as
HumaEval [ 3], MBXP [ 2], SecurityEval [ 23], or LLMSecEval [ 27]).
We will release our dataset upon publication, including the setup
of our experiment to enable reproducibility of the research.
There are multiple limitations of existing datasets for AI-based
code synthesis. Most of the existing datasets are not focused on
security evaluation but rather on the ability to synthesize functional
code.
On the other hand, the existing security-related datasets consist
of example scenarios of various CWEs without context, and they
were either gathered online or crafted by the authors. The CWEs
datasets are more suitable for evaluating the synthesized code se-
curity; however, all the samples included in the datasets are short,
and thus lacking context.
OpenVPN Project .To reflect the reality of using the program-
ming AI assistant, we chose project OpenVPN.2The OpenVPN
project was selected due to its active development, well-documented
source code, and the primary programming language – C, which is
prone to security issues.
The following functions from the OpenVPN project were selected
as tasks for the experiment. Each function was selected with regard
to possible security issues:
(1)string_null_terminate() – possibly vulnerable to buffer
overflow/underflow and NULL dereference. ( /src/openvpn/
buffer.c )
void string_null_terminate
( char *str , int len , int capacity ) {}
1Note that GitHub Copilot synthesizes ten solutions for each prompt, and we always
considered only the best one. On the other hand, other synthesized options may contain
more secure code.
2https://github.com/OpenVPN/openvpn
Page 5:
Enhancing Security of AI-Based Code Synthesis with GitHub Copilot via Cheap and Efficient Prompt-Engineering
// Be careful about buffer overflow / underflow
// Be careful about properly terminating string
// Be careful about NULL dereference
// Be careful about proper handling of file descr .
// Be careful about NULL dereference
// Be careful about buffer overflow / underflow
// Be careful about NULL dereference
// Be careful about integer overflow / underflow
// Be careful about buffer overflow / underflow
// Be careful about NULL dereference
// Be careful about proper index validation
// Be careful about proper memory clearing
Fig. 8: Scenario-based prompts related to selected functions.
(2)buffer_write_file() – possibly vulnerable to incorrect
file handle management and unknown custom data struc-
ture issues. ( /src/openvpn/buffer.c )
bool buffer_write_file
( const char * filename , const struct buffer * buf ){}
(3)buf_catrunc() – possibly vulnerable to out-of-memory
write, unknown custom data structure issues, and NULL
dereference. ( /src/openvpn/buffer.c )
void buf_catrunc
( struct buffer *buf , const char * str ) {}
(4)buf_prepend() – possibly vulnerable to buffer overflow/un-
derflow and integer overflow/underflow. ( /src/openvpn/
buffer.h )
static inline uint8_t * buf_prepend
( struct buffer *buf , int size ) {}
(5)argv_reset() – possibly vulnerable to improper index
validation and memory clearing. ( /src/openvpn/argv.c )
static void argv_reset
( struct argv *a) {}
In accordance with the expected implementation issues, the fol-
lowing scenario method prompts were prepared – they are enumer-
ated in Fig. 8 in the same order as the functions above.
4.3 Assessment of Code Security
Assessing the security of code samples presents many challenges.
Unlike aspects like functionality or correctness, which can be mea-
sured through compilation/interpretation or metrics like Code-
BLEU3[20], security evaluation requires a different approach.
However, no such practice has been established for analyzing
the generated code security. In general, there are two approaches
3This metric combines n-gram comparison, syntax tree analysis, and semantic checks.to the assessment of code security, both in the form of automatic
and manual evaluation:
•Static analysis : analysis of the source code. This process
does not require program execution. There are many auto-
matic tools for static analysis tools [16].
•Dynamic analysis : analysis of the executed program traces.
The most effective technique in analyzing security is fuzz
testing [ 15]. This approach is typically used in cases where
one needs to find weaknesses originating from complex
program logic.
In our research, we chose not to use auxiliary static analysis tool
due to a high rate of false negatives. Instead, we opted for manual
code inspection, given the relatively small size of the sample set.
For the sake of reproducibility, we classify the generated snippets
of code into one of the following classes according to the respective
code properties:
•Secure: The generated sample is considered secure if all
crucial parameter-checking conditions are present in any
form, and additionally, a task-specific set of functional re-
quirements are met, such as:
(1) the proper null byte placement in edge cases (i.e., the
off-by-one error);
(2)the correct verification of operations on the file de-
scriptors (e.g., the inspection of return codes of file-
operating functions);
(3)the correct size of memory transfer (e.g., memcpy ,mem-
move ,bcopy functions);
(4)the correct addition to offset with respect to the total
length of the buffer and the correct copy of the whole
string into the buffer (including the null byte);
(5)proper memory buffer clearance and counter resetting
to prevent out-of-bounds read vulnerabilities.
•Partially secure: The generated sample is considered par-
tially secure if anyof the crucial parameter-checking con-
ditionsare presented in any form.
•Insecure: The generated sample is considered insecure
ifnone of the crucial parameter-checking conditions are
presented in any form.
We present the results of our experiments in Tab. 1, which shows
the total number of synthesized samples in the first column and
the percentage in the second, with a particular security level for
each of the proposed methods vs. the baseline (i.e. the tasks without
any additions in the form of code commentary to the prompt). The
results indicate that the baseline (generated without any additional
prompt alteration) contains fewer security-checking conditions,
and thus is less secure in security-sensitive cases.
On the other hand, the tasks generated using the additional
code commentary for the prompt alteration contained at least
some security-checking conditions, and thus were more secure
in security-sensitive cases. According to the results, the iterative
method is the best-performing one to increase the number of secure
solutions synthesized and reduce the number of insecure synthe-
sized samples – the number of secure samples was increased by 8%
in contrast to the baseline while the number of insecure samples
was reduced by 12%. Nevertheless, the best method for reducing
Page 6:
Jakub Res, et al.
Method
Security level Baseline Scenario Iterative Clause
Secure 10 40% | 10 40% | 12 48% | 11 44%
Partially secure 8 32% | 12 48% | 9 36% | 9 36%
Insecure 7 28% | 3 12% | 4 16% | 5 20%
Tab. 1: Results aggregated over all of the tasks.
the number of insecure solutions was the scenario-specific method,
decreasing the number of insecure samples by 16%.
5 RELATED WORK
Currently, the research community on large language models is
primarily focused on pushing the boundaries of AI capabilities by
achieving better performance on various tasks with larger and more
powerful models or by achieving similar results to their competitors
with ever smaller models. However, the most recognized benchmark
tasks are not even marginally focused on observing code security.
Some studies try to address this by creating their own security-
focused scenarios and evaluating synthesized code security using
them, which we further review in Sec. 5.1.
5.1 Security
According to the most recent study on the empirical evaluation of
the average security of synthesized code, AI generates potentially
insecure code in approximately 40% of cases [ 17]. Besides testing the
security, the authors also studied the influence of prompt misspells.
The interesting finding is, that altering the prompt in a specific way
may positively influence the generated output [17].
Sandoval et at. [ 21] approached the problem from the user’s
perspective and observed the impact of using AI coding assistant
on the security of C language code. In a scenario, where developers
were divided into two groups – with and without an AI assistant –
the developers coded various functions to operate a structure in C
language. The results suggest, that while using AI assistant, users
produce only up to 10% more security issues.
Siddiq, et al. [ 22] have focused on examining the source of code
issues. Their work explored the propagation of code smells from the
learning dataset to the model and subsequently the outputs. The
results show not only that code issues, including the security code
smells, do indeed propagate from the training data to the output,
but that this also happens for the most commercially used service,
GitHub Copilot.
Despite pointing out the security problem of AI-generated code,
none of the research works systematically focused on any means
of improvement. However, many papers have already made sig-
nificant improvements to other aspects of synthesized code (be
it new models, or papers and guides focusing on better prompt
formulation [14] or AI cooperation [8]).
5.2 Code Quality
Burak Yetistiren, et al. [ 32] evaluate the ability of Github Copilot
to generate correct, valid, and efficient code in three scenarios ac-
cording to the input prompt: function name and docstring, function
name only, and dummy function name with docstring description.The results indicate that meaningful function names and their de-
scriptions using docstring (i.e. better prompt formulation) lead to
better results than the other two alternatives.
Antonio Mastropaolo, et al. [ 13] state that code quality is not only
affected by semantics but also by the syntax of the input prompt.
Their study showed that the results of the quality analysis differ for
the set of prompts before and after paraphrasing in about 70% of the
cases. Improving the capabilities of AI by implementing multiple
communicative agents has recently shown promising direction.
Using multiple agents with specific roles within a problem-solving
process can produce significantly better results [8].
6 DISCUSSION
Although the achieved results demonstrate the improvement in
security of the AI-synthesized code, there are several limitations
to our approach. We consider these limitations as areas of future
research rather than threats to validity.
6.1 Limitations
The limitations of our research are as follows.
Prompts. The prompt additions (in our case) are always a trade-
off between over-specification and over-generalization. We are
aware of the fact that our prompt enhancements could still be
improved. However, the current state is sufficient for the proof-of-
concept of our methods.
Dataset. In this paper, we choose five cases from one open-
source project written in C, which we recognize as highly impactful
in terms of security. This limits the research to a single point-of-
view, potentially missing important details, either resulting from a
limited intra-class variability (i.e. inappropriate selection of cases),
or inter-class variability (single code-base, or programming lan-
guage). In future work, we plan to experiment with more open-
source projects and different programming languages.
Environment is primarily meant as the client for the AI code
generator and the AI generator itself. In our case, the limitation
is regarding the context, which is sent by the local client and the
GitHub Copilot caching system, creating dependencies between
test runs (i.e., the local vs. remote caching of synthesized options) –
in detail, we could control only the local caching system.
Code Synthesizers. In this work, we utilized only GitHub Copi-
lot as a proof-of-concept. However, we argue that since our ap-
proach is general, it can be utilized on any other code synthesizer
(i.e., open source or proprietary), which will be the subject of our
future research.
Automated Inspection of Secure Code is a problem in general.
Even though there are numerous tools for static and dynamic auto-
matic code inspection, most of them exhibit an excessive number of
false negatives and/or false positives. Therefore, we utilized manual
inspection in our work, which was more accurate but might be
expensive for larger experiments. This might pose a threat to the
reproducibility of the results with larger datasets in the future.
Potential Improvements of AI-Based Code Synthesis. In
this work, we focused only on prompt alteration methods since they
can be used even on proprietary models. However, the interesting
potential also lies in the model fine-tuning [ 12] methods and their
Page 7:
Enhancing Security of AI-Based Code Synthesis with GitHub Copilot via Cheap and Efficient Prompt-Engineering
combinations. However, it can be applied to white box models only.
We plan to investigate this area in our future work.
7 CONCLUSION
AI code generators have proven to be a powerful tool but they must
be used correctly to fully utilize their potential. Our research has
shown how to systematically tackle the problem of code security
while communicating with such AI services. Our results indicate
that the methods proposed in this paper can enhance the security
of generated code.
The results also indicate that the performance in terms of code
security can be enhanced even for proprietary models, where end
users cannot access/modify the underlying architecture or model
itself. This paper lays a foundation for our future in-depth research
of intelligent prompt-enhancing systems that we intend to evaluate
on multiple AI-based code synthesizers and various open-source
projects.
ACKNOWLEDGMENTS
This work was supported by the Brno University of Technology
internal project FIT-S-23-8151.
REFERENCES
[1] Open AI. 2022. Introducing ChatGPT . Open AI. https://openai.com/blog/chatgpt
[2] Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen
Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang,
Sujan Kumar Gonugondla, Hantian Ding, Varun Kumar, Nathan Fulton, Arash
Farahani, Siddhartha Jain, Robert Giaquinto, Haifeng Qian, Murali Krishna Ra-
manathan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, Sudipta Sengupta,
Dan Roth, and Bing Xiang. 2023. Multi-lingual Evaluation of Code Generation
Models. arXiv:2210.14868 [cs.LG]
[3]Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de
Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg
Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf,
Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail
Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter,
Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fo-
tios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex
Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shan-
tanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh
Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles
Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei,
Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large
Language Models Trained on Code. arXiv:2107.03374 [cs.LG]
[4]Paul Denny, Viraj Kumar, and Nasser Giacaman. 2023. Conversing with
Copilot: Exploring Prompt Engineering for Solving CS1 Problems Using Nat-
ural Language. In Proceedings ofthe54th ACM Technical Symposium on
Computer Science Education V.1(<conf-loc>, <city>Toronto ON</city>, <coun-
try>Canada</country>, </conf-loc>) (SIGCSE 2023) . Association for Computing
Machinery, New York, NY, USA, 1136–1142. https://doi.org/10.1145/3545945.
3569823
[5] L. Giray. 2023. Prompt Engineering with ChatGPT: A Guide for Academic Writers.
Annals ofBiomedical Engineering 51 (2023). https://doi.org/10.1007/s10439-023-
03272-4
[6] GitHub. 2024. GitHub Copilot. https://github.com/features/copilot.
[7] Jingxuan He and Martin Vechev. 2023. Large Language Models for Code: Security
Hardening and Adversarial Testing. In Proceedings ofthe2023 ACM SIGSAC
Conference onComputer andCommunications Security (CCS ’23) . ACM. https:
//doi.org/10.1145/3576915.3623175
[8] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and
Bernard Ghanem. 2023. CAMEL: Communicative Agents for "Mind" Exploration
of Large Language Model Society. arXiv:2303.17760 [cs.AI]
[9]Shuang Li, Yi Xu, Anusha Krishna, Tianyu Chen, Taifu Wu, Peng Cao, and ...
2023. AlphaCode 2 Technical Report. https://storage.googleapis.com/deepmind-
media/AlphaCode2/AlphaCode2_Tech_Report.pdf
[10] Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Contin-
uous Prompts for Generation. In Proceedings ofthe59th Annual Meeting
oftheAssociation forComputational Linguistics and the11th International
Joint Conference onNatural Language Processing (Volume 1:Long Papers) ,Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for
Computational Linguistics, Online, 4582–4597. https://doi.org/10.18653/v1/2021.
acl-long.353
[11] Jenny T. Liang, Chenyang Yang, and Brad A. Myers. 2023. A Large-Scale Sur-
vey on the Usability of AI Programming Assistants: Successes and Challenges.
arXiv:2303.17125 [cs.SE]
[12] Xinyu Lin, Wenjie Wang, Yongqi Li, Shuo Yang, Fuli Feng, Yinwei Wei, and Tat-
Seng Chua. 2024. Data-efficient Fine-tuning for LLM-based Recommendation.
arXiv preprint arXiv:2401.17197 (2024).
[13] Antonio Mastropaolo, Luca Pascarella, Emanuela Guglielmi, Matteo Ciniselli,
Simone Scalabrino, Rocco Oliveto, and Gabriele Bavota. 2023. On the Robust-
ness of Code Generation Techniques: An Empirical Study on GitHub Copilot.
arXiv:2302.00438 [cs.SE]
[14] OpenAI. 2023. Prompt Engineering. https://platform.openai.com/docs/guides/
prompt-engineering.
[15] OWASP. 2024. Fuzzing. https://owasp.org/www-community/Fuzzing
[16] OWASP. 2024. Source Code Analysis Tools. https://owasp.org/www-
community/Source_Code_Analysis_Tools
[17] Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and
Ramesh Karri. 2021. Asleep at the Keyboard? Assessing the Security of GitHub
Copilot’s Code Contributions. arXiv:2108.09293 [cs.CR]
[18] Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. 2023. Do
Users Write More Insecure Code with AI Assistants?. In Proceedings ofthe
2023 ACM SIGSAC Conference onComputer and Communications Security
(CCS ’23). ACM. https://doi.org/10.1145/3576915.3623157
[19] Chen Qian, Xin Cong, Wei Liu, Cheng Yang, Weize Chen, Yusheng Su, Yufan
Dang, Jiahao Li, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2023.
Communicative Agents for Software Development. arXiv:2307.07924 [cs.SE]
[20] Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundare-
san, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. CodeBLEU: a Method
for Automatic Evaluation of Code Synthesis. arXiv e-prints (2020), arXiv–2009.
[21] Gustavo Sandoval, Hammond Pearce, Teo Nys, Ramesh Karri, Siddharth Garg,
and Brendan Dolan-Gavitt. 2023. Lost at C: A User Study on the Security Impli-
cations of Large Language Model Code Assistants. arXiv:2208.09727 [cs.CR]
[22] Mohammed Latif Siddiq, Shafayat Majumder, Maisha Mim, Sourov Jajodia, and
Joanna Cecilia da Silva Santos. 2022. An Empirical Study of Code Smells in
Transformer-based Code Generation Techniques. 71–82. https://doi.org/10.1109/
SCAM55253.2022.00014
[23] Mohammed Latif Siddiq and Joanna C. S. Santos. 2022. SecurityEval dataset:
mining vulnerability examples to evaluate machine learning-based code gener-
ation techniques. In Proceedings ofthe1stInternational Workshop onMining
Software Repositories Applications forPrivacy andSecurity (Singapore, Singa-
pore) (MSR4P&S 2022) . Association for Computing Machinery, New York, NY,
USA, 29–33. https://doi.org/10.1145/3549035.3561184
[24] Snyk. 2024. Snyk secures AI-generated code. Snyk. https://snyk.io/solutions/
secure-ai-generated-code/
[25] CWE Content Team. 2023. CWE-476: NULL Pointer Dereference. https://cwe.
mitre.org/data/definitions/476.html.
[26] CWE Content Team. 2023. CWE VIEW: Research Concepts. https://cwe.mitre.
org/data/definitions/1000.html
[27] Catherine Tony, Markus Mutas, Nicolás E. Díaz Ferreyra, and Riccardo Scandari-
ato. 2023. LLMSecEval: A Dataset of Natural Language Prompts for Security Eval-
uations. In 2023 IEEE/ACM 20th International Conference onMining Software
Repositories (MSR). 588–592. https://doi.org/10.1109/MSR59073.2023.00084
[28] Shubham Ugare, Tarun Suresh, Hangoo Kang, Sasa Misailovic, and Gagandeep
Singh. 2024. Improving LLM Code Generation with Grammar Augmentation.
arXiv:2403.01632 [cs.LG]
[29] Giorgos Vernikos, Arthur Bražinskas, Jakub Adamek, Jonathan Mallinson, Aliak-
sei Severyn, and Eric Malmi. 2023. Small language models improve giants by
rewriting their outputs. arXiv preprint arXiv:2305.13514 (2023).
[30] Yuxia Wang, Revanth Gangi Reddy, Zain Muhammad Mujahid, Arnav Arora,
Aleksandr Rubashevskii, Jiahui Geng, Osama Mohammed Afzal, Liangming
Pan, Nadav Borenstein, Aditya Pillai, Isabelle Augenstein, Iryna Gurevych, and
Preslav Nakov. 2023. Factcheck-GPT: End-to-End Fine-Grained Document-Level
Fact-Checking and Correction of LLM Output. arXiv:2311.09000 [cs.CL]
[31] Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry
Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C. Schmidt. 2023.
A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT.
arXiv:2302.11382 [cs.SE]
[32] Burak Yetistiren, Isik Ozsoy, and Eray Tuzun. 2022. Assessing the quality
of GitHub copilot’s code generation. In Proceedings ofthe18th International
Conference onPredictive Models andData Analytics inSoftware Engineering
(Singapore, Singapore) (PROMISE 2022) . Association for Computing Machinery,
New York, NY, USA, 62–71. https://doi.org/10.1145/3558489.3559072
[33] Zheng Zhang, Chen Zheng, Da Tang, Ke Sun, Yukun Ma, Yingtong Bu, Xun
Zhou, and Liang Zhao. 2023. Balancing specialized and general skills in llms:
The impact of modern tuning and data strategy. arXiv preprint arXiv:2310.04945
(2023).