Open AGI Codes | Your Codes Reflect!

Updates

AI: DeepMind's Breakthroughs
AI: The AI Agent Index
AI: SSI.inc: Safe Superintelligence Inc.
AI: Eurekalabs
AI: Microsoft: The Future of AI blog series
AI: OpenAI Playground
AI: Tufa Labs
Autonomous Agents: Autonomous Agents Research Group
AI: 3Blue1Brown
AI: ReAct: Synergizing Reasoning and Acting in Language Models
AGI: AGI-24 by Machine Learning Street Talk
AI: Try Promptly with AI Agents
AGI: The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) benchmark
AI: Microsoft: Build AI apps with Azure
AI: Deep Learning AI: Evaluating AI Agents
eTextbooks: The MIT Press
AI: Vitra AI: assists creators and businesses in leveraging AI to translate videos, images, and podcasts
AI: Amazon Bedrock playground that enables users to create generative AI-powered applications without writing code
AI: Adept: Agentic AI accelerate agents development
AI: The AI Agent Index

Generating audio...

arxiv

Paper 2403.12671

Enhancing Security of AI-Based Code Synthesis with GitHub Copilot via Cheap and Efficient Prompt-Engineering

Authors: Jakub Res, Ivan Homoliak, Martin Perešíni, Aleš Smrčka, Kamil Malinka, Petr Hanacek

Published: 2024-03-19

Abstract:

AI assistants for coding are on the rise. However one of the reasons developers and companies avoid harnessing their full potential is the questionable security of the generated code. This paper first reviews the current state-of-the-art and identifies areas for improvement on this issue. Then, we propose a systematic approach based on prompt-altering methods to achieve better code security of (even proprietary black-box) AI-based code generators such as GitHub Copilot, while minimizing the complexity of the application from the user point-of-view, the computational resources, and operational costs. In sum, we propose and evaluate three prompt altering methods: (1) scenario-specific, (2) iterative, and (3) general clause, while we discuss their combination. Contrary to the audit of code security, the latter two of the proposed methods require no expert knowledge from the user. We assess the effectiveness of the proposed methods on the GitHub Copilot using the OpenVPN project in realistic scenarios, and we demonstrate that the proposed methods reduce the number of insecure generated code samples by up to 16\% and increase the number of secure code by up to 8\%. Since our approach does not require access to the internals of the AI models, it can be in general applied to any AI-based code synthesizer, not only GitHub Copilot.

Paper Content:

Page 1: Enhancing Security of AI-Based Code Synthesis with GitHub Copilot via Cheap and Efficient Prompt-Engineering Jakub Res iresj@fit.vut.cz Brno University of Technology, Faculty of Information Technology Czech RepublicIvan Homoliak ihomoliak@fit.vut.cz Brno University of Technology, Faculty of Information Technology Czech RepublicMartin Perešíni iperesini@fit.vut.cz Brno University of Technology, Faculty of Information Technology Czech Republic Aleš Smrčka smrcka@fit.vut.cz Brno University of Technology, Faculty of Information Technology Czech RepublicKamil Malinka malinka@fit.vut.cz Brno University of Technology, Faculty of Information Technology Czech RepublicPetr Hanacek hanacek@fit.vut.cz Brno University of Technology, Faculty of Information Technology Czech Republic ABSTRACT AI assistants for coding are on the rise. However one of the reasons developers and companies avoid harnessing their full potential is the questionable security of the generated code. This paper first reviews the current state-of-the-art and identifies areas for im- provement on this issue. Then, we propose a systematic approach based on prompt-altering methods to achieve better code security of (even proprietary black-box) AI-based code generators such as GitHub Copilot, while minimizing the complexity of the applica- tion from the user point-of-view, the computational resources, and operational costs. In sum, we propose and evaluate three prompt altering methods: (1) scenario-specific, (2) iterative, and (3) general clause, while we discuss their combination. Contrary to the audit of code security, the latter two of the proposed methods require no expert knowledge from the user. We assess the effectiveness of the proposed methods on the GitHub Copilot using the OpenVPN project in realistic scenarios, and we demonstrate that the proposed methods reduce the number of insecure generated code samples by up to 16% and increase the number of secure code by up to 8%. Since our approach does not require access to the internals of the AI models, it can be in general applied to any AI-based code synthesizer, not only GitHub Copilot. 1 INTRODUCTION With the release of ChatGPT [ 1] , public attention shifted towards AI assistant tools. These assistants are proficient in many areas, including software engineering or coding. The advent of AI coding assistants means transitioning from intelligent code-completion tools to code-generating tools. Although these AI assistants are far from perfect, in terms of solving coding problems, a recent model AlphaCode 2, proposed by Deepmind, scored better than over 85 % of human competitors [9]. According to Liang et al. [ 11] in the survey with 410 Github users’ responses, 70 % of respondents who had experiences with Github Copilot utilize it at least once in a month while 46 % utilize the AI assistant daily. The most frequent reasons for developers using AI assistants were fewer keystrokes to write code and faster coding. Due to the rapidly rising popularity of AI assistants, researchers started to focus on studying the quality of the synthesized code andperson * newPerson = ( person *) malloc ( sizeof ( person )); newPerson -> status = 0; Fig. 1: Example of security issue generated by AI. The sce- nario comes from the dataset proposed in [17]. ways of improving it (see Sec. 5.2). While observing the validity or correctness, many studies overlook the crucial aspect of code— security. In the motivating example, the AI assistant was tasked with generating a code snippet to fill a gap in the context of a C program. Its objective was to create a new instance of the structure " person " and assign a status value of zero to it. Although the AI assistant provided a reasonable code (see Fig. 1), the snippet contain CWE- 476 [ 25] (the malloc function could fail to allocate memory, thus resulting in a NULL pointer dereference). In this research, we aim to study various ways of improving code security generated by any proprietary Large Language Mod- els (LLMs), and we demonstrate our approach on the well-known GitHub Copilot [6]. There exist a few categories for improving the code synthe- sis of AI models, such as output optimization, model fine-tuning, and prompt engineering, and each of them has some pros and cons. In this work, we focus on efficiency, generality, and low costs, and therefore prompt engineering is the most suitable tech- nique for us. While literature for prompt engineering is mostly general [ 14][31][5][4], we are more specific and determine four ap- proaches to it, which we further investigate: (1) scenario-specific information and warning providing, (2) iterative security-specific prompting, (3) general alignment shifting using inception prompt (i.e., general clause), (4) cooperative agents system. In particular, we experiment with the former three approaches that are orthogo- nal in their principles. Contributions .The contributions of our paper are as follows: (1)We reviewed the literature and identified three different areas of code synthesis improvements of LLMs, involvingarXiv:2403.12671v1 [cs.CR] 19 Mar 2024 Page 2: Jakub Res, et al. optimizing the output ,model fine-tuning , and prompt opti- mizations . (2)With the focus on generality, speed, and low costs, we aimed at prompt engineering area, and we proposed a systematic approach to enhancing its generated code security with three methods and their combinations. (3)We evaluated the efficiency of proposed methods for prompt alteration on a real-world project OpenVPN and we man- aged to increase the ratio of secure code generated by up to 8% and decrease the ratio of generated insecure code by up to 16%. Organization .In Sec. 2 we define the important terms for our paper and set a design space. In Sec. 3 we describe the proposed methods of prompt improvement. In Sec. 4 we describe the design of the experiment, methodology, dataset, and assessment of security with measured results. We refer to the related work in Sec. 5. We discuss the limitations and areas for future research in Sec. 6. In Sec. 7 we conclude our work. 2 BACKGROUND AND DESIGN SPACE Prompt. The prompt, in the context of this work, refers to the tuple: (1)a task that contains function declaration and its description, (2) code of the context , and (3) the user-specified code commentary related to security. Improvements of Code Synthesis. In general, the literature con- tains three main areas of possible improvements to the LLM code- generating abilities (see Fig. 2): (1)Output optimizing – The first and the most intuitive approach is to post-process the output. Once the LLM re- sponds with a result, the obtained code is analyzed for the presence of security issues. Although the output correc- tion is addressed by many works [ 28][30][29], very little attention is given to the code security. There may be multiple implementations of the output cor- rection systems, either by designing another model trained specifically for fixing security issues or by combining static analyzers with issue-repairing rules. Snyk [ 24] is an exam- ple of an existing commercial output optimizer focusing on code security. (2)Model fine-tuning – The model fine-tuning allows the developers to adapt the pre-trained language model to bet- ter fit a specific task [ 33]. It is the most preferable solution Input (Prompt)Model OutputCode synthesis pipelineImprovements(3) Prompt optimizing(1) Output optimizing(2) Model fine- tuning Fig. 2: Potential improvements of code synthesis.person * newPerson = NULL ; newPerson = ( person *) malloc ( sizeof ( person )); if (! newPerson ) { printf (" Error : Failed to allocate memory for person "); return EXIT_FAILURE ; } newPerson -> status = 0; Fig. 3: Preliminary results of prompt enhancing. due to the user experience since the user can directly inter- act with the improved model without any additional steps. However, this method requires full access to the model and imposes a high performance overhead for its re-training. (3)Prompt optimizing – The last way to improve code se- curity is to optimize the user input. As shown by previous works [ 17][32][13][8], the formulation of an input prompt could severely affect the resulting code security. Addition- ally, the results of Neil Perry, et al. [ 18] indicate, that it is possible to positively influence the generated code security by altering the prompt or asking the LLM iteratively. Apart from optimizing the input prompt (or directly the input sequence of tokens), the work of He and Vechev [ 7] presents an application of the concept of prefix tuning [10]. However, this concept is only applicable in cases of on- premise models since access to the internal hidden state of models is needed. 2.1 Design Space Although model fine-tuning might achieve promising results, it has several cons such as requiring access to the full model of often proprietary architectures, it is expensive in terms of computation resources, and it needs high-quality new data to train its model (which is difficult to collect/obtain). Output optimizing does not require access to the architecture of the model nor requires expert knowledge, but it has many cons related to static analysis of the code (i.e., high false negatives/positives rates or inability to analyze incomplete code). On the other hand, prompt-optimizing is fast and requires almost no computational resources (other than re-running the LLM); however, it might require certain expert knowledge in some cases In our research, we emphasized low-performance overhead, low costs, generality, and availability. Therefore, we focus on prompt optimization techniques as a way of improving the security of AI-generated code. Preliminary results of prompt engineering tech- niques proposed in our research were applied to the same task as in the introduction (see Fig. 1) but with an additional prompt specification to focus on proper security practices (see Fig. 3) – the generated code does not contain weakness CWE-476. While literature for prompt engineering techniques is mostly general [ 14][31][5][4], we aim to be more specific and determine four approaches to it, which we further detail in Sec. 3: (1) scenario- specific information and warning providing, (2) iterative security- specific prompting, (3) general alignment shifting using incep- tion prompt [8], (4) cooperative agents system [19]. Page 3: Enhancing Security of AI-Based Code Synthesis with GitHub Copilot via Cheap and Efficient Prompt-Engineering void string_null_terminate ( char *str , int len , int capacity ) {} Listing 1: Original prompt // Be careful about the buffer overflow , underflow and null dereference void string_null_terminate ( char *str , int len , int capacity ) {} Listing 2: Altered prompt Fig. 4: Example of input prompt alteration. 3 PROPOSED APPROACH In this section, we aim to explore the potential of three of the determined methods in Sec. 2.1 – the scenario-specific, the iterative, and the general alignment shifting (further referred to as general clause ). The last determined approach (i.e., cooperating agents) combines all of the other methods and is thus dependent on those methods, we consider it as a dedicated branch of research; therefore, we do not deal with it in the context of this work. In the following, we describe the particular approaches in detail. 3.1 Scenario-Specific The first method aims to provide specific information about the local context to the AI assistant. The prompt thus provides not only requirements for the correct functionality of generated code, but also for specific security-related characteristics. The whole idea lies in enumerating possible issues based on the developer’s experience. As a part of the prompt, numerous warnings and additional information are provided to the AI assistant according to expected functionality and possible security issues regarding the parameters coming to a particular block of code. The main downside of this method is the expert knowledge re- quirements. Therefore, to successfully apply this approach, users are expected to have at least a basic awareness of secure programming and the potential risks posed by incorrectly used programming structures. On the other hand, in the case of this approach, many prompt alterations can be automatically proposed to the user based on the context and data types, which mitigate the expert knowledge requirements of the user. The example in Fig. 4 depicts a single prompt for the AI assistant alteration using the proposed method. 3.2 Iterative The second method applies a naive repeated process to prompt alteration by modifying commentary of previously generated code sample (that is the part of the context for the current iteration). It communicates with the AI assistant iteratively, with each iteration incorporating the previous output while adding information or warning. The most important part of this approach is the proper selection of the sequence of additional information passed to the LLM in every round. This method is agnostic to the task and its code context. The list of commentaries that is iteratively applied should be general,(1) Fix the CWE 284 - Improper Access Control (2)Fix the CWE 435 - Improper Interaction Between Multiple Correctly- Behaving Entities (3)Fix the CWE 664 - Improper Control of a Resource Through its Lifetime (4) Fix the CWE 682 - Incorrect Calculation (5) Fix the CWE 691 - Insufficient Control Flow Management (6) Fix the CWE 693 - Protection Mechanism Failure (7) Fix the CWE 697 - Incorrect Comparison (8)Fix the CWE 703 - Improper Check or Handling of Exceptional Con- ditions (9) Fix the CWE 707 - Improper Neutralization (10) Fix the CWE 710 - Improper Adherence to Coding Standards Fig. 5: Rule set for the iterative method. and therefore cover a wide range of security weaknesses and issues. Thanks to that, the user does not require expert knowledge and can be provided with higher security-level suggestions. For evaluation purposes, we opt to implement Mitre’s Research concepts [ 26] into the rule set, as seen in Fig. 5. This view consists of ten abstract classes, each covering a family of security weaknesses. Together, the classes are designed to contain all CWEs. The iterative method inherently comes with a few advantages, such as almost no requirements for security knowledge from the side of a user, ease of automatic implementation, or application to a large scope of models. However, the disadvantages, such as the negative influence of improperly designed rule set, or required computational time for multiple iterations, may outweigh the posi- tive properties with consideration of reasons for application of this method in the future. 3.3 General Alignment Shifting The main idea of this approach (a.k.a., general clause) is inspired by the work of Li, et al. [ 8]. The authors proposed the term inception prompt that describes an alignment-setting prompt at the start of a conversation with a model. We use a different term (i.e., general clause ) than the original authors since our approach principally differs by not following the conversation pattern. The inception prompt is an input message that comes before any of the user sentences. As for the nature of our approach, the general clause is a part of the user prompt (within the context). /** * Never forget you are a security specialist helping with coding tasks . * Your job is to generate code that is secure and safe . * Never forget that . * You are not a programmer , you are a security specialist . * You will never generate code that contains security vulnerabilities . * You will always look for CWEs in the code you would generate and fix them . **/ Fig. 6: General clause used for experiments. Page 4: Jakub Res, et al. The main advantage of this method is the simplicity and ease of implementation. A single well-crafted commentary addition to the header of the file could improve the security of the generated code in this particular file. On the other hand, there may be major issues with the perfor- mance of the clause method. For example, the LLM may filter out the general clause as irrelevant (depending on the decision of the model). Another significant limitation of this approach is the clause itself. The clause needs to be precisely curated to pose an impact on the decision process of LLM. Alike the previous method, even the general clause method imposes none to very little expert knowledge requirements to the users. 4 EXPERIMENTS In the upcoming section, we describe the experiment design (see Fig. 7). First, we chose the open-source project OpenVPN instead of the conventional dataset because it reflects the real conditions for operating the GitHub Copilot (i.e., providing the tasks with context) and thus producing results with higher impact. We use the GitHub Copilot to consecutively synthesize the five best solutions for each selected task to set a baseline. Then, we enhance the context and task by adding security-related commentary according to the pro- posed methods. After that, we repeat the synthesis step, resulting in 100 solutions (25 per the enhancement method). At the end, we describe the process of assessing the security of synthesized code and measured results. 4.1 Methodology Although many models and datasets are available, this paper fo- cuses solely on proving the concept of systematic prompt altering to achieve better code security. Thus, for the experimental part of this work, we use the most popular AI code generator today [11], GitHub Copilot [ 6]. Throughout the experiments, the param- eters of the GitHub Copilot model were kept to the default. For an untainted environment, a container with a preinstalled GitHub Copilot extension for Vim editor was set up and reinitialized after each experiment run. The whole process of experiments is depicted in Fig. 7. As stated before, the study aims to evaluate the effectiveness of suggested methods on an open-source project instead of well-known datasets for synthesized code evaluation. Using the open source project code base (see Sec. 4.2), we selected five tasks and altered them according to the methods presented earlier. Each of the methods is applied differently: (1)Thescenario method – the added information is inserted inside of the curly brackets of the observed function. (2)Theiterative method – each iteration is forwarded to the upcoming round as a commented-out code with additional information following the rule set (see Fig. 5). (3)Thegeneral clause method – the clause is inserted right after the original file header comment at the start of each source code. Dataset Open-source projectLLM IterativeScenario General ClauseLLM results cacheVirtual renewable environment Security assesmentFig. 7: Experiment design scheme. Unaltered prompts, consisting only of task and context, were used as a baseline for the final comparison. To capture divergence in com- mon results, we consecutively synthesized the five best solutions for every prompt to provide higher statistical significance.1 4.2 Dataset To test the proposed methods of prompt alteration in realistic con- ditions, we opted for a custom experiment using an active open- source project instead of using the conventional dataset (such as HumaEval [ 3], MBXP [ 2], SecurityEval [ 23], or LLMSecEval [ 27]). We will release our dataset upon publication, including the setup of our experiment to enable reproducibility of the research. There are multiple limitations of existing datasets for AI-based code synthesis. Most of the existing datasets are not focused on security evaluation but rather on the ability to synthesize functional code. On the other hand, the existing security-related datasets consist of example scenarios of various CWEs without context, and they were either gathered online or crafted by the authors. The CWEs datasets are more suitable for evaluating the synthesized code se- curity; however, all the samples included in the datasets are short, and thus lacking context. OpenVPN Project .To reflect the reality of using the program- ming AI assistant, we chose project OpenVPN.2The OpenVPN project was selected due to its active development, well-documented source code, and the primary programming language – C, which is prone to security issues. The following functions from the OpenVPN project were selected as tasks for the experiment. Each function was selected with regard to possible security issues: (1)string_null_terminate() – possibly vulnerable to buffer overflow/underflow and NULL dereference. ( /src/openvpn/ buffer.c ) void string_null_terminate ( char *str , int len , int capacity ) {} 1Note that GitHub Copilot synthesizes ten solutions for each prompt, and we always considered only the best one. On the other hand, other synthesized options may contain more secure code. 2https://github.com/OpenVPN/openvpn Page 5: Enhancing Security of AI-Based Code Synthesis with GitHub Copilot via Cheap and Efficient Prompt-Engineering // Be careful about buffer overflow / underflow // Be careful about properly terminating string // Be careful about NULL dereference // Be careful about proper handling of file descr . // Be careful about NULL dereference // Be careful about buffer overflow / underflow // Be careful about NULL dereference // Be careful about integer overflow / underflow // Be careful about buffer overflow / underflow // Be careful about NULL dereference // Be careful about proper index validation // Be careful about proper memory clearing Fig. 8: Scenario-based prompts related to selected functions. (2)buffer_write_file() – possibly vulnerable to incorrect file handle management and unknown custom data struc- ture issues. ( /src/openvpn/buffer.c ) bool buffer_write_file ( const char * filename , const struct buffer * buf ){} (3)buf_catrunc() – possibly vulnerable to out-of-memory write, unknown custom data structure issues, and NULL dereference. ( /src/openvpn/buffer.c ) void buf_catrunc ( struct buffer *buf , const char * str ) {} (4)buf_prepend() – possibly vulnerable to buffer overflow/un- derflow and integer overflow/underflow. ( /src/openvpn/ buffer.h ) static inline uint8_t * buf_prepend ( struct buffer *buf , int size ) {} (5)argv_reset() – possibly vulnerable to improper index validation and memory clearing. ( /src/openvpn/argv.c ) static void argv_reset ( struct argv *a) {} In accordance with the expected implementation issues, the fol- lowing scenario method prompts were prepared – they are enumer- ated in Fig. 8 in the same order as the functions above. 4.3 Assessment of Code Security Assessing the security of code samples presents many challenges. Unlike aspects like functionality or correctness, which can be mea- sured through compilation/interpretation or metrics like Code- BLEU3[20], security evaluation requires a different approach. However, no such practice has been established for analyzing the generated code security. In general, there are two approaches 3This metric combines n-gram comparison, syntax tree analysis, and semantic checks.to the assessment of code security, both in the form of automatic and manual evaluation: •Static analysis : analysis of the source code. This process does not require program execution. There are many auto- matic tools for static analysis tools [16]. •Dynamic analysis : analysis of the executed program traces. The most effective technique in analyzing security is fuzz testing [ 15]. This approach is typically used in cases where one needs to find weaknesses originating from complex program logic. In our research, we chose not to use auxiliary static analysis tool due to a high rate of false negatives. Instead, we opted for manual code inspection, given the relatively small size of the sample set. For the sake of reproducibility, we classify the generated snippets of code into one of the following classes according to the respective code properties: •Secure: The generated sample is considered secure if all crucial parameter-checking conditions are present in any form, and additionally, a task-specific set of functional re- quirements are met, such as: (1) the proper null byte placement in edge cases (i.e., the off-by-one error); (2)the correct verification of operations on the file de- scriptors (e.g., the inspection of return codes of file- operating functions); (3)the correct size of memory transfer (e.g., memcpy ,mem- move ,bcopy functions); (4)the correct addition to offset with respect to the total length of the buffer and the correct copy of the whole string into the buffer (including the null byte); (5)proper memory buffer clearance and counter resetting to prevent out-of-bounds read vulnerabilities. •Partially secure: The generated sample is considered par- tially secure if anyof the crucial parameter-checking con- ditionsare presented in any form. •Insecure: The generated sample is considered insecure ifnone of the crucial parameter-checking conditions are presented in any form. We present the results of our experiments in Tab. 1, which shows the total number of synthesized samples in the first column and the percentage in the second, with a particular security level for each of the proposed methods vs. the baseline (i.e. the tasks without any additions in the form of code commentary to the prompt). The results indicate that the baseline (generated without any additional prompt alteration) contains fewer security-checking conditions, and thus is less secure in security-sensitive cases. On the other hand, the tasks generated using the additional code commentary for the prompt alteration contained at least some security-checking conditions, and thus were more secure in security-sensitive cases. According to the results, the iterative method is the best-performing one to increase the number of secure solutions synthesized and reduce the number of insecure synthe- sized samples – the number of secure samples was increased by 8% in contrast to the baseline while the number of insecure samples was reduced by 12%. Nevertheless, the best method for reducing Page 6: Jakub Res, et al. Method Security level Baseline Scenario Iterative Clause Secure 10 40% | 10 40% | 12 48% | 11 44% Partially secure 8 32% | 12 48% | 9 36% | 9 36% Insecure 7 28% | 3 12% | 4 16% | 5 20% Tab. 1: Results aggregated over all of the tasks. the number of insecure solutions was the scenario-specific method, decreasing the number of insecure samples by 16%. 5 RELATED WORK Currently, the research community on large language models is primarily focused on pushing the boundaries of AI capabilities by achieving better performance on various tasks with larger and more powerful models or by achieving similar results to their competitors with ever smaller models. However, the most recognized benchmark tasks are not even marginally focused on observing code security. Some studies try to address this by creating their own security- focused scenarios and evaluating synthesized code security using them, which we further review in Sec. 5.1. 5.1 Security According to the most recent study on the empirical evaluation of the average security of synthesized code, AI generates potentially insecure code in approximately 40% of cases [ 17]. Besides testing the security, the authors also studied the influence of prompt misspells. The interesting finding is, that altering the prompt in a specific way may positively influence the generated output [17]. Sandoval et at. [ 21] approached the problem from the user’s perspective and observed the impact of using AI coding assistant on the security of C language code. In a scenario, where developers were divided into two groups – with and without an AI assistant – the developers coded various functions to operate a structure in C language. The results suggest, that while using AI assistant, users produce only up to 10% more security issues. Siddiq, et al. [ 22] have focused on examining the source of code issues. Their work explored the propagation of code smells from the learning dataset to the model and subsequently the outputs. The results show not only that code issues, including the security code smells, do indeed propagate from the training data to the output, but that this also happens for the most commercially used service, GitHub Copilot. Despite pointing out the security problem of AI-generated code, none of the research works systematically focused on any means of improvement. However, many papers have already made sig- nificant improvements to other aspects of synthesized code (be it new models, or papers and guides focusing on better prompt formulation [14] or AI cooperation [8]). 5.2 Code Quality Burak Yetistiren, et al. [ 32] evaluate the ability of Github Copilot to generate correct, valid, and efficient code in three scenarios ac- cording to the input prompt: function name and docstring, function name only, and dummy function name with docstring description.The results indicate that meaningful function names and their de- scriptions using docstring (i.e. better prompt formulation) lead to better results than the other two alternatives. Antonio Mastropaolo, et al. [ 13] state that code quality is not only affected by semantics but also by the syntax of the input prompt. Their study showed that the results of the quality analysis differ for the set of prompts before and after paraphrasing in about 70% of the cases. Improving the capabilities of AI by implementing multiple communicative agents has recently shown promising direction. Using multiple agents with specific roles within a problem-solving process can produce significantly better results [8]. 6 DISCUSSION Although the achieved results demonstrate the improvement in security of the AI-synthesized code, there are several limitations to our approach. We consider these limitations as areas of future research rather than threats to validity. 6.1 Limitations The limitations of our research are as follows. Prompts. The prompt additions (in our case) are always a trade- off between over-specification and over-generalization. We are aware of the fact that our prompt enhancements could still be improved. However, the current state is sufficient for the proof-of- concept of our methods. Dataset. In this paper, we choose five cases from one open- source project written in C, which we recognize as highly impactful in terms of security. This limits the research to a single point-of- view, potentially missing important details, either resulting from a limited intra-class variability (i.e. inappropriate selection of cases), or inter-class variability (single code-base, or programming lan- guage). In future work, we plan to experiment with more open- source projects and different programming languages. Environment is primarily meant as the client for the AI code generator and the AI generator itself. In our case, the limitation is regarding the context, which is sent by the local client and the GitHub Copilot caching system, creating dependencies between test runs (i.e., the local vs. remote caching of synthesized options) – in detail, we could control only the local caching system. Code Synthesizers. In this work, we utilized only GitHub Copi- lot as a proof-of-concept. However, we argue that since our ap- proach is general, it can be utilized on any other code synthesizer (i.e., open source or proprietary), which will be the subject of our future research. Automated Inspection of Secure Code is a problem in general. Even though there are numerous tools for static and dynamic auto- matic code inspection, most of them exhibit an excessive number of false negatives and/or false positives. Therefore, we utilized manual inspection in our work, which was more accurate but might be expensive for larger experiments. This might pose a threat to the reproducibility of the results with larger datasets in the future. Potential Improvements of AI-Based Code Synthesis. In this work, we focused only on prompt alteration methods since they can be used even on proprietary models. However, the interesting potential also lies in the model fine-tuning [ 12] methods and their Page 7: Enhancing Security of AI-Based Code Synthesis with GitHub Copilot via Cheap and Efficient Prompt-Engineering combinations. However, it can be applied to white box models only. We plan to investigate this area in our future work. 7 CONCLUSION AI code generators have proven to be a powerful tool but they must be used correctly to fully utilize their potential. Our research has shown how to systematically tackle the problem of code security while communicating with such AI services. Our results indicate that the methods proposed in this paper can enhance the security of generated code. The results also indicate that the performance in terms of code security can be enhanced even for proprietary models, where end users cannot access/modify the underlying architecture or model itself. This paper lays a foundation for our future in-depth research of intelligent prompt-enhancing systems that we intend to evaluate on multiple AI-based code synthesizers and various open-source projects. ACKNOWLEDGMENTS This work was supported by the Brno University of Technology internal project FIT-S-23-8151. REFERENCES [1] Open AI. 2022. Introducing ChatGPT . Open AI. https://openai.com/blog/chatgpt [2] Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, Sujan Kumar Gonugondla, Hantian Ding, Varun Kumar, Nathan Fulton, Arash Farahani, Siddhartha Jain, Robert Giaquinto, Haifeng Qian, Murali Krishna Ra- manathan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, Sudipta Sengupta, Dan Roth, and Bing Xiang. 2023. Multi-lingual Evaluation of Code Generation Models. arXiv:2210.14868 [cs.LG] [3]Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fo- tios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shan- tanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG] [4]Paul Denny, Viraj Kumar, and Nasser Giacaman. 2023. Conversing with Copilot: Exploring Prompt Engineering for Solving CS1 Problems Using Nat- ural Language. In Proceedings ofthe54th ACM Technical Symposium on Computer Science Education V.1(<conf-loc>, <city>Toronto ON</city>, <coun- try>Canada</country>, </conf-loc>) (SIGCSE 2023) . Association for Computing Machinery, New York, NY, USA, 1136–1142. https://doi.org/10.1145/3545945. 3569823 [5] L. Giray. 2023. Prompt Engineering with ChatGPT: A Guide for Academic Writers. Annals ofBiomedical Engineering 51 (2023). https://doi.org/10.1007/s10439-023- 03272-4 [6] GitHub. 2024. GitHub Copilot. https://github.com/features/copilot. [7] Jingxuan He and Martin Vechev. 2023. Large Language Models for Code: Security Hardening and Adversarial Testing. In Proceedings ofthe2023 ACM SIGSAC Conference onComputer andCommunications Security (CCS ’23) . ACM. https: //doi.org/10.1145/3576915.3623175 [8] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society. arXiv:2303.17760 [cs.AI] [9]Shuang Li, Yi Xu, Anusha Krishna, Tianyu Chen, Taifu Wu, Peng Cao, and ... 2023. AlphaCode 2 Technical Report. https://storage.googleapis.com/deepmind- media/AlphaCode2/AlphaCode2_Tech_Report.pdf [10] Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Contin- uous Prompts for Generation. In Proceedings ofthe59th Annual Meeting oftheAssociation forComputational Linguistics and the11th International Joint Conference onNatural Language Processing (Volume 1:Long Papers) ,Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 4582–4597. https://doi.org/10.18653/v1/2021. acl-long.353 [11] Jenny T. Liang, Chenyang Yang, and Brad A. Myers. 2023. A Large-Scale Sur- vey on the Usability of AI Programming Assistants: Successes and Challenges. arXiv:2303.17125 [cs.SE] [12] Xinyu Lin, Wenjie Wang, Yongqi Li, Shuo Yang, Fuli Feng, Yinwei Wei, and Tat- Seng Chua. 2024. Data-efficient Fine-tuning for LLM-based Recommendation. arXiv preprint arXiv:2401.17197 (2024). [13] Antonio Mastropaolo, Luca Pascarella, Emanuela Guglielmi, Matteo Ciniselli, Simone Scalabrino, Rocco Oliveto, and Gabriele Bavota. 2023. On the Robust- ness of Code Generation Techniques: An Empirical Study on GitHub Copilot. arXiv:2302.00438 [cs.SE] [14] OpenAI. 2023. Prompt Engineering. https://platform.openai.com/docs/guides/ prompt-engineering. [15] OWASP. 2024. Fuzzing. https://owasp.org/www-community/Fuzzing [16] OWASP. 2024. Source Code Analysis Tools. https://owasp.org/www- community/Source_Code_Analysis_Tools [17] Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2021. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. arXiv:2108.09293 [cs.CR] [18] Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. 2023. Do Users Write More Insecure Code with AI Assistants?. In Proceedings ofthe 2023 ACM SIGSAC Conference onComputer and Communications Security (CCS ’23). ACM. https://doi.org/10.1145/3576915.3623157 [19] Chen Qian, Xin Cong, Wei Liu, Cheng Yang, Weize Chen, Yusheng Su, Yufan Dang, Jiahao Li, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2023. Communicative Agents for Software Development. arXiv:2307.07924 [cs.SE] [20] Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundare- san, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv e-prints (2020), arXiv–2009. [21] Gustavo Sandoval, Hammond Pearce, Teo Nys, Ramesh Karri, Siddharth Garg, and Brendan Dolan-Gavitt. 2023. Lost at C: A User Study on the Security Impli- cations of Large Language Model Code Assistants. arXiv:2208.09727 [cs.CR] [22] Mohammed Latif Siddiq, Shafayat Majumder, Maisha Mim, Sourov Jajodia, and Joanna Cecilia da Silva Santos. 2022. An Empirical Study of Code Smells in Transformer-based Code Generation Techniques. 71–82. https://doi.org/10.1109/ SCAM55253.2022.00014 [23] Mohammed Latif Siddiq and Joanna C. S. Santos. 2022. SecurityEval dataset: mining vulnerability examples to evaluate machine learning-based code gener- ation techniques. In Proceedings ofthe1stInternational Workshop onMining Software Repositories Applications forPrivacy andSecurity (Singapore, Singa- pore) (MSR4P&S 2022) . Association for Computing Machinery, New York, NY, USA, 29–33. https://doi.org/10.1145/3549035.3561184 [24] Snyk. 2024. Snyk secures AI-generated code. Snyk. https://snyk.io/solutions/ secure-ai-generated-code/ [25] CWE Content Team. 2023. CWE-476: NULL Pointer Dereference. https://cwe. mitre.org/data/definitions/476.html. [26] CWE Content Team. 2023. CWE VIEW: Research Concepts. https://cwe.mitre. org/data/definitions/1000.html [27] Catherine Tony, Markus Mutas, Nicolás E. Díaz Ferreyra, and Riccardo Scandari- ato. 2023. LLMSecEval: A Dataset of Natural Language Prompts for Security Eval- uations. In 2023 IEEE/ACM 20th International Conference onMining Software Repositories (MSR). 588–592. https://doi.org/10.1109/MSR59073.2023.00084 [28] Shubham Ugare, Tarun Suresh, Hangoo Kang, Sasa Misailovic, and Gagandeep Singh. 2024. Improving LLM Code Generation with Grammar Augmentation. arXiv:2403.01632 [cs.LG] [29] Giorgos Vernikos, Arthur Bražinskas, Jakub Adamek, Jonathan Mallinson, Aliak- sei Severyn, and Eric Malmi. 2023. Small language models improve giants by rewriting their outputs. arXiv preprint arXiv:2305.13514 (2023). [30] Yuxia Wang, Revanth Gangi Reddy, Zain Muhammad Mujahid, Arnav Arora, Aleksandr Rubashevskii, Jiahui Geng, Osama Mohammed Afzal, Liangming Pan, Nadav Borenstein, Aditya Pillai, Isabelle Augenstein, Iryna Gurevych, and Preslav Nakov. 2023. Factcheck-GPT: End-to-End Fine-Grained Document-Level Fact-Checking and Correction of LLM Output. arXiv:2311.09000 [cs.CL] [31] Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C. Schmidt. 2023. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv:2302.11382 [cs.SE] [32] Burak Yetistiren, Isik Ozsoy, and Eray Tuzun. 2022. Assessing the quality of GitHub copilot’s code generation. In Proceedings ofthe18th International Conference onPredictive Models andData Analytics inSoftware Engineering (Singapore, Singapore) (PROMISE 2022) . Association for Computing Machinery, New York, NY, USA, 62–71. https://doi.org/10.1145/3558489.3559072 [33] Zheng Zhang, Chen Zheng, Da Tang, Ke Sun, Yukun Ma, Yingtong Bu, Xun Zhou, and Liang Zhao. 2023. Balancing specialized and general skills in llms: The impact of modern tuning and data strategy. arXiv preprint arXiv:2310.04945 (2023).