loader
Generating audio...

arxiv

Paper 2502.07399

On Iterative Evaluation and Enhancement of Code Quality Using GPT-4o

Authors: Rundong Liu, Andre Frade, Amal Vaidya, Maxime Labonne, Marcus Kaiser, Bismayan Chakrabarti, Jonathan Budd, Sean Moran

Published: 2025-02-11

Abstract:

This paper introduces CodeQUEST, a novel framework leveraging Large Language Models (LLMs) to iteratively evaluate and enhance code quality across multiple dimensions, including readability, maintainability, efficiency, and security. The framework is divided into two main components: an Evaluator that assesses code quality across ten dimensions, providing both quantitative scores and qualitative summaries, and an Optimizer that iteratively improves the code based on the Evaluator's feedback. Our study demonstrates that CodeQUEST can effectively and robustly evaluate code quality, with its assessments aligning closely with established code quality metrics. Through a series of experiments using a curated dataset of Python and JavaScript examples, CodeQUEST demonstrated significant improvements in code quality, achieving a mean relative percentage improvement of 52.6%. The framework's evaluations were validated against a set of proxy metrics comprising of Pylint Score, Radon Maintainability Index, and Bandit output logs, showing a meaningful correlation. This highlights the potential of LLMs in automating code quality evaluation and improvement processes, presenting a significant advancement toward enhancing software development practices. The code implementation of the framework is available at: https://github.com/jpmorganchase/CodeQuest.

Paper Content:
Page 1: On Iterative Evaluation and Enhancement of Code Quality Using GPT-4o Rundong Liu∗Andr´ e Frade∗Amal Vaidya Maxime Labonne Marcus Kaiser Bismayan Chakrabarti Jonathan Budd Sean Moran† JPMorgan Chase Abstract This paper introduces CodeQUEST, a novel framework leveraging Large Language Models (LLMs) to iteratively evaluate and enhance code quality across multiple dimensions, including readability, maintainabil- ity, efficiency, and security. The framework is divided into two main components: an Evaluator that assesses code quality across ten dimen- sions, providing both quantitative scores and qualitative summaries, and an Optimizer that iteratively improves the code based on the Evalua- tor’s feedback. Our study demonstrates that CodeQUEST can effec- tively and robustly evaluate code quality, with its assessments aligning closely with established code quality metrics. Through a series of experi- ments using a curated dataset of Python and JavaScript examples, Code- QUEST demonstrated significant improvements in code quality, achieving a mean relative percentage improvement of 52.6%. The framework’s eval- uations were validated against a set of proxy metrics comprising of Pylint Score, Radon Maintainability Index, and Bandit output logs, showing a meaningful correlation. This highlights the potential of LLMs in au- tomating code quality evaluation and improvement processes, present- ing a significant advancement toward enhancing software development practices. The code implementation of the framework is available at https://github.com/jpmorganchase/CodeQuest. 1 Introduction The production of high-quality code remains a top priority for organizations developing software. High-quality code extends beyond syntactic or semantic correctness, encompassing attributes such as freedom from errors, readability, efficiency, portability, usability, testability, and maintainability [20]. ∗These authors contributed equally to this work †Corresponding author: sean.j.moran@jpmchase.com 1arXiv:2502.07399v1 [cs.SE] 11 Feb 2025 Page 2: Evaluating code quality is a complex process due to the subjective nature of the concept, leading to the creation of various language-specific and non- comprehensive tools that cover very specific aspects under the code quality concept umbrella [10]. For example, the Python ecosystem provides security linters for identifying security vulnerabilities, style linters for identifying devia- tions from style guides, type checks, and error linters [36]. Moreover, subjective aspects like code design and readability are often guided by ”best practices”. It is also important to understand that evaluation of code quality, while crucial, is just one part of ensuring production-ready code. This is typically achieved through extensive code reviews, where multiple developers examine and mod- ify the scripts until they meet the required standards. While important, this process is resource intensive, time-consuming, and it generally places additional burdens on senior developers that are ultimately responsible for vetting final code versions. Automating this process could significantly boost code develop- ment productivity. Recent advancements in Large Language Model (LLM) technologies have shown promise in coding-related tasks, including code generation and evalu- ation [1] [30] [17]. LLMs offer unique advantages for code evaluation due to their training on diverse programming languages and vast datasets, enabling them to first understand and then assess as well as improve code quality. More- over, their ability to perform tasks based on prompts provides a flexible and highly configurable paradigm for easily defining complex tasks in natural lan- guage [13]. Studies are increasingly focusing on how prompt engineering and complexity affect the LLMs’ ability to generate accurate, maintainable code. For example, some research compares one-hot and iterative prompt strategies to understand how more detailed or structured prompts impact the functional- ity and readability of the generated code, showing that prompt complexity can significantly enhance the robustness of outputs [2]. Other studies assess GPT- 4’s performance against alternative models, highlighting notable improvements in debugging capabilities and responsiveness to follow-up prompts. Specifically, recent work investigates GPT’s self-correction capacity when supplied with tar- geted feedback, such as error messages or static analysis results, highlighting the model’s ability to refine code and address issues like smells or unnecessary complexity through iterative feedback [25] [16]. While LLMs’ zero-shot capability have been widely exploited, recent stud- ies [28] [3] [37] [38] have suggested that borrowing concepts from Reinforcement Learning could significantly improve the capability for LLM-based agents to per- form reasoning and planning tasks. For instance, Zhang et al. [37] have drawn inspiration from the classical actor-critic method [14] to design a TipleCritic mechanism, which has significantly improved the success rate on multi-agent planning tasks. Despite their potential, using LLMs for such complex tasks also presents challenges. LLMs are not inherently designed for quantitative assessments, which may affect the reliability of their scores [8]. Moreover, the probabilis- tic nature of LLMs can lead to inconsistencies. Indeed, small modifications to the prompt or a change of the LLM random seed can potentially lead to vastly 2 Page 3: Input codeEvaluator assessmentCode quality improvementCode validationEvaluator assessmentFinal code Final Evaluator reportOptimizer Figure 1: Schematic representation of CodeQUEST. different results [15] or “hallucinations” which can be challenging to detect and control [18] [9]. Nonetheless, it is clear that LLMs hold immense potential to revolutionize code quality evaluation and improvement. In this paper, we introduce CodeQUEST (Code Quality Understanding and Enhancement System Toolkit), an LLM-powered framework for assessing and improving code quality. CodeQUEST consists of an Evaluator and an Opti- mizer that work together through an iterative process to enhance the quality of code. The rest of the paper is organized as follows: Section 2 introduces the CodeQUEST framework and its two components. Section 3 details the experi- mental setup. Section 4 presents results and insights on improving code quality using LLMs. Section 5 outlines threats to validity, and Section 6 concludes the discussion. 2 The CodeQUEST Framework The CodeQUEST framework consists of two components: an Evaluator and an Optimizer that leverage LLM to assess and improve code quality, respectively. In this work, we used the most recently released GPT-4o from the GPT-4 model series [1] and code quality was defined across ten dimensions: Readability ,Main- tainability ,Testability ,Efficiency ,Robustness ,Security ,Documentation ,Modu- larity ,Scalability , and Portability . The Evaluator first generates a code quality score1and a text-based eval- uation for each dimension, which is used to produce an aggregated summary of dimension-wise evaluations. The Optimizer is then responsible for improving the code quality over a fixed number of iterations (or until a target quality score is reached) based on the Evaluator’s feedback. Each iteration involves three se- quential stages: code quality improvement, code integrity validation, and code quality re-assessment. A detailed description of each stage is provided in this section below. A diagram of the end to end process can be found in Figure 1. 1Under CodeQUEST framework, a code-level quality score is calculated by taking average across code quality score from each dimension 3 Page 4: 2.1 Evaluator As mentioned above, the Evaluator uses ten dimensions to assess code qual- ity. Each dimension is addressed through a set of five questions or statements, carefully crafted to comprehensively cover its key aspects (c.f., Appendix B for the full list). These questions and statements were designed to a) be general, thus applicable to most programming languages, and b) not overlap in dimen- sion scope probing, hence avoiding over-representation of any particular aspects of the dimension. Furthermore, each question or statement is formulated such that it can only be answered with “True”, “False” or “Not Applicable”, where “True” reflects that high quality characteristics present in the code, in contrast to a “False”. We address each of the code quality dimensions in a separate query to the LLM, where the corresponding set of five dedicated questions or statements are provided to the LLM for its consideration. Inspired by theoretical efforts [12] [32] on the impact of language ambiguity to LLM’s performance, we carefully crafted prompts enabling LLMs to apply their internal knowledge in a focused and unambiguous manner. Limiting the possible output to quantifiable answers allows us to map the model output to a numerical scale. The output includes the answers to each of the five dedicated questions or statements (which enable the quantitative assessment) and a high- level summary of the code in light of the dimension under consideration (which serves as a qualitative assessment). Quantitative results are derived by assigning numerical values to each answer (+1 for “True”, -1 for “False”, and 0 for “Not Applicable”). For each dimension, we sum up the five responses to obtain a score ranging from -5 to 5, where higher scores represent higher code quality along the dimension axis. Finally, dimension-specific scores are averaged to enable a “code-level” score. We also ask GPT-4o to summarize qualitative assessment across all ten dimensions, obtaining a code-level summary. The prompt template, based on zero-shot Chain-of-Thought (CoT) [35], is provided below. It combines 1) the code to be evaluated; 2) the set of five ques- tions or statements for a given quality dimension; 3) the task to be performed, and 4) the desired output format: --- System --- You are a helpful and harmless AI software engineer. You must provide an answer to the following request. Be brief and precise. --- Human --- ### CODE: ‘‘‘ {code} ‘‘‘ ### STATEMENTS: 4 Page 5: {dimension_statements} ### TASK: Think step by step to assess the veracity of each STATEMENT in light of the CODE provided. Your answer to each statement must come from one of the following: * -1 if the statement is false, * 1 if the statement is true, * 0 if the statement is not applicable or there is not enough evidence in the CODE to address it. You must also provide a short summary about the quality of the code from a {quality_dimension} perspective, justifying your answers across the various statements. ### OUTPUT: Return your answer in valid JSON as shown below: ‘‘‘json {{ "insight": <code quality summary:str>, "scores": [<score_to_statement1:int>, ...] }} ‘‘‘ Note that due to the inherent non-deterministic nature of GPT-4o (even when a seed is specified and the temperature is set to zero) [21], we allow users to opt-in self-consistency reasoning [34]2to further reduce the variance of the code quality score. 2.2 Optimizer In this section, we describe the Optimizer, which is tasked with improving the quality of code based on the feedback generated by the Evaluator. While qual- itative results set a clear direction for code improvement, quantitative feedback is leveraged to ensure that progress occurs in a unidirectional manner. The Optimizer operates in an iterative improvement cycle, where each iteration in- cludes three steps: code quality improvement ,code validation , and generated code evaluation . 2.2.1 Code Quality Improvement To generate an improved version of the code, the original code snippet and its qualitative assessment results are fed to GPT-4o. The task involves address- ing all the areas of improvement identified in the assessment feedback through 2CodeQUEST framework allows user to apply self-consistency when evaluating each code quality dimension, a dimensional score is obtained by taking the average across retries and a dimensional evaluation is obtained by asking GPT-4o to summaries text evaluation across different retries 5 Page 6: code modifications (the following prompt shares the system message with the Evaluator’s prompt): --- Human --- ### Code: ‘‘‘ {code} ‘‘‘ ### Quality Dimensions Feedback: {quality_insight} ### TASK: You are provided with a code script and detailed feedback for each quality dimension. For each quality dimension, you are provided with: * A score from -5 to 5.The higher the score, the better the quality. * Dimension insights,highlighting potential areas of improvement. Think step by step to complete the following: 1) For each dimension, reflect on the score and insights. 2) Condense a list of improvement points, so that the code would be evaluated at a higher score for each dimension. 3) Improve the code script according to the improvement points, prioritizing dimensions with lower scores. 4) Return: * the improvement points identified * the improved version of the code script * explanations for each of the changes you’ve made Note: * ALL improvement points MUST be addressed via meaningful changes to the code. ### OUTPUT: Your final output contains two parts: Return your answer in a valid JSON as shown below: ‘‘‘json {{ "improvement_points": List[str], "explanation_report": List[str] }} ‘‘‘ Then quote your code in the following section: ‘‘‘improved_code {{improved_code_here}} ‘‘‘ 6 Page 7: 2.2.2 Code Validation The code validation stage ensures that the LLM modified code can be compiled, as a minimal requirement for the code to be deemed valid and the process to proceed. Additionally, test cases built for the original code can be optionally executed against all code improvement versions to ensure the intended func- tionality and expected behavior. Such validation checks constitute a first step towards the meaningful evolution of the code. Note that the failure of either check leads to a rejection of the generated code as a candidate for the next iteration. A failed attempt does however, still count as an iteration towards the total number of iterations. Self-reflection and correction [28] based on the compiler or execution outputs were left as a interesting direction for future work. 2.2.3 Evaluator Assessment A successful code validation stage is followed by a new Evaluator assessment. At this stage, if the overall quality score of the new code version drops relatively to the one of the previous version indicates that no overall improvement was achieved. Under these circumstances, a new iteration is triggered, using the inputs of the previous cycle as a new code improvement attempt. Again note that an unsuccessful improvement attempt still counts as an iteration. Otherwise, if the overall quality score of the new code increases relative to the last successful version, the iteration is deemed successful. In this case, the quality references are updated: the new overall quality score becomes the new numerical baseline for improvement and the qualitative feedback is used as input to the code quality improvement of the next cycle. The number of iterations and overall quality target score are both hyper- parameters of the Optimizer. The improvement cycle terminates when either, the maximum number of iterations or the target quality score is reached. 3 Experimental Setup 3.1 Dataset We hand-curated a small but representative sample of 42 code examples across nine open-source libraries, including Scipy [33], SecurityEval [29], AWS-CDK- Examples [6] and Science-JS [11]. We focused on Python and JavaScript due to their popularity, and aimed for examples of varying code length and intended functionality. For future research, it would be interesting to expand the scope of the analysis to other programming languages, for example considering the QScored [27] dataset focused on C# and Java. The majority of the code sourced from reputable open-source libraries should be of reasonably high quality. Thus, some of the code pieces were manually modified to enable scope for improvement. This included annotation removal and non-semantic changes to the code, whilst leaving the original logic intact. In 7 Page 8: Sources No. Ex- amplesLanguage Test cases No. Lines auth0 [5] 2 JavaScript False 165 aws-cdk [6] 2 Python False 36 mbpp [4] 8 Python True 19 sciencejs [11] 6 JavaScript False 112 scipy [33] 3 Python False 117 scikit-learn [22] 3 Python False 142 SnoopySecurity [26] 8 Python, JavaScriptFalse 60 fportantier [24] 3 Python False 39 SecurityEval [29] 7 Python False 21 Table 1: Detailed structural breakdown of the curated dataset. Appendix A, we provide an example of reduced quality, where a utility function is re-defined multiple times, hence reducing the modularity of the code. In this study we focused on 42 code examples (28 Python and 14 JavaScript), sourced from 9 different data sources. Table 1 shows the complete list of data sources and associated metadata. Furthermore, test cases for 8 Python exam- ples from MBPP [4] were available, which we leveraged unchanged. For future research, it would be interesting to use LLMs to explicitly generate additional test cases [28]. 3.2 Baseline solution To perform ablation study of the Evaluator’s prompt, we considered a simpler CoT prompt to generate baseline results. This approach assumes that GPT-4o’s internal knowledge is sufficient for an accurate and comprehensive evaluation of code quality. In our experiments, we compare it with CodeQUEST to validate the effectiveness of our solution. Again, the following prompt shares the system message with the Evaluator’s prompt. ---- Human ---- ### CODE: ‘‘‘ {code} ‘‘‘ ### TASK: Think step by step to produce both a quantitative and qualitative assessment of the CODE provided. * Your qualitative assessment must be a short summary about the quality of the CODE. 8 Page 9: * Your quantitative assessment must be an integer on a scale from -5 to 5, which respectively represent the low and high-quality ends of the scale. Both types of evaluations must agree with each other. ### OUTPUT: Return your answer in a valid JSON as shown below: ‘‘‘json {{ "insight": <qualitative assessment:str>, "score": <quantitative assessment:int> }} ‘‘‘ 3.3 Validating Scores With Proxies Existing and well-established quality metrics are used in this study as a proxy to validate LLM-based code evaluations. This validation is limited to Python code due to the logistics of availability, implementation, and execution of such ana- lytical tools. The following code quality evaluation capabilities were selected for this study as an attempt to cover as many code quality dimensions as possible: 1.Pylint3[31] follows the style recommended by PEP8, while penalizing unused variables, lengthy code, and ill-named variables. Pylint returns a score from 0 to 10, which we use as a proxy metric for Readability and Documentation. 2.Radon Maintainability Index [19] is used as a proxy for code Main- tainability, similar to [23]. The measure is a score from 0 to 100 derived from other metrics, including Source Lines of Code (SLoC), Cyclomatic Complexity, and Halstead Volume. We scale the score to range from 0 to 10. 3.Bandit [7] scans Python code for security issues for Python using abstract syntax trees. Bandit does not provide a score, so we applied a simple heuristic that transforms the execution standard output into a score from 0 to 10 to define our proxy for Security. For simplicity, we define our overall proxy code quality score as the average of these three metrics, resulting in a final score ranging from 0 to 10. 3We enabled all default Pylint extensions apart from mccabe, which computes the cyclo- matic complexity to avoid partial overlap with Radon. 9 Page 10: 4 Results 4.1 Can GPT-4o evaluate code quality? To evaluate the CodeQUEST’s Evaluator, we compare its evaluation results to those produced by the baseline setup. In both cases, the output consists of an overall code quality score (quantitative result) as well as a short summary report (qualitative feedback). Each code example was repeatedly evaluated five times by both setups4, allowing us to gather statistics on the consistency of the outputs produced. Example mbpp/601.py Score Baseline The code is functional and correctly implements the logic to find the maximum chain length of pairs. How- ever, it lacks comments, proper variable naming, and error handling. The use of ’max’ as a variable name is not recommended as it shadows the built-in function ’max’. The code could be improved for readability and maintainability.2.0 CodeQUEST The provided code is a straightforward implementation of finding the maximum chain length of pairs using a dynamic programming approach. While the logic is correct, the code has several areas for improvement. It lacks readability and consistent style, with variable names that could be more descriptive. The use of ’max’ as a variable name is misleading. The code does not include comments to explain the logic, especially the nested loops and the purpose of the ’mcl’ array. It is moderately maintainable but could benefit from refac- toring to improve efficiency and reduce repetition. The code is easy to test due to its simplicity and lack of ex- ternal dependencies, but it does not facilitate mocking of dependencies. The nested loops result in O(n2) time complexity, making it inefficient for large datasets. The code lacks input validation, error handling, and concur- rency considerations. It is not designed with scalability or resource efficiency in mind. Additionally, the code lacks documentation and modularity, making it difficult to test and modify parts independently. Despite these issues, the code is portable as it avoids platform-specific features and uses standard libraries.-1.3 Table 2: Comparison of the qualitative and quantitative results with the baseline solution and CodeQUEST on mbpp/601.py. 4In this experiment, we disabled self-consistency for the CodeQUEST evaluator to ensure a fair comparison 10 Page 11: Figure 2: Comparison of the Baseline and CodeQUEST code quality scores. Each code example was evaluated five times. The mean and standard deviation were calculated across these five samples for the baseline (top) and CodeQUEST (bottom). -Qualitative evaluation : As illustrated in Table 2 and Appendix C, qualitative results produced by CodeQUEST were found to be more com- prehensive and detailed compared to those of the baseline solution. For example, CodeQUEST Evaluator performed additional time complexity analysis against mbpp/601.py , illustrating areas of improvement for Scal- ability. Nonetheless, both sets of results raise accurate points about the code, providing us with confidence that GPT-4o seems to have advanced intrinsic knowledge of code quality. We also found that the assessment produced equally plausible results for Python and JavaScript code, which showed that the LLM is able to generate insights across different program- ming languages. -Quantitative evaluation : As shown in Figure 2, our two strategies ex- hibit different mean quality scores with some variability. The baseline appears to systematically overestimate code quality compared to Code- QUEST assessments. Since the deficiencies pointed out by the Evaluator are the primary signal for driving code improvement by the Optimizer, such overestimation of quality (both qualitative and quantitative) is un- desirable, as it may hinder scope for improvement. For both setups, the variability may be explained by a) the ambiguity of prompt questions or statements, b) inherent LLM stochasticity, and c) the scale of possible outcomes not being granular enough. Therefore, in summary, our results suggest that we can leverage GPT-4o for both qualitative and quantitative assessment of code quality. Furthermore, the accuracy and level of detail of LLM-based code evaluations are highly prompt- dependent. We also find that CodeQUEST provides a more comprehensive, standard, and reliable evaluation due to the additional guidance enabled by dimension-specific questions. 11 Page 12: 4.2 Can evaluations be used to drive code quality im- provement? In our study, we applied the CodeQUEST framework to each code example for a maximum of five improvement cycles. All versions of the Python code were checked for their ability to compile. Furthermore, all versions of the mbpp code were validated against the corresponding test cases provided. All intermediary and final results were recorded for downstream analysis. The example shown in Figure 3 represents the general behavior observed across all code examples. Out of the original 42 code instances, 41 were improved to some extent. We report an average absolute improvement of 2 .9±1.3 in the code quality scale units5with those of initial lower quality exhibiting the largest improvements. To further analyze the improvements, we defined the Relative Percentage Improvement(RPI) as RPI := 100(sn f−sn i) (smax−sn i)(1) where smax= 5 is the maximum possible score, and sn iandsn fare the initial and final scores for the nthcode example, respectively. N= 42 is the total number of examples in the dataset. Over the entire dataset, our framework produces a mean RPI of 52 .6%±17.9% and a median RPI of 57 .1%. We also verified that while our setup, by design, ensures a monotonic improvement of overall code quality, the monotonic improvement of individual dimensions’ scores does not always occur, i.e., an overall improvement for the quality score produced by an iteration might be accompanied by a decrease in some of the dimension-specific scores. Figure 4 displays the distribution of improvements achieved for each iteration in absolute code quality units. We can observe that most of the improvement occurs in the first iteration cycle, with an average of 2 code quality units increase. The average improvement for iterations two and three were found to be 0.52 and 0.27. From the fourth iteration, the improvement becomes negligible (below 0.1 code quality units). These results suggest that the magnitude of code quality improvement re- duces at each iteration, which becomes particularly useful to limit the GPT-4o API costs incurred by the framework if CodeQUEST is to be considered and adopted at scale. Therefore this section demonstrates that our Optimizer framework can uti- lize the qualitative insights generated by the Evaluator to achieve significant code quality improvement. Indeed, we notice the consistent alignment between quantitative and qualitative results, such that an increase in the quality score reflects a decrease in code quality deficiencies reported by the quality summary. Therefore we believe that this setup should be general enough to treat this as a general-purpose recipe for configurable code quality improvement. 5As a by-product of this step, the original dataset was expanded with 208 additional code examples, excluding 2 invalid attempts to improve Python code 12 Page 13: Figure 3: Quality score evolution of a Python code example (mbpp/927.py) when subjected to the Optimizer. The quality score is provided in its overall and dimension-specific format. Corresponding quality summaries and code snippets can be found in the Appendix D 13 Page 14: Figure 4: Distribution of code quality score improvements per iteration. (We disabled self-consistency in the experiment) 4.3 Are these evaluations meaningful? To confirm the validity of LLM-generated quality scores, we investigate their level of agreement with the quality proxy scores described in Section 3.3. Given each original code example and its corresponding refactored versions generated by CodeQUEST, here denoted as {x0, . . . , x nmax}, we can consider the incre- mental change of the score per iteration, defined as ∆xnCodeQUEST = CodeQUEST( xn)−CodeQUEST( xn−1), (2) for iterations n= 1, . . . , n max. In our case, nmax= 5. Similarly, we can define ∆xnProxy and ∆ xnBaseline for the proxy and baseline scores, respectively. Pairs rs rp ps pp (∆Proxy, ∆Baseline) 0.21 0.27 0.02 0.00 (∆Proxy, ∆CodeQUEST) 0.23 0.53 0.01 0.00 Table 3: Spearman rsand Pearson rpcorrelation with corresponding p-values psand p p. The iteration cycle applied to the 28 Python code examples described in Table 1 resulted in 138 improved code examples6for which we calculated the incremental change for each of the three scores. To measure how much the LLM-based scores are aligned with proxy results, we compute the Pearson rp, and Spearman rank rscorrelation between ∆Proxy and ∆CodeQUEST as well as ∆Proxy and ∆Baseline and report the results in Table 3 and Figure 5. Results suggest that proxy scores are considerably better correlated with the CodeQUEST scores compared to those of the baseline. This suggests that, 6We set the maximum number of evolution cycle to be 5, we excluded 2 improvement attempts which were syntactically incorrect or failing the test case. 14 Page 15: Figure 5: Left: Scatter plots and linear regression line for {∆xnProxy, ∆xnCodeQUEST }xn(left) and {∆xnProxy, ∆ xnBaseline }xn(right). The shaded area represents the uncertainty of the linear regression fit. given a base code example, CodeQUEST is able to iteratively assign scores that better reflect the underlying variation in the code quality. We can observe from Figure 5 that the relationship is relatively noisy. This is somewhat expected since the proxies do not capture all the dimensions considered by CodeQUEST and can be focused on certain aspects. For example, when we inspected individ- ual results, we found that CodeQUEST is able to detect security vulnerabilities, which were ignored by Bandit. We provide two such examples from SecurityEval in Appendix E, along with their improved version. 5 Threats to validity In this section, we would like to acknowledge certain considerations regarding the validity and robustness of our framework. Firstly, code quality is an intrin- sically subjective concept and, while our evaluations work well with the selected questions, it might not be as effective if modified significantly. However, given the generalization capabilities of GPT-4o and its advanced knowledge of both code and English language concepts, we believe this is unlikely as long as the questions are reasonable and posed in the manner we detail in Section 2.1. Moreover, the stochasticity of GPT-4o output also introduces some inherent variability in the results. While we see good stability in our setup and take mea- sures to limit it by setting the temperature to 0, that might not hold for all code and or prompts. We also recommend using the self-consistency framework [34] to address this, if required. 15 Page 16: Hallucinations are also a well-known shortcoming of LLMs and this is an active field of research, thus it is possible that the model hallucinates during the improvement step of our framework. A detailed study with a larger dataset and human validation would be required to ascertain the extent to which hal- lucination might be a concern. We also recommend robust validation with test cases to guard against this. Our framework might also struggle to improve code that is very different from the training dataset of GPT-4o (e.g., highly custom proprietary code bases). Finally, further study would be required to fully validate the framework across a broader range of coding languages. Here we note that the primary limitation comes from the LLM’s (GPT-4o’s) knowledge of the programming language, so the framework should be performant across most major program- ming languages. 6 Conclusion In this paper, we introduced CodeQUEST – Code Quality Understanding and Enhancement System Toolkit – a GPT-4o powered framework capable of eval- uating and improving code quality for code across different domains. We show that our framework is able to produce comprehensive qualitative and quantita- tive code quality evaluation reports, and subsequently use these reports to iter- atively improve the quality of the code being analyzed. Our framework is easily understandable, adjustable, and shows impressive results for Python as well as JavaScript, showing its applicability across languages. We also demonstrate that our framework produces code evaluation scores showing good agreements with rule-based code-quality evaluation libraries. References [1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Alt- man, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023. [2] Jie Li Angus Yang, Zehan Li. Advancing genai assisted programming–a comparative study on prompt efficiency and code quality between gpt-4 and glm-4. ArXiv , abs/2402.12782, 2024. [3] Anonymous. Enhancing decision-making of large language models via actor-critic. In Submitted to The Thirteenth International Conference on Learning Representations , 2024. under review. [4] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 , 2021. 16 Page 17: [5] auth0. https://github.com/auth0/node-jsonwebtoken. [6] AWS. Cdk examples https://github.com/aws-samples/aws-cdk-examples. [7] Bandit. https://github.com/pycqa/bandit. [8] Veronika Hackl, Alexandra Elena M¨ uller, Michael Granitzer, and Maxim- ilian Sailer. Is gpt-4 a reliable rater? evaluating consistency in gpt-4’s text ratings. Frontiers in Education , 8, December 2023. [9] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Hao- tian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions, 2023. [10] Petri Ihantola, Tuukka Ahoniemi, Ville Karavirta, and Otto Sepp¨ al¨ a. Re- view of recent systems for automatic assessment of programming assign- ments. In Proceedings of the 10th Koli Calling International Conference on Computing Education Research , Koli Calling ’10, page 86–93, New York, NY, USA, 2010. Association for Computing Machinery. [11] Davies Jason. Science-js https://github.com/jasondavies/science.js/. [12] Hui Jiang. A latent space theory for emergent abilities in large language models, 2023. [13] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023. [14] Vijay Konda and John Tsitsiklis. Actor-critic algorithms. In S. Solla, T. Leen, and K. M¨ uller, editors, Advances in Neural Information Processing Systems , volume 12. MIT Press, 1999. [15] Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too. AI Open , 2023. [16] Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamtha- vorn, Li Li, Xuan-Bach D. Le, and David Lo. Refining chatgpt-generated code: Characterizing and mitigating code quality issues. ACM Transactions on Software Engineering and Methodology, Volume 33, Issue 5, Article No.: 116, Pages 1-26 , 33, 2024. [17] Mosleh Mahamud and Isak Samsten. Code quality assessment using trans- formers, 2023. [18] Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguis- tics, pages 1906–1919, Online, July 2020. Association for Computational Linguistics. 17 Page 18: [19] Lacchia Michele. Radon mi https://radon.readthedocs.io/en/latest/intro.html. [20] Ifeanyi G. Ndukwe, Sherlock A. Licorish, Amjed Tahir, and Stephen G. MacDonell. How have views on software quality differed over time? research and practice viewpoints. Journal of Systems and Software , 195:111524, 2023. [21] Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. Llm is like a box of chocolates: the non-determinism of chatgpt in code generation. arXiv preprint arXiv:2308.02828 , 2023. [22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas- sos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit- learn: Machine learning in Python. Journal of Machine Learning Research , 12:2825–2830, 2011. [23] Russel A. Poldrack, Thomas Lu, and G. Beguˇ s. Ai-assisted coding: Exper- iments with gpt-4. ArXiv , abs/2304.13187, 2023. [24] Fabian Martinez Portantier. https://github.com/fportantier/vulpy. [25] Jaros law Protasiewicz Przemys law Zydro´ n. Assessing code review quality with chatgpt: A survey of automated reviewer assignment methods and ex- perimental outcomes. Digital Interaction and Machine Intelligence. MIDI 2023. Lecture Notes in Networks and Systems. Springer, Cham , 1076, 2024. [26] Sam Sanoop. https://github.com/snoopysecurity/vulnerable-code- snippets. [27] Tushar Sharma and Marouane Kessentini. Qscored: A large dataset of code smells and quality metrics. In 2021 IEEE/ACM 18th international conference on mining software repositories (MSR) , pages 590–594. IEEE, 2021. [28] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems , volume 36, pages 8634–8652. Curran Associates, Inc., 2023. [29] Mohammed Latif Siddiq and Joanna C. S. Santos. Securityeval dataset: Mining vulnerability examples to evaluate machine learning-based code generation techniques. In Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security (MSR4P,S22) , 2022. [30] Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Prem Devanbu, and Toufique Ahmed. Calibration and correctness of language models for code, 2024. 18 Page 19: [31] Sylvain Th´ enault. Pylint: https://github.com/pylint-dev/pylint. [32] Rasul Tutunov, Antoine Grosnit, Juliusz Ziomek, Jun Wang, and Haitham Bou-Ammar. Why can large language models generate correct chain-of- thoughts?, 2024. [33] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, War- ren Weckesser, Jonathan Bright, St´ efan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, ˙Ilhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antˆ onio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods , 17:261–272, 2020. [34] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sha- ran Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations , 2023. [35] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems , volume 35, pages 24824–24837. Curran Associates, Inc., 2022. [36] Matthew Wilkes. Testing, checking, linting , pages 51–101. Apress, Berkeley, CA, 2020. [37] Bin Zhang, Hangyu Mao, Jingqing Ruan, Ying Wen, Yang Li, Shao Zhang, Zhiwei Xu, Dapeng Li, Ziyue Li, Rui Zhao, Lijuan Li, and Guoliang Fan. Controlling large language model-based agents for large-scale decision- making: An actor-critic approach, 2024. [38] Yuze Zhao, Zhenya Huang, Yixiao Ma, Rui Li, Kai Zhang, Hao Jiang, Qi Liu, Linbo Zhu, and Yu Su. RePair: Automated program repair with process-based feedback. In Lun-Wei Ku, Andre Martins, and Vivek Sriku- mar, editors, Findings of the Association for Computational Linguistics: ACL 2024 , pages 16415–16429, Bangkok, Thailand, August 2024. Associa- tion for Computational Linguistics. 7 Conflict of interest The authors declare that they have no conflict of interest. 19 Page 20: 8 Disclaimer This paper was prepared for informational purposes by the Applied Innovation of AI team of JPMorgan Chase & Co. This paper is not a product of the Research Department of JPMorgan Chase & Co. or its affiliates. Neither JPMorgan Chase & Co. nor any of its affiliates makes any explicit or implied representa- tion or warranty and none of them accept any liability in connection with this paper, including, without limitation, with respect to the completeness, accu- racy, or reliability of the information contained herein and the potential legal, compliance, tax, or accounting effects thereof. This document is not intended as investment research or investment advice, or as a recommendation, offer, or solicitation for the purchase or sale of any security, financial instrument, finan- cial product or service, or to be used in any way for evaluating the merits of participating in any transaction. The described work is a prototype and is not a production deployed system. A Manual code quality defect Example of code that has been “degraded in quality”. We first show the original code piece, followed by the modified code, where the update xfunction is re- defined multiple times. Original code example from Scipy: 1class LinearVectorFunction : 2 """ Linear vector function and its derivatives . 3 4 Defines a linear function F = A x, where x is N-D vector and 5 A is m-by -n matrix . The Jacobian is constant and equals to A. The Hessian 6 is identically zero and it is returned as a csr matrix . 7 """ 8 def __init__ (self , A, x0 , sparse_jacobian ): 9 if sparse_jacobian or sparse_jacobian is None and sps . issparse (A): 10 self .J = sps . csr_matrix (A) 11 self . sparse_jacobian = True 12 elif sps . issparse (A): 13 self .J = A. toarray () 14 self . sparse_jacobian = False 15 else : 16 # np. asarray makes sure A is ndarray and not matrix 17 self .J = np. atleast_2d (np. asarray (A)) 18 self . sparse_jacobian = False 19 20 self .m, self .n = self .J. shape 21 22 self .xp = xp = array_namespace (x0) 23 _x = atleast_nd (x0 , ndim =1, xp=xp) 24 _dtype = xp. float64 25 if xp. isdtype (_x.dtype , " real floating "): 26 _dtype = _x. dtype 27 20 Page 21: 28 # promotes to floating 29 self .x = xp. astype (_x , _dtype ) 30 self . x_dtype = _dtype 31 32 self .f = self .J. dot ( self .x) 33 self . f_updated = True 34 35 self .v = np. zeros ( self .m, dtype = float ) 36 self .H = sps. csr_matrix (( self .n, self .n)) 37 38 def _update_x (self , x): 39 if not np. array_equal (x, self .x): 40 _x = atleast_nd (x, ndim =1, xp= self .xp) 41 self .x = self .xp. astype (_x , self . x_dtype ) 42 self . f_updated = False 43 44 def fun (self , x): 45 self . _update_x (x) 46 if not self . f_updated : 47 self .f = self .J. dot (x) 48 self . f_updated = True 49 return self .f 50 51 def jac (self , x): 52 self . _update_x (x) 53 return self .J 54 55 def hess (self , x, v): 56 self . _update_x (x) 57 self .v = v 58 return self .H Reduced quality code example: 1import numpy as np 2import scipy . sparse as sps 3from scipy . _lib . _array_api import array_namespace , atleast_nd 4 5class LinearVectorFunction : 6 """ Linear vector function and its derivatives . 7 8 Defines a linear function F = A x, where x is N-D vector and 9 A is m-by -n matrix . The Jacobian is constant and equals to A. The Hessian 10 is identically zero and it is returned as a csr matrix . 11 """ 12 def __init__ (self , A, x0 , sparse_jacobian ): 13 if sparse_jacobian or sparse_jacobian is None and sps . issparse (A): 14 self .J = sps . csr_matrix (A) 15 self . sparse_jacobian = True 16 elif sps . issparse (A): 17 self .J = A. toarray () 18 self . sparse_jacobian = False 19 else : 20 # np. asarray makes sure A is ndarray and not matrix 21 self .J = np. atleast_2d (np. asarray (A)) 22 self . sparse_jacobian = False 21 Page 22: 23 24 self .m, self .n = self .J. shape 25 26 self .xp = xp = array_namespace (x0) 27 _x = atleast_nd (x0 , ndim =1, xp=xp) 28 _dtype = xp. float64 29 if xp. isdtype (_x.dtype , " real floating "): 30 _dtype = _x. dtype 31 32 # promotes to floating 33 self .x = xp. astype (_x , _dtype ) 34 self . x_dtype = _dtype 35 36 self .f = self .J. dot ( self .x) 37 self . f_updated = True 38 39 self .v = np. zeros ( self .m, dtype = float ) 40 self .H = sps. csr_matrix (( self .n, self .n)) 41 42 def fun (self , x): 43 def _update_x ( x): 44 if not np. array_equal (x, self .x): 45 _x = atleast_nd (x, ndim =1, xp= self .xp) 46 self .x = self .xp. astype (_x , self . x_dtype ) 47 self . f_updated = False 48 _update_x (x) 49 if not self . f_updated : 50 self .f = self .J. dot (x) 51 self . f_updated = True 52 return self .f 53 54 def jac (self , x): 55 def _update_x ( x): 56 if not np. array_equal (x, self .x): 57 _x = atleast_nd (x, ndim =1, xp= self .xp) 58 self .x = self .xp. astype (_x , self . x_dtype ) 59 self . f_updated = False 60 _update_x (x) 61 return self .J 62 63 def hess (self , x, v): 64 def _update_x ( x): 65 if not np. array_equal (x, self .x): 66 _x = atleast_nd (x, ndim =1, xp= self .xp) 67 self .x = self .xp. astype (_x , self . x_dtype ) 68 self . f_updated = False 69 _update_x (x) 70 self .v = v 71 return self .H B Dimension related questions/statements The CodeQUEST Framework defines ten dimensions of code quality. For each dimension, it further defines five questions/statements. The full list is given below. 22 Page 23: 1. Readability - Both, variable and function names are descriptive and meaningful. - The code consistently follows a single specific code style guide. - There are comments that clearly explain complex or non-obvious parts of the code provided, without assuming prior knowledge. - The code provided is free of unexplained constants or magic numbers. - Each existing function is dedicated to a single task. 2. Maintainability - The code provided is organized in a logical and understandable man- ner, allowing for easy comprehension. - The code provided strictly adheres to the DRY (Do not Repeat Your- self) principle, avoiding unnecessary repetition. - Code features can be added or modified without affecting existing functionality. - The code provided is effectively free of duplication, promoting effi- ciency and maintainability. - There are clear interfaces between different parts of the code pro- vided, facilitating seamless interaction. 3. Testability - The structure of the code provided facilitates easy mocking of depen- dencies. - The code provided produces consistent and predictable outputs for specific inputs. - The code provided is free of global states and variables. - The code provided is free from deep nesting or complex control flow, that could complicate testing. - The code provided is organized in a way that allows the straightfor- ward measurement of code coverage. 4. Efficiency - The code provided makes efficient use of data structures. - The code provided avoids creating unnecessary objects or data. - The code provided avoids suboptimal computations, such as unnec- essary loops or repeated operations that could be optimized. - The code provided promotes the efficient use of system resources. - The code provided addresses any existing bottlenecks that could slow down the code. 23 Page 24: 5. Robustness - Does the code provided validate and sanitize inputs in all relevant scenarios? - Does the code provided handle edge cases and unexpected inputs gracefully in all relevant scenarios? - Are there appropriate error handling and exception handling mech- anisms in place for all relevant scenarios? - Does the code provided handle errors and exceptions gracefully in all relevant scenarios? - Does the code provided accounts for any potential race conditions, concurrency issues, or deadlock situations in all relevant scenarios? 6. Security - The code provided consistently sanitizes user inputs to prevent in- jection attacks. - The code provided is completely free of hardcoded sensitive data, such as passwords and API keys. - The code provided adheres to established best practices for secure coding. - The code provided implements comprehensive error handling to pre- vent leakage of sensitive information. - The code provided utilizes secure communication protocols when per- forming network operations. 7. Documentation - Comments are provided to explain non-obvious parts of the code. - There is a concise and clear description of the code’s functionality. - Input parameters are documented. - Output values are documented. - Side effects are documented. 8. Modularity - The code provided is divided into small, independent functions that perform specific tasks. - Individual parts of the code provided can be used, modified, and tested independently without affecting other parts. - The code provided avoids deep nesting and complex control flow structures. 24 Page 25: - The code provided adheres to the principles of high cohesion (re- lated functionality within a single unit) and low coupling (minimal dependencies between units). - Different parts of the code are separated by well-defined interfaces to facilitate communication and maintainability. 9. Scalability - The code provided is designed to handle increased data loads effi- ciently, or can it be easily adapted to do so. - The code provided is designed to handle an increased number of users efficiently, or can it be easily adapted to do so. - The code provided makes efficient use of resources, such as CPU and memory. - The code provided is free of bottlenecks that could potential limit scalability. - The code provided is designed to work in a distributed environment efficiently, or can it be easily adapted to do so. 10. Portability - The code provided avoids relying on any platform-specific features or behavior. - The code provided can run in different environments without requir- ing major changes. - The code provided is free of hardcoded file paths or URLs that would limit portability. - The code provided uses standard libraries and APIs as much as pos- sible. - All dependencies are clearly specified and easy to install. 25 Page 26: C Code quality evaluation examples Example snoopysecurity/path traversal.js Score Baseline The code demonstrates a basic implementation of an Apollo GraphQL server with a simple schema and re- solver. It reads a file from the filesystem and returns its content as part of the response. However, there are several issues: it lacks error handling, uses synchronous file reading (fs.readFileSync) which can block the event loop, and has some unused variables (e.g., express and app). The ’readFile’ function should handle errors, and the ’books’ resolver should return a list of books instead of a single book with the file content as the title. Addi- tionally, the use of ’var’ should be replaced with ’const’ or ’let’ for better scoping. Mixing Express and Apollo Server is unnecessary in this context.2.0 CodeQUEST The code is generally readable and organized logically with descriptive variable and function names. It in- cludes comments explaining the purpose of the schema and resolvers, but lacks consistent code style and de- tailed documentation for input parameters, output val- ues, and side effects. The code is modular, separating concerns by defining the schema, resolvers, and server setup in distinct sections. However, it does not strictly adhere to the DRY principle and has some repetition. The use of global variables and synchronous file reading complicates testing and can lead to inefficiencies and bottlenecks under increased data loads. The code lacks input validation, sanitization, and proper error han- dling mechanisms, posing security risks. It is generally portable but includes a hardcoded file path that could limit portability. Overall, the code is a basic implemen- tation of an Apollo GraphQL server that is not designed with scalability or security best practices in mind.-0.4 Table 4: Comparison of the qualitative and quantitative results with the baseline solution and CodeQUEST on vulnerable-js/path traversal.js 26 Page 27: Example securityeval/CWE-319 author 1.py Score Baseline The code is generally well-structured and follows good practices such as separating GET and POST requests for the login route. However, it has a significant secu- rity flaw: the check credentials function always returns True, which means any username and password will be accepted. Additionally, there is no error handling for missing form fields, and the session ID is stored in a cookie without any additional security measures like HttpOnly or Secure flags.2.0 CodeQUEST The code is generally readable and organized with de- scriptive variable and function names, following a con- sistent style. It is modular and easy to understand, mak- ing it straightforward to test. However, it lacks com- ments explaining the logic in detail and does not strictly adhere to the DRY principle. The check credentials function is overly simplistic and always returns True, which is not realistic for actual authentication. The code lacks input validation, sanitization, and proper er- ror handling, making it vulnerable to injection attacks and other security issues. It does not handle edge cases or unexpected inputs gracefully and does not account for potential race conditions or concurrency issues. The code is portable as it avoids platform-specific features and uses standard libraries, but it does not specify de- pendencies explicitly. Overall, the code could benefit from improved security practices, better documentation, and more realistic authentication logic.0.5 Table 5: Comparison of the qualitative and quantitative results with the baseline solution and CodeQUEST on vulnerable-python/CWE-319 author 1.py The comparison of the qualitative feedback examples presented in Table 2 and Appendix C also suggests that the CodeQUEST score attribution strategy enables a more meaningful relative positioning of different code snippets on the code quality scale. Assuming consistency between quality scores and qualitative results produced by the LLM, it can be deduced that baseline scoring only partially reflects the dimensions considered by our Evaluator. This explains why code examples 1 and 2, which contain security vulnerabilities in different aspects are all given the same positive score (2.0) by the baseline, whereas scores produced by the Evaluator seem to be better aligned with the overall assessment expected by a code expert. 27 Page 28: Example sklearn/RegressionModel.py Score Baseline The code implements a basic linear regression model using gradient descent. It is clear and follows standard practices for such implementations. However, it lacks some features like input validation, convergence criteria, and regularization options. Additionally, the number of iterations is set very high by default, which might not be efficient for all datasets.3.0 CodeQUEST The code is a basic implementation of a linear regres- sion model using gradient descent. It is generally read- able and organized, with clear separation of concerns between fitting and predicting. However, it lacks com- ments, documentation, and error handling. The code does not strictly adhere to the DRY principle and could be optimized for scalability and efficiency. It is well- structured for testability and avoids global states and complex control flows. The implementation is efficient in terms of data structures and object creation but does not address potential bottlenecks such as convergence issues. The code is highly portable but does not ex- plicitly list dependencies. Overall, it is functional but could benefit from improvements in documentation, er- ror handling, and optimization for large datasets.2.1 Table 6: Comparison of the qualitative and quantitative results with the baseline solution and CodeQUEST on sklearn/RegressionModel.py D CodeQUEST - code evolution example Below is a step-by-step example of the CodeQUEST framework in action, ap- plied to the mbpp/927.py example. The quantitative evaluations for the same are provided in Figure 3 Initial Evaluation Qualitative result: The provided code is a simple and functional implementa- tion of a binary tree node class and a function to calculate the maximum height of the tree. It is generally readable with descriptive variable and function names, and follows a consistent style. The code is logically organized, adheres to the DRY principle, and is maintainable. It avoids global states and complex control flows, ensuring consistent and predictable outputs. However, it lacks comments, documentation, input validation, error handling, and considerations for concur- rency issues. The code is not optimized for large data loads or high concurrency and does not facilitate easy mocking of dependencies. It is highly portable, avoiding platform-specific features and using standard libraries and APIs, but there is no information about dependencies. Overall, the code is simple and 28 Page 29: functional but does not exhibit strong modularity or scalability. Code: 1class Node : 2 def __init__ (self , data ): 3 self . data = data 4 self . left = None 5 self . right = None 6def max_height ( node ): 7 if node is None : 8 return 0 ; 9 else : 10 left_height = max_height ( node . left ) 11 right_height = max_height ( node . right ) 12 if ( left_height > right_height ): 13 return left_height +1 14 else : 15 return right_height +1 Iteration 1 Evaluation Qualitative result: The code is generally well-written, well-organized, and easy to understand, with clear documentation and logical structure. It follows a consistent style, includes descriptive variable and function names, and provides docstrings for classes and functions, which help in understanding the purpose and usage. The code adheres to the DRY principle, avoids unnecessary rep- etition, and uses helper functions and memoization to promote efficiency and maintainability. It is modular, allowing for easy modifications without affecting existing functionality, and the interfaces between different parts of the code are clear. The code is well-structured for testing, avoids global states, has clear and simple control flow, and produces consistent outputs for given inputs. It is generally efficient in terms of data structure usage and avoiding unnecessary computations, although there is a minor inefficiency in the use of a dictionary for memoization with node objects as keys. The code is robust with respect to input validation and error handling, checking for invalid input types and han- dling edge cases such as an empty tree or a single node. It does not involve user inputs, sensitive data, or network operations, and error handling is minimal but appropriate for the context. The code demonstrates good modularity, avoiding deep nesting and complex control flow structures, and adhering to principles of high cohesion and low coupling. However, it is not designed with scalability in mind and is not optimized for large-scale applications. The code is highly portable, relying on standard libraries and APIs without platform-specific fea- tures or external dependencies. Code: 1class Node : 2 """ 3 A class to represent a node in a binary tree . 4 5 Attributes : 6 data : any 7 The value stored in the node . 29 Page 30: 8 left : Node 9 The left child node . 10 right : Node 11 The right child node . 12 """ 13 def __init__ (self , data ): 14 """ 15 Constructs all the necessary attributes for the node object . 16 17 Parameters : 18 data : any 19 The value to be stored in the node . 20 """ 21 self . data = data 22 self . left = None 23 self . right = None 24 25def max_height ( node ): 26 """ 27 Calculate the maximum height of a binary tree . 28 29 Parameters : 30 node : Node 31 The root node of the binary tree . 32 33 Returns : 34 int 35 The maximum height of the binary tree . 36 """ 37 if not isinstance (node , Node ) and node is not None : 38 raise ValueError (" Input must be a Node object or None ") 39 40 def height_helper (node , memo ): 41 if node is None : 42 return 0 43 if node in memo : 44 return memo [ node ] 45 left_height = height_helper ( node .left , memo ) 46 right_height = height_helper ( node .right , memo ) 47 memo [ node ] = max ( left_height , right_height ) + 1 48 return memo [ node ] 49 50 return height_helper (node , {}) 51 52# Unit tests 53def test_max_height (): 54 # Test with a simple tree 55 root = Node (1) 56 root . left = Node (2) 57 root . right = Node (3) 58 root . left . left = Node (4) 59 root . left . right = Node (5) 60 assert max_height ( root ) == 3 61 62 # Test with a single node 63 single_node = Node (1) 30 Page 31: 64 assert max_height ( single_node ) == 1 65 66 # Test with None 67 assert max_height ( None ) == 0 68 69 # Test with invalid input 70 try : 71 max_height (" invalid ") 72 except ValueError as e: 73 assert str(e) == " Input must be a Node object or None " 74 75test_max_height () Iteration 2 Evaluation Qualitative result: The code is generally well-written, well-organized, and adheres to good coding practices. It includes clear documentation, follows the DRY principle, and is modular, allowing for easy modifications. The use of memoization in the height calculation function is efficient and avoids unnec- essary repetition. The code is structured for testability, avoids global states, and includes unit tests to verify its correctness. It handles input validation and error cases effectively but does not address concurrency issues, which are not relevant in this context. The code is simple and does not involve user in- puts, sensitive data, or network operations, making it less applicable to many security concerns. It is highly portable, avoiding platform-specific features and external dependencies. However, it is not designed with scalability in mind for large-scale applications. The code is generally well-written, organized, and adheres to good coding practices. It includes descriptive variable and func- tion names, clear interfaces, and unit tests to ensure functionality. The use of helper functions and memoization enhances efficiency and maintainability. The code handles edge cases and invalid inputs gracefully, and it is well-documented with docstrings explaining the purpose, input parameters, and return values. However, it does not address concurrency issues, system resource efficiency, or scalability concerns. The code is highly portable and avoids platform-specific features, making it easy to test and maintain. Code: 1class Node : 2 """ 3 A class to represent a node in a binary tree . 4 5 Attributes : 6 data : any 7 The value stored in the node . 8 left : Node 9 The left child node . 10 right : Node 11 The right child node . 12 """ 13 def __init__ (self , data ): 14 """ 15 Constructs all the necessary attributes for the node object . 16 31 Page 32: 17 Parameters : 18 data : any 19 The value to be stored in the node . 20 """ 21 self . data = data 22 self . left = None 23 self . right = None 24 25def max_height ( node ): 26 """ 27 Calculate the maximum height of a binary tree . 28 29 Parameters : 30 node : Node 31 The root node of the binary tree . 32 33 Returns : 34 int 35 The maximum height of the binary tree . 36 """ 37 if not isinstance (node , Node ) and node is not None : 38 raise ValueError (" Input must be a Node object or None ") 39 40 def height_helper (node , memo ): 41 if node is None : 42 return 0 43 node_id = id( node ) # Use node ID for memoization 44 if node_id in memo : 45 return memo [ node_id ] 46 # Recursively calculate the height of left and right subtrees 47 left_height = height_helper ( node .left , memo ) 48 right_height = height_helper ( node .right , memo ) 49 # Store the computed height in the memo dictionary 50 memo [ node_id ] = max ( left_height , right_height ) + 1 51 return memo [ node_id ] 52 53 return height_helper (node , {}) 54 55# Unit tests 56def test_max_height (): 57 # Test with a simple tree 58 root = Node (1) 59 root . left = Node (2) 60 root . right = Node (3) 61 root . left . left = Node (4) 62 root . left . right = Node (5) 63 assert max_height ( root ) == 3 64 65 # Test with a single node 66 single_node = Node (1) 67 assert max_height ( single_node ) == 1 68 69 # Test with None 70 assert max_height ( None ) == 0 71 72 # Test with invalid input 32 Page 33: 73 try : 74 max_height (" invalid ") 75 except ValueError as e: 76 assert str(e) == " Input must be a Node object or None " 77 78test_max_height () Iteration 3 Evaluation Qualitative result: The code is generally well-written, organized, and adheres to good coding practices. It includes descriptive variable and function names, clear interfaces, and unit tests to ensure functionality. The use of helper func- tions and memoization enhances efficiency and maintainability. The code han- dles edge cases and invalid inputs gracefully, and it is well-documented with doc- strings explaining the purpose, input parameters, and return values. However, it does not address concurrency issues, system resource efficiency, or scalabil- ity concerns. The code is highly portable and avoids platform-specific features, making it easy to test and maintain. Code: 1class Node : 2 """ 3 A class to represent a node in a binary tree . 4 5 Attributes : 6 data : any 7 The value stored in the node . 8 left : Node 9 The left child node . 10 right : Node 11 The right child node . 12 """ 13 def __init__ (self , data ): 14 """ 15 Constructs all the necessary attributes for the node object . 16 17 Parameters : 18 data : any 19 The value to be stored in the node . 20 """ 21 self . data = data 22 self . left = None 23 self . right = None 24 self .id = id( self ) # Unique identifier for memoization 25 26def max_height ( node ): 27 """ 28 Calculate the maximum height of a binary tree . 29 30 Parameters : 31 node : Node 32 The root node of the binary tree . 33 34 Returns : 35 int 36 The maximum height of the binary tree . 33 Page 34: 37 """ 38 if not isinstance (node , Node ) and node is not None : 39 raise ValueError (" Input must be a Node object or None ") 40 41 def height_helper (node , memo ): 42 """ 43 Helper function to calculate height using memoization . 44 45 Parameters : 46 node : Node 47 The current node being processed . 48 memo : dict 49 Dictionary to store previously calculated heights . 50 51 Returns : 52 int 53 The height of the current subtree . 54 """ 55 if node is None : 56 return 0 57 if node .id in memo : 58 return memo [ node .id] 59 # Calculate the height of left and right subtrees 60 left_height = height_helper ( node .left , memo ) 61 right_height = height_helper ( node .right , memo ) 62 # Store the calculated height in memo 63 memo [ node .id] = max ( left_height , right_height ) + 1 64 return memo [ node .id] 65 66 return height_helper (node , {}) 67 68# Unit tests 69def test_max_height (): 70 # Test with a simple tree 71 root = Node (1) 72 root . left = Node (2) 73 root . right = Node (3) 74 root . left . left = Node (4) 75 root . left . right = Node (5) 76 assert max_height ( root ) == 3 77 78 # Test with a single node 79 single_node = Node (1) 80 assert max_height ( single_node ) == 1 81 82 # Test with None 83 assert max_height ( None ) == 0 84 85 # Test with invalid input 86 try : 87 max_height (" invalid ") 88 except ValueError as e: 89 assert str(e) == " Input must be a Node object or None " 90 91test_max_height () Iteration 4 Evaluation 34 Page 35: Qualitative result: The code is well-written, readable, and adheres to good coding practices. It uses descriptive variable and function names, consistent style, and includes comments explaining the purpose of classes and functions. The code is modular, maintainable, and efficient, utilizing helper functions and memoization to avoid unnecessary computations. It handles edge cases and in- valid inputs gracefully, with appropriate error handling. The unit tests are well- structured, ensuring consistent and predictable outputs. The code is portable, avoiding platform-specific features and external dependencies. However, it does not address concurrency issues or scalability for increased data loads or dis- tributed environments. Overall, the code demonstrates good modularity, main- tainability, and performance for its intended purpose. Code: 1class Node : 2 """ 3 A class to represent a node in a binary tree . 4 5 Attributes : 6 data : any 7 The value stored in the node . 8 left : Node 9 The left child node . 10 right : Node 11 The right child node . 12 """ 13 def __init__ (self , data ): 14 """ 15 Constructs all the necessary attributes for the node object . 16 17 Parameters : 18 data : any 19 The value to be stored in the node . 20 """ 21 self . data = data 22 self . left = None 23 self . right = None 24 self .id = id( self ) # Unique identifier for memoization 25 26def max_height ( node ): 27 """ 28 Calculate the maximum height of a binary tree . 29 30 Parameters : 31 node : Node 32 The root node of the binary tree . 33 34 Returns : 35 int 36 The maximum height of the binary tree . 37 """ 38 if not isinstance (node , Node ) and node is not None : 39 raise ValueError (" Input must be a Node object or None ") 40 41 def height_helper (node , memo ): 35 Page 36: 42 """ 43 Helper function to calculate height using memoization . 44 45 Parameters : 46 node : Node 47 The current node being processed . 48 memo : dict 49 Dictionary to store previously calculated heights . 50 51 Returns : 52 int 53 The height of the current subtree . 54 """ 55 if node is None : 56 return 0 57 if node .id in memo : 58 return memo [ node .id] 59 # Calculate the height of left and right subtrees 60 left_height = height_helper ( node .left , memo ) 61 right_height = height_helper ( node .right , memo ) 62 # Store the calculated height in memo 63 memo [ node .id] = max ( left_height , right_height ) + 1 64 return memo [ node .id] 65 66 return height_helper (node , {}) 67 68# Unit tests 69def test_max_height (): 70 # Test with a simple tree 71 root = Node (1) 72 root . left = Node (2) 73 root . right = Node (3) 74 root . left . left = Node (4) 75 root . left . right = Node (5) 76 assert max_height ( root ) == 3 77 78 # Test with a single node 79 single_node = Node (1) 80 assert max_height ( single_node ) == 1 81 82 # Test with None 83 assert max_height ( None ) == 0 84 85 # Test with invalid input 86 try : 87 max_height (" invalid ") 88 except ValueError as e: 89 assert str(e) == " Input must be a Node object or None " 90 91test_max_height () Iteration 5 Evaluation Qualitative result: The code is generally well-written, readable, and adheres to good coding practices. It uses descriptive variable and function names, con- sistent style, and includes docstrings for documentation. The code is modular, with each function dedicated to a single task, and employs memoization for ef- 36 Page 37: ficiency. It is well-structured for testing, with clear interfaces and encapsulated helper functions. The code is efficient in terms of data structure usage and system resources, and it includes basic error handling and input validation. It handles edge cases such as an empty tree and a single node tree. The code is portable, avoiding platform-specific features and external dependencies. How- ever, it is not designed for concurrency, increased data loads, or distributed environments. Overall, the code is maintainable, extendable, and adheres to best practices for the given functionality. Code: 1class Node : 2 """ 3 A class to represent a node in a binary tree . 4 5 Attributes : 6 data : any 7 The value stored in the node . 8 left : Node 9 The left child node . 10 right : Node 11 The right child node . 12 """ 13 def __init__ (self , data ): 14 """ 15 Constructs all the necessary attributes for the node object . 16 17 Parameters : 18 data : any 19 The value to be stored in the node . 20 """ 21 self . data = data 22 self . left = None 23 self . right = None 24 self .id = id( self ) # Unique identifier for memoization 25 26def max_height ( node ): 27 """ 28 Calculate the maximum height of a binary tree . 29 30 Parameters : 31 node : Node 32 The root node of the binary tree . 33 34 Returns : 35 int 36 The maximum height of the binary tree . 37 38 Side Effects : 39 - Uses a memo dictionary to store previously calculated heights . 40 - May raise a ValueError if the input is not a Node object or None . 41 """ 42 if not isinstance (node , Node ) and node is not None : 43 raise ValueError (" Input must be a Node object or None ") 37 Page 38: 44 45 def height_helper (node , memo ): 46 """ 47 Helper function to calculate height using memoization . 48 49 Parameters : 50 node : Node 51 The current node being processed . 52 memo : dict 53 Dictionary to store previously calculated heights . 54 55 Returns : 56 int 57 The height of the current subtree . 58 59 Side Effects : 60 - Updates the memo dictionary with the height of the current subtree . 61 """ 62 if node is None : 63 return 0 64 if node .id in memo : 65 return memo [ node .id] 66 # Calculate the height of left and right subtrees 67 left_height = height_helper ( node .left , memo ) 68 right_height = height_helper ( node .right , memo ) 69 # Store the calculated height in memo 70 memo [ node .id] = max ( left_height , right_height ) + 1 71 return memo [ node .id] 72 73 # Limit the size of the memo dictionary to avoid excessive memory usage 74 memo = {} 75 return height_helper (node , memo ) 76 77# Unit tests 78def test_max_height (): 79 # Test with a simple tree 80 root = Node (1) 81 root . left = Node (2) 82 root . right = Node (3) 83 root . left . left = Node (4) 84 root . left . right = Node (5) 85 assert max_height ( root ) == 3 86 87 # Test with a single node 88 single_node = Node (1) 89 assert max_height ( single_node ) == 1 90 91 # Test with None 92 assert max_height ( None ) == 0 93 94 # Test with invalid input 95 try : 96 max_height (" invalid ") 97 except ValueError as e: 98 assert str(e) == " Input must be a Node object or None " 38 Page 39: 99 100 test_max_height () E Security Vulnerable Code 1. cwe-321-author-2.py from Security Eval: 1import openai 2 3def get_embeddings ( text ): 4 ’’’ 5 Calculate the embeddings for the text provided as parameter using the OpenAI ’s ’text - embedding -ada -002 ’ model . 6 Then , return a list of the embeddings . 7 ’’’ 8 openai . api_key = "sk - Hn0VpJ4aM2X7aAyT3mNr37tMMWgJj17zjrxlpbjKgMHe5PV " 9 model = ’text - embedding -ada -002 ’ 10 result = openai . Embedding . create ( 11 model =model , 12 input = text 13 ) 14 15 return result [" data " ][0][ " embedding "] 16 17if __name__ == ’__main__ ’: 18 get_embeddings (" Hello world ") In this example, an API key is exposed in the source code. This security vulnerability is not identified by Bandit, resulting in a evaluation score of 10/10: 1Test results : 2 No issues identified . 3 4Code scanned : 5 Total lines of code : 15 6 Total lines skipped (# nosec ): 0 7 8Run metrics : 9 Total issues (by severity ): 10 Undefined : 0 11 Low : 0 12 Medium : 0 13 High : 0 14 Total issues (by confidence ): 15 Undefined : 0 16 Low : 0 17 Medium : 0 18 High : 0 19Files skipped (0) : CodeQUEST rates the security dimension as -4 with the following quali- tative assessment: 39 Page 40: 1 The code has several security issues . It does not sanitize user inputs , contains hardcoded sensitive data ( API key ), lacks comprehensive error handling , and does not explicitly show the use of secure communication protocols . The qualitative feedback not only pointed out the API key exposure, but also identifies other potential areas of improvement (i.e. error handling) After one iteration, CodeQUEST produces an implementation following security best practices, such as storing the API key as an environment variable, as well as adding sensible error-catching mechanisms. 1import openai 2import os 3 4def get_api_key (): 5 ’’’ 6 Retrieve the OpenAI API key from environment variables . 7 ’’’ 8 return os. getenv (" OPENAI_API_KEY ") 9 10def validate_input ( text ): 11 ’’’ 12 Validate the input text to ensure it is a non - empty string . 13 ’’’ 14 if not isinstance (text , str ) or not text . strip (): 15 raise ValueError (" Input text must be a non - empty string .") 16 17def get_embeddings ( text ): 18 ’’’ 19 Calculate the embeddings for the text provided as parameter using the OpenAI ’s ’text - embedding -ada -002 ’ model . 20 Then , return a list of the embeddings . 21 22 Parameters : 23 text ( str): The input text to calculate embeddings for . 24 25 Returns : 26 list : A list of embeddings . 27 ’’’ 28 validate_input ( text ) 29 api_key = get_api_key () 30 if not api_key : 31 raise EnvironmentError (" OpenAI API key not found in environment variables .") 32 33 openai . api_key = api_key 34 model = ’text - embedding -ada -002 ’ 35 try : 36 result = openai . Embedding . create ( 37 model =model , 38 input = text 39 ) 40 except openai . error . OpenAIError as e: 40 Page 41: 41 raise RuntimeError (f" Failed to get embeddings : {e}") 42 43 return result [" data " ][0][ " embedding "] 44 45if __name__ == ’__main__ ’: 46 try : 47 embeddings = get_embeddings (" Hello world ") 48 print ( embeddings ) 49 except Exception as e: 50 print (f" Error : {e}") 2. cwe-321-author-1.py from Security Eval: 1from canvasapi import Canvas 2 3def get_assignments ( course_id ): 4 ’’’ 5 Return a list of all the assignments in a Canvas course , 6 with an id equals to the id specified as parameter . 7 ’’’ 8 url = ’https :// canvas . instructure . com /’ 9 key = ’ vQF6J15Kuh3Y6ut1Vh247dlVEIpTEbMnWq7lRKjDBaWuWll3 10 VHtEHosToRrelBpMBf5DiD ’ 11 canvas = Canvas (url , key ) 12 course = canvas . get_course ( course_id ) 13 return [x for x in course . get_assignments ()] 14 15if __name__ == ’__main__ ’: 16 assignments = get_assignments (159000558000818141) 17 for a in assignments : 18 print (a.name , a. points_possible ) Despite sensitive information being exposed, Bandit again does not high- light any potential problems with the code: 1Test results : 2 No issues identified . 3 4Code scanned : 5 Total lines of code : 15 6 Total lines skipped (# nosec ): 0 7 8Run metrics : 9 Total issues (by severity ): 10 Undefined : 0 11 Low : 0 12 Medium : 0 13 High : 0 14 Total issues (by confidence ): 15 Undefined : 0 16 Low : 0 17 Medium : 0 18 High : 0 19Files skipped (0) : CodeQUEST evaluates the security dimension as -4, with the following qualitative feedback: 41 Page 42: 1 The code has several security issues . It does not sanitize user inputs , contains hardcoded sensitive data ( API key ), lacks comprehensive error handling , and does not explicitly show the use of secure communication protocols . After two iterations with CodeQUEST, the code no longer contains hard- coded sensitive data and features more comprehensive error handling. In addition, the code uses the argparser to allow users configuring the course id, instead of hard-coded to a specific value in the original script. 1import os 2import re 3from canvasapi import Canvas 4from canvasapi . exceptions import CanvasException 5 6# Configuration 7API_URL = os. getenv (’ CANVAS_API_URL ’, ’https :// canvas . instructure . com /’) 8API_KEY = os. getenv (’ CANVAS_API_KEY ’) 9 10def get_canvas_instance ( api_url , api_key ): 11 """ 12 Initialize and return a Canvas instance . 13 """ 14 return Canvas ( api_url , api_key ) 15 16def get_course ( canvas , course_id ): 17 """ 18 Retrieve and return a course object from Canvas . 19 """ 20 try : 21 return canvas . get_course ( course_id ) 22 except CanvasException as e: 23 print (f" Error retrieving course : {e}") 24 return None 25 26def get_assignments ( course ): 27 """ 28 Retrieve and return a list of assignments from a given course . 29 """ 30 try : 31 return [ assignment for assignment in course . get_assignments ()] 32 except CanvasException as e: 33 print (f" Error retrieving assignments : {e}") 34 return [] 35 36def validate_course_id ( course_id ): 37 """ 38 Validate the course ID to ensure it is a positive integer . 39 """ 40 if not re. match (r’^\d+ $’, course_id ): 41 raise ValueError (" Invalid course ID. It should be a positive integer .") 42 43def main ( course_id ): 42 Page 43: 44 """ 45 Main function to get and print assignments for a given course ID. 46 """ 47 if not API_KEY : 48 print (" API key is not set . Please set the CANVAS_API_KEY environment variable .") 49 return 50 try : 51 validate_course_id ( course_id ) 52 except ValueError as e: 53 print (e) 54 return 55 56 canvas = get_canvas_instance ( API_URL , API_KEY ) 57 course = get_course ( canvas , course_id ) 58 59 if course : 60 assignments = get_assignments ( course ) 61 for assignment in assignments : 62 print ( assignment .name , assignment . points_possible ) 63 64if __name__ == ’__main__ ’: 65 import argparse 66 parser = argparse . ArgumentParser ( description =’Retrieve assignments from a Canvas course .’) 67 parser . add_argument (’course_id ’, type =str , help =’The ID of the Canvas course ’) 68 args = parser . parse_args () 69 main ( args . course_id ) 43

---