Authors: Rundong Liu, Andre Frade, Amal Vaidya, Maxime Labonne, Marcus Kaiser, Bismayan Chakrabarti, Jonathan Budd, Sean Moran
Paper Content:
Page 1:
On Iterative Evaluation and Enhancement of
Code Quality Using GPT-4o
Rundong Liu∗Andr´ e Frade∗Amal Vaidya Maxime Labonne
Marcus Kaiser Bismayan Chakrabarti Jonathan Budd
Sean Moran†
JPMorgan Chase
Abstract
This paper introduces CodeQUEST, a novel framework leveraging
Large Language Models (LLMs) to iteratively evaluate and enhance code
quality across multiple dimensions, including readability, maintainabil-
ity, efficiency, and security. The framework is divided into two main
components: an Evaluator that assesses code quality across ten dimen-
sions, providing both quantitative scores and qualitative summaries, and
an Optimizer that iteratively improves the code based on the Evalua-
tor’s feedback. Our study demonstrates that CodeQUEST can effec-
tively and robustly evaluate code quality, with its assessments aligning
closely with established code quality metrics. Through a series of experi-
ments using a curated dataset of Python and JavaScript examples, Code-
QUEST demonstrated significant improvements in code quality, achieving
a mean relative percentage improvement of 52.6%. The framework’s eval-
uations were validated against a set of proxy metrics comprising of Pylint
Score, Radon Maintainability Index, and Bandit output logs, showing
a meaningful correlation. This highlights the potential of LLMs in au-
tomating code quality evaluation and improvement processes, present-
ing a significant advancement toward enhancing software development
practices. The code implementation of the framework is available at
https://github.com/jpmorganchase/CodeQuest.
1 Introduction
The production of high-quality code remains a top priority for organizations
developing software. High-quality code extends beyond syntactic or semantic
correctness, encompassing attributes such as freedom from errors, readability,
efficiency, portability, usability, testability, and maintainability [20].
∗These authors contributed equally to this work
†Corresponding author: sean.j.moran@jpmchase.com
1arXiv:2502.07399v1 [cs.SE] 11 Feb 2025
Page 2:
Evaluating code quality is a complex process due to the subjective nature
of the concept, leading to the creation of various language-specific and non-
comprehensive tools that cover very specific aspects under the code quality
concept umbrella [10]. For example, the Python ecosystem provides security
linters for identifying security vulnerabilities, style linters for identifying devia-
tions from style guides, type checks, and error linters [36]. Moreover, subjective
aspects like code design and readability are often guided by ”best practices”. It
is also important to understand that evaluation of code quality, while crucial,
is just one part of ensuring production-ready code. This is typically achieved
through extensive code reviews, where multiple developers examine and mod-
ify the scripts until they meet the required standards. While important, this
process is resource intensive, time-consuming, and it generally places additional
burdens on senior developers that are ultimately responsible for vetting final
code versions. Automating this process could significantly boost code develop-
ment productivity.
Recent advancements in Large Language Model (LLM) technologies have
shown promise in coding-related tasks, including code generation and evalu-
ation [1] [30] [17]. LLMs offer unique advantages for code evaluation due to
their training on diverse programming languages and vast datasets, enabling
them to first understand and then assess as well as improve code quality. More-
over, their ability to perform tasks based on prompts provides a flexible and
highly configurable paradigm for easily defining complex tasks in natural lan-
guage [13]. Studies are increasingly focusing on how prompt engineering and
complexity affect the LLMs’ ability to generate accurate, maintainable code.
For example, some research compares one-hot and iterative prompt strategies
to understand how more detailed or structured prompts impact the functional-
ity and readability of the generated code, showing that prompt complexity can
significantly enhance the robustness of outputs [2]. Other studies assess GPT-
4’s performance against alternative models, highlighting notable improvements
in debugging capabilities and responsiveness to follow-up prompts. Specifically,
recent work investigates GPT’s self-correction capacity when supplied with tar-
geted feedback, such as error messages or static analysis results, highlighting
the model’s ability to refine code and address issues like smells or unnecessary
complexity through iterative feedback [25] [16].
While LLMs’ zero-shot capability have been widely exploited, recent stud-
ies [28] [3] [37] [38] have suggested that borrowing concepts from Reinforcement
Learning could significantly improve the capability for LLM-based agents to per-
form reasoning and planning tasks. For instance, Zhang et al. [37] have drawn
inspiration from the classical actor-critic method [14] to design a TipleCritic
mechanism, which has significantly improved the success rate on multi-agent
planning tasks.
Despite their potential, using LLMs for such complex tasks also presents
challenges. LLMs are not inherently designed for quantitative assessments,
which may affect the reliability of their scores [8]. Moreover, the probabilis-
tic nature of LLMs can lead to inconsistencies. Indeed, small modifications to
the prompt or a change of the LLM random seed can potentially lead to vastly
2
Page 3:
Input
codeEvaluator
assessmentCode quality
improvementCode
validationEvaluator
assessmentFinal code
Final Evaluator
reportOptimizer
Figure 1: Schematic representation of CodeQUEST.
different results [15] or “hallucinations” which can be challenging to detect and
control [18] [9]. Nonetheless, it is clear that LLMs hold immense potential to
revolutionize code quality evaluation and improvement.
In this paper, we introduce CodeQUEST (Code Quality Understanding and
Enhancement System Toolkit), an LLM-powered framework for assessing and
improving code quality. CodeQUEST consists of an Evaluator and an Opti-
mizer that work together through an iterative process to enhance the quality
of code. The rest of the paper is organized as follows: Section 2 introduces the
CodeQUEST framework and its two components. Section 3 details the experi-
mental setup. Section 4 presents results and insights on improving code quality
using LLMs. Section 5 outlines threats to validity, and Section 6 concludes the
discussion.
2 The CodeQUEST Framework
The CodeQUEST framework consists of two components: an Evaluator and an
Optimizer that leverage LLM to assess and improve code quality, respectively.
In this work, we used the most recently released GPT-4o from the GPT-4 model
series [1] and code quality was defined across ten dimensions: Readability ,Main-
tainability ,Testability ,Efficiency ,Robustness ,Security ,Documentation ,Modu-
larity ,Scalability , and Portability .
The Evaluator first generates a code quality score1and a text-based eval-
uation for each dimension, which is used to produce an aggregated summary
of dimension-wise evaluations. The Optimizer is then responsible for improving
the code quality over a fixed number of iterations (or until a target quality score
is reached) based on the Evaluator’s feedback. Each iteration involves three se-
quential stages: code quality improvement, code integrity validation, and code
quality re-assessment. A detailed description of each stage is provided in this
section below. A diagram of the end to end process can be found in Figure 1.
1Under CodeQUEST framework, a code-level quality score is calculated by taking average
across code quality score from each dimension
3
Page 4:
2.1 Evaluator
As mentioned above, the Evaluator uses ten dimensions to assess code qual-
ity. Each dimension is addressed through a set of five questions or statements,
carefully crafted to comprehensively cover its key aspects (c.f., Appendix B for
the full list). These questions and statements were designed to a) be general,
thus applicable to most programming languages, and b) not overlap in dimen-
sion scope probing, hence avoiding over-representation of any particular aspects
of the dimension. Furthermore, each question or statement is formulated such
that it can only be answered with “True”, “False” or “Not Applicable”, where
“True” reflects that high quality characteristics present in the code, in contrast
to a “False”. We address each of the code quality dimensions in a separate
query to the LLM, where the corresponding set of five dedicated questions or
statements are provided to the LLM for its consideration.
Inspired by theoretical efforts [12] [32] on the impact of language ambiguity
to LLM’s performance, we carefully crafted prompts enabling LLMs to apply
their internal knowledge in a focused and unambiguous manner. Limiting the
possible output to quantifiable answers allows us to map the model output to a
numerical scale. The output includes the answers to each of the five dedicated
questions or statements (which enable the quantitative assessment) and a high-
level summary of the code in light of the dimension under consideration (which
serves as a qualitative assessment).
Quantitative results are derived by assigning numerical values to each answer
(+1 for “True”, -1 for “False”, and 0 for “Not Applicable”). For each dimension,
we sum up the five responses to obtain a score ranging from -5 to 5, where
higher scores represent higher code quality along the dimension axis. Finally,
dimension-specific scores are averaged to enable a “code-level” score. We also
ask GPT-4o to summarize qualitative assessment across all ten dimensions,
obtaining a code-level summary.
The prompt template, based on zero-shot Chain-of-Thought (CoT) [35], is
provided below. It combines 1) the code to be evaluated; 2) the set of five ques-
tions or statements for a given quality dimension; 3) the task to be performed,
and 4) the desired output format:
--- System ---
You are a helpful and harmless AI software engineer.
You must provide an answer to the following request.
Be brief and precise.
--- Human ---
### CODE:
‘‘‘
{code}
‘‘‘
### STATEMENTS:
4
Page 5:
{dimension_statements}
### TASK:
Think step by step to assess the veracity of each STATEMENT
in light of the CODE provided.
Your answer to each statement must come from one of the following:
* -1 if the statement is false,
* 1 if the statement is true,
* 0 if the statement is not applicable or there is not enough
evidence in the CODE to address it.
You must also provide a short summary about the quality of the
code from a {quality_dimension} perspective, justifying your
answers across the various statements.
### OUTPUT:
Return your answer in valid JSON as shown below:
‘‘‘json
{{ "insight": <code quality summary:str>,
"scores": [<score_to_statement1:int>, ...]
}}
‘‘‘
Note that due to the inherent non-deterministic nature of GPT-4o (even
when a seed is specified and the temperature is set to zero) [21], we allow users
to opt-in self-consistency reasoning [34]2to further reduce the variance of the
code quality score.
2.2 Optimizer
In this section, we describe the Optimizer, which is tasked with improving the
quality of code based on the feedback generated by the Evaluator. While qual-
itative results set a clear direction for code improvement, quantitative feedback
is leveraged to ensure that progress occurs in a unidirectional manner. The
Optimizer operates in an iterative improvement cycle, where each iteration in-
cludes three steps: code quality improvement ,code validation , and generated
code evaluation .
2.2.1 Code Quality Improvement
To generate an improved version of the code, the original code snippet and its
qualitative assessment results are fed to GPT-4o. The task involves address-
ing all the areas of improvement identified in the assessment feedback through
2CodeQUEST framework allows user to apply self-consistency when evaluating each code
quality dimension, a dimensional score is obtained by taking the average across retries and
a dimensional evaluation is obtained by asking GPT-4o to summaries text evaluation across
different retries
5
Page 6:
code modifications (the following prompt shares the system message with the
Evaluator’s prompt):
--- Human ---
### Code:
‘‘‘
{code}
‘‘‘
### Quality Dimensions Feedback:
{quality_insight}
### TASK:
You are provided with a code script and detailed feedback
for each quality dimension.
For each quality dimension, you are provided with:
* A score from -5 to 5.The higher the score,
the better the quality.
* Dimension insights,highlighting potential areas
of improvement.
Think step by step to complete the following:
1) For each dimension, reflect on the score and insights.
2) Condense a list of improvement points, so that the code
would be evaluated at a higher score for each dimension.
3) Improve the code script according to the improvement
points, prioritizing dimensions with lower scores.
4) Return:
* the improvement points identified
* the improved version of the code script
* explanations for each of the changes you’ve made
Note:
* ALL improvement points MUST be addressed via meaningful
changes to the code.
### OUTPUT:
Your final output contains two parts:
Return your answer in a valid JSON as shown below:
‘‘‘json
{{
"improvement_points": List[str],
"explanation_report": List[str]
}}
‘‘‘
Then quote your code in the following section:
‘‘‘improved_code
{{improved_code_here}}
‘‘‘
6
Page 7:
2.2.2 Code Validation
The code validation stage ensures that the LLM modified code can be compiled,
as a minimal requirement for the code to be deemed valid and the process to
proceed. Additionally, test cases built for the original code can be optionally
executed against all code improvement versions to ensure the intended func-
tionality and expected behavior. Such validation checks constitute a first step
towards the meaningful evolution of the code.
Note that the failure of either check leads to a rejection of the generated
code as a candidate for the next iteration. A failed attempt does however, still
count as an iteration towards the total number of iterations.
Self-reflection and correction [28] based on the compiler or execution outputs
were left as a interesting direction for future work.
2.2.3 Evaluator Assessment
A successful code validation stage is followed by a new Evaluator assessment.
At this stage, if the overall quality score of the new code version drops relatively
to the one of the previous version indicates that no overall improvement was
achieved. Under these circumstances, a new iteration is triggered, using the
inputs of the previous cycle as a new code improvement attempt. Again note
that an unsuccessful improvement attempt still counts as an iteration.
Otherwise, if the overall quality score of the new code increases relative to
the last successful version, the iteration is deemed successful. In this case, the
quality references are updated: the new overall quality score becomes the new
numerical baseline for improvement and the qualitative feedback is used as input
to the code quality improvement of the next cycle.
The number of iterations and overall quality target score are both hyper-
parameters of the Optimizer. The improvement cycle terminates when either,
the maximum number of iterations or the target quality score is reached.
3 Experimental Setup
3.1 Dataset
We hand-curated a small but representative sample of 42 code examples across
nine open-source libraries, including Scipy [33], SecurityEval [29], AWS-CDK-
Examples [6] and Science-JS [11]. We focused on Python and JavaScript due to
their popularity, and aimed for examples of varying code length and intended
functionality. For future research, it would be interesting to expand the scope
of the analysis to other programming languages, for example considering the
QScored [27] dataset focused on C# and Java.
The majority of the code sourced from reputable open-source libraries should
be of reasonably high quality. Thus, some of the code pieces were manually
modified to enable scope for improvement. This included annotation removal
and non-semantic changes to the code, whilst leaving the original logic intact. In
7
Page 8:
Sources No. Ex-
amplesLanguage Test cases No.
Lines
auth0 [5] 2 JavaScript False 165
aws-cdk [6] 2 Python False 36
mbpp [4] 8 Python True 19
sciencejs [11] 6 JavaScript False 112
scipy [33] 3 Python False 117
scikit-learn [22] 3 Python False 142
SnoopySecurity [26] 8 Python,
JavaScriptFalse 60
fportantier [24] 3 Python False 39
SecurityEval [29] 7 Python False 21
Table 1: Detailed structural breakdown of the curated dataset.
Appendix A, we provide an example of reduced quality, where a utility function
is re-defined multiple times, hence reducing the modularity of the code.
In this study we focused on 42 code examples (28 Python and 14 JavaScript),
sourced from 9 different data sources. Table 1 shows the complete list of data
sources and associated metadata. Furthermore, test cases for 8 Python exam-
ples from MBPP [4] were available, which we leveraged unchanged. For future
research, it would be interesting to use LLMs to explicitly generate additional
test cases [28].
3.2 Baseline solution
To perform ablation study of the Evaluator’s prompt, we considered a simpler
CoT prompt to generate baseline results. This approach assumes that GPT-4o’s
internal knowledge is sufficient for an accurate and comprehensive evaluation of
code quality. In our experiments, we compare it with CodeQUEST to validate
the effectiveness of our solution. Again, the following prompt shares the system
message with the Evaluator’s prompt.
---- Human ----
### CODE:
‘‘‘
{code}
‘‘‘
### TASK:
Think step by step to produce both a quantitative and
qualitative assessment of the CODE provided.
* Your qualitative assessment must be a short summary
about the quality of the CODE.
8
Page 9:
* Your quantitative assessment must be an integer on
a scale from -5 to 5, which respectively represent the
low and high-quality ends of the scale.
Both types of evaluations must agree with each other.
### OUTPUT:
Return your answer in a valid JSON as shown below:
‘‘‘json
{{
"insight": <qualitative assessment:str>,
"score": <quantitative assessment:int>
}}
‘‘‘
3.3 Validating Scores With Proxies
Existing and well-established quality metrics are used in this study as a proxy to
validate LLM-based code evaluations. This validation is limited to Python code
due to the logistics of availability, implementation, and execution of such ana-
lytical tools. The following code quality evaluation capabilities were selected for
this study as an attempt to cover as many code quality dimensions as possible:
1.Pylint3[31] follows the style recommended by PEP8, while penalizing
unused variables, lengthy code, and ill-named variables. Pylint returns a
score from 0 to 10, which we use as a proxy metric for Readability and
Documentation.
2.Radon Maintainability Index [19] is used as a proxy for code Main-
tainability, similar to [23]. The measure is a score from 0 to 100 derived
from other metrics, including Source Lines of Code (SLoC), Cyclomatic
Complexity, and Halstead Volume. We scale the score to range from 0 to
10.
3.Bandit [7] scans Python code for security issues for Python using abstract
syntax trees. Bandit does not provide a score, so we applied a simple
heuristic that transforms the execution standard output into a score from
0 to 10 to define our proxy for Security.
For simplicity, we define our overall proxy code quality score as the average
of these three metrics, resulting in a final score ranging from 0 to 10.
3We enabled all default Pylint extensions apart from mccabe, which computes the cyclo-
matic complexity to avoid partial overlap with Radon.
9
Page 10:
4 Results
4.1 Can GPT-4o evaluate code quality?
To evaluate the CodeQUEST’s Evaluator, we compare its evaluation results to
those produced by the baseline setup. In both cases, the output consists of
an overall code quality score (quantitative result) as well as a short summary
report (qualitative feedback). Each code example was repeatedly evaluated five
times by both setups4, allowing us to gather statistics on the consistency of the
outputs produced.
Example mbpp/601.py Score
Baseline The code is functional and correctly implements the
logic to find the maximum chain length of pairs. How-
ever, it lacks comments, proper variable naming, and
error handling. The use of ’max’ as a variable name
is not recommended as it shadows the built-in function
’max’. The code could be improved for readability and
maintainability.2.0
CodeQUEST The provided code is a straightforward implementation
of finding the maximum chain length of pairs using a
dynamic programming approach. While the logic is
correct, the code has several areas for improvement.
It lacks readability and consistent style, with variable
names that could be more descriptive. The use of ’max’
as a variable name is misleading. The code does not
include comments to explain the logic, especially the
nested loops and the purpose of the ’mcl’ array. It is
moderately maintainable but could benefit from refac-
toring to improve efficiency and reduce repetition. The
code is easy to test due to its simplicity and lack of ex-
ternal dependencies, but it does not facilitate mocking
of dependencies. The nested loops result in O(n2) time
complexity, making it inefficient for large datasets. The
code lacks input validation, error handling, and concur-
rency considerations. It is not designed with scalability
or resource efficiency in mind. Additionally, the code
lacks documentation and modularity, making it difficult
to test and modify parts independently. Despite these
issues, the code is portable as it avoids platform-specific
features and uses standard libraries.-1.3
Table 2: Comparison of the qualitative and quantitative results with the baseline
solution and CodeQUEST on mbpp/601.py.
4In this experiment, we disabled self-consistency for the CodeQUEST evaluator to ensure
a fair comparison
10
Page 11:
Figure 2: Comparison of the Baseline and CodeQUEST code quality scores.
Each code example was evaluated five times. The mean and standard deviation
were calculated across these five samples for the baseline (top) and CodeQUEST
(bottom).
-Qualitative evaluation : As illustrated in Table 2 and Appendix C,
qualitative results produced by CodeQUEST were found to be more com-
prehensive and detailed compared to those of the baseline solution. For
example, CodeQUEST Evaluator performed additional time complexity
analysis against mbpp/601.py , illustrating areas of improvement for Scal-
ability. Nonetheless, both sets of results raise accurate points about the
code, providing us with confidence that GPT-4o seems to have advanced
intrinsic knowledge of code quality. We also found that the assessment
produced equally plausible results for Python and JavaScript code, which
showed that the LLM is able to generate insights across different program-
ming languages.
-Quantitative evaluation : As shown in Figure 2, our two strategies ex-
hibit different mean quality scores with some variability. The baseline
appears to systematically overestimate code quality compared to Code-
QUEST assessments. Since the deficiencies pointed out by the Evaluator
are the primary signal for driving code improvement by the Optimizer,
such overestimation of quality (both qualitative and quantitative) is un-
desirable, as it may hinder scope for improvement. For both setups, the
variability may be explained by a) the ambiguity of prompt questions or
statements, b) inherent LLM stochasticity, and c) the scale of possible
outcomes not being granular enough.
Therefore, in summary, our results suggest that we can leverage GPT-4o for
both qualitative and quantitative assessment of code quality. Furthermore, the
accuracy and level of detail of LLM-based code evaluations are highly prompt-
dependent. We also find that CodeQUEST provides a more comprehensive,
standard, and reliable evaluation due to the additional guidance enabled by
dimension-specific questions.
11
Page 12:
4.2 Can evaluations be used to drive code quality im-
provement?
In our study, we applied the CodeQUEST framework to each code example for
a maximum of five improvement cycles. All versions of the Python code were
checked for their ability to compile. Furthermore, all versions of the mbpp code
were validated against the corresponding test cases provided. All intermediary
and final results were recorded for downstream analysis.
The example shown in Figure 3 represents the general behavior observed
across all code examples. Out of the original 42 code instances, 41 were improved
to some extent. We report an average absolute improvement of 2 .9±1.3 in the
code quality scale units5with those of initial lower quality exhibiting the largest
improvements.
To further analyze the improvements, we defined the Relative Percentage
Improvement(RPI) as
RPI := 100(sn
f−sn
i)
(smax−sn
i)(1)
where smax= 5 is the maximum possible score, and sn
iandsn
fare the initial
and final scores for the nthcode example, respectively. N= 42 is the total
number of examples in the dataset. Over the entire dataset, our framework
produces a mean RPI of 52 .6%±17.9% and a median RPI of 57 .1%. We
also verified that while our setup, by design, ensures a monotonic improvement
of overall code quality, the monotonic improvement of individual dimensions’
scores does not always occur, i.e., an overall improvement for the quality score
produced by an iteration might be accompanied by a decrease in some of the
dimension-specific scores.
Figure 4 displays the distribution of improvements achieved for each iteration
in absolute code quality units. We can observe that most of the improvement
occurs in the first iteration cycle, with an average of 2 code quality units increase.
The average improvement for iterations two and three were found to be 0.52 and
0.27. From the fourth iteration, the improvement becomes negligible (below 0.1
code quality units).
These results suggest that the magnitude of code quality improvement re-
duces at each iteration, which becomes particularly useful to limit the GPT-4o
API costs incurred by the framework if CodeQUEST is to be considered and
adopted at scale.
Therefore this section demonstrates that our Optimizer framework can uti-
lize the qualitative insights generated by the Evaluator to achieve significant
code quality improvement. Indeed, we notice the consistent alignment between
quantitative and qualitative results, such that an increase in the quality score
reflects a decrease in code quality deficiencies reported by the quality summary.
Therefore we believe that this setup should be general enough to treat this as a
general-purpose recipe for configurable code quality improvement.
5As a by-product of this step, the original dataset was expanded with 208 additional code
examples, excluding 2 invalid attempts to improve Python code
12
Page 13:
Figure 3: Quality score evolution of a Python code example (mbpp/927.py)
when subjected to the Optimizer. The quality score is provided in its overall and
dimension-specific format. Corresponding quality summaries and code snippets
can be found in the Appendix D
13
Page 14:
Figure 4: Distribution of code quality score improvements per iteration. (We
disabled self-consistency in the experiment)
4.3 Are these evaluations meaningful?
To confirm the validity of LLM-generated quality scores, we investigate their
level of agreement with the quality proxy scores described in Section 3.3. Given
each original code example and its corresponding refactored versions generated
by CodeQUEST, here denoted as {x0, . . . , x nmax}, we can consider the incre-
mental change of the score per iteration, defined as
∆xnCodeQUEST = CodeQUEST( xn)−CodeQUEST( xn−1), (2)
for iterations n= 1, . . . , n max. In our case, nmax= 5. Similarly, we can define
∆xnProxy and ∆ xnBaseline for the proxy and baseline scores, respectively.
Pairs rs rp ps pp
(∆Proxy, ∆Baseline) 0.21 0.27 0.02 0.00
(∆Proxy, ∆CodeQUEST) 0.23 0.53 0.01 0.00
Table 3: Spearman rsand Pearson rpcorrelation with corresponding p-values
psand p p.
The iteration cycle applied to the 28 Python code examples described in Table 1
resulted in 138 improved code examples6for which we calculated the incremental
change for each of the three scores. To measure how much the LLM-based
scores are aligned with proxy results, we compute the Pearson rp, and Spearman
rank rscorrelation between ∆Proxy and ∆CodeQUEST as well as ∆Proxy and
∆Baseline and report the results in Table 3 and Figure 5.
Results suggest that proxy scores are considerably better correlated with
the CodeQUEST scores compared to those of the baseline. This suggests that,
6We set the maximum number of evolution cycle to be 5, we excluded 2 improvement
attempts which were syntactically incorrect or failing the test case.
14
Page 15:
Figure 5: Left: Scatter plots and linear regression line for {∆xnProxy,
∆xnCodeQUEST }xn(left) and {∆xnProxy, ∆ xnBaseline }xn(right). The
shaded area represents the uncertainty of the linear regression fit.
given a base code example, CodeQUEST is able to iteratively assign scores that
better reflect the underlying variation in the code quality. We can observe from
Figure 5 that the relationship is relatively noisy. This is somewhat expected
since the proxies do not capture all the dimensions considered by CodeQUEST
and can be focused on certain aspects. For example, when we inspected individ-
ual results, we found that CodeQUEST is able to detect security vulnerabilities,
which were ignored by Bandit. We provide two such examples from SecurityEval
in Appendix E, along with their improved version.
5 Threats to validity
In this section, we would like to acknowledge certain considerations regarding
the validity and robustness of our framework. Firstly, code quality is an intrin-
sically subjective concept and, while our evaluations work well with the selected
questions, it might not be as effective if modified significantly. However, given
the generalization capabilities of GPT-4o and its advanced knowledge of both
code and English language concepts, we believe this is unlikely as long as the
questions are reasonable and posed in the manner we detail in Section 2.1.
Moreover, the stochasticity of GPT-4o output also introduces some inherent
variability in the results. While we see good stability in our setup and take mea-
sures to limit it by setting the temperature to 0, that might not hold for all code
and or prompts. We also recommend using the self-consistency framework [34]
to address this, if required.
15
Page 16:
Hallucinations are also a well-known shortcoming of LLMs and this is an
active field of research, thus it is possible that the model hallucinates during
the improvement step of our framework. A detailed study with a larger dataset
and human validation would be required to ascertain the extent to which hal-
lucination might be a concern. We also recommend robust validation with test
cases to guard against this. Our framework might also struggle to improve code
that is very different from the training dataset of GPT-4o (e.g., highly custom
proprietary code bases).
Finally, further study would be required to fully validate the framework
across a broader range of coding languages. Here we note that the primary
limitation comes from the LLM’s (GPT-4o’s) knowledge of the programming
language, so the framework should be performant across most major program-
ming languages.
6 Conclusion
In this paper, we introduced CodeQUEST – Code Quality Understanding and
Enhancement System Toolkit – a GPT-4o powered framework capable of eval-
uating and improving code quality for code across different domains. We show
that our framework is able to produce comprehensive qualitative and quantita-
tive code quality evaluation reports, and subsequently use these reports to iter-
atively improve the quality of the code being analyzed. Our framework is easily
understandable, adjustable, and shows impressive results for Python as well as
JavaScript, showing its applicability across languages. We also demonstrate
that our framework produces code evaluation scores showing good agreements
with rule-based code-quality evaluation libraries.
References
[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya,
Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Alt-
man, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint
arXiv:2303.08774 , 2023.
[2] Jie Li Angus Yang, Zehan Li. Advancing genai assisted programming–a
comparative study on prompt efficiency and code quality between gpt-4
and glm-4. ArXiv , abs/2402.12782, 2024.
[3] Anonymous. Enhancing decision-making of large language models via
actor-critic. In Submitted to The Thirteenth International Conference on
Learning Representations , 2024. under review.
[4] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk
Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc
Le, et al. Program synthesis with large language models. arXiv preprint
arXiv:2108.07732 , 2021.
16
Page 17:
[5] auth0. https://github.com/auth0/node-jsonwebtoken.
[6] AWS. Cdk examples https://github.com/aws-samples/aws-cdk-examples.
[7] Bandit. https://github.com/pycqa/bandit.
[8] Veronika Hackl, Alexandra Elena M¨ uller, Michael Granitzer, and Maxim-
ilian Sailer. Is gpt-4 a reliable rater? evaluating consistency in gpt-4’s text
ratings. Frontiers in Education , 8, December 2023.
[9] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Hao-
tian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and
Ting Liu. A survey on hallucination in large language models: Principles,
taxonomy, challenges, and open questions, 2023.
[10] Petri Ihantola, Tuukka Ahoniemi, Ville Karavirta, and Otto Sepp¨ al¨ a. Re-
view of recent systems for automatic assessment of programming assign-
ments. In Proceedings of the 10th Koli Calling International Conference on
Computing Education Research , Koli Calling ’10, page 86–93, New York,
NY, USA, 2010. Association for Computing Machinery.
[11] Davies Jason. Science-js https://github.com/jasondavies/science.js/.
[12] Hui Jiang. A latent space theory for emergent abilities in large language
models, 2023.
[13] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and
Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023.
[14] Vijay Konda and John Tsitsiklis. Actor-critic algorithms. In S. Solla,
T. Leen, and K. M¨ uller, editors, Advances in Neural Information Processing
Systems , volume 12. MIT Press, 1999.
[15] Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin
Yang, and Jie Tang. Gpt understands, too. AI Open , 2023.
[16] Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamtha-
vorn, Li Li, Xuan-Bach D. Le, and David Lo. Refining chatgpt-generated
code: Characterizing and mitigating code quality issues. ACM Transactions
on Software Engineering and Methodology, Volume 33, Issue 5, Article No.:
116, Pages 1-26 , 33, 2024.
[17] Mosleh Mahamud and Isak Samsten. Code quality assessment using trans-
formers, 2023.
[18] Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On
faithfulness and factuality in abstractive summarization. In Dan Jurafsky,
Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of
the 58th Annual Meeting of the Association for Computational Linguis-
tics, pages 1906–1919, Online, July 2020. Association for Computational
Linguistics.
17
Page 18:
[19] Lacchia Michele. Radon mi https://radon.readthedocs.io/en/latest/intro.html.
[20] Ifeanyi G. Ndukwe, Sherlock A. Licorish, Amjed Tahir, and Stephen G.
MacDonell. How have views on software quality differed over time? research
and practice viewpoints. Journal of Systems and Software , 195:111524,
2023.
[21] Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. Llm is like
a box of chocolates: the non-determinism of chatgpt in code generation.
arXiv preprint arXiv:2308.02828 , 2023.
[22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-
sos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-
learn: Machine learning in Python. Journal of Machine Learning Research ,
12:2825–2830, 2011.
[23] Russel A. Poldrack, Thomas Lu, and G. Beguˇ s. Ai-assisted coding: Exper-
iments with gpt-4. ArXiv , abs/2304.13187, 2023.
[24] Fabian Martinez Portantier. https://github.com/fportantier/vulpy.
[25] Jaros law Protasiewicz Przemys law Zydro´ n. Assessing code review quality
with chatgpt: A survey of automated reviewer assignment methods and ex-
perimental outcomes. Digital Interaction and Machine Intelligence. MIDI
2023. Lecture Notes in Networks and Systems. Springer, Cham , 1076, 2024.
[26] Sam Sanoop. https://github.com/snoopysecurity/vulnerable-code-
snippets.
[27] Tushar Sharma and Marouane Kessentini. Qscored: A large dataset of
code smells and quality metrics. In 2021 IEEE/ACM 18th international
conference on mining software repositories (MSR) , pages 590–594. IEEE,
2021.
[28] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan,
and Shunyu Yao. Reflexion: language agents with verbal reinforcement
learning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and
S. Levine, editors, Advances in Neural Information Processing Systems ,
volume 36, pages 8634–8652. Curran Associates, Inc., 2023.
[29] Mohammed Latif Siddiq and Joanna C. S. Santos. Securityeval dataset:
Mining vulnerability examples to evaluate machine learning-based code
generation techniques. In Proceedings of the 1st International Workshop
on Mining Software Repositories Applications for Privacy and Security
(MSR4P,S22) , 2022.
[30] Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel,
Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Prem Devanbu, and
Toufique Ahmed. Calibration and correctness of language models for code,
2024.
18
Page 19:
[31] Sylvain Th´ enault. Pylint: https://github.com/pylint-dev/pylint.
[32] Rasul Tutunov, Antoine Grosnit, Juliusz Ziomek, Jun Wang, and Haitham
Bou-Ammar. Why can large language models generate correct chain-of-
thoughts?, 2024.
[33] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland,
Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, War-
ren Weckesser, Jonathan Bright, St´ efan J. van der Walt, Matthew Brett,
Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson,
Eric Jones, Robert Kern, Eric Larson, C J Carey, ˙Ilhan Polat, Yu Feng,
Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert
Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M.
Archibald, Antˆ onio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and
SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific
Computing in Python. Nature Methods , 17:261–272, 2020.
[34] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sha-
ran Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency
improves chain of thought reasoning in language models. In The Eleventh
International Conference on Learning Representations , 2023.
[35] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter,
Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting
elicits reasoning in large language models. In S. Koyejo, S. Mohamed,
A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural
Information Processing Systems , volume 35, pages 24824–24837. Curran
Associates, Inc., 2022.
[36] Matthew Wilkes. Testing, checking, linting , pages 51–101. Apress, Berkeley,
CA, 2020.
[37] Bin Zhang, Hangyu Mao, Jingqing Ruan, Ying Wen, Yang Li, Shao Zhang,
Zhiwei Xu, Dapeng Li, Ziyue Li, Rui Zhao, Lijuan Li, and Guoliang
Fan. Controlling large language model-based agents for large-scale decision-
making: An actor-critic approach, 2024.
[38] Yuze Zhao, Zhenya Huang, Yixiao Ma, Rui Li, Kai Zhang, Hao Jiang,
Qi Liu, Linbo Zhu, and Yu Su. RePair: Automated program repair with
process-based feedback. In Lun-Wei Ku, Andre Martins, and Vivek Sriku-
mar, editors, Findings of the Association for Computational Linguistics:
ACL 2024 , pages 16415–16429, Bangkok, Thailand, August 2024. Associa-
tion for Computational Linguistics.
7 Conflict of interest
The authors declare that they have no conflict of interest.
19
Page 20:
8 Disclaimer
This paper was prepared for informational purposes by the Applied Innovation of
AI team of JPMorgan Chase & Co. This paper is not a product of the Research
Department of JPMorgan Chase & Co. or its affiliates. Neither JPMorgan
Chase & Co. nor any of its affiliates makes any explicit or implied representa-
tion or warranty and none of them accept any liability in connection with this
paper, including, without limitation, with respect to the completeness, accu-
racy, or reliability of the information contained herein and the potential legal,
compliance, tax, or accounting effects thereof. This document is not intended
as investment research or investment advice, or as a recommendation, offer, or
solicitation for the purchase or sale of any security, financial instrument, finan-
cial product or service, or to be used in any way for evaluating the merits of
participating in any transaction. The described work is a prototype and is not
a production deployed system.
A Manual code quality defect
Example of code that has been “degraded in quality”. We first show the original
code piece, followed by the modified code, where the update xfunction is re-
defined multiple times.
Original code example from Scipy:
1class LinearVectorFunction :
2 """ Linear vector function and its derivatives .
3
4 Defines a linear function F = A x, where x is N-D vector and
5 A is m-by -n matrix . The Jacobian is constant and equals to A.
The Hessian
6 is identically zero and it is returned as a csr matrix .
7 """
8 def __init__ (self , A, x0 , sparse_jacobian ):
9 if sparse_jacobian or sparse_jacobian is None and sps .
issparse (A):
10 self .J = sps . csr_matrix (A)
11 self . sparse_jacobian = True
12 elif sps . issparse (A):
13 self .J = A. toarray ()
14 self . sparse_jacobian = False
15 else :
16 # np. asarray makes sure A is ndarray and not matrix
17 self .J = np. atleast_2d (np. asarray (A))
18 self . sparse_jacobian = False
19
20 self .m, self .n = self .J. shape
21
22 self .xp = xp = array_namespace (x0)
23 _x = atleast_nd (x0 , ndim =1, xp=xp)
24 _dtype = xp. float64
25 if xp. isdtype (_x.dtype , " real floating "):
26 _dtype = _x. dtype
27
20
Page 21:
28 # promotes to floating
29 self .x = xp. astype (_x , _dtype )
30 self . x_dtype = _dtype
31
32 self .f = self .J. dot ( self .x)
33 self . f_updated = True
34
35 self .v = np. zeros ( self .m, dtype = float )
36 self .H = sps. csr_matrix (( self .n, self .n))
37
38 def _update_x (self , x):
39 if not np. array_equal (x, self .x):
40 _x = atleast_nd (x, ndim =1, xp= self .xp)
41 self .x = self .xp. astype (_x , self . x_dtype )
42 self . f_updated = False
43
44 def fun (self , x):
45 self . _update_x (x)
46 if not self . f_updated :
47 self .f = self .J. dot (x)
48 self . f_updated = True
49 return self .f
50
51 def jac (self , x):
52 self . _update_x (x)
53 return self .J
54
55 def hess (self , x, v):
56 self . _update_x (x)
57 self .v = v
58 return self .H
Reduced quality code example:
1import numpy as np
2import scipy . sparse as sps
3from scipy . _lib . _array_api import array_namespace , atleast_nd
4
5class LinearVectorFunction :
6 """ Linear vector function and its derivatives .
7
8 Defines a linear function F = A x, where x is N-D vector and
9 A is m-by -n matrix . The Jacobian is constant and equals to A.
The Hessian
10 is identically zero and it is returned as a csr matrix .
11 """
12 def __init__ (self , A, x0 , sparse_jacobian ):
13 if sparse_jacobian or sparse_jacobian is None and sps .
issparse (A):
14 self .J = sps . csr_matrix (A)
15 self . sparse_jacobian = True
16 elif sps . issparse (A):
17 self .J = A. toarray ()
18 self . sparse_jacobian = False
19 else :
20 # np. asarray makes sure A is ndarray and not matrix
21 self .J = np. atleast_2d (np. asarray (A))
22 self . sparse_jacobian = False
21
Page 22:
23
24 self .m, self .n = self .J. shape
25
26 self .xp = xp = array_namespace (x0)
27 _x = atleast_nd (x0 , ndim =1, xp=xp)
28 _dtype = xp. float64
29 if xp. isdtype (_x.dtype , " real floating "):
30 _dtype = _x. dtype
31
32 # promotes to floating
33 self .x = xp. astype (_x , _dtype )
34 self . x_dtype = _dtype
35
36 self .f = self .J. dot ( self .x)
37 self . f_updated = True
38
39 self .v = np. zeros ( self .m, dtype = float )
40 self .H = sps. csr_matrix (( self .n, self .n))
41
42 def fun (self , x):
43 def _update_x ( x):
44 if not np. array_equal (x, self .x):
45 _x = atleast_nd (x, ndim =1, xp= self .xp)
46 self .x = self .xp. astype (_x , self . x_dtype )
47 self . f_updated = False
48 _update_x (x)
49 if not self . f_updated :
50 self .f = self .J. dot (x)
51 self . f_updated = True
52 return self .f
53
54 def jac (self , x):
55 def _update_x ( x):
56 if not np. array_equal (x, self .x):
57 _x = atleast_nd (x, ndim =1, xp= self .xp)
58 self .x = self .xp. astype (_x , self . x_dtype )
59 self . f_updated = False
60 _update_x (x)
61 return self .J
62
63 def hess (self , x, v):
64 def _update_x ( x):
65 if not np. array_equal (x, self .x):
66 _x = atleast_nd (x, ndim =1, xp= self .xp)
67 self .x = self .xp. astype (_x , self . x_dtype )
68 self . f_updated = False
69 _update_x (x)
70 self .v = v
71 return self .H
B Dimension related questions/statements
The CodeQUEST Framework defines ten dimensions of code quality. For each
dimension, it further defines five questions/statements. The full list is given
below.
22
Page 23:
1. Readability
- Both, variable and function names are descriptive and meaningful.
- The code consistently follows a single specific code style guide.
- There are comments that clearly explain complex or non-obvious
parts of the code provided, without assuming prior knowledge.
- The code provided is free of unexplained constants or magic numbers.
- Each existing function is dedicated to a single task.
2. Maintainability
- The code provided is organized in a logical and understandable man-
ner, allowing for easy comprehension.
- The code provided strictly adheres to the DRY (Do not Repeat Your-
self) principle, avoiding unnecessary repetition.
- Code features can be added or modified without affecting existing
functionality.
- The code provided is effectively free of duplication, promoting effi-
ciency and maintainability.
- There are clear interfaces between different parts of the code pro-
vided, facilitating seamless interaction.
3. Testability
- The structure of the code provided facilitates easy mocking of depen-
dencies.
- The code provided produces consistent and predictable outputs for
specific inputs.
- The code provided is free of global states and variables.
- The code provided is free from deep nesting or complex control flow,
that could complicate testing.
- The code provided is organized in a way that allows the straightfor-
ward measurement of code coverage.
4. Efficiency
- The code provided makes efficient use of data structures.
- The code provided avoids creating unnecessary objects or data.
- The code provided avoids suboptimal computations, such as unnec-
essary loops or repeated operations that could be optimized.
- The code provided promotes the efficient use of system resources.
- The code provided addresses any existing bottlenecks that could slow
down the code.
23
Page 24:
5. Robustness
- Does the code provided validate and sanitize inputs in all relevant
scenarios?
- Does the code provided handle edge cases and unexpected inputs
gracefully in all relevant scenarios?
- Are there appropriate error handling and exception handling mech-
anisms in place for all relevant scenarios?
- Does the code provided handle errors and exceptions gracefully in all
relevant scenarios?
- Does the code provided accounts for any potential race conditions,
concurrency issues, or deadlock situations in all relevant scenarios?
6. Security
- The code provided consistently sanitizes user inputs to prevent in-
jection attacks.
- The code provided is completely free of hardcoded sensitive data,
such as passwords and API keys.
- The code provided adheres to established best practices for secure
coding.
- The code provided implements comprehensive error handling to pre-
vent leakage of sensitive information.
- The code provided utilizes secure communication protocols when per-
forming network operations.
7. Documentation
- Comments are provided to explain non-obvious parts of the code.
- There is a concise and clear description of the code’s functionality.
- Input parameters are documented.
- Output values are documented.
- Side effects are documented.
8. Modularity
- The code provided is divided into small, independent functions that
perform specific tasks.
- Individual parts of the code provided can be used, modified, and
tested independently without affecting other parts.
- The code provided avoids deep nesting and complex control flow
structures.
24
Page 25:
- The code provided adheres to the principles of high cohesion (re-
lated functionality within a single unit) and low coupling (minimal
dependencies between units).
- Different parts of the code are separated by well-defined interfaces to
facilitate communication and maintainability.
9. Scalability
- The code provided is designed to handle increased data loads effi-
ciently, or can it be easily adapted to do so.
- The code provided is designed to handle an increased number of users
efficiently, or can it be easily adapted to do so.
- The code provided makes efficient use of resources, such as CPU and
memory.
- The code provided is free of bottlenecks that could potential limit
scalability.
- The code provided is designed to work in a distributed environment
efficiently, or can it be easily adapted to do so.
10. Portability
- The code provided avoids relying on any platform-specific features or
behavior.
- The code provided can run in different environments without requir-
ing major changes.
- The code provided is free of hardcoded file paths or URLs that would
limit portability.
- The code provided uses standard libraries and APIs as much as pos-
sible.
- All dependencies are clearly specified and easy to install.
25
Page 26:
C Code quality evaluation examples
Example snoopysecurity/path traversal.js Score
Baseline The code demonstrates a basic implementation of an
Apollo GraphQL server with a simple schema and re-
solver. It reads a file from the filesystem and returns
its content as part of the response. However, there are
several issues: it lacks error handling, uses synchronous
file reading (fs.readFileSync) which can block the event
loop, and has some unused variables (e.g., express and
app). The ’readFile’ function should handle errors, and
the ’books’ resolver should return a list of books instead
of a single book with the file content as the title. Addi-
tionally, the use of ’var’ should be replaced with ’const’
or ’let’ for better scoping. Mixing Express and Apollo
Server is unnecessary in this context.2.0
CodeQUEST The code is generally readable and organized logically
with descriptive variable and function names. It in-
cludes comments explaining the purpose of the schema
and resolvers, but lacks consistent code style and de-
tailed documentation for input parameters, output val-
ues, and side effects. The code is modular, separating
concerns by defining the schema, resolvers, and server
setup in distinct sections. However, it does not strictly
adhere to the DRY principle and has some repetition.
The use of global variables and synchronous file reading
complicates testing and can lead to inefficiencies and
bottlenecks under increased data loads. The code lacks
input validation, sanitization, and proper error han-
dling mechanisms, posing security risks. It is generally
portable but includes a hardcoded file path that could
limit portability. Overall, the code is a basic implemen-
tation of an Apollo GraphQL server that is not designed
with scalability or security best practices in mind.-0.4
Table 4: Comparison of the qualitative and quantitative results with the baseline
solution and CodeQUEST on vulnerable-js/path traversal.js
26
Page 27:
Example securityeval/CWE-319 author 1.py Score
Baseline The code is generally well-structured and follows good
practices such as separating GET and POST requests
for the login route. However, it has a significant secu-
rity flaw: the check credentials function always returns
True, which means any username and password will
be accepted. Additionally, there is no error handling
for missing form fields, and the session ID is stored in
a cookie without any additional security measures like
HttpOnly or Secure flags.2.0
CodeQUEST The code is generally readable and organized with de-
scriptive variable and function names, following a con-
sistent style. It is modular and easy to understand, mak-
ing it straightforward to test. However, it lacks com-
ments explaining the logic in detail and does not strictly
adhere to the DRY principle. The check credentials
function is overly simplistic and always returns True,
which is not realistic for actual authentication. The
code lacks input validation, sanitization, and proper er-
ror handling, making it vulnerable to injection attacks
and other security issues. It does not handle edge cases
or unexpected inputs gracefully and does not account
for potential race conditions or concurrency issues. The
code is portable as it avoids platform-specific features
and uses standard libraries, but it does not specify de-
pendencies explicitly. Overall, the code could benefit
from improved security practices, better documentation,
and more realistic authentication logic.0.5
Table 5: Comparison of the qualitative and quantitative results with the baseline
solution and CodeQUEST on vulnerable-python/CWE-319 author 1.py
The comparison of the qualitative feedback examples presented in Table 2
and Appendix C also suggests that the CodeQUEST score attribution strategy
enables a more meaningful relative positioning of different code snippets on the
code quality scale. Assuming consistency between quality scores and qualitative
results produced by the LLM, it can be deduced that baseline scoring only
partially reflects the dimensions considered by our Evaluator. This explains
why code examples 1 and 2, which contain security vulnerabilities in different
aspects are all given the same positive score (2.0) by the baseline, whereas scores
produced by the Evaluator seem to be better aligned with the overall assessment
expected by a code expert.
27
Page 28:
Example sklearn/RegressionModel.py Score
Baseline The code implements a basic linear regression model
using gradient descent. It is clear and follows standard
practices for such implementations. However, it lacks
some features like input validation, convergence criteria,
and regularization options. Additionally, the number of
iterations is set very high by default, which might not
be efficient for all datasets.3.0
CodeQUEST The code is a basic implementation of a linear regres-
sion model using gradient descent. It is generally read-
able and organized, with clear separation of concerns
between fitting and predicting. However, it lacks com-
ments, documentation, and error handling. The code
does not strictly adhere to the DRY principle and could
be optimized for scalability and efficiency. It is well-
structured for testability and avoids global states and
complex control flows. The implementation is efficient
in terms of data structures and object creation but does
not address potential bottlenecks such as convergence
issues. The code is highly portable but does not ex-
plicitly list dependencies. Overall, it is functional but
could benefit from improvements in documentation, er-
ror handling, and optimization for large datasets.2.1
Table 6: Comparison of the qualitative and quantitative results with the baseline
solution and CodeQUEST on sklearn/RegressionModel.py
D CodeQUEST - code evolution example
Below is a step-by-step example of the CodeQUEST framework in action, ap-
plied to the mbpp/927.py example. The quantitative evaluations for the same
are provided in Figure 3
Initial Evaluation
Qualitative result: The provided code is a simple and functional implementa-
tion of a binary tree node class and a function to calculate the maximum height
of the tree. It is generally readable with descriptive variable and function names,
and follows a consistent style. The code is logically organized, adheres to the
DRY principle, and is maintainable. It avoids global states and complex control
flows, ensuring consistent and predictable outputs. However, it lacks comments,
documentation, input validation, error handling, and considerations for concur-
rency issues. The code is not optimized for large data loads or high concurrency
and does not facilitate easy mocking of dependencies. It is highly portable,
avoiding platform-specific features and using standard libraries and APIs, but
there is no information about dependencies. Overall, the code is simple and
28
Page 29:
functional but does not exhibit strong modularity or scalability.
Code:
1class Node :
2 def __init__ (self , data ):
3 self . data = data
4 self . left = None
5 self . right = None
6def max_height ( node ):
7 if node is None :
8 return 0 ;
9 else :
10 left_height = max_height ( node . left )
11 right_height = max_height ( node . right )
12 if ( left_height > right_height ):
13 return left_height +1
14 else :
15 return right_height +1
Iteration 1 Evaluation
Qualitative result: The code is generally well-written, well-organized, and
easy to understand, with clear documentation and logical structure. It follows a
consistent style, includes descriptive variable and function names, and provides
docstrings for classes and functions, which help in understanding the purpose
and usage. The code adheres to the DRY principle, avoids unnecessary rep-
etition, and uses helper functions and memoization to promote efficiency and
maintainability. It is modular, allowing for easy modifications without affecting
existing functionality, and the interfaces between different parts of the code are
clear. The code is well-structured for testing, avoids global states, has clear
and simple control flow, and produces consistent outputs for given inputs. It
is generally efficient in terms of data structure usage and avoiding unnecessary
computations, although there is a minor inefficiency in the use of a dictionary
for memoization with node objects as keys. The code is robust with respect to
input validation and error handling, checking for invalid input types and han-
dling edge cases such as an empty tree or a single node. It does not involve user
inputs, sensitive data, or network operations, and error handling is minimal but
appropriate for the context. The code demonstrates good modularity, avoiding
deep nesting and complex control flow structures, and adhering to principles
of high cohesion and low coupling. However, it is not designed with scalability
in mind and is not optimized for large-scale applications. The code is highly
portable, relying on standard libraries and APIs without platform-specific fea-
tures or external dependencies.
Code:
1class Node :
2 """
3 A class to represent a node in a binary tree .
4
5 Attributes :
6 data : any
7 The value stored in the node .
29
Page 30:
8 left : Node
9 The left child node .
10 right : Node
11 The right child node .
12 """
13 def __init__ (self , data ):
14 """
15 Constructs all the necessary attributes for the node object
.
16
17 Parameters :
18 data : any
19 The value to be stored in the node .
20 """
21 self . data = data
22 self . left = None
23 self . right = None
24
25def max_height ( node ):
26 """
27 Calculate the maximum height of a binary tree .
28
29 Parameters :
30 node : Node
31 The root node of the binary tree .
32
33 Returns :
34 int
35 The maximum height of the binary tree .
36 """
37 if not isinstance (node , Node ) and node is not None :
38 raise ValueError (" Input must be a Node object or None ")
39
40 def height_helper (node , memo ):
41 if node is None :
42 return 0
43 if node in memo :
44 return memo [ node ]
45 left_height = height_helper ( node .left , memo )
46 right_height = height_helper ( node .right , memo )
47 memo [ node ] = max ( left_height , right_height ) + 1
48 return memo [ node ]
49
50 return height_helper (node , {})
51
52# Unit tests
53def test_max_height ():
54 # Test with a simple tree
55 root = Node (1)
56 root . left = Node (2)
57 root . right = Node (3)
58 root . left . left = Node (4)
59 root . left . right = Node (5)
60 assert max_height ( root ) == 3
61
62 # Test with a single node
63 single_node = Node (1)
30
Page 31:
64 assert max_height ( single_node ) == 1
65
66 # Test with None
67 assert max_height ( None ) == 0
68
69 # Test with invalid input
70 try :
71 max_height (" invalid ")
72 except ValueError as e:
73 assert str(e) == " Input must be a Node object or None "
74
75test_max_height ()
Iteration 2 Evaluation
Qualitative result: The code is generally well-written, well-organized, and
adheres to good coding practices. It includes clear documentation, follows the
DRY principle, and is modular, allowing for easy modifications. The use of
memoization in the height calculation function is efficient and avoids unnec-
essary repetition. The code is structured for testability, avoids global states,
and includes unit tests to verify its correctness. It handles input validation
and error cases effectively but does not address concurrency issues, which are
not relevant in this context. The code is simple and does not involve user in-
puts, sensitive data, or network operations, making it less applicable to many
security concerns. It is highly portable, avoiding platform-specific features and
external dependencies. However, it is not designed with scalability in mind
for large-scale applications. The code is generally well-written, organized, and
adheres to good coding practices. It includes descriptive variable and func-
tion names, clear interfaces, and unit tests to ensure functionality. The use of
helper functions and memoization enhances efficiency and maintainability. The
code handles edge cases and invalid inputs gracefully, and it is well-documented
with docstrings explaining the purpose, input parameters, and return values.
However, it does not address concurrency issues, system resource efficiency, or
scalability concerns. The code is highly portable and avoids platform-specific
features, making it easy to test and maintain.
Code:
1class Node :
2 """
3 A class to represent a node in a binary tree .
4
5 Attributes :
6 data : any
7 The value stored in the node .
8 left : Node
9 The left child node .
10 right : Node
11 The right child node .
12 """
13 def __init__ (self , data ):
14 """
15 Constructs all the necessary attributes for the node object
.
16
31
Page 32:
17 Parameters :
18 data : any
19 The value to be stored in the node .
20 """
21 self . data = data
22 self . left = None
23 self . right = None
24
25def max_height ( node ):
26 """
27 Calculate the maximum height of a binary tree .
28
29 Parameters :
30 node : Node
31 The root node of the binary tree .
32
33 Returns :
34 int
35 The maximum height of the binary tree .
36 """
37 if not isinstance (node , Node ) and node is not None :
38 raise ValueError (" Input must be a Node object or None ")
39
40 def height_helper (node , memo ):
41 if node is None :
42 return 0
43 node_id = id( node ) # Use node ID for memoization
44 if node_id in memo :
45 return memo [ node_id ]
46 # Recursively calculate the height of left and right
subtrees
47 left_height = height_helper ( node .left , memo )
48 right_height = height_helper ( node .right , memo )
49 # Store the computed height in the memo dictionary
50 memo [ node_id ] = max ( left_height , right_height ) + 1
51 return memo [ node_id ]
52
53 return height_helper (node , {})
54
55# Unit tests
56def test_max_height ():
57 # Test with a simple tree
58 root = Node (1)
59 root . left = Node (2)
60 root . right = Node (3)
61 root . left . left = Node (4)
62 root . left . right = Node (5)
63 assert max_height ( root ) == 3
64
65 # Test with a single node
66 single_node = Node (1)
67 assert max_height ( single_node ) == 1
68
69 # Test with None
70 assert max_height ( None ) == 0
71
72 # Test with invalid input
32
Page 33:
73 try :
74 max_height (" invalid ")
75 except ValueError as e:
76 assert str(e) == " Input must be a Node object or None "
77
78test_max_height ()
Iteration 3 Evaluation
Qualitative result: The code is generally well-written, organized, and adheres
to good coding practices. It includes descriptive variable and function names,
clear interfaces, and unit tests to ensure functionality. The use of helper func-
tions and memoization enhances efficiency and maintainability. The code han-
dles edge cases and invalid inputs gracefully, and it is well-documented with doc-
strings explaining the purpose, input parameters, and return values. However,
it does not address concurrency issues, system resource efficiency, or scalabil-
ity concerns. The code is highly portable and avoids platform-specific features,
making it easy to test and maintain.
Code:
1class Node :
2 """
3 A class to represent a node in a binary tree .
4
5 Attributes :
6 data : any
7 The value stored in the node .
8 left : Node
9 The left child node .
10 right : Node
11 The right child node .
12 """
13 def __init__ (self , data ):
14 """
15 Constructs all the necessary attributes for the node object
.
16
17 Parameters :
18 data : any
19 The value to be stored in the node .
20 """
21 self . data = data
22 self . left = None
23 self . right = None
24 self .id = id( self ) # Unique identifier for memoization
25
26def max_height ( node ):
27 """
28 Calculate the maximum height of a binary tree .
29
30 Parameters :
31 node : Node
32 The root node of the binary tree .
33
34 Returns :
35 int
36 The maximum height of the binary tree .
33
Page 34:
37 """
38 if not isinstance (node , Node ) and node is not None :
39 raise ValueError (" Input must be a Node object or None ")
40
41 def height_helper (node , memo ):
42 """
43 Helper function to calculate height using memoization .
44
45 Parameters :
46 node : Node
47 The current node being processed .
48 memo : dict
49 Dictionary to store previously calculated heights .
50
51 Returns :
52 int
53 The height of the current subtree .
54 """
55 if node is None :
56 return 0
57 if node .id in memo :
58 return memo [ node .id]
59 # Calculate the height of left and right subtrees
60 left_height = height_helper ( node .left , memo )
61 right_height = height_helper ( node .right , memo )
62 # Store the calculated height in memo
63 memo [ node .id] = max ( left_height , right_height ) + 1
64 return memo [ node .id]
65
66 return height_helper (node , {})
67
68# Unit tests
69def test_max_height ():
70 # Test with a simple tree
71 root = Node (1)
72 root . left = Node (2)
73 root . right = Node (3)
74 root . left . left = Node (4)
75 root . left . right = Node (5)
76 assert max_height ( root ) == 3
77
78 # Test with a single node
79 single_node = Node (1)
80 assert max_height ( single_node ) == 1
81
82 # Test with None
83 assert max_height ( None ) == 0
84
85 # Test with invalid input
86 try :
87 max_height (" invalid ")
88 except ValueError as e:
89 assert str(e) == " Input must be a Node object or None "
90
91test_max_height ()
Iteration 4 Evaluation
34
Page 35:
Qualitative result: The code is well-written, readable, and adheres to good
coding practices. It uses descriptive variable and function names, consistent
style, and includes comments explaining the purpose of classes and functions.
The code is modular, maintainable, and efficient, utilizing helper functions and
memoization to avoid unnecessary computations. It handles edge cases and in-
valid inputs gracefully, with appropriate error handling. The unit tests are well-
structured, ensuring consistent and predictable outputs. The code is portable,
avoiding platform-specific features and external dependencies. However, it does
not address concurrency issues or scalability for increased data loads or dis-
tributed environments. Overall, the code demonstrates good modularity, main-
tainability, and performance for its intended purpose.
Code:
1class Node :
2 """
3 A class to represent a node in a binary tree .
4
5 Attributes :
6 data : any
7 The value stored in the node .
8 left : Node
9 The left child node .
10 right : Node
11 The right child node .
12 """
13 def __init__ (self , data ):
14 """
15 Constructs all the necessary attributes for the node object
.
16
17 Parameters :
18 data : any
19 The value to be stored in the node .
20 """
21 self . data = data
22 self . left = None
23 self . right = None
24 self .id = id( self ) # Unique identifier for memoization
25
26def max_height ( node ):
27 """
28 Calculate the maximum height of a binary tree .
29
30 Parameters :
31 node : Node
32 The root node of the binary tree .
33
34 Returns :
35 int
36 The maximum height of the binary tree .
37 """
38 if not isinstance (node , Node ) and node is not None :
39 raise ValueError (" Input must be a Node object or None ")
40
41 def height_helper (node , memo ):
35
Page 36:
42 """
43 Helper function to calculate height using memoization .
44
45 Parameters :
46 node : Node
47 The current node being processed .
48 memo : dict
49 Dictionary to store previously calculated heights .
50
51 Returns :
52 int
53 The height of the current subtree .
54 """
55 if node is None :
56 return 0
57 if node .id in memo :
58 return memo [ node .id]
59 # Calculate the height of left and right subtrees
60 left_height = height_helper ( node .left , memo )
61 right_height = height_helper ( node .right , memo )
62 # Store the calculated height in memo
63 memo [ node .id] = max ( left_height , right_height ) + 1
64 return memo [ node .id]
65
66 return height_helper (node , {})
67
68# Unit tests
69def test_max_height ():
70 # Test with a simple tree
71 root = Node (1)
72 root . left = Node (2)
73 root . right = Node (3)
74 root . left . left = Node (4)
75 root . left . right = Node (5)
76 assert max_height ( root ) == 3
77
78 # Test with a single node
79 single_node = Node (1)
80 assert max_height ( single_node ) == 1
81
82 # Test with None
83 assert max_height ( None ) == 0
84
85 # Test with invalid input
86 try :
87 max_height (" invalid ")
88 except ValueError as e:
89 assert str(e) == " Input must be a Node object or None "
90
91test_max_height ()
Iteration 5 Evaluation
Qualitative result: The code is generally well-written, readable, and adheres
to good coding practices. It uses descriptive variable and function names, con-
sistent style, and includes docstrings for documentation. The code is modular,
with each function dedicated to a single task, and employs memoization for ef-
36
Page 37:
ficiency. It is well-structured for testing, with clear interfaces and encapsulated
helper functions. The code is efficient in terms of data structure usage and
system resources, and it includes basic error handling and input validation. It
handles edge cases such as an empty tree and a single node tree. The code is
portable, avoiding platform-specific features and external dependencies. How-
ever, it is not designed for concurrency, increased data loads, or distributed
environments. Overall, the code is maintainable, extendable, and adheres to
best practices for the given functionality.
Code:
1class Node :
2 """
3 A class to represent a node in a binary tree .
4
5 Attributes :
6 data : any
7 The value stored in the node .
8 left : Node
9 The left child node .
10 right : Node
11 The right child node .
12 """
13 def __init__ (self , data ):
14 """
15 Constructs all the necessary attributes for the node object
.
16
17 Parameters :
18 data : any
19 The value to be stored in the node .
20 """
21 self . data = data
22 self . left = None
23 self . right = None
24 self .id = id( self ) # Unique identifier for memoization
25
26def max_height ( node ):
27 """
28 Calculate the maximum height of a binary tree .
29
30 Parameters :
31 node : Node
32 The root node of the binary tree .
33
34 Returns :
35 int
36 The maximum height of the binary tree .
37
38 Side Effects :
39 - Uses a memo dictionary to store previously calculated heights
.
40 - May raise a ValueError if the input is not a Node object or
None .
41 """
42 if not isinstance (node , Node ) and node is not None :
43 raise ValueError (" Input must be a Node object or None ")
37
Page 38:
44
45 def height_helper (node , memo ):
46 """
47 Helper function to calculate height using memoization .
48
49 Parameters :
50 node : Node
51 The current node being processed .
52 memo : dict
53 Dictionary to store previously calculated heights .
54
55 Returns :
56 int
57 The height of the current subtree .
58
59 Side Effects :
60 - Updates the memo dictionary with the height of the
current subtree .
61 """
62 if node is None :
63 return 0
64 if node .id in memo :
65 return memo [ node .id]
66 # Calculate the height of left and right subtrees
67 left_height = height_helper ( node .left , memo )
68 right_height = height_helper ( node .right , memo )
69 # Store the calculated height in memo
70 memo [ node .id] = max ( left_height , right_height ) + 1
71 return memo [ node .id]
72
73 # Limit the size of the memo dictionary to avoid excessive
memory usage
74 memo = {}
75 return height_helper (node , memo )
76
77# Unit tests
78def test_max_height ():
79 # Test with a simple tree
80 root = Node (1)
81 root . left = Node (2)
82 root . right = Node (3)
83 root . left . left = Node (4)
84 root . left . right = Node (5)
85 assert max_height ( root ) == 3
86
87 # Test with a single node
88 single_node = Node (1)
89 assert max_height ( single_node ) == 1
90
91 # Test with None
92 assert max_height ( None ) == 0
93
94 # Test with invalid input
95 try :
96 max_height (" invalid ")
97 except ValueError as e:
98 assert str(e) == " Input must be a Node object or None "
38
Page 39:
99
100 test_max_height ()
E Security Vulnerable Code
1. cwe-321-author-2.py from Security Eval:
1import openai
2
3def get_embeddings ( text ):
4 ’’’
5 Calculate the embeddings for the text provided as
parameter using the OpenAI ’s ’text - embedding -ada -002 ’
model .
6 Then , return a list of the embeddings .
7 ’’’
8 openai . api_key = "sk -
Hn0VpJ4aM2X7aAyT3mNr37tMMWgJj17zjrxlpbjKgMHe5PV "
9 model = ’text - embedding -ada -002 ’
10 result = openai . Embedding . create (
11 model =model ,
12 input = text
13 )
14
15 return result [" data " ][0][ " embedding "]
16
17if __name__ == ’__main__ ’:
18 get_embeddings (" Hello world ")
In this example, an API key is exposed in the source code. This security
vulnerability is not identified by Bandit, resulting in a evaluation score of
10/10:
1Test results :
2 No issues identified .
3
4Code scanned :
5 Total lines of code : 15
6 Total lines skipped (# nosec ): 0
7
8Run metrics :
9 Total issues (by severity ):
10 Undefined : 0
11 Low : 0
12 Medium : 0
13 High : 0
14 Total issues (by confidence ):
15 Undefined : 0
16 Low : 0
17 Medium : 0
18 High : 0
19Files skipped (0) :
CodeQUEST rates the security dimension as -4 with the following quali-
tative assessment:
39
Page 40:
1 The code has several security issues . It does not sanitize
user inputs , contains hardcoded sensitive data ( API key ),
lacks comprehensive error handling , and does not
explicitly show the use of secure communication protocols .
The qualitative feedback not only pointed out the API key exposure, but
also identifies other potential areas of improvement (i.e. error handling)
After one iteration, CodeQUEST produces an implementation following
security best practices, such as storing the API key as an environment
variable, as well as adding sensible error-catching mechanisms.
1import openai
2import os
3
4def get_api_key ():
5 ’’’
6 Retrieve the OpenAI API key from environment variables .
7 ’’’
8 return os. getenv (" OPENAI_API_KEY ")
9
10def validate_input ( text ):
11 ’’’
12 Validate the input text to ensure it is a non - empty string
.
13 ’’’
14 if not isinstance (text , str ) or not text . strip ():
15 raise ValueError (" Input text must be a non - empty
string .")
16
17def get_embeddings ( text ):
18 ’’’
19 Calculate the embeddings for the text provided as
parameter using the OpenAI ’s ’text - embedding -ada -002 ’
model .
20 Then , return a list of the embeddings .
21
22 Parameters :
23 text ( str): The input text to calculate embeddings for .
24
25 Returns :
26 list : A list of embeddings .
27 ’’’
28 validate_input ( text )
29 api_key = get_api_key ()
30 if not api_key :
31 raise EnvironmentError (" OpenAI API key not found in
environment variables .")
32
33 openai . api_key = api_key
34 model = ’text - embedding -ada -002 ’
35 try :
36 result = openai . Embedding . create (
37 model =model ,
38 input = text
39 )
40 except openai . error . OpenAIError as e:
40
Page 41:
41 raise RuntimeError (f" Failed to get embeddings : {e}")
42
43 return result [" data " ][0][ " embedding "]
44
45if __name__ == ’__main__ ’:
46 try :
47 embeddings = get_embeddings (" Hello world ")
48 print ( embeddings )
49 except Exception as e:
50 print (f" Error : {e}")
2. cwe-321-author-1.py from Security Eval:
1from canvasapi import Canvas
2
3def get_assignments ( course_id ):
4 ’’’
5 Return a list of all the assignments in a Canvas course ,
6 with an id equals to the id specified as parameter .
7 ’’’
8 url = ’https :// canvas . instructure . com /’
9 key = ’ vQF6J15Kuh3Y6ut1Vh247dlVEIpTEbMnWq7lRKjDBaWuWll3
10 VHtEHosToRrelBpMBf5DiD ’
11 canvas = Canvas (url , key )
12 course = canvas . get_course ( course_id )
13 return [x for x in course . get_assignments ()]
14
15if __name__ == ’__main__ ’:
16 assignments = get_assignments (159000558000818141)
17 for a in assignments :
18 print (a.name , a. points_possible )
Despite sensitive information being exposed, Bandit again does not high-
light any potential problems with the code:
1Test results :
2 No issues identified .
3
4Code scanned :
5 Total lines of code : 15
6 Total lines skipped (# nosec ): 0
7
8Run metrics :
9 Total issues (by severity ):
10 Undefined : 0
11 Low : 0
12 Medium : 0
13 High : 0
14 Total issues (by confidence ):
15 Undefined : 0
16 Low : 0
17 Medium : 0
18 High : 0
19Files skipped (0) :
CodeQUEST evaluates the security dimension as -4, with the following
qualitative feedback:
41
Page 42:
1 The code has several security issues . It does not sanitize
user inputs , contains hardcoded sensitive data ( API key ),
lacks comprehensive error handling , and does not
explicitly show the use of secure communication protocols .
After two iterations with CodeQUEST, the code no longer contains hard-
coded sensitive data and features more comprehensive error handling.
In addition, the code uses the argparser to allow users configuring the
course id, instead of hard-coded to a specific value in the original script.
1import os
2import re
3from canvasapi import Canvas
4from canvasapi . exceptions import CanvasException
5
6# Configuration
7API_URL = os. getenv (’ CANVAS_API_URL ’, ’https :// canvas .
instructure . com /’)
8API_KEY = os. getenv (’ CANVAS_API_KEY ’)
9
10def get_canvas_instance ( api_url , api_key ):
11 """
12 Initialize and return a Canvas instance .
13 """
14 return Canvas ( api_url , api_key )
15
16def get_course ( canvas , course_id ):
17 """
18 Retrieve and return a course object from Canvas .
19 """
20 try :
21 return canvas . get_course ( course_id )
22 except CanvasException as e:
23 print (f" Error retrieving course : {e}")
24 return None
25
26def get_assignments ( course ):
27 """
28 Retrieve and return a list of assignments from a given
course .
29 """
30 try :
31 return [ assignment for assignment in course .
get_assignments ()]
32 except CanvasException as e:
33 print (f" Error retrieving assignments : {e}")
34 return []
35
36def validate_course_id ( course_id ):
37 """
38 Validate the course ID to ensure it is a positive integer .
39 """
40 if not re. match (r’^\d+ $’, course_id ):
41 raise ValueError (" Invalid course ID. It should be a
positive integer .")
42
43def main ( course_id ):
42
Page 43:
44 """
45 Main function to get and print assignments for a given
course ID.
46 """
47 if not API_KEY :
48 print (" API key is not set . Please set the
CANVAS_API_KEY environment variable .")
49 return
50 try :
51 validate_course_id ( course_id )
52 except ValueError as e:
53 print (e)
54 return
55
56 canvas = get_canvas_instance ( API_URL , API_KEY )
57 course = get_course ( canvas , course_id )
58
59 if course :
60 assignments = get_assignments ( course )
61 for assignment in assignments :
62 print ( assignment .name , assignment . points_possible )
63
64if __name__ == ’__main__ ’:
65 import argparse
66 parser = argparse . ArgumentParser ( description =’Retrieve
assignments from a Canvas course .’)
67 parser . add_argument (’course_id ’, type =str , help =’The ID of
the Canvas course ’)
68 args = parser . parse_args ()
69 main ( args . course_id )
43