Paper Content:
Page 1:
DynaCode: A Dynamic Complexity-Aware Code Benchmark for
Evaluating Large Language Models in Code Generation
Wenhao Hu1Jinhao Duan2Chunchen Wei1Li Zhang2
Yue Zhang2Kaidi Xu2*
1University of Electronic Science and Technology of China
2Drexel University
Abstract
The rapid advancement of large language mod-
els (LLMs) has significantly improved their per-
formance in code generation tasks. However,
existing code benchmarks remain static, con-
sisting of fixed datasets with predefined prob-
lems. This makes them vulnerable to mem-
orization during training, where LLMs recall
specific test cases instead of generalizing to
new problems, leading to data contamination
and unreliable evaluation results. To address
these issues, we introduce DynaCode, a dy-
namic, complexity-aware benchmark that over-
comes the limitations of static datasets. Dyna-
Code evaluates LLMs systematically using a
complexity-aware metric, incorporating both
code complexity and call-graph structures. Dy-
naCode achieves large-scale diversity, gener-
ating up to 189 million unique nested code
problems across four distinct levels of code
complexity, referred to as units, and 16 types of
call graphs. Results on 12 latest LLMs show an
average performance drop of 16.8%to45.7%
compared to MBPP+, a static code generation
benchmark, with performance progressively de-
creasing as complexity increases. This demon-
strates DynaCode’s ability to effectively dif-
ferentiate LLMs. Additionally, by leveraging
call graphs, we gain insights into LLM behav-
ior, particularly their preference for handling
subfunction interactions within nested code.
1 Introduction
The performance of Large Language Models
(LLMs) in code generation has garnered significant
attention (Hou et al., 2024). With their powerful
language comprehension abilities, LLMs are now
capable of autonomously generating high-quality
code and, to some extent, addressing complex pro-
gramming challenges. These advancements have
not only accelerated the software development pro-
cess but have also had a profound impact on enhanc-
ing developer productivity (Ziegler et al., 2022).
*Corresponding author: Kaidi Xu <kx46@drexel.edu>.
MBPP MBPP+ DynaCode
Benchmark020406080Pass@1 Score (%)64.6%
54.8%
8.4%87.6%
73.0%
52.1%
-56.2%-35.5%
-46.4%-20.9%Meta-Llama-3-8B-Instruct
DeepSeek-V3Figure 1: Data contamination on the popular bench-
marks MBPP and MBPP+. Meta-Llama-3-8B-Instruct
exhibits a significant performance drop from MBPP and
MBPP+ to DynaCode.
However, as LLMs advance, reliable benchmarks
for code generation have become increasingly cru-
cial for evaluating and selecting suitable LLMs.
Currently, the evaluation of LLMs’ code generation
capabilities primarily relies on standardized bench-
marks such as HumanEval (Chen et al., 2021),
MBPP (Austin et al., 2021), CodeXGLUE (Lu
et al., 2021), and ClassEval (Du et al., 2023). These
benchmarks provide an initial reference for assess-
ing the performance of LLMs in code generation
by evaluating the functionality and correctness of
the generated code. Moreover, recent work such as
EvalPlus (Liu et al., 2024c), BigCodeBench (Zhuo
et al., 2024), CRUXEval (Gu et al., 2024), and
EvoEval (Xia et al., 2024) aim to enhance evalua-
tion quality by expanding test cases and employing
techniques like prompt transformation to convert
prompts into more appropriate ones for more pre-
cise evaluation.
Despite these developments, existing bench-
marks exhibit two notable limitations:
Data Contamination. Existing benchmarks are
static and small-scale, making them easily accessi-arXiv:2503.10452v1 [cs.CL] 13 Mar 2025
Page 2:
ble during training and allowing models to mem-
orize test cases instead of generalizing to unseen
problems. Meta-Llama-3-8B-Instruct (Meta AI,
2024a) and Phi-2 (Microsoft, 2024) have been re-
ported to exhibit data contamination (Zhang et al.,
2024a), suggesting that the model may “memorize”
specific test cases or code snippets, which could
compromise the accuracy and fairness of the evalu-
ation process.
Uncontrollable Complexity. Existing bench-
marks lack systematic complexity control, making
it challenging to evaluate LLM performance across
different task complexities. While some works (Yu
et al., 2024; Liu et al., 2024d) define code com-
plexity using simple metrics such as lines of code
and time complexity, these measures fail to cap-
ture deep nesting and complex execution depen-
dencies, thereby leaving a critical gap in assessing
real-world code generation capabilities.
To address these limitations, we propose Dyna-
Code, a novel dynamic evaluation framework that
automatically creates Python code benchmarks by
classifying code problems based on complexity and
forming nested problems using call graphs. This
benchmark provides a more comprehensive and
fair evaluation of code generated by LLMs. Specif-
ically, DynaCode categorizes code problems into
multiple code problem units and, for each unit, con-
structs call-graph structures of varying complexity.
In doing so, it establishes complexity-aware met-
rics along two dimensions: code complexity and
call-graph complexity. Compared to traditional
static benchmarks, DynaCode offers a significantly
more diverse and complex evaluation. As shown
in Figure 1, Meta-Llama-3-8B-Instruct (Meta AI,
2024a), a model that exhibits data contamination,
shows a larger performance drop from MBPP and
MBPP+ to DynaCode. DynaCode generating ap-
proximately 189 million unique code generation
tasks across 4 units of code complexity and 16
call-graph structures. By assessing LLMs from
both code complexity and call-graph complexity
perspectives, DynaCode provides a structured and
scalable evaluation framework while mitigating
data contamination. To further investigate LLM
limitations, we analyzed 4279 error examples, cat-
egorizing errors into 3 distinct types. Our analysis
reveals that LLMs perform well on sequential call
graphs but struggle with complex, multi-branch
dependencies, highlighting their difficulty in han-
dling deeply nested execution flows and long-range
function interactions. In summary, our major con-tributions are listed as follows:
•We propose a dynamic evaluation strategy that
simulates the actual execution process by com-
bining multiple code problems, thereby en-
abling a fairer and more comprehensive evalu-
ation.
•We design complexity-aware metrics combin-
ing code complexity and call graphs, integrat-
ing static analysis and dynamic execution to
create a multidimensional complexity evalua-
tion system with categorized benchmarks.
•We introduce DynaCode, a new code genera-
tion benchmark, and evaluate multiple LLMs,
providing a thorough analysis of its practical
utility.
2 Related Works
2.1 Dynamic Evaluation
Recently, growing interest in dynamic evaluation
methods has emerged to address data contami-
nation. Several works have focused on differ-
ent aspects of this challenge. For example, Dy-
Val (Zhu et al., 2023) utilizes a graph-based ap-
proach to dynamically generate evaluation samples
with controllable complexity; NPHardEval (Fan
et al., 2023) generates new evaluation samples
for NP-hard mathematical problems; DyVal2 (Zhu
et al., 2024) leverages a probing agent based on
LLMs to transform existing problems into new
ones, while a judgment agent verifies the generated
evaluation samples. Additionally, Benchmark Self-
Evolving (Wang et al., 2024) modifies the context
or the problem itself, along with its correspond-
ing answers, to reconstruct existing benchmark in-
stances into new variants for dynamic evaluation.
DARG (Zhang et al., 2024b) also utilizes LLMs
to build reasoning graphs for problems and applies
fine-grained graph perturbations across various di-
mensions. However, these methods mainly focus
on reasoning domains like logic and mathemat-
ics and may not extend well to code generation.
Moreover, they rely on LLMs as agents to refine
benchmarks, introducing evaluation instability and
extra costs. To address these limitations, this pa-
per proposes a dynamic evaluation strategy tailored
for code generation, which automatically creates
benchmarks and offers a more detailed and com-
prehensive assessment.
Page 3:
2.2 Coding benchmark for LLMs
The rapid development of LLMs has driven the
continuous evolution of code generation bench-
marks. Benchmarks such as HumanEval (Chen
et al., 2021) and MBPP (Austin et al., 2021) pri-
marily evaluate LLMs on simple, isolated Python
functions, offering a evaluation of code genera-
tion capabilities. As LLMs have improved, new
benchmarks have been continuously proposed, ex-
panding to address higher levels of difficulty (Zhuo
et al., 2024) and a wider range of programming
languages (Zheng et al., 2023; Khan et al., 2024;
Cassano et al., 2023; Ding et al., 2023). These
benchmarks also tackle more complex tasks such
as program repair and code reasoning (Liu et al.,
2024b; Gu et al., 2024; Jain et al., 2024). In addi-
tion, new benchmarks like SWE-Bench (Jimenez
et al., 2023) and EvoCodeBench (Li et al., 2024)
focus on real-world tasks and code evolution, push-
ing forward the performance evaluation of LLMs
in practical applications. However, existing bench-
marks remain static and fixed, lacking systematic
complexity control in generated code problems.
They typically assess LLMs based on overall per-
formance, failing to provide granular insights into
varying code complexities. We improve this by
proposing a complexity-aware metric, allowing for
a more precise and systematic evaluation of LLMs’
performance.
3 Methodology
In this section, we present the construction ap-
proach and process of DynaCode, as shown in Fig-
ure 2. Specifically, we first introduce the dynamic
evaluation strategy based on call graphs in Sec. 3.1,
explaining how they capture the relationships and
dependencies within a program. In Section 3.2,
we introduce a complexity matrix that measures
code and graph complexity, essential for evaluating
tasks of varying difficulty and capturing program
interactions. Finally, we describe the process of
generating benchmarks in DynaCode in Sec. 3.3.
3.1 Dynamic Evaluation Strategy Based on
Call Graphs
To comprehensively evaluate the performance of
LLMs in code generation tasks, this section intro-
duces a dynamic evaluation strategy based on call
graphs. This strategy measures execution perfor-
mance of generated code from two dimensions:
static complexity classification and dynamic codeexecution behavior. Specifically, this section is
divided into two parts: code complexity classifi-
cation and call graph construction. These parts
describe how code problems are classified based
on code complexity and how the call graphs are
constructed.
3.1.1 Code Complexity Classification
Current LLMs generally perform relatively well
on code problems involving basic syntax and sim-
ple logical structures, but their performance is
inconsistent on problems that involve more con-
trol flow branches, nested structures, and recursive
calls (Jiang et al., 2025; Beger and Dutta, 2025). To
evaluate the code generation capabilities of LLMs
more thoroughly, DynaCode first classifies the com-
plexity of the ground truth code of existing code
problems, as shown in Figure 2(a). Methods for
classifying code problem complexity include lines
of code, cyclomatic complexity (McCabe, 1976),
and halstead complexity (Halstead, 1977). Given
that LLMs often struggle with code generation
tasks involving complex control flow and branch-
ing logic, we use cyclomatic complexity to assess
code difficulty. It measures the number of inde-
pendent paths in the control flow, capturing the
complexity of branches and loops. Specifically, for
each problem piin the set Pof code problems,
we use the static code analysis tool Radon (Rubik,
2014) to calculate the cyclomatic complexity of the
corresponding ground truth code for each problem,
denoted as νpi, where pirepresents the index of the
problem in the set P. The cyclomatic complexity
is calculated using the following formula:
νpi=Epi−Npi+ 2Ppi, (1)
where Epiis the number of edges in the control
flow graph, Npiis the number of nodes, and Ppiis
the number of connected components. Control flow
graphs and cyclomatic complexity computations
are shown in Appendix A.
Based on the calculated cyclomatic complexity
values, code problems are classified into different
complexity units. We define the code complexity as
Uj, where j∈ {1,2, . . . , n }, representing different
cyclomatic complexity units. The set of problems
at complexity level Ujis denoted by:
Uj={pi|αj−1≤νi≤αj}, (2)
where:
Page 4:
...(a) Code Complexity Classification
(b) Call Graph Construction(c) Complexity-aware Metrics
...
Code
Complexity
Problem :Write a function to remove
odd characters in a string.
assert remove_odd("python")==("yhn")
Function f:Max path length,
Branch count,
Edge countCode Problems
...
def count_binary_seq(n):
nCr = 1
res = 1
for r in range(1, n + 1):
nCr = (nCr * (n + 1 - r)) / r
res += nCr * nCr
return resdef main(o):
o1 = function1(o)
o2 = function2(o1)
o3 = function3(o1)
o4 = function4(o1)
return (o2, o3, o4)f1
f2 f3 f4DynaCode
Examine
...
Running
Ground Truth...::
:
:
::
(d) Benchmark GenerationMetrics
...
......
in/output type
Bad Generation
Figure 2: Overview of our proposed DynaCode. (a) Classification of code complexity, resulting in code problems
with varying levels of complexity. (b) Construction of function call graphs, categorized based on graph features.
(c) Integration of code complexity and graph complexity to form two-dimensional complexity-aware metrics. (d)
Benchmark generation process.
1.pirepresents a problem with cyclomatic com-
plexity νi,
2.αj−1andαjare predefined complexity thresh-
olds for the j-th Unit.
Thus, for each j,Ujis the set of problems
pisuch that their cyclomatic complexity νifalls
within the interval [αj−1, αj], i.e., pi∈Ujif and
only if αj−1≤νi≤αj.
The choice of thresholds α1, α2, . . . , α nis based
on the distribution of cyclomatic complexity values
in code dataset. These thresholds are selected to
capture progressively increasing complexity, allow-
ing for a systematic evaluation of LLMs’ perfor-
mance as task difficulty escalates.
3.1.2 Call Graph Construction
A fundamental requirement of execution-based
evaluation is to analyze the execution behavior of
the generated code (Chen et al., 2021). However,
in static benchmarks, where code problems often
involve isolated functions, models may memorize
specific solutions instead of generalizing, leading
to data contamination. To tackle this issue, we
use a call-graph structure to construct nested code,
as shown in Figure 2(b). After classifying the
complexity of code problems, we aim to construct
nested problems and their corresponding nested
codes with different structures at the same com-
plexity unit to enhance the diversity of code gen-
eration evaluations. Specifically, we treat eachcode problem and its ground truth code as nodes in
the call graph to form nested problems and nested
codes. The call graph is defined as a directed graph
G= (V, E), where the node set Vrepresents the
functions in the code, and the edge set Erepresents
the call relationships between the functions. In or-
der to combine code problems into nested problems
and corresponding codes, we define the call graph
as follows:
1.There are exactly Kdistinct call-graph struc-
tures, where each structure Gk= (Vk, Ek),
fork∈ {1,2, . . . , K }, corresponds to a
unique function call pattern.
2.The number of root nodes is |Vroot|= 1, with
the root node denoted by v1.
3.The graph is acyclic: for any pair of nodes
u, v∈V, ifu→v, then there is no v→u,
i.e.,Gis a Directed Acyclic Graph.
These call-graph structures range from simple lin-
ear calls to more complex configurations involv-
ing various branches and call relationships, effec-
tively increasing the logical complexity of the code.
Through this diverse set of call-graph structures,
DynaCode is able to simulate a variety of real-
world scenarios, assessing the ability of LLMs to
generate code of varying complexity. Detailed il-
lustrations of all call-graph structures are shown in
Appendix B.
Page 5:
To more precisely evaluate the code generation
capability of LLMs, DynaCode categorizes the
complexity of different call-graph structures to fur-
ther assess LLM performance within the same unit.
The complexity of the call graph is influenced by
these key features:
1.Maximum path length Lmax(G): The
longest path from the root node rto any other
node in G.
2.Branch count B(G): The total number of
branching points in G, indicating how many
functions each function calls.
3.Edge count |E|: The total number of directed
edges in the graph, which quantifies the inter-
dependencies between functions.
To measure the complexity of the graph G, we
define a comprehensive feature metric M:
M(G) =Lmax(G)×B(G)× |E|. (3)
TheMfeature, computed as the product of the
maximum path length, branch count, and edge
count, comprehensively reflects the overall com-
plexity of the call graph. Based on this feature,
we categorize the complexity of call graphs into
different levels according to predefined thresholds.
Letβ0, β1, . . . , β mbe the thresholds. Then, the
classification is defined as follows:
Lj={G|βj−1≤ M (G)≤βj}, (4)
where Ljrepresents the set of call graphs whose
Mvalues fall within the interval [βj−1, βj], corre-
sponding to the jth level of graph complexity.
3.2 Complexity-aware metrics
By combining code complexity classification with
call-graph complexity classification, we propose
a comprehensive complexity measurement matrix,
as illustrated in Figure 2(c). This matrix provides
a two-dimensional framework for evaluating code
complexity, integrating both the internal logical
structure of functions and their interrelationships
within the call graph. Its goal is to enable a holistic
assessment of code complexity by considering two
critical aspects: the inherent complexity of func-
tion logic and the complexity of its interactions
within the overall code structure. The matrix is
represented as:
C={cξ,η|1≤ξ≤n,1≤η≤m}.(5)where cξ,ηrepresents the complexity value at
the intersection of code complexity unit ξand
call-graph complexity level η. Specifically, ξ∈
{1,2, . . . , n }corresponds to the different units of
code complexity, and η∈ {1,2, . . . , m }corre-
sponds to the levels of complexity associated with
the call-graph structure. The matrix Callows for
a detailed, two-dimensional assessment of code
complexity by evaluating how the internal logic of
functions interacts with the overall structure of the
code, which can significantly influence the perfor-
mance and maintainability of generated code.
3.3 Benchmark Generation
Through the steps outlined above, we can construct
a systematic code generation benchmark. Our ap-
proach improves upon existing code problems by
dynamically combining them to generate Dyna-
Code, which incorporates both a complexity frame-
work and randomness into the code problems. The
benchmark generation process is illustrated in Fig-
ure 2(d). The specific procedure for generating
DynaCode is as follows:
Problem Collection. We begin by collecting ex-
isting code problems to form our DynaCode unit
functions. To mitigate data contamination risks, we
actively source recently published code problems
from the web and incorporate them into our bench-
mark. This approach allows us to continuously
update the code problem set with new releases,
thereby alleviating data contamination.
Code Complexity Classification. For each unit
function, we evaluate its complexity using cyclo-
matic complexity and categorize it accordingly,
forming multiple code problem units. Simulta-
neously, we employ the Monkeytype (Facebook,
2017) tool to generate corresponding input and out-
put data for each code problem, discarding any data
that cannot be integrated into valid nested codes.
Problem Combination. Then, for each unit of
code problems, we merge the problems into new
nested problems by aligning their input and output
types according to the call-graph structure. Concur-
rently, we automatically assemble the correspond-
ing code. The automatically generated problem
prompt is presented in Appendix E.
Testcase Generation. After obtaining code that
conforms to the call-graph structure, we inject the
input values from the root node of the call graph
into the entire nested code and execute it in batches.
If any execution errors occur, the generation is clas-
sified as a bad generation. We then filter out these
Page 6:
Type Unit 1 Unit 2 Unit 3 Unit 4
Base 153 100 76 76
Level 1 88,339 10,553 7,931 23,996
Level 2 3,300,172 157,950 107,578 585,379
Level 3 27,771,290 518,215 332,610 3,428,967
Level 4 133,358,131 2,528,807 1,928,288 15,114,935
Table 1: The number of problems contained in different
units and the corresponding number of benchmarks that
can be generated in combination.
bad generations to retain valid nested code, nested
problems, and their corresponding test cases. The
valid nested code is presented in Appendix G.
4 Evaluation and Analysis
4.1 Experimental Setup
Data. We have selected the MBPP+, processed
by EvalPlus (Liu et al., 2024c), as our unit func-
tion set. MBPP+ is an extended version of the
MBPP (Austin et al., 2021), enhanced and refined
to provide more comprehensive test cases and so-
lutions. To address the potential of data contami-
nation on unit functions, we also curated a set of
the latest programming problems from LeetCode1.
These problems, along with their corresponding
solutions collected using official test cases, were
integrated into our evaluation. As a demonstra-
tion, we introduced 22 new code problems into
Unit 3 and 18 new code problems into Unit 4. The
combination of these two sources forms our unit
function set, ensuring both diversity and relevance
to real-world scenarios.
Evaluation metric. Following previous
work (Chen et al., 2021), we use Pass@1 as the
evaluation metric, which measures the percentage
of problems solved correctly on the first attempt
without further corrections. Since our benchmark
introduces progressively increasing complexity
levels, we use Pass@1 as a unified evaluation
metric to ensure consistent comparisons with
MBPP, MBPP+, and across different complexity
levels.
Evaluated LLMs. We selected a range of main-
stream LLMs to evaluate the effectiveness of Dy-
naCode in code generation tasks. The evalua-
tion models include GPT-4o (Achiam et al., 2023),
GPT-3.5-turbo (OpenAI, 2023), DeepSeek-V3 (Liu
et al., 2024a), Qwen2.5-Coder-32B-Instruct (Hui
et al., 2024), WizardLM-2-8x22B (Xu et al., 2023),
Mixtral-8x22B-Instruct-v0.1 (Jiang et al., 2024),
Phind-CodeLlama-34B-v2 (Roziere et al., 2023),
1LeetCode, https://leetcode.com/starcoder2-15b-instruct-v0.1 (Wei et al., 2024),
codegemma-7b-it (Team et al., 2024), Meta-Llama-
3.1-405B-Instruct (Meta AI, 2024b), Meta-Llama-
3.3-70B-Instruct (Meta AI, 2024d), and Meta-
Llama-3.1-8B-Instruct (Meta AI, 2024c). To en-
sure the stability and consistency of the evaluation
results, we set the temperature to 0 to eliminate
any randomness introduced during the generation
process. Our prompts are shown is Appendix E.
Benchmark Statistics. We have compiled statis-
tics on the number of problems that can be gener-
ated from the selected dataset after applying the
dynamic evaluation strategy, as shown in Table 1.
Specifically, Table 4 provides further details on the
number of problems generated for each call graph,
facilitating an understanding of the scale and distri-
bution of problems across different structures.
4.2 Benchmark Results
Model Performance. The results of our bench-
mark are summarized in Table 2. Compared to tra-
ditional benchmarks like MBPP and MBPP+, Dy-
naCode shows a more pronounced decline in model
performance as code complexity increases. For ex-
ample, GPT-4o achieves a Pass@1 score of 87.6%
on MBPP and 72.2%on MBPP+, but drops signifi-
cantly to 55.4%on DynaCode. A similar trend is
observed with GPT-3.5-Turbo, which drops from
82.5%on MBPP to 29.3%on DynaCode. These
results highlight the increasing complexity in our
benchmark, confirming its ability to assess models
on more complex, real-world code generation tasks.
Specific performance variations are shown for se-
lected models in Figure 3. Our complexity-aware
metrics effectively differentiate model capabilities,
as evidenced by consistent performance degrada-
tion across complexity levels. Models like GPT-
4o show better resilience to increasing complexity,
while others like WizardLM-2-8x22B struggle as
complexity rises. This demonstrates DynaCode’s
robustness in evaluating not just basic code gen-
eration but also handling nested, multi-function
code structures. The performance trends across all
models and complexity scenarios are depicted in
Figure 9, further validating our benchmark’s con-
tribution in revealing challenges faced by LLMs in
complex code generation tasks.
Effect of the Problem Sizes. To further inves-
tigate our benchmark, we conducted a study on
the effect of the number of dynamically gener-
ated problems on the evaluation performance of
the benchmark. For each unit and each call graph
Page 7:
Model Params MBPP MBPP+DynaCode
Unit 1 Unit 2 Unit 3 Unit 4 Average
GPT-4o - 87.6 72.2 74.4 (±1.6) 48 .7 (±1.4) 56 .2 (±1.4) 42 .3 (±0.3) 55 .4 (±0.9)
GPT-3.5-Turbo - 82.5 69.7 34.9 (±1.0) 30 .5 (±1.6) 25 .6 (±0.6) 26 .1 (±1.2) 29 .3 (±0.4)
DeepSeek-V3 236B 87.6 73.0 65.9 (±0.8) 41 .6 (±2.3) 53 .6 (±0.7) 47 .3 (±1.5) 52 .1 (±0.4)
Qwen2.5-Coder-32B-Instruct 32B 90.5 77.0 59.3 (±1.6) 33 .6 (±1.6) 44 .1 (±1.1) 36 .0 (±0.8) 43 .2 (±0.2)
WizardLM-2-8x22B 176B 71.7 60.8 35.4 (±0.8) 27 .5 (±0.9) 25 .1 (±0.8) 12 .8 (±1.1) 25 .2 (±0.6)
Mixtral-8x22B-Instruct-v0.1 176B 73.8 64.3 35.4 (±0.8) 27 .2 (±1.3) 25 .0 (±0.5) 12 .8 (±0.5) 25 .1 (±0.7)
Phind-CodeLlama-34B-v2 34B 85.4 69.6 40.9 (±0.9) 29 .5 (±1.8) 45 .2 (±1.5) 25 .4 (±0.2) 35 .3 (±0.6)
starcoder2-15b-instruct-v0.1 15B 78.0 65.1 40.7 (±1.0) 29 .4 (±1.4) 45 .0 (±1.3) 25 .3 (±0.5) 35 .1 (±0.4)
codegemma-7b-it 7B 70.4 56.9 4.6 (±0.4) 2 .7 (±0.2) 1 .1 (±0.2) 3 .0 (±0.4) 2 .9 (±0.2)
Meta-Llama-3.1-405B-Instruct 405B 88.4 73.0 49.7 (±0.9) 40 .0 (±1.6) 47 .6 (±1.1) 26 .9 (±1.2) 41 .0 (±0.8)
Meta-Llama-3.3-70B-Instruct 70B 89.2 75.1 36.0 (±1.4) 27 .5 (±1.4) 54 .9 (±1.1) 31 .2 (±0.8) 37 .4 (±0.7)
Meta-Llama-3.1-8B-Instruct 8B 68.3 55.6 14.1 (±1.0) 9 .7 (±0.6) 8 .4 (±1.0) 7 .4 (±0.7) 9 .9 (±0.8)
Table 2: Pass@1 results on MBPP, MBPP+, and our DynaCode across varying code complexities. For DynaCode,
results are reported for different function complexity levels. To ensure robustness, all experiments were conducted
three times with 5 different random seeds, and the average results are presented.
1 2 3 4
Level102030405060708090Pass@1 Score (%)
GPT-4o
1 2 3 4
Level
GPT-3.5-Turbo
1 2 3 4
Level
WizardLM-2-8x22B
1 2 3 4
Level
Meta-Llama-3.1-405B-InstructUnit 1 Unit 2 Unit 3 Unit 4
Figure 3: Comparison of Average Pass@1 scores across 4 LLMs (GPT-4o, GPT-3.5-Turbo, WizardLM-2-8x22B,
Meta-Llama-3.1-405B-Instruct) at different complexity levels.
25 50 75 100 125 150
Number of Problems222426283032343638Average pass@1_score (%)
Unit 1 Unit 2 Unit 3 Unit 4
Figure 4: Experiment on the number of problems. For
each graph in every unit, a corresponding number of
problems is generated.
type, we generated different numbers of prob-
lems{25,50,75,100,125,150}, and the results
are shown in Figure 4. The experiments indicate
that when the number of problems is greater than or
equal to 75, the evaluation results meet our require-
ments for stability and reliability. Therefore, we set
MBPP+ DynaCode
GPT-3.5-Turbo020406080100Pass@1 Score (%)69.7%
32.6%88.6%
36.0%
15.2%
MBPP+ DynaCode
Meta-Llama-3.1-8B-Instruct55.6%
10.6%98.1%
23.6%
6.9%w/o finetune
Finetuned on MBPP+/DynaCodeFinetuned on unit functionsFigure 5: Pass@1 scores of GPT-3.5-Turbo and Meta-
Llama-3.1-8B-Instruct before and after fine-tuning on
MBPP+ and DynaCode, and DynaCode unit functions.
100for other experiments. Despite the variations
in the number of generated problems, DynaCode
can systematically evaluate code generation tasks
at various complexity levels and provide reliable
assessments for each complexity scenario. Detailed
results for each number are provided in Figure 10.
4.3 Experimental Insights
DynaCode Limits Memorization for More Re-
liable Evaluation. A key challenge in evaluating
Page 8:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Call Graphs of Index01020304050607080Pass@1 Score (%)GPT-4o
GPT-3.5-Turbo
Meta-Llama-3.1-405B-Instruct
WizardLM-2-8x22BFigure 6: Comparative performance evaluation of 4 LLMs (GPT-4o, GPT-3.5-Turbo, Meta-Llama-3.1-405B-Instruct,
and WizardLM-2-8x22B) across different call graphs. Models exhibit better performance on sequential call graphs
{G1, G2, G3, G4, G8}. Details of the call graphs are provided in Appendix B.
Ability Error TypeDynaCode
Unit 1 Unit 2 Unit 3 Unit 4
Problem Understanding AssertionError, ValueError, RecursionError, ZeroDivisionError 64.1% (615) 79.9% (832) 88.2% (1027) 88.8% (988)
Code Pattern Generation SyntaxError, IndentationError 6.6% (63) 0.4% (4) 0.2% (2) 1.5% (17)
Context Management NameError, AttributeError, TypeError, IndexError, UnboundLocalError 29.4% (282) 19.7% (205) 11.7% (136) 9.4% (105)
Other RuntimeError, OverflowError - - - 0.3% (3)
Table 3: Error analysis of GPT-3.5-Turbo on DynaCode. The model was evaluated across 4 units of increasing
complexity, with 100 questions per graph. The table summarizes the distribution and frequency of errors categorized
by problem understanding, code pattern generation, and context management.
LLMs is data contamination, where models may
memorize training data instead of generalizing ef-
fectively. To explore this issue, we fine-tuned a
commercial and an open-source LLM, GPT-3.5-
Turbo, and Meta-Llama-3.1-8B-Instruct, on both
the MBPP+ and DynaCode datasets, and then eval-
uated their performance as shown in Figure 5. To
ensure a fair comparison and avoid catastrophic
forgetting, we propose a fine-tuning strategy as fol-
lows: for GPT-3.5-Turbo, we trained on MBPP+
for 5 epochs, on DynaCode unit functions for
5 epochs, and on DynaCode for 1 epoch, while
keeping the total fine-tuning steps fixed at 1890;
for Meta-Llama-3.1-8B-Instruct, we trained on
MBPP+ for 10 epochs, on DynaCode unit func-
tions for 10 epochs, and the DynaCode for 2 epochs,
while keeping the total fine-tuning steps fixed at
3780. The results show a significant improve-
ment in GPT-3.5-Turbo’s Pass@1 score on MBPP+,
which increased from 69.7% to 88.6%, while on
DynaCode, the improvement was much smaller,
rising from 32.6% to 36.0%. A similar pattern was
observed for Meta-Llama-3.1-8B-Instruct, where
the model’s performance on MBPP+ surged from
55.6% to 98.1%, but only improved slightly from
10.6% to 23.6% on DynaCode. To further inves-
tigate the impact of data contamination, we intro-
duced a fine-tuning setting using only the unit func-tions from DynaCode, and then evaluated the mod-
els on the full DynaCode benchmark. Under this
setting, GPT-3.5-Turbo’s Pass@1 score dropped to
15.2%, while Meta-Llama-3.1-8B-Instruct’s score
fell to 6.9%. These smaller gains suggest that the
models are not simply memorizing the data, imply-
ing that DynaCode’s dynamic evaluation strategy
effectively mitigates data contamination. As a re-
sult, DynaCode offers a more reliable and compre-
hensive assessment of LLMs’ true code generation
capabilities, ensuring that the evaluation is based
on the model’s generalization ability rather than its
capacity to memorize specific training examples.
LLMs Are Good at Sequential Execution. We
evaluated 4 LLMs on different call graphs, aver-
aging their Pass@1 scores across all 4 units, as
shown in Figure 6. The results reveal a clear trend:
LLMs perform significantly better on sequentially
structured graphs {G1, G2, G3, G4, G8}, which in-
volve linear execution with minimal branching.
This suggests that LLMs excel at straightforward
function compositions and stepwise computations.
However, as call graph complexity increases, par-
ticularly in multi-layered, multi-branch structures
{G9, G10, G11, G12, G13, G14, G15, G16}, model
performance drops considerably. This indicates
that LLMs struggle with parallel function depen-
dencies and managing execution across interde-
Page 9:
pendent subfunctions. Among the evaluated mod-
els, GPT-4o consistently outperforms others across
all graphs, showing greater resilience to structural
complexity. Nevertheless, even GPT-4o exhibits a
notable decline in high-complexity graphs, high-
lighting a fundamental limitation in LLMs’ ability
to generate and manage deeply nested, interdepen-
dent code structures.
LLMs Struggle with Problem Understanding
as Complexity Increases. We evaluate the code
generation capabilities of GPT-3.5-Turbo on Dyna-
Code by analyzing the distribution and frequency
of errors across 4 units, classifying them into Prob-
lem Understanding, Code Pattern Generation, and
Context Management abilities based on common
code errors. The results are summarized in Table 3.
The findings show that the error rate for Problem
Understanding increases progressively from 64.1%
in unit 1 to 88.8%in unit 4, indicating that the in-
creasing complexity of the code problems makes
it more difficult for GPT-3.5-Turbo to correctly un-
derstand the problem requirements. In contrast, the
error rates for Code Pattern Generation and Con-
text Management show a decreasing trend across
units. However, this reduction does not imply an
improvement in capabilities. Instead, it reflects a
shift in the error distribution: as the task complex-
ity increases, the model often fails at the Problem
Understanding stage, preventing it from generat-
ing correct code that would expose errors in syntax
or context management. The error classification
details are shown in Appendix F.
5 Conclusion
We present DynaCode, a dynamic, complexity-
aware benchmark designed to systematically evalu-
ate LLMs on code generation tasks. By integrating
code complexity with call-graph structures, Dy-
naCode generates nested code problems that cap-
ture varying levels of code complexity and inter-
function dependencies. We evaluated 12 of the
latest LLMs, and the results reveal significant per-
formance drops as both unit and graph complexity
increase, highlighting DynaCode’s ability to sys-
tematically assess LLMs. Notably, DynaCode also
addresses the challenge of data contamination and
demonstrates its capacity to mitigate this limitation.
In summary, DynaCode provides a new perspec-
tive for examining and analyzing LLMs, offering
a scalable framework for more comprehensive and
reliable LLM evaluation.Limitations
DynaCode primarily focuses on relatively call-
graph structures, with a maximum node count of 5.
While this ensures manageable complexity for cur-
rent LLMs, it is possible that more advanced LLMs
in the future may learn to handle such call patterns.
We will extend DynaCode to include more com-
plex call-graph structures in future work, further
challenging LLMs and enhancing the benchmark’s
scalability.
References
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama
Ahmad, Ilge Akkaya, Florencia Leoni Aleman,
Diogo Almeida, Janko Altenschmidt, Sam Altman,
Shyamal Anadkat, et al. 2023. Gpt-4 technical report.
arXiv preprint arXiv:2303.08774 .
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten
Bosma, Henryk Michalewski, David Dohan, Ellen
Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021.
Program synthesis with large language models. arXiv
preprint arXiv:2108.07732 .
Claas Beger and Saikat Dutta. 2025. Coconut: Struc-
tural code understanding does not fall out of a tree.
arXiv preprint arXiv:2501.16456 .
Federico Cassano, John Gouwar, Daniel Nguyen, Syd-
ney Nguyen, Luna Phipps-Costin, Donald Pinckney,
Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson,
Molly Q Feldman, et al. 2023. Multipl-e: a scal-
able and polyglot approach to benchmarking neural
code generation. IEEE Transactions on Software
Engineering , 49(7):3675–3691.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming
Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka-
plan, Harri Edwards, Yuri Burda, Nicholas Joseph,
Greg Brockman, Alex Ray, Raul Puri, Gretchen
Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas-
try, Pamela Mishkin, Brooke Chan, Scott Gray,
Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz
Kaiser, Mohammad Bavarian, Clemens Winter,
Philippe Tillet, Felipe Petroski Such, Dave Cum-
mings, Matthias Plappert, Fotios Chantzis, Eliza-
beth Barnes, Ariel Herbert-V oss, William Hebgen
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie
Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,
William Saunders, Christopher Hesse, Andrew N.
Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan
Morikawa, Alec Radford, Matthew Knight, Miles
Brundage, Mira Murati, Katie Mayer, Peter Welinder,
Bob McGrew, Dario Amodei, Sam McCandlish, Ilya
Sutskever, and Wojciech Zaremba. 2021. Evaluating
large language models trained on code.
Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Han-
tian Ding, Ming Tan, Nihal Jain, Murali Krishna
Ramanathan, Ramesh Nallapati, Parminder Bhatia,
Page 10:
Dan Roth, and Bing Xiang. 2023. Crosscodeeval:
A diverse and multilingual benchmark for cross-file
code completion. In Thirty-seventh Conference on
Neural Information Processing Systems Datasets and
Benchmarks Track .
Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang,
Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng
Sha, Xin Peng, and Yiling Lou. 2023. Classe-
val: A manually-crafted benchmark for evaluat-
ing llms on class-level code generation. Preprint ,
arXiv:2308.01861.
Inc. Facebook. 2017. Monkeytype: A python annota-
tion tool for adding type hints to your code.
Lizhou Fan, Wenyue Hua, Lingyao Li, Haoyang Ling,
and Yongfeng Zhang. 2023. Nphardeval: Dynamic
benchmark on reasoning ability of large language
models via complexity classes. arXiv preprint
arXiv:2312.14890 .
Alex Gu, Baptiste Rozière, Hugh Leather, Armando
Solar-Lezama, Gabriel Synnaeve, and Sida I Wang.
2024. Cruxeval: A benchmark for code reason-
ing, understanding and execution. arXiv preprint
arXiv:2401.03065 .
Maurice H Halstead. 1977. Elements of Software Sci-
ence (Operating and programming systems series) .
Elsevier Science Inc.
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong
Wang, Li Li, Xiapu Luo, David Lo, John Grundy,
and Haoyu Wang. 2024. Large language models for
software engineering: A systematic literature review.
ACM Trans. Softw. Eng. Methodol. , 33(8).
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Day-
iheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang,
Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder
technical report. arXiv preprint arXiv:2409.12186 .
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia
Yan, Tianjun Zhang, Sida Wang, Armando Solar-
Lezama, Koushik Sen, and Ion Stoica. 2024. Live-
codebench: Holistic and contamination free eval-
uation of large language models for code. arXiv
preprint arXiv:2403.07974 .
Albert Q Jiang, Alexandre Sablayrolles, Antoine
Roux, Arthur Mensch, Blanche Savary, Chris Bam-
ford, Devendra Singh Chaplot, Diego de las Casas,
Emma Bou Hanna, Florian Bressand, et al. 2024.
Mixtral of experts. arXiv preprint arXiv:2401.04088 .
Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang,
Hongyu Zhang, Ruoming Jin, and Qiang Guan. 2025.
Can large language models understand intermediate
representations? arXiv preprint arXiv:2502.06854 .
Carlos E Jimenez, John Yang, Alexander Wettig,
Shunyu Yao, Kexin Pei, Ofir Press, and Karthik
Narasimhan. 2023. Swe-bench: Can language mod-
els resolve real-world github issues? arXiv preprint
arXiv:2310.06770 .Mohammad Abdullah Matin Khan, M Saiful Bari,
Do Long, Weishi Wang, Md Rizwan Parvez, and
Shafiq Joty. 2024. Xcodeeval: An execution-based
large scale multilingual multitask benchmark for
code understanding, generation, translation and re-
trieval. In Proceedings of the 62nd Annual Meeting
of the Association for Computational Linguistics (Vol-
ume 1: Long Papers) , pages 6766–6805.
Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, and
Zhi Jin. 2024. Evocodebench: An evolving code
generation benchmark aligned with real-world code
repositories. arXiv preprint arXiv:2404.00599 .
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang,
Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi
Deng, Chenyu Zhang, Chong Ruan, et al. 2024a.
Deepseek-v3 technical report. arXiv preprint
arXiv:2412.19437 .
Changshu Liu, Shizhuo Dylan Zhang, Ali Reza
Ibrahimzada, and Reyhaneh Jabbarvand. 2024b.
Codemind: A framework to challenge large lan-
guage models for code reasoning. arXiv preprint
arXiv:2402.09664 .
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Ling-
ming Zhang. 2024c. Is your code generated by chat-
gpt really correct? rigorous evaluation of large lan-
guage models for code generation. In Proceedings
of the 37th International Conference on Neural In-
formation Processing Systems , NIPS ’23, Red Hook,
NY , USA. Curran Associates Inc.
Zhijie Liu, Yutian Tang, Xiapu Luo, Yuming Zhou, and
Liang Feng Zhang. 2024d. No need to lift a finger
anymore? assessing the quality of code generation by
chatgpt. IEEE Transactions on Software Engineer-
ing.
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey
Svyatkovskiy, Ambrosio Blanco, Colin Clement,
Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021.
Codexglue: A machine learning benchmark dataset
for code understanding and generation. arXiv
preprint arXiv:2102.04664 .
Thomas J McCabe. 1976. A complexity measure. IEEE
Transactions on software Engineering , (4):308–320.
Meta AI. 2024a. Meta-llama-3-8b-instruct.
https://huggingface.co/meta-llama/
Meta-Llama-3-8B-Instruct .
Meta AI. 2024b. Meta-llama-3.1-405b-instruct.
https://huggingface.co/meta-llama/Llama-3.
1-405B-Instruct .
Meta AI. 2024c. Meta-llama-3.1-8b-instruct.
https://huggingface.co/meta-llama/Llama-3.
1-8B-Instruct .
Meta AI. 2024d. Meta-llama-3.3-70b-instruct.
https://huggingface.co/meta-llama/Llama-3.
3-70B-Instruct .
Page 11:
Microsoft. 2024. phi-2. https://huggingface.co/
microsoft/phi-2 .
OpenAI. 2023. Gpt-3.5-turbo. https://openai.
com/ .
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten
Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi,
Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023.
Code llama: Open foundation models for code. arXiv
preprint arXiv:2308.12950 .
Rubik. 2014. Radon: A tool to compute complexity
metrics for python source code.
CodeGemma Team, Heri Zhao, Jeffrey Hui, Joshua
Howland, Nam Nguyen, Siqi Zuo, Andrea Hu,
Christopher A Choquette-Choo, Jingyue Shen, Joe
Kelley, et al. 2024. Codegemma: Open code models
based on gemma. arXiv preprint arXiv:2406.11409 .
Siyuan Wang, Zhuohan Long, Zhihao Fan, Zhongyu
Wei, and Xuanjing Huang. 2024. Benchmark self-
evolving: A multi-agent framework for dynamic llm
evaluation. arXiv preprint arXiv:2402.11443 .
Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng
Ding, Naman Jain, Zachary Mueller, Harm de Vries,
Leandro V on Werra, Arjun Guha, and Lingming
Zhang. 2024. Selfcodealign: Self-alignment for code
generation. arXiv preprint arXiv:2410.24198 .
Chunqiu Steven Xia, Yinlin Deng, and Lingming Zhang.
2024. Top leaderboard ranking = top coding pro-
ficiency, always? evoeval: Evolving coding bench-
marks via llm. arXiv preprint .
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng,
Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin
Jiang. 2023. Wizardlm: Empowering large lan-
guage models to follow complex instructions. arXiv
preprint arXiv:2304.12244 .
Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang,
Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang,
and Tao Xie. 2024. Codereval: A benchmark of prag-
matic code generation with generative pre-trained
models. In Proceedings of the 46th IEEE/ACM Inter-
national Conference on Software Engineering , pages
1–12.
Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson,
Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja,
Dylan Slack, Qin Lyu, et al. 2024a. A careful exami-
nation of large language model performance on grade
school arithmetic. arXiv preprint arXiv:2405.00332 .
Zhehao Zhang, Jiaao Chen, and Diyi Yang. 2024b.
Darg: Dynamic evaluation of large language mod-
els via adaptive reasoning graph. arXiv preprint
arXiv:2406.17271 .
Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan
Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang,
Yang Li, et al. 2023. Codegeex: A pre-trained model
for code generation with multilingual benchmarkingon humaneval-x. In Proceedings of the 29th ACM
SIGKDD Conference on Knowledge Discovery and
Data Mining , pages 5673–5684.
Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang
Gong, Diyi Yang, and Xing Xie. 2023. Dyval: Graph-
informed dynamic evaluation of large language mod-
els.arXiv preprint arXiv:2309.17167 .
Kaijie Zhu, Jindong Wang, Qinlin Zhao, Ruochen Xu,
and Xing Xie. 2024. Dyval 2: Dynamic evaluation of
large language models by meta probing agents. arXiv
preprint arXiv:2402.14865 .
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu,
Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani
Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al.
2024. Bigcodebench: Benchmarking code genera-
tion with diverse function calls and complex instruc-
tions. arXiv preprint arXiv:2406.15877 .
Albert Ziegler, Eirini Kalliamvakou, X. Alice Li, An-
drew Rice, Devon Rifkin, Shawn Simister, Ganesh
Sittampalam, and Edward Aftandilian. 2022. Pro-
ductivity assessment of neural code completion. In
Proceedings of the 6th ACM SIGPLAN International
Symposium on Machine Programming , MAPS 2022,
page 21–29, New York, NY , USA. Association for
Computing Machinery.
Page 12:
Appendix
A Cyclomatic Complexity Details
We present the cyclomatic complexity calculations
forSequence ,If-Else ,While , and While-Not
control structures, illustrating their impact on code
complexity in Figure 8. Cyclomatic complexity
quantifies the number of independent execution
paths in a program, making it a useful metric for
assessing the logical complexity of different con-
trol flows. Simple structures like Sequence have
a lower complexity, as they follow a single execu-
tion path, whereas If-Else andWhile introduce
branching and loops, increasing the number of pos-
sible execution paths. While-Not adds additional
conditional constraints, further influencing com-
plexity. Understanding these calculations helps in
evaluating how LLMs handle different levels of
logical complexity in generated code.
B Call Graph Details
In our experiments, we set the maximum number
of nodes in the call graph to 5. This configura-
tion enables the generation of 16 distinct call-graph
structures, each corresponding to a unique func-
tion call pattern. Every call graph is modeled as
a directed acyclic graph with a single root node,
ensuring a unified entry point for function calls.
Figure 7 illustrates the detailed configurations of all
16 call-graph structures. These structures provide
a comprehensive testbed for evaluating the perfor-
mance of LLMs in generating code with complex
nested function calls, thereby enhancing our assess-
ment of their ability to handle varying levels of
logical and structural complexity.
C Benchmark Statistics Details
We report the details of benchmark statistics in
Table 4, which shows the number of problems con-
tained in different units and the corresponding num-
ber of benchmarks that can be generated through
combinations. We also compare the total number
of problems in DynaCode with those in other code
generation benchmarks, as illustrated in the Table 5.
D Model Performance Details
D.1 Overall Results
Figure 9 presents the performance details of all
evaluated models on DynaCode across varying lev-
els of code complexity. The evaluation revealsType Unit 1 Unit 2 Unit 3 Unit 4
Base 153 100 76 76
Graph 1 2,617 697 568 1,026
Graph 2 48,638 5,718 4,133 13,517
Graph 3 37,084 4,138 3,230 9,453
Graph 4 896,634 43,051 26,648 166,466
Graph 5 710,792 34,101 19,768 120,555
Graph 6 1,342,944 64,915 48,304 242,107
Graph 7 349,802 15,883 12,858 56,251
Graph 8 15,938,962 288,317 150,618 1,931,587
Graph 9 12,796,855 241,443 108,883 1,423,255
Graph 10 37,303,818 695,172 601,704 4,189,592
Graph 11 24,051,797 458,375 320,522 2,881,061
Graph 12 11,832,328 229,898 181,992 1,497,380
Graph 13 19,069,749 362,085 237,205 2,083,414
Graph 14 710,792 34,101 19,768 120,555
Graph 15 2,426,580 42,459 39,678 239,392
Graph 16 36,998,540 695,172 600,528 4,177,666
Unit Sum 164,517,932 3,215,525 2,376,407 19,153,277
Total 189,263,141
Table 4: The number of problems contained in different
units and the corresponding number of benchmarks that
can be generated in combination.
consistent performance degradation across all mod-
els as code complexity increases, underscoring the
benchmark’s ability to challenge models beyond
basic code generation.
Models like GPT-4o and DeepSeek-V3 exhibit
relatively robust performance, maintaining higher
Pass@1 scores even at higher complexity levels. In
contrast, models such as WizardLM-2-8x22B and
codegemma-7b-it show a steep decline in perfor-
mance, struggling significantly with more complex,
nested code structures. These trends align with
the observations discussed in the main benchmark
results, further validating the robustness of Dyna-
Code in differentiating model capabilities across
diverse complexity scenarios.
D.2 Details of the Number
In Figure 10, we present the detailed results of
GPT-3.5’s performance for each unit, with prob-
lem quantities set to {25,50,75,100,125,150}for
each graph. The results indicate that smaller Nval-
ues, such as 25 and 50, lead to greater performance
variability across units, suggesting sensitivity to in-
sufficient data. As Nincreases, especially beyond
75, the performance stabilizes, and fluctuations
diminish. Based on this observation, 100was se-
lected for subsequent experiments, balancing com-
putational efficiency with evaluation robustness.
E Prompt
In our DynaCode, we dynamically generate a uni-
fied prompt by hard-coding the combination of
individual code problems. Initially, a call graph is
Page 13:
(1) (2) (3) (4) (5) (6) (7)
(8)
(9) (10) (1 1) (12) (13) (14) (15) (16)Figure 7: Illustration of 16 distinct call-graph structures used in our experiments, each modeled as a directed acyclic
graph with up to 5 nodes and a single root.
Cyclomatic Complexity
Sequence
if :
else :
while :
while not :Structure Control
Figure 8: Cyclomatic complexity calculations for Se-
quence, If-Else, While, and While-Not control struc-
tures, illustrating their impact on code complexity.
Benchmark Number of Problems
HumanEval 164
HumanEval+ 164
MBPP 974
MBPP+ 378
BigCodeBench 1,140
DynaCode 189,263,141
Table 5: Comparison of the number of problems be-
tween DynaCode and other code generation bench-
marks.
constructed by selecting candidate functions based
on their input and output types. The graph is then
completed by traversing it and randomly assigning
nodes while ensuring that the data flow remains
consistent. Specifically, the output of a parent func-
tion becomes the input for its child function. Each
node is assigned a unique prompt number, and theindividual prompts stored in our dataset are con-
catenated in sequence to form the final prompt.
Table 6 presents an example from G8, illustrat-
ing this process in practice. In addition to the se-
quential prompt assembly, a main function is auto-
matically generated to call the individual functions
in the correct order, with explicit rules that govern
the data transfer between them. This hard-coded
dynamic composition strategy not only ensures log-
ical consistency among the code fragments but also
enhances the robustness and adaptability of our
code generation system.
F Error Classification Details
We present the classification of error types in Dy-
naCode along with their corresponding functional
details, as shown in Table 7.
G Examples in DynaCode
We provide a valid nested code example for the
prompt in Table 6, as shown in Figure 11. As
observed, the flow from generate_fibonacci to
is_integer represents a complete workflow, which
demonstrates that the call graph-based dynamic
evaluation strategy can assess the ability of LLMs
to handle multi-step tasks and the dependencies
between different steps by simulating a real-world
task flow. Our call-graph structure offers a vari-
ety of workflow configurations, which can help us
understand how well the LLM performs on tasks
requiring structured thinking and logical progres-
sion, making it a powerful tool for evaluating its
capabilities in multi-step tasks.
Page 14:
1 2 3 4
Level304050607080Pass@1 Score (%)
GPT-4o
1 2 3 4
Level2030405060
GPT-3.5-Turbo
1 2 3 4
Level3040506070
DeepSeek-V3
1 2 3 4
Level203040506070
Qwen2.5-Coder-32B-Instruct
1 2 3 4
Level2030405060Pass@1 Score (%)
starcoder2-15b-instruct-v0.1
1 2 3 4
Level2030405060
Phind-CodeLlama-34B-v2
1 2 3 4
Level2030405060
Meta-Llama-3.1-405B-Instruct
1 2 3 4
Level203040506070
Llama-3.3-70B-Instruct
1 2 3 4
Level01020304050Pass@1 Score (%)
WizardLM-2-8x22B
1 2 3 4
Level01020304050
codegemma-7b-it
1 2 3 4
Level01020304050
Mixtral-8x22B-Instruct-v0.1
1 2 3 4
Level01020304050
Meta-Llama-3.1-8B-InstructUnit 1 Unit 2 Unit 3 Unit 4Figure 9: Performance details of various models on DynaCode across different complexity levels. Each subfigure
illustrates the Pass@1 score trends for units 1 to 4, highlighting how model performance degrades as code complexity
increases.
1 2 3 4
Level102030405060Pass@1 Score (%)
Number = 25
1 2 3 4
Level20304050
Number = 50
1 2 3 4
Level2030405060
Number = 75
1 2 3 4
Level2030405060Pass@1 Score (%)
Number = 100
1 2 3 4
Level2030405060
Number = 125
1 2 3 4
Level2030405060
Number = 150Unit 1 Unit 2 Unit 3 Unit 4
Figure 10: Performance comparison of GPT-3.5-Turbo on DynaCode with varying numbers of dynamically
generated problems {25,50,75,100,125,150}.
Page 15:
Python Code Prompts
Here are 5 prompts that are used to generate 5 functions respectively.
PROMPT 1:
"""
Write a python function to generate the first n Fibonacci numbers.
assert generate_fibonacci(5) == [0, 1, 1, 2, 3]
"""
PROMPT 2:
"""
Write a python function to square each number in a given list.
assert square_numbers([0, 1, 1, 2, 3]) == [0, 1, 1, 4, 9]
"""
PROMPT 3:
"""
Write a python function to find the sum of all numbers in a list.
assert sum_numbers([0, 1, 1, 4, 9]) == 15
"""
PROMPT 4:
"""
Write a python function to find the square root of a number.
assert square_root(15) == 3.872983346207417
"""
PROMPT 5:
"""
Write a python function to check if a number is an integer.
assert is_integer(3.872983346207417) == False
"""
Please write the above 5 functions respectively and write a new function named main to call the above 5 functions.
When calling these functions, please follow the following rules:
The input of the main function equals the input of PROMPT 1 : generate_fibonacci .
The output of function PROMPT 1: generate_fibonacci serves as the input of PROMPT 2: square_numbers .
The output of function PROMPT 2: square_numbers serves as the input of PROMPT 3: sum_numbers .
The output of function PROMPT 3: sum_numbers serves as the input of PROMPT 4: square_root .
The output of function PROMPT 4: square_root serves as the input of PROMPT 5: is_integer .
The main function returns the output of the PROMPT 5: is_integer .
Table 6: Example of a dynamically generated prompt in DynaCode for call graph G8.
Page 16:
Associated Capability Error Type Reason for Categorization
Problem Understanding AssertionError Failure to satisfy the expected logic of the problem, indicating misunderstanding of requirements.
Problem Understanding ValueError Input data out of expected range, implying insufficient understanding of problem constraints.
Problem Understanding RecursionError Incorrect handling of recursion logic, showing failure in comprehending recursive termination.
Problem Understanding ZeroDivisionError Failure to account for boundary conditions, such as division by zero.
Code Pattern Generation SyntaxError Violation of syntax rules, directly reflecting the inability to generate structurally correct code.
Code Pattern Generation IndentationError Incorrect indentation, indicating issues in formatting and structural management of code.
Context Management NameError Use of undefined variables, highlighting issues in variable scope management.
Context Management AttributeError Accessing nonexistent attributes, suggesting misunderstanding of object structures.
Context Management TypeError Incorrect data type handling, indicating flaws in type inference and function parameter management.
Context Management IndexError Indexing beyond valid range, showing poor management of data structures like lists or arrays.
Context Management UnboundLocalError Use of uninitialized local variables, indicating incorrect variable lifecycle management.
Other OverflowError Numeric overflow due to improper handling of large values or operations.
Other RuntimeError Errors during execution, often caused by incorrect function calls or resource issues.
Table 7: Categorization of error types and their corresponding capabilities in DynaCode.
Figure 11: Example of a nested code workflow from Table 6, demonstrating how the call graph-based evaluation
assesses LLMs’ ability to handle multi-step tasks and dependencies.