loader
Generating audio...

arxiv

Paper 2503.10452

DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation

Authors: Wenhao Hu, Jinhao Duan, Chunchen Wei, Li Zhang, Yue Zhang, Kaidi Xu

Published: 2025-03-13

Abstract:

The rapid advancement of large language models (LLMs) has significantly improved their performance in code generation tasks. However, existing code benchmarks remain static, consisting of fixed datasets with predefined problems. This makes them vulnerable to memorization during training, where LLMs recall specific test cases instead of generalizing to new problems, leading to data contamination and unreliable evaluation results. To address these issues, we introduce DynaCode, a dynamic, complexity-aware benchmark that overcomes the limitations of static datasets. DynaCode evaluates LLMs systematically using a complexity-aware metric, incorporating both code complexity and call-graph structures. DynaCode achieves large-scale diversity, generating up to 189 million unique nested code problems across four distinct levels of code complexity, referred to as units, and 16 types of call graphs. Results on 12 latest LLMs show an average performance drop of 16.8% to 45.7% compared to MBPP+, a static code generation benchmark, with performance progressively decreasing as complexity increases. This demonstrates DynaCode's ability to effectively differentiate LLMs. Additionally, by leveraging call graphs, we gain insights into LLM behavior, particularly their preference for handling subfunction interactions within nested code.

Paper Content:
Page 1: DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation Wenhao Hu1Jinhao Duan2Chunchen Wei1Li Zhang2 Yue Zhang2Kaidi Xu2* 1University of Electronic Science and Technology of China 2Drexel University Abstract The rapid advancement of large language mod- els (LLMs) has significantly improved their per- formance in code generation tasks. However, existing code benchmarks remain static, con- sisting of fixed datasets with predefined prob- lems. This makes them vulnerable to mem- orization during training, where LLMs recall specific test cases instead of generalizing to new problems, leading to data contamination and unreliable evaluation results. To address these issues, we introduce DynaCode, a dy- namic, complexity-aware benchmark that over- comes the limitations of static datasets. Dyna- Code evaluates LLMs systematically using a complexity-aware metric, incorporating both code complexity and call-graph structures. Dy- naCode achieves large-scale diversity, gener- ating up to 189 million unique nested code problems across four distinct levels of code complexity, referred to as units, and 16 types of call graphs. Results on 12 latest LLMs show an average performance drop of 16.8%to45.7% compared to MBPP+, a static code generation benchmark, with performance progressively de- creasing as complexity increases. This demon- strates DynaCode’s ability to effectively dif- ferentiate LLMs. Additionally, by leveraging call graphs, we gain insights into LLM behav- ior, particularly their preference for handling subfunction interactions within nested code. 1 Introduction The performance of Large Language Models (LLMs) in code generation has garnered significant attention (Hou et al., 2024). With their powerful language comprehension abilities, LLMs are now capable of autonomously generating high-quality code and, to some extent, addressing complex pro- gramming challenges. These advancements have not only accelerated the software development pro- cess but have also had a profound impact on enhanc- ing developer productivity (Ziegler et al., 2022). *Corresponding author: Kaidi Xu <kx46@drexel.edu>. MBPP MBPP+ DynaCode Benchmark020406080Pass@1 Score (%)64.6% 54.8% 8.4%87.6% 73.0% 52.1% -56.2%-35.5% -46.4%-20.9%Meta-Llama-3-8B-Instruct DeepSeek-V3Figure 1: Data contamination on the popular bench- marks MBPP and MBPP+. Meta-Llama-3-8B-Instruct exhibits a significant performance drop from MBPP and MBPP+ to DynaCode. However, as LLMs advance, reliable benchmarks for code generation have become increasingly cru- cial for evaluating and selecting suitable LLMs. Currently, the evaluation of LLMs’ code generation capabilities primarily relies on standardized bench- marks such as HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), CodeXGLUE (Lu et al., 2021), and ClassEval (Du et al., 2023). These benchmarks provide an initial reference for assess- ing the performance of LLMs in code generation by evaluating the functionality and correctness of the generated code. Moreover, recent work such as EvalPlus (Liu et al., 2024c), BigCodeBench (Zhuo et al., 2024), CRUXEval (Gu et al., 2024), and EvoEval (Xia et al., 2024) aim to enhance evalua- tion quality by expanding test cases and employing techniques like prompt transformation to convert prompts into more appropriate ones for more pre- cise evaluation. Despite these developments, existing bench- marks exhibit two notable limitations: Data Contamination. Existing benchmarks are static and small-scale, making them easily accessi-arXiv:2503.10452v1 [cs.CL] 13 Mar 2025 Page 2: ble during training and allowing models to mem- orize test cases instead of generalizing to unseen problems. Meta-Llama-3-8B-Instruct (Meta AI, 2024a) and Phi-2 (Microsoft, 2024) have been re- ported to exhibit data contamination (Zhang et al., 2024a), suggesting that the model may “memorize” specific test cases or code snippets, which could compromise the accuracy and fairness of the evalu- ation process. Uncontrollable Complexity. Existing bench- marks lack systematic complexity control, making it challenging to evaluate LLM performance across different task complexities. While some works (Yu et al., 2024; Liu et al., 2024d) define code com- plexity using simple metrics such as lines of code and time complexity, these measures fail to cap- ture deep nesting and complex execution depen- dencies, thereby leaving a critical gap in assessing real-world code generation capabilities. To address these limitations, we propose Dyna- Code, a novel dynamic evaluation framework that automatically creates Python code benchmarks by classifying code problems based on complexity and forming nested problems using call graphs. This benchmark provides a more comprehensive and fair evaluation of code generated by LLMs. Specif- ically, DynaCode categorizes code problems into multiple code problem units and, for each unit, con- structs call-graph structures of varying complexity. In doing so, it establishes complexity-aware met- rics along two dimensions: code complexity and call-graph complexity. Compared to traditional static benchmarks, DynaCode offers a significantly more diverse and complex evaluation. As shown in Figure 1, Meta-Llama-3-8B-Instruct (Meta AI, 2024a), a model that exhibits data contamination, shows a larger performance drop from MBPP and MBPP+ to DynaCode. DynaCode generating ap- proximately 189 million unique code generation tasks across 4 units of code complexity and 16 call-graph structures. By assessing LLMs from both code complexity and call-graph complexity perspectives, DynaCode provides a structured and scalable evaluation framework while mitigating data contamination. To further investigate LLM limitations, we analyzed 4279 error examples, cat- egorizing errors into 3 distinct types. Our analysis reveals that LLMs perform well on sequential call graphs but struggle with complex, multi-branch dependencies, highlighting their difficulty in han- dling deeply nested execution flows and long-range function interactions. In summary, our major con-tributions are listed as follows: •We propose a dynamic evaluation strategy that simulates the actual execution process by com- bining multiple code problems, thereby en- abling a fairer and more comprehensive evalu- ation. •We design complexity-aware metrics combin- ing code complexity and call graphs, integrat- ing static analysis and dynamic execution to create a multidimensional complexity evalua- tion system with categorized benchmarks. •We introduce DynaCode, a new code genera- tion benchmark, and evaluate multiple LLMs, providing a thorough analysis of its practical utility. 2 Related Works 2.1 Dynamic Evaluation Recently, growing interest in dynamic evaluation methods has emerged to address data contami- nation. Several works have focused on differ- ent aspects of this challenge. For example, Dy- Val (Zhu et al., 2023) utilizes a graph-based ap- proach to dynamically generate evaluation samples with controllable complexity; NPHardEval (Fan et al., 2023) generates new evaluation samples for NP-hard mathematical problems; DyVal2 (Zhu et al., 2024) leverages a probing agent based on LLMs to transform existing problems into new ones, while a judgment agent verifies the generated evaluation samples. Additionally, Benchmark Self- Evolving (Wang et al., 2024) modifies the context or the problem itself, along with its correspond- ing answers, to reconstruct existing benchmark in- stances into new variants for dynamic evaluation. DARG (Zhang et al., 2024b) also utilizes LLMs to build reasoning graphs for problems and applies fine-grained graph perturbations across various di- mensions. However, these methods mainly focus on reasoning domains like logic and mathemat- ics and may not extend well to code generation. Moreover, they rely on LLMs as agents to refine benchmarks, introducing evaluation instability and extra costs. To address these limitations, this pa- per proposes a dynamic evaluation strategy tailored for code generation, which automatically creates benchmarks and offers a more detailed and com- prehensive assessment. Page 3: 2.2 Coding benchmark for LLMs The rapid development of LLMs has driven the continuous evolution of code generation bench- marks. Benchmarks such as HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021) pri- marily evaluate LLMs on simple, isolated Python functions, offering a evaluation of code genera- tion capabilities. As LLMs have improved, new benchmarks have been continuously proposed, ex- panding to address higher levels of difficulty (Zhuo et al., 2024) and a wider range of programming languages (Zheng et al., 2023; Khan et al., 2024; Cassano et al., 2023; Ding et al., 2023). These benchmarks also tackle more complex tasks such as program repair and code reasoning (Liu et al., 2024b; Gu et al., 2024; Jain et al., 2024). In addi- tion, new benchmarks like SWE-Bench (Jimenez et al., 2023) and EvoCodeBench (Li et al., 2024) focus on real-world tasks and code evolution, push- ing forward the performance evaluation of LLMs in practical applications. However, existing bench- marks remain static and fixed, lacking systematic complexity control in generated code problems. They typically assess LLMs based on overall per- formance, failing to provide granular insights into varying code complexities. We improve this by proposing a complexity-aware metric, allowing for a more precise and systematic evaluation of LLMs’ performance. 3 Methodology In this section, we present the construction ap- proach and process of DynaCode, as shown in Fig- ure 2. Specifically, we first introduce the dynamic evaluation strategy based on call graphs in Sec. 3.1, explaining how they capture the relationships and dependencies within a program. In Section 3.2, we introduce a complexity matrix that measures code and graph complexity, essential for evaluating tasks of varying difficulty and capturing program interactions. Finally, we describe the process of generating benchmarks in DynaCode in Sec. 3.3. 3.1 Dynamic Evaluation Strategy Based on Call Graphs To comprehensively evaluate the performance of LLMs in code generation tasks, this section intro- duces a dynamic evaluation strategy based on call graphs. This strategy measures execution perfor- mance of generated code from two dimensions: static complexity classification and dynamic codeexecution behavior. Specifically, this section is divided into two parts: code complexity classifi- cation and call graph construction. These parts describe how code problems are classified based on code complexity and how the call graphs are constructed. 3.1.1 Code Complexity Classification Current LLMs generally perform relatively well on code problems involving basic syntax and sim- ple logical structures, but their performance is inconsistent on problems that involve more con- trol flow branches, nested structures, and recursive calls (Jiang et al., 2025; Beger and Dutta, 2025). To evaluate the code generation capabilities of LLMs more thoroughly, DynaCode first classifies the com- plexity of the ground truth code of existing code problems, as shown in Figure 2(a). Methods for classifying code problem complexity include lines of code, cyclomatic complexity (McCabe, 1976), and halstead complexity (Halstead, 1977). Given that LLMs often struggle with code generation tasks involving complex control flow and branch- ing logic, we use cyclomatic complexity to assess code difficulty. It measures the number of inde- pendent paths in the control flow, capturing the complexity of branches and loops. Specifically, for each problem piin the set Pof code problems, we use the static code analysis tool Radon (Rubik, 2014) to calculate the cyclomatic complexity of the corresponding ground truth code for each problem, denoted as νpi, where pirepresents the index of the problem in the set P. The cyclomatic complexity is calculated using the following formula: νpi=Epi−Npi+ 2Ppi, (1) where Epiis the number of edges in the control flow graph, Npiis the number of nodes, and Ppiis the number of connected components. Control flow graphs and cyclomatic complexity computations are shown in Appendix A. Based on the calculated cyclomatic complexity values, code problems are classified into different complexity units. We define the code complexity as Uj, where j∈ {1,2, . . . , n }, representing different cyclomatic complexity units. The set of problems at complexity level Ujis denoted by: Uj={pi|αj−1≤νi≤αj}, (2) where: Page 4: ...(a) Code Complexity Classification (b) Call Graph Construction(c) Complexity-aware Metrics ... Code Complexity Problem :Write a function to remove odd characters in a string. assert remove_odd("python")==("yhn") Function f:Max path length, Branch count, Edge countCode Problems ... def count_binary_seq(n): nCr = 1 res = 1 for r in range(1, n + 1): nCr = (nCr * (n + 1 - r)) / r res += nCr * nCr return resdef main(o): o1 = function1(o) o2 = function2(o1) o3 = function3(o1) o4 = function4(o1) return (o2, o3, o4)f1 f2 f3 f4DynaCode Examine ... Running Ground Truth...:: : : :: (d) Benchmark GenerationMetrics ... ...... in/output type Bad Generation Figure 2: Overview of our proposed DynaCode. (a) Classification of code complexity, resulting in code problems with varying levels of complexity. (b) Construction of function call graphs, categorized based on graph features. (c) Integration of code complexity and graph complexity to form two-dimensional complexity-aware metrics. (d) Benchmark generation process. 1.pirepresents a problem with cyclomatic com- plexity νi, 2.αj−1andαjare predefined complexity thresh- olds for the j-th Unit. Thus, for each j,Ujis the set of problems pisuch that their cyclomatic complexity νifalls within the interval [αj−1, αj], i.e., pi∈Ujif and only if αj−1≤νi≤αj. The choice of thresholds α1, α2, . . . , α nis based on the distribution of cyclomatic complexity values in code dataset. These thresholds are selected to capture progressively increasing complexity, allow- ing for a systematic evaluation of LLMs’ perfor- mance as task difficulty escalates. 3.1.2 Call Graph Construction A fundamental requirement of execution-based evaluation is to analyze the execution behavior of the generated code (Chen et al., 2021). However, in static benchmarks, where code problems often involve isolated functions, models may memorize specific solutions instead of generalizing, leading to data contamination. To tackle this issue, we use a call-graph structure to construct nested code, as shown in Figure 2(b). After classifying the complexity of code problems, we aim to construct nested problems and their corresponding nested codes with different structures at the same com- plexity unit to enhance the diversity of code gen- eration evaluations. Specifically, we treat eachcode problem and its ground truth code as nodes in the call graph to form nested problems and nested codes. The call graph is defined as a directed graph G= (V, E), where the node set Vrepresents the functions in the code, and the edge set Erepresents the call relationships between the functions. In or- der to combine code problems into nested problems and corresponding codes, we define the call graph as follows: 1.There are exactly Kdistinct call-graph struc- tures, where each structure Gk= (Vk, Ek), fork∈ {1,2, . . . , K }, corresponds to a unique function call pattern. 2.The number of root nodes is |Vroot|= 1, with the root node denoted by v1. 3.The graph is acyclic: for any pair of nodes u, v∈V, ifu→v, then there is no v→u, i.e.,Gis a Directed Acyclic Graph. These call-graph structures range from simple lin- ear calls to more complex configurations involv- ing various branches and call relationships, effec- tively increasing the logical complexity of the code. Through this diverse set of call-graph structures, DynaCode is able to simulate a variety of real- world scenarios, assessing the ability of LLMs to generate code of varying complexity. Detailed il- lustrations of all call-graph structures are shown in Appendix B. Page 5: To more precisely evaluate the code generation capability of LLMs, DynaCode categorizes the complexity of different call-graph structures to fur- ther assess LLM performance within the same unit. The complexity of the call graph is influenced by these key features: 1.Maximum path length Lmax(G): The longest path from the root node rto any other node in G. 2.Branch count B(G): The total number of branching points in G, indicating how many functions each function calls. 3.Edge count |E|: The total number of directed edges in the graph, which quantifies the inter- dependencies between functions. To measure the complexity of the graph G, we define a comprehensive feature metric M: M(G) =Lmax(G)×B(G)× |E|. (3) TheMfeature, computed as the product of the maximum path length, branch count, and edge count, comprehensively reflects the overall com- plexity of the call graph. Based on this feature, we categorize the complexity of call graphs into different levels according to predefined thresholds. Letβ0, β1, . . . , β mbe the thresholds. Then, the classification is defined as follows: Lj={G|βj−1≤ M (G)≤βj}, (4) where Ljrepresents the set of call graphs whose Mvalues fall within the interval [βj−1, βj], corre- sponding to the jth level of graph complexity. 3.2 Complexity-aware metrics By combining code complexity classification with call-graph complexity classification, we propose a comprehensive complexity measurement matrix, as illustrated in Figure 2(c). This matrix provides a two-dimensional framework for evaluating code complexity, integrating both the internal logical structure of functions and their interrelationships within the call graph. Its goal is to enable a holistic assessment of code complexity by considering two critical aspects: the inherent complexity of func- tion logic and the complexity of its interactions within the overall code structure. The matrix is represented as: C={cξ,η|1≤ξ≤n,1≤η≤m}.(5)where cξ,ηrepresents the complexity value at the intersection of code complexity unit ξand call-graph complexity level η. Specifically, ξ∈ {1,2, . . . , n }corresponds to the different units of code complexity, and η∈ {1,2, . . . , m }corre- sponds to the levels of complexity associated with the call-graph structure. The matrix Callows for a detailed, two-dimensional assessment of code complexity by evaluating how the internal logic of functions interacts with the overall structure of the code, which can significantly influence the perfor- mance and maintainability of generated code. 3.3 Benchmark Generation Through the steps outlined above, we can construct a systematic code generation benchmark. Our ap- proach improves upon existing code problems by dynamically combining them to generate Dyna- Code, which incorporates both a complexity frame- work and randomness into the code problems. The benchmark generation process is illustrated in Fig- ure 2(d). The specific procedure for generating DynaCode is as follows: Problem Collection. We begin by collecting ex- isting code problems to form our DynaCode unit functions. To mitigate data contamination risks, we actively source recently published code problems from the web and incorporate them into our bench- mark. This approach allows us to continuously update the code problem set with new releases, thereby alleviating data contamination. Code Complexity Classification. For each unit function, we evaluate its complexity using cyclo- matic complexity and categorize it accordingly, forming multiple code problem units. Simulta- neously, we employ the Monkeytype (Facebook, 2017) tool to generate corresponding input and out- put data for each code problem, discarding any data that cannot be integrated into valid nested codes. Problem Combination. Then, for each unit of code problems, we merge the problems into new nested problems by aligning their input and output types according to the call-graph structure. Concur- rently, we automatically assemble the correspond- ing code. The automatically generated problem prompt is presented in Appendix E. Testcase Generation. After obtaining code that conforms to the call-graph structure, we inject the input values from the root node of the call graph into the entire nested code and execute it in batches. If any execution errors occur, the generation is clas- sified as a bad generation. We then filter out these Page 6: Type Unit 1 Unit 2 Unit 3 Unit 4 Base 153 100 76 76 Level 1 88,339 10,553 7,931 23,996 Level 2 3,300,172 157,950 107,578 585,379 Level 3 27,771,290 518,215 332,610 3,428,967 Level 4 133,358,131 2,528,807 1,928,288 15,114,935 Table 1: The number of problems contained in different units and the corresponding number of benchmarks that can be generated in combination. bad generations to retain valid nested code, nested problems, and their corresponding test cases. The valid nested code is presented in Appendix G. 4 Evaluation and Analysis 4.1 Experimental Setup Data. We have selected the MBPP+, processed by EvalPlus (Liu et al., 2024c), as our unit func- tion set. MBPP+ is an extended version of the MBPP (Austin et al., 2021), enhanced and refined to provide more comprehensive test cases and so- lutions. To address the potential of data contami- nation on unit functions, we also curated a set of the latest programming problems from LeetCode1. These problems, along with their corresponding solutions collected using official test cases, were integrated into our evaluation. As a demonstra- tion, we introduced 22 new code problems into Unit 3 and 18 new code problems into Unit 4. The combination of these two sources forms our unit function set, ensuring both diversity and relevance to real-world scenarios. Evaluation metric. Following previous work (Chen et al., 2021), we use Pass@1 as the evaluation metric, which measures the percentage of problems solved correctly on the first attempt without further corrections. Since our benchmark introduces progressively increasing complexity levels, we use Pass@1 as a unified evaluation metric to ensure consistent comparisons with MBPP, MBPP+, and across different complexity levels. Evaluated LLMs. We selected a range of main- stream LLMs to evaluate the effectiveness of Dy- naCode in code generation tasks. The evalua- tion models include GPT-4o (Achiam et al., 2023), GPT-3.5-turbo (OpenAI, 2023), DeepSeek-V3 (Liu et al., 2024a), Qwen2.5-Coder-32B-Instruct (Hui et al., 2024), WizardLM-2-8x22B (Xu et al., 2023), Mixtral-8x22B-Instruct-v0.1 (Jiang et al., 2024), Phind-CodeLlama-34B-v2 (Roziere et al., 2023), 1LeetCode, https://leetcode.com/starcoder2-15b-instruct-v0.1 (Wei et al., 2024), codegemma-7b-it (Team et al., 2024), Meta-Llama- 3.1-405B-Instruct (Meta AI, 2024b), Meta-Llama- 3.3-70B-Instruct (Meta AI, 2024d), and Meta- Llama-3.1-8B-Instruct (Meta AI, 2024c). To en- sure the stability and consistency of the evaluation results, we set the temperature to 0 to eliminate any randomness introduced during the generation process. Our prompts are shown is Appendix E. Benchmark Statistics. We have compiled statis- tics on the number of problems that can be gener- ated from the selected dataset after applying the dynamic evaluation strategy, as shown in Table 1. Specifically, Table 4 provides further details on the number of problems generated for each call graph, facilitating an understanding of the scale and distri- bution of problems across different structures. 4.2 Benchmark Results Model Performance. The results of our bench- mark are summarized in Table 2. Compared to tra- ditional benchmarks like MBPP and MBPP+, Dy- naCode shows a more pronounced decline in model performance as code complexity increases. For ex- ample, GPT-4o achieves a Pass@1 score of 87.6% on MBPP and 72.2%on MBPP+, but drops signifi- cantly to 55.4%on DynaCode. A similar trend is observed with GPT-3.5-Turbo, which drops from 82.5%on MBPP to 29.3%on DynaCode. These results highlight the increasing complexity in our benchmark, confirming its ability to assess models on more complex, real-world code generation tasks. Specific performance variations are shown for se- lected models in Figure 3. Our complexity-aware metrics effectively differentiate model capabilities, as evidenced by consistent performance degrada- tion across complexity levels. Models like GPT- 4o show better resilience to increasing complexity, while others like WizardLM-2-8x22B struggle as complexity rises. This demonstrates DynaCode’s robustness in evaluating not just basic code gen- eration but also handling nested, multi-function code structures. The performance trends across all models and complexity scenarios are depicted in Figure 9, further validating our benchmark’s con- tribution in revealing challenges faced by LLMs in complex code generation tasks. Effect of the Problem Sizes. To further inves- tigate our benchmark, we conducted a study on the effect of the number of dynamically gener- ated problems on the evaluation performance of the benchmark. For each unit and each call graph Page 7: Model Params MBPP MBPP+DynaCode Unit 1 Unit 2 Unit 3 Unit 4 Average GPT-4o - 87.6 72.2 74.4 (±1.6) 48 .7 (±1.4) 56 .2 (±1.4) 42 .3 (±0.3) 55 .4 (±0.9) GPT-3.5-Turbo - 82.5 69.7 34.9 (±1.0) 30 .5 (±1.6) 25 .6 (±0.6) 26 .1 (±1.2) 29 .3 (±0.4) DeepSeek-V3 236B 87.6 73.0 65.9 (±0.8) 41 .6 (±2.3) 53 .6 (±0.7) 47 .3 (±1.5) 52 .1 (±0.4) Qwen2.5-Coder-32B-Instruct 32B 90.5 77.0 59.3 (±1.6) 33 .6 (±1.6) 44 .1 (±1.1) 36 .0 (±0.8) 43 .2 (±0.2) WizardLM-2-8x22B 176B 71.7 60.8 35.4 (±0.8) 27 .5 (±0.9) 25 .1 (±0.8) 12 .8 (±1.1) 25 .2 (±0.6) Mixtral-8x22B-Instruct-v0.1 176B 73.8 64.3 35.4 (±0.8) 27 .2 (±1.3) 25 .0 (±0.5) 12 .8 (±0.5) 25 .1 (±0.7) Phind-CodeLlama-34B-v2 34B 85.4 69.6 40.9 (±0.9) 29 .5 (±1.8) 45 .2 (±1.5) 25 .4 (±0.2) 35 .3 (±0.6) starcoder2-15b-instruct-v0.1 15B 78.0 65.1 40.7 (±1.0) 29 .4 (±1.4) 45 .0 (±1.3) 25 .3 (±0.5) 35 .1 (±0.4) codegemma-7b-it 7B 70.4 56.9 4.6 (±0.4) 2 .7 (±0.2) 1 .1 (±0.2) 3 .0 (±0.4) 2 .9 (±0.2) Meta-Llama-3.1-405B-Instruct 405B 88.4 73.0 49.7 (±0.9) 40 .0 (±1.6) 47 .6 (±1.1) 26 .9 (±1.2) 41 .0 (±0.8) Meta-Llama-3.3-70B-Instruct 70B 89.2 75.1 36.0 (±1.4) 27 .5 (±1.4) 54 .9 (±1.1) 31 .2 (±0.8) 37 .4 (±0.7) Meta-Llama-3.1-8B-Instruct 8B 68.3 55.6 14.1 (±1.0) 9 .7 (±0.6) 8 .4 (±1.0) 7 .4 (±0.7) 9 .9 (±0.8) Table 2: Pass@1 results on MBPP, MBPP+, and our DynaCode across varying code complexities. For DynaCode, results are reported for different function complexity levels. To ensure robustness, all experiments were conducted three times with 5 different random seeds, and the average results are presented. 1 2 3 4 Level102030405060708090Pass@1 Score (%) GPT-4o 1 2 3 4 Level GPT-3.5-Turbo 1 2 3 4 Level WizardLM-2-8x22B 1 2 3 4 Level Meta-Llama-3.1-405B-InstructUnit 1 Unit 2 Unit 3 Unit 4 Figure 3: Comparison of Average Pass@1 scores across 4 LLMs (GPT-4o, GPT-3.5-Turbo, WizardLM-2-8x22B, Meta-Llama-3.1-405B-Instruct) at different complexity levels. 25 50 75 100 125 150 Number of Problems222426283032343638Average pass@1_score (%) Unit 1 Unit 2 Unit 3 Unit 4 Figure 4: Experiment on the number of problems. For each graph in every unit, a corresponding number of problems is generated. type, we generated different numbers of prob- lems{25,50,75,100,125,150}, and the results are shown in Figure 4. The experiments indicate that when the number of problems is greater than or equal to 75, the evaluation results meet our require- ments for stability and reliability. Therefore, we set MBPP+ DynaCode GPT-3.5-Turbo020406080100Pass@1 Score (%)69.7% 32.6%88.6% 36.0% 15.2% MBPP+ DynaCode Meta-Llama-3.1-8B-Instruct55.6% 10.6%98.1% 23.6% 6.9%w/o finetune Finetuned on MBPP+/DynaCodeFinetuned on unit functionsFigure 5: Pass@1 scores of GPT-3.5-Turbo and Meta- Llama-3.1-8B-Instruct before and after fine-tuning on MBPP+ and DynaCode, and DynaCode unit functions. 100for other experiments. Despite the variations in the number of generated problems, DynaCode can systematically evaluate code generation tasks at various complexity levels and provide reliable assessments for each complexity scenario. Detailed results for each number are provided in Figure 10. 4.3 Experimental Insights DynaCode Limits Memorization for More Re- liable Evaluation. A key challenge in evaluating Page 8: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Call Graphs of Index01020304050607080Pass@1 Score (%)GPT-4o GPT-3.5-Turbo Meta-Llama-3.1-405B-Instruct WizardLM-2-8x22BFigure 6: Comparative performance evaluation of 4 LLMs (GPT-4o, GPT-3.5-Turbo, Meta-Llama-3.1-405B-Instruct, and WizardLM-2-8x22B) across different call graphs. Models exhibit better performance on sequential call graphs {G1, G2, G3, G4, G8}. Details of the call graphs are provided in Appendix B. Ability Error TypeDynaCode Unit 1 Unit 2 Unit 3 Unit 4 Problem Understanding AssertionError, ValueError, RecursionError, ZeroDivisionError 64.1% (615) 79.9% (832) 88.2% (1027) 88.8% (988) Code Pattern Generation SyntaxError, IndentationError 6.6% (63) 0.4% (4) 0.2% (2) 1.5% (17) Context Management NameError, AttributeError, TypeError, IndexError, UnboundLocalError 29.4% (282) 19.7% (205) 11.7% (136) 9.4% (105) Other RuntimeError, OverflowError - - - 0.3% (3) Table 3: Error analysis of GPT-3.5-Turbo on DynaCode. The model was evaluated across 4 units of increasing complexity, with 100 questions per graph. The table summarizes the distribution and frequency of errors categorized by problem understanding, code pattern generation, and context management. LLMs is data contamination, where models may memorize training data instead of generalizing ef- fectively. To explore this issue, we fine-tuned a commercial and an open-source LLM, GPT-3.5- Turbo, and Meta-Llama-3.1-8B-Instruct, on both the MBPP+ and DynaCode datasets, and then eval- uated their performance as shown in Figure 5. To ensure a fair comparison and avoid catastrophic forgetting, we propose a fine-tuning strategy as fol- lows: for GPT-3.5-Turbo, we trained on MBPP+ for 5 epochs, on DynaCode unit functions for 5 epochs, and on DynaCode for 1 epoch, while keeping the total fine-tuning steps fixed at 1890; for Meta-Llama-3.1-8B-Instruct, we trained on MBPP+ for 10 epochs, on DynaCode unit func- tions for 10 epochs, and the DynaCode for 2 epochs, while keeping the total fine-tuning steps fixed at 3780. The results show a significant improve- ment in GPT-3.5-Turbo’s Pass@1 score on MBPP+, which increased from 69.7% to 88.6%, while on DynaCode, the improvement was much smaller, rising from 32.6% to 36.0%. A similar pattern was observed for Meta-Llama-3.1-8B-Instruct, where the model’s performance on MBPP+ surged from 55.6% to 98.1%, but only improved slightly from 10.6% to 23.6% on DynaCode. To further inves- tigate the impact of data contamination, we intro- duced a fine-tuning setting using only the unit func-tions from DynaCode, and then evaluated the mod- els on the full DynaCode benchmark. Under this setting, GPT-3.5-Turbo’s Pass@1 score dropped to 15.2%, while Meta-Llama-3.1-8B-Instruct’s score fell to 6.9%. These smaller gains suggest that the models are not simply memorizing the data, imply- ing that DynaCode’s dynamic evaluation strategy effectively mitigates data contamination. As a re- sult, DynaCode offers a more reliable and compre- hensive assessment of LLMs’ true code generation capabilities, ensuring that the evaluation is based on the model’s generalization ability rather than its capacity to memorize specific training examples. LLMs Are Good at Sequential Execution. We evaluated 4 LLMs on different call graphs, aver- aging their Pass@1 scores across all 4 units, as shown in Figure 6. The results reveal a clear trend: LLMs perform significantly better on sequentially structured graphs {G1, G2, G3, G4, G8}, which in- volve linear execution with minimal branching. This suggests that LLMs excel at straightforward function compositions and stepwise computations. However, as call graph complexity increases, par- ticularly in multi-layered, multi-branch structures {G9, G10, G11, G12, G13, G14, G15, G16}, model performance drops considerably. This indicates that LLMs struggle with parallel function depen- dencies and managing execution across interde- Page 9: pendent subfunctions. Among the evaluated mod- els, GPT-4o consistently outperforms others across all graphs, showing greater resilience to structural complexity. Nevertheless, even GPT-4o exhibits a notable decline in high-complexity graphs, high- lighting a fundamental limitation in LLMs’ ability to generate and manage deeply nested, interdepen- dent code structures. LLMs Struggle with Problem Understanding as Complexity Increases. We evaluate the code generation capabilities of GPT-3.5-Turbo on Dyna- Code by analyzing the distribution and frequency of errors across 4 units, classifying them into Prob- lem Understanding, Code Pattern Generation, and Context Management abilities based on common code errors. The results are summarized in Table 3. The findings show that the error rate for Problem Understanding increases progressively from 64.1% in unit 1 to 88.8%in unit 4, indicating that the in- creasing complexity of the code problems makes it more difficult for GPT-3.5-Turbo to correctly un- derstand the problem requirements. In contrast, the error rates for Code Pattern Generation and Con- text Management show a decreasing trend across units. However, this reduction does not imply an improvement in capabilities. Instead, it reflects a shift in the error distribution: as the task complex- ity increases, the model often fails at the Problem Understanding stage, preventing it from generat- ing correct code that would expose errors in syntax or context management. The error classification details are shown in Appendix F. 5 Conclusion We present DynaCode, a dynamic, complexity- aware benchmark designed to systematically evalu- ate LLMs on code generation tasks. By integrating code complexity with call-graph structures, Dy- naCode generates nested code problems that cap- ture varying levels of code complexity and inter- function dependencies. We evaluated 12 of the latest LLMs, and the results reveal significant per- formance drops as both unit and graph complexity increase, highlighting DynaCode’s ability to sys- tematically assess LLMs. Notably, DynaCode also addresses the challenge of data contamination and demonstrates its capacity to mitigate this limitation. In summary, DynaCode provides a new perspec- tive for examining and analyzing LLMs, offering a scalable framework for more comprehensive and reliable LLM evaluation.Limitations DynaCode primarily focuses on relatively call- graph structures, with a maximum node count of 5. While this ensures manageable complexity for cur- rent LLMs, it is possible that more advanced LLMs in the future may learn to handle such call patterns. We will extend DynaCode to include more com- plex call-graph structures in future work, further challenging LLMs and enhancing the benchmark’s scalability. References Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 . Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 . Claas Beger and Saikat Dutta. 2025. Coconut: Struc- tural code understanding does not fall out of a tree. arXiv preprint arXiv:2501.16456 . Federico Cassano, John Gouwar, Daniel Nguyen, Syd- ney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, et al. 2023. Multipl-e: a scal- able and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering , 49(7):3675–3691. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka- plan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas- try, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cum- mings, Matthias Plappert, Fotios Chantzis, Eliza- beth Barnes, Ariel Herbert-V oss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating large language models trained on code. Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Han- tian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Page 10: Dan Roth, and Bing Xiang. 2023. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track . Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. Classe- val: A manually-crafted benchmark for evaluat- ing llms on class-level code generation. Preprint , arXiv:2308.01861. Inc. Facebook. 2017. Monkeytype: A python annota- tion tool for adding type hints to your code. Lizhou Fan, Wenyue Hua, Lingyao Li, Haoyang Ling, and Yongfeng Zhang. 2023. Nphardeval: Dynamic benchmark on reasoning ability of large language models via complexity classes. arXiv preprint arXiv:2312.14890 . Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. 2024. Cruxeval: A benchmark for code reason- ing, understanding and execution. arXiv preprint arXiv:2401.03065 . Maurice H Halstead. 1977. Elements of Software Sci- ence (Operating and programming systems series) . Elsevier Science Inc. Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review. ACM Trans. Softw. Eng. Methodol. , 33(8). Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Day- iheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186 . Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. 2024. Live- codebench: Holistic and contamination free eval- uation of large language models for code. arXiv preprint arXiv:2403.07974 . Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. 2024. Mixtral of experts. arXiv preprint arXiv:2401.04088 . Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, and Qiang Guan. 2025. Can large language models understand intermediate representations? arXiv preprint arXiv:2502.06854 . Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language mod- els resolve real-world github issues? arXiv preprint arXiv:2310.06770 .Mohammad Abdullah Matin Khan, M Saiful Bari, Do Long, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty. 2024. Xcodeeval: An execution-based large scale multilingual multitask benchmark for code understanding, generation, translation and re- trieval. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers) , pages 6766–6805. Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, and Zhi Jin. 2024. Evocodebench: An evolving code generation benchmark aligned with real-world code repositories. arXiv preprint arXiv:2404.00599 . Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024a. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 . Changshu Liu, Shizhuo Dylan Zhang, Ali Reza Ibrahimzada, and Reyhaneh Jabbarvand. 2024b. Codemind: A framework to challenge large lan- guage models for code reasoning. arXiv preprint arXiv:2402.09664 . Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Ling- ming Zhang. 2024c. Is your code generated by chat- gpt really correct? rigorous evaluation of large lan- guage models for code generation. In Proceedings of the 37th International Conference on Neural In- formation Processing Systems , NIPS ’23, Red Hook, NY , USA. Curran Associates Inc. Zhijie Liu, Yutian Tang, Xiapu Luo, Yuming Zhou, and Liang Feng Zhang. 2024d. No need to lift a finger anymore? assessing the quality of code generation by chatgpt. IEEE Transactions on Software Engineer- ing. Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 . Thomas J McCabe. 1976. A complexity measure. IEEE Transactions on software Engineering , (4):308–320. Meta AI. 2024a. Meta-llama-3-8b-instruct. https://huggingface.co/meta-llama/ Meta-Llama-3-8B-Instruct . Meta AI. 2024b. Meta-llama-3.1-405b-instruct. https://huggingface.co/meta-llama/Llama-3. 1-405B-Instruct . Meta AI. 2024c. Meta-llama-3.1-8b-instruct. https://huggingface.co/meta-llama/Llama-3. 1-8B-Instruct . Meta AI. 2024d. Meta-llama-3.3-70b-instruct. https://huggingface.co/meta-llama/Llama-3. 3-70B-Instruct . Page 11: Microsoft. 2024. phi-2. https://huggingface.co/ microsoft/phi-2 . OpenAI. 2023. Gpt-3.5-turbo. https://openai. com/ . Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 . Rubik. 2014. Radon: A tool to compute complexity metrics for python source code. CodeGemma Team, Heri Zhao, Jeffrey Hui, Joshua Howland, Nam Nguyen, Siqi Zuo, Andrea Hu, Christopher A Choquette-Choo, Jingyue Shen, Joe Kelley, et al. 2024. Codegemma: Open code models based on gemma. arXiv preprint arXiv:2406.11409 . Siyuan Wang, Zhuohan Long, Zhihao Fan, Zhongyu Wei, and Xuanjing Huang. 2024. Benchmark self- evolving: A multi-agent framework for dynamic llm evaluation. arXiv preprint arXiv:2402.11443 . Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro V on Werra, Arjun Guha, and Lingming Zhang. 2024. Selfcodealign: Self-alignment for code generation. arXiv preprint arXiv:2410.24198 . Chunqiu Steven Xia, Yinlin Deng, and Lingming Zhang. 2024. Top leaderboard ranking = top coding pro- ficiency, always? evoeval: Evolving coding bench- marks via llm. arXiv preprint . Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. 2023. Wizardlm: Empowering large lan- guage models to follow complex instructions. arXiv preprint arXiv:2304.12244 . Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. Codereval: A benchmark of prag- matic code generation with generative pre-trained models. In Proceedings of the 46th IEEE/ACM Inter- national Conference on Software Engineering , pages 1–12. Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, et al. 2024a. A careful exami- nation of large language model performance on grade school arithmetic. arXiv preprint arXiv:2405.00332 . Zhehao Zhang, Jiaao Chen, and Diyi Yang. 2024b. Darg: Dynamic evaluation of large language mod- els via adaptive reasoning graph. arXiv preprint arXiv:2406.17271 . Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, et al. 2023. Codegeex: A pre-trained model for code generation with multilingual benchmarkingon humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages 5673–5684. Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, and Xing Xie. 2023. Dyval: Graph- informed dynamic evaluation of large language mod- els.arXiv preprint arXiv:2309.17167 . Kaijie Zhu, Jindong Wang, Qinlin Zhao, Ruochen Xu, and Xing Xie. 2024. Dyval 2: Dynamic evaluation of large language models by meta probing agents. arXiv preprint arXiv:2402.14865 . Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. 2024. Bigcodebench: Benchmarking code genera- tion with diverse function calls and complex instruc- tions. arXiv preprint arXiv:2406.15877 . Albert Ziegler, Eirini Kalliamvakou, X. Alice Li, An- drew Rice, Devon Rifkin, Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian. 2022. Pro- ductivity assessment of neural code completion. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , MAPS 2022, page 21–29, New York, NY , USA. Association for Computing Machinery. Page 12: Appendix A Cyclomatic Complexity Details We present the cyclomatic complexity calculations forSequence ,If-Else ,While , and While-Not control structures, illustrating their impact on code complexity in Figure 8. Cyclomatic complexity quantifies the number of independent execution paths in a program, making it a useful metric for assessing the logical complexity of different con- trol flows. Simple structures like Sequence have a lower complexity, as they follow a single execu- tion path, whereas If-Else andWhile introduce branching and loops, increasing the number of pos- sible execution paths. While-Not adds additional conditional constraints, further influencing com- plexity. Understanding these calculations helps in evaluating how LLMs handle different levels of logical complexity in generated code. B Call Graph Details In our experiments, we set the maximum number of nodes in the call graph to 5. This configura- tion enables the generation of 16 distinct call-graph structures, each corresponding to a unique func- tion call pattern. Every call graph is modeled as a directed acyclic graph with a single root node, ensuring a unified entry point for function calls. Figure 7 illustrates the detailed configurations of all 16 call-graph structures. These structures provide a comprehensive testbed for evaluating the perfor- mance of LLMs in generating code with complex nested function calls, thereby enhancing our assess- ment of their ability to handle varying levels of logical and structural complexity. C Benchmark Statistics Details We report the details of benchmark statistics in Table 4, which shows the number of problems con- tained in different units and the corresponding num- ber of benchmarks that can be generated through combinations. We also compare the total number of problems in DynaCode with those in other code generation benchmarks, as illustrated in the Table 5. D Model Performance Details D.1 Overall Results Figure 9 presents the performance details of all evaluated models on DynaCode across varying lev- els of code complexity. The evaluation revealsType Unit 1 Unit 2 Unit 3 Unit 4 Base 153 100 76 76 Graph 1 2,617 697 568 1,026 Graph 2 48,638 5,718 4,133 13,517 Graph 3 37,084 4,138 3,230 9,453 Graph 4 896,634 43,051 26,648 166,466 Graph 5 710,792 34,101 19,768 120,555 Graph 6 1,342,944 64,915 48,304 242,107 Graph 7 349,802 15,883 12,858 56,251 Graph 8 15,938,962 288,317 150,618 1,931,587 Graph 9 12,796,855 241,443 108,883 1,423,255 Graph 10 37,303,818 695,172 601,704 4,189,592 Graph 11 24,051,797 458,375 320,522 2,881,061 Graph 12 11,832,328 229,898 181,992 1,497,380 Graph 13 19,069,749 362,085 237,205 2,083,414 Graph 14 710,792 34,101 19,768 120,555 Graph 15 2,426,580 42,459 39,678 239,392 Graph 16 36,998,540 695,172 600,528 4,177,666 Unit Sum 164,517,932 3,215,525 2,376,407 19,153,277 Total 189,263,141 Table 4: The number of problems contained in different units and the corresponding number of benchmarks that can be generated in combination. consistent performance degradation across all mod- els as code complexity increases, underscoring the benchmark’s ability to challenge models beyond basic code generation. Models like GPT-4o and DeepSeek-V3 exhibit relatively robust performance, maintaining higher Pass@1 scores even at higher complexity levels. In contrast, models such as WizardLM-2-8x22B and codegemma-7b-it show a steep decline in perfor- mance, struggling significantly with more complex, nested code structures. These trends align with the observations discussed in the main benchmark results, further validating the robustness of Dyna- Code in differentiating model capabilities across diverse complexity scenarios. D.2 Details of the Number In Figure 10, we present the detailed results of GPT-3.5’s performance for each unit, with prob- lem quantities set to {25,50,75,100,125,150}for each graph. The results indicate that smaller Nval- ues, such as 25 and 50, lead to greater performance variability across units, suggesting sensitivity to in- sufficient data. As Nincreases, especially beyond 75, the performance stabilizes, and fluctuations diminish. Based on this observation, 100was se- lected for subsequent experiments, balancing com- putational efficiency with evaluation robustness. E Prompt In our DynaCode, we dynamically generate a uni- fied prompt by hard-coding the combination of individual code problems. Initially, a call graph is Page 13: (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (1 1) (12) (13) (14) (15) (16)Figure 7: Illustration of 16 distinct call-graph structures used in our experiments, each modeled as a directed acyclic graph with up to 5 nodes and a single root. Cyclomatic Complexity Sequence if : else : while : while not :Structure Control Figure 8: Cyclomatic complexity calculations for Se- quence, If-Else, While, and While-Not control struc- tures, illustrating their impact on code complexity. Benchmark Number of Problems HumanEval 164 HumanEval+ 164 MBPP 974 MBPP+ 378 BigCodeBench 1,140 DynaCode 189,263,141 Table 5: Comparison of the number of problems be- tween DynaCode and other code generation bench- marks. constructed by selecting candidate functions based on their input and output types. The graph is then completed by traversing it and randomly assigning nodes while ensuring that the data flow remains consistent. Specifically, the output of a parent func- tion becomes the input for its child function. Each node is assigned a unique prompt number, and theindividual prompts stored in our dataset are con- catenated in sequence to form the final prompt. Table 6 presents an example from G8, illustrat- ing this process in practice. In addition to the se- quential prompt assembly, a main function is auto- matically generated to call the individual functions in the correct order, with explicit rules that govern the data transfer between them. This hard-coded dynamic composition strategy not only ensures log- ical consistency among the code fragments but also enhances the robustness and adaptability of our code generation system. F Error Classification Details We present the classification of error types in Dy- naCode along with their corresponding functional details, as shown in Table 7. G Examples in DynaCode We provide a valid nested code example for the prompt in Table 6, as shown in Figure 11. As observed, the flow from generate_fibonacci to is_integer represents a complete workflow, which demonstrates that the call graph-based dynamic evaluation strategy can assess the ability of LLMs to handle multi-step tasks and the dependencies between different steps by simulating a real-world task flow. Our call-graph structure offers a vari- ety of workflow configurations, which can help us understand how well the LLM performs on tasks requiring structured thinking and logical progres- sion, making it a powerful tool for evaluating its capabilities in multi-step tasks. Page 14: 1 2 3 4 Level304050607080Pass@1 Score (%) GPT-4o 1 2 3 4 Level2030405060 GPT-3.5-Turbo 1 2 3 4 Level3040506070 DeepSeek-V3 1 2 3 4 Level203040506070 Qwen2.5-Coder-32B-Instruct 1 2 3 4 Level2030405060Pass@1 Score (%) starcoder2-15b-instruct-v0.1 1 2 3 4 Level2030405060 Phind-CodeLlama-34B-v2 1 2 3 4 Level2030405060 Meta-Llama-3.1-405B-Instruct 1 2 3 4 Level203040506070 Llama-3.3-70B-Instruct 1 2 3 4 Level01020304050Pass@1 Score (%) WizardLM-2-8x22B 1 2 3 4 Level01020304050 codegemma-7b-it 1 2 3 4 Level01020304050 Mixtral-8x22B-Instruct-v0.1 1 2 3 4 Level01020304050 Meta-Llama-3.1-8B-InstructUnit 1 Unit 2 Unit 3 Unit 4Figure 9: Performance details of various models on DynaCode across different complexity levels. Each subfigure illustrates the Pass@1 score trends for units 1 to 4, highlighting how model performance degrades as code complexity increases. 1 2 3 4 Level102030405060Pass@1 Score (%) Number = 25 1 2 3 4 Level20304050 Number = 50 1 2 3 4 Level2030405060 Number = 75 1 2 3 4 Level2030405060Pass@1 Score (%) Number = 100 1 2 3 4 Level2030405060 Number = 125 1 2 3 4 Level2030405060 Number = 150Unit 1 Unit 2 Unit 3 Unit 4 Figure 10: Performance comparison of GPT-3.5-Turbo on DynaCode with varying numbers of dynamically generated problems {25,50,75,100,125,150}. Page 15: Python Code Prompts Here are 5 prompts that are used to generate 5 functions respectively. PROMPT 1: """ Write a python function to generate the first n Fibonacci numbers. assert generate_fibonacci(5) == [0, 1, 1, 2, 3] """ PROMPT 2: """ Write a python function to square each number in a given list. assert square_numbers([0, 1, 1, 2, 3]) == [0, 1, 1, 4, 9] """ PROMPT 3: """ Write a python function to find the sum of all numbers in a list. assert sum_numbers([0, 1, 1, 4, 9]) == 15 """ PROMPT 4: """ Write a python function to find the square root of a number. assert square_root(15) == 3.872983346207417 """ PROMPT 5: """ Write a python function to check if a number is an integer. assert is_integer(3.872983346207417) == False """ Please write the above 5 functions respectively and write a new function named main to call the above 5 functions. When calling these functions, please follow the following rules: The input of the main function equals the input of PROMPT 1 : generate_fibonacci . The output of function PROMPT 1: generate_fibonacci serves as the input of PROMPT 2: square_numbers . The output of function PROMPT 2: square_numbers serves as the input of PROMPT 3: sum_numbers . The output of function PROMPT 3: sum_numbers serves as the input of PROMPT 4: square_root . The output of function PROMPT 4: square_root serves as the input of PROMPT 5: is_integer . The main function returns the output of the PROMPT 5: is_integer . Table 6: Example of a dynamically generated prompt in DynaCode for call graph G8. Page 16: Associated Capability Error Type Reason for Categorization Problem Understanding AssertionError Failure to satisfy the expected logic of the problem, indicating misunderstanding of requirements. Problem Understanding ValueError Input data out of expected range, implying insufficient understanding of problem constraints. Problem Understanding RecursionError Incorrect handling of recursion logic, showing failure in comprehending recursive termination. Problem Understanding ZeroDivisionError Failure to account for boundary conditions, such as division by zero. Code Pattern Generation SyntaxError Violation of syntax rules, directly reflecting the inability to generate structurally correct code. Code Pattern Generation IndentationError Incorrect indentation, indicating issues in formatting and structural management of code. Context Management NameError Use of undefined variables, highlighting issues in variable scope management. Context Management AttributeError Accessing nonexistent attributes, suggesting misunderstanding of object structures. Context Management TypeError Incorrect data type handling, indicating flaws in type inference and function parameter management. Context Management IndexError Indexing beyond valid range, showing poor management of data structures like lists or arrays. Context Management UnboundLocalError Use of uninitialized local variables, indicating incorrect variable lifecycle management. Other OverflowError Numeric overflow due to improper handling of large values or operations. Other RuntimeError Errors during execution, often caused by incorrect function calls or resource issues. Table 7: Categorization of error types and their corresponding capabilities in DynaCode. Figure 11: Example of a nested code workflow from Table 6, demonstrating how the call graph-based evaluation assesses LLMs’ ability to handle multi-step tasks and dependencies.

---