Authors: Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, Junyang Lin
Paper Content:
Page 1:
2025-01-06
CODE ELO: Benchmarking Competition-level Code Generation of
LLMs with Human-comparable Elo Ratings
Shanghaoran Quan Jiaxi Yang Bowen Yu Bo Zheng Dayiheng Liu An Yang
Xuancheng Ren Bofei Gao Yibo Miao Yunlong Feng Zekun Wang
Jian Yang Zeyu Cui Yang Fan Yichang Zhang Binyuan HuiBJunyang LinB
Qwen Team, Alibaba Group
{quanshanghaoran,binyuan.hby,junyang.ljy}@alibaba-inc.com
https://CodeElo-bench.github.io
https://hf.co/datasets/Qwen/CodeElo
Abstract
With the increasing code reasoning capabilities of existing large language models (LLMs)
and breakthroughs in reasoning models like OpenAI o1 and o3, there is a growing
need to develop more challenging and comprehensive benchmarks that effectively
test their sophisticated competition-level coding abilities. Existing benchmarks, like
LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack
of support for special judges, and misaligned execution environments. To bridge this gap,
we introduce CODE ELO, a standardized competition-level code generation benchmark
that effectively addresses all these challenges for the first time. CODE ELObenchmark is
mainly based on the official CodeForces1platform and tries to align with the platform as
much as possible. We compile the recent six months of contest problems on CodeForces
with detailed information such as contest divisions, problem difficulty ratings, and
problem algorithm tags. We introduce a unique judging method in which problems are
submitted directly to the platform and develop a reliable Elo rating calculation system
that aligns with the platform and is comparable with human participants but has lower
variance. By testing on our CODE ELO, we provide the Elo ratings of 30 existing popular
open-source and 3 proprietary LLMs for the first time. The results show that o1-mini
and QwQ-32B-Preview stand out significantly, achieving Elo ratings of 1578 and 1261,
respectively, while other models struggle even with the easiest problems, placing in
the lowest 25 percent among all human participants. Detailed analysis experiments are
also conducted to provide insights into performance across algorithms and comparisons
between using C++ and Python, which can suggest directions for future studies.
o1-mini
QwQ-32B-Preview
Claude-3-5-Sonnet-2024-10-22 ChatGPT-4o-latest-2024-11-20Qwen2.5-72B-Instruct
Mistral-Large-Instruct-2411DS-V2.5
Qwen2.5-Coder-32B-InstructDS-Coder-V2-InstructQwen2.5-32B-InstructLlama-3.1-70B-Instruct
Qwen2.5-Coder-14B-InstructQwen2.5-14B-Instruct
Qwen2.5-Coder-7B-InstructCodestral-22B-v0.1DS-V2-Chat
Qwen2.5-7B-InstructYi-Coder-9B-Chat
Mixtral-8x22B-Instruct-v0.1DS-Coder-V2-Lite-InstructLlama-3.1-8B-Instruct
Ministral-8B-Instruct-2410DS-Coder-33B-InstructCodeLlama-70B-Instruct
Qwen2.5-Coder-3B-InstructDS-Coder-6.7B-InstructOpenCoder-8B-Instruct
Starcoder2-15B-Instruct-v0.1Mixtral-8x7B-Instruct-v0.1
Qwen2.5-Coder-1.5B-InstructDS-V2-Lite-Chat
Mistral-7B-Instruct-v0.2DS-Coder-1.3B-Instruct02004006008001000120014001600Elo Rating
10 Percentile20 Percentile60 Percentile90 Percentile1578
1261
710668634631629
575532513478
424414397385
3183152962952542232192072001601551521299893604921
Figure 1: The Elo rating leaderboard. The test results may be slightly lower than the actual capabilities,
as we constrain the number of submissions for each problem to eight times. The green dashed lines
represent the Elo ratings of human participants at the corresponding percentiles.
BCorresponding authors
1https://codeforces.com
1arXiv:2501.01257v2 [cs.CL] 3 Jan 2025
Page 2:
1 Introduction
With the increasing capabilities of existing LLMs and breakthroughs in reasoning models like OpenAI
o1 and o3, there is a growing need to develop benchmarks that effectively test their sophisticated
reasoning abilities. Math and coding are two evaluation methods for this purpose, as they provide
accurate and easily verifiable feedback. While math presents hard benchmarks, like AIME (MAA,
2024), Omni-MATH (Gao et al., 2024), and LiveAoPSBench (Anonymous, 2024), there’s still a lack of
a suitable benchmark in coding to appropriately measure LLMs’ reasoning skills. We find that the
CodeForces platform is suitable, and in the reports of OpenAI o1 (OpenAI, 2024b), o3 (OpenAI, 2024c)
and DeepSeek r1 (DeepSeek, 2024), they all test hard code on CodeForces. However, we still lack a
standardized CodeForces test suite, leading to repeated compiling and creating work on CodeForces and
potential misaligned settings and results. These situations highlight the need for research community to
standardize a CodeForces evaluation benchmark to assess the competition-level coding capabilities of
LLMs and reflect their corresponding sophisticated reasoning abilities.
We analyze that existing competition-level code benchmarks like LiveCodeBench (Jain et al., 2024),
USACO (Shi et al., 2024), and CodeContests (Li et al., 2022) cannot meet the above requirement due to
certain limitations attributed to the following unique nature of competition code evaluation. (1) Firstly,
unlike typical math and general coding problems, competition code problem often requires extensive,
human-designed, robust test cases to validate solution correctness. Although it is easy to scrape any
problems from the websites, the competition code platforms or so-called online judges often hide their
test cases. Consequently, existing benchmarks can only rely on publicly available or self-generated test
cases that are often small and weak, leading to high false-positive rates in judgment2. (2) Secondly, to
truly evaluate LLM capabilities and provide human-comparable Elo ratings, it’s necessary to test on
all problems in contests just as human participants do. However, not all problems can be judged by
directly comparing the output with the correct answer, as about 30% of problems do not have unique
correct outputs, which require specific judging codes called special judges3to evaluate correctness.
However, these special judges are often unavailable and difficult to write for some specific problems. (3)
Thirdly, different from general code testing, execution time is a significantly critical factor in competitive
coding since problems often have strict constraints on running times and algorithm time complexity, but
offline testing faces challenges due to varied efficiency across different machines, leading to potential
misaligned evaluation results. Hence, there remains a lack of comprehensive standardized benchmarks
for competition-level coding.
In this work, for the first time, we present the benchmark CODE ELOand the corresponding evaluation
method, which achieves zero false positives, supports special judges, and reaches an aligned execution
environment to ensure absolutely correct judgments and provide human-comparable, standardized Elo
ratings. We have compiled our test problems from CodeForces and categorized them by contest divisions,
problem difficulty ratings, and problem algorithm tags for more comprehensive evaluation and analysis.
To achieve absolutely correct judgments, we introduce a simple and effective method: using a bot to
automatically submit model solutions to CodeForces and receive test results. Thanks to the judgment
from the platform, this allows us to evaluate problems just like human participants, achieving zero false
positives without needing access to the full test set, where test cases are often created adversarially and
are robust. Similarly, given the favor of the platform judgments, we support special judges, which former
benchmarks did not provide, and the execution environment is absolutely aligned since all use the same
platform, even aligned with human participants. Based on the evaluation results and available user data
on CodeForces, we also introduce an Elo rating calculation system that estimates the models’ expected
Elo rating based on their performance. This system is aligned with the CodeForces platform but has a
much lower variance, as detailed in Section 3.3.2.
We tested 30 existing popular open-source LLMs and 3 proprietary LLMs, with the leaderboard shown
in Figure 1. From the results, we find the OpenAI o1-mini model stands out by achieving the best
performance with an Elo rating of 1578, surpassing nearly 90 percent of human participants. The o1-like
reasoning model, QwQ-32B-Preview, also achieves great performance and stands out among open-source
models, with an Elo rating of 1261, placing it in about the 60th percentile. This suggests that increasing
the length of the chain-of-thought (CoT) is a promising way to improve the models’ reasoning ability.
On the other hand, most models struggle to pass the easiest problems and fall within the lowest 10th
percentile of Elo ratings among human participants, and several large and prominent open-source LLMs
fall in the range of the 10th to 20th percentiles. Through more detailed experiments and analysis, we
find that models are good at problems involving math and implementation, but struggle with dynamic
2Passing all tests in CodeContests might still result in a wrong answer (~4%) or a time limit error (~42%), according
tohttps://alphacode.deepmind.com/
3Special judges are programs used to determine whether solutions are accepted for problems that do not have a
unique correct answer. A detailed discussion is provided in Appendix F.
2
Page 3:
programming (dp) and trees. We also find that for most models, the best-performing language is C++,
instead of Python, which is most frequently used by LLMs and is also the main test language in previous
benchmarks. We hope our CODE ELOcan pave the way for testing advanced LLMs’ code reasoning
capabilities and provide insights to improve such abilities in LLMs.
To summarize, our benchmark has the following main contributions, with a more detailed discussion on
them available in Section 6.1.
•We provide a set of Codeforces problems with detailed information like contest divisions, problem
ratings, and problem algorithm tags.
•We introduce a unique evaluation method in which problems are submitted directly to the platform,
achieving zero false positives, special judge support and a fully aligned execution environment for the
first time.
•We are the first to provide standardized human-comparable Elo ratings that fairly judge the models’
competition-level code generation for the existing popular open-source and proprietary LLMs.
• A detailed analysis of experimental results provides insights that can suggest future studies.
2 Related Work
Before our work, there were several competitive code benchmarks. Here, we list six representative ones
that are most relevant to our work:
•APPS (Hendrycks et al., 2021): Proposed in 2021/05, APPS curated problems from Codewars, AtCoder,
Kattis, and CodeForces.
•CodeContests (Li et al., 2022): Proposed in 2022/03, CodeContests includes problems, solutions, and
test cases sourced from the CodeForces platform.
•xCodeEval (Khan et al., 2023): Proposed in 2023/03, xCodeEval is a large-scale multilingual multitask
code benchmark that also includes problems from CodeForces.
•TACO (Li et al., 2023): Proposed in 2023/12, TACO compiles problems from CodeContests and APPS,
adding new problems gathered from several competitive coding websites.
•LiveCodeBench (Jain et al., 2024): Proposed in 2024/03, LiveCodeBench primarily contains scraped
coding problems and test cases from LeetCode and AtCoder, with minimal content from CodeForces.
It avoids contamination by re-scraping new problems every month and releasing online updates.
•USACO (Shi et al., 2024): Proposed in 2024/04, USACO features hundreds of problems and test cases
from the USA Computing Olympiad. It updates its benchmark through new released versions.
We include a comparison against these benchmarks in Table 1. We find that these benchmarks all source
problems from open-access coding competition websites and conduct offline evaluations. While scraping
problems is simple, most online judges hide their test cases. To address this, existing benchmarks attempt
to generate test cases; however, these are often not as robust as the original ones (which are often designed
in an adversarial manner and require a lot of labor from high-level participants), leading to many false-
positive judgments. Additionally, these benchmarks do not support special judges. Another limitation is
that they require execution on an individual’s machine. Since runtime is a critical factor in algorithm
competitions, differing machine performance can affect results. Furthermore, none of these benchmarks
offer human-comparable standardized Elo ratings. These characteristics highlight the uniqueness and
significant advantages of our benchmark.
Problem
DiffcultyUpdatesZero False
Positive?Special
Judge?Aligned
Execution
Environment?Standardized
Elo Rating?
APPS ⋆⋆ No updates ✗ ✗ ✗ ✗
CodeContests ⋆⋆⋆ No updates ✗ ✗ ✗ ✗
TACO ⋆⋆ No updates ✗ ✗ ✗ ✗
xCodeEval ⋆ No updates ✗ ✗ ✗ ✗
USACO ⋆⋆ Offline ✓ ✗ ✗ ✗
LiveCodeBench ⋆ Online ✗ ✗ ✗ ✗
CODE ELO ⋆⋆⋆ Online ✓ ✓ ✓ ✓
Table 1: Comparison between C ODE ELOand other competition code benchmarks.
3
Page 4:
Apart from competition code benchmarks, there are also some other more popular general code bench-
marks, like HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021) and BigCodeBench (Zhuo et al.,
2024). These benchmarks have three significant differences with competition code benchmarks: 1) While
we need to analyze the problem and write complete code with a focus on algorithm design, general
benchmarks usually require writing only a small function to test specific functionality. 2) The problems in
our test set are much harder than these general benchmarks, often needing sophisticated reasoning while
general benchmarks do not. 3) Execution time is a significantly crucial constraint, as we not only judge
the output for correctness but also demand on the algorithm complexity, whereas general benchmarks
mainly focus on the former.
3 C ODE ELOBenchmark
Our benchmark is primarily based on the well-known coding competition platform, CodeForces. In this
section, we discuss the development of our benchmark, detailing the processes of problem collection,
classification and categorization, and the evaluation method.
3.1 Problem Collection
Our testing problems are sourced from the official CodeForces platform. We gathered all problems from
the rated contests held since the platform inception (but we only tested on the recently held contests to
avoid data contamination).
An example problem can be found in Appendix Figure 4. The scraped problems were originally in raw
HTML format. We parsed out different sections of each problem, including the problem description,
input format, output format, examples, notes, and so on, to allow for more flexible restructuring of the
prompts during testing. By default, however, the problems are displayed in their original HTML format
to preserve critical information and structure with minimal format translation for better clarity of text
and formulas format. While these HTML formats might be challenging for humans to process without
rendering, they can be easily processed by LLMs (Gur et al., 2022).
3.2 Classification and Categories
The CODE ELOincludes several distinct classifications. These classifications are sourced and collected
from the CodeForces platform, and we integrate them into our benchmark for more comprehensive
analysis and detailed evaluations of various LLMs.
Contest Difficulty Division CodeForces categorizes competitions into four divisions (Div. for short)
based on their levels of difficulty: Div. 1, 2, 3, and 4, with Div. 1 being the most challenging and Div. 4 the
least. Additionally, there are special divisions that combine Div. 1 and 2, as well as some global rounds
whose difficulty averages between Div. 1 and 2. We classify these as Div. 1 + 2. It’s important to note that
each contest includes multiple problems, and the division just represents an overall average difficulty, so
hard contests can also contain easier problems.
Problem Difficulty Rating The problems also have a rating attribute but please note that this is a
different concept with ratings for participants. The problem rating indicates the level of difficulty of
a problem. Specifically, a problem rating of xsignifies that competitors with a rating of xhave a 50%
probability of passing the problem the first time they see it. The problem ratings are derived from
actual human performance in contests and represent a statistically calculated value, making it relatively
accurate.
Problem Algorithm Tags Problems are categorized using algorithm tags, which identify the types of
algorithms required to solve them. On average, a problem is associated with 3.9 tags, as it may need
multiple steps or belong to different algorithmic categories. We have a total of 35 tags, and nearly 90% of
the occurrences are covered by the top 16 tags. Note that these tags are not visible to humans during
contests and also not visible to the models being tested, thus they cannot serve as hints for solutions.
3.3 Evaluation Method
3.3.1 Solution Submitting and Judgment
Unlike other benchmarks that require testing on one’s own machine, which may cause execution environ-
ment misalignment, we parse the solution code block and directly submit it to the CodeForces official
4
Page 5:
platform after obtaining the model’s output to achieve absolutely accurate feedback. This is especially
important in competitive coding problems since execution time is a crucial metric in judging a solution,
and different environments may vary in this aspect. The proxy judging settings are also advantageous
because they not only allow us to bypass the need for complete test cases but also support the evaluation
of problems that require special judges, thereby enabling us to better assess the model’s capabilities and
provide more accurate results.
We use an automatic submission bot to help submit the problems and obtain judging results. Since the
results come directly from the official platform, we can directly parse the status to see if it is an "Accept."
Unlike some other competitive coding benchmarks like APPS, we consider a solution accepted only when
it passes all the test cases; partially passing a proportion of test cases does not count towards scoring,
aligned with the criteria on the platform.
3.3.2 Elo Rating Calculation System
We use an Elo rating calculation method similar to the official CodeForces platform Elo rating calculation
system4to obtain a standardized Elo rating. This rating reflects an individual’s competitive programming
ability and is comparable between humans, models, and across both humans and models. Unlike
CodeForces, which updates a participant’s rating by considering changes after each contest to maintain it
online, we treat each contest as independent for simplicity and accuracy. Thus, we calculate the model’s
expected rating rfor each contest individually.
Specifically, let’s assume there are nhuman participants in a contest with ratings r(i)fori=1, 2, ..., n.
For ease of mathematical representation, assume participants are ranked from best to worst in terms of
performance in this contest. Position the model’s performance within this ranking, denoting the model’s
rank as m(where 1 ≤m≤n+1). Suppose the model’s expected rating is r. According to the definition
of the Elo rating (Elo and Sloan, 1978), we have the following equation:
m=n
∑
i=11
1+10(r−r(i))/400
Since the expression on the right side of the equation is monotonically decreasing with respect to r, we
can easily use binary search to determine the exact value of r.
As demonstrated in Appendix C, we have analyzed that this calculation method maintains the same
expected rating as the official platform’s method but with significantly lower variance.
Advantages of Elo Ratings Over Other Metrics While many existing benchmarks use pass@ n(with n
typically being 1or other values) as the evaluation metric, our benchmark adopts a more advanced Elo
rating system. This system accounts for multiple attempts, providing a more comprehensive analysis
than simply considering pass@ 1and takes the sampling diversity into account. Moreover, it effectively
balances pass attempts by penalizing failed attempts before the successful ones, which is superior to the
traditional pass@ nmetric. In addition, passing more challenging problems will receive higher scores
in rating calculations, encouraging models to tackle more difficult tasks, which is also a feature not
supported by most previous benchmarks.
4 Evaluation on Existing LLMs
4.1 Experiment Setup
Based on our observation that most models struggle with even the simplest problems in Div. 1 contests,
we decided to discard these contests and focus solely on testing the remaining divisions in the most
recent contests. We gathered contests held between May 4, 2024, and November 4, 2024, totaling 54
contests or 387 problems. We present basic statistics for these contests in Table 2.
For each problem, we allow each tested model up to eight attempts. Given that the inference times for
each model are significantly lower than those for humans and can be omitted, we assume that models will
respond and submit solutions within the first minute, so no time penalties are applied. However, penalty
points will still be counted for any failed attempts made before a successful submission, which aligned
with the platform. Each problem has a specific score, and we calculate the final scores and penalties of
the tested models to compare and rank them against human participants. All conditions not mentioned
are kept as similar as possible to the official platform settings.
4https://codeforces.com/blog/entry/20762
5
Page 6:
Div. CountAvg.
ProblemsAvg.
RatingsRating
Requirement
1 3 7.0 2533 ≥1900
1+2 8 9.1 2106 All
2 33 6.5 1779 ≤2100
3 10 7.5 1436 ≤1600
4 3 8.3 1276 ≤1400
Table 2: Basic statics of different contest divisions.
We evaluated 30 popular open-source models using vLLM for inference and 4 proprietary models via
API calls (detailed in the model cards found in Appendix A). All tested models followed the same
chain-of-thought prompting:
You are a coding expert. Given a competition-level coding problem, you need to write a
C++ program to solve it. You may start by outlining your thought process. In the end,
please provide the complete code in a code block enclosed with ``` ```.
Here C++ was chosen as the test language because it generally elicits the best performance from models
on competition code problems. A detailed analysis of language use is provided in Section 5.2.
4.2 Elo Ratings
We present the standardized Elo rating leaderboard in Figure 1. Our analysis reveals that o1-mini, with a
rating of 1578, and QwQ-32B-Preview, with a rating of 1261, stand out significantly among proprietary
and open-source models, respectively. This demonstrates the substantial advantage of long CoT o1-like
reasoning models in tackling difficult competition code problems. Additionally, we observe a clear
trend that larger models tend to outperform smaller ones, and all open-source models that achieved
a rating above 500 are 30B+ models. However, their ratings remain below 700, which corresponds to
approximately the lowest 20th percentile among all human participants in Table 6.
4.3 Main Results
We present performance details of all tested proprietary and open-source models in Table 3. For a clearer
comparison, we categorize open-source models by their sizes and classify mixture-of-experts models by
regarding their parameters as square root of product of activation and total parameters.
Elo Ratings Across Contest Divisions From a contest perspective, we observe that contests in easier
divisions often result in higher performance ratings. However, we note that this trend may not always
hold true if the model’s capacity is larger. The best results are typically achieved when models are tested
in contests that match their skill level (refer to rating requirements in Table 2 for approximate contest
ratings based on Elo ratings). Consequently, most models perform optimally in Div. 4. However, o1-mini
demonstrates the best performance in Div. 3, consistent with its higher rating and the rating requirements
for that division. From a model perspective, we find that superior models consistently outperform
inferior ones across different contest divisions.
Pass Rate Across Problem Difficulty Levels We flatten the problems in contests and divide them by
problem ratings: Easy ( [800, 1000 )), Medium ( [1000, 1300 )), and Hard ( [1300, 3500 ]). We record the model
pass rate at these difficulty levels and find that even the easy category is challenging for most models
and demonstrates the best differentiation. The medium level effectively distinguishes several advanced
models. Although the minimum problem rating in the hard level is only 1300, all models, except for
o1-mini and QwQ-32B-Preview, are struggling to achieve a pass. These results also indicate significant
room for improvement in existing models and highlight the extensibility of our benchmark.
Pass@ nWe also display the pass@ nmetrics for different values of n(n=1, 2, 4, 8 )for each model. Our
findings indicate that most models increase their pass rate consistently as the number of samplings rises.
Some models may have poor pass@ 1results but improve significantly by pass@ 8, showcasing the model’s
ability to produce diverse effective explorations. Moreover, we find that pass@ noften correlates with Elo
ratings, but they may not always align due to their differing calculation methods.
6
Page 7:
ModelElo Rating Pass Rate for Pass @
Overall Div. 1 + 2 Div. 2 Div. 3 Div. 4 Easy Medium Hard 1 2 4 8
Proprietary LLMs
ChatGPT-4o-latest-2024-11-20 ὑ2 668 (22.2) 586 507 1111 1149 36.54 14.0 0.83 9.3 10.8 14.57 16.83
Claude-3-5-Sonnet-2024-10-22 ὑ2 710 (24.1) 430 616 1092 1124 46.47 11.0 0.97 11.81 13.82 15.58 16.08
o1-mini ὑ2 1578 (89.2) 1197 1541 1906 1792 74.52 42.75 11.71 26.88 33.92 39.7 39.95
1B+ Open-source LLMs
DS-Coder-1.3B-Instruct 1.3B 21 (0.0) 0 0 0 378 3.37 0.0 0.0 0.75 0.75 0.75 0.75
Qwen2.5-Coder-1.5B-Instruct 1.5B 93 (0.0) 0 48 179 514 6.73 0.0 0.0 1.26 1.76 2.51 2.51
Qwen2.5-Coder-3B-Instruct 3B 160(0.0) 0 74 398 676 10.9 0.5 0.0 2.26 2.51 4.02 4.77
6B+ Open-source LLMs
Mistral-7B-Instruct-v0.2 7B 49 (0.0) 0 0 146 378 6.25 0.0 0.0 1.26 1.26 1.26 1.26
DS-V2-Lite-Chat 2.4/16B 60 (0.0) 0 16 151 378 4.01 0.0 0.0 1.01 1.26 1.76 1.76
OpenCoder-8B-Instruct 8B 152 (0.0) 0 70 372 667 8.17 0.5 0.0 1.01 2.51 3.77 4.52
DS-Coder-6.7B-Instruct 6.7B 155 (0.0) 0 97 319 606 10.1 0.25 0.0 1.76 2.26 3.27 4.52
Ministral-8B-Instruct-2410 8B 219 (0.0) 0 118 548 745 13.94 0.5 0.05 2.51 3.52 4.77 6.28
Llama-3.1-8B-Instruct 8B 223 (0.0) 0 207 325 585 12.18 0.25 0.0 2.26 2.76 4.52 6.53
DS-V2-Lite-Instruct 2.4/16B 254 (0.0) 187 155 446 851 16.51 3.5 0.05 3.02 4.27 5.78 6.78
Yi-Coder-9B-Chat 9B 296 (0.0) 108 228 560 606 14.26 1.75 0.09 2.76 4.02 6.28 7.29
Qwen2.5-7B-Instruct 7B 315 (0.0) 123 242 581 676 17.63 1.5 0.09 4.27 5.53 6.78 7.79
Qwen2.5-Coder-7B-Instruct 7B 397(11.6) 143 334 647 842 19.55 3.0 0.14 4.52 6.03 8.29 10.05
13B+ Open-source Models
Mixtral-8x7B-Instruct-v0.1 8/56B 98 (0.0) 0 30 226 644 5.29 0.25 0.05 1.26 1.51 2.26 3.52
Starcoder2-15B-Instruct-v0.1 15B 129 (0.0) 0 48 333 641 5.93 0.0 0.0 1.76 2.76 3.27 3.52
Codestral-22B-v0.1 22B 385 (10.2) 57 345 586 926 20.03 2.25 0.14 3.52 4.77 6.78 10.3
Qwen2.5-14B-Instruct 14B 414 (12.9) 497 277 752 606 23.4 1.5 0.32 5.03 6.03 7.79 11.31
Qwen2.5-Coder-14B-Instruct 14B 424(13.5) 123 292 876 1067 25.64 5.75 0.32 6.78 8.04 9.55 12.06
30B+ Open-source Models
CodeLlama-70B-Instruct 70B 200 (0.0) 0 111 539 507 8.97 0.75 0.05 1.76 2.76 5.03 5.78
DS-Coder-33B-Instruct 33B 207 (0.0) 0 113 498 746 13.46 1.5 0.0 3.02 4.27 4.52 6.28
Mixtral-8x22B-Instruct-v0.1 22/176B 295 (0.0) 75 241 510 680 14.42 0.5 0.05 3.27 4.27 5.78 7.04
DS-V2-Chat 21/236B 318 (0.0) 59 253 588 737 16.83 2.25 0.0 3.77 4.77 6.53 9.05
Llama-3.1-70B-Instruct 70B 478 (15.0) 255 361 886 933 25.32 3.0 0.46 5.03 7.29 10.55 12.56
Qwen2.5-32B-Instruct 32B 513 (15.5) 350 366 923 1146 28.85 6.5 0.46 5.53 8.29 10.8 13.07
DS-Coder-V2-Instruct 21/236B 532 (15.7) 420 400 853 1179 29.33 7.5 0.37 6.53 8.79 11.81 14.32
Qwen2.5-Coder-32B-Instruct 32B 575 (16.8) 206 416 1166 1222 29.49 7.75 0.46 6.03 8.54 12.06 16.58
DS-V2.5 21/236B 629 (20.4) 594 483 958 1221 33.65 10.0 0.65 8.79 11.56 13.32 15.58
Mistral-Large-Instruct-2411 123B 631 (20.5) 632 449 1049 1226 35.58 9.5 0.65 8.29 11.81 13.07 16.33
Qwen2.5-72B-Instruct 72B 634 (20.7) 439 498 1033 1255 35.26 12.0 0.97 9.3 11.06 13.32 16.58
QwQ-32B-Preview 32B 1261 (63.6) 1071 1169 1566 1700 57.21 21.75 4.54 18.59 23.12 29.4 32.91
Table 3: Main results of different LLMs on CODE ELO. The number in parentheses after the overall Elo
rating shows the percentile rank among human participants. The underlined numbers represent the best
scores within the same model size range.
5 Analysis Experiments
5.1 Performance Across Algorithms
Each problem is associated with tags that suggest potential algorithms or methods for solving it. We
have identified 16 tags that appear frequently, each corresponding to at least 30 problems tested. Note
that a single problem may be associated with multiple tags, as it may require multiple steps or fall
under different algorithmic categories. The performance of different models on these tagged problems is
summarized in Table 4.
We observe significant variation in model performance across different algorithms. Models demonstrate
strong performance in areas such as math (Ma.), implementation (Im.), and sorting (So.), achieving the
highest pass rates. However, they struggle with dp (DP), dfs and similar (DFS), and trees (Tr.), with many
models failing to solve even a single problem under these algorithms.
5.2 Comparison between C++ and Python
An interesting observation is that while Python is the most commonly used programming language
for most models, the best performance on CODE ELOis achieved when models use C++ as the coding
language. This performance surpasses even the scenarios where models select the language on their own.
We conducted experiments where models received prompts without a specified programming language,
allowing them to choose freely. We found that under our CODE ELO, nearly all models defaulted to
using Python over 95% of the time, with only occasional use of C++, Java, and others. In contrast,
human participants in code competitions predominantly use C++, with rates close to 80%5. We further
5We randomly selected 250 human submissions and found 201 used C++, and this ratio will be even higher among
proficient competitors.
7
Page 8:
Model Gr. Ma. Im. BF. DP DS. CA. BS. So. Gr. DFS NT. Tr. Co. TP . Bi.
Proprietary LLMs
ChatGPT-4o-latest-2024-11-20 ὑ2 5.60 9.07 12.80 9.53 2.17 1.59 6.39 4.17 14.58 1.82 0.00 4.83 0.00 4.28 6.07 2.57
Claude-3-5-Sonnet-2024-10-22 ὑ2 9.40 12.02 15.97 10.35 0.00 2.50 7.07 5.25 17.50 0.78 0.80 5.11 0.00 3.62 7.50 2.94
o1-mini ὑ2 25.83 31.11 31.94 23.98 10.65 14.77 22.15 22.38 34.58 13.54 11.44 22.73 4.55 25.00 19.29 20.59
1B+ Open-source LLMs
DS-Coder-1.3B-Instruct 1.3B 0.00 0.55 2.08 0.00 0.00 0.00 0.00 0.00 1.46 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Qwen2.5-Coder-1.5B-Instruct 1.5B 0.06 1.10 3.27 0.51 0.00 0.00 0.00 0.62 2.08 0.00 0.00 0.57 0.00 0.00 0.00 0.00
Qwen2.5-Coder-3B-Instruct 3B 0.54 2.06 3.97 1.74 0.00 0.23 0.68 0.77 3.96 0.00 0.00 1.42 0.00 0.00 1.07 0.00
6B+ Open-source LLMs
Mistral-7B-Instruct-v0.2 7B 0.00 1.03 3.17 0.72 0.00 0.00 0.00 0.00 3.12 0.00 0.00 0.00 0.00 0.00 0.00 0.00
DS-V2-Lite-Chat 2.4/16B 0.06 0.69 2.28 0.20 0.00 0.00 0.14 0.00 1.67 0.00 0.00 0.00 0.00 0.00 0.00 0.00
OpenCoder-8B-Instruct 8B 0.54 1.24 4.07 0.51 0.00 0.11 0.95 0.15 2.29 0.00 0.00 0.28 0.00 0.33 0.36 0.00
DS-Coder-6.7B-Instruct 6.7B 0.60 1.79 4.17 1.23 0.00 0.00 0.54 0.62 2.92 0.00 0.00 1.42 0.00 0.33 0.36 0.00
Ministral-8B-Instruct-2410 8B 1.01 2.40 5.36 1.84 0.00 0.34 0.27 1.08 3.96 0.00 0.00 1.99 0.00 0.00 0.71 0.00
Llama-3.1-8B-Instruct 8B 0.65 2.61 4.76 1.13 0.00 0.00 1.49 0.62 3.54 0.00 0.00 1.14 0.00 0.33 0.36 0.00
DS-V2-Lite-Instruct 2.4/16B 1.73 3.78 6.85 3.48 0.11 0.68 1.77 1.08 5.21 0.00 0.00 2.27 0.00 1.97 1.43 0.00
Yi-Coder-9B-Chat 9B 1.67 2.82 5.85 2.15 0.43 0.23 1.90 1.23 5.62 0.00 0.00 1.42 0.00 0.33 0.00 0.37
Qwen2.5-7B-Instruct 7B 1.49 3.78 5.36 2.97 0.00 0.80 1.49 1.54 5.00 0.26 0.27 2.27 0.00 0.00 2.14 0.00
Qwen2.5-Coder-7B-Instruct 7B 2.14 3.98 6.55 3.38 0.11 1.02 2.04 1.85 7.29 0.00 0.00 2.27 0.00 0.33 1.07 0.37
13B+ Open-source Models
Mixtral-8x7B-Instruct-v0.1 8/56B 0.06 1.17 2.18 0.51 0.00 0.00 0.27 0.93 1.25 0.00 0.00 0.57 0.00 0.00 0.00 0.00
Starcoder2-15B-Instruct-v0.1 15B 0.36 0.96 2.78 0.61 0.00 0.00 0.68 0.00 2.29 0.00 0.00 0.57 0.00 0.00 0.00 0.00
Codestral-22B-v0.1 22B 2.32 3.71 9.03 2.77 0.00 0.23 2.45 0.77 6.46 0.00 0.27 1.70 0.28 0.33 0.36 0.37
Qwen2.5-14B-Instruct 14B 3.21 5.43 7.94 3.38 0.65 1.14 2.31 2.01 7.29 0.26 0.27 2.56 0.00 0.66 1.43 0.37
Qwen2.5-Coder-14B-Instruct 14B 3.33 5.63 9.13 5.74 1.20 1.36 2.45 2.47 9.58 1.04 0.27 2.84 0.00 0.99 2.86 0.00
30B+ Open-source Models
CodeLlama-70B-Instruct 70B 0.48 1.65 3.87 0.92 0.00 0.34 0.41 0.62 2.92 0.00 0.00 0.85 0.00 0.66 0.71 0.00
DS-Coder-33B-Instruct 33B 1.37 2.40 5.36 1.33 0.33 0.11 0.82 0.77 3.54 0.00 0.00 1.99 0.00 0.00 0.71 0.37
Mixtral-8x22B-Instruct-v0.1 22/176B 1.55 3.09 5.56 1.95 0.00 0.11 1.90 1.08 7.29 0.00 0.00 0.85 0.00 0.00 0.71 0.37
DS-V2-Chat 21/236B 1.61 3.57 6.35 2.77 0.11 0.68 1.90 1.23 4.58 0.00 0.00 2.27 0.00 0.00 0.00 0.00
Llama-3.1-70B-Instruct 70B 2.98 5.98 10.02 4.00 0.33 0.80 2.72 2.78 9.17 0.00 0.00 2.27 0.00 0.66 2.86 1.10
Qwen2.5-32B-Instruct 32B 3.75 6.59 9.72 4.51 0.87 1.59 3.67 3.86 10.83 0.78 0.00 2.84 0.00 0.66 1.79 1.47
DS-Coder-V2-Instruct 21/236B 3.81 6.94 11.21 4.82 1.09 1.14 3.94 2.93 8.75 2.08 0.00 2.84 0.00 2.63 5.00 1.10
Qwen2.5-Coder-32B-Instruct 32B 4.05 7.01 9.62 6.35 1.52 1.59 4.76 4.01 11.04 1.30 0.27 3.41 0.00 1.32 5.00 0.74
DS-V2.5 21/236B 5.18 8.24 13.10 6.05 1.30 1.70 4.89 2.62 12.71 2.08 0.00 2.56 0.00 3.62 3.57 1.84
Mistral-Large-Instruct-2411 123B 6.01 8.17 11.61 6.05 1.63 2.16 4.48 4.78 13.96 1.04 0.00 2.84 0.00 0.99 7.50 2.94
Qwen2.5-72B-Instruct 72B 6.79 9.00 12.40 7.68 1.41 1.48 7.34 3.70 16.88 2.60 1.33 3.12 0.00 1.32 5.71 1.84
QwQ-32B-Preview 32B 15.00 21.70 19.64 15.37 3.37 8.18 15.35 8.80 23.96 4.17 3.19 9.66 0.57 14.14 6.43 8.09
Table 4: Pass rate (pass@ 1) on major problem categories that have at least 30 problems tested. The
abbreviations "Gr.", "Ma.", "Im.", "BF.", "DP", "DS.", "CA.", "BS.", "So.", "Gr.", "DFS", "NT.", "Tr.", "Co.",
"TP .", and "Bi." stand for greedy, math, implementation, brute force, dp, data structures, constructive
algorithms, binary search, sortings, graphs, dfs and similar, number theory, trees, combinatorics, two
pointers, and bitmasks, respectively.
investigated the performance difference of models when constrained to use either C++ or Python.
We instructed several popular LLMs by specifying the coding language in prompts, and the results are
displayed in Figure 2. Our findings show that all models achieved higher ratings when using C++. This
aligns with human experience, as competition-level code problems often impose constraints on algorithm
running time, and C++ is more efficient at meeting these challenges. These findings disclose existing
model training shortcomings and offer guidance on enhancing model language selection cognition:
models should be trained to use C++ when facing problems where runtime efficiency is critical to get
better performance. These findings also indicate that we had better test the models in C++ to unlock
their best performance. This insight contrasts existing competition-level code benchmarks like APPS and
LiveCodeBench, which predominantly assess performance using Python.
QwQ-32B-Preview
Mistral-Large-Instruct-2411Qwen2.5-72B-InstructDS-Coder-V2-Instruct Llama-3.1-8B-Instruct020040060080010001200ELO RatingC++
Python
Figure 2: The Elo ratings using C++ and Python as programming languages.
8
Page 9:
5.3 Rating Variance
The variance in ratings is a crucial aspect of the benchmark. In Figure 3, we present violin plots of several
models across all tested contests. We observe that most models exhibit a standard deviation between
300 and 500. This fluctuation in ratings can be attributed to the models’ limited ability; solving just one
additional problem significantly boosts their ratings in one contest since they can only pass very few.
To reduce this standard deviation, increasing the number of tested contests can be beneficial. In our
experiments, by testing across 54 contests, the standard deviation in overall average ratings is reduced to
around 50, which we is acceptable, and the violin plots can effectively demonstrate their performance
comparisons.
o1-mini
QwQ-32B-Preview
Qwen2.5-32B-Instruct Qwen2.5-14B-InstructQwen2.5-7B-Instruct500
0500100015002000250030003500Elo Rating
Figure 3: Violin plots of Elo ratings across tested contests.
6 Discussion
6.1 Contributions
Due to the unique nature of our work compared to other competition code benchmarks, it is important to
clearly outline the contributions we make to the NLP community.
1.Similar to other benchmarks, we provide high-quality test problems. However, unlike others that also
include CodeForces problems, our dataset offers a full set of problems that can be updated online and
includes detailed information such as contest divisions, problem difficulty ratings, and problem tags.
This allows for more comprehensive evaluations compared to existing benchmarks.
2.Our approach stands out by using a unique evaluation method in which problems are submitted
directly to the platform. In our previous analysis, we found that this method is crucial for competition
code problems, as many of them require special judges and have unaccessible hidden test cases.
Unlike general code generation tasks, program efficiency is a critical metric in judging competition
code problems. Evaluating them independently can lead to misalignment of the environment and
inaccurate results. Our method successfully provides an easy way to address these issues, providing
more accurate evaluations that reflect the models’ true capabilities.
3.We are the first to provide human-comparable Elo ratings for existing models. This allows us to assess
the progress of current AI models in relation to human performance. Our findings suggest that o1-like
reasoning models show promise in improving coding capabilities, given their significant advantage
over other advanced models on our benchmark. We also provide detailed results across different
contest divisions, problem difficulties, and algorithm tags, which can help developers specifically
target and improve corresponding abilities.
4.We offer insights that previous work has overlooked or failed to achieve. For example, we find that
while Python is the most familiar language for existing LLMs, they often perform better when answer-
ing in C++ under CODE ELO. This highlights a limitation of previous competition code benchmarks
that evaluate solely in Python, as they may not fully stimulate the models’ best performance. It also
suggests that model developers should consider training their models to output C++ code when
efficiency is a critical factor.
9
Page 10:
6.2 Limitations
We believe our benchmark has the following two limitations:
•One limitation is that we only allow eight submissions per problem. In practice, users can make
additional submissions as long as the penalty scores remain lower than the passing scores. This
constraint might result in the tested Elo ratings being slightly lower than the actual ratings, as models
may successfully solve problems with more attempts. However, it is a carefully considered value
to balance the alignment of actual performance while avoiding excessive submissions that could
contaminate the platform and impact other users’ access, since the margin of larger submission times,
like 32 tries or more, will decrease quickly. Therefore, we adopt this setting and encourage others to
remain the same for a fair comparison.
•Another limitation is that we rely on interaction with the CodeForces platform to conduct the judging
process, whereas previous benchmarks typically only required offline testing. This reliance is necessary
due to the need for bypassing the access to hidden test cases and special judges to provide accurate
feedback. In other words, if we had all the hidden test cases and special judges in our hands, we could
conduct the evaluation independently. However, these resources are hidden from the platform and
extremely difficult to access, and we cannot provide them either.
7 Conclusion
In this work, we propose CODE ELObenchmark, a collection of Codeforces problems with detailed
problem information and a unique judgment method that involves submitting solutions to the CodeForces
platform and receiving judgment status to achieve zero false positives, special judge support, and an
aligned execution environment for the first time. Based on the judgment from the platform, we developed
an Elo rating calculation system that aligns with the platform but has a lower variance. Testing on 30
open-source and 3 proprietary LLMs, we find that o1-mini and QwQ-32B-Preview stand out significantly
among all models, and most models struggle even with the easiest problems and fall in the lowest 20
percent among all human participants. We further conduct analysis experiments and find different
performances of LLMs across algorithm tags, and the best performance is in C++ rather than Python,
contradictory with former benchmarks. We have made our CODEELObenchmark publicly available and
hope it can pave the way for the NLP community to test LLMs’ sophisticated reasoning abilities on code
and provide insights for future studies.
8 Ethical Statement
Our benchmark relies on the CodeForces platform to conduct judgments. We strictly adhere to the
Codeforces Terms and Conditions6throughout all experiments and emphasize that others should follow
the same. This benchmark is for academic purposes only and should be used in a limited way to
avoid impacting user access to the platform. It is designed for virtual participation only and cannot be
used for in-contest testing, in accordance with the CodeForces Rule Restricting the Use of AI7. Due to
ethical considerations, we will conduct a comprehensive risk assessment and seek permission from the
CodeForces platform before open-sourcing the entire submission and evaluation scaffold, and we have not
included it in this version of the paper. Before that, we recommend that others independently reproduce
our proposed method to conduct evaluations. We would also like to express our great acknowledgment
to Mike Mirzayanov for creating the remarkable CodeForces platform.
6https://codeforces.com/terms
7https://codeforces.com/blog/entry/133941
10
Page 11:
References
01.AI. Meet yi-coder: A small but mighty llm for code, September 2024. URL https://01-ai.github.
io/blog.html?post=en/2024-09-05-A-Small-but-Mighty-LLM-for-Code.md .
Anonymous. AoPS dataset: Leveraging online olympiad-level math problems for LLMs training
and contamination-resistant evaluation. In Submitted toThe Thirteenth International Conference
onLearning Representations , 2024. URL https://openreview.net/forum?id=Bgz3okeZ7H . under
review.
Anthropic. Anthropic: Introducing claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/
claude-3-5-sonnet .
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen
Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv
preprint arXiv:2108.07732, 2021.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan,
Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models
trained on code. arXiv preprint arXiv:2107.03374, 2021.
DeepSeek. Deepseek-r1-lite-preview is now live: unleashing supercharged reasoning power!, 2024. URL
https://api-docs.deepseek.com/news/news1120 .
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha
Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv
preprint arXiv:2407.21783, 2024.
Arpad E Elo and Sam Sloan. The rating of chessplayers: Past and present. (No Title), 1978.
Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang
Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large
language models. arXiv preprint arXiv:2410.07985, 2024.
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi,
Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of
code intelligence. arXiv preprint arXiv:2401.14196, 2024.
Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang, Aakanksha Chowdhery,
Sharan Narang, Noah Fiedel, and Aleksandra Faust. Understanding html with large language models.
arXiv preprint arXiv:2210.03945, 2022.
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns,
Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps.
arXiv preprint arXiv:2105.09938, 2021.
Siming Huang, Tianhao Cheng, Jason Klein Liu, Jiaran Hao, Liuyihan Song, Yang Xu, J Yang, JH Liu,
Chenchen Zhang, Linzheng Chai, et al. Opencoder: The open cookbook for top-tier code large language
models. arXiv preprint arXiv:2411.04905, 2024.
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen
Yu, Keming Lu, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024.
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow,
Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276 ,
2024.
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-
Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of
large language models for code. arXiv preprint arXiv:2403.07974, 2024.
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego
de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.
arXiv preprint arXiv:2310.06825, 2023.
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford,
Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of
experts. arXiv preprint arXiv:2401.04088, 2024.
11
Page 12:
Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez,
and Shafiq Joty. xcodeeval: A large scale multilingual multitask benchmark for code understanding,
generation, translation and retrieval. arXiv preprint arXiv:2303.03004, 2023.
Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li.
Taco: Topics in algorithmic code generation dataset. arXiv preprint arXiv:2312.14852, 2023.
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles,
James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with
alphacode. Science, 378(6624):1092–1097, 2022.
Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan,
Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts
language model. arXiv preprint arXiv:2405.04434, 2024.
MAA. American invitational mathematics examination - aime. In American Invitational
Mathematics Examination -AIME 2024 , February 2024. URL https://maa.org/math-competitions/
american-invitational-mathematics-examination-aime .
OpenAI. Openai o1-mini, September 2024a. URL https://openai.com/index/
openai-o1-mini-advancing-cost-efficient-reasoning/ .
OpenAI. Introducing openai o1-preview, September 2024b. URL https://openai.com/index/
introducing-openai-o1-preview/ .
OpenAI. Day 12 of shipmas: New frontier models o3 and o3-mini an-
nouncement, December 2024c. URL https://community.openai.com/t/
day-12-of-shipmas-new-frontier-models-o3-and-o3-mini-announcement/1061818 .
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi,
Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code. arXiv
preprint arXiv:2308.12950, 2023.
Quan Shi, Michael Tang, Karthik Narasimhan, and Shunyu Yao. Can language models solve olympiad
programming? arXiv preprint arXiv:2404.10952, 2024.
Mistral AI Team. Codestral: Hello, world!, May 2024a. URL https://mistral.ai/news/codestral/ .
Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown, November 2024b. URL https:
//qwenlm.github.io/blog/qwq-32b-preview/ .
Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries,
Leandro Von Werra, Arjun Guha, and Lingming Zhang. Selfcodealign: Self-alignment for code
generation. arXiv preprint arXiv:2410.24198, 2024.
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li,
Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024.
Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo
Gao, Shirong Ma, et al. Deepseek-coder-v2: Breaking the barrier of closed-source models in code
intelligence. arXiv preprint arXiv:2406.11931, 2024.
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani
Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation
with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877, 2024.
12
Page 13:
A Model Cards
We list and cite all the tested models in Table 5.
Short Name Citation HuggingFace Endpoint
Claude-3-5-Sonnet-2024-10-22 Anthropic (2024) -
ChatGPT-4o-latest-2024-11-20 Hurst et al. (2024) -
o1-mini OpenAI (2024a) -
Qwen2.5-Coder-1.5B-Instruct Hui et al. (2024) Qwen/Qwen2.5-Coder-1.5B-Instruct
Qwen2.5-Coder-3B-Instruct Hui et al. (2024) Qwen/Qwen2.5-Coder-3B-Instruct
Qwen2.5-Coder-7B-Instruct Hui et al. (2024) Qwen/Qwen2.5-Coder-7B-Instruct
Qwen2.5-Coder-14B-Instruct Hui et al. (2024) Qwen/Qwen2.5-Coder-14B-Instruct
Qwen2.5-Coder-32B-Instruct Hui et al. (2024) Qwen/Qwen2.5-Coder-32B-Instruct
Qwen2.5-7B-Instruct Yang et al. (2024) Qwen/Qwen2.5-7B-Instruct
Qwen2.5-14B-Instruct Yang et al. (2024) Qwen/Qwen2.5-14B-Instruct
Qwen2.5-32B-Instruct Yang et al. (2024) Qwen/Qwen2.5-32B-Instruct
Qwen2.5-72B-Instruct Yang et al. (2024) Qwen/Qwen2.5-72B-Instruct
QwQ-32B-Preview Team (2024b) Qwen/QwQ-32B-Preview
DS-Coder-1.3B-Instruct Guo et al. (2024) deepseek-ai/deepseek-coder-1.3b-instruct
DS-Coder-6.7B-Instruct Guo et al. (2024) deepseek-ai/deepseek-coder-6.7b-instruct
DS-Coder-33B-Instruct Guo et al. (2024) deepseek-ai/deepseek-coder-33b-instruct
DS-Coder-V2-Lite-Instruct Zhu et al. (2024) deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct
DS-Coder-V2-Instruct Zhu et al. (2024) deepseek-ai/DeepSeek-Coder-V2-Instruct
DS-V2-Lite-Chat Liu et al. (2024) deepseek-ai/DeepSeek-V2-Lite-Chat
DS-V2-Chat Liu et al. (2024) deepseek-ai/DeepSeek-V2-Chat
DS-V2.5 Liu et al. (2024) deepseek-ai/DeepSeek-V2.5
CodeLlama-70B-Instruct Roziere et al. (2023) meta-llama/CodeLlama-70b-Instruct-hf
Llama-3.1-8B-Instruct Dubey et al. (2024) meta-llama/Llama-3.1-8B-Instruct
Llama-3.1-70B-Instruct Dubey et al. (2024) meta-llama/Llama-3.1-70B-Instruct
Codestral-22B-v0.1 Team (2024a) mistralai/Codestral-22B-v0.1
Mistral-7B-Instruct-v0.2 Jiang et al. (2023) mistralai/Mistral-7B-Instruct-v0.2
Ministral-8B-Instruct-2410 Jiang et al. (2023) mistralai/Ministral-8B-Instruct-2410
Mistral-Large-Instruct-2411 Jiang et al. (2023) mistralai/Mistral-Large-Instruct-2411
Mixtral-8x7B-Instruct-v0.1 Jiang et al. (2024) mistralai/Mixtral-8x7B-Instruct-v0.1
Mixtral-8x22B-Instruct-v0.1 Jiang et al. (2024) mistralai/Mixtral-8x22B-Instruct-v0.1
OpenCoder-8B-Instruct Huang et al. (2024) infly/OpenCoder-8B-Instruct
Yi-Coder-9B-Chat 01.AI (2024) 01-ai/Yi-Coder-9B-Chat
Starcoder2-15B-Instruct-v0.1 Wei et al. (2024) bigcode/starcoder2-15b-instruct-v0.1
Table 5: Model cards.
B Decoding Hyperparameters
All proprietary models use API calls with default parameters. For open-source models, the inference
settings are temperature=0.7, top_p=0.8, top_k=20, and repetition_penalty=1.1. The maximum number of
output tokens is set to 4,096 for all models, except for QwQ-32B-Preview, which is set to 32,768 tokens.
C Analysis of Our Elo Rating Calculation System
In Section 3.3.2, we present our method for calculating Elo ratings for each contest. Although there are
slight differences between our system and the original CodeForces Elo rating calculation, we provide a
proof to demonstrate that our ratings are equivalent to the original. For simplicity, we do not consider
the differences between divisions here. In fact, the following analysis will always hold under the same
divisions, and since all LLMs attend the same set of contests, it will be fair.
For any specific model, after each contest, we calculate the expected rating and then average them. We
consider each contest to be independent, and since we acknowledge that the ratings are standardized,
we assume the ratings for any specific model under different contests are independent and identically
13
Page 14:
distributed (IID). Let this expected rating be r, and let the variance be Var(r). For a total of kcontests, we
calculate the average ratings. Since the ratings in all contests for a specific model are IID, we can easily
determine that the average will also have an expected value of rand a variance ofVar(r)
k. It is evident that
the calculated ratings will converge as kapproaches infinity.
In the original calculation from the platform, for each individual, a historical rating list is maintained
after each contest, denoted as ri, with an initial value of r0=0. After calculating the expected rating
E(ri)in the i-th contest based on performance, the rating is updated by moving halfway towards the
expected rating from the current rating. Namely, the new rating can be calculated using the formula
ri=ri−1+E(ri)
2. Note that each contest is independent, and E(ri)shares the same distribution as r. Through
simple mathematical transformations and deductions, we can determine that the expected value of ri
will converge to r, and its variance will converge to Var(r)as the number of contests approaches infinity.
These results indicate that we can achieve the same expected results as CodeForces while significantly
reducing the variance by increasing the number of contests.
D Human-comparable Elo Rating
One advantage of our benchmark is that we provide standardized Elo ratings that are comparable with
those of human participants. We present each percentile of ratings among all human participants in
Table 6, based on publicly available user ratings from the CodeForces platform.
Percentile Rating Percentile Rating Percentile Rating Percentile Rating
1 348 26 740 51 1088 76 1390
2 351 27 754 52 1103 77 1398
3 353 28 767 53 1118 78 1405
4 356 29 781 54 1133 79 1411
5 359 30 794 55 1147 80 1418
6 362 31 806 56 1162 81 1427
7 366 32 819 57 1176 82 1437
8 371 33 832 58 1191 83 1448
9 376 34 845 59 1205 84 1462
10 383 35 858 60 1218 85 1478
11 391 36 872 61 1231 86 1497
12 401 37 886 62 1243 87 1518
13 415 38 900 63 1254 88 1543
14 437 39 913 64 1265 89 1571
15 478 40 927 65 1276 90 1603
16 559 41 942 66 1288 91 1624
17 577 42 956 67 1301 92 1648
18 591 43 971 68 1313 93 1678
19 605 44 985 69 1325 94 1712
20 621 45 1000 70 1338 95 1751
21 639 46 1014 71 1352 96 1812
22 662 47 1029 72 1365 97 1916
23 687 48 1044 73 1370 98 2019
24 707 49 1058 74 1375 99 2157
25 724 50 1073 75 1382 100 4009
Table 6: Percentiles of ratings among all human participants, calculated based on publicly available user
ratings from the CodeForces platform, collected in November, 2024.
14
Page 15:
E Problem Demonstration
A problem demonstration in C ODE ELOis shown in Figure 4.
YES
NO1
2
34
5
6
7
8
Figure 4: An example of a problem in CODE ELO. Each problem contains: 1) title, 2) time limit, 3)
memory limit, 4) problem description, 5) input format, 6) output format, 7) test case examples, and 8)
note (optional). This problem can be found at https://codeforces.com/contest/2034/problem/E .
15
Page 16:
F Special Judge
In some code competition problems, a given input might have multiple valid outputs, all of which can be
considered correct (Note that the input and outputs here refer to test cases, not the problem statement
and model responses). In such situations, a dedicated code is necessary to verify the validity of the
outputs instead of simply comparing them against a reference output; this is known as a special judge.
It’s like a logical unit test, but since the problems are more complex, creating a special judge is also more
challenging. Figure 5 showcases a case demonstration.
While most competition problems have a single correct output for any given input and do not need
a special judge, there are still a proportion of problems that require one. We conducted an empirical
study and found that 30 out of 100 randomly selected problems required special judges. Previous
competition-level code benchmarks could not handle these situations and therefore did not accurately
assess the full capabilities of models. Our evaluation method has the advantage of accommodating these
types of problems. Similarly, we also support interactive problems8that were not supported in earlier
benchmarks. Supporting these kinds of problems is crucial for thoroughly evaluating a model’s abilities
and obtaining human-comparable Elo ratings.
bac abc bcc
A clear indication that needs a special judge
Figure 5: An example of a problem (examples and note parts are omitted) that needs a special judge
since there can be multiple valid outputs for the same input (input and outputs refer to test cases but not
problem and solutions). e.g., given the input "abc", acceptable outputs could include "abb", "acc", "aac",
and any other string derived from "abc" except itself. So we cannot simply compare the output with a
predetermined correct solution for evaluation in this problem. CODE ELOaddresses this by evaluating
the code submissions directly on their official platform, marking its first support for this kind of problem.
The complete problem can be found at https://codeforces.com/contest/2047/problem/B .
8An example of interactive problems can be found at https://codeforces.com/contest/2036/problem/G .
16