Authors: Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan Qiu
Paper Content:
Page 1:
2025-01-06
Qwen2.5 Technical Report
Qwen Team
https://huggingface.co/Qwen
https://modelscope.cn/organization/qwen
https://github.com/QwenLM/Qwen2.5
Abstract
In this report, we introduce Qwen2.5, a comprehensive series of large language models
(LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has
been significantly improved during both the pre-training and post-training stages. In
terms of pre-training, we have scaled the high-quality pre-training datasets from the
previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for
common sense, expert knowledge, and reasoning capabilities. In terms of post-training,
we implement intricate supervised finetuning with over 1 million samples, as well as
multistage reinforcement learning, including offline learning DPO and online learning
GRPO. Post-training techniques significantly enhance human preference, and notably
improve long text generation, structural data analysis, and instruction following.
To handle diverse and varied use cases effectively, we present Qwen2.5 LLM series in rich
configurations. The open-weight offerings include base models and instruction-tuned
models in sizes of 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B parameters. Quantized versions
of the instruction-tuned models are also provided. Over 100 models can be accessed
from Hugging Face Hub, ModelScope, and Kaggle. In addition, for hosted solutions, the
proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-
Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio.
Qwen2.5 has demonstrated top-tier performance on a wide range of benchmarks eval-
uating language understanding, reasoning, mathematics, coding, human preference
alignment, etc. Specifically, the open-weight flagship Qwen2.5-72B-Instruct outperforms
a number of open and proprietary models and demonstrates competitive performance to
the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times
larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness while perform-
ing competitively against GPT-4o-mini and GPT-4o respectively. Additionally, as the
foundation, Qwen2.5 models have been instrumental in training specialized models such
as Qwen2.5-Math (Yang et al., 2024b), Qwen2.5-Coder (Hui et al., 2024), QwQ (Qwen
Team, 2024d), and multimodal models.
3T7T18T
tokensMathMBPPBBHMMLU
Qwen1.5-72B
Qwen2-72B
Qwen2.5-72B
Figure 1: In the iterative development of the Qwen series, data scaling has played a crucial role. Qwen 2.5,
which leverages 18 trillion tokens for pre-training, has demonstrated the most advanced capabilities
within the Qwen series, especially in terms of domain expertise, underscoring the importance of scale
together with mixture in enhancing the model’s capabilities.
1arXiv:2412.15115v2 [cs.CL] 3 Jan 2025
Page 2:
1 Introduction
The sparks of artificial general intelligence (AGI) are increasingly visible through the fast development
of large foundation models, notably large language models (LLMs) (Brown et al., 2020; OpenAI, 2023;
2024a; Gemini Team, 2024; Anthropic, 2023a;b; 2024; Bai et al., 2023; Yang et al., 2024a; Touvron et al.,
2023a;b; Dubey et al., 2024). The continuous advancement in model and data scaling, combined with
the paradigm of large-scale pre-training followed by high-quality supervised fine-tuning (SFT) and
reinforcement learning from human feedback (RLHF) (Ouyang et al., 2022), has enabled large language
models (LLMs) to develop emergent capabilities in language understanding, generation, and reasoning.
Building on this foundation, recent breakthroughs in inference time scaling, particularly demonstrated
by o1 (OpenAI, 2024b), have enhanced LLMs’ capacity for deep thinking through step-by-step reasoning
and reflection. These developments have elevated the potential of language models, suggesting they may
achieve significant breakthroughs in scientific exploration as they continue to demonstrate emergent
capabilities indicative of more general artificial intelligence.
Besides the fast development of model capabilities, the recent two years have witnessed a burst of open
(open-weight) large language models in the LLM community, for example, the Llama series (Touvron
et al., 2023a;b; Dubey et al., 2024), Mistral series (Jiang et al., 2023a; 2024a), and our Qwen series (Bai
et al., 2023; Yang et al., 2024a; Qwen Team, 2024a; Hui et al., 2024; Qwen Team, 2024c; Yang et al.,
2024b). The open-weight models have democratized the access of large language models to common
users and developers, enabling broader research participation, fostering innovation through community
collaboration, and accelerating the development of AI applications across diverse domains.
Recently, we release the details of our latest version of the Qwen series, Qwen2.5. In terms of the open-
weight part, we release pre-trained and instruction-tuned models of 7 sizes, including 0.5B, 1.5B, 3B, 7B,
14B, 32B, and 72B, and we provide not only the original models in bfloat16 precision but also the quantized
models in different precisions. Specifically, the flagship model Qwen2.5-72B-Instruct demonstrates
competitive performance against the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is
around 5 times larger. Additionally, we also release the proprietary models of Mixture-of-Experts (MoE,
Lepikhin et al., 2020; Fedus et al., 2022; Zoph et al., 2022), namely Qwen2.5-Turbo and Qwen2.5-Plus1,
which performs competitively against GPT-4o-mini and GPT-4o respectively.
In this technical report, we introduce Qwen2.5, the result of our continuous endeavor to create better
LLMs. Below, we show the key features of the latest version of Qwen:
•Better in Size : Compared with Qwen2, in addition to 0.5B, 1.5B, 7B, and 72B models, Qwen2.5
brings back the 3B, 14B, and 32B models, which are more cost-effective for resource-limited
scenarios and are under-represented in the current field of open foundation models. Qwen2.5-
Turbo and Qwen2.5-Plus offer a great balance among accuracy, latency, and cost.
•Better in Data : The pre-training and post-training data have been improved significantly. The
pre-training data increased from 7 trillion tokens to 18 trillion tokens, with focus on knowledge,
coding, and mathematics. The pre-training is staged to allow transitions among different mixtures.
The post-training data amounts to 1 million examples, across the stage of supervised finetuning
(SFT, Ouyang et al., 2022), direct preference optimization (DPO, Rafailov et al., 2023), and group
relative policy optimization (GRPO, Shao et al., 2024).
•Better in Use : Several key limitations of Qwen2 in use have been eliminated, including larger
generation length (from 2K tokens to 8K tokens), better support for structured input and output,
(e.g., tables and JSON), and easier tool use. In addition, Qwen2.5-Turbo supports a context length
of up to 1 million tokens.
2 Architecture & Tokenizer
Basically, the Qwen2.5 series include dense models for opensource, namely Qwen2.5-0.5B / 1.5B / 3B
/ 7B / 14B / 32B / 72B, and MoE models for API service, namely Qwen2.5-Turbo and Qwen2.5-Plus.
Below, we provide details about the architecture of models.
For dense models, we maintain the Transformer-based decoder architecture (Vaswani et al., 2017; Radford
et al., 2018) as Qwen2 (Yang et al., 2024a). The architecture incorporates several key components:
Grouped Query Attention (GQA, Ainslie et al., 2023) for efficient KV cache utilization, SwiGLU activation
function (Dauphin et al., 2017) for non-linear activation, Rotary Positional Embeddings (RoPE, Su
1Qwen2.5-Turbo is identified as qwen-turbo-2024-11-01 and Qwen2.5-Plus is identified as qwen-plus-2024-xx-xx
(to be released) in the API.
2
Page 3:
Table 1: Model architecture and license of Qwen2.5 open-weight models.
Models Layers Heads (Q / KV) Tie Embedding Context / Generation Length License
0.5B 24 14 / 2 Yes 32K / 8K Apache 2.0
1.5B 28 12 / 2 Yes 32K / 8K Apache 2.0
3B 36 16 / 2 Yes 32K / 8K Qwen Research
7B 28 28 / 4 No 128K / 8K Apache 2.0
14B 48 40 / 8 No 128K / 8K Apache 2.0
32B 64 40 / 8 No 128K / 8K Apache 2.0
72B 80 64 / 8 No 128K / 8K Qwen
et al., 2024) for encoding position information, QKV bias (Su, 2023) in the attention mechanism and
RMSNorm (Jiang et al., 2023b) with pre-normalization to ensure stable training.
Building upon the dense model architectures, we extend it to MoE model architectures. This is achieved
by replacing standard feed-forward network (FFN) layers with specialized MoE layers, where each layer
comprises multiple FFN experts and a routing mechanism that dispatches tokens to the top-K experts.
Following the approaches demonstrated in Qwen1.5-MoE (Yang et al., 2024a), we implement fine-grained
expert segmentation (Dai et al., 2024) and shared experts routing (Rajbhandari et al., 2022; Dai et al., 2024).
These architectural innovations have yielded substantial improvements in model performance across
downstream tasks.
For tokenization, we utilize Qwen’s tokenizer (Bai et al., 2023), which implements byte-level byte-pair
encoding (BBPE, Brown et al., 2020; Wang et al., 2020; Sennrich et al., 2016) with a vocabulary of 151,643
regular tokens. We have expanded the set of control tokens from 3 to 22 compared to previous Qwen
versions, adding two new tokens for tool functionality and allocating the remainder for other model
capabilities. This expansion establishes a unified vocabulary across all Qwen2.5 models, enhancing
consistency and reducing potential compatibility issues.
3 Pre-training
Our language model pre-training process consists of several key components. First, we carefully curate
high-quality training data through sophisticated filtering and scoring mechanisms, combined with
strategic data mixture. Second, we conduct extensive research on hyperparameter optimization to
effectively train models at various scales. Finally, we incorporate specialized long-context pre-training
to enhance the model’s ability to process and understand extended sequences. Below, we detail our
approaches to data preparation, hyperparameter selection, and long-context training.
3.1 Pre-training Data
Qwen2.5 demonstrates significant enhancements in pre-training data quality compared to its predecessor
Qwen2. These improvements stem from several key aspects:
(1)Better data filtering . High-quality pre-training data is crucial for model performance, mak-
ing data quality assessment and filtering a critical component of our pipeline. We leverage
Qwen2-Instruct models as data quality filters that perform comprehensive, multi-dimensional
analysis to evaluate and score training samples. The filtering method represents a significant
advancement over our previous approach used for Qwen2, as it benefits from Qwen2’s expanded
pre-training on a larger multilingual corpus. The enhanced capabilities enable more nuanced
quality assessment, resulting in both improved retention of high-quality training data and more
effective filtering of low-quality samples across multiple languages.
(2)Better math and code data . During the pre-training phase of Qwen2.5, we incorporate training
data from Qwen2.5-Math (Yang et al., 2024b) and Qwen2.5-Coder (Hui et al., 2024). This data
integration strategy proves highly effective, as these specialized datasets are instrumental in
achieving state-of-the-art performance on mathematical and coding tasks. By leveraging these
high-quality domain-specific datasets during pre-training, Qwen2.5 inherits strong capabilities
in both mathematical reasoning and code generation.
(3)Better synthetic data . To generate high-quality synthetic data, particularly in mathematics, code,
and knowledge domains, we leverage both Qwen2-72B-Instruct (Yang et al., 2024a) and Qwen2-
Math-72B-Instruct (Qwen Team, 2024c). The quality of this synthesized data is further enhanced
through rigorous filtering using our proprietary general reward model and the specialized
Qwen2-Math-RM-72B (Qwen Team, 2024c) model.
3
Page 4:
(4)Better data mixture . To optimize the pre-training data distribution, we employ Qwen2-Instruct
models to classify and balance content across different domains. Our analysis revealed that
domains like e-commerce, social media, and entertainment are significantly overrepresented
in web-scale data, often containing repetitive, template-based, or machine-generated content.
Conversely, domains such as technology, science, and academic research, while containing higher-
quality information, are traditionally underrepresented. Through strategic down-sampling of
overrepresented domains and up-sampling of high-value domains, we ensure a more balanced
and information-rich training dataset that better serves our model’s learning objectives.
Building on these techniques, we have developed a larger and higher-quality pre-training dataset,
expanding from the 7 trillion tokens used in Qwen2 (Yang et al., 2024a) to 18 trillion tokens.
3.2 Scaling Law for Hyper-parameters
We develop scaling laws for hyper-parameter based on the pre-training data of Qwen2.5 (Hoffmann et al.,
2022; Kaplan et al., 2020). While previous studies (Dubey et al., 2024; Almazrouei et al., 2023; Hoffmann
et al., 2022) primarily used scaling laws to determine optimal model sizes given compute budgets, we
leverage them to identify optimal hyperparameters across model architectures. Specifically, our scaling
laws help determine key training parameters like batch size Band learning rate µfor both dense models
and MoE models of varying sizes.
Through extensive experimentation, we systematically study the relationship between model architecture
and optimal training hyper-parameters. Specifically, we analyze how the optimal learning rate µopt
and batch size Boptvary with model size Nand pre-training data size D. Our experiments cover a
comprehensive range of architectures, including dense models with 44M to 14B parameters and MoE
models with 44M to 1B activated parameters, trained on datasets ranging from 0.8B to 600B tokens.
Using these optimal hyper-parameter predictions, we then model the final loss as a function of model
architecture and training data scale.
Additionally, we leverage scaling laws to predict and compare the performance of MoE models with
varying parameter counts against their dense counterparts. This analysis guides our hyper-parameter
configuration for MoE models, enabling us to achieve performance parity with specific dense model
variants (such as Qwen2.5-72B and Qwen2.5-14B) through careful tuning of both activated and total
parameters.
3.3 Long-context Pre-training
For optimal training efficiency, Qwen2.5 employs a two-phase pre-training approach: an initial phase
with a 4,096-token context length, followed by an extension phase for longer sequences. Following
the strategy used in Qwen2, we extend the context length from 4,096 to 32,768 tokens during the final
pre-training stage for all model variants except Qwen2.5-Turbo. Concurrently, we increase the base
frequency of RoPEfrom 10,000 to 1,000,000 using the ABF technique (Xiong et al., 2023).
For Qwen2.5-Turbo, we implement a progressive context length expansion strategy during training,
advancing through four stages: 32,768 tokens, 65,536 tokens, 131,072 tokens, and ultimately 262,144
tokens, with a RoPE base frequency of 10,000,000. At each stage, we carefully curate the training data
to include 40% sequences at the current maximum length and 60% shorter sequences. This progressive
training methodology enables smooth adaptation to increasing context lengths while maintaining the
model’s ability to effectively process and generalize across sequences of varying lengths.
To enhance our models’ ability to process longer sequences during inference, we implement two key
strategies: YARN (Peng et al., 2023) and Dual Chunk Attention (DCA, An et al., 2024). Through these
innovations, we achieve a four-fold increase in sequence length capacity, enabling Qwen2.5-Turbo to
handle up to 1 million tokens and other models to process up to 131,072 tokens. Notably, these approaches
not only improve the modeling of long sequences by reducing perplexity but also maintain the models’
strong performance on shorter sequences, ensuring consistent quality across varying input lengths.
4 Post-training
Qwen 2.5 introduces two significant advancements in its post-training design compared to Qwen 2:
(1)Expanded Supervised Fine-tuning Data Coverage: The supervised fine-tuning process leverages
a massive dataset comprising millions of high-quality examples. This expansion specifically
addresses key areas where the previous model showed limitations, such as long-sequence
4
Page 5:
generation, mathematical problem-solving, coding, instruction-following, structured data under-
standing, logical reasoning, cross-lingual transfer, and robust system instruction.
(2)Two-stage Reinforcement Learning: The reinforcement learning (RL) process in Qwen 2.5 is
divided into two distinct stages: Offline RL and Online RL.
•Offline RL: This stage focuses on developing capabilities that are challenging for the reward
model to evaluate, such as reasoning, factuality, and instruction-following. Through meticu-
lous construction and validation of training data, we ensure that the Offline RL signals are
both learnable and reliable (Xiang et al., 2024), enabling the model to acquire those complex
skills effectively.
•Online RL: The Online RL phase leverages the reward model’s ability to detect nuances in
output quality, including truthfulness, helpfulness, conciseness, relevance, harmlessness
and debiasing. It enables the model to generate responses that are precise, coherent, and
well-structured while maintaining safety and readability. As a result, the model’s outputs
consistently meet human quality standards and expectations.
4.1 Supervised Fine-tuning
In this section, we detail the key enhancements made during the SFT phase of Qwen2.5, focusing on
several critical areas:
(1)Long-sequence Generation: Qwen2.5 is capable of generating high-quality content with an
output context length of up to 8,192 tokens, a significant advancement over the typical post-
training response length, which often remains under 2,000 tokens. To address this gap, we
develop long-response datasets (Quan et al., 2024). We employ back-translation techniques to
generate queries for long-text data from pre-training corpora, impose output length constraints,
and use Qwen2 to filter out low-quality paired data.
(2)Mathematics: We introduce the chain-of-thought data of Qwen2.5-Math (Yang et al., 2024b),
which encompasses a diverse range of query sources, including public datasets, K-12 problem
collections, and synthetic problems. To ensure high-quality reasoning, we employ rejection
sampling (Yuan et al., 2023) along with reward modeling and annotated answers for guidance,
producing step-by-step reasoning process.
(3)Coding: To enhance coding capabilities, we incorporate the instruction tuning data of Qwen2.5-
Coder (Hui et al., 2024). We use multiple language-specific agents into a collaborative framework,
generating diverse and high-quality instruction pairs across nearly 40 programming languages.
We expand our instruction dataset by synthesizing new examples from code-related Q&A
websites and gathering algorithmic code snippets from GitHub. A comprehensive multilingual
sandbox is used to perform static code checking and validate code snippets through automated
unit testing, ensuring code quality and correctness (Dou et al., 2024; Yang et al., 2024c).
(4)Instruction-following: To ensure high-quality instruction-following data, we implement a
rigorous code-based validation framework. In this approach, LLMs generate both instructions
and corresponding verification code, along with comprehensive unit tests for cross-validation.
Through execution feedback-based rejection sampling, we carefully curate the training data used
for Supervised Fine-Tuning, thereby guaranteeing the model’s faithful adherence to intended
instructions (Dong et al., 2024).
(5)Structured Data Understanding: We develop a comprehensive structured understanding dataset
that encompasses both traditional tasks, such as tabular question-answering, fact verification,
error correction, and structural understanding, as well as complex tasks involving structured
and semi-structured data. By incorporating reasoning chains into the model’s responses, we
significantly enhance its ability to infer information from structured data, thereby improving its
performance across these diverse tasks. This approach not only broadens the scope of the dataset
but also deepens the model’s capacity to reason and derive meaningful insights from complex
data structures.
(6)Logical Reasoning: To enhance the model’s logical reasoning capabilities, we introduce a diverse
set of 70,000 new queries spanning various domains. These queries encompass multiple-choice
questions, true / false questions, and open-ended questions. The model is trained to approach
problems systematically, employing a range of reasoning methods such as deductive reason-
ing, inductive generalization, analogical reasoning, causal reasoning, and statistical reasoning.
Through iterative refinement, we systematically filter out data containing incorrect answers or
flawed reasoning processes. This process progressively strengthens the model’s ability to reason
logically and accurately, ensuring robust performance across different types of reasoning tasks.
5
Page 6:
(7)Cross-Lingual Transfer: To facilitate the transfer of the model’s general capabilities across lan-
guages, we employ a translation model to convert instructions from high-resource languages
into various low-resource languages, thereby generating corresponding response candidates. To
ensure the accuracy and consistency of these responses, we evaluate the semantic alignment be-
tween each multilingual response and its original counterpart. This process preserves the logical
structure and stylistic nuances of the original responses, thereby maintaining their integrity and
coherence across different languages.
(8)Robust System Instruction: We construct hundreds of general system prompts to improve the
diversity of system prompts in post-training, ensuring consistency between system prompts and
conversations. Evaluations with different system prompts show that the model maintains good
performance (Lu et al., 2024b) and reduced variance, indicating improved robustness.
(9)Response Filtering: To evaluate the quality of responses, we employ multiple automatic an-
notation methods, including a dedicated critic model and a multi-agent collaborative scoring
system. Responses are subjected to rigorous assessment, and only those deem flawless by all
scoring systems are retained. This comprehensive approach ensures that our outputs maintain
the highest quality standards.
Ultimately, we construct a dataset of over 1 million SFT examples. The model is fine-tuned for two epochs
with a sequence length of 32,768 tokens. To optimize learning, the learning rate is gradually decreased
from 7 ×10−6to 7×10−7. To address overfitting, we apply a weight decay of 0.1, and gradient norms
are clipped at a maximum value of 1.0.
4.2 Offline Reinforcement Learning
Compared to Online Reinforcement Learning (RL), Offline RL enables the pre-preparation of training
signals, which is particularly advantageous for tasks where standard answers exist but are challenging to
evaluate using reward models. In this study, we focus on objective query domains such as mathematics,
coding, instruction following, and logical reasoning, where obtaining accurate evaluations can be complex.
In the previous phase, we extensively employ strategies like execution feedback and answer matching to
ensure the quality of responses. For the current phase, we reuse that pipeline, employing the SFT model
to resample responses for a new set of queries. Responses that pass our quality checks are used as positive
examples, while those that fail are treated as negative examples for Direct Preference Optimization (DPO)
training (Rafailov et al., 2023). To further enhance the reliability and accuracy of the training signals, we
make use of both human and automated review processes (Cao et al., 2024). This dual approach ensures
that the training data is not only learnable but also aligned with human expectations. Ultimately, we
construct a dataset consisting of approximately 150,000 training pairs. The model is then trained for one
epoch using the Online Merging Optimizer (Lu et al., 2024a), with a learning rate of 7 ×10−7.
4.3 Online Reinforcement Learning
To develop a robust reward model for online RL, we adhere to a set of carefully defined labeling criteria.
Those criteria ensure that the responses generated by the model are not only high-quality but also aligned
with ethical and user-centric standards (Wang et al., 2024a). The specific guidelines for data labeling are
as follows:
•Truthfulness: Responses must be grounded in factual accuracy, faithfully reflecting the pro-
vided context and instructions. The model should avoid generating information that is false or
unsupported by the given data.
•Helpfulness: The model’s output should be genuinely useful, addressing the user’s query
effectively while providing content that is positive, engaging, educational, and relevant. It
should follow the given instructions precisely and offer value to the user.
•Conciseness: Responses should be succinct and to the point, avoiding unnecessary verbosity.
The goal is to convey information clearly and efficiently without overwhelming the user with
excessive detail.
•Relevance: All parts of the response should be directly related to the user’s query, dialogue
history, and the assistant’s context. The model should tailor its output to ensure it is perfectly
aligned with the user’s needs and expectations.
•Harmlessness: The model must prioritize user safety by avoiding any content that could lead
to illegal, immoral, or harmful behavior. It should promote ethical conduct and responsible
communication at all times.
6
Page 7:
•Debiasing: The model should produce responses that are free from bias, including but not
limited to gender, race, nationality, and politics. It should treat all topics equally and fairly,
adhering to widely accepted moral and ethical standards.
The queries utilized to train the reward model are drawn from two distinct datasets: publicly available
open-source data and a proprietary query set characterized by higher complexity. Responses are gener-
ated from checkpoints of the Qwen models, which have been fine-tuned using different methods—SFT,
DPO, and RL—at various stages of training. To introduce diversity, those responses are sampled at
different temperature settings. Preference pairs are created through both human and automated labeling
processes, and the training data for DPO is also integrated into this dataset.
In our online reinforcement learning (RL) framework, we employ Group Relative Policy Optimization
(GRPO, Shao et al., 2024). The query set utilized for training the reward model is identical to the one used
in the RL training phase. The sequence in which queries are processed during training is determined by
the variance of their response scores, as evaluated by the reward model. Specifically, queries with higher
variance in response scores are prioritized to ensure more effective learning. We sample 8 responses
for each query. All models are trained with a 2048 global batch size and 2048 samples in each episode,
considering a pair of queries and responses as a sample.
4.4 Long Context Fine-tuning
To further extend the context length of Qwen2.5-Turbo, we introduce longer SFT examples during
post-training, enabling it to better align with human preference in long queries.
In the SFT phase, we employ a two-stage approach. In the first stage, the model is fine-tuned exclusively
using short instructions, each containing up to 32,768 tokens. This stage uses the same data and training
steps as those employed for the other Qwen2.5 models, ensuring strong performance on short tasks.
In the second stage, the fine-tuning process combines both short instructions (up to 32,768 tokens)
and long instructions (up to 262,144 tokens). This hybrid approach effectively enhances the model’s
instruction-following ability in long context tasks while maintaining its performance on short tasks.
During the RL stage, we use a training strategy similar to that used for the other Qwen2.5 models,
focusing solely on short instructions. This design choice is driven by two primary considerations: first,
RL training is computationally expensive for long context tasks; second, there is currently a scarcity of
reward models that provide suitable reward signals for long context tasks. Additionally, we find that
adopting RL on short instructions alone can still significantly enhance the model’s alignment with human
preferences in long context tasks.
5 Evaluation
The base models produced by pre-training and the instruction-tuned models produced by post-training
are evaluated accordingly with a comprehensive evaluation suite, including both commonly-used open
benchmarks and skill-oriented in-house datasets. The evaluation suite is designed to be primarily
automatic with minimal human interaction.
To prevent test data leakage, we exclude potentially contaminated data using n-gram matching when
constructing the pre-training and post-training datasets. Following the criteria used in Qwen2, a training
sequence stis removed from the training data if there exists a test sequence sesuch that the length of the
longest common subsequence (LCS) between tokenized stand sesatisfies both |LCS (st,se)| ≥13 and
|LCS (st,se)| ≥0.6×min (|st|,|se|).
5.1 Base Models
We conduct comprehensive evaluations of the base language models of the Qwen2.5 series. The evaluation
of base models primarily emphasizes their performance in natural language understanding, general
question answering, coding, mathematics, scientific knowledge, reasoning, and multilingual capabilities.
The evaluation datasets include:
General Tasks MMLU (Hendrycks et al., 2021a) (5-shot), MMLU-Pro (Wang et al., 2024b) (5-shot),
MMLU-redux (Gema et al., 2024) (5-shot), BBH (Suzgun et al., 2023) (3-shot), ARC-C (Clark et al.,
2018) (25-shot), TruthfulQA (Lin et al., 2022a) (0-shot), Winogrande (Sakaguchi et al., 2021) (5-shot),
HellaSwag (Zellers et al., 2019) (10-shot).
7
Page 8:
Table 2: Performance of the 70B+ base models and Qwen2.5-Plus.
Datasets Llama-3-70B Mixtral-8x22B Llama-3-405B Qwen2-72B Qwen2.5-72B Qwen2.5-Plus
General Tasks
MMLU 79.5 77.8 85.2 84.2 86.1 85.4
MMLU-Pro 52.8 51.6 61.6 55.7 58.1 64.0
MMLU-redux 75.0 72.9 - 80.5 83.9 82.8
BBH 81.0 78.9 85.9 82.4 86.3 85.8
ARC-C 68.8 70.7 - 68.9 72.4 70.9
TruthfulQA 45.6 51.0 - 54.8 60.4 55.3
WindoGrande 85.3 85.0 86.7 85.1 83.9 85.5
HellaSwag 88.0 88.7 - 87.3 87.6 89.2
Mathematics & Science Tasks
GPQA 36.3 34.3 - 37.4 45.9 43.9
TheoremQA 32.3 35.9 - 42.8 42.4 48.5
MATH 42.5 41.7 53.8 50.9 62.1 64.4
MMLU-stem 73.7 71.7 - 79.6 82.7 81.2
GSM8K 77.6 83.7 89.0 89.0 91.5 93.0
Coding Tasks
HumanEval 48.2 46.3 61.0 64.6 59.1 59.1
HumanEval+ 42.1 40.2 - 56.1 51.2 52.4
MBPP 70.4 71.7 73.0 76.9 84.7 79.7
MBPP+ 58.4 58.1 - 63.9 69.2 66.9
MultiPL-E 46.3 46.7 - 59.6 60.5 61.0
Multilingual Tasks
Multi-Exam 70.0 63.5 - 76.6 78.7 78.5
Multi-Understanding 79.9 77.7 - 80.7 89.6 89.2
Multi-Mathematics 67.1 62.9 - 76.0 76.7 82.4
Multi-Translation 38.0 23.3 - 37.8 39.0 40.4
Mathematics & Science Tasks GPQA (Rein et al., 2023) (5-shot), Theorem QA (Chen et al., 2023a)
(5-shot), GSM8K (Cobbe et al., 2021) (4-shot), MATH (Hendrycks et al., 2021b) (4-shot).
Coding Tasks HumanEval (Chen et al., 2021) (0-shot), HumanEval+ (Liu et al., 2023)(0-shot), MBPP (Austin
et al., 2021) (0-shot), MBPP+ (Liu et al., 2023) (0-shot), MultiPL-E (Cassano et al., 2023) (0-shot) (Python,
C++, JAVA, PHP , TypeScript, C#, Bash, JavaScript).
Multilingual Tasks We group them into four categories: (a) Exam: M3Exam (5-shot, we only choose
examples that require no image), IndoMMLU (Koto et al., 2023) (3-shot), ruMMLU (Fenogenova et al.,
2024) (5-shot), and translated MMLU (Chen et al., 2023b) (5-shot on Arabic, Spanish, French, Portuguese,
German, Italian, Japanese, and Korean); (b) Understanding: BELEBELE (Bandarkar et al., 2023) (5-shot),
XCOPA (Ponti et al., 2020) (5-shot), XWinograd (Muennighoff et al., 2023) (5-shot), XStoryCloze (Lin
et al., 2022b) (0-shot) and PAWS-X (Yang et al., 2019) (5-shot); (c) Mathematics: MGSM (Goyal et al., 2022)
(8-shot CoT); and (d) Translation: Flores-101 (Goyal et al., 2022) (5-shot).
For base models, we compare Qwen2.5 models with Qwen2 models and other leading open-weight
models in terms of scales of parameters.
Qwen2.5-72B & Qwen2.5-Plus We compare the base models of Qwen2.5-72B and Qwen2.5-Plus to other
leading open-weight base models: Llama3-70B (Dubey et al., 2024), Llama3-405B (Dubey et al., 2024),
Mixtrail-8x22B (Jiang et al., 2024a), and our previous 72B version, the Qwen2-72B (Yang et al., 2024a).
The Qwen2.5-72B base model significantly outperforms its peers in the same category across a wide
range of tasks. It achieves results comparable to Llama-3-405B while utilizing only one-fifth of the pa-
rameters. Furthermore, when compared to its predecessor, Qwen2-72B, the Qwen2.5-72B shows marked
improvements in nearly all benchmark evaluations, particularly excelling in general tasks, mathematics,
and coding challenges. With significantly lower training and inference costs, Qwen2.5-Plus achieves
very competitive performance results compared to Qwen2.5-72B and Llama3-405B, outperforming other
baseline models on the Hellaswag, TheoremQA, MATH, GSM8K, MultiPL-E, Multi-Mathematics, and
Multi-Translation. Moreover, Qwen2.5-Plus achieves 64.0 on MMLU-Pro, which is 5.9 points higher than
Qwen2.5-72B.
Qwen2.5-14B/32B & Qwen2.5-T urbo The evaluation of the Qwen2.5-Turbo, Qwen2.5-14B, and 32B
models is compared against baselines of similar sizes. These baselines include Yi-1.5-34B (Young et al.,
8
Page 9:
Table 3: Performance of the 14B-30B+ base models and Qwen2.5-T urbo.
Datasets Qwen1.5-32B Gemma2-27B Yi-1.5-34B Qwen2.5-T urbo Qwen2.5-14B Qwen2.5-32B
General Tasks
MMLU 74.3 75.2 77.2 79.5 79.7 83.3
MMLU-pro 44.1 49.1 48.3 55.6 51.2 55.1
MMLU-redux 69.0 - 74.1 77.1 76.6 82.0
BBH 66.8 74.9 76.4 76.1 78.2 84.5
ARC-C 63.6 71.4 65.6 67.8 67.3 70.4
TruthfulQA 57.4 40.1 53.9 56.3 58.4 57.8
Winogrande 81.5 59.7 84.9 81.1 81.0 82.0
Hellaswag 85.0 86.4 85.9 85.0 84.3 85.2
Mathematics & Science Tasks
GPQA 30.8 34.9 37.4 41.4 32.8 48.0
Theoremqa 28.8 35.8 40.0 42.1 43.0 44.1
MATH 36.1 42.7 41.7 55.6 55.6 57.7
MMLU-stem 66.5 71.0 72.6 77.0 76.4 80.9
GSM8K 78.5 81.1 81.7 88.3 90.2 92.9
Coding Tasks
HumanEval 43.3 54.9 46.3 57.3 56.7 58.5
HumanEval+ 40.2 46.3 40.2 51.2 51.2 52.4
MBPP 64.2 75.7 65.5 76.2 76.7 84.5
MBPP+ 53.9 60.2 55.4 63.0 63.2 67.2
MultiPL-E 38.5 48.0 39.5 53.9 53.5 59.4
Multilingual Tasks
Multi-Exam 61.6 65.8 58.3 70.3 70.6 75.4
Multi-Understanding 76.5 82.2 73.9 85.3 85.9 88.4
Multi-Mathematics 56.1 61.6 49.3 71.3 68.5 73.7
Multi-Translation 33.5 38.7 30.0 36.8 36.2 37.3
Table 4: Performance of the 7B+ base models.
Datasets Mistral-7B Llama3-8B Gemma2-9B Qwen2-7B Qwen2.5-7B
General Tasks
MMLU 64.2 66.6 71.3 70.3 74.2
MMLU-pro 30.9 35.4 44.7 40.1 45.0
MMLU-redux 58.1 61.6 67.9 68.1 71.1
BBH 56.1 57.7 68.2 62.3 70.4
ARC-C 60.0 59.3 68.2 60.6 63.7
TruthfulQA 42.2 44.0 45.3 54.2 56.4
Winogrande 78.4 77.4 79.5 77.0 75.9
HellaSwag 83.3 82.1 81.9 80.7 80.2
Mathematics & Science Tasks
GPQA 24.7 25.8 32.8 30.8 36.4
TheoremQA 19.2 22.1 28.9 29.6 36.0
MATH 10.2 20.5 37.7 43.5 49.8
MMLU-stem 50.1 55.3 65.1 64.2 72.3
GSM8K 36.2 55.3 70.7 80.2 85.4
Coding Tasks
HumanEval 29.3 33.5 37.8 51.2 57.9
HumanEval+ 24.4 29.3 30.5 43.3 50.6
MBPP 51.1 53.9 62.2 64.2 74.9
MBPP+ 40.9 44.4 50.6 51.9 62.9
MultiPL-E 29.4 22.6 34.9 41.0 50.3
Multilingual Tasks
Multi-Exam 47.1 52.3 61.2 59.2 59.4
Multi-Understanding 63.3 68.6 78.3 72.0 79.3
Multi-Mathematics 26.3 36.3 53.0 57.5 57.8
Multi-Translation 23.3 31.9 36.5 31.5 32.4
9
Page 10:
Table 5: Performance of the smaller base models.
Datasets Qwen2-0.5B Qwen2.5-0.5B Qwen2-1.5B Qwen2.5-1.5B Gemma2-2.6B Qwen2.5-3B
General Tasks
MMLU 44.3 47.5 55.9 60.9 52.2 65.6
MMLU-pro 14.7 15.7 21.6 28.5 23.0 34.6
MMLU-redux 40.7 45.1 51.8 58.5 50.9 63.7
BBH 18.2 20.3 36.5 45.1 41.9 56.3
ARC-C 31.0 35.6 43.7 54.7 55.7 56.5
TruthfulQA 39.7 40.2 45.9 46.6 36.2 48.9
Winogrande 56.9 56.3 65.0 65.0 71.5 71.1
Hellaswag 49.1 52.1 67.0 67.9 74.6 74.6
Mathematics & Science Tasks
GPQA 29.8 24.8 20.7 24.2 25.3 26.3
TheoremQA 9.6 16.0 14.8 22.1 15.9 27.4
MATH 11.2 19.5 21.6 35.0 18.3 42.6
MMLU-STEM 27.5 39.8 42.7 54.8 45.8 62.5
GSM8K 36.4 41.6 46.9 68.5 30.3 79.1
Coding Tasks
HumanEval 22.6 30.5 34.8 37.2 19.5 42.1
HumanEval+ 18.9 26.8 29.9 32.9 15.9 36.0
MBPP 33.1 39.3 46.9 60.2 42.1 57.1
MBPP+ 27.6 33.8 37.6 49.6 33.6 49.4
MultiPL-E 16.3 18.9 27.9 33.1 17.6 41.2
Multilingual Tasks
Multi-Exam 29.4 30.8 43.1 47.9 38.1 54.6
Multi-Understanding 40.4 41.0 50.7 65.1 46.8 76.6
Multi-Mathematics 7.8 13.5 21.3 37.5 18.2 48.9
Multi-Translation 14.1 15.3 23.8 25.0 26.9 29.3
2024), Gemma2-27B (Gemma Team et al., 2024), and Qwen1.5-32B (Qwen Team, 2024b). The results
are shown in Table 3. The Qwen2.5-14B model demonstrates a solid performance across various tasks,
particularly excelling in general tasks like MMLU and BBH, where it achieves scores of 79.7 and 78.2,
outcompeting competitors of larger sizes. Meanwhile, Qwen2.5-32B, in particular, showcases exceptional
capabilities, often surpassing larger models of similar model sizes. Notably, it outperforms its predecessor
Qwen1.5-32B significantly, especially in challenging areas such as mathematics and coding, with notable
scores of 57.7 in MATH and 84.5 in MBPP . For Qwen2.5-Turbo, although its training cost and inference cost
are significantly smaller than those of Qwen2.5-14B, it achieves comparable results, where its MMLU-Pro
score is even better than that of Qwen2.5-32B.
Qwen2.5-7B For 7B-level models, we focus on comparing Qwen2.5-7B with other leading 7B+ models,
including Mistral-7B (Jiang et al., 2023a), Llama3-8B (Dubey et al., 2024), Gemma2-9B (Gemma Team et al.,
2024), and our predecessor, Qwen2-7B (Yang et al., 2024a). The results can be found in Table 4. Note that
the non-embedding parameters of Qwen2-7B and Qwen2.5-7B are only 6.5B, while that of Gemma2-9B
is 8.2B. The Qwen2.5-7B model surpasses its predecessors and counterparts in numerous benchmarks,
despite having fewer non-embedding parameters. It demonstrates significant improvements across
various tasks, achieving 74.2 on general benchmarks like MMLU (Hendrycks et al., 2021a), 49.8 on math
challenges such as MATH (Hendrycks et al., 2021b), and 57.9 on coding tasks like HumanEval (Chen
et al., 2021).
Qwen2.5-0.5B/1.5B/3B For edge-side models, we compare Qwen2.5-0.5B, 1.5B, and 3B against estab-
lished baselines: Qwen2-0.5B/1.5B (Yang et al., 2024a) and Gemma2-2.6B (Gemma Team et al., 2024). The
results are given in Table 5. Qwen2.5-0.5B, 1.5B, and 3B continue to maintain strong performance across
nearly all benchmarks. Notably, the Qwen2.5-0.5B model outperforms the Gemma2-2.6B on various math
and coding tasks.
5.2 Instruction-tuned Model
To critically evaluate instruction-tuned models, we adopt a multifaceted approach. Foundational skills
and human preferences are assessed using open datasets and benchmarks. Additionally, our detailed
in-house evaluations delve deeper into the models’competencies in key areas and multilingualism. A
particular focus is placed on assessing long-context capability. The subsequent sections outline the
evaluation methods and present the results.
10
Page 11:
Table 6: Performance of the 70B+ Instruct models and Qwen2.5-Plus.
Datasets Llama-3.1-70B Llama-3.1-405B Qwen2-72B Qwen2.5-72B Qwen2.5-Plus
General Tasks
MMLU-Pro 66.4 73.3 64.4 71.1 72.5
MMLU-redux 83.0 86.2 81.6 86.8 86.3
LiveBench 0831 46.6 53.2 41.5 52.3 54.6
Mathematics & Science Tasks
GPQA 46.7 51.1 42.4 49.0 49.7
MATH 68.0 73.8 69.0 83.1 84.7
GSM8K 95.1 96.8 93.2 95.8 96.0
Coding Tasks
HumanEval 80.5 89.0 86.0 86.6 87.8
MBPP 84.2 84.5 80.2 88.2 85.5
MultiPL-E 68.2 73.5 69.2 75.1 77.0
LiveCodeBench 32.1 41.6 32.2 55.5 51.4
Alignment Tasks
IFEval 83.6 86.0 77.6 84.1 86.3
Arena-Hard 55.7 69.3 48.1 81.2 81.4
MTbench 8.79 9.08 9.12 9.35 9.30
Table 7: Performance of the 14B-30B+ instruction-tuned models and Qwen2.5-T urbo.
Datasets Qwen2-57BA14B Gemma2-27B GPT4o-mini Qwen2.5-T urbo Qwen2.5-14B Qwen2.5-32B
General Tasks
MMLU-Pro 52.8 55.5 63.1 64.5 63.7 69.0
MMLU-redux 72.6 75.7 81.5 81.7 80.0 83.9
LiveBench 0831 31.1 39.6 43.3 42.3 44.4 50.7
Mathematics & Science Tasks
GPQA 34.3 38.4 40.2 42.3 45.5 49.5
MATH 49.1 54.4 70.2 81.1 80.0 83.1
GSM8K 85.3 90.4 93.2 93.8 94.8 95.9
Coding Tasks
HumanEval 79.9 78.7 88.4 86.6 83.5 88.4
MBPP 70.9 81.0 85.7 82.8 82.0 84.0
MultiPL-E 66.4 67.4 75.0 73.7 72.8 75.4
LiveCodeBench 22.5 - 40.7 37.8 42.6 51.2
Alignment Tasks
IFEval 59.9 77.1 80.4 76.3 81.0 79.5
Arena-Hard 17.8 57.5 74.9 67.1 68.3 74.5
MTbench 8.55 9.10 - 8.81 8.88 9.20
5.2.1 Open Benchmark Evaluation
To comprehensively evaluate the quality of instruction-tuned models, we compile automatic and human
evaluation to assess the capabilities and human preference. For the evaluation of basic capabilities, we
apply similar datasets in the pre-trained model evaluation, which target on natural language understand-
ing, coding, mathematics, and reasoning. Specifically, we evaluate on MMLU-Pro, MMLU-redux and
LiveBench 0831 (White et al., 2024) for general evaluation, GPQA, GSM8K and MATH for science and
mathematics, HumanEval, MBPP , MultiPL-E and LiveCodeBench 2305-2409 (Jain et al., 2024) for coding,
IFEval (Zhou et al., 2023)2for instruction following. Additionally, we assess the performance of human
preference alignment and instruction following by evaluating on benchmarks including MT-Bench (Zheng
et al., 2023) and Arena-Hard (Li et al., 2024).
Qwen2.5-72B-Instruct & Qwen2.5-Plus As shown in Table 6, we compare Qwen2.5-72B-Instruct and
Qwen2.5-Plus to other leading open-weight instrution-tuned models: Llama3.1-70B-Instruct (Dubey
2For simplicity, we report the results of the subset strict-prompt .
11
Page 12:
Table 8: Performance of the 7B+ instruction-tuned models.
Datasets Gemma2-9B Llama3.1-8B Qwen2-7B Qwen2.5-7B
General Tasks
MMLU-Pro 52.1 48.3 44.1 56.3
MMLU-redux 72.8 67.2 67.3 75.4
LiveBench 0831 30.6 26.7 29.2 35.9
Mathematics & Science Tasks
GPQA 32.8 32.8 34.3 36.4
MATH 44.3 51.9 52.9 75.5
GSM8K 76.7 84.5 85.7 91.6
Coding Tasks
HumanEval 68.9 72.6 79.9 84.8
MBPP 74.9 69.6 67.2 79.2
MultiPL-E 53.4 50.7 59.1 70.4
LiveCodeBench 18.9 8.3 23.9 28.7
Alignment Tasks
IFEval 70.1 75.9 54.7 71.2
Arena-Hard 41.6 27.8 25.0 52.0
MTbench 8.49 8.23 8.26 8.75
Table 9: Performance comparison of 2B-4B instruction-tuned models.
Datasets Gemma2-2B Phi3.5-Mini MiniCPM3-4B Qwen2.5-3B
Non-Emb Params 2.0B 3.6B 4.0B 2.8B
General Tasks
MMLU-Pro 26.7 47.5 43.0 43.7
MMLU-redux 51.9 67.7 59.9 64.4
LiveBench 0831 20.1 27.4 27.6 26.8
Mathematics & Science Tasks
GPQA 29.3 27.2 31.3 30.3
MATH 26.6 48.5 46.6 65.9
GSM8K 63.2 86.2 81.1 86.7
Coding Tasks
HumanEval 68.9 72.6 74.4 74.4
MBPP 74.9 63.2 72.5 72.7
MultiPL-E 30.5 47.2 49.1 60.2
LiveCodeBench 5.8 15.8 23.8 19.9
Alignment Tasks
IFEval 51.0 52.1 68.4 58.2
et al., 2024), Llama3.1-405B-Instruct (Dubey et al., 2024), and our previous 72B version, Qwen2-72B-
Instruct (Yang et al., 2024a). The Qwen2.5-72B-Instruct model delivers exceptional performance, even
surpassing the larger Llama-3.1-405B-Instruct in several critical benchmarks including MMLU-redux,
MATH, MBPP , MultiPL-E, LiveCodeBench, Arena-Hard and MTBench. Moreover, Qwen2.5-Plus outper-
forms Qwen2.5-72B-Instruct on 9 out of 13 benchmarks.
Qwen2.5-14B/32B-Instruct & Qwen2.5-T urbo The performance of the Qwen2.5-Turbo, Qwen2.5-14B-
Instruct, and Qwen2.5-32B-Instruct models is evaluated and compared against baselines of similar sizes.
The baselines include GPT4o-mini, Gemma2-27B-IT (Gemma Team et al., 2024), and Qwen2-57BA14B-
Instruct (Yang et al., 2024a). The results are summarized in Table 7. The Qwen2.5-32B-Instruct model
exhibits superior performance across most tasks when compared to other models of similar size. Notably,
our open-weight Qwen2.5-14B-Instruct model delivers competitive results across all benchmarks, rivaling
those of GPT-4o-mini. Despite its significantly lower training and inference costs, the Qwen2.5-Turbo
model outperforms Qwen2.5-14B-Instruct on eight out of ten benchmarks. This demonstrates that
Qwen2.5-Turbo achieves remarkable efficiency and effectiveness, making it a compelling choice for
resource-constrained environments.
12
Page 13:
Table 10: Performance comparison of 0.5B-1.5B instruction-tuned models.
Datasets Qwen2-0.5B Qwen2.5-0.5B Qwen2-1.5B Qwen2.5-1.5B
General Tasks
MMLU-Pro 14.4 15.0 22.9 32.4
MMLU-redux 12.9 24.1 41.2 50.7
LiveBench 7.4 12.6 12.4 18.8
Mathematics & Science Tasks
GPQA 23.7 29.8 21.2 29.8
MATH 13.9 34.4 25.3 55.2
GSM8K 40.1 49.6 61.6 73.2
Coding Tasks
HumanEval 31.1 35.4 42.1 61.6
MBPP 39.7 49.6 44.2 63.2
MultiPL-E 20.8 28.5 38.5 50.4
LiveCodeBench 1.6 5.1 4.5 14.8
Alignment Tasks
IFEval 14.6 27.9 29.0 42.5
Table 11: Performance Comparison on our in-house English automatic evaluation benchmark.
Models IF Knowledge Comprehension Coding Math Reasoning
Proprietary LLMs
GPT-4o-2024-08-06 83.28 68.08 76.51 58.05 52.36 66.45
GPT-4o-2024-11-20 80.06 65.25 79.07 60.19 49.74 67.07
Claude3.5-sonnet-2024-10-22 84.22 74.61 79.02 67.17 48.67 70.20
Qwen2 Series
Qwen2-0.5B-Instruct 18.33 18.59 30.64 5.42 13.16 32.03
Qwen2-1.5B-Instruct 29.42 29.23 45.81 17.02 20.34 38.86
Qwen2-7B-Instruct 50.47 44.79 58.04 43.04 38.31 50.25
Qwen2-72B-Instruct 76.08 59.49 72.19 48.95 48.07 60.33
Llama-3.1 Series
Llama-3.1-70B-Instruct 81.33 63.42 69.29 55.96 48.00 63.18
Llama-3.1-405B-Instruct 83.33 67.10 75.55 58.14 47.09 64.74
Qwen2.5 Series
Qwen2.5-0.5B-Instruct 33.35 30.29 29.78 15.41 26.29 36.13
Qwen2.5-1.5B-Instruct 40.25 41.19 47.69 26.19 40.99 42.23
Qwen2.5-3B-Instruct 60.60 46.11 57.98 41.43 49.38 49.80
Qwen2.5-7B-Instruct 70.01 52.74 62.69 48.41 56.93 54.69
Qwen2.5-14B-Instruct 74.17 59.78 69.11 52.68 59.68 62.51
Qwen2.5-Turbo 72.76 58.56 68.70 54.48 57.77 61.06
Qwen2.5-32B-Instruct 76.79 64.08 71.28 58.90 60.97 65.49
Qwen2.5-72B-Instruct 82.65 66.09 74.43 60.41 59.73 65.90
Qwen2.5-Plus 83.18 68.41 79.35 59.58 62.52 66.92
Other Instruction-tuned Models As illustrated in Table 8, the Qwen2.5-7B-Instruct model significantly
outperforms its competitors, Gemma2-9B-IT and Llama3.1-8B-Instruct, across all tasks except IFEval.
Notably, Qwen2.5-7B-Instruct exhibits clear advantages in mathematics (MATH: 75.5) and coding (Hu-
manEval: 84.8). For the edge-side instruction models, the Qwen2.5-3B-Instruct model, despite having
fewer parameters than both the Phi3.5-mini-instruct (Abdin et al., 2024) and MiniCPM3-4B-Instruct (Hu
et al., 2024) models, surpasses them in mathematics and coding tasks, as shown in Table 9. Additionally,
it delivers competitive results in language understanding. The Qwen2.5-1.5B-Instruct and Qwen2.5-0.5B-
Instruct models have also seen substantial performance improvements over their previous versions, as
detailed in Table 10. These enhancements make them particularly well-suited for edge-side applications
in highly resource-constrained environments.
13
Page 14:
Table 12: Performance Comparison on our in-house Chinese automatic evaluation benchmark.
Models IF Knowledge Comprehension Coding Math Reasoning
Proprietary LLMs
GPT-4o-2024-08-06 42.50 68.55 80.11 61.53 61.74 56.88
GPT-4o-2024-11-20 42.71 71.29 83.04 62.39 66.04 62.04
Claude3.5-sonnet-2024-10-22 49.25 72.09 82.16 66.00 63.71 66.60
Qwen2 Series
Qwen2-0.5B-Instruct 4.69 40.43 39.13 9.85 14.07 32.73
Qwen2-1.5B-Instruct 6.81 51.54 46.89 14.14 24.57 35.19
Qwen2-7B-Instruct 16.83 65.95 60.30 37.05 50.52 44.96
Qwen2-72B-Instruct 31.98 74.96 75.49 41.57 65.55 58.19
Llama-3.1 Series
Llama-3.1-70B-Instruct 28.96 57.41 67.24 54.82 41.18 52.42
Llama-3.1-405B-Instruct 30.39 63.79 72.27 60.73 46.05 55.88
Qwen2.5 Series
Qwen2.5-0.5B-Instruct 6.12 39.13 42.97 9.60 24.03 33.72
Qwen2.5-1.5B-Instruct 7.38 48.68 49.69 22.96 37.30 39.17
Qwen2.5-3B-Instruct 16.50 57.18 62.55 29.88 51.64 39.57
Qwen2.5-7B-Instruct 26.64 65.77 67.55 39.56 61.06 49.70
Qwen2.5-14B-Instruct 26.87 70.28 76.96 49.78 67.01 56.41
Qwen2.5-Turbo 32.94 72.93 74.37 51.92 66.08 53.30
Qwen2.5-32B-Instruct 32.64 74.70 79.46 54.45 67.86 60.19
Qwen2.5-72B-Instruct 37.22 75.86 78.85 56.71 68.39 63.02
Qwen2.5-Plus 46.15 72.07 82.64 58.48 69.96 62.98
5.2.2 In-house Automatic Evaluation
Despite the availability of several open benchmark datasets for evaluation, we believe that these are
insufficient to fully capture the capabilities of LLMs. To address this, we have developed a series
of in-house datasets designed to assess various aspects of model performance, including knowledge
understanding, text generation, coding, and more. These evaluations are conducted in both Chinese
and English. In addition, we have specifically evaluated the multilingual performance of instruction-
tuned models. The results are summarized in Table 11 for English, Table 12 for Chinese, Table 13 for
multilingualism of 70B+ Instruct models, and Table 14 for 7B-14B models, respectively.
English & Chinese Evaluation We compare the performance of Qwen2.5-Instruct models against
several leading language models, including GPT-4, Claude3.5-sonnet, Qwen2, and Llama-3.1, across both
English and Chinese languages. Our analysis focuses on model size and its impact on performance, as
well as how our latest Qwen2.5 series compares to previous iterations and competing models. For smaller
models, we observe that the Qwen2.5-0.5B model achieves performance that is on par with or even
surpasses the Qwen2-1.5B model. This indicates that the Qwen2.5 series has optimized parameter usage,
enabling mid-sized models to achieve similar performance levels to larger models from the previous
generation. The Qwen2.5-3B model demonstrates performance that is comparable to the Qwen2-7B
model. Notably, the Qwen2.5-32B model exhibits a remarkable improvement over the Qwen2-72B model.
Our flagship model, Qwen2.5-72B, further narrows the gap between Qwen and state-of-the-art models
like GPT-4 and Claude3.5-sonnet. In particular, Qwen2.5-72B matches or exceeds the performance
of Llama-3.1-405B in all metrics except for instruction following. This achievement underscores the
competitiveness of Qwen2.5-72B in a wide range of language processing tasks, while also identifying
areas for future improvement. Qwen2.5-Plus addresses the previous shortcomings in Chinese instruction
following and further enhances its advantages in other areas.
Multilingual Evaluation To comprehensively evaluate the multilingual capabilities of instruction-tuned
models, we followed P-MMEval (Zhang et al., 2024) and extended several benchmarks as follows: (1)
IFEval (Multilingual): We expanded the IFEval benchmark, originally in English, to include multilingual
examples. To ensure language neutrality, we removed instances that contained language-specific content
(e.g., ”start with letter A”). (2) Knowledge Utilization: to assess the knowledge utilization abilities
of the Qwen2.5 series models across multiple languages, we employed five MMLU-like benchmarks
(multiple-choice format). These benchmarks include: AMMLU (Arabic), JMMLU (Japanese), KMMLU
(Korean), IndoMMLU (Indonesian), and TurkishMMLU (Turkish). Additionally, we evaluated the models’
performance on the translated version of the MMLU benchmark (okapi MMLU), which has been adapted
14
Page 15:
Table 13: Performance of the 70B+ Instruct models on Multilingual Tasks.
Datasets Qwen2-72B Llama3.1-70B Qwen2.5-32B Mistral-Large GPT4o-mini Qwen2.5-72B
Instruction Following
IFEval (multilingual) 79.69 80.47 82.68 82.69 85.03 86.98
Knowledge
AMMLU (Arabic) 68.85 70.08 70.44 69.24 69.73 72.44
JMMLU (Japanese) 77.37 73.89 76.55 75.77 73.74 80.56
KMMLU (Korean) 57.04 53.23 60.75 56.42 56.77 61.96
IndoMMLU (Indonesian) 66.31 67.50 66.42 63.21 67.75 69.25
TurkishMMLU (Turkish) 69.22 66.89 72.41 64.78 71.19 76.12
okapi MMLU (translated) 77.84 76.49 77.16 78.37 73.44 79.97
Math Reasoning
MGSM8K (extended) 82.72 73.31 87.15 89.01 87.36 88.16
Cultural Nuances
BLEnD 25.90 30.49 27.88 33.47 35.91 32.48
Table 14: Performance of the 7B-14B Instruct models on Multilingual Tasks.
Datasets Qwen2-7B Llama3.1-8B Qwen2.5-7B Gemma2-9B Qwen2.5-14B
Instruction Following
IFEval (multilingual) 51.43 60.68 74.87 77.47 77.08
Knowledge
AMMLU (Arabic) 54.87 54.28 59.78 60.26 66.81
JMMLU (Japanese) 57.71 53.26 61.88 64.59 72.78
KMMLU (Korean) 43.96 42.28 46.59 46.24 59.71
IndoMMLU (Indonesian) 54.05 53.92 56.42 61.73 65.09
TurkishMMLU (Turkish) 49.27 45.61 54.28 55.44 66.85
okapi MMLU (translated) 60.47 55.18 66.98 46.72 72.12
Math Reasoning
MGSM8K (extended) 56.13 66.05 66.11 78.37 82.27
Cultural Nuances
BLEnD 22.49 19.47 23.66 28.31 26.99
into multiple languages from its original English form. (3) MGSM8K (Extended): Building upon the
original MGSM8K benchmark, we extended the language support to include Arabic (ar), Korean (ko),
Portuguese (pt), and Vietnamese (vi). (4) Cultural Nuances: To evaluate the models’ ability to capture
cultural nuances, we utilized the BLEnD benchmark (Myung et al., 2024). This benchmark is specifically
designed to test LLMs on their understanding of cultural subtleties.
Qwen2.5 exhibits competitive performance in instruction following, multilingual knowledge, and mathe-
matical reasoning, aligning well with models of comparable size. Although it shows notable improve-
ments in capturing cultural nuances relative to its predecessor, Qwen2, there remains potential for further
refinement in this domain.
5.2.3 Reward Model
The reward model serves as the cornerstone for guiding RL processes, and thus we conduct a separate
evaluation of the reward model used in the Qwen2.5 series. Our assessment benchmarks encompass
Reward Bench (Lambert et al., 2024), RMB (Zhou et al., 2024), PPE (Frick et al., 2024b), and an internally
collected out-of-domain Chinese human preference benchmark (Human-Preference-Chinese) to provide
a comprehensive analysis. For comparison, we included baseline models such as Nemotron-4-340B-
Reward (Adler et al., 2024), Llama-3.1-Nemotron-70B-Reward (Wang et al., 2024c), and Athene-RM-
70B (Frick et al., 2024a). The results are shown in Table 15. Overall, our findings indicate that Llama-3.1-
Nemotron-70B-Reward excels on the Reward Bench, while Athene-RM-70B performs best on the RMB
benchmark. The Qwen2.5-RM-72B, leads in both the PPE and Human-Preference-Chinese evaluations,
ranking second only to Athene-RM-70B on the RMB and achieving a performance level comparable to
15
Page 16:
Table 15: Performance comparison across multiple RM benchmarks.
MetricNemotron-4-340B-
RewardLlama-3.1-Nemotron-
70B-RewardAthene-RM
-70BQwen2.5-RM
-72B
Reward Bench
Chat 95.80 97.50 98.32 97.21
Chat Hard 87.10 85.70 70.61 78.73
Safety 91.50 95.10 92.10 92.71
Reasoning 93.60 98.10 92.19 97.65
Score 92.00 94.10 88.32 91.59
RMB
Helpfulness (BoN) 48.85 61.02 67.24 65.72
Helpfulness (Pairwise) 68.70 75.28 80.82 78.83
Harmlessness (BoN) 50.92 52.00 67.02 56.35
Harmlessness (Pairwise) 70.84 69.96 80.83 73.94
Overall 59.83 64.57 73.98 68.71
PPE
Human Preference 59.28 64.32 66.48 64.80
IFEval 62.66 63.40 62.15 67.97
GPQA 56.56 59.14 59.26 59.80
MATH 65.12 69.73 79.14 81.48
MBPP-Plus 49.15 55.62 67.97 64.34
MMLU-Pro 69.69 70.20 76.95 75.66
Objective-Avg 60.64 63.62 69.09 69.85
Human-Preference-Chinese
Accuracy 50.46 59.95 61.11 61.27
Nemotron-4-340B-Reward on the Reward Bench, albeit slightly behind Llama-3.1-Nemotron-70B-Reward.
Due to the lack of evaluation methods for reward models, current reward models are typically evaluated
using Reward Bench. However, our evaluation results from multiple RM benchmarks suggest that over-
optimization on a specific benchmark may trigger Goodhart’s law (Hoskin, 1996), resulting in degraded
performance on other benchmarks and potentially impacting downstream alignment performance. This
highlights the need for comprehensive evaluation of reward models across diverse benchmarks rather
than relying solely on a single benchmark.
More importantly, through iterative experimentation, we have also come to recognize a critical limitation:
current reward model evaluation benchmarks do not accurately predict the performance of the RL models
trained under their guidance. In other words, a higher score on RM benchmarks does not necessarily
correlate with superior performance of the resulting RL model. This insight underscores the need for
further research into more predictive evaluation methods for reward models.
5.2.4 Long Context Capabilities
We utilize three benchmarks to evaluate long context capabilities of Qwen2.5 models: RULER (Hsieh
et al., 2024), LV-Eval (Yuan et al., 2024), and Longbench-Chat (Bai et al., 2024). In LV-Eval, we adopt
keyword recall as the reported score to mitigate the high rate of false negatives present in the original
metrics.
The results are shown in Table 16 and Table 17. We can observe that the Qwen2.5 models, after equipping
length extrapolation techniques (i.e., DCA + YARN), have demonstrated strong long context processing
capabilities on the three datasets. Among them, Qwen2.5-72B-Instruct has shown the strongest perfor-
mance across all context lengths, significantly outperforming existing open-weight long-context models
as well as the proprietary models like GPT-4o-mini and GPT-4.
Furthermore, as shown in Figure 2, Qwen2.5-Turbo achieves 100% accuracy in the 1M-token passkey
retrieval task, demonstrating its exceptional ability to capture detailed information from ultra-long
contexts. We develop a sparse attention mechanism based on Minference (Jiang et al., 2024b) to signifi-
cantly enhance inference speed, which is critical for user experience when processing long contexts. For
sequences of 1M tokens, this approach reduces the computational load of the attention mechanism by
12.5 times. Figure 3 illustrates the time to first token (TTFT) of Qwen2.5-Turbo across various hardware
configurations, where our method achieves a 3.2 to 4.3 times speedup.
16
Page 17:
Table 16: Performance of Qwen2.5 Models on RULER. YARN+DCA does not change the model behavior
within 32K tokens.
Model Claimed
LengthRULER
Avg. 4K 8K 16K 32K 64K 128K
GLM4-9b-Chat-1M 1M 89.9 94.7 92.8 92.1 89.9 86.7 83.1
Llama-3-8B-Instruct-Gradient-1048k 1M 88.3 95.5 93.8 91.6 87.4 84.7 77.0
Llama-3.1-70B-Instruct 128K 89.6 96.5 95.8 95.4 94.8 88.4 66.6
GPT-4o-mini 128K 87.3 95.0 92.9 92.7 90.2 87.6 65.8
GPT-4 128K 91.6 96.6 96.3 95.2 93.2 87.0 81.2
Qwen2.5-7B-Instruct 128K 85.4 96.7 95.1 93.7 89.4 82.3 55.1
w/o DCA + YARN 80.1 96.7 95.1 93.7 89.4 74.5 31.4
Qwen2.5-14B-Instruct 128K 91.4 97.7 96.8 95.9 93.4 86.7 78.1
w/o DCA + YARN 86.5 97.7 96.8 95.9 93.4 82.3 53.0
Qwen2.5-32B-Instruct 128K 92.9 96.9 97.1 95.5 95.5 90.3 82.0
w/o DCA + YARN 88.0 96.9 97.1 95.5 95.5 85.3 57.7
Qwen2.5-72B-Instruct 128K 95.1 97.7 97.2 97.7 96.5 93.0 88.4
w/o DCA + YARN 90.8 97.7 97.2 97.7 96.5 88.5 67.0
Qwen2.5-T urbo 1M 93.1 97.5 95.7 95.5 94.8 90.8 84.5
Table 17: Performance of Qwen2.5 Models on LV-Eval and LongBench-Chat. YARN+DCA does not
change the model behavior within 32k tokens.
Model Claimed
LengthLV-EvalLongBench-
Chat 16k 32k 64k 128k 256k
GLM4-9B-Chat-1M 1M 46.4 43.2 42.9 40.4 37.0 7.82
Llama-3-8B-Instruct-Gradient-1048k 1M 31.7 31.8 28.8 26.3 21.1 6.20
Llama-3.1-70B-Instruct 128k 48.6 47.4 42.9 26.2 N/A 6.80
GPT-4o-mini 128k 52.9 48.1 46.0 40.7 N/A 8.48
Qwen2.5-7B-Instruct 128k 55.9 49.7 48.0 41.1 36.9 7.42
w/o DCA + YARN 55.9 49.7 33.1 13.6 0.5 -
Qwen2.5-14B-Instruct 128k 53.0 50.8 46.8 43.6 39.4 8.04
w/o DCA + YARN 53.0 50.8 37.0 18.4 0.8 -
Qwen2.5-32B-Instruct 128k 56.0 53.6 48.8 45.3 41.0 8.70
w/o DCA + YARN 56.0 53.6 40.1 20.5 0.7 -
Qwen2.5-72B-Instruct 128k 60.4 57.5 53.9 50.9 45.2 8.72
w/o DCA + YARN 60.4 57.5 47.4 27.0 2.4 -
Qwen2.5-T urbo 1M 53.4 50.0 45.4 43.9 38.0 8.34
Context Length (# Tokens)Document
DepthTop of
Document
Bottom of
DocumentTesting Qwen2.5 -Turbo via “Passkey Retrieval”
Retrieve Hidden Number from Irrelevant Sentences across Context Lengths and Document Depth
100%
Accuracy of
Retrieval
50%
Accuracy of
Retrieval
0%
Accuracy of
Retrieval
Figure 2: Performance of Qwen2.5-T urbo on Passkey Retrieval Task with 1M Token Lengths.
17
Page 18:
4.3x
3.1x
1.7x3.2x
2.4x
1.4x
5.6x
4.1x
2.3x4x
2.9x
1.7xFigure 3: TTFT (Time To First Token) of Qwen2.5-T urbo and Qwen2.5-7B with Full Attention and Our
Method.
6 Conclusion
Qwen2.5 represents a significant advancement in large language models (LLMs), with enhanced pre-
training on 18 trillion tokens and sophisticated post-training techniques, including supervised fine-tuning
and multi-stage reinforcement learning. These improvements boost human preference alignment, long
text generation, and structural data analysis, making Qwen2.5 highly effective for instruction-following
tasks. Available in various configurations, Qwen2.5 offers both open-weight from 0.5B to 72B parameters
and proprietary models including cost-effective MoE variants like Qwen2.5-Turbo and Qwen2.5-Plus.
Empirical evaluations show that Qwen2.5-72B-Instruct matches the performance of the state-of-the-art
Llama-3-405B-Instruct, despite being six times smaller. Qwen2.5 also serves as a foundation for specialized
models, demonstrating its versatility for domain-specific applications. We believe that Qwen2.5’s robust
performance, flexible architecture, and broad availability make it a valuable resource for both academic
research and industrial applications, positioning it as a key player of future innovations.
In the future, we will focus on advancing robust foundational models. First, we will iteratively refine both
base and instruction-tuned large language models (LLMs) by incorporating broader, more diverse, higher-
quality data. Second, we will also continue to develop multimodal models. Our goal is to integrate
various modalities into a unified framework. This will facilitate seamless, end-to-end information
processing across textual, visual, and auditory domains. Third, we are committed to enhancing the
reasoning capabilities of our models. This will be achieved through strategic scaling of inference compute
resources. These efforts aim to push the boundaries of current technological limitations and contribute to
the broader field of artificial intelligence.
7 Authors
Core Contributors: An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu,
Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang,
Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le
Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia,
18
Page 19:
Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui,
Zhenru Zhang, Zihan Qiu
Contributors: Biao Sun, Bin Luo, Bin Zhang, Binghai Wang, Chaojie Yang, Chang Si, Cheng Chen,
Chengpeng Li, Chujie Zheng, Fan Hong, Guanting Dong, Guobin Zhao, Hangrui Hu, Hanyu Zhao, Hao
Lin, Hao Xiang, Haoyan Huang, Humen Zhong, Jialin Wang, Jialong Tang, Jiandong Jiang, Jianqiang
Wan, Jianxin Ma, Jianyuan Zeng, Jie Zhang, Jin Xu, Jinkai Wang, Jinzheng He, Jun Tang, Ke Yi, Keqin
Chen, Langshi Chen, Le Jiang, Lei Zhang, Liang Chen, Man Yuan, Mingkun Yang, Minmin Sun, Na Ni,
Nuo Chen, Peng Wang, Peng Zhu, Pengcheng Zhang, Pengfei Wang, Qiaoyu Tang, Qing Fu, Rong Zhang,
Ru Peng, Ruize Gao, Shanghaoran Quan, Shen Huang, Shuai Bai, Shuang Luo, Sibo Song, Song Chen,
Tao He, Ting He, Wei Ding, Wei Liao, Weijia Xu, Wenbin Ge, Wenbiao Yin, Wenyuan Yu, Xianyan Jia,
Xianzhong Shi, Xiaodong Deng, Xiaoming Huang, Ximing Zhou, Xinyu Wang, Xipin Wei, Xuejing Liu,
Yang Liu, Yang Yao, Yang Zhang, Yibo Miao, Yidan Zhang, Yikai Zhu, Yinger Zhang, Yong Jiang, Yong Li,
Yongan Yue, Yuanzhi Zhu, Yunfei Chu, Zekun Wang, Zhaohai Li, Zheren Fu, Zhi Li, Zhibo Yang, Zhifang
Guo, Zhipeng Zhang, Zhiying Xu, Zile Qiao, Ziye Meng
References
Marah I Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla,
Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat S. Behl, Alon Benhaim, Misha Bilenko, Johan
Bjorck, S ´ebastien Bubeck, Martin Cai, Caio C ´esar Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary,
Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Ronen Eldan, Dan Iter, Amit
Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Jamie
Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud
Khademi, Lev Kurilenko, James R. Lee, Yin Tat Lee, Yuanzhi Li, Chen Liang, Weishung Liu, Eric
Lin, Zeqi Lin, Piyush Madan, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun
Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Corby Rosset,
Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital
Shah, Ning Shang, Hiteshi Sharma, Xia Song, Masahiro Tanaka, Xin Wang, Rachel Ward, Guanhua
Wang, Philipp Witte, Michael Wyatt, Can Xu, Jiahang Xu, Sonali Yadav, Fan Yang, Ziyi Yang, Donghan
Yu, Chengruidong Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, Yunan
Zhang, and Xiren Zhou. Phi-3 technical report: A highly capable language model locally on your
phone. CoRR , abs/2404.14219, 2024.
Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared
Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier
Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman,
Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorzek, Robert Hero, Jining Huang, Vibhu
Jawa, Joseph Jennings, Aastha Jhunjhunwala, John Kamalu, Sadaf Khan, Oleksii Kuchaiev, Patrick
LeGresley, Hui Li, Jiwei Liu, Zihan Liu, Eileen Peters Long, Ameya Mahabaleshwarkar, Somshubra
Majumdar, James Maki, Miguel Martinez, Maer Rodrigues de Melo, Ivan Moshkov, Deepak Narayanan,
Sean Narenthiran, Jesus Navarro, Phong Nguyen, Osvald Nitski, Vahid Noroozi, Guruprasad Nutheti,
Christopher Parisien, Jupinder Parmar, Mostofa Patwary, Krzysztof Pawelec, Wei Ping, Shrimai
Prabhumoye, Rajarshi Roy, Trisha Saar, Vasanth Rao Naik Sabavat, Sanjeev Satheesh, Jane Polak
Scowcroft, Jason D. Sewall, Pavel Shamis, Gerald Shen, Mohammad Shoeybi, Dave Sizer, Misha
Smelyanskiy, Felipe Soares, Makesh Narsimhan Sreedhar, Dan Su, Sandeep Subramanian, Shengyang
Sun, Shubham Toshniwal, Hao Wang, Zhilin Wang, Jiaxuan You, Jiaqi Zeng, Jimmy Zhang, Jing Zhang,
Vivienne Zhang, Yian Zhang, and Chen Zhu. Nemotron-4 340B technical report. CoRR , abs/2406.11704,
2024.
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr ´on, and Sumit
Sanghai. GQA: Training generalized multi-query Transformer models from multi-head checkpoints. In
EMNLP , pp. 4895–4901. Association for Computational Linguistics, 2023.
Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru,
M´erouane Debbah, ´Etienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Maz-
zotta, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. The Falcon series of open language
models. CoRR , abs/2311.16867, 2023.
Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong.
Training-free long-context scaling of large language models. CoRR , abs/2402.17463, 2024.
Anthropic. Introducing Claude, 2023a. URL https://www.anthropic.com/index/introducing-claude .
Anthropic. Claude 2. Technical report, Anthropic, 2023b. URL https://www-files.anthropic.com/pro
duction/images/Model-Card-Claude-2.pdf .
19
Page 20:
Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku. Technical report, Anthropic, AI, 2024. URL
https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model Card Claud
e3.pdf .
Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan,
Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large
language models. CoRR , abs/2108.07732, 2021.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei
Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu,
Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong
Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang,
Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan
Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang
Zhu. Qwen technical report. CoRR , abs/2309.16609, 2023.
Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, and Juanzi Li. LongAlign:
A recipe for long context alignment of large language models. In EMNLP (Findings) , pp. 1376–1395.
Association for Computational Linguistics, 2024.
Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa,
Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. The Belebele benchmark:
A parallel reading comprehension dataset in 122 language variants. CoRR , abs/2308.16884, 2023.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario
Amodei. Language models are few-shot learners. In NeurIPS , 2020.
Boxi Cao, Keming Lu, Xinyu Lu, Jiawei Chen, Mengjie Ren, Hao Xiang, Peilin Liu, Yaojie Lu, Ben He,
Xianpei Han, Le Sun, Hongyu Lin, and Bowen Yu. Towards scalable automated alignment of LLMs: A
survey. CoRR , abs/2406.01252, 2024.
Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney,
Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q. Feldman, Arjun Guha, Michael Greenberg,
and Abhinav Jangda. MultiPL-E: A scalable and polyglot approach to benchmarking neural code
generation. IEEE Trans. Software Eng. , 49(7):3675–3691, 2023.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond ´e de Oliveira Pinto, Jared Kaplan,
Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen
Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick
Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe
Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes,
Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor
Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr,
Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles
Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish,
Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. CoRR ,
abs/2107.03374, 2021.
Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia.
TheoremQA: A theorem-driven question answering dataset. In EMNLP , pp. 7889–7901. Association for
Computational Linguistics, 2023a.
Zhihong Chen, Shuo Yan, Juhao Liang, Feng Jiang, Xiangbo Wu, Fei Yu, Guiming Hardy Chen, Junying
Chen, Hongbo Zhang, Li Jianquan, Wan Xiang, and Benyou Wang. MultilingualSIFT: Multilingual
supervised instruction fine-tuning, 2023b. URL https://github.com/FreedomIntelligence/Multili
ngualSIFT .
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind
Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. CoRR ,
abs/1803.05457, 2018.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman.
Training verifiers to solve math word problems. CoRR , abs/2110.14168, 2021.
20
Page 21:
Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding
Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and
Wenfeng Liang. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language
models. CoRR , abs/2401.06066, 2024.
Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated
convolutional networks. In ICML , volume 70 of Proceedings of Machine Learning Research , pp. 933–941.
PMLR, 2017.
Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and Jingren Zhou.
Self-play with execution feedback: Improving instruction-following capabilities of large language
models. CoRR , abs/2406.13542, 2024.
Shihan Dou, Jiazheng Zhang, Jianxiang Zang, Yunbo Tao, Haoxiang Jia, Shichun Liu, Yuming Yang,
Shenxi Wu, Shaoqing Zhang, Muling Wu, et al. Multi-programming language sandbox for llms. CoRR ,
abs/2410.23074, 2024.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha
Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn,
Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston
Zhang, Aur ´elien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozi `ere, Bethany Biron, Binh
Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell,
Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus
Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv
Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin,
Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang,
Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Gr ´egoire Mialon, Guan
Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov,
Imanol Arrieta Ibarra, Isabel M. Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan
Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock,
Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu,
Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia,
Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, and
et al. The Llama 3 herd of models. CoRR , abs/2407.21783, 2024.
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter
models with simple and efficient sparsity. J. Mach. Learn. Res. , 23:120:1–120:39, 2022.
Alena Fenogenova, Artem Chervyakov, Nikita Martynov, Anastasia Kozlova, Maria Tikhonova, Albina
Akhmetgareeva, Anton A. Emelyanov, Denis Shevelev, Pavel Lebedev, Leonid Sinev, Ulyana Isaeva, Ka-
terina Kolomeytseva, Daniil Moskovskiy, Elizaveta Goncharova, Nikita Savushkin, Polina Mikhailova,
Denis Dimitrov, Alexander Panchenko, and Sergey Markov. MERA: A comprehensive LLM evaluation
in russian. CoRR , abs/2401.04531, 2024.
Evan Frick, Peter Jin, Tianle Li, Karthik Ganesan, Jian Zhang, Jiantao Jiao, and Banghua Zhu. Athene-70b:
Redefining the boundaries of post-training for open models, July 2024a. URL https://nexusflow.ai/b
logs/athene .
Evan Frick, Tianle Li, Connor Chen, Wei-Lin Chiang, Anastasios Nikolas Angelopoulos, Jiantao Jiao,
Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. How to evaluate reward models for RLHF. CoRR ,
abs/2410.14872, 2024b.
Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino,
Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we
done with mmlu? CoRR , abs/2406.04127, 2024.
Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.
Technical report, Google, 2024. URL https://storage.googleapis.com/deepmind-media/gemini/gemi
niv15report.pdf .
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju,
L´eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram ´e, et al. Gemma 2: Improving
open language models at a practical size. CoRR , abs/2408.00118, 2024.
Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana
Krishnan, Marc’Aurelio Ranzato, Francisco Guzm ´an, and Angela Fan. The Flores-101 evaluation
benchmark for low-resource and multilingual machine translation. Trans. Assoc. Comput. Linguistics , 10:
522–538, 2022.
21
Page 22:
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob
Steinhardt. Measuring massive multitask language understanding. In ICLR . OpenReview.net, 2021a.
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song,
and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In NeurIPS
Datasets and Benchmarks , 2021b.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford,
Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-
optimal large language models. CoRR , abs/2203.15556, 2022.
Keith Hoskin. The “awful idea of accountability”: Inscribing people into the measurement of objects.
Accountability: Power, ethos and the technologies of managing , 1996.
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang,
and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? CoRR ,
abs/2404.06654, 2024.
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang
Huang, Weilin Zhao, Xinrong Zhang, Zhen Leng Thai, Kai Zhang, Chongyi Wang, Yuan Yao, Chenyang
Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu,
and Maosong Sun. MiniCPM: Unveiling the potential of small language models with scalable training
strategies. CoRR , abs/2404.06395, 2024.
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen
Yu, Keming Lu, et al. Qwen2.5-Coder technical report. CoRR , abs/2409.12186, 2024.
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-
Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of
large language models for code. CoRR , abs/2403.07974, 2024.
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego
de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L ´elio Renard
Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee
Lacroix, and William El Sayed. Mistral 7B. CoRR , abs/2310.06825, 2023a.
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford,
Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel,
Guillaume Bour, Guillaume Lample, L ´elio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux,
Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Th ´eophile Gervet,
Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mixtral of experts. CoRR ,
abs/2401.04088, 2024a.
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han,
Amir H Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Minference 1.0: Accelerating
pre-filling for long-context llms via dynamic sparse attention. arXiv preprint arXiv:2407.02490 , 2024b.
Zixuan Jiang, Jiaqi Gu, Hanqing Zhu, and David Z. Pan. Pre-RMSNorm and Pre-CRMSNorm Transform-
ers: Equivalent and efficient pre-LN Transformers. CoRR , abs/2305.14858, 2023b.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott
Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. CoRR ,
abs/2001.08361, 2020.
Fajri Koto, Nurul Aisyah, Haonan Li, and Timothy Baldwin. Large language models only pass primary
school exams in Indonesia: A comprehensive test on IndoMMLU. In EMNLP , pp. 12359–12374.
Association for Computational Linguistics, 2023.
Nathan Lambert, Valentina Pyatkin, Jacob Daniel Morrison, Lester James Validad Miranda, Bill Yuchen
Lin, Khyathi Raghavi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith,
and Hanna Hajishirzi. RewardBench: Evaluating reward models for language modeling. CoRR ,
abs/2403.13787, 2024.
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim
Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling giant models with conditional computation
and automatic sharding. CoRR , abs/2006.16668, 2020.
22
Page 23:
Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez,
and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-Hard and BenchBuilder
pipeline. CoRR , abs/2406.11939, 2024.
Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human
falsehoods. In ACL (1) , pp. 3214–3252. Association for Computational Linguistics, 2022a.
Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott,
Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura,
Vishrav Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona T. Diab,
Veselin Stoyanov, and Xian Li. Few-shot learning with multilingual generative language models. In
EMNLP , pp. 9019–9052. Association for Computational Linguistics, 2022b.
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT
really correct? Rigorous evaluation of large language models for code generation. In NeurIPS , 2023.
Keming Lu, Bowen Yu, Fei Huang, Yang Fan, Runji Lin, and Chang Zhou. Online merging optimizers for
boosting rewards and mitigating tax in alignment. CoRR , abs/2405.17931, 2024a.
Keming Lu, Bowen Yu, Chang Zhou, and Jingren Zhou. Large language models are superpositions of all
characters: Attaining arbitrary role-play via self-alignment. CoRR , abs/2401.12474, 2024b.
Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao,
M. Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev,
Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and
Colin Raffel. Crosslingual generalization through multitask finetuning. In ACL (1) , pp. 15991–16111.
Association for Computational Linguistics, 2023.
Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Afina Putri, Dimosthenis Antypas, Hsuvas Borkakoty,
Eunsu Kim, Carla P ´erez-Almendros, Abinew Ali Ayele, V ´ıctor Guti ´errez-Basulto, Yazm ´ın Ib ´a˜nez-
Garc ´ıa, Hwaran Lee, Shamsuddeen Hassan Muhammad, Ki-Woong Park, Anar Sabuhi Rzayev, Nina
White, Seid Muhie Yimam, Mohammad Taher Pilehvar, Nedjma Ousidhoum, Jos ´e Camacho-Collados,
and Alice Oh. Blend: A benchmark for llms on everyday knowledge in diverse cultures and languages.
CoRR , abs/2406.09948, 2024.
OpenAI. GPT4 technical report. CoRR , abs/2303.08774, 2023.
OpenAI. Hello GPT-4o, 2024a. URL https://openai.com/index/hello-gpt-4o/ .
OpenAI. Learning to reason with LLMs, 2024b. URL https://openai.com/index/learning-to-reaso
n-with-llms/ .
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong
Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke
Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe.
Training language models to follow instructions with human feedback. In NeurIPS , 2022.
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window
extension of large language models. CoRR , abs/2309.00071, 2023.
Edoardo Maria Ponti, Goran Glavas, Olga Majewska, Qianchu Liu, Ivan Vulic, and Anna Korhonen.
XCOPA: A multilingual dataset for causal commonsense reasoning. In EMNLP (1) , pp. 2362–2376.
Association for Computational Linguistics, 2020.
Shanghaoran Quan, Tianyi Tang, Bowen Yu, An Yang, Dayiheng Liu, Bofei Gao, Jianhong Tu, Yichang
Zhang, Jingren Zhou, and Junyang Lin. Language models can self-lengthen to generate long texts.
CoRR , abs/2410.23933, 2024.
Qwen Team. Code with CodeQwen1.5, 2024a. URL https://qwenlm.github.io/blog/codeqwen1.5/ .
Qwen Team. Introducing Qwen1.5, 2024b. URL https://qwenlm.github.io/blog/qwen1.5/ .
Qwen Team. Introducing Qwen2-Math, 2024c. URL https://qwenlm.github.io/blog/qwen2-math/ .
Qwen Team. QwQ: Reflect deeply on the boundaries of the unknown, 2024d. URL https://qwenlm.git
hub.io/blog/qwq-32b-preview/ .
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understand-
ing by generative pre-training. Technical report, OpenAI, 2018.
23
Page 24:
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn.
Direct preference optimization: Your language model is secretly a reward model. In NeurIPS , 2023.
Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad
Awan, Jeff Rasley, and Yuxiong He. DeepSpeed-MoE: Advancing mixture-of-experts inference and
training to power next-generation AI scale. In ICML , volume 162 of Proceedings of Machine Learning
Research , pp. 18332–18346. PMLR, 2022.
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani,
Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark.
CoRR , abs/2311.12022, 2023.
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial
winograd schema challenge at scale. Commun. ACM , 64(9):99–106, 2021.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with
subword units. In ACL (1) . The Association for Computer Linguistics, 2016.
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and
Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.
CoRR , abs/2402.03300, 2024.
Jianlin Su. The magical effect of the Bias term: RoPE + Bias = better length extrapolation, 2023. URL
https://spaces.ac.cn/archives/9577 .
Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer:
Enhanced Transformer with rotary position embedding. Neurocomputing , 568:127063, 2024.
Mirac Suzgun, Nathan Scales, Nathanael Sch ¨arli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung,
Aakanksha Chowdhery, Quoc V . Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging BIG-Bench
tasks and whether chain-of-thought can solve them. In ACL (Findings) , pp. 13003–13051. Association
for Computational Linguistics, 2023.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix,
Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur ´elien Rodriguez, Armand Joulin,
Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models.
CoRR , abs/2302.13971, 2023a.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-
Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian
Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui
Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev,
Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu,
Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew
Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael
Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang
Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan
Narang, Aur ´elien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open
foundation and fine-tuned chat models. CoRR , abs/2307.09288, 2023b.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS , pp. 5998–6008, 2017.
Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu
Zhou, Chenyu Shi, et al. Secrets of RLHF in large language models part II: Reward modeling. CoRR ,
abs/2401.06080, 2024a.
Changhan Wang, Kyunghyun Cho, and Jiatao Gu. Neural machine translation with byte-level subwords.
InAAAI , pp. 9154–9160. AAAI Press, 2020.
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren,
Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang
Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding
benchmark. CoRR , abs/2406.01574, 2024b.
Zhilin Wang, Alexander Bukharin, Olivier Delalleau, Daniel Egert, Gerald Shen, Jiaqi Zeng, Oleksii
Kuchaiev, and Yi Dong. HelpSteer2-Preference: Complementing ratings with preferences. CoRR ,
abs/2410.01257, 2024c.
24
Page 25:
Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-
Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie
Neiswanger, and Micah Goldblum. LiveBench: A challenging, contamination-free LLM benchmark.
CoRR , abs/2406.19314, 2024.
Hao Xiang, Bowen Yu, Hongyu Lin, Keming Lu, Yaojie Lu, Xianpei Han, Le Sun, Jingren Zhou, and
Junyang Lin. Aligning large language models via self-steering optimization. CoRR , abs/2410.17131,
2024.
Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi
Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad,
Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang,
and Hao Ma. Effective long-context scaling of foundation models. CoRR , abs/2309.16039, 2023.
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li,
Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang,
Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He,
Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang,
Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu,
Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang,
Xipin Wei, Xuancheng Ren, Xuejing Liu, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu,
Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zhifang Guo, and Zhihao Fan. Qwen2 technical report. CoRR ,
abs/2407.10671, 2024a.
An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu,
Jingren Zhou, Junyang Lin, et al. Qwen2.5-Math technical report: Toward mathematical expert model
via self-improvement. CoRR , abs/2409.12122, 2024b.
Jian Yang, Jiaxi Yang, Ke Jin, Yibo Miao, Lei Zhang, Liqun Yang, Zeyu Cui, Yichang Zhang, Binyuan Hui,
and Junyang Lin. Evaluating and aligning codellms on human preference. CoRR , abs/2412.05210,
2024c.
Yinfei Yang, Yuan Zhang, Chris Tar, and Jason Baldridge. PAWS-X: A cross-lingual adversarial dataset
for paraphrase identification. In EMNLP/IJCNLP (1) , pp. 3685–3690. Association for Computational
Linguistics, 2019.
Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng
Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming
Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi
Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong Dai. Yi: Open
foundation models by 01.AI. CoRR , abs/2403.04652, 2024.
Tao Yuan, Xuefei Ning, Dong Zhou, Zhijie Yang, Shiyao Li, Minghui Zhuang, Zheyue Tan, Zhuyu Yao,
Dahua Lin, Boxun Li, Guohao Dai, Shengen Yan, and Yu Wang. LV-Eval: A balanced long-context
benchmark with 5 length levels up to 256K. CoRR , abs/2402.05136, 2024.
Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling
relationship on learning mathematical reasoning with large language models. CoRR , abs/2308.01825,
2023.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine
really finish your sentence? In ACL (1) , pp. 4791–4800. Association for Computational Linguistics,
2019.
Yidan Zhang, Boyi Deng, Yu Wan, Baosong Yang, Haoran Wei, Fei Huang, Bowen Yu, Junyang Lin, and
Jingren Zhou. P-MMEval: A parallel multilingual multitask benchmark for consistent evaluation of
LLMs. CoRR , abs/2411.09116, 2024.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Zhuohan Li, Dacheng Li, Eric P . Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging
LLM-as-a-judge with MT-Bench and Chatbot Arena. In NeurIPS , 2023.
Enyu Zhou, Guodong Zheng, Bing Wang, Zhiheng Xi, Shihan Dou, Rong Bao, Wei Shen, Limao Xiong,
Jessica Fan, Yurong Mou, Rui Zheng, Tao Gui, Qi Zhang, and Xuanjing Huang. RMB: Comprehensively
benchmarking reward models in LLM alignment. CoRR , abs/2410.09893, 2024.
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and
Le Hou. Instruction-following evaluation for large language models. CoRR , abs/2311.07911, 2023.
25
Page 26:
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William
Fedus. ST-MoE: Designing stable and transferable sparse expert models. CoRR , abs/2202.08906, 2022.
26