Paper Content:
Page 1:
Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts
Weigao Sun1†, Disen Lan1,2, Tong Zhu3, Xiaoye Qu1, Yu Cheng4
1Shanghai AI Laboratory,2South China University of Technology,3Soochow University,
4The Chinese University of Hong Kong
Abstract
Linear Sequence Modeling (LSM) like linear
attention, state space models and linear RNNs,
and Mixture-of-Experts (MoE) have recently
emerged as significant architectural improve-
ments. In this paper, we introduce Linear-
MoE, a production-level system for model-
ing and training large-scale models that in-
tegrate LSM with MoE. Linear-MoE lever-
ages the advantages of both LSM modules
for linear-complexity sequence modeling and
MoE layers for sparsely activation, aiming
to offer high performance with efficient train-
ing. The Linear-MoE system comprises: 1)
Modeling subsystem, which provides a uni-
fied framework supporting all instances of
LSM. and 2) Training subsystem, which fa-
cilitates efficient training by incorporating vari-
ous advanced parallelism technologies, particu-
larly Sequence Parallelism designed for Linear-
MoE models. Additionally, we explore hy-
brid models that combine Linear-MoE layers
with standard Transformer-MoE layers with
its Sequence Parallelism to further enhance
model flexibility and performance. Evaluations
on two model series, A0.3B-2B and A1B-7B,
demonstrate Linear-MoE achieves efficiency
gains while maintaining competitive perfor-
mance on various benchmarks, showcasing
its potential as a next-generation foundational
model architecture. Code: https://github.
com/OpenSparseLLMs/Linear-MoE .
1 Introduction
Mixture-of-Experts (MoE) (Jacobs et al., 1991; Qu
et al., 2024) architectures have gained widespread
adoption in cutting-edge models within indus-
try, with prominent examples including Gemini-
1.5 (Reid et al., 2024) and the reported use of MoE
†Project lead. Corresponding to Weigao Sun (sun-
weigao@outlook.com). Work done during Disen Lan’s in-
ternship at Shanghai AI Laboratory.in GPT-4 (Chintala, 2023). Other notable large
models incorporating MoE techniques include Mix-
tral (Jiang et al., 2024), DeepSeek V2 (Liu et al.,
2024), Qwen2 (Yang et al., 2024a), JetMoE (Shen
et al., 2024), Jamba (Team et al., 2024), and OL-
MoE (Muennighoff et al., 2024).
Most advances on MoE studies primarily concen-
trate on modifying the routing mechanism or expert
layers, while typically keeping the attention layers
unchanged (Zhu et al., 2024b). These attention
layers commonly rely on the softmax self-attention
mechanism introduced in the Transformer archi-
tecture (Vaswani et al., 2017). The softmax-based
self-attention has proven to be highly effective for
sequence modeling tasks across various data types.
However, a significant limitation of this mecha-
nism is its computational complexity, which grows
quadratically with the input sequence length. This
complexity can lead to substantial computational
costs, especially during training, making it a chal-
lenge for models need to handle long sequences
efficiently.
Linear sequence modeling (LSM) has recently
gained significant attention due to its impressive ef-
ficiency in both training and inference. These meth-
ods function similarly to recurrent neural networks
(RNNs) with matrix-valued hidden states, allowing
them to achieve linear-time training and constant-
memory inference. This efficiency is largely due
to the fact that LSM techniques bypass the com-
putation of attention scores and eliminate the need
for maintaining a key-value (KV) cache. There are
three primary approaches to linear sequence mod-
eling: linear attention (Katharopoulos et al., 2020),
state space models (SSM) (Gu and Dao, 2023; Dao
and Gu, 2024; Hu et al., 2024; Waleffe et al., 2024),
and linear RNN (Peng et al., 2023, 2024; Qin et al.,
2024d). Linear attention is a variation of the tra-
ditional softmax attention mechanism, replacing
the exponential kernel with a simpler dot product
between key and query vectors, which enables the
1arXiv:2503.05447v1 [cs.LG] 7 Mar 2025
Page 2:
use of the right-product kernel trick to reduce com-
putational complexity. SSM approaches, such as
Mamba and Mamba2, stem from control theory
and represent sequence modeling as dynamic sys-
tems. Meanwhile, linear RNN methods address
the limitations of traditional RNNs in modeling
long contexts by enabling parallel training of RNN
models. These different methods, linear attention,
SSM, and linear RNN, share a common mathemat-
ical foundation and exhibit similar performance on
sequence modeling tasks (Dao and Gu, 2024; Peng
et al., 2024; Qin et al., 2024b; Yang et al., 2024c).
In fact, they all employ a unified recurrence frame-
work expressed as Ms=Ms−1+cMs, where Ms
denotes the memory state and cMsrepresents the
incremental memory update at the s-th token.
In this paper, we introduce Linear-MoE, a
production-level system designed for modeling and
training of large-scale MoE models with LSM mod-
ules integrated. The Linear-MoE system is com-
posed of two key subsystems: Modeling and Train-
ing. The Modeling subsystem provides a unified
linear sequence modeling framework for Linear-
MoE models. It supports three main types of
LSM methods: linear attention, state space model
(SSM), and linear RNN. For each type, multiple
instances are implemented under a unified formu-
lation. While the Training subsystem is designed
to achieve efficient training of Linear-MoE models
on modern accelerators. In addition to supporting
state-of-the-art training techniques, we incorporate
a specialized Sequence Parallelism (SP) technique
for LSM modules, which is particularly effective
for handling extremely long input sequences on
Linear-MoE architecture. Importantly, the system
is designed to be extensible, enables more advanced
sequence modeling methods or training techniques
integrated in the future. Furthermore, we also ex-
plore efficient modeling and training for hybrid
Linear-MoE models, which combine Linear-MoE
layers with standard Transformer-MoE layers. For
hybrid models, we introduce an SP method that em-
ploys distinct computational and communication
strategies tailored to the different types of layers.
Our contributions can be summarized as follows:
•Production-level System. We introduce Linear-
MoE, a production-level system designing for
efficient modeling and training of large-scale
MoE models with LSM modules integrated.
•Modeling & Training Subsystems. The Linear-
MoE system is composed of two subsystems:
Modeling andTraining . We provide unifiedlinear sequence modeling formulation to sup-
port various LSM modules with MoE layers,
as well as state-of-the-art training techniques
for efficient large-scale model training, espe-
cially on long-context inputs.
•Experimental Validation. In empirical studies,
we pretrain two series of Linear-MoE mod-
els from scratch on the public SlimPajama
corpus. Extensive experiments validate the
training and inference efficiency of our sys-
tem framework, as well as the performance
of Linear-MoE architecture on various down-
stream tasks.
2 Linear-MoE System
2.1 Modeling
2.1.1 Unified Linear Sequence Modeling
The standard softmax attention (Vaswani et al.,
2017), commonly used in transformer models,
whose parallel computation form during training
can typically be expressed as:
O= Softmax( QK⊤)V. (1)
Here, the matrices Q,K,V,O∈RN×dcorre-
spond to the query, key, value, and output matrices,
respectively. The matrices Q,K,andVare lin-
ear projections of the input matrix X∈RN×d,
defined as Q=XW Q,K=XW K, andV=
XW V, where WQ,WK,WV∈Rd×dare learn-
able weight matrices. Here, Nanddrepresent the
sequence length and hidden dimension.
Linear Attention (Katharopoulos et al., 2020)
as one of the representative LSM methods, has
emerged as a viable alternative to traditional soft-
max attention by implementing two primary modi-
fications. First, it eliminates the Softmax (·)oper-
ation, instead embedding it within a kernel fea-
ture map. Second, it leverages the associative
property of matrix multiplication, reconfiguring
(QK⊤)VintoQ(K⊤V). These changes reduce
both the computational and memory complexity
fromO(N2d)toO(Nd2). This approach is fre-
quently referred to as the right-product kernel trick,
as it prioritizes matrix product on the right side.
While during inference, both softmax self-
attention and linear attention handle a single token
at each iteration. Given the s-th token xs∈R1×d,
softmax self-attention computes requiring the stor-
age of an expanding set of keys {k1,···, ks}and
values {v1,···, vs}i.e., the KV cache, which leads
2
Page 3:
Table 1: Instances of Linear Sequence Modeling Methods. All instances listed follow the unified formulation
in Eq. (5). Here, a∈R,as∈R,as∈Rd,A∈Rd×d,As∈Rd×drepresents a fixed constant, a time-dependent
scalar, a time-dependent vector, a time-independent matrix, and a time-dependent matrix, respectively. Note that the
same notation may denote different variables in different instances.
LSM Method Instance Recurrent Update Rule Parameter
Linear Attention BLA Ms=Ms−1+k⊤
svs \
Lightning Attn Ms=aMs−1+k⊤
svs a∈R
RetNet Ms=aMs−1+k⊤
svs a∈R
GLA Ms=diag{as}Ms−1+k⊤
svs as∈Rd
DeltaNet Ms= (I−ask⊤
sks)Ms−1+bsk⊤
svs as, bs∈R
Rebased Ms=Ms−1+ϕ(ks)⊤vs \
GFW Ms=As⊙Ms−1+k⊤
svs As∈Rd×d
GateLoop Ms=As⊙Ms−1+k⊤
svs As∈Rd×d
Gated DeltaNet Ms=as(I−k⊤
sks)Ms−1+bsk⊤
svs as, bs∈R
TTT Ms=Ms−1+bs∇l(Ms−1;ks,vs) bs∈R
Titans Ms=asMs−1+bs∇l(Ms−1;ks,vs) as, bs∈R
SSM*S4 Ms= exp( −(a1⊤)A)⊙Ms−1+ (a1⊤)b⊤vs a,b∈Rd,A∈Rd×d
Mamba Ms= exp( −(as1⊤)As)⊙Ms−1+ (as1⊤)k⊤
svsas∈Rd,As∈Rd×d
Mamba2 Ms= exp( −abs)⊙Ms−1+bsk⊤
svs a, bs∈R
HGRN2 Ms=diag{as}Ms−1+ (1−as)⊤vs as∈Rd
Linear RNN RWKV6 Ms=diag{as}Ms−1+k⊤
svs as∈Rd
RWKV7 Ms=diag{as}Ms−1+∇l(Ms−1;ks,vs) as∈Rd
*For both S4 and Mamba, the Euler Discretization (Gu et al., 2020) is applied, such that ¯B=∆B, and the unprojected xsis
denoted as vsfor consistency with other formulas.
to a significant memory burden when dealing with
long input sequences:
qs,ks,vs=xsWQ,xsWK,xsWV,
os=Ps
i=1exp(qski⊤)viPs
i=1exp(qsk⊤
i).(2)
Linear attention replaces the term exp(qsk⊤
i)with
a kernel k(x,y)with an associated feature map ϕ,
i.e.,k(x,y) =⟨ϕ(x), ϕ(y)⟩. This simplifies the
calculation of osas
os=Ps
i=1ϕ(qs)ϕ(ki)⊤viPs
i=1ϕ(qs)ϕ(ki)⊤. (3)
Letting Ms=Ps
i=1ϕ(ki)⊤viandzs=Ps
i=1ϕ(ki)⊤where Ms∈Rd×d,zs∈Rd×1, we
can rewrite Eq. (3) as an RNN:
Ms=Ms−1+ϕ(ks)⊤vs,
zs=zs−1+ϕ(ks)⊤,
os=ϕ(qs)Ms
ϕ(qs)zs.(4)
Follow-up studies on SSM (e.g., Mamba2) and
linear RNNs (e.g., RWKV6, HGRN2), have demon-
strated their similarity with linear attention (Dao
and Gu, 2024; Peng et al., 2024). In fact, recentstudies (Qin et al., 2024b; Yang et al., 2024c) have
suggested that linear attention, state space, and
linear RNN sequence modeling methods can be
expressed within a unified recurrence framework
as:
cMs=f(k⊤
s,vs),
Ms=Θs⋄Ms−1+cMs.(5)
In this formulation, cMs∈Rd×drepresents the
memory state corresponding to the s-th input,
which is a function of k⊤
sandvs. And Θsde-
notes a coefficient matrix that may be time-varying
or constant (and also can be a vector or scalar). The
operator " ⋄" can denote either standard matrix mul-
tiplication or Hadamard product. We collect recent
LSM method instances which follow the unified
formulation in Eq. (5) and list them in Table 1.
2.1.2 Linear-MoE Architecture
The Linear-MoE architecture is relatively straight-
forward, consisting of L×stacked Linear-MoE
blocks, as depicted in Fig. 1. Each Linear-MoE
block includes an LSM layer and an MoE layer,
with a normalization layer preceding each. The
LSM layer serves as a generalized structure that
supports various LSM methods, specifically, linear
attention, SSM, and linear RNN, each encompass-
ing multiple instance methods. Table 1 provides
3
Page 4:
Expert 1
Input EmbeddingMoE LayerNormLSM LayerNormExpert 2RouterMoE Layer
LinearLSM LayerMatmulMatmulLinearLinearVKQMemory StateNormLinear𝐿×Expert 3Expert 4Figure 1: Linear-MoE Architecture. In each Linear-MoE block, there is both an LSM layer and an MoE layer,
with each layer preceded by its own normalization layer. The LSM layer is designed as a flexible abstraction of
LSM methods, including: linear attention, SSM, and linear RNN, which follows a unified recurrence framework.
an overview of these LSM method instances, uni-
fied under a common recurrence framework. This
framework highlights key distinctions between var-
ious LSM instances, primarily in their handling of
the prior-step memory state Ms−1and the compu-
tation of the incremental memory state cMs. For the
MoE layers, we retain the standard mechanisms of
sparse expert activation and routing, as employed
in SOTA open-source MoE models. These mech-
anisms are essential for maintaining an optimal
balance between model performance and computa-
tional efficiency.
In this paper, we refer to models composed ex-
clusively of Linear-MoE layers as pure Linear-
MoE models. These models achieve high effi-
ciency during both training and inference, benefit-
ing from the LSM modules embedded in each layer.
However, despite these advantages, empirical re-
search (Lieber et al., 2024; Ren et al., 2024; Waleffe
et al., 2024) has shown that models relying solely
on LSM modules tend to underperform on tasks re-
quiring strong recall capabilities, such as in-context
learning (e.g., five-shot MMLU (Hendrycks et al.,
2020), Phone-book lookup (Jelassi et al., 2024),
Needle In A Haystack (Briakou et al., 2023)) and
long-context reasoning. In such cases, a hybrid ar-
chitecture that interleaves linear transformer layers
with standard transformer layers has proven effec-
tive in improving model performance on recall-
intensive tasks (Yang et al., 2024b; MiniMax et al.,
2025; Lan et al., 2025).
Based on this prior, we propose a hybrid Linear-
MoE architecture that combines Linear-MoE lay-
ers with standard (MoE) transformer layers. A
practical approach for constructing these hybrid
models is to periodically substitute certain Linear-
MoE layers with standard MoE transformer layerswithin the model. For instance, in an 4-layer hy-
brid Linear-MoE model, denoted by "L" for Linear-
MoE layers and "N" for normal transformer layers,
configurations such as "LLLL" or "LNLN" may
be used, depending on the desired ratio of normal
transformer layers, which can be adjusted based on
user preference.
2.2 Training
2.2.1 Sequence Parallelism on Linear-MoE
The existing methods, LASP (Sun et al., 2024) and
its improved version LASP-2 (Sun et al., 2025), are
designed specifically to leverage the right-product-
first property of linear attention techniques for ef-
ficient sequence parallelism (SP). LASP employs
a point-to-point ring-style communication pattern,
facilitating the exchange of incremental memory
states across devices. This communication pattern
is particularly effective for managing dependen-
cies while minimizing the data transferred between
devices, enhancing the scalability of SP. LASP-
2 further refines this approach by replacing the
ring-style communication with an all-gather col-
lective communication operation, streamlining the
entire communication process. This modification
not only simplifies the communication structure
but also improves the parallelism of computation
and communication.
In this work, we extend the capabilities of LASP
series to the Linear-MoE system, allowing for the
efficient SP training on LSM modules, particularly
when dealing with extremely long sequences across
large-scale distributed clusters. This extension sig-
nificantly enhances the scalability and efficiency
of training large-scale Linear-MoE models with
long-context sequences on extensive compute re-
4
Page 5:
Seq Chunk 0Seq Chunk 1AG / RSQKVQKVLSMLSMOutAG /NoNo/AGLSMLSMOutAG /NoNo/AGRS / AG
AG / RSQKVQKVLSMLSMOutAG /NoNo/AGLSMLSMOutAG /NoNo/AGRS / AGGPU0GPU1GPU2GPU3AG / RSQKVQKVAttnAttnOutAG /RSNo/AGAttnAttnOutAG /RSNo/AGRS / AG
AG / RSQKVQKVAttnAttnOutAG /RSNo/AGAttnAttnOutAG /RSNo/AGRS / AGLSMModuleStandard Attention ModuleMMK / VK / VFigure 2: Sequence Parallelism Approach on Hybrid Linear-MoE models. We exemplify the parallelism on the
hybrid layers of LSM and standard attention with both TP and SP (both have a dimension of 2). The communication
operations colored in yellow and green are for TP and SP, respectively. AG/RS: all-gather in forward and reduce-
scatter in backward, RS/AG: reduce-scatter in forward and all-gather in backward, AG/No: all-gather in forward
and no-op in backward, No/AG: no-op in forward and all-gather in backward. Note that the SP communication
operations for linear attention operate on the memory state Ms∈Rd×d, while for standard attention, they operate
on states Ks,Vs∈RC×d.
sources. A detailed breakdown of the SP algorithm
on Linear-MoE, with and without masking, is pro-
vided in Appendix A.3.
2.2.2 Hybrid Model Sequence Parallelism
Hybrid linear sequence modeling models, which
combine linear transformer layers (leveraging
LSM methods for token mixing) with standard
transformer layers (utilizing conventional self-
attention for token mixing), have demonstrated
substantial improvements in handling long-context
tasks (Lieber et al., 2024; Ren et al., 2024; Waleffe
et al., 2024). This hybrid model is particularly ben-
eficial for tasks with high recall requirements, in-
cluding five-shot MMLU (Hendrycks et al., 2020),
Phone-book lookup (Jelassi et al., 2024), and Nee-
dle In A Haystack (Briakou et al., 2023), etc.. Our
proposed hybrid Linear-MoE models also aim to
enhance performance in areas where pure Linear-
MoE models have shown limitations, specifically
on tasks where recall precision is critical.
Applying SP on pure Linear-MoE models is
straightforward, as this form of SP operates exclu-
sively on the LSM modules, leaving the MoE layers
unaffected. In hybrid Linear-MoE models, how-
ever, implementing SP becomes more complex due
to the interleaving of distinct sequence modeling
layers. To effectively optimize SP for these hybrid
models, we introduce an integrated approach that
incorporates SP across both the linear-MoE and
standard transformer layers, thus enhancing overall
efficiency. We illustrate the approach in Fig. 2, and
explain it as below:
On LSM Module. The SP for LSM modulesis implemented via a single collective communica-
tion operation on the memory state Ms∈Rd×d.
This approach ensures that the communication com-
plexity does not depend on either the sequence or
sub-sequence length; rather, it scales only linearly
with the SP size T, thereby maintaining efficiency
in distributed setups.
On Standard Attention Module. Context par-
allelism (CP) is a SP technique used in Megatron-
LM that divides input data and activations along
the sequence dimension, specifically designed for
standard softmax attention. Traditional CP imple-
mentations in Megatron-LM rely on a ring-like
communication-computation overlap (Liu et al.,
2023). In contrast, our approach for standard atten-
tion modules adopts the all-gather-based strategy
used in the pretraining of Llama3 (Dubey et al.,
2024). Rather than utilizing a ring strategy, we
perform all-gather communication for KsandVs
tensors across devices, followed by local compu-
tation of attention output on each device’s chunk
ofQs. While all-gather communication theoreti-
cally has higher latency than ring-based methods,
it offers greater flexibility and adaptability for han-
dling different attention masks, such as document-
level masks, making it ideal for varying attention
patterns. Moreover, the latency of all-gather is
minimized since the KsandVstensors are no-
tably smaller than the Qstensor, especially with
grouped query attention (Ainslie et al., 2023). Con-
sequently, the computational time for generating
attention output significantly outweighs the cost of
all-gather communication.
5
Page 6:
2.2.3 Hybrid Parallelism
SP in Linear-MoE allows for a flexible choice of
sequence parallel size that can be set to any factor
smaller than or divisible by the total number of
distributed nodes (i.e., the world size). This flexi-
bility enables splitting input data across both batch
and sequence dimensions, creating a combined ap-
proach known as data-sequence hybrid parallelism.
Standard data parallelism techniques, such as Dis-
tributed Data Parallel (DDP) (Li et al., 2020), can
integrate seamlessly with SP in Linear-MoE. Addi-
tionally, the sharded data parallelism method, like
Distributed Optimizer (Korthikanti et al., 2022) in
Megatron-Core, is also compatible.
Furthermore, the system provides support for
Tensor Parallelism (TP), Pipeline Parallelism (PP),
and Expert Parallelism (EP) specifically tailored
for Linear-MoE models. In the case of TP, its appli-
cation to Linear-MoE models is direct and efficient,
as detailed in §A.2. Regarding PP and EP, these
parallelism techniques operate on Linear-MoE in
much the same way as their original versions since
they are not involved in the inner computations of
the LSM modules but rather work at the level of
complete Linear-MoE blocks or MoE layers. More-
over, TP, PP, and EP can be combined with DP and
SP as introduced earlier, enhancing flexibility and
scalability for large distributed setups.
2.2.4 Variable Length
During pretraining, batches generally consist of se-
quences with a uniform length. However, in the
finetuning phase or during inference, the model
often encounters batches containing sequences of
different lengths. A common approach to handle
this variation is to right-pad each sequence in the
batch so that all match the length of the longest
sequence in that batch. While straightforward, this
padding strategy can lead to inefficiencies, particu-
larly when sequence lengths vary greatly within a
batch. For standard transformers, more advanced
methods have been introduced to address this issue.
These methods include techniques like distribut-
ing workloads across GPUs to avoid padding alto-
gether (Zeng et al., 2022; Zhai et al., 2023), or pack-
ing multiple sequences into a single batch while
adjusting the attention mask as needed (Ding et al.,
2024; Pouransari et al., 2024). In Linear-MoE, han-
dling variable-length sequences is simplified by
processing the entire batch as one continuous long
sequence, effectively managing varying sequence
lengths without the need for padding.
MoEBaseModels(with Tokenizers)Qwen2DeepSeek-V2MixtralCheckpoint ConvertorHuggingFaceàMegatron-CoreMegatron-Core àHuggingFaceExamples
Megatron-CoreData ParallelTensor ParallelPipeline ParallelExpert ParallelLASPContext ParallelDistributed OptimizerCPUOffloadingTransformer EngineMixed PrecisionActivation CheckpointingLSM ModulesLinear Attention: BLA, Lightning Attn, Retention, GLA, DeltaNet,Based, Rebased, …SSM: Mamba2, …Linear RNN: HGRN2, RWKV6, …PretrainingFinetuningEvaluatingConfigsTrainingModeling
………
Grouped GEMMMegaBlocks…Figure 3: Linear-MoE System Implementation. The
Linear-MoE system is composed of two main subsys-
tems: Modeling and Training. It is developed in a
non-intrusive manner, utilizing the latest version of
Megatron-Core. All components within the system are
designed with extensibility in mind, encompassing the
LSM modules, base models, examples, and training
technologies. This design allows for future enhance-
ments and extensions of the system.
2.3 Implementation
The implementation of the Linear-MoE system is
based on Megatron-Core, an open-source library
developed on PyTorch that incorporates optimized
GPU techniques and advanced system-level en-
hancements. As depicted in Fig. 3, the Linear-MoE
system consists of both modeling and training sub-
systems, facilitating adaptable model building and
efficient training specifically for Linear-MoE mod-
els. Leveraging the capabilities of Megatron-Core,
the Linear-MoE library is fully compatible with all
NVIDIA Tensor Core GPUs, including support for
FP8 acceleration on NVIDIA Hopper architectures.
The Linear-MoE design approach aims to min-
imize any invasive changes to Megatron-Core’s
source code. Rather than adding new modules
directly, Linear-MoE operates independently, al-
lowing users to benefit from the latest LLM prac-
tices without disruptions due to updates or changes
within Megatron-Core.
2.3.1 Modeling Subsystem
Linear-MoE abstracts its LSM modules into modu-
lar and composable APIs, providing model devel-
opers and researchers with extensive flexibility to
design and train large-scale custom Linear-MoE
models on accelerated computing platforms. The
system includes essential building blocks, such as
6
Page 7:
core components for LSM mechanisms, MoE lay-
ers and Linear-MoE blocks, normalization tech-
niques, and embedding methods. To enhance adapt-
ability, LSM mechanisms are organized into three
main categories: linear attention, SSM, and linear
RNN, with multiple instances available in each. For
linear attention, options include basic linear atten-
tion (BLA), Lightning Attention, Retention, GLA,
DeltaNet, Based, and Rebased; for SSM, we pro-
vide Mamba2, the leading SSM model at present;
and for linear RNN, options include HGRN2 and
RWKV6. As LSM techniques evolve, Linear-MoE
will continue to incorporate more LSM methods
to ensure users have access to the latest advance-
ments.
Additionally, Linear-MoE offers vital compo-
nents such as a model library, tokenizers, model
converters, usage examples, and a set of supportive
toolkits. The model library includes instances of
Linear-MoE models that are adapted from state-
of-the-art open-source MoE architectures, includ-
ing Qwen2 MoE, DeepSeekV2 MoE, and Mix-
tral MoE. These adapted instances are designated
as Linear-MoE-Qwen2, Linear-MoE-DeepSeekV2,
and Linear-MoE-Mixtral, respectively. These mod-
els are implemented following Megatron-Core for-
mat, with the standard attention layers replaced by
LSM-based token mixing layers, while maintaining
the original embedding, normalization, and expert
layers unchanged.
2.3.2 Training Subsystem
Advanced parallelism techniques, encompassing
tensor, sequence, pipeline, context, and MoE ex-
pert parallelism, are seamlessly incorporated into
the Linear-MoE system through its design on top of
the Megatron-Core library. This non-intrusive inte-
gration allows Linear-MoE to leverage the robust
training capabilities of Megatron-Core, supporting
large-scale model training across both standard at-
tention layers and MoE expert layers. However, the
inherent parallelism mechanisms, such as TP and
SP, were not originally optimized for LSM mod-
ules. Additionally, Megatron-Core does not fully
support efficient SP for hybrid models containing
both LSM modules and standard attention layers.
To address these gaps, we elaborate on our TP and
SP approaches specifically designed for LSM mod-
ules and hybrid models, as discussed in §2.2.
Further capabilities, including mixed precision,
activation recomputation, distributed optimizer,
distributed checkpointing, and CPU offloading,are also inherited from Megatron-Core, enhanc-
ing model training flexibility and efficiency. And
Linear-MoE supports 8-bit floating point (FP8) pre-
cision on Hopper GPUs, benefiting from the in-
tegration of NVIDIA’s Transformer Engine (Mi-
cikevicius et al., 2022). This feature optimizes
memory usage and accelerates performance during
both training and inference stages.
To enhance the training speed of MoE layers, we
incorporate MegaBlocks (Gale et al., 2023) into our
Linear-MoE system. MegaBlocks is designed to
optimize MoE training on GPUs by reconfiguring
MoE computations using block-sparse operations
and developing new block-sparse GPU kernels that
effectively manage the inherent dynamism of MoE.
In addition, we also integrate the Grouped GEMM
library into Linear-MoE, which introduces grouped
GEMM kernels in PyTorch, thereby accelerating
the computational processes involved in training
MoE models.
2.3.3 Evaluation Module
In order to facilitate the evaluation on mainstream
benchmarks, we have developed offline text gen-
eration of Linear-MoE models within the system.
Based on this, mature evaluation frameworks such
as OpenCompass (Contributors, 2023) and LM-
Evaluation-Harness (Gao et al., 2023), are readily
available for conducting evaluation tasks on Linear-
MoE models. Furthermore, the system facilitates
seamless bidirectional conversion between model
weights from HuggingFace and Megatron-Core.
This functionality enables users to easily leverage
pretrained models from HuggingFace for contin-
ued pretraining or fine-tuning within the Megatron-
Core environment. Additionally, it allows for the
assessment of model performance by using Hug-
gingFace’s evaluation and inference pipelines on
models trained within the Megatron-Core frame-
work.
3 Empirical Study
3.1 Experiment Setup
Models and Dataset. We conduct experiments
on two Linear-MoE model series: A0.3B-2B and
A1B-7B. A0.3B-2B denotes a Linear-MoE model
containing a total of 2 billion parameters, with 0.3
billion parameters activated. The same applies
for the A1B-7B model. Each series consists of
several model instances, each incorporating a dis-
tinct instance of the LSM module. The specific
7
Page 8:
Models A0.3B-2B A1B-7B
Hidden Dimension 1024 2048
FFN Dimension 896 1024
Num of Heads 8 16
Num of Layers 12 16
Num of Act Experts 8 8
Num of Experts 64 64
LR 1e-4 1e-5
Minimum LR 1e-5 1e-6
LR Scheduler Cosine Cosine
Seq Length 2048 2048
Training Tokens 15B 100B
Table 2: Linear-MoE Family Models and Training
Configurations. A0.3B-2B indicates that the Linear-
MoE model has a total of 2 billion parameters, with 0.3
billion parameters activated. The same for A1B-7B.
2K × 8 4K × 4 8K × 2 16K × 1
Seq Length × Batch Size60K80K100K120K140KThroughput (T okens/s)
Baseline
FlashAttn2
BLA
Retention
GLA
DeltaNet
Mamba2
HGRN2
RWKV6
Figure 4: Training Throughput (Tokens/s). As se-
quence length increases, the throughput of Baseline
declines significantly, whereas LSM models maintain
stable training efficiency.
LSM module instances used in our experiments in-
clude: basic linear attention (BLA) (Katharopoulos
et al., 2020), Retentive Network (Retention) (Sun
et al., 2023), Gated Linear Attention (GLA) (Yang
et al., 2023), DeltaNet (Schlag et al., 2021),
Mamba2 (Dao and Gu, 2024), HGRN2 (Qin et al.,
2024d), and RWKV6 (Peng et al., 2023, 2024), all
implemented in Triton. These model instances are
evaluated against models with standard attention
implementation in Megatron-Core (referred to as
Baseline) and the FlashAttention-2 (Dao, 2023)
implemented in Transformer Engine (in CUDA).
To implement the Linear-MoE model instances,
we utilize the Qwen2 MoE architecture (Yang et al.,
2024a) as the base model. All models are pre-
trained from scratch on a portion of the SlimPajama
dataset (Soboleva et al., 2023). This dataset orig-
inally contains 627 billion tokens, we restrict our
experiments to the first two chunks of the dataset,
totaling approximately 100 billion tokens. TheQwen2 tokenizer is employed throughout the train-
ing processes.
Training Configurations. Table 2 details the
training configurations for both Linear-MoE model
series. We employ the Adam optimizer (Kingma
and Ba, 2014) along with parallelism techniques,
including TP and EP. Each pretraining run is per-
formed on a node with eight A100 80G GPUs.
3.2 Training and Inference Efficiency
We perform experiments to evaluate the training
efficiency of the Linear-MoE system, focusing on
throughput and GPU memory requirements using
eight A100 GPUs. For training the sparse MoE
models, we set the EP size to 8. During the experi-
ments, we maintain a total of 16K input tokens per
iteration, while varying the input sequence lengths
across {2K, 4K, 8K, 16K} with corresponding
batch sizes of {8, 4, 2, 1}. As illustrated in Table
3 and Fig. 4, we observe that the standard atten-
tion Baseline shows a significant quadratic increase
in memory usage and a decline in throughput as
the input sequence lengths grow. FlashAttention-2
also demonstrates notable variations in both mem-
ory footprint and throughput, when the sequence
length reaches 16K. In contrast, the Linear-MoE
models, which incorporate LSM, exhibit relatively
stable GPU memory consumption and consistent
throughput when the sequence length increases, but
number of input tokens remains fixed.
We also perform experiments to compare the
inference efficiency of Linear-MoE (using Basic
LA) with the Baseline (using FlashAttention-2).
The results, shown in Table 5, reveal that Linear-
MoE offers a significant speed advantage when
the decoding length exceeds 16K. Additionally, its
memory usage remains constant, which is a key
benefit resulting from the adoption of LSM.
Furthermore, to highlight the efficiency benefits
of the Linear-MoE training subsystem, we conduct
ablation studies on MoE optimization techniques
and parallelism training methods. The results of
these experiments are presented in Table 4. It is
evident that the implementation of MoE optimiza-
tion techniques, specifically Grouped GEMM and
MegaBlocks, significantly reduces the elapsed time
for each iteration. Additionally, the various paral-
lelism training techniques each demonstrate their
own advantages in terms of memory footprint and
overall training efficiency.
8
Page 9:
Seq Length ×Batch Size2K×8 4K ×4 8K ×2 16K ×1
Mem. Thpt. Mem. Thpt. Mem. Thpt. Mem. Thpt.
Baseline 40.74 102.14 41.42 88.60 42.93 66.17 47.08 49.39
FlashAttn-2 38.96 103.91 39.10 101.78 39.57 105.08 41.51 96.16
Basic LA 42.69 115.16 43.85 119.72 42.71 112.66 43.00 114.67
Retention 42.71 117.85 42.66 119.11 42.73 119.16 42.65 118.19
GLA 43.87 113.29 43.73 118.77 43.63 116.34 43.60 110.87
DeltaNet 43.33 116.95 43.34 120.27 43.31 117.43 43.32 109.72
Mamba2 45.63 105.99 45.94 108.13 47.16 102.51 44.97 106.84
HGRN2 46.03 92.58 46.14 95.74 45.56 97.98 44.97 96.02
RWKV6 47.11 137.62 47.12 136.73 47.11 135.60 47.12 134.51
Table 3: Quantitative Training Efficiency Results. We experiment on 8 A100 GPUs and report the max allocated
GPU memory (GB) and throughput ( ×103tokens/s) of A0.3B-2B model instances with varying input sequence
lengthes and batch sizes.
1K 2K 4K 8K 16K 32K 64K 128K
Decoding Length02500500075001000012500150001750020000Latency Time (seconds)
Baseline w/ FlashAttn2 Time
Basic Linear Attn Time
Baseline w/ FlashAttn2 Memory
Basic Linear Attn Memory
Baseline w/ FlashAttn2 OOM
01020304050607080
GPU Memory Usage (GB)
Figure 5: Inference Efficiency of A0.3B-2B Model Instances. We variate the decoding length from 1K to 128K
with fixed batch size of 16 on single A800 80GB GPU to evaluate the Baseline w/ FlashAttention-2 and the
Linear-MoE w/ Basic Linear Attention in terms of inference latency time and GPU memory usage.
3.3 Training Loss and Evaluation
To evaluate the overall training performance of the
Linear-MoE models, we pretrain the A0.3B-2B and
A1B-7B model instances using 15B and 100B to-
kens, respectively. We test both pure and hybrid
model configurations; for the hybrid models, we in-
corporate one quarter of standard transformer MoE
layers throughout the architecture. For instance, in
the 12-layer A0.3B-2B model, the hybrid config-
uration follows the pattern "LLLNLLLNLLLN",
while the 16-layer A1B-7B model adopts the pat-
tern "LLLNLLLNLLLNLLLN".
The training loss curves for the A0.3B-2B model
instances, which include both pure and hybrid
Linear-MoE models, are presented in Fig. 6. The
results demonstrate that the pure Linear-MoE ar-
chitecture achieves competitive convergence per-
formance compared to the standard attention Base-
line. Moreover, the hybrid models exhibit more sta-
ble convergence and consistent performance when
compared with the Baseline. Additional experi-
ment results such as benchmark evaluations and
training loss curves of A1B-7B models can befound in Appendix A.4. Both the A0.3B-2B and
A1B-7B Linear-MoE model series show competi-
tive performance on various benchmarks, and it is
verified that hybrid models usually perform better
than the pure linear models.
4 Conclusion
In this paper, we introduced Linear-MoE, a novel
product-level system designed to integrate LSM
with MoE, aiming to advance both the efficiency
and scalability of existing large models. By com-
bining linear-complexity sequence modeling capa-
bilities of LSM with sparsely activated MoE layers,
Linear-MoE achieves high performance while ad-
dressing computational and memory constraints
common in large model training and deployment.
The dual subsystems: Modeling and Training, pro-
vide a flexible and extensible framework that sup-
ports diverse LSM methods and advanced paral-
lelism techniques, including specific sequence par-
allelism for handling long input sequences effi-
ciently. We also explored hybrid models that fur-
ther enhance adaptability by incorporating stan-
9
Page 10:
0 1 2 3 4 5 6 7
Samples1e62.502.753.003.253.503.754.004.254.50LossBaseline
BLA
Retention
GLA
DeltaNet
Mamba2
0 1 2 3 4 5 6 7
Samples1e62.502.753.003.253.503.754.004.254.50LossBaseline
BLA(H)
Retention(H)
GLA(H)
DeltaNet(H)
Mamba2(H)Figure 6: Training Loss Curves of A0.3B-2B Model Instances. Left: pure Linear-MoE models; Right: hybrid
Linear-MoE models. Linear-MoE shows competitive training convergence performance compared to the standard
attention Baseline.
MoE Optimization Memory (GB) Time/Iter (ms)
Baseline 35.28 1565.6
Grouped GEMM 35.01 455.4
MegaBlocks 36.90 348.8
EP TP PP Memory (GB) Time/Iter (ms)
1 1 1 35.28 1565.6
8 1 1 22.98 739.4
1 8 1 10.04 6879.0
1 1 8 8.89 1820.2
2 2 2 12.90 1684.9
Table 4: Above: MoE Optimization. Below: Dis-
tributed training efficiency under different paral-
lelism settings. We report the memory usage per GPU
(GB) and elapsed time per iteration (ms) while training
the A0.3B-2B model with a sequence length of 2048
and a batch size of 4, using a node equipped with 8
A100 GPUs. The Baseline refers to the MoE imple-
mentation in Megatron-Core, which is used without any
optimizations.
dard Transformer layers. Our experimental results
demonstrate that Linear-MoE achieves significant
efficiency gains while maintaining strong perfor-
mance across various benchmarks. These findings
highlight the potential of Linear-MoE as the next
generation of foundation model architecture.
Limitations
Despite the promising results demonstrated in this
paper, there are several limitations to the Linear-
MoE framework. First, while the system success-
fully combines LSM with MoE to enhance effi-
ciency, the integration of different LSM methods
and MoE layers may introduce complexity in hy-
perparameter tuning, which could impact model
performance under certain configurations. Addi-tionally, the scalability of Linear-MoE in extremely
large-scale settings, such as beyond the model sizes
tested in our experiments (A0.3B-2B and A1B-7B),
remains an area for further investigation. Moreover,
while the system supports various parallelism tech-
niques, their effectiveness on diverse hardware ar-
chitectures, particularly in resource-constrained en-
vironments, needs more comprehensive evaluation.
Therefore, future work should focus on further opti-
mizing the system for a broader set of use cases and
exploring additional hybrid modeling strategies.
References
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury
Zemlyanskiy, Federico Lebrón, and Sumit Sanghai.
2023. GQA: Training generalized multi-query trans-
former models from multi-head checkpoints. arXiv
preprint arXiv:2305.13245 .
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng
Gao, and Yejin Choi. 2020. Piqa: Reasoning about
physical commonsense in natural language. In Thirty-
Fourth AAAI Conference on Artificial Intelligence .
Eleftheria Briakou, Colin Cherry, and George Foster.
2023. Searching for needles in a haystack: On the
role of incidental bilingualism in palm’s translation
capability. arXiv preprint arXiv:2305.10266 .
Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang,
Sunghun Kim, and Jiayi Huang. 2024. A survey on
mixture of experts. arXiv preprint arXiv:2407.06204 .
Soumith Chintala. 2023. Gpt-4 moe.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,
Ashish Sabharwal, Carissa Schoenick, and Oyvind
Tafjord. 2018. Think you have solved question
answering? try arc, the ai2 reasoning challenge.
Preprint , arXiv:1803.05457.
10
Page 11:
OpenCompass Contributors. 2023. Opencompass:
A universal evaluation platform for foundation
models. https://github.com/open-compass/
opencompass .
Tri Dao. 2023. Flashattention-2: Faster attention with
better parallelism and work partitioning. arXiv
preprint arXiv:2307.08691 .
Tri Dao and Albert Gu. 2024. Transformers are
ssms: Generalized models and efficient algorithms
through structured state space duality. arXiv preprint
arXiv:2405.21060 .
Hantian Ding, Zijian Wang, Giovanni Paolini, Varun
Kumar, Anoop Deoras, Dan Roth, and Stefano Soatto.
2024. Fewer truncations improve language modeling.
arXiv preprint arXiv:2404.10830 .
Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu,
and Yu Cheng. 2025. Mom: Linear sequence
modeling with mixture-of-memories. Preprint ,
arXiv:2502.13685.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey,
Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman,
Akhil Mathur, Alan Schelten, Amy Yang, Angela
Fan, et al. 2024. The llama 3 herd of models. arXiv
preprint arXiv:2407.21783 .
William Fedus, Barret Zoph, and Noam Shazeer. 2022.
Switch transformers: Scaling to trillion parameter
models with simple and efficient sparsity. Journal of
Machine Learning Research , 23(120):1–39.
Trevor Gale, Deepak Narayanan, Cliff Young, and Matei
Zaharia. 2023. MegaBlocks: Efficient Sparse Train-
ing with Mixture-of-Experts. Proceedings of Ma-
chine Learning and Systems , 5.
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman,
Sid Black, Anthony DiPofi, Charles Foster, Laurence
Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li,
Kyle McDonell, Niklas Muennighoff, Chris Ociepa,
Jason Phang, Laria Reynolds, Hailey Schoelkopf,
Aviya Skowron, Lintang Sutawika, Eric Tang, An-
ish Thite, Ben Wang, Kevin Wang, and Andy Zou.
2023. A framework for few-shot language model
evaluation.
Albert Gu and Tri Dao. 2023. Mamba: Linear-time
sequence modeling with selective state spaces. arXiv
preprint arXiv:2312.00752 .
Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and
Christopher Ré. 2020. Hippo: Recurrent mem-
ory with optimal polynomial projections. Advances
in neural information processing systems , 33:1474–
1487.
Albert Gu, Karan Goel, Ankit Gupta, and Christopher
Ré. 2022a. On the parameterization and initialization
of diagonal state space models. Advances in Neural
Information Processing Systems , 35:35971–35983.Albert Gu, Karan Goel, and Christopher Ré. 2022b.
Efficiently modeling long sequences with structured
state spaces. In The International Conference on
Learning Representations (ICLR) .
Ankit Gupta, Albert Gu, and Jonathan Berant. 2022. Di-
agonal state spaces are as effective as structured state
spaces. Advances in Neural Information Processing
Systems , 35:22982–22994.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou,
Mantas Mazeika, Dawn Song, and Jacob Steinhardt.
2020. Measuring massive multitask language under-
standing. arXiv preprint arXiv:2009.03300 .
Jiaxi Hu, Disen Lan, Ziyu Zhou, Qingsong Wen, and
Yuxuan Liang. 2024. Time-ssm: Simplifying and uni-
fying state space models for time series forecasting.
Preprint , arXiv:2405.16312.
Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan,
and Geoffrey E. Hinton. 1991. Adaptive mixtures of
local experts. Neural Computation , page 79–87.
Samy Jelassi, David Brandfonbrener, Sham M Kakade,
and Eran Malach. 2024. Repeat after me: Trans-
formers are better than state space models at copying.
arXiv preprint arXiv:2402.01032 .
Albert Q Jiang, Alexandre Sablayrolles, Antoine
Roux, Arthur Mensch, Blanche Savary, Chris Bam-
ford, Devendra Singh Chaplot, Diego de las Casas,
Emma Bou Hanna, Florian Bressand, et al. 2024.
Mixtral of experts. arXiv preprint arXiv:2401.04088 .
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pap-
pas, and François Fleuret. 2020. Transformers are
rnns: Fast autoregressive transformers with linear
attention. In International Conference on Machine
Learning , pages 5156–5165. PMLR.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint
arXiv:1412.6980 .
Vijay Korthikanti, Jared Casper, Sangkug Lym,
Lawrence McAfee, Michael Andersch, Mohammad
Shoeybi, and Bryan Catanzaro. 2022. Reducing ac-
tivation recomputation in large transformer models.
Preprint , arXiv:2205.05198.
Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, and
Yu Cheng. 2025. Liger: Linearizing large lan-
guage models to gated recurrent structures. Preprint ,
arXiv:2503.01496.
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu,
Dehao Chen, Orhan Firat, Yanping Huang, Maxim
Krikun, Noam Shazeer, and Zhifeng Chen. 2020.
Gshard: Scaling giant models with conditional com-
putation and automatic sharding. arXiv preprint
arXiv:2006.16668 .
Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman
Goyal, and Luke Zettlemoyer. 2021. Base layers:
Simplifying training of large, sparse models. In In-
ternational Conference on Machine Learning , pages
6265–6274. PMLR.
11
Page 12:
Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang,
Hai Zhao, Yeyun Gong, Nan Duan, and Timothy
Baldwin. 2023. Cmmlu: Measuring massive mul-
titask language understanding in chinese. Preprint ,
arXiv:2306.09212.
Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar,
Pieter Noordhuis, Teng Li, Adam Paszke, Jeff
Smith, Brian Vaughan, Pritam Damania, and Soumith
Chintala. 2020. Pytorch distributed: Experiences
on accelerating data parallel training. Preprint ,
arXiv:2006.15704.
Opher Lieber, Barak Lenz, Hofit Bata, Gal Co-
hen, Jhonathan Osin, Itay Dalmedigos, Erez
Safahi, Shaked Meirom, Yonatan Belinkov, Shai
Shalev-Shwartz, et al. 2024. Jamba: A hybrid
transformer-mamba language model. arXiv preprint
arXiv:2403.19887 .
Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang,
Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong
Ruan, Damai Dai, Daya Guo, et al. 2024.
Deepseek-v2: A strong, economical, and efficient
mixture-of-experts language model. arXiv preprint
arXiv:2405.04434 .
Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023.
Ring attention with blockwise transformers for near-
infinite context. Preprint , arXiv:2310.01889.
Zhenyi Lu, Chenghao Fan, Wei Wei, Xiaoye Qu, Dan-
gyang Chen, and Yu Cheng. 2025. Twin-merging:
Dynamic integration of modular expertise in model
merging. Advances in Neural Information Process-
ing Systems , 37:78905–78935.
Paulius Micikevicius, Dusan Stosic, Neil Burgess, Mar-
ius Cornea, Pradeep Dubey, Richard Grisenthwaite,
Sangwon Ha, Alexander Heinecke, Patrick Judd,
John Kamalu, et al. 2022. Fp8 formats for deep
learning. arXiv preprint arXiv:2209.05433 .
MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji
Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Con-
gchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin
Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai
Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao
Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan,
Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han,
Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng,
Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi,
Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei
Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang,
Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi,
Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang,
Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu
Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xi-
aodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min,
Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu,
Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxi-
ang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan
Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin
Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying,
Zhaoyang Cong, Zhen Qin, Zhenhua Fan, ZhihangYu, Zhuo Jiang, and Zijia Wu. 2025. Minimax-01:
Scaling foundation models with lightning attention.
Preprint , arXiv:2501.08313.
Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld,
Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi,
Pete Walsh, Oyvind Tafjord, Nathan Lambert, et al.
2024. Olmoe: Open mixture-of-experts language
models. arXiv preprint arXiv:2409.02060 .
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Al-
balak, Samuel Arcadinho, Stella Biderman, Huanqi
Cao, Xin Cheng, Michael Chung, Leon Derczynski,
Xingjian Du, Matteo Grella, Kranthi Gv, Xuzheng
He, Haowen Hou, Przemyslaw Kazienko, Jan Ko-
con, Jiaming Kong, Bartłomiej Koptyra, Hayden
Lau, Jiaju Lin, Krishna Sri Ipsit Mantri, Ferdinand
Mom, Atsushi Saito, Guangyu Song, Xiangru Tang,
Johan Wind, Stanisław Wo´ zniak, Zhenyuan Zhang,
Qinghua Zhou, Jian Zhu, and Rui-Jie Zhu. 2023.
RWKV: Reinventing RNNs for the transformer era.
InFindings of the Association for Computational
Linguistics: EMNLP 2023 , pages 14048–14077, Sin-
gapore. Association for Computational Linguistics.
Bo Peng, Daniel Goldstein, Quentin Anthony, Alon
Albalak, Eric Alcaide, Stella Biderman, Eugene
Cheah, Teddy Ferdinan, Haowen Hou, Przemysław
Kazienko, et al. 2024. Eagle and finch: Rwkv with
matrix-valued states and dynamic recurrence. arXiv
preprint arXiv:2404.05892 .
Hadi Pouransari, Chun-Liang Li, Jen-Hao Rick Chang,
Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal
Shankar, and Oncel Tuzel. 2024. Dataset decom-
position: Faster llm training with variable sequence
length curriculum. arXiv preprint arXiv:2405.13226 .
Zhen Qin, Dong Li, Weigao Sun, Weixuan Sun,
Xuyang Shen, Xiaodong Han, Yunshen Wei, Bao-
hong Lv, Xiao Luo, Yu Qiao, and Yiran Zhong.
2024a. Transnormerllm: A faster and better large lan-
guage model with improved transnormer. Preprint ,
arXiv:2307.14995.
Zhen Qin, Xuyang Shen, Weigao Sun, Dong Li, Stan
Birchfield, Richard Hartley, and Yiran Zhong. 2024b.
Unlocking the secrets of linear complexity sequence
model from a unified perspective. arXiv preprint
arXiv:2405.17383 .
Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weix-
uan Sun, and Yiran Zhong. 2024c. Lightning
attention-2: A free lunch for handling unlimited
sequence lengths in large language models. arXiv
preprint arXiv:2401.04658 .
Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen,
Dong Li, Weigao Sun, and Yiran Zhong. 2024d.
Hgrn2: Gated linear rnns with state expansion. arXiv
preprint arXiv:2404.07904 .
Zhen Qin, Songlin Yang, and Yiran Zhong. 2024e. Hi-
erarchically gated recurrent neural network for se-
quence modeling. Advances in Neural Information
Processing Systems , 36.
12
Page 13:
Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu,
Weigao Sun, and Yu Cheng. 2024. Llama-moe
v2: Exploring sparsity of llama from perspective of
mixture-of-experts with post-training. arXiv preprint
arXiv:2411.15708 .
Machel Reid, Nikolay Savinov, Denis Teplyashin,
Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste
Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Fi-
rat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Un-
locking multimodal understanding across millions of
tokens of context. arXiv preprint arXiv:2403.05530 .
Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen
Liang, and Weizhu Chen. 2024. Samba: Sim-
ple hybrid state space models for efficient unlim-
ited context language modeling. arXiv preprint
arXiv:2406.07522 .
Stephen Roller, Sainbayar Sukhbaatar, Jason Weston,
et al. 2021. Hash layers for large sparse models.
Advances in Neural Information Processing Systems ,
34:17555–17566.
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhaga-
vatula, and Yejin Choi. 2019. Winogrande: An ad-
versarial winograd schema challenge at scale. arXiv
preprint arXiv:1907.10641 .
Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber.
2021. Linear transformers are secretly fast weight
programmers. In International Conference on Ma-
chine Learning .
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,
Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff
Dean. 2017. Outrageously large neural networks:
The sparsely-gated mixture-of-experts layer. arXiv
preprint arXiv:1701.06538 .
Yikang Shen, Zhen Guo, Tianle Cai, and Zengyi Qin.
2024. Jetmoe: Reaching llama2 performance with
0.1 m dollars. arXiv preprint arXiv:2404.07413 .
Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Ja-
cob R Steeves, Joel Hestness, and Nolan Dey. 2023.
SlimPajama: A 627B token cleaned and deduplicated
version of RedPajama.
Weigao Sun, Disen Lan, Yiran Zhong, Xiaoye Qu, and
Yu Cheng. 2025. Lasp-2: Rethinking sequence par-
allelism for linear attention and its hybrid. arXiv
preprint arXiv:2502.07563 .
Weigao Sun, Zhen Qin, Dong Li, Xuyang Shen,
Yu Qiao, and Yiran Zhong. 2024. Linear at-
tention sequence parallelism. arXiv preprint
arXiv:2404.02882 .
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma,
Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu
Wei. 2023. Retentive network: A successor to trans-
former for large language models. arXiv preprint
arXiv:2307.08621 .Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman,
Avshalom Manevich, Barak Peleg, Ben Aviram, Chen
Almagor, Clara Fridman, Dan Padnos, et al. 2024.
Jamba-1.5: Hybrid transformer-mamba models at
scale. arXiv preprint arXiv:2408.12570 .
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. Advances in neural information processing
systems , 30.
Roger Waleffe, Wonmin Byeon, Duncan Riach, Bran-
don Norick, Vijay Korthikanti, Tri Dao, Albert
Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak
Narayanan, et al. 2024. An empirical study of
mamba-based language models. arXiv preprint
arXiv:2406.07887 .
Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun,
and Damai Dai. 2024. Auxiliary-loss-free load bal-
ancing strategy for mixture-of-experts. Preprint ,
arXiv:2408.15664.
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng,
Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan
Li, Dayiheng Liu, Fei Huang, Guanting Dong, Hao-
ran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian
Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin
Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang
Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang,
Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng
Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin,
Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu,
Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng,
Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin
Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang
Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu
Cui, Zhenru Zhang, and Zhihao Fan. 2024a. Qwen2
technical report. arXiv preprint arXiv:2407.10671 .
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. 2024b.
Gated delta networks: Improving mamba2 with delta
rule. Preprint , arXiv:2412.06464.
Songlin Yang, Bailin Wang, Yikang Shen, Rameswar
Panda, and Yoon Kim. 2023. Gated linear attention
transformers with hardware-efficient training. arXiv
preprint arXiv:2312.06635 .
Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen,
and Yoon Kim. 2024c. Parallelizing linear transform-
ers with the delta rule over sequence length. arXiv
preprint arXiv:2406.06484 .
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Farhadi, and Yejin Choi. 2019. Hellaswag: Can a
machine really finish your sentence? In Proceedings
of the 57th Annual Meeting of the Association for
Computational Linguistics .
Jinle Zeng, Min Li, Zhihua Wu, Jiaqi Liu, Yuang Liu,
Dianhai Yu, and Yanjun Ma. 2022. Boosting dis-
tributed training performance of the unpadded bert
model. arXiv preprint arXiv:2208.08124 .
13
Page 14:
Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying
Jia, Shang Zhang, Zizhong Chen, Xin Liu, and Yibo
Zhu. 2023. Bytetransformer: A high-performance
transformer boosted for variable-length inputs. In
2023 IEEE International Parallel and Distributed
Processing Symposium (IPDPS) , pages 344–355.
IEEE.
Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang,
Leyang Cui, Yiqiao Wang, Bolun Wang, Wei Bi
Freda Shi, Bailin Wang, Peng Zhou, and Guohong Fu.
2024. Gated slot attention for efficient linear-time se-
quence modeling. arXiv preprint arXiv:2409.07146 .
Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping
Huang, Vincent Zhao, Andrew M Dai, Quoc V Le,
James Laudon, et al. 2022. Mixture-of-experts with
expert choice routing. Advances in Neural Informa-
tion Processing Systems , 35:7103–7114.
Tong Zhu, Daize Dong, Xiaoye Qu, Jiacheng Ruan,
Wenliang Chen, and Yu Cheng. 2024a. Dynamic data
mixing maximizes instruction tuning for mixture-of-
experts. arXiv preprint arXiv:2406.11256 .
Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan,
Jingqi Tong, Conghui He, and Yu Cheng. 2024b.
Llama-moe: Building mixture-of-experts from
llama with continual pre-training. arXiv preprint
arXiv:2406.16554 .A Appendix
A.1 Related Work
A.1.1 Mixture-of-Experts
MoE (Jacobs et al., 1991; Cai et al., 2024; Lu et al.,
2025) is gaining increasing attention in the devel-
opment of large language models (LLMs) due to
its ability to scale model size while maintaining
computational efficiency. Its key strength lies in
the sparse activation of experts and routing mech-
anisms, enabling a better balance between model
performance and training cost. The effectiveness
of MoE in modern deep learning was first demon-
strated in Shazeer et al. (2017), where an MoE layer
was introduced between LSTM layers, resulting in
state-of-the-art performance on language modeling
and machine translation benchmarks. Following
this, the MoE layer was incorporated into the Trans-
former architecture, replacing the feed-forward net-
work (FFN) layers. GShard (Lepikhin et al., 2020)
applied MoE to Transformers, significantly im-
proving machine translation across 100 languages.
Switch Transformers (Fedus et al., 2022) further
scaled model size to trillions of parameters, us-
ing a simplified and efficient MoE layer design.
However, training MoE models often leads to load
imbalance, where only a few experts are heavily
utilized, leaving others underutilized (Lewis et al.,
2021; Wang et al., 2024; Zhu et al., 2024a; Du et al.,
2025). To address this, several strategies have been
developed to optimize MoE training. These include
the BASE layer (Lewis et al., 2021), the HASH
layer (Roller et al., 2021), and Expert Choice (Zhou
et al., 2022), all of which aim to maximize model
capacity utilization. MoE architectures have been
widely adopted in industry-leading models, such
as Gemini-1.5 (Reid et al., 2024) and reportedly
GPT-4 (Chintala, 2023). Other notable examples
of LLMs incorporating MoE techniques include
Mixtral (Jiang et al., 2024), DeepSeek V2 (Liu
et al., 2024), Qwen2 (Yang et al., 2024a), Jet-
MoE (Shen et al., 2024), Jamba (Team et al., 2024),
and OLMoE (Muennighoff et al., 2024). Despite
the advances in MoE, most research has focused
on improving FFN layers and routers, while atten-
tion mechanisms have remained largely unchanged.
There is still much room for exploring how to en-
hance the efficiency of MoE models by evolving
their attention layers.
14
Page 15:
A.1.2 Linear Sequence Modeling
Linear Attention. Linear attention encompasses
a set of techniques aimed at calculating atten-
tion outputs using the "right-product kernel trick,"
which first computes key-value products, thereby
avoiding the quadratic complexity associated with
query-key computations. Vanilla linear atten-
tion (Katharopoulos et al., 2020) replaces the
Softmax attention (Vaswani et al., 2017) with ker-
nel methods, reducing the computational complex-
ity to linear in relation to sequence length. Building
on this, various extensions of linear attention have
emerged. For example, TransNormerLLM (Qin
et al., 2024a) introduces Lightning Attention, an op-
timized linear attention mechanism that speeds up
processing by enhancing IO operations. Lightning
Attention-2 (Qin et al., 2024c) further improves this
by separately handling inter- and intra-block com-
putations to fully exploit the advantages of linear at-
tention on autoregressive tasks. RetNet (Sun et al.,
2023) combines a retention mechanism with atten-
tion, offering both parallel training and linear-time
inference. Gated Linear Attention (GLA) (Yang
et al., 2023) introduces a data-independent gating
mechanism and presents a hardware-efficient algo-
rithm for training. DeltaNet (Schlag et al., 2021),
along with its parallelized version (Yang et al.,
2024c), applies a delta-rule-like update to improve
performance in long-context scenarios. More re-
cently, Gated Slot Attention (GSA) (Zhang et al.,
2024), inspired by GLA, introduces a bounded-
memory slot control mechanism within the gated
linear attention framework, further boosting perfor-
mance in tasks requiring strong recall abilities.
State Space Model. SSM provides a robust
framework for capturing the behavior of sequence
modeling within dynamic systems, and has demon-
strated itself in the field of linear sequence mod-
eling. Models such as S4 (Gu et al., 2022b) and
its subsequent variants (Gu et al., 2022a; Gupta
et al., 2022) have achieved notable success, par-
ticularly in long-range synthetic tasks. A recent
example is Mamba (Gu and Dao, 2023), a represen-
tative SSM model that introduces a state selection
mechanism. Mamba addresses the limitation of
static dynamics in previous methods, arguing that
they do not account for input-specific context se-
lection within the hidden state, which is critical for
tasks like language modeling. Mamba has shown
superior performance compared to Transformers
across various model sizes and scales. Mamba hasbeen further refined in its successor, Mamba2 (Dao
and Gu, 2024), which integrates a linear attention-
like mechanism that improves hardware efficiency
during training. Similar to how linear attention
uses outer products to expand the state, Mamba2
leverages a state-space duality that enables paral-
lel attention-style computation while maintaining
recurrent inference capabilities.
Linear RNN. Traditional RNNs struggle with
long-context sequence modeling, largely due to
their sequential nature during training, which limits
their ability to benefit from scaling laws (Sun et al.,
2023). To mitigate these issues, Linear RNNs intro-
duce parallel training capabilities, achieving com-
petitive performance with Transformers of com-
parable size. RWKV (Peng et al., 2023, 2024)
is an example of a large language model based
on linear RNNs, designed to effectively manage
long-term dependencies. Furthermore, HGRN (Qin
et al., 2024e) emphasizes the importance of data-
dependent decay mechanisms in enhancing linear
RNN performance, showing how tuning decay pa-
rameters can improve learning in long-context sce-
narios. The upgraded HGRN2 (Qin et al., 2024d)
builds on this by introducing a state expansion
mechanism that leverages outer product opera-
tions, allowing for better scalability and improved
sequence modeling over extended inputs. Both
RWKV and HGRN models aim to address the
limitations of traditional RNNs for efficient long-
sequence modeling.
A.2 Tensor Parallelism on Linear-MoE
The core computation mechanism of LSM modules
can be abstracted in the following general form:
O=ϕ(Q)(ϕ(K)⊤V),
Q=XW Q,K=XW K,V=XW V,(6)
where TP is applied by splitting the matrix multi-
plications as follows:
Q= [ϕ(XW1
Q), ϕ(XW2
Q)],
K= [ϕ(XW1
K), ϕ(XW2
K)],
V=X[W1
V,W2
V],
O= [O1,O2],(7)
where the weight matrices Wq,Wk, andWvare
divided along their columns, producing an output
matrix Othat is also split along columns.
The split output [O1,O2]is then multiplied by
an output linear weight matrix that is split along its
15
Page 16:
rows, resulting in:
O= [O1,O2][W1
O,W2
O]⊤
=O1W1
O+O2W2
O,(8)
which produces a unified output.
As with TP in standard attention, TP for LSM
modules introduces an all-reduce collective com-
munication operation during both the forward and
backward passes. In practical terms, this all-reduce
operation is implemented via two separate steps:
all-gather and reduce-scatter, which together func-
tionally achieve the same result as a single all-
reduce.
A.3 Sequence Parallelism on Linear-MoE
Algorithm 1 SP on Linear-MoE w/o Masking
1:Input: input sequence X, distributed world size W, se-
quence parallel size T=W.
2: Distribute X= [Xt]T
1.
3:forchunk t∈ {1,···, T}on ranks {1,···, W}in par-
allel do
4: Calculate Qt=XtWQ,Kt=XtWK,Vt=
XtWV.
5: Compute Mt=K⊤
tVt.
6: Communicate [Mt]⊤
1=AllGather ([Mt]⊤
1).
7: Compute M1:T=Sum([Mt]T
1).
8: Compute Ot=QtM1:T.
9:end for
10: return O= [Ot]T
1.
Algorithm 2 SP on Linear-MoE w/ Masking
1:Input: input sequence X, distributed world size W, se-
quence parallel size T=W.
2: Distribute X= [Xt]T
1.
3:Initialize mask matrix Ψ, where Ψij= 1 ifi≥jand
Ψij=−∞ ifi < j .
4:forchunk t∈ {1,···, T}on ranks {1,···, W}in par-
allel do
5: Calculate Qt=XtWQ,Kt=XtWK,Vt=
XtWV.
6: Compute Mt= (Kt)⊤Vt.
7: Communicate [Mt]⊤
1=AllGather ([Mt]⊤
1).
8: Compute Ot,intra= [(QtK⊤
t)⊙Ψ]Vt.
9: Compute prefix sum
M1:t−1=PrefixSum ([Mt]t−1
1).
10: Compute Ot,inter=QtM1:t−1.
11: Compute Ot=Ot,intra+Ot,inter.
12:end for
13: return O= [Ot]T
1.A.4 Additional Experiments
0 1 2 3 4 5
Samples1e72.502.753.003.253.503.754.004.254.50LossMamba2
Mamba2(H)
GLA
Figure 7: Training Loss Curves of A1B-7B Model
Instances.
A.5 Datasets and Benchmarks
We pretrain all the models on a portion of the
SlimPajama dataset which is sampled to approxi-
mately 100 billion tokens.
•SlimPajama (Soboleva et al., 2023) is a
high-quality, optimized subset of the Red-
Pajama dataset, designed for large-scale lan-
guage model training. It includes diverse text
sources such as Common Crawl, Wikipedia,
books, and GitHub code, with a primary focus
on English. The dataset is cleaned, dedupli-
cated, and optimized for efficiency and perfor-
mance.
For the benchmark, we tested on these tasks:
•PiQA (Bisk et al., 2020): A dataset focused on
physical commonsense reasoning in English
with 3084 test samples. The text consists of
everyday tasks and scenarios, requiring mod-
els to determine the most practical way to
perform an action. The data is sourced from
crowdsourced descriptions, reflecting a broad
range of common human experiences.
•ARC-Easy & ARC-Challenge (Clark et al.,
2018): A set of multiple-choice science ques-
tions in English, sourced from standardized
exams and educational materials with 2376
and 1172 test samples. The dataset repre-
sents the domain of elementary and high
school science, with questions authored by
16
Page 17:
Scale ModelLSM
InstancePIQA Hella. Wino. ARC-e ARC-c MMLU Avg. Avg.
acc↑ acc_norm ↑ acc↑ acc↑ acc_norm ↑acc(5-shot) ↑ ↑ (no MMLU) ↑
Baseline Attention 55.77 27.10 50.83 33.04 23.21 23.24 35.53 37.99
A0.3B-2B
15B TokensPureBLA 64.42 33.41 49.01 48.15 24.32 26.32 40.94 43.86
Retention 62.08 29.14 50.75 42.72 21.50 23.12 39.60 43.39
GLA 65.56 35.29 50.67 47.81 23.04 24.85 41.20 44.47
Mamba2 66.97 37.79 50.20 49.12 24.74 25.85 42.45 45.76
HGRN2 52.50 26.37 49.01 24.83 27.65 25.10 34.24 36.07
HybridBLA 66.76 37.16 49.96 49.62 24.74 25.64 42.31 45.65
Retention 66.21 36.06 51.54 47.18 24.91 23.71 41.60 45.18
GLA 67.71 38.62 49.72 50.51 26.02 25.05 42.94 46.52
Mamba2 66.38 38.81 51.30 50.17 24.91 24.61 42.70 46.31
HGRN2 66.27 36.79 51.46 48.82 25.43 23.19 41.99 45.75
Table 5: A0.3B-2B Evaluation Results on Language Modeling Benchmarks (No Data Corruption). All models
are pretrained from scratch on the same 15B subset of the SlimPajama dataset with the Qwen2 tokenizer. No bench-
mark data corruption in the pretraining dataset. The A0.3B-2B hybrid models have a stack as "LLLNLLLNLLLN",
where "L" represents the Linear-MoE layer, and "N" represents the normal MoE transformer layer.
Scale ModelLSM
InstancePIQA Hella. Wino. ARC-e ARC-c MMLU Avg. Avg.
acc↑ acc_norm ↑ acc↑ acc↑ acc_norm ↑acc(5-shot) ↑ ↑ (no MMLU) ↑
A1B-7B
100B TokensPureBLA 66.65 37.74 50.12 50.80 24.23 23.71 42.21 45.91
GLA 68.17 43.51 51.22 52.48 25.09 24.83 44.22 48.09
Mamba2 69.21 41.86 51.46 52.86 25.17 23.66 44.04 48.11
Table 6: A1B-7B Evaluation Results on Language Modeling Benchmarks (No Data Corruption). All models
are pretrained from scratch on the same 15B subset of the SlimPajama dataset with the Qwen2 tokenizer. No
benchmark data corruption in the pretraining dataset.
educators and test designers. ARC-Easy in-
cludes straightforward questions, while ARC-
Challenge contains more difficult ones that
require advanced reasoning.
•HellaSwag (Zellers et al., 2019): An English-
language dataset designed for commonsense
reasoning, where models must choose the
most plausible continuation of a sentence. The
text is derived from activity descriptions (e.g.,
WikiHow), covering everyday scenarios. The
dataset was constructed adversarially to be
challenging for language models. It has 10003
test samples.
•WinoGrande (Sakaguchi et al., 2019): A
large-scale English dataset for commonsense
reasoning, based on the Winograd Schema
Challenge with 1267 test samples. It tests pro-
noun resolution in ambiguous contexts, with
sentences sourced and refined through crowd-
sourcing. The dataset aims to reduce annota-
tion biases by diversifying sentence structures
and topics.
•MMLU (Li et al., 2023): The MMLU
(Massive Multitask Language Understanding)
dataset is a comprehensive benchmark de-signed to evaluate AI models’ general knowl-
edge across a wide range of subjects and
languages. It comprises 57 distinct cate-
gories, spanning elementary-level knowledge
to advanced professional topics such as law,
physics, history, and computer science. The
dataset has been translated into 14 languages
using professional human translators, ensur-
ing high-quality and accurate translations.
This multilingual approach aims to improve
the inclusivity and effectiveness of AI models
across different linguistic communities.
All datasets used in this work are publicly avail-
able and have been released by their original cre-
ators, who are responsible for ensuring privacy pro-
tection. These datasets are used in accordance with
their respective licenses and intended purposes. No
modifications or derivative datasets have been cre-
ated.
17