Open AGI Codes | Your Codes Reflect!

Updates

Generating audio...

arxiv

Paper 2503.05447

Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts

Authors: Weigao Sun, Disen Lan, Tong Zhu, Xiaoye Qu, Yu Cheng

Published: 2025-03-07

Abstract:

Linear Sequence Modeling (LSM) like linear attention, state space models and linear RNNs, and Mixture-of-Experts (MoE) have recently emerged as significant architectural improvements. In this paper, we introduce Linear-MoE, a production-level system for modeling and training large-scale models that integrate LSM with MoE. Linear-MoE leverages the advantages of both LSM modules for linear-complexity sequence modeling and MoE layers for sparsely activation, aiming to offer high performance with efficient training. The Linear-MoE system comprises: 1) Modeling subsystem, which provides a unified framework supporting all instances of LSM. and 2) Training subsystem, which facilitates efficient training by incorporating various advanced parallelism technologies, particularly Sequence Parallelism designed for Linear-MoE models. Additionally, we explore hybrid models that combine Linear-MoE layers with standard Transformer-MoE layers with its Sequence Parallelism to further enhance model flexibility and performance. Evaluations on two model series, A0.3B-2B and A1B-7B, demonstrate Linear-MoE achieves efficiency gains while maintaining competitive performance on various benchmarks, showcasing its potential as a next-generation foundational model architecture. Code: https://github.com/OpenSparseLLMs/Linear-MoE.

Paper Content:

Page 1: Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts Weigao Sun1†, Disen Lan1,2, Tong Zhu3, Xiaoye Qu1, Yu Cheng4 1Shanghai AI Laboratory,2South China University of Technology,3Soochow University, 4The Chinese University of Hong Kong Abstract Linear Sequence Modeling (LSM) like linear attention, state space models and linear RNNs, and Mixture-of-Experts (MoE) have recently emerged as significant architectural improve- ments. In this paper, we introduce Linear- MoE, a production-level system for model- ing and training large-scale models that in- tegrate LSM with MoE. Linear-MoE lever- ages the advantages of both LSM modules for linear-complexity sequence modeling and MoE layers for sparsely activation, aiming to offer high performance with efficient train- ing. The Linear-MoE system comprises: 1) Modeling subsystem, which provides a uni- fied framework supporting all instances of LSM. and 2) Training subsystem, which fa- cilitates efficient training by incorporating vari- ous advanced parallelism technologies, particu- larly Sequence Parallelism designed for Linear- MoE models. Additionally, we explore hy- brid models that combine Linear-MoE layers with standard Transformer-MoE layers with its Sequence Parallelism to further enhance model flexibility and performance. Evaluations on two model series, A0.3B-2B and A1B-7B, demonstrate Linear-MoE achieves efficiency gains while maintaining competitive perfor- mance on various benchmarks, showcasing its potential as a next-generation foundational model architecture. Code: https://github. com/OpenSparseLLMs/Linear-MoE . 1 Introduction Mixture-of-Experts (MoE) (Jacobs et al., 1991; Qu et al., 2024) architectures have gained widespread adoption in cutting-edge models within indus- try, with prominent examples including Gemini- 1.5 (Reid et al., 2024) and the reported use of MoE †Project lead. Corresponding to Weigao Sun (sun- weigao@outlook.com). Work done during Disen Lan’s in- ternship at Shanghai AI Laboratory.in GPT-4 (Chintala, 2023). Other notable large models incorporating MoE techniques include Mix- tral (Jiang et al., 2024), DeepSeek V2 (Liu et al., 2024), Qwen2 (Yang et al., 2024a), JetMoE (Shen et al., 2024), Jamba (Team et al., 2024), and OL- MoE (Muennighoff et al., 2024). Most advances on MoE studies primarily concen- trate on modifying the routing mechanism or expert layers, while typically keeping the attention layers unchanged (Zhu et al., 2024b). These attention layers commonly rely on the softmax self-attention mechanism introduced in the Transformer archi- tecture (Vaswani et al., 2017). The softmax-based self-attention has proven to be highly effective for sequence modeling tasks across various data types. However, a significant limitation of this mecha- nism is its computational complexity, which grows quadratically with the input sequence length. This complexity can lead to substantial computational costs, especially during training, making it a chal- lenge for models need to handle long sequences efficiently. Linear sequence modeling (LSM) has recently gained significant attention due to its impressive ef- ficiency in both training and inference. These meth- ods function similarly to recurrent neural networks (RNNs) with matrix-valued hidden states, allowing them to achieve linear-time training and constant- memory inference. This efficiency is largely due to the fact that LSM techniques bypass the com- putation of attention scores and eliminate the need for maintaining a key-value (KV) cache. There are three primary approaches to linear sequence mod- eling: linear attention (Katharopoulos et al., 2020), state space models (SSM) (Gu and Dao, 2023; Dao and Gu, 2024; Hu et al., 2024; Waleffe et al., 2024), and linear RNN (Peng et al., 2023, 2024; Qin et al., 2024d). Linear attention is a variation of the tra- ditional softmax attention mechanism, replacing the exponential kernel with a simpler dot product between key and query vectors, which enables the 1arXiv:2503.05447v1 [cs.LG] 7 Mar 2025 Page 2: use of the right-product kernel trick to reduce com- putational complexity. SSM approaches, such as Mamba and Mamba2, stem from control theory and represent sequence modeling as dynamic sys- tems. Meanwhile, linear RNN methods address the limitations of traditional RNNs in modeling long contexts by enabling parallel training of RNN models. These different methods, linear attention, SSM, and linear RNN, share a common mathemat- ical foundation and exhibit similar performance on sequence modeling tasks (Dao and Gu, 2024; Peng et al., 2024; Qin et al., 2024b; Yang et al., 2024c). In fact, they all employ a unified recurrence frame- work expressed as Ms=Ms−1+cMs, where Ms denotes the memory state and cMsrepresents the incremental memory update at the s-th token. In this paper, we introduce Linear-MoE, a production-level system designed for modeling and training of large-scale MoE models with LSM mod- ules integrated. The Linear-MoE system is com- posed of two key subsystems: Modeling and Train- ing. The Modeling subsystem provides a unified linear sequence modeling framework for Linear- MoE models. It supports three main types of LSM methods: linear attention, state space model (SSM), and linear RNN. For each type, multiple instances are implemented under a unified formu- lation. While the Training subsystem is designed to achieve efficient training of Linear-MoE models on modern accelerators. In addition to supporting state-of-the-art training techniques, we incorporate a specialized Sequence Parallelism (SP) technique for LSM modules, which is particularly effective for handling extremely long input sequences on Linear-MoE architecture. Importantly, the system is designed to be extensible, enables more advanced sequence modeling methods or training techniques integrated in the future. Furthermore, we also ex- plore efficient modeling and training for hybrid Linear-MoE models, which combine Linear-MoE layers with standard Transformer-MoE layers. For hybrid models, we introduce an SP method that em- ploys distinct computational and communication strategies tailored to the different types of layers. Our contributions can be summarized as follows: •Production-level System. We introduce Linear- MoE, a production-level system designing for efficient modeling and training of large-scale MoE models with LSM modules integrated. •Modeling & Training Subsystems. The Linear- MoE system is composed of two subsystems: Modeling andTraining . We provide unifiedlinear sequence modeling formulation to sup- port various LSM modules with MoE layers, as well as state-of-the-art training techniques for efficient large-scale model training, espe- cially on long-context inputs. •Experimental Validation. In empirical studies, we pretrain two series of Linear-MoE mod- els from scratch on the public SlimPajama corpus. Extensive experiments validate the training and inference efficiency of our sys- tem framework, as well as the performance of Linear-MoE architecture on various down- stream tasks. 2 Linear-MoE System 2.1 Modeling 2.1.1 Unified Linear Sequence Modeling The standard softmax attention (Vaswani et al., 2017), commonly used in transformer models, whose parallel computation form during training can typically be expressed as: O= Softmax( QK⊤)V. (1) Here, the matrices Q,K,V,O∈RN×dcorre- spond to the query, key, value, and output matrices, respectively. The matrices Q,K,andVare lin- ear projections of the input matrix X∈RN×d, defined as Q=XW Q,K=XW K, andV= XW V, where WQ,WK,WV∈Rd×dare learn- able weight matrices. Here, Nanddrepresent the sequence length and hidden dimension. Linear Attention (Katharopoulos et al., 2020) as one of the representative LSM methods, has emerged as a viable alternative to traditional soft- max attention by implementing two primary modi- fications. First, it eliminates the Softmax (·)oper- ation, instead embedding it within a kernel fea- ture map. Second, it leverages the associative property of matrix multiplication, reconfiguring (QK⊤)VintoQ(K⊤V). These changes reduce both the computational and memory complexity fromO(N2d)toO(Nd2). This approach is fre- quently referred to as the right-product kernel trick, as it prioritizes matrix product on the right side. While during inference, both softmax self- attention and linear attention handle a single token at each iteration. Given the s-th token xs∈R1×d, softmax self-attention computes requiring the stor- age of an expanding set of keys {k1,···, ks}and values {v1,···, vs}i.e., the KV cache, which leads 2 Page 3: Table 1: Instances of Linear Sequence Modeling Methods. All instances listed follow the unified formulation in Eq. (5). Here, a∈R,as∈R,as∈Rd,A∈Rd×d,As∈Rd×drepresents a fixed constant, a time-dependent scalar, a time-dependent vector, a time-independent matrix, and a time-dependent matrix, respectively. Note that the same notation may denote different variables in different instances. LSM Method Instance Recurrent Update Rule Parameter Linear Attention BLA Ms=Ms−1+k⊤ svs \ Lightning Attn Ms=aMs−1+k⊤ svs a∈R RetNet Ms=aMs−1+k⊤ svs a∈R GLA Ms=diag{as}Ms−1+k⊤ svs as∈Rd DeltaNet Ms= (I−ask⊤ sks)Ms−1+bsk⊤ svs as, bs∈R Rebased Ms=Ms−1+ϕ(ks)⊤vs \ GFW Ms=As⊙Ms−1+k⊤ svs As∈Rd×d GateLoop Ms=As⊙Ms−1+k⊤ svs As∈Rd×d Gated DeltaNet Ms=as(I−k⊤ sks)Ms−1+bsk⊤ svs as, bs∈R TTT Ms=Ms−1+bs∇l(Ms−1;ks,vs) bs∈R Titans Ms=asMs−1+bs∇l(Ms−1;ks,vs) as, bs∈R SSM*S4 Ms= exp( −(a1⊤)A)⊙Ms−1+ (a1⊤)b⊤vs a,b∈Rd,A∈Rd×d Mamba Ms= exp( −(as1⊤)As)⊙Ms−1+ (as1⊤)k⊤ svsas∈Rd,As∈Rd×d Mamba2 Ms= exp( −abs)⊙Ms−1+bsk⊤ svs a, bs∈R HGRN2 Ms=diag{as}Ms−1+ (1−as)⊤vs as∈Rd Linear RNN RWKV6 Ms=diag{as}Ms−1+k⊤ svs as∈Rd RWKV7 Ms=diag{as}Ms−1+∇l(Ms−1;ks,vs) as∈Rd *For both S4 and Mamba, the Euler Discretization (Gu et al., 2020) is applied, such that ¯B=∆B, and the unprojected xsis denoted as vsfor consistency with other formulas. to a significant memory burden when dealing with long input sequences: qs,ks,vs=xsWQ,xsWK,xsWV, os=Ps i=1exp(qski⊤)viPs i=1exp(qsk⊤ i).(2) Linear attention replaces the term exp(qsk⊤ i)with a kernel k(x,y)with an associated feature map ϕ, i.e.,k(x,y) =⟨ϕ(x), ϕ(y)⟩. This simplifies the calculation of osas os=Ps i=1ϕ(qs)ϕ(ki)⊤viPs i=1ϕ(qs)ϕ(ki)⊤. (3) Letting Ms=Ps i=1ϕ(ki)⊤viandzs=Ps i=1ϕ(ki)⊤where Ms∈Rd×d,zs∈Rd×1, we can rewrite Eq. (3) as an RNN: Ms=Ms−1+ϕ(ks)⊤vs, zs=zs−1+ϕ(ks)⊤, os=ϕ(qs)Ms ϕ(qs)zs.(4) Follow-up studies on SSM (e.g., Mamba2) and linear RNNs (e.g., RWKV6, HGRN2), have demon- strated their similarity with linear attention (Dao and Gu, 2024; Peng et al., 2024). In fact, recentstudies (Qin et al., 2024b; Yang et al., 2024c) have suggested that linear attention, state space, and linear RNN sequence modeling methods can be expressed within a unified recurrence framework as: cMs=f(k⊤ s,vs), Ms=Θs⋄Ms−1+cMs.(5) In this formulation, cMs∈Rd×drepresents the memory state corresponding to the s-th input, which is a function of k⊤ sandvs. And Θsde- notes a coefficient matrix that may be time-varying or constant (and also can be a vector or scalar). The operator " ⋄" can denote either standard matrix mul- tiplication or Hadamard product. We collect recent LSM method instances which follow the unified formulation in Eq. (5) and list them in Table 1. 2.1.2 Linear-MoE Architecture The Linear-MoE architecture is relatively straight- forward, consisting of L×stacked Linear-MoE blocks, as depicted in Fig. 1. Each Linear-MoE block includes an LSM layer and an MoE layer, with a normalization layer preceding each. The LSM layer serves as a generalized structure that supports various LSM methods, specifically, linear attention, SSM, and linear RNN, each encompass- ing multiple instance methods. Table 1 provides 3 Page 4: Expert 1 Input EmbeddingMoE LayerNormLSM LayerNormExpert 2RouterMoE Layer LinearLSM LayerMatmulMatmulLinearLinearVKQMemory StateNormLinear𝐿×Expert 3Expert 4Figure 1: Linear-MoE Architecture. In each Linear-MoE block, there is both an LSM layer and an MoE layer, with each layer preceded by its own normalization layer. The LSM layer is designed as a flexible abstraction of LSM methods, including: linear attention, SSM, and linear RNN, which follows a unified recurrence framework. an overview of these LSM method instances, uni- fied under a common recurrence framework. This framework highlights key distinctions between var- ious LSM instances, primarily in their handling of the prior-step memory state Ms−1and the compu- tation of the incremental memory state cMs. For the MoE layers, we retain the standard mechanisms of sparse expert activation and routing, as employed in SOTA open-source MoE models. These mech- anisms are essential for maintaining an optimal balance between model performance and computa- tional efficiency. In this paper, we refer to models composed ex- clusively of Linear-MoE layers as pure Linear- MoE models. These models achieve high effi- ciency during both training and inference, benefit- ing from the LSM modules embedded in each layer. However, despite these advantages, empirical re- search (Lieber et al., 2024; Ren et al., 2024; Waleffe et al., 2024) has shown that models relying solely on LSM modules tend to underperform on tasks re- quiring strong recall capabilities, such as in-context learning (e.g., five-shot MMLU (Hendrycks et al., 2020), Phone-book lookup (Jelassi et al., 2024), Needle In A Haystack (Briakou et al., 2023)) and long-context reasoning. In such cases, a hybrid ar- chitecture that interleaves linear transformer layers with standard transformer layers has proven effec- tive in improving model performance on recall- intensive tasks (Yang et al., 2024b; MiniMax et al., 2025; Lan et al., 2025). Based on this prior, we propose a hybrid Linear- MoE architecture that combines Linear-MoE lay- ers with standard (MoE) transformer layers. A practical approach for constructing these hybrid models is to periodically substitute certain Linear- MoE layers with standard MoE transformer layerswithin the model. For instance, in an 4-layer hy- brid Linear-MoE model, denoted by "L" for Linear- MoE layers and "N" for normal transformer layers, configurations such as "LLLL" or "LNLN" may be used, depending on the desired ratio of normal transformer layers, which can be adjusted based on user preference. 2.2 Training 2.2.1 Sequence Parallelism on Linear-MoE The existing methods, LASP (Sun et al., 2024) and its improved version LASP-2 (Sun et al., 2025), are designed specifically to leverage the right-product- first property of linear attention techniques for ef- ficient sequence parallelism (SP). LASP employs a point-to-point ring-style communication pattern, facilitating the exchange of incremental memory states across devices. This communication pattern is particularly effective for managing dependen- cies while minimizing the data transferred between devices, enhancing the scalability of SP. LASP- 2 further refines this approach by replacing the ring-style communication with an all-gather col- lective communication operation, streamlining the entire communication process. This modification not only simplifies the communication structure but also improves the parallelism of computation and communication. In this work, we extend the capabilities of LASP series to the Linear-MoE system, allowing for the efficient SP training on LSM modules, particularly when dealing with extremely long sequences across large-scale distributed clusters. This extension sig- nificantly enhances the scalability and efficiency of training large-scale Linear-MoE models with long-context sequences on extensive compute re- 4 Page 5: Seq Chunk 0Seq Chunk 1AG / RSQKVQKVLSMLSMOutAG /NoNo/AGLSMLSMOutAG /NoNo/AGRS / AG AG / RSQKVQKVLSMLSMOutAG /NoNo/AGLSMLSMOutAG /NoNo/AGRS / AGGPU0GPU1GPU2GPU3AG / RSQKVQKVAttnAttnOutAG /RSNo/AGAttnAttnOutAG /RSNo/AGRS / AG AG / RSQKVQKVAttnAttnOutAG /RSNo/AGAttnAttnOutAG /RSNo/AGRS / AGLSMModuleStandard Attention ModuleMMK / VK / VFigure 2: Sequence Parallelism Approach on Hybrid Linear-MoE models. We exemplify the parallelism on the hybrid layers of LSM and standard attention with both TP and SP (both have a dimension of 2). The communication operations colored in yellow and green are for TP and SP, respectively. AG/RS: all-gather in forward and reduce- scatter in backward, RS/AG: reduce-scatter in forward and all-gather in backward, AG/No: all-gather in forward and no-op in backward, No/AG: no-op in forward and all-gather in backward. Note that the SP communication operations for linear attention operate on the memory state Ms∈Rd×d, while for standard attention, they operate on states Ks,Vs∈RC×d. sources. A detailed breakdown of the SP algorithm on Linear-MoE, with and without masking, is pro- vided in Appendix A.3. 2.2.2 Hybrid Model Sequence Parallelism Hybrid linear sequence modeling models, which combine linear transformer layers (leveraging LSM methods for token mixing) with standard transformer layers (utilizing conventional self- attention for token mixing), have demonstrated substantial improvements in handling long-context tasks (Lieber et al., 2024; Ren et al., 2024; Waleffe et al., 2024). This hybrid model is particularly ben- eficial for tasks with high recall requirements, in- cluding five-shot MMLU (Hendrycks et al., 2020), Phone-book lookup (Jelassi et al., 2024), and Nee- dle In A Haystack (Briakou et al., 2023), etc.. Our proposed hybrid Linear-MoE models also aim to enhance performance in areas where pure Linear- MoE models have shown limitations, specifically on tasks where recall precision is critical. Applying SP on pure Linear-MoE models is straightforward, as this form of SP operates exclu- sively on the LSM modules, leaving the MoE layers unaffected. In hybrid Linear-MoE models, how- ever, implementing SP becomes more complex due to the interleaving of distinct sequence modeling layers. To effectively optimize SP for these hybrid models, we introduce an integrated approach that incorporates SP across both the linear-MoE and standard transformer layers, thus enhancing overall efficiency. We illustrate the approach in Fig. 2, and explain it as below: On LSM Module. The SP for LSM modulesis implemented via a single collective communica- tion operation on the memory state Ms∈Rd×d. This approach ensures that the communication com- plexity does not depend on either the sequence or sub-sequence length; rather, it scales only linearly with the SP size T, thereby maintaining efficiency in distributed setups. On Standard Attention Module. Context par- allelism (CP) is a SP technique used in Megatron- LM that divides input data and activations along the sequence dimension, specifically designed for standard softmax attention. Traditional CP imple- mentations in Megatron-LM rely on a ring-like communication-computation overlap (Liu et al., 2023). In contrast, our approach for standard atten- tion modules adopts the all-gather-based strategy used in the pretraining of Llama3 (Dubey et al., 2024). Rather than utilizing a ring strategy, we perform all-gather communication for KsandVs tensors across devices, followed by local compu- tation of attention output on each device’s chunk ofQs. While all-gather communication theoreti- cally has higher latency than ring-based methods, it offers greater flexibility and adaptability for han- dling different attention masks, such as document- level masks, making it ideal for varying attention patterns. Moreover, the latency of all-gather is minimized since the KsandVstensors are no- tably smaller than the Qstensor, especially with grouped query attention (Ainslie et al., 2023). Con- sequently, the computational time for generating attention output significantly outweighs the cost of all-gather communication. 5 Page 6: 2.2.3 Hybrid Parallelism SP in Linear-MoE allows for a flexible choice of sequence parallel size that can be set to any factor smaller than or divisible by the total number of distributed nodes (i.e., the world size). This flexi- bility enables splitting input data across both batch and sequence dimensions, creating a combined ap- proach known as data-sequence hybrid parallelism. Standard data parallelism techniques, such as Dis- tributed Data Parallel (DDP) (Li et al., 2020), can integrate seamlessly with SP in Linear-MoE. Addi- tionally, the sharded data parallelism method, like Distributed Optimizer (Korthikanti et al., 2022) in Megatron-Core, is also compatible. Furthermore, the system provides support for Tensor Parallelism (TP), Pipeline Parallelism (PP), and Expert Parallelism (EP) specifically tailored for Linear-MoE models. In the case of TP, its appli- cation to Linear-MoE models is direct and efficient, as detailed in §A.2. Regarding PP and EP, these parallelism techniques operate on Linear-MoE in much the same way as their original versions since they are not involved in the inner computations of the LSM modules but rather work at the level of complete Linear-MoE blocks or MoE layers. More- over, TP, PP, and EP can be combined with DP and SP as introduced earlier, enhancing flexibility and scalability for large distributed setups. 2.2.4 Variable Length During pretraining, batches generally consist of se- quences with a uniform length. However, in the finetuning phase or during inference, the model often encounters batches containing sequences of different lengths. A common approach to handle this variation is to right-pad each sequence in the batch so that all match the length of the longest sequence in that batch. While straightforward, this padding strategy can lead to inefficiencies, particu- larly when sequence lengths vary greatly within a batch. For standard transformers, more advanced methods have been introduced to address this issue. These methods include techniques like distribut- ing workloads across GPUs to avoid padding alto- gether (Zeng et al., 2022; Zhai et al., 2023), or pack- ing multiple sequences into a single batch while adjusting the attention mask as needed (Ding et al., 2024; Pouransari et al., 2024). In Linear-MoE, han- dling variable-length sequences is simplified by processing the entire batch as one continuous long sequence, effectively managing varying sequence lengths without the need for padding. MoEBaseModels(with Tokenizers)Qwen2DeepSeek-V2MixtralCheckpoint ConvertorHuggingFaceàMegatron-CoreMegatron-Core àHuggingFaceExamples Megatron-CoreData ParallelTensor ParallelPipeline ParallelExpert ParallelLASPContext ParallelDistributed OptimizerCPUOffloadingTransformer EngineMixed PrecisionActivation CheckpointingLSM ModulesLinear Attention: BLA, Lightning Attn, Retention, GLA, DeltaNet,Based, Rebased, …SSM: Mamba2, …Linear RNN: HGRN2, RWKV6, …PretrainingFinetuningEvaluatingConfigsTrainingModeling ……… Grouped GEMMMegaBlocks…Figure 3: Linear-MoE System Implementation. The Linear-MoE system is composed of two main subsys- tems: Modeling and Training. It is developed in a non-intrusive manner, utilizing the latest version of Megatron-Core. All components within the system are designed with extensibility in mind, encompassing the LSM modules, base models, examples, and training technologies. This design allows for future enhance- ments and extensions of the system. 2.3 Implementation The implementation of the Linear-MoE system is based on Megatron-Core, an open-source library developed on PyTorch that incorporates optimized GPU techniques and advanced system-level en- hancements. As depicted in Fig. 3, the Linear-MoE system consists of both modeling and training sub- systems, facilitating adaptable model building and efficient training specifically for Linear-MoE mod- els. Leveraging the capabilities of Megatron-Core, the Linear-MoE library is fully compatible with all NVIDIA Tensor Core GPUs, including support for FP8 acceleration on NVIDIA Hopper architectures. The Linear-MoE design approach aims to min- imize any invasive changes to Megatron-Core’s source code. Rather than adding new modules directly, Linear-MoE operates independently, al- lowing users to benefit from the latest LLM prac- tices without disruptions due to updates or changes within Megatron-Core. 2.3.1 Modeling Subsystem Linear-MoE abstracts its LSM modules into modu- lar and composable APIs, providing model devel- opers and researchers with extensive flexibility to design and train large-scale custom Linear-MoE models on accelerated computing platforms. The system includes essential building blocks, such as 6 Page 7: core components for LSM mechanisms, MoE lay- ers and Linear-MoE blocks, normalization tech- niques, and embedding methods. To enhance adapt- ability, LSM mechanisms are organized into three main categories: linear attention, SSM, and linear RNN, with multiple instances available in each. For linear attention, options include basic linear atten- tion (BLA), Lightning Attention, Retention, GLA, DeltaNet, Based, and Rebased; for SSM, we pro- vide Mamba2, the leading SSM model at present; and for linear RNN, options include HGRN2 and RWKV6. As LSM techniques evolve, Linear-MoE will continue to incorporate more LSM methods to ensure users have access to the latest advance- ments. Additionally, Linear-MoE offers vital compo- nents such as a model library, tokenizers, model converters, usage examples, and a set of supportive toolkits. The model library includes instances of Linear-MoE models that are adapted from state- of-the-art open-source MoE architectures, includ- ing Qwen2 MoE, DeepSeekV2 MoE, and Mix- tral MoE. These adapted instances are designated as Linear-MoE-Qwen2, Linear-MoE-DeepSeekV2, and Linear-MoE-Mixtral, respectively. These mod- els are implemented following Megatron-Core for- mat, with the standard attention layers replaced by LSM-based token mixing layers, while maintaining the original embedding, normalization, and expert layers unchanged. 2.3.2 Training Subsystem Advanced parallelism techniques, encompassing tensor, sequence, pipeline, context, and MoE ex- pert parallelism, are seamlessly incorporated into the Linear-MoE system through its design on top of the Megatron-Core library. This non-intrusive inte- gration allows Linear-MoE to leverage the robust training capabilities of Megatron-Core, supporting large-scale model training across both standard at- tention layers and MoE expert layers. However, the inherent parallelism mechanisms, such as TP and SP, were not originally optimized for LSM mod- ules. Additionally, Megatron-Core does not fully support efficient SP for hybrid models containing both LSM modules and standard attention layers. To address these gaps, we elaborate on our TP and SP approaches specifically designed for LSM mod- ules and hybrid models, as discussed in §2.2. Further capabilities, including mixed precision, activation recomputation, distributed optimizer, distributed checkpointing, and CPU offloading,are also inherited from Megatron-Core, enhanc- ing model training flexibility and efficiency. And Linear-MoE supports 8-bit floating point (FP8) pre- cision on Hopper GPUs, benefiting from the in- tegration of NVIDIA’s Transformer Engine (Mi- cikevicius et al., 2022). This feature optimizes memory usage and accelerates performance during both training and inference stages. To enhance the training speed of MoE layers, we incorporate MegaBlocks (Gale et al., 2023) into our Linear-MoE system. MegaBlocks is designed to optimize MoE training on GPUs by reconfiguring MoE computations using block-sparse operations and developing new block-sparse GPU kernels that effectively manage the inherent dynamism of MoE. In addition, we also integrate the Grouped GEMM library into Linear-MoE, which introduces grouped GEMM kernels in PyTorch, thereby accelerating the computational processes involved in training MoE models. 2.3.3 Evaluation Module In order to facilitate the evaluation on mainstream benchmarks, we have developed offline text gen- eration of Linear-MoE models within the system. Based on this, mature evaluation frameworks such as OpenCompass (Contributors, 2023) and LM- Evaluation-Harness (Gao et al., 2023), are readily available for conducting evaluation tasks on Linear- MoE models. Furthermore, the system facilitates seamless bidirectional conversion between model weights from HuggingFace and Megatron-Core. This functionality enables users to easily leverage pretrained models from HuggingFace for contin- ued pretraining or fine-tuning within the Megatron- Core environment. Additionally, it allows for the assessment of model performance by using Hug- gingFace’s evaluation and inference pipelines on models trained within the Megatron-Core frame- work. 3 Empirical Study 3.1 Experiment Setup Models and Dataset. We conduct experiments on two Linear-MoE model series: A0.3B-2B and A1B-7B. A0.3B-2B denotes a Linear-MoE model containing a total of 2 billion parameters, with 0.3 billion parameters activated. The same applies for the A1B-7B model. Each series consists of several model instances, each incorporating a dis- tinct instance of the LSM module. The specific 7 Page 8: Models A0.3B-2B A1B-7B Hidden Dimension 1024 2048 FFN Dimension 896 1024 Num of Heads 8 16 Num of Layers 12 16 Num of Act Experts 8 8 Num of Experts 64 64 LR 1e-4 1e-5 Minimum LR 1e-5 1e-6 LR Scheduler Cosine Cosine Seq Length 2048 2048 Training Tokens 15B 100B Table 2: Linear-MoE Family Models and Training Configurations. A0.3B-2B indicates that the Linear- MoE model has a total of 2 billion parameters, with 0.3 billion parameters activated. The same for A1B-7B. 2K × 8 4K × 4 8K × 2 16K × 1 Seq Length × Batch Size60K80K100K120K140KThroughput (T okens/s) Baseline FlashAttn2 BLA Retention GLA DeltaNet Mamba2 HGRN2 RWKV6 Figure 4: Training Throughput (Tokens/s). As se- quence length increases, the throughput of Baseline declines significantly, whereas LSM models maintain stable training efficiency. LSM module instances used in our experiments in- clude: basic linear attention (BLA) (Katharopoulos et al., 2020), Retentive Network (Retention) (Sun et al., 2023), Gated Linear Attention (GLA) (Yang et al., 2023), DeltaNet (Schlag et al., 2021), Mamba2 (Dao and Gu, 2024), HGRN2 (Qin et al., 2024d), and RWKV6 (Peng et al., 2023, 2024), all implemented in Triton. These model instances are evaluated against models with standard attention implementation in Megatron-Core (referred to as Baseline) and the FlashAttention-2 (Dao, 2023) implemented in Transformer Engine (in CUDA). To implement the Linear-MoE model instances, we utilize the Qwen2 MoE architecture (Yang et al., 2024a) as the base model. All models are pre- trained from scratch on a portion of the SlimPajama dataset (Soboleva et al., 2023). This dataset orig- inally contains 627 billion tokens, we restrict our experiments to the first two chunks of the dataset, totaling approximately 100 billion tokens. TheQwen2 tokenizer is employed throughout the train- ing processes. Training Configurations. Table 2 details the training configurations for both Linear-MoE model series. We employ the Adam optimizer (Kingma and Ba, 2014) along with parallelism techniques, including TP and EP. Each pretraining run is per- formed on a node with eight A100 80G GPUs. 3.2 Training and Inference Efficiency We perform experiments to evaluate the training efficiency of the Linear-MoE system, focusing on throughput and GPU memory requirements using eight A100 GPUs. For training the sparse MoE models, we set the EP size to 8. During the experi- ments, we maintain a total of 16K input tokens per iteration, while varying the input sequence lengths across {2K, 4K, 8K, 16K} with corresponding batch sizes of {8, 4, 2, 1}. As illustrated in Table 3 and Fig. 4, we observe that the standard atten- tion Baseline shows a significant quadratic increase in memory usage and a decline in throughput as the input sequence lengths grow. FlashAttention-2 also demonstrates notable variations in both mem- ory footprint and throughput, when the sequence length reaches 16K. In contrast, the Linear-MoE models, which incorporate LSM, exhibit relatively stable GPU memory consumption and consistent throughput when the sequence length increases, but number of input tokens remains fixed. We also perform experiments to compare the inference efficiency of Linear-MoE (using Basic LA) with the Baseline (using FlashAttention-2). The results, shown in Table 5, reveal that Linear- MoE offers a significant speed advantage when the decoding length exceeds 16K. Additionally, its memory usage remains constant, which is a key benefit resulting from the adoption of LSM. Furthermore, to highlight the efficiency benefits of the Linear-MoE training subsystem, we conduct ablation studies on MoE optimization techniques and parallelism training methods. The results of these experiments are presented in Table 4. It is evident that the implementation of MoE optimiza- tion techniques, specifically Grouped GEMM and MegaBlocks, significantly reduces the elapsed time for each iteration. Additionally, the various paral- lelism training techniques each demonstrate their own advantages in terms of memory footprint and overall training efficiency. 8 Page 9: Seq Length ×Batch Size2K×8 4K ×4 8K ×2 16K ×1 Mem. Thpt. Mem. Thpt. Mem. Thpt. Mem. Thpt. Baseline 40.74 102.14 41.42 88.60 42.93 66.17 47.08 49.39 FlashAttn-2 38.96 103.91 39.10 101.78 39.57 105.08 41.51 96.16 Basic LA 42.69 115.16 43.85 119.72 42.71 112.66 43.00 114.67 Retention 42.71 117.85 42.66 119.11 42.73 119.16 42.65 118.19 GLA 43.87 113.29 43.73 118.77 43.63 116.34 43.60 110.87 DeltaNet 43.33 116.95 43.34 120.27 43.31 117.43 43.32 109.72 Mamba2 45.63 105.99 45.94 108.13 47.16 102.51 44.97 106.84 HGRN2 46.03 92.58 46.14 95.74 45.56 97.98 44.97 96.02 RWKV6 47.11 137.62 47.12 136.73 47.11 135.60 47.12 134.51 Table 3: Quantitative Training Efficiency Results. We experiment on 8 A100 GPUs and report the max allocated GPU memory (GB) and throughput ( ×103tokens/s) of A0.3B-2B model instances with varying input sequence lengthes and batch sizes. 1K 2K 4K 8K 16K 32K 64K 128K Decoding Length02500500075001000012500150001750020000Latency Time (seconds) Baseline w/ FlashAttn2 Time Basic Linear Attn Time Baseline w/ FlashAttn2 Memory Basic Linear Attn Memory Baseline w/ FlashAttn2 OOM 01020304050607080 GPU Memory Usage (GB) Figure 5: Inference Efficiency of A0.3B-2B Model Instances. We variate the decoding length from 1K to 128K with fixed batch size of 16 on single A800 80GB GPU to evaluate the Baseline w/ FlashAttention-2 and the Linear-MoE w/ Basic Linear Attention in terms of inference latency time and GPU memory usage. 3.3 Training Loss and Evaluation To evaluate the overall training performance of the Linear-MoE models, we pretrain the A0.3B-2B and A1B-7B model instances using 15B and 100B to- kens, respectively. We test both pure and hybrid model configurations; for the hybrid models, we in- corporate one quarter of standard transformer MoE layers throughout the architecture. For instance, in the 12-layer A0.3B-2B model, the hybrid config- uration follows the pattern "LLLNLLLNLLLN", while the 16-layer A1B-7B model adopts the pat- tern "LLLNLLLNLLLNLLLN". The training loss curves for the A0.3B-2B model instances, which include both pure and hybrid Linear-MoE models, are presented in Fig. 6. The results demonstrate that the pure Linear-MoE ar- chitecture achieves competitive convergence per- formance compared to the standard attention Base- line. Moreover, the hybrid models exhibit more sta- ble convergence and consistent performance when compared with the Baseline. Additional experi- ment results such as benchmark evaluations and training loss curves of A1B-7B models can befound in Appendix A.4. Both the A0.3B-2B and A1B-7B Linear-MoE model series show competi- tive performance on various benchmarks, and it is verified that hybrid models usually perform better than the pure linear models. 4 Conclusion In this paper, we introduced Linear-MoE, a novel product-level system designed to integrate LSM with MoE, aiming to advance both the efficiency and scalability of existing large models. By com- bining linear-complexity sequence modeling capa- bilities of LSM with sparsely activated MoE layers, Linear-MoE achieves high performance while ad- dressing computational and memory constraints common in large model training and deployment. The dual subsystems: Modeling and Training, pro- vide a flexible and extensible framework that sup- ports diverse LSM methods and advanced paral- lelism techniques, including specific sequence par- allelism for handling long input sequences effi- ciently. We also explored hybrid models that fur- ther enhance adaptability by incorporating stan- 9 Page 10: 0 1 2 3 4 5 6 7 Samples1e62.502.753.003.253.503.754.004.254.50LossBaseline BLA Retention GLA DeltaNet Mamba2 0 1 2 3 4 5 6 7 Samples1e62.502.753.003.253.503.754.004.254.50LossBaseline BLA(H) Retention(H) GLA(H) DeltaNet(H) Mamba2(H)Figure 6: Training Loss Curves of A0.3B-2B Model Instances. Left: pure Linear-MoE models; Right: hybrid Linear-MoE models. Linear-MoE shows competitive training convergence performance compared to the standard attention Baseline. MoE Optimization Memory (GB) Time/Iter (ms) Baseline 35.28 1565.6 Grouped GEMM 35.01 455.4 MegaBlocks 36.90 348.8 EP TP PP Memory (GB) Time/Iter (ms) 1 1 1 35.28 1565.6 8 1 1 22.98 739.4 1 8 1 10.04 6879.0 1 1 8 8.89 1820.2 2 2 2 12.90 1684.9 Table 4: Above: MoE Optimization. Below: Dis- tributed training efficiency under different paral- lelism settings. We report the memory usage per GPU (GB) and elapsed time per iteration (ms) while training the A0.3B-2B model with a sequence length of 2048 and a batch size of 4, using a node equipped with 8 A100 GPUs. The Baseline refers to the MoE imple- mentation in Megatron-Core, which is used without any optimizations. dard Transformer layers. Our experimental results demonstrate that Linear-MoE achieves significant efficiency gains while maintaining strong perfor- mance across various benchmarks. These findings highlight the potential of Linear-MoE as the next generation of foundation model architecture. Limitations Despite the promising results demonstrated in this paper, there are several limitations to the Linear- MoE framework. First, while the system success- fully combines LSM with MoE to enhance effi- ciency, the integration of different LSM methods and MoE layers may introduce complexity in hy- perparameter tuning, which could impact model performance under certain configurations. Addi-tionally, the scalability of Linear-MoE in extremely large-scale settings, such as beyond the model sizes tested in our experiments (A0.3B-2B and A1B-7B), remains an area for further investigation. Moreover, while the system supports various parallelism tech- niques, their effectiveness on diverse hardware ar- chitectures, particularly in resource-constrained en- vironments, needs more comprehensive evaluation. Therefore, future work should focus on further opti- mizing the system for a broader set of use cases and exploring additional hybrid modeling strategies. References Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training generalized multi-query trans- former models from multi-head checkpoints. arXiv preprint arXiv:2305.13245 . Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. Piqa: Reasoning about physical commonsense in natural language. In Thirty- Fourth AAAI Conference on Artificial Intelligence . Eleftheria Briakou, Colin Cherry, and George Foster. 2023. Searching for needles in a haystack: On the role of incidental bilingualism in palm’s translation capability. arXiv preprint arXiv:2305.10266 . Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. 2024. A survey on mixture of experts. arXiv preprint arXiv:2407.06204 . Soumith Chintala. 2023. Gpt-4 moe. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. Preprint , arXiv:1803.05457. 10 Page 11: OpenCompass Contributors. 2023. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/ opencompass . Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691 . Tri Dao and Albert Gu. 2024. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060 . Hantian Ding, Zijian Wang, Giovanni Paolini, Varun Kumar, Anoop Deoras, Dan Roth, and Stefano Soatto. 2024. Fewer truncations improve language modeling. arXiv preprint arXiv:2404.10830 . Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, and Yu Cheng. 2025. Mom: Linear sequence modeling with mixture-of-memories. Preprint , arXiv:2502.13685. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 . William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research , 23(120):1–39. Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. 2023. MegaBlocks: Efficient Sparse Train- ing with Mixture-of-Experts. Proceedings of Ma- chine Learning and Systems , 5. Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, An- ish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2023. A framework for few-shot language model evaluation. Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 . Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. 2020. Hippo: Recurrent mem- ory with optimal polynomial projections. Advances in neural information processing systems , 33:1474– 1487. Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré. 2022a. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems , 35:35971–35983.Albert Gu, Karan Goel, and Christopher Ré. 2022b. Efficiently modeling long sequences with structured state spaces. In The International Conference on Learning Representations (ICLR) . Ankit Gupta, Albert Gu, and Jonathan Berant. 2022. Di- agonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems , 35:22982–22994. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language under- standing. arXiv preprint arXiv:2009.03300 . Jiaxi Hu, Disen Lan, Ziyu Zhou, Qingsong Wen, and Yuxuan Liang. 2024. Time-ssm: Simplifying and uni- fying state space models for time series forecasting. Preprint , arXiv:2405.16312. Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. 1991. Adaptive mixtures of local experts. Neural Computation , page 79–87. Samy Jelassi, David Brandfonbrener, Sham M Kakade, and Eran Malach. 2024. Repeat after me: Trans- formers are better than state space models at copying. arXiv preprint arXiv:2402.01032 . Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. 2024. Mixtral of experts. arXiv preprint arXiv:2401.04088 . Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pap- pas, and François Fleuret. 2020. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning , pages 5156–5165. PMLR. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 . Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2022. Reducing ac- tivation recomputation in large transformer models. Preprint , arXiv:2205.05198. Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, and Yu Cheng. 2025. Liger: Linearizing large lan- guage models to gated recurrent structures. Preprint , arXiv:2503.01496. Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional com- putation and automatic sharding. arXiv preprint arXiv:2006.16668 . Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. 2021. Base layers: Simplifying training of large, sparse models. In In- ternational Conference on Machine Learning , pages 6265–6274. PMLR. 11 Page 12: Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. 2023. Cmmlu: Measuring massive mul- titask language understanding in chinese. Preprint , arXiv:2306.09212. Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. 2020. Pytorch distributed: Experiences on accelerating data parallel training. Preprint , arXiv:2006.15704. Opher Lieber, Barak Lenz, Hofit Bata, Gal Co- hen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. 2024. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887 . Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. 2024. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434 . Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. Ring attention with blockwise transformers for near- infinite context. Preprint , arXiv:2310.01889. Zhenyi Lu, Chenghao Fan, Wei Wei, Xiaoye Qu, Dan- gyang Chen, and Yu Cheng. 2025. Twin-merging: Dynamic integration of modular expertise in model merging. Advances in Neural Information Process- ing Systems , 37:78905–78935. Paulius Micikevicius, Dusan Stosic, Neil Burgess, Mar- ius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. 2022. Fp8 formats for deep learning. arXiv preprint arXiv:2209.05433 . MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Con- gchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xi- aodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxi- ang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, ZhihangYu, Zhuo Jiang, and Zijia Wu. 2025. Minimax-01: Scaling foundation models with lightning attention. Preprint , arXiv:2501.08313. Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, et al. 2024. Olmoe: Open mixture-of-experts language models. arXiv preprint arXiv:2409.02060 . Bo Peng, Eric Alcaide, Quentin Anthony, Alon Al- balak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Gv, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Ko- con, Jiaming Kong, Bartłomiej Koptyra, Hayden Lau, Jiaju Lin, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang, Johan Wind, Stanisław Wo´ zniak, Zhenyuan Zhang, Qinghua Zhou, Jian Zhu, and Rui-Jie Zhu. 2023. RWKV: Reinventing RNNs for the transformer era. InFindings of the Association for Computational Linguistics: EMNLP 2023 , pages 14048–14077, Sin- gapore. Association for Computational Linguistics. Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Teddy Ferdinan, Haowen Hou, Przemysław Kazienko, et al. 2024. Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence. arXiv preprint arXiv:2404.05892 . Hadi Pouransari, Chun-Liang Li, Jen-Hao Rick Chang, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, and Oncel Tuzel. 2024. Dataset decom- position: Faster llm training with variable sequence length curriculum. arXiv preprint arXiv:2405.13226 . Zhen Qin, Dong Li, Weigao Sun, Weixuan Sun, Xuyang Shen, Xiaodong Han, Yunshen Wei, Bao- hong Lv, Xiao Luo, Yu Qiao, and Yiran Zhong. 2024a. Transnormerllm: A faster and better large lan- guage model with improved transnormer. Preprint , arXiv:2307.14995. Zhen Qin, Xuyang Shen, Weigao Sun, Dong Li, Stan Birchfield, Richard Hartley, and Yiran Zhong. 2024b. Unlocking the secrets of linear complexity sequence model from a unified perspective. arXiv preprint arXiv:2405.17383 . Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weix- uan Sun, and Yiran Zhong. 2024c. Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models. arXiv preprint arXiv:2401.04658 . Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. 2024d. Hgrn2: Gated linear rnns with state expansion. arXiv preprint arXiv:2404.07904 . Zhen Qin, Songlin Yang, and Yiran Zhong. 2024e. Hi- erarchically gated recurrent neural network for se- quence modeling. Advances in Neural Information Processing Systems , 36. 12 Page 13: Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, and Yu Cheng. 2024. Llama-moe v2: Exploring sparsity of llama from perspective of mixture-of-experts with post-training. arXiv preprint arXiv:2411.15708 . Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Fi- rat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Un- locking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 . Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. 2024. Samba: Sim- ple hybrid state space models for efficient unlim- ited context language modeling. arXiv preprint arXiv:2406.07522 . Stephen Roller, Sainbayar Sukhbaatar, Jason Weston, et al. 2021. Hash layers for large sparse models. Advances in Neural Information Processing Systems , 34:17555–17566. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhaga- vatula, and Yejin Choi. 2019. Winogrande: An ad- versarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641 . Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. 2021. Linear transformers are secretly fast weight programmers. In International Conference on Ma- chine Learning . Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 . Yikang Shen, Zhen Guo, Tianle Cai, and Zengyi Qin. 2024. Jetmoe: Reaching llama2 performance with 0.1 m dollars. arXiv preprint arXiv:2404.07413 . Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Ja- cob R Steeves, Joel Hestness, and Nolan Dey. 2023. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. Weigao Sun, Disen Lan, Yiran Zhong, Xiaoye Qu, and Yu Cheng. 2025. Lasp-2: Rethinking sequence par- allelism for linear attention and its hybrid. arXiv preprint arXiv:2502.07563 . Weigao Sun, Zhen Qin, Dong Li, Xuyang Shen, Yu Qiao, and Yiran Zhong. 2024. Linear at- tention sequence parallelism. arXiv preprint arXiv:2404.02882 . Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. 2023. Retentive network: A successor to trans- former for large language models. arXiv preprint arXiv:2307.08621 .Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, et al. 2024. Jamba-1.5: Hybrid transformer-mamba models at scale. arXiv preprint arXiv:2408.12570 . Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems , 30. Roger Waleffe, Wonmin Byeon, Duncan Riach, Bran- don Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, et al. 2024. An empirical study of mamba-based language models. arXiv preprint arXiv:2406.07887 . Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. 2024. Auxiliary-loss-free load bal- ancing strategy for mixture-of-experts. Preprint , arXiv:2408.15664. An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Hao- ran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zhihao Fan. 2024a. Qwen2 technical report. arXiv preprint arXiv:2407.10671 . Songlin Yang, Jan Kautz, and Ali Hatamizadeh. 2024b. Gated delta networks: Improving mamba2 with delta rule. Preprint , arXiv:2412.06464. Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. 2023. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635 . Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. 2024c. Parallelizing linear transform- ers with the delta rule over sequence length. arXiv preprint arXiv:2406.06484 . Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . Jinle Zeng, Min Li, Zhihua Wu, Jiaqi Liu, Yuang Liu, Dianhai Yu, and Yanjun Ma. 2022. Boosting dis- tributed training performance of the unpadded bert model. arXiv preprint arXiv:2208.08124 . 13 Page 14: Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, and Yibo Zhu. 2023. Bytetransformer: A high-performance transformer boosted for variable-length inputs. In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) , pages 344–355. IEEE. Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Wei Bi Freda Shi, Bailin Wang, Peng Zhou, and Guohong Fu. 2024. Gated slot attention for efficient linear-time se- quence modeling. arXiv preprint arXiv:2409.07146 . Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. 2022. Mixture-of-experts with expert choice routing. Advances in Neural Informa- tion Processing Systems , 35:7103–7114. Tong Zhu, Daize Dong, Xiaoye Qu, Jiacheng Ruan, Wenliang Chen, and Yu Cheng. 2024a. Dynamic data mixing maximizes instruction tuning for mixture-of- experts. arXiv preprint arXiv:2406.11256 . Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, and Yu Cheng. 2024b. Llama-moe: Building mixture-of-experts from llama with continual pre-training. arXiv preprint arXiv:2406.16554 .A Appendix A.1 Related Work A.1.1 Mixture-of-Experts MoE (Jacobs et al., 1991; Cai et al., 2024; Lu et al., 2025) is gaining increasing attention in the devel- opment of large language models (LLMs) due to its ability to scale model size while maintaining computational efficiency. Its key strength lies in the sparse activation of experts and routing mech- anisms, enabling a better balance between model performance and training cost. The effectiveness of MoE in modern deep learning was first demon- strated in Shazeer et al. (2017), where an MoE layer was introduced between LSTM layers, resulting in state-of-the-art performance on language modeling and machine translation benchmarks. Following this, the MoE layer was incorporated into the Trans- former architecture, replacing the feed-forward net- work (FFN) layers. GShard (Lepikhin et al., 2020) applied MoE to Transformers, significantly im- proving machine translation across 100 languages. Switch Transformers (Fedus et al., 2022) further scaled model size to trillions of parameters, us- ing a simplified and efficient MoE layer design. However, training MoE models often leads to load imbalance, where only a few experts are heavily utilized, leaving others underutilized (Lewis et al., 2021; Wang et al., 2024; Zhu et al., 2024a; Du et al., 2025). To address this, several strategies have been developed to optimize MoE training. These include the BASE layer (Lewis et al., 2021), the HASH layer (Roller et al., 2021), and Expert Choice (Zhou et al., 2022), all of which aim to maximize model capacity utilization. MoE architectures have been widely adopted in industry-leading models, such as Gemini-1.5 (Reid et al., 2024) and reportedly GPT-4 (Chintala, 2023). Other notable examples of LLMs incorporating MoE techniques include Mixtral (Jiang et al., 2024), DeepSeek V2 (Liu et al., 2024), Qwen2 (Yang et al., 2024a), Jet- MoE (Shen et al., 2024), Jamba (Team et al., 2024), and OLMoE (Muennighoff et al., 2024). Despite the advances in MoE, most research has focused on improving FFN layers and routers, while atten- tion mechanisms have remained largely unchanged. There is still much room for exploring how to en- hance the efficiency of MoE models by evolving their attention layers. 14 Page 15: A.1.2 Linear Sequence Modeling Linear Attention. Linear attention encompasses a set of techniques aimed at calculating atten- tion outputs using the "right-product kernel trick," which first computes key-value products, thereby avoiding the quadratic complexity associated with query-key computations. Vanilla linear atten- tion (Katharopoulos et al., 2020) replaces the Softmax attention (Vaswani et al., 2017) with ker- nel methods, reducing the computational complex- ity to linear in relation to sequence length. Building on this, various extensions of linear attention have emerged. For example, TransNormerLLM (Qin et al., 2024a) introduces Lightning Attention, an op- timized linear attention mechanism that speeds up processing by enhancing IO operations. Lightning Attention-2 (Qin et al., 2024c) further improves this by separately handling inter- and intra-block com- putations to fully exploit the advantages of linear at- tention on autoregressive tasks. RetNet (Sun et al., 2023) combines a retention mechanism with atten- tion, offering both parallel training and linear-time inference. Gated Linear Attention (GLA) (Yang et al., 2023) introduces a data-independent gating mechanism and presents a hardware-efficient algo- rithm for training. DeltaNet (Schlag et al., 2021), along with its parallelized version (Yang et al., 2024c), applies a delta-rule-like update to improve performance in long-context scenarios. More re- cently, Gated Slot Attention (GSA) (Zhang et al., 2024), inspired by GLA, introduces a bounded- memory slot control mechanism within the gated linear attention framework, further boosting perfor- mance in tasks requiring strong recall abilities. State Space Model. SSM provides a robust framework for capturing the behavior of sequence modeling within dynamic systems, and has demon- strated itself in the field of linear sequence mod- eling. Models such as S4 (Gu et al., 2022b) and its subsequent variants (Gu et al., 2022a; Gupta et al., 2022) have achieved notable success, par- ticularly in long-range synthetic tasks. A recent example is Mamba (Gu and Dao, 2023), a represen- tative SSM model that introduces a state selection mechanism. Mamba addresses the limitation of static dynamics in previous methods, arguing that they do not account for input-specific context se- lection within the hidden state, which is critical for tasks like language modeling. Mamba has shown superior performance compared to Transformers across various model sizes and scales. Mamba hasbeen further refined in its successor, Mamba2 (Dao and Gu, 2024), which integrates a linear attention- like mechanism that improves hardware efficiency during training. Similar to how linear attention uses outer products to expand the state, Mamba2 leverages a state-space duality that enables paral- lel attention-style computation while maintaining recurrent inference capabilities. Linear RNN. Traditional RNNs struggle with long-context sequence modeling, largely due to their sequential nature during training, which limits their ability to benefit from scaling laws (Sun et al., 2023). To mitigate these issues, Linear RNNs intro- duce parallel training capabilities, achieving com- petitive performance with Transformers of com- parable size. RWKV (Peng et al., 2023, 2024) is an example of a large language model based on linear RNNs, designed to effectively manage long-term dependencies. Furthermore, HGRN (Qin et al., 2024e) emphasizes the importance of data- dependent decay mechanisms in enhancing linear RNN performance, showing how tuning decay pa- rameters can improve learning in long-context sce- narios. The upgraded HGRN2 (Qin et al., 2024d) builds on this by introducing a state expansion mechanism that leverages outer product opera- tions, allowing for better scalability and improved sequence modeling over extended inputs. Both RWKV and HGRN models aim to address the limitations of traditional RNNs for efficient long- sequence modeling. A.2 Tensor Parallelism on Linear-MoE The core computation mechanism of LSM modules can be abstracted in the following general form: O=ϕ(Q)(ϕ(K)⊤V), Q=XW Q,K=XW K,V=XW V,(6) where TP is applied by splitting the matrix multi- plications as follows: Q= [ϕ(XW1 Q), ϕ(XW2 Q)], K= [ϕ(XW1 K), ϕ(XW2 K)], V=X[W1 V,W2 V], O= [O1,O2],(7) where the weight matrices Wq,Wk, andWvare divided along their columns, producing an output matrix Othat is also split along columns. The split output [O1,O2]is then multiplied by an output linear weight matrix that is split along its 15 Page 16: rows, resulting in: O= [O1,O2][W1 O,W2 O]⊤ =O1W1 O+O2W2 O,(8) which produces a unified output. As with TP in standard attention, TP for LSM modules introduces an all-reduce collective com- munication operation during both the forward and backward passes. In practical terms, this all-reduce operation is implemented via two separate steps: all-gather and reduce-scatter, which together func- tionally achieve the same result as a single all- reduce. A.3 Sequence Parallelism on Linear-MoE Algorithm 1 SP on Linear-MoE w/o Masking 1:Input: input sequence X, distributed world size W, se- quence parallel size T=W. 2: Distribute X= [Xt]T 1. 3:forchunk t∈ {1,···, T}on ranks {1,···, W}in par- allel do 4: Calculate Qt=XtWQ,Kt=XtWK,Vt= XtWV. 5: Compute Mt=K⊤ tVt. 6: Communicate [Mt]⊤ 1=AllGather ([Mt]⊤ 1). 7: Compute M1:T=Sum([Mt]T 1). 8: Compute Ot=QtM1:T. 9:end for 10: return O= [Ot]T 1. Algorithm 2 SP on Linear-MoE w/ Masking 1:Input: input sequence X, distributed world size W, se- quence parallel size T=W. 2: Distribute X= [Xt]T 1. 3:Initialize mask matrix Ψ, where Ψij= 1 ifi≥jand Ψij=−∞ ifi < j . 4:forchunk t∈ {1,···, T}on ranks {1,···, W}in par- allel do 5: Calculate Qt=XtWQ,Kt=XtWK,Vt= XtWV. 6: Compute Mt= (Kt)⊤Vt. 7: Communicate [Mt]⊤ 1=AllGather ([Mt]⊤ 1). 8: Compute Ot,intra= [(QtK⊤ t)⊙Ψ]Vt. 9: Compute prefix sum M1:t−1=PrefixSum ([Mt]t−1 1). 10: Compute Ot,inter=QtM1:t−1. 11: Compute Ot=Ot,intra+Ot,inter. 12:end for 13: return O= [Ot]T 1.A.4 Additional Experiments 0 1 2 3 4 5 Samples1e72.502.753.003.253.503.754.004.254.50LossMamba2 Mamba2(H) GLA Figure 7: Training Loss Curves of A1B-7B Model Instances. A.5 Datasets and Benchmarks We pretrain all the models on a portion of the SlimPajama dataset which is sampled to approxi- mately 100 billion tokens. •SlimPajama (Soboleva et al., 2023) is a high-quality, optimized subset of the Red- Pajama dataset, designed for large-scale lan- guage model training. It includes diverse text sources such as Common Crawl, Wikipedia, books, and GitHub code, with a primary focus on English. The dataset is cleaned, dedupli- cated, and optimized for efficiency and perfor- mance. For the benchmark, we tested on these tasks: •PiQA (Bisk et al., 2020): A dataset focused on physical commonsense reasoning in English with 3084 test samples. The text consists of everyday tasks and scenarios, requiring mod- els to determine the most practical way to perform an action. The data is sourced from crowdsourced descriptions, reflecting a broad range of common human experiences. •ARC-Easy & ARC-Challenge (Clark et al., 2018): A set of multiple-choice science ques- tions in English, sourced from standardized exams and educational materials with 2376 and 1172 test samples. The dataset repre- sents the domain of elementary and high school science, with questions authored by 16 Page 17: Scale ModelLSM InstancePIQA Hella. Wino. ARC-e ARC-c MMLU Avg. Avg. acc↑ acc_norm ↑ acc↑ acc↑ acc_norm ↑acc(5-shot) ↑ ↑ (no MMLU) ↑ Baseline Attention 55.77 27.10 50.83 33.04 23.21 23.24 35.53 37.99 A0.3B-2B 15B TokensPureBLA 64.42 33.41 49.01 48.15 24.32 26.32 40.94 43.86 Retention 62.08 29.14 50.75 42.72 21.50 23.12 39.60 43.39 GLA 65.56 35.29 50.67 47.81 23.04 24.85 41.20 44.47 Mamba2 66.97 37.79 50.20 49.12 24.74 25.85 42.45 45.76 HGRN2 52.50 26.37 49.01 24.83 27.65 25.10 34.24 36.07 HybridBLA 66.76 37.16 49.96 49.62 24.74 25.64 42.31 45.65 Retention 66.21 36.06 51.54 47.18 24.91 23.71 41.60 45.18 GLA 67.71 38.62 49.72 50.51 26.02 25.05 42.94 46.52 Mamba2 66.38 38.81 51.30 50.17 24.91 24.61 42.70 46.31 HGRN2 66.27 36.79 51.46 48.82 25.43 23.19 41.99 45.75 Table 5: A0.3B-2B Evaluation Results on Language Modeling Benchmarks (No Data Corruption). All models are pretrained from scratch on the same 15B subset of the SlimPajama dataset with the Qwen2 tokenizer. No bench- mark data corruption in the pretraining dataset. The A0.3B-2B hybrid models have a stack as "LLLNLLLNLLLN", where "L" represents the Linear-MoE layer, and "N" represents the normal MoE transformer layer. Scale ModelLSM InstancePIQA Hella. Wino. ARC-e ARC-c MMLU Avg. Avg. acc↑ acc_norm ↑ acc↑ acc↑ acc_norm ↑acc(5-shot) ↑ ↑ (no MMLU) ↑ A1B-7B 100B TokensPureBLA 66.65 37.74 50.12 50.80 24.23 23.71 42.21 45.91 GLA 68.17 43.51 51.22 52.48 25.09 24.83 44.22 48.09 Mamba2 69.21 41.86 51.46 52.86 25.17 23.66 44.04 48.11 Table 6: A1B-7B Evaluation Results on Language Modeling Benchmarks (No Data Corruption). All models are pretrained from scratch on the same 15B subset of the SlimPajama dataset with the Qwen2 tokenizer. No benchmark data corruption in the pretraining dataset. educators and test designers. ARC-Easy in- cludes straightforward questions, while ARC- Challenge contains more difficult ones that require advanced reasoning. •HellaSwag (Zellers et al., 2019): An English- language dataset designed for commonsense reasoning, where models must choose the most plausible continuation of a sentence. The text is derived from activity descriptions (e.g., WikiHow), covering everyday scenarios. The dataset was constructed adversarially to be challenging for language models. It has 10003 test samples. •WinoGrande (Sakaguchi et al., 2019): A large-scale English dataset for commonsense reasoning, based on the Winograd Schema Challenge with 1267 test samples. It tests pro- noun resolution in ambiguous contexts, with sentences sourced and refined through crowd- sourcing. The dataset aims to reduce annota- tion biases by diversifying sentence structures and topics. •MMLU (Li et al., 2023): The MMLU (Massive Multitask Language Understanding) dataset is a comprehensive benchmark de-signed to evaluate AI models’ general knowl- edge across a wide range of subjects and languages. It comprises 57 distinct cate- gories, spanning elementary-level knowledge to advanced professional topics such as law, physics, history, and computer science. The dataset has been translated into 14 languages using professional human translators, ensur- ing high-quality and accurate translations. This multilingual approach aims to improve the inclusivity and effectiveness of AI models across different linguistic communities. All datasets used in this work are publicly avail- able and have been released by their original cre- ators, who are responsible for ensuring privacy pro- tection. These datasets are used in accordance with their respective licenses and intended purposes. No modifications or derivative datasets have been cre- ated. 17