Authors: Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli
Page 1:
Smarter, Better, Faster, Longer : A Modern Bidirectional Encoder for Fast,
Memory Efficient, and Long Context Finetuning and Inference
Benjamin Warner1†Antoine Chaffin2†Benjamin Clavié1†
Orion Weller3Oskar Hallström2Said Taghadouini2
Alexis Gallagher1Raja Biswas1Faisal Ladhak4*Tom Aarsen5
Nathan Cooper1Griffin Adams1Jeremy Howard1Iacopo Poli2
1Answer.AI2LightOn3Johns Hopkins University4NVIDIA5HuggingFace
†: core authors, *: work done while at Answer.AI
Correspondence: {bw,bc}@answer.ai, antoine.chaffin@lighton.ai
Abstract
Encoder-only transformer models such as
BERT offer a great performance-size tradeoff
for retrieval and classification tasks with re-
spect to larger decoder-only models. Despite
being the workhorse of numerous production
pipelines, there have been limited Pareto im-
provements to BERT since its release. In this
paper, we introduce ModernBERT, bringing
modern model optimizations to encoder-only
models and representing a major Pareto im-
provement over older encoders. Trained on
2 trillion tokens with a native 8192 sequence
length, ModernBERT models exhibit state-of-
the-art results on a large pool of evaluations
encompassing diverse classification tasks and
both single and multi-vector retrieval on dif-
ferent domains (including code). In addition
to strong downstream performance, Modern-
BERT is also the most speed and memory effi-
cient encoder and is designed for inference on
common GPUs.
1 Introduction
After the release of BERT (Devlin et al., 2019),
encoder-only transformer-based (Vaswani et al.,
2017) language models dominated most appli-
cations of modern Natural Language Processing
(NLP). Despite the rising popularity of Large Lan-
guage Models (LLMs) such as GPT (Radford et al.,
2018, 2019; Brown et al., 2020), Llama (Touvron
et al., 2023; Dubey et al., 2024), and Qwen (Bai
et al., 2023; Yang et al., 2024), encoder-only
models remain widely used in a variety of non-
generative downstream applications.
The encoder’s popularity is largely due to their
modest inference requirements, enabling them to
efficiently process corpora of documents at scale
for retrieval and quickly perform discriminative
tasks. Encoder models offer a compelling trade-
off in quality versus size, making them a popular
https://github.com/AnswerDotAI/ModernBERToption against encoder-decoder and decoder-only
language models when dealing with substantial
amounts of data (Penedo et al., 2024).
Encoder models are particularly popular in In-
formation Retrieval (IR) applications, e.g., seman-
tic search, with notable progress on leveraging en-
coders for this task (Karpukhin et al., 2020; Khat-
tab and Zaharia, 2020). While LLMs have taken
the spotlight in recent years, they have also moti-
vated a renewed interest in encoder-only models
for IR. Indeed, encoder-based semantic search is
a core component of Retrieval-Augmented Gener-
ation (RAG) pipelines (Lewis et al., 2020), where
encoder models are used to retrieve and feed LLMs
with context relevant to user queries.
Encoder-only models are also still frequently
used for a variety of discriminative tasks such as
classification (Tunstall et al., 2022) or Natural En-
tity Recognition (NER) (Zaratiana et al., 2024),
where they often match the performance of special-
ized LLMs. Here again, they can be used in con-
junction with LLMs, for example detecting toxic
prompts (Ji et al., 2023; Jiang et al., 2024b) and pre-
venting responses, or routing queries in an agentic
framework (Yao et al., 2023; Schick et al., 2023).
Surprisingly, these pipelines currently rely on
older models, and quite often on the original BERT
itself as their backbone (Wang et al., 2022; Xiao
et al., 2023), without leveraging improvements de-
veloped in recent years. Practitioners face many
drawbacks: sequence lengths limited to 512 tokens,
suboptimal model design (Anthony et al., 2024)
and vocabulary sizes (Karpathy, 2023), and gen-
erally inefficient architectures, whether in terms
of downstream performance or computational ef-
ficiency. Finally, training data is limited in vol-
ume and restricted to narrow domains (especially
lacking code data) or lacking knowledge of recent
events.
Recent modernization efforts have only partially
addressed the shortcomings of encoder-only mod-
1arXiv:2412.13663v2 [cs.CL] 19 Dec 2024
Page 2:
els due to limited breadth. MosaicBERT (Portes
et al., 2023), CrammingBERT (Geiping and Gold-
stein, 2023), and AcademicBERT (Izsak et al.,
2021) focused on matching BERT performance
with better training efficiency. NomicBERT (Nuss-
baum et al., 2024) and GTE-en-MLM (Zhang et al.,
2024) (developed concurrently to this work) intro-
duced longer-context encoder models focused on
retrieval applications, but did not optimize for effi-
ciency or classification performance, and re-used
older training data mixtures which is especially
apparent in programming-related tasks.
Contributions We present ModernBERT, a mod-
ernized encoder-only transformer model, with an
improved architecture designed to increase down-
stream performance and efficiency, especially over
longer sequence lengths. We also bring encoder-
only models to modern, larger data scales, by
training on 2 trillion tokens, with a data mix-
ture including code data. We release two mod-
els,ModernBERT-base andModernBERT-large ,
which reach state-of-the-art overall performance
against all existing encoder models on a wide vari-
ety of downstream tasks. These results are achieved
with considerably higher inference efficiency, pro-
cessing sequences of 8192 tokens almost two times
faster than previous models.
To support future research on encoder-only mod-
els, we release FlexBERT1, our modular architec-
ture framework allowing easy experimentation, and
inspired by Pythia (Biderman et al., 2023), all in-
termediate training checkpoints (further detailed in
Section 2.2.2).
2 Methods
2.1 Architectural Improvements
Our model architecture extends the standard trans-
former architecture (Vaswani et al., 2017) by incor-
porating extensively tested recent advances (Sec-
tion 2.1.1). We introduce additional efficiency-
oriented modifications, through both architectural
and implementation improvements (Section 2.1.2)
and a GPU optimized model design (Section 2.1.3).
All of our architectural decisions were informed by
ablations, which we detail in Appendix D.
2.1.1 Modern Transformer
Bias Terms Following (Dayma et al., 2021), we
disable bias terms in all linear layers except for the
1FlexBERT is built on top of a revised Mo-
saicBERT (Portes et al., 2023) codebase.final decoder linear layer2. We also disable all bias
terms in Layer Norms (Xu et al., 2019). These two
changes allow us to spend more of our parameter
budget in linear layers.
Positional Embeddings We use rotary posi-
tional embeddings (RoPE) (Su et al., 2024) instead
of absolute positional embeddings. This choice is
motivated by the proven performance of RoPE in
short- and long-context language models (Black
et al., 2022; Dubey et al., 2024; Gemma et al.,
2024), efficient implementations in most frame-
works, and ease of context extension.
Normalization We use a pre-normalization
block (Xiong et al., 2020) with the standard layer
normalization (Lei Ba et al., 2016), which is known
to help stabilize training (Xiong et al., 2020). Sim-
ilar to CrammingBERT (Geiping and Goldstein,
2023) which also uses pre-normalization, we add
a LayerNorm after the embedding layer. To avoid
repetition, we remove the first LayerNorm in the
first attention layer.
Activation We adopt GeGLU (Shazeer, 2020),
a Gated-Linear Units (GLU)-based (Dauphin et al.,
2017) activation function built on top of the origi-
nal BERT’s GeLU (Hendrycks and Gimpel, 2016)
activation function. This is in line with recent work
showing consistent empirical improvements when
using GLU variants (Shazeer, 2020; Geiping and
Goldstein, 2023).
2.1.2 Efficiency Improvements
Alternating Attention Following recent work on
efficient long context models (Gemma et al., 2024),
attention layers in ModernBERT alternate between
global attention, where every token within a se-
quence attends to every other token, and local atten-
tion, where tokens only attend to each other within
a small sliding window (Beltagy et al., 2020). In
ModernBERT, every third layer employs global
attention with a RoPE theta of 160,000 and the
remaining layers use a 128 token, local sliding win-
dow attention with a RoPE theta of 10,000.
Unpadding ModernBERT follows Mo-
saicBERT (Portes et al., 2023) and GTE (Zhang
et al., 2024) in employing unpadding (Zeng et al.,
2022) for both training and inference. Encoder-
only language models typically use padding tokens
to ensure a uniform sequence length in a batch,
2While many efficient BERT training recipes disable the
bias term in the decoder, e.g. Geiping and Goldstein (2023),
we hypothesized a decoder bias might help alleviate weight
tying’s negative effects (Gao et al., 2019; Welch et al., 2020).
2
Page 3:
wasting compute on semantically empty tokens.
Unpadding avoids this inefficiency by removing
padding tokens, concatenating all sequences
from a minibatch into a single sequence, and
processing it as a batch of one. Prior unpadding
implementations unpad and repad sequences
internally for different model layers, wasting
compute and memory bandwidth. We use Flash
Attention’s variable length attention and RoPE
implementations, allowing jagged attention masks
and RoPE applications on one unpadded sequence.
ModernBERT unpads inputs before the token
embedding layer and optionally repads model
outputs leading to a 10-to-20 percent performance
improvement over other unpadding methods.
Flash Attention Flash Attention (Dao et al.,
2022) is a core component of modern transformer-
based models, providing memory and compute ef-
ficient attention kernels. At the start of this work,
Flash Attention 3 (Shah et al., 2024), the most
recent iteration for Nvidia H100 GPUs, did not
include support for sliding window attention. Mod-
ernBERT uses a mixture of Flash Attention 3 for
global attention layers and Flash Attention 2 (Dao,
2023) for local attention layers.
torch.compile We leverage PyTorch’s built-in
compiling (Ansel et al., 2024) to improve the train-
ing efficiency by compiling all compatible modules.
This yields a 10 percent improvement in throughput
with negligible compilation overhead.
2.1.3 Model Design
At the same parameter count, models with more
narrow layers ( Deep & Narrow ) have different
learning patterns than models with fewer wide lay-
ers (Shallow & Wide ) (Nguyen et al., 2021). Tay
et al. (2022) and (Liu et al., 2024) have shown
thatDeep & Narrow language models have bet-
ter downstream performance than their shallower
counterparts, at the expense of slower inference.
Anthony et al. (2024) highlighted that large
runtime gains can be unlocked by designing mod-
els in a hardware-aware way, which had previ-
ously been anecdotally observed by many prac-
titioners (Shoeybi et al., 2019; Karpathy, 2023;
Black et al., 2022). ModernBERT was designed
through many small-scale ablations to maximize
the utilization of a basket of common GPUs3, while
3Which, at the time of this work, are server GPUs:
NVIDIA T4, A10, L4, A100, and H100 and consumer GPUs:
NVIDIA RTX 3090 and 4090. Prioritization was given to
inference GPUs (excluding A100 & H100).aiming to be as Deep & Narrow as possible without
a significant inference slowdown.
ModernBERT has 22 and 28 layers for the base
and large models, for a total parameter count of 149
and 395 million, respectively, striking the balance
between downstream performance and hardware
efficiency. ModernBERT base has a hidden size of
768 with a GLU expansion of 2,304, while large
has a hidden size of 1,024 and GLU expansion
of 5,248. These ratios allow optimal tiling across
tensor cores and the most efficient tiling across the
differing number of streaming multiprocessors on
our target basket of GPUs. More details on model
design are provided in Appendix B.
2.2 Training
2.2.1 Data
Mixture Both ModernBERT models are trained on
2 trillion tokens of primarily English data from a
variety of data sources, including web documents,
code, and scientific literature, following common
modern data mixtures. We choose the final data
mixture based on a series of ablations.
Tokenizer Unlike the majority of recent en-
coders which reuse the original BERT tok-
enizer (Nussbaum et al., 2024; Portes et al., 2023;
Zhang et al., 2024), we opt to use a modern BPE
tokenizer. We use a modified version of the OLMo
tokenizer (Groeneveld et al., 2024) which provides
better token efficiency and performance on code-
related tasks. The ModernBERT tokenizer uses the
same special tokens (e.g., [CLS] and[SEP] ) and
templating as the original BERT model (Devlin
et al., 2019), facilitating backwards compatibility.
To ensure optimal GPU utilization (Anthony et al.,
2024; Karpathy, 2023), the vocabulary is set to
50,368, a multiple of 64 and includes 83 unused
tokens to support downstream applications.
Sequence Packing In order to avoid high
minibatch-size variance within our training batches
as a result of unpadding, we adopt sequence pack-
ing (Raffel et al., 2020; Krell et al., 2022) with
a greedy algorithm, which resulted in a sequence
packing efficiency of over 99 percent, ensuring
batch size uniformity.
2.2.2 Training Settings
MLM We follow the Masked Language Modeling
(MLM) setup used by MosaicBERT (Portes et al.,
2023). We remove the Next-Sentence Prediction
objective which introduces noticeable overhead for
no performance improvement (Liu et al., 2019a;
3
Page 4:
Izsak et al., 2021), and use a masking rate of 30
percent, as the original rate of 15 percent has since
been shown to be sub-optimal (Wettig et al., 2023).
Optimizer We use the StableAdamW opti-
mizer (Wortsman et al., 2023), which improves
upon AdamW (Loshchilov and Hutter, 2019) by
adding Adafactor-style (Shazeer and Stern, 2018)
update clipping as a per-parameter learning rate
adjustment. StableAdamW’s learning rate clipping
outperformed standard gradient clipping on down-
stream tasks and led to more stable training. Hy-
perparameters details are given in Appendix A.
Learning Rate Schedule During pretraining,
we use a modified trapezoidal Learning Rate
(LR) schedule (Xing et al., 2018), also known as
Warmup-Stable-Decay (WSD) (Zhai et al., 2022;
Hu et al., 2024). After a short LR warmup, the
trapezoidal schedule holds the LR constant for the
majority of training, followed by a short LR de-
cay. This schedule has been shown to match the
performance of cosine scheduling (Hägele et al.,
2024; Hallström et al., 2024) with the benefit of
enabling continual training on any checkpoint with-
out cold restart issues (Ash and Adams, 2019). Un-
like most trapezoidal schedules, we use a 1−sqrt
LR decay (Hägele et al., 2024), as we found it to
outperform linear and cosine decay.
We trained ModernBERT-base at a constant LR
of 8e-4 for 1.7 trillion tokens following a 3 billion
token warmup. After a 2 billion token warmup,
we trained ModernBERT-large at a LR of 5e-4 for
900 billion tokens. We rolled back and restarted
training at 5e-5 for the remaining 800 billion tokens
after large’s loss plateaued for a few hundred billion
tokens at 5e-4.
Batch Size Schedule Batch size scheduling
starts with smaller gradient accumulated batches,
increasing over time to the full batch size. In abla-
tions, this schedule accelerated training progress.
We warmup the batch size from 768 to 4,608 over
50 billion tokens and from 448 to 4,928 over 10
billion tokens, for ModernBERT-base and -large,
respectively, with an uneven token schedule so each
batch size has the same number of update steps.
Details are provided in Appendix A.1.
Weight Initialization and Tiling We initialize
ModernBERT-base with random weights following
the Megatron initialization (Shoeybi et al., 2019).
For ModernBERT-large, we follow the Phi model
family (Li et al., 2023; Javaheripi et al., 2023)4and
4As detailed in their 2023 NeurIPS presentation.initialize -large’s weights from ModernBERT-base.
In ablation runs, this consistently matched Phi’s
improved training results and greatly speed up the
initial loss decrease of our model training5. Details
are provided in Appendix A.2.
Context Length Extension After training on 1.7
trillion tokens at a 1024 sequence length and RoPE
theta of 10,000, we extend the native context length
of ModernBERT to 8192 tokens by increasing the
global attention layer’s RoPE theta to 160,000 and
train for an additional 300 billion tokens. We first
train at a constant lower learning rate6of 3e-4 for
250 billion tokens on an 8192 token mixture of the
original pretraining dataset sampled following Fu
et al. (2024). Next, we upsample higher-quality
sources following Gao et al. (2024) and conduct
the decay phase with a 1−sqrtLR schedule over
50 billion tokens. This context extension process
yielded the most balanced model on downstream
tasks, as most of our ablations using only one of
these strategies resulted in a performance loss on
either retrieval or classification tasks.
3 Downstream Evaluation
We performed an extensive set of evaluations,
across a large range of tasks, aiming to demon-
strate the versatility of ModernBERT in common
scenarios.
For all tasks, ModernBERT is evaluated against
existing encoders of similar size. The BASE size,
conventionally defined as under 150 million pa-
rameters, includes BERT-base (Devlin et al., 2019),
DeBERTa-v3-base (He et al., 2023), RoBERTa-
base (Liu et al., 2019a), as well as the more re-
cent 8192 context NomicBERT (Nussbaum et al.,
2024) and GTE-en-MLM-base (Zhang et al., 2024).
The LARGE size, conventionally defined as above
300 million and under 500 million parameters, in-
cludes BERT-large-uncased (Devlin et al., 2019),
DeBERTa-v3-large (He et al., 2023) and RoBERTa-
large (Liu et al., 2019a) and GTE-en-MLM-
large (Zhang et al., 2024).
3.1 Evaluation Setting
3.1.1 Natural Language Understanding
The General Language Understanding Evaluation
(GLUE) benchmark (Wang et al., 2018) is the
standard Natural Language Understanding (NLU)
5This initialization reduced the amount of batch size and
LR warmup needed for ModernBERT-large
6We only lowered the LR for ModernBERT-base, as large
already decreased LR during the 1024 token training phase.
4
Page 5:
benchmark for encoder models, aiming to measure
how well a model performs across a range of sen-
tence or sentence-pair understanding tasks, such as
sentiment detection (Liu et al., 2019b) or language
entailment, through tasks such as MNLI (Williams
et al., 2018). Although GLUE is often regarded
as saturated by the best-performing models, such
as large language models (Zhao et al., 2023), it
remains one of the most commonly used evaluation
suites for smaller encoder-based models, and pro-
vides a good impression of a model’s performance
on common classification tasks (Portes et al., 2023;
Zhang et al., 2024; He et al., 2023).
We follow the practice of previous studies (De-
vlin et al., 2019; Liu et al., 2019a; He et al.,
2023) and conduct a hyperparameter search on each
GLUE subset (detailed in Appendix E.1) in order
to provide values comparable to other models.7
3.1.2 Text Retrieval
Information Retrieval (IR) is one of the most com-
mon applications of encoder-only models,8where
they are used to represent documents and queries
in semantic search (Karpukhin et al., 2020). This
domain has recently seen considerable growth and
interest following the spread of LLMs where se-
mantic search powered by lightweight models is
used to provide relevant context to LLMs as part of
Retrieval-Augmented Generation pipelines.
We evaluate models in both the single-vector
Dense Passage Retrieval (DPR) (Karpukhin et al.,
2020) setting and the multi-vector ColBERT (Khat-
tab and Zaharia, 2020) setting.
We report retrieval results on the popular BEIR
evaluation suite (Thakur et al., 2021), the com-
mon standard for evaluating retrieval performance
across a variety of tasks and domains, using the
nDCG@10 metric. For each setting detailed below,
we conduct a learning rate sweep based on results
over a subset of the BEIR benchmarks to select the
final model, detailed in Appendix E.2.
Single vector retrieval One of the most com-
mon approaches to neural retrieval using encoders
is DPR (Karpukhin et al., 2020), where a single-
vector is used to represent an entire document. The
7As (Zhang et al., 2024) do not explicitly mention a param-
eter sweep, we initially ran the same hyperparameter sweep
as we did for ModernBERT, but observed inconsistencies in
the results. To avoid under-representing GTE-en-MLM’s ca-
pabilities, we choose to use their reported GLUE results.
8At the time of this paper’s writing, over half of the 100
most downloaded models on the HuggingFace Model Hub
were encoder-based retrieval models.similarity between a query and a document can then
be computed through distance operations, such as
cosine similarity. Models are finetuned using con-
trastive learning to create representations which
are close if a document is relevant to a query, and
distant if not (van den Oord et al., 2018).
We train every base model using the MS-
MARCO (Bajaj et al., 2016) dataset with mined
hard negatives (Xuan et al., 2020) on 1.25M sam-
ples with a batch size of 16 and learning rate
warmup for 5% of the training using sentence-
transformers (Reimers and Gurevych, 2019).
Multi vector retrieval Multi-vector retrieval,
championed by ColBERT (Khattab and Zaharia,
2020), seeks to mitigate lost information from com-
pressing an entire sequence into a single vector.
In multi-vector retrieval, each document is repre-
sented by all of its individual token vectors, and
the similarity between a query and a document is
computed using the MaxSim9operator.
We adopt the training setup of JaCol-
BERTv2.5 (Clavié, 2024), an update on the
ColBERTv2 (Santhanam et al., 2022) training
procedure, with a batch size of 16 and a 5%
learning rate warmup. We train all models by
distilling the knowledge of a teacher model by
using the KL-Divergence between the normalized
teacher and student scores. Models are trained
on 810k samples from MS-Marco (Bajaj et al.,
2016) and teacher scores from BGE-M3 (Chen
et al., 2024), using the PyLate library (Chaffin and
Sourty, 2024).
3.1.3 Long-Context Text Retrieval
With a native 8192 context length, ModernBERT
improves long-context performance over most ex-
isting encoders. However, there are relatively
few standardized long-context benchmarks for
encoder-only models, and most benchmarks, such
as Needle-in-a-haystack (Kamradt, 2023) and
RULER (Hsieh et al., 2024) are geared towards gen-
erative tasks. Given this limitation, we demonstrate
improved long-context performance on the English
subset of MLDR (Chen et al., 2024), a long-context
retrieval benchmark comprised of over 200,000
long documents. We evaluate three settings:
Single Vector – Out-Of-Domain Models are
trained on short-context MS-MARCO as described
above, and is evaluated on long context MLDR
without any further fine-tuning.
9The sum for every query token of its similarity with the
most similar document token
5
Page 6:
IR (DPR) IR (ColBERT) NLU Code
Model BEIR MLDR OOD MLDR ID BEIR MLDR OOD GLUE CSN SQABaseBERT 38.9 23.9 32.2 49.0 28.1 84.7 41.2 59.5
RoBERTa 37.7 22.9 32.8 48.7 28.2 86.4 44.3 59.6
DeBERTaV3 20.2 5.4 13.4 47.1 21.9 88.1 17.5 18.6
NomicBERT 41.0 26.7 30.3 49.9 61.3 84.0 41.6 61.4
GTE-en-MLM 41.4 34.3 44.4 48.2 69.3 85.6 44.9 71.4
ModernBERT 41.6 27.4 44.0 51.3 80.2 88.4 56.4 73.6LargeBERT 38.9 23.3 31.7 49.5 28.5 85.2 41.6 60.8
RoBERTa 41.4 22.6 36.1 49.8 28.8 88.9 47.3 68.1
DeBERTaV3 25.6 7.1 19.2 46.7 23.0 91.4 21.2 19.7
GTE-en-MLM 42.5 36.4 48.9 50.7 71.3 87.6 40.5 66.9
ModernBERT 44.0 34.3 48.6 52.4 80.4 90.4 59.5 83.9
Table 1: Results for all models across an overview of all tasks. CSN refers to CodeSearchNet and SQA to StackQA.
MLDR IDrefers to in-domain (fine-tuned on the training set) evaluation, and MLDR OODto out-of-domain.
Single Vector – In Domain Models trained
on MS-MARCO are further fine-tuned on long-
context MLDR training set before being evaluated.
Multi-Vector – Out-Of-Domain Due to its
token-level MaxSim mechanism, ColBERT mod-
els are able to generalize to long-context without
any specific training (Bergum, 2024). We directly
evaluate the best checkpoints from Section 3.1.2
without any further fine-tuning on MLDR.
3.1.4 Code Retrieval
Fueled by increasingly good code completion mod-
els (Jiang et al., 2024a), downstream applications
have quickly grown in popularity following the
emergence of code assistants.10Encoder-only mod-
els are used to process and retrieve large quantities
of code-related information under resource con-
straints, increasing the importance of measuring
and improving code capabilities of encoder models
(Li et al., 2024). Unlike most previous encoders
which were largely trained only on textual data (De-
vlin et al., 2019; Liu et al., 2019a; Portes et al.,
2023; Zhang et al., 2024; Nussbaum et al., 2024),
ModernBERT is pre-trained on code and uses a
code-aware tokenizer11.
To measure programming-related performance,
we evaluate all models on CodeSearchNet (Hu-
sain et al., 2019), a code-to-text benchmark where
the model must identify relevant docstring or com-
ments for code blocks, and StackOverflow-QA (Li
10Spearheaded by GitHub Copilot in 2021
11Avoiding issues such as the ones seen in T5 (Raffel et al.,
2020), whose vocabulary did not include curly braces.et al., 2024), where the model must identify rel-
evant responses to StackOverflow questions, in a
"hybrid" setting where documents contain both text
and code. The latter benchmark also leverages long-
context capabilities, as its queries and documents
respectively contain 1,400 and 1,200 words on aver-
age, leading to average token counts of over 2000.
We evaluate these benchmarks using the CoIR
(CodeIR) framework (Li et al., 2024), as single-
vector retrieval tasks. All models are trained by
re-using the best hyper-parameters identified in
Section 3.1.2.
3.2 Downstream Results and Discussion
Aggregated results for all evaluations are presented
in Table 1. For BEIR and GLUE, the two common
evaluation suites, we follow existing practice in
reporting the average results. Detailed results are
provided in Appendix E.
In terms of downstream performance, Modern-
BERT is the strongest overall model at both the
BASE and LARGE model sizes. ModernBERT rep-
resents a Pareto improvement on all tasks over the
original BERT and RoBERTA models, with better
performance on every evaluation category.
Short-Context Retrieval On BEIR, both vari-
ants of ModernBERT outperform existing encoders
in both the DPR and ColBERT settings, including
the recent GTE-en-MLM and NomicBERT mod-
els designed to serve as better backbones for re-
trieval (Zhang et al., 2024; Nussbaum et al., 2024).
While ModernBERT-base only narrowly edges
out GTE-en-MLM-base on DPR evaluations,
6
Page 7:
Short Long
Model Params BS Fixed Variable BS Fixed VariableBaseBERT 110M 1096 180.4 90.2 – – –
RoBERTa 125M 664 179.9 89.9 – – –
DeBERTaV3 183M 236 70.2 35.1 – – –
NomicBERT 137M 588 117.1 58.5 36 46.1 23.1
GTE-en-MLM 137M 640 123.7 61.8 38 46.8 23.4
GTE-en-MLM xformers 137M 640 122.5 128.6 38 47.5 67.3
ModernBERT 149M 1604 148.1 147.3 98 123.7 133.8LargeBERT 330M 792 54.4 27.2 – – –
RoBERTa 355M 460 42.0 21.0 – – –
DeBERTaV3 434M 134 24.6 12.3 – – –
GTE-en-MLM 435M 472 38.7 19.3 28 16.2 8.1
GTE-en-MLM xformers 435M 472 38.5 40.4 28 16.5 22.8
ModernBERT 395M 770 52.3 52.9 48 46.8 49.8
Table 2: Memory (max batch size, BS) and Inference (in thousands of tokens per second) efficiency results on an
NVIDIA RTX 4090, averaged over 10 runs. Dashes indicate unsupported configurations.
ModernBERT-large increases its lead despite hav-
ing comparatively fewer parameters at 395M to
GTE-en-MLM-large’s 435M.
Long-Context Retrieval - Single Vector In the
DPR setting, ModernBERT achieves impressive
performance on MLDR, a long-context text re-
trieval task. However, these results also highlight
an interesting phenomenon: without long-context
finetuning ModernBERT outperforms both shorter-
context models and the long-context NomicBERT
but performs noticeably worse than GTE-en-MLM.
The performance gap narrows considerably when
evaluated in-domain, with both models performing
similarly. This suggests that ModernBERT can ef-
fectively process long context sequences as a dense
encoder but may require more adapted tuning. We
plan to explore multiple potential explanations for
this phenomenon in future work, including the im-
pact of local attention or GTE-en-MLM having
spent a larger part of its pretraining compute bud-
get on longer sequence lengths (Zhang et al., 2024).
Long-Context Retrieval - Multi-Vector In
the ColBERT setting, long-context models (GTE-
en-MLM, NomicBERT, and ModernBERT) all
outperform short-context models by at least 40
NDCG@10 points without requiring any specific
finetuning. These results confirm the findings of
Bergum (2024), who showed that ColBERT models
are particularly well-suited to long-context retrieval
tasks. Among the long-context models, Modern-
BERT outperforms other long-context models, withat least a 9 NDCG@10 point lead on both model
sizes. We theorize that these sizable gains could
be explained by our long pretraining ensuring few,
if any, tokens are under-trained, as well as a po-
tentially synergistic effect of local attention with
ColBERT-style retrieval, but leave further explo-
ration of this phenomenon to future work.
Natural Language Understanding Both Mod-
ernBERT models demonstrate exceptional NLU
results, as measured by GLUE. ModernBERT-
base surpasses all existing base models, includ-
ing DeBERTaV3-base, becoming the first MLM-
trained model to do so. This is surprising,
as DeBERTaV3 was trained with the Replaced-
Token-Detection objective, which was previously
thought to yield stronger downstream NLU per-
formance (Clark et al., 2020; He et al., 2023).
ModernBERT-large is the second-best large en-
coder on GLUE, almost matching DeBERTaV3-
large with one-tenth fewer parameters while pro-
cessing tokens in half the time (see Section 4).
Code On programming tasks, in both code-to-
text (CodeSearchNet) and longer-context hybrid
settings (StackQA), ModernBERT outperforms all
other models. This result was expected, as it is the
only evaluated encoder to be trained on a data mix-
ture including programming data. These results,
combined with ModernBERT’s strong showings
on other tasks, indicates that ModernBERT has im-
proved understanding of code at no detriment to its
ability to process natural text.
7
Page 8:
4 Efficiency
4.1 Evaluation Setting
To measure inference efficiency across multiple
sequence lengths, we create 4 synthetic sets of
8192 documents12. The first two document sets
are fixed-length: in fixed short-context , all docu-
ments contain 512 tokens and in fixed long-context
all documents contain 8192 tokens13. To account
for the impact of unpadding, we also create two
varying-length document sets, where the number
of tokens in each set are defined by a normal dis-
tribution centered on half the maximum sequence
length, 256 and 4096 tokens, respectively. Full data
statistics are provided in Appendix F.
We then evaluate all models based on the number
of tokens they can process per second, averaged
over ten runs. All efficiency evaluations are ran
on a single NVIDIA RTX 4090, one of the target
GPUs of ModernBERT outlined in Section 2.1.3
We evaluate the GTE-en-MLM models under two
settings: out-of-the box, and with the use of the
xformers (Lefaudeux et al., 2022) library, which en-
ables efficiency enhancements such as unpadding.
4.2 Results
All tokens-per-second efficiency results are pre-
sented in Table 2, with absolute run-times provided
in Appendix F. ModernBERT stands out as the
most efficient model overall. On short context, it
processes fixed-length 512 token inputs faster than
all other recent encoders, although slower than the
original BERT and RoBERTa models14. On long-
context, ModernBERT is faster than all competing
encoders, processing documents 2.65 and 3 times
faster than the next-fastest encoder at the BASE and
LARGE sizes, respectively. ModernBERT-large’s
processing speed at length 8192 (46,801 tokens
per second) is closer to that of GTE-en-MLM base
(47,507 tokens per second) than it is to GTE-en-
MLM-large (16,532 tokens per second).
On variable-length inputs, both GTE-en-MLM
and ModernBERT models are considerably faster
than all other models, largely due to unpadding.
However, ModernBERT remains noticeably more
efficient than GTE-en-MLM, processing 14.5-30.9
12Many common benchmarks are biased towards low and
uniform sequence lengths, which is unrepresentative of many
real-world situations.
13512 being the maximum length of most existing encoders,
while 8192 is the maximum length of all long-context ones.
14This is partially due to the relatively low parameter count
of BERT and RoBERTa compared to more recent encoders.percent more tokens per second at low context
lengths and 98.8-118.8 percent more at longer con-
text lengths, thanks to its use of local attention.
ModernBERT is the overall most memory effi-
cient model on both model sizes. ModernBERT-
base is able to process batch sizes twice as
large as every other model on both input lengths.
ModernBERT-large is slightly less memory effi-
cient than the original BERT-large on short-context
inputs, but can process batches at least 60 percent
bigger than every other large model.
5 Conclusion
We present ModernBERT, an open family of
encoder-only models which set a new state of the
art over existing encoder models on a wide range
of classification and retrieval tasks. We show that
encoders benefit from both recent pretraining data
scales and architecture improvements from autore-
gressive LLMs.
ModernBERT has a native sequence length of
8,192 tokens and incorporates recent architecture
improvements, such as GeGLU layers, RoPE po-
sitional embeddings, and alternating local-global
attention. ModernBERT is the first open model
to feature entire model unpadding and is the first
encoder designed in a hardware-aware way to max-
imize inference efficiency.
ModernBERT pushes the encoder state of the art
forward across a wide range of benchmarks. On
GLUE, ModernBERT-base is the first encoder to
beat DeBERTaV3-base since its release in 2021.
ModernBERT is in a class of its own in code and
ColBERT-style long-context retrieval benchmarks,
scoring at least 6.85 and 9.1 percentage points
higher than the closest model, respectively, while
remaining state-of-the-art on short-context retrieval
in both single and multi-vector settings.
At the same time, ModernBERT processes short
context inputs twice as fast as DeBERTaV3 and
long-context inputs two times faster than the next
fastest model with best-in-class memory efficiency.
ModernBERT is a generational leap over the
original encoder models, with notable performance
improvements over BERT and RoBERTa on both
classification and retrieval tasks. ModernBERT is
one of the few encoders to support long-context and
programming applications, while simultaneously
setting a new record in encoder inference efficiency.
8
Page 9:
6 Limitations
Language This study focuses exclusively on the
English language, and trains on a very large num-
ber of tokens. As such, a major limitation of our
work is that it is not directly applicable to other
languages, and potentially even less-so to lower
resources languages.
Biases Our model is trained largely on web data,
as a result, all of its representations are subject to
the biases present in such data.
Harmful Content Generation The MLM objec-
tive gives the model some ability to generate text
by suggesting a given token to replace the [MASK]
token (Samuel, 2024), which could result in the
generation of harmful content. However, Modern-
BERT is not, primarily, a generative model, and as
such, has not been trained to and therefore cannot
generate longer sequences of text. As a result, it is
considerably less likely to be at risk of generating
harmful content of any kind.
MLM-only objective Given the strong results of
DeBERTav3 on classification tasks but weak ones
on retrieval, it seems that a training leveraging both
MLM and RTD might be better suited to achieve
best results on classification. Extending our work
to RTD is thus a promising line of research.
Scaling Besides the architectural modifications,
a key aspect of our studies is data scaling. How-
ever, other scaling axes, notably in terms of model
parameters are left unexplored.
7 Acknowledgements
The authors would like to acknowledge & thank the
many people who assisted, supported, or offered
insights useful for the completion of this project.
We are particularly thankful for the one-off im-
plementation or evaluation work conducted by
Jack Cook, Mark Tenenholtz, Johno Whitaker, and
Wayde Gilliam. We also extend similar thanks to
Zach Nussbaum for assisting in resolving issues we
encountered with NomicBERT during evaluation.
We would like to acknowledge Enrico Shippole,
Daniel Han, Colin Raffel, Pierre-Carl Langlais,
Omar Khattab, Urchade Zaratiana, Aurélien Lac,
Amélie Chatelain, and Raphaël Sourty, for their
helpful contributions to discussions.
We also thank Weights&Biases for providing
free access to their platform, in particular Morgan
McGuire and Thomas Capelle for their support.
We thank HuggingFace’s Arthur Zucker, CyrilVallez, and Pedro Cuenca for assisting with day-
one HuggingFace support.
Finally, we acknowledge Orange Business Cloud
Avenue as compute provider and their hardware
support throughout the project and thank LightOn
for sponsoring the compute.
8 Contribution Statement
BW, AC, and BC jointly led the project and con-
tributed to all parts of it.
BW worked on all aspects of the project and con-
tributed to all major decisions. He led model de-
sign, model training, implemented the majority of
the model architecture, and assisted with data se-
lection, elevations, and paper writing.
AC co-initiated the project and worked on all as-
pects of it, including project coordination. Notably,
he contributed to monitoring training runs and co-
led ablations, final evaluations and paper writing.
BC initiated the project and worked on all aspects
of it. He contributed to model design and co-led
final evaluations, led paper writing, and contributed
to the context extension data processing.
OW led and conducted the majority of the data se-
lection, processing, and discussion, for all stages
of training. He also contributed valuable inputs
throughout all stages of the project.
OH and ST contributed to a majority of the stages
of the project, in particular model architecture and
training, with both discussions, implementations
and paper writing. Other contributions include pre-
training monitoring, final traditional evaluations,
and ablations. ST specifically worked on adapting
the RoPE kernel for unpadded sequences and run-
ning the final GLUE benchmarks. OH additionally
conducted a thorough investigation into complex
issues that arose during training.
RB contributed greatly to the initial evaluation
work, focusing on ablations and in-training evals.
AG and FL contributed to training efficiency, espe-
cially in implementing sequence packing.
AG and GA contributed to model evaluations, es-
pecially in long context evaluations.
TA contributed to discussions throughout the
project and assisted in integrating the original re-
search implementation with open source software.
NC contributed to context extension data mixtures,
and provided insight into model training and on
improving the quality of code data.
IP and JH provided guidance and support through-
out the project, especially on key decisions.
9
Page 10:
References
Jason Ansel, Edward Yang, Horace He, Natalia
Gimelshein, Animesh Jain, Michael V oznesensky,
Bin Bao, Peter Bell, David Berard, Evgeni Burovski,
et al. 2024. Pytorch 2: Faster machine learning
through dynamic python bytecode transformation and
graph compilation. In Proceedings of the 29th ACM
International Conference on Architectural Support
for Programming Languages and Operating Systems ,
volume 2, pages 929–947.
Quentin Anthony, Jacob Hatef, Deepak Narayanan,
Stella Biderman, Stas Bekman, Junqi Yin, Aamir
Shafi, Hari Subramoni, and Dhabaleswar Panda.
2024. The case for co-designing model architectures
with hardware. Preprint , arXiv:2401.14489.
Jordan T. Ash and Ryan P. Adams. 2019. On the
difficulty of warm-starting neural network training.
CoRR , abs/1910.08475.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang,
Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei
Huang, et al. 2023. Qwen technical report. arXiv
preprint arXiv:2309.16609 .
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng,
Jianfeng Gao, Xiaodong Liu, Rangan Majumder,
Andrew McNamara, Bhaskar Mitra, Tri Nguyen,
et al. 2016. Ms marco: A human generated ma-
chine reading comprehension dataset. arXiv preprint
arXiv:1611.09268 .
Iz Beltagy, Matthew E. Peters, and Arman Cohan.
2020. Longformer: The long-document transformer.
Preprint , arXiv:2004.05150.
Jo Kristian Bergum. 2024. Announcing vespa long-
context ColBERT. Vespa Blog .
Stella Biderman, Hailey Schoelkopf, Quentin Gregory
Anthony, Herbie Bradley, Kyle O’Brien, Eric Hal-
lahan, Mohammad Aflah Khan, Shivanshu Purohit,
USVSN Sai Prashanth, Edward Raff, et al. 2023.
Pythia: A suite for analyzing large language mod-
els across training and scaling. In International
Conference on Machine Learning , pages 2397–2430.
PMLR.
Sidney Black, Stella Biderman, Eric Hallahan, Quentin
Anthony, Leo Gao, Laurence Golding, Horace He,
Connor Leahy, Kyle McDonell, Jason Phang, et al.
2022. Gpt-neox-20b: An open-source autoregres-
sive language model. In Proceedings of BigScience
Episode# 5–Workshop on Challenges & Perspectives
in Creating Large Language Models , pages 95–136.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-V oss,
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,Jack Clark, Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and Dario Amodei.
2020. Language models are few-shot learners. In Ad-
vances in Neural Information Processing Systems 33:
Annual Conference on Neural Information Process-
ing Systems 2020, NeurIPS 2020, December 6-12,
2020, virtual .
Antoine Chaffin and Raphaël Sourty. 2024. Pylate:
Flexible training and retrieval for late interaction
models.
Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun
Luo, Defu Lian, and Zheng Liu. 2024. M3-
embedding: Multi-linguality, multi-functionality,
multi-granularity text embeddings through self-
knowledge distillation. In Findings of the Asso-
ciation for Computational Linguistics, ACL 2024,
Bangkok, Thailand and virtual meeting, August 11-
16, 2024 , pages 2318–2335. Association for Compu-
tational Linguistics.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Maarten Bosma, Gaurav Mishra, Adam Roberts,
Paul Barham, Hyung Won Chung, Charles Sutton,
Sebastian Gehrmann, Parker Schuh, Kensen Shi,
Sasha Tsvyashchenko, Joshua Maynez, Abhishek
Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin-
odkumar Prabhakaran, Emily Reif, Nan Du, Ben
Hutchinson, Reiner Pope, James Bradbury, Jacob
Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin,
Toju Duke, Anselm Levskaya, Sanjay Ghemawat,
Sunipa Dev, Henryk Michalewski, Xavier Garcia,
Vedant Misra, Kevin Robinson, Liam Fedus, Denny
Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim,
Barret Zoph, Alexander Spiridonov, Ryan Sepassi,
David Dohan, Shivani Agrawal, Mark Omernick, An-
drew M. Dai, Thanumalayan Sankaranarayana Pil-
lai, Marie Pellat, Aitor Lewkowycz, Erica Moreira,
Rewon Child, Oleksandr Polozov, Katherine Lee,
Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark
Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy
Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov,
and Noah Fiedel. 2023. Palm: Scaling language mod-
eling with pathways. J. Mach. Learn. Res. , 24:240:1–
240:113.
Kevin Clark, Minh-Thang Luong, Quoc V . Le, and
Christopher D. Manning. 2020. ELECTRA: pre-
training text encoders as discriminators rather than
generators. In 8th International Conference on
Learning Representations, ICLR 2020, Addis Ababa,
Ethiopia, April 26-30, 2020 . OpenReview.net.
Benjamin Clavié. 2024. Jacolbertv2.5: Optimis-
ing multi-vector retrievers to create state-of-the-
art japanese retrievers with constrained resources.
Preprint , arXiv:2407.20750.
Tri Dao. 2023. Flashattention-2: Faster attention with
better parallelism and work partitioning. In The
Twelfth International Conference on Learning Repre-
sentations .
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and
Christopher Ré. 2022. Flashattention: Fast and
10
Page 11:
memory-efficient exact attention with io-awareness.
Advances in Neural Information Processing Systems ,
35:16344–16359.
Yann N. Dauphin, Angela Fan, Michael Auli, and David
Grangier. 2017. Language modeling with gated con-
volutional networks. In Proceedings of the 34th In-
ternational Conference on Machine Learning , vol-
ume 70 of Proceedings of Machine Learning Re-
search , pages 933–941. PMLR.
Boris Dayma, Suraj Patil, Pedro Cuenca, Khalid Saiful-
lah, Tanishq Abraham, Phúc Lê Kh ˘ac, Luke Melas,
and Ritobrata Ghosh. 2021. Dall·e mini.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: pre-training of
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, NAACL-HLT 2019, Minneapolis, MN, USA,
June 2-7, 2019, Volume 1 (Long and Short Papers) ,
pages 4171–4186. Association for Computational
Linguistics.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey,
Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman,
Akhil Mathur, Alan Schelten, Amy Yang, Angela
Fan, et al. 2024. The llama 3 herd of models. arXiv
preprint arXiv:2407.21783 .
Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Han-
naneh Hajishirzi, Yoon Kim, and Hao Peng. 2024.
Data engineering for scaling language models to 128k
context. Preprint , arXiv:2402.10171.
Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-
Yan Liu. 2019. Representation degeneration prob-
lem in training natural language generation models.
ArXiv , abs/1907.12009.
Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi
Chen. 2024. How to train long-context language
models (effectively). Preprint , arXiv:2410.02660.
Jonas Geiping and Tom Goldstein. 2023. Cramming:
Training a language model on a single GPU in one
day. In International Conference on Machine Learn-
ing, ICML 2023, 23-29 July 2023, Honolulu, Hawaii,
USA, volume 202 of Proceedings of Machine Learn-
ing Research , pages 11117–11143. PMLR.
Team Gemma, Morgane Riviere, Shreya Pathak,
Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati-
raju, Léonard Hussenot, Thomas Mesnard, Bobak
Shahriari, Alexandre Ramé, et al. 2024. Gemma 2:
Improving open language models at a practical size.
arXiv preprint arXiv:2408.00118 .
Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bha-
gia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh
Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang,
et al. 2024. Olmo: Accelerating the science of lan-
guage models. arXiv preprint arXiv:2402.00838 .Alexander Hägele, Elie Bakouch, Atli Kosson,
Loubna Ben Allal, Leandro von Werra, and Mar-
tin Jaggi. 2024. Scaling laws and compute-optimal
training beyond fixed training durations. CoRR ,
abs/2405.18392.
Oskar Hallström, Said Taghadouini, Clément Thiriet,
and Antoine Chaffin. 2024. Passing the torch: Train-
ing a mamba model for smooth handover.
Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2023.
Debertav3: Improving deberta using electra-style
pre-training with gradient-disentangled embedding
sharing. In The Eleventh International Conference
on Learning Representations, ICLR 2023, Kigali,
Rwanda, May 1-5, 2023 . OpenReview.net.
Dan Hendrycks and Kevin Gimpel. 2016. Gaus-
sian error linear units (gelus). arXiv preprint
arXiv:1606.08415 .
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shan-
tanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang,
and Boris Ginsburg. 2024. Ruler: What’s the real
context size of your long-context language models?
arXiv preprint arXiv:2404.06654 .
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He,
Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang,
Yuxiang Huang, Weilin Zhao, Xinrong Zhang,
Zhen Leng Thai, Kai Zhang, Chongyi Wang, Yuan
Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu
Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li,
Zhiyuan Liu, and Maosong Sun. 2024. Minicpm: Un-
veiling the potential of small language models with
scalable training strategies. CoRR , abs/2404.06395.
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis
Allamanis, and Marc Brockschmidt. 2019. Code-
searchnet challenge: Evaluating the state of semantic
code search. arXiv preprint arXiv:1909.09436 .
Alexander Hägele, Elie Bakouch, Atli Kosson,
Loubna Ben Allal, Leandro V on Werra, and Mar-
tin Jaggi. 2024. Scaling laws and compute-optimal
training beyond fixed training durations. Preprint ,
arXiv:2405.18392.
Peter Izsak, Moshe Berchansky, and Omer Levy. 2021.
How to train BERT with an academic budget. In Pro-
ceedings of the 2021 Conference on Empirical Meth-
ods in Natural Language Processing , pages 10644–
10652, Online and Punta Cana, Dominican Republic.
Association for Computational Linguistics.
Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jy-
oti Aneja, Sebastien Bubeck, Caio César Teodoro
Mendes, Weizhu Chen, Allie Del Giorno, Ronen
Eldan, Sivakanth Gopi, et al. 2023. Phi-2: The sur-
prising power of small language models. Microsoft
Research Blog , 1(3):3.
Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi
Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou
Wang, and Yaodong Yang. 2023. Beavertails: To-
wards improved safety alignment of llm via a human-
preference dataset. arXiv preprint arXiv:2307.04657 .
11
Page 12:
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim,
and Sunghun Kim. 2024a. A survey on large lan-
guage models for code generation. arXiv preprint
arXiv:2406.00515 .
Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger,
Faeze Brahman, Sachin Kumar, Niloofar Mireshghal-
lah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha
Dziri. 2024b. Wildteaming at scale: From in-the-
wild jailbreaks to (adversarially) safer language mod-
els.Preprint , arXiv:2406.18510.
Gregory Kamradt. 2023. Needle In A Haystack - pres-
sure testing LLMs. Github .
Andrej Karpathy. 2023. The most dramatic optimization
to nanogpt so far ( 25% speedup) is to simply increase
vocab size from 50257 to 50304 (nearest multiple of
64).
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick
S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi Chen,
and Wen-tau Yih. 2020. Dense passage retrieval for
open-domain question answering. In Proceedings of
the 2020 Conference on Empirical Methods in Nat-
ural Language Processing, EMNLP 2020, Online,
November 16-20, 2020 , pages 6769–6781. Associa-
tion for Computational Linguistics.
Omar Khattab and Matei Zaharia. 2020. Colbert: Ef-
ficient and effective passage search via contextual-
ized late interaction over BERT. In Proceedings of
the 43rd International ACM SIGIR conference on
research and development in Information Retrieval,
SIGIR 2020, Virtual Event, China, July 25-30, 2020 ,
pages 39–48. ACM.
Mario Michael Krell, Matej Kosec, Sergio P. Perez, and
Andrew Fitzgibbon. 2022. Efficient sequence pack-
ing without cross-contamination: Accelerating large
language models without impacting performance.
Preprint , arXiv:2107.02027.
Benjamin Lefaudeux, Francisco Massa, Diana
Liskovich, Wenhan Xiong, Vittorio Caggiano,
Sean Naren, Min Xu, Jieru Hu, Marta Tintore,
Susan Zhang, Patrick Labatut, Daniel Haziza,
Luca Wehrstedt, Jeremy Reizenstein, and Grigory
Sizov. 2022. xformers: A modular and hackable
transformer modelling library. https://github.
com/facebookresearch/xformers .
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-
ton. 2016. Layer normalization. ArXiv e-prints ,
pages arXiv–1607.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
Petroni, Vladimir Karpukhin, Naman Goyal, Hein-
rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-
täschel, et al. 2020. Retrieval-augmented genera-
tion for knowledge-intensive nlp tasks. Advances in
Neural Information Processing Systems (NeurIPS) ,
33:9459–9474.
Xiangyang Li, Kuicai Dong, Yi Quan Lee, Wei Xia,
Yichun Yin, Hao Zhang, Yong Liu, Yasheng Wang,and Ruiming Tang. 2024. Coir: A comprehensive
benchmark for code information retrieval models.
arXiv preprint arXiv:2407.02883 .
Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del
Giorno, Suriya Gunasekar, and Yin Tat Lee. 2023.
Textbooks are all you need ii: phi-1.5 technical report.
Preprint , arXiv:2309.05463.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019a.
Roberta: A robustly optimized BERT pretraining
approach. CoRR , abs/1907.11692.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019b.
Roberta: A robustly optimized BERT pretraining
approach. CoRR , abs/1907.11692.
Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen
Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong,
Ernie Chang, Yangyang Shi, Raghuraman Krish-
namoorthi, Liangzhen Lai, and Vikas Chandra. 2024.
Mobilellm: Optimizing sub-billion parameter lan-
guage models for on-device use cases. Preprint ,
arXiv:2402.14905.
Ilya Loshchilov and Frank Hutter. 2019. Decoupled
weight decay regularization. In International Confer-
ence on Learning Representations .
The Mosaic ML Team. 2021. composer. https://
github.com/mosaicml/composer/ .
Thao Nguyen, Maithra Raghu, and Simon Kornblith.
2021. Do wide and deep networks learn the same
things? uncovering how neural network representa-
tions vary with width and depth. In International
Conference on Learning Representations .
Zach Nussbaum, John X. Morris, Brandon Duderstadt,
and Andriy Mulyar. 2024. Nomic embed: Training
a reproducible long context text embedder. CoRR ,
abs/2402.01613.
Guilherme Penedo, Hynek Kydlí ˇcek, Loubna Ben al-
lal, Anton Lozhkov, Margaret Mitchell, Colin Raffel,
Leandro V on Werra, and Thomas Wolf. 2024. The
fineweb datasets: Decanting the web for the finest
text data at scale. Preprint , arXiv:2406.17557.
Jacob Portes, Alexander Trott, Sam Havens, Daniel
King, Abhinav Venigalla, Moin Nadeem, Nikhil Sar-
dana, Daya Khudia, and Jonathan Frankle. 2023. Mo-
saicbert: A bidirectional encoder optimized for fast
pretraining. In Advances in Neural Information Pro-
cessing Systems 36: Annual Conference on Neural
Information Processing Systems 2023, NeurIPS 2023,
New Orleans, LA, USA, December 10 - 16, 2023 .
Rushi Qiang, Ruiyi Zhang, and Pengtao Xie. 2024.
Bilora: A bi-level optimization framework for
overfitting-resilient low-rank adaptation of large pre-
trained models. CoRR , abs/2403.13037.
12
Page 13:
Alec Radford, Karthik Narasimhan, Tim Salimans, and
Ilya Sutskeve. 2018. Improving language understand-
ing by generative pre-training. In OpenAI Tech Re-
port.
Alec Radford, Jeff Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners. OpenAI
Blog .
Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie
Millican, Jordan Hoffmann, Francis Song, John
Aslanides, Sarah Henderson, Roman Ring, Susan-
nah Young, Eliza Rutherford, Tom Hennigan, Ja-
cob Menick, Albin Cassirer, Richard Powell, George
van den Driessche, Lisa Anne Hendricks, Mari-
beth Rauh, Po-Sen Huang, Amelia Glaese, Jo-
hannes Welbl, Sumanth Dathathri, Saffron Huang,
Jonathan Uesato, John Mellor, Irina Higgins, Anto-
nia Creswell, Nat McAleese, Amy Wu, Erich Elsen,
Siddhant Jayakumar, Elena Buchatskaya, David Bud-
den, Esme Sutherland, Karen Simonyan, Michela Pa-
ganini, Laurent Sifre, Lena Martens, Xiang Lorraine
Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena
Gribovskaya, Domenic Donato, Angeliki Lazaridou,
Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsim-
poukelli, Nikolai Grigorev, Doug Fritz, Thibault Sot-
tiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong,
Daniel Toyama, Cyprien de Masson d’Autume, Yujia
Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin,
Aidan Clark, Diego de Las Casas, Aurelia Guy,
Chris Jones, James Bradbury, Matthew Johnson,
Blake Hechtman, Laura Weidinger, Iason Gabriel,
William Isaac, Ed Lockhart, Simon Osindero, Laura
Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub,
Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Ko-
ray Kavukcuoglu, and Geoffrey Irving. 2022. Scaling
language models: Methods, analysis & insights from
training gopher. Preprint , arXiv:2112.11446.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J Liu. 2020. Exploring the lim-
its of transfer learning with a unified text-to-text
transformer. Journal of machine learning research ,
21(140):1–67.
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert:
Sentence embeddings using siamese bert-networks.
InProceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing . Associa-
tion for Computational Linguistics.
David Samuel. 2024. Berts are generative in-context
learners. CoRR , abs/2406.04823.
Keshav Santhanam, Omar Khattab, Jon Saad-Falcon,
Christopher Potts, and Matei Zaharia. 2022. Col-
bertv2: Effective and efficient retrieval via
lightweight late interaction. In Proceedings of the
2022 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu-
man Language Technologies, NAACL 2022, Seattle,
WA, United States, July 10-15, 2022 , pages 3715–
3734. Association for Computational Linguistics.Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta
Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle-
moyer, Nicola Cancedda, and Thomas Scialom. 2023.
Toolformer: Language models can teach themselves
to use tools. In Advances in Neural Information Pro-
cessing Systems 36: Annual Conference on Neural
Information Processing Systems 2023, NeurIPS 2023,
New Orleans, LA, USA, December 10 - 16, 2023 .
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay
Thakkar, Pradeep Ramani, and Tri Dao. 2024.
Flashattention-3: Fast and accurate attention with
asynchrony and low-precision. arXiv preprint
arXiv:2407.08608 .
Noam Shazeer. 2020. Glu variants improve transformer.
arXiv preprint arXiv:2002.05202 .
Noam Shazeer and Mitchell Stern. 2018. Adafactor:
Adaptive learning rates with sublinear memory cost.
InProceedings of the 35th International Conference
on Machine Learning , volume 80 of Proceedings
of Machine Learning Research , pages 4596–4604.
PMLR.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri,
Patrick LeGresley, Jared Casper, and Bryan Catan-
zaro. 2019. Megatron-lm: Training multi-billion
parameter language models using model parallelism.
arXiv preprint arXiv:1909.08053 .
Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng
Pan, Wen Bo, and Yunfeng Liu. 2024. Roformer: En-
hanced transformer with rotary position embedding.
Neurocomputing , 568:127063.
Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus,
Samira Abnar, Hyung Won Chung, Sharan Narang,
Dani Yogatama, Ashish Vaswani, and Donald Met-
zler. 2022. Scale efficiently: Insights from pretrain-
ing and finetuning transformers. In International
Conference on Learning Representations (ICLR) 22 .
Nandan Thakur, Nils Reimers, Andreas Rücklé, Ab-
hishek Srivastava, and Iryna Gurevych. 2021. BEIR:
A heterogeneous benchmark for zero-shot evaluation
of information retrieval models. In Proceedings of
the Neural Information Processing Systems Track on
Datasets and Benchmarks 1, NeurIPS Datasets and
Benchmarks 2021, December 2021, virtual .
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, et al. 2023. Llama 2: Open founda-
tion and fine-tuned chat models. arXiv preprint
arXiv:2307.09288 .
Lewis Tunstall, Nils Reimers, Unso Eun Seo Jo, Luke
Bates, Daniel Korat, Moshe Wasserblat, and Oren
Pereg. 2022. Efficient few-shot learning without
prompts. arXiv preprint .
Aäron van den Oord, Yazhe Li, and Oriol Vinyals. 2018.
Representation learning with contrastive predictive
coding. CoRR , abs/1807.03748.
13
Page 14:
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
cessing Systems 30: Annual Conference on Neural
Information Processing Systems 2017, December 4-9,
2017, Long Beach, CA, USA , pages 5998–6008.
Ellen V oorhees, Tasmeer Alam, Steven Bedrick, Dina
Demner-Fushman, William R Hersh, Kyle Lo, Kirk
Roberts, Ian Soboroff, and Lucy Lu Wang. 2021.
Trec-covid: constructing a pandemic information re-
trieval test collection. In ACM SIGIR Forum , vol-
ume 54, pages 1–12. ACM New York, NY , USA.
Alex Wang, Amanpreet Singh, Julian Michael, Felix
Hill, Omer Levy, and Samuel Bowman. 2018. GLUE:
A multi-task benchmark and analysis platform for nat-
ural language understanding. In Proceedings of the
2018 EMNLP Workshop BlackboxNLP: Analyzing
and Interpreting Neural Networks for NLP , pages
353–355, Brussels, Belgium. Association for Com-
putational Linguistics.
Liang Wang, Nan Yang, Xiaolong Huang, Binxing
Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder,
and Furu Wei. 2022. Text embeddings by weakly-
supervised contrastive pre-training. arXiv preprint
arXiv:2212.03533 .
Benjamin Warner. 2023. optim ¯ı: Fast, modern, memory
efficient, and low precision pytorch optimizers.
Charles Welch, Rada Mihalcea, and Jonathan K. Kum-
merfeld. 2020. Improving low compute language
modeling with in-domain embedding initialisation.
Preprint , arXiv:2009.14109.
Alexander Wettig, Tianyu Gao, Zexuan Zhong, and
Danqi Chen. 2023. Should you mask 15% in masked
language modeling? Preprint , arXiv:2202.08005.
Adina Williams, Nikita Nangia, and Samuel Bowman.
2018. A broad-coverage challenge corpus for sen-
tence understanding through inference. In Proceed-
ings of the 2018 Conference of the North American
Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume 1
(Long Papers) , pages 1112–1122.
Mitchell Wortsman, Tim Dettmers, Luke Zettle-
moyer, Ari Morcos, Ali Farhadi, and Ludwig
Schmidt. 2023. Stable and low-precision training
for large-scale vision-language models. Preprint ,
arXiv:2304.13013.
Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas
Muennighoff. 2023. C-pack: Packaged resources
to advance general chinese embedding. Preprint ,
arXiv:2309.07597.
Chen Xing, Devansh Arpit, Christos Tsirigotis, and
Yoshua Bengio. 2018. A walk with sgd. Preprint ,
arXiv:1802.08770.Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng,
Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan
Lan, Liwei Wang, and Tie-Yan Liu. 2020. On layer
normalization in the transformer architecture. In Pro-
ceedings of the 37th International Conference on
Machine Learning, ICML 2020, 13-18 July 2020, Vir-
tual Event , volume 119 of Proceedings of Machine
Learning Research , pages 10524–10533. PMLR.
Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao,
and Junyang Lin. 2019. Understanding and improv-
ing layer normalization. Advances in neural informa-
tion processing systems , 32.
Hong Xuan, Abby Stylianou, Xiaotong Liu, and Robert
Pless. 2020. Hard negative examples are hard, but
useful. In Computer Vision–ECCV 2020: 16th Euro-
pean Conference, Glasgow, UK, August 23–28, 2020,
Proceedings, Part XIV 16 , pages 126–142. Springer.
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng,
Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan
Li, Dayiheng Liu, Fei Huang, et al. 2024. Qwen2
technical report. arXiv preprint arXiv:2407.10671 .
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023.
React: Synergizing reasoning and acting in language
models. In The Eleventh International Conference
on Learning Representations, ICLR 2023, Kigali,
Rwanda, May 1-5, 2023 . OpenReview.net.
Urchade Zaratiana, Nadi Tomeh, Pierre Holat, and
Thierry Charnois. 2024. Gliner: Generalist model for
named entity recognition using bidirectional trans-
former. In Proceedings of the 2024 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies (Volume 1: Long Papers) , pages 5364–5376.
Jinle Zeng, Min Li, Zhihua Wu, Jiaqi Liu, Yuang Liu,
Dianhai Yu, and Yanjun Ma. 2022. Boosting dis-
tributed training performance of the unpadded bert
model. arXiv preprint arXiv:2208.08124 .
Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby,
and Lucas Beyer. 2022. Scaling vision transformers.
InIEEE/CVF Conference on Computer Vision and
Pattern Recognition, CVPR 2022, New Orleans, LA,
USA, June 18-24, 2022 , pages 1204–1213. IEEE.
Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie,
Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang,
Pengjun Xie, Fei Huang, Meishan Zhang, Wenjie
Li, and Min Zhang. 2024. mgte: Generalized long-
context text representation and reranking models for
multilingual text retrieval. In Proceedings of the 2024
Conference on Empirical Methods in Natural Lan-
guage Processing: EMNLP 2024 - Industry Track,
Miami, Florida, USA, November 12-16, 2024 , pages
1393–1412. Association for Computational Linguis-
tics.
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,
Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen
Zhang, Junjie Zhang, Zican Dong, et al. 2023. A
14
Page 15:
survey of large language models. arXiv preprint
arXiv:2303.18223 .
A Training Settings
Detailed training settings can be found in Table 3.
During training we used MNLI as a live evalu-
ation, along with validation loss and token accu-
racy metrics on a 500 million randomly sampled
sequences from the source datasets.
We use Composer (Mosaic ML Team, 2021) as
our training framework and optim ¯ı (Warner, 2023)
for our optimizer implementations.
A.1 Batch Size Schedule
Batch size warmup is a common-knowledge trick
to speed up model training when working with
medium to large batch sizes. Instead of "wasting" a
full batch on updating the suboptimal initial weight
distribution, we update the model weights on a
gradually increasing batch size. Batch size warmup
is usually longer than learning rate warmup, and
can be thought of as providing a higher initial learn-
ing rate with a mini-learning rate decay to the de-
fined learning rate schedule. We warmup Mod-
ernBERT’s batch size from 768 to 4,608 over 50
billion tokens and from 448 to 4,928 over 10 billion
tokens, for -base and -large, respectively, with an
uneven token schedule so each batch size has the
same number of update steps.
A.2 Weight Tiling
Following the Phi family of models (Li et al.,
2023; Javaheripi et al., 2023), we initialized
ModernBERT-large directly from ModernBERT-
base’s pretraining weights using center tiling and
Gopher layer scaling (Rae et al., 2022). Since
Base’s weight matrices are smaller than Large’s,
we centered Base’ weights, accounting for each
token embedding and attention head, then filled
rest the of the weights using wraparound. Like Phi,
we tested center initialization with random edge
values and tiling from an edge, but both of these un-
derperformed center tiling with wraparound. This
weight initialization strategy greatly accelerates
ModernBERT-large’s initial training.
A.3 Weight Decay
We did not apply weight decay to the bias terms
or normalization layers. Instead of PyTorch-style
decoupled weight decay, we applied fully decou-
pled weight decay following Loshchilov and Hutter
(2019).A.4 Final Checkpoints
Inspired by recent work showing that checkpoint
averaging yields stronger final models (Dubey
et al., 2024; Clavié, 2024), we selected our final
checkpoints by experimenting with various aver-
aging methods and evaluating them on a subset
of evaluation tasks. In no cases did Exponen-
tial Moving Average during annealing, as used by
Dubey et al. (2024), result in stronger performance.
ModernBERT-base is the result of averaging the 3
best performing annealing checkpoints with the fi-
nal one. Averaging did not yield successful results
on the large size, ModernBERT-Large model is the
best performing annealing checkpoint.
B Model Design
From Anthony et al. (2024), in addition to setting
attention heads as multiples of 64 and setting the
embedding matrix as a power of 2 or multiple of
64, there are three model design choices to max-
imize performance (assuming float16 or bfloat16
computation):
•Tensor Core Requirement : Weight matrix
dimensions should be divisible by 64
•Tile Quantization : Weight matrix is divisible
into 128 × 256 blocks.
•Wave Quantization : Number of blocks is
divisible by the number of streaming multi-
processors (SM).
Given that we wanted to target good performance
across multiple GPUs with a wide variety of SM
counts, wave quantization is an impossible ask. So
we selected a basket of GPUs (NVIDIA T4, A10,
L4, RTX 3090, RTX 4090, A100, and H100) and
calculated the approximate SM utilization for each
by dividing the modulus blocks by the number of
SMs. This appeared to be a decent performance
heuristic in our spot checking. We then designed
our models to maximize performance on the basket
of GPUs, putting more weight on inference GPUs.
C Training Log
C.1 Sampling Issue
Our first pretraining run of ModernBERT-base
ended in disaster as the loss exhibited a slow see-
saw pattern before slowly diverging. Despite us-
ing PyTorch’s distributed random sampler, train-
ing metrics suggested that the model was training
15
Page 16:
Pretraining Phase Context Extension: Phase One Context Extension: Phase Two
Base Large Base Large Base Large
Training Tokens 1.719 trillion 250 billion 50 billion
Max Sequence Length 1,024 8,192 8,192
Batch Size 4,608 4,928 72 77 72 78
Warmup (tokens) 50 billion 10 billion - - - -
Microbatch Size 96 56 12 7 12 6
Learning Rate 8e-4 5e-4, 5e-5 3e-4 5e-5 3e-4 5e-5
Schedule Trapezoidal - - 1-sqrt
Warmup (tokens) 3 billion 2 billion - - - -
Decay (tokens) - - - - 50 billion
Weight Decay 1e-5 1e-5, 1e-6 1e-5 1e-6 1e-5 1e-6
Total Time (hours) 194.2 425.3 39.9 80.7 11.5 21.7
Training Time (hours) 191.1 420.4 36.3 75.1 7.5 15.3
Model Initialization Megatron From Base - - - -
Dropout (attn out) 0.1
Dropout (all other layers) 0.0
Optimizer StableAdamW
Betas (0.90, 0.98)
Epsilon 1e-06
Training Hardware 8x H100
Training Strategy Distributed DataParallel
Software Libraries PyTorch 2.4.0, Cuda 12.4.0, Composer 0.24.1, Flash Attention 2.6.3, FA3 commit 32792d3
Table 3: ModernBERT training settings. Dropout and below are shared across all phases.
Base Large
V ocabulary 50,368 50,368
Unused Tokens 83 83
Layers 22 28
Hidden Size 768 1024
Transformer Block Pre-Norm Pre-Norm
Activation Function GeLU GeLU
Linear Bias False False
Attention Multi-head Multi-head
Attention Heads 12 16
Global Attention Every three layers Every three layers
Local Attention Window 128 128
Intermediate Size 1,152 2,624
GLU Expansion 2,304 5,248
Normalization LayerNorm LayerNorm
Norm Epsilon 1e-5 1e-5
Norm Bias False False
RoPE theta 160,000 160,000
Local Attn RoPE theta 10,000 10,000
Table 4: ModernBERT model design
on the dataset in a non-random order. Like the
Olmo authors15, we determined that the PyTorch
random sampler returns sequentially biased sam-
ples when the number of samples is somewhere
between 500 million and 1 billion samples16. We
resolved this issue by replacing the PyTorch sam-
15We found a comment and GitHub issue about this in the
Olmo codebase after resolving the issue ourselves.
16We did not conduct a rigorous statistical analysis to deter-
mine exactly when this happens.pler with NumPy’s PCG64DXSM random sampler.
C.2 Large Rollback
We rolled back and restarted ModernBERT-large
training at a lower learning rate of 5e-5 and lower
weight decay of 1e-6 for the last 800 billion to-
kens. Prior to restarting training, large’s training
loss, validation metrics, and live evaluations on
MNLI had plateaued for a few hundred billion to-
kens at the higher 5e-4 learning rate. In contrast,
16
Page 17:
ModernBERT-base showed a continuous, but di-
minishing, improvement on training loss, valida-
tion metrics, and live evaluations through the entire
1.719 trillion token training phase. This highlights
one of the risks of training with a constant learning
rate, other learning rate schedules can mitigate se-
lecting a too high learning rate (or too small batch
size) by lowering the learning rate throughout train-
ing.
D Architecture ablations
To select the updates to add in the ModernBERT
architecture, we performed different ablations, ex-
cept where stated, most ablations where ran at the
8-20 billion token scale:
•We compared two GLU layers, GeGLU and
SwiGLU. We find close to no difference be-
tween the two and choose to use GeGLU lay-
ers.
•Using different percentage of the head dimen-
sion for the RoPE dimension (50, 75, 100).
Lower percentages gave slightly better results.
However, the observed difference was min-
imal. As the ablations were conducted at a
considerably smaller scale than the final train-
ing, we choose to err on the side of caution
and opt to keep the dimension at 100 % to
avoid potentially hindering the capabilities of
the fully trained models.
•Both LayerNorm and RMSNorm yielded very
similar results. While RMSNorm is theo-
retically faster, at the time this work was
conducted, PyTorch did not have a native
RMSNorm implementation, leading to eager-
mode RMSNorm being the default implemen-
tation used for many users. To ensure Modern-
BERT has the highest possible out-of-the-box
efficiency, we choose to use LayerNorm in the
final models.
•We investigated using parallel attention to
compute the MLP and attention matrices at
the same time, which has been shown to in-
crease processing speeds for larger model
sizes (Chowdhery et al., 2023). However, for
models within our targe sizes and pre-training
sequence length, the speed-up we observed
was minimal while we encountered signifi-
cant degradation in downstream performance.
As such, we do not use parallel attention. It ishowever possible that larger encoders and/or
larger sequence lengths might see a different
trade-off.
•We explored the use of alternating global/local
attention, with global attention every 3 layers
and local attention over a 128 token sliding
window otherwise. This setup yielded identi-
cal downstream performance when compared
to the use of global attention in every layer,
even at 100 billion tokens, while resulting in
major speedups.
•We experimented with multiple tokenizers, be-
fore selecting our final one, based on a mod-
ified OLMo (Groeneveld et al., 2024) tok-
enizer, which performed the best out of the
recent tokenizers evaluated. Tokenizers from
the BERT and RoBERTa generation of en-
coder models had competitive downstream
performance on MNLI, but we theorized that
their lack of recent training data and lack of
code support would hinder downstream appli-
cations. Interestingly, we observed significant
downstream performance degradation when
using the Llama 2 (Touvron et al., 2023) tok-
enizer.
E Extended results
E.1 Full GLUE results
The results for all the models each GLUE subsets
are presented in Table 5. The values for prior mod-
els are extracted from the literature. As mentioned
in Section 3.1.1, we follow standard practice (Liu
et al., 2019a; Portes et al., 2023; He et al., 2023)
and conduct an hyperparameter search on each
subset. More specifically, we perform a sweep
over learning rates in [1e−5,3e−5,5e−5,8e−5],
weight decay in [1e−6,5e−6,8e−6,1e−5], and
number of epochs in [1,2,3]for tasks in SST-2,
MNLI, and RTE, and [2,5,10]for tasks in QNLI,
QQP, CoLA, MRPC, and STS-B. The final values
are detailed in Table 6. Early stopping is used for
all the fine-tuning runs which reduces the overall
fine-tuning time considerably. RTE MRPC and
STS-B checkpoints are trained starting from the
MNLI checkpoint.
E.2 Full BEIR results
In the main body, we only report the average
score over the 15 very diverse datasets of BEIR.
We report the results on every subsets for both
17
Page 18:
Single Sentence Paraphrase and Similarity Natural Language Inference
Model Params Seq. CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTEBaseBERTβ110M 512 59.0 93.1 89.5 89.4 91.4 85.4 91.6 78.2
RoBERTaα125M 512 63.6 94.8 90.2 91.2 91.9 87.6 92.8 78.7
DeBERTav3ϵ183M 512 69.2 95.6 89.5 91.6 92.4 90.0 94.0 83.8
MosaicBERT-128β137M 128 58.2 93.5 89.0 90.3 92.0 85.6 91.4 83.0
NomicBERT-2048γ137M 2048 50.0 93.0 88.0 90.0 92.0 86.0 92.0 82.0
GTE-en-MLMδ137M 8192 57.0 93.4 92.1 90.2 88.8 86.7 91.9 84.8
ModernBERT 149M 8192 65.1 96.0 92.2 91.8 92.1 89.1 93.9 87.4LargeBERTβ330M 512 56.2 93.3 87.8 90.6 90.9 86.3 92.8 83.8
RoBERTaα355M 512 68.0 96.4 90.9 92.4 92.2 90.2 94.7 86.6
DeBERTav3ζ434M 512 75.3 96.9 92.2 93.0 93.3 91.8 96.0 92.7
GTE-en-MLMδ434M 8192 60.4 95.1 93.5 91.4 89.2 89.2 93.9 88.1
ModernBERT 395M 8192 71.4 97.1 91.7 92.8 92.7 90.8 95.2 92.1
Table 5: GLUE (Wang et al., 2018) dev set scores.αtaken from Table 8 of (Liu et al., 2019a),βtaken from Table
S3 of (Portes et al., 2023),γfrom Table 2 of (Nussbaum et al., 2024),δfrom Table 21 of (Zhang et al., 2024),ϵ
from Table 2 of (Qiang et al., 2024) andζfrom Table 3 of (He et al., 2023)
Base Large
Task LR WD Ep LR WD Ep
CoLA 8e−5 1e−6 53e−5 8e−6 5
MNLI 5e−5 5e−6 13e−5 1e−5 1
MRPC 5e−5 5e−6108e−5 5e−6 2
QNLI 8e−5 5e−6 23e−5 5e−6 2
QQP 5e−5 5e−6105e−5 8e−6 2
RTE 5e−5 1e−5 35e−5 8e−6 3
SST-2 8e−5 1e−5 21e−5 1e−6 3
STSB 8e−5 5e−6108e−5 1e−510
Table 6: Fine-tuning hyperparameters for ModernBERT on GLUE tasks. LR: Learning Rate, WD: Weight Decay,
Ep: Epochs.
single and multi-vector retrieval in Table 7 and
Table 8 respectively. For both settings and for
every model, we perform a sweep for learning
rates in [1e−5,2e−5,3e−5,5e−5,8e−5,1e−4]
and choose the model obtaining the best average
result over a subset of datasets composed of NFCor-
pus, SciFact, TREC-Covid and FiQA as the final
model. Best learning rates for every setting are
reported in Table 9. Although ModernBERT show-
case strong results across the board, it should be
noted that an important factor in its performance is
TREC-COVID (V oorhees et al., 2021), potentially
showcasing the benefits of ModernBERT being
trained with a more recent knowledge cutoff than
most existing encoders. However, NomicBERT
and GTE have also been trained on updated data,
so the cutoff cannot be the only factor affecting the
performance.F Efficiency
Full statistics of the synthetic datasets used to eval-
uate the efficiency of the models in Section 4 are
given in Table 10. The detailed runtimes, alongside
with the maximum batch size for every model is
detailed in Table 11.
The high maximum batch-size achieved by Mod-
ernBERT models, considerably higher than any
other models, highlight the strong memory effi-
ciency of the model at both sizes. Inversely, it
is worth noting that while DeBERTaV3 has com-
petitive GLUE performance, it stands out as par-
ticularly inefficient, both in its memory use and
processing speed. Indeed, on both model sizes, De-
BERTaV3’s memory use is 5-to-7 times higher than
ModernBERT’s, and it processes inputs, two times
slower even in the most favorable scenario where
all sequences are at the maximum possible length,
thus negating any advantage from unpadding.
18
Page 19:
Model NFCorpus SciFact TREC-Covid FiQA ArguAna Climate-FEVER DBPedia FEVER HotpotQA MSMARCO NQ Quora SciDocs Touche2020 CQADupstack Avg.BaseBERT 24.3 51.3 49.5 22.8 31.6 21.9 28.2 64.1 47.9 58.5 37.9 83.1 12.9 20.4 28.5 38.9
RoBERTa 20.4 45.6 52.2 26.1 35.2 22.3 23.1 60.2 45.0 56.0 34.7 84.0 11.4 21.1 28.8 37.7
DeBERTaV3 8.0 22.6 48.4 11.5 26.1 9.7 5.3 17.3 8.0 25.2 12.5 74.7 5.4 14.2 14.2 20.2
NomicBERT 25.7 52.0 63.0 23.5 35.5 22.9 30.3 65.0 48.0 60.6 42.6 84.5 12.6 19.0 29.2 41.0
GTE-en-MLM 26.3 54.1 49.7 30.1 35.7 24.5 28.9 66.5 49.9 63.1 41.7 85.2 14.1 19.1 32.5 41.4
ModernBERT 23.7 57.0 72.1 28.8 35.7 23.6 23.8 59.9 46.1 61.6 39.5 85.9 12.5 20.8 33.1 41.6LargeBERT 23.3 50.7 48.9 24.0 35.2 22.1 27.2 61.7 45.9 59.8 39.5 83.6 13.0 19.5 28.9 38.9
RoBERTa 23.9 53.4 55.0 33.4 37.6 23.5 25.4 65.2 47.1 60.4 43.3 85.8 13.7 21.1 33.0 41.4
DeBERTaV3 9.6 31.2 56.6 15.8 26.3 14.4 6.8 29.4 15.3 32.4 21.5 79.1 7.0 18.8 19.9 25.6
GTE-en-MLM 27.7 57.6 48.4 34.0 35.3 24.0 27.0 65.4 50.8 64.1 44.9 85.3 15.6 21.4 35.5 42.5
ModernBERT 26.2 60.4 74.1 33.1 38.2 20.5 25.1 62.7 49.2 64.9 45.5 86.5 13.8 23.1 36.5 44.0
Table 7: BEIR (Thakur et al., 2021) nDCG@10 scores for single-vector retrieval models.
Model NFCorpus SciFact TREC-Covid FiQA ArguAna Climate-FEVER DBPedia FEVER HotpotQA MSMARCO NQ Quora SciDocs Touche2020 CQADupstack Avg.BaseBERT 34.2 71.5 69.9 35.0 49.9 19.2 42.4 83.1 69.8 45.4 55.4 84.1 14.7 27.0 34.2 49.0
RoBERTa 33.7 70.8 69.8 37.4 48.9 18.9 39.3 81.2 66.1 43.7 56.3 83.6 14.8 31.7 34.4 48.7
DeBERTaV3 31.9 68.5 75.5 35.5 46.5 18.3 35.6 78.1 65.3 39.5 50.4 83.7 14.6 31.1 32.3 47.1
NomicBERT 35.5 72.2 73.5 35.9 44.8 19.0 43.6 83.9 71.1 46.3 58.5 84.0 15.1 31.3 33.9 49.9
GTE-en-MLM 35.1 71.5 69.4 36.0 48.5 17.4 41.2 79.9 67.0 44.4 52.8 85.2 15.0 25.4 34.6 48.2
ModernBERT 35.2 73.0 80.5 38.0 49.1 22.2 42.0 85.8 70.4 45.4 57.1 86.3 16.0 33.9 35.1 51.3LargeBERT 34.6 72.9 68.8 35.5 48.3 19.7 42.4 83.6 70.7 45.9 57.2 84.8 15.2 28.9 34.9 49.5
RoBERTa 35.0 72.3 74.4 38.7 50.0 19.6 41.0 82.0 66.2 44.7 57.5 85.9 15.3 27.9 36.0 49.8
DeBERTaV3 31.7 70.2 73.3 35.0 46.2 18.0 36.5 79.0 63.2 39.4 51.6 81.1 14.1 28.6 33.1 46.7
GTE-en-MLM 35.2 72.4 67.2 39.6 50.3 20.8 44.4 82.5 72.0 47.0 60.1 86.4 15.9 30.9 35.4 50.7
ModernBERT 36.0 73.2 81.3 40.3 50.3 22.3 44.1 85.8 72.5 46.0 59.9 86.1 16.9 34.6 35.9 52.4
Table 8: BEIR (Thakur et al., 2021) nDCG@10 scores for multi-vector retrieval models.
Model Single-vector (DPR) Multi-vector (ColBERT)BaseBERT 5×10−58×10−5
RoBERTa 3×10−58×10−5
DeBERTaV3 8×10−55×10−5
NomicBERT 5×10−51×10−4
GTE-en-MLM 5×10−58×10−5
ModernBERT 8×10−51×10−4LargeBERT 3×10−51×10−4
RoBERTa 3×10−51×10−5
DeBERTaV3 8×10−51×10−5
GTE-en-MLM 3×10−53×10−5
ModernBERT 1×10−43×10−5
Table 9: Learning rate used for reported results on BEIR (Thakur et al., 2021) for both single and multi vector
retrieval
Short Long
Fixed Variable Fixed Variable
Total Token Count 4,194,304 2,096,510 67,108,864 33,604,913
Standard deviation 0 64 0 1,024
Average Length 512 256 8,192 4,102
Longest sequence 512 476 8,192 7,624
Shortest sequence 512 32 8,192 171
Number of sequences 8,192 8,192 8,192 8,192
Table 10: Token statistics for the synthetic datasets used in efficiency evaluations.
G Licensing
We release the ModernBERT model architecture,
model weights, and training codebase under the
Apache 2.0 license.
19
Page 20:
Short Long
Model Params BS Fixed Variable BS Fixed VariableBaseBERT 110M 1096 23.3 ± 0.02 – – – –
RoBERTa 125M 664 23.3 ± 0.19 – – – –
DeBERTaV3 183M 236 59.7 ± 0.11 – – – –
NomicBERT 137M 588 35.8 ± 0.01 – 36 1455.5 ± 0.31 –
GTE-en-MLM 137M 640 33.9 ± 1.21 – 38 1434.7 ± 3.69 –
GTE-en-MLM xformers 137M 640 34.2 ± 0.10 16.3 ± 0.04 38 1412.6 ± 3.19 499.2 ± 0.11
ModernBERT 149M 1604 28.3 ± 0.55 14.2 ± 0.01 98 542.4 ± 0.20 251.2 ± 0.32LargeBERT 330M 792 77.1 ± 1.50 – – – –
RoBERTa 355M 460 99.8 ± 1.79 – – – –
DeBERTaV3 434M 134 170.8 ± 0.06 – – – –
GTE-en-MLM 435M 472 108.4 ± 0.07 – 28 4144.7 ± 0.05 –
GTE-en-MLM xformers 435M 472 109.0 ± 0.14 51.9 ± 0.02 28 4059.1 ± 4.55 1476.3 ± 0.94
ModernBERT 395M 770 80.1 ± 1.65 39.6 ± 0.02 48 1433.9 ± 0.99 674.9 ± 0.15
Table 11: Inference runtime for all models. Bold indicates the best for the column within two SDs.
20