Authors: Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M. Alves, André Martins, Ayoub Hammal, Caio Corro, Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, Gabriel Hautreux, João Alves, Kevin El-Haddad, Manuel Faysse, Maxime Peyrard, Nuno M. Guerreiro, Patrick Fernandes, Ricardo Rei, Pierre Colombo
Paper Content:
Page 1:
EuroBERT: Scaling Multilingual Encoders
for European Languages
Nicolas Boizard1,3Hippolyte Gisserot-Boukhlef2,3Duarte M. Alves4,5
Andr ´e Martins† 4,5,6Ayoub Hammal† 7,8,9Caio Corro† 8,10,11C´eline Hudelot† 3
Emmanuel Malherbe† 2Etienne Malaboeuf† 12Fanny Jourdan† 13Gabriel
Hautreux† 12Jo˜ao Alves† 6Kevin El-Haddad† 1,17Manuel Faysse† 3,14Maxime
Peyrard† 8,15Nuno M. Guerreiro† 3,4,5,6Patrick Fernandes† 4,5,18Ricardo Rei† 6
Pierre Colombo⋆3,16
1Diabolocom,2Artefact,3MICS, CentraleSup ´elec, Universit ´e Paris-Saclay,4Instituto
Superior T ´ecnico & Universidade de Lisboa (Lisbon ELLIS Unit),5Instituto de
Telecomunica c ¸˜oes,6Unbabel,7Universit ´e Paris-Saclay,8CNRS,9LISN,10INSA Rennes,
11IRISA,12CINES,13IRT Saint Exup ´ery,14Illuin Technology,15Universit ´e Grenoble
Alpes, Grenoble INP , LIG,16Equall,17ISIA Lab,18Carnegie Mellon University
Equal contribution ,†Ordered alphabetically by the first name,⋆Senior advisor
General-purpose multilingual vector representations, used in retrieval, regression and
classification, are traditionally obtained from bidirectional encoder models. Despite
their wide applicability, encoders have been recently overshadowed by advances in
generative decoder-only models. However, many innovations driving this progress are
not inherently tied to decoders. In this paper, we revisit the development of multilingual
encoders through the lens of these advances, and introduce EuroBERT, a family of
multilingual encoders covering European and widely spoken global languages. Our
models outperform existing alternatives across a diverse range of tasks, spanning
multilingual capabilities, mathematics, and coding, and natively supporting sequences
of up to 8,192 tokens. We also examine the design decisions behind EuroBERT, offering
insights into our dataset composition and training pipeline. We publicly release the
EuroBERT models, including intermediate training checkpoints, together with our
training framework.
Contact firstname.lastname@centralesupelec.com, duartemalves@tecnico.ulisboa.pt
Website https://huggingface.co/EuroBERT
Date March 10, 2025
1 Introduction
Many important tasks in NLP , including information retrieval, classification, or regression,
are built upon general-purpose vector representations. These representations are tradition-
ally obtained from bidirectional encoder models, which aggregate information from the left
and right contexts of each token (Devlin et al., 2019; Conneau et al., 2020; He et al., 2023). In
contrast, recent advances in generative modeling have shifted the research community’s
attention towards unidirectional architectures (Bai et al., 2023; Llama Team, 2024; OLMo
et al., 2025). Notably, these efforts have identified several key performance drivers that
span architectural advances, data improvements, and increased scale. Yet, despite no appar-
ent barrier to transferring these insights to bidirectional architectures, little effort has been
devoted towards this objective, forcing practitioners to depend on older models.
In this paper, we introduce a refreshed recipe for training general-purpose multilingual
encoders, resulting in the EuroBERT family. Our models incorporate recent architectural
advances from decoder models ( §2.1), and are trained on a 5T-token multilingual dataset,
covering European and widely spoken global languages, along with mathematics and
code ( §2.2). We adopt a masked language modeling objective, and employ a two-phasearXiv:2503.05500v1 [cs.CL] 7 Mar 2025
Page 2:
EuroBERT: Scaling Multilingual Encoders for European Languages
EuroBERT-2.1BEuroBERT-610mEuroBERT-210mmGTE-MLMXLM-R-baseXLM-R-largeXLM-R-XLEuroBERT-2.1BEuroBERT-610mEuroBERT-210mmGTE-MLMXLM-R-baseXLM-R-largeXLM-R-XLEuroBERT-2.1BEuroBERT-610mEuroBERT-210mmDeBERTamDeBERTaXLM-R-baseXLM-R-largeXLM-R-XLmGTE-MLM
Figure 1: Pareto performance plots for key multilingual tasks, including retrieval (MIRACL),
classification (XNLI), and regression (SeaHorse). The blue dotted line represents the Pareto
frontier achieved by the EuroBERT model family.
training pipeline, adjusting the data distribution in the second training phase to improve
downstream performance (§2.3).
We extensively evaluate the EuroBERT models, comparing to several similarly sized alterna-
tives across a suite of tasks representative of real-world encoder applications ( §3). Our mod-
els match or exceed the performance of alternative models, such as XLM-RoBERTa (Conneau
et al., 2020) and mGTE-MLM-base (Zhang et al., 2024), on multilingual retrieval, classifi-
cation and regression tasks (Figure 1), and outperform them on tasks related to code and
mathematics (Figure 2).
We also examine the impact of our design choices through systematic ablations on several
components of our annealing recipe (§4). We explore the choice of masking ratio, showing
that while higher masking ratios benefit retrieval tasks, lower ratios improve sentence
classification. Additionally, we highlight that including data for code and mathematics can
improve multilingual retrieval, but degrades classification accuracy. Contrary to expecta-
tions, we also find that, when using model-based filters for data selection, mixing data from
lower and higher quality thresholds can improve both retrieval and classification.
Accompanying this work, we release the EuroBERT family, comprising three models with
210m, 610m and 2.1B parameters. To facilitate future research, we also release intermediate
training checkpoints, as well as our training framework.
2 EuroBERT: A Refreshed Multilingual Encoder
The EuroBERT models are available in three sizes (210m, 610m, and 2.1B parameters) and
closely follow the Llama 3 architecture (Llama Team, 2024) ( §2.1). They are trained on a large
multilingual corpus, which includes datasets of code and mathematics ( §2.2). The training
pipeline has two stages, pre-training and annealing, and employs the masked language
modeling (MLM) objective (§2.3).
2.1 Architecture
The EuroBERT models are based on a standard dense transformer (Vaswani et al., 2017),
with several architectural changes. Similarly to Llama 2 (Touvron et al., 2023), we remove
all biases. Additionally, we incorporate grouped query attention (Ainslie et al., 2023), swish
gated linear units (Shazeer, 2020), root mean square layer normalization (Zhang & Sennrich,
2019), and rotary position embeddings (Su et al., 2024).1
1We provide more architecture details in Appendix A.
2
Page 3:
EuroBERT: Scaling Multilingual Encoders for European Languages
EuroBERT-2.1BEuroBERT-610mEuroBERT-210mModernBERT-baseModernBERT-large
XLM-R-baseXLM-R-largeXLM-R-XLmGTE-MLMEuroBERT-2.1BEuroBERT-610mEuroBERT-210mModernBERT-baseModernBERT-largeXLM-R-baseXLM-R-largeXLM-R-XLmGTE-MLMmDeBERTa
Figure 2: Pareto performance plots for code-related (CodeSearchNet) and math-related
(MathShepherd) tasks. The blue dotted line represents the Pareto frontier achieved by the
EuroBERT model family.
2.2 Dataset
To train EuroBERT, we construct a multilingual 5T-token corpus — 4.8T tokens for pre-
training and 200B for annealing — which includes 15 languages: English, French, German,
Spanish, Chinese, Italian, Russian, Polish, Portuguese, Japanese, Vietnamese, Dutch, Arabic,
Turkish, and Hindi.2Following prior work on curriculum learning, we adjust the data
distribution to emphasize higher-quality datasets during annealing.
Pre-training mixture. We use FineWeb (Penedo et al., 2024) for English, and Cul-
turaX (Nguyen et al., 2024) for multilingual data. We also incorporate parallel data, which
can improve cross-lingual transfer (Conneau & Lample, 2019; Reid & Artetxe, 2022; 2023),
by concatenating to-English and from-English translation pairs, separated by a special
<|parallel_sep|> token. Finally, we incorporate 38 programming languages from The Stack
v2 and Proof-Pile-2, which we found to improve multilingual information retrieval (§4).
Annealing mixture. For annealing, we classified data not seen during pre-training into
four quality levels using the EuroLLM classifier (Martins et al., 2024). We then selected
data above the third threshold, representing a mixture of medium and high quality data.
Contrary to our expectations, we found that including data of lower quality improved
performance ( §4). Additionally, we adjusted the data distribution based on multiple ab-
lations ( §4). Specifically, we decreased the proportion of English while proportionally
increasing the remaining languages. We also decreased the amount of code and math data
while increasing parallel data.3
2.3 Training Recipe
Masked language modeling. We choose to pre-train EuroBERT models with a 50% mask-
ing ratio, following the insights from Wettig et al. (2023), which find that masking 15% and
30% of tokens is sub-optimal, and that larger models benefit from higher masking ratios.
For the subsequent annealing phase, however, we lower the masking ratio to 10% based on
downstream evaluations ( §4), aligning with the findings from Yang et al. (2023) and Ankner
et al. (2024).
Hyperparameters. We employed the Warmup-Stable-Decay (WSD) scheduler (Shen et al.,
2024), with a linear warm-up phase of 2,000 steps, a constant learning rate of 1 ×10−4
during pre-training, and a cosine scheduler decaying to 0 during the annealing phase.
During pre-training, we packed sentences to 2,048 tokens and used a Rotary Position
2These languages were selected to balance European and widely spoken global languages, and
ensure representation across diverse alphabets and language families.
3We provide further details on our pretraining and annealing datasets in Appendix C.
3
Page 4:
EuroBERT: Scaling Multilingual Encoders for European Languages
BenchmarkmDeBERTa mGTE XLM-RoBERTa EuroBERT
280m 305m 280m 560m 3.5B 210m 610m 2.1B
Retrieval
MIRACL 43.7 5 93.8 289.5 491.6 392.6 395.1 195.0 194.8 1
MLDR 20.0 6 73.2 258.7 565.2 470.0 373.4 275.8 172.9 2
CC-News 15.8 7 71.5 460.4 672.1 480.9 169.0 476.6 276.9 2
Wikipedia 58.9 5 94.6 391.7 493.6 396.7 195.6 296.6 196.6 1
Classification
XNLI 82.0 4 78.4 676.6 784.1 386.1 279.9 584.7 286.8 1
PAWS-X 91.9 2 89.8 388.9 492.4 292.9 189.9 392.2 293.0 1
QAM 72.1 4 68.6 567.9 573.7 275.4 169.6 472.9 373.3 3
AmazonReviews 63.7 2 62.7 362.7 364.5 164.7 163.0 264.0 264.5 1
MassiveIntent 87.3 2 87.5 287.2 288.8 188.5 187.2 287.8 288.2 2
Regression
WMT 43.0 2 37.7 634.2 739.0 444.4 140.5 441.1 438.8 5
SeaHorse 30.2 8 55.8 431.7 734.3 635.7 559.3 361.8 264.2 1
SummEval 26.0 7 43.7 326.9 638.6 430.3 646.3 357.4 259.4 1
Table 1: Multilingual evaluation results. Scores represent NDCG@10 for retrieval, accuracy
for classification, and Spearman rank correlation for regression tasks, averaged across Euro-
pean languages. Red-highlighted values indicate model rankings for each task, determined
through pairwise statistical tests with a 95% confidence level.
Embedding (RoPE) value of 10,000. In the annealing phase, we increased the RoPE theta
to 250,000 and randomly cropped our training documents to lengths between 12 and 8,192
tokens. We adopted this approach because, due to pre-processing constraints, our training
data had already been segmented into fixed-length documents, making standard variable-
length training infeasible. Therefore, we introduced random cropping of these fixed-length
sequences as an approximation of variable-length training. Surprisingly, we found that this
approach outperforms training only on fixed lengths ( §4), further highlighting the necessity
for variable length documents during long context training (Gao et al., 2024).
Infrastructure. We trained the EuroBERT family on Adastra, using 92 MI250X GPUs
for EuroBERT-210M, 384 MI250X GPUs for EuroBERT-610M, and 96 MI300A GPUs for
EuroBERT-2.1B, for a total of 200k GPU hours. Our training framework incorporates
FlashAttention (Dao, 2023), fused cross-entropy from LigerKernel (Hsu et al., 2024),
torch.compile (Ansel et al., 2024), and hybrid sharding with Fully Sharded Data Paral-
lel (Zhao et al., 2023).
3 Evaluation
3.1 Evaluation Setup
Datasets and tasks. We select a suite of tasks to cover various real-world use cases for
encoders. For multilingual tasks, we evaluate retrieval performance using MIRACL (Zhang
et al., 2023), MLDR (Chen et al., 2024), WikipediaRetrieval4, and CC-News (de Gibert et al.,
2024). We assess classification with XNLI (Conneau et al., 2018), PAWS-X (Yang et al., 2019),
QAM (Liang et al., 2020), AmazonReviews (Keung et al., 2020) and MassiveIntent (Keung
et al., 2020). Additionally, we measure sequence regression performance on the WMT (Bojar
et al., 2017; 2018; Barrault et al., 2019; 2020; Akhbardeh et al., 2021; Kocmi et al., 2022)
quality estimation task, as well as on summary evaluation using SeaHorse (Clark et al.,
2023) and SummEval (Fabbri et al., 2021). For code-related tasks, we evaluate retrieval
on CodeSearchNet (Husain et al., 2019) and DupStackMath (Hoogeveen et al., 2015), and
4https://huggingface.co/datasets/Samoed/WikipediaRetrievalMultilingual
4
Page 5:
EuroBERT: Scaling Multilingual Encoders for European Languages
(1K, 5K] (5K, 9K](9K, 17K](17K, 26K](26K, 151K]4050607080
Document LengthNDCG@10MLDR
XLM-RoBERTa 280m XLM-RoBERTa 560m XLM-RoBERTa 3.5B mGTE 305m
EuroBERT 210m EuroBERT 610m EuroBERT 2.1B(69, 2K] (2K, 2K] (2K, 3K] (3K, 4K](4K, 27K]020406080
Document LengthSpearmanSeaHorse
Figure 3: Long context analysis. We examine how context length affects model performance
in the XLM-RoBERTa and EuroBERT families across two long-context tasks (MLDR and
SeaHorse).
classification on CodeDefect (Zhou et al., 2019) and CodeComplexity (Jeon et al., 2023).
Finally, in the mathematical domain, we test retrieval on the MathFormula (Drechsel et al.,
2025) task, and process reward modeling on MathShepherd (Wang et al., 2024b). We believe
that our chosen suite is representative of many practical applications of encoder models.5
Baselines. We compare EuroBERT with the multilingual encoders XLM-RoBERTa (Con-
neau et al., 2020; Goyal et al., 2021), mGTE-MLM-base (Zhang et al., 2024)6and mDeBERTa-
v3-base. Additionally, for code and mathematical tasks, we compare with the English-only
ModernBERT (Warner et al., 2024) models.
Fine-tuning. We follow a standardized fine-tuning protocol to ensure fair model compari-
son. For sequence classification tasks, models are trained for 10,000 steps on the correspond-
ing training split using the standard cross-entropy loss, a batch size of 32, a 10% warm-up
ratio, and a linear learning rate decay. For small datasets requiring multiple epochs, we
apply early stopping with a patience of one epoch based on validation performance. To
account for model specificities, we fine-tune using 10 logarithmically spaced learning rates
(1×10−5to 1×10−4), selecting the one that achieves the highest validation metric.7For
sequence regression tasks, we use the same setting but replace the loss with the MSE. For
long-context summarization datasets (SeaHorse and SummEval), fine-tuning is limited to
5,000 steps to reduce computational cost. For retrieval tasks, models are fine-tuned for
1,000 steps on MS-MARCO (Bajaj et al., 2016),8using InfoNCE loss (Oord et al., 2018) with
in-batch negatives and cosine similarity as the similarity metric. All other hyperparameters
are aligned with those used in classification and regression tasks.
5Additional information on evaluation tasks is given in Appendix D.
6Since the EuroBERT models are general-purpose encoders, we evaluate them against the pre-
trained mGTE-MLM-base variant, which, similarly, was not optimized for retrieval tasks.
7Additional fine-tuning hyperparameters, such as AdamW parameters ( β1,β2,ϵ, and weight
decay), are set according to the values reported in the original source papers. For EuroBERT models,
we maintain the same settings as those used during pre-training and annealing.
8Since many retrieval datasets lack dedicated training splits, we use MS-MARCO, an English-only
dataset. This choice also allows us to assess cross-lingual generalization.
5
Page 6:
EuroBERT: Scaling Multilingual Encoders for European Languages
BenchmarkModernBERT mDeBERTa mGTE XLM-RoBERTa EuroBERT
150m 395m 280m 305m 280m 560m 3.5B 210m 610m 2.1B
Code
CodeSearchNet 53.9 565.8 3 2.8 34.0 723.0 840.8 654.1 558.9 469.9 272.6 1
DupStackMath 39.7 445.5 2 10.2 7 37.5 429.3 636.9 542.9 341.7 346.0 248.3 1
CodeComplexity 86.1 388.6 3 73.9 5 74.5 574.1 583.6 484.3 491.9 294.2 195.2 1
CodeDefect 65.8 367.0 2 64.7 3 63.5 461.9 454.3 565.8 369.5 169.0 167.7 2
Math
MathFormula 89.6 591.9 2 85.2 7 83.4 883.1 881.4 89.1 691.5 392.6 191.0 4
MathShepherd 77.7 483.6 2 75.1 5 77.2 471.9 667.6 782.5 384.0 287.3 186.8 1
Table 2: Evaluation results for code and math domains. Scores are reported as NDCG@10
for retrieval tasks (CodeSearchNet, DupStackMath, MathFormula) and accuracy for classifi-
cation tasks (CodeComplexity, CodeDefect, MathShepherd).
Evaluation metrics. We report accuracy for sequence classification, Spearman rank cor-
relation for regression, and NDCG@10 for retrieval tasks. We also follow Freitag et al.
(2023), and group systems into language-specific clusters based on statistically significant
performance gaps at 95% confidence thresholds. We then compute system-level rankings
using a normalized Borda count (Colombo et al., 2022), defined as the average over the
obtained per-language clusters.
3.2 Results
Table 1 reports the aggregated results for all multilingual tasks, aggregating over the Euro-
pean languages seen during training.9
The EuroBERT family exhibits strong multilingual performance across domains and
tasks. EuroBERT-2.1B, our largest model, achieves the highest performance among all
systems, ranking first on 7 of 12 benchmarks. Importantly, it outperforms the largest system,
XLM-RoBERTa-XL. Additionally, EuroBERT-610m is competitive with XLM-RoBERTa-XL, a
model 5 times its size, on most multilingual tasks, and surpasses it on code and mathematics.
Similarly, the smaller EuroBERT-210m is competitive with XLM-RoBERTa-Large, which has
twice the number of parameters, and globally outperforms all similarly sized systems.
EuroBERT is effective at document ranking. Across domains, EuroBERT consistently
ranks high for retrieval tasks. Notably, the 210m and 610m parameter models outperform
all models of comparable sizes, and are competitive with the larger XLM-RoBERTa-XL.
For sentence classification, EuroBERT models achieve results on par with similarly sized
models. On sentence classification, no model significantly outperforms all others. During
the development of EuroBERT, we found that several design decisions lead to a trade-off
between retrieval and classification capabilities ( §4). We highlight, however, that EuroBERT-
2.1B is still among the highest ranking systems, and that the smaller models in the family
are competitive with models of comparable size.
EuroBERT can function as an evaluation metric. In translation evaluation, while there is
a performance gap to the larger XLM-RoBERTa, the EuroBERT models remain competitive
with the other alternatives. In the future, we would like to explore other training signals to
further enhance cross-lingual capabilities of EuroBERT. In contrast, for summary evaluation,
EuroBERT models consistently outperform competitors of any sizes, making them a reliable
choice for training a metric on this type of task.
9More detailed results are provided in Appendix D.
6
Page 7:
EuroBERT: Scaling Multilingual Encoders for European Languages
-1.0-0.50.00.51.0
41% 26% 17%Performance Delta
English-1.0-0.50.00.51.0
8% 4% 2%
Math-1.0-0.50.00.51.0
8% 6% 2%
Code-1.0-0.50.00.51.0
5% 8%
Parallel-1.0-0.50.00.51.0
1% 0%
IFT
XNLI MIRACL
Figure 4: Impact of data subset ratios on model performance on XNLI and MIRACL. The
first vertical axis of each subplot denotes the reference data mix named reference in Table C.
EuroBERT maintains performance at longer context lengths. Figure 3 compares the
long context performance of EuroBERT and XLM-RoBERTa. While both models achieve
similar performance for shorter inputs, EuroBERT maintains performance at longer contexts,
whereas XLM-RoBERTa suffers notable degradation.
The EuroBERT family excels in tasks related to code and mathematics. Table 2 reports
the results on tasks related to code and mathematics. On these tasks, all EuroBERT models
consistently surpass other systems. Similarly to retrieval tasks, the EuroBERT-210m reflects
most of the performance of the larger models in the family, and ranks above all baselines,
highlighting its capabilities at a smaller scale. Additionally, the larger EuroBERT-2.1B
achieves the highest performance among all systems.
4 Training Recipe Analysis
We measure the impact of various design decisions made during the development of
EuroBERT with extensive ablations. Following Blakeney et al. (2024) and Llama Team (2024),
we perform multiple annealing runs on 40B tokens, each varying a different component of
our recipe, and measure the performance on the XNLI and MIRACL validation sets, the
former representing multilingual classification and the latter multilingual retrieval.10
Balancing the language distribution can enhance performance. The left-most plot in
Figure 4 reports the retrieval and classification performance as the proportion of English is
reduced and re-distributed between other languages. Remarkably, the retrieval performance
consistently decreases, suggesting that increasing multilingual data may not lead to an
increase in multilingual performance.
Including math and code improves multilingual retrieval, but degrades multilingual clas-
sification. The second and third plots in Figure 4 show MIRACL performance dropping
and XNLI accuracy rising as the proportions for math and code data decrease. This outcome
underscores the specific trade-offs encountered during model development. In future work,
we aim to investigate how to better balance task performance during pre-training.
Increasing parallel data yields performance gains. The forth plot in Figure 4 presents the
XNLI and MIRACL performance when increasing the amount of parallel data. Similar to
Anil et al. (2023); Briakou et al. (2023); Alves et al. (2024), we find it increases performance
on both benchmarks.
Adding instruction fine-tuning data degrades model performance. The right-most plot
in Figure 4 analyses the impact of adding instructions during annealing, which can improve
performance for decoder language models. In contrast to decoders, it leads to worse
performance when training an encoder model.
10We follow the evaluation procedure from §3, but instead test on the validation splits.
7
Page 8:
EuroBERT: Scaling Multilingual Encoders for European Languages
-6-3036
2k [12, 2k] [12, 8k]Performance Delta
Sentence Length-2-1012
50% 30% 10%
Mask Ratio-2-1012
4 3 3+4
Quality Buckets
XNLI MIRACL
Figure 5: Impact of hyperparameter choice on model performance on XNLI and MIRACL.
The first vertical axis of each subplot denotes the reference data mix.
Model-based quality filters can lead to worse results. Contrary to initial expectations,
using the highest-quality data bucket quality during the annealing phase did not result
in better performance on XNLI and MIRACL. Instead, as illustrated in the right plot of
Figure 5, mixing the buckets with quality levels 3 and 4 leads to the best performance on
XNLI, while the data bucket of quality 3 achieved the best results for MIRACL.
A reduced masking ratio during annealing enhances classification performance. Similar
to previous research (Yang et al., 2023; Ankner et al., 2024), which advocates lowering the
masking ratio in later training, we also find that reducing it to 10% during the annealing
phase improves EuroBERT’s performance on XNLI, though it leads to a modest decline in
MIRACL scores.
Impact of variable sentence length on model performance. The first plot in Figure 5
examines the impact of variable sentence lengths during annealing. Compared to the fixed
packed sentence lengths employed in pretraining, variable sentence lengths significantly
boosts XNLI and moderately MIRACL performance. This improvement remains stable,
without degradation when the maximum context length is extended to 8,192 tokens.
Based on this analysis, we decided to create our final annealing dataset by selecting data
above the third threshold. We reduced the proportion of English to 26% while proportionally
increasing the share of the remaining languages. To balance retrieval and classification
performance, we allocated 6% and 4% of the data mix to math and code, respectively.
Additionally, we increased the proportion of parallel data to 6%, using remaining data not
seen during pre-training, and removed instruction one. We finally lowered the masking
ratio to 10% and performed annealing with random sentence lengths of up to 8,192 tokens.
5 Related Work
Encoder models and learning objectives. Encoder-only models have consistently demon-
strated strong performance in non-generative NLP tasks, such as classification (Acheampong
et al., 2021; Ma et al., 2019) and retrieval (Karpukhin et al., 2020; Wang et al., 2024a), lever-
aging their ability to effectively represent sequences while maintaining relatively compact
model sizes. Traditional encoder architectures, such as BERT (Devlin et al., 2019) and
RoBERTa (Liu et al., 2019), rely on the masked language modeling (MLM) objective, which,
combined with bidirectional attention, enables them to learn rich contextual representa-
tions well-suited for sequence-level understanding. In contrast, DeBERTa (He et al., 2023)
introduces replaced token detection (RTD) as an alternative pre-training objective, which
improves efficiency and achieves strong results in classification tasks. In this paper, we
chose to use the masked language modeling objective because initial evaluations of existing
models showed more balanced results across all tasks.
Multilingual encoders. Expanding on these monolingual architectures, multilingual en-
coder models such as mBERT (Devlin et al., 2019), XLM-RoBERTa (Conneau et al., 2020),
and mDeBERTa (He et al., 2023) have extended pre-training benefits to diverse languages,
8
Page 9:
EuroBERT: Scaling Multilingual Encoders for European Languages
enhancing cross-lingual understanding. However, training these models has also high-
lighted the so-called ”curse of multilinguality” (Conneau et al., 2020; Chang et al., 2024),
demonstrating the necessity of scaling in terms of training data, model size, and context
length to maintain competitive performance across languages.
Scaling encoder models. Initial efforts to scale encoder models focused on increasing
model size and supported languages, as seen in larger variants like XLM-RoBERTa-XL and
XLM-RoBERTa-XXL (Goyal et al., 2021), which demonstrated the benefits of scaling for
multilingual performance. However, more recent advancements in encoder architectures
have moved beyond mere size increases, incorporating sophisticated design improvements.
Notably, concurrent work on modern pre-trained encoders, such as ModernBERT (Warner
et al., 2024) and mGTE (Zhang et al., 2024), introduces innovations like grouped query
attention (Ainslie et al., 2023), RoPE embeddings (Su et al., 2024), GLU activations (Shazeer,
2020), RMS normalization (Zhang & Sennrich, 2019), and extended context support. In line
with these advancements and inspired by recent progress in decoder scaling (Brown et al.,
2020; Yang et al., 2024; DeepSeek-AI et al., 2024), our work revisits the classical encoder
pre-training paradigm. Specifically, we increase the MLM masking ratio, scale training
across multiple languages with up to 5 trillion tokens, and integrate recent architectural
improvements.
6 Conclusion
We propose a recipe for training general-purpose multilingual encoders, creating the Eu-
roBERT family. We incorporate recent architectural advances from decoder models, and
train on a multilingual dataset containing European and globally spoken languages, to-
gether with code and mathematics. Our models outperform existing alternatives on a
comprehensive suite of tasks covering multilingual capabilities, mathematics and code.
We also extensively analyze the design decisions behind EuroBERT’s dataset and training
pipeline. Alongside this paper, we release all models in the EuroBERT family, including
intermediate training checkpoints, and our training framework to facilitate future research.
Acknowledgments
We sincerely thank the ADASTRA supercomputer (CINES) for its technical support and
high-performance computing (HPC) resources, provided through grants C1615122 and
GDA2401. We also appreciate the support of the French government through the France
2030 program as part of the ArGiMi project. This work was also supported by the EU’s Hori-
zon Europe Research and Innovation Actions (UTTER, contract 101070631), by the project
DECOLLAGE (ERC-2022-CoG 101088763), by the Portuguese Recovery and Resilience
Plan through project C645008882-00000055 (Center for Responsible AI), and by FCT/MECI
through national funds and when applicable co-funded EU funds under UID/50008: In-
stituto de Telecomunica c ¸˜oes. Duarte was also partially supported by the DataIA Institute,
whose contributions facilitated the completion of this work.
9
Page 10:
EuroBERT: Scaling Multilingual Encoders for European Languages
References
Francisca Adoma Acheampong, Henry Nunoo-Mensah, and Wenyu Chen. Transformer
models for text-based emotion detection: a review of bert-based approaches. Artificial
Intelligence Review , 54(8), 2021. URL https://dl.acm.org/doi/abs/10.1007/s10462-021-
09958-2 .
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron,
and Sumit Sanghai. GQA: Training generalized multi-query transformer models from
multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in
Natural Language Processing , Singapore, December 2023. Association for Computational
Linguistics. URL https://aclanthology.org/2023.emnlp-main.298/ .
Farhad Akhbardeh, Arkady Arkhangorodsky, Magdalena Biesialska, Ond ˇrej Bojar, Ra-
jen Chatterjee, Vishrav Chaudhary, Marta R. Costa-jussa, Cristina Espa ˜na-Bonet, An-
gela Fan, Christian Federmann, Markus Freitag, Yvette Graham, Roman Grundkiewicz,
Barry Haddow, Leonie Harter, Kenneth Heafield, Christopher Homan, Matthias Huck,
Kwabena Amponsah-Kaakyire, Jungo Kasai, Daniel Khashabi, Kevin Knight, Tom Kocmi,
Philipp Koehn, Nicholas Lourie, Christof Monz, Makoto Morishita, Masaaki Nagata,
Ajay Nagesh, Toshiaki Nakazawa, Matteo Negri, Santanu Pal, Allahsera Auguste Tapo,
Marco Turchi, Valentin Vydrin, and Marcos Zampieri. Findings of the 2021 confer-
ence on machine translation (WMT21). In Proceedings of the Sixth Conference on Machine
Translation , Online, November 2021. Association for Computational Linguistics. URL
https://aclanthology.org/2021.wmt-1.1/ .
Duarte Miguel Alves, Jos ´e Pombal, Nuno M Guerreiro, Pedro Henrique Martins, Jo ˜ao
Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, et al.
Tower: An open multilingual large language model for translation-related tasks. In First
Conference on Language Modeling , 2024. URL https://arxiv.org/abs/2402.17733 .
Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexan-
dre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu,
Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav
Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan
Xiao, Yuanzhong Xu, Yujing Zhang, Gustavo Hernandez Abrego, Junwhan Ahn, Jacob
Austin, Paul Barham, Jan Botha, James Bradbury, Siddhartha Brahma, Kevin Brooks,
Michele Catasta, Yong Cheng, Colin Cherry, Christopher A. Choquette-Choo, Aakanksha
Chowdhery, Cl ´ement Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin,
Mark D ´ıaz, Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus
Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-Ari, Steven Hand,
Hadi Hashemi, Le Hou, Joshua Howland, Andrea Hu, Jeffrey Hui, Jeremy Hurwitz,
Michael Isard, Abe Ittycheriah, Matthew Jagielski, Wenhao Jia, Kathleen Kenealy, Maxim
Krikun, Sneha Kudugunta, Chang Lan, Katherine Lee, Benjamin Lee, Eric Li, Music Li,
Wei Li, YaGuang Li, Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu, Frederick
Liu, Marcello Maggioni, Aroma Mahendru, Joshua Maynez, Vedant Misra, Maysam
Moussalem, Zachary Nado, John Nham, Eric Ni, Andrew Nystrom, Alicia Parrish, Marie
Pellat, Martin Polacek, Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter,
Parker Riley, Alex Castro Ros, Aurko Roy, Brennan Saeta, Rajkumar Samuel, Renee Shelby,
Ambrose Slone, Daniel Smilkov, David R. So, Daniel Sohn, Simon Tokumine, Dasha Valter,
Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang, Pidong Wang, Zirui Wang, Tao Wang,
John Wieting, Yuhuai Wu, Kelvin Xu, Yunhan Xu, Linting Xue, Pengcheng Yin, Jiahui
Yu, Qiao Zhang, Steven Zheng, Ce Zheng, Weikang Zhou, Denny Zhou, Slav Petrov,
and Yonghui Wu. Palm 2 technical report. arXiv preprint arXiv:2305.10403 , 2023. URL
https://arxiv.org/abs/2305.10403 .
Zachary Ankner, Naomi Saphra, Davis Blalock, Jonathan Frankle, and Matthew Leavitt.
Dynamic masking rate schedules for MLM pretraining. In Proceedings of the 18th Conference
of the European Chapter of the Association for Computational Linguistics (Volume 2: Short
Papers) , St. Julian’s, Malta, March 2024. Association for Computational Linguistics. URL
https://aclanthology.org/2024.eacl-short.42/ .
10
Page 11:
EuroBERT: Scaling Multilingual Encoders for European Languages
Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voz-
nesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, et al. Pytorch 2: Faster
machine learning through dynamic python bytecode transformation and graph com-
pilation. In Proceedings of the 29th ACM International Conference on Architectural Sup-
port for Programming Languages and Operating Systems, Volume 2 , 2024. URL https:
//dl.acm.org/doi/abs/10.1145/3620665.3640366 .
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge,
Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu,
Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng
Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Sheng-
guang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao,
Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang,
Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen
technical report, 2023. URL https://arxiv.org/abs/2309.16609 .
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan
Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human
generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 , 2016.
URL https://arxiv.org/abs/1611.09268 .
Lo¨ıc Barrault, Ond ˇrej Bojar, Marta R. Costa-juss `a, Christian Federmann, Mark Fishel, Yvette
Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz,
Mathias M ¨uller, Santanu Pal, Matt Post, and Marcos Zampieri. Findings of the 2019
conference on machine translation (WMT19). In Proceedings of the Fourth Conference on
Machine Translation (Volume 2: Shared Task Papers, Day 1) , Florence, Italy, August 2019.
Association for Computational Linguistics. URL https://aclanthology.org/W19-5301/ .
Lo¨ıc Barrault, Magdalena Biesialska, Ond ˇrej Bojar, Marta R. Costa-juss `a, Christian Fed-
ermann, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Matthias Huck, Eric
Joanis, Tom Kocmi, Philipp Koehn, Chi-kiu Lo, Nikola Ljube ˇsi´c, Christof Monz, Makoto
Morishita, Masaaki Nagata, Toshiaki Nakazawa, Santanu Pal, Matt Post, and Marcos
Zampieri. Findings of the 2020 conference on machine translation (WMT20). In Proceed-
ings of the Fifth Conference on Machine Translation , Online, November 2020. Association for
Computational Linguistics. URL https://aclanthology.org/2020.wmt-1.1/ .
Cody Blakeney, Mansheej Paul, Brett W Larsen, Sean Owen, and Jonathan Frankle. Does
your data spark joy? performance gains from domain upsampling at the end of training.
InFirst Conference on Language Modeling , 2024.
Ond ˇrej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Shu-
jian Huang, Matthias Huck, Philipp Koehn, Qun Liu, Varvara Logacheva, Christof Monz,
Matteo Negri, Matt Post, Raphael Rubino, Lucia Specia, and Marco Turchi. Findings
of the 2017 conference on machine translation (WMT17). In Proceedings of the Second
Conference on Machine Translation , Copenhagen, Denmark, September 2017. Association
for Computational Linguistics. URL https://aclanthology.org/W17-4717/ .
Ond ˇrej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias
Huck, Philipp Koehn, and Christof Monz. Findings of the 2018 conference on machine
translation (WMT18). In Proceedings of the Third Conference on Machine Translation: Shared
Task Papers , Belgium, Brussels, October 2018. Association for Computational Linguistics.
URL https://aclanthology.org/W18-6401/ .
Eleftheria Briakou, Colin Cherry, and George Foster. Searching for needles in a haystack:
On the role of incidental bilingualism in PaLM‘s translation capability. In Proceedings
of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers) , Toronto, Canada, July 2023. Association for Computational Linguistics. URL
https://aclanthology.org/2023.acl-long.524/ .
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini
Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya
11
Page 12:
EuroBERT: Scaling Multilingual Encoders for European Languages
Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are
few-shot learners. In Advances in Neural Information Processing Systems , volume 33. Curran
Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper files/paper/2020/
file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf .
Tyler A. Chang, Catherine Arnett, Zhuowen Tu, and Ben Bergen. When is multilingual-
ity a curse? language modeling for 250 high- and low-resource languages. In Pro-
ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , Mi-
ami, Florida, USA, November 2024. Association for Computational Linguistics. URL
https://aclanthology.org/2024.emnlp-main.236/ .
Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-
embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings
through self-knowledge distillation. In Findings of the Association for Computational Lin-
guistics: ACL 2024 , Bangkok, Thailand, August 2024. Association for Computational
Linguistics. URL https://aclanthology.org/2024.findings-acl.137/ .
Elizabeth Clark, Shruti Rijhwani, Sebastian Gehrmann, Joshua Maynez, Roee Aharoni,
Vitaly Nikolaev, Thibault Sellam, Aditya Siddhant, Dipanjan Das, and Ankur Parikh.
SEAHORSE: A multilingual, multifaceted dataset for summarization evaluation. In
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing ,
Singapore, December 2023. Association for Computational Linguistics. URL https:
//aclanthology.org/2023.emnlp-main.584 .
Pierre Colombo, Nathan Noiry, Ekhine Irurozki, and Stephan Cl ´emen c ¸on. What
are the best systems? new perspectives on nlp benchmarking. In Ad-
vances in Neural Information Processing Systems , volume 35. Curran Associates,
Inc., 2022. URL https://proceedings.neurips.cc/paper files/paper/2022/file/
ac4920f4085b5662133dd751493946a6-Paper-Conference.pdf .
Alexis Conneau and Guillaume Lample. Cross-lingual language model pretrain-
ing. In Advances in Neural Information Processing Systems , volume 32. Curran As-
sociates, Inc., 2019. URL https://proceedings.neurips.cc/paper files/paper/2019/
file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf .
Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger
Schwenk, and Veselin Stoyanov. XNLI: Evaluating cross-lingual sentence representations.
InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing ,
Brussels, Belgium, October-November 2018. Association for Computational Linguistics.
URL https://aclanthology.org/D18-1269/ .
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume
Wenzek, Francisco Guzm ´an, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin
Stoyanov. Unsupervised cross-lingual representation learning at scale. In Proceedings of
the 58th Annual Meeting of the Association for Computational Linguistics , Online, July 2020.
Association for Computational Linguistics. URL https://aclanthology.org/2020.acl-
main.747/ .
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.
InThe Twelfth International Conference on Learning Representations , 2023. URL https://
arxiv.org/abs/2307.08691 .
Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Ba ˜n´on, Jelmer van der Linde, Shaox-
iong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ram ´ırez-S ´anchez, Andrey
Kutuzov, Sampo Pyysalo, Stephan Oepen, and J ¨org Tiedemann. A new massive mul-
tilingual dataset for high-performance language technologies. In Proceedings of the
2024 Joint International Conference on Computational Linguistics, Language Resources and
Evaluation (LREC-COLING 2024) , Torino, Italia, May 2024. ELRA and ICCL. URL
https://aclanthology.org/2024.lrec-main.100/ .
12
Page 13:
EuroBERT: Scaling Multilingual Encoders for European Languages
DeepSeek-AI, :, Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi
Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun
Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying
He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y. K. Li, Wenfeng
Liang, Fangyun Lin, A. X. Liu, Bo Liu, Wen Liu, Xiaodong Liu, Xin Liu, Yiyuan Liu, Haoyu
Lu, Shanghao Lu, Fuli Luo, Shirong Ma, Xiaotao Nie, Tian Pei, Yishi Piao, Junjie Qiu, Hui
Qu, Tongzheng Ren, Zehui Ren, Chong Ruan, Zhangli Sha, Zhihong Shao, Junxiao Song,
Xuecheng Su, Jingxiang Sun, Yaofeng Sun, Minghui Tang, Bingxuan Wang, Peiyi Wang,
Shiyu Wang, Yaohui Wang, Yongji Wang, Tong Wu, Y. Wu, Xin Xie, Zhenda Xie, Ziwei Xie,
Yiliang Xiong, Hanwei Xu, R. X. Xu, Yanhong Xu, Dejian Yang, Yuxiang You, Shuiping Yu,
Xingkai Yu, B. Zhang, Haowei Zhang, Lecong Zhang, Liyue Zhang, Mingchuan Zhang,
Minghua Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan
Zhou, Shunfeng Zhou, Qihao Zhu, and Yuheng Zou. Deepseek llm: Scaling open-source
language models with longtermism, 2024. URL https://arxiv.org/abs/2401.02954 .
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of
deep bidirectional transformers for language understanding. In Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers) , Minneapolis, Minnesota,
June 2019. Association for Computational Linguistics. URL https://aclanthology.org/
N19-1423/ .
Jonathan Drechsel, Anja Reusch, and Steffen Herbold. MAMUT: A novel framework for
modifying mathematical formulas for the generation of specialized datasets for language
model training, 2025. URL https://arxiv.org/abs/2502.20855 .
Alexander R. Fabbri, Wojciech Kry ´sci´nski, Bryan McCann, Caiming Xiong, Richard Socher,
and Dragomir Radev. SummEval: Re-evaluating summarization evaluation. Transactions
of the Association for Computational Linguistics , 9, 2021. URL https://aclanthology.org/
2021.tacl-1.24/ .
Markus Freitag, Nitika Mathur, Chi-kiu Lo, Eleftherios Avramidis, Ricardo Rei, Brian
Thompson, Tom Kocmi, Frederic Blain, Daniel Deutsch, Craig Stewart, Chrysoula Zerva,
Sheila Castilho, Alon Lavie, and George Foster. Results of wmt23 metrics shared task:
Metrics might be guilty but references are not innocent. In Proceedings of the Eighth Confer-
ence on Machine Translation , Singapore, December 2023. Association for Computational
Linguistics. URL https://aclanthology.org/2023.wmt-1.51 .
Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen. How to train long-context
language models (effectively). arXiv preprint arXiv:2410.02660 , 2024. URL https://
arxiv.org/abs/2410.02660 .
Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, and Alexis Conneau. Larger-scale
transformers for multilingual masked language modeling. In Proceedings of the 6th Work-
shop on Representation Learning for NLP (RepL4NLP-2021) , Online, August 2021. Association
for Computational Linguistics. URL https://aclanthology.org/2021.repl4nlp-1.4/ .
Pengcheng He, Jianfeng Gao, and Weizhu Chen. Debertav3: Improving deberta using
electra-style pre-training with gradient-disentangled embedding sharing. In The Eleventh
International Conference on Learning Representations , 2023. URL https://arxiv.org/abs/
2111.09543 .
Doris Hoogeveen, Karin M Verspoor, and Timothy Baldwin. Cqadup-
stack: A benchmark data set for community question-answering re-
search. In Proceedings of the 20th Australasian document computing sympo-
sium , 2015. URL https://dl.acm.org/doi/abs/10.1145/2838931.2838934 ?casa
token=-tk7Uh-Jal4AAAAA:LP9O8GQO5yOQAW6m4nw81fVeZspyMSSae4QXz7vStNi-
zdy6MNAEw393sY0kWvDZfDO7PwnKeHpX5A .
Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven
Shimizu, Shivam Sahni, Haowen Ning, and Yanning Chen. Liger kernel: Efficient triton
kernels for llm training, 2024. URL https://arxiv.org/abs/2410.10989 .
13
Page 14:
EuroBERT: Scaling Multilingual Encoders for European Languages
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt.
Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint
arXiv:1909.09436 , 2019. URL https://arxiv.org/abs/1909.09436 .
Mingi Jeon, Seung-yeop Baik, Joonghyuk Hahn, Yo-Sub Han, and Sang-Ki Ko. Deep
learning-based source code complexity prediction. 2023. URL https://openreview.net/
forum ?id=9irBKvxsw9 .
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov,
Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question an-
swering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing (EMNLP) , Online, November 2020. Association for Computational Linguistics.
URL https://aclanthology.org/2020.emnlp-main.550/ .
Phillip Keung, Yichao Lu, Gy ¨orgy Szarvas, and Noah A. Smith. The multilingual Amazon
reviews corpus. In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP) , Online, November 2020. Association for Computational
Linguistics. URL https://aclanthology.org/2020.emnlp-main.369/ .
Tom Kocmi, Rachel Bawden, Ond ˇrej Bojar, Anton Dvorkovich, Christian Federmann, Mark
Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Rebecca
Knowles, Philipp Koehn, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki
Nakazawa, Michal Nov ´ak, Martin Popel, and Maja Popovi ´c. Findings of the 2022 confer-
ence on machine translation (WMT22). In Proceedings of the Seventh Conference on Machine
Translation (WMT) , Abu Dhabi, United Arab Emirates (Hybrid), December 2022. Associa-
tion for Computational Linguistics. URL https://aclanthology.org/2022.wmt-1.1/ .
Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fenfei Guo, Weizhen Qi, Ming Gong, Linjun
Shou, Daxin Jiang, Guihong Cao, Xiaodong Fan, Ruofei Zhang, Rahul Agrawal, Edward
Cui, Sining Wei, Taroon Bharti, Ying Qiao, Jiun-Hung Chen, Winnie Wu, Shuguang
Liu, Fan Yang, Daniel Campos, Rangan Majumder, and Ming Zhou. XGLUE: A new
benchmark dataset for cross-lingual pre-training, understanding and generation. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing
(EMNLP) , Online, November 2020. Association for Computational Linguistics. URL
https://aclanthology.org/2020.emnlp-main.484/ .
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert
pretraining approach, 2019. URL https://arxiv.org/abs/1907.11692 .
AI @ Meta Llama Team. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/
2407.21783 .
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International
Conference on Learning Representations , 2019.
Xiaofei Ma, Peng Xu, Zhiguo Wang, Ramesh Nallapati, and Bing Xiang. Domain adap-
tation with BERT-based domain classification and data selection. In Proceedings of
the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019) ,
Hong Kong, China, November 2019. Association for Computational Linguistics. URL
https://aclanthology.org/D19-6109/ .
Pedro Henrique Martins, Patrick Fernandes, Jo ˜ao Alves, Nuno M. Guerreiro, Ricardo
Rei, Duarte M. Alves, Jos ´e Pombal, Amin Farajian, Manuel Faysse, Mateusz Kli-
maszewski, Pierre Colombo, Barry Haddow, Jos ´e G. C. de Souza, Alexandra Birch, and
Andr ´e F. T. Martins. Eurollm: Multilingual language models for europe, 2024. URL
https://arxiv.org/abs/2409.16235 .
Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck
Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. CulturaX: A cleaned, enormous,
and multilingual dataset for large language models in 167 languages. In Proceedings
of the 2024 Joint International Conference on Computational Linguistics, Language Resources
14
Page 15:
EuroBERT: Scaling Multilingual Encoders for European Languages
and Evaluation (LREC-COLING 2024) , Torino, Italia, May 2024. ELRA and ICCL. URL
https://aclanthology.org/2024.lrec-main.377/ .
Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita
Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk,
Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark,
Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Ji-
acheng Liu, Saumya Malik, William Merrill, Lester James V . Miranda, Jacob Morrison,
Tyler Murray, Crystal Nam, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam
Skjonsberg, David Wadden, Christopher Wilhelm, Michael Wilson, Luke Zettlemoyer,
Ali Farhadi, Noah A. Smith, and Hannaneh Hajishirzi. 2 olmo 2 furious, 2025. URL
https://arxiv.org/abs/2501.00656 .
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive
predictive coding. arXiv preprint arXiv:1807.03748 , 2018. URL https://arxiv.org/abs/
1807.03748 .
Guilherme Penedo, Hynek Kydl ´ıˇcek, Loubna Ben allal, Anton Lozhkov, Mar-
garet Mitchell, Colin A Raffel, Leandro Von Werra, and Thomas Wolf. The
fineweb datasets: Decanting the web for the finest text data at scale. In Ad-
vances in Neural Information Processing Systems , volume 37. Curran Associates,
Inc., 2024. URL https://proceedings.neurips.cc/paper files/paper/2024/file/
370df50ccfdf8bde18f8f9c2d9151bda-Paper-Datasets andBenchmarks Track.pdf .
Machel Reid and Mikel Artetxe. PARADISE: Exploiting parallel data for multilingual
sequence-to-sequence pretraining. In Proceedings of the 2022 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies , Seattle,
United States, July 2022. Association for Computational Linguistics. URL https://
aclanthology.org/2022.naacl-main.58/ .
Machel Reid and Mikel Artetxe. On the role of parallel data in cross-lingual transfer learning.
InFindings of the Association for Computational Linguistics: ACL 2023 , Toronto, Canada,
July 2023. Association for Computational Linguistics. URL https://aclanthology.org/
2023.findings-acl.372/ .
Noam Shazeer. Glu variants improve transformer, 2020. URL https://arxiv.org/abs/
2002.05202 .
Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya
Prasad, Adriana Meza Soria, David D. Cox, and Rameswar Panda. Power scheduler:
A batch size and token number agnostic learning rate scheduler, 2024. URL https:
//arxiv.org/abs/2408.13359 .
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Ro-
former: Enhanced transformer with rotary position embedding. Neurocomputing , 568,
2024. ISSN 0925-2312. URL https://www.sciencedirect.com/science/article/pii/
S0925231223011864 .
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas
Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude
Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman
Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas,
Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura,
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning
Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew
Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva,
Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor,
Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang,
Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic,
Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat
models, 2023. URL https://arxiv.org/abs/2307.09288 .
15
Page 16:
EuroBERT: Scaling Multilingual Encoders for European Languages
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural In-
formation Processing Systems , 2017. URL https://proceedings.neurips.cc/paper files/
paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf .
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Ma-
jumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training,
2024a. URL https://arxiv.org/abs/2212.03533 .
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and
Zhifang Sui. Math-shepherd: Verify and reinforce LLMs step-by-step without human
annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers) , Bangkok, Thailand, August 2024b. Association for
Computational Linguistics. URL https://aclanthology.org/2024.acl-long.510/ .
Benjamin Warner, Antoine Chaffin, Benjamin Clavi ´e, Orion Weller, Oskar Hallstr ¨om, Said
Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper,
Griffin Adams, Jeremy Howard, and Iacopo Poli. Smarter, better, faster, longer: A modern
bidirectional encoder for fast, memory efficient, and long context finetuning and inference,
2024. URL https://arxiv.org/abs/2412.13663 .
Alexander Wettig, Tianyu Gao, Zexuan Zhong, and Danqi Chen. Should you mask 15% in
masked language modeling? In Proceedings of the 17th Conference of the European Chapter of
the Association for Computational Linguistics , Dubrovnik, Croatia, May 2023. Association
for Computational Linguistics. URL https://aclanthology.org/2023.eacl-main.217/ .
Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus
for sentence understanding through inference. In Proceedings of the 2018 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers) , New Orleans, Louisiana, June 2018. Association for
Computational Linguistics. URL https://aclanthology.org/N18-1101/ .
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li,
Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong
Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin
Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin
Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui
Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao
Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu
Zhang, Xipin Wei, Xuancheng Ren, Xuejing Liu, Yang Fan, Yang Yao, Yichang Zhang,
Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zhifang Guo, and Zhihao
Fan. Qwen2 technical report, 2024. URL https://arxiv.org/abs/2407.10671 .
Dongjie Yang, Zhuosheng Zhang, and Hai Zhao. Learning better masking for better lan-
guage model pre-training. In Proceedings of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers) , Toronto, Canada, July 2023. Association
for Computational Linguistics. URL https://aclanthology.org/2023.acl-long.400/ .
Yinfei Yang, Yuan Zhang, Chris Tar, and Jason Baldridge. PAWS-X: A cross-lingual adversar-
ial dataset for paraphrase identification. In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP) , Hong Kong, China, November 2019. Association
for Computational Linguistics. URL https://aclanthology.org/D19-1382/ .
Biao Zhang and Rico Sennrich. Root mean square layer normalization. In Ad-
vances in Neural Information Processing Systems , volume 32. Curran Associates,
Inc., 2019. URL https://proceedings.neurips.cc/paper files/paper/2019/file/
1e8a19426224ca89e83cef47f1e7f53b-Paper.pdf .
Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin,
Baosong Yang, Pengjun Xie, Fei Huang, Meishan Zhang, Wenjie Li, and Min Zhang. mGTE:
Generalized long-context text representation and reranking models for multilingual text
16
Page 17:
EuroBERT: Scaling Multilingual Encoders for European Languages
retrieval. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language
Processing: Industry Track , Miami, Florida, US, November 2024. Association for Computa-
tional Linguistics. URL https://aclanthology.org/2024.emnlp-industry.103/ .
Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-
Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. MIRACL: A
multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association
for Computational Linguistics , 11, 2023. URL https://aclanthology.org/2023.tacl-1.63/ .
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less
Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: Experiences on
scaling fully sharded data parallel. Proceedings of the VLDB Endowment , 16(12), 2023. URL
https://dl.acm.org/doi/abs/10.14778/3611540.3611569 .
Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. Devign: Effective
vulnerability identification by learning comprehensive program semantics via graph
neural networks. In Advances in Neural Information Processing Systems , volume 32. Curran
Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper files/paper/2019/
file/49265d2447bc3bbfe9e76306ce40a31f-Paper.pdf .
17
Page 18:
EuroBERT: Scaling Multilingual Encoders for European Languages
A EuroBERT Model Architecture
EuroBERT consists of three variants: EuroBERT-210m with 210 million parameters,
EuroBERT-610m with 610 million parameters and EuroBERT-2.1B with 2.1 billion param-
eters. These variants strike a balance between traditional encoder model sizes and the
benefits of parameter scaling. We report the architectural details of each model in Table 3.
Model Size 210m 610m 2.1B
Layers 12 26 32
Embedding Dimension 768 1,152 2,304
FFN Dimension 3,072 4,096 6,144
Attention Heads 12 18 18
Key/Value Heads 12 6 6
Layer Normalization RMSNorm
RMSNorm ϵ 1×10−5
Activation Function SwiGLU
Vocabulary Size 128,000
Positional Embeddings RoPE
RoPE θ 250,000
Tokenizer LLaMA 3
Table 3: Summary of architectural hyperparameters for EuroBERT models of different
sizes.
B Training Framework
We trained the EuroBERT family on Adastra, utilizing 92 MI250X GPUs for EuroBERT-210M
over 15k hours, 384 MI250X GPUs for EuroBERT-610M over 92k hours, and 96 MI300A
GPUs for EuroBERT-2.1B over 106k hours. The training recipe for the EuroBERT models
consists of two main stages, applied uniformly to both model sizes: (1)the pre-training
phase, (2)the annealing phase including context extension.
Training Hyperparameters. We initialized the linear and embedding layers with values
drawn from a normal distribution with a mean of 0 and a standard deviation of 0.2. For
stability, we increased the default epsilon value of AdamW (Loshchilov & Hutter, 2019) to
1×10−5and set β1=0.9,β2=0.95, with a weight decay of 0.1. The Warmup-Stable-Decay
(WSD) scheduler (Shen et al., 2024) was employed, featuring a warmup phase of 2, 000
steps and a constant learning rate (LR) of 1 ×10−4throughout training. To achieve similar
effective batch size of 9 ×106tokens between EuroBERT models, gradient accumulation was
applied for the 2.1 billion parameter model and set to 5. Detailed hyperparameter choices
are reported in Table 4. We find this training recipe highly stable, with no loss spikes or
need for intervention to address model training divergence (Figure 6).
Infrastructure, scaling, and efficiency. Trained large language models (LLMs) at scale is
a resource-intensive process that demands specialized hardware and an optimized code-
base to effectively manage computational resources. We trained the EuroBERT family on
the Adastra French supercomputer cluster, leveraging AMD GPUs: 192 MI250 GPUs for
EuroBERT-210m, 384 MI250 GPUs for EuroBERT-610m, and 96 MI300A GPUs for EuroBERT-
2.1B. However, most open-source pre-training frameworks are designed for NVIDIA hard-
ware, presenting significant compatibility challenges. To overcome this, we developed a
custom codebase tailored to training our models on AMD and NVIDIA GPUs11.
11https://github.com/Nicolas-BZRD/EuroBERT
18
Page 19:
EuroBERT: Scaling Multilingual Encoders for European Languages
Parameter 210m 610m 2.1B
Pre-training
LR 1e-4
LR Scheduler WSD
Warmup Steps 2,000
Context Length 2,048
Annealing
LR 1e-4 to 0
LR Scheduler Cosine
Context Length 8,192
Optimizer
Optimizer AdamW
Beta1 0.9
Beta2 0.95
Epsilon (eps) 1e-5
Weight Decay 0.1
Clip Grad Norm 1.0
Training Setup
Per-GPU Batch Size 24 12 10
Gradient Accumulation Steps 1 1 5
GPUs 192 384 96
Tokens/Step 9,437,184 9,437,184 9,830,400
Table 4: Training hyperparameters for EuroBERT models (210m, 610m, 2.1B). The optimizer
and Tokens/Step remain consistent across both pre-training and annealing phases.
Figure 6: Pre-training Loss for all EuroBERT models on a logarithmic scale.
Built on PyTorch, this code base includes several optimizations to increase training through-
put. Specifically, we highlight FlashAttention (Dao, 2023), fused cross-entropy from LigerK-
ernel (Hsu et al., 2024), torch.compile (Ansel et al., 2024), and hybrid sharding with Fully
Sharded Data Parallel (FSDP) (Zhao et al., 2023), splitting model, gradients and optimizer
states within the same node while replicating them across nodes. We achieved a training
throughput of 1.2M tokens/s on 96 MI300A.
19
Page 20:
EuroBERT: Scaling Multilingual Encoders for European Languages
C Data Mix
Source Subset Tokens (M) Mix (%) Source Subset Tokens (M) Mix (%)
FineWeb English 2, 002, 327 41.34 The-Stack v2 JavaScript 58, 440 1.21
CulturaX French 295, 113 6.09 The-Stack v2 PHP 25, 620 0.53
CulturaX German 291, 514 6.02 The-Stack v2 C# 24, 842 0.51
CulturaX Spanish 290, 489 6.00 The-Stack v2 Python 21, 521 0.44
CulturaX Chinese 238, 467 4.92 The-Stack v2 Java 20, 950 0.43
CulturaX Italian 120, 128 2.48 The-Stack v2 Go 14, 766 0.30
CulturaX Russian 116, 797 2.41 The-Stack v2 TypeScript 11, 307 0.23
CulturaX Portuguese 112, 321 2.32 The-Stack v2 HTML 7, 962 0.16
CulturaX Japanese 112, 242 2.32 The-Stack v2 Lua 7, 733 0.16
CulturaX Polish 111, 659 2.31 The-Stack v2 Ruby 5, 524 0.11
CulturaX Turkish 53, 126 1.10 The-Stack v2 Vue 5, 411 0.11
CulturaX Arabic 52, 413 1.08 The-Stack v2 R 5, 287 0.11
CulturaX Vietnamese 50, 661 1.05 The-Stack v2 Shell 4, 793 0.10
CulturaX Dutch 50, 646 1.05 The-Stack v2 Swift 3, 766 0.08
CulturaX Hindi 25, 544 0.53 The-Stack v2 reStructuredText 3, 761 0.08
Unbabel parallel es ↔en 50, 613 1.05 The-Stack v2 JSON 3, 586 0.07
Unbabel parallel fr ↔en 44, 891 0.93 The-Stack v2 Rust 3, 152 0.07
Unbabel parallel de ↔en 30, 541 0.63 The-Stack v2 YAML 2, 716 0.06
Unbabel parallel it ↔en 18, 702 0.39 The-Stack v2 Dart 2, 678 0.06
Unbabel parallel ru ↔en 13, 808 0.29 The-Stack v2 RMarkdown 2, 058 0.04
Unbabel parallel nl ↔en 12, 666 0.26 The-Stack v2 HCL 1, 423 0.03
Unbabel parallel pl ↔en 7, 280 0.15 The-Stack v2 PowerShell 1, 027 0.02
Unbabel parallel ar ↔en 6, 414 0.13 The-Stack v2 VBA 1, 027 0.02
Unbabel parallel zh ↔en 6, 206 0.13 The-Stack v2 AsciiDoc 970 0.02
Unbabel parallel cs ↔en 5, 458 0.11 The-Stack v2 Groovy 540 0.01
Unbabel parallel hu ↔en 4, 599 0.09 The-Stack v2 CUDA 406 0.01
Unbabel parallel vi ↔en 3, 395 0.07 The-Stack v2 Dockerfile 281 0.01
Unbabel parallel tr ↔en 2, 975 0.06 The-Stack v2 Cython 103 0.01
Unbabel parallel ja ↔en 2, 687 0.06 The-Stack v2 COBOL 96 0.01
Unbabel parallel hi ↔en 1, 136 0.02 The-Stack v2 GraphQL 83 0.01
Proof-pile-2 Arxic 121, 503 2.51 The-Stack v2 HTTP 82 0.01
Proof-pile-2 Open-Web-Math 54, 168 1.12 The-Stack v2 ABAP 71 0.01
Proof-pile-2 Algebraic-stack 35, 985 0.74 The-Stack v2 RDoc 16 0.01
The-Stack v2 C++ 120, 085 2.48 The-Stack v2 Metal 8 0.01
The-Stack v2 SQL 75, 348 1.56 The-Stack v2 AppleScript 7 0.01
The-Stack v2 C 59, 404 1.23 Total 4 ,843,357 100
Table 5: MLM pre-training data , with a total of 4.8 trillion tokens (according to EuroBERT’s
tokenizer). We report the list of all dataset names and subsets used, including the number
of tokens selected and their proportion in the final data mix.
Mix en fr de nl hi it ja pl pt ru es ar zh tr Code Math Parallel IFT
Reference 46.3 5.8 5.7 1.0 0.3 1.5 0.8 1.0 1.4 1.0 5.7 0.4 4.7 1.0 8.7 8.2 5.2 1.2
English 26% 26.0 6.0 6.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 6.0 4.0 6.0 4.0 4.0 4.0 5.0 1.0
English 17% 17.0 6.0 6.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 6.0 5.0 6.0 5.0 4.0 4.0 5.0 1.0
Math 4% 46.3 5.8 5.7 1.0 0.3 1.5 0.8 1.0 1.4 1.0 5.7 0.4 4.7 1.0 8.7 4.0 5.2 1.2
Math 2% 46.3 5.8 5.7 1.0 0.3 1.5 0.8 1.0 1.4 1.0 5.7 0.4 4.7 1.0 8.7 2.0 5.2 1.2
Code 8% 46.3 5.8 5.7 1.0 0.3 1.5 0.8 1.0 1.4 1.0 5.7 0.4 4.7 1.0 6.0 8.2 5.2 1.2
Code 4% 46.3 5.8 5.7 1.0 0.3 1.5 0.8 1.0 1.4 1.0 5.7 0.4 4.7 1.0 4.0 8.2 5.2 1.2
Code 2% 46.3 5.8 5.7 1.0 0.3 1.5 0.8 1.0 1.4 1.0 5.7 0.4 4.7 1.0 2.0 8.2 5.2 1.2
Parallel 8% 46.3 5.8 5.7 1.0 0.3 1.5 0.8 1.0 1.4 1.0 5.7 0.4 4.7 1.0 8.7 8.2 8.0 1.2
IFT 0% 46.3 5.8 5.7 1.0 0.3 1.5 0.8 1.0 1.4 1.0 5.7 0.4 4.7 1.0 8.7 8.2 5.2 0.0
Table 6: Data mix employed in the ablation study measuring the importance of different
data subsets in the EuroBERT annealing phase.
We curate our multilingual corpus for MLM pre-training using a variety of freely available
and cleaned datasets (Table 5). This data mix predominantly consists of web-based data,
with the FineWeb dataset (Penedo et al., 2024) serving as the primary English corpus.
We used CulturaX dataset (Nguyen et al., 2024) for multilingual text. We selected 14
languages (French, German, Spanish, Chinese, Italian, Russian, Polish, Portuguese, Japanese,
Vietnamese, Dutch, Arabic, Turkish, and Hindi) to create a corpus of European and most
widely spoken languages, representing a broad range of alphabets and cultures.
In addition, we incorporated parallel data, which has been shown to improve cross-lingual
transfer learning (Reid & Artetxe, 2022; 2023). Specifically, we concatenated in random order,
20
Page 21:
EuroBERT: Scaling Multilingual Encoders for European Languages
pairs of English sentences and their translations into another language. These sentence pairs
were separated by a <|parallel_sep|> token and uniformly masked creating an asymmetric
masking between sentences. This asymmetric masking encourages the model to use its
bidirectional attention mechanism to decode masked tokens, leveraging both the original
language and its translation.
Finally, we enriched the dataset with 38 programming languages from The Stack v2 and sci-
ence data with Proof-Pile-2 (upsampled 5 ×). A detailed summary of the dataset composition
is available in Table 5.
21
Page 22:
EuroBERT: Scaling Multilingual Encoders for European Languages
D Evaluation
D.1 Dataset Details
This appendix offers additional details on the datasets used for evaluation. Table 7 presents
the language coverage of all evaluation datasets, and below are additional specifications on
the evaluated tasks.
European Languages Extra-European LanguagesCode Math
Task en de es fr it nl pl pt ar hi ja ru tr vi zh
Sequence Classification
XNLI ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
PAWS-X ✓ ✓ ✓ ✓
QAM ✓ ✓ ✓
AmazonReviews ✓ ✓ ✓ ✓ ✓ ✓
MassiveIntent ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
CodeDefect ✓ ✓
CodeComplexity ✓ ✓
MathShepherd ✓ ✓
Sequence Regression
WMT ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
SeaHorse ✓ ✓ ✓ ✓ ✓
SummEval ✓ ✓ ✓ ✓ ✓
Information Retrieval
MIRACL ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
MLDR ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Wikipedia ✓ ✓ ✓ ✓ ✓ ✓
CC-News ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
CodeSearchNet ✓ ✓
DupStackMath ✓ ✓
MathFormula ✓ ✓
Table 7: Language coverage across evaluation datasets.
Classification datasets:
•XNLI (Conneau et al., 2018) — Cross-lingual natural language inference task extending
MNLI (Williams et al., 2018), consisting in classifying sentence pairs into entailment,
contradiction, or neutral. In our experiments, we train multi-lingually instead of cross-
lingually to assess models’ capacity for multilingual fine-tuning.
•PAWS-X (Yang et al., 2019) — Paraphrase identification task aimed at determining
whether two sentences convey the same meaning. Fine-tuning is performed cross-
lingually, with training on the English subset and evaluation across all available lan-
guages.
•QAM (Liang et al., 2020) — NLU task aimed at verifying whether a question-passage
pair forms a valid question-answer pair. Fine-tuning is performed cross-lingually
•AmazonReviews (Keung et al., 2020) — A sentiment analysis task consisting in esti-
mating the satisfaction level of multilingual Amazon product reviews on a 1-to-5 scale.
Fine-tuning is performed on all available languages.
•MassiveIntent (Keung et al., 2020) — A multilingual classification task consisting in
assigning sentences to one of 60 topic categories. Fine-tuning is performed on all
available languages.
•CodeDefect (Zhou et al., 2019) — A binary classification task aimed at identifying
whether a given code snippet contains a defect.
•CodeComplexity (Jeon et al., 2023) — Computational analysis task consisting in estimat-
ing the order of complexity of a code-formulated computer science problem.
22
Page 23:
EuroBERT: Scaling Multilingual Encoders for European Languages
•MathShepherd (Wang et al., 2024b) — Binary classification task aimed at determining
whether a step-by-step math rationale is correct given a problem prompt. Since longer
rationales are more error-prone, we filter the dataset to retain only 3-step rationales to
prevent models from overfitting answer length. As the dataset lacks a validation split,
we allocate half of the test set for validation.
Regression datasets:
•WMT (Bojar et al., 2017; 2018; Barrault et al., 2019; 2020; Akhbardeh et al., 2021; Kocmi
et al., 2022) — Regression task consisting in predicting translation quality given a source
sentence. As the original test set covers only three language pairs, we construct validation
and test sets by sampling 5% of the training set for each, ensuring broader language
coverage in evaluation.
•SeaHorse (Clark et al., 2023) — Multilingual summarization evaluation task, where each
text-summary pair is annotated across 6 binary evaluation dimensions. The final score is
obtained by averaging these labels, yielding a continuous value between 0 and 1.
•SummEval (Fabbri et al., 2021) — Initially English-only summarization evaluation
task, later extended to other languages, including German12, French13, Spanish14, and
Turkish15. Each summary is assessed across 4 dimensions, averaged to produce a
continuous score.
Retrieval datasets:
•MS-MARCO (Bajaj et al., 2016) — English-only retrieval dataset used for fine-tuning,
where each anchor-positive pair includes a mined hard negative, forming a triplet
structure.
•MIRACL (Zhang et al., 2023) — Multilingual retrieval dataset. We use the semi-
supervised version with labeled positive pairs provided by SentenceTransformers16
as the primary data source. Anchors serve as queries, and the corpus consists of all
positive documents in the dataset. Since only a single data split is available, we create
validation and test sets by partitioning 50% of the original split for each, using queries
as the split key to ensure no data leakage.
•MLDR (Chen et al., 2024) — Long-context multilingual retrieval dataset. As with
MIRACL, we use the triplet version provided by SentenceTransformers and apply the
same validation-test split strategy.
•Wikipedia17— Multilingual information retrieval dataset. Since only a single data split
is available, we partition 50% of the queries into validation and test sets.
•CC-News (de Gibert et al., 2024) — Highly multilingual retrieval dataset. As with
MIRACL, we use the SentenceTransformers dataset version as the primary data source
and apply the same test-validation split method.
•CodeSearchNet (Husain et al., 2019) — Code retrieval dataset with comment-code
query-positive pairs (SentenceTransformers version), processed similarly to the previous
datasets.
•DupStackMath (Hoogeveen et al., 2015) — Code retrieval dataset with queries, a corpus,
and relevant documents, processed the same way as the above datasets.
•MathFormula (Drechsel et al., 2025) — Mathematical retrieval dataset consisting of
pairs of equivalent formulas. The original dataset contains formula pairs labeled as true
12https://huggingface.co/datasets/sproos/summeval-de
13https://huggingface.co/datasets/sproos/summeval-fr
14https://huggingface.co/datasets/sproos/summeval-es
15https://huggingface.co/datasets/sproos/summeval-tr
16https://huggingface.co/collections/sentence-transformers/embedding-model-datasets-
6644d7a3673a511914aa7552
17https://huggingface.co/datasets/Samoed/WikipediaRetrievalMultilingual
23
Page 24:
EuroBERT: Scaling Multilingual Encoders for European Languages
or false based on their equivalence, spanning 71 well-known mathematical formulas.
To construct the retrieval dataset, we extract only equivalent formula pairs, retaining
positive instances. Due to the dataset’s large size, we sample 100 positive pairs per
formula type for both validation and test sets. The final dataset is processed following
the same methodology as other pair-based datasets.
D.2 Detailed Results
This appendix presents detailed results for the multilingual evaluation tasks in our bench-
mark, including per-language scores as well as averages across European languages and all
languages.
MIRACL
European Languages Extra-European Languages Average
Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World
XLM-RoBERTa-base 88.0 — 88.6 91.9 — — — — 84.6 77.9 85.7 82.5 — — 83.8 89.5 85.4
XLM-RoBERTa-large 89.7 — 91.6 93.5 — — — — 89.0 81.1 91.1 88.6 — — 90.3 91.6 89.4
XLM-RoBERTa-XL 90.4 — 93.0 94.5 — — — — 91.5 85.1 92.6 92.1 — — 91.9 92.6 91.4
mDeBERTa-v3-base 45.3 — 39.6 46.2 — — — — 34.4 36.0 33.7 29.6 — — 35.3 43.7 37.5
mGTE-MLM-base 91.4 — 94.6 95.2 — — — — 91.6 85.3 91.5 88.3 — — 91.4 93.8 91.2
ModernBERT-base — — — — — — — — — — — — — — — — —
ModernBERT-large — — — — — — — — — — — — — — — — —
EuroBERT-210m 94.1 — 95.4 95.8 — — — — 90.0 83.2 90.8 85.7 — — 90.9 95.1 90.8
EuroBERT-610m 93.6 — 95.1 96.3 — — — — 91.8 88.3 92.4 90.7 — — 92.4 95.0 92.6
EuroBERT-2.1B 94.2 — 95.0 95.3 — — — — 93.0 87.1 93.4 91.5 — — 94.1 94.8 92.9
MLDR
European Languages Extra-European Languages Average
Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World
XLM-RoBERTa-base 59.4 56.7 58.0 64.5 53.4 — — 60.3 44.4 56.4 50.3 43.6 — — 53.8 58.7 54.6
XLM-RoBERTa-large 63.4 61.1 66.9 71.1 61.9 — — 67.1 51.5 60.3 54.9 51.5 — — 58.9 65.2 60.8
XLM-RoBERTa-XL 68.9 66.1 72.7 73.5 67.5 — — 71.0 56.5 61.8 62.6 60.8 — — 63.5 70.0 65.9
mDeBERTa-v3-base 18.8 24.3 15.4 23.9 18.2 — — 19.0 12.3 20.5 17.4 13.2 — — 18.1 20.0 18.3
mGTE-MLM-base 63.5 68.7 79.5 78.2 71.4 — — 78.1 55.7 66.2 62.4 60.8 — — 60.9 73.2 67.8
ModernBERT-base — — — — — — — — — — — — — — — — —
ModernBERT-large — — — — — — — — — — — — — — — — —
EuroBERT-210m 67.2 68.1 78.2 80.0 68.9 — — 77.9 52.1 51.3 60.8 59.1 — — 56.4 73.4 65.4
EuroBERT-610m 72.5 69.5 80.3 79.8 73.9 — — 79.0 55.5 60.9 61.6 62.5 — — 59.0 75.8 68.6
EuroBERT-2.1B 72.5 65.4 77.6 77.6 69.2 — — 75.0 53.0 58.1 61.5 59.3 — — 57.6 72.9 66.1
Wikipedia
European Languages Extra-European Languages Average
Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World
XLM-RoBERTa-base 94.7 91.2 — — 91.5 90.2 — 90.8 — 87.4 — — — — — 91.7 91.0
XLM-RoBERTa-large 95.4 93.5 — — 93.4 93.4 — 92.5 — 90.7 — — — — — 93.6 93.1
XLM-RoBERTa-XL 97.9 96.5 — — 96.6 96.3 — 96.0 — 94.5 — — — — — 96.7 96.3
mDeBERTa-v3-base 66.1 60.4 — — 53.6 57.6 — 56.5 — 51.3 — — — — — 58.9 57.6
mGTE-MLM-base 96.7 93.9 — — 94.7 93.9 — 93.6 — 92.0 — — — — — 94.6 94.1
ModernBERT-base — — — — — — — — — — — — — — — — —
ModernBERT-large — — — — — — — — — — — — — — — — —
EuroBERT-210m 97.7 94.7 — — 94.5 95.6 — 95.7 — 88.6 — — — — — 95.6 94.4
EuroBERT-610m 98.3 95.9 — — 96.0 96.6 — 96.1 — 92.6 — — — — — 96.6 95.9
EuroBERT-2.1B 99.0 96.1 — — 96.0 96.0 — 95.9 — 92.0 — — — — — 96.6 95.8
CC-News
European Languages Extra-European Languages Average
Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World
XLM-RoBERTa-base 69.9 57.8 55.8 57.5 57.2 66.8 55.1 63.1 72.2 31.1 75.7 76.2 47.8 61.5 77.1 60.4 61.6
XLM-RoBERTa-large 77.3 68.6 69.1 70.1 70.8 75.9 69.6 75.3 82.8 51.0 82.3 83.3 61.1 73.7 81.2 72.1 72.8
XLM-RoBERTa-XL 84.4 77.4 79.7 79.1 79.6 83.8 79.5 84.0 88.1 59.5 87.4 88.8 72.1 82.9 86.8 80.9 80.9
mDeBERTa-v3-base 25.0 15.7 11.9 12.4 13.1 20.4 12.4 15.8 23.1 4.7 33.6 23.6 10.7 20.4 34.8 15.8 18.5
mGTE-MLM-base 76.1 68.7 72.8 70.1 68.4 76.5 65.1 74.3 79.6 32.5 85.1 83.3 56.7 72.3 88.2 71.5 71.3
ModernBERT-base — — — — — — — — — — — — — — — — —
ModernBERT-large — — — — — — — — — — — — — — — — —
EuroBERT-210m 80.0 66.9 69.2 69.7 65.9 73.5 57.8 69.1 76.7 17.0 82.4 79.7 52.1 57.4 90.9 69.0 67.2
EuroBERT-610m 84.0 72.9 76.4 75.5 73.8 79.6 70.9 79.9 84.0 49.7 84.9 85.1 62.4 66.7 88.1 76.6 75.6
EuroBERT-2.1B 85.8 73.1 77.1 76.9 73.8 79.0 70.3 79.7 84.2 49.9 88.2 86.5 60.7 63.0 89.9 76.9 75.9
Table 8: Detailed results on multilingual retrieval tasks (NDCG@10, in %).
24
Page 25:
EuroBERT: Scaling Multilingual Encoders for European Languages
XNLI
European Languages Extra-European Languages Average
Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World
XLM-RoBERTa-base 80.0 74.8 76.4 75.1 — — — — 71.5 69.1 — 74.4 72.4 74.0 73.2 76.6 74.1
XLM-RoBERTa-large 87.1 83.0 83.5 82.9 — — — — 80.4 77.8 — 81.2 80.6 80.2 80.6 84.1 81.7
XLM-RoBERTa-XL 89.0 85.5 85.3 84.5 — — — — 81.5 80.5 — 83.1 82.4 83.3 82.3 86.1 83.7
mDeBERTa-v3-base 84.9 81.2 81.1 81.0 — — — — 77.6 76.1 — 79.1 77.7 78.1 78.4 82.0 79.5
mGTE-MLM-base 81.1 76.9 78.5 77.2 — — — — 73.6 71.3 — 75.4 72.9 75.9 75.5 78.4 75.8
ModernBERT-base — — — — — — — — — — — — — — — — —
ModernBERT-large — — — — — — — — — — — — — — — — —
EuroBERT-210m 83.5 77.8 79.4 78.9 — — — — 74.3 70.6 — 76.6 74.2 75.1 75.3 79.9 76.6
EuroBERT-610m 87.8 82.9 84.6 83.6 — — — — 79.5 76.7 — 82.0 80.3 80.8 80.7 84.7 81.9
EuroBERT-2.1B 89.6 85.5 86.4 85.8 — — — — 82.8 79.9 — 83.3 83.0 82.3 82.3 86.8 84.1
PAWS-X
European Languages Extra-European Languages Average
Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World
XLM-RoBERTa-base 93.8 86.4 87.5 88.0 — — — — — — — — — — — 88.9 88.9
XLM-RoBERTa-large 95.5 91.0 91.4 91.8 — — — — — — — — — — — 92.4 92.4
XLM-RoBERTa-XL 95.8 91.9 91.7 92.3 — — — — — — — — — — — 92.9 92.9
mDeBERTa-v3-base 95.7 90.2 90.4 91.3 — — — — — — — — — — — 91.9 91.9
mGTE-MLM-base 94.7 87.5 88.2 88.8 — — — — — — — — — — — 89.8 89.8
ModernBERT-base — — — — — — — — — — — — — — — — —
ModernBERT-large — — — — — — — — — — — — — — — — —
EuroBERT-210m 95.6 86.5 88.7 88.9 — — — — — — — — — — — 89.9 89.9
EuroBERT-610m 95.6 90.0 91.3 92.0 — — — — — — — — — — — 92.2 92.2
EuroBERT-2.1B 96.2 91.6 91.8 92.5 — — — — — — — — — — — 93.0 93.0
QAM
European Languages Extra-European Languages Average
Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World
XLM-RoBERTa-base 69.0 67.4 — 67.3 — — — — — — — — — — — 67.9 67.9
XLM-RoBERTa-large 73.3 74.5 — 73.3 — — — — — — — — — — — 73.7 73.7
XLM-RoBERTa-XL 75.3 76.8 — 74.1 — — — — — — — — — — — 75.4 75.4
mDeBERTa-v3-base 71.2 73.5 — 71.5 — — — — — — — — — — — 72.1 72.1
mGTE-MLM-base 69.4 68.1 — 68.4 — — — — — — — — — — — 68.6 68.6
ModernBERT-base — — — — — — — — — — — — — — — — —
ModernBERT-large — — — — — — — — — — — — — — — — —
EuroBERT-210m 71.6 69.0 — 68.2 — — — — — — — — — — — 69.6 69.6
EuroBERT-610m 73.6 73.3 — 71.7 — — — — — — — — — — — 72.9 72.9
EuroBERT-2.1B 74.7 72.8 — 72.5 — — — — — — — — — — — 73.3 73.3
AmazonReviews
European Languages Extra-European Languages Average
Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World
XLM-RoBERTa-base 64.9 65.0 60.6 60.0 — — — — — — 59.0 — — — 56.8 62.7 61.1
XLM-RoBERTa-large 66.9 67.0 62.4 61.5 — — — — — — 62.1 — — — 58.5 64.5 63.1
XLM-RoBERTa-XL 67.1 67.8 62.4 61.5 — — — — — — 63.7 — — — 59.2 64.7 63.6
mDeBERTa-v3-base 66.4 66.1 61.6 60.6 — — — — — — 60.4 — — — 57.7 63.7 62.1
mGTE-MLM-base 65.0 65.3 60.9 59.5 — — — — — — 61.2 — — — 57.4 62.7 61.5
ModernBERT-base — — — — — — — — — — — — — — — — —
ModernBERT-large — — — — — — — — — — — — — — — — —
EuroBERT-210m 65.9 65.4 60.5 60.2 — — — — — — 60.4 — — — 57.7 63.0 61.7
EuroBERT-610m 66.7 66.4 61.6 61.2 — — — — — — 61.7 — — — 58.1 64.0 62.6
EuroBERT-2.1B 66.5 67.8 62.8 60.9 — — — — — — 62.4 — — — 59.0 64.5 63.2
MassiveIntent
European Languages Extra-European Languages Average
Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World
XLM-RoBERTa-base 89.1 86.1 87.0 86.6 87.5 87.5 86.8 87.3 79.0 86.2 86.5 87.3 85.6 86.7 86.0 87.2 86.3
XLM-RoBERTa-large 90.3 87.7 88.0 89.0 88.5 89.1 88.8 88.8 83.5 88.1 88.9 89.0 87.8 88.9 87.1 88.8 88.2
XLM-RoBERTa-XL 89.9 87.6 88.2 88.6 88.7 88.3 87.9 88.6 81.8 88.0 88.4 89.2 87.8 88.4 86.9 88.5 87.9
mDeBERTa-v3-base 88.1 86.4 86.9 87.3 87.6 88.0 87.0 86.9 79.8 86.0 87.3 87.3 85.9 86.5 85.9 87.3 86.5
mGTE-MLM-base 89.0 86.3 87.4 87.9 87.2 87.9 86.3 87.8 80.7 86.6 87.9 87.8 86.4 87.4 86.3 87.5 86.9
ModernBERT-base — — — — — — — — — — — — — — — — —
ModernBERT-large — — — — — — — — — — — — — — — — —
EuroBERT-210m 89.0 86.0 86.9 86.9 87.0 87.1 86.8 87.9 81.2 86.9 87.4 87.2 85.8 85.0 86.0 87.2 86.5
EuroBERT-610m 89.2 86.6 87.4 87.6 88.1 88.2 87.3 87.8 82.7 87.3 88.3 88.2 86.8 86.1 87.0 87.8 87.2
EuroBERT-2.1B 88.9 87.2 88.0 88.7 87.9 88.2 88.1 88.2 83.2 87.6 89.0 88.1 87.1 85.4 87.0 88.2 87.5
Table 9: Detailed results on multilingual sequence classification tasks (accuracy, in %).
25
Page 26:
EuroBERT: Scaling Multilingual Encoders for European Languages
SeaHorse
European Languages Extra-European Languages Average
Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World
XLM-RoBERTa-base 36.8 32.8 25.5 — — — — — — — — 34.5 43.8 32.0 — 31.7 34.2
XLM-RoBERTa-large 39.2 35.5 28.1 — — — — — — — — 39.3 51.0 35.0 — 34.3 38.0
XLM-RoBERTa-XL 40.5 36.1 30.5 — — — — — — — — 42.2 53.1 35.4 — 35.7 39.6
mDeBERTa-v3-base 36.4 31.1 23.1 — — — — — — — — 31.3 44.0 18.6 — 30.2 30.7
mGTE-MLM-base 50.8 55.8 60.7 — — — — — — — — 67.6 65.8 59.5 — 55.8 60.0
ModernBERT-base — — — — — — — — — — — — — — — — —
ModernBERT-large — — — — — — — — — — — — — — — — —
EuroBERT-210m 54.2 58.5 65.1 — — — — — — — — 68.2 69.3 60.2 — 59.3 62.6
EuroBERT-610m 56.1 59.6 69.7 — — — — — — — — 71.8 73.2 60.7 — 61.8 65.2
EuroBERT-2.1B 59.0 62.5 71.2 — — — — — — — — 74.4 75.3 61.9 — 64.2 67.4
SummEval
European Languages Extra-European Languages Average
Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World
XLM-RoBERTa-base 22.7 26.9 29.9 28.1 — — — — — — — — 28.1 — — 26.9 27.1
XLM-RoBERTa-large 43.2 37.3 33.6 40.2 — — — — — — — — 38.6 — — 38.6 38.6
XLM-RoBERTa-XL 29.8 36.3 19.8 35.1 — — — — — — — — 32.6 — — 30.3 30.7
mDeBERTa-v3-base 28.8 25.5 27.0 22.7 — — — — — — — — 25.4 — — 26.0 25.9
mGTE-MLM-base 49.0 41.6 44.4 39.5 — — — — — — — — 34.7 — — 43.7 41.9
ModernBERT-base — — — — — — — — — — — — — — — — —
ModernBERT-large — — — — — — — — — — — — — — — — —
EuroBERT-210m 57.5 36.9 44.4 46.4 — — — — — — — — 30.1 — — 46.3 43.1
EuroBERT-610m 57.4 52.4 62.4 57.5 — — — — — — — — 44.8 — — 57.4 54.9
EuroBERT-2.1B 65.5 58.8 51.3 61.9 — — — — — — — — 51.6 — — 59.4 57.8
Table 10: Detailed results on multilingual summary evaluation tasks (Spearman rank
correlation, in %).
European Pairs Extra-European Pairs Average
en-xx xx-en Other en-xx xx-enEuro World
en-de en-pl de-en pl-en de-fr fr-de en-ja en-ru en-tr en-zh ja-en ru-en tr-en zh-en
XLM-RoBERTa-base 45.3 53.6 26.7 16.3 31.4 32.0 47.0 56.5 61.5 42.5 10.4 21.8 40.7 25.3 34.2 36.5
XLM-RoBERTa-large 50.7 66.0 30.7 15.8 41.1 29.8 52.1 61.2 66.2 47.2 11.0 24.8 45.4 29.0 39.0 40.8
XLM-RoBERTa-XL 55.1 66.9 35.9 18.7 46.7 43.3 56.3 64.9 64.8 52.2 13.1 27.0 47.7 30.6 44.4 44.5
mDeBERTa-v3-base 53.5 63.1 33.1 20.1 45.5 42.6 53.0 61.9 67.9 48.5 10.5 25.2 47.9 29.5 43.0 43.0
mGTE-MLM-base 48.6 55.2 30.4 18.5 37.5 35.9 48.3 57.2 59.7 45.5 10.6 23.4 41.5 27.3 37.7 38.5
ModernBERT-base — — — — — — — — — — — — — — — —
ModernBERT-large — — — — — — — — — — — — — — — —
EuroBERT-210m 52.9 58.4 33.2 17.5 40.6 40.3 51.1 57.9 57.3 48.3 14.3 26.7 44.3 30.8 40.5 41.0
EuroBERT-610m 52.9 61.1 32.4 18.2 42.6 39.2 51.3 59.4 62.3 48.6 12.2 26.6 44.1 29.7 41.1 41.5
EuroBERT-2.1B 49.1 57.8 29.8 19.3 38.3 38.5 47.8 56.9 56.5 45.0 10.7 23.5 41.3 27.5 38.8 38.7
Table 11: Detailed results on the WMT task (Spearman rank correlation, in %).
26