Open AGI Codes | Your Codes Reflect!

Updates

Generating audio...

arxiv

Paper 2503.05500

EuroBERT: Scaling Multilingual Encoders for European Languages

Authors: Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M. Alves, André Martins, Ayoub Hammal, Caio Corro, Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, Gabriel Hautreux, João Alves, Kevin El-Haddad, Manuel Faysse, Maxime Peyrard, Nuno M. Guerreiro, Patrick Fernandes, Ricardo Rei, Pierre Colombo

Published: 2025-03-07

Abstract:

General-purpose multilingual vector representations, used in retrieval, regression and classification, are traditionally obtained from bidirectional encoder models. Despite their wide applicability, encoders have been recently overshadowed by advances in generative decoder-only models. However, many innovations driving this progress are not inherently tied to decoders. In this paper, we revisit the development of multilingual encoders through the lens of these advances, and introduce EuroBERT, a family of multilingual encoders covering European and widely spoken global languages. Our models outperform existing alternatives across a diverse range of tasks, spanning multilingual capabilities, mathematics, and coding, and natively supporting sequences of up to 8,192 tokens. We also examine the design decisions behind EuroBERT, offering insights into our dataset composition and training pipeline. We publicly release the EuroBERT models, including intermediate training checkpoints, together with our training framework.

Paper Content:

Page 1: EuroBERT: Scaling Multilingual Encoders for European Languages Nicolas Boizard1,3Hippolyte Gisserot-Boukhlef2,3Duarte M. Alves4,5 Andr ´e Martins† 4,5,6Ayoub Hammal† 7,8,9Caio Corro† 8,10,11C´eline Hudelot† 3 Emmanuel Malherbe† 2Etienne Malaboeuf† 12Fanny Jourdan† 13Gabriel Hautreux† 12Jo˜ao Alves† 6Kevin El-Haddad† 1,17Manuel Faysse† 3,14Maxime Peyrard† 8,15Nuno M. Guerreiro† 3,4,5,6Patrick Fernandes† 4,5,18Ricardo Rei† 6 Pierre Colombo⋆3,16 1Diabolocom,2Artefact,3MICS, CentraleSup ´elec, Universit ´e Paris-Saclay,4Instituto Superior T ´ecnico & Universidade de Lisboa (Lisbon ELLIS Unit),5Instituto de Telecomunica c ¸˜oes,6Unbabel,7Universit ´e Paris-Saclay,8CNRS,9LISN,10INSA Rennes, 11IRISA,12CINES,13IRT Saint Exup ´ery,14Illuin Technology,15Universit ´e Grenoble Alpes, Grenoble INP , LIG,16Equall,17ISIA Lab,18Carnegie Mellon University Equal contribution ,†Ordered alphabetically by the first name,⋆Senior advisor General-purpose multilingual vector representations, used in retrieval, regression and classification, are traditionally obtained from bidirectional encoder models. Despite their wide applicability, encoders have been recently overshadowed by advances in generative decoder-only models. However, many innovations driving this progress are not inherently tied to decoders. In this paper, we revisit the development of multilingual encoders through the lens of these advances, and introduce EuroBERT, a family of multilingual encoders covering European and widely spoken global languages. Our models outperform existing alternatives across a diverse range of tasks, spanning multilingual capabilities, mathematics, and coding, and natively supporting sequences of up to 8,192 tokens. We also examine the design decisions behind EuroBERT, offering insights into our dataset composition and training pipeline. We publicly release the EuroBERT models, including intermediate training checkpoints, together with our training framework. Contact firstname.lastname@centralesupelec.com, duartemalves@tecnico.ulisboa.pt Website https://huggingface.co/EuroBERT Date March 10, 2025 1 Introduction Many important tasks in NLP , including information retrieval, classification, or regression, are built upon general-purpose vector representations. These representations are tradition- ally obtained from bidirectional encoder models, which aggregate information from the left and right contexts of each token (Devlin et al., 2019; Conneau et al., 2020; He et al., 2023). In contrast, recent advances in generative modeling have shifted the research community’s attention towards unidirectional architectures (Bai et al., 2023; Llama Team, 2024; OLMo et al., 2025). Notably, these efforts have identified several key performance drivers that span architectural advances, data improvements, and increased scale. Yet, despite no appar- ent barrier to transferring these insights to bidirectional architectures, little effort has been devoted towards this objective, forcing practitioners to depend on older models. In this paper, we introduce a refreshed recipe for training general-purpose multilingual encoders, resulting in the EuroBERT family. Our models incorporate recent architectural advances from decoder models ( §2.1), and are trained on a 5T-token multilingual dataset, covering European and widely spoken global languages, along with mathematics and code ( §2.2). We adopt a masked language modeling objective, and employ a two-phasearXiv:2503.05500v1 [cs.CL] 7 Mar 2025 Page 2: EuroBERT: Scaling Multilingual Encoders for European Languages EuroBERT-2.1BEuroBERT-610mEuroBERT-210mmGTE-MLMXLM-R-baseXLM-R-largeXLM-R-XLEuroBERT-2.1BEuroBERT-610mEuroBERT-210mmGTE-MLMXLM-R-baseXLM-R-largeXLM-R-XLEuroBERT-2.1BEuroBERT-610mEuroBERT-210mmDeBERTamDeBERTaXLM-R-baseXLM-R-largeXLM-R-XLmGTE-MLM Figure 1: Pareto performance plots for key multilingual tasks, including retrieval (MIRACL), classification (XNLI), and regression (SeaHorse). The blue dotted line represents the Pareto frontier achieved by the EuroBERT model family. training pipeline, adjusting the data distribution in the second training phase to improve downstream performance (§2.3). We extensively evaluate the EuroBERT models, comparing to several similarly sized alterna- tives across a suite of tasks representative of real-world encoder applications ( §3). Our mod- els match or exceed the performance of alternative models, such as XLM-RoBERTa (Conneau et al., 2020) and mGTE-MLM-base (Zhang et al., 2024), on multilingual retrieval, classifi- cation and regression tasks (Figure 1), and outperform them on tasks related to code and mathematics (Figure 2). We also examine the impact of our design choices through systematic ablations on several components of our annealing recipe (§4). We explore the choice of masking ratio, showing that while higher masking ratios benefit retrieval tasks, lower ratios improve sentence classification. Additionally, we highlight that including data for code and mathematics can improve multilingual retrieval, but degrades classification accuracy. Contrary to expecta- tions, we also find that, when using model-based filters for data selection, mixing data from lower and higher quality thresholds can improve both retrieval and classification. Accompanying this work, we release the EuroBERT family, comprising three models with 210m, 610m and 2.1B parameters. To facilitate future research, we also release intermediate training checkpoints, as well as our training framework. 2 EuroBERT: A Refreshed Multilingual Encoder The EuroBERT models are available in three sizes (210m, 610m, and 2.1B parameters) and closely follow the Llama 3 architecture (Llama Team, 2024) ( §2.1). They are trained on a large multilingual corpus, which includes datasets of code and mathematics ( §2.2). The training pipeline has two stages, pre-training and annealing, and employs the masked language modeling (MLM) objective (§2.3). 2.1 Architecture The EuroBERT models are based on a standard dense transformer (Vaswani et al., 2017), with several architectural changes. Similarly to Llama 2 (Touvron et al., 2023), we remove all biases. Additionally, we incorporate grouped query attention (Ainslie et al., 2023), swish gated linear units (Shazeer, 2020), root mean square layer normalization (Zhang & Sennrich, 2019), and rotary position embeddings (Su et al., 2024).1 1We provide more architecture details in Appendix A. 2 Page 3: EuroBERT: Scaling Multilingual Encoders for European Languages EuroBERT-2.1BEuroBERT-610mEuroBERT-210mModernBERT-baseModernBERT-large XLM-R-baseXLM-R-largeXLM-R-XLmGTE-MLMEuroBERT-2.1BEuroBERT-610mEuroBERT-210mModernBERT-baseModernBERT-largeXLM-R-baseXLM-R-largeXLM-R-XLmGTE-MLMmDeBERTa Figure 2: Pareto performance plots for code-related (CodeSearchNet) and math-related (MathShepherd) tasks. The blue dotted line represents the Pareto frontier achieved by the EuroBERT model family. 2.2 Dataset To train EuroBERT, we construct a multilingual 5T-token corpus — 4.8T tokens for pre- training and 200B for annealing — which includes 15 languages: English, French, German, Spanish, Chinese, Italian, Russian, Polish, Portuguese, Japanese, Vietnamese, Dutch, Arabic, Turkish, and Hindi.2Following prior work on curriculum learning, we adjust the data distribution to emphasize higher-quality datasets during annealing. Pre-training mixture. We use FineWeb (Penedo et al., 2024) for English, and Cul- turaX (Nguyen et al., 2024) for multilingual data. We also incorporate parallel data, which can improve cross-lingual transfer (Conneau & Lample, 2019; Reid & Artetxe, 2022; 2023), by concatenating to-English and from-English translation pairs, separated by a special <|parallel_sep|> token. Finally, we incorporate 38 programming languages from The Stack v2 and Proof-Pile-2, which we found to improve multilingual information retrieval (§4). Annealing mixture. For annealing, we classified data not seen during pre-training into four quality levels using the EuroLLM classifier (Martins et al., 2024). We then selected data above the third threshold, representing a mixture of medium and high quality data. Contrary to our expectations, we found that including data of lower quality improved performance ( §4). Additionally, we adjusted the data distribution based on multiple ab- lations ( §4). Specifically, we decreased the proportion of English while proportionally increasing the remaining languages. We also decreased the amount of code and math data while increasing parallel data.3 2.3 Training Recipe Masked language modeling. We choose to pre-train EuroBERT models with a 50% mask- ing ratio, following the insights from Wettig et al. (2023), which find that masking 15% and 30% of tokens is sub-optimal, and that larger models benefit from higher masking ratios. For the subsequent annealing phase, however, we lower the masking ratio to 10% based on downstream evaluations ( §4), aligning with the findings from Yang et al. (2023) and Ankner et al. (2024). Hyperparameters. We employed the Warmup-Stable-Decay (WSD) scheduler (Shen et al., 2024), with a linear warm-up phase of 2,000 steps, a constant learning rate of 1 ×10−4 during pre-training, and a cosine scheduler decaying to 0 during the annealing phase. During pre-training, we packed sentences to 2,048 tokens and used a Rotary Position 2These languages were selected to balance European and widely spoken global languages, and ensure representation across diverse alphabets and language families. 3We provide further details on our pretraining and annealing datasets in Appendix C. 3 Page 4: EuroBERT: Scaling Multilingual Encoders for European Languages BenchmarkmDeBERTa mGTE XLM-RoBERTa EuroBERT 280m 305m 280m 560m 3.5B 210m 610m 2.1B Retrieval MIRACL 43.7 5 93.8 289.5 491.6 392.6 395.1 195.0 194.8 1 MLDR 20.0 6 73.2 258.7 565.2 470.0 373.4 275.8 172.9 2 CC-News 15.8 7 71.5 460.4 672.1 480.9 169.0 476.6 276.9 2 Wikipedia 58.9 5 94.6 391.7 493.6 396.7 195.6 296.6 196.6 1 Classification XNLI 82.0 4 78.4 676.6 784.1 386.1 279.9 584.7 286.8 1 PAWS-X 91.9 2 89.8 388.9 492.4 292.9 189.9 392.2 293.0 1 QAM 72.1 4 68.6 567.9 573.7 275.4 169.6 472.9 373.3 3 AmazonReviews 63.7 2 62.7 362.7 364.5 164.7 163.0 264.0 264.5 1 MassiveIntent 87.3 2 87.5 287.2 288.8 188.5 187.2 287.8 288.2 2 Regression WMT 43.0 2 37.7 634.2 739.0 444.4 140.5 441.1 438.8 5 SeaHorse 30.2 8 55.8 431.7 734.3 635.7 559.3 361.8 264.2 1 SummEval 26.0 7 43.7 326.9 638.6 430.3 646.3 357.4 259.4 1 Table 1: Multilingual evaluation results. Scores represent NDCG@10 for retrieval, accuracy for classification, and Spearman rank correlation for regression tasks, averaged across Euro- pean languages. Red-highlighted values indicate model rankings for each task, determined through pairwise statistical tests with a 95% confidence level. Embedding (RoPE) value of 10,000. In the annealing phase, we increased the RoPE theta to 250,000 and randomly cropped our training documents to lengths between 12 and 8,192 tokens. We adopted this approach because, due to pre-processing constraints, our training data had already been segmented into fixed-length documents, making standard variable- length training infeasible. Therefore, we introduced random cropping of these fixed-length sequences as an approximation of variable-length training. Surprisingly, we found that this approach outperforms training only on fixed lengths ( §4), further highlighting the necessity for variable length documents during long context training (Gao et al., 2024). Infrastructure. We trained the EuroBERT family on Adastra, using 92 MI250X GPUs for EuroBERT-210M, 384 MI250X GPUs for EuroBERT-610M, and 96 MI300A GPUs for EuroBERT-2.1B, for a total of 200k GPU hours. Our training framework incorporates FlashAttention (Dao, 2023), fused cross-entropy from LigerKernel (Hsu et al., 2024), torch.compile (Ansel et al., 2024), and hybrid sharding with Fully Sharded Data Paral- lel (Zhao et al., 2023). 3 Evaluation 3.1 Evaluation Setup Datasets and tasks. We select a suite of tasks to cover various real-world use cases for encoders. For multilingual tasks, we evaluate retrieval performance using MIRACL (Zhang et al., 2023), MLDR (Chen et al., 2024), WikipediaRetrieval4, and CC-News (de Gibert et al., 2024). We assess classification with XNLI (Conneau et al., 2018), PAWS-X (Yang et al., 2019), QAM (Liang et al., 2020), AmazonReviews (Keung et al., 2020) and MassiveIntent (Keung et al., 2020). Additionally, we measure sequence regression performance on the WMT (Bojar et al., 2017; 2018; Barrault et al., 2019; 2020; Akhbardeh et al., 2021; Kocmi et al., 2022) quality estimation task, as well as on summary evaluation using SeaHorse (Clark et al., 2023) and SummEval (Fabbri et al., 2021). For code-related tasks, we evaluate retrieval on CodeSearchNet (Husain et al., 2019) and DupStackMath (Hoogeveen et al., 2015), and 4https://huggingface.co/datasets/Samoed/WikipediaRetrievalMultilingual 4 Page 5: EuroBERT: Scaling Multilingual Encoders for European Languages (1K, 5K] (5K, 9K](9K, 17K](17K, 26K](26K, 151K]4050607080 Document LengthNDCG@10MLDR XLM-RoBERTa 280m XLM-RoBERTa 560m XLM-RoBERTa 3.5B mGTE 305m EuroBERT 210m EuroBERT 610m EuroBERT 2.1B(69, 2K] (2K, 2K] (2K, 3K] (3K, 4K](4K, 27K]020406080 Document LengthSpearmanSeaHorse Figure 3: Long context analysis. We examine how context length affects model performance in the XLM-RoBERTa and EuroBERT families across two long-context tasks (MLDR and SeaHorse). classification on CodeDefect (Zhou et al., 2019) and CodeComplexity (Jeon et al., 2023). Finally, in the mathematical domain, we test retrieval on the MathFormula (Drechsel et al., 2025) task, and process reward modeling on MathShepherd (Wang et al., 2024b). We believe that our chosen suite is representative of many practical applications of encoder models.5 Baselines. We compare EuroBERT with the multilingual encoders XLM-RoBERTa (Con- neau et al., 2020; Goyal et al., 2021), mGTE-MLM-base (Zhang et al., 2024)6and mDeBERTa- v3-base. Additionally, for code and mathematical tasks, we compare with the English-only ModernBERT (Warner et al., 2024) models. Fine-tuning. We follow a standardized fine-tuning protocol to ensure fair model compari- son. For sequence classification tasks, models are trained for 10,000 steps on the correspond- ing training split using the standard cross-entropy loss, a batch size of 32, a 10% warm-up ratio, and a linear learning rate decay. For small datasets requiring multiple epochs, we apply early stopping with a patience of one epoch based on validation performance. To account for model specificities, we fine-tune using 10 logarithmically spaced learning rates (1×10−5to 1×10−4), selecting the one that achieves the highest validation metric.7For sequence regression tasks, we use the same setting but replace the loss with the MSE. For long-context summarization datasets (SeaHorse and SummEval), fine-tuning is limited to 5,000 steps to reduce computational cost. For retrieval tasks, models are fine-tuned for 1,000 steps on MS-MARCO (Bajaj et al., 2016),8using InfoNCE loss (Oord et al., 2018) with in-batch negatives and cosine similarity as the similarity metric. All other hyperparameters are aligned with those used in classification and regression tasks. 5Additional information on evaluation tasks is given in Appendix D. 6Since the EuroBERT models are general-purpose encoders, we evaluate them against the pre- trained mGTE-MLM-base variant, which, similarly, was not optimized for retrieval tasks. 7Additional fine-tuning hyperparameters, such as AdamW parameters ( β1,β2,ϵ, and weight decay), are set according to the values reported in the original source papers. For EuroBERT models, we maintain the same settings as those used during pre-training and annealing. 8Since many retrieval datasets lack dedicated training splits, we use MS-MARCO, an English-only dataset. This choice also allows us to assess cross-lingual generalization. 5 Page 6: EuroBERT: Scaling Multilingual Encoders for European Languages BenchmarkModernBERT mDeBERTa mGTE XLM-RoBERTa EuroBERT 150m 395m 280m 305m 280m 560m 3.5B 210m 610m 2.1B Code CodeSearchNet 53.9 565.8 3 2.8 34.0 723.0 840.8 654.1 558.9 469.9 272.6 1 DupStackMath 39.7 445.5 2 10.2 7 37.5 429.3 636.9 542.9 341.7 346.0 248.3 1 CodeComplexity 86.1 388.6 3 73.9 5 74.5 574.1 583.6 484.3 491.9 294.2 195.2 1 CodeDefect 65.8 367.0 2 64.7 3 63.5 461.9 454.3 565.8 369.5 169.0 167.7 2 Math MathFormula 89.6 591.9 2 85.2 7 83.4 883.1 881.4 89.1 691.5 392.6 191.0 4 MathShepherd 77.7 483.6 2 75.1 5 77.2 471.9 667.6 782.5 384.0 287.3 186.8 1 Table 2: Evaluation results for code and math domains. Scores are reported as NDCG@10 for retrieval tasks (CodeSearchNet, DupStackMath, MathFormula) and accuracy for classifi- cation tasks (CodeComplexity, CodeDefect, MathShepherd). Evaluation metrics. We report accuracy for sequence classification, Spearman rank cor- relation for regression, and NDCG@10 for retrieval tasks. We also follow Freitag et al. (2023), and group systems into language-specific clusters based on statistically significant performance gaps at 95% confidence thresholds. We then compute system-level rankings using a normalized Borda count (Colombo et al., 2022), defined as the average over the obtained per-language clusters. 3.2 Results Table 1 reports the aggregated results for all multilingual tasks, aggregating over the Euro- pean languages seen during training.9 The EuroBERT family exhibits strong multilingual performance across domains and tasks. EuroBERT-2.1B, our largest model, achieves the highest performance among all systems, ranking first on 7 of 12 benchmarks. Importantly, it outperforms the largest system, XLM-RoBERTa-XL. Additionally, EuroBERT-610m is competitive with XLM-RoBERTa-XL, a model 5 times its size, on most multilingual tasks, and surpasses it on code and mathematics. Similarly, the smaller EuroBERT-210m is competitive with XLM-RoBERTa-Large, which has twice the number of parameters, and globally outperforms all similarly sized systems. EuroBERT is effective at document ranking. Across domains, EuroBERT consistently ranks high for retrieval tasks. Notably, the 210m and 610m parameter models outperform all models of comparable sizes, and are competitive with the larger XLM-RoBERTa-XL. For sentence classification, EuroBERT models achieve results on par with similarly sized models. On sentence classification, no model significantly outperforms all others. During the development of EuroBERT, we found that several design decisions lead to a trade-off between retrieval and classification capabilities ( §4). We highlight, however, that EuroBERT- 2.1B is still among the highest ranking systems, and that the smaller models in the family are competitive with models of comparable size. EuroBERT can function as an evaluation metric. In translation evaluation, while there is a performance gap to the larger XLM-RoBERTa, the EuroBERT models remain competitive with the other alternatives. In the future, we would like to explore other training signals to further enhance cross-lingual capabilities of EuroBERT. In contrast, for summary evaluation, EuroBERT models consistently outperform competitors of any sizes, making them a reliable choice for training a metric on this type of task. 9More detailed results are provided in Appendix D. 6 Page 7: EuroBERT: Scaling Multilingual Encoders for European Languages -1.0-0.50.00.51.0 41% 26% 17%Performance Delta English-1.0-0.50.00.51.0 8% 4% 2% Math-1.0-0.50.00.51.0 8% 6% 2% Code-1.0-0.50.00.51.0 5% 8% Parallel-1.0-0.50.00.51.0 1% 0% IFT XNLI MIRACL Figure 4: Impact of data subset ratios on model performance on XNLI and MIRACL. The first vertical axis of each subplot denotes the reference data mix named reference in Table C. EuroBERT maintains performance at longer context lengths. Figure 3 compares the long context performance of EuroBERT and XLM-RoBERTa. While both models achieve similar performance for shorter inputs, EuroBERT maintains performance at longer contexts, whereas XLM-RoBERTa suffers notable degradation. The EuroBERT family excels in tasks related to code and mathematics. Table 2 reports the results on tasks related to code and mathematics. On these tasks, all EuroBERT models consistently surpass other systems. Similarly to retrieval tasks, the EuroBERT-210m reflects most of the performance of the larger models in the family, and ranks above all baselines, highlighting its capabilities at a smaller scale. Additionally, the larger EuroBERT-2.1B achieves the highest performance among all systems. 4 Training Recipe Analysis We measure the impact of various design decisions made during the development of EuroBERT with extensive ablations. Following Blakeney et al. (2024) and Llama Team (2024), we perform multiple annealing runs on 40B tokens, each varying a different component of our recipe, and measure the performance on the XNLI and MIRACL validation sets, the former representing multilingual classification and the latter multilingual retrieval.10 Balancing the language distribution can enhance performance. The left-most plot in Figure 4 reports the retrieval and classification performance as the proportion of English is reduced and re-distributed between other languages. Remarkably, the retrieval performance consistently decreases, suggesting that increasing multilingual data may not lead to an increase in multilingual performance. Including math and code improves multilingual retrieval, but degrades multilingual clas- sification. The second and third plots in Figure 4 show MIRACL performance dropping and XNLI accuracy rising as the proportions for math and code data decrease. This outcome underscores the specific trade-offs encountered during model development. In future work, we aim to investigate how to better balance task performance during pre-training. Increasing parallel data yields performance gains. The forth plot in Figure 4 presents the XNLI and MIRACL performance when increasing the amount of parallel data. Similar to Anil et al. (2023); Briakou et al. (2023); Alves et al. (2024), we find it increases performance on both benchmarks. Adding instruction fine-tuning data degrades model performance. The right-most plot in Figure 4 analyses the impact of adding instructions during annealing, which can improve performance for decoder language models. In contrast to decoders, it leads to worse performance when training an encoder model. 10We follow the evaluation procedure from §3, but instead test on the validation splits. 7 Page 8: EuroBERT: Scaling Multilingual Encoders for European Languages -6-3036 2k [12, 2k] [12, 8k]Performance Delta Sentence Length-2-1012 50% 30% 10% Mask Ratio-2-1012 4 3 3+4 Quality Buckets XNLI MIRACL Figure 5: Impact of hyperparameter choice on model performance on XNLI and MIRACL. The first vertical axis of each subplot denotes the reference data mix. Model-based quality filters can lead to worse results. Contrary to initial expectations, using the highest-quality data bucket quality during the annealing phase did not result in better performance on XNLI and MIRACL. Instead, as illustrated in the right plot of Figure 5, mixing the buckets with quality levels 3 and 4 leads to the best performance on XNLI, while the data bucket of quality 3 achieved the best results for MIRACL. A reduced masking ratio during annealing enhances classification performance. Similar to previous research (Yang et al., 2023; Ankner et al., 2024), which advocates lowering the masking ratio in later training, we also find that reducing it to 10% during the annealing phase improves EuroBERT’s performance on XNLI, though it leads to a modest decline in MIRACL scores. Impact of variable sentence length on model performance. The first plot in Figure 5 examines the impact of variable sentence lengths during annealing. Compared to the fixed packed sentence lengths employed in pretraining, variable sentence lengths significantly boosts XNLI and moderately MIRACL performance. This improvement remains stable, without degradation when the maximum context length is extended to 8,192 tokens. Based on this analysis, we decided to create our final annealing dataset by selecting data above the third threshold. We reduced the proportion of English to 26% while proportionally increasing the share of the remaining languages. To balance retrieval and classification performance, we allocated 6% and 4% of the data mix to math and code, respectively. Additionally, we increased the proportion of parallel data to 6%, using remaining data not seen during pre-training, and removed instruction one. We finally lowered the masking ratio to 10% and performed annealing with random sentence lengths of up to 8,192 tokens. 5 Related Work Encoder models and learning objectives. Encoder-only models have consistently demon- strated strong performance in non-generative NLP tasks, such as classification (Acheampong et al., 2021; Ma et al., 2019) and retrieval (Karpukhin et al., 2020; Wang et al., 2024a), lever- aging their ability to effectively represent sequences while maintaining relatively compact model sizes. Traditional encoder architectures, such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), rely on the masked language modeling (MLM) objective, which, combined with bidirectional attention, enables them to learn rich contextual representa- tions well-suited for sequence-level understanding. In contrast, DeBERTa (He et al., 2023) introduces replaced token detection (RTD) as an alternative pre-training objective, which improves efficiency and achieves strong results in classification tasks. In this paper, we chose to use the masked language modeling objective because initial evaluations of existing models showed more balanced results across all tasks. Multilingual encoders. Expanding on these monolingual architectures, multilingual en- coder models such as mBERT (Devlin et al., 2019), XLM-RoBERTa (Conneau et al., 2020), and mDeBERTa (He et al., 2023) have extended pre-training benefits to diverse languages, 8 Page 9: EuroBERT: Scaling Multilingual Encoders for European Languages enhancing cross-lingual understanding. However, training these models has also high- lighted the so-called ”curse of multilinguality” (Conneau et al., 2020; Chang et al., 2024), demonstrating the necessity of scaling in terms of training data, model size, and context length to maintain competitive performance across languages. Scaling encoder models. Initial efforts to scale encoder models focused on increasing model size and supported languages, as seen in larger variants like XLM-RoBERTa-XL and XLM-RoBERTa-XXL (Goyal et al., 2021), which demonstrated the benefits of scaling for multilingual performance. However, more recent advancements in encoder architectures have moved beyond mere size increases, incorporating sophisticated design improvements. Notably, concurrent work on modern pre-trained encoders, such as ModernBERT (Warner et al., 2024) and mGTE (Zhang et al., 2024), introduces innovations like grouped query attention (Ainslie et al., 2023), RoPE embeddings (Su et al., 2024), GLU activations (Shazeer, 2020), RMS normalization (Zhang & Sennrich, 2019), and extended context support. In line with these advancements and inspired by recent progress in decoder scaling (Brown et al., 2020; Yang et al., 2024; DeepSeek-AI et al., 2024), our work revisits the classical encoder pre-training paradigm. Specifically, we increase the MLM masking ratio, scale training across multiple languages with up to 5 trillion tokens, and integrate recent architectural improvements. 6 Conclusion We propose a recipe for training general-purpose multilingual encoders, creating the Eu- roBERT family. We incorporate recent architectural advances from decoder models, and train on a multilingual dataset containing European and globally spoken languages, to- gether with code and mathematics. Our models outperform existing alternatives on a comprehensive suite of tasks covering multilingual capabilities, mathematics and code. We also extensively analyze the design decisions behind EuroBERT’s dataset and training pipeline. Alongside this paper, we release all models in the EuroBERT family, including intermediate training checkpoints, and our training framework to facilitate future research. Acknowledgments We sincerely thank the ADASTRA supercomputer (CINES) for its technical support and high-performance computing (HPC) resources, provided through grants C1615122 and GDA2401. We also appreciate the support of the French government through the France 2030 program as part of the ArGiMi project. This work was also supported by the EU’s Hori- zon Europe Research and Innovation Actions (UTTER, contract 101070631), by the project DECOLLAGE (ERC-2022-CoG 101088763), by the Portuguese Recovery and Resilience Plan through project C645008882-00000055 (Center for Responsible AI), and by FCT/MECI through national funds and when applicable co-funded EU funds under UID/50008: In- stituto de Telecomunica c ¸˜oes. Duarte was also partially supported by the DataIA Institute, whose contributions facilitated the completion of this work. 9 Page 10: EuroBERT: Scaling Multilingual Encoders for European Languages References Francisca Adoma Acheampong, Henry Nunoo-Mensah, and Wenyu Chen. Transformer models for text-based emotion detection: a review of bert-based approaches. Artificial Intelligence Review , 54(8), 2021. URL https://dl.acm.org/doi/abs/10.1007/s10462-021- 09958-2 . Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Singapore, December 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.emnlp-main.298/ . Farhad Akhbardeh, Arkady Arkhangorodsky, Magdalena Biesialska, Ond ˇrej Bojar, Ra- jen Chatterjee, Vishrav Chaudhary, Marta R. Costa-jussa, Cristina Espa ˜na-Bonet, An- gela Fan, Christian Federmann, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Leonie Harter, Kenneth Heafield, Christopher Homan, Matthias Huck, Kwabena Amponsah-Kaakyire, Jungo Kasai, Daniel Khashabi, Kevin Knight, Tom Kocmi, Philipp Koehn, Nicholas Lourie, Christof Monz, Makoto Morishita, Masaaki Nagata, Ajay Nagesh, Toshiaki Nakazawa, Matteo Negri, Santanu Pal, Allahsera Auguste Tapo, Marco Turchi, Valentin Vydrin, and Marcos Zampieri. Findings of the 2021 confer- ence on machine translation (WMT21). In Proceedings of the Sixth Conference on Machine Translation , Online, November 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.wmt-1.1/ . Duarte Miguel Alves, Jos ´e Pombal, Nuno M Guerreiro, Pedro Henrique Martins, Jo ˜ao Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, et al. Tower: An open multilingual large language model for translation-related tasks. In First Conference on Language Modeling , 2024. URL https://arxiv.org/abs/2402.17733 . Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexan- dre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang, Gustavo Hernandez Abrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan Botha, James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, Christopher A. Choquette-Choo, Aakanksha Chowdhery, Cl ´ement Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark D ´ıaz, Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-Ari, Steven Hand, Hadi Hashemi, Le Hou, Joshua Howland, Andrea Hu, Jeffrey Hui, Jeremy Hurwitz, Michael Isard, Abe Ittycheriah, Matthew Jagielski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun, Sneha Kudugunta, Chang Lan, Katherine Lee, Benjamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li, Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu, Frederick Liu, Marcello Maggioni, Aroma Mahendru, Joshua Maynez, Vedant Misra, Maysam Moussalem, Zachary Nado, John Nham, Eric Ni, Andrew Nystrom, Alicia Parrish, Marie Pellat, Martin Polacek, Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter, Parker Riley, Alex Castro Ros, Aurko Roy, Brennan Saeta, Rajkumar Samuel, Renee Shelby, Ambrose Slone, Daniel Smilkov, David R. So, Daniel Sohn, Simon Tokumine, Dasha Valter, Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang, Pidong Wang, Zirui Wang, Tao Wang, John Wieting, Yuhuai Wu, Kelvin Xu, Yunhan Xu, Linting Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Zheng, Ce Zheng, Weikang Zhou, Denny Zhou, Slav Petrov, and Yonghui Wu. Palm 2 technical report. arXiv preprint arXiv:2305.10403 , 2023. URL https://arxiv.org/abs/2305.10403 . Zachary Ankner, Naomi Saphra, Davis Blalock, Jonathan Frankle, and Matthew Leavitt. Dynamic masking rate schedules for MLM pretraining. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers) , St. Julian’s, Malta, March 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.eacl-short.42/ . 10 Page 11: EuroBERT: Scaling Multilingual Encoders for European Languages Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voz- nesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, et al. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph com- pilation. In Proceedings of the 29th ACM International Conference on Architectural Sup- port for Programming Languages and Operating Systems, Volume 2 , 2024. URL https: //dl.acm.org/doi/abs/10.1145/3620665.3640366 . Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Sheng- guang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report, 2023. URL https://arxiv.org/abs/2309.16609 . Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 , 2016. URL https://arxiv.org/abs/1611.09268 . Lo¨ıc Barrault, Ond ˇrej Bojar, Marta R. Costa-juss `a, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias M ¨uller, Santanu Pal, Matt Post, and Marcos Zampieri. Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1) , Florence, Italy, August 2019. Association for Computational Linguistics. URL https://aclanthology.org/W19-5301/ . Lo¨ıc Barrault, Magdalena Biesialska, Ond ˇrej Bojar, Marta R. Costa-juss `a, Christian Fed- ermann, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Matthias Huck, Eric Joanis, Tom Kocmi, Philipp Koehn, Chi-kiu Lo, Nikola Ljube ˇsi´c, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Santanu Pal, Matt Post, and Marcos Zampieri. Findings of the 2020 conference on machine translation (WMT20). In Proceed- ings of the Fifth Conference on Machine Translation , Online, November 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.wmt-1.1/ . Cody Blakeney, Mansheej Paul, Brett W Larsen, Sean Owen, and Jonathan Frankle. Does your data spark joy? performance gains from domain upsampling at the end of training. InFirst Conference on Language Modeling , 2024. Ond ˇrej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Shu- jian Huang, Matthias Huck, Philipp Koehn, Qun Liu, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Raphael Rubino, Lucia Specia, and Marco Turchi. Findings of the 2017 conference on machine translation (WMT17). In Proceedings of the Second Conference on Machine Translation , Copenhagen, Denmark, September 2017. Association for Computational Linguistics. URL https://aclanthology.org/W17-4717/ . Ond ˇrej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, and Christof Monz. Findings of the 2018 conference on machine translation (WMT18). In Proceedings of the Third Conference on Machine Translation: Shared Task Papers , Belgium, Brussels, October 2018. Association for Computational Linguistics. URL https://aclanthology.org/W18-6401/ . Eleftheria Briakou, Colin Cherry, and George Foster. Searching for needles in a haystack: On the role of incidental bilingualism in PaLM‘s translation capability. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , Toronto, Canada, July 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.acl-long.524/ . Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya 11 Page 12: EuroBERT: Scaling Multilingual Encoders for European Languages Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems , volume 33. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper files/paper/2020/ file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf . Tyler A. Chang, Catherine Arnett, Zhuowen Tu, and Ben Bergen. When is multilingual- ity a curse? language modeling for 250 high- and low-resource languages. In Pro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , Mi- ami, Florida, USA, November 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.emnlp-main.236/ . Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3- embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. In Findings of the Association for Computational Lin- guistics: ACL 2024 , Bangkok, Thailand, August 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.findings-acl.137/ . Elizabeth Clark, Shruti Rijhwani, Sebastian Gehrmann, Joshua Maynez, Roee Aharoni, Vitaly Nikolaev, Thibault Sellam, Aditya Siddhant, Dipanjan Das, and Ankur Parikh. SEAHORSE: A multilingual, multifaceted dataset for summarization evaluation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Singapore, December 2023. Association for Computational Linguistics. URL https: //aclanthology.org/2023.emnlp-main.584 . Pierre Colombo, Nathan Noiry, Ekhine Irurozki, and Stephan Cl ´emen c ¸on. What are the best systems? new perspectives on nlp benchmarking. In Ad- vances in Neural Information Processing Systems , volume 35. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper files/paper/2022/file/ ac4920f4085b5662133dd751493946a6-Paper-Conference.pdf . Alexis Conneau and Guillaume Lample. Cross-lingual language model pretrain- ing. In Advances in Neural Information Processing Systems , volume 32. Curran As- sociates, Inc., 2019. URL https://proceedings.neurips.cc/paper files/paper/2019/ file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf . Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. XNLI: Evaluating cross-lingual sentence representations. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , Brussels, Belgium, October-November 2018. Association for Computational Linguistics. URL https://aclanthology.org/D18-1269/ . Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm ´an, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , Online, July 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.acl- main.747/ . Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InThe Twelfth International Conference on Learning Representations , 2023. URL https:// arxiv.org/abs/2307.08691 . Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Ba ˜n´on, Jelmer van der Linde, Shaox- iong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ram ´ırez-S ´anchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, and J ¨org Tiedemann. A new massive mul- tilingual dataset for high-performance language technologies. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , Torino, Italia, May 2024. ELRA and ICCL. URL https://aclanthology.org/2024.lrec-main.100/ . 12 Page 13: EuroBERT: Scaling Multilingual Encoders for European Languages DeepSeek-AI, :, Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y. K. Li, Wenfeng Liang, Fangyun Lin, A. X. Liu, Bo Liu, Wen Liu, Xiaodong Liu, Xin Liu, Yiyuan Liu, Haoyu Lu, Shanghao Lu, Fuli Luo, Shirong Ma, Xiaotao Nie, Tian Pei, Yishi Piao, Junjie Qiu, Hui Qu, Tongzheng Ren, Zehui Ren, Chong Ruan, Zhangli Sha, Zhihong Shao, Junxiao Song, Xuecheng Su, Jingxiang Sun, Yaofeng Sun, Minghui Tang, Bingxuan Wang, Peiyi Wang, Shiyu Wang, Yaohui Wang, Yongji Wang, Tong Wu, Y. Wu, Xin Xie, Zhenda Xie, Ziwei Xie, Yiliang Xiong, Hanwei Xu, R. X. Xu, Yanhong Xu, Dejian Yang, Yuxiang You, Shuiping Yu, Xingkai Yu, B. Zhang, Haowei Zhang, Lecong Zhang, Liyue Zhang, Mingchuan Zhang, Minghua Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Qihao Zhu, and Yuheng Zou. Deepseek llm: Scaling open-source language models with longtermism, 2024. URL https://arxiv.org/abs/2401.02954 . Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. URL https://aclanthology.org/ N19-1423/ . Jonathan Drechsel, Anja Reusch, and Steffen Herbold. MAMUT: A novel framework for modifying mathematical formulas for the generation of specialized datasets for language model training, 2025. URL https://arxiv.org/abs/2502.20855 . Alexander R. Fabbri, Wojciech Kry ´sci´nski, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. SummEval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics , 9, 2021. URL https://aclanthology.org/ 2021.tacl-1.24/ . Markus Freitag, Nitika Mathur, Chi-kiu Lo, Eleftherios Avramidis, Ricardo Rei, Brian Thompson, Tom Kocmi, Frederic Blain, Daniel Deutsch, Craig Stewart, Chrysoula Zerva, Sheila Castilho, Alon Lavie, and George Foster. Results of wmt23 metrics shared task: Metrics might be guilty but references are not innocent. In Proceedings of the Eighth Confer- ence on Machine Translation , Singapore, December 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.wmt-1.51 . Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen. How to train long-context language models (effectively). arXiv preprint arXiv:2410.02660 , 2024. URL https:// arxiv.org/abs/2410.02660 . Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, and Alexis Conneau. Larger-scale transformers for multilingual masked language modeling. In Proceedings of the 6th Work- shop on Representation Learning for NLP (RepL4NLP-2021) , Online, August 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.repl4nlp-1.4/ . Pengcheng He, Jianfeng Gao, and Weizhu Chen. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations , 2023. URL https://arxiv.org/abs/ 2111.09543 . Doris Hoogeveen, Karin M Verspoor, and Timothy Baldwin. Cqadup- stack: A benchmark data set for community question-answering re- search. In Proceedings of the 20th Australasian document computing sympo- sium , 2015. URL https://dl.acm.org/doi/abs/10.1145/2838931.2838934 ?casa token=-tk7Uh-Jal4AAAAA:LP9O8GQO5yOQAW6m4nw81fVeZspyMSSae4QXz7vStNi- zdy6MNAEw393sY0kWvDZfDO7PwnKeHpX5A . Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, and Yanning Chen. Liger kernel: Efficient triton kernels for llm training, 2024. URL https://arxiv.org/abs/2410.10989 . 13 Page 14: EuroBERT: Scaling Multilingual Encoders for European Languages Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 , 2019. URL https://arxiv.org/abs/1909.09436 . Mingi Jeon, Seung-yeop Baik, Joonghyuk Hahn, Yo-Sub Han, and Sang-Ki Ko. Deep learning-based source code complexity prediction. 2023. URL https://openreview.net/ forum ?id=9irBKvxsw9 . Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question an- swering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Online, November 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.emnlp-main.550/ . Phillip Keung, Yichao Lu, Gy ¨orgy Szarvas, and Noah A. Smith. The multilingual Amazon reviews corpus. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Online, November 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.emnlp-main.369/ . Tom Kocmi, Rachel Bawden, Ond ˇrej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Rebecca Knowles, Philipp Koehn, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Michal Nov ´ak, Martin Popel, and Maja Popovi ´c. Findings of the 2022 confer- ence on machine translation (WMT22). In Proceedings of the Seventh Conference on Machine Translation (WMT) , Abu Dhabi, United Arab Emirates (Hybrid), December 2022. Associa- tion for Computational Linguistics. URL https://aclanthology.org/2022.wmt-1.1/ . Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fenfei Guo, Weizhen Qi, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, Xiaodong Fan, Ruofei Zhang, Rahul Agrawal, Edward Cui, Sining Wei, Taroon Bharti, Ying Qiao, Jiun-Hung Chen, Winnie Wu, Shuguang Liu, Fan Yang, Daniel Campos, Rangan Majumder, and Ming Zhou. XGLUE: A new benchmark dataset for cross-lingual pre-training, understanding and generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Online, November 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.emnlp-main.484/ . Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019. URL https://arxiv.org/abs/1907.11692 . AI @ Meta Llama Team. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/ 2407.21783 . Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations , 2019. Xiaofei Ma, Peng Xu, Zhiguo Wang, Ramesh Nallapati, and Bing Xiang. Domain adap- tation with BERT-based domain classification and data selection. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019) , Hong Kong, China, November 2019. Association for Computational Linguistics. URL https://aclanthology.org/D19-6109/ . Pedro Henrique Martins, Patrick Fernandes, Jo ˜ao Alves, Nuno M. Guerreiro, Ricardo Rei, Duarte M. Alves, Jos ´e Pombal, Amin Farajian, Manuel Faysse, Mateusz Kli- maszewski, Pierre Colombo, Barry Haddow, Jos ´e G. C. de Souza, Alexandra Birch, and Andr ´e F. T. Martins. Eurollm: Multilingual language models for europe, 2024. URL https://arxiv.org/abs/2409.16235 . Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. CulturaX: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources 14 Page 15: EuroBERT: Scaling Multilingual Encoders for European Languages and Evaluation (LREC-COLING 2024) , Torino, Italia, May 2024. ELRA and ICCL. URL https://aclanthology.org/2024.lrec-main.377/ . Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Ji- acheng Liu, Saumya Malik, William Merrill, Lester James V . Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg, David Wadden, Christopher Wilhelm, Michael Wilson, Luke Zettlemoyer, Ali Farhadi, Noah A. Smith, and Hannaneh Hajishirzi. 2 olmo 2 furious, 2025. URL https://arxiv.org/abs/2501.00656 . Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 , 2018. URL https://arxiv.org/abs/ 1807.03748 . Guilherme Penedo, Hynek Kydl ´ıˇcek, Loubna Ben allal, Anton Lozhkov, Mar- garet Mitchell, Colin A Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In Ad- vances in Neural Information Processing Systems , volume 37. Curran Associates, Inc., 2024. URL https://proceedings.neurips.cc/paper files/paper/2024/file/ 370df50ccfdf8bde18f8f9c2d9151bda-Paper-Datasets andBenchmarks Track.pdf . Machel Reid and Mikel Artetxe. PARADISE: Exploiting parallel data for multilingual sequence-to-sequence pretraining. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Seattle, United States, July 2022. Association for Computational Linguistics. URL https:// aclanthology.org/2022.naacl-main.58/ . Machel Reid and Mikel Artetxe. On the role of parallel data in cross-lingual transfer learning. InFindings of the Association for Computational Linguistics: ACL 2023 , Toronto, Canada, July 2023. Association for Computational Linguistics. URL https://aclanthology.org/ 2023.findings-acl.372/ . Noam Shazeer. Glu variants improve transformer, 2020. URL https://arxiv.org/abs/ 2002.05202 . Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox, and Rameswar Panda. Power scheduler: A batch size and token number agnostic learning rate scheduler, 2024. URL https: //arxiv.org/abs/2408.13359 . Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Ro- former: Enhanced transformer with rotary position embedding. Neurocomputing , 568, 2024. ISSN 0925-2312. URL https://www.sciencedirect.com/science/article/pii/ S0925231223011864 . Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288 . 15 Page 16: EuroBERT: Scaling Multilingual Encoders for European Languages Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural In- formation Processing Systems , 2017. URL https://proceedings.neurips.cc/paper files/ paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf . Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Ma- jumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training, 2024a. URL https://arxiv.org/abs/2212.03533 . Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , Bangkok, Thailand, August 2024b. Association for Computational Linguistics. URL https://aclanthology.org/2024.acl-long.510/ . Benjamin Warner, Antoine Chaffin, Benjamin Clavi ´e, Orion Weller, Oskar Hallstr ¨om, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference, 2024. URL https://arxiv.org/abs/2412.13663 . Alexander Wettig, Tianyu Gao, Zexuan Zhong, and Danqi Chen. Should you mask 15% in masked language modeling? In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.eacl-main.217/ . Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , New Orleans, Louisiana, June 2018. Association for Computational Linguistics. URL https://aclanthology.org/N18-1101/ . An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Xuejing Liu, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zhifang Guo, and Zhihao Fan. Qwen2 technical report, 2024. URL https://arxiv.org/abs/2407.10671 . Dongjie Yang, Zhuosheng Zhang, and Hai Zhao. Learning better masking for better lan- guage model pre-training. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , Toronto, Canada, July 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.acl-long.400/ . Yinfei Yang, Yuan Zhang, Chris Tar, and Jason Baldridge. PAWS-X: A cross-lingual adversar- ial dataset for paraphrase identification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , Hong Kong, China, November 2019. Association for Computational Linguistics. URL https://aclanthology.org/D19-1382/ . Biao Zhang and Rico Sennrich. Root mean square layer normalization. In Ad- vances in Neural Information Processing Systems , volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper files/paper/2019/file/ 1e8a19426224ca89e83cef47f1e7f53b-Paper.pdf . Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, Meishan Zhang, Wenjie Li, and Min Zhang. mGTE: Generalized long-context text representation and reranking models for multilingual text 16 Page 17: EuroBERT: Scaling Multilingual Encoders for European Languages retrieval. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , Miami, Florida, US, November 2024. Association for Computa- tional Linguistics. URL https://aclanthology.org/2024.emnlp-industry.103/ . Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso- Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. MIRACL: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics , 11, 2023. URL https://aclanthology.org/2023.tacl-1.63/ . Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: Experiences on scaling fully sharded data parallel. Proceedings of the VLDB Endowment , 16(12), 2023. URL https://dl.acm.org/doi/abs/10.14778/3611540.3611569 . Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Advances in Neural Information Processing Systems , volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper files/paper/2019/ file/49265d2447bc3bbfe9e76306ce40a31f-Paper.pdf . 17 Page 18: EuroBERT: Scaling Multilingual Encoders for European Languages A EuroBERT Model Architecture EuroBERT consists of three variants: EuroBERT-210m with 210 million parameters, EuroBERT-610m with 610 million parameters and EuroBERT-2.1B with 2.1 billion param- eters. These variants strike a balance between traditional encoder model sizes and the benefits of parameter scaling. We report the architectural details of each model in Table 3. Model Size 210m 610m 2.1B Layers 12 26 32 Embedding Dimension 768 1,152 2,304 FFN Dimension 3,072 4,096 6,144 Attention Heads 12 18 18 Key/Value Heads 12 6 6 Layer Normalization RMSNorm RMSNorm ϵ 1×10−5 Activation Function SwiGLU Vocabulary Size 128,000 Positional Embeddings RoPE RoPE θ 250,000 Tokenizer LLaMA 3 Table 3: Summary of architectural hyperparameters for EuroBERT models of different sizes. B Training Framework We trained the EuroBERT family on Adastra, utilizing 92 MI250X GPUs for EuroBERT-210M over 15k hours, 384 MI250X GPUs for EuroBERT-610M over 92k hours, and 96 MI300A GPUs for EuroBERT-2.1B over 106k hours. The training recipe for the EuroBERT models consists of two main stages, applied uniformly to both model sizes: (1)the pre-training phase, (2)the annealing phase including context extension. Training Hyperparameters. We initialized the linear and embedding layers with values drawn from a normal distribution with a mean of 0 and a standard deviation of 0.2. For stability, we increased the default epsilon value of AdamW (Loshchilov & Hutter, 2019) to 1×10−5and set β1=0.9,β2=0.95, with a weight decay of 0.1. The Warmup-Stable-Decay (WSD) scheduler (Shen et al., 2024) was employed, featuring a warmup phase of 2, 000 steps and a constant learning rate (LR) of 1 ×10−4throughout training. To achieve similar effective batch size of 9 ×106tokens between EuroBERT models, gradient accumulation was applied for the 2.1 billion parameter model and set to 5. Detailed hyperparameter choices are reported in Table 4. We find this training recipe highly stable, with no loss spikes or need for intervention to address model training divergence (Figure 6). Infrastructure, scaling, and efficiency. Trained large language models (LLMs) at scale is a resource-intensive process that demands specialized hardware and an optimized code- base to effectively manage computational resources. We trained the EuroBERT family on the Adastra French supercomputer cluster, leveraging AMD GPUs: 192 MI250 GPUs for EuroBERT-210m, 384 MI250 GPUs for EuroBERT-610m, and 96 MI300A GPUs for EuroBERT- 2.1B. However, most open-source pre-training frameworks are designed for NVIDIA hard- ware, presenting significant compatibility challenges. To overcome this, we developed a custom codebase tailored to training our models on AMD and NVIDIA GPUs11. 11https://github.com/Nicolas-BZRD/EuroBERT 18 Page 19: EuroBERT: Scaling Multilingual Encoders for European Languages Parameter 210m 610m 2.1B Pre-training LR 1e-4 LR Scheduler WSD Warmup Steps 2,000 Context Length 2,048 Annealing LR 1e-4 to 0 LR Scheduler Cosine Context Length 8,192 Optimizer Optimizer AdamW Beta1 0.9 Beta2 0.95 Epsilon (eps) 1e-5 Weight Decay 0.1 Clip Grad Norm 1.0 Training Setup Per-GPU Batch Size 24 12 10 Gradient Accumulation Steps 1 1 5 GPUs 192 384 96 Tokens/Step 9,437,184 9,437,184 9,830,400 Table 4: Training hyperparameters for EuroBERT models (210m, 610m, 2.1B). The optimizer and Tokens/Step remain consistent across both pre-training and annealing phases. Figure 6: Pre-training Loss for all EuroBERT models on a logarithmic scale. Built on PyTorch, this code base includes several optimizations to increase training through- put. Specifically, we highlight FlashAttention (Dao, 2023), fused cross-entropy from LigerK- ernel (Hsu et al., 2024), torch.compile (Ansel et al., 2024), and hybrid sharding with Fully Sharded Data Parallel (FSDP) (Zhao et al., 2023), splitting model, gradients and optimizer states within the same node while replicating them across nodes. We achieved a training throughput of 1.2M tokens/s on 96 MI300A. 19 Page 20: EuroBERT: Scaling Multilingual Encoders for European Languages C Data Mix Source Subset Tokens (M) Mix (%) Source Subset Tokens (M) Mix (%) FineWeb English 2, 002, 327 41.34 The-Stack v2 JavaScript 58, 440 1.21 CulturaX French 295, 113 6.09 The-Stack v2 PHP 25, 620 0.53 CulturaX German 291, 514 6.02 The-Stack v2 C# 24, 842 0.51 CulturaX Spanish 290, 489 6.00 The-Stack v2 Python 21, 521 0.44 CulturaX Chinese 238, 467 4.92 The-Stack v2 Java 20, 950 0.43 CulturaX Italian 120, 128 2.48 The-Stack v2 Go 14, 766 0.30 CulturaX Russian 116, 797 2.41 The-Stack v2 TypeScript 11, 307 0.23 CulturaX Portuguese 112, 321 2.32 The-Stack v2 HTML 7, 962 0.16 CulturaX Japanese 112, 242 2.32 The-Stack v2 Lua 7, 733 0.16 CulturaX Polish 111, 659 2.31 The-Stack v2 Ruby 5, 524 0.11 CulturaX Turkish 53, 126 1.10 The-Stack v2 Vue 5, 411 0.11 CulturaX Arabic 52, 413 1.08 The-Stack v2 R 5, 287 0.11 CulturaX Vietnamese 50, 661 1.05 The-Stack v2 Shell 4, 793 0.10 CulturaX Dutch 50, 646 1.05 The-Stack v2 Swift 3, 766 0.08 CulturaX Hindi 25, 544 0.53 The-Stack v2 reStructuredText 3, 761 0.08 Unbabel parallel es ↔en 50, 613 1.05 The-Stack v2 JSON 3, 586 0.07 Unbabel parallel fr ↔en 44, 891 0.93 The-Stack v2 Rust 3, 152 0.07 Unbabel parallel de ↔en 30, 541 0.63 The-Stack v2 YAML 2, 716 0.06 Unbabel parallel it ↔en 18, 702 0.39 The-Stack v2 Dart 2, 678 0.06 Unbabel parallel ru ↔en 13, 808 0.29 The-Stack v2 RMarkdown 2, 058 0.04 Unbabel parallel nl ↔en 12, 666 0.26 The-Stack v2 HCL 1, 423 0.03 Unbabel parallel pl ↔en 7, 280 0.15 The-Stack v2 PowerShell 1, 027 0.02 Unbabel parallel ar ↔en 6, 414 0.13 The-Stack v2 VBA 1, 027 0.02 Unbabel parallel zh ↔en 6, 206 0.13 The-Stack v2 AsciiDoc 970 0.02 Unbabel parallel cs ↔en 5, 458 0.11 The-Stack v2 Groovy 540 0.01 Unbabel parallel hu ↔en 4, 599 0.09 The-Stack v2 CUDA 406 0.01 Unbabel parallel vi ↔en 3, 395 0.07 The-Stack v2 Dockerfile 281 0.01 Unbabel parallel tr ↔en 2, 975 0.06 The-Stack v2 Cython 103 0.01 Unbabel parallel ja ↔en 2, 687 0.06 The-Stack v2 COBOL 96 0.01 Unbabel parallel hi ↔en 1, 136 0.02 The-Stack v2 GraphQL 83 0.01 Proof-pile-2 Arxic 121, 503 2.51 The-Stack v2 HTTP 82 0.01 Proof-pile-2 Open-Web-Math 54, 168 1.12 The-Stack v2 ABAP 71 0.01 Proof-pile-2 Algebraic-stack 35, 985 0.74 The-Stack v2 RDoc 16 0.01 The-Stack v2 C++ 120, 085 2.48 The-Stack v2 Metal 8 0.01 The-Stack v2 SQL 75, 348 1.56 The-Stack v2 AppleScript 7 0.01 The-Stack v2 C 59, 404 1.23 Total 4 ,843,357 100 Table 5: MLM pre-training data , with a total of 4.8 trillion tokens (according to EuroBERT’s tokenizer). We report the list of all dataset names and subsets used, including the number of tokens selected and their proportion in the final data mix. Mix en fr de nl hi it ja pl pt ru es ar zh tr Code Math Parallel IFT Reference 46.3 5.8 5.7 1.0 0.3 1.5 0.8 1.0 1.4 1.0 5.7 0.4 4.7 1.0 8.7 8.2 5.2 1.2 English 26% 26.0 6.0 6.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 6.0 4.0 6.0 4.0 4.0 4.0 5.0 1.0 English 17% 17.0 6.0 6.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 6.0 5.0 6.0 5.0 4.0 4.0 5.0 1.0 Math 4% 46.3 5.8 5.7 1.0 0.3 1.5 0.8 1.0 1.4 1.0 5.7 0.4 4.7 1.0 8.7 4.0 5.2 1.2 Math 2% 46.3 5.8 5.7 1.0 0.3 1.5 0.8 1.0 1.4 1.0 5.7 0.4 4.7 1.0 8.7 2.0 5.2 1.2 Code 8% 46.3 5.8 5.7 1.0 0.3 1.5 0.8 1.0 1.4 1.0 5.7 0.4 4.7 1.0 6.0 8.2 5.2 1.2 Code 4% 46.3 5.8 5.7 1.0 0.3 1.5 0.8 1.0 1.4 1.0 5.7 0.4 4.7 1.0 4.0 8.2 5.2 1.2 Code 2% 46.3 5.8 5.7 1.0 0.3 1.5 0.8 1.0 1.4 1.0 5.7 0.4 4.7 1.0 2.0 8.2 5.2 1.2 Parallel 8% 46.3 5.8 5.7 1.0 0.3 1.5 0.8 1.0 1.4 1.0 5.7 0.4 4.7 1.0 8.7 8.2 8.0 1.2 IFT 0% 46.3 5.8 5.7 1.0 0.3 1.5 0.8 1.0 1.4 1.0 5.7 0.4 4.7 1.0 8.7 8.2 5.2 0.0 Table 6: Data mix employed in the ablation study measuring the importance of different data subsets in the EuroBERT annealing phase. We curate our multilingual corpus for MLM pre-training using a variety of freely available and cleaned datasets (Table 5). This data mix predominantly consists of web-based data, with the FineWeb dataset (Penedo et al., 2024) serving as the primary English corpus. We used CulturaX dataset (Nguyen et al., 2024) for multilingual text. We selected 14 languages (French, German, Spanish, Chinese, Italian, Russian, Polish, Portuguese, Japanese, Vietnamese, Dutch, Arabic, Turkish, and Hindi) to create a corpus of European and most widely spoken languages, representing a broad range of alphabets and cultures. In addition, we incorporated parallel data, which has been shown to improve cross-lingual transfer learning (Reid & Artetxe, 2022; 2023). Specifically, we concatenated in random order, 20 Page 21: EuroBERT: Scaling Multilingual Encoders for European Languages pairs of English sentences and their translations into another language. These sentence pairs were separated by a <|parallel_sep|> token and uniformly masked creating an asymmetric masking between sentences. This asymmetric masking encourages the model to use its bidirectional attention mechanism to decode masked tokens, leveraging both the original language and its translation. Finally, we enriched the dataset with 38 programming languages from The Stack v2 and sci- ence data with Proof-Pile-2 (upsampled 5 ×). A detailed summary of the dataset composition is available in Table 5. 21 Page 22: EuroBERT: Scaling Multilingual Encoders for European Languages D Evaluation D.1 Dataset Details This appendix offers additional details on the datasets used for evaluation. Table 7 presents the language coverage of all evaluation datasets, and below are additional specifications on the evaluated tasks. European Languages Extra-European LanguagesCode Math Task en de es fr it nl pl pt ar hi ja ru tr vi zh Sequence Classification XNLI ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ PAWS-X ✓ ✓ ✓ ✓ QAM ✓ ✓ ✓ AmazonReviews ✓ ✓ ✓ ✓ ✓ ✓ MassiveIntent ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ CodeDefect ✓ ✓ CodeComplexity ✓ ✓ MathShepherd ✓ ✓ Sequence Regression WMT ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ SeaHorse ✓ ✓ ✓ ✓ ✓ SummEval ✓ ✓ ✓ ✓ ✓ Information Retrieval MIRACL ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ MLDR ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Wikipedia ✓ ✓ ✓ ✓ ✓ ✓ CC-News ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ CodeSearchNet ✓ ✓ DupStackMath ✓ ✓ MathFormula ✓ ✓ Table 7: Language coverage across evaluation datasets. Classification datasets: •XNLI (Conneau et al., 2018) — Cross-lingual natural language inference task extending MNLI (Williams et al., 2018), consisting in classifying sentence pairs into entailment, contradiction, or neutral. In our experiments, we train multi-lingually instead of cross- lingually to assess models’ capacity for multilingual fine-tuning. •PAWS-X (Yang et al., 2019) — Paraphrase identification task aimed at determining whether two sentences convey the same meaning. Fine-tuning is performed cross- lingually, with training on the English subset and evaluation across all available lan- guages. •QAM (Liang et al., 2020) — NLU task aimed at verifying whether a question-passage pair forms a valid question-answer pair. Fine-tuning is performed cross-lingually •AmazonReviews (Keung et al., 2020) — A sentiment analysis task consisting in esti- mating the satisfaction level of multilingual Amazon product reviews on a 1-to-5 scale. Fine-tuning is performed on all available languages. •MassiveIntent (Keung et al., 2020) — A multilingual classification task consisting in assigning sentences to one of 60 topic categories. Fine-tuning is performed on all available languages. •CodeDefect (Zhou et al., 2019) — A binary classification task aimed at identifying whether a given code snippet contains a defect. •CodeComplexity (Jeon et al., 2023) — Computational analysis task consisting in estimat- ing the order of complexity of a code-formulated computer science problem. 22 Page 23: EuroBERT: Scaling Multilingual Encoders for European Languages •MathShepherd (Wang et al., 2024b) — Binary classification task aimed at determining whether a step-by-step math rationale is correct given a problem prompt. Since longer rationales are more error-prone, we filter the dataset to retain only 3-step rationales to prevent models from overfitting answer length. As the dataset lacks a validation split, we allocate half of the test set for validation. Regression datasets: •WMT (Bojar et al., 2017; 2018; Barrault et al., 2019; 2020; Akhbardeh et al., 2021; Kocmi et al., 2022) — Regression task consisting in predicting translation quality given a source sentence. As the original test set covers only three language pairs, we construct validation and test sets by sampling 5% of the training set for each, ensuring broader language coverage in evaluation. •SeaHorse (Clark et al., 2023) — Multilingual summarization evaluation task, where each text-summary pair is annotated across 6 binary evaluation dimensions. The final score is obtained by averaging these labels, yielding a continuous value between 0 and 1. •SummEval (Fabbri et al., 2021) — Initially English-only summarization evaluation task, later extended to other languages, including German12, French13, Spanish14, and Turkish15. Each summary is assessed across 4 dimensions, averaged to produce a continuous score. Retrieval datasets: •MS-MARCO (Bajaj et al., 2016) — English-only retrieval dataset used for fine-tuning, where each anchor-positive pair includes a mined hard negative, forming a triplet structure. •MIRACL (Zhang et al., 2023) — Multilingual retrieval dataset. We use the semi- supervised version with labeled positive pairs provided by SentenceTransformers16 as the primary data source. Anchors serve as queries, and the corpus consists of all positive documents in the dataset. Since only a single data split is available, we create validation and test sets by partitioning 50% of the original split for each, using queries as the split key to ensure no data leakage. •MLDR (Chen et al., 2024) — Long-context multilingual retrieval dataset. As with MIRACL, we use the triplet version provided by SentenceTransformers and apply the same validation-test split strategy. •Wikipedia17— Multilingual information retrieval dataset. Since only a single data split is available, we partition 50% of the queries into validation and test sets. •CC-News (de Gibert et al., 2024) — Highly multilingual retrieval dataset. As with MIRACL, we use the SentenceTransformers dataset version as the primary data source and apply the same test-validation split method. •CodeSearchNet (Husain et al., 2019) — Code retrieval dataset with comment-code query-positive pairs (SentenceTransformers version), processed similarly to the previous datasets. •DupStackMath (Hoogeveen et al., 2015) — Code retrieval dataset with queries, a corpus, and relevant documents, processed the same way as the above datasets. •MathFormula (Drechsel et al., 2025) — Mathematical retrieval dataset consisting of pairs of equivalent formulas. The original dataset contains formula pairs labeled as true 12https://huggingface.co/datasets/sproos/summeval-de 13https://huggingface.co/datasets/sproos/summeval-fr 14https://huggingface.co/datasets/sproos/summeval-es 15https://huggingface.co/datasets/sproos/summeval-tr 16https://huggingface.co/collections/sentence-transformers/embedding-model-datasets- 6644d7a3673a511914aa7552 17https://huggingface.co/datasets/Samoed/WikipediaRetrievalMultilingual 23 Page 24: EuroBERT: Scaling Multilingual Encoders for European Languages or false based on their equivalence, spanning 71 well-known mathematical formulas. To construct the retrieval dataset, we extract only equivalent formula pairs, retaining positive instances. Due to the dataset’s large size, we sample 100 positive pairs per formula type for both validation and test sets. The final dataset is processed following the same methodology as other pair-based datasets. D.2 Detailed Results This appendix presents detailed results for the multilingual evaluation tasks in our bench- mark, including per-language scores as well as averages across European languages and all languages. MIRACL European Languages Extra-European Languages Average Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World XLM-RoBERTa-base 88.0 — 88.6 91.9 — — — — 84.6 77.9 85.7 82.5 — — 83.8 89.5 85.4 XLM-RoBERTa-large 89.7 — 91.6 93.5 — — — — 89.0 81.1 91.1 88.6 — — 90.3 91.6 89.4 XLM-RoBERTa-XL 90.4 — 93.0 94.5 — — — — 91.5 85.1 92.6 92.1 — — 91.9 92.6 91.4 mDeBERTa-v3-base 45.3 — 39.6 46.2 — — — — 34.4 36.0 33.7 29.6 — — 35.3 43.7 37.5 mGTE-MLM-base 91.4 — 94.6 95.2 — — — — 91.6 85.3 91.5 88.3 — — 91.4 93.8 91.2 ModernBERT-base — — — — — — — — — — — — — — — — — ModernBERT-large — — — — — — — — — — — — — — — — — EuroBERT-210m 94.1 — 95.4 95.8 — — — — 90.0 83.2 90.8 85.7 — — 90.9 95.1 90.8 EuroBERT-610m 93.6 — 95.1 96.3 — — — — 91.8 88.3 92.4 90.7 — — 92.4 95.0 92.6 EuroBERT-2.1B 94.2 — 95.0 95.3 — — — — 93.0 87.1 93.4 91.5 — — 94.1 94.8 92.9 MLDR European Languages Extra-European Languages Average Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World XLM-RoBERTa-base 59.4 56.7 58.0 64.5 53.4 — — 60.3 44.4 56.4 50.3 43.6 — — 53.8 58.7 54.6 XLM-RoBERTa-large 63.4 61.1 66.9 71.1 61.9 — — 67.1 51.5 60.3 54.9 51.5 — — 58.9 65.2 60.8 XLM-RoBERTa-XL 68.9 66.1 72.7 73.5 67.5 — — 71.0 56.5 61.8 62.6 60.8 — — 63.5 70.0 65.9 mDeBERTa-v3-base 18.8 24.3 15.4 23.9 18.2 — — 19.0 12.3 20.5 17.4 13.2 — — 18.1 20.0 18.3 mGTE-MLM-base 63.5 68.7 79.5 78.2 71.4 — — 78.1 55.7 66.2 62.4 60.8 — — 60.9 73.2 67.8 ModernBERT-base — — — — — — — — — — — — — — — — — ModernBERT-large — — — — — — — — — — — — — — — — — EuroBERT-210m 67.2 68.1 78.2 80.0 68.9 — — 77.9 52.1 51.3 60.8 59.1 — — 56.4 73.4 65.4 EuroBERT-610m 72.5 69.5 80.3 79.8 73.9 — — 79.0 55.5 60.9 61.6 62.5 — — 59.0 75.8 68.6 EuroBERT-2.1B 72.5 65.4 77.6 77.6 69.2 — — 75.0 53.0 58.1 61.5 59.3 — — 57.6 72.9 66.1 Wikipedia European Languages Extra-European Languages Average Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World XLM-RoBERTa-base 94.7 91.2 — — 91.5 90.2 — 90.8 — 87.4 — — — — — 91.7 91.0 XLM-RoBERTa-large 95.4 93.5 — — 93.4 93.4 — 92.5 — 90.7 — — — — — 93.6 93.1 XLM-RoBERTa-XL 97.9 96.5 — — 96.6 96.3 — 96.0 — 94.5 — — — — — 96.7 96.3 mDeBERTa-v3-base 66.1 60.4 — — 53.6 57.6 — 56.5 — 51.3 — — — — — 58.9 57.6 mGTE-MLM-base 96.7 93.9 — — 94.7 93.9 — 93.6 — 92.0 — — — — — 94.6 94.1 ModernBERT-base — — — — — — — — — — — — — — — — — ModernBERT-large — — — — — — — — — — — — — — — — — EuroBERT-210m 97.7 94.7 — — 94.5 95.6 — 95.7 — 88.6 — — — — — 95.6 94.4 EuroBERT-610m 98.3 95.9 — — 96.0 96.6 — 96.1 — 92.6 — — — — — 96.6 95.9 EuroBERT-2.1B 99.0 96.1 — — 96.0 96.0 — 95.9 — 92.0 — — — — — 96.6 95.8 CC-News European Languages Extra-European Languages Average Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World XLM-RoBERTa-base 69.9 57.8 55.8 57.5 57.2 66.8 55.1 63.1 72.2 31.1 75.7 76.2 47.8 61.5 77.1 60.4 61.6 XLM-RoBERTa-large 77.3 68.6 69.1 70.1 70.8 75.9 69.6 75.3 82.8 51.0 82.3 83.3 61.1 73.7 81.2 72.1 72.8 XLM-RoBERTa-XL 84.4 77.4 79.7 79.1 79.6 83.8 79.5 84.0 88.1 59.5 87.4 88.8 72.1 82.9 86.8 80.9 80.9 mDeBERTa-v3-base 25.0 15.7 11.9 12.4 13.1 20.4 12.4 15.8 23.1 4.7 33.6 23.6 10.7 20.4 34.8 15.8 18.5 mGTE-MLM-base 76.1 68.7 72.8 70.1 68.4 76.5 65.1 74.3 79.6 32.5 85.1 83.3 56.7 72.3 88.2 71.5 71.3 ModernBERT-base — — — — — — — — — — — — — — — — — ModernBERT-large — — — — — — — — — — — — — — — — — EuroBERT-210m 80.0 66.9 69.2 69.7 65.9 73.5 57.8 69.1 76.7 17.0 82.4 79.7 52.1 57.4 90.9 69.0 67.2 EuroBERT-610m 84.0 72.9 76.4 75.5 73.8 79.6 70.9 79.9 84.0 49.7 84.9 85.1 62.4 66.7 88.1 76.6 75.6 EuroBERT-2.1B 85.8 73.1 77.1 76.9 73.8 79.0 70.3 79.7 84.2 49.9 88.2 86.5 60.7 63.0 89.9 76.9 75.9 Table 8: Detailed results on multilingual retrieval tasks (NDCG@10, in %). 24 Page 25: EuroBERT: Scaling Multilingual Encoders for European Languages XNLI European Languages Extra-European Languages Average Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World XLM-RoBERTa-base 80.0 74.8 76.4 75.1 — — — — 71.5 69.1 — 74.4 72.4 74.0 73.2 76.6 74.1 XLM-RoBERTa-large 87.1 83.0 83.5 82.9 — — — — 80.4 77.8 — 81.2 80.6 80.2 80.6 84.1 81.7 XLM-RoBERTa-XL 89.0 85.5 85.3 84.5 — — — — 81.5 80.5 — 83.1 82.4 83.3 82.3 86.1 83.7 mDeBERTa-v3-base 84.9 81.2 81.1 81.0 — — — — 77.6 76.1 — 79.1 77.7 78.1 78.4 82.0 79.5 mGTE-MLM-base 81.1 76.9 78.5 77.2 — — — — 73.6 71.3 — 75.4 72.9 75.9 75.5 78.4 75.8 ModernBERT-base — — — — — — — — — — — — — — — — — ModernBERT-large — — — — — — — — — — — — — — — — — EuroBERT-210m 83.5 77.8 79.4 78.9 — — — — 74.3 70.6 — 76.6 74.2 75.1 75.3 79.9 76.6 EuroBERT-610m 87.8 82.9 84.6 83.6 — — — — 79.5 76.7 — 82.0 80.3 80.8 80.7 84.7 81.9 EuroBERT-2.1B 89.6 85.5 86.4 85.8 — — — — 82.8 79.9 — 83.3 83.0 82.3 82.3 86.8 84.1 PAWS-X European Languages Extra-European Languages Average Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World XLM-RoBERTa-base 93.8 86.4 87.5 88.0 — — — — — — — — — — — 88.9 88.9 XLM-RoBERTa-large 95.5 91.0 91.4 91.8 — — — — — — — — — — — 92.4 92.4 XLM-RoBERTa-XL 95.8 91.9 91.7 92.3 — — — — — — — — — — — 92.9 92.9 mDeBERTa-v3-base 95.7 90.2 90.4 91.3 — — — — — — — — — — — 91.9 91.9 mGTE-MLM-base 94.7 87.5 88.2 88.8 — — — — — — — — — — — 89.8 89.8 ModernBERT-base — — — — — — — — — — — — — — — — — ModernBERT-large — — — — — — — — — — — — — — — — — EuroBERT-210m 95.6 86.5 88.7 88.9 — — — — — — — — — — — 89.9 89.9 EuroBERT-610m 95.6 90.0 91.3 92.0 — — — — — — — — — — — 92.2 92.2 EuroBERT-2.1B 96.2 91.6 91.8 92.5 — — — — — — — — — — — 93.0 93.0 QAM European Languages Extra-European Languages Average Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World XLM-RoBERTa-base 69.0 67.4 — 67.3 — — — — — — — — — — — 67.9 67.9 XLM-RoBERTa-large 73.3 74.5 — 73.3 — — — — — — — — — — — 73.7 73.7 XLM-RoBERTa-XL 75.3 76.8 — 74.1 — — — — — — — — — — — 75.4 75.4 mDeBERTa-v3-base 71.2 73.5 — 71.5 — — — — — — — — — — — 72.1 72.1 mGTE-MLM-base 69.4 68.1 — 68.4 — — — — — — — — — — — 68.6 68.6 ModernBERT-base — — — — — — — — — — — — — — — — — ModernBERT-large — — — — — — — — — — — — — — — — — EuroBERT-210m 71.6 69.0 — 68.2 — — — — — — — — — — — 69.6 69.6 EuroBERT-610m 73.6 73.3 — 71.7 — — — — — — — — — — — 72.9 72.9 EuroBERT-2.1B 74.7 72.8 — 72.5 — — — — — — — — — — — 73.3 73.3 AmazonReviews European Languages Extra-European Languages Average Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World XLM-RoBERTa-base 64.9 65.0 60.6 60.0 — — — — — — 59.0 — — — 56.8 62.7 61.1 XLM-RoBERTa-large 66.9 67.0 62.4 61.5 — — — — — — 62.1 — — — 58.5 64.5 63.1 XLM-RoBERTa-XL 67.1 67.8 62.4 61.5 — — — — — — 63.7 — — — 59.2 64.7 63.6 mDeBERTa-v3-base 66.4 66.1 61.6 60.6 — — — — — — 60.4 — — — 57.7 63.7 62.1 mGTE-MLM-base 65.0 65.3 60.9 59.5 — — — — — — 61.2 — — — 57.4 62.7 61.5 ModernBERT-base — — — — — — — — — — — — — — — — — ModernBERT-large — — — — — — — — — — — — — — — — — EuroBERT-210m 65.9 65.4 60.5 60.2 — — — — — — 60.4 — — — 57.7 63.0 61.7 EuroBERT-610m 66.7 66.4 61.6 61.2 — — — — — — 61.7 — — — 58.1 64.0 62.6 EuroBERT-2.1B 66.5 67.8 62.8 60.9 — — — — — — 62.4 — — — 59.0 64.5 63.2 MassiveIntent European Languages Extra-European Languages Average Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World XLM-RoBERTa-base 89.1 86.1 87.0 86.6 87.5 87.5 86.8 87.3 79.0 86.2 86.5 87.3 85.6 86.7 86.0 87.2 86.3 XLM-RoBERTa-large 90.3 87.7 88.0 89.0 88.5 89.1 88.8 88.8 83.5 88.1 88.9 89.0 87.8 88.9 87.1 88.8 88.2 XLM-RoBERTa-XL 89.9 87.6 88.2 88.6 88.7 88.3 87.9 88.6 81.8 88.0 88.4 89.2 87.8 88.4 86.9 88.5 87.9 mDeBERTa-v3-base 88.1 86.4 86.9 87.3 87.6 88.0 87.0 86.9 79.8 86.0 87.3 87.3 85.9 86.5 85.9 87.3 86.5 mGTE-MLM-base 89.0 86.3 87.4 87.9 87.2 87.9 86.3 87.8 80.7 86.6 87.9 87.8 86.4 87.4 86.3 87.5 86.9 ModernBERT-base — — — — — — — — — — — — — — — — — ModernBERT-large — — — — — — — — — — — — — — — — — EuroBERT-210m 89.0 86.0 86.9 86.9 87.0 87.1 86.8 87.9 81.2 86.9 87.4 87.2 85.8 85.0 86.0 87.2 86.5 EuroBERT-610m 89.2 86.6 87.4 87.6 88.1 88.2 87.3 87.8 82.7 87.3 88.3 88.2 86.8 86.1 87.0 87.8 87.2 EuroBERT-2.1B 88.9 87.2 88.0 88.7 87.9 88.2 88.1 88.2 83.2 87.6 89.0 88.1 87.1 85.4 87.0 88.2 87.5 Table 9: Detailed results on multilingual sequence classification tasks (accuracy, in %). 25 Page 26: EuroBERT: Scaling Multilingual Encoders for European Languages SeaHorse European Languages Extra-European Languages Average Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World XLM-RoBERTa-base 36.8 32.8 25.5 — — — — — — — — 34.5 43.8 32.0 — 31.7 34.2 XLM-RoBERTa-large 39.2 35.5 28.1 — — — — — — — — 39.3 51.0 35.0 — 34.3 38.0 XLM-RoBERTa-XL 40.5 36.1 30.5 — — — — — — — — 42.2 53.1 35.4 — 35.7 39.6 mDeBERTa-v3-base 36.4 31.1 23.1 — — — — — — — — 31.3 44.0 18.6 — 30.2 30.7 mGTE-MLM-base 50.8 55.8 60.7 — — — — — — — — 67.6 65.8 59.5 — 55.8 60.0 ModernBERT-base — — — — — — — — — — — — — — — — — ModernBERT-large — — — — — — — — — — — — — — — — — EuroBERT-210m 54.2 58.5 65.1 — — — — — — — — 68.2 69.3 60.2 — 59.3 62.6 EuroBERT-610m 56.1 59.6 69.7 — — — — — — — — 71.8 73.2 60.7 — 61.8 65.2 EuroBERT-2.1B 59.0 62.5 71.2 — — — — — — — — 74.4 75.3 61.9 — 64.2 67.4 SummEval European Languages Extra-European Languages Average Model en de es fr it nl pl pt ar hi ja ru tr vi zh Euro World XLM-RoBERTa-base 22.7 26.9 29.9 28.1 — — — — — — — — 28.1 — — 26.9 27.1 XLM-RoBERTa-large 43.2 37.3 33.6 40.2 — — — — — — — — 38.6 — — 38.6 38.6 XLM-RoBERTa-XL 29.8 36.3 19.8 35.1 — — — — — — — — 32.6 — — 30.3 30.7 mDeBERTa-v3-base 28.8 25.5 27.0 22.7 — — — — — — — — 25.4 — — 26.0 25.9 mGTE-MLM-base 49.0 41.6 44.4 39.5 — — — — — — — — 34.7 — — 43.7 41.9 ModernBERT-base — — — — — — — — — — — — — — — — — ModernBERT-large — — — — — — — — — — — — — — — — — EuroBERT-210m 57.5 36.9 44.4 46.4 — — — — — — — — 30.1 — — 46.3 43.1 EuroBERT-610m 57.4 52.4 62.4 57.5 — — — — — — — — 44.8 — — 57.4 54.9 EuroBERT-2.1B 65.5 58.8 51.3 61.9 — — — — — — — — 51.6 — — 59.4 57.8 Table 10: Detailed results on multilingual summary evaluation tasks (Spearman rank correlation, in %). European Pairs Extra-European Pairs Average en-xx xx-en Other en-xx xx-enEuro World en-de en-pl de-en pl-en de-fr fr-de en-ja en-ru en-tr en-zh ja-en ru-en tr-en zh-en XLM-RoBERTa-base 45.3 53.6 26.7 16.3 31.4 32.0 47.0 56.5 61.5 42.5 10.4 21.8 40.7 25.3 34.2 36.5 XLM-RoBERTa-large 50.7 66.0 30.7 15.8 41.1 29.8 52.1 61.2 66.2 47.2 11.0 24.8 45.4 29.0 39.0 40.8 XLM-RoBERTa-XL 55.1 66.9 35.9 18.7 46.7 43.3 56.3 64.9 64.8 52.2 13.1 27.0 47.7 30.6 44.4 44.5 mDeBERTa-v3-base 53.5 63.1 33.1 20.1 45.5 42.6 53.0 61.9 67.9 48.5 10.5 25.2 47.9 29.5 43.0 43.0 mGTE-MLM-base 48.6 55.2 30.4 18.5 37.5 35.9 48.3 57.2 59.7 45.5 10.6 23.4 41.5 27.3 37.7 38.5 ModernBERT-base — — — — — — — — — — — — — — — — ModernBERT-large — — — — — — — — — — — — — — — — EuroBERT-210m 52.9 58.4 33.2 17.5 40.6 40.3 51.1 57.9 57.3 48.3 14.3 26.7 44.3 30.8 40.5 41.0 EuroBERT-610m 52.9 61.1 32.4 18.2 42.6 39.2 51.3 59.4 62.3 48.6 12.2 26.6 44.1 29.7 41.1 41.5 EuroBERT-2.1B 49.1 57.8 29.8 19.3 38.3 38.5 47.8 56.9 56.5 45.0 10.7 23.5 41.3 27.5 38.8 38.7 Table 11: Detailed results on the WMT task (Spearman rank correlation, in %). 26