loader
Generating audio...

arxiv

Paper 2503.03008

One Model to Train them All: Hierarchical Self-Distillation for Enhanced Early Layer Embeddings

Authors: Andrea Gurioli, Federico Pennino, João Monteiro, Maurizio Gabbrielli

Published: 2025-03-04

Abstract:

Deploying language models often requires handling model size vs. performance trade-offs to satisfy downstream latency constraints while preserving the model's usefulness. Model distillation is commonly employed to reduce model size while maintaining acceptable performance. However, distillation can be inefficient since it involves multiple training steps. In this work, we introduce MODULARSTARENCODER, a modular multi-exit encoder with 1B parameters, useful for multiple tasks within the scope of code retrieval. MODULARSTARENCODER is trained with a novel self-distillation mechanism that significantly improves lower-layer representations-allowing different portions of the model to be used while still maintaining a good trade-off in terms of performance. Our architecture focuses on enhancing text-to-code and code-to-code search by systematically capturing syntactic and semantic structures across multiple levels of representation. Specific encoder layers are targeted as exit heads, allowing higher layers to guide earlier layers during training. This self-distillation effect improves intermediate representations, increasing retrieval recall at no extra training cost. In addition to the multi-exit scheme, our approach integrates a repository-level contextual loss that maximally utilizes the training context window, further enhancing the learned representations. We also release a new dataset constructed via code translation, seamlessly expanding traditional text-to-code benchmarks with code-to-code pairs across diverse programming languages. Experimental results highlight the benefits of self-distillation through multi-exit supervision.

Paper Content:
Page 1: One Model to Train them All: Hierarchical Self-Distillation for Enhanced Early Layer Embeddings Andrea Gurioli1, Federico Pennino1, Joao Monteiro2 Maurizio Gabbrielli1 1University of Bologna,2Autodesk Correspondence: andrea.gurioli5@unibo.it, joao.monteiro@autodesk.com Abstract Deploying language models often requires han- dling model size vs. performance trade-offs to satisfy downstream latency constraints while preserving the model’s usefulness. Model distillation is commonly employed to reduce model size while maintaining acceptable perfor- mance. However, distillation can be inefficient since it involves multiple training steps. In this work, we introduce MODULAR STAREN- CODER , a modular multi-exit encoder with 1B parameters, useful for multiple tasks within the scope of code retrieval. MODULAR STAREN- CODER is trained with a novel self-distillation mechanism that significantly improves lower- layer representations—allowing different por- tions of the model to be used while still main- taining a good trade-off in terms of perfor- mance. Our architecture focuses on enhancing text-to-code and code-to-code search by sys- tematically capturing syntactic and semantic structures across multiple levels of representa- tion. Specific encoder layers are targeted as exit heads, allowing higher layers to guide ear- lier layers during training. This self-distillation effect improves intermediate representations, increasing retrieval recall at no extra training cost. In addition to the multi-exit scheme, our approach integrates a repository-level contex- tual loss that maximally utilizes the training context window, further enhancing the learned representations. We also release a new dataset constructed via code translation, seamlessly ex- panding traditional text-to-code benchmarks with code-to-code pairs across diverse program- ming languages. Experimental results highlight the benefits of self-distillation through multi- exit supervision. 1 Introduction Large language models (LLMs) have significantly impacted the field of natural language processing, demonstrating remarkable performance across var- ious applications (Niu et al., 2023). However, theamount of computation required to operate state- of-the-art models poses significant challenges for the large-scale deployment of these models. To mitigate these challenges, the research com- munity has explored several model strategies to reduce the operational cost of LLMs without sacri- ficing their effectiveness. A prominent technique in model compression is quantization (Jacob et al., 2017; Lin et al., 2023; Egiazarian et al., 2024), which involves the reduction of numerical precision in the model’s parameters. Quantization effectively decreases memory requirements and enhances in- ference speed, facilitating the deployment of large language models in resource-constrained environ- ments. Concurrently, knowledge distillation has emerged as a powerful technique whereby a smaller “student” model is trained to emulate the behav- ior of a larger “teacher” model, as evidenced by works such as DISTIL BERT (Sanh et al., 2019) and TINYBERT (Jiao et al., 2019). Additionally, prun- ing methods selectively eliminate less influential weights or neurons, further reducing model com- plexity and aiming to preserve performance (Han et al., 2015). Recent efforts have increasingly focused on de- veloping efficient architectures requiring fewer pa- rameters. Model families such as LLaMA (Dubey et al., 2024), Qwen (Hui et al., 2024), Mistral (Jiang et al., 2023), and SmolLM (Allal et al., 2025) ex- emplify a paradigm shift towards smaller, more accessible architectures. These model families are deployed at various resolutions—ranging from lightweight variants optimized for heavily resource- constrained environments to larger versions that retain competitive performance. In parallel, advancements in dynamic inference strategies have introduced mechanisms that fur- ther optimize computational efficiency. Techniques like multi-exit networks enable early predictions at intermediate layers, reducing unnecessary com- putations. For instance, early-exit architectures 1arXiv:2503.03008v1 [cs.CL] 4 Mar 2025 Page 2: Layer 1 Layer 2 Layer 4 Layer 50.8 0.4 0.23...0.5 0.3 0.8... Layer 9 Layer 100.75 0.5 0.8... 0.25 0.6 0.6... Layer 18 Layer 190.5 0.8 0.6... 0.45 0.6 0.32... Layer 27 Layer 280.46 0.67 0.18 ... 0.56 0.6 0.32... Layer 36 0.43 0.86 0.18 ... 0.27 0.85 0.75...Layer loss Layer loss Layer loss Layer loss Layer lossFigure 1: Overview of our multi-exit self-distillation encoder, shown here with exit heads at selected layers (e.g., Layers 4, 9, 18, 27, and 36). Each exit head predicts an output embedding and adds a “layer loss,” contribution weighted by a coefficient αi, summed into the overall objective L. such as BranchyNet (Teerapittayanon et al., 2017) dynamically balance computation and accuracy by allowing predictions before full model execu- tion. Similarly, Matryoshka representation learn- ing (Kusupati et al., 2022) extends this idea to em- beddings, introducing a loss function that yields multi-granular representations. This approach al- lows downstream tasks to adjust computational complexity by pruning embedding dimensionality, further contributing to efficient model deployment. Building on these principles, we propose MODU- LARSTARENCODER , a modular multi-exit encoder architecture that integrates a novel intra-model self- distillation mechanism. In our design, specific in- termediate layers are supervised by both the pri- mary task loss and auxiliary distillation losses on specific exit heads, encouraging lower layers to learn better representations by mimicking the out- puts of higher layers. We apply a shared embed- ding head comprising a masked language modeling head and an in-context classification head across a chosen subset of layers. We then fine-tuned the model with different projection heads for each exit point. We reached state-of-the-art results on multi- ple retrieval tasks (such as code-to-code and text-to- code), fine-tuning onesingle modular model that can be sliced depending on the end-user computa- tional constraints. Our contributions are as follows: •We introduce a self-distillation framework that enables training multiple model resolutions within a unified layer stack, reducing redun- dancy and improving scalability. We believethis approach can significantly affect LLM training pipelines that depend on multiple model distillations. •We train and release MODULAR STAREN- CODER , which consists of a pre-trained1and fine-tuned2encoder: The former is a modular pre-trained encoder with up to 1 billion pa- rameters and five exit points, allowing users to perform multiple exit fine-tuning depending on downstream tasks. The latter is a fine-tuned encoder for various retrieval tasks. We allow the user to choose either the entire model with 1 billion parameters or a model size that fits their memory and computational limitations. •We release SYNTH CODE2CODE2NL a new dataset3constructed via code translation, ex- panding popular text-to-code datasets across diverse programming languages with code-to- code pairs. SYNTH CODE2CODE2NL com- prises 1 071 367 triplets of natural language- code-code. 2 Methodology 2.1 Dataset In the pre-training phase, we leveraged The Stack V2 (Lozhkov et al., 2024), a large open-source code dataset structured by repository. 1https://huggingface.co/modularStarEncoder/ ModularStarEncoder 2https://huggingface.co/modularStarEncoder/ ModularStarEncoder-finetuned 3https://huggingface.co/datasets/ modularStarEncoder/SynthCode2Code2NL 2 Page 3: Table 1: SYNTH CODE2CODE2NL details: Average character count and sample size per language for the Code- SearchNet dataset and the synthesized portion obtained through translation. Language CSN samples CSN avg. char Synth. samples Synth. avg. char English 1 071 367 180 - - PHP 280 706 514 116 967 579 Python 274 454 474 117 374 518 Go 234 089 350 124 125 541 Java 282 118 505 116 098 707 C++ - - 141 956 938 Ruby - - 158 494 456 C - - 136 365 1029 JavaScript - - 159 988 557 Translate this ''' print("Hello W orld") ''' from Python to Rust. Here is the translated code '''Qwen2.5Coder- 7B-Instruct  fn main() {       println!("Hello World!");   } Figure 2: Prompt provided to Qwen2.5-Coder-7B-Instruct for translating a given code snippet ( print("Hello World") in the example) from a source programming language ( Python ) to a target one ( Rust ). For the fine-tuning stage, we created SYNTH - CODE2CODE2NL , a dataset that supports text-to- code and code-to-code search. Using the popu- larCODESEARCH NET(Husain et al., 2019) as a seed dataset and selecting popular programming languages (Python, Java, Go, and PHP), we aug- mented it by transpiling available code snippets onto other languages. To generate semantically similar code snippets for code-to-code search, we translated each snip- pet into a different language randomly sampled from Go, Ruby, Python, Java, C++, PHP, C, JavaScript. We prompted the QWEN 2.5-C ODER - 7B-I NSTRUCT model with the source code, the name of the source language, and the name of the target language (see fig. 2). During code transla- tion, we choose the token with the highest proba- bility as output (greedy search) to prevent semantic discrepancies. This process yielded pairs of code snippets in distinct languages tied to the same natural lan- guage description. As a result, every sample in the fine-tuning dataset includes a natural language description and two code snippets from distinct languages. SYNTH CODE2CODE2NL contains 1 071 367 samples where, in the first code column, we directly processed code snippets from Code-SearchNet, including Python, Java, PHP, and Go. The third column, artificially synthesized via code translation, includes Go, Ruby, JavaScript, Python, C++, PHP, C, and Java code snippets. After a man- ual inspection, we discovered that both columns contained code snippets that differed only in identi- fiers or function arguments. Several tasks were se- mantically identical but paraphrased with different parameter requirements (e.g., two identical para- phrased tasks asked for opening a socket on a differ- ent port). During the preprocessing phase of SYN- THCODE2CODE2NL , motivated by the dataset’s redundancy and preliminary experiments that show its effectiveness on the model’s performance, we near-deduplicated the dataset using both the Code- SearchNet code column and the synthesized code column. During the data near deduplication phase, we relied on Locality Sensitive Hashing (LSH) with a Jaccard similarity threshold of 0.7 and 256 per- mutations, analyzing character-level 5-grams. Ta- ble 1 shows the average number of characters per language in SYNTH CODE2CODE2NL , we empha- size that synthesized data is significantly longer than human-written code and might have stylistic differences compared to human code, we further discussed this in section 5. Appendix A provides examples of code translation. 3 Page 4: def build_examples(repo_1, repo_2):     is_negative = random(0, 1)     input = empty_string     while not empty(repo_1):         input += <sep_token>         # Positive case         pos_sample = sample_random_snippet(repo_1)         if len(input) + len(pos_sample) > context_length - 1:             break         input += pos_sample         if is_negative and not empty(repo_2):             # Negative (snippets from different repos)             input += <sep_token>             neg_sample = sample_random_snippet(repo_2)             if len(input) + len(neg_sample) > context_length - 1:                 break             input += neg_sample         input += <cls_token>     return inputSample 1 repo 1Sample 2 repo 1Sample 3 repo 1 Sample 1 repo 1Sample 1 repo 2Sample 2 repo 1Sample 2 repo 2Sample 3 repo 1SEP SEP SEP CLS CLS SEP SEP SEP SEP SEPPositive example Negative exampleFigure 3: On the left side the illustration of the in-context loss framework, where samples from different repositories are concatenated. Positive examples share the same repository context, while negative examples come from different repositories. On the right side, in-context loss framework pseudocode. 2.2 Architecture We updated the first version of STARENCODER (Li et al., 2023) by enabling longer code snippets (up to 2 048 tokens as context length), increasing the model size from ≈125M to ≈1B parameters and utilizing state-of-the-art methodologies (Warner et al., 2024; Lozhkov et al., 2024) resulting in MODULAR STARENCODER . We built MODULAR STARENCODER on top of STARCODER -2(Lozhkov et al., 2024), applying several modifications to the model. We reduced its size from 15B to 1B parameters. Our architecture comprises 36 hidden layers and adopts Grouped Query Attention (GQA) (Ainslie et al., 2023) with 16 attention heads and 4 key-value heads. MODU - LARSTARENCODER relies upon Rotary Positional Encoding (RoPE) (Su et al., 2021) with a base pe- riodθ= 10−6and features a hidden dimensionality of 1024 with an intermediate size of 12 288. We followed Devlin et al. (2019) and replaced the causal self-attention in STARCODER -2with bidirectional self-attention. Aiming for modularity, we also replaced sliding window attention with full attention. This step was taken to avoid the recep- tive field phenomenon of sliding window mecha- nisms (Zhu et al., 2021). Finally, our implementa- tion integrates FLASH ATTENTION V2(Dao, 2023) for faster inference. Table 2 summarizes the archi- tectural details. 2.3 Pre-training We pre-trained MODULAR STARENCODER with a batch size of 3.99M tokens for 245 000 train- ing steps, processing ≈1T tokens. We conducted pre-training and fine-tuning on 512 NVIDIA Am- pere (64GB) GPUs using the Leonardo supercom-puter (Turisini et al., 2023), requiring 450 000 GPU working hours. To enable both token-level and snippet-level em- beddings after pre-training, we employed a multi- objective pre-training strategy that combined two losses, as detailed in section 2.3.1 and section 2.3.2. The pre-training was performed on THESTACK V2, whose context length analysis revealed an average of≈630tokens per code snippet. As described in section 2.3.1, we concatenated multiple snippets to facilitate our multi-loss methodology, allowing our in-context classification loss to expand the average context window to ≈1300 tokens, reaching the maximum context length 20% of the time. We used the AdamW optimizer with β1set to 0.9,β2to 0.95, ϵto 1e-6, and a weight decay of 1e-1. We initialized the learning rate at 6.24e-4 and decreased it using a multi-step learning rate schd- uler (Bi et al., 2024) with 4 000 warmup steps. The learning rate was reduced at 120 000, 185 000, 220 000, 230 000, and 240 000 training steps, applying a decay factor of 0.36, and from step 185,000 on- ward, further reduced by factors of 0.1, 0.031, 0.01, and 0.001. Table 2 summarizes the hyperparame- ters for architecture, pre-training, and fine-tuning. 2.3.1 Masked Language Modeling and In-Context Classification The training objectives of BERT (Feng et al., 2020), specifically Masked Language Modeling (MLM) and Next Sentence Prediction (NSP), have become a de facto standard. However, The NSP loss con- strains the context window length to the sentence length, leading to too many padding tokens and redundant computation (Zeng et al., 2022), and has been shown to not yield significant benefits after 4 Page 5: Table 2: Hyperparameters for Architecture, Pre-training, and Fine-tuning Architecture Hyperparameter Value Model size 1B parameters Precision bfloat16 Hidden layers 36 Attention heads 16 Hidden dimensionality 1024 Positional encoding RoPE ( θ= 10−6) Context length 2048 Attention mechanism Grouped-Query Attention Attention pattern Bi-directional Pre-training Batch size 3.99M tokens Pretraining steps 245 000 Pretraining Tokens 1T Loss function MLM + In-Context loss Multi-layer loss yes Optimizer AdamW Weight decay 1e-1 Initial learning rate 6.24e-4 Learning rate schedule Multi-step Warmup steps 4000 Fine-tuning Dataset size 635.404 samples Fine-tuning steps 20 000 Loss function CLIP loss Multi-layer loss yes Batch size 2048 Learning rate 1.0e-5 Temperature parameter 10.0 Hardware (Pre-training + fine-tuning) GPUs 512 NVIDIA Ampere (64GB) Overall Training hours 450 000 fine-tuning (Warner et al., 2024; Aroca-Ouellette and Rudzicz, 2020). Given that the average num- ber of tokens per data sample in Stack v2 is 630, a large context window of 2048 results in substantial padding, making long-context training inefficient. While Wang et al. (2023) demonstrated the advan- tages of training LLMs with multiple objectives, we revisited the NSP loss and introduced an in-context classification (ICC) objective. We hypothesize that predicting whether multiple code snippets belong to the same context (in our case, the same repos- itory) can enhance semantic search performance while allowing efficient concatenation of multiple code fragments. Our final training objective is the summation of two losses: (1) MLM loss and (2) ICC loss: L=LMLM +LICC. InLMLM , a certain percentage of tokens are randomly masked and predicted using a classifi- cation head. Following Zhang et al. (2024), we adopt a 15% masking rate with the standard 80- 10-10 token replacement strategy (Devlin et al.,2019). The secondary objective, LICC, determines whether randomly concatenated inputs (separated by a< SEP > token) originate from the same repository (see fig. 3). Each concatenated sam- ple has a 50% probability of containing source code from different repositories. This approach increases input density—reducing padding by ex- panding the average input length from 630 to 1 300 tokens—and potentially enhances cross-language understanding. Since repositories are inherently modular and often contain files written in multiple languages, learning from repository-level context may improve inter-language generalization. 2.3.2 Multi-layer Loss To achieve layer-wise modularity in transformer architectures, we apply the previously introduced loss (section 2.3.1) across a selected set of lay- ers, sharing classification heads (masked language modeling and in-context classification) while in- corporating a positional embedding of the layer index. The total loss is computed as the sum of individual layer losses, weighted by a factor αto prioritize deeper layers: L=P i∈ιLi·αwhere α=i/|I|andI={1, . . . , 36}represents all lay- ers, and the selected subset ι={4,9,18,27,36} defines the layers where the loss is applied. The selected subset was chosen to enable four model variants equally spaced in depth (9, 18, 27, 36) along with an additional “tiny” version (4) to see the model performance in a lower number of param- eters set. This approach allows for flexible model deployment, enabling adaptive layer pruning while maintaining performance trade-offs. 2.4 Fine-tuning Following Su et al. (2023), we fine-tune a single model for both text-to-code and code-to-code re- trieval using instruction prompting. The optimiza- tion objective combines CLIP loss (Radford et al., 2021) with a multi-layer loss (details in 2.3.2). To enhance representation learning, we replace the single-head projection of the multi-layer loss with five distinct projection heads, applied at differ- ent exit points of the pre-trained model (layers 4, 9, 18, 27, and 36). We used a batch of 2 048 elements, ensuring that text-to-code and code-to-code were equally distributed across the batch. We performed data augmentation by randomly replacing frequently occurring words (appearing more than twice and having at least three charac- ters) with random strings. We applied the augmen- 5 Page 6: Table 3: Performance of different models on text-to-code with CodeSearchNet using codeXGLUE . We reported the results presented in codet5plus, unixcoder and modernBERT (Wang et al., 2023; Guo et al., 2022; Warner et al., 2024). CodeSearchNet Model Ruby JS Go Python Java PHP avg. MRR avg. NDCG MODULAR STARENCODER 74.1 74.0 82.5 92.5 78.7 84.5 81.0 84.2 Codet5+ 770M 78.0 71.3 92.7 75.8 76.2 70.1 77.4 - OpenAI text-embedding-3-large 84.7 85.3 95.9 99.8 90.1 95.6 91.9 93.3 Unixcoder 74.0 68.4 91.5 72.0 72.6 67.6 74.4 - ModernBERT-large - - - - - - - 59.5 tation exclusively to code snippets in 30% of cases, leaving natural language descriptions unchanged. After conducting a grid search, we selected 1.0e−5 as the learning rate, maintained throughout the fine- tuning process, and set the temperature parameter at 10.0. 2.5 Evaluation We evaluated MODULAR STARENCODER fine- tuned, on both text-to-code and code-to-code re- trieval tasks using CODEXGLUE (Lu et al., 2021), which comprises several benchmarking datasets. For text-to-code retrieval, we employed the CODE- SEARCH NETdataset, where the goal is to retrieve the most relevant code snippet given a natural lan- guage query. Specifically, the query corresponds to a documentation comment, and the model is tasked with ranking the correct code snippet among 999 distractor snippets (Husain et al., 2019). This setup assesses the model’s ability to learn meaningful cross-modal representations between code and nat- ural language. For code-to-code retrieval, we relied on two datasets from CODEXGLUE : the Code Translation (CT) benchmark and POJ-104 . The Code Trans- lation dataset consists of semantically equivalent code snippets in different programming languages, and we framed the task as cross-language code re- trieval rather than translation. In this setting, given a Java code snippet as a query, the model retrieves the corresponding C# implementation, testing its capability to capture cross-lingual semantic simi- larities between functionally equivalent programs. In contrast, with POJ-104 dataset, we want to evaluate the model on intra-language semantic search ( POJ-104 contains only C++ snippets), where programs solve the same problem but with different implementations. This setup evaluates the model’s capacity to generalize across structural variations while preserving semantic equivalence.Table 4: Performance of different models on Code Trans- lation (CT) and POJ104 for code-to-code search with codeXGLUE dataset. CT POJ104 Model MRR mAP MODULAR STARENCODER 98.9 56.5 Codet5+ 110M-embedding 98.4 24.5 OpenAI text-embedding-3-large 98.8 82.9 Unixcoder 97.6 41.0 ModernBERT-large 93.1 27.3 3 Results and Discussion 3.1 Benchmarks Table 3 presents the results for CodeSearchNet (t2c) task in terms of Mean Reciprocal Rank (MRR) for each single language, average NDCG and aver- age MRR. Results for Unixcoder, ModernBERT, and CodeT5+ are reported from the original pa- pers (Guo et al., 2022; Warner et al., 2024; Wang et al., 2023). On CODESEARCH NET,MODU - LARSTARENCODER achieves an MRR of 81.0 and a NDCG of 84.2, outperforming CODET5+ (Wang et al., 2023) (770M), UNIXCODER (Guo et al., 2022), and MODERN BERT- LARGE (Warner et al., 2024). The only encoder that surpasses MODU - LARSTARENCODER is OpenAI’s text-embedding- 3-large. Table 4 presents results from both POJ104 and CT datasets reported respectively in MRR for code translation (Java to C# retrieval) and mean aver- age precision for POJ104 (C++ to C++# retrieval). MODULAR STARENCODER reaches the best perfor- mance among the tests. We decided to replicate the benchmarking for all models in a zero-shot setting for code-to-code tasks because our model does not integrate POJ104 and the code translation datasets in the training set. Referring to Table 4, on the POJ104 dataset in zero-shot, MODULAR STARENCODER achieves 6 Page 7: Table 5: Performance comparison of MODULAR STARENCODER layers and baseline fine-tuned models on the CodeSearchNet benchmark. The table displays the overall retrieval performance measured by Mean Reciprocal Rank (MRR). We refer to MODULAR STARENCODER , fine-tuned with multiple exit points simultaneously, as self-distilled . The models not marked as self-distilled are the baselines, fine-tuned individually for each exit point. CodeSearchNet Model Size Ruby Javascript Go Python Java PHP avg. MRR Layer-4 ≈160M 59.5 61.3 72.1 86.2 68.2 75.5 70.5 Layer-4 (self-distilled) 62.2 64.7 74.8 88.1 71.4 78.0 73.2 Layer-9 ≈300M 64.9 65.7 74.3 87.3 72.0 78.8 73.8 Layer-9 (self-distilled) 67.6 69.4 78.9 90.2 75.5 82.3 77.3 Layer-18 ≈550M 73.8 73.5 82.4 92.1 78.4 84.0 80.7 Layer-18 (self-distilled) 74.1 74.0 82.5 92.5 78.7 84.5 81.0 Layer-27 ≈800M 72.3 71.8 80.8 90.8 76.9 82.3 79.1 Layer-27 (self-distilled) 73.2 73.3 81.7 92.1 77.8 83.8 80.3 Layer-36 ≈1B 72.3 72.9 80.7 91.5 77.1 82.9 79.5 Layer-36 (self-distilled) 73.5 72.6 80.5 91.4 76.9 82.7 79.6 an mAP of 0.57, which is state-of-art between open-sourced models, however it is significantly behind OpenAI text-embedding-3-large. We under- score that a direct comparison with OpenAI text- embedding-3-large remains challenging because it is closed-source, and details such as model size, training methodology, or potential data contamina- tion are undisclosed. 3.2 Ablation Study We conducted an ablation study by fine-tuning sin- gularly each exit point (also starting from MOD- ULAR STARENCODER , pre-trained) and pruning the subsequent layers (e.g., for the baseline on layer 18, we retain only the first 18 layers and fine-tune the model using just one projection head on that layer). Finally, we compared the sliced models with the corresponding results (self- distilled) of the model fine-tuned with the multi- layer loss ( MODULAR STARENCODER ).MODU - LARSTARENCODER consistently outperforms the single-exit baseline, indicating that lower-level lay- ers benefit from training signals propagated from deeper layers . This behavior is highlighted in Table 5, where MODULAR STARENCODER , indi- cated as self-distilled , outperforms all the single exit baselines consistently. This finding under- scores a promising new direction in self-distillation for large-scale code and text models, enabling high performance even in more compact configu- rations. Moreover, Figure 4 illustrates that MODU- LARSTARENCODER maintains robust performance from layers 18 to 36, allowing users to scale down the network to match their memory, computational,or latency constraints while preserving strong re- trieval accuracy. 4 Related work Since the introduction of ELMo (Peters et al., 2018), deep contextual information has enhanced generating embeddings for textual retrieval or clas- sification, reaching state-of-the-art results in sev- eral tasks. BERT (Feng et al., 2020) followed those findings, adapting the Transformer architec- ture (Vaswani et al., 2017) to enable a bi-directional representation with two different training object- ing, namely the masked language modeling and the next sentence prediction losses. (Lan et al., 2019; Liu et al., 2019) adapted the BERT archi- tecture to obtain an enhanced pre-trained model by removing or modifying the NSP, focusing on pre-training data or hyperparameters optimization. More recently, modernBERT (Warner et al., 2024) tied the gap between modern decoders (Jiang et al., 2023; Hui et al., 2024; Dubey et al., 2024; Touvron et al., 2023; Lozhkov et al., 2024) advancements that rely upon models with an increased number of parameters, trained upon more tokens, and being capable of handling longer contextual information. In code representation, large language models must be adapted by training them on a curated cor- pus focused on software and by leveraging code’s syntactic and semantic structures, which differ sig- nificantly from natural language. Feng et al. (2020) adapted the BERT architecture to produce seman- tically meaningful embeddings for source code, resulting in codeBERT. This was accomplished by including more source code in the training set 7 Page 8: 5 10 15 20 25 30 35 Layer0.700.720.740.760.780.800.82MRR +2.72%+3.47%+0.40% +1.18% +0.00% Baseline models Self-distilled model(a) MRR 5 10 15 20 25 30 35 Layer0.600.620.640.660.680.700.720.74Recall@1 +3.11%+4.36%+0.58% +1.46% +0.27% Baseline models Self-distilled model (b) Recall@1 Figure 4: Performance Comparison Across Layers: The graph illustrates the MRR and the Recall@1 for different layers, comparing baseline models and a self-distilled model. and focusing on a training loss that can leverage bimodal (natural language and code) contextual in- formation (Clark et al., 2020). GraphCodeBERT enhanced codeBERT (Feng et al., 2020) represen- tations by incorporating data flow graphs, captur- ing dependencies between variables and operations, and improving tasks like code summarization and clone detection. UniXcoder (Guo et al., 2022) extended this by introducing a unified encoder- decoder framework, integrating abstract syntax trees (ASTs) and data flow information. Wang et al. (2023) expanded these findings with codet5plus, stressing how multiple losses that leverage code semantics impact the model pertaining. The work incorporated text-code contrastive learning, text- code Matching, and text-code causal LM for better code understanding and generation. When trying to achieve better performance, re- search has shifted toward models with a high num- ber of parameters. While this trend appears ef- fective from a performance perspective, end users may face computational or memory limitations as LLMs vary from millions to billions of parame- ters. Sanh et al. (2019) pioneered the introduction of knowledge distillation, using a “teacher” model that guides a smaller model to emulate its behav- ior. This methodology has been widely adopted and improved upon recently (DeepSeek-AI et al., 2025; Hui et al., 2024), becoming a standard for obtaining high-performing smaller LLMs. Our work differs from previous work by adapt- ing a modern architecture (Lozhkov et al., 2024) to a code encoder-only based model and introducing a novel ’self-distillation’ mechanism. We replace the next sentence prediction loss with an in-context classification focused on the repository level andexpand the context to 2048 tokens. Our novel self-distillation mechanism improves low-level lay- ers, resulting in a modular transformer architecture without additional teacher models or further data for distillation. 5 Conclusion In this work, we introduced MODULAR STAREN- CODER , a modular multi-exit encoder architecture designed to improve efficiency and scalability in code retrieval tasks. By integrating an intra-model self-distillation mechanism, our approach enables multiple resolution models to be trained within a unified layer stack, reducing redundancy while maintaining high retrieval performance. Our evalu- ation on CODEXGLUE demonstrates that MOD- ULAR STARENCODER achieves state-of-the-art re- sults among open-source models, outperforming prior baselines across text-to-code and code-to- code retrieval tasks. Ablations further highlighted the benefits of self-distillation, showing that lower layers gain representational strength from deeper layers, leading to superior performance compared to single-exit models. Beyond performance gains, MODU - LARSTARENCODER offers practical benefits by providing multiple exit points, allowing users to balance computational efficiency and accuracy based on resource constraints. The results suggest that self-distillation provides a promising direction for efficient large-scale encoders, reducing deployment costs without sacrificing effectiveness. Finally, released in open-access our SYNTH - CODE2CODE2NL and both pre-trained and fine- tuned M ODULAR STARENCODER models. 8 Page 9: Acknowledgments We acknowledge ISCRA for awarding this project access to the LEONARDO supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CINECA (Italy). Limitations Due to our dependence on multiple GPUs, we encountered significant computational constraints. Parameter grid searches with smaller and embry- onic models were the only ways to extrapolate the best hyperparameter setup. The best hyperparam- eters for smaller models can differ from those for larger ones; thus, we faced a limitation in finding an optimal training setup. Ablating both the in- context classification and the multi-layer loss in a real scenario was impossible as we depended on smaller models to understand their performances. Therefore, computational resources pose a signif- icant constraint in this work, and we want to em- phasize how this factor undermines the possibility of replicating the experiments. Here, we highlight potential threats to the va- lidity of the research process, focusing on both external and internal factors. External validity When synthesizing the SYN- THCODE2CODE2NL code, we rely on code trans- lation; we understand that synthesized data adheres to stylistic writing patterns distinct from those of humans. We tested the model’s performance on standard benchmarks. However, the impact of uti- lizing code snippets as synthetic data in training large language models for generalization over hu- man text-to-code and code-to-code search is still not fully understood. Internal validity The ablation study focused on fine-tuning the model with and without multi-layer loss. However, this comparison does not account for how the model behaves when starting from a model not pre-trained on multi-layer loss. Al- though our experiments present promising results, further inspection is necessary to better understand this phenomenon. References Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sang- hai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv e-prints , arXiv:2305.13245.Loubna Ben Allal, Anton Lozhkov, Elie Bak- ouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlí ˇcek, Agustín Piqueres Lajarín, Vaibhav Srivastav, et al. 2025. Smollm2: When smol goes big–data-centric training of a small language model. arXiv preprint arXiv:2502.02737 . Stéphane Aroca-Ouellette and Frank Rudzicz. 2020. On losses for modern language models. arXiv preprint arXiv:2010.01694 . Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y . K. Li, Wenfeng Liang, Fangyun Lin, Alex X. Liu, Bo Liu, Wen Liu, Xiaodong Liu, Xin Liu, Yiyuan Liu, Haoyu Lu, Shanghao Lu, Fuli Luo, Shirong Ma, Xiaotao Nie, Tian Pei, Yishi Piao, Jun- jie Qiu, Hui Qu, Tongzheng Ren, Zehui Ren, Chong Ruan, Zhangli Sha, Zhihong Shao, Junxiao Song, Xuecheng Su, Jingxiang Sun, Yaofeng Sun, Minghui Tang, Bingxuan Wang, Peiyi Wang, Shiyu Wang, Yaohui Wang, Yongji Wang, Tong Wu, Y . Wu, Xin Xie, Zhenda Xie, Ziwei Xie, Yiliang Xiong, Hanwei Xu, R. X. Xu, Yanhong Xu, Dejian Yang, Yuxiang You, Shuiping Yu, Xingkai Yu, B. Zhang, Haowei Zhang, Lecong Zhang, Liyue Zhang, Mingchuan Zhang, Minghua Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Qihao Zhu, and Yuheng Zou. 2024. Deepseek LLM: scaling open-source language mod- els with longtermism. CoRR , abs/2401.02954. Kevin Clark, Minh-Thang Luong, Quoc V . Le, and Christopher D. Manning. 2020. ELECTRA: pre- training text encoders as discriminators rather than generators. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net. Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv e- prints , arXiv:2307.08691. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong 9 Page 10: Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxi- ang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y . K. Li, Y . Q. Wang, Y . X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yu- jia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y . X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zi- jia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. 2025. Deepseek-r1: Incen- tivizing reasoning capability in llms via reinforce- ment learning. Preprint , arXiv:2501.12948. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers) , pages 4171–4186. Association for Computational Linguistics. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Bap- tiste Rozière, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Al- lonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova,Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Geor- gia Lewis Anderson, Graeme Nail, Grégoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Han- nah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel M. Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, and et al. 2024. The llama 3 herd of models. CoRR , abs/2407.21783. Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. 2024. Extreme Compression of Large Language Models via Additive Quantization. arXiv e-prints , arXiv:2401.06118. Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Pro- gramming and Natural Languages. arXiv e-prints , arXiv:2002.08155. Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi- aocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. Code- bert: A pre-trained model for programming and nat- ural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 , volume EMNLP 2020 ofFindings of ACL , pages 1536–1547. Association for Computational Linguistics. Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. Unixcoder: Unified cross- modal pre-training for code representation. In Pro- ceedings of the 60th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022 , pages 7212–7225. Association for Computa- tional Linguistics. Song Han, Jeff Pool, John Tran, and William J. Dally. 2015. Learning both Weights and Connections for Efficient Neural Networks. arXiv e-prints , arXiv:1506.02626. Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Day- iheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-coder technical report. CoRR , abs/2409.12186. Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Code- searchnet challenge: Evaluating the state of semantic code search. CoRR , abs/1909.09436. 10 Page 11: Benoit Jacob, Skirmantas Kligys, Bo Chen, Meng- long Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2017. Quantiza- tion and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. arXiv e-prints , arXiv:1712.05877. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv e-prints , arXiv:2310.06825. Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. TinyBERT: Distilling BERT for Natural Language Understanding. arXiv e-prints , arXiv:1909.10351. Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanu- jan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, and Ali Farhadi. 2022. Ma- tryoshka representation learning. In Advances in Neural Information Processing Systems , volume 35, pages 30233–30249. Curran Associates, Inc. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv e- prints , arXiv:1909.11942. Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhat- tacharyya, Wenhao Yu, Swayam Singh, Sasha Luc- cioni, Paulo Villegas, Maxim Kunakov, Fedor Zh- danov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jen- nifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Ar- jun Guha, Leandro von Werra, and Harm de Vries. 2023. StarCoder: may the source be with you! arXiv e-prints , arXiv:2305.06161. Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei- Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2023. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv e-prints , arXiv:2306.00978.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR , abs/1907.11692. Anton Lozhkov, Raymond Li, Loubna Ben Allal, Fed- erico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen- Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Cheng- hao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Car- olyn Jane Anderson, Nicolas Chapados, Mostofa Pat- wary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. 2024. StarCoder 2 and The Stack v2: The Next Generation. arXiv e-prints , arXiv:2402.19173. Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Li- dong Zhou, Linjun Shou, Long Zhou, Michele Tu- fano, Ming Gong, Ming Zhou, Nan Duan, Neel Sun- daresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. Codexglue: A machine learning bench- mark dataset for code understanding and generation. InProceedings of the Neural Information Process- ing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual . Changan Niu, Chuanyi Li, Vincent Ng, Dongxiao Chen, Jidong Ge, and Bin Luo. 2023. An empirical compar- ison of pre-trained models of source code. In 2023 IEEE/ACM 45th International Conference on Soft- ware Engineering (ICSE) , pages 2136–2148. IEEE. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word repre- sentations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long Papers) , pages 2227–2237, New Orleans, Louisiana. Association for Computa- tional Linguistics. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learn- ing transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 11 Page 12: July 2021, Virtual Event , volume 139 of Proceedings of Machine Learning Research , pages 8748–8763. PMLR. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv e-prints , arXiv:1910.01108. Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. 2023. One embedder, any task: Instruction-finetuned text em- beddings. In Findings of the Association for Com- putational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023 , pages 1102–1121. Association for Computational Linguistics. Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2021. RoFormer: En- hanced Transformer with Rotary Position Embedding. arXiv e-prints , arXiv:2104.09864. Surat Teerapittayanon, Bradley McDanel, and H. T. Kung. 2017. BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks. arXiv e-prints , arXiv:1709.01686. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton- Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, An- thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di- ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar- tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly- bog, Yixin Nie, Andrew Poulton, Jeremy Reizen- stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subrama- nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay- lor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Ro- driguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine- tuned chat models. CoRR , abs/2307.09288. Matteo Turisini, Giorgio Amati, and Mirko Cestari. 2023. LEONARDO: A pan-european pre-exascale supercomputer for HPC and AI applications. CoRR , abs/2307.16885. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arXiv e-prints , arXiv:1706.03762. Yue Wang, Hung Le, Akhilesh Gotmare, Nghi D. Q. Bui, Junnan Li, and Steven C. H. Hoi. 2023. Codet5+: Open code large language models for code under- standing and generation. In Proceedings of the 2023Conference on Empirical Methods in Natural Lan- guage Processing, EMNLP 2023, Singapore, Decem- ber 6-10, 2023 , pages 1069–1088. Association for Computational Linguistics. Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. 2024. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. CoRR , abs/2412.13663. Jinle Zeng, Min Li, Zhihua Wu, Jiaqi Liu, Yuang Liu, Dianhai Yu, and Yanjun Ma. 2022. Boosting dis- tributed training performance of the unpadded BERT model. CoRR , abs/2208.08124. Dejiao Zhang, Wasi Ahmad, Ming Tan, Hantian Ding, Ramesh Nallapati, Dan Roth, Xiaofei Ma, and Bing Xiang. 2024. Code Representation Learning At Scale. arXiv e-prints , arXiv:2402.01935. Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein, Anima Anandkumar, and Bryan Catanzaro. 2021. Long-short transformer: Ef- ficient transformers for language and vision. In Ad- vances in Neural Information Processing Systems 34: Annual Conference on Neural Information Process- ing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual , pages 17723–17736. A Synthetic dataset SYNTH CODE2CODE2NL is a fine-tuning dataset designed for text-to-code and code-to-code search, built by augmenting CODESEARCH NET(Husain et al., 2019) with transpiled code snippets across multiple languages (Python, Java, Go, PHP, Ruby, C++, C, JavaScript). The dataset underwent a pre- processing phase, including deduplication based on the original and synthesized code columns. Near- deduplication was performed using Locality Sensi- tive Hashing (LSH) with a Jaccard similarity thresh- old of 0.7 over character-level 5-grams to remove semantically identical snippets differing only in identifiers or function arguments. For code-to-code search, we translated each snip- pet into a randomly sampled target language us- ing the QWEN 2.5-C ODER -7B-I NSTRUCT model with greedy search to ensure consistency. Each dataset entry consists of a natural language descrip- tion and two code snippets in different languages. SYNTH CODE2CODE2NL contains 1,071,367 sam- ples, with original code from CODESEARCH NET (Python, Java, PHP, Go) and translated code (Go, Ruby, JavaScript, Python, C++, PHP, C, Java). In Figure 5, In Figure 6 and Figure 7 some examples of code translation are shown. 12 Page 13: GO // readNextChunk creates an io.LimitedReader for the next chunk of data, // and returns io.EOF if the last chunk has been read. funcreadNextChunk(r wire.SyncScanner) (io.Reader, error) { status, err := r.ReadStatus("read-chunk") if err != nil { if wire.IsAdbServerErrorMatching(err, readFileNotFoundPredicate) { returnnil, errors.Errorf(errors.FileNoExistError, "no such file or directory") } returnnil, err } switch status { case wire.StatusSyncData: return r.ReadBytes() case wire.StatusSyncDone: returnnil, io.EOF default: returnnil, errors.Errorf( errors.AssertionError, "expected chunk id '%s' or '%s', but got '%s'", wire.StatusSyncData, wire.StatusSyncDone, []byte(status), ) } } Ruby def read_next_chunk(r) status, err = r.read_status("read-chunk") if err if AdbServerErrorMatching(err, readFileNotFoundPredicate) returnnil, FileNoExistError.new("no such file or directory") end returnnil, err end case status when wire.StatusSyncData return r.read_bytes when wire.StatusSyncDone returnnil, EOFError.new else returnnil, AssertionError.new( "expected chunk id '#{wire.StatusSyncData}' or '#{wire.StatusSyncDone}', but got '#{status}'" ) end endFigure 5: Example of a code snippet translation from GotoRuby . 13 Page 14: JAVA public void get(URI uri, Credentials credentials) throws IOException { get(uri, null, credentials); } C #include <stdio.h> #include <stdlib.h> #include <string.h> typedef struct {     char *username;     char *password; } Credentials; typedef struct {     char *scheme;     char *host;     int port;     char *path; } URI; void get(URI *uri, Credentials *credentials) {     // Implementation of the get function     printf("Fetching resource from %s://%s:%d%s\n", uri->scheme, uri->host, uri->port, uri->path);     if (credentials != NULL) {         printf("Using credentials: %s:%s\n", credentials->username, credentials->password);     } } int main() {     URI uri = {"http", "example.com", 80, "/index.html"};     Credentials credentials = {"user", "pass"};     get(&uri, &credentials);    return 0; }Figure 6: Example of a code snippet translation from Java toC. 14 Page 15: Python def toString(self):         result = []         k, v = self.optimalRepr()         longest = reduce(lambda x, y: x if x > len(y) else len(y), k, 0)         for ind in range(len(k)):             result.append("%s : %s" % (k[ind].ljust(longest), v[ind]))         return "\n".join(result) PHP public function toString() {     /**      * Return a printable view of the dictionary      */     $result = [];     list($k, $v) = $this->optimalRepr();     $longest = array_reduce($k, function($x, $y) {         return $x > strlen($y) ? $x : strlen($y);     }, 0);     for ($ind = 0; $ind < count($k); $ind++) {         $result[] = sprintf("%s : %s", ltrim($k[$ind], ' '), str_pad($v[$ind], $longest, ' ', STR_PAD_LEFT));     }     return implode("\n", $result); }Figure 7: Example of a code snippet translation from Python toPHP . 15

---