Paper Content:
Page 1:
One Model to Train them All: Hierarchical Self-Distillation for Enhanced
Early Layer Embeddings
Andrea Gurioli1, Federico Pennino1, Joao Monteiro2
Maurizio Gabbrielli1
1University of Bologna,2Autodesk
Correspondence: andrea.gurioli5@unibo.it, joao.monteiro@autodesk.com
Abstract
Deploying language models often requires han-
dling model size vs. performance trade-offs to
satisfy downstream latency constraints while
preserving the model’s usefulness. Model
distillation is commonly employed to reduce
model size while maintaining acceptable perfor-
mance. However, distillation can be inefficient
since it involves multiple training steps. In
this work, we introduce MODULAR STAREN-
CODER , a modular multi-exit encoder with 1B
parameters, useful for multiple tasks within the
scope of code retrieval. MODULAR STAREN-
CODER is trained with a novel self-distillation
mechanism that significantly improves lower-
layer representations—allowing different por-
tions of the model to be used while still main-
taining a good trade-off in terms of perfor-
mance. Our architecture focuses on enhancing
text-to-code and code-to-code search by sys-
tematically capturing syntactic and semantic
structures across multiple levels of representa-
tion. Specific encoder layers are targeted as
exit heads, allowing higher layers to guide ear-
lier layers during training. This self-distillation
effect improves intermediate representations,
increasing retrieval recall at no extra training
cost. In addition to the multi-exit scheme, our
approach integrates a repository-level contex-
tual loss that maximally utilizes the training
context window, further enhancing the learned
representations. We also release a new dataset
constructed via code translation, seamlessly ex-
panding traditional text-to-code benchmarks
with code-to-code pairs across diverse program-
ming languages. Experimental results highlight
the benefits of self-distillation through multi-
exit supervision.
1 Introduction
Large language models (LLMs) have significantly
impacted the field of natural language processing,
demonstrating remarkable performance across var-
ious applications (Niu et al., 2023). However, theamount of computation required to operate state-
of-the-art models poses significant challenges for
the large-scale deployment of these models.
To mitigate these challenges, the research com-
munity has explored several model strategies to
reduce the operational cost of LLMs without sacri-
ficing their effectiveness. A prominent technique
in model compression is quantization (Jacob et al.,
2017; Lin et al., 2023; Egiazarian et al., 2024),
which involves the reduction of numerical precision
in the model’s parameters. Quantization effectively
decreases memory requirements and enhances in-
ference speed, facilitating the deployment of large
language models in resource-constrained environ-
ments. Concurrently, knowledge distillation has
emerged as a powerful technique whereby a smaller
“student” model is trained to emulate the behav-
ior of a larger “teacher” model, as evidenced by
works such as DISTIL BERT (Sanh et al., 2019) and
TINYBERT (Jiao et al., 2019). Additionally, prun-
ing methods selectively eliminate less influential
weights or neurons, further reducing model com-
plexity and aiming to preserve performance (Han
et al., 2015).
Recent efforts have increasingly focused on de-
veloping efficient architectures requiring fewer pa-
rameters. Model families such as LLaMA (Dubey
et al., 2024), Qwen (Hui et al., 2024), Mistral (Jiang
et al., 2023), and SmolLM (Allal et al., 2025) ex-
emplify a paradigm shift towards smaller, more
accessible architectures. These model families
are deployed at various resolutions—ranging from
lightweight variants optimized for heavily resource-
constrained environments to larger versions that
retain competitive performance.
In parallel, advancements in dynamic inference
strategies have introduced mechanisms that fur-
ther optimize computational efficiency. Techniques
like multi-exit networks enable early predictions
at intermediate layers, reducing unnecessary com-
putations. For instance, early-exit architectures
1arXiv:2503.03008v1 [cs.CL] 4 Mar 2025
Page 2:
Layer 1
Layer 2
Layer 4
Layer 50.8 0.4 0.23...0.5 0.3 0.8...
Layer 9
Layer 100.75 0.5 0.8... 0.25 0.6 0.6...
Layer 18
Layer 190.5 0.8 0.6... 0.45 0.6 0.32...
Layer 27
Layer 280.46 0.67 0.18 ... 0.56 0.6 0.32...
Layer 36
0.43 0.86 0.18 ... 0.27 0.85 0.75...Layer loss
Layer loss
Layer loss
Layer loss
Layer lossFigure 1: Overview of our multi-exit self-distillation encoder, shown here with exit heads at selected layers (e.g.,
Layers 4, 9, 18, 27, and 36). Each exit head predicts an output embedding and adds a “layer loss,” contribution
weighted by a coefficient αi, summed into the overall objective L.
such as BranchyNet (Teerapittayanon et al., 2017)
dynamically balance computation and accuracy
by allowing predictions before full model execu-
tion. Similarly, Matryoshka representation learn-
ing (Kusupati et al., 2022) extends this idea to em-
beddings, introducing a loss function that yields
multi-granular representations. This approach al-
lows downstream tasks to adjust computational
complexity by pruning embedding dimensionality,
further contributing to efficient model deployment.
Building on these principles, we propose MODU-
LARSTARENCODER , a modular multi-exit encoder
architecture that integrates a novel intra-model self-
distillation mechanism. In our design, specific in-
termediate layers are supervised by both the pri-
mary task loss and auxiliary distillation losses on
specific exit heads, encouraging lower layers to
learn better representations by mimicking the out-
puts of higher layers. We apply a shared embed-
ding head comprising a masked language modeling
head and an in-context classification head across
a chosen subset of layers. We then fine-tuned the
model with different projection heads for each exit
point. We reached state-of-the-art results on multi-
ple retrieval tasks (such as code-to-code and text-to-
code), fine-tuning onesingle modular model that
can be sliced depending on the end-user computa-
tional constraints.
Our contributions are as follows:
•We introduce a self-distillation framework that
enables training multiple model resolutions
within a unified layer stack, reducing redun-
dancy and improving scalability. We believethis approach can significantly affect LLM
training pipelines that depend on multiple
model distillations.
•We train and release MODULAR STAREN-
CODER , which consists of a pre-trained1and
fine-tuned2encoder: The former is a modular
pre-trained encoder with up to 1 billion pa-
rameters and five exit points, allowing users
to perform multiple exit fine-tuning depending
on downstream tasks. The latter is a fine-tuned
encoder for various retrieval tasks. We allow
the user to choose either the entire model with
1 billion parameters or a model size that fits
their memory and computational limitations.
•We release SYNTH CODE2CODE2NL a new
dataset3constructed via code translation, ex-
panding popular text-to-code datasets across
diverse programming languages with code-to-
code pairs. SYNTH CODE2CODE2NL com-
prises 1 071 367 triplets of natural language-
code-code.
2 Methodology
2.1 Dataset
In the pre-training phase, we leveraged The Stack
V2 (Lozhkov et al., 2024), a large open-source code
dataset structured by repository.
1https://huggingface.co/modularStarEncoder/
ModularStarEncoder
2https://huggingface.co/modularStarEncoder/
ModularStarEncoder-finetuned
3https://huggingface.co/datasets/
modularStarEncoder/SynthCode2Code2NL
2
Page 3:
Table 1: SYNTH CODE2CODE2NL details: Average character count and sample size per language for the Code-
SearchNet dataset and the synthesized portion obtained through translation.
Language CSN samples CSN avg. char Synth. samples Synth. avg. char
English 1 071 367 180 - -
PHP 280 706 514 116 967 579
Python 274 454 474 117 374 518
Go 234 089 350 124 125 541
Java 282 118 505 116 098 707
C++ - - 141 956 938
Ruby - - 158 494 456
C - - 136 365 1029
JavaScript - - 159 988 557
Translate this ''' print("Hello W orld") '''
from Python to Rust.
Here is the translated code '''Qwen2.5Coder-
7B-Instruct fn main() {
println!("Hello World!");
}
Figure 2: Prompt provided to Qwen2.5-Coder-7B-Instruct for translating a given code snippet ( print("Hello
World") in the example) from a source programming language ( Python ) to a target one ( Rust ).
For the fine-tuning stage, we created SYNTH -
CODE2CODE2NL , a dataset that supports text-to-
code and code-to-code search. Using the popu-
larCODESEARCH NET(Husain et al., 2019) as a
seed dataset and selecting popular programming
languages (Python, Java, Go, and PHP), we aug-
mented it by transpiling available code snippets
onto other languages.
To generate semantically similar code snippets
for code-to-code search, we translated each snip-
pet into a different language randomly sampled
from Go, Ruby, Python, Java, C++, PHP, C,
JavaScript. We prompted the QWEN 2.5-C ODER -
7B-I NSTRUCT model with the source code, the
name of the source language, and the name of the
target language (see fig. 2). During code transla-
tion, we choose the token with the highest proba-
bility as output (greedy search) to prevent semantic
discrepancies.
This process yielded pairs of code snippets in
distinct languages tied to the same natural lan-
guage description. As a result, every sample in
the fine-tuning dataset includes a natural language
description and two code snippets from distinct
languages. SYNTH CODE2CODE2NL contains 1
071 367 samples where, in the first code column,
we directly processed code snippets from Code-SearchNet, including Python, Java, PHP, and Go.
The third column, artificially synthesized via code
translation, includes Go, Ruby, JavaScript, Python,
C++, PHP, C, and Java code snippets. After a man-
ual inspection, we discovered that both columns
contained code snippets that differed only in identi-
fiers or function arguments. Several tasks were se-
mantically identical but paraphrased with different
parameter requirements (e.g., two identical para-
phrased tasks asked for opening a socket on a differ-
ent port). During the preprocessing phase of SYN-
THCODE2CODE2NL , motivated by the dataset’s
redundancy and preliminary experiments that show
its effectiveness on the model’s performance, we
near-deduplicated the dataset using both the Code-
SearchNet code column and the synthesized code
column. During the data near deduplication phase,
we relied on Locality Sensitive Hashing (LSH) with
a Jaccard similarity threshold of 0.7 and 256 per-
mutations, analyzing character-level 5-grams. Ta-
ble 1 shows the average number of characters per
language in SYNTH CODE2CODE2NL , we empha-
size that synthesized data is significantly longer
than human-written code and might have stylistic
differences compared to human code, we further
discussed this in section 5. Appendix A provides
examples of code translation.
3
Page 4:
def build_examples(repo_1, repo_2):
is_negative = random(0, 1)
input = empty_string
while not empty(repo_1):
input += <sep_token>
# Positive case
pos_sample = sample_random_snippet(repo_1)
if len(input) + len(pos_sample) > context_length - 1:
break
input += pos_sample
if is_negative and not empty(repo_2):
# Negative (snippets from different repos)
input += <sep_token>
neg_sample = sample_random_snippet(repo_2)
if len(input) + len(neg_sample) > context_length - 1:
break
input += neg_sample
input += <cls_token>
return inputSample 1
repo 1Sample 2
repo 1Sample 3
repo 1
Sample 1
repo 1Sample 1
repo 2Sample 2
repo 1Sample 2
repo 2Sample 3
repo 1SEP SEP SEP CLS
CLS SEP SEP SEP SEP SEPPositive example
Negative exampleFigure 3: On the left side the illustration of the in-context loss framework, where samples from different repositories
are concatenated. Positive examples share the same repository context, while negative examples come from different
repositories. On the right side, in-context loss framework pseudocode.
2.2 Architecture
We updated the first version of STARENCODER (Li
et al., 2023) by enabling longer code snippets (up
to 2 048 tokens as context length), increasing the
model size from ≈125M to ≈1B parameters and
utilizing state-of-the-art methodologies (Warner
et al., 2024; Lozhkov et al., 2024) resulting in
MODULAR STARENCODER .
We built MODULAR STARENCODER on top of
STARCODER -2(Lozhkov et al., 2024), applying
several modifications to the model. We reduced its
size from 15B to 1B parameters. Our architecture
comprises 36 hidden layers and adopts Grouped
Query Attention (GQA) (Ainslie et al., 2023) with
16 attention heads and 4 key-value heads. MODU -
LARSTARENCODER relies upon Rotary Positional
Encoding (RoPE) (Su et al., 2021) with a base pe-
riodθ= 10−6and features a hidden dimensionality
of 1024 with an intermediate size of 12 288.
We followed Devlin et al. (2019) and replaced
the causal self-attention in STARCODER -2with
bidirectional self-attention. Aiming for modularity,
we also replaced sliding window attention with full
attention. This step was taken to avoid the recep-
tive field phenomenon of sliding window mecha-
nisms (Zhu et al., 2021). Finally, our implementa-
tion integrates FLASH ATTENTION V2(Dao, 2023)
for faster inference. Table 2 summarizes the archi-
tectural details.
2.3 Pre-training
We pre-trained MODULAR STARENCODER with
a batch size of 3.99M tokens for 245 000 train-
ing steps, processing ≈1T tokens. We conducted
pre-training and fine-tuning on 512 NVIDIA Am-
pere (64GB) GPUs using the Leonardo supercom-puter (Turisini et al., 2023), requiring 450 000 GPU
working hours.
To enable both token-level and snippet-level em-
beddings after pre-training, we employed a multi-
objective pre-training strategy that combined two
losses, as detailed in section 2.3.1 and section 2.3.2.
The pre-training was performed on THESTACK V2,
whose context length analysis revealed an average
of≈630tokens per code snippet. As described in
section 2.3.1, we concatenated multiple snippets to
facilitate our multi-loss methodology, allowing our
in-context classification loss to expand the average
context window to ≈1300 tokens, reaching the
maximum context length 20% of the time.
We used the AdamW optimizer with β1set to
0.9,β2to 0.95, ϵto 1e-6, and a weight decay of
1e-1. We initialized the learning rate at 6.24e-4 and
decreased it using a multi-step learning rate schd-
uler (Bi et al., 2024) with 4 000 warmup steps. The
learning rate was reduced at 120 000, 185 000, 220
000, 230 000, and 240 000 training steps, applying
a decay factor of 0.36, and from step 185,000 on-
ward, further reduced by factors of 0.1, 0.031, 0.01,
and 0.001. Table 2 summarizes the hyperparame-
ters for architecture, pre-training, and fine-tuning.
2.3.1 Masked Language Modeling and
In-Context Classification
The training objectives of BERT (Feng et al., 2020),
specifically Masked Language Modeling (MLM)
and Next Sentence Prediction (NSP), have become
a de facto standard. However, The NSP loss con-
strains the context window length to the sentence
length, leading to too many padding tokens and
redundant computation (Zeng et al., 2022), and has
been shown to not yield significant benefits after
4
Page 5:
Table 2: Hyperparameters for Architecture, Pre-training,
and Fine-tuning
Architecture
Hyperparameter Value
Model size 1B parameters
Precision bfloat16
Hidden layers 36
Attention heads 16
Hidden dimensionality 1024
Positional encoding RoPE ( θ= 10−6)
Context length 2048
Attention mechanism Grouped-Query Attention
Attention pattern Bi-directional
Pre-training
Batch size 3.99M tokens
Pretraining steps 245 000
Pretraining Tokens 1T
Loss function MLM + In-Context loss
Multi-layer loss yes
Optimizer AdamW
Weight decay 1e-1
Initial learning rate 6.24e-4
Learning rate schedule Multi-step
Warmup steps 4000
Fine-tuning
Dataset size 635.404 samples
Fine-tuning steps 20 000
Loss function CLIP loss
Multi-layer loss yes
Batch size 2048
Learning rate 1.0e-5
Temperature parameter 10.0
Hardware (Pre-training + fine-tuning)
GPUs 512 NVIDIA Ampere (64GB)
Overall Training hours 450 000
fine-tuning (Warner et al., 2024; Aroca-Ouellette
and Rudzicz, 2020). Given that the average num-
ber of tokens per data sample in Stack v2 is 630, a
large context window of 2048 results in substantial
padding, making long-context training inefficient.
While Wang et al. (2023) demonstrated the advan-
tages of training LLMs with multiple objectives, we
revisited the NSP loss and introduced an in-context
classification (ICC) objective. We hypothesize that
predicting whether multiple code snippets belong
to the same context (in our case, the same repos-
itory) can enhance semantic search performance
while allowing efficient concatenation of multiple
code fragments. Our final training objective is the
summation of two losses: (1) MLM loss and (2)
ICC loss: L=LMLM +LICC.
InLMLM , a certain percentage of tokens are
randomly masked and predicted using a classifi-
cation head. Following Zhang et al. (2024), we
adopt a 15% masking rate with the standard 80-
10-10 token replacement strategy (Devlin et al.,2019). The secondary objective, LICC, determines
whether randomly concatenated inputs (separated
by a< SEP > token) originate from the same
repository (see fig. 3). Each concatenated sam-
ple has a 50% probability of containing source
code from different repositories. This approach
increases input density—reducing padding by ex-
panding the average input length from 630 to 1 300
tokens—and potentially enhances cross-language
understanding. Since repositories are inherently
modular and often contain files written in multiple
languages, learning from repository-level context
may improve inter-language generalization.
2.3.2 Multi-layer Loss
To achieve layer-wise modularity in transformer
architectures, we apply the previously introduced
loss (section 2.3.1) across a selected set of lay-
ers, sharing classification heads (masked language
modeling and in-context classification) while in-
corporating a positional embedding of the layer
index. The total loss is computed as the sum of
individual layer losses, weighted by a factor αto
prioritize deeper layers: L=P
i∈ιLi·αwhere
α=i/|I|andI={1, . . . , 36}represents all lay-
ers, and the selected subset ι={4,9,18,27,36}
defines the layers where the loss is applied. The
selected subset was chosen to enable four model
variants equally spaced in depth (9, 18, 27, 36)
along with an additional “tiny” version (4) to see
the model performance in a lower number of param-
eters set. This approach allows for flexible model
deployment, enabling adaptive layer pruning while
maintaining performance trade-offs.
2.4 Fine-tuning
Following Su et al. (2023), we fine-tune a single
model for both text-to-code and code-to-code re-
trieval using instruction prompting. The optimiza-
tion objective combines CLIP loss (Radford et al.,
2021) with a multi-layer loss (details in 2.3.2).
To enhance representation learning, we replace
the single-head projection of the multi-layer loss
with five distinct projection heads, applied at differ-
ent exit points of the pre-trained model (layers 4, 9,
18, 27, and 36). We used a batch of 2 048 elements,
ensuring that text-to-code and code-to-code were
equally distributed across the batch.
We performed data augmentation by randomly
replacing frequently occurring words (appearing
more than twice and having at least three charac-
ters) with random strings. We applied the augmen-
5
Page 6:
Table 3: Performance of different models on text-to-code with CodeSearchNet using codeXGLUE . We reported the
results presented in codet5plus, unixcoder and modernBERT (Wang et al., 2023; Guo et al., 2022; Warner et al.,
2024).
CodeSearchNet
Model Ruby JS Go Python Java PHP avg. MRR avg. NDCG
MODULAR STARENCODER 74.1 74.0 82.5 92.5 78.7 84.5 81.0 84.2
Codet5+ 770M 78.0 71.3 92.7 75.8 76.2 70.1 77.4 -
OpenAI text-embedding-3-large 84.7 85.3 95.9 99.8 90.1 95.6 91.9 93.3
Unixcoder 74.0 68.4 91.5 72.0 72.6 67.6 74.4 -
ModernBERT-large - - - - - - - 59.5
tation exclusively to code snippets in 30% of cases,
leaving natural language descriptions unchanged.
After conducting a grid search, we selected 1.0e−5
as the learning rate, maintained throughout the fine-
tuning process, and set the temperature parameter
at 10.0.
2.5 Evaluation
We evaluated MODULAR STARENCODER fine-
tuned, on both text-to-code and code-to-code re-
trieval tasks using CODEXGLUE (Lu et al., 2021),
which comprises several benchmarking datasets.
For text-to-code retrieval, we employed the CODE-
SEARCH NETdataset, where the goal is to retrieve
the most relevant code snippet given a natural lan-
guage query. Specifically, the query corresponds to
a documentation comment, and the model is tasked
with ranking the correct code snippet among 999
distractor snippets (Husain et al., 2019). This setup
assesses the model’s ability to learn meaningful
cross-modal representations between code and nat-
ural language.
For code-to-code retrieval, we relied on two
datasets from CODEXGLUE : the Code Translation
(CT) benchmark and POJ-104 . The Code Trans-
lation dataset consists of semantically equivalent
code snippets in different programming languages,
and we framed the task as cross-language code re-
trieval rather than translation. In this setting, given
a Java code snippet as a query, the model retrieves
the corresponding C# implementation, testing its
capability to capture cross-lingual semantic simi-
larities between functionally equivalent programs.
In contrast, with POJ-104 dataset, we want to
evaluate the model on intra-language semantic
search ( POJ-104 contains only C++ snippets),
where programs solve the same problem but with
different implementations. This setup evaluates
the model’s capacity to generalize across structural
variations while preserving semantic equivalence.Table 4: Performance of different models on Code Trans-
lation (CT) and POJ104 for code-to-code search with
codeXGLUE dataset.
CT POJ104
Model MRR mAP
MODULAR STARENCODER 98.9 56.5
Codet5+ 110M-embedding 98.4 24.5
OpenAI text-embedding-3-large 98.8 82.9
Unixcoder 97.6 41.0
ModernBERT-large 93.1 27.3
3 Results and Discussion
3.1 Benchmarks
Table 3 presents the results for CodeSearchNet (t2c)
task in terms of Mean Reciprocal Rank (MRR) for
each single language, average NDCG and aver-
age MRR. Results for Unixcoder, ModernBERT,
and CodeT5+ are reported from the original pa-
pers (Guo et al., 2022; Warner et al., 2024; Wang
et al., 2023). On CODESEARCH NET,MODU -
LARSTARENCODER achieves an MRR of 81.0 and
a NDCG of 84.2, outperforming CODET5+ (Wang
et al., 2023) (770M), UNIXCODER (Guo et al.,
2022), and MODERN BERT- LARGE (Warner et al.,
2024). The only encoder that surpasses MODU -
LARSTARENCODER is OpenAI’s text-embedding-
3-large.
Table 4 presents results from both POJ104 and
CT datasets reported respectively in MRR for code
translation (Java to C# retrieval) and mean aver-
age precision for POJ104 (C++ to C++# retrieval).
MODULAR STARENCODER reaches the best perfor-
mance among the tests. We decided to replicate the
benchmarking for all models in a zero-shot setting
for code-to-code tasks because our model does not
integrate POJ104 and the code translation datasets
in the training set.
Referring to Table 4, on the POJ104 dataset
in zero-shot, MODULAR STARENCODER achieves
6
Page 7:
Table 5: Performance comparison of MODULAR STARENCODER layers and baseline fine-tuned models on the
CodeSearchNet benchmark. The table displays the overall retrieval performance measured by Mean Reciprocal
Rank (MRR). We refer to MODULAR STARENCODER , fine-tuned with multiple exit points simultaneously, as
self-distilled . The models not marked as self-distilled are the baselines, fine-tuned individually for each exit point.
CodeSearchNet
Model Size Ruby Javascript Go Python Java PHP avg. MRR
Layer-4 ≈160M 59.5 61.3 72.1 86.2 68.2 75.5 70.5
Layer-4 (self-distilled) 62.2 64.7 74.8 88.1 71.4 78.0 73.2
Layer-9 ≈300M 64.9 65.7 74.3 87.3 72.0 78.8 73.8
Layer-9 (self-distilled) 67.6 69.4 78.9 90.2 75.5 82.3 77.3
Layer-18 ≈550M 73.8 73.5 82.4 92.1 78.4 84.0 80.7
Layer-18 (self-distilled) 74.1 74.0 82.5 92.5 78.7 84.5 81.0
Layer-27 ≈800M 72.3 71.8 80.8 90.8 76.9 82.3 79.1
Layer-27 (self-distilled) 73.2 73.3 81.7 92.1 77.8 83.8 80.3
Layer-36 ≈1B 72.3 72.9 80.7 91.5 77.1 82.9 79.5
Layer-36 (self-distilled) 73.5 72.6 80.5 91.4 76.9 82.7 79.6
an mAP of 0.57, which is state-of-art between
open-sourced models, however it is significantly
behind OpenAI text-embedding-3-large. We under-
score that a direct comparison with OpenAI text-
embedding-3-large remains challenging because it
is closed-source, and details such as model size,
training methodology, or potential data contamina-
tion are undisclosed.
3.2 Ablation Study
We conducted an ablation study by fine-tuning sin-
gularly each exit point (also starting from MOD-
ULAR STARENCODER , pre-trained) and pruning
the subsequent layers (e.g., for the baseline on
layer 18, we retain only the first 18 layers and
fine-tune the model using just one projection
head on that layer). Finally, we compared the
sliced models with the corresponding results (self-
distilled) of the model fine-tuned with the multi-
layer loss ( MODULAR STARENCODER ).MODU -
LARSTARENCODER consistently outperforms the
single-exit baseline, indicating that lower-level lay-
ers benefit from training signals propagated from
deeper layers . This behavior is highlighted in
Table 5, where MODULAR STARENCODER , indi-
cated as self-distilled , outperforms all the single
exit baselines consistently. This finding under-
scores a promising new direction in self-distillation
for large-scale code and text models, enabling
high performance even in more compact configu-
rations. Moreover, Figure 4 illustrates that MODU-
LARSTARENCODER maintains robust performance
from layers 18 to 36, allowing users to scale down
the network to match their memory, computational,or latency constraints while preserving strong re-
trieval accuracy.
4 Related work
Since the introduction of ELMo (Peters et al.,
2018), deep contextual information has enhanced
generating embeddings for textual retrieval or clas-
sification, reaching state-of-the-art results in sev-
eral tasks. BERT (Feng et al., 2020) followed
those findings, adapting the Transformer architec-
ture (Vaswani et al., 2017) to enable a bi-directional
representation with two different training object-
ing, namely the masked language modeling and
the next sentence prediction losses. (Lan et al.,
2019; Liu et al., 2019) adapted the BERT archi-
tecture to obtain an enhanced pre-trained model
by removing or modifying the NSP, focusing on
pre-training data or hyperparameters optimization.
More recently, modernBERT (Warner et al., 2024)
tied the gap between modern decoders (Jiang et al.,
2023; Hui et al., 2024; Dubey et al., 2024; Touvron
et al., 2023; Lozhkov et al., 2024) advancements
that rely upon models with an increased number of
parameters, trained upon more tokens, and being
capable of handling longer contextual information.
In code representation, large language models
must be adapted by training them on a curated cor-
pus focused on software and by leveraging code’s
syntactic and semantic structures, which differ sig-
nificantly from natural language. Feng et al. (2020)
adapted the BERT architecture to produce seman-
tically meaningful embeddings for source code,
resulting in codeBERT. This was accomplished
by including more source code in the training set
7
Page 8:
5 10 15 20 25 30 35
Layer0.700.720.740.760.780.800.82MRR
+2.72%+3.47%+0.40%
+1.18%
+0.00%
Baseline models
Self-distilled model(a) MRR
5 10 15 20 25 30 35
Layer0.600.620.640.660.680.700.720.74Recall@1
+3.11%+4.36%+0.58%
+1.46%
+0.27%
Baseline models
Self-distilled model (b) Recall@1
Figure 4: Performance Comparison Across Layers: The graph illustrates the MRR and the Recall@1 for different
layers, comparing baseline models and a self-distilled model.
and focusing on a training loss that can leverage
bimodal (natural language and code) contextual in-
formation (Clark et al., 2020). GraphCodeBERT
enhanced codeBERT (Feng et al., 2020) represen-
tations by incorporating data flow graphs, captur-
ing dependencies between variables and operations,
and improving tasks like code summarization and
clone detection. UniXcoder (Guo et al., 2022)
extended this by introducing a unified encoder-
decoder framework, integrating abstract syntax
trees (ASTs) and data flow information. Wang et al.
(2023) expanded these findings with codet5plus,
stressing how multiple losses that leverage code
semantics impact the model pertaining. The work
incorporated text-code contrastive learning, text-
code Matching, and text-code causal LM for better
code understanding and generation.
When trying to achieve better performance, re-
search has shifted toward models with a high num-
ber of parameters. While this trend appears ef-
fective from a performance perspective, end users
may face computational or memory limitations as
LLMs vary from millions to billions of parame-
ters. Sanh et al. (2019) pioneered the introduction
of knowledge distillation, using a “teacher” model
that guides a smaller model to emulate its behav-
ior. This methodology has been widely adopted
and improved upon recently (DeepSeek-AI et al.,
2025; Hui et al., 2024), becoming a standard for
obtaining high-performing smaller LLMs.
Our work differs from previous work by adapt-
ing a modern architecture (Lozhkov et al., 2024) to
a code encoder-only based model and introducing
a novel ’self-distillation’ mechanism. We replace
the next sentence prediction loss with an in-context
classification focused on the repository level andexpand the context to 2048 tokens. Our novel
self-distillation mechanism improves low-level lay-
ers, resulting in a modular transformer architecture
without additional teacher models or further data
for distillation.
5 Conclusion
In this work, we introduced MODULAR STAREN-
CODER , a modular multi-exit encoder architecture
designed to improve efficiency and scalability in
code retrieval tasks. By integrating an intra-model
self-distillation mechanism, our approach enables
multiple resolution models to be trained within
a unified layer stack, reducing redundancy while
maintaining high retrieval performance. Our evalu-
ation on CODEXGLUE demonstrates that MOD-
ULAR STARENCODER achieves state-of-the-art re-
sults among open-source models, outperforming
prior baselines across text-to-code and code-to-
code retrieval tasks. Ablations further highlighted
the benefits of self-distillation, showing that lower
layers gain representational strength from deeper
layers, leading to superior performance compared
to single-exit models.
Beyond performance gains, MODU -
LARSTARENCODER offers practical benefits
by providing multiple exit points, allowing users
to balance computational efficiency and accuracy
based on resource constraints. The results suggest
that self-distillation provides a promising direction
for efficient large-scale encoders, reducing
deployment costs without sacrificing effectiveness.
Finally, released in open-access our SYNTH -
CODE2CODE2NL and both pre-trained and fine-
tuned M ODULAR STARENCODER models.
8
Page 9:
Acknowledgments
We acknowledge ISCRA for awarding this project
access to the LEONARDO supercomputer, owned
by the EuroHPC Joint Undertaking, hosted by
CINECA (Italy).
Limitations
Due to our dependence on multiple GPUs, we
encountered significant computational constraints.
Parameter grid searches with smaller and embry-
onic models were the only ways to extrapolate the
best hyperparameter setup. The best hyperparam-
eters for smaller models can differ from those for
larger ones; thus, we faced a limitation in finding
an optimal training setup. Ablating both the in-
context classification and the multi-layer loss in a
real scenario was impossible as we depended on
smaller models to understand their performances.
Therefore, computational resources pose a signif-
icant constraint in this work, and we want to em-
phasize how this factor undermines the possibility
of replicating the experiments.
Here, we highlight potential threats to the va-
lidity of the research process, focusing on both
external and internal factors.
External validity When synthesizing the SYN-
THCODE2CODE2NL code, we rely on code trans-
lation; we understand that synthesized data adheres
to stylistic writing patterns distinct from those of
humans. We tested the model’s performance on
standard benchmarks. However, the impact of uti-
lizing code snippets as synthetic data in training
large language models for generalization over hu-
man text-to-code and code-to-code search is still
not fully understood.
Internal validity The ablation study focused on
fine-tuning the model with and without multi-layer
loss. However, this comparison does not account
for how the model behaves when starting from
a model not pre-trained on multi-layer loss. Al-
though our experiments present promising results,
further inspection is necessary to better understand
this phenomenon.
References
Joshua Ainslie, James Lee-Thorp, Michiel de Jong,
Yury Zemlyanskiy, Federico Lebrón, and Sumit Sang-
hai. 2023. GQA: Training Generalized Multi-Query
Transformer Models from Multi-Head Checkpoints.
arXiv e-prints , arXiv:2305.13245.Loubna Ben Allal, Anton Lozhkov, Elie Bak-
ouch, Gabriel Martín Blázquez, Guilherme Penedo,
Lewis Tunstall, Andrés Marafioti, Hynek Kydlí ˇcek,
Agustín Piqueres Lajarín, Vaibhav Srivastav, et al.
2025. Smollm2: When smol goes big–data-centric
training of a small language model. arXiv preprint
arXiv:2502.02737 .
Stéphane Aroca-Ouellette and Frank Rudzicz. 2020. On
losses for modern language models. arXiv preprint
arXiv:2010.01694 .
Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen,
Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong,
Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun
Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong
Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie
Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi
Li, Yao Li, Y . K. Li, Wenfeng Liang, Fangyun Lin,
Alex X. Liu, Bo Liu, Wen Liu, Xiaodong Liu, Xin
Liu, Yiyuan Liu, Haoyu Lu, Shanghao Lu, Fuli Luo,
Shirong Ma, Xiaotao Nie, Tian Pei, Yishi Piao, Jun-
jie Qiu, Hui Qu, Tongzheng Ren, Zehui Ren, Chong
Ruan, Zhangli Sha, Zhihong Shao, Junxiao Song,
Xuecheng Su, Jingxiang Sun, Yaofeng Sun, Minghui
Tang, Bingxuan Wang, Peiyi Wang, Shiyu Wang,
Yaohui Wang, Yongji Wang, Tong Wu, Y . Wu, Xin
Xie, Zhenda Xie, Ziwei Xie, Yiliang Xiong, Hanwei
Xu, R. X. Xu, Yanhong Xu, Dejian Yang, Yuxiang
You, Shuiping Yu, Xingkai Yu, B. Zhang, Haowei
Zhang, Lecong Zhang, Liyue Zhang, Mingchuan
Zhang, Minghua Zhang, Wentao Zhang, Yichao
Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou,
Shunfeng Zhou, Qihao Zhu, and Yuheng Zou. 2024.
Deepseek LLM: scaling open-source language mod-
els with longtermism. CoRR , abs/2401.02954.
Kevin Clark, Minh-Thang Luong, Quoc V . Le, and
Christopher D. Manning. 2020. ELECTRA: pre-
training text encoders as discriminators rather than
generators. In 8th International Conference on
Learning Representations, ICLR 2020, Addis Ababa,
Ethiopia, April 26-30, 2020 . OpenReview.net.
Tri Dao. 2023. FlashAttention-2: Faster Attention with
Better Parallelism and Work Partitioning. arXiv e-
prints , arXiv:2307.08691.
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang,
Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu,
Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang,
Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong
Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue,
Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu,
Chenggang Zhao, Chengqi Deng, Chenyu Zhang,
Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji,
Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo,
Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang,
Han Bao, Hanwei Xu, Haocheng Wang, Honghui
Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li,
Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang
Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L.
Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai
Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai
Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong
9
Page 10:
Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan
Zhang, Minghua Zhang, Minghui Tang, Meng Li,
Miaojun Wang, Mingming Li, Ning Tian, Panpan
Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen,
Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan,
Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen,
Shanghao Lu, Shangyan Zhou, Shanhuang Chen,
Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng
Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing
Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun,
T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu,
Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao
Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan
Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin
Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li,
Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin,
Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxi-
ang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang,
Xinxia Shan, Y . K. Li, Y . Q. Wang, Y . X. Wei, Yang
Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng
Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi,
Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang,
Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo,
Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yu-
jia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You,
Yuxuan Liu, Yuyang Zhou, Y . X. Zhu, Yanhong Xu,
Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu,
Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan,
Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean
Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao,
Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zi-
jia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song,
Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu
Zhang, and Zhen Zhang. 2025. Deepseek-r1: Incen-
tivizing reasoning capability in llms via reinforce-
ment learning. Preprint , arXiv:2501.12948.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: pre-training of
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, NAACL-HLT 2019, Minneapolis, MN, USA,
June 2-7, 2019, Volume 1 (Long and Short Papers) ,
pages 4171–4186. Association for Computational
Linguistics.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey,
Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman,
Akhil Mathur, Alan Schelten, Amy Yang, Angela
Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang,
Archi Mitra, Archie Sravankumar, Artem Korenev,
Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien
Rodriguez, Austen Gregerson, Ava Spataru, Bap-
tiste Rozière, Bethany Biron, Binh Tang, Bobbie
Chern, Charlotte Caucheteux, Chaya Nayak, Chloe
Bi, Chris Marra, Chris McConnell, Christian Keller,
Christophe Touret, Chunyang Wu, Corinne Wong,
Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Al-
lonsius, Daniel Song, Danielle Pintz, Danny Livshits,
David Esiobu, Dhruv Choudhary, Dhruv Mahajan,
Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes,
Egor Lakomkin, Ehab AlBadawy, Elina Lobanova,Emily Dinan, Eric Michael Smith, Filip Radenovic,
Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Geor-
gia Lewis Anderson, Graeme Nail, Grégoire Mialon,
Guan Pang, Guillem Cucurell, Hailey Nguyen, Han-
nah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov,
Imanol Arrieta Ibarra, Isabel M. Kloumann, Ishan
Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan
Geffert, Jana Vranes, Jason Park, Jay Mahadeokar,
Jeet Shah, Jelmer van der Linde, Jennifer Billock,
Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi,
Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu,
Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph
Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia,
Kalyan Vasuden Alwala, Kartikeya Upasani, Kate
Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, and
et al. 2024. The llama 3 herd of models. CoRR ,
abs/2407.21783.
Vage Egiazarian, Andrei Panferov, Denis Kuznedelev,
Elias Frantar, Artem Babenko, and Dan Alistarh.
2024. Extreme Compression of Large Language
Models via Additive Quantization. arXiv e-prints ,
arXiv:2401.06118.
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan,
Xiaocheng Feng, Ming Gong, Linjun Shou, Bing
Qin, Ting Liu, Daxin Jiang, and Ming Zhou.
2020. CodeBERT: A Pre-Trained Model for Pro-
gramming and Natural Languages. arXiv e-prints ,
arXiv:2002.08155.
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi-
aocheng Feng, Ming Gong, Linjun Shou, Bing Qin,
Ting Liu, Daxin Jiang, and Ming Zhou. 2020. Code-
bert: A pre-trained model for programming and nat-
ural languages. In Findings of the Association for
Computational Linguistics: EMNLP 2020, Online
Event, 16-20 November 2020 , volume EMNLP 2020
ofFindings of ACL , pages 1536–1547. Association
for Computational Linguistics.
Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming
Zhou, and Jian Yin. 2022. Unixcoder: Unified cross-
modal pre-training for code representation. In Pro-
ceedings of the 60th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1: Long
Papers), ACL 2022, Dublin, Ireland, May 22-27,
2022 , pages 7212–7225. Association for Computa-
tional Linguistics.
Song Han, Jeff Pool, John Tran, and William J. Dally.
2015. Learning both Weights and Connections
for Efficient Neural Networks. arXiv e-prints ,
arXiv:1506.02626.
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Day-
iheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang,
Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang,
Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and
Junyang Lin. 2024. Qwen2.5-coder technical report.
CoRR , abs/2409.12186.
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis
Allamanis, and Marc Brockschmidt. 2019. Code-
searchnet challenge: Evaluating the state of semantic
code search. CoRR , abs/1909.09436.
10
Page 11:
Benoit Jacob, Skirmantas Kligys, Bo Chen, Meng-
long Zhu, Matthew Tang, Andrew Howard, Hartwig
Adam, and Dmitry Kalenichenko. 2017. Quantiza-
tion and Training of Neural Networks for Efficient
Integer-Arithmetic-Only Inference. arXiv e-prints ,
arXiv:1712.05877.
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men-
sch, Chris Bamford, Devendra Singh Chaplot, Diego
de las Casas, Florian Bressand, Gianna Lengyel, Guil-
laume Lample, Lucile Saulnier, Lélio Renard Lavaud,
Marie-Anne Lachaux, Pierre Stock, Teven Le Scao,
Thibaut Lavril, Thomas Wang, Timothée Lacroix,
and William El Sayed. 2023. Mistral 7B. arXiv
e-prints , arXiv:2310.06825.
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao
Chen, Linlin Li, Fang Wang, and Qun Liu. 2019.
TinyBERT: Distilling BERT for Natural Language
Understanding. arXiv e-prints , arXiv:1909.10351.
Aditya Kusupati, Gantavya Bhatt, Aniket Rege,
Matthew Wallingford, Aditya Sinha, Vivek Ramanu-
jan, William Howard-Snyder, Kaifeng Chen, Sham
Kakade, Prateek Jain, and Ali Farhadi. 2022. Ma-
tryoshka representation learning. In Advances in
Neural Information Processing Systems , volume 35,
pages 30233–30249. Curran Associates, Inc.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
Kevin Gimpel, Piyush Sharma, and Radu Soricut.
2019. ALBERT: A Lite BERT for Self-supervised
Learning of Language Representations. arXiv e-
prints , arXiv:1909.11942.
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas
Muennighoff, Denis Kocetkov, Chenghao Mou, Marc
Marone, Christopher Akiki, Jia Li, Jenny Chim,
Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo,
Thomas Wang, Olivier Dehaene, Mishig Davaadorj,
Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko,
Nicolas Gontier, Nicholas Meade, Armel Zebaze,
Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu,
Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo
Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp
Patel, Dmitry Abulkhanov, Marco Zocca, Manan
Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhat-
tacharyya, Wenhao Yu, Swayam Singh, Sasha Luc-
cioni, Paulo Villegas, Maxim Kunakov, Fedor Zh-
danov, Manuel Romero, Tony Lee, Nadav Timor,
Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf,
Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jen-
nifer Robinson, Carolyn Jane Anderson, Brendan
Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel
Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos
Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Ar-
jun Guha, Leandro von Werra, and Harm de Vries.
2023. StarCoder: may the source be with you! arXiv
e-prints , arXiv:2305.06161.
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-
Ming Chen, Wei-Chen Wang, Guangxuan Xiao,
Xingyu Dang, Chuang Gan, and Song Han. 2023.
AWQ: Activation-aware Weight Quantization for
LLM Compression and Acceleration. arXiv e-prints ,
arXiv:2306.00978.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized BERT pretraining
approach. CoRR , abs/1907.11692.
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Fed-
erico Cassano, Joel Lamy-Poirier, Nouamane Tazi,
Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei,
Tianyang Liu, Max Tian, Denis Kocetkov, Arthur
Zucker, Younes Belkada, Zijian Wang, Qian Liu,
Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-
Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue
Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade,
Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su,
Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai,
Niklas Muennighoff, Xiangru Tang, Muhtasham
Oblokulov, Christopher Akiki, Marc Marone, Cheng-
hao Mou, Mayank Mishra, Alex Gu, Binyuan Hui,
Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas
Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten
Scholak, Sebastien Paquet, Jennifer Robinson, Car-
olyn Jane Anderson, Nicolas Chapados, Mostofa Pat-
wary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz
Ferrandis, Lingming Zhang, Sean Hughes, Thomas
Wolf, Arjun Guha, Leandro von Werra, and Harm
de Vries. 2024. StarCoder 2 and The Stack v2: The
Next Generation. arXiv e-prints , arXiv:2402.19173.
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey
Svyatkovskiy, Ambrosio Blanco, Colin B. Clement,
Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Li-
dong Zhou, Linjun Shou, Long Zhou, Michele Tu-
fano, Ming Gong, Ming Zhou, Nan Duan, Neel Sun-
daresan, Shao Kun Deng, Shengyu Fu, and Shujie
Liu. 2021. Codexglue: A machine learning bench-
mark dataset for code understanding and generation.
InProceedings of the Neural Information Process-
ing Systems Track on Datasets and Benchmarks 1,
NeurIPS Datasets and Benchmarks 2021, December
2021, virtual .
Changan Niu, Chuanyi Li, Vincent Ng, Dongxiao Chen,
Jidong Ge, and Bin Luo. 2023. An empirical compar-
ison of pre-trained models of source code. In 2023
IEEE/ACM 45th International Conference on Soft-
ware Engineering (ICSE) , pages 2136–2148. IEEE.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word repre-
sentations. In Proceedings of the 2018 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long Papers) , pages 2227–2237,
New Orleans, Louisiana. Association for Computa-
tional Linguistics.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-
try, Amanda Askell, Pamela Mishkin, Jack Clark,
Gretchen Krueger, and Ilya Sutskever. 2021. Learn-
ing transferable visual models from natural language
supervision. In Proceedings of the 38th International
Conference on Machine Learning, ICML 2021, 18-24
11
Page 12:
July 2021, Virtual Event , volume 139 of Proceedings
of Machine Learning Research , pages 8748–8763.
PMLR.
Victor Sanh, Lysandre Debut, Julien Chaumond, and
Thomas Wolf. 2019. DistilBERT, a distilled version
of BERT: smaller, faster, cheaper and lighter. arXiv
e-prints , arXiv:1910.01108.
Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang,
Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A.
Smith, Luke Zettlemoyer, and Tao Yu. 2023. One
embedder, any task: Instruction-finetuned text em-
beddings. In Findings of the Association for Com-
putational Linguistics: ACL 2023, Toronto, Canada,
July 9-14, 2023 , pages 1102–1121. Association for
Computational Linguistics.
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha,
Bo Wen, and Yunfeng Liu. 2021. RoFormer: En-
hanced Transformer with Rotary Position Embedding.
arXiv e-prints , arXiv:2104.09864.
Surat Teerapittayanon, Bradley McDanel, and H. T.
Kung. 2017. BranchyNet: Fast Inference via Early
Exiting from Deep Neural Networks. arXiv e-prints ,
arXiv:1709.01686.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-
Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Isabel Kloumann, Artem Korenev, Punit Singh Koura,
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-
ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar-
tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-
bog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
stein, Rashi Rungta, Kalyan Saladi, Alan Schelten,
Ruan Silva, Eric Michael Smith, Ranjan Subrama-
nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
lor, Adina Williams, Jian Xiang Kuan, Puxin Xu,
Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
Melanie Kambadur, Sharan Narang, Aurélien Ro-
driguez, Robert Stojnic, Sergey Edunov, and Thomas
Scialom. 2023. Llama 2: Open foundation and fine-
tuned chat models. CoRR , abs/2307.09288.
Matteo Turisini, Giorgio Amati, and Mirko Cestari.
2023. LEONARDO: A pan-european pre-exascale
supercomputer for HPC and AI applications. CoRR ,
abs/2307.16885.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention Is All
You Need. arXiv e-prints , arXiv:1706.03762.
Yue Wang, Hung Le, Akhilesh Gotmare, Nghi D. Q. Bui,
Junnan Li, and Steven C. H. Hoi. 2023. Codet5+:
Open code large language models for code under-
standing and generation. In Proceedings of the 2023Conference on Empirical Methods in Natural Lan-
guage Processing, EMNLP 2023, Singapore, Decem-
ber 6-10, 2023 , pages 1069–1088. Association for
Computational Linguistics.
Benjamin Warner, Antoine Chaffin, Benjamin Clavié,
Orion Weller, Oskar Hallström, Said Taghadouini,
Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom
Aarsen, Nathan Cooper, Griffin Adams, Jeremy
Howard, and Iacopo Poli. 2024. Smarter, better,
faster, longer: A modern bidirectional encoder for
fast, memory efficient, and long context finetuning
and inference. CoRR , abs/2412.13663.
Jinle Zeng, Min Li, Zhihua Wu, Jiaqi Liu, Yuang Liu,
Dianhai Yu, and Yanjun Ma. 2022. Boosting dis-
tributed training performance of the unpadded BERT
model. CoRR , abs/2208.08124.
Dejiao Zhang, Wasi Ahmad, Ming Tan, Hantian Ding,
Ramesh Nallapati, Dan Roth, Xiaofei Ma, and Bing
Xiang. 2024. Code Representation Learning At Scale.
arXiv e-prints , arXiv:2402.01935.
Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad
Shoeybi, Tom Goldstein, Anima Anandkumar, and
Bryan Catanzaro. 2021. Long-short transformer: Ef-
ficient transformers for language and vision. In Ad-
vances in Neural Information Processing Systems 34:
Annual Conference on Neural Information Process-
ing Systems 2021, NeurIPS 2021, December 6-14,
2021, virtual , pages 17723–17736.
A Synthetic dataset
SYNTH CODE2CODE2NL is a fine-tuning dataset
designed for text-to-code and code-to-code search,
built by augmenting CODESEARCH NET(Husain
et al., 2019) with transpiled code snippets across
multiple languages (Python, Java, Go, PHP, Ruby,
C++, C, JavaScript). The dataset underwent a pre-
processing phase, including deduplication based on
the original and synthesized code columns. Near-
deduplication was performed using Locality Sensi-
tive Hashing (LSH) with a Jaccard similarity thresh-
old of 0.7 over character-level 5-grams to remove
semantically identical snippets differing only in
identifiers or function arguments.
For code-to-code search, we translated each snip-
pet into a randomly sampled target language us-
ing the QWEN 2.5-C ODER -7B-I NSTRUCT model
with greedy search to ensure consistency. Each
dataset entry consists of a natural language descrip-
tion and two code snippets in different languages.
SYNTH CODE2CODE2NL contains 1,071,367 sam-
ples, with original code from CODESEARCH NET
(Python, Java, PHP, Go) and translated code (Go,
Ruby, JavaScript, Python, C++, PHP, C, Java). In
Figure 5, In Figure 6 and Figure 7 some examples
of code translation are shown.
12
Page 13:
GO
// readNextChunk creates an io.LimitedReader for the next chunk of data,
// and returns io.EOF if the last chunk has been read.
funcreadNextChunk(r wire.SyncScanner) (io.Reader, error) {
status, err := r.ReadStatus("read-chunk")
if err != nil {
if wire.IsAdbServerErrorMatching(err, readFileNotFoundPredicate) {
returnnil, errors.Errorf(errors.FileNoExistError, "no such file or directory")
}
returnnil, err
}
switch status {
case wire.StatusSyncData:
return r.ReadBytes()
case wire.StatusSyncDone:
returnnil, io.EOF
default:
returnnil, errors.Errorf(
errors.AssertionError,
"expected chunk id '%s' or '%s', but got '%s'",
wire.StatusSyncData, wire.StatusSyncDone, []byte(status),
)
}
}
Ruby
def read_next_chunk(r)
status, err = r.read_status("read-chunk")
if err
if AdbServerErrorMatching(err, readFileNotFoundPredicate)
returnnil, FileNoExistError.new("no such file or directory")
end
returnnil, err
end
case status
when wire.StatusSyncData
return r.read_bytes
when wire.StatusSyncDone
returnnil, EOFError.new
else
returnnil, AssertionError.new(
"expected chunk id '#{wire.StatusSyncData}' or '#{wire.StatusSyncDone}', but got '#{status}'"
)
end
endFigure 5: Example of a code snippet translation from GotoRuby .
13
Page 14:
JAVA
public void get(URI uri, Credentials credentials) throws IOException {
get(uri, null, credentials);
}
C
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef struct {
char *username;
char *password;
} Credentials;
typedef struct {
char *scheme;
char *host;
int port;
char *path;
} URI;
void get(URI *uri, Credentials *credentials) {
// Implementation of the get function
printf("Fetching resource from %s://%s:%d%s\n", uri->scheme, uri->host, uri->port, uri->path);
if (credentials != NULL) {
printf("Using credentials: %s:%s\n", credentials->username, credentials->password);
}
}
int main() {
URI uri = {"http", "example.com", 80, "/index.html"};
Credentials credentials = {"user", "pass"};
get(&uri, &credentials);
return 0;
}Figure 6: Example of a code snippet translation from Java toC.
14
Page 15:
Python
def toString(self):
result = []
k, v = self.optimalRepr()
longest = reduce(lambda x, y: x if x > len(y) else len(y), k, 0)
for ind in range(len(k)):
result.append("%s : %s" % (k[ind].ljust(longest), v[ind]))
return "\n".join(result)
PHP
public function toString() {
/**
* Return a printable view of the dictionary
*/
$result = [];
list($k, $v) = $this->optimalRepr();
$longest = array_reduce($k, function($x, $y) {
return $x > strlen($y) ? $x : strlen($y);
}, 0);
for ($ind = 0; $ind < count($k); $ind++) {
$result[] = sprintf("%s : %s", ltrim($k[$ind], ' '), str_pad($v[$ind], $longest, ' ', STR_PAD_LEFT));
}
return implode("\n", $result);
}Figure 7: Example of a code snippet translation from Python toPHP .
15