Authors: Kshitij Ambilduke, Ben Peters, Sonal Sannigrahi, Anil Keshwani, Tsz Kin Lam, Bruno Martins, Marcely Zanon Boito, André F. T. Martins
Paper Content:
Page 1:
From TOWER toSPIRE: Adding the Speech Modality to a Text-Only LLM
Kshitij Ambilduke*†1, Ben Peters∗2, Sonal Sannigrahi∗2,3, Anil Keshwani†4,
Tsz Kin Lam5,Bruno Martins3,6,Marcely Zanon Boito‡7,André F.T. Martins‡2,3,8,9
1Paris-Saclay University2Instituto de Telecomunicações
3Instituto Superior Técnico, Universidade de Lisboa4Sapienza University of Rome
5University of Edinburgh6INESC-ID7NA VER LABS Europe8ELLIS Unit Lisbon9Unbabel
Correspondence: benzurdopeters@gmail.com
Abstract
Large language models (LLMs) have shown
remarkable performance and generalization ca-
pabilities across multiple languages and tasks,
making them very attractive targets for multi-
modality integration ( e.g., images or speech).
In this work, we extend an existing LLM to
the speech modality via speech discretization
and continued pre-training. In particular, we
are interested in multilingual LLMs, such as
TOWER , as their pre-training setting allows
us to treat discretized speech input as an ad-
ditional translation language . The resulting
open-source model, SPIRE , is able to tran-
scribe and translate English speech input while
maintaining TOWER ’s original performance on
translation-related tasks, showcasing that dis-
cretized speech input integration as an addi-
tional language is feasible during LLM adapta-
tion. We make our code and models available
to the community.
1 Introduction
Large language models (LLMs) have demonstrated
remarkable success across various text-based natu-
ral language processing tasks (Achiam et al., 2023;
Touvron et al., 2023; Jiang et al., 2024; Yang et al.,
2024; Alves et al., 2024; Martins et al., 2024), moti-
vating research into extending them to other modali-
ties. This has led to the development of multimodal
LLMs (MLLMs) capable of processing speech, au-
dio, images and video (Team et al., 2023; Driess
et al., 2023; Rubenstein et al., 2023; Liu et al., 2023;
Tang et al., 2023; Défossez et al., 2024; Hu et al.,
2024; Laurençon et al., 2024; Huang et al., 2024;
Nguyen et al., 2025). For speech-LLM integration,
an effortless approach is to link the output of an
automatic speech recognition (ASR) system to a
text-only LLM (Huang et al., 2024). However, this
*Equal contribution.
†Work begun during an internship at Instituto de Teleco-
municações.
‡Equal contribution.solution fails to fully leverage the LLM’s superior
language modeling capabilities for disambiguating
transcripts.
More popular are solutions that investigate equip-
ping LLMs with speech processing capabilities
through modality projection (Shu et al., 2023; Rad-
hakrishnan et al., 2023; Wu et al., 2023a; Tang
et al., 2023; Xue et al., 2024; Hu et al., 2024).
Typically, a speech foundation model generates
speech representations that are mapped to the em-
bedding space of the LLM. The speech model is
then fine-tuned along with a projector on speech-to-
text tasks to equip the LLM with speech processing
capabilities. In this setting, key challenges include
prompt overfitting and high training costs, as tun-
ing these MLLMs requires the adaptation of the
speech projector module on vast amounts of raw
speech data (Tang et al., 2023; Hu et al., 2024).
An alternative approach for MLLMs is the use
ofspeech discretization , where continuous speech
features are transformed prior to training into se-
quences of “discrete speech units” (DSUs), which
can be processed similarly to text (Chou et al.,
2023a; Zhang et al., 2023; Rubenstein et al., 2023;
Chang et al., 2024; Défossez et al., 2024; Trinh
et al., 2024; Maiti et al., 2024; Nguyen et al.,
2025). This approach simplifies training by elimi-
nating the need for additional parameters beyond
extended embedding matrices. Finally, while both
projector-based and discretization-based MLLMs
have shown promising results on text-to-speech
and/or speech-to-text tasks, their development has
prioritized speech-centric tasks at the expense
of textual performance. Currently, limited re-
search has focused on integrating speech while pre-
serving the LLM’s original capabilities in textual
tasks (Chou et al., 2023b; Huang et al., 2024).
In this work we present SPIRE , a speech-
augmented LLM built on top of the open multilin-
gual model TOWER (Alves et al., 2024). SPIRE
can process English speech and perform ASRarXiv:2503.10620v1 [cs.CL] 13 Mar 2025
Page 2:
Figure 1: Illustration of the model training method for S PIRE BASE and S PIRE FULL.
and speech translation (ST) while maintaining
TOWER ’s strong performance on machine trans-
lation (MT). SPIRE encodes speech via HuBERT-
based (Hsu et al., 2021) k-means clusterization, as
in previous work (Zhang et al., 2023; Rubenstein
et al., 2023; Chang et al., 2024). We perform train-
ing in two stages: continued pre-training (CPT)
and instruction tuning (IT). For the CPT stage, we
use a mixture of ASR data and a small fraction of
TOWER ’s text CPT data. For IT, we again leverage
TOWER ’s task-specific MT data, as well as addi-
tional English ASR and ST data. SPIRE is trained
using approximately 42.5K hours of speech. Fig-
ure 1 illustrates our training process.
We make the following contributions:
•We present a pipeline for integrating speech as
an additional modality into an existing LLM,
enabling it to transcribe and translate English
speech while preserving its original MT capa-
bilities;
•We compare speech integration at two stages,
namely CPT and IT, demonstrating that both
stages are essential for achieving optimal per-
formance on speech tasks;
•We reach ASR and ST results that are close
to those of strong speech-centric models
trained on larger amounts of data—Whisper-
large-v3 (Radford et al., 2023) and Seam-
lessM4T (Barrault et al., 2023)—while out-
performing SeamlessM4T on MT;
•We provide a reproducible pipeline to the com-
munity: all our models, datasets and scripts
are made available.1
2 Related Work
Speech-to-Text Models An increasing number
of studies have explored integrating speech into
1https://github.com/utter-project/SpireLMLLMs (Zhang et al., 2023; Rubenstein et al., 2023;
Hassid et al., 2024). For discrete speech in-
put, Hassid et al. (2024) demonstrate the benefits
of initializing a speech LLM from a text-based
LLM. SpeechGPT (Zhang et al., 2023) applies
IT on speech-to-text cross-modal ASR, text-to-
speech (TTS), and text-based question answering.
AudioPALM (Rubenstein et al., 2023) is trained in
a multi-task fashion, similarly to SpeechGPT, but
on multilingual input. Recently, V oxtLM (Maiti
et al., 2024) was trained jointly on DSUs and text
data for ASR, TTS, and open-ended speech/text
generation. Our work is most similar to that of
Spirit-LM (Nguyen et al., 2025), which adapts an
LLM with an interleaved mixture of DSU and text
data which requires an expensive DSU-to-transcript
step to create. In contrast, we adopt a more cost-
effective input representation that can be extended
to any language, regardless of the availability of a
speech aligner. Our focus is on successfully incor-
porating speech input while preserving the origi-
nal competence of the model, so that the resulting
model can successfully perform both speech-to-text
and text-only tasks. None of the aforementioned
models are trained to preserve the original model’s
performance in text tasks.
Adapting LLMs Previous approaches involve
training from scratch with task- and domain-
specific data (Singhal et al., 2023; Lewkowycz
et al., 2022), performing CPT with a diverse
training data mix designed to broadly extend the
model’s knowledge (Wu et al., 2023b), or doing
IT use-case-specific data (Chen et al., 2023). Re-
cent work has explored combining the latter two
approaches (Xu et al., 2024a; Alves et al., 2024;
Wei et al., 2021; Roziere et al., 2023). In our ap-
proach to integrating DSUs into TOWER , we take
inspiration from Alves et al. (2024) in adopting a
two-step CPT+IT process. Our work differs in that
we focus on adding the speech modality, whereas
Page 3:
Alves et al. (2024) focused on increasing the multi-
lingual capabilities of an LLM.
Continuous and Discrete Speech Represen-
tations Self-supervised speech representation
models produce contextualized high-dimensional
speech vectors directly from raw audio (Hsu et al.,
2021; Baevski et al., 2020; Chen et al., 2022),
largely outperforming statistical speech features on
downstream tasks (Yang et al., 2021). These con-
tinuous representations can be used to derive DSUs
that capture both linguistic content and prosody
through clustering (Borsos et al., 2023; Kharitonov
et al., 2022). DSUs provide better alignment with
textual data, facilitating the transfer of successful
training settings from the text domain (Cui et al.,
2024). Building on Lakhotia et al. (2021), which
demonstrated that HuBERT (Hsu et al., 2021) is
a powerful feature extractor, several studies have
adopted this approach, incorporating a k-means
clustering step for discretization (Zhang et al.,
2023; Rubenstein et al., 2023; Lam et al., 2024;
Chang et al., 2024; Nguyen et al., 2025). Xu et al.
(2024b) study the optimal settings to obtain DSUs
in terms of cluster size and feature extraction layer.
We use their findings to inform our initial choices.
3 S PIRE : A Speech-to-Text LLM
Our goal is to equip an LLM with speech capabili-
ties while preserving its preexisting text capabili-
ties. As our starting point, we select TOWER (Alves
et al., 2024) as our LLM, which was developed
from Llama-2 (Touvron et al., 2023) with a two-
step approach: CPT on a mixture of monolingual
and parallel data ( TOWER BASE), followed by IT
on translation-related tasks ( TOWER INSTRUCT ).
We use a similar approach to extend TOWER to
speech. First, we perform CPT with a combination
of text-only and aligned speech-to-text datasets, fol-
lowed by IT using both text-only general-purpose
and task-specific data curated in TOWER BLOCKS ,2
alongside task-specific speech-to-text datasets. We
name our model S PIRE .
3.1 Speech Discretization
To more easily transfer the training set-up of
TOWER , we use DSUs as opposed to an auxil-
iary speech encoder. For all speech datasets that
were used, we follow recent discretization method-
ology (Zhang et al., 2023; Rubenstein et al., 2023;
2https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2Chang et al., 2024) to produce DSUs by first ex-
tracting continuous speech representations for our
speech data from the 22nd layer of an English
HuBERT-large model, and then using k-means clus-
tering (K= 5000 ) to produce centroids that are
used to convert our continuous speech represen-
tation into a discrete sequence of cluster IDs.3
We train our k-means model on a collection of
235K audio files (approximately 720 hours), drawn
from three speech corpora: CoV oST-2 (Wang et al.,
2021b), V oxPopuli (Wang et al., 2021a), and Mul-
tilingual Librispeech (MLS; Pratap et al., 2020).
The CoV oST subset consists of 62K audio files
from 10,049 speakers, with a maximum of 8 audio
files per speaker. The V oxPopuli subset includes
65K audio files from 639 speakers, capped at 250
audio files per speaker. Finally, the MLS subset
contains 107K audio files from 5,490 speakers.
3.2 S PIRE BASE
SPIRE BASE is trained from TOWER BASE-7B
using both text-only and aligned speech-to-text
datasets. Following previous work, we incorpo-
rate a fraction of TOWER ’s original training data to
preserve its original performance (Scialom et al.,
2022; de Masson D’Autume et al., 2019).
3.2.1 Data
We use a mixture of monolingual and parallel
text in Chinese (zh), Dutch (nl), English (en),
French (fr), German (de), Italian (it), Korean (ko),
Portuguese (pt), Russian (ru), and Spanish (es), that
was sourced from the TOWER training data, as well
as English ASR data sourced from popular open-
source ASR datasets, as reported in Table 2. Both
speech and text data are downsampled to create a
6B token data mixture (5B speech; 1B text), mea-
sured by the model tokenizer.4Note that the 5B
speech tokens include both DSUs (4.4B tokens)
and their text transcriptions (0.6B tokens).
Text Data The monolingual text data split corre-
sponds to data from mC4 (Raffel et al., 2019), a
multilingual web-crawled corpus which we uni-
formly sample from across all languages. The
parallel data split includes uniformly sampled in-
stances to and from English (en ↔xx) for the 10
3Optimizing the layer selection for feature extraction is a
complex research problem (Pasad et al., 2023; Mousavi et al.,
2024). In this work we follow the insights from Gow-Smith
et al. (2023) and Xu et al. (2024b).
4Preliminary experiments on the data mixture led to this
particular choice.
Page 4:
languages, sourced from various public sources.
Further details can be found in Alves et al. (2024).
Speech Data We collect 35K hours of speech
data from SPGI Speech (O’Neill et al., 2021), Gi-
gaSpeech (Chen et al., 2021), MLS, and V oxPop-
uli. The transcription normalization process is ex-
plained in Appendix A.1.
3.2.2 CPT Setup
We train SPIRE BASE using the MegatronLLM
codebase (Cano et al., 2023) on 8 A100-80GB
GPUs for 6 days. We use the same hyperparame-
ters as TOWER , except for the effective batch size,
which in our case is 2,304. To incorporate the
DSUs in the CPT stage, we extend the model’s orig-
inal vocabulary by 5000 types, e.g.,<extra_id_x> .
This allows us to have a vocabulary that can encode
both text in subword units and speech in DSUs. For
the extended vocabulary, we initialize new embed-
dings from a multivariate Gaussian distribution.
The mean of this distribution is set to the average
of the original embeddings, while the covariance is
derived from the empirical covariance of the orig-
inal embeddings, scaled by a factor of 1×10−5
(Hewitt, 2021).
3.3 S PIRE FULL
The SPIRE FULL model is obtained by perform-
ing instruction tuning on SPIRE BASE using task-
specific text-only and aligned speech-to-text data.
3.3.1 Data
We use a mixture of text and speech instructions
for ASR, MT, and ST. The prompt formats used
during training are shown in Table 1.
Text Data We use TOWER BLOCKS (Alves et al.,
2024), which includes high quality translation bi-
texts between English and the other languages sup-
ported by TOWER . It also includes instructions
for the translation-related tasks of named entity
recognition and automatic post-editing.
ASR Data We use 0.8K hours of ASR data from
CommonV oice version 17 (CV; Ardila et al., 2020).
The down-sampling strategy is described in Ap-
pendix A.1.
ST Data In our IT set, we use 842 hours of
speech across three ST training sets: FLEURS (all
nine language pairs; we filter out examples whose
transcriptions overlap with the FLORES devtest
set), Europarl-ST (Iranzo-Sánchez et al., 2020) (enASR (CPT)
Speech:<extra_id_i> · · ·<extra_id_j>
English: {TRANSCRIPT}
MT (CPT)
Source_lang: Source-sentence
Target_lang: {TRANSLATION}
ASR (IT)
Speech: <extra_id_i> · · ·<extra_id_j>
English: {TRANSCRIPT}
Direct ST (IT)
Speech: <extra_id_i> · · ·<extra_id_j>
TARGET_LANG: {TRANSLATION}
Multi-turn ST (IT)
Speech: <extra_id_i> · · ·<extra_id_j>
English:{TRANSCRIPT}
TARGET_LANG: {TRANSLATION}
Table 1: Prompt formats used at different training
stages.
{de, es, fr, it, nl, pt}), and CoV oST-2 (en →zh).
Since this amounts to far less data for ST than what
is available for ASR, and since en →{ko, ru} have
only examples from the tiny FLEURS set, we aug-
ment our speech collection with pseudo-labeled
data. This has been shown to be effective for other
ST systems (Barrault et al., 2023). For the CV ,
SPGI, and GigaSpeech datasets, we select 300K
ASR samples to be pseudo-labeled each. These ex-
amples are translated to all nine target languages us-
ingTowerInstruct-13B .5Although this produces
a very large ST corpus, not all predictions are of
high quality, so we filter out examples for which the
transcript-translation combination has a COMET-
QE6(Rei et al., 2022b) score under 85. Finally,
for each language pair, we sample 60K examples
to be used in direct ST prompts and another 60K
samples to be used in multi-turn prompts. This pro-
cess results in 180K direct ST prompts and 180K
multi-turn prompts for each language pair.7
3.3.2 IT Training Setup
Similar to TOWER , we use the chatml template
(OpenAI, 2023) to format our instructions in dia-
logue form. We train models using Axolotl8on 4
H100-80GB GPUs for 2.7 days. We use a learning
rate of 7×10−6and a cosine scheduler with 100
5https://huggingface.co/Unbabel/TowerInstruct-13B-v0.1
6https://huggingface.co/Unbabel/wmt22-cometkiwi-da
7Due to our aggressive filtering, we were left with slightly
fewer examples for en zh.
8https://github.com/axolotl-ai-cloud/axolotl
Page 5:
Dataset Task Phase # DSUs # Hours
SPGI Speech ASR CPT 645M 5.1K
Gigaspeech ASR CPT 1.2B 9.9K
MLS ASR CPT 2.4B 19.2K
V oxPopuli ASR CPT 69M 0.5K
CV ASR IT 105M 0.8K
Europarl-ST ST IT 122M 1.0K
FLEURS ST IT 11M 0.09K
CoV oST-2 ST IT 12M 0.09K
SPGI Speech Pseudo-ST IT 350M 2.8K
GigaSpeech Pseudo-ST IT 161M 1.3K
CV Pseudo-ST IT 212M 1.7K
Table 2: Statistics for the speech data used for training.
Numbers of hours are approximated from the number
of deduplicated DSUs.
warm-up steps. We train for 4 epochs with an effec-
tive batch size of 576 and a weight decay of 0.01.
We impose a maximum sequence length of 4096
and use the AdamW optimizer (Loshchilov and
Hutter, 2019). Other hyperparameters are derived
from T OWER INSTRUCT (Alves et al., 2024).
4 Experiments
We evaluate our models across three tasks: ASR,
MT, and ST. First, we present our results for
ASR (§4.1), confirming the competitive perfor-
mance of SPIRE in the speech domain. We then
present MT results (§4.2), demonstrating that the
speech performance does not come at the expense
of the original model’s MT performance. Finally,
we present results for ST (§4.3) to investigate
model performance on a task that requires both
ASR and MT capabilities.
Our Models Across experiments, we compare
all models shown in Table 3. These models vary
in which base model they were trained on, whether
they underwent speech-centric CPT, and on what
set of instructions IT was performed. Apart from
SPIRE models, we report a few ablation studies in
the IT stage where:
•i) no CPT was performed (T OWER FULL);
•ii) no data from TOWER BLOCKS was seen
during IT (S PIRE NOBLOCKS ), and
•iii) pseudo-labeled ST data and FLEURS were
omitted (S PIRE NOPSEUDO ).
When reporting results, we highlight toplines
when possible.Model Base ModelCPT IT
Speech Text Speech Pseudo Text
TOWER FULL TowerBase-7B ✗ ✗ ✓ ✓ ✓
SPIRE BASE SpireBase ✓ ✓ ✗ ✗ ✗
SPIRE FULL SpireBase ✓ ✓ ✓ ✓ ✓
SPIRE Variants
SPIRE NOBLOCKS SpireBase ✓ ✓ ✓ ✓ ✗
SPIRE NOPSEUDO SpireBase ✓ ✓ ✓ ✗ ✓
Table 3: Our models and their variants, along with their
base models. The CPT and IT columns indicate which
data was seen during training.
Evaluation Setup Across models and tasks, we
perform inference with greedy decoding with a
maximum of 256 generated tokens. For the TOWER
andSPIRE models, we decode with vllm . However,
since vllm does not support all of our baselines, we
use alternative libraries (namely, transformers )
where necessary. Unless specified otherwise, we
use zero-shot prompts for all models and tasks.
4.1 ASR Experiments
Datasets and Metrics We evaluate ASR perfor-
mance across multiple test sets, in order to cover
a variety of recording styles: Librispeech (LS)
test-clean and test-other (Panayotov et al., 2015),
FLEURS (Conneau et al., 2023), and V oxPopuli.9
We report the Word Error Rate (WER) between
the hypotheses and gold transcripts, after Whisper
normalization (Radford et al., 2023).
Baselines We report results for the following
models:
•Whisper (Radford et al., 2023) is an encoder-
decoder transformer model trained on over 5
million hours of labeled data, it performs mul-
tilingual ASR and to-English ST. We report
results for the 74M parameter Whisper-base
and the 1.5B parameter Whisper-large-v3 ver-
sions.
•SeamlessM4T (Barrault et al., 2023) is an
encoder-decoder transformer trained on 406K
hours of speech that performs ASR, ST and
MT across 100 languages. We report results
for the 2.3B parameter SeamlessM4T-large-v2
version of the model.
•Spirit-LM (Nguyen et al., 2025) is the most
similar work to ours. It is a decoder-only
model, trained from Llama-2 on 307B tokens
of text, 458K hours of unlabeled speech, and
9For CPT models, LS is an in-domain evaluation because
its training set is part of MLS.
Page 6:
LibriSpeechFLEURS VoxPopuliClean Other
Baselines
HuBERT-large+CTC 4.3 7.6 11.4 14.7
Spirit-LM 6.0*11.0* - -
SeamlessM4T 2.6 4.9 8.1 7.5
Whisper-base 5.0 11.9 12.1 9.8
Whisper-large-v3 1.8 3.7 5.8 9.2
Our models
TOWER FULL 9.5 13.8 14.3 40.7
SPIRE NOBLOCKS 4.1 7.4 10.4 15.8
SPIRE NOPSEUDO 3.9 7.3 11.1 14.3
SPIRE BASE 28.9 56.3 11.0 13.7
SPIRE FULL 4.2 7.1 10.7 15.8
*We were unable to reproduce Spirit-LM’s ASR performance; therefore, we
report their self-reported LS results using ten-shot prompts.
Table 4: WER on various ASR test sets.
111K hours of labeled speech. Unfortunately,
despite the availability of their inference code,
we were unable to reproduce its reported per-
formance on speech tasks.
•HuBERT-large+CTC is a CTC-based ASR
model trained using the same speech repre-
sentation model we use for DSU generation,
and using the same ASR data from the IT
stage (Section 3.3.1). This model allows us
to compare our IT-only TOWER FULL model
against a model which has access to continu-
ous speech representations.10
Results Our results are presented in Table 4.
SPIRE FULL’s performance demonstrates that per-
forming both the CPT and IT stages is an effective
strategy to give speech capabilities to a text LLM.
Notably, SPIRE FULL outperforms the HuBERT-
large+CTC baseline on three out of four datasets—
an impressive result given that the CTC model has
a helpful inductive bias and access to continuous
features, both of which S PIRE FULL lacks.
The performance gap between SPIRE FULL and
TOWER FULL (5.3 points in LS test-clean) demon-
strates that combining CPT and IT is more effective
than using IT alone. We further observe that while
TOWER FULL obtains better results than SPIRE -
BASE, it performs worse than HuBERT-large+CTC,
showing that the CPT stage is crucial in outper-
forming a model that has access to continuous
features. Additionally, the minimal difference be-
tween SPIRE NOBLOCKS andSPIRE FULL in the
IT stage suggests that incorporating textual tasks
10The hyperparameters for this ASR model are described
in Appendix B.en→xx xx →en
C22 spB C22 spB
Baselines
SeamlessM4T 87.22 39.0 87.42 39.9
TOWERBASE -7B 87.38 37.8 88.02 41.7
TOWERINSTRUCT -7B88.45 38.8 88.27 42.0
Our models
TOWER FULL 88.57 39.4 88.17 41.7
SPIRE NOBLOCKS 82.98 34.2 85.93 36.1
SPIRE NOPSEUDO 88.40 38.9 88.22 42.0
SPIRE BASE 87.41 37.4 87.97 41.4
SPIRE FULL 88.54 39.3 88.21 41.8
Table 5: COMET-22 (C22) and spBLEU (spB) on the
FLORES devtest set between English and the other
languages supported by T OWER And S PIRE .
does not negatively impact ASR performance. For
SPIRE BASE, it is surprising that FLEURS and V ox-
Populi results were somewhat strong zero-shot set-
tings, given that non-instruction-tuned models of-
ten struggle to work in out-of-domain without in-
context learning examples.11
Finally, although SPIRE FULL cannot match the
performance of SeamlessM4T or Whisper-large-
v3, both of which were trained on far more speech
data, it does exceed the performance of Whisper-
base on both sections of LS and FLEURS. It also
outperforms Spirit-LM on LS, which is notable
because both models are derived from Llama-2 and
make use of HuBERT DSU tokens.
4.2 MT Experiments
Having demonstrated that our CPT and IT ap-
proaches work well for ASR, we now turn to MT.
The key question is whether SPIRE can maintain
TOWER ’s strong performance on MT, despite its
speech-centric CPT. We report performance for
translation-related tasks in Appendix C.
Datasets and Metrics We use two datasets for
MT: FLORES-200 (Team et al., 2024), which cov-
ers all of our models’ languages, and the WMT23
test set (Kocmi et al., 2023), which covers en ↔{de,
ru, zh}. We report COMET-22 (Rei et al., 2022a)
11We also tried prompting SPIRE BASE with few-shot ex-
amples, but the results were significantly worse. This may
be due to the length of the DSU sequences, which likely led
to in-context examples that were too long for the model to
handle effectively.
Page 7:
en→de en →ru en →zh de →en ru →en zh →en
C22 spB C22 spB C22 spB C22 spB C22 spB C22 spB
Baselines
SeamlessM4T 77.76 27.8 83.22 34.2 80.14 29.7 78.69 26.6 80.58 32.5 76.96 23.8
TOWERBASE -7B 79.96 36.1 83.08 34.2 83.49 33.3 83.56 41.1 80.06 32.7 78.48 23.5
TOWERINSTRUCT -7B82.34 38.8 84.66 34.9 85.09 35.3 84.95 45.1 82.94 36.7 80.14 26.1
Our models
TOWER FULL 82.63 39.2 84.55 34.5 85.39 37.2 84.65 45.2 82.41 35.6 79.68 25.9
SPIRE NOBLOCKS 67.97 24.4 73.86 26.6 77.80 29.6 73.24 28.7 78.09 29.1 73.01 17.6
SPIRE NOPSEUDO 82.18 38.5 84.31 34.7 85.31 37.6 85.04 45.5 82.56 36.2 79.91 26.2
SPIRE BASE 79.88 34.7 83.04 33.7 83.85 32.4 83.19 40.5 80.20 32.4 78.65 23.1
SPIRE FULL 82.50 39.5 84.60 34.9 85.37 37.3 85.24 45.2 82.58 36.4 79.92 26.3
Table 6: COMET-22 (C22) and spBLEU (spB) on the WMT23 test set.
and spBLEU12(Papineni et al., 2002) scores via
the SacreBLEU toolkit (Post, 2018).
Baselines We compare the SPIRE models against
the text-to-text translation performance of Seam-
lessM4T. Additionally, we report the performance
ofTOWERBASE -7BandTOWERINSTRUCT -7Bas
toplines.
Results Our results showcase that even after the
speech-centric CPT and mixed speech and text IT
stage, the SPIRE models retain TOWER ’s perfor-
mance on both FLORES (Table 5) and WMT23
(Table 6). This indicates that neither CPT nor IT on
speech data negatively impact the model’s ability to
perform MT. This is true for both CPT-only models,
where SPIRE BASE achieves performance compa-
rable to TOWERBASE on both datasets; and for IT
models, where SPIRE FULL andTOWER FULL both
perform slightly better than TOWERINSTRUCT on
en→xx, which is possibly an artifact of the large
number of multi-turn en →xx ST instructions in
their IT set. Notably, our strongest SPIRE model
also outperforms SeamlessM4T by both metrics on
all WMT23 language pairs, and for both en →xx
and xx→en on FLORES.
4.3 ST Experiments
AsSPIRE has shown success at both ASR and MT,
we now investigate its performance on ST.
Datasets For ST, we evaluate our models on
FLEURS (Conneau et al., 2023), covering ST be-
tween en and all TOWER languages, and CoV oST-
12nrefs:1|case:mixed|eff:no|tok:flores200|
smooth:exp|version:2.5.12 (Wang et al., 2021b) for en ↔{de, zh}. We report
the same metrics as for MT.
ST approaches We evaluate ST performance on
both direct ST and self-cascaded ST,13in which
each model transcribes the audio before translating
its own output to the target language ( i.e., ASR
followed by MT). To assess the impact of ASR
error propagation, we also report MT results given
gold transcriptions.
Results Our ST results on FLEURS and CoV oST-
2 are presented in Table 7. Among our mod-
els, the ones that were trained on large quan-
tities of pseudo-labeled ST data ( TOWER FULL,
SPIRE NOBLOCKS , and SPIRE FULL) achieve far
higher scores on direct ST than the one that did not
(SPIRE NOPSEUDO ). This indicates that merely
performing CPT with ASR and MT data is not
enough to achieve generalization to the task of
direct ST, even if the model excels at both ASR
and MT. Indeed, we also attempted direct ST with
SPIRE BASE and it failed to produce output in
the target language, even when given few-shot
prompts.
We also observe that SPIRE NOBLOCKS per-
forms nearly as well at direct ST as SPIRE -
FULL, even though its MT performance is much
poorer (see Table 5 and 6), showing that, sur-
prisingly, competence at MT is notvery helpful
for direct ST. SPIRE FULL achieves the best self-
cascaded performance by a significant margin for
13We also tried inference with the multi-turn prompt format
shown in Table 1, but results were similar to a self-cascade.
Reporting the self-cascade enables comparison to models that
do not support the multi-turn format, i.e.SeamlessM4T.
Page 8:
FLEURS CoV oST-2
Direct Self-Cascade Gold Direct Self-Cascade Gold
C22 spB C22 spB C22 spB C22 spB C22 spB C22 spB
Baselines
SeamlessM4T 84.63 33.7 75.45 22.4 86.79 38.7 84.79 36.8 72.36 19.4 86.55 39.0
Our Models
TOWER FULL 79.10 26.1 83.42 31.9 88.27 39.2 71.52 20.1 74.17 25.8 87.14 38.5
SPIRE NOBLOCKS 81.11 27.1 79.46 28.9 82.44 34.0 74.02 23.2 68.09 22.8 69.31 26.8
SPIRE NOPSEUDO 62.80 11.7 83.79 32.2 88.10 38.7 59.88 6.8 78.15 29.7 87.10 39.0
SPIRE FULL 81.33 27.1 85.21 33.7 88.36 39.2 74.25 23.2 78.78 30.0 87.11 38.5
Table 7: ST results on FLEURS and CoV oST-2 for en →xx reporting COMET-22 (C22) and spBLEU (spB) using
direct ST ( direct ), self-cascaded ST ( self-cascade ), and MT from gold-transcriptions ( gold). Scores are averaged
over all language pairs.
both datasets, outperforming both models that were
trained with fewer speech samples ( TOWER FULL
andSPIRE NOPSEUDO ), and the model that was not
tuned on MT ( SPIRE NOBLOCKS ). This suggests
that, unlike direct ST, both ASR and MT competen-
cies are necessary for a strong self-cascade perfor-
mance. Although SPIRE FULLdoes not reach the di-
rect ST performance of SeamlessM4T, it manages
to achieve competitive performance despite using
far less ST data (Barrault et al., 2023). Our self-
cascading experiments additionally demonstrate
thatSPIRE FULL maintains greater robustness to its
own outputs than SeamlessM4T.
5 Conclusion
In this work we presented SPIRE , a simple and ef-
fective recipe for adapting a text-based, translation-
specialist LLM to the speech modality while pre-
serving the original performance on text-based
tasks. We investigated the impact of speech inte-
gration on two stages of LLM adaptation, CPT and
IT, finding that both contribute to the final model’s
performance on speech tasks. Our results demon-
strate that we are able to successfully integrate a
new modality without compromising the original
model’s capabilities. SPIRE achieves competitive
performance on ASR, while its MT abilities remain
on par with the original TOWER model. Finally,
for the ST task, we find that the leveraging ASR
and MT data does not directly transfer to ST perfor-
mance. Nonetheless, the model achieves promising
performance with both direct and self-cascaded ST.
As future work, we intend to extend this recipe
to multilingual settings by replacing our English
HuBERT speech component by the multilingualmHuBERT-147 (Boito et al., 2024). We also plan
to leverage the flexibility of DSU modeling to in-
vestigate the integration of speech generation tasks.
To benefit the research community, we only use
publicly available and licensed data to train our
models, making our results reproducible.
Limitations
The downstream tasks we evaluate on are re-
stricted to MT and ASR/ST, which provide an
idea of the model performance but do not give
us the full picture. We plan to address this by
utilizing the LM-harness evaluation (Gao et al.,
2024) to evaluate on a suite of text-based bench-
marks such as MMLU (Multitask Language Under-
standing) (Hendrycks et al., 2021b,a), Arc (Com-
monsense Reasoning) (Clark et al., 2018), Bele-
bele (Reading Comprehension) (Bandarkar et al.,
2024), and HellaSwag (Sentence Completion)
(Zellers et al., 2019). Lastly, our model handles
speech and text on the input side but is currently
limited to generating only text.
Acknowledgments
This work was supported by EU’s Horizon
Europe Research and Innovation Actions (UT-
TER, contract 101070631), by UK Research
and Innovation (UKRI) under the UK govern-
ment’s Horizon Europe funding guarantee (grant
number 10039436: UTTER), by the project
DECOLLAGE (ERC-2022-CoG 101088763),
by the Portuguese Recovery and Resilience
Plan through project C645008882-00000055
(Center for Responsible AI), by Fundação
Page 9:
para a Ciência e Tecnologia (FCT) through
the project with reference UIDB/50021/2020
(DOI:10.54499/UIDB/50021/2020), and by
FCT/MECI through national funds and when
applicable co-funded EU funds under UID/50008:
Instituto de Telecomunicações. This work was per-
formed using HPC resources from GENCI–IDRIS
(Grant 2023-AD011014668R1). We thank Duarte
Alves and Giuseppe Attanasio for their insightful
comments.
References
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama
Ahmad, Ilge Akkaya, Florencia Leoni Aleman,
Diogo Almeida, Janko Altenschmidt, Sam Altman,
Shyamal Anadkat, et al. 2023. Gpt-4 technical report.
arXiv preprint arXiv:2303.08774 .
Duarte Miguel Alves, José Pombal, Nuno M Guerreiro,
Pedro Henrique Martins, João Alves, Amin Farajian,
Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta
Agrawal, Pierre Colombo, José G. C. de Souza, and
Andre Martins. 2024. Tower: An open multilingual
large language model for translation-related tasks. In
First Conference on Language Modeling .
Rosana Ardila, Megan Branson, Kelly Davis, Michael
Kohler, Josh Meyer, Michael Henretty, Reuben
Morais, Lindsay Saunders, Francis Tyers, and Gre-
gor Weber. 2020. Common voice: A massively-Â-
multilingual speech corpus. In Proceedings of The
12th Language Resources and Evaluation Confer-
ence, pages 4218–4222, Marseille, France. European
Language Resources Association.
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed,
and Michael Auli. 2020. wav2vec 2.0: A framework
for self-supervised learning of speech representations.
Advances in neural information processing systems ,
33:12449–12460.
Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel
Artetxe, Satya Narayan Shukla, Donald Husa, Naman
Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and
Madian Khabsa. 2024. The belebele benchmark: a
parallel reading comprehension dataset in 122 lan-
guage variants. In Proceedings of the 62nd Annual
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers) , pages 749–775,
Bangkok, Thailand. Association for Computational
Linguistics.
Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli,
David Dale, Ning Dong, Paul-Ambroise Duquenne,
Hady Elsahar, Hongyu Gong, Kevin Heffernan, John
Hoffman, et al. 2023. Seamlessm4t-massively mul-
tilingual & multimodal machine translation. arXiv
preprint arXiv:2308.11596 .
Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos,
Laurent Besacier, and Ioan Calapodescu. 2024.mHuBERT-147: A Compact Multilingual HuBERT
Model. In Interspeech 2024 .
Zalán Borsos, Raphaël Marinier, Damien Vincent,
Eugene Kharitonov, Olivier Pietquin, Matt Shar-
ifi, Dominik Roblek, Olivier Teboul, David Grang-
ier, Marco Tagliasacchi, et al. 2023. Audiolm: a
language modeling approach to audio generation.
IEEE/ACM transactions on audio, speech, and lan-
guage processing , 31:2523–2533.
Alejandro Hernández Cano, Matteo Pagliardini, An-
dreas Köpf, Kyle Matoba, Amirkeivan Mohtashami,
Xingyao Wang, Olivia Simin Fan, Axel Marmet,
Deniz Bayazit, Igor Krawczuk, Zeming Chen,
Francesco Salvi, Antoine Bosselut, and Martin Jaggi.
2023. epfllm megatron-llm.
Xuankai Chang, Brian Yan, Kwanghee Choi, Jee-Weon
Jung, Yichen Lu, Soumi Maiti, Roshan Sharma, Jia-
tong Shi, Jinchuan Tian, Shinji Watanabe, et al. 2024.
Exploring speech recognition, translation, and under-
standing with discrete speech units: A comparative
study. In ICASSP 2024-2024 IEEE International
Conference on Acoustics, Speech and Signal Process-
ing (ICASSP) , pages 11481–11485. IEEE.
Guoguo Chen, Shuzhou Chai, Guan-Bo Wang, Jiayu
Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel
Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev
Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei
Zou, Xiangang Li, Xuchen Yao, Yongqing Wang,
Zhao You, and Zhiyong Yan. 2021. GigaSpeech: An
Evolving, Multi-Domain ASR Corpus with 10,000
Hours of Transcribed Audio. In Proc. Interspeech
2021 , pages 3670–3674.
Sanyuan Chen, Chengyi Wang, Zhengyang Chen,
Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki
Kanda, Takuya Yoshioka, and Xiong Xiao. 2022.
Wavlm: Large-scale self-supervised pre-training for
full stack speech processing. IEEE Journal of Se-
lected Topics in Signal Processing , 16(6):1505–1518.
Zeming Chen, Alejandro Hernández Cano, Angelika
Romanou, Antoine Bonnet, Kyle Matoba, Francesco
Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf,
Amirkeivan Mohtashami, et al. 2023. Meditron-70b:
Scaling medical pretraining for large language mod-
els.arXiv preprint arXiv:2311.16079 .
Ju-Chieh Chou, Chung-Ming Chien, Wei-Ning Hsu,
Karen Livescu, Arun Babu, Alexis Conneau, Alexei
Baevski, and Michael Auli. 2023a. Toward joint
language modeling for speech units and text. arXiv
preprint arXiv:2310.08715 .
Ju-Chieh Chou, Chung-Ming Chien, Wei-Ning Hsu,
Karen Livescu, Arun Babu, Alexis Conneau, Alexei
Baevski, and Michael Auli. 2023b. Toward joint
language modeling for speech units and text. In
Findings of the Association for Computational Lin-
guistics: EMNLP 2023 , pages 6582–6593, Singapore.
Association for Computational Linguistics.
Page 10:
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,
Ashish Sabharwal, Carissa Schoenick, and Oyvind
Tafjord. 2018. Think you have solved question
answering? try arc, the ai2 reasoning challenge.
arXiv:1803.05457v1 .
Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang,
Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara
Rivera, and Ankur Bapna. 2023. Fleurs: Few-shot
learning evaluation of universal representations of
speech. In 2022 IEEE Spoken Language Technology
Workshop (SLT) , pages 798–805. IEEE.
Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng,
Guangyan Zhang, Qichao Wang, Yiwen Guo, and Ir-
win King. 2024. Recent advances in speech language
models: A survey. arXiv preprint arXiv:2410.03751 .
Cyprien de Masson D’Autume, Sebastian Ruder, Ling-
peng Kong, and Dani Yogatama. 2019. Episodic
memory in lifelong language learning. Advances in
Neural Information Processing Systems , 32.
Alexandre Défossez, Laurent Mazaré, Manu Orsini,
Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard
Grave, and Neil Zeghidour. 2024. Moshi: a speech-
text foundation model for real-time dialogue. arXiv
preprint arXiv:2410.00037 .
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch,
Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid,
Jonathan Tompson, Quan Vuong, Tianhe Yu, et al.
2023. Palm-e: An embodied multimodal language
model. arXiv preprint arXiv:2303.03378 .
Besnik Fetahu, Zhiyu Chen, Sudipta Kar, Oleg
Rokhlenko, and Shervin Malmasi. 2023. Multi-
CoNER v2: a large multilingual dataset for fine-
grained and noisy named entity recognition. In Find-
ings of the Association for Computational Linguis-
tics: EMNLP 2023 , pages 2027–2051, Singapore.
Association for Computational Linguistics.
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman,
Sid Black, Anthony DiPofi, Charles Foster, Laurence
Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li,
Kyle McDonell, Niklas Muennighoff, Chris Ociepa,
Jason Phang, Laria Reynolds, Hailey Schoelkopf,
Aviya Skowron, Lintang Sutawika, Eric Tang, An-
ish Thite, Ben Wang, Kevin Wang, and Andy Zou.
2024. A framework for few-shot language model
evaluation.
Edward Gow-Smith, Alexandre Berard, Marcely
Zanon Boito, and Ioan Calapodescu. 2023. NA VER
LABS Europe’s multilingual speech translation sys-
tems for the IWSLT 2023 low-resource track. In
Proceedings of the 20th International Conference on
Spoken Language Translation (IWSLT 2023) , pages
144–158, Toronto, Canada (in-person and online).
Association for Computational Linguistics.
Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat,
Alexis Conneau, Felix Kreuk, Jade Copet, Alexan-
dre Defossez, Gabriel Synnaeve, Emmanuel Dupoux,
et al. 2024. Textually pretrained speech languagemodels. Advances in Neural Information Processing
Systems , 36.
Dan Hendrycks, Collin Burns, Steven Basart, Andrew
Critch, Jerry Li, Dawn Song, and Jacob Steinhardt.
2021a. Aligning ai with shared human values. Pro-
ceedings of the International Conference on Learning
Representations (ICLR) .
Dan Hendrycks, Collin Burns, Steven Basart, Andy
Zou, Mantas Mazeika, Dawn Song, and Jacob Stein-
hardt. 2021b. Measuring massive multitask language
understanding. Proceedings of the International Con-
ference on Learning Representations (ICLR) .
John Hewitt. 2021. Initializing new word
embeddings for pretrained language mod-
els. https:/nlp.stanford.edu/ johnhew//vocab-
expansion.html.
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,
Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel-
rahman Mohamed. 2021. Hubert: Self-supervised
speech representation learning by masked prediction
of hidden units. IEEE/ACM transactions on audio,
speech, and language processing , 29:3451–3460.
Shujie Hu, Long Zhou, Shujie Liu, Sanyuan Chen, Ling-
wei Meng, Hongkun Hao, Jing Pan, Xunying Liu,
Jinyu Li, Sunit Sivasankaran, et al. 2024. Wavllm:
Towards robust and adaptive speech large language
model. arXiv preprint arXiv:2404.00656 .
Rongjie Huang, Mingze Li, Dongchao Yang, Jia-
tong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu,
Zhiqing Hong, Jiawei Huang, Jinglin Liu, et al. 2024.
Audiogpt: Understanding and generating speech, mu-
sic, sound, and talking head. In Proceedings of
the AAAI Conference on Artificial Intelligence , vol-
ume 38, pages 23802–23804.
Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerda,
Javier Jorge, Nahuel Roselló, Adria Giménez, Al-
bert Sanchis, Jorge Civera, and Alfons Juan. 2020.
Europarl-st: A multilingual corpus for speech transla-
tion of parliamentary debates. In ICASSP 2020-2020
IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP) , pages 8229–8233.
IEEE.
Albert Q Jiang, Alexandre Sablayrolles, Antoine
Roux, Arthur Mensch, Blanche Savary, Chris Bam-
ford, Devendra Singh Chaplot, Diego de las Casas,
Emma Bou Hanna, Florian Bressand, et al. 2024.
Mixtral of experts. arXiv preprint arXiv:2401.04088 .
Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi
Adi, Jade Copet, Kushal Lakhotia, Tu Anh Nguyen,
Morgane Riviere, Abdelrahman Mohamed, Em-
manuel Dupoux, and Wei-Ning Hsu. 2022. Text-free
prosody-aware generative spoken language modeling.
InProceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume
1: Long Papers) , pages 8666–8681, Dublin, Ireland.
Association for Computational Linguistics.
Page 11:
Tom Kocmi, Eleftherios Avramidis, Rachel Bawden,
Ondˇrej Bojar, Anton Dvorkovich, Christian Fed-
ermann, Mark Fishel, Markus Freitag, Thamme
Gowda, Roman Grundkiewicz, Barry Haddow,
Philipp Koehn, Benjamin Marie, Christof Monz,
Makoto Morishita, Kenton Murray, Makoto Nagata,
Toshiaki Nakazawa, Martin Popel, Maja Popovi ´c,
and Mariya Shmatova. 2023. Findings of the 2023
conference on machine translation (WMT23): LLMs
are here but not quite there yet. In Proceedings of the
Eighth Conference on Machine Translation , pages
1–42, Singapore. Association for Computational Lin-
guistics.
Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu,
Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh
Nguyen, Jade Copet, Alexei Baevski, Abdelrahman
Mohamed, and Emmanuel Dupoux. 2021. On gen-
erative spoken language modeling from raw audio.
Transactions of the Association for Computational
Linguistics , 9:1336–1354.
Tsz Kin Lam, Alexandra Birch, and Barry Haddow.
2024. Compact speech translation models via
discrete speech units pretraining. arXiv preprint
arXiv:2402.19333 .
Hugo Laurençon, Léo Tronchon, Matthieu Cord, and
Victor Sanh. 2024. What matters when build-
ing vision-language models? arXiv preprint .
ArXiv:2405.02246.
Aitor Lewkowycz, Anders Andreassen, David Dohan,
Ethan Dyer, Henryk Michalewski, Vinay Ramasesh,
Ambrose Slone, Cem Anil, Imanol Schlag, Theo
Gutman-Solo, et al. 2022. Solving quantitative rea-
soning problems with language models. Advances
in Neural Information Processing Systems , 35:3843–
3857.
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae
Lee. 2023. Visual Instruction Tuning (LLaV A).
arXiv preprint . ArXiv:2304.08485 [cs].
Ilya Loshchilov and Frank Hutter. 2019. Decoupled
weight decay regularization. In International Confer-
ence on Learning Representations .
Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-weon
Jung, Xuankai Chang, and Shinji Watanabe. 2024.
V oxtlm: Unified decoder-only models for consoli-
dating speech recognition, synthesis and speech, text
continuation tasks. In ICASSP 2024-2024 IEEE Inter-
national Conference on Acoustics, Speech and Signal
Processing (ICASSP) , pages 13326–13330. IEEE.
Pedro Henrique Martins, Patrick Fernandes, João Alves,
Nuno M Guerreiro, Ricardo Rei, Duarte M Alves,
José Pombal, Amin Farajian, Manuel Faysse, Ma-
teusz Klimaszewski, et al. 2024. Eurollm: Multi-
lingual language models for europe. arXiv preprint
arXiv:2409.16235 .
Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca
Della Libera, Artem Ploujnikov, Cem Subakan, and
Mirco Ravanelli. 2024. How should we extractdiscrete audio tokens from self-supervised models?
arXiv preprint arXiv:2406.10735 .
Tu Anh Nguyen, Benjamin Muller, Bokai Yu,
Marta R Costa-Jussa, Maha Elbayad, Sravya Popuri,
Christophe Ropers, Paul-Ambroise Duquenne, Robin
Algayres, Ruslan Mavlyutov, et al. 2025. Spirit-
lm: Interleaved spoken and written language model.
Transactions of the Association for Computational
Linguistics , 13:30–52.
Patrick K O’Neill, Vitaly Lavrukhin, Somshubra Ma-
jumdar, Vahid Noroozi, Yuekai Zhang, Oleksii
Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko,
Keenan Freyberg, Michael D Shulman, et al. 2021.
Spgispeech: 5,000 hours of transcribed financial au-
dio for fully formatted end-to-end speech recognition.
arXiv preprint arXiv:2104.02014 .
OpenAI. 2023. URL https://github.com/openai/
openai-python/blob/release-v0.28.1/chatml.
md.
Vassil Panayotov, Guoguo Chen, Daniel Povey, and
Sanjeev Khudanpur. 2015. Librispeech: an asr cor-
pus based on public domain audio books. In 2015
IEEE international conference on acoustics, speech
and signal processing (ICASSP) , pages 5206–5210.
IEEE.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic evalu-
ation of machine translation. In Proceedings of the
40th Annual Meeting of the Association for Compu-
tational Linguistics , pages 311–318, Philadelphia,
Pennsylvania, USA. Association for Computational
Linguistics.
Ankita Pasad, Bowen Shi, and Karen Livescu. 2023.
Comparative layer-wise analysis of self-supervised
speech models. In ICASSP 2023-2023 IEEE Interna-
tional Conference on Acoustics, Speech and Signal
Processing (ICASSP) , pages 1–5. IEEE.
Matt Post. 2018. A call for clarity in reporting BLEU
scores. In Proceedings of the Third Conference on
Machine Translation: Research Papers , pages 186–
191, Brussels, Belgium. Association for Computa-
tional Linguistics.
Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel
Synnaeve, and Ronan Collobert. 2020. MLS: A
Large-Scale Multilingual Dataset for Speech Re-
search. In Proc. Interspeech 2020 , pages 2757–2761.
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
man, Christine McLeavey, and Ilya Sutskever. 2023.
Robust speech recognition via large-scale weak su-
pervision. In International conference on machine
learning , pages 28492–28518. PMLR.
Srijith Radhakrishnan, Chao-Han Yang, Sumeer Khan,
Rohit Kumar, Narsis Kiani, David Gomez-Cabrero,
and Jesper Tegnér. 2023. Whispering LLaMA: A
cross-modal generative error correction framework
for speech recognition. In Proceedings of the 2023
Page 12:
Conference on Empirical Methods in Natural Lan-
guage Processing , pages 10007–10016, Singapore.
Association for Computational Linguistics.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J. Liu. 2019. Exploring the limits
of transfer learning with a unified text-to-text trans-
former. arXiv e-prints .
Ricardo Rei, José G. C. de Souza, Duarte Alves,
Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova,
Alon Lavie, Luisa Coheur, and André F. T. Martins.
2022a. COMET-22: Unbabel-IST 2022 submission
for the metrics shared task. In Proceedings of the
Seventh Conference on Machine Translation (WMT) ,
pages 578–585, Abu Dhabi, United Arab Emirates
(Hybrid). Association for Computational Linguistics.
Ricardo Rei, Marcos Treviso, Nuno M. Guerreiro,
Chrysoula Zerva, Ana C Farinha, Christine Maroti,
José G. C. de Souza, Taisiya Glushkova, Duarte
Alves, Luisa Coheur, Alon Lavie, and André F. T.
Martins. 2022b. CometKiwi: IST-unbabel 2022 sub-
mission for the quality estimation shared task. In
Proceedings of the Seventh Conference on Machine
Translation (WMT) , pages 634–645, Abu Dhabi,
United Arab Emirates (Hybrid). Association for Com-
putational Linguistics.
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten
Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi,
Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023.
Code llama: Open foundation models for code. arXiv
preprint arXiv:2308.12950 .
Paul K Rubenstein, Chulayuth Asawaroengchai,
Duc Dung Nguyen, Ankur Bapna, Zalán Borsos,
Félix de Chaumont Quitry, Peter Chen, Dalia El
Badawy, Wei Han, Eugene Kharitonov, et al. 2023.
Audiopalm: A large language model that can speak
and listen. arXiv preprint arXiv:2306.12925 .
Thomas Scialom, Tuhin Chakrabarty, and Smaranda
Muresan. 2022. Fine-tuned language models are
continual learners. In Proceedings of the 2022 Con-
ference on Empirical Methods in Natural Language
Processing , pages 6107–6122, Abu Dhabi, United
Arab Emirates. Association for Computational Lin-
guistics.
Yu Shu, Siwei Dong, Guangyao Chen, Wenhao Huang,
Ruihua Zhang, Daochen Shi, Qiqi Xiang, and Yemin
Shi. 2023. Llasm: Large language and speech model.
arXiv preprint arXiv:2308.15930 .
Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mah-
davi, Jason Wei, Hyung Won Chung, Nathan Scales,
Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl,
et al. 2023. Large language models encode clinical
knowledge. Nature , 620(7972):172–180.
Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao
Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao
Zhang. 2023. Salmonn: Towards generic hearing
abilities for large language models. arXiv preprint
arXiv:2310.13289 .Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-
Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan
Schalkwyk, Andrew M Dai, Anja Hauth, Katie
Millican, et al. 2023. Gemini: a family of
highly capable multimodal models. arXiv preprint
arXiv:2312.11805 .
NLLB Team, Marta R Costa-jussà, James Cross, Onur
Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef-
fernan, Elahe Kalbassi, Janice Lam, Daniel Licht,
et al. 2022. No language left behind: Scaling
human-centered machine translation (2022). URL
https://arxiv. org/abs/2207.04672 .
NLLB Team et al. 2024. Scaling neural machine trans-
lation to 200 languages. Nature , 630(8018):841.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton
Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Isabel Kloumann, Artem Korenev, Punit Singh Koura,
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-
ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar-
tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-
bog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
stein, Rashi Rungta, Kalyan Saladi, Alan Schelten,
Ruan Silva, Eric Michael Smith, Ranjan Subrama-
nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
lor, Adina Williams, Jian Xiang Kuan, Puxin Xu,
Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
Melanie Kambadur, Sharan Narang, Aurelien Ro-
driguez, Robert Stojnic, Sergey Edunov, and Thomas
Scialom. 2023. Llama 2: Open foundation and fine-
tuned chat models. Preprint , arXiv:2307.09288.
Viet Anh Trinh, Rosy Southwell, Yiwen Guan, Xinlu
He, Zhiyong Wang, and Jacob Whitehill. 2024. Dis-
crete multimodal transformers with a pretrained large
language model for mixed-supervision speech pro-
cessing. arXiv preprint arXiv:2406.06582 .
Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu,
Chaitanya Talnikar, Daniel Haziza, Mary Williamson,
Juan Pino, and Emmanuel Dupoux. 2021a. V oxPop-
uli: A large-scale multilingual speech corpus for rep-
resentation learning, semi-supervised learning and
interpretation. In Proceedings of the 59th Annual
Meeting of the Association for Computational Lin-
guistics and the 11th International Joint Conference
on Natural Language Processing (Volume 1: Long
Papers) , pages 993–1003, Online. Association for
Computational Linguistics.
Changhan Wang, Anne Wu, Jiatao Gu, and Juan Pino.
2021b. Covost 2 and massively multilingual speech
translation. Interspeech 2021 .
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,
Adams Wei Yu, Brian Lester, Nan Du, Andrew M
Page 13:
Dai, and Quoc V Le. 2021. Finetuned language mod-
els are zero-shot learners. In International Confer-
ence on Learning Representations .
T Wolf. 2019. Huggingface’s transformers: State-of-
the-art natural language processing. arXiv preprint
arXiv:1910.03771 .
Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yi-
meng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu,
Bo Ren, Linquan Liu, et al. 2023a. On decoder-only
architecture for speech-to-text and large language
model integration. In 2023 IEEE Automatic Speech
Recognition and Understanding Workshop (ASRU) ,
pages 1–8. IEEE.
Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski,
Mark Dredze, Sebastian Gehrmann, Prabhanjan Kam-
badur, David Rosenberg, and Gideon Mann. 2023b.
Bloomberggpt: A large language model for finance.
arXiv preprint arXiv:2303.17564 .
Haoran Xu, Young Jin Kim, Amr Sharaf, and Hany Has-
san Awadalla. 2024a. A paradigm shift in machine
translation: Boosting translation performance of
large language models. In The Twelfth International
Conference on Learning Representations .
Yaoxun Xu, Shi-Xiong Zhang, Jianwei Yu, Zhiyong
Wu, and Dong Yu. 2024b. Comparing discrete and
continuous space llms for speech recognition. In
Proc. Interspeech 2024 .
Hongfei Xue, Wei Ren, Xuelong Geng, Kun Wei, Long-
hao Li, Qijie Shao, Linju Yang, Kai Diao, and Lei
Xie. 2024. Ideal-llm: Integrating dual encoders and
language-adapted llm for multilingual speech-to-text.
arXiv preprint arXiv:2409.11214 .
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui,
Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu,
Fei Huang, Haoran Wei, et al. 2024. Qwen2. 5 tech-
nical report. arXiv preprint arXiv:2412.15115 .
Shu-Wen Yang, Po-Han Chi, Yung-Sung Chuang,
Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y . Lin,
Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting
Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko tik
Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-
Wen Li, Shinji Watanabe, Abdelrahman Mohamed,
and Hung yi Lee. 2021. SUPERB: Speech Process-
ing Universal PERformance Benchmark. In Proc.
Interspeech 2021 , pages 1194–1198.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Farhadi, and Yejin Choi. 2019. HellaSwag: Can a ma-
chine really finish your sentence? In Proceedings of
the 57th Annual Meeting of the Association for Com-
putational Linguistics , pages 4791–4800, Florence,
Italy. Association for Computational Linguistics.
Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan,
Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. 2023.
SpeechGPT: Empowering large language models
with intrinsic cross-modal conversational abilities.
InFindings of the Association for ComputationalLinguistics: EMNLP 2023 , pages 15757–15773, Sin-
gapore. Association for Computational Linguistics.
Page 14:
A Data
A.1 Speech Data Preprocessing
Normalization In order to make transcripts con-
sistent across the different datasets, the following
normalization is applied:
•GigaSpeech (CPT): we lower-case
the text and replace punctuation tags:
<COMMA> , <PERIOD> , QUESTIONMARK> ,
<EXCLAMATIONPOINT> with their appropriate
punctuation.
•MLS (CPT): we apply a tail-end normaliza-
tion step here which uniformly samples each
speaker to have at maximum 13 transcriptions.
This allows us to have a better distribution of
speakers.
•CV (IT): we subsampled from CommonV oice
to ensure a minimum duration of 3 seconds
per sample. To enhance transcript diversity,
we limit each transcript to 4 unique speakers.
Deduplication As in previous work (Zhang et al.,
2023; Rubenstein et al., 2023; Chang et al., 2024),
we merge consecutive repeated DSU tokens into a
single token to reduce sequence length.
B CTC-based ASR model
We train a CTC-based ASR model using the Hug-
gingFace Transformers library (Wolf, 2019), lever-
aging the ASR data from the IT stage (Common-
V oice, Table 2) as training data. Our ASR model
is made of the HuBERT-Large14speech representa-
tion model, followed by three hidden layers and a
vocabulary projection layer. We train for 50 epochs
with a dropout of 0.3 and a learning rate of 1e-4
with a warm-up ratio of 0.15. The best checkpoint
is selected using CER scores. This was obtained at
step 220K (at epoch 12.8).
C Translation-related Tasks
TOWER was evaluated on translation-related tasks
in addition to MT. We follow their evaluation setup
and use the Tower-eval suite15(Alves et al., 2024)
to evaluate SPIRE models on APE and NER. For
a detailed description of the tasks, we refer the
readers to Alves et al. (2024). Briefly, APE mea-
sures final translation quality on WMT23 after post-
editing with NLLB-3.3B (Team et al., 2022), and
14https://huggingface.co/facebook/hubert-large-ll60k
15https://github.com/deep-spin/tower-evalNER measures entity recognition on MultiCoNER
2023 (Fetahu et al., 2023) test. We report COMET-
22 for APE and sequence F1 score for NER. We
evaluate APE for en ↔{de, ru, zh} and NER for
{de, en, es, fr, it, pt, zh}.
Table 8 reports our results. We achieve compara-
ble scores to TOWERINSTRUCT across both tasks
and all language directions considered. Notably,
we outperform TOWER FULL in all settings, which
may be suggestive of the benefit of including text
data in the CPT stage.
APE NER
en→xx xx →en Multilingual
TOWER INSTRUCT -7B 83.08 80.29 71.56
TOWER FULL 82.65 79.90 65.07
SPIRE FULL 83.13 80.08 67.10
Table 8: Results on APE and NER reporting COMET-
22 (↑) and sequence F1 score ( ↑) respectively.