Open AGI Codes | Your Codes Reflect!

Updates

Generating audio...

arxiv

Paper 2503.10620

From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM

Authors: Kshitij Ambilduke, Ben Peters, Sonal Sannigrahi, Anil Keshwani, Tsz Kin Lam, Bruno Martins, Marcely Zanon Boito, André F. T. Martins

Published: 2025-03-13

Abstract:

Large language models (LLMs) have shown remarkable performance and generalization capabilities across multiple languages and tasks, making them very attractive targets for multi-modality integration (e.g., images or speech). In this work, we extend an existing LLM to the speech modality via speech discretization and continued pre-training. In particular, we are interested in multilingual LLMs, such as TOWER, as their pre-training setting allows us to treat discretized speech input as an additional translation language. The resulting open-source model, SPIRE, is able to transcribe and translate English speech input while maintaining TOWER's original performance on translation-related tasks, showcasing that discretized speech input integration as an additional language is feasible during LLM adaptation. We make our code and models available to the community.

Paper Content:

Page 1: From TOWER toSPIRE: Adding the Speech Modality to a Text-Only LLM Kshitij Ambilduke*†1, Ben Peters∗2, Sonal Sannigrahi∗2,3, Anil Keshwani†4, Tsz Kin Lam5,Bruno Martins3,6,Marcely Zanon Boito‡7,André F.T. Martins‡2,3,8,9 1Paris-Saclay University2Instituto de Telecomunicações 3Instituto Superior Técnico, Universidade de Lisboa4Sapienza University of Rome 5University of Edinburgh6INESC-ID7NA VER LABS Europe8ELLIS Unit Lisbon9Unbabel Correspondence: benzurdopeters@gmail.com Abstract Large language models (LLMs) have shown remarkable performance and generalization ca- pabilities across multiple languages and tasks, making them very attractive targets for multi- modality integration ( e.g., images or speech). In this work, we extend an existing LLM to the speech modality via speech discretization and continued pre-training. In particular, we are interested in multilingual LLMs, such as TOWER , as their pre-training setting allows us to treat discretized speech input as an ad- ditional translation language . The resulting open-source model, SPIRE , is able to tran- scribe and translate English speech input while maintaining TOWER ’s original performance on translation-related tasks, showcasing that dis- cretized speech input integration as an addi- tional language is feasible during LLM adapta- tion. We make our code and models available to the community. 1 Introduction Large language models (LLMs) have demonstrated remarkable success across various text-based natu- ral language processing tasks (Achiam et al., 2023; Touvron et al., 2023; Jiang et al., 2024; Yang et al., 2024; Alves et al., 2024; Martins et al., 2024), moti- vating research into extending them to other modali- ties. This has led to the development of multimodal LLMs (MLLMs) capable of processing speech, au- dio, images and video (Team et al., 2023; Driess et al., 2023; Rubenstein et al., 2023; Liu et al., 2023; Tang et al., 2023; Défossez et al., 2024; Hu et al., 2024; Laurençon et al., 2024; Huang et al., 2024; Nguyen et al., 2025). For speech-LLM integration, an effortless approach is to link the output of an automatic speech recognition (ASR) system to a text-only LLM (Huang et al., 2024). However, this *Equal contribution. †Work begun during an internship at Instituto de Teleco- municações. ‡Equal contribution.solution fails to fully leverage the LLM’s superior language modeling capabilities for disambiguating transcripts. More popular are solutions that investigate equip- ping LLMs with speech processing capabilities through modality projection (Shu et al., 2023; Rad- hakrishnan et al., 2023; Wu et al., 2023a; Tang et al., 2023; Xue et al., 2024; Hu et al., 2024). Typically, a speech foundation model generates speech representations that are mapped to the em- bedding space of the LLM. The speech model is then fine-tuned along with a projector on speech-to- text tasks to equip the LLM with speech processing capabilities. In this setting, key challenges include prompt overfitting and high training costs, as tun- ing these MLLMs requires the adaptation of the speech projector module on vast amounts of raw speech data (Tang et al., 2023; Hu et al., 2024). An alternative approach for MLLMs is the use ofspeech discretization , where continuous speech features are transformed prior to training into se- quences of “discrete speech units” (DSUs), which can be processed similarly to text (Chou et al., 2023a; Zhang et al., 2023; Rubenstein et al., 2023; Chang et al., 2024; Défossez et al., 2024; Trinh et al., 2024; Maiti et al., 2024; Nguyen et al., 2025). This approach simplifies training by elimi- nating the need for additional parameters beyond extended embedding matrices. Finally, while both projector-based and discretization-based MLLMs have shown promising results on text-to-speech and/or speech-to-text tasks, their development has prioritized speech-centric tasks at the expense of textual performance. Currently, limited re- search has focused on integrating speech while pre- serving the LLM’s original capabilities in textual tasks (Chou et al., 2023b; Huang et al., 2024). In this work we present SPIRE , a speech- augmented LLM built on top of the open multilin- gual model TOWER (Alves et al., 2024). SPIRE can process English speech and perform ASRarXiv:2503.10620v1 [cs.CL] 13 Mar 2025 Page 2: Figure 1: Illustration of the model training method for S PIRE BASE and S PIRE FULL. and speech translation (ST) while maintaining TOWER ’s strong performance on machine trans- lation (MT). SPIRE encodes speech via HuBERT- based (Hsu et al., 2021) k-means clusterization, as in previous work (Zhang et al., 2023; Rubenstein et al., 2023; Chang et al., 2024). We perform train- ing in two stages: continued pre-training (CPT) and instruction tuning (IT). For the CPT stage, we use a mixture of ASR data and a small fraction of TOWER ’s text CPT data. For IT, we again leverage TOWER ’s task-specific MT data, as well as addi- tional English ASR and ST data. SPIRE is trained using approximately 42.5K hours of speech. Fig- ure 1 illustrates our training process. We make the following contributions: •We present a pipeline for integrating speech as an additional modality into an existing LLM, enabling it to transcribe and translate English speech while preserving its original MT capa- bilities; •We compare speech integration at two stages, namely CPT and IT, demonstrating that both stages are essential for achieving optimal per- formance on speech tasks; •We reach ASR and ST results that are close to those of strong speech-centric models trained on larger amounts of data—Whisper- large-v3 (Radford et al., 2023) and Seam- lessM4T (Barrault et al., 2023)—while out- performing SeamlessM4T on MT; •We provide a reproducible pipeline to the com- munity: all our models, datasets and scripts are made available.1 2 Related Work Speech-to-Text Models An increasing number of studies have explored integrating speech into 1https://github.com/utter-project/SpireLMLLMs (Zhang et al., 2023; Rubenstein et al., 2023; Hassid et al., 2024). For discrete speech in- put, Hassid et al. (2024) demonstrate the benefits of initializing a speech LLM from a text-based LLM. SpeechGPT (Zhang et al., 2023) applies IT on speech-to-text cross-modal ASR, text-to- speech (TTS), and text-based question answering. AudioPALM (Rubenstein et al., 2023) is trained in a multi-task fashion, similarly to SpeechGPT, but on multilingual input. Recently, V oxtLM (Maiti et al., 2024) was trained jointly on DSUs and text data for ASR, TTS, and open-ended speech/text generation. Our work is most similar to that of Spirit-LM (Nguyen et al., 2025), which adapts an LLM with an interleaved mixture of DSU and text data which requires an expensive DSU-to-transcript step to create. In contrast, we adopt a more cost- effective input representation that can be extended to any language, regardless of the availability of a speech aligner. Our focus is on successfully incor- porating speech input while preserving the origi- nal competence of the model, so that the resulting model can successfully perform both speech-to-text and text-only tasks. None of the aforementioned models are trained to preserve the original model’s performance in text tasks. Adapting LLMs Previous approaches involve training from scratch with task- and domain- specific data (Singhal et al., 2023; Lewkowycz et al., 2022), performing CPT with a diverse training data mix designed to broadly extend the model’s knowledge (Wu et al., 2023b), or doing IT use-case-specific data (Chen et al., 2023). Re- cent work has explored combining the latter two approaches (Xu et al., 2024a; Alves et al., 2024; Wei et al., 2021; Roziere et al., 2023). In our ap- proach to integrating DSUs into TOWER , we take inspiration from Alves et al. (2024) in adopting a two-step CPT+IT process. Our work differs in that we focus on adding the speech modality, whereas Page 3: Alves et al. (2024) focused on increasing the multi- lingual capabilities of an LLM. Continuous and Discrete Speech Represen- tations Self-supervised speech representation models produce contextualized high-dimensional speech vectors directly from raw audio (Hsu et al., 2021; Baevski et al., 2020; Chen et al., 2022), largely outperforming statistical speech features on downstream tasks (Yang et al., 2021). These con- tinuous representations can be used to derive DSUs that capture both linguistic content and prosody through clustering (Borsos et al., 2023; Kharitonov et al., 2022). DSUs provide better alignment with textual data, facilitating the transfer of successful training settings from the text domain (Cui et al., 2024). Building on Lakhotia et al. (2021), which demonstrated that HuBERT (Hsu et al., 2021) is a powerful feature extractor, several studies have adopted this approach, incorporating a k-means clustering step for discretization (Zhang et al., 2023; Rubenstein et al., 2023; Lam et al., 2024; Chang et al., 2024; Nguyen et al., 2025). Xu et al. (2024b) study the optimal settings to obtain DSUs in terms of cluster size and feature extraction layer. We use their findings to inform our initial choices. 3 S PIRE : A Speech-to-Text LLM Our goal is to equip an LLM with speech capabili- ties while preserving its preexisting text capabili- ties. As our starting point, we select TOWER (Alves et al., 2024) as our LLM, which was developed from Llama-2 (Touvron et al., 2023) with a two- step approach: CPT on a mixture of monolingual and parallel data ( TOWER BASE), followed by IT on translation-related tasks ( TOWER INSTRUCT ). We use a similar approach to extend TOWER to speech. First, we perform CPT with a combination of text-only and aligned speech-to-text datasets, fol- lowed by IT using both text-only general-purpose and task-specific data curated in TOWER BLOCKS ,2 alongside task-specific speech-to-text datasets. We name our model S PIRE . 3.1 Speech Discretization To more easily transfer the training set-up of TOWER , we use DSUs as opposed to an auxil- iary speech encoder. For all speech datasets that were used, we follow recent discretization method- ology (Zhang et al., 2023; Rubenstein et al., 2023; 2https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2Chang et al., 2024) to produce DSUs by first ex- tracting continuous speech representations for our speech data from the 22nd layer of an English HuBERT-large model, and then using k-means clus- tering (K= 5000 ) to produce centroids that are used to convert our continuous speech represen- tation into a discrete sequence of cluster IDs.3 We train our k-means model on a collection of 235K audio files (approximately 720 hours), drawn from three speech corpora: CoV oST-2 (Wang et al., 2021b), V oxPopuli (Wang et al., 2021a), and Mul- tilingual Librispeech (MLS; Pratap et al., 2020). The CoV oST subset consists of 62K audio files from 10,049 speakers, with a maximum of 8 audio files per speaker. The V oxPopuli subset includes 65K audio files from 639 speakers, capped at 250 audio files per speaker. Finally, the MLS subset contains 107K audio files from 5,490 speakers. 3.2 S PIRE BASE SPIRE BASE is trained from TOWER BASE-7B using both text-only and aligned speech-to-text datasets. Following previous work, we incorpo- rate a fraction of TOWER ’s original training data to preserve its original performance (Scialom et al., 2022; de Masson D’Autume et al., 2019). 3.2.1 Data We use a mixture of monolingual and parallel text in Chinese (zh), Dutch (nl), English (en), French (fr), German (de), Italian (it), Korean (ko), Portuguese (pt), Russian (ru), and Spanish (es), that was sourced from the TOWER training data, as well as English ASR data sourced from popular open- source ASR datasets, as reported in Table 2. Both speech and text data are downsampled to create a 6B token data mixture (5B speech; 1B text), mea- sured by the model tokenizer.4Note that the 5B speech tokens include both DSUs (4.4B tokens) and their text transcriptions (0.6B tokens). Text Data The monolingual text data split corre- sponds to data from mC4 (Raffel et al., 2019), a multilingual web-crawled corpus which we uni- formly sample from across all languages. The parallel data split includes uniformly sampled in- stances to and from English (en ↔xx) for the 10 3Optimizing the layer selection for feature extraction is a complex research problem (Pasad et al., 2023; Mousavi et al., 2024). In this work we follow the insights from Gow-Smith et al. (2023) and Xu et al. (2024b). 4Preliminary experiments on the data mixture led to this particular choice. Page 4: languages, sourced from various public sources. Further details can be found in Alves et al. (2024). Speech Data We collect 35K hours of speech data from SPGI Speech (O’Neill et al., 2021), Gi- gaSpeech (Chen et al., 2021), MLS, and V oxPop- uli. The transcription normalization process is ex- plained in Appendix A.1. 3.2.2 CPT Setup We train SPIRE BASE using the MegatronLLM codebase (Cano et al., 2023) on 8 A100-80GB GPUs for 6 days. We use the same hyperparame- ters as TOWER , except for the effective batch size, which in our case is 2,304. To incorporate the DSUs in the CPT stage, we extend the model’s orig- inal vocabulary by 5000 types, e.g.,<extra_id_x> . This allows us to have a vocabulary that can encode both text in subword units and speech in DSUs. For the extended vocabulary, we initialize new embed- dings from a multivariate Gaussian distribution. The mean of this distribution is set to the average of the original embeddings, while the covariance is derived from the empirical covariance of the orig- inal embeddings, scaled by a factor of 1×10−5 (Hewitt, 2021). 3.3 S PIRE FULL The SPIRE FULL model is obtained by perform- ing instruction tuning on SPIRE BASE using task- specific text-only and aligned speech-to-text data. 3.3.1 Data We use a mixture of text and speech instructions for ASR, MT, and ST. The prompt formats used during training are shown in Table 1. Text Data We use TOWER BLOCKS (Alves et al., 2024), which includes high quality translation bi- texts between English and the other languages sup- ported by TOWER . It also includes instructions for the translation-related tasks of named entity recognition and automatic post-editing. ASR Data We use 0.8K hours of ASR data from CommonV oice version 17 (CV; Ardila et al., 2020). The down-sampling strategy is described in Ap- pendix A.1. ST Data In our IT set, we use 842 hours of speech across three ST training sets: FLEURS (all nine language pairs; we filter out examples whose transcriptions overlap with the FLORES devtest set), Europarl-ST (Iranzo-Sánchez et al., 2020) (enASR (CPT) Speech:<extra_id_i> · · ·<extra_id_j> English: {TRANSCRIPT} MT (CPT) Source_lang: Source-sentence Target_lang: {TRANSLATION} ASR (IT) Speech: <extra_id_i> · · ·<extra_id_j> English: {TRANSCRIPT} Direct ST (IT) Speech: <extra_id_i> · · ·<extra_id_j> TARGET_LANG: {TRANSLATION} Multi-turn ST (IT) Speech: <extra_id_i> · · ·<extra_id_j> English:{TRANSCRIPT} TARGET_LANG: {TRANSLATION} Table 1: Prompt formats used at different training stages. {de, es, fr, it, nl, pt}), and CoV oST-2 (en →zh). Since this amounts to far less data for ST than what is available for ASR, and since en →{ko, ru} have only examples from the tiny FLEURS set, we aug- ment our speech collection with pseudo-labeled data. This has been shown to be effective for other ST systems (Barrault et al., 2023). For the CV , SPGI, and GigaSpeech datasets, we select 300K ASR samples to be pseudo-labeled each. These ex- amples are translated to all nine target languages us- ingTowerInstruct-13B .5Although this produces a very large ST corpus, not all predictions are of high quality, so we filter out examples for which the transcript-translation combination has a COMET- QE6(Rei et al., 2022b) score under 85. Finally, for each language pair, we sample 60K examples to be used in direct ST prompts and another 60K samples to be used in multi-turn prompts. This pro- cess results in 180K direct ST prompts and 180K multi-turn prompts for each language pair.7 3.3.2 IT Training Setup Similar to TOWER , we use the chatml template (OpenAI, 2023) to format our instructions in dia- logue form. We train models using Axolotl8on 4 H100-80GB GPUs for 2.7 days. We use a learning rate of 7×10−6and a cosine scheduler with 100 5https://huggingface.co/Unbabel/TowerInstruct-13B-v0.1 6https://huggingface.co/Unbabel/wmt22-cometkiwi-da 7Due to our aggressive filtering, we were left with slightly fewer examples for en zh. 8https://github.com/axolotl-ai-cloud/axolotl Page 5: Dataset Task Phase # DSUs # Hours SPGI Speech ASR CPT 645M 5.1K Gigaspeech ASR CPT 1.2B 9.9K MLS ASR CPT 2.4B 19.2K V oxPopuli ASR CPT 69M 0.5K CV ASR IT 105M 0.8K Europarl-ST ST IT 122M 1.0K FLEURS ST IT 11M 0.09K CoV oST-2 ST IT 12M 0.09K SPGI Speech Pseudo-ST IT 350M 2.8K GigaSpeech Pseudo-ST IT 161M 1.3K CV Pseudo-ST IT 212M 1.7K Table 2: Statistics for the speech data used for training. Numbers of hours are approximated from the number of deduplicated DSUs. warm-up steps. We train for 4 epochs with an effec- tive batch size of 576 and a weight decay of 0.01. We impose a maximum sequence length of 4096 and use the AdamW optimizer (Loshchilov and Hutter, 2019). Other hyperparameters are derived from T OWER INSTRUCT (Alves et al., 2024). 4 Experiments We evaluate our models across three tasks: ASR, MT, and ST. First, we present our results for ASR (§4.1), confirming the competitive perfor- mance of SPIRE in the speech domain. We then present MT results (§4.2), demonstrating that the speech performance does not come at the expense of the original model’s MT performance. Finally, we present results for ST (§4.3) to investigate model performance on a task that requires both ASR and MT capabilities. Our Models Across experiments, we compare all models shown in Table 3. These models vary in which base model they were trained on, whether they underwent speech-centric CPT, and on what set of instructions IT was performed. Apart from SPIRE models, we report a few ablation studies in the IT stage where: •i) no CPT was performed (T OWER FULL); •ii) no data from TOWER BLOCKS was seen during IT (S PIRE NOBLOCKS ), and •iii) pseudo-labeled ST data and FLEURS were omitted (S PIRE NOPSEUDO ). When reporting results, we highlight toplines when possible.Model Base ModelCPT IT Speech Text Speech Pseudo Text TOWER FULL TowerBase-7B ✗ ✗ ✓ ✓ ✓ SPIRE BASE SpireBase ✓ ✓ ✗ ✗ ✗ SPIRE FULL SpireBase ✓ ✓ ✓ ✓ ✓ SPIRE Variants SPIRE NOBLOCKS SpireBase ✓ ✓ ✓ ✓ ✗ SPIRE NOPSEUDO SpireBase ✓ ✓ ✓ ✗ ✓ Table 3: Our models and their variants, along with their base models. The CPT and IT columns indicate which data was seen during training. Evaluation Setup Across models and tasks, we perform inference with greedy decoding with a maximum of 256 generated tokens. For the TOWER andSPIRE models, we decode with vllm . However, since vllm does not support all of our baselines, we use alternative libraries (namely, transformers ) where necessary. Unless specified otherwise, we use zero-shot prompts for all models and tasks. 4.1 ASR Experiments Datasets and Metrics We evaluate ASR perfor- mance across multiple test sets, in order to cover a variety of recording styles: Librispeech (LS) test-clean and test-other (Panayotov et al., 2015), FLEURS (Conneau et al., 2023), and V oxPopuli.9 We report the Word Error Rate (WER) between the hypotheses and gold transcripts, after Whisper normalization (Radford et al., 2023). Baselines We report results for the following models: •Whisper (Radford et al., 2023) is an encoder- decoder transformer model trained on over 5 million hours of labeled data, it performs mul- tilingual ASR and to-English ST. We report results for the 74M parameter Whisper-base and the 1.5B parameter Whisper-large-v3 ver- sions. •SeamlessM4T (Barrault et al., 2023) is an encoder-decoder transformer trained on 406K hours of speech that performs ASR, ST and MT across 100 languages. We report results for the 2.3B parameter SeamlessM4T-large-v2 version of the model. •Spirit-LM (Nguyen et al., 2025) is the most similar work to ours. It is a decoder-only model, trained from Llama-2 on 307B tokens of text, 458K hours of unlabeled speech, and 9For CPT models, LS is an in-domain evaluation because its training set is part of MLS. Page 6: LibriSpeechFLEURS VoxPopuliClean Other Baselines HuBERT-large+CTC 4.3 7.6 11.4 14.7 Spirit-LM 6.0*11.0* - - SeamlessM4T 2.6 4.9 8.1 7.5 Whisper-base 5.0 11.9 12.1 9.8 Whisper-large-v3 1.8 3.7 5.8 9.2 Our models TOWER FULL 9.5 13.8 14.3 40.7 SPIRE NOBLOCKS 4.1 7.4 10.4 15.8 SPIRE NOPSEUDO 3.9 7.3 11.1 14.3 SPIRE BASE 28.9 56.3 11.0 13.7 SPIRE FULL 4.2 7.1 10.7 15.8 *We were unable to reproduce Spirit-LM’s ASR performance; therefore, we report their self-reported LS results using ten-shot prompts. Table 4: WER on various ASR test sets. 111K hours of labeled speech. Unfortunately, despite the availability of their inference code, we were unable to reproduce its reported per- formance on speech tasks. •HuBERT-large+CTC is a CTC-based ASR model trained using the same speech repre- sentation model we use for DSU generation, and using the same ASR data from the IT stage (Section 3.3.1). This model allows us to compare our IT-only TOWER FULL model against a model which has access to continu- ous speech representations.10 Results Our results are presented in Table 4. SPIRE FULL’s performance demonstrates that per- forming both the CPT and IT stages is an effective strategy to give speech capabilities to a text LLM. Notably, SPIRE FULL outperforms the HuBERT- large+CTC baseline on three out of four datasets— an impressive result given that the CTC model has a helpful inductive bias and access to continuous features, both of which S PIRE FULL lacks. The performance gap between SPIRE FULL and TOWER FULL (5.3 points in LS test-clean) demon- strates that combining CPT and IT is more effective than using IT alone. We further observe that while TOWER FULL obtains better results than SPIRE - BASE, it performs worse than HuBERT-large+CTC, showing that the CPT stage is crucial in outper- forming a model that has access to continuous features. Additionally, the minimal difference be- tween SPIRE NOBLOCKS andSPIRE FULL in the IT stage suggests that incorporating textual tasks 10The hyperparameters for this ASR model are described in Appendix B.en→xx xx →en C22 spB C22 spB Baselines SeamlessM4T 87.22 39.0 87.42 39.9 TOWERBASE -7B 87.38 37.8 88.02 41.7 TOWERINSTRUCT -7B88.45 38.8 88.27 42.0 Our models TOWER FULL 88.57 39.4 88.17 41.7 SPIRE NOBLOCKS 82.98 34.2 85.93 36.1 SPIRE NOPSEUDO 88.40 38.9 88.22 42.0 SPIRE BASE 87.41 37.4 87.97 41.4 SPIRE FULL 88.54 39.3 88.21 41.8 Table 5: COMET-22 (C22) and spBLEU (spB) on the FLORES devtest set between English and the other languages supported by T OWER And S PIRE . does not negatively impact ASR performance. For SPIRE BASE, it is surprising that FLEURS and V ox- Populi results were somewhat strong zero-shot set- tings, given that non-instruction-tuned models of- ten struggle to work in out-of-domain without in- context learning examples.11 Finally, although SPIRE FULL cannot match the performance of SeamlessM4T or Whisper-large- v3, both of which were trained on far more speech data, it does exceed the performance of Whisper- base on both sections of LS and FLEURS. It also outperforms Spirit-LM on LS, which is notable because both models are derived from Llama-2 and make use of HuBERT DSU tokens. 4.2 MT Experiments Having demonstrated that our CPT and IT ap- proaches work well for ASR, we now turn to MT. The key question is whether SPIRE can maintain TOWER ’s strong performance on MT, despite its speech-centric CPT. We report performance for translation-related tasks in Appendix C. Datasets and Metrics We use two datasets for MT: FLORES-200 (Team et al., 2024), which cov- ers all of our models’ languages, and the WMT23 test set (Kocmi et al., 2023), which covers en ↔{de, ru, zh}. We report COMET-22 (Rei et al., 2022a) 11We also tried prompting SPIRE BASE with few-shot ex- amples, but the results were significantly worse. This may be due to the length of the DSU sequences, which likely led to in-context examples that were too long for the model to handle effectively. Page 7: en→de en →ru en →zh de →en ru →en zh →en C22 spB C22 spB C22 spB C22 spB C22 spB C22 spB Baselines SeamlessM4T 77.76 27.8 83.22 34.2 80.14 29.7 78.69 26.6 80.58 32.5 76.96 23.8 TOWERBASE -7B 79.96 36.1 83.08 34.2 83.49 33.3 83.56 41.1 80.06 32.7 78.48 23.5 TOWERINSTRUCT -7B82.34 38.8 84.66 34.9 85.09 35.3 84.95 45.1 82.94 36.7 80.14 26.1 Our models TOWER FULL 82.63 39.2 84.55 34.5 85.39 37.2 84.65 45.2 82.41 35.6 79.68 25.9 SPIRE NOBLOCKS 67.97 24.4 73.86 26.6 77.80 29.6 73.24 28.7 78.09 29.1 73.01 17.6 SPIRE NOPSEUDO 82.18 38.5 84.31 34.7 85.31 37.6 85.04 45.5 82.56 36.2 79.91 26.2 SPIRE BASE 79.88 34.7 83.04 33.7 83.85 32.4 83.19 40.5 80.20 32.4 78.65 23.1 SPIRE FULL 82.50 39.5 84.60 34.9 85.37 37.3 85.24 45.2 82.58 36.4 79.92 26.3 Table 6: COMET-22 (C22) and spBLEU (spB) on the WMT23 test set. and spBLEU12(Papineni et al., 2002) scores via the SacreBLEU toolkit (Post, 2018). Baselines We compare the SPIRE models against the text-to-text translation performance of Seam- lessM4T. Additionally, we report the performance ofTOWERBASE -7BandTOWERINSTRUCT -7Bas toplines. Results Our results showcase that even after the speech-centric CPT and mixed speech and text IT stage, the SPIRE models retain TOWER ’s perfor- mance on both FLORES (Table 5) and WMT23 (Table 6). This indicates that neither CPT nor IT on speech data negatively impact the model’s ability to perform MT. This is true for both CPT-only models, where SPIRE BASE achieves performance compa- rable to TOWERBASE on both datasets; and for IT models, where SPIRE FULL andTOWER FULL both perform slightly better than TOWERINSTRUCT on en→xx, which is possibly an artifact of the large number of multi-turn en →xx ST instructions in their IT set. Notably, our strongest SPIRE model also outperforms SeamlessM4T by both metrics on all WMT23 language pairs, and for both en →xx and xx→en on FLORES. 4.3 ST Experiments AsSPIRE has shown success at both ASR and MT, we now investigate its performance on ST. Datasets For ST, we evaluate our models on FLEURS (Conneau et al., 2023), covering ST be- tween en and all TOWER languages, and CoV oST- 12nrefs:1|case:mixed|eff:no|tok:flores200| smooth:exp|version:2.5.12 (Wang et al., 2021b) for en ↔{de, zh}. We report the same metrics as for MT. ST approaches We evaluate ST performance on both direct ST and self-cascaded ST,13in which each model transcribes the audio before translating its own output to the target language ( i.e., ASR followed by MT). To assess the impact of ASR error propagation, we also report MT results given gold transcriptions. Results Our ST results on FLEURS and CoV oST- 2 are presented in Table 7. Among our mod- els, the ones that were trained on large quan- tities of pseudo-labeled ST data ( TOWER FULL, SPIRE NOBLOCKS , and SPIRE FULL) achieve far higher scores on direct ST than the one that did not (SPIRE NOPSEUDO ). This indicates that merely performing CPT with ASR and MT data is not enough to achieve generalization to the task of direct ST, even if the model excels at both ASR and MT. Indeed, we also attempted direct ST with SPIRE BASE and it failed to produce output in the target language, even when given few-shot prompts. We also observe that SPIRE NOBLOCKS per- forms nearly as well at direct ST as SPIRE - FULL, even though its MT performance is much poorer (see Table 5 and 6), showing that, sur- prisingly, competence at MT is notvery helpful for direct ST. SPIRE FULL achieves the best self- cascaded performance by a significant margin for 13We also tried inference with the multi-turn prompt format shown in Table 1, but results were similar to a self-cascade. Reporting the self-cascade enables comparison to models that do not support the multi-turn format, i.e.SeamlessM4T. Page 8: FLEURS CoV oST-2 Direct Self-Cascade Gold Direct Self-Cascade Gold C22 spB C22 spB C22 spB C22 spB C22 spB C22 spB Baselines SeamlessM4T 84.63 33.7 75.45 22.4 86.79 38.7 84.79 36.8 72.36 19.4 86.55 39.0 Our Models TOWER FULL 79.10 26.1 83.42 31.9 88.27 39.2 71.52 20.1 74.17 25.8 87.14 38.5 SPIRE NOBLOCKS 81.11 27.1 79.46 28.9 82.44 34.0 74.02 23.2 68.09 22.8 69.31 26.8 SPIRE NOPSEUDO 62.80 11.7 83.79 32.2 88.10 38.7 59.88 6.8 78.15 29.7 87.10 39.0 SPIRE FULL 81.33 27.1 85.21 33.7 88.36 39.2 74.25 23.2 78.78 30.0 87.11 38.5 Table 7: ST results on FLEURS and CoV oST-2 for en →xx reporting COMET-22 (C22) and spBLEU (spB) using direct ST ( direct ), self-cascaded ST ( self-cascade ), and MT from gold-transcriptions ( gold). Scores are averaged over all language pairs. both datasets, outperforming both models that were trained with fewer speech samples ( TOWER FULL andSPIRE NOPSEUDO ), and the model that was not tuned on MT ( SPIRE NOBLOCKS ). This suggests that, unlike direct ST, both ASR and MT competen- cies are necessary for a strong self-cascade perfor- mance. Although SPIRE FULLdoes not reach the di- rect ST performance of SeamlessM4T, it manages to achieve competitive performance despite using far less ST data (Barrault et al., 2023). Our self- cascading experiments additionally demonstrate thatSPIRE FULL maintains greater robustness to its own outputs than SeamlessM4T. 5 Conclusion In this work we presented SPIRE , a simple and ef- fective recipe for adapting a text-based, translation- specialist LLM to the speech modality while pre- serving the original performance on text-based tasks. We investigated the impact of speech inte- gration on two stages of LLM adaptation, CPT and IT, finding that both contribute to the final model’s performance on speech tasks. Our results demon- strate that we are able to successfully integrate a new modality without compromising the original model’s capabilities. SPIRE achieves competitive performance on ASR, while its MT abilities remain on par with the original TOWER model. Finally, for the ST task, we find that the leveraging ASR and MT data does not directly transfer to ST perfor- mance. Nonetheless, the model achieves promising performance with both direct and self-cascaded ST. As future work, we intend to extend this recipe to multilingual settings by replacing our English HuBERT speech component by the multilingualmHuBERT-147 (Boito et al., 2024). We also plan to leverage the flexibility of DSU modeling to in- vestigate the integration of speech generation tasks. To benefit the research community, we only use publicly available and licensed data to train our models, making our results reproducible. Limitations The downstream tasks we evaluate on are re- stricted to MT and ASR/ST, which provide an idea of the model performance but do not give us the full picture. We plan to address this by utilizing the LM-harness evaluation (Gao et al., 2024) to evaluate on a suite of text-based bench- marks such as MMLU (Multitask Language Under- standing) (Hendrycks et al., 2021b,a), Arc (Com- monsense Reasoning) (Clark et al., 2018), Bele- bele (Reading Comprehension) (Bandarkar et al., 2024), and HellaSwag (Sentence Completion) (Zellers et al., 2019). Lastly, our model handles speech and text on the input side but is currently limited to generating only text. Acknowledgments This work was supported by EU’s Horizon Europe Research and Innovation Actions (UT- TER, contract 101070631), by UK Research and Innovation (UKRI) under the UK govern- ment’s Horizon Europe funding guarantee (grant number 10039436: UTTER), by the project DECOLLAGE (ERC-2022-CoG 101088763), by the Portuguese Recovery and Resilience Plan through project C645008882-00000055 (Center for Responsible AI), by Fundação Page 9: para a Ciência e Tecnologia (FCT) through the project with reference UIDB/50021/2020 (DOI:10.54499/UIDB/50021/2020), and by FCT/MECI through national funds and when applicable co-funded EU funds under UID/50008: Instituto de Telecomunicações. This work was per- formed using HPC resources from GENCI–IDRIS (Grant 2023-AD011014668R1). We thank Duarte Alves and Giuseppe Attanasio for their insightful comments. References Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 . Duarte Miguel Alves, José Pombal, Nuno M Guerreiro, Pedro Henrique Martins, João Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, Pierre Colombo, José G. C. de Souza, and Andre Martins. 2024. Tower: An open multilingual large language model for translation-related tasks. In First Conference on Language Modeling . Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gre- gor Weber. 2020. Common voice: A massively-Â- multilingual speech corpus. In Proceedings of The 12th Language Resources and Evaluation Confer- ence, pages 4218–4222, Marseille, France. European Language Resources Association. Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems , 33:12449–12460. Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. 2024. The belebele benchmark: a parallel reading comprehension dataset in 122 lan- guage variants. In Proceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers) , pages 749–775, Bangkok, Thailand. Association for Computational Linguistics. Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, et al. 2023. Seamlessm4t-massively mul- tilingual & multimodal machine translation. arXiv preprint arXiv:2308.11596 . Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, and Ioan Calapodescu. 2024.mHuBERT-147: A Compact Multilingual HuBERT Model. In Interspeech 2024 . Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Shar- ifi, Dominik Roblek, Olivier Teboul, David Grang- ier, Marco Tagliasacchi, et al. 2023. Audiolm: a language modeling approach to audio generation. IEEE/ACM transactions on audio, speech, and lan- guage processing , 31:2523–2533. Alejandro Hernández Cano, Matteo Pagliardini, An- dreas Köpf, Kyle Matoba, Amirkeivan Mohtashami, Xingyao Wang, Olivia Simin Fan, Axel Marmet, Deniz Bayazit, Igor Krawczuk, Zeming Chen, Francesco Salvi, Antoine Bosselut, and Martin Jaggi. 2023. epfllm megatron-llm. Xuankai Chang, Brian Yan, Kwanghee Choi, Jee-Weon Jung, Yichen Lu, Soumi Maiti, Roshan Sharma, Jia- tong Shi, Jinchuan Tian, Shinji Watanabe, et al. 2024. Exploring speech recognition, translation, and under- standing with discrete speech units: A comparative study. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP) , pages 11481–11485. IEEE. Guoguo Chen, Shuzhou Chai, Guan-Bo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Zhao You, and Zhiyong Yan. 2021. GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio. In Proc. Interspeech 2021 , pages 3670–3674. Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, and Xiong Xiao. 2022. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Se- lected Topics in Signal Processing , 16(6):1505–1518. Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, et al. 2023. Meditron-70b: Scaling medical pretraining for large language mod- els.arXiv preprint arXiv:2311.16079 . Ju-Chieh Chou, Chung-Ming Chien, Wei-Ning Hsu, Karen Livescu, Arun Babu, Alexis Conneau, Alexei Baevski, and Michael Auli. 2023a. Toward joint language modeling for speech units and text. arXiv preprint arXiv:2310.08715 . Ju-Chieh Chou, Chung-Ming Chien, Wei-Ning Hsu, Karen Livescu, Arun Babu, Alexis Conneau, Alexei Baevski, and Michael Auli. 2023b. Toward joint language modeling for speech units and text. In Findings of the Association for Computational Lin- guistics: EMNLP 2023 , pages 6582–6593, Singapore. Association for Computational Linguistics. Page 10: Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1 . Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. 2023. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT) , pages 798–805. IEEE. Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Yiwen Guo, and Ir- win King. 2024. Recent advances in speech language models: A survey. arXiv preprint arXiv:2410.03751 . Cyprien de Masson D’Autume, Sebastian Ruder, Ling- peng Kong, and Dani Yogatama. 2019. Episodic memory in lifelong language learning. Advances in Neural Information Processing Systems , 32. Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. 2024. Moshi: a speech- text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037 . Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. 2023. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 . Besnik Fetahu, Zhiyu Chen, Sudipta Kar, Oleg Rokhlenko, and Shervin Malmasi. 2023. Multi- CoNER v2: a large multilingual dataset for fine- grained and noisy named entity recognition. In Find- ings of the Association for Computational Linguis- tics: EMNLP 2023 , pages 2027–2051, Singapore. Association for Computational Linguistics. Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, An- ish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2024. A framework for few-shot language model evaluation. Edward Gow-Smith, Alexandre Berard, Marcely Zanon Boito, and Ioan Calapodescu. 2023. NA VER LABS Europe’s multilingual speech translation sys- tems for the IWSLT 2023 low-resource track. In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023) , pages 144–158, Toronto, Canada (in-person and online). Association for Computational Linguistics. Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexan- dre Defossez, Gabriel Synnaeve, Emmanuel Dupoux, et al. 2024. Textually pretrained speech languagemodels. Advances in Neural Information Processing Systems , 36. Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. 2021a. Aligning ai with shared human values. Pro- ceedings of the International Conference on Learning Representations (ICLR) . Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Stein- hardt. 2021b. Measuring massive multitask language understanding. Proceedings of the International Con- ference on Learning Representations (ICLR) . John Hewitt. 2021. Initializing new word embeddings for pretrained language mod- els. https:/nlp.stanford.edu/ johnhew//vocab- expansion.html. Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel- rahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing , 29:3451–3460. Shujie Hu, Long Zhou, Shujie Liu, Sanyuan Chen, Ling- wei Meng, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, et al. 2024. Wavllm: Towards robust and adaptive speech large language model. arXiv preprint arXiv:2404.00656 . Rongjie Huang, Mingze Li, Dongchao Yang, Jia- tong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, et al. 2024. Audiogpt: Understanding and generating speech, mu- sic, sound, and talking head. In Proceedings of the AAAI Conference on Artificial Intelligence , vol- ume 38, pages 23802–23804. Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerda, Javier Jorge, Nahuel Roselló, Adria Giménez, Al- bert Sanchis, Jorge Civera, and Alfons Juan. 2020. Europarl-st: A multilingual corpus for speech transla- tion of parliamentary debates. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 8229–8233. IEEE. Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. 2024. Mixtral of experts. arXiv preprint arXiv:2401.04088 . Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu Anh Nguyen, Morgane Riviere, Abdelrahman Mohamed, Em- manuel Dupoux, and Wei-Ning Hsu. 2022. Text-free prosody-aware generative spoken language modeling. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 8666–8681, Dublin, Ireland. Association for Computational Linguistics. Page 11: Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondˇrej Bojar, Anton Dvorkovich, Christian Fed- ermann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Philipp Koehn, Benjamin Marie, Christof Monz, Makoto Morishita, Kenton Murray, Makoto Nagata, Toshiaki Nakazawa, Martin Popel, Maja Popovi ´c, and Mariya Shmatova. 2023. Findings of the 2023 conference on machine translation (WMT23): LLMs are here but not quite there yet. In Proceedings of the Eighth Conference on Machine Translation , pages 1–42, Singapore. Association for Computational Lin- guistics. Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, and Emmanuel Dupoux. 2021. On gen- erative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics , 9:1336–1354. Tsz Kin Lam, Alexandra Birch, and Barry Haddow. 2024. Compact speech translation models via discrete speech units pretraining. arXiv preprint arXiv:2402.19333 . Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. 2024. What matters when build- ing vision-language models? arXiv preprint . ArXiv:2405.02246. Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. 2022. Solving quantitative rea- soning problems with language models. Advances in Neural Information Processing Systems , 35:3843– 3857. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruction Tuning (LLaV A). arXiv preprint . ArXiv:2304.08485 [cs]. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Confer- ence on Learning Representations . Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-weon Jung, Xuankai Chang, and Shinji Watanabe. 2024. V oxtlm: Unified decoder-only models for consoli- dating speech recognition, synthesis and speech, text continuation tasks. In ICASSP 2024-2024 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 13326–13330. IEEE. Pedro Henrique Martins, Patrick Fernandes, João Alves, Nuno M Guerreiro, Ricardo Rei, Duarte M Alves, José Pombal, Amin Farajian, Manuel Faysse, Ma- teusz Klimaszewski, et al. 2024. Eurollm: Multi- lingual language models for europe. arXiv preprint arXiv:2409.16235 . Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca Della Libera, Artem Ploujnikov, Cem Subakan, and Mirco Ravanelli. 2024. How should we extractdiscrete audio tokens from self-supervised models? arXiv preprint arXiv:2406.10735 . Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R Costa-Jussa, Maha Elbayad, Sravya Popuri, Christophe Ropers, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, et al. 2025. Spirit- lm: Interleaved spoken and written language model. Transactions of the Association for Computational Linguistics , 13:30–52. Patrick K O’Neill, Vitaly Lavrukhin, Somshubra Ma- jumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D Shulman, et al. 2021. Spgispeech: 5,000 hours of transcribed financial au- dio for fully formatted end-to-end speech recognition. arXiv preprint arXiv:2104.02014 . OpenAI. 2023. URL https://github.com/openai/ openai-python/blob/release-v0.28.1/chatml. md. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr cor- pus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages 5206–5210. IEEE. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic evalu- ation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Compu- tational Linguistics , pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics. Ankita Pasad, Bowen Shi, and Karen Livescu. 2023. Comparative layer-wise analysis of self-supervised speech models. In ICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 1–5. IEEE. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers , pages 186– 191, Brussels, Belgium. Association for Computa- tional Linguistics. Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. 2020. MLS: A Large-Scale Multilingual Dataset for Speech Re- search. In Proc. Interspeech 2020 , pages 2757–2761. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak su- pervision. In International conference on machine learning , pages 28492–28518. PMLR. Srijith Radhakrishnan, Chao-Han Yang, Sumeer Khan, Rohit Kumar, Narsis Kiani, David Gomez-Cabrero, and Jesper Tegnér. 2023. Whispering LLaMA: A cross-modal generative error correction framework for speech recognition. In Proceedings of the 2023 Page 12: Conference on Empirical Methods in Natural Lan- guage Processing , pages 10007–10016, Singapore. Association for Computational Linguistics. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text trans- former. arXiv e-prints . Ricardo Rei, José G. C. de Souza, Duarte Alves, Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova, Alon Lavie, Luisa Coheur, and André F. T. Martins. 2022a. COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT) , pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics. Ricardo Rei, Marcos Treviso, Nuno M. Guerreiro, Chrysoula Zerva, Ana C Farinha, Christine Maroti, José G. C. de Souza, Taisiya Glushkova, Duarte Alves, Luisa Coheur, Alon Lavie, and André F. T. Martins. 2022b. CometKiwi: IST-unbabel 2022 sub- mission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT) , pages 634–645, Abu Dhabi, United Arab Emirates (Hybrid). Association for Com- putational Linguistics. Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 . Paul K Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, et al. 2023. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925 . Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. 2022. Fine-tuned language models are continual learners. In Proceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing , pages 6107–6122, Abu Dhabi, United Arab Emirates. Association for Computational Lin- guistics. Yu Shu, Siwei Dong, Guangyao Chen, Wenhao Huang, Ruihua Zhang, Daochen Shi, Qiqi Xiang, and Yemin Shi. 2023. Llasm: Large language and speech model. arXiv preprint arXiv:2308.15930 . Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mah- davi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. 2023. Large language models encode clinical knowledge. Nature , 620(7972):172–180. Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. 2023. Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289 .Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 . NLLB Team, Marta R Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef- fernan, Elahe Kalbassi, Janice Lam, Daniel Licht, et al. 2022. No language left behind: Scaling human-centered machine translation (2022). URL https://arxiv. org/abs/2207.04672 . NLLB Team et al. 2024. Scaling neural machine trans- lation to 200 languages. Nature , 630(8018):841. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, An- thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di- ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar- tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly- bog, Yixin Nie, Andrew Poulton, Jeremy Reizen- stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subrama- nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay- lor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Ro- driguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine- tuned chat models. Preprint , arXiv:2307.09288. Viet Anh Trinh, Rosy Southwell, Yiwen Guan, Xinlu He, Zhiyong Wang, and Jacob Whitehill. 2024. Dis- crete multimodal transformers with a pretrained large language model for mixed-supervision speech pro- cessing. arXiv preprint arXiv:2406.06582 . Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. 2021a. V oxPop- uli: A large-scale multilingual speech corpus for rep- resentation learning, semi-supervised learning and interpretation. In Proceedings of the 59th Annual Meeting of the Association for Computational Lin- guistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 993–1003, Online. Association for Computational Linguistics. Changhan Wang, Anne Wu, Jiatao Gu, and Juan Pino. 2021b. Covost 2 and massively multilingual speech translation. Interspeech 2021 . Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Page 13: Dai, and Quoc V Le. 2021. Finetuned language mod- els are zero-shot learners. In International Confer- ence on Learning Representations . T Wolf. 2019. Huggingface’s transformers: State-of- the-art natural language processing. arXiv preprint arXiv:1910.03771 . Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yi- meng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu, Bo Ren, Linquan Liu, et al. 2023a. On decoder-only architecture for speech-to-text and large language model integration. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages 1–8. IEEE. Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kam- badur, David Rosenberg, and Gideon Mann. 2023b. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564 . Haoran Xu, Young Jin Kim, Amr Sharaf, and Hany Has- san Awadalla. 2024a. A paradigm shift in machine translation: Boosting translation performance of large language models. In The Twelfth International Conference on Learning Representations . Yaoxun Xu, Shi-Xiong Zhang, Jianwei Yu, Zhiyong Wu, and Dong Yu. 2024b. Comparing discrete and continuous space llms for speech recognition. In Proc. Interspeech 2024 . Hongfei Xue, Wei Ren, Xuelong Geng, Kun Wei, Long- hao Li, Qijie Shao, Linju Yang, Kai Diao, and Lei Xie. 2024. Ideal-llm: Integrating dual encoders and language-adapted llm for multilingual speech-to-text. arXiv preprint arXiv:2409.11214 . An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. 2024. Qwen2. 5 tech- nical report. arXiv preprint arXiv:2412.15115 . Shu-Wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y . Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang- Wen Li, Shinji Watanabe, Abdelrahman Mohamed, and Hung yi Lee. 2021. SUPERB: Speech Process- ing Universal PERformance Benchmark. In Proc. Interspeech 2021 , pages 1194–1198. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a ma- chine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Com- putational Linguistics , pages 4791–4800, Florence, Italy. Association for Computational Linguistics. Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. 2023. SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. InFindings of the Association for ComputationalLinguistics: EMNLP 2023 , pages 15757–15773, Sin- gapore. Association for Computational Linguistics. Page 14: A Data A.1 Speech Data Preprocessing Normalization In order to make transcripts con- sistent across the different datasets, the following normalization is applied: •GigaSpeech (CPT): we lower-case the text and replace punctuation tags: <COMMA> , <PERIOD> , QUESTIONMARK> , <EXCLAMATIONPOINT> with their appropriate punctuation. •MLS (CPT): we apply a tail-end normaliza- tion step here which uniformly samples each speaker to have at maximum 13 transcriptions. This allows us to have a better distribution of speakers. •CV (IT): we subsampled from CommonV oice to ensure a minimum duration of 3 seconds per sample. To enhance transcript diversity, we limit each transcript to 4 unique speakers. Deduplication As in previous work (Zhang et al., 2023; Rubenstein et al., 2023; Chang et al., 2024), we merge consecutive repeated DSU tokens into a single token to reduce sequence length. B CTC-based ASR model We train a CTC-based ASR model using the Hug- gingFace Transformers library (Wolf, 2019), lever- aging the ASR data from the IT stage (Common- V oice, Table 2) as training data. Our ASR model is made of the HuBERT-Large14speech representa- tion model, followed by three hidden layers and a vocabulary projection layer. We train for 50 epochs with a dropout of 0.3 and a learning rate of 1e-4 with a warm-up ratio of 0.15. The best checkpoint is selected using CER scores. This was obtained at step 220K (at epoch 12.8). C Translation-related Tasks TOWER was evaluated on translation-related tasks in addition to MT. We follow their evaluation setup and use the Tower-eval suite15(Alves et al., 2024) to evaluate SPIRE models on APE and NER. For a detailed description of the tasks, we refer the readers to Alves et al. (2024). Briefly, APE mea- sures final translation quality on WMT23 after post- editing with NLLB-3.3B (Team et al., 2022), and 14https://huggingface.co/facebook/hubert-large-ll60k 15https://github.com/deep-spin/tower-evalNER measures entity recognition on MultiCoNER 2023 (Fetahu et al., 2023) test. We report COMET- 22 for APE and sequence F1 score for NER. We evaluate APE for en ↔{de, ru, zh} and NER for {de, en, es, fr, it, pt, zh}. Table 8 reports our results. We achieve compara- ble scores to TOWERINSTRUCT across both tasks and all language directions considered. Notably, we outperform TOWER FULL in all settings, which may be suggestive of the benefit of including text data in the CPT stage. APE NER en→xx xx →en Multilingual TOWER INSTRUCT -7B 83.08 80.29 71.56 TOWER FULL 82.65 79.90 65.07 SPIRE FULL 83.13 80.08 67.10 Table 8: Results on APE and NER reporting COMET- 22 (↑) and sequence F1 score ( ↑) respectively.