Open AGI Codes | Your Codes Reflect!

Updates

Generating audio...

Extracting PDF content...

arxiv

Paper 2412.08821

Large Concept Models: Language Modeling in a Sentence Representation Space

Authors: LCM team, Loïc Barrault, Paul-Ambroise Duquenne, Maha Elbayad, Artyom Kozhevnikov, Belen Alastruey, Pierre Andrews, Mariano Coria, Guillaume Couairon, Marta R. Costa-jussà, David Dale, Hady Elsahar, Kevin Heffernan, João Maria Janeiro, Tuan Tran, Christophe Ropers, Eduardo Sánchez, Robin San Roman, Alexandre Mourachko, Safiyyah Saleem, Holger Schwenk

Published: 2024-12-11

Abstract:

LLMs have revolutionized the field of artificial intelligence and have emerged as the de-facto tool for many tasks. The current established technology of LLMs is to process input and generate output at the token level. This is in sharp contrast to humans who operate at multiple levels of abstraction, well beyond single words, to analyze information and to generate creative content. In this paper, we present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a concept. Concepts are language- and modality-agnostic and represent a higher level idea or action in a flow. Hence, we build a "Large Concept Model". In this study, as proof of feasibility, we assume that a concept corresponds to a sentence, and use an existing sentence embedding space, SONAR, which supports up to 200 languages in both text and speech modalities. The Large Concept Model is trained to perform autoregressive sentence prediction in an embedding space. We explore multiple approaches, namely MSE regression, variants of diffusion-based generation, and models operating in a quantized SONAR space. These explorations are performed using 1.6B parameter models and training data in the order of 1.3T tokens. We then scale one architecture to a model size of 7B parameters and training data of about 2.7T tokens. We perform an experimental evaluation on several generative tasks, namely summarization and a new task of summary expansion. Finally, we show that our model exhibits impressive zero-shot generalization performance to many languages, outperforming existing LLMs of the same size. The training code of our models is freely available.

Paper Content: on Alphaxiv

PDF Extraction Method:

Page 1: Large Concept Models : Language Modeling in a Sentence Representation Space LCM team ,Loïc Barrault∗,Paul-Ambroise Duquenne∗,Maha Elbayad∗,Artyom Kozhevnikov∗, Belen Alastruey†,Pierre Andrews†,Mariano Coria†,Guillaume Couairon+†,Marta R. Costa-jussà†,David Dale†,Hady Elsahar†,Kevin Heffernan†,João Maria Janeiro†,Tuan Tran†, Christophe Ropers†,Eduardo Sánchez†,Robin San Roman†,Alexandre Mourachko‡,Safiyyah Saleem‡,Holger Schwenk‡ FAIR at Meta ∗Core contributors, alphabetical order ,†Contributors to data preparation, LCM extensions and evaluation, alphabetical order ,‡Research and project management, alphabetical order ,+Initial work while at FAIR at Meta, new affiliation: INRIA, France LLMshave revolutionized the field of artificial intelligence and have emerged as the de-facto tool for many tasks. The current established technology of LLMsis to process input and generate output at the token level. This is in sharp contrast to humans who operate at multiple levels of abstraction, well beyond single words, to analyze information and to generate creative content. In this paper, we present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a “concept” . Concepts are language- and modality-agnostic and represent a higher level idea or action in a flow. Hence, we build a “Large Concept Model ”. In this study, as proof of feasibility, we assume that a concept corresponds to a sentence, and use an existing sentence embedding space, SONAR , which supports up to 200 languages in both text and speech modalities. TheLarge Concept Model is trained to perform autoregressive sentence prediction in an embedding space. We explore multiple approaches, namely MSE regression, variants of diffusion-based generation, and models operating in a quantized SONAR space. These explorations are performed using 1.6B parameter models and training data in the order of 1.3T tokens. We then scale one architecture to a model size of 7B parameters and training data of about 2.7T tokens. We perform an experimental evaluation on several generative tasks, namely summarization and a new task of summary expansion. Finally, we show that our model exhibits impressive zero-shot generalization performance to many languages, outperforming existing LLMsof the same size. The training code of our models is freely available.a Date:December 12, 2024 Correspondence: Holger Schwenk at schwenk@meta.com ahttps://github.com/facebookresearch/large_concept_model 1 Introduction Large Language models ( LLMs) are dominating current research in natural language processing, and with their recent extension to more modalities, namely images, video and speech, they seem to be considered as the de-facto technique to follow to approach human intelligence. LLMsachieve indeed impressive performance on a large variety of tasks, such as providing detailed answers for general knowledge questions, helping in performing long document analysis, or drafting different types of messages, and writing or debugging code. Building an LLMfrom scratch requires access to 1arXiv:2412.08821v2 [cs.CL] 15 Dec 2024 Page 2: enormous computational resources to process ever larger amounts of data and train models, the size of which now exceeds four hundred billion parameters. Knowledge acquisition in LLMsis heavily data-driven and extending them to more languages or modalities usually requires injecting additional (synthetic) data to cover them. The landscape of available LLMscan be structured into open models such as Llama(The Llama3 team, 2024), Mistral (Jiang et al., 2024), Bloom (BigScience Workshop, 2023) or Falcon (Almazrouei et al., 2023), on the one hand, and closed models such as Gemini(Gemini Team Google, 2024), GPT(OpenAI, 2024) or Claude (Anthropic, 2024), on the other. It is striking that all these models are based on the same underlying architecture: a transformer-based, decoder-only language model, which is pretrained to predict the next token, given a long context of preceding tokens. Despite the undeniable success of LLMsand continued progress, all current LLMsmiss a crucial characteristic of human intelligence: explicit reasoning and planning at multiple levels of abstraction. The human brain does not operate at the word level only. We usually have a top-down process to solve a complex task or compose a long document: we first plan at a higher level the overall structure, and then step-by-step, add details at lower levels of abstraction. One may argue thatLLMsare implicitly learning a hierarchical representation, but we stipulate that models with an explicit hierarchical architecture are better suited to create coherent long-form output. Imagine a researcher giving a fifteen-minute talk. In such a situation, researchers do not usually prepare detailed speeches by writing out every single word they will pronounce. Instead, they outline a flow of higher-level ideas they want to communicate. Should they give the same talk multiple times, the actual words being spoken may differ, the talk could even be given in different languages, but the flow of higher-level abstract ideas will remain the same. Similarly, when writing a research paper or essay on a specific topic, humans usually start by preparing an outline that structures the whole document into sections, which they then refine iteratively. Humans also detect and remember dependencies between the different parts of a longer document at an abstract level. If we expand on our previous research writing example, keeping track of dependencies means that we need to provide results for each of the experiment mentioned in the introduction. Finally, when processing and analyzing information, humans rarely consider every single word in a large document. Instead, we use a hierarchical approach: we remember which part of a long document we should search to find a specific piece of information. To the best of our knowledge, this explicit hierarchical structure of information processing and generation, at an abstract level, independent of any instantiation in a particular language or modality, cannot be found in any of the current LLMs. In this work, we present a new approach which moves away from processing at the token level and closer to (hierarchical) reasoning in an abstract embedding space. This abstract embedding space is designed to be independent of the language or modality in which the content is expressed; in other words, we aim to model the underlying reasoning process at a purely semantic level, not its instantiation in a specific language. In order to verify our approach, we limit our study to two levels of abstraction: subword tokens and concepts. We define a conceptas an abstract atomic idea. In practice, a concept would often correspond to a sentence in a text document, or an equivalent speech utterance. We posit that a sentence is an appropriate unit to achieve language independence, in opposition to single words. This is in sharp contrast to current LLMstechniques which are heavily English centric and token based. Our fundamental idea could be based on any fixed-size sentence embedding space for which an encoder and decoder are available. In particular, we could aim to train a new embedding space specifically optimized to our reasoning architecture. In this work, we chose an existing and freely available sentence embedding, named SONAR (Duquenne et al., 2023b). SONAR supports text 2 Page 3: Figure 1 - Left: visualization of reasoning in an embedding space of concepts (task of summarization). Right: fundamental architecture of an Large Concept Model (LCM). ⋆: concept encoder and decoder are frozen. input and output in 200 languages, speech input in 76languages, and speech output in English. We discuss the constraints and impact of this choice in Section 2.1, and share some ideas on alternative embedding spaces in Section 6. Figure 1-left visualizes reasoning in an embedding space with the example of a summarization task, which is materialized by a function on the embedding space, mapping five concept representations into two. Figure 1-right summarizes the overall architecture and processing flow. The input is first segmented into sentences, and each one is encoded with SONAR to achieve a sequence of concepts, i.e.,sentence embeddings. This sequence of concepts is then processed by a Large Concept Model (LCM)to generate at the output a new sequence of concepts. Finally, the generated concepts are decoded by SONAR into a sequence of subwords. The encoder and decoder are fixed and are not trained. It is important to highlight that the unchanged sequence of concepts at the output of theLCMcan be decoded into other languages or modalities without performing again the whole reasoning process. In the same spirit, a particular reasoning operation such as summarization can be performed in a zero-shot setting on input in any language or modality, since it solely operates on concepts. To summarize, the LCMneither has information on the input language or modality nor generates output in a particular language or modality. We explore multiple architectures to train the LCM, in particular several variants of diffusion. Finally, we envision an additional level of abstraction beyond concepts which could correspond to a short description of a paragraph or small section. In Section 4.3 we report initial ideas on how conditioning and predicting such higher-level representations can improve consistency of output generated by an LCM. To some extent, the LCMarchitecture resembles the Jepaapproach (LeCun, 2022) that also aims to predict the representation of the next observation in an embedding space. However, unlike Jepathat places more emphasis on learning a representation space in a self-supervised way, the LCMfocuses on accurate prediction in the existing embedding space. 3 Page 4: The mains characteristics of our generic Large Concept Model approach are as follows: •Reasoning at an abstract language- and modality-agnostic level beyond tokens: –We model the underlying reasoning process, not its instantiation in a particular language. –TheLCMcan be trained, i.e. acquire knowledge, on all languages and modalities at once, promising scalability in an unbiased way. •Explicit hierarchical structure: –Better readability of long-form output by a human. –Facilitates local interactive edits by a user. •Handling of long context and long-form output: –The complexity of a vanilla transformer model increases quadratically with the sequence length. This makes handling of large context windows challenging and several techniques have been developed to alleviate this problem, e.g.,sparse attention (Child et al., 2019) or LSH attention (Kitaev et al., 2020). Our LCMoperates on sequences which are at least an order of magnitude shorter.1 •Unparalleled zero-shot generalization: –Independently of the language or modality the LCMis pre-trained and fine-tuned on, it can be applied to any language and modality supported by the SONAR encoders, without the need of additional data or fine-tuning. We report results for multiple languages in the text modality. •Modularity and extensibility: –Unlike multimodal LLMsthat can suffer from modality competition (Aghajanyan et al., 2023; Chameleon team, 2024), concept encoders and decoders can be independently developed and optimized without any competition or interference. –New languages or modalities can be easily added for an existing system. The goal of this paper is to provide a proof of concept of this high-level vision of an alternative architecture to current best practice in language modeling. In the next section we present the main design principles of our models and discuss several variants to build and train a Large Concept Model. We discuss several designs to implement diffusion approaches with concept embeddings and carefully study noise scheduling. This section is completed by a compute complexity comparison with token-based LLMs. Section 3 is dedicated to the analysis of a larger 7B parameter model. We discuss challenges when instruction fine-tuning this model on multiple generative tasks, and provide a comparison with existing LLMsof comparable size. The paper concludes with a discussion of related work, the current limitations and perspectives of our approach. To foster research in this area, we make our LCMtraining code2as well as SONAR encoders and decoders3for up to 200 languages and multiple modalities freely available. 1We assume an average sentence length of 10–20 tokens. 2https://github.com/facebookresearch/large_concept_model 3https://github.com/facebookresearch/SONAR 4 Page 5: 2 Main Design Principles In this section, we outline the main design principles of the LCM. We first describe the SONAR embedding space with its encoders and decoders. Then, we discuss details of data preparation, namely sentence segmentation i.e.,how we split long documents into sentences. And finally, we describe in details the different versions of LCMsintroduced in this work. 2.1 The SONAR embedding space The motivation of this work is to perform reasoning at a higher conceptual level than tokens. This requires an embedding space which is highly semantic. We chose SONAR (Duquenne et al., 2023b) since it achieves best performance on several semantic similarity metrics like xsimorxsim++ (Chen et al., 2023b), and it was successfully used in large-scale bitext mining for translation (Seamless Communication et al., 2023b). TheSONAR text embedding space was trained as an encoder/decoder architecture, with a fixed-size bottleneck instead of cross-attention (see Figure 2). The criterion combines a machine translation objective for 200 languages into and out of English, denoising auto-encoding and an explicit MSE loss at the embedding bottleneck layer. Once the text embedding space was trained, a teacher-student approach was applied to extend the SONAR space to the speech modality. More details on the architecture and training procedure can be found in Duquenne et al. (2023b), and detailed speech recognition and translation results in the appendix of Seamless Communication et al. (2023a). OurLCMoperates directly on SONAR concepts embeddings, hence, it can perform reasoning on all supported languages and modalities. Table 1 compares the language coverage of several other LLMs. The LCMsupports substantially more languages than other models, in particular many low-resource languages. In addition to the text modality, SONAR supports 76languages for speech input and speech output in English. We have also developed an experimental encoder for American Sign language (ASL). All these encoders and decoders are freely available.4Exact listings of the supported languages can be found in the SONAR GitHub repository. 4https://github.com/facebookresearch/SONAR Figure 2 - Encoder/decoder bottleneck architecture to train the SONAR text embeddings (right part of figure). Teacher-student approach to extend SONAR to the speech modality (left part). 5 Page 6: Text Speech Image Video Model Input Output Input Output Input Output Input Output Gemini 47 47 62 ✓ ✓ ✓ ✓ ✗ GPT 85 85 ✓ ✓ ✓ ✓ ? ✗ Claude 37 37 ✓ ✓ ✓ ✓ ✗ ✗ Bloom 46 46 ✗ ✗ ✓ ✓ ✗ ✗ Llama3-400B 8 8 34 ✗ ✓ ✓ ✗ ✗ LCM-SONAR 200 200 76 1 ✗ ✗ (ASL) ✗ Table 1- Comparison of language and modality coverage for several LLMsand our LCMoperating on the SONAR embedding space. SONAR has an experimental support for American Sign Language (ASL) which is not used in this paper. 2.2 Data preparation To train and evaluate the LCM, we need to convert raw text datasets into a sequence of SONAR embeddings, each one corresponding to a sentence. Dealing with large text corpora presents several practical limitations. First, the precise segmentation of a text into sentences can be challenging due to the presence of errors, specific formatting issues or any other sources of noise. This requires us to apply robust automatic text segmentation techniques. Second, some sentences (even well formed) can be very long and complex, which might negatively impact the quality of the encoded SONAR embeddings. This is particularly prevalent for texts in the scientific domain. In the following, we discuss strategies for sentence segmentation and how they affect the SONAR encoding. Sentence segmentation analysis We have identified two potential sentence segmentation tech- niques; as we are exploring multilingual data, we focus on sentence segmenters with a large language coverage: 1.SpaCy segmenter ( SpaCy) (Honnibal et al., 2020) is a well established multilingual NLP toolkit that provides a rule-based approach to sentence segmentation. SpaCyis thoroughly tested for high-resource languages. 2.Segment any Text ( SaT) (Minixhofer et al., 2023; Frohmann et al., 2024) offers a suite of models and adapters that predict sentence boundaries at the token level. SaTis designed to be resilient to perturbations, particularly avoiding the over-reliance on punctuation and capitalization. This is valuable in domains where these conventional markers are often missing. The quality of SaT’s segmentation is however dependent on the choice of an “appropriate” split probability threshold. We additionally customize both methods by incorporating a maximum sentence length cap in characters. We refer to these extensions by SpaCyCapped and SaTCapped. Long sentences are broken down into smaller, logically coherent fragments using a rule-based approach based on punctuation marks for SpaCy. For SaT, we leverage the provided splitting probability estimates to identify the next best potential split. To measure the efficacy of a given segmenter, we evaluate the quality of the reconstructed sentences withAutoBLEU . It is defined as a BLEUscore (Papineni et al., 2002) comparing the decoded text from a SONAR vector after encoding a segment, to the the reference segment. A good segmentation will yield segments that can be encoded and then decoded without loss of signal, and thus score a higher AutoBLEU . 6 Page 7: For this analysis, we sample 10k documents from our pretraining datasets, representing approximately 500k sentences. The documents are processed with each segmenter, the sentences are encoded then decoded and the AutoBLEU score is calculated. We stratified the results based on the lengths of the original sentences. 0100200300400500 Sentence Size in Characters0.600.650.700.750.800.850.900.95Average Auto-BLEU Score SAT SpaCy 0255075100125150175200 Sentence Size in Characters0.600.650.700.750.800.850.900.95 SAT Capped SpaCy Capped Figure 3 -Segmenters quality. Average Auto-BLEU scores for different sentence segmentation methods depending on sentence length, for both out of the box (left) and capped implementations (right). As illustrated in Figure 3 and with a capping at 200 characters, the SaTCapped method demon- strates a slight but consistent advantage over SpaCyCapped. Both out-of-the-box segmenters, however, exhibit significant under-performance across all sentence lengths. This lower performance is especially pronounced for sentences exceeding 250 characters, underscoring the limitations of using the segmenters without capping. Accordingly, we prepare the LCMtraining data with SaTCapped. We discuss in Appendix A technical and engineering challenges faced when handling large amounts of SONAR embeddings. 2.3 Large Concept Model variants The design of the LCMis driven by the need to conditionally generate a continuous sentence embedding. This obviously contrasts with how current LLMswork,i.e.,estimating a probability distribution over a vocabulary of discrete tokens. A straightforward way of solving the task is to train a transformer model to generate an embedding with the objective of minimizing the MSE loss (see Section 2.3.1). However, a given context may have many plausible, yet semantically different, continuations. The model should thus be able to learn a conditional probability distribution over the continuous embedding of the next sentence. There is a large body of work in computer vision aiming to learn such conditional probability distributions over continuous data (Dhariwal and Nichol, 2021; Rombach et al., 2021). Models like Dall-E 3 (Betker et al., 2023) or Imagen Video (Ho et al., 2022) use a diffusion process to generate an image or video from a text prompt. Many different real images may satisfy the same input prompt, hence the model has to learn a probability distribution over continuous pixel data. This motivates the exploration of diffusion models for sentence embedding generation. Two variants are presented in Sections 2.3.3 and 2.3.4. Another prevalent take on continuous data generation consists of quantizing said data to ultimately model with discrete units; we explore LCMmodeling with quantization in Section 2.3.5. 7 Page 8: xn ˆxnˆxn PreNetTransformer DecoderPostNet PreNetFeature normalizerLinear RdSONAR→Rdmodel PostNetLinear RdSONAR→RdmodelFeature normalizerFigure 4 -TheBase-LCM. Illustration of the Base-LCM . At its core is a standard decoder-only Transformer surrounded with a PreNetand a PostNet. 2.3.1 Base-LCM Our baseline architecture for next-concept prediction is a standard decoder-only Transformer that transduces a sequence of preceding concepts (read sentence embeddings) into a sequence of future ones. As illustrated in Figure 4, the Base-LCM is equipped with a “ PostNet” and a “ PreNet”. The PreNetnormalizes the input SONAR embeddings and maps them to the model’s hidden dimension dmodel. PreNet( x) = normalize( x)Wt pre+bpre, (1) PostNet( x) = denormalize xWt post+bpost , (2) (3) where Wpost∈RdSONAR ×dmodel,bpost∈RdSONAR,Wpre∈Rdmodel×dSONARandbpre∈Rdmodel. In order to learn the maps “ normalize ” and its inverse “ denormalize ” we fit a robust scaler to a set of randomly sampled SONAR vectors from different corpora and domains of text data. This scaler removes the median statistics and scales the data according to the interquartile range (IQR). normalize( x) =x−µ σ,denormalize( x) =µ+σx. (4) TheBase-LCM is trained on the semi-supervised task of next concept prediction, that is, the model predicts the next concept ˆxnand its parameters θare optimized to regress the ground truth next concept ( xn). ˆxn=f(x<n;θ),MSE( ˆxn,xn) =∥ˆxn−xn∥2. (5) Given a data distribution q of documents (sequences of concepts), the training loss is evaluated as: LBase-LCM (θ) =Ex∼qh|x|X n=1MSE ( f(x<n;θ),xn)i . (6) In order to enable the generation of variable length documents at inference time, we suffix training documents with the sentence “End of text.”. Similar to any sentence in the document, this special 8 Page 9: suffix will be encoded with SONAR . This means that x|x|=− →eot:=encode ("End of text." ). During inference, we implement two main early stopping mechanisms: the first one measures the similarity of the generated embedding ˆxnto− →eotand stops if the cosine similarity exceeds a threshold seot. The second mechanism compares the newly generated embedding ˆxnto the previous generation ˆxn−1and stops if their cosine similarity is higher than a threshold sprev. We set both seotandsprevto 0.9. 2.3.2 Diffusion-based LCM Diffusion-based LCMsare generative latent variable models that learn a model distribution pθ approximating a data distribution q. Similar to the Base-LCM , we model the diffusion LCMsas auto-regressive models that generate concepts in a document, one at a time. The model distribution is thus expressed at each position nof the sequence as pθ(xn|x<n)i.e.,the generation of the next concept is conditioned on the preceding context. In what follows we use a superscript for the denoising/diffusion step ( t∈[0,1]) and a subscript ( n) for indexing the sequence of concepts. We simplify for a given nthe conditional model distribution pθ(x0 n|x0 <n)as pθ(x0), and the conditional data distribution q (x0 n|x0 <n)as q(x0). Diffusion models involve two processes: a forwardnoising process and a reversedenoising process (Ho et al., 2020; Song et al., 2020): Forward process and noise schedule The forward process is a Gaussian diffusion process characterized by the marginal distribution q (xt|x0), given for every timestep t∈[0,1]as: q(xt|x0):=N(αtx0, σ2 tI). (7) With the reparameterization trick, we can sample from this marginal distribution via: xt=αtx0+σtϵwhere ϵ∼ N(0,I) (8) We use a variance-preserving forward process (Karras et al., 2022) for which we have: α2 t= sigmoid( λt), σ2 t= sigmoid( −λt) = 1−sigmoid( λt), λ t= log α2 t/σ2 t ,(9) where λtis the log signal-to-noise ratio (log-SNR) for timestep t. The noise schedule is a strictly monotonically decreasing function fλthat maps from the timestep t∈[0,1]to a log-SNR level: λt=fλ(t). It is common in previous work to also define the noise schedule based on a discrete variance schedule (β0, . . . , β T). This stems from the formulation of the forward process as a discrete-time Markov chain that gradually adds Gaussian noise to the data according to said variance schedule: q(x1...T|x0):=TY t=1q(xt|xt−1),q(xt|xt−1):=N(xt;p 1−βtxt−1, βtI), (10) where to simplify the notation, xtis short for xt/Tas the timesteps are now discretized. From the variance schedule (βt)t, the noise schedule can be expressed as: α2 t=tY s=1(1−βs). (11) 9 Page 10: Following Kingma and Gao (2024), for any given noise schedule, we visualize the distribution over noise levels p(λ) =−dt/dλin order to characterize how much time we are spending at every noise level during training. In this work, we consider three types of noise schedules: Cosine. The schedule formulated in Nichol and Dhariwal (2021) as: α2 t=f(t)/f(0),where f(t) = cos2t+s 1 +s.π 2 ,where s= 0.008. (12) Quadratic. The schedule introduced in Ho et al. (2020) where the variances (βt)tare set to constants increasing quadratically from β0toβ1. βt/T=p β0+t T.p β1−p β02 . (13) Sigmoid. We introduce in this work, the sigmoidschedule as a means to study the impact of the SNR distribution on the training of our models. The schedule is parametrized by two hyper-parameters (γ, δ)and is defined as: α2 t=f(t)/f(0),where f(t) = sigmoid ( δ−γlogit( t)), (14) where “ sigmoid” is the sigmoid function sigmoid x7→ex/(ex+ 1)and “ logit” its inverse function logit :x7→log(x/(1−x)). The hyper-parameter γcontrols the scale of the log-SNR distribution p(λ)andδits center (see Figure 5). In all our experiments, we follow Lin et al. (2024) and rescale the variance schedule (β1, . . . β T)to enforce zero terminal SNR i.e.,βT= 1. Reverse process and objective function The joint distribution of the diffusion model pθ(x0...1) is called the reverse process and is defined as a Markov chain with learned Gaussian transitions starting at p (x1) =N(0,I). In its discretized form: pθ(x0:T):=p(xT)TY t=1pθ(xt−1|xt),pθ(xt−1|xt):=N(xt−1;µθ(xt, t),Σθ(xt, t)),(15) where µθandΣare predicted statistics. Σis set to to the constant σ2 tI(matching the transitions of the forward process). µθcan be decomposed into a linear combination of xt−1and a noise approximation model ϵθ. This prediction method is dubbed ϵ-prediction (Ho et al., 2020; Nichol and Dhariwal, 2021; Nichol et al., 2022). In this work we adopt x0-prediction i.e.,we predict the noiseless state and optimize the simple reconstruction loss: L(θ):=Et∼U(0,1) ω(t)L(t,θ) ,L(t,θ):=Ex0,ϵh x0−µθ(αtx0+σtϵ, t) 2 2i .(16) Different weighting strategies for the reconstruction loss were proposed in the literature (Ho et al., 2020; Salimans and Ho, 2022; Hang et al., 2023). In this work, we default to the simple reconstruction loss ( ω(t) = 1 ,∀t)and we experiment with a clamped-SNR weighting strategy: ω(t) = max(min(exp( λt), λmax), λmin), λt= log( α2 t/σ2 t), (17) 10 Page 11: 0.000.250.500.751.00 t0.00.20.40.60.81.0αt 0.000.250.500.751.00 t7.5 5.0 2.5 0.02.55.07.5λt=logSNRt 10 5 05 λt=logSNRt0.0000.0250.0500.0750.1000.1250.1500.175p(λ)Cosine Sigmoid(1.5, -2) Sigmoid(1.5, -1) 0.000.250.500.751.00 t0.00.20.40.60.81.0αt 0.000.250.500.751.00 t15 10 5 051015λt=logSNRt 10 010 λt=logSNRt0.000.050.100.150.200.250.30p(λ)Cosine Sigmoid(0.8, -1) Sigmoid(3.5, 0) 0.000.250.500.751.00 t0.00.20.40.60.81.0αt 0.000.250.500.751.00 t10.0 7.5 5.0 2.5 0.02.55.07.5λt=logSNRt 10 5 05 λt=logSNRt0.000.050.100.150.20p(λ)Cosine Quadaratic-1 Quadratic-2Figure 5 -Noise schedules. Illustrations of the different noise schedules explored in this work. Our default schedule being cosine. Quadratic-1 is characterized by (β0= 0.001, βT= 0.0012)and Quadratic-2 by (β0= 0.02, βT= 0.022)For each schedule we visualize the curve of (αt)t(see Equation (8)), the curve of the log-SNR and the associated distribution over noises levels p(λ)(Kingma and Gao, 2024). which is a generalization of Salimans and Ho (2022)’s truncated-SNR weighting and Hang et al. (2023)’s min-SNR strategy where the SNR is clamped between a min- and max-value λminandλmax. Additionally, we consider a weighting strategy that factors in the quality of the sample x0. We use as sample weight a scalar ω(x0)∈[0,1]correlated with the sample’s fragility score i.e.,how easy it is to reconstruct a noised sample (see Section 2.5.2). Fragile samples will be assigned a smaller weight and thus contribute less to the objective function. Lfragility (θ):=Et∼U(0,1),x0,ϵh ω(x0) x0−µθ(αtx0+σtϵ, t) 2 2i , (18) ω(x0) = sigmoid( afragility( x0) +b), (19) where a <0andbare hyper-parameters to tune. Classifier-free diffusion guidance for the LCM Classifier-free diffusion guidance (Ho and Salimans, 2022) consists of jointly training a conditional and an unconditional diffusion model. The resulting conditional and unconditional score estimates are combined at inference time to achieve a trade-off between sample quality and diversity. This combined score is defined as follows: ∇xlogγp(x|y) = (1 −γ)∇xlogp(x) +γ∇xlogp(x|y), (20) 11 Page 12: where yis the conditioning variable, in our case the sequence of preceding embeddings (x1, . . .xn−1) when denoising xn. The hyper-parameter γcontrols the contribution of the conditional score; For γ= 0, this is equivalent to an unconditional model, and for γ= 1, it is a fully conditional model. In practice for vision models, γis set to a value greater than 1, thus amplifying the signal from the conditioning model. Inference At inference time, the reverse process is applied. xTis obtained by sampling a random noise from p(xT) =N(0,I), and is then iteratively denoised by taking steps in the direction of the score function ( i.e.,the direction in which the log-likelihood increases fastest). Additional noise is added during the process in order to avoid falling down into modes of the distribution. Practically, we start from xT∼ N (0, σ2 initI)and find that the quality of the sampled output is sensitive to the initial noise scale σinit. Although we train the model on a large number of discretized timesteps, e.g.,T=100, we only generate with a smaller number of steps, e.g.,S=40, at inference via accelerated generation processes (Song et al., 2020). We select the sample steps following the trailing method of Lu et al. (2022) as it is found to be more efficient for smaller steps S(Lin et al., 2024). That is we generate along the sampled steps (τ1, . . . τS) =round (flip(arange (T,0,−T/S))). During inference, we employ the classifier-free guidance rescaling technique of Lin et al. (2024) proven to alleviate the image over-exposure problem encountered in image synthesis diffusion models as the terminal SNR approaches zero. We denote with gscaleandgrescalethe guidance scale and guidance rescale factors used at inference. Following Ning et al. (2023), we perform Epsilon-scaling at inference time as it is shown to alleviate the exposure bias problem in diffusion models. In its simplified version, this is a training-free method that consists of scaling down the over-predicted magnitude of error by a scalar λeps. We describe in Section 2.3.3 and Section 2.3.4 two variants of diffusion LCM:One-Tower and Two-Tower . xt n0 0 0 tˆx0 PreNetTransformer DecoderPostNet ×(s) denoising steps PreNetTransformer DecoderPostNet xt nˆx0 PreNetSelf-attention⋆Cross-attentionFeed forwardPostNet modulatort ×Ld×(s) denoising steps Figure 6 -Inference with diffusion-based LCMs. In the left-hand side, an illustration of the One-Tower LCM and on the right-hand side an illustration of the Two-Tower LCM . 2.3.3 One-Tower Diffusion LCM This model, depicted in the left panel of Figure 6, consists of a single transformer backbone whose task is to predict the clean next sentence embedding x0 ngiven a noisy input xt n, conditioned on previous clean sentence embeddings x0 <n. During training, self-attention can be dropped with a 12 Page 13: xt0 0x0 0 xt1 1x0 1 xt2 20 0 t0 t1 t2ˆx0 0 ˆx0 1 ˆx0 2 PreNetTransformer DecoderPostNetFigure 7 -Training of One-Tower diffusion LCM.Interleaving the clean and noisy embeddings and sampling different diffusion timesteps allows for efficient training. certain probability for unconditional training. This enables classifier-free guidance at inference time (see Section 2.3.2 for details). Each input embedding is concatenated with the corresponding diffusion timestep embedding. The learned position embeddings are added to the input vectors prior to being fed to LCM. The backbone utilizes a causal multi-head self-attention. For efficient training, the model is trained to predict each and every sentence in a document at once. As depicted in Figure 7, during the diffusion process, the model attends to the clean sentences in the context using causal multi-head attention layers. The input is specially prepared by interleaving the noisy (blue) and clean (light blue) sentence embeddings, and the attention mask is prepared accordingly to only attend to the clean sentence embeddings (gray arrows). 2.3.4 Two-Tower Diffusion LCM This model, depicted in the right panel of Figure 6, separates the encoding of the preceding context from the diffusion of the next embedding. A first model, labeled contextualizer , takes as input the context vectors x<nand encodes them causally i.e.,we apply a decoder-only Transformer with causal self-attention. The outputs of the contextualizer are then fed to a second model dubbed denoiser, which predicts the clean next sentence embedding x0 nby iteratively denoising the latent x1 n∼ N (0,I). The denoiser consists of a stack of Transformer blocks with cross-attention block to attend over the encoded context. Both the denoiser and the contextualizer share the same Transformer hidden dimension dmodel. Each block of each Transformer layer in the denoiser (including the cross-attention layer) is modulated with adaptive layer norm ( AdaLN, Perez et al. (2018); Peebles and Xie (2023)). TheAdaLNmodulator of Two-Tower regresses channel-wise scale ( γ), shift ( β) and residual gates (α) from the embedding of the current diffusion timestep t. [β,γ,α] = SiLU(embed( t))Wt+b, (21) y=x+αBlock((1 + γ)x+β), (22) Following Peebles and Xie (2023) and Goyal (2017) we initialize each residual block in a Transformer layer (“ Block”) with the identity function via initializing Wandbin Equation (21) to zero. The 13 Page 14: x0x1x2PreNetTransformer DecoderPostNets0 s1 s2 xt0 0xt1 1xt2 2ˆx0 n ˆx1 n ˆx2 n PreNetSelf-attention⋆Cross-attentionFeed forwardPostNet modulatort0, t1, t2 ×Ldh4 h3 h2 h1 h0 0 s0s1s2s3s4111110 100000 111000 110000 100000Figure 8 -Training Two-Tower diffusion LCM. On the left panel, a Two-Tower forward pass in training time in order to denoise multiple embeddings in parallel. On the right side panel a visualization of the denoiser’s cross-attention masks with the red highlighted row signaling a sample dropped to train the denoiser unconditionally. (h1, . . . , h 4)denotes the sequence of intermediate representations in the denoiser right before the cross-attention layer. diffusion timestep tis embedded using a 256-dimensional frequency embedding (Dhariwal and Nichol, 2021; Peebles and Xie, 2023) followed by a two-layer MLP with SiLUas activation function. “ embed” maps to the denoiser’s hidden dimension dmodel. The self-attention layers in the denoiser do only attend to the current position i.e.,we do not attend to the preceding noised context. The self- attention layers were kept for consistency with a standard Transformer block and for the possible extension of denoising multiple vectors at once. Two-Tower training. At training time, Two-Tower ’s parameters are optimized for the next-sentence prediction task on unsupervised sequences of embeddings. The causal embeddings from the contextualizer are shifted by one position in the denoiser and a causal mask is used in its cross-attention layers. A zero vector is prepended to the context vectors to enable the prediction of the first position in the sequence (see Figure 8). To train the model both conditionally and unconditionally in preparation for inference with classifier-free guidance scaling, we drop random rows from the cross-attention mask with a rate of pcfgand denoise the corresponding positions with only the zero vector as context. 2.3.5 Quantized LCM Two major approaches currently stand to deal with continuous data generation in the image or speech generation fields: one is diffusion modeling, the other is learning quantization of the data before modeling on top of these discrete units. In addition, the text modality remains discrete, and despite dealing with continuous representations in the SONAR space, all possible text sentences (of less than a given number of characters) are a cloud of points rather than a real continuous distribution in the SONAR space. These considerations motivate the exploration of quantization of SONAR representations and then modeling on these discrete units to address the next sentence prediction task. Finally, following such an approach enables the natural use of temperature, top-p or top-k sampling, to control the level of randomness and diversity in the sampling of the next sentence representation. In this section, we learn residual quantizers for the SONAR space, and then build a Quantized Large Concept Model based on these discrete units. We tried to come up with an architecture as close as the diffusion LCMmodels, to be able to compare approaches. 14 Page 15: Quantization of SONAR space. We use Residual Vector Quantization (RVQ; Zeghidour et al. (2021)) as a coarse-to-fine quantization technique to discretize SONAR representations. Vector quantization maps continuous input embeddings to the nearest entry in a learnt codebook. RVQ iteratively quantize residual errors from previous quantizations using additional codebook for each iteration. We use FAISS implementation (Douze et al., 2024) which performs iterative k-means clustering of residuals. We use the Improved Residual Vector Quantization (IRVQ) method from Liu et al. (2015), with a beam size of 1 for memory efficiency. We trained the RVQ codebooks on 15 million English sentences extracted from Common Crawl using ncodebooks = 64number of quantizers with nunits-per-codebook = 8192units per codebook. One property of RVQ is that the cumulative sum of centroid embeddings of the first codebooks are an intermediate coarse approximation of input SONAR vectors. In that way, we can report the evolution of auto-encoding BLEU scores with the increasing number of codebooks used to quantize SONAR embeddings, before using the SONAR text decoder to decode quantized embeddings. We notice in Figure 9 that auto-encoding BLEU consistently improves as the number of codebooks increases , reaching around 70% of the auto-encoding BLEU score achieved with continuous SONAR embeddings, when using all 64 codebooks. 102030405060 Number of codebooks020406080Auto-Encoding BLEU score Base decoder Finetuned decoder Unquantized topline Figure 9 - Auto-encoding BLEU scores on FLORES devtest set, encoding sentences with SONAR encoder, quantizing with a varying number of codebooks, dequantizing and decoding with SONAR decoder. Finetuning the SONAR decoder on quantized representations. We fine-tuned SONAR decoder on quantized representations to adjust it for the space created by the quantizers on 1.2M Englishsentences. Tomakethedecodermorerobustagainstresidualrepresentationsfromintermediate codebooks, we randomly select a codebook number k∈2 3·ncodebooks , ncodebooks during fine-tuning, with probability p= 0.3, and use the quantized representation with codebooks up to k. Figure 9 shows the improvement in auto-encoding performance when the decoder is adapted to quantized representations. Quant-LCM architecture. In the same spirit of diffusion LCM, we aim at coarse-to-fine generation of SONAR embeddings conditioned on left-context sentences. However, we do not follow a denoising task as in diffusion modeling, but an iterative generation of SONAR embeddings based on intermediate quantized representations instead. Inordertogeneratea SONAR embeddingconditioned on left-context sentences, the Quant-LCM model starts with the intermediate representation as a vector filled with zeros. We iteratively add to this intermediate representation the predicted residual centroid embeddings. In that way, the predicted SONAR embeddings are iteratively refined based on the growing cumulative sum of centroid embeddings of first codebooks, until all codebooks 15 Page 16: have been seen. We used the One-Tower architecture for Quant-LCM experiments even though it could be trained with Two-Tower architecture too. Compared to the diffusion LCM, noisy input representations are replaced with intermediate quantized representations and diffusion timestep embeddings as input are replaced by codebook index embeddings. Discrete targets. Following previous work on modeling discrete units from residual quantizers (Wang et al., 2023; Rubenstein et al., 2023; Lee et al., 2022), a Quant-LCM can be trained to predict the unit from the next codebook, parameterized with a softmax output layer. For parameter efficiency, we do not use ncodebooks ·nunits-per-codebook unique indices as discrete targets which would imply ncodebooks ·nunits-per-codebook output dimensions, but only nunits-per-codebook output dimensions while inputting the information of the codebook index to the model. At training time, similarly to diffusion LCMtraining, we randomly sample codebook index kbetween 1 and ncodebooks, and compute the cumulative sum of centroid embeddings of the first k−1codebooks as input. We use the unit from codebook kof the target embedding as target index for cross entropy loss computation. At inference time, we iteratively predict the unit from the next codebook, get the corresponding centroid embedding and add it to the current intermediate representation as additional predicted residual embedding. Finally, we also enable classifier-free guidance on logits at inference time (Gafni et al., 2022) by randomly dropping left-context conditioning during training as previously described in Section 2.3.3. This modeling approach with discrete targets is dubbed Quant-LCM-d in the following sections. The improved SONAR decoder for quantized representations is used to bridge the compression gap coming from SONAR quantization in following ablation studies when using Quant-LCM-d . Continuous targets. We also explored a modeling approach that predicts continuous target SONAR vectors based on left-context sentences and intermediate quantized representation of the target vector, minimizing the Mean Squared Error between prediction and target embeddings. At inference time, we can either iteratively add the closest centroid embedding based on the predicted residual ˆror sample a centroid cifrom the following distribution: p(ci|ˆr) =e−β·∥ci−ˆr∥2 P ke−β·∥ck−ˆr∥2, (23) where βis a temperature hyper-parameter. This modeling approach with continuous targets is denoted with Quant-LCM-c in the following sections. 2.4 Ablations In this section, we delineate the ablations experiments conducted to evaluate the aforementioned LCM designs. We compare all the variants of LCMsintroduced above, namely, Base-LCM ,One-Tower , Two-Tower andQuant-LCM . 2.4.1 Experimental setup For our ablation study and for the sake of reproducibility, we pre-train our models on the Fineweb- edudataset (Lozhkov et al., 2024). All models are configured to have approximately 1.6B trainable parameters and are pre-trained on Meta’s Research Super Cluster (RSC, Lee and Sengupta (2022)) for 250k optimization steps spanning 32 A100 GPUs with a total batch size of 229k concepts. 16 Page 17: Models architectures. TheBase-LCM has 32 layers and a model dimension dmodel = 2048 with 16 attention heads. It uses rotary position embeddings (RoPE, Su et al. (2024)), applies pre-normalization using RMSNorm (Zhang and Sennrich, 2019), uses the SwiGLU activation func- tion (Shazeer, 2020) and is trained with a dropout rate of p =0.1. TheOne-Tower diffusion LCMis made of 32 transformer blocks, each made of a self-attention layer with 32 attention heads and followed by a feed-forward neural network with inner size 8192. It has a dimension dmodelof 2048 and uses learned position embeddings. The noise scheduler is set withT=100diffusion timesteps. During training, self-attention is dropped with a probability of 0.15 for unconditional training, enabling classifier-free guidance at inference time. TheTwo-Tower diffusion LCMhas 5 layers in its contextualizer and 13 layers in its denoiser. Similar to the Base-LCM , it has 16 attention heads, a model dimension dmodel = 2048, and uses SwiGLU activations and RMSNorm in both contextualizer and denoiser. The contextualizer uses RoPE for embedding positions whereas the denoiser is without positional embeddings. We use by default the cosine noise schedule with T=100and train with a dropout rate of p=0.1. For training the model unconditionally we use a cross-attention mask dropout of rate 0.15 (see Section 2.3.4). The pre-training documents are wrapped at 128 sentences. Unless otherwise mentioned we decode withS=40sample steps with a guidance scale gscale= 3, a guidance rescaling factor of grescale = 0.7, an initial noise scale σinit= 0.6and epsilon-scaling with λeps= 1.00045. TheQuant-LCM follows exactly the same architecture as the One-Tower diffusion LCM, except forQuant-LCM-d which differs only by its output dimension which is set to nunits-per-codebook = 8192 for softmax computation. For single sentence prediction tasks, we use topk= 1andgscale= 2for Quant-LCM-d andtopk= 1withgscale= 3forQuant-LCM-c , while for multi-sentence generation tasks we used temperature of 1, topk= 3,gscale= 1, forQuant-LCM-d and temperature of 0.005, topk= 5,gscale= 1.5forQuant-LCM-c , as higher guidance or lower temperature setups led to repeated generated sentences. #Llama2 Tokens #Sentences Total sentencesDataset #Docs Q1 Q2 Q3 Q1 Q2 Q3 ROC-stories (dev) 2000 48 57 64 5 5 5 10K ROC-stories (test) 1871 50 56 62 5 5 5 9.4K C4(dev) 1000 136 288 577 6 12 24 20.6K C4(test) 1000 133 282 599 6 11 25 21.9K Wikipedia-en (dev) 1000 146 332 736 5 10 23 21.1K Wikipedia-en (test) 1000 147 312 673 5 9 21 19.1K Gutenberg (dev) 55 10297 15934 22259 328 530 687 216.2K Gutenberg (test) 61 10752 15204 23414 325 457 735 562.2K Table 2-Statistics of the pre-training evaluation datasets. For each subset we report the number of documents, the total number of sentences and document lengths quartiles in sentences and in Llama2 tokens for reference. Pre-training evaluation. Pre-trained token-level language models are typically evaluated with perplexity: a measure of how well each next token is predicted given a teacher-forced ( i.e.,ground truth) prefix of the document. In a similar spirit, we evaluate pre-trained LCMs in a teacher-forced mode. But as they cannot produce the probability explicitly, we resort to a custom set of metrics of the quality of next sentence prediction. 17 Page 18: Each pre-trained model is initially evaluated on the quality of its predicted next sentence ˆxngiven a ground truth context x<n. Practically, for a given document x1:N, we run the LCM inference in teacher-forcing mode and evaluate the following metrics: •L2 distance (ℓ2). Euclidean distance in the SONAR space between the predicted embedding ˆxn and the ground truth continuation xn:ℓ2:=∥ˆxn−xn∥2. •Round-trip L2 distance (ℓ2-r). Euclidean distance in the SONAR space between the re-encoded sentence generated from the predicted embedding and the ground truth continuation xn, ℓ2-r:=∥encode(decode( ˆxn))−xn∥2. Since an LCM can predict an embedding outside of the distribution of real embeddings (obtained by encoding natural sentences), the SONAR decoder might shift these embeddings to the nearest plausible embeddings subspace. The ℓ2-rmetric is introduced to capture the shift in embeddings after decoding them into text then re-embedding them again in the SONAR space. The more the generated embeddings are out-of-distribution, the higher the delta between ℓ2-randℓ2would be. •Contrastive accuracy (CA). The ratio of embeddings in a batch that are further away (in terms of ℓ2) from the predicted embedding ˆxnthan the ground truth xn(for each n, we exclude xnand its two neighboring ground truth embeddings from the comparison). This metric naturally assigns higher penalty for large ℓ2values in the regions with high density of sentence embeddings. •Paraphrasing (PAR). The maximum cosine similarity ( CS) between the generated embedding ˆxn and the context embeddings x<n, normalized by the score of the ground truth sentence. Thus, PAR =max m<nCS(ˆxn,xm)/max m<nCS(xn,xm). The goal of this metric is to capture if the model is simply copying or paraphrasing a sentence from the context more ( >1) or less ( <1) than the reference sentence. •Mutual information (MI). This metric of text coherence evaluates the mutual information between the next predicted sentence ˆsn=decode (ˆxn)and the previous k= 10ground truth sentences by computing the difference between the unconditional perplexity of ˆsnand its perplexity conditioned on the prompt: MI =1 |ˆsn|(logpLM(ˆsn)−logpLM(ˆsn|sn−k:n−1)). We estimate the perplexity with a small language model, GPT-2 (Radford et al., 2019). We prepend a newline symbol to ˆsn, so that a probability could be assigned to its first token, and we compute the average mutual information per-token by normalizing it with |ˆsn|, the length of ˆsnin tokens. When averaging MIover a dataset, |ˆsn|are used as weights. Pre-training evaluation data. The pre-training evaluation is performed on sampled subsets from four corpora covering different domains: ROC-stories (Mostafazadeh et al., 2016), C4(Raffel et al., 2019), Wikipedia-en (English Wikipedia dump) and Gutenberg . We sample two distinct subsets (dev and test) from each corpus, we use the dev split for tuning inference hyper-parameters and report the results on the test splits. The statistics of the evaluation corpora are presented in Table 2. The results of the pre-training evaluation are presented in Table 3. First, diffusion-based LCMandQuant-LCM variants have similar ℓ2andℓ2-rscores despite an important difference in their learning objectives. The only model that shows substantially lower ℓ2 score is the Base-LCM . This is expected since Base-LCM effectively optimizes ℓ2score during training. Yet, ℓ2-rscore is not improved compared to other models. This could be explained by the 18 Page 19: ModelROC-stories C4 ℓ2 ℓ2-r PAR CA MI ℓ2 ℓ2-r PAR CA MI Base-LCM 0.177 0.237 1.847 72.4% 0.062 0.204 0.261 1.964 69.1% -0.105 One-Tower 0.236 0.236 1.939 80.2% 0.977 0.279 0.273 2.239 77.1% 1.110 Two-Tower 0.233 0.231 2.088 80.6% 1.137 0.265 0.261 2.265 75.4% 1.134 Quant-LCM-c 0.236 0.237 1.683 76.0% 0.610 0.279 0.283 2.013 77.2% 0.715 Quant-LCM-d 0.240 0.246 1.871 81.1% 0.682 0.270 0.282 1.808 75.0% 0.359 ModelWikipedia-en Gutenberg ℓ2 ℓ2-r PAR CA MI ℓ2 ℓ2-r PAR CA MI Base-LCM 0.229 0.283 1.770 69.6% 0.071 0.207 0.264 1.780 67.8% -0.184 One-Tower 0.324 0.311 2.087 80.9% 1.202 0.284 0.281 2.051 75.1% 0.725 Two-Tower 0.307 0.297 2.079 78.8% 1.307 0.267 0.267 2.077 73.0% 0.684 Quant-LCM-c 0.306 0.317 1.842 79.5% 0.744 0.269 0.281 1.774 72.1% 0.419 Quant-LCM-d 0.295 0.311 1.592 76.0% 0.323 0.276 0.290 1.599 72.0% 0.153 Table 3-Comparing architectures. Pre-training evaluation results on the four select corpora. For each subset, we report ℓ2(L2 distance in SONAR space), ℓ2-r(round-trip L2 distance after decoding and re-encoding the generated embeddings), PAR(similarity to preceding embeddings) and CA(contrastive accuracy) fact that when many plausible next sentence continuations are possible, Base-LCM generates their average in SONAR space (instead of sampling one of plausible modes) which may not correspond to any relevant point in SONAR space. This hypothesis is also highlighted by the poor Base-LCM performance in term of CAandMIscores. We do not notice any significant difference in CAscores between diffusion LCMsandQuant-LCM variants. MIscores, on the contrary, are consistently higher for diffusion-based models compared toQuant-LCM . At the same time, diffusion LCMstend to paraphrase more the context in the generated embeddings, which also correlates with an increased MIscore. Still, Quant-LCM variants significantly outperform Base-LCM onMImetric. Now comparing the different variants, Quant- LCM-coutperforms Quant-LCM-d modeling variant: one hypothesis is that predicting codebook indices with cross-entropy loss is harder than MSEobjective where Quant-LCM-c can more easily learn combination of left-context vectors for next sentence embedding. For diffusion LCMs, we don’t observe any consistent difference between One-Tower andTwo- Towerwhen looking across all metrics and datasets. Note that overall, to tackle the next sentence prediction task in the SONAR space, diffusion-based methods give clearly better results compared to all other models. Instruction-tuning evaluation. Subsequently, the pre-trained models are instruction-tuned on the stories subset of Cosmopedia (Ben Allal et al., 2024) and are evaluated on a held-out subset ofCosmopedia itself. We aim with this finetuning to evaluate the ability of the models to follow instructions and generate consistent stories. For the sake of comparison, we trained a small Llama (Touvron et al., 2023) on the same training data ( Fineweb-edu ) and finetuned it on Cosmopedia . This model has 24 transformer layers, each with 16 attention heads and a model dimension of 2048 for a total of 1.4B parameters. This model will be referred to as smaLlama . 19 Page 20: We evaluate the following metrics: •ROUGE-L (R-L). ROUGE-L (F-measure) (Lin, 2004) between the generated and reference stories. •Coherence (Coherence ). This reference-free metric is computed with a bidirectional transformer modelfine-tuned by Jwalapuram etal. (2022) toassignhigher scores topositive“natural” documents than to negative examples with permuted sentences. For reporting, we normalize it with a sigmoid (with a temperature 3.0, empirically set to make the scores of “certainly incoherent” documents close to 0 and those of “certainly coherent” documents close to 1). Model R-L↑Coherence ↑ Base-LCM 23.69 0.482 One-Tower 33.40 0.968 Two-Tower 33.64 0.938 Quant-LCM-c 30.87 0.847 Quant-LCM-d 28.01 0.704 smaLlama 34.88 0.984 Table 4-Comparing architectures. Instruction-tuning evaluation results. For each model we score the generated stories on the held-out test prompts and report R-L(ROUGE-L) scores. The scores in terms of R-LandCoherence of the finetuning evaluation are presented in Table 4. Those quantitative results are in line with the pretraining evaluation ones. Both R-LandCoherence scores correlate with the model ordering based from MIscores, mainly Quant-LCM is outperformed by diffusion-based models, and both outperform Base-LCM by a large margin. We also note that smaLlama outperforms the LCMson this downstream task on both metrics. It is well known that LLMsproduce very fluent outputs, that explains the higher Rouge-L score. We also note that the One-Tower andTwo-Tower produce coherent outputs, on par with the smaLlama outputs. 2.4.2 Importance of the diffusion inference hyper-parameters In this section we study the effect of different inference hyper-parameters on the quality of the generated text. To this end, we generate outputs for the C4test split with the Two-Tower LCM model above, while varying the following hyper-parameters: the guidance scale gscale, the initial noise scale σinit, and the number of inference sample steps S. We score the generations following the same protocol above and report the results in Figure 10. We note that as we increase the guidance scale, the mutual information between the prefix and the generated suffix increases, and so does paraphrasing as we are paying more attention to the conditioning context variables. In the opposite direction of the mutual information, the ℓ2distance from the ground truth continuation increases as the model prefers to stay close to the prefix. Regarding the initial noise scale, we observe that values between 0.5 and 0.7 achieve the best MIscore. In particular, the generated individual sentences are usually longer with a higher σinit. The ℓ2distance on the other hand does not reflect this trend. Lastly, we increase the number of inference steps and measure the mutual information ( MI) of the generated texts. With more inference steps, we can improve the prefix-suffix mutual information, but there is diminishing returns to increasing the inference cost with little qualitative improvement. 20 Page 21: 246 Guidance scale (gs)0.20.40.60.81.01.21.4Mutual information (MI) MI /lscript2 0.40.60.81.0 Initial noise scale (σinit)0.5 0.00.51.01.5Mutual information (MI) MI /lscript2 20406080 Inference sample steps (S)0.70.80.91.01.11.2Mutual information (MI) λeps=1.0025 λeps=1.00045 λeps=1 0.240.260.280.30 L2-distance (/lscript2) 0.240.260.280.30 L2-distance (/lscript2) Figure 10 -Importance of inference hyper-parameters. The first panel shows the quality of the generated output measured with MIandℓ2as we vary the guidance scale gscalewith fixed σinit= 0.6andS= 40. The second panel varies the initial noise scale σinitwith fixed guidance gscale= 3and S = 40. The third panel varies the inference steps Swhile holding the guidance scale gscale= 1.5andσinit= 0.6fixed. We consider 3 values for λepsto see the impact of epsilon-scaling in the regime of large inference steps. 2.4.3 Studying the noise schedules In this section, we compare different Two-Tower diffusion LCMstrained with different noise schedules, namely: Cosine. Our default cosine noise schedule. Quadratic. The first quadratic schedule (Quadratic-1) has β0= 0.001andβT= 0.0012, whereas the second quadratic schedule (Quadratic-2) has β0= 0.02andβT= 0.022. Sigmoid. Four different sigmoid schedules with with (α, β)∈ {(1.5,−1),(1.5,−2),(0.8,−1),(3.5,0)}. All schedules are set with the default T=100. For the exact description of each noise schedule refer to Section 2.3.2 and Figure 5. We selected the quadratic schedule as a commonly used schedule for reference with two configurations, Quadratic-1 closer to the cosine schedule and Quadratic-2 with more weight given to lower log-SNR. The selected sigmoid schedules with δ= 1.5are configured with γ=−1andγ=−2,γ=−2being slightly shifted to the left on the log-SNR distribution i.e., more weight to lower log-SNR regimes. We then change the δparameter of the sigmoid schedule to choose ( δ= 0.8, γ=−1) for a peaked distribution of log-SNR around -1 and ( δ= 3.5, γ= 0)for a flat distribution over noise levels. We follow the experimental setup described in Section 2.4.1 and report the results of the pre-training evaluation in Table 5. Both quadratic schedules achieve a better MIscore while the wide sigmoid schedule (δ, γ) = (3 .5,0)achieves the highest accuracy CAon both C4andWikipedia-en . To further understand the differences between those schedules we re-evaluate on C4while varying the guidance scale gscale. The results in Figure 11 confirm that the wide sigmoid schedule ( δ= 3.5, γ= 0) has a much higher contrastive accuracy across the board (rightmost panel) while closely matching the cosine schedule in terms of mutual information ( MI). This schedule being trained on a wider spectrum of log-SNR learns to contrast more than to regress. Contrast this behavior to the peaked sigmoid schedule ( δ= 0.8, γ=−1)where the model focuses on regressing to the target (lower ℓ2), akin to a Base-LCM , resulting in a lower contrastive accuracy. 21 Page 22: ModelC4 Wikipedia-en ℓ2 ℓ2-r PAR CA MI ℓ2 ℓ2-r PAR CA MI Cosine 0.265 0.261 2.265 75.4% 1.134 0.307 0.297 2.079 78.8% 1.307 Quadratic-1 0.268 0.264 2.341 75.7% 1.252 0.309 0.300 2.202 79.1% 1.409 Quadratic-2 0.270 0.265 2.320 76.2% 1.252 0.312 0.303 2.185 79.7% 1.399 Sigmoid(1.5, -1) 0.257 0.259 2.226 74% 1.083 0.298 0.292 2.110 77% 1.271 Sigmoid(1.5, -2) 0.277 0.267 2.291 77.2% 1.173 0.321 0.303 2.179 80.3% 1.308 Sigmoid(0.8, -1) 0.252 0.255 2.053 70.6% 0.936 0.285 0.283 1.883 71.7% 1.127 Sigmoid(3.5, 0) 0.307 0.265 2.347 80.3% 1.154 0.347 0.303 2.187 83.7% 1.288 Table 5-Comparing noise schedules. Results of the pre-training evaluation on two corpora, C4and Wikipedia-en . 246 Guidance scale (gs)0.20.40.60.81.01.21.41.6Mutual information (MI) Cosine Quadratic-1 Quadratic-2 Sigmoid(3.5) Sigmoid(0.8) 246 Guidance scale (gs)0.240.260.280.300.320.34L2 distance (/lscript2) 246 Guidance scale (gs)0.600.650.700.750.80Contrastive accuracy (AC) Figure 11 -Comparing noise schedules. The prefix-suffix mutual information ( MI), the ℓ2distance and contrastive accuracy ( CA) scores of evaluated C4documents while varying the guidance scale (gscale)under different schedules. 2.4.4 Studying the loss weighting strategies In this section we compare the baseline Two-Tower diffusion LCMtrained with the simplified objective ( i.e.,ω(t) = 1 ,∀t)) to models trained with the clamped-SNR weighting strategy. We consider two sets of (λmin, λmax):(λmin, λmax) = (0 ,10)and(λmin, λmax) = (0 .001,5). All models in this section are trained with the cosine noise schedule. Fragility as a sample weighing strategy As introduced in Equation (19), we train in this section a Two-Tower LCM model with loss terms weighted by the sample’s fragility. ω(x0) = sigmoid( afragility( x0) +b) (24) Given that estimating the fragility score of each sample as defined in Section 2.5.2 can be very costly, we resort to training a simple MLP (3-layers) on 50M sampled sentences to approximate these fragility scores. This model is referred to as F. Henceforth, the sample weight is: ω(x0) = sigmoid( aF(x) +b) (25) The two hyper-parameters ( a, b)are chosen so that extremely fragile sentences contribute less to the loss ( ω(x0)≈0), and so that sample weights should increase smoothly as sample robustness 22 Page 23: improves. Figure 12 plots the sample weights distribution evaluated on a pre-training dataset with a=−4andb= 3.5. 0.0 0.2 0.4 0.6 0.8 1.0 Sample Weight0.00.51.01.52.02.5Density Figure 12 - Resulting distribution of fragility sample weights ω(x0)with ω(x0) = sigmoid( −4F(x) + 3.5). We follow the experimental setup described in Section 2.4.1 and report the results of the pre-training evaluation in Table 6. We observe that weighting with clamped SNRs does not improve the quality of generated texts as measured with our pre-training evaluation metrics. When it comes to the fragility-aware weighting strategy, we observe an improvement in the contrastive accuracy of the model. In the remainder of this work we default to the simplified training objective ( ω(t) = 1 ,∀t). ModelC4 Wikipedia-en ℓ2 ℓ2-r PAR CA MI ℓ2 ℓ2-r PAR CA MI Baseline ω(t) = 10.265 0.261 2.265 75.4% 1.134 0.307 0.297 2.079 78.8% 1.307 SNR (0,10) 0.280 0.264 2.334 74.8% 1.107 0.320 0.296 2.102 77.9% 1.212 SNR (0.001,5) 0.266 0.261 2.269 73.4% 1.094 0.304 0.291 2.007 76.6% 1.295 Fragility 0.2815 0.273 2.306 76.5% 1.103 0.321 0.308 2.118 80.7% 1.193 Table 6-Comparing weighting strategies. Results of the pre-training evaluation on two corpora, C4and Wikipedia-en . 2.5 Analysis 2.5.1 Inference efficiency of LCMs We compare in this section the inference computational cost of the Two-Tower LCM to that of a vanilla LLM as a function of the total length in tokens of the prompt and output combined. We chose the theoretical number of FLOPs, independent of any specific optimization. These optimizations are generic to the transformer architecture and also apply to our LCM. We include in this comparison two configurations of LCMs; the 1.6B used in the previous ablation studies and a 7B model we scale up to in the following sections. For both LCMs, we estimate the inference cost with inference sample steps S= 40. Given the quadratic complexity of the attention mechanism in transformers, the complexity sharply increases with the context size (see upper right corner of Figure 13’s left panel). The complexity of the LCMdepends on how the context is sentencized: a context length of 200 tokens split into 10 sentences (20 tokens each) will incur a higher cost than the same 200 tokens split 23 Page 24: 0100002000030000400005000060000 Total context size in tokens0.000.250.500.751.001.251.50Inference Flops×1015 LLama2-7b Two-tower LCM (7B) Two-tower LCM (1.6B) 050010001500200025003000 Total context size in tokens0.00.20.40.60.81.0×1014 LLama2-7b Two-tower LCM (7B) Two-tower LCM (1.6B)Figure 13 -Theoretical inference Flops of LCMs and LLLms . We evaluate the inference flops for different text lengths (in Llama2 tokens) with a variable average sentence length. Only extremely short sentences ( ≤10 tokens) favor LLMs. into 5 sentences (40 tokens each). We account for this by computing the cost on a range of sentence lengths but report the total context size on the x-axis (context size = sentence length ×number of sentences). The LCMshows substantially better scalability with respect to increasing context size. The inference computational cost of the LCMincludes the three steps of (1) encoding into SONAR , (2)LCMprediction in the sentence space then (3) decoding with a SONAR decoder. The inference cost of LCMsvaries significantly depending on the average length in tokens per sentence. For extremely short sentences (less than 10 tokens), an LLMis more computationally efficient (see lower left corner of Figure 13’s right panel). 2.5.2 Fragility of SONAR space When we perform modeling in a latent space, we primarily rely on the induced geometry ( L2-distance). However, the homogeneous Euclidean geometry of any latent representation will not perfectly match the underlying text semantics. This is evidenced by the fact that a small perturbation in the embedding space may result in a drastic loss of semantic information after decoding. We dub such embeddings “fragile”. For this reason, we aim to quantify the fragility of semantic embeddings (namely SONAR codes) to understand the quality of the LCMtraining data and how this fragility can hinder the LCMtraining dynamics. Given a text fragment wand its SONAR codex= encode( w), we define the fragility of was: fragility( w):=−Eα∼U([0,1]),ϵ∼N(0,I)[score( w, w α,ϵ)], (26) xα,ϵ= denormalize√ 1−αnormalize( x) +√αϵ , (27) wα,ϵ= decode( xα,ϵ), (28) where normalize anddenormalize are the normalization and denormalization operators introduced in Equation (4) with the goal of making x’s coordinates scale-independent. The “ encode“ operation maps text fragments into SONAR space, and the “ decode“ operation produces a text fragment from a given vector in the SONAR space. For each αin[0,1],xα,ϵis the perturbed version of xwhere a noise vector of variance αis linearly combined with x. The perturbed vector is then decoded into a 24 Page 25: text fragment wα,ϵ. This perturbation is similar to the variance-preserving noising used in diffusion LCMs(see Section 2.3.2). The “ score” operator in Equation (26) is set to be a semantic similarity metric comparing the perturbed text wα,ϵto the original w. We considered the following options: •Auto-Encoding BLEU .score( w, w α,ϵ) =BLEU (w, w α,ϵ). •External cosine similarity. Provided an external text encoder (read unrelated to SONAR ) that encodes a text fragment wintoencode ext(w)score (w, w α,ϵ) =CS(encode ext(w),encode ext(wα,ϵ)), where CSis the cosine similarity measure. Compared to Auto-Encoding BLEU, this method is typically more robust to paraphrasing. Finetuned robust decoder. To serve as a testbed for our fragility analysis, a new SONAR decoder for English text is finetuned on a sample of our pre-training data. In order to improve the decoder’s robustness to imperfectly generated embeddings from the LCM, we follow Equation (27) and add random noise vectors to SONAR embeddings during training. As reported in Table 7, the finetuned SONAR decoder exhibits stronger performance across a range of public corpora. Model Flores CNN DailyMail Gutenberg C4 Base SONAR decoder 79.5 75.9 70.5 75.7 Finetuned SONAR decoder 88.0 87.6 85.6 87.5 Table 7-Comparing SONAR decoders. Raw reconstruction performance of our base SONAR decoder vs. the new decoder trained with noised embeddings. Scores are Auto-Encoding BLEU on random subsets of 10k sentences from each dataset, except for Flores where we use the entire dev split. Fragility study. We sample 50M random text fragments, and for each sample we generate 9 perturbations corresponding to different noise levels α∈[0.1,0.2, . . . , 0.9]. For the external cosine similarity metric we use mGTEas external encoder (Zhang et al., 2024). 050100150200250 Sentence Length in Characters0.200.250.300.350.400.450.500.55Auto-Encoding BLEU Score 0.00.20.40.60.81.0 α0.00.20.40.60.8Auto-Encoding BLEU Score BLEU: Base decoder BLEU: Finetuned decoder CosSim: Base decoder CosSim: Finetuned decoder 0.720.740.760.780.800.820.840.86 Cosine Similarity 0.50.60.70.80.91.0 Cosine Similarity BLEU: Base decoder BLEU: Finetuned decoder CosSim: Base decoder CosSim: Finetuned decoder Figure 14 -Fragility scores. Auto-Encoding BLEU and external cosine similarity. In the left-hand panel as a function of the text length ( α-averaged) and in the right-hand panel as a function of the noise variance α. We depict in the right panel of Figure 14 the curves of both score functions with respect to the noise level α. We observe that BLEUscores decrease faster than the cosine similarity. Most importantly, fragility scores are sensitive to the choice of the decoder. In particular, both Auto-Encoding BLEU 25 Page 26: 0.0 0.2 0.4 0.6 0.8 1.0 Score value0.00.51.01.52.02.5DensityScore BLEU: Base decoder BLEU: Finetuned decoder CosSim: Base decoder CosSim: Finetuned decoderFigure 15 - Auto-Encoding BLEU scores and cosine similarity ( α-averaged) distributions. and cosine similarity scores decrease at a markedly slower rate for the Finetuned decoder than for theBaseone as the amount of noise increases. We note also that the overall score distribution (after averaging over all α), shown in Figure 15, exhibits a large spread of fragility scores across SONAR samples. One factor that can explain such a discrepancy is the text length. Compared to the Auto-Encoding BLEU metric (which drops only by 1–2% for long sentences), fragility is more sensitive to the length of sentences and drops faster for both similarity metrics. This shows that using a max sentence length over 250can be extremely challenging for SONAR and the LCMmodel. On the other hand, even if short sentences are on average more robust, splitting a long sentence in the wrong place may result in shorter but more fragile sub-sentences. Taking a closer look at the 5% most fragile embeddings, we notice that they are very noisy. Typically, they correspond to hyperlinks, references, unique ids, code-switched or numerical entries. These are likely artifacts that the SONAR models were not exposed to during training or where the SONAR tokenizer fails. Fragility can thus be used to filter out hard samples from the training data. We also observe that short but complex technical phrases can be more fragile than common language phrases of similar length. 3 Scaling the model to 7B This section describes our effort to scale our model to 7B parameters and compare the performance against other approaches such as token-based LLMs on more challenging tasks such as summarization and summarization expansion (detailed in Section 3.1). Based on the results in Section 2.4 where the two diffusion-based LCMs(One-Tower andTwo- Tower) outperform the other variants, we decided to scale a diffusion model to 7B parameters. We chose to scale Two-Tower given its smaller memory footprint, particularly when processing long contexts with a shallower contextualizer tower The large 7B Two-Tower diffusion LCMhas 5 layers in its contextualizer and 14 layers in its denoiser. Its dimension has been extended to dmodel =4096. Each self-attention layer has 32 attention heads. All other parameters are kept the same as for the 1.6B Two-Tower model. The model is 26 Page 27: pre-trained on a dataset of 2.3B documents, representing 2.7T tokens and 142.4B concepts/sentences. We pre-trained this model on Meta’s RSC for 124k optimization steps spanning 256 A100 GPUs with a total batch size of 1M concepts. We further extend the context length of this model to cover 2048 concepts instead of the 128 concepts in the ablation experiments. We trained using the AdamW optimizer with (β1, β2) = (0 .9,0.95),ϵ= 1e-5and a weight decay of 0.1. We use a cosine learning rate schedule, with warm-up of 10,000 steps up to LR= 3e-4. To improve training stability we clip gradients at a maximum norm of g= 10. We subsequently finetune the 7B Two-Tower LCM on publicly available instruction tuning datasets following Chung et al. (2024). Each sample consists of a prompt and an answer and we back-propagate on answer sentences only. Each answer sequence is suffixed with the phrase “End of response.” to teach the model when to stop generating. The finetuning data totals 389M sentences, of which 53M are answers ( i.e.,targets). For supervised finetuning, we use a cosine learning rate schedule with an initial rate of LR= 3e-5and finetune the model for 7 epochs with a batch size of 262K sentences (prompts and answers combined). We will refer to the pre-trained model as Two-Tower-7B and the finetuned model as Two-Tower-7B-IT . 3.1 Evaluation Tasks and Data This section describes the tasks on which we are evaluating and benchmarking our proposed model. We detail datasets, baselines and metrics. For each task, the dataset was processed with the same sentence splitter and SONAR encoder as used in the LCMtraining. 3.1.1 Metrics As longform text generation is the main challenge for LCM, our benchmarking is mainly focused on generative tasks, which are notoriously more difficult to evaluate automatically. Therefore, we evaluate them with multiple automatic metrics, chosen to focus on complementary aspects on generation quality. All metrics used in this section are summarized in Table 8. For summarization and summary expansion (defined below), we report the traditional reference-based Rouge-L metric (Lin, 2004). As summarization models have a tendency to copy content from the source or from its own generated prefix, we report two additional word-based metrics. To evaluate how much content is directly copied from the source, we report the proportion of word 3-grams of the source that are present in the output ( OVL-3). To evaluate repetitiveness of the generated texts, we report the portion of duplicated word 4-grams in the output ( REP-4). To complement word-based metrics with summarization-focused neural evaluation, we use two metrics introduced by Clark et al. (2023): average probabilities of the SEAHORSE classifiers for Q4 (whether all the information in the summary is fully attributable to the source), denoted as SH-4in the following and Q5 (whether the summary captures the main ideas of the source), denoted as SH-5. As a metric of the overall fluency of the generated sentences, we report an average probability that the sentence is linguistically acceptable, as predicted by a classifier trained by Krishna et al. (2020) on the CoLAdataset (Warstadt et al., 2019), further referred to as CoLA. To evaluate the local coherence of the generated text, we report the average cosine similarity between each n’th and n+ 2’th sentence (Parola et al., 2023). 3.1.2 Summarization Task and datasets. When considering a relatively long document, a summarization task can be described as the act of generating a much shorter corresponding document that includes the essential 27 Page 28: Task Area Metric Description Reference Summarization Target similarity R-L ROUGE-L Lin (2004) Source similarity OVL-3 N-grams overlap (N=3) Grammaticality REP-4 Portion of duplicated N-grams (N=4) Welleck et al. (2019) Fluency CoLA Sentence fluency classifier score Krishna et al. (2020) Attribution SH-4 Seahorse-Large-Q4 score Clark et al. (2023) Semantic coverage SH-5 Seahorse-Large-Q5 coverage score Clark et al. (2023) Summary ExpansionGrammaticality REP-4 (see above) Welleck et al. (2019) Fluency CoLA (see above) Krishna et al. (2020) Table 8- Summary of automatic metrics used in different tasks in Section 3.1. Order mostly follows paper’s narrative. information contained in the long document and the same logical structure linking the various pieces of essential information. Summarization techniques can range from more extractive to more abstractive. Extractive techniques attempt to preserve in the summary the same vocabulary as that found in the long document, thereby shortening the long by removing details and superfluous wording. Abstractive techniques, on the other hand, attempt to produce the summary by rephrasing the essential pieces of information found in the long document. Our work focuses more on abstractive summarization, as such type of summarization cannot be performed without some form of understanding and reasoning. We use the CNN DailyMail (Hermann et al., 2015) and XSum(Narayan et al., 2018) datasets. We also report results on the challenging LCFOcorpus which takes long documents as input, approx. 5k words (Costa-jussà et al., 2024). The task is to provide abstractive summaries with lengths representing 20%, 10%, and 5% of the input document. Detailed statistics are provided in Table 9. #Llama2 Tokens #Sentences Dataset #Docs Q1 Q2 Q3 Q1 Q2 Q3 CNN DailyMail 11.5k 605/61 892/78 1266/97 10/3 14/4 21/4 XSum 11.3k 273/25 445/30 735/35 25/1 30/1 35/1 LCFO.5% 249 6559/341 7214/378 7916/418 209/12 295/15 527/18 LCFO.10% 249 6559/654 7214/718 7916/796 209/22 295/27 527/32 LCFO.20% 249 6559/1276 7214/1403 7916/1524 209/41 295/48 527/59 Table 9- Statistics of the test split of evaluation benchmarks. For each subset we report the number of documents and statistics of document and summary length in terms of sentences and Llama2 tokens. Each table cell shows “document/summary” length quartiles. Baselines. ForCNN DailyMail andXSum, we compare against several baselines of different architectures (encoder-decoder transformer, decoder-only LLMs) that are known to perform well on summarization tasks. For encoder-decoder transformer models, we use T5 (Raffel et al., 2020). For decoder-only LLMs, we choose Gemma-7B,Llama-3.1-8B and Mistral -7B-v0.3. We chose the published instruction-tuned models to compare with the LCMwith the same training regime, and have a similar size (7B). Note that while T5 has much smaller sizes than the LCM, this is compensated by using models that are fine-tuned explicitly on the target evaluation dataset. Summarizationresults. Table10containstheresultsofdifferentbaselinesandour LCMmodelfor summarization ( CNN DailyMail andXSum). We can notice that the LCMproduces competitive Rouge-L scores when compared to a specifically tuned LLM(T5-3B) and even surpasses the 28 Page 29: Model Paradigm CNN DailyMail R-L(↑)OVL-3(↑)REP-4(↓)CoLA(↑)SH-4(↑)SH-5(↑) Ground truth — 100.00 0.170 0.684 0.850 0.683 0.586 T5-3B SFT 37.56 0.174 0.854 0.946 0.773 0.503 Gemma-7B-IT IFT 31.14 0.245 1.032 0.963 0.740 0.560 Mistral-7B-v0.3-IT IFT 36.06 0.200 0.780 0.972 0.780 0.676 Llama-3.1-8B-IT IFT 34.97 0.248 0.928 0.973 0.763 0.692 Two-Tower-7B-IT IFT 36.47 0.177 0.757 0.767 0.723 0.459 Model Paradigm XSum R-L(↑)OVL-3(↑)REP-4(↓)CoLA(↑)SH-4(↑)SH-5(↑) Ground truth — 100.00 0.108 0.399 0.987 0.352 0.418 T5-3B — 17.11 0.221 0.671 0.939 0.680 0.450 Gemma-7B-IT IFT 18.20 0.177 0.620 0.769 0.546 0.446 Mistral-7B-v0.3-IT IFT 21.22 0.162 0.480 0.922 0.633 0.621 Llama-3.1-8B-IT IFT 20.35 0.186 0.501 0.941 0.687 0.658 Two-Tower-7B-IT IFT 23.71 0.106 0.464 0.683 0.358 0.284 Table 10 - Performance on the CNN DailyMail andXSumsummarization tasks. instruct-finetuned LLMs. Our model tends to generate more abstractive summaries rather than extractive ones, as shown by the lower OVL-3scores. The LCMproduces fewer repetitions compared toLLMs, and more importantly, the repetition rate is closer to the ground truth one. The LCM generates globally less fluent summaries according to CoLAclassifier. However, we can remark that even the human generated ground truth gets a lower score compared to the LLM. A similar behavior is observed for the source attribution ( SH-4) and semantic coverage ( SH-5). This may be explained by model-based metrics that are more biased towards LLMgenerated content. Long-context summarization results. Table 11 presents the results for long-context summariza- tion (LCFO.5% ,LCFO.10% andLCFO.20% ). This is a challenging task for most of the models. For example, Mistral-7B-v0.3-IT seems to be unable to follow the length instruction of the summary–it always generates summaries which length is about 50% of the source. Mistral-7B-v0.3-IT also has the highest SH-4score,i.e.,source attribution. The summaries generated by Gemma-7B-IT tend to be longer than requested, while Llama-3.1-8B-IT generates summaries which length is the closest to the requested size. TheLCMhas only seen a limited amount of long documents in the pretraining and fine-tuning data. Nevertheless, it performs well for this task. It outperforms Mistral-7B-v0.3-IT andGemma-7B-IT in the metric Rouge-L for the 5 and 10% conditions, and is close to Gemma-7B-IT for the 20% condition. We also observe that the LCMyields high SH-5scores for all conditions, i.e.,the summaries can be attributed to the source. Finally, we observe that Llama-3.1-8B-IT performs substantially better than the other LLMs, according to Rouge-L , while all LLMshave similar performance on the CNN DailyMail andXSum summarization tasks. This could be explained by training data contamination for Llama-3.1-8B-IT , or by the fact that the other two LLMsstruggle to handle the long input context. 29 Page 30: Method WR LCFO.5% R-L(↑)OVL-3(↑)REP-4(↓)CoLA(↑)SH-4(↑)SH-5(↑) Gemma-7B-IT 0.107 25.21 0.151 4.711 0.688 0.357 0.174 Mistral-7B-v0.3-IT 0.512 21.36 0.532 5.997 0.854 0.656 0.296 Llama-3.1-8B-IT 0.076 37.67 0.190 2.767 0.931 0.488 0.314 Two-Tower-7B-IT 0.060 26.88 0.162 2.473 0.796 0.628 0.196 LCFO.10% R-L(↑)OVL-3(↑)REP-4(↓)CoLA(↑)SH-4(↑)SH-5(↑) Gemma-7B-IT 0.150 29.25 0.164 6.427 0.667 0.377 0.194 Mistral-7B-v0.3-IT 0.549 25.00 0.537 6.289 0.848 0.660 0.306 Llama-3.1-8B-IT 0.128 42.85 0.243 3.804 0.907 0.486 0.310 Two-Tower-7B-IT 0.089 29.38 0.202 3.00 0.791 0.623 0.183 LCFO.20% R-L(↑)OVL-3(↑)REP-4(↓)CoLA(↑)SH-4(↑)SH-5(↑) Gemma-7B-IT 0.257 33.32 0.201 9.188 0.603 0.425 0.239 Mistral-7B-v0.3-IT 0.493 28.82 0.527 5.806 0.858 0.658 0.293 Llama-3.1-8B-IT 0.179 46.92 0.272 4.783 0.888 0.485 0.315 Two-Tower-7B-IT 0.140 31.74 0.253 3.664 0.779 0.613 0.187 Table 11 - Performance on the long-context summarization task of LCFO. WR is the word count ratio between the generated text and the source document. 4Large Concept Model Extensions In this section, we explore several extension of the Large Concept Model . First, we evaluate theLCMon the new task of summary expansion, i.e.,given a summary, create a longer text. We then showcase the good zero-shot generalization performance of the LCM. Finally, we explore an approach to add higher level information beyond sentences. 4.1 Summary Expansion Task and datasets. When considering a short and concise document that has similar properties to those of a summary (i.e., mainly a stand-alone document that abstracts from details), a summary expansion task can be described as the act of generating a much longer document that preserves the essential elements found in the corresponding short document, as well as the logical structure that connects such elements. As this is a more freely generative task, an additional requirement to be taken into consideration is that of coherence (for example, the detailed information included in one generated sentence should not contradict that included in another sentence). The summary expansion task presented here consists in taking summaries as inputs, from CNN DailyMail and XSum, and generating a long document. Note that the goal is not to recreate the factual information of the initial document rather than evaluating the capability of the model to extend the input text in a meaningful and fluent way. We use similar baselines and metrics as in the previous section 3.1.2. Results. Table 12 shows the results of the summary expansion for CNN DailyMail andXSum. First of all, regarding the word count ratio, we can see different behaviours for the two corpora. ForCNN DailyMail , the models tend to generate texts that are 6 times larger than the input. Llama-3.1-8B-IT produces even longer outputs (factor 8 instead of 6). But for XSum, while 30 Page 31: CNN DailyMail Method WR R-L(↑)OVL-3(↑)REP-4(↓)CoLA(↑) Gemma-7B-IT 6.8 35.54 0.801 2.104 0.951 Mistral-7B-v0.3-IT 6.4 34.24 0.817 2.063 0.959 Llama-3.1-8B-IT 8.5 37.76 0.822 2.582 0.844 Two-Tower-7B-IT 6.3 30.85 0.726 2.911 0.474 XSum Method WR R-L(↑)OVL-3(↑)REP-4(↓)CoLA(↑) Gemma-7B-IT 19.5 17.89 0.963 10.238 0.116 Mistral-7B-v0.3-IT 1.6 29.31 0.893 2.268 0.939 Llama-3.1-8B-IT 19.8 28.84 0.915 2.543 0.898 Two-Tower-7B-IT 7.1 23.82 0.561 1.542 0.603 Table 12 - Performance on the summary expansion tasks of CNN DailyMail andXSum, evaluated with the metrics described in Table 8. WR is the word count ratio between the hypothesis and the source summary. Gemma-7B-IT andLlama-3.1-8B-IT generate very long texts (almost 20 times longer than the prompt), the LCMgenerates output of approximately the same length ratio as for CNN DailyMail . Only Mistral-7B-v0.3-IT fails to generate long outputs for this corpus. Then, we clearly see a different trend compared to the summarization task. The LLMsget higher Rouge-L scores compared to the LCM. As mentioned above, the goal of this task is not to recreate the original document. However, the R-Lscore tells us how much of the content of the full document can be recreated. Contrary to the LLMs, our model tends to generate different sentences compared to the original document from which the summary has been created (with an exception for Gemma- 7B-ITonXSum). This is expected since our model generates embeddings that are then processed by a decoder trained on a translation task that tends to paraphrase the initial content. However, the CoLAresults show that this comes along with lower fluency, especially for CNN DailyMail . 4.2 Zero-shot generalization performance SONAR is a semantic space that can represent 200 languages. In this paper, all experiments presented so far have been done on English text. In this section, we explore the capability of our proposed LCMapproach to process other languages in a zero-shot fashion by leveraging SONAR ’s ability to represent multilingual data. We use the XLSum (Hasan et al., 2021) corpus, a large scale multilingual abstractive news summa- rization benchmark covering 45 languages. We score model outputs using the multilingual rouge scoring scripts released with the benchmark.5Note that Rouge-L scores are heavily dependent on language-specific text tokenization and stemming. Unless provided in the aforementioned scripts, we tokenize the model outputs and references with the default tokenization from Lin (2004). Languages like Korean, Telugu and Tamil can benefit from a more appropriate stemming and tokenization. We compare the LCM performance with Llama-3.1-8B-IT which officially supports eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. According to The Llama3 team (2024), the model has seen many additional languages during pretraining, but was instruction 5https://github.com/csebuetnlp/xl-sum/tree/master/multilingual_rouge_scoring 31 Page 32: Vietnamese English Pashto Swahili Burmese French Hausa Urdu Hindi Indonesian Portuguese Arabic Spanish Welsh Somali Turkish Russian Persian Japanese Igbo Chinese simplified Kirundi Thai Tigrinya Serbian Cyrillic Scottish Gaelic Chinese traditional Yoruba Ukrainian Azerbaijani Amharic Marathi Tamil Sinhala Oromo Telugu Korean Punjabi Bengali Gujarati Kyrgyz Nepali0102030XLSUM Rouge-LTwo-Tower-7B-IT Llama-3.1-8B-IT Officially supported Llama3 languageFigure 16 -Rouge-L scores on XLSum forLlama-3.1-8B-IT andTwo-Tower-7B-IT . finetuned on those eight languages only. The LCM, on the other hand, has never seen any language other than English, and we do not use the English XLSum training data either. We report Rouge-L scores for 42 languages in Figure 16. Three languages were excluded since they are currently not supported by SONAR : Pidgin, Serbian in Latin script and Uzbek in Cyrillic script. TheLCMsubstantially outperforms Llama-3.1-8B-IT on English (23.5 compared to 20.7 Rouge-L ) and on the average over all six languages officially supported by both models and included in XLSum (20.2 versus 19.7 Rouge-L ).6We also observe that the LCMgeneralizes very well to many other languages, in particular low-resource languages like Southern Pashto, Burmese, Hausa or Welsch which all have Rouge-L scores greater than 20. Other well performing low-resource languages are Somali, Igbo or Kirundi. Finally, the LCMobtains a Rouge-L score of 30.4 on Vietnamese. Overall, these results highlight the impressive zero-shot generalization performance of the LCMto languages it has never seen. 4.3 Exploring explicit planning When writing long-form text, it is often practical to first think about how to structure our narrative. Indeed Dijk (1977) states that all texts have an innate macrostructure i.e.,a global discourse structure spanning the entirety of the text. In general scientific discourse, one such macrostructure is that of problem-solving (Heffernan and Teufel, 2022), where an author must first motivate and describe the problem, before proposing an appropriate solution. Composing such a macrostructure is no easy task. Yet in the realm of LLMs, there have been many recent efforts to help guide model generation using similar structures. One such popular approach is that of creating outlines, which provide a high-level overview of the desired narrative (Li et al., 2024). An alternative approach to outlines is creating summaries of future targets, which the model can then expand upon (Sun et al., 2022). Given the nature of the LCMoperating at the concept level, it naturally creates long-form output. Therefore, it is important to ensure that the model is capable of creating coherent generations given the multitudes of possibilities for next concept prediction. In order to address this, we envision an 6English, French, Hindi, Portuguese, Spanish and Thai. 32 Page 33: explicit capability for planning. Similar to creating summaries, we propose a complementary planning modelwhich creates a high-level overview of what should be generated next, given the prior context. The proposed plan could span multiple concepts, such as a paragraph. The LCMis conditioned on this plan, before it then generates the subsequent output sequence. Operationally the model predicts auto-regressively a sequence of concepts followed by a breakconcept, which represents a natural topic cadence such as a paragraph break. Once the breakconcept is predicted, the large planning model (LPM) generates a plan in order to condition the LCMfor prediction of the subsequent sequence. The model then continues generation as usual conditioned on both the prior concepts and the proposed plan. An overview of the approach is shown in Figure 17. Figure 17 -LCMconditioned on both the prior context and a high-level plan for the next sequence. Although we envision a separate (but complementary) model for planning, we present here an initial experiment using a simplified single-model approach where the LCMis trained in a multitask setting to also predict both the breakconcepts and plans. Additionally, instead of the idealized plan spanning multiple concepts (such as a paragraph), we use a single concept in order to capture what should come next (i.e. a planconcept). We call this simplified multitask approach a Large Planning Concept Model ( LPCM). Methodology In order to evaluate this single-model approach, we perform an ablation study. As a baseline method, we train a One-Tower LCM (cf. Section 2.3.3) without any visibility to breakorplanconcepts. We then subsequently train a LPCMwith the same number of parameters as the baseline. Both models were trained on the same data7for the same number of steps. Data preprocessing. In order to represent breakconcepts, we begin by first segmenting the data into paragraphs. Given that most real world datasets are absent of paragraph structure (or it is not easy to recover), we apply the Segment Any Text (Frohmann et al., 2024) paragraph splitting API8. We additionally force each paragraph to be less than 10 sentences, and merge small (e.g. one sentence) consecutive paragraphs together. In order to represent planconcepts, we generate synthetic high-level topic description for each preceding segmented paragraph using an existing open-sourced LLM, namely Llama-3.1-8B-IT, which offers a good trade-off between the generated topic quality and the generation speed. The system prompt used to generate these topic descriptions is listed in Appendix C. In total we process approximately 320Mparagraphs with topic descriptions, spanning 1.5B segmented concepts (i.e. approximately 30B tokens). Metrics. We focus on coherence as our main measure of evaluation. Previous ablations (cf. Section 2.4) used the coherence metric introduced by Jwalapuram et al. (2022). However, we explore 7Compared to previous experiments, we use a different data mixture more favorable for long-form generation. 8https://github.com/segment-any-text/wtpsplit 33 Page 34: hereLLM-as-a-judge as an alternative. Specifically, we use Llama-3.1-8B-IT in order to evaluate the coherence of the generated model outputs, which is prompted to return an overall coherence score between [0,5]. The prompt used is listed in Appendix D. In order to validate this prompt, we evaluate it against a dataset of human judgements introduced by Jwalapuram et al. (2022), and observed it reported an agreement9with human annotators which improves upon their coherence model. We therefore choose this metric for our evaluation. To be consistent across both model results, we do not include the special breakorplanconcepts generated by the LPCMwhen calculating coherence scores. Llama-3.1-8B-IT ( ↑) LPCM 2.82±0.62 Baseline 2.74 ±0.70 Table 13 -LPCMablation coherence score results. Results We provide the results of our ablation experiment in Table 13. Results are reported over a held-out subset of Cosmopedia (Ben Allal et al., 2024) following instruction fine-tuning, similar to previous ablations (cf. Section 2.4). We observe that the LPCMachieves significantly10higher coherence scores (significance was measured using a paired t-test) than the baseline One-Tower LCM . This finding suggests that the LPCMis capable of producing significantly more coherent outputs than theLCMas a result of the additional structure coming from predicted plan concepts, helping the LPCMproduce a more coherent narrative, which is essential for the objective of generating long-form output. 5 Related work 5.1 Sentence representations Multilingual sentence representations Learning effective sentence embeddings has been a well studied subject in recent years. Significant progress has been made in this field, largely due to the capabilities of transformer-based language models that by learning contextual representations for individual tokens (Devlin et al., 2018; Conneau et al., 2020), are able to effectively learn the semantics of language. However, such models are not optimal to create sentence representations. Following approaches built upon these initial works, and aimed at learning general sentence repre- sentations, leveraging dual encoder architectures (Guo et al., 2018; Reimers and Gurevych, 2019; Ni et al., 2021). These architectures encode the source and target into a common embedding space, and use a distance-based metric to create an alignment loss that approximate semantically identical sentences. Such architectures have been extended to leverage multilingual data to create general, aligned embedding spaces across languages (Feng et al., 2020; Janeiro et al., 2024; Sturua et al., 2024). Initial approaches leveraged the contrastive loss to align translations across languages (Feng et al., 2020; Yang et al., 2019), using only translation data to train. Other architectural changes, namely using token-level objectives combined with the sentence level objectives, have proven useful to improve the quality of multilingual sentence representations based on translation data only (Li 9Krippendorff’s α= 0.48 10Significance observed at the 99.9% level. 34 Page 35: et al., 2023; Janeiro et al., 2024). Recent approaches explore using data from other tasks, besides translation data, to increase the generality of the sentence representations (Wang et al., 2024b; Mohr et al., 2024). Other approaches change their embeddings per task, either with task-specific prompts (Wang et al., 2024b; Su et al., 2022; Lee et al., 2024b) or with task-specific parameters (Sturua et al., 2024). Another successful line of work to create general purpose, multilingual, sentence representations is to leverage the translation objective. LASER (Artetxe and Schwenk, 2019), and SONAR (Duquenne et al., 2023b) leverage an encoder-decoder architecture, with a fixed-size sentence representation between the encoder and the decoder, trained with a translation objective. SONAR is initialized from the NLLB-200 model (NLLB Team et al., 2022), and covers 200 languages, making it one of the open-source models with the widest language coverage. SONAR also provides open-source speech encoders aligned to their sentence encoders for 73 languages (Seamless Communication et al., 2023a), aligned through a teacher-student approach. SONAR has been used as the basis for several works (Seamless Communication et al., 2023a,b; Chen et al., 2023a), and its speech decoders have been extended to keep the expressiveness of the original speech (Duquenne et al., 2023a). Joint speech/text sentence representations There has been a large body of research on unsupervised representation learning for monolingual (Baevski et al., 2020) and multilingual speech (Babu et al., 2022), with recently w2v-bert (Chung et al., 2021) that combines contrastive learning and masked language modeling to learn self-supervised representations from speech. Other works explored multilingual and multimodal (speech/text) pre-training methods, including mSLAM (Bapna et al., 2022). Finally, Duquenne et al. (2021), followed by Khurana et al. (2022), introduced multilingual and multimodal sentence embeddings, extending a pre-existing multilingual text sentence embedding space to the speech modality with a distillation approach. Duquenne et al. (2022, 2023c) also showed that it is possible to efficiently decode multilingual speech sentence embeddings with decoders trained on text sentence embeddings into different languages, to perform zero-shot speech translation. LLMbased sentence representations Several text representation methods have been proposed which are based on existing LLMs. Wang et al. (2024a) proposed extracting text embeddings from the last token of LLMs fine-tuned with instructions on contrastive data. Lee et al. (2024a) improved text embedding capabilities of fine-tuned LLMs by removing the causal attention mask and applying extra nonlinear layers before pooling the token embeddings. Embeddings as a service are supported by some commercial LLM providers, for example, Mistral-embed .11Such embeddings proved competitive on retrieval benchmarks; however, to the best of our knowledge, their applicability to reconstructing the texts back from the embedding space has not been demonstrated. 5.2 Multilingual LLMs Most of the leading LLMshave been trained on texts in several languages. Table 1 summarizes the coverage of several of them. Nevertheless, the pretraining data of these LLMsseems to be mainly English texts. For example, The Llama3 team (2024) mentions that pretraining data contains significantly more English texts, requiring continued pre-training with multilingual data, out of which 34.6% is translated reasoning data. There are also several efforts to train LLMsoptimized on specific languages, e.g. LeoLM for German,12Fuanofor Italian (Bacciu et al., 2024), ALLaM for Arabic (Bari et al., 2024), and several 11https://docs.mistral.ai/capabilities/embeddings/ 12https://laion.ai/blog/leo-lm/ 35 Page 36: models for Chinese: ErniBot ,13Tongyi Qianwen ,14orChatGLM (Team GLM et al., 2024). Some adaptations of LLMs to a massive number of languages also exist. LOLA (Srivastava et al., 2024) is a recent mixture-of-experts LLM supporting 160 languages, MALA-500 (Ji et al., 2024) adapts LLaMA2 to 546 languages. However, such models typically face a trade-off between language coverage and other capabilities. For example, the Aya model (Üstün et al., 2024), following instructions in 101 languages, was superseded by Aya-23 (Aryabumi et al., 2024) that exchanged some breadth for depth, focusing on 23 languages only. The LCM architecture, combining a language-agnostic model for knowledge and reasoning with potentially language-specialized encoders and decoders, is expected to exhibit this trade-off to a lesser extent. 5.3 Alternative LLM architectures Predicting the next state in the embedding space is a core idea of the Joint Embedding Predictive Architecture ( Jepa) proposed by LeCun (2022). This idea has been implemented for images ( I-JEPA by Assran et al. (2024)) and video ( V-JEPA by Bardes et al. (2024)) as a self-supervised approach to learning representations. For language, equivalent models have not yet been explored. Sentence embeddings for language modeling. For text completion, Ippolito et al. (2020) proposed a sentence-level language model operating by choosing the next sentence from a finite set of candidates. Their model demonstrated success in selecting appropriate continuations for short stories, but it has not been scaled to longer inputs or to fully generative outputs. Golestani et al. (2021) studied a similar problem in the even more restrictive sentence ordered setting, but with a more thorough study of architectural choices. The INSET architecture (Huang et al., 2020) solves the sentence infilling task by combining a denoising autoencoder that encodes sentences into fixed-size vectors and decodes them back and a bidirectional transformer that predicts the embedding of a missing sentence. Marfurt and Henderson (2021) and Cornille et al. (2024) used predicted next sentence embeddings in a fully generative setting, for summarization and generic language modeling, respectively. However, their architectures considered sentence-level connections only as an addition to the token-level connections across sentences, not as their replacement. In a recent work of An et al. (2024), the SentenceVAE architecture performs language modeling on the sentence level using a sentence encoder to prepare the inputs and a sentence decoder to produce the outputs. However, its input and output embedding spaces are not tied, so the inference is only possible by decoding each predicted sentence into text and then re-encoding it for adding it to the context. Language modeling with diffusion. A series of more recent works tried adapting diffusion modeling, originally developed for continuous data, to the discrete text domain. The PLANNER architecture (Zhang et al., 2023) consists of a variational autoencoder for paragraphs and a diffusion model trained to predict latent autoencoder representations conditional on the textual context or on the class label. Lovelace et al. (2024) augmented a decoder-only language model with an encoded semantic proposal of the continuation text, with an easily guidable diffusion model predicting the embedding of the next proposal. A TEncDM model (Shabalin et al., 2024) performs diffusion in the space of contextual token embeddings which are then decoded non-autoregressively. 13http://research.baidu.com/Blog/index-view?id=183 14https://www.alibabacloud.com/en/solutions/generative-ai 36 Page 37: Some applications of diffusion to sequence modeling have targeted the planning capabilities of the sequence models. Semformer (Yin et al., 2024) proposed training transformers language models to plan several steps ahead by including special planning tokens, the representations of which are trained to be informative about the future tokens. Ye et al. (2024) applied discrete diffusion to language models as an alternative to autoregressive generation, more suitable for tasks that require multi-step planning. Ubukata et al. (2024) give an overview of applications of diffusion for planning tasks, but most of them are not concerned with the language domain. Overall, while many of the previous works used hidden representations for language modeling or related tasks, all of them either relied on token-level inputs or outputs, or were not intented for generating texts of arbitrary length. The LCMseems to be the first fully generative language model implemented fully in a highly semantic, reconstructable sentence representation space. 6 Limitations In this section we discuss the possible limitations of the presented Large Concept Modeling approach. Choice of the embedding space. The choice and design of the embedding space plays a crucial role in the LCMmodeling approach. •TheSONAR embedding space was chosen for its good multilingual and multimodal represen- tations, as well as the availability of a massively multilingual decoder, which achieves excellent results in both translation and auto-encoding. However, the SONAR model was trained on very specific training data, namely bitext machine translation data containing rather short sentences. This has several consequences: 1.SONAR is trained to sustain a local geometry (sentences with very similar meanings are geometrically close) with no special guarantees for sentences that are only loosely related. Yet, predicting next sentences distribution requires the space to operate well globally. 2.SONAR auto-encodes surprisingly well texts containing links, references, or merely numbers or code data. Yet, such texts tend to be fragile, highlighting a distribution mismatch between the SONAR training data and commonly used LLMpre-training text corpora. Therefore, the accurate prediction of the sentences containing such a content (non-negligible in LCMpre-training data) will be hard for any LCM SONAR based model. For instance, the factuality of fragile generated sentences may easily be compromised. •Using a frozen encoder represents some interesting trade-offs. Any frozen encoder which is learned in a different data context, and with no a-priori strong connection to LCMmodeling, may be suboptimal compared to encoders that are learned in an end-to-end fashion (with the loss coming from the decoder). At the same time, learning an encoder within end-to-end training can be challenging and the resulting space is not guaranteed to result in good semantic representations shared across languages and modalities. Training the concept representation and the LCMend-to-end would also be less data and compute efficient since all modeling data should be multilingual and -modal, bearing the risk of modality competition. 37 Page 38: Concept granularity •In this work, the definition of concepts is interpreted at sentence level. However, the manifold of possible next sentences is very wide, attributing a proper probability to each of such sentences is much harder (even with a modeling within the latent space) that to the discrete set of tokens. •In NLP, we encounter sentences of variable length. Combinatorial complexity of possible next sentences grows exponentially with the maximum character length. The choice of granularity forLCMis not trivial as long sentences (>120 characters) could reasonably be considered as several concepts. However, any finer splitting of such sentences does not necessary separate well these concepts. This shows the limitation of a fixed size embedding representation for one sentence. Text splitting (such as sentence splitting) or one-to-many mapping of a sentence into several embeddings is a major future direction of research. •Each document in a training corpus typically contains a sequence of unique sentences or a little number of repetitions. This data sparsity effect manifests as well at large corpora level: the large majority of sentences are merely unique. In principle, this issue can be addressed with higher-level semantic embedding representations. These higher-level representations come with trade-off between requirement of lossless data encoding (think of named-entities or numbers, critical in many language modeling tasks) and good level of abstraction to enable reasoning capabilities. Compared to a monolingual auto-encoder which would simply compress input sentences, SONAR offers semantic representations with good auto-encoding quality but still certainly sacrificing generalization capacities. •This generalization issue can be partially mitigated by splitting or encoding input text as new conceptual units which are more commonly shared across source documents. This is in the spirit of stemming or lemmatization techniques studied in NLP for words. That being said, building such conceptual units that are also language and modality agnostic is a challenging task. Such shared multilingual and multimodal conceptual units are also key for generalization across languages and across modalities. To maximize cross-lingual and cross-modal transfers, Large Concept Models should be exposed to a richer variety of multilingual and multi-modal data. Continuous versus discrete •Diffusion modeling has proven to be very efficient in generative modeling of continuous data like images or speech. As previously stated, sentences in the SONAR space, despite being represented as continuous vectors, remain discrete combinatorial objects. This makes diffusion modeling struggle on the text modality (either at word or sentence embedding level). •The contrastive nature of cross-entropy loss based on softmax outputs which is used for next token prediction plays a critical role for many downstream task where higher accuracy is required (e.g. MCQ tasks, code or math generation). On the opposite, continuous diffusion modeling does not allow to integrate such a contrastive objective. •TheQuant-LCM could be a way to address the discrete nature of text while modeling on coarse-to-fine semantic units shared across languages and modalities. The limited performance of the Quant-LCM approaches presented in this paper may be explained by the fact that SONAR space was not trained to be efficiently quantizable, yielding a significant number of codebooks and a large amount of units per codebook. Therefore, the current SONAR quantization suffers from the exponentially increasing number of RVQ units combinations which does not solve the data sparsity/uniqueness issue discussed earlier. This indicates once again the importance of developing a new representation space, either continuous or discrete, for the Large Concept Model . 38 Page 39: 7 Acknowledgments We would like to thank Robbie Adkins, Can Balioglu, Joy Chen, Pascale Fung, Jason Holland, Amita Kamath, Justine Kao, Sagar Miglani, Alice Rakotoarison, Abhilasha Sancheti, Arjang Talattof, Ellen Tan, Carleigh Wood, Shireen Yates, Bokai Yu and Luke Zettlemoyer for comments and suggestions on this work, as well helping us to improve this paper. 8 Conclusion and Future Work Current best practice for large scale language modeling is to operate at the token level, i.e. to learn to predict the next tokens given a sequence of preceding tokens. There is a large body of research on improvements of LLMs, but most works concentrate on incremental changes and do not question the main underlying architecture. In this paper, we have proposed a new architecture, named a Large Concept Model (LCM), which substantially differs from current LLMsin two aspects: 1) all modeling is performed in a high-dimensional embedding space instead of on a discrete token representation; and 2) modeling is not instantiated in a particular language or modality, but at a higher semantic and abstract level. We have named the general form of this representation a“concept” . Inthispaper, toverifythefeasibilityofthehigh-levelidea, wehaveassumedthataconceptcorresponds to a sentence in the text domain, or an equivalent speech segment, and that the embeddings are obtained by the freely available SONAR sentence encoder (Duquenne et al., 2023b). With respect to the specific architecture of the LCM, we have first shown that directly minimizing the MSE loss in the embedding space does not yield good results. We then explored several architectures based on a diffusion process: the One-Tower andTwo-Tower LCM , as well as a Quant-LCM which uses quantization of SONAR representations and then modeling on these discrete units. These ablation experiments were performed with models with 1.6B parameters and focused on the generative task of continuing a sequence of sentences. We have then scaled our models to a size of 7B parameters and instruction-finetuned them on several summarization and summary expansion tasks. We provide a detailed comparison to other public models of the same size, namely Gemma,Mistral andLlama. By design, a LCMexhibits strong zero-shot generalization performance. In this paper, we trained models on English texts only, and applied them to text in other languages, without any additional training data, neither aligned nor unlabeled. The LCMoutperforms Llama-3.1-8B-IT on English and on the average over foreign languages officially supported by the LLM. The LCMitself could also be trained on multilingual- and model data to acquire knowledge from these sources. We will explore this in future versions of the LCM. In short, all languages and modalities are first class citizens and handled equally at all stages of a LCM. We have observed that next sentence prediction is substantially more challenging than next token prediction. First, given that we operate in an embedding space and at a higher semantic level, the number of possible sentences is virtually unlimited, while token vocabularies are usually in the range of 100k. Second, even given a long context, there is unavoidably more ambiguity in choosing the next sentence than the next token. And third, the usual softmax output layer over the fixed size token vocabulary provides a normalized probability distribution over all possible token continuations. Theoretically, a diffusion process should be able to learn a probability distribution over an output embedding space, but our current experimental evidence indicates that more research is needed to take full advantage of the properties of Large Concept Models . As an example, the ability to sample multiple embeddings and associate a score would enable beam search to find the best sequence of sentences. Finally, small modeling errors could yield predictions in the embedding space 39 Page 40: which do not correspond to valid sentences, i.e. that cannot be decoded into a syntactically and semantically correct sentence. We will work on alternative concept embeddings to SONAR which would be better suited to the next sentence prediction task, and would improve modeling approaches in that concept embedding space. We see the models and results discussed in this paper as a step towards increasing scientific diversity and a move away from current best practice in large scale language modeling. We acknowledge that there is still a long path to reach the performance of current flagship LLMs. This will require of course further improving the core architecture, but also careful data selection and curation, extensive ablations, optimized and diverse instruction fine-tuning, and finally, scaling to models with more than 70B parameters. We open-source the full training code of all our LCMvariants, together with a set of supporting scripts,15to make it easy for other teams to train LCMmodels. By these means, we hope to foster research on alternative LLMsand contribute to advance the field of machine intelligence. References A. Aghajanyan, L. Yu, A. Conneau, W.-N. Hsu, K. Hambardzumyan, S. Zhang, S. Roller, N. Goyal, O. Levy, and L. Zettlemoyer. Scaling laws for generative mixed-modal language models. In International Conference on Machine Learning , pages 265–279. PMLR, 2023. E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, Étienne Goffinet, D. Hesslow, J. Launay, Q. Malartic, D. Mazzotta, B. Noune, B. Pannier, and G. Penedo. The Falcon series of open language models. ArXiv, abs/2311.16867, 2023. URL https://arxiv.org/pdf/2311.16867 . H. An, Y. Chen, Z. Sun, and X. Li. SentenceVAE: Enable next-sentence prediction for large language models with faster speed, higher accuracy and longer context. arXiv preprint arXiv:2408.00655 , 2024. Anthropic. The Claude 3 model family: Opus, sonnet, haiku, 2024. URL https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf . M. Artetxe and H. Schwenk. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. TACL, pages 597–610, 2019. V.Aryabumi, J.Dang, D.Talupuru, S.Dash, D.Cairuz, H.Lin, B.Venkitesh, M.Smith, K.Marchisio, S.Ruder, et al. Aya 23: Open weight releases to further multilingual progress. arXiv preprint arXiv:2405.15032 , 2024. M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y. LeCun, and N. Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. ArXiv, abs/2301.08243, 2024. URL https://arxiv.org/pdf/2301.08243 . A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale. In Proc. Interspeech 2022 , pages 2278–2282, 2022. doi: 10.21437/Interspeech.2022-143. A. Bacciu, G. Trappolini, A. Santilli, E. Rodolà, and F. Silvestri. Fauno: The italian large language model that will leave you senza parole! ArXiv, abs/2306.14457, 2024. URL https://arxiv.org/pdf/2306.14457 . A. Baevski, Y. Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. NeurIPS , 33:12449–12460, 2020. C. Balioglu. fairseq2, 2023. URL http://github.com/facebookresearch/fairseq2 . A. Bapna, C. Cherry, Y. Zhang, Y. Jia, M. Johnson, Y. Cheng, S. Khanuja, J. Riesa, and A. Conneau. mslam: Massively multilingual joint pre-training for speech and text. arXiv preprint arXiv:2202.01374 , 2022. 15https://github.com/facebookresearch/large_concept_model 40 Page 41: A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y. LeCun, M. Assran, and N. Ballas. Revisiting feature prediction for learning visual representations from video. ArXiv, abs/2404.08471, 2024. URL https://arxiv.org/pdf/2404.08471 . M. S. Bari, Y. Alnumay, N. A. Alzahrani, N. M. Alotaibi, H. A. Alyahya, S. AlRashed, F. A. Mirza, S. Z. Alsubaie, H. A. Alahmed, G. Alabduljabbar, R. Alkhathran, Y. Almushayqih, R. Alnajim, S. Alsubaihi, M. A. Mansour, M. Alrubaian, A. Alammari, Z. Alawami, A. Al-Thubaity, A. Abdelali, J. Kuriakose, A. Abujabal, N. Al-Twairesh, A. Alowisheq, and H. Khan. ALLaM: Large language models for arabic and english.ArXiv, abs/2407.15390, 2024. URL https://arxiv.org/pdf/2407.15390 . L. Ben Allal, A. Lozhkov, G. Penedo, T. Wolf, and L. von Werra. Cosmopedia, 2024. URL https:// huggingface.co/datasets/HuggingFaceTB/cosmopedia . J. Betker, G. Goh, L. Jing, TimBrooks, J. Wang, L. Li, LongOuyang, JuntangZhuang, JoyceLee, YufeiGuo, WesamManassra, PrafullaDhariwal, CaseyChu, YunxinJiao, and A. Ramesh. Improving image generation with better captions, 2023. URL https://api.semanticscholar.org/CorpusID:264403242 . BigScience Workshop. BLOOM: a 176b-parameter open-access multilingual language model. ArXiv, abs/2211.05100, 2023. URL https://arxiv.org/pdf/2211.05100 . Chameleon team. Chameleon: Mixed-modal early-fusion foundation models. ArXiv, abs/2405.09818, 2024. URLhttps://arxiv.org/pdf/2405.09818 . M. Chen, P.-A. Duquenne, P. Andrews, J. Kao, A. Mourachko, H. Schwenk, and M. R. Costa-jussà. BLASER: A text-free speech-to-speech translation evaluation metric. In ACL, pages 9064–9079, 2023a. URL https://aclanthology.org/2023.acl-long.504 . M. Chen, K. Heffernan, O. Çelebi, A. Mourachko, and H. Schwenk. xSIM++: An improved proxy to bitext mining performance for low-resource languages. In ACL, pages 101–109, 2023b. URL https: //aclanthology.org/2023.acl-short.10 . R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 , 2019. H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research , 25(70):1–53, 2024. Y.-A. Chung, Y. Zhang, W. Han, C.-C. Chiu, J. Qin, R. Pang, and Y. Wu. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages 244–250. IEEE, 2021. E. Clark, S. Rijhwani, S. Gehrmann, J. Maynez, R. Aharoni, V. Nikolaev, T. Sellam, A. Siddhant, D. Das, and A. Parikh. Seahorse: A multilingual, multifaceted dataset for summarization evaluation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 9397–9413, 2023. A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettle- moyer, and V. Stoyanov. Unsupervised cross-lingual representation learning at scale. In ACL, 2020. N. Cornille, M.-F. Moens, and F. Mai. Learning to plan for language modeling from unlabeled data. arXiv preprint arXiv:2404.00614 , 2024. M. R. Costa-jussà, P. Andrews, M. C. Megliogli, J. Chen, J. Chuang, D. Dale, C. Ropers, A. Mourachko, E. Sánchez, H. Schwenk, T. Tran, A. Turkatenko, and C. Wood. LCFO: Long context and long form output dataset and benchmarking. ArXiv, 2024. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018. P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems , 34:8780–8794, 2021. 41 Page 42: V. Dijk. Text and Context: Explorations in the Semantics and Pragmatics of Discourse. Longman, 1977. M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazaré, M. Lomeli, L. Hosseini, and H. Jégou. The Faiss library. ArXiv, abs/2401.08281, 2024. URL https://arxiv.org/pdf/2401.08281 . P.-A. Duquenne, H. Gong, and H. Schwenk. Multimodal and multilingual embeddings for large-scale speech mining. In NeurIPS , volume 34, pages 15748–15761, 2021. URL https://proceedings.neurips.cc/paper/ 2021/file/8466f9ace6a9acbe71f75762ffc890f1-Paper.pdf . P.-A. Duquenne, H. Gong, B. Sagot, and H. Schwenk. T-modules: Translation modules for zero-shot cross- modal machine translation. In EMNLP, pages 5794–5806, 2022. URL https://aclanthology.org/2022. emnlp-main.391.pdf . P.-A. Duquenne, K. Heffernan, A. Mourachko, B. Sagot, and H. Schwenk. Sonar expressive: Zero-shot expressive speech-to-speech translation, 2023a. P.-A. Duquenne, H. Schwenk, and B. Sagot. SONAR: sentence-level multimodal and language-agnostic representations, 2023b. URL https://arxiv.org/abs/2308.11466 . P.-A. Duquenne, H. Schwenk, and B. Sagot. Modular speech-to-text translation for zero-shot cross-modal transfer. In Interspeech , 2023c. F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and W. Wang. Language-agnostic BERT sentence embedding. arXiv preprint arXiv:2007.01852 , 2020. M. Frohmann, I. Sterner, I. Vulić, B. Minixhofer, and M. Schedl. Segment Any Text: A universal approach for robust, efficient and adaptable sentence segmentation. In EMNLP, pages 11908–11941, 2024. URL https://aclanthology.org/2024.emnlp-main.665 . O. Gafni, A. Polyak, O. Ashual, S. Sheynin, D. Parikh, and Y. Taigman. Make-a-scene: Scene-based text-to- image generation with human priors. In European Conference on Computer Vision , pages 89–106. Springer, 2022. Gemini Team Google. Gemini 1.5 unlocking multimodal understanding across millions of tokens of conte. ArXiv, abs/2403.05530, 2024. URL https://arxiv.org/pdf/2403.05530 . M. Golestani, S. Z. Razavi, Z. Borhanifard, F. Tahmasebian, and H. Faili. Using BERT encoding and sentence-level language model for sentence ordering. In International Conference on Text, Speech, and Dialogue, pages 318–330. Springer, 2021. P. Goyal. Accurate, large minibatch sg d: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 , 2017. M. Guo, Q. Shen, Y. Yang, H. Ge, D. Cer, G. H. Abrego, K. Stevens, N. Constant, Y.-H. Sung, B. Strope, et al. Effective parallel corpus mining using bilingual sentence embeddings. arXiv preprint arXiv:1807.11906 , 2018. T. Hang, S. Gu, C. Li, J. Bao, D. Chen, H. Hu, X. Geng, and B. Guo. Efficient diffusion training via min-snr weighting strategy. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 7441–7451, 2023. T. Hasan, A. Bhattacharjee, M. S. Islam, K. Mubasshir, Y.-F. Li, Y.-B. Kang, M. S. Rahman, and R. Shahriyar. XL-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , pages 4693–4703, Online, Aug. 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.findings-acl.413 . K. Heffernan and S. Teufel. Problem-solving recognition in scientific text. In Proceedings of the Thirteenth Language Resources and Evaluation Conference , pages 6045–6058, Marseille, France, June 2022. European Language Resources Association. URL https://aclanthology.org/2022.lrec-1.650 . K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend. Advances in neural information processing systems , 28, 2015. 42 Page 43: J. Ho and T. Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 , 2022. J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. CoRR, abs/2006.11239, 2020. URL https://arxiv.org/abs/2006.11239 . J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, and T. Salimans. Imagen video: High definition video generation with diffusion models. ArXiv, abs/2210.02303, 2022. URL https://arxiv.org/abs/2210.02303 . M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd. spaCy: Industrial-strength Natural Language Processing in Python, 2020. Y. Huang, Y. Zhang, O. Elachqar, and Y. Cheng. INSET: Sentence infilling with inter-sentential transformer. InACL, pages 2502–2515, 2020. URL https://aclanthology.org/2020.acl-main.226.pdf . D. Ippolito, D. Grangier, D. Eck, and C. Callison-Burch. Toward better storylines with sentence-level language models.ArXiv, abs/2005.05255, 2020. URL https://arxiv.org/pdf/2005.05255 . J. M. Janeiro, B. Piwowarski, P. Gallinari, and L. Barrault. Mexma: Token-level objectives improve sentence representations, 2024. URL https://arxiv.org/abs/2409.12737 . S. Ji, Z. Li, I. Paul, J.Paavola, P.Lin, P. Chen, D. O’Brien, H. Luo, H.Schütze, J.Tiedemann, etal. Emma-500: Enhancing massively multilingual adaptation of large language models. arXiv preprint arXiv:2409.17892 , 2024. A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mixtral. ArXiv, abs/2401.04088, 2024. URL https://arxiv.org/pdf/2401.04088 . P. Jwalapuram, S. Joty, and X. Lin. Rethinking self-supervision objectives for generalizable coherence modeling. InACL, pages 6044–6059, 2022. URL https://aclanthology.org/2022.acl-long.418 . T. Karras, M. Aittala, T. Aila, and S. Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems , 35:26565–26577, 2022. S. Khurana, A. Laurent, and J. Glass. Samu-xlsr: Semantically-aligned multimodal utterance-level cross-lingual speech representation. arXiv preprint arXiv:2205.08180 , 2022. D. Kingma and R. Gao. Understanding diffusion objectives as the elbo with simple data augmentation. Advances in Neural Information Processing Systems , 36, 2024. N. Kitaev, Ł. Kaiser, and A. Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451 , 2020. K. Krishna, J. Wieting, and M. Iyyer. Reformulating unsupervised style transfer as paraphrase generation. In Empirical Methods in Natural Language Processing , 2020. Y. LeCun. A path towards autonomous machine intelligence, 2022. URL https://openreview.net/pdf?id= BZ5a1r-kVsf . C. Lee, R. Roy, M. Xu, J. Raiman, M. Shoeybi, B. Catanzaro, and W. Ping. NV-Embed: Improved techniques for training LLMs as generalist embedding models. arXiv preprint arXiv:2405.17428 , 2024a. D. Lee, C. Kim, S. Kim, M. Cho, and W.-S. Han. Autoregressive image generation using residual quantization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 11523– 11532, 2022. J. Lee, Z. Dai, X. Ren, B. Chen, D. Cer, J. R. Cole, K. Hui, M. Boratko, R. Kapadia, W. Ding, Y. Luan, S. M. K. Duddu, G. H. Abrego, W. Shi, N. Gupta, A. Kusupati, P. Jain, S. R. Jonnalagadda, M.-W. Chang, and I. Naim. Gecko: Versatile text embeddings distilled from large language models, 2024b. URL https://arxiv.org/abs/2403.20327 . 43 Page 44: K. Lee and S. Sengupta. Introducing the ai research supercluster — meta’s cutting-edge ai supercomputer for ai research, 2022. URL https://ai.facebook.com/blog/ai-rsc/ . Y. Li, Q. Chen, W. Yan, W. Wang, Q. Zhang, and H. Sundaram. Advancing precise outline-conditioned text generation with task duality and explicit outline control, 2024. URL https://arxiv.org/abs/2305.14459 . Z. Li, S. Huang, Z. Zhang, Z.-H. Deng, Q. Lou, H. Huang, J. Jiao, F. Wei, W. Deng, and Q. Zhang. Dual- alignment pre-training for cross-lingual sentence embedding. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 3466–3478, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.191. URL https://aclanthology.org/2023.acl-long.191 . C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out , pages 74–81, 2004. S. Lin, B. Liu, J. Li, and X. Yang. Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages 5404–5411, 2024. S. Liu, H. Lu, and J. Shao. Improved residual vector quantization for high-dimensional approximate nearest neighbor search. arXiv preprint arXiv:1509.05195 , 2015. J. Lovelace, V. Kishore, Y. Chen, and K. Q. Weinberger. Diffusion guided language modeling. ArXiv, abs/2408.04220, 2024. URL https://arxiv.org/pdf/2408.04220 . A. Lozhkov, L. Ben Allal, L. von Werra, and T. Wolf. Fineweb-edu, May 2024. URL https://huggingface. co/datasets/HuggingFaceFW/fineweb-edu . C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems , 35:5775–5787, 2022. A. Marfurt and J. Henderson. Sentence-level planning for especially abstractive summarization. In ACL, pages 1–14, 2021. URL https://aclanthology.org/2021.newsum-1.1.pdf . B. Minixhofer, J. Pfeiffer, and I. Vulić. Where’s the point? self-supervised multilingual punctuation-agnostic sentence segmentation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 7215–7235, Toronto, Canada, July 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.acl-long.398 . I. Mohr, M. Krimmel, S. Sturua, M. K. Akram, A. Koukounas, M. Günther, G. Mastrapas, V. Ravishankar, J. F. Martínez, F. Wang, Q. Liu, Z. Yu, J. Fu, S. Ognawala, S. Guzman, B. Wang, M. Werk, N. Wang, and H. Xiao. Multi-task contrastive learning for 8192-token bilingual text embeddings, 2024. URL https://arxiv.org/abs/2402.17016 . N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen. A corpus and cloze evaluation for deeper understanding of commonsense stories. In NAACL, pages 839–849, June 2016. URL https://aclanthology.org/N16-1098 . S. Narayan, S. B. Cohen, and M. Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745 , 2018. J. Ni, G. H. Abrego, N. Constant, J. Ma, K. B. Hall, D. Cer, and Y. Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877 , 2021. A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning , pages 8162–8171. PMLR, 2021. A. Q. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. Mcgrew, I. Sutskever, and M. Chen. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning , volume 162 of Proceedings of Machine Learning Research , pages 16784– 16804. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/nichol22a.html . 44 Page 45: M. Ning, M. Li, J. Su, A. A. Salah, and I. O. Ertugrul. Elucidating the exposure bias in diffusion models. arXiv preprint arXiv:2308.15321 , 2023. NLLB Team, M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard, A. Sun, S. Wang, G. Wenzek, A. Youngblood, B. Akula, L. Barrault, G. Mejia- Gonzalez, P. Hansanti, J. Hoffman, S. Jarrett, K. R. Sadagopan, D. Rowe, S. Spruit, C. Tran, P. Andrews, N. F. Ayan, S. Bhosale, S. Edunov, A. Fan, C. Gao, V. Goswami, F. Guzmán, P. Koehn, A. Mourachko, C. Ropers, S. Saleem, H. Schwenk, and J. Wang. No language left behind: Scaling human-centered machine translation, 2022. URL https://arxiv.org/abs/2207.04672 . OpenAI. GPT-4 technical report. ArXiv, abs/2303.08774, 2024. URL https://arxiv.org/pdf/2303.08774 . K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://aclanthology.org/P02-1040 . A. Parola, J. M. Lin, A. Simonsen, V. Bliksted, Y. Zhou, H. Wang, L. Inoue, K. Koelkebeck, and R. Fusaroli. Speech disturbances in schizophrenia: Assessing cross-linguistic generalizability of nlp automated measures of coherence. Schizophrenia Research , 259:59–70, 2023. W. Peebles and S. Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 4195–4205, 2023. E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence , 2018. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog , 1(8):9, 2019. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints , 2019. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research , 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html . N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese Bert-networks. arXiv preprint arXiv:1908.10084 , 2019. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. CoRR, abs/2112.10752, 2021. URL https://arxiv.org/abs/2112.10752 . P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Borsos, F. de Chaumont Quitry, P. Chen, D. E. Badawy, W. Han, E. Kharitonov, H. Muckenhirn, D. Padfield, J. Qin, D. Rozenberg, T. N. Sainath, J. Schalkwyk, M. Sharifi, M. T. Ramanovich, M. Tagliasacchi, A. Tudor, M. Velimirovic, D. Vincent, J. Yu, Y. Wang, V. Zayats, N. Zeghidour, Y. Zhang, Z. Zhang, L. Zilka, and C. H. Frank. Audiopalm: A large language model that can speak and listen. CoRR, abs/2306.12925, 2023. URL https://doi.org/10.48550/arXiv.2306.12925 . T. Salimans and J. Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512 , 2022. Seamless Communication, L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, N. Dong, M. Duppenthaler, P.-A. Duquenne, B. Ellis, H. Elsahar, J. Haaheim, J. Hoffman, M.-J. Hwang, H. Inaguma, C. Klaiber, I. Kulikov, P. Li, D. Licht, J. Maillard, R. Mavlyutov, A. Rakotoarison, K. R. Sadagopan, A. Ramakrishnan, T. Tran, G. Wenzek, Y. Yang, E. Ye, I. Evtimov, P. Fernandez, C. Gao, P. Hansanti, E. Kalbassi, A. Kallet, A. Kozhevnikov, G. M. Gonzalez, R. S. Roman, C. Touret, C. Wong, C. Wood, B. Yu, P. Andrews, C. Balioglu, P.-J. Chen, M. R. Costa-jussà, M. Elbayad, H. Gong, F. Guzmán, K. Heffernan, S. Jain, J. Kao, A. Lee, X. Ma, A. Mourachko, B. Peloquin, J. Pino, S. Popuri, C. Ropers, S. Saleem, H. Schwenk, A. Sun, 45 Page 46: P. Tomasello, C. Wang, J. Wang, S. Wang, and M. Williamson. Seamless: Multilingual expressive and streaming speech translation. ArXiv, abs/2312.05187, 2023a. URL https://arxiv.org/abs/2312.05187 . Seamless Communication, L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, N. Dong, P.-A. Duquenne, H. Elsahar, H. Gong, K. Heffernan, J. Hoffman, C. Klaiber, P. Li, D. Licht, J. Maillard, A. Rakotoarison, K. R. Sadagopan, G. Wenzek, E. Ye, B. Akula, P.-J. Chen, N. E. Hachem, B. Ellis, G. M. Gonzalez, J. Haaheim, P. Hansanti, R. Howes, B. Huang, M.-J. Hwang, H. Inaguma, S. Jain, E. Kalbassi, A. Kallet, I. Kulikov, J. Lam, D. Li, X. Ma, R. Mavlyutov, B. Peloquin, M. Ramadan, A. Ramakrishnan, A. Sun, K. Tran, T. Tran, I. Tufanov, V. Vogeti, C. Wood, Y. Yang, B. Yu, P. Andrews, C. Balioglu, M. R. Costa-jussà, O. Celebi, M. Elbayad, C. Gao, F. Guzmán, J. Kao, A. Lee, A. Mourachko, J. Pino, S. Popuri, C. Ropers, S. Saleem, H. Schwenk, P. Tomasello, C. Wang, J. Wang, and S. Wang. SeamlessM4T - massively multilingual & multimodal machine translation, 2023b. URL https://arxiv.org/abs/2308.11596 . A. Shabalin, V. Meshchaninov, E. Chimbulatov, V. Lapikov, R. Kim, G. Bartosh, D. Molchanov, S. Markov, and D. Vetrov. Tencdm: Understanding the properties of diffusion model in the space of language model encodings. ArXiv, abs/2402.19097, 2024. URL https://arxiv.org/pdf/2402.19097 . N. Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202 , 2020. J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. CoRR, abs/2010.02502, 2020. URL https://arxiv.org/abs/2010.02502 . N. Srivastava, D. Kuchelev, T. M. Ngoli, K. Shetty, M. Röder, D. Moussallem, H. Zahera, and A.-C. N. Ngomo. Lola–an open-source massively multilingual large language model. arXiv preprint arXiv:2409.11272 , 2024. S. Sturua, I. Mohr, M. K. Akram, M. Günther, B. Wang, M. Krimmel, F. Wang, G. Mastrapas, A. Koukounas, N. Wang, and H. Xiao. jina-embeddings-v3: Multilingual embeddings with task lora, 2024. URL https: //arxiv.org/abs/2409.10173 . H. Su, W. Shi, J. Kasai, Y. Wang, Y. Hu, M. Ostendorf, W.-t. Yih, N. A. Smith, L. Zettlemoyer, and T. Yu. One embedder, any task: Instruction-finetuned text embeddings. arXiv preprint arXiv:2212.09741 , 2022. J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing , 568:127063, 2024. X. Sun, Z. Sun, Y. Meng, J. Li, and C. Fan. Summarize, outline, and elaborate: Long-text generation via hierarchical supervision from extractive summaries. In Proceedings of the 29th International Conference on Computational Linguistics , pages 6392–6402, Gyeongju, Republic of Korea, Oct. 2022. International Committee on Computational Linguistics. URL https://aclanthology.org/2022.coling-1.556 . Team GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Zhang, D. Rojas, G. Feng, H. Zhao, H. Lai, H. Yu, H. Wang, J. Sun, J. Zhang, J. Cheng, J. Gui, J. Tang, J. Zhang, J. Sun, J. Li, L. Zhao, L. Wu, L. Zhong, M. Liu, M. Huang, P. Zhang, Q. Zheng, R. Lu, S. Duan, S. Zhang, S. Cao, S. Yang, W. L. Tam, W. Zhao, X. Liu, X. Xia, X. Zhang, X. Gu, X. Lv, X. Liu, X. Liu, X. Yang, X. Song, X. Zhang, Y. An, Y. Xu, Y. Niu, Y. Yang, Y. Li, Y. Bai, Y. Dong, Z. Qi, Z. Wang, Z. Yang, Z. Du, Z. Hou, and Z. Wang. ChatGLM: A family of large language models from glm-130b to glm-4 all tools. ArXiv, abs/2406.12793, 2024. URL https://arxiv.org/pdf/2406.12793 . The Llama3 team. The Llama 3 herd of models. ArXiv, abs/2407.21783, 2024. URL https://arxiv.org/pdf/ 2407.21783 . H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023. 46 Page 47: T. Ubukata, J. Li, and K. Tei. Diffusion model for planning: A systematic literature review. ArXiv, abs/2408.10266, 2024. URL https://arxiv.org/pdf/2408.10266 . A. Üstün, V. Aryabumi, Z.-X. Yong, W.-Y. Ko, D. D’souza, G. Onilude, N. Bhandari, S. Singh, H.-L. Ooi, A. Kayid, et al. Aya model: An instruction finetuned open-access multilingual language model. arXiv preprint arXiv:2402.07827 , 2024. C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei. Neural codec language models are zero-shot text to speech synthesizers. CoRR, abs/2301.02111, 2023. URL https://doi.org/10.48550/arXiv.2301.02111 . L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei. Improving text embeddings with large language models. In ACL, pages 11897–11916, 2024a. URL https://aclanthology.org/2024.acl-long.642 . L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei. Multilingual e5 text embeddings: A technical report, 2024b. URL https://arxiv.org/abs/2402.05672 . A. Warstadt, A. Singh, and S. R. Bowman. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics , 7:625–641, 2019. doi: 10.1162/tacl_a_00290. URL https://aclanthology.org/Q19-1040 . S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319 , 2019. Y. Yang, G. H. Abrego, S. Yuan, M. Guo, Q. Shen, D. Cer, Y.-H. Sung, B. Strope, and R. Kurzweil. Improving multilingual sentence embedding using bi-directional dual encoder with additive margin softmax. arXiv preprint arXiv:1902.08564 , 2019. J. Ye, J. Gao, S. Gong, L. Zheng, X. Jiang, Z. Li, and L. Kong. Beyond autoregression: Discrete diffusion for complex reasoning and planning. ArXiv, abs/2410.14157, 2024. URL https://arxiv.org/pdf/2410.14157 . Y. Yin, J. Ding, K. Song, and Y. Zhang. Semformer: Transformer language models with semantic planning. ArXiv, abs/2409.11143, 2024. URL https://arxiv.org/pdf/2409.11143 . N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing , 30:495–507, 2021. B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019. X. Zhang, Y. Zhang, D. Long, W. Xie, Z. Dai, J. Tang, H. Lin, B. Yang, P. Xie, F. Huang, M. Zhang, W. Li, and M. Zhang. mgte: Generalized long-context text representation and reranking models for multilingual text retrieval, 2024. URL https://arxiv.org/abs/2407.19669 . Y. Zhang, J. Gu, Z. Wu, S. Zhai, J. Susskind, and N. Jaitly. Planner: Generating diversified paragraph via latent language diffusion model. In NeurIPS , 2023. 47 Page 48: A Technical consideration for data preparation Since our modeling approach uses a fixed encoder and a fixed document segmentation method, we decided to use pre-computed SONAR embeddings instead of producing them on-the-fly for each training run. This allows for faster iteration on the same data mix, trading expensive GPU compute against storage capacity. As we are storing sequences of SONAR embedding, which are fixed size tensors of 1024 floats, the storage requirements become more demanding than storing the raw text. For one terra bytes of raw text data we need to store between fifteen and twenty terra bytes of encoded data. Overall, this trade-off in space vs compute reduces the GPU memory occupation and the compute load and lets use iterate faster. Typically, with on single GPU we can produce around 300-400 SONAR sentence embeddings per second whereas by loading precomputed data (potentially from remote storage) we can load over 20 thousand embeddings per second per GPU (with around 15 CPU per GPU). We store sequences of SONAR embeddings with 16 bits precision (FP16) in parquet datasets. Embeddings remain aligned with the segmented texts and the parquet binary format and library ecosystem is well suited for storing and loading efficiently such complex data structures. Parquet also lets us store extra data (such as quality metrics for each sentences) and enables non-trivial last mile data filtering and transformation. For training the LCM, we processed around four billion documents, generating 310 billion sentences with an average of 27 tokens per sentences for 88 characters length on average; totaling a bit more than 889 terra-bytes of raw text. B Open Sourced Code In the spirit of reproducibility, we release under an open source license the training, evaluation and data processing code for the LCM. This is available at https://github.com/facebookresearch/ large_concept_model . The training code is based on the Fairseq2 framework (Balioglu, 2023) that allowed us to build and iterate over the different model architectures discussed above. While Fairseq2 shares the same name as the popular fairseq toolchain, its API architecture is different. It is not a monolithic toolchain but a set of modules that can be composed, this allows us to build different architectures side by side and easily share training components. We also release our evaluation framework so the evaluation tasks reported in 3.1 and comparison between the LCMand other models can easily be reproduced. The evaluation framework provides a clear abstraction between predictors ,tasksanddata loading , which again, makes it modular and lets us describe a set of tasks to be evaluated. The evaluation framework can be run locally or distributed over a SLURM cluster to run evaluations at scale. Finally, we release an updated version of stopes16to simplify large scale data pre-processing on a SLURM cluster. This was used to run the sentence segmentation and SONAR encoding described in section 2.2. The stopes data processing framework deals with scheduling and monitoring large number of jobs on SLURM or to run everything locally for small scale jobs. It provides an API compatible with ray.data17that makes it easy to process large datasets in blocks and apply transform 16https://github.com/facebookresearch/stopes 17https://docs.ray.io/ 48 Page 49: function over it. This makes our code reusable outside of a SLURM cluster as it can also be used with a ray.io cluster. C System prompt: Generation of Topic Descriptions You are a topic description generator. Your job is to read an extract of text and then generate a topic description. The extract may be well formed or not. The topic description you will write will be at most one sentence in length, and use as few words as possible. However, it can not be generic and it can not contain any profanity. Here is an example of an extract, an ideal topic description, and some examples of bad topic descriptions: Example extract: “One day, one neighborhood of the city was completely devasted. Glass windows were shattered, shops turned upside down, and many civilian killed. Superman instantly recognized the signature of one of his old enemies, Voltar, who he had barely beaten in the past. This was a message to him: "I challenge you! Come find me!"" An example of a good topic description: An old enemy of Superman’s, Voltar, appeared and challenged him. An example of a bad topic description: Superman An example of a bad topic description: Voltar An example of a bad topic description: D User prompt: LLM As a Judge - Coherence Below is a text extract. Your task is to analyze the extract and assign a coherence score between 0 and 5 inclusive, where: 0: The text is completely incoherent and lacks any logical connection. 1: The text has some minor connections, but overall it is disjointed and hard to follow. 2: The text has some coherence, but it is still difficult to understand due to unclear relationships between ideas. 3: The text is moderately coherent, with some clear connections between ideas, but may lack depth or clarity. 4: The text is highly coherent, with clear and logical connections between ideas, making it easy to follow. 5: The text is extremely coherent, with a clear and concise structure, making it effortless to understand. You will provide a score ONLY. Do NOT also provide an explanation. The extract: <extract> After examining the extract, the coherence score between 0 and 5 inclusive is: 49