loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2207.09238

Formal Algorithms for Transformers

Authors: Mary Phuong, Marcus Hutter

Published: 2022-07-19

Abstract:

This document aims to be a self-contained, mathematically precise overview of transformer architectures and algorithms (*not* results). It covers what transformers are, how they are trained, what they are used for, their key architectural components, and a preview of the most prominent models. The reader is assumed to be familiar with basic ML terminology and simpler neural network architectures such as MLPs.

Paper Content: on Alphaxiv
Page 1: 19 July 2022 Formal Algorithms for Transformers Mary Phuong1and Marcus Hutter1 1DeepMind This document aims to be a self-contained, mathematically precise overview of transformer architec- turesandalgorithms( notresults). Itcoverswhattransformersare,howtheyaretrained,whattheyare used for, their key architectural components, and a preview of the most prominent models. The reader is assumed to be familiar with basic ML terminology and simpler neural network architectures such as MLPs. Keywords: formalalgorithms,pseudocode,transformers,attention,encoder,decoder,BERT,GPT,Gopher, tokenization, training, inference. Contents 1 Introduction 1 2 Motivation 1 3 Transformers and Typical Tasks 3 4 Tokenization: HowTextisRepresented 4 5 Architectural Components 4 6 Transformer Architectures 7 7 Transformer Training and Inference 8 8 Practical Considerations 9 A References 9 B List of Notation 16 A famous colleague once sent an actually very well-written paper he was quite proud of to a fa- mous complexity theorist. His answer: “I can’t find a theorem in the paper. I have no idea what this paper is about.” 1. Introduction Transformersaredeepfeed-forwardartificialneu- ral networks with a (self)attention mechanism. They have been tremendously successful in natu- ral language processing tasks and other domains. Sincetheirinception5yearsago[ VSP¸17],many variantshavebeensuggested[ LWLQ21]. Descrip- tions are usually graphical, verbal, partial, or in- cremental. Despite their popularity, it seems no pseudocode has ever been published for any vari- ant. Contrast this to other fields of computer science,evento“cousin”disciplinereinforcement learning [MKS¸13, SBB18, EMK¸21]. This report intends to rectify the situation for Transformers. Itaimstobeaself-contained, com-plete, precise and compact overview of trans- former architectures and formal algorithms (but notresults). ItcoverswhatTransformersare(Sec- tion 6), how they are trained (Section 7), what they’re used for (Section 3), their key architec- tural components (Section 5), tokenization (Sec- tion 4), and a preview of practical considerations (Section 8) and the most prominent models. The essentially complete pseudocode is about 50 lines, compared to thousands of lines of ac- tual real source code. We believe these formal algorithms will be useful for theoreticians who require compact, complete, and precise formu- lations, experimental researchers interested in implementing a Transformer from scratch, and encourageauthorstoaugmenttheirpaperortext book with formal Transformer algorithms (Sec- tion 2). The reader is assumed to be familiar with ba- sic ML terminology and simpler neural network architectures such as MLPs. In short, a (formally inclined) reader, upon un- derstanding the contents of this document, will have a solid grasp of transformers: they will be ready to read and contribute to the literature on the topic as well as implement their own Trans- former using the pseudocode as templates. 2. Motivation The true story above the introduction describes quite well the feeling we have when browsing Corresponding author(s): fbuiphuong,mhutter g@deepmind.com ©2022 DeepMind. All rights reservedarXiv:2207.09238v1 [cs.LG] 19 Jul 2022 Page 2: Formal Algorithms for Transformers many Deep Learning (DL) papers; unable to fig- ure out the algorithmic suggestions exactly. For practitioners, the papers may be sufficiently de- tailed, but the precision needed by theoreticians is usually higher. For reasons not entirely clear to us, the DL community seems shy of providing pseudocode for their neural network models. Be- low we describe the SOTA in DL paper writing andargueforthevalueofformalalgorithms. The reader already convinced about their merit can without loss skip this section. ThelackofscientificprecisionanddetailinDL publications. Deep Learning has been tremen- dously successful in the last 5 to 10 years with thousands of papers published every year. Many describe only informally how they change a pre- vious model, Some 100+ page papers contain only a few lines of prose informally describing the model [ RBC¸21]. At best there are some high-level diagrams. No pseudocode. No equa- tions. Noreferencetoapreciseexplanationofthe model. One may argue that most DL models are minor variations of a few core architectures, such as the Transformer [ VSP¸17], so a reference aug- mented by a description of the changes should suffice. Thiswouldbetrueif(a)thechangeswere describedprecisely,(b)thereferencearchitecture hasbeendescribedpreciselyelsewhere, and(c)a reference is given to this description. Some if not all three are lacking in most DL papers. To the best of our knowledge no-one has even provided pseudocode for the famous Transformer and its encoder/decoder-only variations. Interfacingalgorithms. Equallyimportantare proper explanations of how these networks are trained and used, but sometimes it is not even clear what the inputs and outputs and potential side-effectsofthedescribedmodelare. Ofcourse someone experienced would usually be able to correctly guess, but this is not a particularly sci- entific approach. The experimental section in publications often does not explain what is ac- tually fed into the algorithms and how. If there is some explanation in the methods section, it is often disconnected from what is described in the experimental section, possibly due to differentauthors writing the different sections. The core algorithms for the models should be accompa- nied by the wrapper algorithms that call them, e.g. (pre)training, fine-tuning, prompting, infer- ence, deployment. Sometimes this is simple, but sometimes the magic happens there. In any case, if these things are not formally stated they re- main unclear. Again, if the setup is standard and has been formally explained elsewhere, a simple reference will do. Source code vs pseudocode. Providing open source code is very useful, but not a proper sub- stitute for formal algorithms. There is a massive difference between a (partial) Python dump and well-crafted pseudocode. A lot of abstraction and clean-up is necessary: remove boiler plate code, use mostly single-letter variable names, replace code by math expressions wherever possible, e.g. replace loops by sums, remove (some) optimiza- tions, etc. A well-crafted pseudocode is often less than a page and still essentially complete, com- pared to often thousands of lines of real source code. This is hard work no-one seems to be will- ing to do. Of course a forward process of first designing algorithms and write up pseudocode on paper, and then implementing it is fine too, but few DL practitioners seem to work that way. Examples of good neural network pseu- docode and mathematics and explanations. Multi-Layer Perceptrons (MLPs) are usually well- described in many papers, e.g. [ MPCB14,BFT17, JGH18], though also without pseudocode. For a rare text-book example of pseudocode for a non- trivialneuralnetworkarchitecture,seeAlgorithm S2 of [SGBK¸21], which constitutes a complete, i.e.essentially executable , pseudocode of just 25 lines based on a 350-line Python Colab toy im- plementation, which itself is based on a proper 1000+ line implementation. This work aims to do the same for Transform- ers: The whole decoder-only Transformer GPT Algorithm 10 based on attention Algorithms 4 and 5 and normalization Algorithm 6 including training Algorithm 13 and prompting and infer- ence Algorithm 14 all-together is less than 50 lines of pseudocode, compared to e.g. the 2000- 2 Page 3: Formal Algorithms for Transformers line self-contained C-implementation [Bel21]. [Ala19] is a great blog-post explaining Trans- formers and [ EGKZ21] describes the attention mechanism to sufficient mathematical precision to allow proving properties about it, but neither provides pseudocode. [ Elh21] is an attempt to understand Transformers by reverse-engineering the computations they perform and interpreting them as circuits. Motivation. But does anyone actually need pseudocode and what would they be useful for (wesometimesgetasked)? Wefindtheabsenceof pseudocodeinDLandthisquestionquiteperplex- ing, but apparently it requires answering. Pro- viding such pseudocode can be useful for many purposes: •They can be used as templates and adapted to precisely describe future variations, and therewith set a new standard in DL publish- ing. We explicitly encourage the reader to copy and adapt them to their needs and cite the original as “adapted from [PH22]”. •Having all that matters on one page in front of you makes it easier to develop new varia- tions compared to reading prose or scrolling through 1000s of lines of actual code. •They can be used as a basis for new imple- mentations from scratch, e.g. in different programming languages, without having to wade through and reverse-engineer existing idiosyncratic real source code. •They may establish notational convention, which eases communication and reading fu- ture variations. •The process of converting source code into pseudocode can exhibit implementation er- rors (as it e.g. did in [SGBK¸21]). •Theoreticians need compact, complete, and precise representations for reasoning and ul- timatelyprovingpropertiesaboutalgorithms. They are often unwilling or unable to re- verse engineer code, or guess the meaning of words or fancy diagrams. With this motivation in mind, the following five sections formally describe all aspects of trans- former architectures, training, and inference.3. Transformers and Typical Tasks Transformers are neural network models that ex- cel at natural language processing, or more gen- erallyatmodellingsequentialdata. Twocommon types of tasks they are used for are sequence mod- ellingandsequence-to-sequence prediction . Notation. Let𝑉denote a finite set, called a vo- cabulary,oftenidentifiedwith »𝑁V¼:=f1”“““”𝑁 Vg. This could be words or letters, but typically are sub-words, called tokens. Let 𝒙𝑥»1 :¼  𝑥»1¼𝑥»2¼“““𝑥»¼2𝑉be a sequence of tokens, e.g. a sentence or a paragraph or a document. Un- likeinPython, weusearraysstartingfrom 1, and 𝑥»1 :¼includes𝑥»¼. Foramatrix 𝑀2ℝ𝑑𝑑0,we write𝑀»𝑖”:¼2ℝ𝑑0forthe𝑖throwand 𝑀»:” 𝑗¼2ℝ𝑑 for the𝑗-th column. We use matrix column vec- tor convention more common in mathematics, compared to the default row vector matrix in the transformer literature, i.e. our matrices are transposed. See Appendix B for a complete list of notation. Chunking. The predominant paradigm in ma- chinelearningis(still)learningfromindependent and identically distributed (i.i.d.) data. Even for sequence modelling for practical reasons this tra- dition is upheld. The training data may natu- rallybeacollectionof(independent)articles, but eventhen,somemayexceedthemaximalcontext lengthmaxtransformers can handle. In this case, an article is crudely broken into shorter chunks of lengthmax. Sequence modelling (DTransformer). Given a vocabulary 𝑉, let 𝒙𝑛2𝑉for𝑛2»𝑁data¼be a dataset of sequences (imagined to be) sampled i.i.d. from some distribution 𝑃over𝑉. The goal is to learn an estimate ˆ𝑃of the distribution 𝑃¹𝒙º. In practice, the distribution estimate is often de- composed via the chain rule as ˆ𝑃¹𝒙º=ˆ𝑃𝜽¹𝑥»1¼º ˆ𝑃𝜽¹𝑥»2¼j𝑥»1¼º ˆ𝑃𝜽¹𝑥»¼j𝒙»1 :1¼º, where 𝜽 consists of all neural network parameters to be learned. The goal is to learn a distribution over a single token 𝑥»𝑡¼given its preceding tokens 𝑥»1 :𝑡1¼as context. 3 Page 4: Formal Algorithms for Transformers Examples include e.g. language modelling, RL policy distillation, or music generation. Sequence-to-sequence (seq2seq) prediction (EDTransformer). Given a vocabulary 𝑉and an i.i.d. dataset of sequence pairs ¹𝒛𝑛”𝒙𝑛º𝑃, where𝑃is a distribution over 𝑉𝑉, learn an estimate of the conditional distribution 𝑃¹𝒙j𝒛º. In practice, the conditional distribution estimate is often decomposed as ˆ𝑃¹𝒙j𝒛º=ˆ𝑃𝜽¹𝑥»1¼j𝒛º ˆ𝑃𝜽¹𝑥»2¼j𝑥»1¼”𝒛ºˆ𝑃𝜽¹𝑥»¼j𝒙»1 :1¼”𝒛º. Examples include translation ( 𝒛=a sentence in English, 𝒙=the same sentence in German), question answering ( 𝒛=question, 𝒙=the corre- sponding answer), text-to-speech ( 𝒛=a piece of text, 𝒙=a voice recording of someone reading the text). Classification(ETransformer). Givenavocab- ulary𝑉and a set of classes »𝑁C¼, let¹𝒙𝑛”𝑐𝑛º 2 𝑉»𝑁C¼for𝑛2»𝑁data¼be an i.i.d. dataset of sequence-class pairs sampled from 𝑃¹𝒙”𝑐º. The goal in classification is to learn an estimate of the conditional distribution 𝑃¹𝑐j𝒙º. Examples include e.g. sentiment classification, spam filtering, toxicity classification. 4. Tokenization: How Text is Repre- sented In the context of natural language tasks, tok- enization refers to how a piece of text such as “My grandma makes the best apple pie.” is rep- resented as a sequence of vocabulary elements (calledtokens). Character-level tokenization. One possible choice is to let 𝑉be the English alphabet (plus punctuation). In the example above, we’d get a sequenceoflength36: [‘M’,‘y’,‘ ’,...]. Character- level tokenization tends to yield very long se- quences. Word-level tokenization. Another choice would be to let 𝑉consist of all English words(plus punctuation). In the example above, we’d get a sequence of length 7: [‘My ’, ‘grandma ’, ‘makes ’, ...]. Word-level tokenization tends to require a very large vocabulary and cannot deal with new words at test time. Subword tokenization. This is the method used in practice nowadays: 𝑉is a set of com- monly occurring word segments like ‘cious’, ‘ing’, ‘pre’. Commonwordslike‘is’areoftenaseparate token, and single characters are also included in 𝑉to ensure all words are expressible. There are in fact many ways to do subword to- kenization. One of the simplest and most success- ful ones is Byte Pair Encoding [ Gag94,SHB16] used in GPT-2 [RWC¸19]. Final vocabulary and text representation. Given a choice of tokenization / vocabulary, each vocabulary element is assigned a unique index inf1”2”“““”𝑁 V3g. A number of special tokens are then added to the vocabulary. The number of special tokens varies, and here we will con- sider three: mask_token :=𝑁V2, used in masked language modelling (see Algorithm 12); bos_token :=𝑁V1, used for representing the beginning of sequence; and eos_token :=𝑁V, used for representing the end of sequence. The complete vocabulary has 𝑁V=j𝑉jelements. A piece of text is represented as a sequence of indices (called token IDs ) corresponding to its (sub)words, preceded by bos_token and fol- lowed by eos_token . 5. Architectural Components The following are the neural network build- ing blocks (functions with learnable parameters) from which transformers are made. Full archi- tectures featuring these building blocks are pre- sentedinthenextsection. (Byaslightabuseofno- tation, we identify 𝑉with the setf1”2”“““”𝑁 Vg.) Token embedding. The token embedding learns to represent each vocabulary element as a vector in ℝ𝑑e; see Algorithm 1. 4 Page 5: Formal Algorithms for Transformers Algorithm 1: Token embedding. Input:𝑣2𝑉»𝑁V¼, a token ID. Output: 𝒆2ℝ𝑑e, the vector representation of the token. Parameters: 𝑾𝒆2ℝ𝑑e𝑁V, the token embedding matrix. 1return 𝒆=𝑾𝒆»:”𝑣¼ Positional embedding. The positional embed- ding learns to represent a token’s position in a sequenceasavectorin ℝ𝑑e. Forexample,theposi- tionofthefirsttokeninasequenceisrepresented by a (learned) vector 𝑾𝒑»:”1¼, the position of the second token is represented by another (learned) vector 𝑾𝒑»:”2¼,etc. Thepurposeofthepositional embedding is to allow a Transformer to make sense of word ordering; in its absence the rep- resentation would be permutation invariant and the model would perceive sequences as “bags of words” instead. Learned positional embeddings require that in- putsequencelengthisatmostsomefixednumber max(thesizeofthelearnedpositionalembedding matrixmustbefiniteandfixedinadvanceoftrain- ing). An intuitive explanation of how this works can be found at [ Ala18]. For pseudocode, see Algorithm 2. Not all transformers make use of learnedposi- tional embeddings, some use a hard-coded map- ping 𝑾𝒑:ℕ!ℝ𝑑einstead [ Ker21]. Such hard- coded positional embeddings can (theoretically) handle arbitrarily long sequences. The original Transformer [VSP¸17] uses 𝑾𝒑»2𝑖1”𝑡¼=sin¹𝑡2𝑖𝑑e maxº” 𝑾𝒑»2𝑖”𝑡¼=cos¹𝑡2𝑖𝑑e maxº“ for0 𝑖𝑑e2. The positional embedding of a token is usu- ally added to the token embedding to form a token’s initial embedding. For the 𝑡-th token of a sequence 𝒙, the embedding is 𝒆=𝑾𝒆»:”𝑥»𝑡¼¼¸𝑾𝒑»:”𝑡¼“ (1) Attention. Attention is the main architectural component of transformers. It enables a neuralAlgorithm 2: Positional embedding. Input:2»max¼, position of a token in the sequence. Output: 𝒆𝒑2ℝ𝑑e, the vector representation of the position. Parameters: 𝑾𝒑2ℝ𝑑emax, the positional embedding matrix. 1return 𝒆𝒑=𝑾𝒑»:”¼ network to make use of contextual information (e.g. preceding text or the surrounding text) for predicting the current token. Onahighlevel, attentionworksasfollows: the token currently being predicted is mapped to a queryvector 𝒒2ℝ𝑑attn, and the tokens in the context are mapped to keyvectors 𝒌𝑡2ℝ𝑑attnand valuevectors 𝒗𝑡2ℝ𝑑value. The inner products 𝒒|𝒌𝑡 areinterpretedasthedegreetowhichtoken 𝑡2𝑉 is important for predicting the current token 𝑞 – they are used to derive a distribution over the contexttokens,whichisthenusedtocombinethe value vectors. An intuitive explanation how this achievesattentioncanbefoundat[ Ala18,Ala19]. The precise algorithm is given in Algorithm 3. Algorithm 3: Basicsingle-queryattention. Input: 𝒆2ℝ𝑑in, vector representation of the current token Input: 𝒆𝑡2ℝ𝑑in, vector representations of context tokens 𝑡2»𝑇¼. Output: ˜𝒗2ℝ𝑑out, vector representation of the token and context combined. Parameters: 𝑾𝒒”𝑾𝒌2ℝ𝑑attn𝑑in, 𝒃𝒒”𝒃𝒌2ℝ𝑑attn, the query and key linear projections. Parameters: 𝑾𝒗2ℝ𝑑out𝑑in,𝒃𝒗2ℝ𝑑out, the value linear projection. 1𝒒 𝑾𝒒𝒆¸𝒃𝒒 28𝑡:𝒌𝑡 𝑾𝒌𝒆𝑡¸𝒃𝒌 38𝑡:𝒗𝑡 𝑾𝒗𝒆𝑡¸𝒃𝒗 48𝑡:𝛼𝑡=exp¹𝒒|𝒌𝑡p 𝑑attnºÍ 𝑢exp¹𝒒|𝒌𝑢p 𝑑attnº 5return ˜𝒗=Í𝑇 𝑡=1𝛼𝑡𝒗𝑡 There are many ways the basic attention mech- anism is used in transformers. We list some of the most common variants below. 5 Page 6: Formal Algorithms for Transformers It will be useful to define the softmax function for matrix arguments, as well as a Mask matrix: softmax¹𝑨º»𝑡z”𝑡x¼:=exp𝐴»𝑡z”𝑡x¼Í 𝑡exp𝐴»𝑡”𝑡x¼”(2) Mask»𝑡z”𝑡x¼=1for bidirectional attention »»𝑡z𝑡x¼¼for unidirectional att. (3) Algorithm4: ˜𝑽 Attention¹𝑿”𝒁jW𝒒𝒌𝒗”Maskº /* Computes a single (masked) self- or cross- attention head. */ Input: 𝑿2ℝ𝑑xx”𝒁2ℝ𝑑zz, vector representations of primary and context sequence. Output: ˜𝑽2ℝ𝑑outx, updated representations of tokens in 𝑿, folding in information from tokens in 𝒁. Parameters: W𝒒𝒌𝒗consisting of: 𝑾𝒒2ℝ𝑑attn𝑑x,𝒃𝒒2ℝ𝑑attn 𝑾𝒌2ℝ𝑑attn𝑑z,𝒃𝒌2ℝ𝑑attn 𝑾𝒗2ℝ𝑑out𝑑z,𝒃𝒗2ℝ𝑑out. Hyperparameters: Mask2f0”1gzx,"(3) 1𝑸 𝑾𝒒𝑿¸𝒃𝒒1|[[𝑸uery2ℝ𝑑attnx]] 2𝑲 𝑾𝒌𝒁¸𝒃𝒌1|[[𝑲ey2ℝ𝑑attnz]] 3𝑽 𝑾𝒗𝒁¸𝒃𝒗1|[[𝑽alue2ℝ𝑑outz]] 4𝑺 𝑲|𝑸 [[𝑺core2ℝzx]] 58𝑡z”𝑡x”if:Mask»𝑡z”𝑡x¼then𝑆»𝑡z”𝑡x¼ 1 6return ˜𝑽=𝑽softmax 𝑺p 𝑑attn Bidirectional / unmasked self-attention. Given a sequence of token representations, this variant applies attention to each token, treating all tokens in the sequence as the context. See Algorithm 4, called with token sequence 𝒁=𝑿 and no masking (Mask 1). Unidirectional / masked self-attention. Given a sequence of token representations, this variant applies attention to each token, treating all preceding tokens (including itself) as the con- text. Futuretokensaremaskedout,sothiscausal auto-regressive version can be used for online prediction. See Algorithm 4, called with token sequence 𝒁=𝑿and Mask»𝑡z”𝑡x¼:=»»𝑡z𝑡x¼¼. For this Mask, the output ˜𝑽»:”1 :𝑡¼only dependson𝑿»:”1 :𝑡¼, hence can be used to predict 𝑿»:”𝑡¸1¼. Cross-attention. Given two sequences of to- ken representations (often in the context of a sequence-to-sequence task), this variant applies attention to each token of the primary token se- quence 𝑿, treating the second token sequence 𝒁as the context. See Algorithm 4, called with Mask1. Whiletheoutput ˜𝑽andinputsequences 𝑿have the same length x, the context sequence 𝒁can have different length z. Multi-headattention. Theattentionalgorithm presented so far (Algorithm 4) describes the op- eration of a single attention head . In practice, transformers run multiple attention heads (with separate learnable parameters) in parallel and combine their outputs; this is called multi-head attention; see Algorithm 5 Algorithm5: ˜𝑽 MHAttention¹𝑿”𝒁jW”Maskº /* Computes Multi-Head (masked) self- or cross- attention layer. */ Input: 𝑿2ℝ𝑑xx”𝒁2ℝ𝑑zz, vector representations of primary and context sequence. Output: ˜𝑽2ℝ𝑑outx, updated representations of tokens in 𝑿, folding in information from tokens in 𝒁. Hyperparameters: 𝐻, number of attention heads Hyperparameters: Mask2f0”1gzx,"(3) Parameters: Wconsisting of Forℎ2»𝐻¼,Wℎ 𝒒𝒌𝒗consisting of: j𝑾ℎ 𝒒2ℝ𝑑attn𝑑x,𝒃ℎ 𝒒2ℝ𝑑attn, j𝑾ℎ 𝒌2ℝ𝑑attn𝑑z,𝒃ℎ 𝒌2ℝ𝑑attn, j𝑾ℎ 𝒗2ℝ𝑑mid𝑑z,𝒃ℎ 𝒗2ℝ𝑑mid. 𝑾𝒐2ℝ𝑑out𝐻𝑑mid,𝒃𝒐2ℝ𝑑out. 1Forℎ2»𝐻¼: 2 𝒀ℎ Attention¹𝑿”𝒁jWℎ 𝒒𝒌𝒗”Maskº 3𝒀 »𝒀1;𝒀2;“““;𝒀𝐻¼ 4return ˜𝑽=𝑾𝒐𝒀¸𝒃𝒐1| 6 Page 7: Formal Algorithms for Transformers Algorithm 6: ˆ𝒆 layer_norm¹𝒆j𝜸”𝜷º /* Normalizes layer activations 𝒆. */ Input: 𝒆2ℝ𝑑e, neural network activations. Output: b𝒆2ℝ𝑑e, normalized activations. Parameters: 𝜸”𝜷2ℝ𝑑e, element-wise scale and offset. 1𝒎 Í𝑑e 𝑖=1𝒆»𝑖¼𝑑e 2𝑣 Í𝑑e 𝑖=1¹𝒆»𝑖¼𝒎º2𝑑e 3return b𝒆=𝒆𝒎p𝑣 𝜸¸𝜷, where denotes element-wise multiplication. Algorithm 7: Unembedding. Input: 𝒆2ℝ𝑑e, a token encoding. Output: 𝒑2Δ¹𝑉º, a probability distribution over the vocabulary. Parameters: 𝑾𝒖2ℝ𝑁V𝑑e, the unembedding matrix. 1return 𝒑=softmax¹𝑾𝒖𝒆º Layer normalisation. Layer normalisation ex- plicitlycontrolsthemeanandvarianceofindivid- ual neural network activations; the pseudocode is given in Algorithm 6. Some transformers use a simpler and more computationally efficient ver- sion of layer normalization setting 𝒎=𝜷=0, called root mean square layer normalisation, or RMSnorm. Unembedding. The unembedding learns to convert a vector representation of a token and its context into a distribution over the vocabulary elements; see Algorithm 7. The algorithm de- scribes an independently learned unembedding matrix, but note that sometimes the unembed- ding matrix is instead fixed to be the transpose of the embedding matrix. 6. Transformer Architectures Thissectionpresentsafewprominenttransformer architectures, based on attention Algorithms 4 and 5 and using normalization Algorithm 6, in roughly historical order: •EDT [VSP¸17] The original sequence-to- sequence / Encoder-Decoder Transformer,Algorithms 8, 11 and 15. •BERT [DCLT19], which is an instance of an encoder-only transformer (encoder-only means that it is derived from the encoder- decoder architecture by dropping the de- coder part), Algorithms 9 and 12. •GPT [RWC¸19,BMR¸20], which is an in- stance of a decoder-only transformer, Algo- rithms 10, 13 and 14. While the main architectural difference between BERT and GPT is in attention masking, they also differ in a number of less important ways: e.g. they use different activation functions and the layer-norms are positioned differently. We in- cludedthesedifferencesinthepseudocodetostay faithful to the original algorithms, but note that different transformer architectures may adopt these selectively. To simplify notation, we denote by Wthe en- tire set of parameters (query, key, value and out- put linear projections) required by a multi-head attention layer: W:=©­­­ «𝑾ℎ 𝒒2ℝ𝑑attn𝑑x”𝒃ℎ 𝒒2ℝ𝑑attn” ℎ2»𝐻¼ 𝑾ℎ 𝒌2ℝ𝑑attn𝑑z”𝒃ℎ 𝒌2ℝ𝑑attn” ℎ2»𝐻¼ 𝑾ℎ 𝒗2ℝ𝑑mid𝑑z”𝒃ℎ 𝒗2ℝ𝑑mid” ℎ2»𝐻¼ 𝑾𝒐2ℝ𝑑out𝐻𝑑mid”𝒃𝒐2ℝ𝑑outª®®® ¬ (4) Encoder-decoder / sequence-to-sequence transformer [VSP¸17].This is the very first transformer. It was originally used for sequence- to-sequence tasks (machine translation), which is why it is more complicated than its successors. The idea behind the architecture is as follows: First, the context sequence is encoded using bidi- rectional multi-head attention. The output of this ‘encoder’ part of the network is a vector represen- tation of each context token, taking into account the entire context sequence. Second, the primary sequence is encoded. Each token in the primary sequence is allowed to use information from the encoded context sequence, as well as primary se- quence tokens that precede it. See Algorithm 8 for more details. 7 Page 8: Formal Algorithms for Transformers Encoder-only transformer: BERT [DCLT19]. BERT is a bidirectional transformer trained on the task of masked language modelling. Given a piece of text with some tokens masked out, the goalistocorrectlyrecoverthemasked-outtokens. The original use of BERT was to learn generally useful text representations, which could then be adapted for various downstream NLP tasks. The maskingisnotperformedviatheMaskparameter but differently: During training each input token isreplacedwithprobability 𝑝maskbyadummyto- kenmask_token ,andevaluationisbasedonthe reconstruction probability of these knocked-out tokens (see Algorithm 12). The BERT architecture resembles the encoder part of the seq2seq transformer (hence ‘encoder- only’). It is described in detail in Algorithm 9. It uses the GELU nonlinearity instead of ReLU: GELU¹𝑥º=𝑥ℙ𝑋N¹0”1º»𝑋  𝑥¼“(5) (When called with vector or matrix arguments, GELU is applied element-wise.) Decoder-only transformers: GPT-2 [RWC¸19], GPT-3 [BMR¸20], Gopher [RBC¸21].GPT-2 and GPT-3 are large language models developed byOpenAI,andGopherisalargelanguagemodel developed by DeepMind. They all have similar architectures and are trained by autoregressive language modelling: Given an incomplete sen- tenceorparagraph,thegoalistopredictthenext token. The main difference from BERT is that GPT/Gopher use unidirectional attention instead of bidirectional attention; they also apply layer- norms in slightly different order. See Algorithm 10 for the pseudocode of GPT-2. GPT-3 is identical except larger, and replaces dense attention in Line 6 by sparse attention, i.e. each token only uses a subset of the full context. Gopher also deviates only slightly from the GPT-2architecture: itreplaceslayernorminlines 5,7and10byRMSnorm( 𝑚=𝜷=0),andituses different positional embeddings.Multi-domain decoder-only transformer: Gato [RZP¸22].Gato is a multi-modal multi- task transformer built by DeepMind. It is a single neural network that can play Atari, navigate 3D environments, control a robotic arm, caption images, have conversations, and more. Under the hood, each modality is converted into a sequence prediction problem by a sepa- rate tokenization and embedding method; for exampleimagesaredividedintonon-overlapping 1616patches, ordered in raster order (left-to- right, top-to-bottom) and processed by a ResNet block to obtain a vector representation. The actual Gato architecture is then a decoder- onlytransformerliketheoneinAlgorithm10,but where Line 2 is replaced with modality-specific embedding code. 7. Transformer Training and Infer- ence This section lists the pseudocode for various algo- rithms for training and using transformers: •EDTraining() Algorithm 11 shows how to train a sequence-to-sequence transformer (the original Transformer [VSP¸17]). •ETraining() Algorithm 12 shows how to train a transformer on the task of masked language modelling (like BERT [DCLT19]). •DTraining() Algorithm 13 shows how to train a transformer on the task of next to- ken prediction (like CPT-x [ BMR¸20] and Gopher [RBC¸21]). •DInference() Algorithm 14 shows how to prompt a transformer trained on next token prediction (like GPT-x [ BMR¸20]). The tem- perature parameter 𝜏interpolates between most likely continuation ( 𝜏=0), faithful sampling ( 𝜏=1º, and uniform sampling ¹𝜏=1º. •EDInference() Algorithm 15 shows how to use a sequence-to-sequence transformer for prediction. Gradientdescent. ThedescribedtrainingAlgo- rithms 11 to 13 use Stochastic Gradient Descent 8 Page 9: Formal Algorithms for Transformers (SGD) 𝜽 𝜽𝜂rloss¹𝜽ºto minimize the log loss (aka cross entropy) as the update rule. Com- putation of the gradient is done via automatic differentiation tools; see [ BPRS18, Table 5]. In practice, vanilla SGD is usually replaced by some more refined variation such as RMSProp or Ada- Grad or others [ Rud16]. Adam [ KB15] is used most often these days. 8. Practical Considerations Whilethevanillatransformersprovidedheremay work in practice, a variety of “tricks” have been developed over the years to improve the perfor- mance of deep neural networks in general and transformers in particular [LWLQ21]: •Data preprocessing: cleaning, augmen- tation [FGW¸21], adding noise, shuffling [Lem21] (besides tokenization and chunk- ing). •Architecture: sparse layers, weight sharing (besides attention). •Training: improved optimizers, mini- batches, batch normalization, learning rate scheduling, weight initialization, pre- training, ensembling, multi-task, adversarial (besides layer normalization) [Sut15]. •Regularization: weight decay, early stop- ping,cross-validation,dropout,addingnoise [MBM20, TZ22]. •Inference: scratchpad prompting, few-shot prompting,chainofthought,majorityvoting [LAD¸22]. •Others. A. References [Ala18]Jay Alammar. The Illustrated Trans- former. http://jalammar.github.io/illustrated- transformer/, 2018. [Ala19]Jay Alammar. The Illustrated GPT- 2 (Visualizing Transformer Language Mod- els). http://jalammar.github.io/illustrated- gpt2/, 2019. [Bel21]Fabrice Bellard. NNCP v2: Loss- less Data Compression with Transformer. https://bellard.org/libnc/gpt2tc.html, 2021.[BFT17] PeterLBartlett, DylanJFoster, andMa- tus J Telgarsky. Spectrally-normalized margin bounds for neural networks. NeurIPS, 2017. [BMR¸20]Tom Brown, Benjamin Mann, Nick Ryder, et al. Language models are few-shot learners. NeurIPS, 2020. [BPRS18] Atilim Gunes Baydin, Barak A. Pearl- mutter, Alexey Andreyevich Radul, and Jef- frey Mark Siskind. Automatic Differentiation inMachineLearning: ASurvey. JournalofMa- chine Learning Research , 18(153):1–43, 2018. [DCLT19] Jacob Devlin, Ming-Wei Chang, Ken- ton Lee, and Kristina Toutanova. BERT: Pre- training of deep bidirectional transformers for language understanding. ACL, 2019. [EGKZ21] Benjamin L. Edelman, Surbhi Goel, Sham Kakade, and Cyril Zhang. Inductive Bi- ases and Variable Creation in Self-Attention Mechanisms. arXiv:2110.10090 [cs, stat] , Oc- tober 2021. [Elh21]Nelson Elhage. A Math- ematical Framework for Trans- former Circuits. https://transformer- circuits.pub/2021/framework/index.html, 2021. [EMK¸21]Yonathan Efroni, Dipendra Misra, Ak- shayKrishnamurthy,AlekhAgarwal,andJohn Langford.ProvableRLwithExogenousDistrac- tors via Multistep Inverse Dynamics, March 2021. [FGW¸21]Steven Y Feng, Varun Gangal, Ja- son Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy. A survey of data augmentation approaches for NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , pages 968–988, 2021. [Gag94] Philip Gage. A new algorithm for data compression. Dr. Dobbs / C Users Journal , 12(2):23–38, 1994. [JGH18] Arthur Jacot, Franck Gabriel, and Clé- ment Hongler. Neural tangent kernel: Conver- gence and generalization in neural networks. NeurIPS, 2018. 9 Page 10: Formal Algorithms for Transformers [KB15] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015. [Ker21] Jonathan Kernes. Mas- ter Positional Encoding: Part I. https://towardsdatascience.com/master- positional-encoding-part-i-63c05d90a0c3, March 2021. [LAD¸22]Aitor Lewkowycz, Anders An- dreassen, David Dohan, Ethan Dyer, Henryk Michalewski,VinayRamasesh,AmbroseSlone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving Quantitative Reasoning Problems with Language Models. arXiv:2206.14858 [cs] , June 2022. [Lem21] Chris Lemke. Data preprocessing in NLP. https://towardsdatascience.com/data- preprocessing-in-nlp-c371d53ba3e0, July 2021. [LWLQ21] Tianyang Lin, Yuxin Wang, Xi- angyang Liu, and Xipeng Qiu. A Survey of Transformers, June 2021. [MBM20] Reza Moradi, Reza Berangi, and Behrouz Minaei. A survey of regularization strategies for deep models. Artificial Intelli- gence Review , 53(6):3947–3986, August 2020. [MKS¸13]Volodymyr Mnih, Koray Kavukcuoglu, DavidSilver, AlexGraves, IoannisAntonoglou, Daan Wierstra, and Martin Riedmiller. Play- ing Atari with Deep Reinforcement Learning, December 2013. [MPCB14] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural net- works.NeurIPS, 2014. [PH22]M. Phuong and M. Hutter. Formal al- gorithms for transformers. Technical report, DeepMind, London, UK, 2022. LaTeX source available at http://arXiv.org [RBC¸21]Jack W. Rae, Sebastian Borgeaud, Trevor Cai, et al. Scaling language models: Methods, analysis & insights from training go- pher.arXiv:2112.11446 , 2021.[Rud16] Sebastian Ruder. An overview of gradient descent optimization algo- rithms. https://ruder.io/optimizing-gradient- descent/, January 2016. [RWC¸19]Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Languagemodelsareunsupervised multitask learners. OpenAI blog , 2019. [RZP¸22]Scott Reed, Konrad Żołna, Emilio Parisotto, et al. A generalist agent. arXiv:2205.06175 , 2022. [SBB18] Richard S. Sutton, Andrew G. Barto, and Francis Bach. Reinforcement Learning: An Introduction . MIT Press, Cambridge, Mas- sachusetts, second edition edition edition, November 2018. [SGBK¸21]Eren Sezener, Agnieszka Grabska- Barwińska,DimitarKostadinov,MaximeBeau, Sanjukta Krishnagopal, David Budden, Mar- cus Hutter, Joel Veness, Matthew Botvinick, ClaudiaClopath,MichaelHäusser,andPeterE. Latham. A rapid and efficient learning rule for biological neural circuits. Technical report, DeepMind, London, UK, 2021. [SHB16] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In 54th Annual Meeting of the Association for Compu- tational Linguistics , pages 1715–1725. Asso- ciation for Computational Linguistics (ACL), 2016. [Sut15]Ilya Sutskever. A Brief Overview of Deep Learning. http://yyue.blogspot.com/2015/01/a- brief-overview-of-deep-learning.html, January 2015. [TZ22]Yingjie Tian and Yuqi Zhang. A compre- hensive survey on regularization strategies in machinelearning. InformationFusion ,80:146– 166, April 2022. [VSP¸17]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 2017. 10 Page 11: Formal Algorithms for Transformers Algorithm 8: 𝑷 EDTransformer¹𝒛”𝒙j𝜽º /* Encoder-decoder transformer forward pass */ Input: 𝒛”𝒙2𝑉, two sequences of token IDs. Output: 𝑷2¹0”1º𝑁Vlength¹𝒙º, where the 𝑡-th column of 𝑷represents ˆ𝑃𝜽¹𝑥»𝑡¸1¼j𝒙»1 :𝑡¼”𝒛º. Hyperparameters: max”𝐿enc”𝐿dec”𝐻”𝑑 e”𝑑mlp2ℕ Parameters: 𝜽includes all of the following parameters: 𝑾𝒆2ℝ𝑑e𝑁V,𝑾𝒑2ℝ𝑑emax, the token and positional embedding matrices. For𝑙2»𝐿enc¼: jWenc 𝑙, multi-head encoder attention parameters for layer 𝑙, see (4), j𝜸1 𝑙”𝜷1 𝑙”𝜸2 𝑙”𝜷2 𝑙2ℝ𝑑e, two sets of layer-norm parameters, j𝑾𝑙 mlp12ℝ𝑑mlp𝑑e,𝒃𝑙 mlp12ℝ𝑑mlp,𝑾𝑙 mlp22ℝ𝑑e𝑑mlp,𝒃𝑙 mlp22ℝ𝑑e, MLP parameters. For𝑙2»𝐿dec¼: jWdec 𝑙, multi-head decoder attention parameters for layer 𝑙, see (4), jWed 𝑙, multi-head cross-attention parameters for layer 𝑙, see (4), j𝜸3 𝑙”𝜷3 𝑙”𝜸4 𝑙”𝜷4 𝑙”𝜸5 𝑙”𝜷5 𝑙2ℝ𝑑e, three sets of layer-norm parameters, j𝑾𝑙 mlp32ℝ𝑑mlp𝑑e,𝒃𝑙 mlp32ℝ𝑑mlp,𝑾𝑙 mlp42ℝ𝑑e𝑑mlp,𝒃𝑙 mlp42ℝ𝑑e, MLP parameters. 𝑾𝒖2ℝ𝑁V𝑑e, the unembedding matrix. /* Encode the context sequence: */ 1z length¹𝒛º 2for𝑡2»z¼:𝒆𝑡 𝑾𝒆»:”𝑧»𝑡¼¼¸𝑾𝒑»:”𝑡¼ 3𝒁 »𝒆1”𝒆2”“““𝒆z¼ 4for𝑙=1”2”“““”𝐿 encdo 5 𝒁 𝒁¸MHAttention¹𝒁jWenc 𝑙”Mask1º 6for𝑡2»z¼:𝒁»:”𝑡¼ layer_norm¹𝒁»:”𝑡¼j𝜸1 𝑙”𝜷1 𝑙º 7 𝒁 𝒁¸𝑾𝑙 mlp2ReLU¹𝑾𝑙 mlp1𝒁¸𝒃𝑙 mlp11|º¸𝒃𝑙 mlp21| 8for𝑡2»z¼:𝒁»:”𝑡¼ layer_norm¹𝒁»:”𝑡¼j𝜸2 𝑙”𝜷2 𝑙º 9end /* Decode the primary sequence, conditioning on the context: */ 10x length¹𝒙º 11for𝑡2»x¼:𝒆𝑡 𝑾𝒆»:”𝑥»𝑡¼¼¸𝑾𝒑»:”𝑡¼ 12𝑿 »𝒆1”𝒆2”“““𝒆x¼ 13for𝑖=1”2”“““”𝐿 decdo 14 𝑿 𝑿¸MHAttention¹𝑿jWdec 𝑙”Mask»𝑡”𝑡0¼»»𝑡𝑡0¼¼º 15for𝑡2»x¼:𝑿»:”𝑡¼ layer_norm¹𝑿»:”𝑡¼j𝜸3 𝑙”𝜷3 𝑙º 16 𝑿 𝑿¸MHAttention¹𝑿”𝒁jWed 𝑙”Mask1º 17for𝑡2»x¼:𝑿»:”𝑡¼ layer_norm¹𝑿»:”𝑡¼j𝜸4 𝑙”𝜷4 𝑙º 18 𝑿 𝑿¸𝑾𝑙 mlp4ReLU¹𝑾𝑙 mlp3𝑿¸𝒃𝑙 mlp31|º¸𝒃𝑙 mlp41| 19for𝑡2»x¼:𝑿»:”𝑡¼ layer_norm¹𝑿»:”𝑡¼j𝜸5 𝑙”𝜷5 𝑙º 20end /* Derive conditional probabilities and return: */ 21return 𝑷=softmax¹𝑾𝒖𝑿º 11 Page 12: Formal Algorithms for Transformers Algorithm 9: 𝑷 ETransformer (𝒙j𝜽º /* BERT, an encoder-only transformer, forward pass */ Input: 𝒙2𝑉, a sequence of token IDs. Output: 𝑷2¹0”1º𝑁Vlength¹𝒙º, where each column of 𝑷is a distribution over the vocabulary. Hyperparameters: max”𝐿”𝐻”𝑑 e”𝑑mlp”𝑑f2ℕ Parameters: 𝜽includes all of the following parameters: 𝑾𝒆2ℝ𝑑e𝑁V,𝑾𝒑2ℝ𝑑emax, the token and positional embedding matrices. For𝑙2»𝐿¼: jW𝑙, multi-head attention parameters for layer 𝑙, see (4), j𝜸1 𝑙”𝜷1 𝑙”𝜸2 𝑙”𝜷2 𝑙2ℝ𝑑e, two sets of layer-norm parameters, j𝑾𝑙 mlp12ℝ𝑑mlp𝑑e,𝒃𝑙 mlp12ℝ𝑑mlp,𝑾𝑙 mlp22ℝ𝑑e𝑑mlp,𝒃𝑙 mlp22ℝ𝑑e, MLP parameters. 𝑾𝒇2ℝ𝑑f𝑑e”𝒃𝒇2ℝ𝑑f,𝜸”𝜷2ℝ𝑑f, the final linear projection and layer-norm parameters. 𝑾𝒖2ℝ𝑁V𝑑e, the unembedding matrix. 1 length¹𝒙º 2for𝑡2»¼:𝒆𝑡 𝑾𝒆»:”𝑥»𝑡¼¼¸𝑾𝒑»:”𝑡¼ 3𝑿 »𝒆1”𝒆2”“““𝒆¼ 4for𝑙=1”2”“““”𝐿do 5 𝑿 𝑿¸MHAttention¹𝑿jW𝑙”Mask1º 6for𝑡2»¼:𝑿»:”𝑡¼ layer_norm¹𝑿»:”𝑡¼j𝜸1 𝑙”𝜷1 𝑙º 7 𝑿 𝑿¸𝑾𝑙 mlp2GELU¹𝑾𝑙 mlp1𝑿¸𝒃𝑙 mlp11|º¸𝒃𝑙 mlp21| 8for𝑡2»¼:𝑿»:”𝑡¼ layer_norm¹𝑿»:”𝑡¼j𝜸2 𝑙”𝜷2 𝑙º 9end 10𝑿 GELU¹𝑾𝒇𝑿¸𝒃𝒇1|º 11for𝑡2»¼:𝑿»:”𝑡¼ layer_norm¹𝑿»:”𝑡¼j𝜸”𝜷º 12return 𝑷=softmax¹𝑾𝒖𝑿º 12 Page 13: Formal Algorithms for Transformers Algorithm 10: 𝑷 DTransformer (𝒙j𝜽º /* GPT, a decoder-only transformer, forward pass */ Input: 𝒙2𝑉, a sequence of token IDs. Output: 𝑷2¹0”1º𝑁Vlength¹𝒙º, where the 𝑡-th column of 𝑷represents ˆ𝑃𝜽¹𝑥»𝑡¸1¼j𝒙»1 :𝑡¼º. Hyperparameters: max”𝐿”𝐻”𝑑 e”𝑑mlp2ℕ Parameters: 𝜽includes all of the following parameters: 𝑾𝒆2ℝ𝑑e𝑁V,𝑾𝒑2ℝ𝑑emax, the token and positional embedding matrices. For𝑙2»𝐿¼: jW𝑙, multi-head attention parameters for layer 𝑙, see (4), j𝜸1 𝑙”𝜷1 𝑙”𝜸2 𝑙”𝜷2 𝑙2ℝ𝑑e, two sets of layer-norm parameters, j𝑾𝑙 mlp12ℝ𝑑mlp𝑑e,𝒃𝑙 mlp12ℝ𝑑mlp,𝑾𝑙 mlp22ℝ𝑑e𝑑mlp,𝒃𝑙 mlp22ℝ𝑑e, MLP parameters. 𝜸”𝜷2ℝ𝑑e, final layer-norm parameters. 𝑾𝒖2ℝ𝑁V𝑑e, the unembedding matrix. 1 length¹𝒙º 2for𝑡2»¼:𝒆𝑡 𝑾𝒆»:”𝑥»𝑡¼¼¸𝑾𝒑»:”𝑡¼ 3𝑿 »𝒆1”𝒆2”“““𝒆¼ 4for𝑙=1”2”“““”𝐿do 5for𝑡2»¼:˜𝑿»:”𝑡¼ layer_norm¹𝑿»:”𝑡¼j𝜸1 𝑙”𝜷1 𝑙º 6 𝑿 𝑿¸MHAttention¹˜𝑿jW𝑙”Mask»𝑡”𝑡0¼=»»𝑡𝑡0¼¼º 7for𝑡2»¼:˜𝑿»:”𝑡¼ layer_norm¹𝑿»:”𝑡¼j𝜸2 𝑙”𝜷2 𝑙º 8 𝑿 𝑿¸𝑾𝑙 mlp2GELU¹𝑾𝑙 mlp1˜𝑿¸𝒃𝑙 mlp11|º¸𝒃𝑙 mlp21| 9end 10for𝑡2»¼:𝑿»:”𝑡¼ layer_norm¹𝑿»:”𝑡¼j𝜸”𝜷º 11return 𝑷=softmax¹𝑾𝒖𝑿º 13 Page 14: Formal Algorithms for Transformers Algorithm11: ˆ𝜽 EDTraining (𝒛1:𝑁data”𝒙1:𝑁data”𝜽) /* Training a seq2seq model */ Input:f¹𝒛𝑛”𝒙𝑛ºg𝑁data 𝑛=1, a dataset of sequence pairs. Input: 𝜽, initial transformer parameters. Output: ˆ𝜽, the trained parameters. Hyperparameters: 𝑁epochs2ℕ” 𝜂2¹0”1º 1for𝑖=1”2”“““”𝑁 epochsdo 2for𝑛=1”2”“““𝑁 datado 3 length¹𝒙𝑛º 4 𝑷¹𝜽º EDTransformer¹𝒛𝑛”𝒙𝑛j𝜽º 5 loss¹𝜽º=Í1 𝑡=1log𝑃¹𝜽º»𝑥𝑛»𝑡¸1¼”𝑡¼ 6 𝜽 𝜽𝜂rloss¹𝜽º 7end 8end 9return ˆ𝜽=𝜽 Algorithm12: ˆ𝜽 ETraining (𝒙1:𝑁data”𝜽) /* Training by masked language modelling */ Input:f𝒙𝑛g𝑁data 𝑛=1, a dataset of sequences. Input: 𝜽, initial encoder-only transformer parameters. Output: ˆ𝜽, the trained parameters. Hyperparameters: 𝑁epochs2ℕ” 𝜂2 ¹0”1º”𝑝mask2¹0”1º 1for𝑖=1”2”“““”𝑁 epochsdo 2for𝑛=1”2”“““”𝑁 datado 3 length¹𝒙𝑛º 4 for𝑡=1”2”“““”do 5 ˜𝑥𝑛»𝑡¼ mask_token or𝑥𝑛»𝑡¼ randomly with probability 𝑝maskor1𝑝mask 6 end 7 ˜𝑇 f𝑡2»¼: ˜𝑥𝑛»𝑡¼=mask_tokeng 8 𝑷¹𝜽º ETransformer¹˜𝒙𝑛j𝜽º 9 loss¹𝜽º=Í 𝑡2˜𝑇log𝑃¹𝜽º»𝑥𝑛»𝑡¼”𝑡¼ 10 𝜽 𝜽𝜂rloss¹𝜽º 11end 12end 13return ˆ𝜽=𝜽Algorithm13: ˆ𝜽 DTraining (𝒙1:𝑁data”𝜽) /* Training next token prediction */ Input:f𝒙𝑛g𝑁data 𝑛=1, a dataset of sequences. Input: 𝜽, initial decoder-only transformer parameters. Output: ˆ𝜽, the trained parameters. Hyperparameters: 𝑁epochs2ℕ” 𝜂2¹0”1º 1for𝑖=1”2”“““”𝑁 epochsdo 2for𝑛=1”2”“““𝑁 datado 3 length¹𝒙𝑛º 4 𝑷¹𝜽º DTransformer¹𝒙𝑛j𝜽º 5 loss¹𝜽º=Í1 𝑡=1log𝑃¹𝜽º»𝑥𝑛»𝑡¸1¼”𝑡¼ 6 𝜽 𝜽𝜂rloss¹𝜽º 7end 8end 9return ˆ𝜽=𝜽 Algorithm 14: 𝒚 DInference (𝒙”ˆ𝜽) /* Prompting a trained model and using it for prediction. */ Input:Trained transformer parameters ˆ𝜽. Input: 𝒙2𝑉, a prompt. Output: 𝒚2𝑉, the transformer’s continuation of the prompt. Hyperparameters: gen2ℕ” 𝜏2¹0”1º 1 length¹𝒙º 2for𝑖=1”2”“““ gendo 3 𝑷 DTransformer¹𝒙jˆ𝜽º 4 𝒑 𝑷»:”¸𝑖1¼ 5sample a token 𝑦from 𝒒/𝒑1𝜏 6 𝒙 »𝒙”𝑦¼ 7end 8return 𝒚=𝒙»¸1 :¸gen¼ 14 Page 15: Formal Algorithms for Transformers Algorithm 15: ˆ𝒙 EDInference (𝒛”ˆ𝜽) /* Using a trained seq2seq model for prediction. */ Input:A seq2seq transformer and trained parameters ˆ𝜽of the transformer. Input: 𝒛2𝑉, input sequence, e.g. a sentence in English. Output: ˆ𝒙2𝑉, output sequence, e.g. the sentence in German. Hyperparameters: 𝜏2¹0”1º 1ˆ𝒙 »bos_token¼ 2𝑦 0 3while𝑦≠eos_token do 4 𝑷 EDTransformer¹𝒛”ˆ𝒙jˆ𝜽º 5 𝒑 𝑷»:”length¹ˆ𝒙º¼ 6sample a token 𝑦from 𝒒/𝒑1𝜏 7 ˆ𝒙 »ˆ𝒙”𝑦¼ 8end 9return ˆ𝒙 15 Page 16: Formal Algorithms for Transformers B. List of Notation Symbol Type Explanation »𝑁¼ :=f1”“““”𝑁gset of integers 1”2”“““”𝑁1”𝑁 𝑖” 𝑗2ℕ generic integer indices 𝑉 »𝑁V¼vocabulary 𝑁V2ℕ vocabulary size 𝑉=Ð1 =0𝑉set of token sequences; elements include e.g. sentences or documents max2ℕ maximum sequence length 2»max¼length of token sequence 𝑡2»¼index of token in a sequence 𝑑“““2ℕ dimension of various vectors 𝒙𝑥»1 :¼ 𝑥»1¼𝑥»2¼“““𝑥»¼2𝑉primary token sequence 𝒛𝑧»1 :¼ 𝑧»1¼𝑧»2¼“““𝑧»¼2𝑉context token sequence 𝑀»𝑖” 𝑗¼ 2 ℝ entry𝑀𝑖𝑗of matrix𝑀2ℝ𝑑𝑑0 𝑀»𝑖”:¼𝑀»𝑖¼2ℝ𝑑0𝑖-th row of matrix 𝑀2ℝ𝑑𝑑0 𝑀»:” 𝑗¼ 2 ℝ𝑑𝑗-th column of matrix 𝑀2ℝ𝑑𝑑0 𝒆2ℝ𝑑evector representation / embedding of a token 𝑿2ℝ𝑑e𝑥encoded primary token sequence 𝒁2ℝ𝑑e𝑧encoded context token sequence Mask2ℝ𝑧𝑥masking matrix, it determines the attention context for each token 𝐿”𝐿enc”𝐿dec2ℕ number of network (encoder, decoder) layers 𝑙2»𝐿¼index of network layer 𝐻2ℕ number of attention heads ℎ2»𝐻¼index of attention head 𝑁data2ℕ (i.i.d.) sample size 𝑛2»𝑁data¼index of sample sequence 𝜂2¹0”1ºlearning rate 𝜏2¹0”1ºtemperature; it controls the diversity-plausibility trade-off at inference 𝑾𝒆2ℝ𝑑e𝑁Vtoken embedding matrix 𝑾𝒑2ℝ𝑑emaxpositional embedding matrix 𝑾𝒖2ℝ𝑁V𝑑eunembedding matrix 𝑾𝒒2ℝ𝑑attn𝑑xquery weight matrix 𝒃𝒒2ℝ𝑑attnquery bias 𝑾𝒌2ℝ𝑑attn𝑑zkey weight matrix 𝒃𝒌2ℝ𝑑attnkey bias 𝑾𝒗2ℝ𝑑out𝑑zvalue weight matrix 𝒃𝒗2ℝ𝑑outvalue bias W𝒒𝒌𝒗 collection of above parameters of a single-head attention layer 𝑾𝒐2ℝ𝑑out𝐻𝑑midoutput weight matrix 𝒃𝒐2ℝ𝑑outoutput bias W collection of above parameters of a multi-head attention layer, see eq. (4) 𝑾mlp2ℝ𝑑1𝑑2weight matrix corresponding to an MLP layer in a Transformer 𝒃mlp2ℝ𝑑1bias corresponding to an MLP layer in a Transformer 𝜸2ℝ𝑑elayer-norm learnable scale parameter 𝜷2ℝ𝑑elayer-norm learnable offset parameter 𝜽”ˆ𝜽2ℝ𝑑collection of all learnable / learned Transformer parameters 16

---