Page 1:
19 July 2022
Formal Algorithms for Transformers
Mary Phuong1and Marcus Hutter1
1DeepMind
This document aims to be a self-contained, mathematically precise overview of transformer architec-
turesandalgorithms( notresults). Itcoverswhattransformersare,howtheyaretrained,whattheyare
used for, their key architectural components, and a preview of the most prominent models. The reader
is assumed to be familiar with basic ML terminology and simpler neural network architectures such as
MLPs.
Keywords: formalalgorithms,pseudocode,transformers,attention,encoder,decoder,BERT,GPT,Gopher,
tokenization, training, inference.
Contents
1 Introduction 1
2 Motivation 1
3 Transformers and Typical Tasks 3
4 Tokenization: HowTextisRepresented 4
5 Architectural Components 4
6 Transformer Architectures 7
7 Transformer Training and Inference 8
8 Practical Considerations 9
A References 9
B List of Notation 16
A famous colleague once sent an actually very
well-written paper he was quite proud of to a fa-
mous complexity theorist. His answer: “I can’t find
a theorem in the paper. I have no idea what this
paper is about.”
1. Introduction
Transformersaredeepfeed-forwardartificialneu-
ral networks with a (self)attention mechanism.
They have been tremendously successful in natu-
ral language processing tasks and other domains.
Sincetheirinception5yearsago[ VSP¸17],many
variantshavebeensuggested[ LWLQ21]. Descrip-
tions are usually graphical, verbal, partial, or in-
cremental. Despite their popularity, it seems no
pseudocode has ever been published for any vari-
ant. Contrast this to other fields of computer
science,evento“cousin”disciplinereinforcement
learning [MKS¸13, SBB18, EMK¸21].
This report intends to rectify the situation for
Transformers. Itaimstobeaself-contained, com-plete, precise and compact overview of trans-
former architectures and formal algorithms (but
notresults). ItcoverswhatTransformersare(Sec-
tion 6), how they are trained (Section 7), what
they’re used for (Section 3), their key architec-
tural components (Section 5), tokenization (Sec-
tion 4), and a preview of practical considerations
(Section 8) and the most prominent models.
The essentially complete pseudocode is about
50 lines, compared to thousands of lines of ac-
tual real source code. We believe these formal
algorithms will be useful for theoreticians who
require compact, complete, and precise formu-
lations, experimental researchers interested in
implementing a Transformer from scratch, and
encourageauthorstoaugmenttheirpaperortext
book with formal Transformer algorithms (Sec-
tion 2).
The reader is assumed to be familiar with ba-
sic ML terminology and simpler neural network
architectures such as MLPs.
In short, a (formally inclined) reader, upon un-
derstanding the contents of this document, will
have a solid grasp of transformers: they will be
ready to read and contribute to the literature on
the topic as well as implement their own Trans-
former using the pseudocode as templates.
2. Motivation
The true story above the introduction describes
quite well the feeling we have when browsing
Corresponding author(s): fbuiphuong,mhutter g@deepmind.com
©2022 DeepMind. All rights reservedarXiv:2207.09238v1 [cs.LG] 19 Jul 2022
Page 2:
Formal Algorithms for Transformers
many Deep Learning (DL) papers; unable to fig-
ure out the algorithmic suggestions exactly. For
practitioners, the papers may be sufficiently de-
tailed, but the precision needed by theoreticians
is usually higher. For reasons not entirely clear
to us, the DL community seems shy of providing
pseudocode for their neural network models. Be-
low we describe the SOTA in DL paper writing
andargueforthevalueofformalalgorithms. The
reader already convinced about their merit can
without loss skip this section.
ThelackofscientificprecisionanddetailinDL
publications. Deep Learning has been tremen-
dously successful in the last 5 to 10 years with
thousands of papers published every year. Many
describe only informally how they change a pre-
vious model, Some 100+ page papers contain
only a few lines of prose informally describing
the model [ RBC¸21]. At best there are some
high-level diagrams. No pseudocode. No equa-
tions. Noreferencetoapreciseexplanationofthe
model. One may argue that most DL models are
minor variations of a few core architectures, such
as the Transformer [ VSP¸17], so a reference aug-
mented by a description of the changes should
suffice. Thiswouldbetrueif(a)thechangeswere
describedprecisely,(b)thereferencearchitecture
hasbeendescribedpreciselyelsewhere, and(c)a
reference is given to this description. Some if not
all three are lacking in most DL papers. To the
best of our knowledge no-one has even provided
pseudocode for the famous Transformer and its
encoder/decoder-only variations.
Interfacingalgorithms. Equallyimportantare
proper explanations of how these networks are
trained and used, but sometimes it is not even
clear what the inputs and outputs and potential
side-effectsofthedescribedmodelare. Ofcourse
someone experienced would usually be able to
correctly guess, but this is not a particularly sci-
entific approach. The experimental section in
publications often does not explain what is ac-
tually fed into the algorithms and how. If there
is some explanation in the methods section, it is
often disconnected from what is described in the
experimental section, possibly due to differentauthors writing the different sections. The core
algorithms for the models should be accompa-
nied by the wrapper algorithms that call them,
e.g. (pre)training, fine-tuning, prompting, infer-
ence, deployment. Sometimes this is simple, but
sometimes the magic happens there. In any case,
if these things are not formally stated they re-
main unclear. Again, if the setup is standard and
has been formally explained elsewhere, a simple
reference will do.
Source code vs pseudocode. Providing open
source code is very useful, but not a proper sub-
stitute for formal algorithms. There is a massive
difference between a (partial) Python dump and
well-crafted pseudocode. A lot of abstraction and
clean-up is necessary: remove boiler plate code,
use mostly single-letter variable names, replace
code by math expressions wherever possible, e.g.
replace loops by sums, remove (some) optimiza-
tions, etc. A well-crafted pseudocode is often less
than a page and still essentially complete, com-
pared to often thousands of lines of real source
code. This is hard work no-one seems to be will-
ing to do. Of course a forward process of first
designing algorithms and write up pseudocode
on paper, and then implementing it is fine too,
but few DL practitioners seem to work that way.
Examples of good neural network pseu-
docode and mathematics and explanations.
Multi-Layer Perceptrons (MLPs) are usually well-
described in many papers, e.g. [ MPCB14,BFT17,
JGH18], though also without pseudocode. For a
rare text-book example of pseudocode for a non-
trivialneuralnetworkarchitecture,seeAlgorithm
S2 of [SGBK¸21], which constitutes a complete,
i.e.essentially executable , pseudocode of just 25
lines based on a 350-line Python Colab toy im-
plementation, which itself is based on a proper
1000+ line implementation.
This work aims to do the same for Transform-
ers: The whole decoder-only Transformer GPT
Algorithm 10 based on attention Algorithms 4
and 5 and normalization Algorithm 6 including
training Algorithm 13 and prompting and infer-
ence Algorithm 14 all-together is less than 50
lines of pseudocode, compared to e.g. the 2000-
2
Page 3:
Formal Algorithms for Transformers
line self-contained C-implementation [Bel21].
[Ala19] is a great blog-post explaining Trans-
formers and [ EGKZ21] describes the attention
mechanism to sufficient mathematical precision
to allow proving properties about it, but neither
provides pseudocode. [ Elh21] is an attempt to
understand Transformers by reverse-engineering
the computations they perform and interpreting
them as circuits.
Motivation. But does anyone actually need
pseudocode and what would they be useful for
(wesometimesgetasked)? Wefindtheabsenceof
pseudocodeinDLandthisquestionquiteperplex-
ing, but apparently it requires answering. Pro-
viding such pseudocode can be useful for many
purposes:
•They can be used as templates and adapted
to precisely describe future variations, and
therewith set a new standard in DL publish-
ing. We explicitly encourage the reader to
copy and adapt them to their needs and cite
the original as “adapted from [PH22]”.
•Having all that matters on one page in front
of you makes it easier to develop new varia-
tions compared to reading prose or scrolling
through 1000s of lines of actual code.
•They can be used as a basis for new imple-
mentations from scratch, e.g. in different
programming languages, without having to
wade through and reverse-engineer existing
idiosyncratic real source code.
•They may establish notational convention,
which eases communication and reading fu-
ture variations.
•The process of converting source code into
pseudocode can exhibit implementation er-
rors (as it e.g. did in [SGBK¸21]).
•Theoreticians need compact, complete, and
precise representations for reasoning and ul-
timatelyprovingpropertiesaboutalgorithms.
They are often unwilling or unable to re-
verse engineer code, or guess the meaning
of words or fancy diagrams.
With this motivation in mind, the following
five sections formally describe all aspects of trans-
former architectures, training, and inference.3. Transformers and Typical Tasks
Transformers are neural network models that ex-
cel at natural language processing, or more gen-
erallyatmodellingsequentialdata. Twocommon
types of tasks they are used for are sequence mod-
ellingandsequence-to-sequence prediction .
Notation. Let𝑉denote a finite set, called a vo-
cabulary,oftenidentifiedwith »𝑁V¼:=f1𝑁 Vg.
This could be words or letters, but typically are
sub-words, called tokens. Let 𝒙𝑥»1 :¼
𝑥»1¼𝑥»2¼𝑥»¼2𝑉be a sequence of tokens, e.g.
a sentence or a paragraph or a document. Un-
likeinPython, weusearraysstartingfrom 1, and
𝑥»1 :¼includes𝑥»¼. Foramatrix 𝑀2ℝ𝑑𝑑0,we
write𝑀»𝑖:¼2ℝ𝑑0forthe𝑖throwand 𝑀»: 𝑗¼2ℝ𝑑
for the𝑗-th column. We use matrix column vec-
tor convention more common in mathematics,
compared to the default row vector matrix in
the transformer literature, i.e. our matrices are
transposed. See Appendix B for a complete list of
notation.
Chunking. The predominant paradigm in ma-
chinelearningis(still)learningfromindependent
and identically distributed (i.i.d.) data. Even for
sequence modelling for practical reasons this tra-
dition is upheld. The training data may natu-
rallybeacollectionof(independent)articles, but
eventhen,somemayexceedthemaximalcontext
lengthmaxtransformers can handle. In this case,
an article is crudely broken into shorter chunks
of lengthmax.
Sequence modelling (DTransformer). Given
a vocabulary 𝑉, let 𝒙𝑛2𝑉for𝑛2»𝑁data¼be a
dataset of sequences (imagined to be) sampled
i.i.d. from some distribution 𝑃over𝑉. The goal
is to learn an estimate ˆ𝑃of the distribution 𝑃¹𝒙º.
In practice, the distribution estimate is often de-
composed via the chain rule as ˆ𝑃¹𝒙º=ˆ𝑃𝜽¹𝑥»1¼º
ˆ𝑃𝜽¹𝑥»2¼j𝑥»1¼º ˆ𝑃𝜽¹𝑥»¼j𝒙»1 :1¼º, where 𝜽
consists of all neural network parameters to be
learned. The goal is to learn a distribution over
a single token 𝑥»𝑡¼given its preceding tokens
𝑥»1 :𝑡1¼as context.
3
Page 4:
Formal Algorithms for Transformers
Examples include e.g. language modelling, RL
policy distillation, or music generation.
Sequence-to-sequence (seq2seq) prediction
(EDTransformer). Given a vocabulary 𝑉and
an i.i.d. dataset of sequence pairs ¹𝒛𝑛𝒙𝑛º𝑃,
where𝑃is a distribution over 𝑉𝑉, learn an
estimate of the conditional distribution 𝑃¹𝒙j𝒛º.
In practice, the conditional distribution estimate
is often decomposed as ˆ𝑃¹𝒙j𝒛º=ˆ𝑃𝜽¹𝑥»1¼j𝒛º
ˆ𝑃𝜽¹𝑥»2¼j𝑥»1¼𝒛ºˆ𝑃𝜽¹𝑥»¼j𝒙»1 :1¼𝒛º.
Examples include translation ( 𝒛=a sentence
in English, 𝒙=the same sentence in German),
question answering ( 𝒛=question, 𝒙=the corre-
sponding answer), text-to-speech ( 𝒛=a piece of
text, 𝒙=a voice recording of someone reading
the text).
Classification(ETransformer). Givenavocab-
ulary𝑉and a set of classes »𝑁C¼, let¹𝒙𝑛𝑐𝑛º 2
𝑉»𝑁C¼for𝑛2»𝑁data¼be an i.i.d. dataset of
sequence-class pairs sampled from 𝑃¹𝒙𝑐º. The
goal in classification is to learn an estimate of the
conditional distribution 𝑃¹𝑐j𝒙º.
Examples include e.g. sentiment classification,
spam filtering, toxicity classification.
4. Tokenization: How Text is Repre-
sented
In the context of natural language tasks, tok-
enization refers to how a piece of text such as
“My grandma makes the best apple pie.” is rep-
resented as a sequence of vocabulary elements
(calledtokens).
Character-level tokenization. One possible
choice is to let 𝑉be the English alphabet (plus
punctuation). In the example above, we’d get a
sequenceoflength36: [‘M’,‘y’,‘ ’,...]. Character-
level tokenization tends to yield very long se-
quences.
Word-level tokenization. Another choice
would be to let 𝑉consist of all English words(plus punctuation). In the example above, we’d
get a sequence of length 7: [‘My ’, ‘grandma ’,
‘makes ’, ...]. Word-level tokenization tends to
require a very large vocabulary and cannot deal
with new words at test time.
Subword tokenization. This is the method
used in practice nowadays: 𝑉is a set of com-
monly occurring word segments like ‘cious’, ‘ing’,
‘pre’. Commonwordslike‘is’areoftenaseparate
token, and single characters are also included in
𝑉to ensure all words are expressible.
There are in fact many ways to do subword to-
kenization. One of the simplest and most success-
ful ones is Byte Pair Encoding [ Gag94,SHB16]
used in GPT-2 [RWC¸19].
Final vocabulary and text representation.
Given a choice of tokenization / vocabulary, each
vocabulary element is assigned a unique index
inf12𝑁 V3g. A number of special tokens
are then added to the vocabulary. The number
of special tokens varies, and here we will con-
sider three: mask_token :=𝑁V2, used in
masked language modelling (see Algorithm 12);
bos_token :=𝑁V1, used for representing the
beginning of sequence; and eos_token :=𝑁V,
used for representing the end of sequence. The
complete vocabulary has 𝑁V=j𝑉jelements.
A piece of text is represented as a sequence
of indices (called token IDs ) corresponding to
its (sub)words, preceded by bos_token and fol-
lowed by eos_token .
5. Architectural Components
The following are the neural network build-
ing blocks (functions with learnable parameters)
from which transformers are made. Full archi-
tectures featuring these building blocks are pre-
sentedinthenextsection. (Byaslightabuseofno-
tation, we identify 𝑉with the setf12𝑁 Vg.)
Token embedding. The token embedding
learns to represent each vocabulary element as a
vector in ℝ𝑑e; see Algorithm 1.
4
Page 5:
Formal Algorithms for Transformers
Algorithm 1: Token embedding.
Input:𝑣2𝑉»𝑁V¼, a token ID.
Output: 𝒆2ℝ𝑑e, the vector representation
of the token.
Parameters: 𝑾𝒆2ℝ𝑑e𝑁V, the token
embedding matrix.
1return 𝒆=𝑾𝒆»:𝑣¼
Positional embedding. The positional embed-
ding learns to represent a token’s position in a
sequenceasavectorin ℝ𝑑e. Forexample,theposi-
tionofthefirsttokeninasequenceisrepresented
by a (learned) vector 𝑾𝒑»:1¼, the position of the
second token is represented by another (learned)
vector 𝑾𝒑»:2¼,etc. Thepurposeofthepositional
embedding is to allow a Transformer to make
sense of word ordering; in its absence the rep-
resentation would be permutation invariant and
the model would perceive sequences as “bags of
words” instead.
Learned positional embeddings require that in-
putsequencelengthisatmostsomefixednumber
max(thesizeofthelearnedpositionalembedding
matrixmustbefiniteandfixedinadvanceoftrain-
ing). An intuitive explanation of how this works
can be found at [ Ala18]. For pseudocode, see
Algorithm 2.
Not all transformers make use of learnedposi-
tional embeddings, some use a hard-coded map-
ping 𝑾𝒑:ℕ!ℝ𝑑einstead [ Ker21]. Such hard-
coded positional embeddings can (theoretically)
handle arbitrarily long sequences. The original
Transformer [VSP¸17] uses
𝑾𝒑»2𝑖1𝑡¼=sin¹𝑡2𝑖𝑑e
maxº
𝑾𝒑»2𝑖𝑡¼=cos¹𝑡2𝑖𝑑e
maxº
for0 𝑖𝑑e2.
The positional embedding of a token is usu-
ally added to the token embedding to form a
token’s initial embedding. For the 𝑡-th token of a
sequence 𝒙, the embedding is
𝒆=𝑾𝒆»:𝑥»𝑡¼¼¸𝑾𝒑»:𝑡¼ (1)
Attention. Attention is the main architectural
component of transformers. It enables a neuralAlgorithm 2: Positional embedding.
Input:2»max¼, position of a token in
the sequence.
Output: 𝒆𝒑2ℝ𝑑e, the vector
representation of the position.
Parameters: 𝑾𝒑2ℝ𝑑emax, the positional
embedding matrix.
1return 𝒆𝒑=𝑾𝒑»:¼
network to make use of contextual information
(e.g. preceding text or the surrounding text) for
predicting the current token.
Onahighlevel, attentionworksasfollows: the
token currently being predicted is mapped to a
queryvector 𝒒2ℝ𝑑attn, and the tokens in the
context are mapped to keyvectors 𝒌𝑡2ℝ𝑑attnand
valuevectors 𝒗𝑡2ℝ𝑑value. The inner products 𝒒|𝒌𝑡
areinterpretedasthedegreetowhichtoken 𝑡2𝑉
is important for predicting the current token 𝑞
– they are used to derive a distribution over the
contexttokens,whichisthenusedtocombinethe
value vectors. An intuitive explanation how this
achievesattentioncanbefoundat[ Ala18,Ala19].
The precise algorithm is given in Algorithm 3.
Algorithm 3: Basicsingle-queryattention.
Input: 𝒆2ℝ𝑑in, vector representation of
the current token
Input: 𝒆𝑡2ℝ𝑑in, vector representations of
context tokens 𝑡2»𝑇¼.
Output: ˜𝒗2ℝ𝑑out, vector representation of
the token and context combined.
Parameters: 𝑾𝒒𝑾𝒌2ℝ𝑑attn𝑑in,
𝒃𝒒𝒃𝒌2ℝ𝑑attn, the query and
key linear projections.
Parameters: 𝑾𝒗2ℝ𝑑out𝑑in,𝒃𝒗2ℝ𝑑out, the
value linear projection.
1𝒒 𝑾𝒒𝒆¸𝒃𝒒
28𝑡:𝒌𝑡 𝑾𝒌𝒆𝑡¸𝒃𝒌
38𝑡:𝒗𝑡 𝑾𝒗𝒆𝑡¸𝒃𝒗
48𝑡:𝛼𝑡=exp¹𝒒|𝒌𝑡p
𝑑attnºÍ
𝑢exp¹𝒒|𝒌𝑢p
𝑑attnº
5return ˜𝒗=Í𝑇
𝑡=1𝛼𝑡𝒗𝑡
There are many ways the basic attention mech-
anism is used in transformers. We list some of
the most common variants below.
5
Page 6:
Formal Algorithms for Transformers
It will be useful to define the softmax function
for matrix arguments, as well as a Mask matrix:
softmax¹𝑨º»𝑡z𝑡x¼:=exp𝐴»𝑡z𝑡x¼Í
𝑡exp𝐴»𝑡𝑡x¼(2)
Mask»𝑡z𝑡x¼=1for bidirectional attention
»»𝑡z𝑡x¼¼for unidirectional att.
(3)
Algorithm4: ˜𝑽 Attention¹𝑿𝒁jW𝒒𝒌𝒗Maskº
/* Computes a single (masked) self- or
cross- attention head. */
Input: 𝑿2ℝ𝑑xx𝒁2ℝ𝑑zz, vector
representations of primary and
context sequence.
Output: ˜𝑽2ℝ𝑑outx, updated
representations of tokens in 𝑿,
folding in information from
tokens in 𝒁.
Parameters: W𝒒𝒌𝒗consisting of:
𝑾𝒒2ℝ𝑑attn𝑑x,𝒃𝒒2ℝ𝑑attn
𝑾𝒌2ℝ𝑑attn𝑑z,𝒃𝒌2ℝ𝑑attn
𝑾𝒗2ℝ𝑑out𝑑z,𝒃𝒗2ℝ𝑑out.
Hyperparameters: Mask2f01gzx,"(3)
1𝑸 𝑾𝒒𝑿¸𝒃𝒒1|[[𝑸uery2ℝ𝑑attnx]]
2𝑲 𝑾𝒌𝒁¸𝒃𝒌1|[[𝑲ey2ℝ𝑑attnz]]
3𝑽 𝑾𝒗𝒁¸𝒃𝒗1|[[𝑽alue2ℝ𝑑outz]]
4𝑺 𝑲|𝑸 [[𝑺core2ℝzx]]
58𝑡z𝑡xif:Mask»𝑡z𝑡x¼then𝑆»𝑡z𝑡x¼ 1
6return ˜𝑽=𝑽softmax
𝑺p
𝑑attn
Bidirectional / unmasked self-attention.
Given a sequence of token representations, this
variant applies attention to each token, treating
all tokens in the sequence as the context. See
Algorithm 4, called with token sequence 𝒁=𝑿
and no masking (Mask 1).
Unidirectional / masked self-attention.
Given a sequence of token representations, this
variant applies attention to each token, treating
all preceding tokens (including itself) as the con-
text. Futuretokensaremaskedout,sothiscausal
auto-regressive version can be used for online
prediction. See Algorithm 4, called with token
sequence 𝒁=𝑿and Mask»𝑡z𝑡x¼:=»»𝑡z𝑡x¼¼.
For this Mask, the output ˜𝑽»:1 :𝑡¼only dependson𝑿»:1 :𝑡¼, hence can be used to predict
𝑿»:𝑡¸1¼.
Cross-attention. Given two sequences of to-
ken representations (often in the context of a
sequence-to-sequence task), this variant applies
attention to each token of the primary token se-
quence 𝑿, treating the second token sequence
𝒁as the context. See Algorithm 4, called with
Mask1. Whiletheoutput ˜𝑽andinputsequences
𝑿have the same length x, the context sequence
𝒁can have different length z.
Multi-headattention. Theattentionalgorithm
presented so far (Algorithm 4) describes the op-
eration of a single attention head . In practice,
transformers run multiple attention heads (with
separate learnable parameters) in parallel and
combine their outputs; this is called multi-head
attention; see Algorithm 5
Algorithm5: ˜𝑽 MHAttention¹𝑿𝒁jWMaskº
/* Computes Multi-Head (masked) self-
or cross- attention layer. */
Input: 𝑿2ℝ𝑑xx𝒁2ℝ𝑑zz, vector
representations of primary and
context sequence.
Output: ˜𝑽2ℝ𝑑outx, updated
representations of tokens in 𝑿,
folding in information from
tokens in 𝒁.
Hyperparameters: 𝐻, number of
attention heads
Hyperparameters: Mask2f01gzx,"(3)
Parameters: Wconsisting of
Forℎ2»𝐻¼,Wℎ
𝒒𝒌𝒗consisting of:
j𝑾ℎ
𝒒2ℝ𝑑attn𝑑x,𝒃ℎ
𝒒2ℝ𝑑attn,
j𝑾ℎ
𝒌2ℝ𝑑attn𝑑z,𝒃ℎ
𝒌2ℝ𝑑attn,
j𝑾ℎ
𝒗2ℝ𝑑mid𝑑z,𝒃ℎ
𝒗2ℝ𝑑mid.
𝑾𝒐2ℝ𝑑out𝐻𝑑mid,𝒃𝒐2ℝ𝑑out.
1Forℎ2»𝐻¼:
2 𝒀ℎ Attention¹𝑿𝒁jWℎ
𝒒𝒌𝒗Maskº
3𝒀 »𝒀1;𝒀2;;𝒀𝐻¼
4return ˜𝑽=𝑾𝒐𝒀¸𝒃𝒐1|
6
Page 7:
Formal Algorithms for Transformers
Algorithm 6: ˆ𝒆 layer_norm¹𝒆j𝜸𝜷º
/* Normalizes layer activations 𝒆. */
Input: 𝒆2ℝ𝑑e, neural network activations.
Output: b𝒆2ℝ𝑑e, normalized activations.
Parameters: 𝜸𝜷2ℝ𝑑e, element-wise
scale and offset.
1𝒎 Í𝑑e
𝑖=1𝒆»𝑖¼𝑑e
2𝑣 Í𝑑e
𝑖=1¹𝒆»𝑖¼𝒎º2𝑑e
3return b𝒆=𝒆𝒎p𝑣𝜸¸𝜷, wheredenotes
element-wise multiplication.
Algorithm 7: Unembedding.
Input: 𝒆2ℝ𝑑e, a token encoding.
Output: 𝒑2Δ¹𝑉º, a probability
distribution over the vocabulary.
Parameters: 𝑾𝒖2ℝ𝑁V𝑑e, the
unembedding matrix.
1return 𝒑=softmax¹𝑾𝒖𝒆º
Layer normalisation. Layer normalisation ex-
plicitlycontrolsthemeanandvarianceofindivid-
ual neural network activations; the pseudocode
is given in Algorithm 6. Some transformers use a
simpler and more computationally efficient ver-
sion of layer normalization setting 𝒎=𝜷=0,
called root mean square layer normalisation, or
RMSnorm.
Unembedding. The unembedding learns to
convert a vector representation of a token and
its context into a distribution over the vocabulary
elements; see Algorithm 7. The algorithm de-
scribes an independently learned unembedding
matrix, but note that sometimes the unembed-
ding matrix is instead fixed to be the transpose
of the embedding matrix.
6. Transformer Architectures
Thissectionpresentsafewprominenttransformer
architectures, based on attention Algorithms 4
and 5 and using normalization Algorithm 6, in
roughly historical order:
•EDT [VSP¸17] The original sequence-to-
sequence / Encoder-Decoder Transformer,Algorithms 8, 11 and 15.
•BERT [DCLT19], which is an instance of
an encoder-only transformer (encoder-only
means that it is derived from the encoder-
decoder architecture by dropping the de-
coder part), Algorithms 9 and 12.
•GPT [RWC¸19,BMR¸20], which is an in-
stance of a decoder-only transformer, Algo-
rithms 10, 13 and 14.
While the main architectural difference between
BERT and GPT is in attention masking, they also
differ in a number of less important ways: e.g.
they use different activation functions and the
layer-norms are positioned differently. We in-
cludedthesedifferencesinthepseudocodetostay
faithful to the original algorithms, but note that
different transformer architectures may adopt
these selectively.
To simplify notation, we denote by Wthe en-
tire set of parameters (query, key, value and out-
put linear projections) required by a multi-head
attention layer:
W:=©
«𝑾ℎ
𝒒2ℝ𝑑attn𝑑x𝒃ℎ
𝒒2ℝ𝑑attn ℎ2»𝐻¼
𝑾ℎ
𝒌2ℝ𝑑attn𝑑z𝒃ℎ
𝒌2ℝ𝑑attn ℎ2»𝐻¼
𝑾ℎ
𝒗2ℝ𝑑mid𝑑z𝒃ℎ
𝒗2ℝ𝑑mid ℎ2»𝐻¼
𝑾𝒐2ℝ𝑑out𝐻𝑑mid𝒃𝒐2ℝ𝑑outª®®®
¬
(4)
Encoder-decoder / sequence-to-sequence
transformer [VSP¸17].This is the very first
transformer. It was originally used for sequence-
to-sequence tasks (machine translation), which
is why it is more complicated than its successors.
The idea behind the architecture is as follows:
First, the context sequence is encoded using bidi-
rectional multi-head attention. The output of this
‘encoder’ part of the network is a vector represen-
tation of each context token, taking into account
the entire context sequence. Second, the primary
sequence is encoded. Each token in the primary
sequence is allowed to use information from the
encoded context sequence, as well as primary se-
quence tokens that precede it. See Algorithm 8
for more details.
7
Page 8:
Formal Algorithms for Transformers
Encoder-only transformer: BERT [DCLT19].
BERT is a bidirectional transformer trained on
the task of masked language modelling. Given a
piece of text with some tokens masked out, the
goalistocorrectlyrecoverthemasked-outtokens.
The original use of BERT was to learn generally
useful text representations, which could then be
adapted for various downstream NLP tasks. The
maskingisnotperformedviatheMaskparameter
but differently: During training each input token
isreplacedwithprobability 𝑝maskbyadummyto-
kenmask_token ,andevaluationisbasedonthe
reconstruction probability of these knocked-out
tokens (see Algorithm 12).
The BERT architecture resembles the encoder
part of the seq2seq transformer (hence ‘encoder-
only’). It is described in detail in Algorithm 9. It
uses the GELU nonlinearity instead of ReLU:
GELU¹𝑥º=𝑥ℙ𝑋N¹01º»𝑋 𝑥¼(5)
(When called with vector or matrix arguments,
GELU is applied element-wise.)
Decoder-only transformers: GPT-2 [RWC¸19],
GPT-3 [BMR¸20], Gopher [RBC¸21].GPT-2
and GPT-3 are large language models developed
byOpenAI,andGopherisalargelanguagemodel
developed by DeepMind. They all have similar
architectures and are trained by autoregressive
language modelling: Given an incomplete sen-
tenceorparagraph,thegoalistopredictthenext
token.
The main difference from BERT is that
GPT/Gopher use unidirectional attention instead
of bidirectional attention; they also apply layer-
norms in slightly different order.
See Algorithm 10 for the pseudocode of GPT-2.
GPT-3 is identical except larger, and replaces
dense attention in Line 6 by sparse attention, i.e.
each token only uses a subset of the full context.
Gopher also deviates only slightly from the
GPT-2architecture: itreplaceslayernorminlines
5,7and10byRMSnorm( 𝑚=𝜷=0),andituses
different positional embeddings.Multi-domain decoder-only transformer:
Gato [RZP¸22].Gato is a multi-modal multi-
task transformer built by DeepMind. It is a single
neural network that can play Atari, navigate 3D
environments, control a robotic arm, caption
images, have conversations, and more.
Under the hood, each modality is converted
into a sequence prediction problem by a sepa-
rate tokenization and embedding method; for
exampleimagesaredividedintonon-overlapping
1616patches, ordered in raster order (left-to-
right, top-to-bottom) and processed by a ResNet
block to obtain a vector representation.
The actual Gato architecture is then a decoder-
onlytransformerliketheoneinAlgorithm10,but
where Line 2 is replaced with modality-specific
embedding code.
7. Transformer Training and Infer-
ence
This section lists the pseudocode for various algo-
rithms for training and using transformers:
•EDTraining() Algorithm 11 shows how to
train a sequence-to-sequence transformer
(the original Transformer [VSP¸17]).
•ETraining() Algorithm 12 shows how to
train a transformer on the task of masked
language modelling (like BERT [DCLT19]).
•DTraining() Algorithm 13 shows how to
train a transformer on the task of next to-
ken prediction (like CPT-x [ BMR¸20] and
Gopher [RBC¸21]).
•DInference() Algorithm 14 shows how to
prompt a transformer trained on next token
prediction (like GPT-x [ BMR¸20]). The tem-
perature parameter 𝜏interpolates between
most likely continuation ( 𝜏=0), faithful
sampling ( 𝜏=1º, and uniform sampling
¹𝜏=1º.
•EDInference() Algorithm 15 shows how to
use a sequence-to-sequence transformer for
prediction.
Gradientdescent. ThedescribedtrainingAlgo-
rithms 11 to 13 use Stochastic Gradient Descent
8
Page 9:
Formal Algorithms for Transformers
(SGD) 𝜽 𝜽𝜂rloss¹𝜽ºto minimize the log
loss (aka cross entropy) as the update rule. Com-
putation of the gradient is done via automatic
differentiation tools; see [ BPRS18, Table 5]. In
practice, vanilla SGD is usually replaced by some
more refined variation such as RMSProp or Ada-
Grad or others [ Rud16]. Adam [ KB15] is used
most often these days.
8. Practical Considerations
Whilethevanillatransformersprovidedheremay
work in practice, a variety of “tricks” have been
developed over the years to improve the perfor-
mance of deep neural networks in general and
transformers in particular [LWLQ21]:
•Data preprocessing: cleaning, augmen-
tation [FGW¸21], adding noise, shuffling
[Lem21] (besides tokenization and chunk-
ing).
•Architecture: sparse layers, weight sharing
(besides attention).
•Training: improved optimizers, mini-
batches, batch normalization, learning
rate scheduling, weight initialization, pre-
training, ensembling, multi-task, adversarial
(besides layer normalization) [Sut15].
•Regularization: weight decay, early stop-
ping,cross-validation,dropout,addingnoise
[MBM20, TZ22].
•Inference: scratchpad prompting, few-shot
prompting,chainofthought,majorityvoting
[LAD¸22].
•Others.
A. References
[Ala18]Jay Alammar. The Illustrated Trans-
former. http://jalammar.github.io/illustrated-
transformer/, 2018.
[Ala19]Jay Alammar. The Illustrated GPT-
2 (Visualizing Transformer Language Mod-
els). http://jalammar.github.io/illustrated-
gpt2/, 2019.
[Bel21]Fabrice Bellard. NNCP v2: Loss-
less Data Compression with Transformer.
https://bellard.org/libnc/gpt2tc.html, 2021.[BFT17] PeterLBartlett, DylanJFoster, andMa-
tus J Telgarsky. Spectrally-normalized margin
bounds for neural networks. NeurIPS, 2017.
[BMR¸20]Tom Brown, Benjamin Mann, Nick
Ryder, et al. Language models are few-shot
learners. NeurIPS, 2020.
[BPRS18] Atilim Gunes Baydin, Barak A. Pearl-
mutter, Alexey Andreyevich Radul, and Jef-
frey Mark Siskind. Automatic Differentiation
inMachineLearning: ASurvey. JournalofMa-
chine Learning Research , 18(153):1–43, 2018.
[DCLT19] Jacob Devlin, Ming-Wei Chang, Ken-
ton Lee, and Kristina Toutanova. BERT: Pre-
training of deep bidirectional transformers for
language understanding. ACL, 2019.
[EGKZ21] Benjamin L. Edelman, Surbhi Goel,
Sham Kakade, and Cyril Zhang. Inductive Bi-
ases and Variable Creation in Self-Attention
Mechanisms. arXiv:2110.10090 [cs, stat] , Oc-
tober 2021.
[Elh21]Nelson Elhage. A Math-
ematical Framework for Trans-
former Circuits. https://transformer-
circuits.pub/2021/framework/index.html,
2021.
[EMK¸21]Yonathan Efroni, Dipendra Misra, Ak-
shayKrishnamurthy,AlekhAgarwal,andJohn
Langford.ProvableRLwithExogenousDistrac-
tors via Multistep Inverse Dynamics, March
2021.
[FGW¸21]Steven Y Feng, Varun Gangal, Ja-
son Wei, Sarath Chandar, Soroush Vosoughi,
Teruko Mitamura, and Eduard Hovy. A survey
of data augmentation approaches for NLP. In
Findings of the Association for Computational
Linguistics: ACL-IJCNLP 2021 , pages 968–988,
2021.
[Gag94] Philip Gage. A new algorithm for data
compression. Dr. Dobbs / C Users Journal ,
12(2):23–38, 1994.
[JGH18] Arthur Jacot, Franck Gabriel, and Clé-
ment Hongler. Neural tangent kernel: Conver-
gence and generalization in neural networks.
NeurIPS, 2018.
9
Page 10:
Formal Algorithms for Transformers
[KB15] Diederik Kingma and Jimmy Ba. Adam:
A method for stochastic optimization. ICLR,
2015.
[Ker21] Jonathan Kernes. Mas-
ter Positional Encoding: Part I.
https://towardsdatascience.com/master-
positional-encoding-part-i-63c05d90a0c3,
March 2021.
[LAD¸22]Aitor Lewkowycz, Anders An-
dreassen, David Dohan, Ethan Dyer, Henryk
Michalewski,VinayRamasesh,AmbroseSlone,
Cem Anil, Imanol Schlag, Theo Gutman-Solo,
Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari,
and Vedant Misra. Solving Quantitative
Reasoning Problems with Language Models.
arXiv:2206.14858 [cs] , June 2022.
[Lem21] Chris Lemke. Data preprocessing in
NLP. https://towardsdatascience.com/data-
preprocessing-in-nlp-c371d53ba3e0, July
2021.
[LWLQ21] Tianyang Lin, Yuxin Wang, Xi-
angyang Liu, and Xipeng Qiu. A Survey of
Transformers, June 2021.
[MBM20] Reza Moradi, Reza Berangi, and
Behrouz Minaei. A survey of regularization
strategies for deep models. Artificial Intelli-
gence Review , 53(6):3947–3986, August 2020.
[MKS¸13]Volodymyr Mnih, Koray Kavukcuoglu,
DavidSilver, AlexGraves, IoannisAntonoglou,
Daan Wierstra, and Martin Riedmiller. Play-
ing Atari with Deep Reinforcement Learning,
December 2013.
[MPCB14] Guido F Montufar, Razvan Pascanu,
Kyunghyun Cho, and Yoshua Bengio. On the
number of linear regions of deep neural net-
works.NeurIPS, 2014.
[PH22]M. Phuong and M. Hutter. Formal al-
gorithms for transformers. Technical report,
DeepMind, London, UK, 2022. LaTeX source
available at http://arXiv.org
[RBC¸21]Jack W. Rae, Sebastian Borgeaud,
Trevor Cai, et al. Scaling language models:
Methods, analysis & insights from training go-
pher.arXiv:2112.11446 , 2021.[Rud16] Sebastian Ruder. An overview
of gradient descent optimization algo-
rithms. https://ruder.io/optimizing-gradient-
descent/, January 2016.
[RWC¸19]Alec Radford, Jeffrey Wu, Rewon
Child, David Luan, Dario Amodei, and Ilya
Sutskever. Languagemodelsareunsupervised
multitask learners. OpenAI blog , 2019.
[RZP¸22]Scott Reed, Konrad Żołna, Emilio
Parisotto, et al. A generalist agent.
arXiv:2205.06175 , 2022.
[SBB18] Richard S. Sutton, Andrew G. Barto,
and Francis Bach. Reinforcement Learning:
An Introduction . MIT Press, Cambridge, Mas-
sachusetts, second edition edition edition,
November 2018.
[SGBK¸21]Eren Sezener, Agnieszka Grabska-
Barwińska,DimitarKostadinov,MaximeBeau,
Sanjukta Krishnagopal, David Budden, Mar-
cus Hutter, Joel Veness, Matthew Botvinick,
ClaudiaClopath,MichaelHäusser,andPeterE.
Latham. A rapid and efficient learning rule
for biological neural circuits. Technical report,
DeepMind, London, UK, 2021.
[SHB16] Rico Sennrich, Barry Haddow, and
Alexandra Birch. Neural machine translation
of rare words with subword units. In 54th
Annual Meeting of the Association for Compu-
tational Linguistics , pages 1715–1725. Asso-
ciation for Computational Linguistics (ACL),
2016.
[Sut15]Ilya Sutskever. A Brief
Overview of Deep Learning.
http://yyue.blogspot.com/2015/01/a-
brief-overview-of-deep-learning.html, January
2015.
[TZ22]Yingjie Tian and Yuqi Zhang. A compre-
hensive survey on regularization strategies in
machinelearning. InformationFusion ,80:146–
166, April 2022.
[VSP¸17]Ashish Vaswani, Noam Shazeer, Niki
Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin.
Attention is all you need. NeurIPS, 2017.
10
Page 11:
Formal Algorithms for Transformers
Algorithm 8: 𝑷 EDTransformer¹𝒛𝒙j𝜽º
/* Encoder-decoder transformer forward pass */
Input: 𝒛𝒙2𝑉, two sequences of token IDs.
Output: 𝑷2¹01º𝑁Vlength¹𝒙º, where the 𝑡-th column of 𝑷represents ˆ𝑃𝜽¹𝑥»𝑡¸1¼j𝒙»1 :𝑡¼𝒛º.
Hyperparameters: max𝐿enc𝐿dec𝐻𝑑 e𝑑mlp2ℕ
Parameters: 𝜽includes all of the following parameters:
𝑾𝒆2ℝ𝑑e𝑁V,𝑾𝒑2ℝ𝑑emax, the token and positional embedding matrices.
For𝑙2»𝐿enc¼:
jWenc
𝑙, multi-head encoder attention parameters for layer 𝑙, see (4),
j𝜸1
𝑙𝜷1
𝑙𝜸2
𝑙𝜷2
𝑙2ℝ𝑑e, two sets of layer-norm parameters,
j𝑾𝑙
mlp12ℝ𝑑mlp𝑑e,𝒃𝑙
mlp12ℝ𝑑mlp,𝑾𝑙
mlp22ℝ𝑑e𝑑mlp,𝒃𝑙
mlp22ℝ𝑑e, MLP parameters.
For𝑙2»𝐿dec¼:
jWdec
𝑙, multi-head decoder attention parameters for layer 𝑙, see (4),
jWed
𝑙, multi-head cross-attention parameters for layer 𝑙, see (4),
j𝜸3
𝑙𝜷3
𝑙𝜸4
𝑙𝜷4
𝑙𝜸5
𝑙𝜷5
𝑙2ℝ𝑑e, three sets of layer-norm parameters,
j𝑾𝑙
mlp32ℝ𝑑mlp𝑑e,𝒃𝑙
mlp32ℝ𝑑mlp,𝑾𝑙
mlp42ℝ𝑑e𝑑mlp,𝒃𝑙
mlp42ℝ𝑑e, MLP parameters.
𝑾𝒖2ℝ𝑁V𝑑e, the unembedding matrix.
/* Encode the context sequence: */
1z length¹𝒛º
2for𝑡2»z¼:𝒆𝑡 𝑾𝒆»:𝑧»𝑡¼¼¸𝑾𝒑»:𝑡¼
3𝒁 »𝒆1𝒆2𝒆z¼
4for𝑙=12𝐿 encdo
5 𝒁 𝒁¸MHAttention¹𝒁jWenc
𝑙Mask1º
6for𝑡2»z¼:𝒁»:𝑡¼ layer_norm¹𝒁»:𝑡¼j𝜸1
𝑙𝜷1
𝑙º
7 𝒁 𝒁¸𝑾𝑙
mlp2ReLU¹𝑾𝑙
mlp1𝒁¸𝒃𝑙
mlp11|º¸𝒃𝑙
mlp21|
8for𝑡2»z¼:𝒁»:𝑡¼ layer_norm¹𝒁»:𝑡¼j𝜸2
𝑙𝜷2
𝑙º
9end
/* Decode the primary sequence, conditioning on the context: */
10x length¹𝒙º
11for𝑡2»x¼:𝒆𝑡 𝑾𝒆»:𝑥»𝑡¼¼¸𝑾𝒑»:𝑡¼
12𝑿 »𝒆1𝒆2𝒆x¼
13for𝑖=12𝐿 decdo
14 𝑿 𝑿¸MHAttention¹𝑿jWdec
𝑙Mask»𝑡𝑡0¼»»𝑡𝑡0¼¼º
15for𝑡2»x¼:𝑿»:𝑡¼ layer_norm¹𝑿»:𝑡¼j𝜸3
𝑙𝜷3
𝑙º
16 𝑿 𝑿¸MHAttention¹𝑿𝒁jWed
𝑙Mask1º
17for𝑡2»x¼:𝑿»:𝑡¼ layer_norm¹𝑿»:𝑡¼j𝜸4
𝑙𝜷4
𝑙º
18 𝑿 𝑿¸𝑾𝑙
mlp4ReLU¹𝑾𝑙
mlp3𝑿¸𝒃𝑙
mlp31|º¸𝒃𝑙
mlp41|
19for𝑡2»x¼:𝑿»:𝑡¼ layer_norm¹𝑿»:𝑡¼j𝜸5
𝑙𝜷5
𝑙º
20end
/* Derive conditional probabilities and return: */
21return 𝑷=softmax¹𝑾𝒖𝑿º
11
Page 12:
Formal Algorithms for Transformers
Algorithm 9: 𝑷 ETransformer (𝒙j𝜽º
/* BERT, an encoder-only transformer, forward pass */
Input: 𝒙2𝑉, a sequence of token IDs.
Output: 𝑷2¹01º𝑁Vlength¹𝒙º, where each column of 𝑷is a distribution over the vocabulary.
Hyperparameters: max𝐿𝐻𝑑 e𝑑mlp𝑑f2ℕ
Parameters: 𝜽includes all of the following parameters:
𝑾𝒆2ℝ𝑑e𝑁V,𝑾𝒑2ℝ𝑑emax, the token and positional embedding matrices.
For𝑙2»𝐿¼:
jW𝑙, multi-head attention parameters for layer 𝑙, see (4),
j𝜸1
𝑙𝜷1
𝑙𝜸2
𝑙𝜷2
𝑙2ℝ𝑑e, two sets of layer-norm parameters,
j𝑾𝑙
mlp12ℝ𝑑mlp𝑑e,𝒃𝑙
mlp12ℝ𝑑mlp,𝑾𝑙
mlp22ℝ𝑑e𝑑mlp,𝒃𝑙
mlp22ℝ𝑑e, MLP parameters.
𝑾𝒇2ℝ𝑑f𝑑e𝒃𝒇2ℝ𝑑f,𝜸𝜷2ℝ𝑑f, the final linear projection and layer-norm parameters.
𝑾𝒖2ℝ𝑁V𝑑e, the unembedding matrix.
1 length¹𝒙º
2for𝑡2»¼:𝒆𝑡 𝑾𝒆»:𝑥»𝑡¼¼¸𝑾𝒑»:𝑡¼
3𝑿 »𝒆1𝒆2𝒆¼
4for𝑙=12𝐿do
5 𝑿 𝑿¸MHAttention¹𝑿jW𝑙Mask1º
6for𝑡2»¼:𝑿»:𝑡¼ layer_norm¹𝑿»:𝑡¼j𝜸1
𝑙𝜷1
𝑙º
7 𝑿 𝑿¸𝑾𝑙
mlp2GELU¹𝑾𝑙
mlp1𝑿¸𝒃𝑙
mlp11|º¸𝒃𝑙
mlp21|
8for𝑡2»¼:𝑿»:𝑡¼ layer_norm¹𝑿»:𝑡¼j𝜸2
𝑙𝜷2
𝑙º
9end
10𝑿 GELU¹𝑾𝒇𝑿¸𝒃𝒇1|º
11for𝑡2»¼:𝑿»:𝑡¼ layer_norm¹𝑿»:𝑡¼j𝜸𝜷º
12return 𝑷=softmax¹𝑾𝒖𝑿º
12
Page 13:
Formal Algorithms for Transformers
Algorithm 10: 𝑷 DTransformer (𝒙j𝜽º
/* GPT, a decoder-only transformer, forward pass */
Input: 𝒙2𝑉, a sequence of token IDs.
Output: 𝑷2¹01º𝑁Vlength¹𝒙º, where the 𝑡-th column of 𝑷represents ˆ𝑃𝜽¹𝑥»𝑡¸1¼j𝒙»1 :𝑡¼º.
Hyperparameters: max𝐿𝐻𝑑 e𝑑mlp2ℕ
Parameters: 𝜽includes all of the following parameters:
𝑾𝒆2ℝ𝑑e𝑁V,𝑾𝒑2ℝ𝑑emax, the token and positional embedding matrices.
For𝑙2»𝐿¼:
jW𝑙, multi-head attention parameters for layer 𝑙, see (4),
j𝜸1
𝑙𝜷1
𝑙𝜸2
𝑙𝜷2
𝑙2ℝ𝑑e, two sets of layer-norm parameters,
j𝑾𝑙
mlp12ℝ𝑑mlp𝑑e,𝒃𝑙
mlp12ℝ𝑑mlp,𝑾𝑙
mlp22ℝ𝑑e𝑑mlp,𝒃𝑙
mlp22ℝ𝑑e, MLP parameters.
𝜸𝜷2ℝ𝑑e, final layer-norm parameters.
𝑾𝒖2ℝ𝑁V𝑑e, the unembedding matrix.
1 length¹𝒙º
2for𝑡2»¼:𝒆𝑡 𝑾𝒆»:𝑥»𝑡¼¼¸𝑾𝒑»:𝑡¼
3𝑿 »𝒆1𝒆2𝒆¼
4for𝑙=12𝐿do
5for𝑡2»¼:˜𝑿»:𝑡¼ layer_norm¹𝑿»:𝑡¼j𝜸1
𝑙𝜷1
𝑙º
6 𝑿 𝑿¸MHAttention¹˜𝑿jW𝑙Mask»𝑡𝑡0¼=»»𝑡𝑡0¼¼º
7for𝑡2»¼:˜𝑿»:𝑡¼ layer_norm¹𝑿»:𝑡¼j𝜸2
𝑙𝜷2
𝑙º
8 𝑿 𝑿¸𝑾𝑙
mlp2GELU¹𝑾𝑙
mlp1˜𝑿¸𝒃𝑙
mlp11|º¸𝒃𝑙
mlp21|
9end
10for𝑡2»¼:𝑿»:𝑡¼ layer_norm¹𝑿»:𝑡¼j𝜸𝜷º
11return 𝑷=softmax¹𝑾𝒖𝑿º
13
Page 14:
Formal Algorithms for Transformers
Algorithm11: ˆ𝜽 EDTraining (𝒛1:𝑁data𝒙1:𝑁data𝜽)
/* Training a seq2seq model */
Input:f¹𝒛𝑛𝒙𝑛ºg𝑁data
𝑛=1, a dataset of
sequence pairs.
Input: 𝜽, initial transformer parameters.
Output: ˆ𝜽, the trained parameters.
Hyperparameters: 𝑁epochs2ℕ 𝜂2¹01º
1for𝑖=12𝑁 epochsdo
2for𝑛=12𝑁 datado
3 length¹𝒙𝑛º
4 𝑷¹𝜽º EDTransformer¹𝒛𝑛𝒙𝑛j𝜽º
5 loss¹𝜽º=Í1
𝑡=1log𝑃¹𝜽º»𝑥𝑛»𝑡¸1¼𝑡¼
6 𝜽 𝜽𝜂rloss¹𝜽º
7end
8end
9return ˆ𝜽=𝜽
Algorithm12: ˆ𝜽 ETraining (𝒙1:𝑁data𝜽)
/* Training by masked language
modelling */
Input:f𝒙𝑛g𝑁data
𝑛=1, a dataset of sequences.
Input: 𝜽, initial encoder-only transformer
parameters.
Output: ˆ𝜽, the trained parameters.
Hyperparameters: 𝑁epochs2ℕ 𝜂2
¹01º𝑝mask2¹01º
1for𝑖=12𝑁 epochsdo
2for𝑛=12𝑁 datado
3 length¹𝒙𝑛º
4 for𝑡=12do
5 ˜𝑥𝑛»𝑡¼ mask_token or𝑥𝑛»𝑡¼
randomly with probability
𝑝maskor1𝑝mask
6 end
7 ˜𝑇 f𝑡2»¼: ˜𝑥𝑛»𝑡¼=mask_tokeng
8 𝑷¹𝜽º ETransformer¹˜𝒙𝑛j𝜽º
9 loss¹𝜽º=Í
𝑡2˜𝑇log𝑃¹𝜽º»𝑥𝑛»𝑡¼𝑡¼
10 𝜽 𝜽𝜂rloss¹𝜽º
11end
12end
13return ˆ𝜽=𝜽Algorithm13: ˆ𝜽 DTraining (𝒙1:𝑁data𝜽)
/* Training next token prediction */
Input:f𝒙𝑛g𝑁data
𝑛=1, a dataset of sequences.
Input: 𝜽, initial decoder-only transformer
parameters.
Output: ˆ𝜽, the trained parameters.
Hyperparameters: 𝑁epochs2ℕ 𝜂2¹01º
1for𝑖=12𝑁 epochsdo
2for𝑛=12𝑁 datado
3 length¹𝒙𝑛º
4 𝑷¹𝜽º DTransformer¹𝒙𝑛j𝜽º
5 loss¹𝜽º=Í1
𝑡=1log𝑃¹𝜽º»𝑥𝑛»𝑡¸1¼𝑡¼
6 𝜽 𝜽𝜂rloss¹𝜽º
7end
8end
9return ˆ𝜽=𝜽
Algorithm 14: 𝒚 DInference (𝒙ˆ𝜽)
/* Prompting a trained model and using
it for prediction. */
Input:Trained transformer parameters ˆ𝜽.
Input: 𝒙2𝑉, a prompt.
Output: 𝒚2𝑉, the transformer’s
continuation of the prompt.
Hyperparameters: gen2ℕ 𝜏2¹01º
1 length¹𝒙º
2for𝑖=12 gendo
3 𝑷 DTransformer¹𝒙jˆ𝜽º
4 𝒑 𝑷»:¸𝑖1¼
5sample a token 𝑦from 𝒒/𝒑1𝜏
6 𝒙 »𝒙𝑦¼
7end
8return 𝒚=𝒙»¸1 :¸gen¼
14
Page 15:
Formal Algorithms for Transformers
Algorithm 15: ˆ𝒙 EDInference (𝒛ˆ𝜽)
/* Using a trained seq2seq model for
prediction. */
Input:A seq2seq transformer and trained
parameters ˆ𝜽of the transformer.
Input: 𝒛2𝑉, input sequence, e.g. a
sentence in English.
Output: ˆ𝒙2𝑉, output sequence, e.g. the
sentence in German.
Hyperparameters: 𝜏2¹01º
1ˆ𝒙 »bos_token¼
2𝑦 0
3while𝑦≠eos_token do
4 𝑷 EDTransformer¹𝒛ˆ𝒙jˆ𝜽º
5 𝒑 𝑷»:length¹ˆ𝒙º¼
6sample a token 𝑦from 𝒒/𝒑1𝜏
7 ˆ𝒙 »ˆ𝒙𝑦¼
8end
9return ˆ𝒙
15
Page 16:
Formal Algorithms for Transformers
B. List of Notation
Symbol Type Explanation
»𝑁¼ :=f1𝑁gset of integers 12𝑁1𝑁
𝑖 𝑗2ℕ generic integer indices
𝑉 »𝑁V¼vocabulary
𝑁V2ℕ vocabulary size
𝑉=Ð1
=0𝑉set of token sequences; elements include e.g. sentences or documents
max2ℕ maximum sequence length
2»max¼length of token sequence
𝑡2»¼index of token in a sequence
𝑑2ℕ dimension of various vectors
𝒙𝑥»1 :¼ 𝑥»1¼𝑥»2¼𝑥»¼2𝑉primary token sequence
𝒛𝑧»1 :¼ 𝑧»1¼𝑧»2¼𝑧»¼2𝑉context token sequence
𝑀»𝑖 𝑗¼ 2 ℝ entry𝑀𝑖𝑗of matrix𝑀2ℝ𝑑𝑑0
𝑀»𝑖:¼𝑀»𝑖¼2ℝ𝑑0𝑖-th row of matrix 𝑀2ℝ𝑑𝑑0
𝑀»: 𝑗¼ 2 ℝ𝑑𝑗-th column of matrix 𝑀2ℝ𝑑𝑑0
𝒆2ℝ𝑑evector representation / embedding of a token
𝑿2ℝ𝑑e𝑥encoded primary token sequence
𝒁2ℝ𝑑e𝑧encoded context token sequence
Mask2ℝ𝑧𝑥masking matrix, it determines the attention context for each token
𝐿𝐿enc𝐿dec2ℕ number of network (encoder, decoder) layers
𝑙2»𝐿¼index of network layer
𝐻2ℕ number of attention heads
ℎ2»𝐻¼index of attention head
𝑁data2ℕ (i.i.d.) sample size
𝑛2»𝑁data¼index of sample sequence
𝜂2¹01ºlearning rate
𝜏2¹01ºtemperature; it controls the diversity-plausibility trade-off at inference
𝑾𝒆2ℝ𝑑e𝑁Vtoken embedding matrix
𝑾𝒑2ℝ𝑑emaxpositional embedding matrix
𝑾𝒖2ℝ𝑁V𝑑eunembedding matrix
𝑾𝒒2ℝ𝑑attn𝑑xquery weight matrix
𝒃𝒒2ℝ𝑑attnquery bias
𝑾𝒌2ℝ𝑑attn𝑑zkey weight matrix
𝒃𝒌2ℝ𝑑attnkey bias
𝑾𝒗2ℝ𝑑out𝑑zvalue weight matrix
𝒃𝒗2ℝ𝑑outvalue bias
W𝒒𝒌𝒗 collection of above parameters of a single-head attention layer
𝑾𝒐2ℝ𝑑out𝐻𝑑midoutput weight matrix
𝒃𝒐2ℝ𝑑outoutput bias
W collection of above parameters of a multi-head attention layer, see eq. (4)
𝑾mlp2ℝ𝑑1𝑑2weight matrix corresponding to an MLP layer in a Transformer
𝒃mlp2ℝ𝑑1bias corresponding to an MLP layer in a Transformer
𝜸2ℝ𝑑elayer-norm learnable scale parameter
𝜷2ℝ𝑑elayer-norm learnable offset parameter
𝜽ˆ𝜽2ℝ𝑑collection of all learnable / learned Transformer parameters
16