Page 1:
arXiv:2504.01002v1 [cs.CL] 1 Apr 2025Token Embeddings Violate the Manifold
Hypothesis
Michael Robinson1, Sourya Dey2, Tony Chiang3
1Mathematics and Statistics, American University, Washington,
DC, USA, michaelr@american.edu
2Galois, Inc., Arlington, VA, USA, sourya@galois.com
3Department of Mathematics, University of Washington, Seattle,
WA, chiang@math.washington.edu
Abstract
To fully understand the behavior of a large language model (L LM)
requires our understanding of its input space. If this input space dif-
fers from our assumption, our understanding of and conclusi ons about
the LLM is likely flawed, regardless of its architecture. Her e, we eluci-
date the structure of the token embeddings, the input domain for LLMs,
both empirically and theoretically. We present a generaliz ed and statis-
tically testable model where the neighborhood of each token splits into
well-defined signal and noise dimensions. This model is base d on a gener-
alization of a manifold called a fiber bundle , so we denote our hypothesis
test as the “fiber bundle null.” Failing to reject the null is u ninformative,
but rejecting it at a specific token indicates that token has a statistically
significant local structure, and so is of interest to us. By ru nning our
test over several open-source LLMs, each with unique token e mbeddings,
we find that the null is frequently rejected, and so the token s ubspace is
provably not a fiber bundle and hence also not a manifold. As a c onse-
quence of our findings, when an LLM is presented with two seman tically
equivalent prompts, and if one prompt contains a token impli cated by our
test, that prompt will likely exhibit more output variabili ty proportional
to the local signal dimension of the token.
1 Introduction
Large language models (LLMs) produce a response to a given query , by using a
deepneuralnetworktopredictthenexttokengivenawindowofpr evioustokens.
How interchangeable are these tokens? From a linguistic perspectiv e, those
tokens that can be exchanged without impacting the meaning of a st atement
should be considered synonyms . Some tokens have more synonyms, whereas
othershavefewer. Thosewith fewersynonymstend tobe syntac ticallyessential:
This material is based upon work supported by the Defense Adv anced Research Projects Agency (DARPA) under
Contract No. HR001124C0319. Any opinions, findings and con clusions or recommendations expressed in this
material are those of the author(s) and do not necessarily re flect the views of the Defense Advanced Research
Projects Agency (DARPA). Distribution Statement “A” (Appr oved for Public Release, Distribution Unlimited).
Page 2:
if you swap such a token for another, the resulting sentence is not likely to
occur. Conversely, tokens with many synonyms are likely to be viewe d as being
interchangeable.
Logically prior to understanding the syntax learned by an LLM is the u nder-
standing of its token subspace , the internal representation of individual tokens
(not sequences of tokens in context). Numerous papers have po inted to unex-
pected behaviors exhibited by LLMs that hinge on subtle changes in w ording
and text layout between apparently similar prompts, suggesting th at certain—
apparently semantically similar—tokens have dramatically different ne ighbor-
hoods in the token subspace (for instance, see [1]). These differen ces in neigh-
borhoods correspond to places where the token subspace is not a manifold;
it issingular at such a token. Linguistically, singularities may correspond to
polysemy orhomonyms —tokens with multiple distinct meanings [2].
If the token subspace is singular, then these singularities can pers ist into
the output of the LLM, perhaps unavoidably and regardless of its a rchitecture.
Not accounting for singularities in the token subspace may thereby impede the
understanding of the LLM’s behavior. Suppose the LLM is presente d with two
similar prompts, but one prompt has a token that is near the singular ity. The
prompt with a token near the singularity will likely exhibit more variability if
both prompts are changed in the same way, depending on how well th e trans-
former can resolve the singularity.
We present a test that determines whether the neighborhood of a given
token contains a singularity. The test works by identifying changes in subspace
dimension that are inconsistent with the token subspace being a fiber bundle ,
whichisastrictgeneralizationofamanifold. Whenourmodelfindsasin gularity
atatoken, this impliesthat thetokenhasfarfewersynonymsthan itsneighbors.
In a context where such a token is used, its use in that role is syntac tically
essential, indicating that it plays an outsized role in the LLM.
We applied our test to four open source LLMs’ token subspaces (G PT2 [3],
Llemma7B [4], Mistral7B [5], and Pythia6.9B [6]). In each LLM we tested,
we found that the token subspace is not a manifold, because it is also not a
fiber bundle . Moreover, we observe highly statistically significant differences in
the singular tokens betweenLLMs—even for those with identical sets of tokens
overall—which indicates that their respective training methodologies have a
strong impact on the token subspace. Under this situation, none o f these LLMs
should be expected to have similar responses to a prompt involving an y of these
singular tokens [7].
1.1 Background
At an abstract but precise level, an LLM consists of several intera cting pro-
cesses, as outlined in Figure 1. An LLM implements a transformation o f a
sequence of tokens (the query) into a new sequence of tokens (t he response).
Formally, if each input token is an element of a metric space T, then the LLM
is a transformation Tn→Tm, wherenis the number of tokens in the query
andmis the number of tokens in the response. This transformation is typ ically
This material is based upon work supported by the Defense Adv anced Research Projects Agency (DARPA) under
Contract No. HR001124C0319. Any opinions, findings and con clusions or recommendations expressed in this
material are those of the author(s) and do not necessarily re flect the views of the Defense Advanced Research
Projects Agency (DARPA). Distribution Statement “A” (Appr oved for Public Release, Distribution Unlimited).
Page 3:
Query token
sequence
XnXnXnContext windowsY Y YT T T Output tokens
Next token
distributions
...
initial
context
windowQuery Response
Outside the LLM
Inside the LLM token input
embedding
transformer block(s)random draws
fT n
f fen
F F
Figure 1: Data flow in a typical LLM. A sequence of tokens forming th e query
is converted via the token input embedding eninto the initial context window,
as a point in the latent space Xn. Each of these windows in the latent space are
converted, token-by-token, into probability distributions via finto the single
token latent space X. From these, each token presented in the output (in the
setY) is obtained via a random draw. These output tokens are then used for
subsequent windows.
nota function because it is stochastic—it involves random draws.
To operate upon tokens using numerical models, such as could be imp le-
mented using neural networks, we must transform the finite set o f tokens T
into numerical data. This is typically done by way of a pair of latent spaces
X=RdandY=Rq. The dimension qofYis chosen to be equal to the
number of elements in T, so that elements of Yhave the interpretation of being
(unnormalized) probability distributions over T.
The transformation Tn→Tmis constructed in several stages.
Input tokenization : Eachtokenisembedded individuallyviathe token input
embedding function e:T→X. As a whole, Xnis called a latent window .
Transformer blocks : The probability distribution for the next token is con-
structed by a continuous function f:Xn→Y. This is usually imple-
mented by one or more transformer blocks .
Output tokenization : Given the output of one of the transformer blocks
f, one can obtain an output token in Tby a random draw. Specifically,
if (x1,x2,...,x n) is the current window in Xn, then the next token tis
drawn from the distribution given by f(x1,x2,...,x n).
Next window prediction : Given that token twas drawn from the distri-
bution, the next latent window itself is constructed by a transform ation
F:Xn→Xn, which advances the window as follows:
F(x1,x2,...,x n) := (x2,...,x n,t).
This material is based upon work supported by the Defense Adv anced Research Projects Agency (DARPA) under
Contract No. HR001124C0319. Any opinions, findings and con clusions or recommendations expressed in this
material are those of the author(s) and do not necessarily re flect the views of the Defense Advanced Research
Projects Agency (DARPA). Distribution Statement “A” (Appr oved for Public Release, Distribution Unlimited).
Page 4:
The focus of this paper is specifically upon the structure of the token input
embedding e:T→X=Rd. Since the token set Tis finite, ecan be stored as a
matrix. In thismatrix, eachcolumncorrespondstoanelementof T, andthereby
ascribes a vector of numerical coordinates to each token. By rep lacing the last
layer of the deep neural network f:Xn→Y, a vector of probabilities for the
next token is obtained from the activations of the last layer. One ca n therefore
interpret the probabilities as specifying a token output embedding . Both the
tokenization and the transformer stages are learned during train ing, and many
strategies for this learning process are discussed extensively in th e literature.
Althoughtheirtrainingisusuallyperformedseparately,thesetwos tagesinteract
when theyproducethe LLM output, soit isimportantto understan dthe lineage
of a given tokenization as being from a particular LLM. We emphasize t hatonly
the input tokenization is discussed in this article .
The token input embedding matrix itself is interesting, as it defines “w here”
the tokens are located. It is reasonable to consider the tokens as being sampled
from a larger latent subspace within the space of all possible activat ions. Such
a space is quite unconstrained. There is no a priori reason to suspect it is a
manifold, for instance. It has already been shown that local neighb orhoods of
each token have salient topological structure [8]. One of the most basic param-
eters is the dimension near any given token in this space. Higher dimension at
a token means that the token has more near-neighbors—moresyn onyms—while
lower dimensional tokens are less interchangeable [2].
Dimension is a manifestly local property. However, for manifolds, dim ension
is locally constant, hence global. It is for this reason that manifold lea rning
is popular. If one computes PCA locally for a random sampling of a manif old
embedded within Euclidean space, most of the variance in the data is c aptured
within a few principal directions, namely those tangent to the manifo ld. In
essence, these represent the signal within the data. The number of these di-
rections is the dimension of the manifold, and this is a constant over ( each
connected component of) the manifold. The remaining directions, w hich are
not tangent to the manifold, represent noise. The basic assumptio n is that
transformers act on the entire input space, and that (clearly) is a manifold,
because it is Euclidean space. But the truth is that a transformer in the context
of an LLM really only acts on the token subspace , the image of the token input
embedding e:T→X, which is a subspace of that Euclidean space. That
the token subspace is not a linear subspace is widely acknowledged, b ut more
problematic is that it is not a manifold [9].
Assumingthat wordembeddingsyieldmanifolds, someresearchersh aveused
global dimension estimators on token input embeddings and word emb eddings
[10, 11]. A priori one should not suspect that a set of tokens (or other samples)
lies on a manifold. Although there are rigorous statistical tests for manifolds
[12], they are arduous to apply in practice.
By using a local (not global) dimension estimator, [9] presented the fi rst
(to our knowledge) direct test of whether the token subspace is a manifold
for the token input embeddings for several LLMs. A strongly nega tive result
was obtained: the subspace of tokens is apparently never a manifo ld, so global
This material is based upon work supported by the Defense Adv anced Research Projects Agency (DARPA) under
Contract No. HR001124C0319. Any opinions, findings and con clusions or recommendations expressed in this
material are those of the author(s) and do not necessarily re flect the views of the Defense Advanced Research
Projects Agency (DARPA). Distribution Statement “A” (Appr oved for Public Release, Distribution Unlimited).
Page 5:
0 200 400 600 8000.000.010.020.03
dimensionprobability densityNumerics, months, days of week, cardinal directions
Single English words
Mostly word fragments
and a few single words
Figure 2: The distribution of local dimensions estimated near tokens in GPT2,
from [9].
dimension estimators are not reliable. Figure 2 shows the distribution of di-
mensions they obtained for GPT2 (March 11, 2024 version) [3]. Reca ll that
dimension correlates with the number of free parameters one can p erturb a
point and still stay within the space, and that for manifolds this numb er is a
constant. The highly multi-modal nature of the distribution is a refle ction of
the inherent non-manifold structure of the token subspace.
There are several clusters of low dimensional tokens, which accor d with the
low dimensions obtained by others using global estimators [10, 11]. Ho wever,
the high dimensional mode indicates that there are many tokens tha t can be
perturbed more substantially. Intuitively, the token subspace is d imensionally
“thicker” near these tokens with higher dimensional neighborhood s. This yields
a striking interpretation: the high dimensional modes correspond t o tokens
with a much higher variance, while the lower dimensional modes have a lo wer
variance. Therefore, an immediate consequence of Figure 2 is that the noise
near a token is strongly and unavoidably dependent upon that toke n.
1.2 Contributions
The dependence of variability near a token upon that token is a form ofhet-
eroscedasticity . In order to construct a manifold hypothesis testing framework,
we formalize the notion of heteroscedasticity by making a very gene ral model of
non-heteroscedastic noise: it is a probability distribution supporte d on afiber
bundle. Roughly speaking, instead of having one local dimension (as a man-
ifold does), a fiber bundle has twolocal dimensions. These two dimensions
correspond to a clean split between “signal” and “noise” dimensions. The fiber
bundle hypothesis asserts that the noise dimension is valid near a give n point,
while the signal dimension is valid further away from that point. While th is
hypothesis may not be true, if it is true then the noise model is quite b enign.
This material is based upon work supported by the Defense Adv anced Research Projects Agency (DARPA) under
Contract No. HR001124C0319. Any opinions, findings and con clusions or recommendations expressed in this
material are those of the author(s) and do not necessarily re flect the views of the Defense Advanced Research
Projects Agency (DARPA). Distribution Statement “A” (Appr oved for Public Release, Distribution Unlimited).
Page 6:
AccordingtoTheorem1, which wecallthe “fiberbundle hypothesis,” proven
in the supplementary material, it is easy to test for a fiber bundle usin g the
volume-versus-radius plots of [9] by finding places where the slope is discon-
tinuous and increases at this discontinuity. The present paper exp lains that
the token subspaces for LLMs mostly, but not entirely, look like fibe r bundles.
The places where the token subspace has singularities (violates the fiber bundle
hypothesis) are likely to be at interesting tokens.
In Section 2, we explain how to test and interpret the fiber bundle hy poth-
esis. As a benefit, Theorem 1 yields two new dimension estimators tha t aid in
performing the test. Rejection of the fiber bundle hypothesis the refore implies
a very strong heteroscedasticity. We rebuilt the dimension estimat or in [9] to
automatically find the stratification boundaries.
InSection3,weexhibitresultsfromournewestimatoronGPT2[3], Lle mma7B
[4], Mistral7B [5], and Pythia6.9B [6]. Tokens near violations of the fiber b un-
dle hypothesis are near places where the noise distribution is guaran teed to
change abruptly. Furthermore, the two dimension estimators fro m Theorem 1
also identify sets of tokens with interesting structure.
2 Methods
Our method assumes that the set of tokens Tis a random sample of a proba-
bility distribution mon a topological space (not necessarily a manifold) Ethat
represents all possible tokens (including those that have not been seen before).
We can safely assume that the token input embedding e:T→X=Rdis a
continuous function.
Supposing that tis a token of interest, our method estimates the probability
distribution min the neighborhood of x=e(t), and uses this estimate to infer
properties about the structure of E. Since we assumed that Twas randomly
sampled from m, the number of tokens within radius rofx,
Nx(r) :={y∈T:/bardblx−y/bardbl2< r},
will converge in expectation to
E(Nx(r)) =m(e−1(Br(x)))#T, (1)
as the number of tokens grows large, provided eis continuous.
Theorem 1, provides an asymptotic estimate of Equation (1) under addi-
tional assumptions about E. Specifically, Theorem 1 asserts that if Eis a
manifold and eis smooth, then log E(Nx(r)) depends linearly upon log r, in
which the slope of this linear relationship is the dimension of E. Moreover,
Theorem 1 asserts that if Eis afibered manifold , a generalization of a manifold
that is a type of fiber bundle (as discussed in Section 2.1), then the relationship
is piecewise linear, and the slopes must decrease as the radius increases.
If the conclusion of Theorem 1 does not hold, namely the relationship be-
tween log E(Nx(r)) and log ris not piecewise linear or the slopes do not decrease
This material is based upon work supported by the Defense Adv anced Research Projects Agency (DARPA) under
Contract No. HR001124C0319. Any opinions, findings and con clusions or recommendations expressed in this
material are those of the author(s) and do not necessarily re flect the views of the Defense Advanced Research
Projects Agency (DARPA). Distribution Statement “A” (Appr oved for Public Release, Distribution Unlimited).
Page 7:
with increasing r, then we rejectthe fiber bundle hypothesis. In particular, re-
jecting implies that the neighborhood of xis inconsistent with a fiber bundle, and
as a consequence, it is also inconsistent with a manifold .
2.1 Interpretation as signal and noise
It is usual to describe measurements as exhibiting the combined effe ct of signal
and noise. If we were to know both of these quantities, we could exp ress each
measurement as being an ordered pair (signal, noise). Therefore, if the space of
all possible signals is Band the space of all noise values is V, we could represent
the space of all possible measurements as the cartesian product E=B×V.
In what follows, we will call Bthebase space andVthefiber space . The
productE=B×Vdescribes the situation when the set of possible noise values
does not depend on the signal value, and is called a homoscedastic noise model.
In contrast, in a heteroscedastic noise model, the set of possible noise values
depends on the signal value.
R1R2log(r)log(ar ea)
disk peeks
out both
sidesSharp corner
Subtle cornerdisk
entirely
inside
strip
slope 2slope 1
as r → ∞
Figure 3: Our method applied to a fiber bundle in R2. The vertical direction
is the base space (signal), while the horizontal direction represent s the fibers
space (noise). Gray points on the right frame show estimates from a random
sampling of points in the strip; the solid line shows the theoretical are a versus
radius curve.
Figure 3 shows an example of this situation. It consists of a 1-dimens ional
base space (the signal) and 1-dimensional fibers (the noise), which in this case
forms a narrow strip in the plane. Volumes (areas, in this case) of ba lls of small
radius scale quadratically (slope 2 in a log-log plot), but scale asympto tically
linearly (slope 1 in a log-log plot) for large radii. The transition between these
This material is based upon work supported by the Defense Adv anced Research Projects Agency (DARPA) under
Contract No. HR001124C0319. Any opinions, findings and con clusions or recommendations expressed in this
material are those of the author(s) and do not necessarily re flect the views of the Defense Advanced Research
Projects Agency (DARPA). Distribution Statement “A” (Appr oved for Public Release, Distribution Unlimited).
Page 8:
two behaviors is detectable by way of a corner in the plot. This situat ion is
easily and robustly estimated from the data; the gray points in Figur e 3 (right)
are derived from a random sampling of points drawn from the strip.
To test for heteroscedastic noise, we propose a substantial nonlinear gener-
alization of homoscedastic noise. While the strength of noise can depend on the
signal value, the number of dimensions necessary to describe it doe s not. This
situation is modeled mathematically by a fiber bundle . In a fiber bundle, the
signal is still modeled by a space B, but the possible measurements are modeled
by a function p:E→B. The idea is that fibersp−1(b) are still cartesian prod-
ucts: pairs of signal and noise, and these are all identical up to diffeomorphism .
Our method relies upon a particular geometric property of fiber bun dles: we
can identify if the fibers are not all identical according to when the c onclusion
of Theorem 1 is violated.
Figure4showsasituationthatisnotafiberbundle, sincethereisach angein
thedimensionofthefiber. Inthe upperportionofthefigure, thefi berdimension
is 0 while in the lower portion the fiber dimension is 1. This is detectable by
looking at the volume versus radius plots for two samples. While both s amples
show corners in their volume versus radius plots, Theorem 1 establis hes that
the slopes always decrease with increasing radius for a fiber bundle. This is
violated for the sample marked (a), so we conclude that the space is not a fiber
bundle. On the other hand, because the sample marked (b) does no t exhibit
this violation, it is important to note that if a sample yields data consist ent
with Theorem 1, we cannot conclude that the space is a fiber bundle.
The statement of Theorem 1 is rather technical, but can be summar ized in a
simple way. Considera token x, and count as a function of radius r, the number
oftokenswithinradius rofthetoken x. Ifweplotthisfunctiononalog-logscale,
it will be roughly linear for small radii anywhere where the space has t he local
structure of a manifold near the token x. The manifold hypothesis prohibits
discontinuities in the derivative of this function for small radii, but ac cording to
Theorem 1, fiber bundles permit the slope to decrease through a discontinuity.
Therefore, anywhere the slope increases through a discontinuity will cause us to
conclude that the vicinity of that token cannot be a fiber bundle. Re jecting the
fiber bundle hypothesis implies that the token xhas far fewer synonyms than its
neighbors, and might be a token corresponding to multiple distinct me anings.
2.2 Testing framework for the fiber bundle hypothesis
Our method is summarized by Figure 5. The first three blocks of Figur e 5
compute Nx(r) directly, while the last three blocks perform the test to see
whether the conclusion of Theorem 1 holds.
Note that the test itself—the final block—is rather straightforwa rd. Theo-
rem 1 asserts that log Nx(r) as a function of log ris a piecewise linear function,
in which the slopes decrease as rincreases. If there is a statistically significant
increase in the slope estimates, then we reject.
The most subtle of the blocks in Figure 5 is the fourth block, labeled “d etect
slopechanges”. Thisblockconsistsofestimatingtheslopebyusingt hestandard
This material is based upon work supported by the Defense Adv anced Research Projects Agency (DARPA) under
Contract No. HR001124C0319. Any opinions, findings and con clusions or recommendations expressed in this
material are those of the author(s) and do not necessarily re flect the views of the Defense Advanced Research
Projects Agency (DARPA). Distribution Statement “A” (Appr oved for Public Release, Distribution Unlimited).
Page 9:
Base
Violates fiber
bundle
hypothesis volume (log scale)
radius (log scale)(a)
(b)r1
r2
r3r1
r3r2slope 1
slope 1
slope 1
slope 2slope 2:
Reject!volume (log scale)
radius (log scale)
Figure 4: Rejecting fiber bundle model; another example in R2. Two samples
are marked as (a) and (b) along with their volume versus radius plots .
Select
tokenCompute
distances
to all othe r
tokensSort
distancesDetect
slope
changesReport
slopes
on either
side of
detectionCheck
for slope
increase
(= reject
Thm. 1)
Figure 5: Flow chart of the proposed method.
This material is based upon work supported by the Defense Adv anced Research Projects Agency (DARPA) under
Contract No. HR001124C0319. Any opinions, findings and con clusions or recommendations expressed in this
material are those of the author(s) and do not necessarily re flect the views of the Defense Advanced Research
Projects Agency (DARPA). Distribution Statement “A” (Appr oved for Public Release, Distribution Unlimited).
Page 10:
three-point centered differences method, and then uses a const ant false alarm
rate detector to identify changes in these slope estimates as a fun ction of radius.
It is worth noting that the false alarm rate is the significance level fo r our
test. For the results shown in Section 3, the significance level was s et at 10−3.
Nevertheless, we found that our results in Section 3 were insensitiv e to the false
alarm rate, which means that the rejections were highly significant.
3 Results
fail to
rejectfail to
rejectfail to
reject
reject!
Figure 6: Log-log plots of volume (token count) versus radius for t hree tokens
in GPT2 with significant slope changes marked.
Figure 6 shows the volume versus radius curves for three tokens u sed by
GPT2. Of these, most of the slope changes shown are not inconsist ent with the
fiber bundle hypothesis posited by Theorem 1. While this does not allow one
to conclude that the vicinity of $and#are fiber bundles, if this were to be the
case, we could use Theorem 1 to estimate the base and fiber dimensio n from
the slopes on either side of the marked points.
Notice that the curve for ¢exhibits two slope changes. One slope change
This material is based upon work supported by the Defense Adv anced Research Projects Agency (DARPA) under
Contract No. HR001124C0319. Any opinions, findings and con clusions or recommendations expressed in this
material are those of the author(s) and do not necessarily re flect the views of the Defense Advanced Research
Projects Agency (DARPA). Distribution Statement “A” (Appr oved for Public Release, Distribution Unlimited).
Page 11:
in Figure 6 represents a violation of the fiber bundle hypothesis posit ed by
Theorem 1, which implies that the vicinity of ¢does not split cleanly into signal
versus noise. The rejection for ¢is interesting: there are some sentences in
which the presence of ¢is essential. (Note that ¢was chosen for illustrative
purposes. The p-value for rejecting the fiber bundle hypothesis at ¢is larger
thanα= 10−3, so¢does not appear in Table 2.)
Given that each token subspace consists of multiple tokens, and we perform
thetestingmethodologyinSection2foreachtoken,itisimportantt odistinguish
between two variants of the manifold and fiber bundle hypotheses: “is the token
subspace a manifold (or fiber bundle) overall?” and “is the token sub space a
manifold (or fiber bundle) near a given token?” The methodology in Se ction
2 performs the latter directly. Each token consists of a statistica l test, the
collection of which is aggregated over the entire token space. Ther efore, we
applied the Holm-Bonferroni multiple test correction to the p-values of each
token’s test. Rejections were reported using a significance level o fα= 10−3. To
address the former question, the number of rejections for the t wo slope changes
(if they occur) are shown as two separate columns in Table 1.
Table 1: Dimensional data for and number of tokens rejecting the m anifold and
fiber bundle hypotheses
Model Manifold Base Fiber
rejects dim. rejects dim. rejects
GPT2 68 14 7 389 12
n= 50257 p <3×10−8p <3×10−8p <9×10−6
Llemma7B 33 11 1 >1060
n= 32016 p <5×10−9p <3×10−4N/A
Mistral7B 40 6 2 48 1
n= 32016 p <3×10−7p <8×10−5p <8×10−4
Pythia6.9B 54 2 0 135 0
n= 50254 p <2×10−7N/A N/A
Table 1 shows the results for the four models we analyzed. It is clear that
the models have quite different token input embeddings, and all of th em exhibit
highly significant rejections of the manifold hypothesis. GPT2, Llemm a7B and
Mistral7B also reject the fiber bundle hypothesis. The rejections o f the fiber
bundle hypothesis are more frequent in the base space than the fib er space,
which is consistent with the polysemy interpretation of [2]. Table 2 sho ws each
of the fiber bundle violations for each model that are listed in Table 1.
While most of the tokens are not shared between the LLMs, Llemma7 B and
Mistral7B do have identical token sets. The fact that Table 1 shows significant
differences between these two models indicates that the structur e of the sin-
gularities for these these two models is quite different. This implies tha t their
response to the same prompt is expected to be markedly different, even without
considering their respective transformer stages.
There are many more rejections of the manifold hypothesis than ca n be
This material is based upon work supported by the Defense Adv anced Research Projects Agency (DARPA) under
Contract No. HR001124C0319. Any opinions, findings and con clusions or recommendations expressed in this
material are those of the author(s) and do not necessarily re flect the views of the Defense Advanced Research
Projects Agency (DARPA). Distribution Statement “A” (Appr oved for Public Release, Distribution Unlimited).
Page 12:
Table 2: Violations of the fiber bundle hypothesis
Model Token Base/fiber p-value Comment
GPT2 Xan Base 3×10−8Must start a word
GPT2 aunder Base 2×10−4
GPT2 Dri Base 2×10−4
GPT2 ney Base 3×10−4
GPT2 rodu Base 3×10−4
GPT2 Insert Base 4×10−4
GPT2 Ying Base 4×10−4Must start a word
GPT2 laughable Fiber 9×10−6Must start a word
GPT2 nuance Fiber 2×10−4Must start a word
GPT2 dt Fiber 2×10−4
GPT2 Mesh Fiber 2×10−4
GPT2 affect Fiber 3×10−4Must start a word
GPT2 Thankfully Fiber 3×10−4
GPT2 swat Fiber 6×10−4Must start a word
GPT2 Malaysian Fiber 6×10−4Must start a word
GPT2 Palestinian Fiber 7×10−4Must start a word
GPT2 wins Fiber 8×10−4Must start a word
GPT2 hedon Fiber 9×10−4
GPT2 donor Fiber 9×10−4Must start a word
Llemma7B pax Base 3×10−4
Mistral7B H0 Base 5×10−4
Mistral7B monitor Base 8×10−5Must start a word
Mistral7B ¨ ange Fiber 8×10−4
conveniently listed in a table. Therefore, we list some general trend s of which
tokens cause the manifold hypothesis to be rejected.
•The GPT2 tokens at singularities are tokens that can only appear at the
beginning of words.
•The Pythia6.9B tokens at singularities are nearly all word fragments or
short sequences of text that are quite meaningless on their own.
•The Llemma7B and Mistral7B tokens at singularities are a combination
of the previous two: either they can only appear at the beginning of words
or they are word fragments.
Figures 7–10 show representations of each of the models we analyz ed. The
visualizations were created by first reducing the latent space dimen sion from
its original value to 50 via principal components analysis, then furth er reducing
to 2 dimensions via t-SNE. The fiber space is clearly stratified in each o f the
models, but the kinds of stratifications are rather different.
This material is based upon work supported by the Defense Adv anced Research Projects Agency (DARPA) under
Contract No. HR001124C0319. Any opinions, findings and con clusions or recommendations expressed in this
material are those of the author(s) and do not necessarily re flect the views of the Defense Advanced Research
Projects Agency (DARPA). Distribution Statement “A” (Appr oved for Public Release, Distribution Unlimited).
Page 13:
In Llemma7B, Pythia6.9B, and GPT2 (Figures 7, 8, and 10 respective ly),
there are isolated regions of tokens with very small dimensional neig hborhoods.
Thissuggeststhatthese particularlow-dimensionaltokensmayex hibit semantic
polysemyasanticipatedin[2]. InPythia6.9B,the“pinchpoint”shownin Figure
8 consists mostly of long strings of non-printing and whitespace cha racters.
In Lemma7B and Mistral7B (Figures 7 and 9, respectively), there ar e strat-
ification boundaries: on one side of the boundary the dimension of to kens is
much higher than on the other side. While the interpretation of this k ind of
stratification is unclear, it suggests that there may be variability in t he training
data support for the implicated tokens. Given the significant differe nce in the
structure of the spacesshown in Figures 7 and 9, we can conclude t hat the token
subspaces for these two models are quite different, even though b oth of these
LLMs use the same tokens.
The fiber space of GPT2 (Figure 10) also exhibits a feature not seen in the
other models, namely a large cluster of low-dimensional tokens isolat ed from the
others. This clusterwasidentifiedin [9], andinvestigationofthe clus terrevealed
that it mostly contains numeric tokens and date-related tokens. C lustered nu-
meric tokens likely means it is hard for GPT2 to distinguish different num bers.
This could cause GPT2 to fail to distinguish between prompts involving dates
from those involving mathematical operations.
The base space is not visibly stratified in Llemma7B, Pythia6.9B,and GP T2
(Figures 7, 8, and 10 respectively), but is visibly stratified in Mistral7 B (Figure
9).
PC1PC2Token Local
Dimension
(z-sco re)
Stratification
boundary
(dimension
change)Tiny cluster of
non-printing
single byte tokensBase Fiber
Figure7: ScatterplotofLlemma7Btokenscoloredbylocalbaseand fiberdimen-
sion,projectedto2dviaprincipalcomponentsanalysis. Becauset hedistribution
of dimensions is very different for base and fiber, the colors are nor malized via
z-scores independently for base and fiber.
This material is based upon work supported by the Defense Adv anced Research Projects Agency (DARPA) under
Contract No. HR001124C0319. Any opinions, findings and con clusions or recommendations expressed in this
material are those of the author(s) and do not necessarily re flect the views of the Defense Advanced Research
Projects Agency (DARPA). Distribution Statement “A” (Appr oved for Public Release, Distribution Unlimited).
Page 14:
PC1PC2Token Local
Dimension
(z-sco re)"Pinch point"
in fiber spaceBase Fiber
Figure8: ScatterplotofPythia6.9Btokenscoloredbylocalbasean dfiberdimen-
sion,projectedto2dviaprincipalcomponentsanalysis. Becauset hedistribution
of dimensions is very different for base and fiber, the colors are nor malized via
z-scores independently for base and fiber. The pinch point shown in t he fiber
space consists mostly of strings of non-printing and whitespace ch aracters.
PC1PC2Token Local
Dimension
(z-sco re)Stratification
boundariesBase Fiber
Figure 9: Scatterplot of Mistral7B tokens colored by local base and fiber dimen-
sion,projectedto2dviaprincipalcomponentsanalysis. Becauset hedistribution
of dimensions is very different for base and fiber, the colors are nor malized via
z-scores independently for base and fiber.
This material is based upon work supported by the Defense Adv anced Research Projects Agency (DARPA) under
Contract No. HR001124C0319. Any opinions, findings and con clusions or recommendations expressed in this
material are those of the author(s) and do not necessarily re flect the views of the Defense Advanced Research
Projects Agency (DARPA). Distribution Statement “A” (Appr oved for Public Release, Distribution Unlimited).
Page 15:
PC1PC2Token Local
Dimension
(z-sco re)Tokens with
leading spacesBase Fiber
Numerics
(and others)
Figure10: ScatterplotofGPT2tokenscoloredbylocalbaseandfib erdimension,
projected to 2d via principal components analysis. Because the dis tribution of
dimensions is very different for base and fiber, the colors are norma lized via
z-scores independently for base and fiber.
4 Discussion
Noneof the four LLMs we studied have token subspaces that are manifo lds, and
three of the four are also not fiber bundles. Singularities—tokens t hat cause
rejections of the manifold hypothesis—occur in different ways acro ss all four
LLMs. Additionally, singularities correspond to violations of the fiber bundle
hypothesis are tokens whose neighborhoods exhibit a dependency between the
large- and small-scale variability.
Singularities may arise either as artifacts of the training process or from
features of the languages being represented. Consistent with th e idea that pol-
ysemy may yield singularities [2], several of the tokens in Table 2 are cle ar
homonyms. For instance, both “affect” and “monitor” can be used either nouns
or verbs, and their meanings are different in these two roles.
Because tokens are fragments of text, a token may correspond to homonyms
after the addition of a prefix or suffix. A token like “aunder” can be p refixed to
yieldthe word“launder”,which isa contranym —awordwith multiple meanings
of opposite sense. Specifically, one can “launder” clothing (which ha s a positive
connotation) or “launder” money (which has a negative connotatio n). Several
other tokens in Table 2 form words with substantially different meanin gs or
grammatical roles upon adding a prefix or suffice. For instance, “win s” can
appear as a noun, a verb, and is also part of the adjective “winsome ”.
The grammatical roles of tokens is likely a root cause for some of sen sitivity
of LLMs to their prompts that has been observed in the literature, and may
explain why “explaining LLM behavior” remains difficult. Most methods f or
explaining LLM behavior in terms of dynamical systems, for instance , derive
their inferential power from assuming that the token subspace is a manifold.
This material is based upon work supported by the Defense Adv anced Research Projects Agency (DARPA) under
Contract No. HR001124C0319. Any opinions, findings and con clusions or recommendations expressed in this
material are those of the author(s) and do not necessarily re flect the views of the Defense Advanced Research
Projects Agency (DARPA). Distribution Statement “A” (Appr oved for Public Release, Distribution Unlimited).
Page 16:
Our results show that these theoretical methods simply do no t apply to actual
LLMs.
The fact that the LLMs are not manifolds means that the geodesic d istance
between tokens can be very unstable. As a result, while the distanc e along
geodesicscanbedefined, itmaynotcorrelatewithanysenseofsem anticdistance
between tokens. Furthermore, as [9] indicated, in most of the mod els, there are
tokens with dimension 0 neighborhoods. These tokens are therefo reisolated,
which implies that the token subspace is disconnected. The geodesic distance
between an isolated token and any other token is therefore infinite .
The differences in how the manifold and fiber bundle hypotheses are r ejected
across different LLMs suggest that the training methodology for e ach model
leaves an indelible fingerprint. Making general assertions about LLM s without
consideration of the details of their training is likely fraught. Even be tween
Llemma7B and Mistral7B, which have identical tokens, prompts likely cannot
be “ported” from one LLM to another without significant chang e if they contain
tokens near singularities.
A few clear patterns among tokens near singularities are neverthe less no-
ticeable. Tokens that begin a word or are a word fragment are ofte n located at
a singularity. Additionally, in Llemma7B (but not Mistral7B) and Pythia6 .9B
the tokens with unusually low fiber dimension often contain non-print ing or
whitespace characters. This suggests that these models are quit e sensitive to
text layout, perhaps to the exclusion of more semantically salient fe atures in the
text. Given our findings, future experiments can be run to explore the impact
of singular tokens on the variability of responses produced by differ ent LLMs.
Acknowledgments
TheauthorswouldliketothankAnandSarwateandAndrewLauziere forhelpful
suggestions on a draft of this manuscript. This article is based upon work par-
tially supported by the Defense Advanced Research Projects Age ncy (DARPA).
Any opinions, findings and conclusions, or recommendations expres sed in this
material are those of the authors and do not necessarily reflect t he views of
DARPA.
References
[1] Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantify ing
languagemodels’sensitivitytospuriousfeaturesinpromptdesigno r: HowI
learnedtostartworryingaboutpromptformatting. ArXiv,abs/2310.11324,
2023.
[2] Alexander Jakubowski, Milica Gasic, and Marcus Zibrowius. Topolog y of
word embeddings: Singularities reflect polysemy. In Iryna Gurevyc h, Mar-
ianna Apidianaki, and Manaal Faruqui, editors, Proceedings of the Ninth
Joint Conference on Lexical and Computational Semantics , pages 103–113,
This material is based upon work supported by the Defense Adv anced Research Projects Agency (DARPA) under
Contract No. HR001124C0319. Any opinions, findings and con clusions or recommendations expressed in this
material are those of the author(s) and do not necessarily re flect the views of the Defense Advanced Research
Projects Agency (DARPA). Distribution Statement “A” (Appr oved for Public Release, Distribution Unlimited).
Page 17:
Barcelona, Spain (Online), December 2020. Association for Comput ational
Linguistics.
[3] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, I lya
Sutskever, et al. Language models are unsupervised multitask lear ners.
OpenAI blog , 1(8):9, 2019.
[4] Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Do s Santos,
Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sea n
Welleck. Llemma: An open language model for mathematics, 2024.
[5] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Ba mford,
Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Giann a
Lengyel, Guillaume Lample, Lucile Saulnier, L´ elio Renard Lavaud, Marie -
Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thoma s
Wang, Timoth´ ee Lacroix, and William El Sayed. Mistral7b, 2023.
[6] Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Brad ley,
Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit,
USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutaw ika,
andOskarvanderWal. Pythia: Asuiteforanalyzinglargelanguagemo dels
across training and scaling, 2023.
[7] Max Vargas, Reilly Cannon, Andrew Engel, Anand D Sarwate, and T ony
Chiang. Understandinggenerativeaicontentwithembeddingmode ls, 2024.
[8] Archit Rathore, Yichu Zhou, Vivek Srikumar, and Bei Wang. Topo bert:
Exploring the topology of fine-tuned word representations. Information
Visualization , 22(3):186–208, 2023.
[9] Michael Robinson, Sourya Dey, and Shauna Sweet. The structu re of the
token space for large language models, 2024.
[10] Vasilii A. Gromov, Nikita S. Borodin, and Asel S. Yerbolova. A lang uage
and its dimensions: Intrinsic dimensions of language fractal struct ures.
Complexity , 2024.
[11] Eduard Tulchinskii, Kristian Kuznetsov, Laida Kushnareva, Dan iil Cher-
niavskii, Serguei Barannikov, Irina Piontkovskaya, Sergey Nikolen ko, and
Evgeny Burnaev. Intrinsic dimension estimation for robust detect ion of
AI-generated texts, 2023.
[12] Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing
the manifold hypothesis. Journal of the American Mathematical Society ,
29(4):983–1049, 2016.
[13] J. Lee. Smooth Manifolds . Springer, 2003.
[14] Alfred Gray. The volumeofasmall geodesicball ofaRiemannian ma nifold.
Michigan Mathematical Journal , 20(4):329 – 344, 1974.
This material is based upon work supported by the Defense Adv anced Research Projects Agency (DARPA) under
Contract No. HR001124C0319. Any opinions, findings and con clusions or recommendations expressed in this
material are those of the author(s) and do not necessarily re flect the views of the Defense Advanced Research
Projects Agency (DARPA). Distribution Statement “A” (Appr oved for Public Release, Distribution Unlimited).
Page 18:
Supplementary
This section contains mathematical justification for the fiber bund le model pro-
posed earlier in the paper and the proof of Theorem 1. The central idea is the
use of a special kind of fiber bundle, namely a fibered manifold . This is done by
placing a specific structure on a manifold Ethat describes the data, by relating
it to another, lower dimensional, manifold B, called the base space , via a smooth
mapp:E→B.
Definition 1. Afibered manifold is a surjective function p:E→Bsuch that
the Jacobian matrix dxpat every point x∈Ehas rank equal to the dimension
ofB.
By the submersion theorem [13], if the Jacobian matrix of pat every point
has rank equal to the dimension of B, then the preimages p−1(x)⊆Eof each
pointx∈Eare all diffeomorphic to each other. These preimages form the fibers
discussed in the earlier sections of the paper.
Asaconsequence,eachpoint yinthebasespace Bhasanopenneighborhood
Uwhere the preimage p−1(U) is diffeomorphic to the product U×p−1(y), which
is precisely the base-fiber split discussed in Section 2.1. Specifically, t he base
dimension is simply the dimension of B, whereas the fiber dimension is the
(dimE−dimB).
The notion of a fibered manifold p:E→Bforms the intrinsic model of the
data, which is only implicit in an LLM. The tokens present in a given LLM ca n
be thought of as a sample from a probability distribution monE, which can be
taken to be the Riemannian volume form on Enormalized so that m(E) = 1.
Definition 2. Iff:E→Rdis a smooth map and mis a volume form on E,
then the pushforward is defined by
(f∗m)(V) :=m/parenleftbig
f−1(V)/parenrightbig
for each measurable set V.
It is a standard fact that if fis a fibered manifold or an embedding, then
f∗mis also a volume form.
The explicit representation of the token subspace arises by embed ding the
tokens within a Euclidean space Rd. On the hypothesis that the tokens lie on a
fiberedmanifold—recallthattheymaynot—the tokeninput embedd ingconsists
of a smooth embedding e:E→Rd. If this is the correct representation of the
tokens, then the probability distribution monEwill impact the distribution of
tokenswithin Rd. Theorem1characterizesthe resultingprobabilitydistribution
using parameters (the exponents in Equation (2)) that can be est imated from
the token input embedding, as described by the earlier sections of t his paper.
These parameters are bounded by the dimensions of the base and fi ber spaces.
This material is based upon work supported by the Defense Adv anced Research Projects Agency (DARPA) under
Contract No. HR001124C0319. Any opinions, findings and con clusions or recommendations expressed in this
material are those of the author(s) and do not necessarily re flect the views of the Defense Advanced Research
Projects Agency (DARPA). Distribution Statement “A” (Appr oved for Public Release, Distribution Unlimited).
Page 19:
Theorem 1. Suppose that Eis a compact, finite-dimensional Riemannian man-
ifold with boundary1, with a volume form msatisfying m(E)<∞, and let
p:E→Bbe a fibered manifold.
Ife:E→Rdis a smooth embedding with reach τ, then there is a function
ρ:e(E)→[0,τ]such that if for x∈e(E),
(e∗m)(Br(x)) =/braceleftBigg
O(rdimE) if0≤r≤ρ(x),
(e∗m)/parenleftbig
Bρ(x)(x)/parenrightbig
+O((r−ρ(x))dimB)ifρ(x)≤r,
(2)
where the asymptotic limits are valid for small r.
As a special case, mmay be normalized to yield a probability measure.
Proof.Sinceeisassumedtobeasmoothembedding,theimageof eisamanifold
of dimension dim E. The pushforward of a volume form is a contravariant
functor, so this means that e∗mis the volume form for a Riemannian metric on
e(E). Using this Riemannian metric on e(E), then [14, Thm 3.1] implies that
for every x∈e(E), ifr≪τ, then
(e∗m)(Br(x)) =O/parenleftbig
rdimE/parenrightbig
. (3)
SinceEis compact, Bis also compact via the surjectivity of p. This implies
that there is a maximum radius r1for which a ball of this radius centered on a
point on x∈e(E) is entirely contained within e(E). Also by compactness of B,
there is a minimum radius r2such that a ball of radius r2centered on a point
x∈e(E) contains a point outside of e(E).
Sinceeisassumedtobeanembedding,bythetubularneighborhoodtheore m
[13], it must be that r2< τ. Define
ρ(x) := argmaxr{Br(x)⊆e(E)},
from which it follows that 0 < r1≤ρ(x)≤r2< τ. As a result, Equation (3)
holds for all r < ρ(x), which is also the first case listed in Equation (2).
Ifris chosen such that ρ(x)< r < τ, the volume of the ball centered on x
of radius rwill be less than what is given by Equation (3), namely
(e∗m)(Br(x))< O(rdimE).
Sincemis a volume form, its pushforward ( p∗m) ontoBis also a volume
form. Moreover, via the surjectivity of p,
(e∗m)(Br(x)) =m(e−1(Br(x)))
≤m(p−1(p(e−1(Br(x)))))
≤(p∗m)(p(e−1(Br(x))))
≤O(rdimB).
1Every point in a manifold with boundary has a neighborhood that is locally homeomorphic
to a half-space. As a consequence, manifolds are a special ca se of manifolds with boundary.
This material is based upon work supported by the Defense Adv anced Research Projects Agency (DARPA) under
Contract No. HR001124C0319. Any opinions, findings and con clusions or recommendations expressed in this
material are those of the author(s) and do not necessarily re flect the views of the Defense Advanced Research
Projects Agency (DARPA). Distribution Statement “A” (Appr oved for Public Release, Distribution Unlimited).
Page 20:
From this, the second case of Equation(2) follows by recentering t he asymptotic
series on ρ(x).
Notice that the second case in Equation (2) may be precluded since w hile
it holds for small r, it may be that ρ(x) may not be sufficiently small. As a
consequence, the second case only occurs when both randρ(x) are sufficiently
small. In the results shown in Section 3, both cases appear to hold fr equently.
This material is based upon work supported by the Defense Adv anced Research Projects Agency (DARPA) under
Contract No. HR001124C0319. Any opinions, findings and con clusions or recommendations expressed in this
material are those of the author(s) and do not necessarily re flect the views of the Defense Advanced Research
Projects Agency (DARPA). Distribution Statement “A” (Appr oved for Public Release, Distribution Unlimited).