Authors: Dong Shu, Xuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, Mengnan Du
Paper Content:
Page 1:
A Survey on Sparse Autoencoders:
Interpreting the Internal Mechanisms of Large Language Models
Dong Shu1,†, Xuansheng Wu2,†, Haiyan Zhao3,†, Daking Rai4,
Ziyu Yao4,Ninghao Liu2,Mengnan Du3
1Northwestern University2University of Georgia
3New Jersey Institute of Technology4George Mason University
dongshu2024@u.northwestern.edu ,{xw54582,ninghao.liu}@uga.edu ,{hz54,mengnan.du}@njit.edu ,{drai2,ziyuyao}@gmu.edu
Abstract
Large Language Models (LLMs) have revolu-
tionized natural language processing, yet their
internal mechanisms remain largely opaque.
Recently, mechanistic interpretability has at-
tracted significant attention from the research
community as a means to understand the inner
workings of LLMs. Among various mechanis-
tic interpretability approaches, Sparse Autoen-
coders (SAEs) have emerged as a particularly
promising method due to their ability to dis-
entangle the complex, superimposed features
within LLMs into more interpretable compo-
nents. This paper presents a comprehensive
examination of SAEs as a promising approach
to interpreting and understanding LLMs. We
provide a systematic overview of SAE princi-
ples, architectures, and applications specifically
tailored for LLM analysis, covering theoretical
foundations, implementation strategies, and re-
cent developments in sparsity mechanisms. We
also explore how SAEs can be leveraged to
explain the internal workings of LLMs, steer
model behaviors in desired directions, and de-
velop more transparent training methodologies
for future models. Despite the challenges that
remain around SAE implementation and scal-
ing, they continue to provide valuable tools for
understanding the internal mechanisms of large
language models.
1 Introduction
Large Language Models (LLMs), such as GPT-
4 (OpenAI et al., 2024), Claude-3.5 (Anthropic,
2024), DeepSeek-R1 (DeepSeek-AI et al., 2025),
and Grok-3 (xAI, 2025), have emerged as powerful
tools in natural language processing, demonstrat-
ing remarkable capabilities in tasks ranging from
text generation to complex reasoning. However,
their increasing size and complexity have created
significant challenges in understanding their inter-
nal representations and decision-making processes.
†Equal contributionThis “black box” nature of LLMs has sparked grow-
ing interest in mechanistic interpretability (Bereska
and Gavves, 2024; Zhao et al., 2024a; Rai et al.,
2024; Zhao et al., 2024b), a field that aims to break
down LLMs into understandable components and
systematically analyze how these components inter-
act to produce emergent behaviors. Understanding
these mechanisms is crucial not only for scientific
progress but also for addressing concerns related
to safety, reliability, and alignment of increasingly
powerful LLM systems.
Among the various approaches to interpreting
LLMs, Sparse Autoencoders (SAEs) (Cunningham
et al., 2023; Bricken et al., 2023; Gao et al., 2025;
Rajamanoharan et al., 2024b) have emerged as
a particularly promising direction for addressing
a fundamental challenge in LLM interpretability:
polysemanticity . Many neurons in LLMs are poly-
semantic, responding to seemingly unrelated con-
cepts or features simultaneously. This is a phe-
nomenon likely resulting from superposition (El-
hage et al., 2022), where LLMs represent more
independent features than they have neurons by
encoding each feature as a linear combination of
neurons. SAEs address this issue by learning an
overcomplete, sparse representation of neural ac-
tivations, effectively disentangling these superim-
posed features into more interpretable units. By
training a sparse autoencoder to reconstruct the ac-
tivations of a target network layer while enforcing
sparsity constraints, SAEs can extract a larger set
ofmonosemantic features that offer clearer insights
into what information the LLM is processing. This
approach has shown considerable promise in trans-
forming the often-inscrutable activations of LLMs
into more human-understandable representations,
potentially creating a more effective vocabulary for
mechanistic analysis of these complex systems.
In this paper, we provide a comprehensive
overview of SAE for LLM interpretability, begin-
ning with their foundational motivations and de-arXiv:2503.05613v1 [cs.LG] 7 Mar 2025
Page 2:
...Activation
Function
...
......
Reconstruction
Sparsity
Sparse
Autoencoder
Model
Size
Stage
Neurons
Small
Size
LLMs
GPT-2
Neuron
Toy
Model
Transformer
Model
GPT-2
Pythia-70m
GPT-4
Claude-3
LLaMa-3.1
Gemma-2
Close
Source
Open
Source
09/22
05/23
10/23
02/24
12/23
05/23
06/24
10/24
08/24
(a)
SAE
Framework
(b)
SAE
History
Token
Transfomer
Block
...
Transfomer
Block
Next-word
Prediction
...
...
Large
Language
ModelFigure 1: (a) This figure illustrates the fundamental framework of a Sparse Autoencoder (SAE). SAE is trained
to take a model representation zas input and project it to an overcomplete sparse activation h(z)by learning to
reconstruct the original input ˆz. The SAE typically comprises an encoder, a decoder, and a loss function for training.
(b) The development of the SAE progresses through multiple stages, with each stage drawing inspiration from and
building upon the previous one.
velopment history. We then explore the techni-
cal framework of SAEs, including their basic ar-
chitecture, various design improvements, and ef-
fective training strategies. The survey examines
different approaches to analyzing and explaining
SAE features, categorized broadly into input-based
and output-based explanation methods. We dis-
cuss evaluation methodologies for assessing SAE
performance, covering both structural metrics that
analyze the properties of the learned features and
functional metrics that evaluate their practical util-
ity. The paper further delves into real-world ap-
plications of SAEs in understanding and manipu-
lating LLMs, including model behavior analysis,
intentional steering of model outputs, and insights
for improved model training. Lastly, we highlight
current research challenges and conclude with per-
spectives on promising future directions.
2 Why Sparse Autoencoders?
As LLMs continue to grow rapidly in size, interpre-
tation becomes more challenging, as the complex-
ity of their latent space and internal representations
also expands exponentially. SAEs have emerged as
a powerful tool to understand how LLMs make de-
cisions. This ability is known as mechanistic inter-
pretability, which aims to reverse-engineer models
by breaking down their internal computations into
understandable, interpretable components. SAE
is designed to learn a sparse, linear, and decom-
posable representation of the internal activations
of a LLM. It enforces a sparsity constraint so that
only a few features are active at any given time.
This encourages each active feature to correspond
to a specific, understandable concept. This simpli-fication allows researchers to focus on a few key
features rather than being overwhelmed by the full
complexity of the model. Below, we discuss the
development history of the SAE for LLMs and
present Figure 1b to visually depict this progress.
Due to page limitations, we do not attempt to pro-
vide an exhaustive history of SAEs, but instead
focus on highlighting key milestones in the devel-
opment of SAEs for mainstream LLMs.
Explaining Individual Neurons.
The development of interpretability techniques for
LLMs has progressed in stages rather than as a
single step. From 2022 to 2023, researchers at
OpenAI and Anthropic focused on understanding
LLMs by examining individual neurons. OpenAI,
for instance, leveraged GPT -4 to generate natural
language explanations for neurons in models like
GPT -2, attempting to map specific neuron activa-
tions to concrete linguistic or conceptual features
(Bills et al., 2023). Similarly, Anthropic built small
toy models trained on synthetic data to observe
how features are stored in neurons. These early in-
vestigations showed that analyzing single neurons
can provide initial insights (Elhage et al., 2022). In
addition, it is worth noting that the study of explain-
ing individual neurons by labeling interpretable
features to them has been extensively explored in
non-MI studies (Radford et al., 2017; Donnelly and
Roegiest, 2019; Nguyen et al., 2019; Szegedy et al.,
2013) prior to the introduction of the MI.
However, they soon discovered that analyzing
individual neurons had significant limitations, as
neurons in LLMs often exhibit polysemanticity ,
i.e., responding to multiple unrelated inputs within
Page 3:
the same neuron. For instance, a single neuron
might simultaneously activate for academic cita-
tions, English dialogue, HTTP requests, and Ko-
rean text (Bricken et al., 2023). This phenomenon
is largely attributed to superposition , where neu-
ral networks represent more independent features
than available neurons by encoding each feature as
a linear combination of neurons (Ferrando et al.,
2024). While this architectural efficiency allows
models to encode vast amounts of knowledge, it
makes individual neurons difficult to interpret since
their activations represent entangled mixtures of
different concepts. This fundamental challenge
with neuron-level analysis motivated researchers
to explore more sophisticated approaches for dis-
entangling these superimposed features, leading to
the development of sparse autoencoders (SAEs) as
a promising solution for extracting interpretable,
monosemantic features from the model’s complex
internal representations.
SAEs for Small-size Language Models.
In late 2023, Anthropic advanced transformer in-
terpretability by moving beyond raw neuron activa-
tions to decompose model activations into single-
concept, monosemantic features, addressing the
polysemanticity of individual neurons in LLMs
(Cunningham et al., 2023; Bricken et al., 2023).
They trained SAEs on transformer activation data
by optimizing a reconstruction loss with a strong
sparsity constraint. This training forces the au-
toencoder to represent each activation as a sparse
combination of basis vectors, with each basis vec-
tor ideally capturing a distinct, interpretable con-
cept. SAEs transform the overlapping signals of
individual neurons into a set of clean, monose-
mantic features that are much easier to understand.
This approach offers a clear advantage over tradi-
tional neuron-based methods by isolating the key
features that drive model behavior. The promis-
ing experimental results on simpler transformer
models demonstrated that SAEs provide a more
effective and scalable route for interpreting model
internals. Building on this success, later works
Bloom (2024) and Samuel et al. (2024) applied
SAE techniques to smaller models such as GPT-
2 (Radford et al., 2019) and Pythia-70m (Biderman
et al., 2023), thereby paving the way for their even-
tual extension to full-scale billion size LLMs.
SAEs for Large Language Models.
After witnessing the success of SAEs on smaller-
scale models, the third stage of their developmentemerged in 2024, when Anthropic (Templeton
et al., 2024) and OpenAI (Gao et al., 2025) be-
came the first groups to apply SAEs to their latest
proprietary LLMs, Claude 3 Sonnet and GPT-4,
respectively. This marked a significant step for-
ward in understanding these closed-source, black-
box models, even for the researchers who built
them. However, scaling SAEs from small models
to full-scale LLMs introduced several new chal-
lenges. One major issue was the sheer scale of
activations in models with billions of parameters,
which made training and extracting interpretable
features computationally expensive. Additionally,
ensuring that extracted features remained monose-
mantic became increasingly difficult, as feature
superposition is more prevalent in larger models
(Templeton et al., 2024). Despite these challenges,
researchers found that SAEs could effectively de-
compose polysemantic neurons into monosemantic
features, revealing meaningful and interpretable
latent representations within the models. For in-
stance, Anthropic demonstrated that certain neu-
rons in Claude 3 Sonnet encode high-level concepts
such as “sycophantic praise”, where phrases like “a
generous and gracious man” strongly activate this
feature. Similarly, OpenAI’s research on GPT-4
identified a “humans have flaws” feature, which
activates on phrases like “My Dad wasn’t perfect
(are any of us?) but he loved us dearly.” These find-
ings not only deepen our understanding of model
behavior but also provide powerful interpretability
tools, allowing the practitioners to better analyze,
refine, and steer language model outputs.
As the architecture and mechanisms of SAEs
become clearer, more researchers have begun to
follow this approach, applying SAEs to interpret
open-source models. For example, Google Deep-
Mind (Lieberum et al., 2024) used SAEs to analyze
Gemma 2 (Team et al., 2024), while He et al. (2024)
applied similar techniques to LLaMA 3.1 (Dubey
et al., 2024). This growing adoption highlights
the increasing role of SAEs in mechanistic inter-
pretability, paving the way for broader transparency
in both close- and open-source LLMs.
3 Technical Framework of SAEs
3.1 Basic SAE Framework
SAE is a neural network that learns an overcom-
plete dictionary for representation reconstruction.
As shown in Figure 1a, the input of SAE is the
representation of a token extracted from LLMs,
Page 4:
Table 1: Taxonomy of SAE Frameworks: An Overview of Basic and Variant Architectures.
Category Examples Activation Citations
Basic SAE Framework (§3.1) l2-norm SAE ReLU Ferrando et al. (2024)
Improve Architecture (§3.2.1)Gated SAE Jump ReLU Rajamanoharan et al. (2024a)
TopK SAE TopK Gao et al. (2025)
Batch TopK SAE Batch TopK Bussmann et al. (2024)
ProLU SAE ProLU Taggart (2024)
JumpReLU SAE Jump ReLU Rajamanoharan et al. (2024b)
Switch SAE TopK Mudide et al. (2024)
Improve Training Strategy (§3.2.2)Layer Group SAE Jump ReLU Ghilardi et al. (2024)
Feature Choice SAE TopK Ayonrinde (2024)
Mutual Choice SAE TopK Ayonrinde (2024)
Feature Aligned SAE TopK Marks et al. (2024)
End-to-end SAE ReLU Braun et al. (2025)
Formal Languages SAE ReLU Menon et al. (2024)
Specialized SAE ReLU Muhamed et al. (2024)
which is mapped onto a sparse vector of dictionary
activations.
Input. Given a LLM denoted as fwith a total
ofLtransformer layers, we consider an input se-
quence x= (x0, . . . , x N)withNtokens, where
each xn∈xrepresents a token in the sequence.
As the sequence xis processed by the LLM, each
token xnproduces representations at different lay-
ers. For a specific layer l, we denote the hidden
representation corresponding to token xnasz(l)
n,
where z(l)
n∈Rdindicates the embedding vector
of dimension d. Each representation z(l)
nserves as
input to SAEs. In the following, we may omit the
superscript(l)of layers to simplify the notation.
After extracting the representation z(l)
n, the SAE
takes it as input, decomposes it into a sparse rep-
resentation, and then reconstructs it. The SAE
framework is typically composed of three key com-
ponents: the encoder , which maps the input repre-
sentation to a higher-dimensional sparse activation;
thedecoder , which reconstructs the original repre-
sentation from this sparse activation; and the loss
function , which ensures accurate reconstruction
while enforcing sparsity constraints.
Encoding Step. Given an input representation
z∈Rd, the encoder applies a linear transforma-
tion using a weight matrix Wenc∈Rd×mand a
bias term benc∈Rm, followed by a ReLU acti-
vation function to enforce sparsity. The encoding
operation is defined as:
h(z) =ReLU (z·Wenc+benc), (1)
where h(z)∈Rmrepresents the sparse activationvector, which helps disentangle superposition fea-
tures. The ReLU activation function ReLU (x) =
max (0 , x)ensures that only non-negative values
pass through, which encourages sparsity by setting
negative values to zero.
Since the SAE constructs an overcomplete dic-
tionary to facilitate sparse activation, the number
of learned dictionary elements mis chosen to be
larger than the input dimension d(i.e., m≫d).
This overcompleteness allows the encoder to learn
a richer and more expressive representation of the
input, making it possible to reconstruct the original
data using only a sparse subset of dictionary ele-
ments. The output h(z)from the encoder is then
passed to the decoding stage, where it is mapped
back to the original input space to reconstruct z.
Decoding Step. After the encoding step, the next
stage in the SAE framework is the decoding pro-
cess, where the sparse activation vector h(z)is
mapped back to the original input space. This step
ensures that the sparse features learned by the en-
coder contain sufficient information to accurately
reconstruct the original representation. The decod-
ing operation is defined as:
ˆz=SAE (z) =h(z)·Wdec+bdec,(2)
where Wdec∈Rm×dis the decoder weight matrix.
bdec∈Rdis the decoder bias term. ˆz∈Rdis the
reconstructed output, which aims to approximate
the original input z.
The accuracy of the reconstruction and the in-
terpretability of the learned representation depends
heavily on the effectiveness and sparsity of the ac-
Page 5:
tivation vector h(z). Therefore, the SAE is trained
using a loss function that balances minimizing the
reconstruction error and enforcing sparsity. This
trade-off ensures that the learned dictionary ele-
ments provide a compact yet expressive representa-
tion of the input data.
Loss Function. The activation vector h(z)is en-
couraged to be sparse, meaning that most of its
values should be zero. While the ReLU activation
function after the encoder enforces basic sparsity
by setting negative values to zero, it does not nec-
essarily eliminate small positive values, which can
still contribute to a dense representation. Therefore,
additional sparsity enforcement is required. This
is achieved using a sparsity regularization term in
the loss function, which further promotes a mini-
mal number of active features. Beyond enforcing
sparsity, the SAE must also ensure that the learned
sparse activation retains sufficient information to
accurately reconstruct the original input z. The
loss function for training the SAE consists of two
key components: reconstruction loss andsparsity
regularization :
L(z) =∥z−ˆz∥2
2+α∥h(z)∥1, (3)
where reconstruction loss ensures that the SAE
learns to reconstruct the input data accurately,
meaning the features encoded in the sparse rep-
resentation must also be present in the input acti-
vations. On the other hand, sparsity regularization
enforces sparsity by penalizing nonzero values in
h(z), and αis a hyper-parameter to control the
penalty level of the sparsity. Specifically, without
the sparsity loss, SAEs could simply memorize the
training data, reconstructing the input without dis-
entangling meaningful features. However, once
the sparsity loss is introduced, the model is forced
to activate only a small subset of neurons for re-
constructing the input activation. This constraint
encourages the SAE to focus on the most informa-
tive and critical features to reconstruct the input
activation. A higher value of αenforces stronger
sparsity by shrinking more values in h(z)to zero,
but this may lead to information loss and degraded
reconstruction quality. A lower value of αpriori-
tizes reconstruction accuracy but may result in less
sparsity, reducing the interpretability of the learned
features. Thus, selecting an optimal αis crucial for
achieving a balance between interpretability and
accurate data representation.3.2 Different SAE Variants
As SAEs continue to emerge as a powerful tool for
interpreting the internal representations of LLMs,
researchers have increasingly focused on refining
and extending their capabilities. Various SAE vari-
ants have been proposed to address the limitations
of traditional SAEs, each introducing improve-
ments from different perspectives. In this section,
we categorize these advancements into two main
groups: Improve Architectural, which modify the
structure and design of the traditional SAE, and
Improve Training Strategy, which retain the orig-
inal architecture but introduce novel methods to
enhance training efficiency, feature selection, and
sparsity enforcement.
3.2.1 Improve Architecture
Gated SAE. The Gated SAE (Rajamanoharan
et al., 2024a) is a modification of the standard SAE
that aims to improve the trade-off between recon-
struction fidelity and sparsity enforcement. Tradi-
tional SAEs suffer from shrinkage bias (Wright and
Sharkey, 2024), where the L1-norm regularization
systematically underestimates feature activations,
leading to reduced reconstruction accuracy. In-
spired by Gated Linear Units (Dauphin et al., 2017;
Shazeer, 2020), Gated SAEs replace the standard
ReLU encoder with a gated ReLU encoder, which
allows the L1 penalty to be applied solely to the
selection mechanism:
˜h(z) =1
πgate(z)>0
⊙ReLU (πmag(z)),(4)
where πgate(z) =Wgate(z−bdec)+bgateis the gat-
ing function that determines which features should
be activated. Wgateis a weight matrix for feature
selection. πmag(z) =Wmag(z−bdec)+bmagis the
magnitude estimation function that determines the
strength of the active features. Wmagis a weight
matrix for magnitude estimation. 1[·]is the Heavi-
side step function that binarizes activations and ⊙
denotes element-wise multiplication. In this case,
Gated SAEs introduce independent pathways for
determining which features are activated and their
respective strengths, reducing bias and improving
interpretability.
To optimize the Gated SAE, the
authors introduce an auxiliary loss
∥z−ˆzfrozen
ReLU
πgate(z)
∥2
2on top of
the traditional loss function. This addresses the
issue of the non-differentiability of the Heaviside
step function used in the gating mechanism, and
Page 6:
enables gradient flow during backpropagation
while still enforcing sparsity. Here, ˆzfrozen is a copy
of the decoder with frozen weights.
TopK SAE. The TopK SAE (Gao et al., 2025) is
an improvement over the traditional SAE, designed
to directly enforce sparsity without requiring L1-
norm regularization. Instead of penalizing all acti-
vations, which can introduce shrinkage bias, TopK
SAEs enforce sparsity by retaining only the top K
largest activations and setting the rest to zero. This
ensures that only the most important features con-
tribute to the learned representation. The encoder
applies a linear transformation followed by a hard
TopK selection:
˜h(z) =TopK
Wenc(z−bpre)
, (5)
where Wenc∈Rd×mis the encoder weight matrix,
andbpre∈Rmis a pre-normalization bias term
applied before the TopK selection.
Since the sparsity constraint is explicitly en-
forced through the TopK operation, there is no need
for an additional sparsity regularization term in the
loss function. The training objective reduces to
minimizing the reconstruction loss:
L(z) =∥z−ˆz∥2
2+αLaux, (6)
whereLauxis an auxiliary loss scaled by the coeffi-
cient α, designed to stabilize training and prevent
dead latents (Templeton et al., 2024).
BatchTopK SAE. The BatchTopK SAE is a mod-
ification of the TopK SAE, designed to address the
limitations of enforcing a fixed number of active
features per sample. Bussmann et al. (2024) iden-
tified two key inefficiencies in the standard TopK
SAE. First, it forces every token to use exactly
Kfeatures, even when some tokens may require
fewer or more active features. It also does not allow
flexibility across a batch, leading to inefficient spar-
sity control. To overcome these issues, BatchTopK
SAEs apply the TopK selection globally across the
entire batch instead of enforcing it per token. This
means that BatchTopK selects the top K×Bac-
tivations across the entire batch, where Bis the
batch size. The encoder is modified to:
˜h(Z) =BatchTopK
Wenc(Z−bpre)
,(7)
where Z∈RB×dis the input batch matrix, and
Bis the batch size. Similar to TopK SAE, Batch-
TopK directly controls sparsity through the selec-
tion mechanism, it eliminates the need for explicit
sparsity regularization: L(Z) =∥Z−ˆZ∥2
2+αLaux.ProLU SAE. The ProLU SAE (Taggart, 2024)
introduces a novel activation function called Pro-
portional ReLU, which serves as an alternative to
ReLU in traditional SAEs. ProLU SAE provides a
Pareto improvement over both standard SAEs with
L1-norm regularization, which suffer from shrink-
age bias, and SAEs trained with a Sqrt( L1) penalty,
which attempt to mitigate shrinkage but still do not
fully address inconsistencies in activation scaling.
In contrast to ReLU, which applies a fixed thresh-
old at zero, ProLU introduces a learnable threshold
for each activation, allowing the model to deter-
mine the optimal activation boundary dynamically.
The ProLU activation function is defined as:
ProLU (mi, bi) =(
mi,ifmi+bi>0andmi>0
0, otherwise,
(8)
where miis the pre-activation output from the en-
coder, and biis a learnable bias term that shifts
the activation threshold. The encoding process in
ProLU SAE replaces the standard ReLU activa-
tion with ProLU, leading to the following encoding
function:
˜h(z) =ProLU ((z−bdec)Wenc,benc).(9)
The ProLU SAE training objective consists of
the standard reconstruction loss combined with an
auxiliary sparsity term:
L(z) =∥z−ˆz∥2
2+λP(˜h(z)), (10)
where λis the sparsity penalty coefficient, and
P(˜h(z))is a sparsity-inducing function. The au-
thors found that using a Sqrt( L1) penalty, defined
asP(h) =∥h∥1/2, provided better sparsity control
compared to the standard L1-norm.
JumpReLU SAE. The JumpReLU SAE (Raja-
manoharan et al., 2024b) is a modification of the
traditional SAE that replaces the standard ReLU
activation function with JumpReLU. The ReLU
activation function sets all negative pre-activation
values to zero but allows small positive values, lead-
ing to false positives in feature selection and un-
derestimation of feature magnitudes. JumpReLU
introduces an explicit threshold θthat zeroes out
pre-activations below this threshold, ensuring that
weak activations do not contribute to the recon-
struction. The JumpReLU activation function is
defined as:
JumpReLUθ(z) =z·H(z−θ), (11)
Page 7:
where θis a learnable threshold and H(x)is the
Heaviside step function, which is 1 when x > 0
and 0 otherwise. The encoder in JumpReLU SAE
follows a standard linear transformation followed
by JumpReLU activation:
˜h(z) =JumpReLUθ(Wencz+benc). (12)
Unlike traditional SAEs that use L1-norm for spar-
sity regularization, JumpReLU SAEs directly op-
timize the L0-norm, which counts the number of
nonzero activations: L(z) =∥z−ˆz∥2
2+α∥h(z)∥0.
Switch SAE. Inspired by Mixture of Experts
(MoE) models (Shazeer et al., 2017), Switch SAE
(Mudide et al., 2024) introduces a more compu-
tationally efficient framework for training SAEs.
Instead of training a single large SAE, Switch
SAE leverages multiple smaller “expert SAEs”
E1, E2, ..., E Nand a routing network that dynam-
ically assigns each input to an appropriate expert.
This approach enables efficient scaling to a large
number of features while avoiding the memory and
FLOP bottlenecks of traditional SAEs. Each “ex-
pert SAE” follows a standard TopK SAE formula-
tion:
Ei(z) =W(i)
dec·TopK (W(i)
encz), (13)
where W(i)
encandW(i)
decare the encoder and decoder
weight matrices for expert i. The routing network
determines which expert is assigned to each input
by computing a probability distribution over the
experts:
p(z) =softmax (Wrouter(z−brouter)),(14)
where Wrouter is the routing weight matrix. brouter
is the bias term. p(z)represents the probability of
selecting each expert. The final reconstruction is
computed as:
ˆz=pi∗(z)Ei∗(z)(z−bpre) +bpre, (15)
where i∗(z)is the selected expert for input z.
To ensure balanced expert utilization and avoid
expert collapse, Switch SAE incorporates an auxil-
iary loss for load balancing:
Laux=NNX
i=1fi·Pi, (16)
where fiis the fraction of activations assigned to
expert i, andPiis the fraction of router probability
assigned to expert i. This auxiliary loss is then
added to the traditional reconstruction loss function
to form the final learning objective.3.2.2 Improve Training Strategy
Layer Group SAE. Traditionally, one SAE is
trained per layer in a transformer-based LLM, re-
sulting in a substantial number of parameters and
high computational costs. To address this ineffi-
ciency, the Layer Group SAE (Ghilardi et al., 2024)
clusters multiple layers into groups based on acti-
vation similarity and trains a single SAE per group.
This significantly reduces training time while pre-
serving reconstruction accuracy and interpretability.
To determine which layers should be grouped to-
gether, the method measures the angular similarity
between layer activations, defined as:
dangular (zp
post,zq
post) =1
πarccos zp
post·zq
post
∥zp
post∥2∥zq
post∥2!
,
(17)
where zp
postandzq
postrepresent post-MLP residual
stream activations for layers pandq. Using this
similarity metric, layers with highly correlated ac-
tivations are clustered together through a hierar-
chical clustering strategy. The number of groups
Kis chosen based on a computational trade-off,
balancing efficiency and reconstruction accuracy.
Once the layer groups are formed, a single SAE
is trained per group instead of one per layer. The
SAE architecture and training objective remains
similar as in traditional SAEs, optimizing for both
reconstruction accuracy and sparsity.
Feature Choice SAE. Traditional SAEs face sev-
eral limitations, including dead features, fixed spar-
sity per token, and lack of adaptive computation.
Feature Choice SAEs (Ayonrinde, 2024) address
these issues by imposing a constraint on the num-
ber of tokens each feature can be active for, rather
than restricting the number of active features per
token. This approach ensures that all features are
utilized efficiently, preventing feature collapse and
improving reconstruction accuracy. This sparsity
allocation constraint is defined as:
X
jSi,j=m,∀i,where M=mF, (18)
where Si,jis a binary selection matrix, indicating
whether feature iis active for token j. Each feature
must be activated for exactly mtokens, enforcing
uniform feature utilization.
Mutual Choice SAE. Mutual Choice SAE (Ay-
onrinde, 2024) remove all constraints on sparsity
allocation, allowing the model to freely distribute
its limited total sparsity budget across all tokens
Page 8:
and features. Unlike TopK SAEs, which enforce a
fixed number of active features per token, or Fea-
ture Choice SAEs, which constrain the number
of tokens each feature can be assigned to, Mutual
Choice SAE introduce global sparsity allocation.
This means that instead of enforcing a per-token
or per-feature selection, the model selects the top
Mfeature-token matches across the entire dataset,
ensuring that sparsity is allocated adaptively based
on reconstruction needs. Mathematically, the acti-
vation selection process is defined as:
S=TopKIndices (Z′, M), (19)
where Z′represents the pre-activation affinity ma-
trix between tokens and features. Mis the global
sparsity budget, denoting the total number of ac-
tive feature-token pairs allowed. TopKIndices (·)
selects the top Mactivations globally, instead of
enforcing a fixed Kper token.
Feature Aligned SAE. The Feature Aligned SAE
(Marks et al., 2024) introduces Mutual Feature
Regularization (MFR), a novel training method de-
signed to improve the interpretability and fidelity of
learned features in SAEs. Traditional SAEs often
suffer from feature fragmentation, where meaning-
ful input features get split across multiple decoder
weights, and feature entanglement, where multi-
ple independent input features are merged into a
single decoder weight. These issues reduce the in-
terpretability of SAEs and limit their effectiveness
in analyzing neural activations. The key insight be-
hind MFR is that features learned by multiple SAEs
trained on the same dataset are more likely to align
with the true underlying structure of the input data.
To enforce this, Feature Aligned SAE trains mul-
tiple SAEs in parallel and applies a MFR penalty
that encourages them to learn mutually consistent
features:
LMFR=α
1
N(N−1)N−1X
i=1NX
j=i+1(1−MMCS (W(i), W(j)))
,
(20)
where W(i)andW(j)are the decoder weight ma-
trices of different SAEs. Mean of Max Cosine
Similarity (MMCS) measures the degree of align-
ment between the learned features across SAEs. α
is a hyperparameter that controls the strength of
the regularization. This mutual feature regulariza-
tion is then combined with the traditional SAE loss
to form the final training objective of the Feature
Aligned SAE.
End-to-end SAE. Traditional SAEs often prior-
itize minimizing reconstruction error rather thanensuring that learned features are functionally im-
portant to the model’s decision-making. This often
leads to feature splitting, where a single meaningful
feature is divided into multiple redundant compo-
nents. To address this, End-to-end SAE (Braun
et al., 2025) modifies the training objective to en-
sure that the discovered features directly influence
the network’s output. They propose minimizing
the Kullback-Leibler (KL) divergence between the
original network’s output distribution and the out-
put distribution when using SAE activations, for-
mulated as:
Le2e=KL(ˆy, y) +α∥h(z)∥1. (21)
To further ensure that activations follow similar
computational pathways in later layers, they pro-
pose E2e + Downstream SAE, which introduces an
additional downstream reconstruction loss, leading
to the formulation:
Le2e+ds=KL(ˆy, y) +α∥h(z)∥1+βLX
k=l+1∥ˆa(k)−a(k)∥2
2.(22)
By shifting the training focus from activation re-
construction to output distribution preservation,
this method ensures that learned features are more
aligned with the actual computational processes of
the network while maintaining interpretability.
Formal Languages SAE. Traditional SAEs effec-
tiveness remains questionable in language models
due to their reliance on correlations rather than
causal attributions. While SAEs often recover fea-
tures that correlate with linguistic structures, such
as parts of speech or syntactic depth, interventions
on these features frequently do not influence the
model’s predictions, suggesting that current train-
ing objectives fail to ensure causal relevance. To
address this, Formal Languages SAE (Menon et al.,
2024) introduce a causal loss term that explicitly
encourages SAEs to learn features that impact the
model’s computation. Their proposed loss function
is given by:
L=Lrecon+αLsparse +βLcaus, (23)
where Lrecon is the standard reconstruction loss,
Lsparse enforces sparsity, and Lcausensures that in-
terventions on learned features result in predictable
changes in model output.
Specialized SAE. Traditional SAEs struggle to
capture rare and low-frequency concepts, which are
critical for understanding model behavior in spe-
cific subdomains. To address this, Specialized SAE
Page 9:
(SSAE) (Muhamed et al., 2024) focuses on learning
rare subdomain-specific features through targeted
data selection and a novel training objective. In-
stead of training on the full dataset, SSAE uses
high-recall dense retrieval methods, such as BM25,
Contriever, and TracIn reranking, to identify rel-
evant subdomain data, ensuring that rare features
are well-represented. Additionally, they introduce
Tilted Empirical Risk Minimization (TERM), an
objective that optimizes for worst-case reconstruc-
tion loss rather than average loss. This is achieved
by modifying the standard SAE loss function to:
LTERM (t;w) =1
tlog
1
NNX
i=1et·Lw(zi)!
,(24)
where Lw(zi)is the standard SAE loss for repre-
sentation zi.Nis the size of a minibatch, and t
is the tilt parameter that controls emphasis on rare
concept reconstruction.
3.3 SAE Training
Even though the framework of SAEs is conceptu-
ally straightforward, training SAEs is both compu-
tationally expensive and data-intensive. The com-
plexity arises due to the overcomplete dictionary
representation, large-scale data requirements, and
the layer-wise training paradigm necessary for in-
terpreting LLMs. Each of these factors contributes
to the substantial computational cost associated
with training SAEs at scale.
Overcomplete Dictionary Representation. A
defining characteristic of SAEs is their overcom-
plete dictionary, where the number of learned fea-
tures far exceeds the dimensionality of the LLM’s
latent space. This overcompleteness is what en-
ables SAEs to enforce sparsity, allowing them to
isolate and extract meaningful feature activations
from high-dimensional representations. The en-
forced sparsity is crucial for LLM interpretabil-
ity, as it helps decompose complex neural activa-
tions into more semantically meaningful features.
Empirical studies highlight the scale of overcom-
pleteness; for example, LLaMa-Scope (He et al.,
2024) trained SAEs with 32K and 128K features,
which are 8 ×and 32 ×larger than the hidden size
of LLaMa3.1-8B. This extreme overparameteriza-
tion provides a highly expressive feature space but
significantly increases the computational burden
during training.
Large-Scale Data Requirements. Since the in-
put to an SAE consists of representations fromLLMs, an enormous amount of data is required
to ensure that the model learns a diverse and rep-
resentative set of activations. To effectively train
an SAE, it is essential to activate a wide range of
neurons in the LLM, which necessitates process-
ing large-scale datasets covering diverse linguistic
structures. Moreover, because SAEs are overcom-
plete, they require significantly more training data
to converge. Empirical results from Gemma-Scope
(Lieberum et al., 2024) illustrate this requirement:
SAEs with 16.4K features were trained on 4 bil-
lion tokens, while 1M-feature SAEs required 16
billion tokens to reach satisfactory performance.
This highlights the immense data demands neces-
sary for training effective SAEs. Another challenge
arises when scaling up the training data, which is
how to efficiently shuffling massive datasets across
distributed systems. Shuffling is crucial to prevent
models from learning spurious, order-dependent
patterns. However, as datasets grow to terabyte or
petabyte scales, performing a distributed shuffle be-
comes a significant engineering hurdle (Anthropic,
2024).
Layer-Wise Training. Interpreting an LLM re-
quires understanding its representations at each
layer, which necessitates training separate SAEs
for different layers of the model. The standard
approach is to train one SAE per layer, meaning
that for deep models, this process must be repeated
across dozens or even hundreds of layers, com-
pounding the computational cost. The necessity of
layer-wise training is further evidenced by ongoing
research efforts attempting to reduce the number
of SAEs required. For example, Layer Group SAE
(Ghilardi et al., 2024), which we discussed previ-
ously, clusters multiple layers into layer groups and
trains a single SAE per group instead of per layer.
The emergence of such strategies demonstrates the
significant computational burden imposed by layer-
wise SAE training and the ongoing efforts to opti-
mize it.
4 Explainability Analysis of SAEs
This section aims to interpret the learned feature
vectors from a trained SAE with natural language.
Specifically, given a pre-defined vocabulary set V,
the goal of the explainability analysis is to extract
a subset of words Im⊂ V to represent the mean-
ing of wm=Wdec[m], form= 1, ..., M . Hu-
mans can understand the meaning of wmby read-
ing their natural language explanations Im. There
Page 10:
morphological?
or?
physico-chemical?
structure?
of?
native?
starch?
is?
disrupted?
in?
some?
way,?
such?
as?
in?
food?
preparation.?
The?
most?
common?
way?
to?
modify?
starch?
is?
to?
apply?
heat.?
Cooking?
pits,?
hearths,?
and?
ovens?
that?
may?
have?
come?
into?
contact?
with?
starchy?
material?
yield?
modified?
starches?
which?
can?
provide?
other?
insights.
Negative
Logits
Positive
Logits
Statistical
Analysis
-0.39
-0.3
4
-0.3
2
-0.3
2
-0.3
2
-0.3
0
-0.3
0
-0.
29
-0.
28
0.71
0.70
0.70
0.67
0.67
0.66
0.65
0.64
0.64
VocabProj
MaxAct
consumer
Food
food
?
?
Foods
FOODS
product
FOOD
products
impegno
wikipagina
alugar
financière
telefónica
auroit
Empfang
jurk
désert
0
0.5
1
0
-0.4
0.7
10k
20k
500
1000Figure 2: The figure illustrates the interpretation of a learned SAE feature using V ocabProj and MaxAct. V ocabProj
lists words with the highest logits in “Positive Logits” column, and lowest logits in “Negative Logits” column.
The upper histogram in Statistical Analysis shows the distribution of randomly sampled non-zero activations, with
the y-axis representing the number of sampled activations and the x-axis indicating activation scores. The lower
histogram depicts the logit density, where the y-axis represents the number of tokens and the x-axis corresponds
to logit scores. MaxAct highlights tokens in an input text that strongly activate the learned feature. The figure
references the Neuronpedia website (Lin, 2023).
are two lines of work for this purpose, namely the
input-based andoutput-based methods. Figure 2
visualizes generated explanations of using different
methods to interpret a learned feature vector.
4.1 Input-based Explanations
MaxAct. The most straightforward way to collect
natural language explanation is by selecting a set of
texts whose hidden representation can maximally
activate a certain feature vector we are interpret-
ing (Bricken et al., 2023). Formally, given a large
corpus Xwhere each text span x∈ VNconsists of
Nwords, the MaxAct strategy finds Ktext spans
that could maximally activate our interested learned
feature vector wm:
Im= arg max
X′⊂X,|X|=KX
x∈X′f<l(x)·w⊤
m, (25)
where f<l(x)indicates generating the hidden rep-
resentation of input text xat the l-th layer, and lis
the layer our SAE is trained for. This strategy is
reasonable for interpreting weight vectors of SAEs
because of the sparse nature of SAEs, which indi-
cates that a learned feature vector should only be
activated by a certain pattern/concept. Therefore,
summarizing the text spans that could maximally
activate a certain weight vector gives us a clue to
understanding the semantic meaning of the learned
feature vector.PruningMaxAct. While MaxAct collects text
spans that maximally activate a feature vector,
these spans often contain extraneous or redundant
phrases that can obscure the underlying concept.
Building on the Neuron-to-Graph approach (Foote
et al., 2023), researchers (Gao et al., 2025) intro-
duce a pruning operation to remove irrelevant to-
kens from each text span, thereby retaining only
the minimal context necessary to preserve strong
activation. Formally, let p(·)be a pruning strategy
that maps text xtop(x), and let p−1(·)recover
the original text from its pruned version. The final
pruned spans are then gathered via:
Im= arg max
X′⊂p(X),|X′|=KX
x∈X′f<l(x)·w⊤
m,
s.t.∀x∈p(X),f<l(p−1(x))·w⊤
m
f<l(x)·w⊤m≥0.5,(26)
where the condition enforces that the pruned text
p(x)retains at least half of the original activation.
In practice, p(·)can be instantiated by removing
selected tokens or replacing them with padding.
According to Gao et al. (2025), this PruningMax-
Act technique yields higher recall (i.e., finds more
relevant examples) but lower precision compared
to the original MaxAct strategy.
Page 11:
4.2 Output-based Explanations
VocabProj. Output-based explanations project the
learned feature vectors to the output word embed-
dings of texts to compute the activations. Mathe-
matically, fout(w) :V →Rddenotes the output
word embedding layer that returns the output em-
beddings of word w, and we can collect the natural
language explanations by:
Im= arg max
V′⊂V,|V′|=KX
w∈V′fout(w)·w⊤
m.(27)
This mapping process makes sense for decoder-
only LLMs because the layers in such models share
the same residual stream, enabling the representa-
tions in the intermediate layers to be linear cor-
related to the output word embeddings (Nostalge-
braist, 2020). Recent works (Wu et al., 2025b;
Gur-Arieh et al., 2025) find that output-based ex-
planations show a stronger promise in interpreting
LLM behaviors (i.e., generated texts) compared to
input-based ones.
MutInfo. The V ocabProj assumes that the output
embeddings that maximally activate an interested
feature vector can best describe the meaning of the
learned feature. However, this assumption may fail
for frequent words, whose embeddings often have
large l2norm (Gao et al., 2019). To address this,
(Wu et al., 2025b) proposes extracting a vocabulary
subset that maximizes mutual information with the
learned feature. Formally, let Cdenote knowledge
encoded by wc, the explanations are extracted by
Im= arg max
V′⊂V,|V′|=MMI(V′;C)∝arg min
V′⊂V,|V′|=MH(C|V′)
∝arg max
V′⊂V,|V′|=MX
w∈V′p(w|wm) logp(wm|w),
(28)
where MI(·;·)indicates mutual information be-
tween two variables(Cover, 1999), and U(C)in-
cludes all possible vectors that express the knowl-
edgeC. Practically, the conditional probabilities
can be estimated by:
p(w|wm) =exp(fout(w)·w⊤
m)P
w′∈Vexp(fout(w′)·w⊤m),
p(wc|w) =exp(fout(w)·w⊤
c)P
c′∈Cexp(fout(w)·W⊤
c′).(29)
Compared with V ocabProj, that only considers
p(w|wm), this mutual-information-driven objec-
tive highlights the need to normalize the raw ac-
tivation with p(wm|w). That is to say, if a wordwhose output embedding consistently activates var-
ious learned feature vectors, it has no specific to
interpret any of them.
5 Evaluation Metrics and Methods
Evaluating SAEs is inherently challenging due to
the absence of ground truth labels. Unlike tradi-
tional machine learning tasks where performance
can be directly measured against labeled data, the
quality of an SAE must be inferred through a di-
verse set of metrics. These metrics assess both
the internal structure of the model and its func-
tional utility. To provide a comprehensive evalu-
ation framework, we categorize SAE evaluation
methods into two main groups: structural metrics
and functional metrics. This categorization ensures
a holistic assessment of SAEs, covering both their
training behavior and real-world applicability.
5.1 Structural Metrics
Structural metrics focus on assessing whether an
SAE behaves as intended during training. SAEs are
designed to optimize both reconstruction fidelity
and sparsity, as these properties are explicitly en-
forced in the training loss. Therefore, natural eval-
uation metrics assess reconstruction accuracy and
sparsity in the model’s learned representations.
Reconstruction Fidelity. The most fundamental
way to evaluate reconstruction fidelity is through
Mean Squared Error (MSE) and Cosine Similar-
ity (Ng et al., 2011), which directly compare the
original activations with SAE-reconstructed acti-
vations. Additional metrics such as Fraction of
Variance Unexplained (FVU) (also known as nor-
malized loss) (Gao et al., 2025) and Explained Vari-
ance (Karvonen et al., 2024) measure how much
variance in the original data is retained after SAE
reconstructs. Beyond direct reconstruction com-
parisons, researchers also evaluate how SAEs af-
fect the probability distribution of model outputs.
Cross-Entropy Loss (Shannon, 1948) and KL Di-
vergence (Kullback and Leibler, 1951) measure
the shift in probability distributions when substitut-
ing original model activations with SAE-generated
activations. If the SAE faithfully reconstructs acti-
vations, the probability distributions should remain
similar. Similarly, Delta LM Loss (Lieberum et al.,
2024) quantifies the difference between the original
language model loss and the loss incurred when
replacing activations with those from the SAE. An-
other important aspect of reconstruction fidelity is
Page 12:
magnitude preservation. The L2 Ratio (Karvonen
et al., 2024) compares the Euclidean norms of dif-
ferent activations to ensure that the SAE does not
systematically alter activation magnitudes.
Sparsity. A key design objective of SAEs is spar-
sity, which ensures that only a small subset of la-
tent neurons activate for any given input. The most
direct metric for sparsity is L0 Sparsity (Louizos
et al., 2017), which measures the average number
of nonzero activations per input. However, sparsity
is not just about minimizing activations; it also re-
quires ensuring that the active features are meaning-
ful. To assess feature usage patterns, Latent Firing
Frequency (He et al., 2024) and Feature Density
Statistics (Karvonen et al., 2024) track how often
each SAE latent is activated across different inputs,
ensuring that features are neither too frequent nor
inactive. Additionally, the Sparsity-fidelity Trade-
off (Gao et al., 2025) evaluates whether adjusting
sparsity affects reconstruction quality, helping to
determine the optimal balance between sparsity
and fidelity.
5.2 Functional Metrics
While structural metrics ensure that an SAE fol-
lows its design principles, functional metrics assess
whether the SAE is useful for real-world analy-
sis. These include interpretability, which assesses
whether the SAE’s learned features correspond to
meaningful and distinct concepts, and robustness,
which evaluates whether the learned representa-
tions are stable and generalizable.
Interpretability. One of the primary motivations
for SAEs is to enhance interpretability by disen-
tangling LLM activations into meaningful features.
A crucial property for interpretability is monose-
manticity, where each feature should encode a sin-
gle concept. RA VEL and Automated Interpretabil-
ity (Karvonen et al., 2024) automatically evaluate
monosemanticity by using a language model to gen-
erate and assess feature descriptions. These meth-
ods analyze the most activating contexts for each
feature and assign interpretability scores. Sparse
Probing and Targeted Probe Perturbation (TPP)
(Karvonen et al., 2024) evaluate whether SAE fea-
tures align with specific downstream tasks. In
sparse probing, a linear probe is trained using only
a small subset of SAE activations, while TPP mea-
sures how much perturbing individual latents im-
pacts probe accuracy. If a small number of active
features enable strong performance, the SAE has
learned disentangled and meaningful representa-tions. Beyond evaluating feature alignment, it is
also crucial to assess the faithfulness of feature
descriptions. Input-Based Evaluation and Output-
Based Evaluation (Gur-Arieh et al., 2025) provide
a framework for verifying whether feature descrip-
tions accurately reflect what a feature represents.
Input-Based Evaluation tests whether a given fea-
ture description correctly identifies which inputs
activate the feature by generating activating and
neutral examples and measuring activation differ-
ences. Output-Based Evaluation assesses whether a
feature description captures the causal influence of
the feature on model outputs by modifying feature
activations and comparing the resulting generated
texts. Feature Absorption (Karvonen et al., 2024)
assesses whether a feature is capturing multiple in-
dependent concepts instead of a single interpretable
concept. If adding more features does not signifi-
cantly improve representation quality, it suggests
that the extracted features are already sufficient.
Another approach to detecting whether each neu-
ron is monosemantic is checking for redundancy
or overlap with other neurons. Feature Geometry
Analysis (He et al., 2024; Bricken et al., 2023; Tem-
pleton et al., 2024) detects redundancy among SAE
latents by measuring cosine similarity between de-
coder columns. If two features have high cosine
similarity, they may represent redundant concepts
rather than independent units.
Robustness. In addition to being interpretable,
a well-designed SAE should be robust in various
contexts. Robustness ensures that SAEs do not
overfit to a specific dataset or condition but in-
stead generalize effectively. Generalizability (He
et al., 2024) assesses whether SAEs remain effec-
tive when applied to out-of-distribution data. Two
common tests for generalizability include evaluat-
ing whether SAEs trained on shorter text sequences
still perform well on longer sequences and check-
ing whether SAEs trained on base LLM activations
generalize to instruction-finetuned models. Un-
learning (Karvonen et al., 2024) measures whether
an SAE can selectively forget specific features
while preserving useful information. This is crucial
for applications that require privacy-focused mod-
els, where sensitive information needs to be erased.
Spurious Correlation Removal (SCR) (Karvonen
et al., 2024) tests whether an SAE can eliminate
biased correlations in downstream models. If re-
moving certain latents reduces unwanted correla-
tions without harming performance, the SAE has
learned to capture and remove spurious patterns.
Page 13:
...
...
Sparse
Autoencoder
...
...
Steered
Output
...
...
...
...
(a)
Steering
Vector
Extraction
Input
Transfomer
Block
Embedding
...
...
Transfomer
Block
Unembedding
Logits
(b)
Steering
LLM
Behavior
Write
a
brief
angry
review
in
10
words
for
the
smartphone
'Apple
52
Pro
Max'.
Overpriced
junk!
Laggy,
terrible
battery,
useless
updates,
worst
purchase
ever!
Steer
Happiness
Feature:
Amazing
phone!
Fast,
stunning
display,
great
battery,
worth
every
penny!
Steer
Confusion
Feature:
Great
camera,
but
lags?
Expensive,
yet
feels
cheap?
I?m
lost.
Steer
Fact
Feature:
Apple
52
Pro
Max
doesn?t
exist
yet,
so
no
review
possible.
(c)
Steered
Output
Example
Example
Input:
Original
Output:
Steered
Output:
...
...Figure 3: The figure illustrates the process of using a SAE to steer the behavior of a LLM, with an example of
the resulting steered output. In part (a), normally people use SAE to extract a steering vector by comparing two
representations: z, which lacks a certain feature, and z′, which contains that feature. In part (b), this steering
vector is added to the input representation, modifying the LLM’s behavior to align with the desired feature. Part (c)
demonstrates the example results of this process, where the steered output reflects the steered feature, even when the
original input prompt is neutral or contradictory to the feature being introduced.
6Applications in Large Language Models
The latents learned by SAEs represent a collection
of low-level concepts. Each SAE latent can be
interpreted through gathering its activating exam-
ples (Lin, 2023). This approach enables latents to
be interpreted in a human-understandable manner,
thereby enhancing our comprehension of how mod-
els perform tasks and facilitating more effective
control of model behaviors.
6.1 Model Behavior Anlysis
SAEs construct a dictionary of concepts through
their latents, providing a more fine-grained per-
spective for concept interpretation. This capability
enables the analysis of the model’s internal repre-
sentations and learned knowledge in greater detail.
A recent study utilizes SAEs to reveal the mech-
anism of hallucination, where entity recognition
plays a pivot role in recalling facts. A direction
distinguishing whether the model knows an en-
tity is identified, which is usually used for hallu-
cination refusal in chat models (Ferrando et al.,
2025). Some studies focus on interpreting how
in-context learning (ICL) is performed within mod-
els. One study focuses on general ICL tasks, and
task-related function vectors has been successfully
isolated (Kharlapenko et al., 2024). Another study
focuses on reinforcement learning(RL) tasks. Their
experiment shows that LLM’s internal representa-tions are capable of capturing temporal difference
errors and Q-values that are essential in RL com-
putations (Demircan et al., 2025). Besides, one
study attempts to examine the working mechanism
of instruction following. Their analysis on trans-
lation tasks shows that instructions are composed
of multiple relevant concepts rather than individual
ones (He et al., 2025).
Moreover, SAEs have been employed to study
behaviors related to toxicity, sycophancy, refusal,
and emotions. One recent study shows that features
captured by SAEs can be used to construct probes
to classify cross-lingual toxicity (Gallifant et al.,
2025). By examining SAE latents that activate on
anger-related tokens, researchers have identified
steering vectors that control angry outputs (Nanda
et al., 2024). Additionally, other research demon-
strates that SAEs can reconstruct vectors respon-
sible for refusing to answering harmful questions
as well as directions that produce sycophantic re-
sponses (neverix et al., 2024).
6.2 Model Steering
Unlike supervised concept vectors, such as probing
classifiers (Belinkov, 2022; Zhao et al., 2025; Jin
et al., 2025), SAEs can simultaneously learn a large
volume of concept vectors. The learned vectors can
then be utilized to steer model behaviors in ways
similar to supervised vectors (See Figure 3).
As previously mentioned, SAE latents can be
Page 14:
employed to produce steering vectors that control
model outputs related to toxicity, sycophancy, re-
fusal, and emotions. Other steering applications
have also proven feasible. Recently, a study shows
that SAE latents are able to capture instructions
such as translations. These identified latents are
effective in manipulating models to translate inputs
according to specific instructions (He et al., 2025).
Another investigation focusing on semantic search-
ing demonstrates that SAEs can learn fine-grained
semantic concepts at various levels of abstraction.
These concept vectors can be used to steer mod-
els toward related semantic (O’Neill et al., 2024).
Alternatively, SAEs trained on biological datasets
can provide biology-related features that enable
the unlearning of biology-relevant knowledge with
fewer side effects than existing techniques (Far-
rell et al., 2024). Given that steering effects are
generally challenging to control, SAE-TS further
utilizes SAE latents to optimize steering vectors by
measuring the changes in SAE feature activation
caused by steering, thereby helping construct vec-
tors that target relevant features while minimizing
unwanted effects (Chalnev et al., 2024). Moreover,
explanations based on SAE latents risk prioritiz-
ing linguistic features over semantic meaning. Wu
et al. (2025b) propose a novel approach that pro-
motes diverse semantic explanations, which has
been demonstrated to enhance model safety.
6.3 Model Training
SAEs are trained to obtain more sparse and in-
terpretable features. The learned concepts and
sparsity are both beneficial in model transparency,
which can be utilized in model training to align
model with human understanding and improve
model performance. Since SAEs includes feature-
level constraints, Yin et al. (2024) leverage these
constraints to enable sparsity-enforced alignment
in post-training. Their experiments demonstrate
that this approach achieves superior performance
across benchmark datasets with reduced computa-
tional costs. Similarly, combine learned concepts
with next-token prediction training to build more
transparent models. Specifically, they extract in-
fluential concepts on outputs from SAEs, then in-
corporate these concept vectors into hidden states
by modifying token embeddings. Results show
that models trained with this method perform bet-
ter and exhibit greater robustness on token predic-
tion and knowledge distillation across benchmark
datasets (Tack et al., 2025). Moreover, SAEs’ abil-ity to provide large-scale explanations has been
well explored. By examining the diversity of ac-
tivated features, Yang et al. (2025) developed a
new approach to augment data diversity. Another
work uses task-specific features learned in SAEs
to mitigate unintended features within models, sig-
nificantly improving model generalization on real-
world tasks (Wu et al., 2025a).
7 Research Challenges
In this section, we outline several critical research
challenges with SAEs. Although SAEs have
emerged as promising tools for providing large-
scale, fine-grained interpretable explanations, these
challenges could threaten the faithfulness, effec-
tiveness, and efficiency of their applications.
7.1 Incomplete Concept Dictionary
SAEs are trained on large corpora of data encom-
passing various concepts. However, achieving
comprehensive concept coverage remains challeng-
ing (Muhamed et al., 2024). Additionally, the learn-
ing process of SAEs functions as a black box where
learned concepts cannot be predetermined. Conse-
quently, controlling the completeness of input and
output concepts is nearly impossible. Furthermore,
explanations provided by SAEs may be incomplete
or misleading due to the conceptual gaps. This lim-
itation can result in unreliable interpretations when
applying SAEs to complex reasoning tasks that
require comprehensive knowledge representation.
7.2 Lack Theoretical Foundations
The development of SAEs is indeed based on as-
sumptions of superposition and linear concept rep-
resentation. Empirically, we’ve found it effective
to construct high-level features through linear com-
binations of low-level features. However, our un-
derstanding of how these concepts are represented
in hidden spaces and their spatial relationships re-
mains limited. This limitation explains why we
must derive combination parameters empirically
rather than mathematically. The validity and ad-
vancement of SAEs may remain unclear until we
can properly demonstrate the correctness of these
fundamental assumptions about concept represen-
tation and superposition in neural networks.
7.3 Reconstruction Errors
SAEs are trained by minimizing the reconstruction
errors between original and reconstructed activa-
tions. However, these errors persist and remain
Page 15:
poorly understood. Recent research by (Gao et al.,
2025) demonstrates that reconstruction errors can
produce significant performance degradation com-
parable to using a model with only 10% of the pre-
training compute. This finding raises substantial
concerns about SAE accuracy as interpretability
tools. Furthermore, the impact of these reconstruc-
tion errors on model generations has not been ad-
equately measured. The field lacks output-centric
metrics that could precisely quantify how recon-
structed activations affect a model’s final outputs.
To advance our understanding of SAEs and their re-
liability as interpretability tools, developing metrics
that directly measure the effect of reconstruction
errors on generated content is essential.
7.4 Computational Burden
SAEs operate at the layer level for each model,
mapping original activations to a much higher-
dimensional representation space before recon-
structing them back to the original space. This
architecture necessitates that SAE parameters for a
specific layer significantly outnumber the parame-
ters of that original layer itself. Consequently, the
overall training computation exceeds that of the
original model training, particularly problematic
for LLMs with billions of parameters. The exten-
sive computational resources required create a sub-
stantial barrier for researchers interested in investi-
gating these methods. Furthermore, SAEs exhibit
limited transferability across models, they must
be trained specifically for each model and each
layer, exacerbating the computational burden. This
layer-specific and model-specific training require-
ment multiplies the already significant resource
demands, further restricting accessibility for the
broader research community.
7.5 Connection to the Broader Field of
Interpretability
The field of MI has been critiqued for its insuffi-
cient engagement with the broader interpretability
and NLP research literature (Bereska and Gavves,
2024; Saphra and Wiegreffe, 2024). Many of the
research topics within MI, such as polysemanticity,
superposition, and SAEs, have been investigated
in prior and concurrent non-MI fields, often un-
der different terminologies while addressing the
same fundamental challenges (Saphra and Wiegr-
effe, 2024; Elhage et al., 2022). For instance, the
study of polysemanticity and superposition, which
aims to understand how features are encoded in themodel activations, have been studied in the con-
text of distributed representations (Hinton, 1984;
Mikolov et al., 2013b,a; Arora et al., 2018; Olah,
2023), disentangled representations (Higgins et al.,
2018; Kim and Mnih, 2018; Locatello et al., 2019),
and concept-based interpretability (Nicolson et al.,
2024; Kim et al., 2018). Similarly, SAEs are
closely related to and draw inspiration from earlier
lines of research on sparse coding and dictionary
learning (Olshausen and Field, 1997; Gregor and
LeCun, 2010; Faruqui et al., 2015; Subramanian
et al., 2018). These methods, like SAEs, posit the
feature sparsity hypothesis (Elhage et al., 2022)
and aim to learn an overcomplete representation to
disentangle the features from activation in super-
position. Since these fields pursue similar goals or
study the same research problems, the current dis-
connect causes issues such as missing relevant lit-
erature, hindering collaboration, unintentionally re-
defining established concepts, rediscovering exist-
ing techniques, and overlooking well-known base-
lines. Therefore, it is imperative for the MI com-
munity to bridge these gaps and more actively inte-
grate findings from related non-MI research.
8 Conclusions
In this survey, we provided a comprehensive exam-
ination of SAEs as a promising approach to inter-
preting and understanding LLMs. SAEs effectively
address the challenge of polysemanticity through
learning overcomplete, sparse representations that
disentangle superimposed features into more inter-
pretable units. We have systematically explored
the foundational principles, technical frameworks,
evaluation methodologies, and real-world appli-
cations of SAEs in the context of LLM analysis.
While SAEs have demonstrated considerable suc-
cess in revealing the internal mechanisms of LLMs,
several challenges remain, including the incom-
pleteness of concept dictionaries, limited theoret-
ical foundations, persistent reconstruction errors,
and substantial computational requirements. De-
spite these challenges, SAEs continue to evolve
through architectural innovations and improved
training strategies, offering deeper insights into
the inner workings of increasingly complex LLMs.
Page 16:
References
Anthropic. 2024. [link].
Anthropic. 2024. Introducing Claude 3.5 Sonnet. An-
nouncement of Claude 3.5 Sonnet model release, fea-
turing improved intelligence, vision capabilities, and
new Artifacts feature.
Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma,
and Andrej Risteski. 2018. Linear algebraic struc-
ture of word senses, with applications to polysemy.
Transactions of the Association for Computational
Linguistics , 6:483–495.
Kola Ayonrinde. 2024. Adaptive sparse allocation with
mutual choice & feature choice sparse autoencoders.
arXiv preprint arXiv:2411.02124 .
Yonatan Belinkov. 2022. Probing classifiers: Promises,
shortcomings, and advances. Computational Linguis-
tics, 48(1):207–219.
Leonard Bereska and Efstratios Gavves. 2024. Mech-
anistic interpretability for ai safety–a review. arXiv
preprint arXiv:2404.14082 .
Stella Biderman, Hailey Schoelkopf, Quentin Gregory
Anthony, Herbie Bradley, Kyle O’Brien, Eric Hal-
lahan, Mohammad Aflah Khan, Shivanshu Purohit,
USVSN Sai Prashanth, Edward Raff, et al. 2023.
Pythia: A suite for analyzing large language mod-
els across training and scaling. In International
Conference on Machine Learning , pages 2397–2430.
PMLR.
Steven Bills, Nick Cammarata, Dan Moss-
ing, Henk Tillman, Leo Gao, Gabriel Goh,
Ilya Sutskever, Jan Leike, Jeff Wu, and
William Saunders. 2023. Language mod-
els can explain neurons in language models.
https://openaipublic.blob.core.windows.
net/neuron-explainer/paper/index.html .
Joseph Bloom. 2024. Open source sparse autoencoders
for all residual stream layers of gpt2 small.
Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill,
and Lee Sharkey. 2025. Identifying functionally im-
portant features with end-to-end sparse dictionary
learning. Advances in Neural Information Process-
ing Systems , 37:107286–107325.
Trenton Bricken, Adly Templeton, Joshua Batson,
Brian Chen, Adam Jermyn, Tom Conerly, Nick
Turner, Cem Anil, Carson Denison, Amanda Askell,
Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas
Schiefer, Tim Maxwell, Nicholas Joseph, Zac
Hatfield-Dodds, Alex Tamkin, Karina Nguyen,
Brayden McLean, Josiah E Burke, Tristan Hume,
Shan Carter, Tom Henighan, and Christopher
Olah. 2023. Towards monosemanticity: Decom-
posing language models with dictionary learning.
Transformer Circuits Thread . Https://transformer-
circuits.pub/2023/monosemantic-
features/index.html.Bart Bussmann, Patrick Leask, and Neel Nanda. 2024.
Batchtopk sparse autoencoders. arXiv preprint
arXiv:2412.06410 .
Sviatoslav Chalnev, Matthew Siu, and Arthur Conmy.
2024. Improving steering vectors by target-
ing sparse autoencoder features. arXiv preprint
arXiv:2411.02193 .
Thomas M Cover. 1999. Elements of information theory .
John Wiley & Sons.
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert
Huben, and Lee Sharkey. 2023. Sparse autoencoders
find highly interpretable features in language models.
arXiv preprint arXiv:2309.08600 .
Yann N Dauphin, Angela Fan, Michael Auli, and David
Grangier. 2017. Language modeling with gated con-
volutional networks. In International conference on
machine learning , pages 933–941. PMLR.
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang,
Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu,
Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang,
Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong
Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue,
Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu,
Chenggang Zhao, Chengqi Deng, Chenyu Zhang,
Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji,
Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo,
Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang,
Han Bao, Hanwei Xu, Haocheng Wang, Honghui
Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li,
Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang
Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L.
Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai
Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai
Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong
Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan
Zhang, Minghua Zhang, Minghui Tang, Meng Li,
Miaojun Wang, Mingming Li, Ning Tian, Panpan
Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen,
Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan,
Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen,
Shanghao Lu, Shangyan Zhou, Shanhuang Chen,
Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng
Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing
Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun,
T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu,
Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao
Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan
Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin
Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li,
Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin,
Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxi-
ang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang,
Xinxia Shan, Y . K. Li, Y . Q. Wang, Y . X. Wei, Yang
Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng
Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi,
Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang,
Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo,
Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yu-
jia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You,
Yuxuan Liu, Yuyang Zhou, Y . X. Zhu, Yanhong Xu,
Page 17:
Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu,
Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan,
Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean
Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao,
Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zi-
jia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song,
Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu
Zhang, and Zhen Zhang. 2025. Deepseek-r1: Incen-
tivizing reasoning capability in llms via reinforce-
ment learning. Preprint , arXiv:2501.12948.
Can Demircan, Tankred Saanum, Akshay Kumar Ja-
gadish, Marcel Binz, and Eric Schulz. 2025. Sparse
autoencoders reveal temporal difference learning in
large language models. In The Thirteenth Interna-
tional Conference on Learning Representations .
Jonathan Donnelly and Adam Roegiest. 2019. On in-
terpretability and feature representations: an analysis
of the sentiment neuron. In Advances in Information
Retrieval: 41st European Conference on IR Research,
ECIR 2019, Cologne, Germany, April 14–18, 2019,
Proceedings, Part I 41 , pages 795–802. Springer.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey,
Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman,
Akhil Mathur, Alan Schelten, Amy Yang, Angela
Fan, et al. 2024. The llama 3 herd of models. arXiv
preprint arXiv:2407.21783 .
Nelson Elhage, Tristan Hume, Catherine Olsson,
Nicholas Schiefer, Tom Henighan, Shauna Kravec,
Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain,
Carol Chen, Roger Grosse, Sam McCandlish, Jared
Kaplan, Dario Amodei, Martin Wattenberg, and
Christopher Olah. 2022. Toy models of superpo-
sition. Transformer Circuits Thread .
Eoin Farrell, Yeu-Tong Lau, and Arthur Conmy. 2024.
Applying sparse autoencoders to unlearn knowledge
in language models. In Neurips Safe Generative AI
Workshop 2024 .
Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris
Dyer, and Noah Smith. 2015. Sparse overcom-
plete word vector representations. arXiv preprint
arXiv:1506.02004 .
Javier Ferrando, Oscar Balcells Obeso, Senthooran Ra-
jamanoharan, and Neel Nanda. 2025. Do i know this
entity? knowledge awareness and hallucinations in
language models. In The Thirteenth International
Conference on Learning Representations .
Javier Ferrando, Gabriele Sarti, Arianna Bisazza, and
Marta R Costa-jussà. 2024. A primer on the in-
ner workings of transformer-based language models.
arXiv preprint arXiv:2405.00208 .
Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas,
Shay Cohen, and Fazl Barez. 2023. Neuron to graph:
Interpreting language model neurons at scale. arXiv
preprint arXiv:2305.19911 .
Jack Gallifant, Shan Chen, Kuleen Sasse, Hugo Aerts,
Thomas Hartvigsen, and Danielle S Bitterman. 2025.Sparse autoencoder features for classifications and
transferability. arXiv preprint arXiv:2502.11367 .
Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-
Yan Liu. 2019. Representation degeneration problem
in training natural language generation models. arXiv
preprint arXiv:1907.12009 .
Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel
Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan
Leike, and Jeffrey Wu. 2025. Scaling and evaluating
sparse autoencoders. In The Thirteenth International
Conference on Learning Representations .
Davide Ghilardi, Federico Belotti, and Marco Moli-
nari. 2024. Efficient training of sparse autoencoders
for large language models via layer groups. arXiv
preprint arXiv:2410.21508 .
Karol Gregor and Yann LeCun. 2010. Learning fast
approximations of sparse coding. In Proceedings of
the 27th international conference on international
conference on machine learning , pages 399–406.
Yoav Gur-Arieh, Roy Mayan, Chen Agassy, Atticus
Geiger, and Mor Geva. 2025. Enhancing automated
interpretability with output-centric feature descrip-
tions. arXiv preprint arXiv:2501.08319 .
Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen,
Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng
Guo, Xuanjing Huang, Zuxuan Wu, et al. 2024.
Llama scope: Extracting millions of features from
llama-3.1-8b with sparse autoencoders. arXiv
preprint arXiv:2410.20526 .
Zirui He, Haiyan Zhao, Yiran Qiao, Fan Yang, Ali
Payani, Jing Ma, and Mengnan Du. 2025. Saif: A
sparse autoencoder framework for interpreting and
steering instruction following of language models.
arXiv preprint arXiv:2502.11356 .
Irina Higgins, David Amos, David Pfau, Sebastien
Racaniere, Loic Matthey, Danilo Rezende, and
Alexander Lerchner. 2018. Towards a definition
of disentangled representations. arXiv preprint
arXiv:1812.02230 .
Geoffrey E Hinton. 1984. Distributed representations.
Mingyu Jin, Qinkai Yu, Jingyuan Huang, Qingcheng
Zeng, Zhenting Wang, Wenyue Hua, Haiyan Zhao,
Kai Mei, Yanda Meng, Kaize Ding, et al. 2025. Ex-
ploring concept depth: How large language models
acquire knowledge and concept at different layers?
InProceedings of the 31st International Conference
on Computational Linguistics , pages 558–573.
A Karvonen, C Rager, J Lin, C Tigges, J Bloom,
D Chanin, YT Lau, E Farrell, A Conmy, C Mc-
Dougall, et al. 2024. Saebench: A comprehensive
benchmark for sparse autoencoders, december 2024.
URL https://www. neuronpedia. org/sae-bench/info .
Dmitrii Kharlapenko, neverix, Neel Nanda, and Authur
Conmy. 2024. Extracting SAE task features for in-
context learning. [Accessed 26-02-2025].
Page 18:
Been Kim, Martin Wattenberg, Justin Gilmer, Carrie
Cai, James Wexler, Fernanda Viegas, et al. 2018. In-
terpretability beyond feature attribution: Quantitative
testing with concept activation vectors (tcav). In In-
ternational conference on machine learning , pages
2668–2677. PMLR.
Hyunjik Kim and Andriy Mnih. 2018. Disentangling by
factorising. In International conference on machine
learning , pages 2649–2658. PMLR.
Solomon Kullback and Richard A Leibler. 1951. On
information and sufficiency. The annals of mathe-
matical statistics , 22(1):79–86.
Tom Lieberum, Senthooran Rajamanoharan, Arthur
Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant
Varma, János Kramár, Anca Dragan, Rohin Shah,
and Neel Nanda. 2024. Gemma scope: Open sparse
autoencoders everywhere all at once on gemma 2.
arXiv preprint arXiv:2408.05147 .
Johnny Lin. 2023. Neuronpedia: Interactive reference
and tooling for analyzing neural networks. Software
available from neuronpedia.org.
Francesco Locatello, Stefan Bauer, Mario Lucic, Gun-
nar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and
Olivier Bachem. 2019. Challenging common as-
sumptions in the unsupervised learning of disentan-
gled representations. In international conference on
machine learning , pages 4114–4124. PMLR.
Christos Louizos, Max Welling, and Diederik P Kingma.
2017. Learning sparse neural networks through l_0
regularization. arXiv preprint arXiv:1712.01312 .
Luke Marks, Alasdair Paren, David Krueger, and Fazl
Barez. 2024. Enhancing neural network interpretabil-
ity with feature-aligned sparse autoencoders. arXiv
preprint arXiv:2411.01220 .
Abhinav Menon, Manish Shrivastava, David Krueger,
and Ekdeep Singh Lubana. 2024. Analyzing (in)
abilities of saes via formal languages. arXiv preprint
arXiv:2410.11767 .
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013a. Distributed representa-
tions of words and phrases and their compositionality.
Advances in neural information processing systems ,
26.
Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig.
2013b. Linguistic regularities in continuous space
word representations. In Proceedings of the 2013
conference of the north american chapter of the as-
sociation for computational linguistics: Human lan-
guage technologies , pages 746–751.
Anish Mudide, Joshua Engels, Eric J Michaud, Max
Tegmark, and Christian Schroeder de Witt. 2024. Ef-
ficient dictionary learning with switch sparse autoen-
coders. arXiv preprint arXiv:2410.08201 .Aashiq Muhamed, Mona Diab, and Virginia Smith.
2024. Decoding dark matter: Specialized sparse
autoencoders for interpreting rare concepts in foun-
dation models. arXiv preprint arXiv:2411.00743 .
Neel Nanda, Arthur Conmy, Lewis Smith, Senthooran
Rajamanoharan, Tom Lieberum, János Kramár, and
Vikrant Varma. 2024. [Full Post] Progress Update
#1 from the GDM Mech Interp Team. [Accessed
26-02-2025].
neverix, Dmitrii Kharlapenko, Arthur Conmy, and Neel
Nanda. 2024. SAE features for refusal and syco-
phancy steering vectors. [Accessed 26-02-2025].
Andrew Ng et al. 2011. Sparse autoencoder. CS294A
Lecture notes , 72(2011):1–19.
Anh Nguyen, Jason Yosinski, and Jeff Clune. 2019. Un-
derstanding neural networks via feature visualization:
A survey. Explainable AI: interpreting, explaining
and visualizing deep learning , pages 55–76.
Angus Nicolson, Lisa Schut, J Alison Noble, and
Yarin Gal. 2024. Explaining explainability: Under-
standing concept activation vectors. arXiv preprint
arXiv:2404.03713 .
Nostalgebraist. 2020. Interpreting gpt: the logit lens.
Chris Olah. 2023. Distributed representations: Compo-
sition & superposition. Transformer Circuits Thread ,
27.
Bruno A Olshausen and David J Field. 1997. Sparse
coding with an overcomplete basis set: A strategy
employed by v1? Vision research , 37(23):3311–
3325.
Charles O’Neill, Christine Ye, Kartheik Iyer, and
John F Wu. 2024. Disentangling dense embed-
dings with sparse autoencoders. arXiv preprint
arXiv:2408.00657 .
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal,
Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale-
man, Diogo Almeida, Janko Altenschmidt, Sam Alt-
man, Shyamal Anadkat, Red Avila, Igor Babuschkin,
Suchir Balaji, Valerie Balcom, Paul Baltescu, Haim-
ing Bao, Mohammad Bavarian, Jeff Belgum, Ir-
wan Bello, Jake Berdine, Gabriel Bernadett-Shapiro,
Christopher Berner, Lenny Bogdonoff, Oleg Boiko,
Madelaine Boyd, Anna-Luisa Brakman, Greg Brock-
man, Tim Brooks, Miles Brundage, Kevin Button,
Trevor Cai, Rosie Campbell, Andrew Cann, Brittany
Carey, Chelsea Carlson, Rory Carmichael, Brooke
Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully
Chen, Ruby Chen, Jason Chen, Mark Chen, Ben
Chess, Chester Cho, Casey Chu, Hyung Won Chung,
Dave Cummings, Jeremiah Currier, Yunxing Dai,
Cory Decareaux, Thomas Degry, Noah Deutsch,
Damien Deville, Arka Dhar, David Dohan, Steve
Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti,
Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix,
Simón Posada Fishman, Juston Forte, Isabella Ful-
ford, Leo Gao, Elie Georges, Christian Gibson, Vik
Page 19:
Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-
Lopes, Jonathan Gordon, Morgan Grafstein, Scott
Gray, Ryan Greene, Joshua Gross, Shixiang Shane
Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris,
Yuchen He, Mike Heaton, Johannes Heidecke, Chris
Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele,
Brandon Houghton, Kenny Hsu, Shengli Hu, Xin
Hu, Joost Huizinga, Shantanu Jain, Shawn Jain,
Joanne Jang, Angela Jiang, Roger Jiang, Haozhun
Jin, Denny Jin, Shino Jomoto, Billie Jonn, Hee-
woo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Ka-
mali, Ingmar Kanitscheider, Nitish Shirish Keskar,
Tabarak Khan, Logan Kilpatrick, Jong Wook Kim,
Christina Kim, Yongjik Kim, Jan Hendrik Kirch-
ner, Jamie Kiros, Matt Knight, Daniel Kokotajlo,
Łukasz Kondraciuk, Andrew Kondrich, Aris Kon-
stantinidis, Kyle Kosic, Gretchen Krueger, Vishal
Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan
Leike, Jade Leung, Daniel Levy, Chak Ming Li,
Rachel Lim, Molly Lin, Stephanie Lin, Mateusz
Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue,
Anna Makanju, Kim Malfacini, Sam Manning, Todor
Markov, Yaniv Markovski, Bianca Martin, Katie
Mayer, Andrew Mayne, Bob McGrew, Scott Mayer
McKinney, Christine McLeavey, Paul McMillan,
Jake McNeil, David Medina, Aalok Mehta, Jacob
Menick, Luke Metz, Andrey Mishchenko, Pamela
Mishkin, Vinnie Monaco, Evan Morikawa, Daniel
Mossing, Tong Mu, Mira Murati, Oleg Murk, David
Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak,
Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh,
Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex
Paino, Joe Palermo, Ashley Pantuliano, Giambat-
tista Parascandolo, Joel Parish, Emy Parparita, Alex
Passos, Mikhail Pavlov, Andrew Peng, Adam Perel-
man, Filipe de Avila Belbute Peres, Michael Petrov,
Henrique Ponde de Oliveira Pinto, Michael, Poko-
rny, Michelle Pokrass, Vitchyr H. Pong, Tolly Pow-
ell, Alethea Power, Boris Power, Elizabeth Proehl,
Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh,
Cameron Raymond, Francis Real, Kendra Rimbach,
Carl Ross, Bob Rotsted, Henri Roussez, Nick Ry-
der, Mario Saltarelli, Ted Sanders, Shibani Santurkar,
Girish Sastry, Heather Schmidt, David Schnurr, John
Schulman, Daniel Selsam, Kyla Sheppard, Toki
Sherbakov, Jessica Shieh, Sarah Shoker, Pranav
Shyam, Szymon Sidor, Eric Sigler, Maddie Simens,
Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin
Sokolowsky, Yang Song, Natalie Staudacher, Fe-
lipe Petroski Such, Natalie Summers, Ilya Sutskever,
Jie Tang, Nikolas Tezak, Madeleine B. Thompson,
Phil Tillet, Amin Tootoonchian, Elizabeth Tseng,
Preston Tuggle, Nick Turley, Jerry Tworek, Juan Fe-
lipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya,
Chelsea V oss, Carroll Wainwright, Justin Jay Wang,
Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei,
CJ Weinmann, Akila Welihinda, Peter Welinder, Ji-
ayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner,
Clemens Winter, Samuel Wolrich, Hannah Wong,
Lauren Workman, Sherwin Wu, Jeff Wu, Michael
Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qim-
ing Yuan, Wojciech Zaremba, Rowan Zellers, Chong
Zhang, Marvin Zhang, Shengjia Zhao, Tianhao
Zheng, Juntang Zhuang, William Zhuk, and Bar-ret Zoph. 2024. Gpt-4 technical report. Preprint ,
arXiv:2303.08774.
Alec Radford, Rafal Jozefowicz, and Ilya Sutskever.
2017. Learning to generate reviews and discovering
sentiment. arXiv preprint arXiv:1704.01444 .
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, Ilya Sutskever, et al. 2019. Language
models are unsupervised multitask learners. OpenAI
blog, 1(8):9.
Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov,
and Ziyu Yao. 2024. A practical review of mecha-
nistic interpretability for transformer-based language
models. arXiv preprint arXiv:2407.02646 .
Senthooran Rajamanoharan, Arthur Conmy, Lewis
Smith, Tom Lieberum, Vikrant Varma, János Kramár,
Rohin Shah, and Neel Nanda. 2024a. Improving
dictionary learning with gated sparse autoencoders.
arXiv preprint arXiv:2404.16014 .
Senthooran Rajamanoharan, Tom Lieberum, Nicolas
Sonnerat, Arthur Conmy, Vikrant Varma, János
Kramár, and Neel Nanda. 2024b. Jumping ahead: Im-
proving reconstruction fidelity with jumprelu sparse
autoencoders. arXiv preprint arXiv:2407.14435 .
Marks Samuel, Karvonen Adam, and Mueller Aaron.
2024. dictionary learning. https://github.com/
saprmarks/dictionary_learning .
Naomi Saphra and Sarah Wiegreffe. 2024. Mechanistic?
arXiv preprint arXiv:2410.09087 .
Claude E Shannon. 1948. A mathematical theory of
communication. The Bell system technical journal ,
27(3):379–423.
Noam Shazeer. 2020. Glu variants improve transformer.
arXiv preprint arXiv:2002.05202 .
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,
Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff
Dean. 2017. Outrageously large neural networks:
The sparsely-gated mixture-of-experts layer. arXiv
preprint arXiv:1701.06538 .
Anant Subramanian, Danish Pruthi, Harsh Jhamtani,
Taylor Berg-Kirkpatrick, and Eduard Hovy. 2018.
Spine: Sparse interpretable neural embeddings. In
Proceedings of the AAAI conference on artificial in-
telligence , volume 32.
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,
Joan Bruna, Dumitru Erhan, Ian Goodfellow, and
Rob Fergus. 2013. Intriguing properties of neural
networks. arXiv preprint arXiv:1312.6199 .
Jihoon Tack, Jack Lanchantin, Jane Yu, Andrew Co-
hen, Ilia Kulikov, Janice Lan, Shibo Hao, Yuandong
Tian, Jason Weston, and Xian Li. 2025. Llm pre-
training with continuous concepts. arXiv preprint
arXiv:2502.08524 .
Page 20:
Glen Taggart. 2024. Prolu: A nonlinearity for sparse
autoencoders - ai alignment forum.
Gemma Team, Morgane Riviere, Shreya Pathak,
Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati-
raju, Léonard Hussenot, Thomas Mesnard, Bobak
Shahriari, Alexandre Ramé, et al. 2024. Gemma 2:
Improving open language models at a practical size.
arXiv preprint arXiv:2408.00118 .
Adly Templeton, Tom Conerly, Jonathan Marcus, Jack
Lindsey, Trenton Bricken, Brian Chen, Adam Pearce,
Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy
Cunningham, Nicholas L Turner, Callum McDougall,
Monte MacDiarmid, C. Daniel Freeman, Theodore R.
Sumers, Edward Rees, Joshua Batson, Adam Jermyn,
Shan Carter, Chris Olah, and Tom Henighan. 2024.
Scaling monosemanticity: Extracting interpretable
features from claude 3 sonnet. Transformer Circuits
Thread .
Benjamin Wright and Lee Sharkey. 2024. Addressing
feature suppression in saes - ai alignment forum.
Xuansheng Wu, Wenhao Yu, Xiaoming Zhai, and Ning-
hao Liu. 2025a. Self-regularization with latent space
explanations for controllable llm-based classification.
arXiv preprint arXiv:2502.14133 .
Xuansheng Wu, Jiayi Yuan, Wenlin Yao, Xiaoming
Zhai, and Ninghao Liu. 2025b. Interpreting and
steering llms with mutual information-based expla-
nations on sparse autoencoders. arXiv preprint
arXiv:2502.15576 .
xAI. 2025. Grok 3 beta — the age of reasoning
agents. Blog post announcing Grok 3 Beta, describ-
ing improvements in reasoning capabilities and per-
formance benchmarks.
Xianjun Yang, Shaoliang Nie, Lijuan Liu, Suchin Guru-
rangan, Ujjwal Karn, Rui Hou, Madian Khabsa, and
Yuning Mao. 2025. Diversity-driven data selection
for language model tuning through sparse autoen-
coder. arXiv preprint arXiv:2502.14050 .
Qingyu Yin, Chak Tou Leong, Hongbo Zhang, Minjun
Zhu, Hanqi Yan, Qiang Zhang, Yulan He, Wenjie Li,
Jun Wang, Yue Zhang, et al. 2024. Direct preference
optimization using sparse feature-level constraints.
arXiv preprint arXiv:2411.07618 .
Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu,
Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei
Yin, and Mengnan Du. 2024a. Explainability for
large language models: A survey. ACM Transactions
on Intelligent Systems and Technology , 15(2):1–38.
Haiyan Zhao, Fan Yang, Bo Shen, Himabindu
Lakkaraju, and Mengnan Du. 2024b. Towards
uncovering how large language model works:
An explainability perspective. arXiv preprint
arXiv:2402.10688 .Haiyan Zhao, Heng Zhao, Bo Shen, Ali Payani, Fan
Yang, and Mengnan Du. 2025. Beyond single con-
cept vector: Modeling concept subspace in llms with
gaussian distribution. The Thirteenth International
Conference on Learning Representations .