loader
Generating audio...

arxiv

Paper 2503.05613

A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models

Authors: Dong Shu, Xuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, Mengnan Du

Published: 2025-03-07

Abstract:

Large Language Models (LLMs) have revolutionized natural language processing, yet their internal mechanisms remain largely opaque. Recently, mechanistic interpretability has attracted significant attention from the research community as a means to understand the inner workings of LLMs. Among various mechanistic interpretability approaches, Sparse Autoencoders (SAEs) have emerged as a particularly promising method due to their ability to disentangle the complex, superimposed features within LLMs into more interpretable components. This paper presents a comprehensive examination of SAEs as a promising approach to interpreting and understanding LLMs. We provide a systematic overview of SAE principles, architectures, and applications specifically tailored for LLM analysis, covering theoretical foundations, implementation strategies, and recent developments in sparsity mechanisms. We also explore how SAEs can be leveraged to explain the internal workings of LLMs, steer model behaviors in desired directions, and develop more transparent training methodologies for future models. Despite the challenges that remain around SAE implementation and scaling, they continue to provide valuable tools for understanding the internal mechanisms of large language models.

Paper Content:
Page 1: A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models Dong Shu1,†, Xuansheng Wu2,†, Haiyan Zhao3,†, Daking Rai4, Ziyu Yao4,Ninghao Liu2,Mengnan Du3 1Northwestern University2University of Georgia 3New Jersey Institute of Technology4George Mason University dongshu2024@u.northwestern.edu ,{xw54582,ninghao.liu}@uga.edu ,{hz54,mengnan.du}@njit.edu ,{drai2,ziyuyao}@gmu.edu Abstract Large Language Models (LLMs) have revolu- tionized natural language processing, yet their internal mechanisms remain largely opaque. Recently, mechanistic interpretability has at- tracted significant attention from the research community as a means to understand the inner workings of LLMs. Among various mechanis- tic interpretability approaches, Sparse Autoen- coders (SAEs) have emerged as a particularly promising method due to their ability to dis- entangle the complex, superimposed features within LLMs into more interpretable compo- nents. This paper presents a comprehensive examination of SAEs as a promising approach to interpreting and understanding LLMs. We provide a systematic overview of SAE princi- ples, architectures, and applications specifically tailored for LLM analysis, covering theoretical foundations, implementation strategies, and re- cent developments in sparsity mechanisms. We also explore how SAEs can be leveraged to explain the internal workings of LLMs, steer model behaviors in desired directions, and de- velop more transparent training methodologies for future models. Despite the challenges that remain around SAE implementation and scal- ing, they continue to provide valuable tools for understanding the internal mechanisms of large language models. 1 Introduction Large Language Models (LLMs), such as GPT- 4 (OpenAI et al., 2024), Claude-3.5 (Anthropic, 2024), DeepSeek-R1 (DeepSeek-AI et al., 2025), and Grok-3 (xAI, 2025), have emerged as powerful tools in natural language processing, demonstrat- ing remarkable capabilities in tasks ranging from text generation to complex reasoning. However, their increasing size and complexity have created significant challenges in understanding their inter- nal representations and decision-making processes. †Equal contributionThis “black box” nature of LLMs has sparked grow- ing interest in mechanistic interpretability (Bereska and Gavves, 2024; Zhao et al., 2024a; Rai et al., 2024; Zhao et al., 2024b), a field that aims to break down LLMs into understandable components and systematically analyze how these components inter- act to produce emergent behaviors. Understanding these mechanisms is crucial not only for scientific progress but also for addressing concerns related to safety, reliability, and alignment of increasingly powerful LLM systems. Among the various approaches to interpreting LLMs, Sparse Autoencoders (SAEs) (Cunningham et al., 2023; Bricken et al., 2023; Gao et al., 2025; Rajamanoharan et al., 2024b) have emerged as a particularly promising direction for addressing a fundamental challenge in LLM interpretability: polysemanticity . Many neurons in LLMs are poly- semantic, responding to seemingly unrelated con- cepts or features simultaneously. This is a phe- nomenon likely resulting from superposition (El- hage et al., 2022), where LLMs represent more independent features than they have neurons by encoding each feature as a linear combination of neurons. SAEs address this issue by learning an overcomplete, sparse representation of neural ac- tivations, effectively disentangling these superim- posed features into more interpretable units. By training a sparse autoencoder to reconstruct the ac- tivations of a target network layer while enforcing sparsity constraints, SAEs can extract a larger set ofmonosemantic features that offer clearer insights into what information the LLM is processing. This approach has shown considerable promise in trans- forming the often-inscrutable activations of LLMs into more human-understandable representations, potentially creating a more effective vocabulary for mechanistic analysis of these complex systems. In this paper, we provide a comprehensive overview of SAE for LLM interpretability, begin- ning with their foundational motivations and de-arXiv:2503.05613v1 [cs.LG] 7 Mar 2025 Page 2: ...Activation Function ... ...... Reconstruction Sparsity Sparse Autoencoder Model Size Stage Neurons Small Size LLMs GPT-2 Neuron Toy Model Transformer Model GPT-2 Pythia-70m GPT-4 Claude-3 LLaMa-3.1 Gemma-2 Close Source Open Source 09/22 05/23 10/23 02/24 12/23 05/23 06/24 10/24 08/24 (a) SAE Framework (b) SAE History Token Transfomer Block ... Transfomer Block Next-word Prediction ... ... Large Language ModelFigure 1: (a) This figure illustrates the fundamental framework of a Sparse Autoencoder (SAE). SAE is trained to take a model representation zas input and project it to an overcomplete sparse activation h(z)by learning to reconstruct the original input ˆz. The SAE typically comprises an encoder, a decoder, and a loss function for training. (b) The development of the SAE progresses through multiple stages, with each stage drawing inspiration from and building upon the previous one. velopment history. We then explore the techni- cal framework of SAEs, including their basic ar- chitecture, various design improvements, and ef- fective training strategies. The survey examines different approaches to analyzing and explaining SAE features, categorized broadly into input-based and output-based explanation methods. We dis- cuss evaluation methodologies for assessing SAE performance, covering both structural metrics that analyze the properties of the learned features and functional metrics that evaluate their practical util- ity. The paper further delves into real-world ap- plications of SAEs in understanding and manipu- lating LLMs, including model behavior analysis, intentional steering of model outputs, and insights for improved model training. Lastly, we highlight current research challenges and conclude with per- spectives on promising future directions. 2 Why Sparse Autoencoders? As LLMs continue to grow rapidly in size, interpre- tation becomes more challenging, as the complex- ity of their latent space and internal representations also expands exponentially. SAEs have emerged as a powerful tool to understand how LLMs make de- cisions. This ability is known as mechanistic inter- pretability, which aims to reverse-engineer models by breaking down their internal computations into understandable, interpretable components. SAE is designed to learn a sparse, linear, and decom- posable representation of the internal activations of a LLM. It enforces a sparsity constraint so that only a few features are active at any given time. This encourages each active feature to correspond to a specific, understandable concept. This simpli-fication allows researchers to focus on a few key features rather than being overwhelmed by the full complexity of the model. Below, we discuss the development history of the SAE for LLMs and present Figure 1b to visually depict this progress. Due to page limitations, we do not attempt to pro- vide an exhaustive history of SAEs, but instead focus on highlighting key milestones in the devel- opment of SAEs for mainstream LLMs. Explaining Individual Neurons. The development of interpretability techniques for LLMs has progressed in stages rather than as a single step. From 2022 to 2023, researchers at OpenAI and Anthropic focused on understanding LLMs by examining individual neurons. OpenAI, for instance, leveraged GPT -4 to generate natural language explanations for neurons in models like GPT -2, attempting to map specific neuron activa- tions to concrete linguistic or conceptual features (Bills et al., 2023). Similarly, Anthropic built small toy models trained on synthetic data to observe how features are stored in neurons. These early in- vestigations showed that analyzing single neurons can provide initial insights (Elhage et al., 2022). In addition, it is worth noting that the study of explain- ing individual neurons by labeling interpretable features to them has been extensively explored in non-MI studies (Radford et al., 2017; Donnelly and Roegiest, 2019; Nguyen et al., 2019; Szegedy et al., 2013) prior to the introduction of the MI. However, they soon discovered that analyzing individual neurons had significant limitations, as neurons in LLMs often exhibit polysemanticity , i.e., responding to multiple unrelated inputs within Page 3: the same neuron. For instance, a single neuron might simultaneously activate for academic cita- tions, English dialogue, HTTP requests, and Ko- rean text (Bricken et al., 2023). This phenomenon is largely attributed to superposition , where neu- ral networks represent more independent features than available neurons by encoding each feature as a linear combination of neurons (Ferrando et al., 2024). While this architectural efficiency allows models to encode vast amounts of knowledge, it makes individual neurons difficult to interpret since their activations represent entangled mixtures of different concepts. This fundamental challenge with neuron-level analysis motivated researchers to explore more sophisticated approaches for dis- entangling these superimposed features, leading to the development of sparse autoencoders (SAEs) as a promising solution for extracting interpretable, monosemantic features from the model’s complex internal representations. SAEs for Small-size Language Models. In late 2023, Anthropic advanced transformer in- terpretability by moving beyond raw neuron activa- tions to decompose model activations into single- concept, monosemantic features, addressing the polysemanticity of individual neurons in LLMs (Cunningham et al., 2023; Bricken et al., 2023). They trained SAEs on transformer activation data by optimizing a reconstruction loss with a strong sparsity constraint. This training forces the au- toencoder to represent each activation as a sparse combination of basis vectors, with each basis vec- tor ideally capturing a distinct, interpretable con- cept. SAEs transform the overlapping signals of individual neurons into a set of clean, monose- mantic features that are much easier to understand. This approach offers a clear advantage over tradi- tional neuron-based methods by isolating the key features that drive model behavior. The promis- ing experimental results on simpler transformer models demonstrated that SAEs provide a more effective and scalable route for interpreting model internals. Building on this success, later works Bloom (2024) and Samuel et al. (2024) applied SAE techniques to smaller models such as GPT- 2 (Radford et al., 2019) and Pythia-70m (Biderman et al., 2023), thereby paving the way for their even- tual extension to full-scale billion size LLMs. SAEs for Large Language Models. After witnessing the success of SAEs on smaller- scale models, the third stage of their developmentemerged in 2024, when Anthropic (Templeton et al., 2024) and OpenAI (Gao et al., 2025) be- came the first groups to apply SAEs to their latest proprietary LLMs, Claude 3 Sonnet and GPT-4, respectively. This marked a significant step for- ward in understanding these closed-source, black- box models, even for the researchers who built them. However, scaling SAEs from small models to full-scale LLMs introduced several new chal- lenges. One major issue was the sheer scale of activations in models with billions of parameters, which made training and extracting interpretable features computationally expensive. Additionally, ensuring that extracted features remained monose- mantic became increasingly difficult, as feature superposition is more prevalent in larger models (Templeton et al., 2024). Despite these challenges, researchers found that SAEs could effectively de- compose polysemantic neurons into monosemantic features, revealing meaningful and interpretable latent representations within the models. For in- stance, Anthropic demonstrated that certain neu- rons in Claude 3 Sonnet encode high-level concepts such as “sycophantic praise”, where phrases like “a generous and gracious man” strongly activate this feature. Similarly, OpenAI’s research on GPT-4 identified a “humans have flaws” feature, which activates on phrases like “My Dad wasn’t perfect (are any of us?) but he loved us dearly.” These find- ings not only deepen our understanding of model behavior but also provide powerful interpretability tools, allowing the practitioners to better analyze, refine, and steer language model outputs. As the architecture and mechanisms of SAEs become clearer, more researchers have begun to follow this approach, applying SAEs to interpret open-source models. For example, Google Deep- Mind (Lieberum et al., 2024) used SAEs to analyze Gemma 2 (Team et al., 2024), while He et al. (2024) applied similar techniques to LLaMA 3.1 (Dubey et al., 2024). This growing adoption highlights the increasing role of SAEs in mechanistic inter- pretability, paving the way for broader transparency in both close- and open-source LLMs. 3 Technical Framework of SAEs 3.1 Basic SAE Framework SAE is a neural network that learns an overcom- plete dictionary for representation reconstruction. As shown in Figure 1a, the input of SAE is the representation of a token extracted from LLMs, Page 4: Table 1: Taxonomy of SAE Frameworks: An Overview of Basic and Variant Architectures. Category Examples Activation Citations Basic SAE Framework (§3.1) l2-norm SAE ReLU Ferrando et al. (2024) Improve Architecture (§3.2.1)Gated SAE Jump ReLU Rajamanoharan et al. (2024a) TopK SAE TopK Gao et al. (2025) Batch TopK SAE Batch TopK Bussmann et al. (2024) ProLU SAE ProLU Taggart (2024) JumpReLU SAE Jump ReLU Rajamanoharan et al. (2024b) Switch SAE TopK Mudide et al. (2024) Improve Training Strategy (§3.2.2)Layer Group SAE Jump ReLU Ghilardi et al. (2024) Feature Choice SAE TopK Ayonrinde (2024) Mutual Choice SAE TopK Ayonrinde (2024) Feature Aligned SAE TopK Marks et al. (2024) End-to-end SAE ReLU Braun et al. (2025) Formal Languages SAE ReLU Menon et al. (2024) Specialized SAE ReLU Muhamed et al. (2024) which is mapped onto a sparse vector of dictionary activations. Input. Given a LLM denoted as fwith a total ofLtransformer layers, we consider an input se- quence x= (x0, . . . , x N)withNtokens, where each xn∈xrepresents a token in the sequence. As the sequence xis processed by the LLM, each token xnproduces representations at different lay- ers. For a specific layer l, we denote the hidden representation corresponding to token xnasz(l) n, where z(l) n∈Rdindicates the embedding vector of dimension d. Each representation z(l) nserves as input to SAEs. In the following, we may omit the superscript(l)of layers to simplify the notation. After extracting the representation z(l) n, the SAE takes it as input, decomposes it into a sparse rep- resentation, and then reconstructs it. The SAE framework is typically composed of three key com- ponents: the encoder , which maps the input repre- sentation to a higher-dimensional sparse activation; thedecoder , which reconstructs the original repre- sentation from this sparse activation; and the loss function , which ensures accurate reconstruction while enforcing sparsity constraints. Encoding Step. Given an input representation z∈Rd, the encoder applies a linear transforma- tion using a weight matrix Wenc∈Rd×mand a bias term benc∈Rm, followed by a ReLU acti- vation function to enforce sparsity. The encoding operation is defined as: h(z) =ReLU (z·Wenc+benc), (1) where h(z)∈Rmrepresents the sparse activationvector, which helps disentangle superposition fea- tures. The ReLU activation function ReLU (x) = max (0 , x)ensures that only non-negative values pass through, which encourages sparsity by setting negative values to zero. Since the SAE constructs an overcomplete dic- tionary to facilitate sparse activation, the number of learned dictionary elements mis chosen to be larger than the input dimension d(i.e., m≫d). This overcompleteness allows the encoder to learn a richer and more expressive representation of the input, making it possible to reconstruct the original data using only a sparse subset of dictionary ele- ments. The output h(z)from the encoder is then passed to the decoding stage, where it is mapped back to the original input space to reconstruct z. Decoding Step. After the encoding step, the next stage in the SAE framework is the decoding pro- cess, where the sparse activation vector h(z)is mapped back to the original input space. This step ensures that the sparse features learned by the en- coder contain sufficient information to accurately reconstruct the original representation. The decod- ing operation is defined as: ˆz=SAE (z) =h(z)·Wdec+bdec,(2) where Wdec∈Rm×dis the decoder weight matrix. bdec∈Rdis the decoder bias term. ˆz∈Rdis the reconstructed output, which aims to approximate the original input z. The accuracy of the reconstruction and the in- terpretability of the learned representation depends heavily on the effectiveness and sparsity of the ac- Page 5: tivation vector h(z). Therefore, the SAE is trained using a loss function that balances minimizing the reconstruction error and enforcing sparsity. This trade-off ensures that the learned dictionary ele- ments provide a compact yet expressive representa- tion of the input data. Loss Function. The activation vector h(z)is en- couraged to be sparse, meaning that most of its values should be zero. While the ReLU activation function after the encoder enforces basic sparsity by setting negative values to zero, it does not nec- essarily eliminate small positive values, which can still contribute to a dense representation. Therefore, additional sparsity enforcement is required. This is achieved using a sparsity regularization term in the loss function, which further promotes a mini- mal number of active features. Beyond enforcing sparsity, the SAE must also ensure that the learned sparse activation retains sufficient information to accurately reconstruct the original input z. The loss function for training the SAE consists of two key components: reconstruction loss andsparsity regularization : L(z) =∥z−ˆz∥2 2+α∥h(z)∥1, (3) where reconstruction loss ensures that the SAE learns to reconstruct the input data accurately, meaning the features encoded in the sparse rep- resentation must also be present in the input acti- vations. On the other hand, sparsity regularization enforces sparsity by penalizing nonzero values in h(z), and αis a hyper-parameter to control the penalty level of the sparsity. Specifically, without the sparsity loss, SAEs could simply memorize the training data, reconstructing the input without dis- entangling meaningful features. However, once the sparsity loss is introduced, the model is forced to activate only a small subset of neurons for re- constructing the input activation. This constraint encourages the SAE to focus on the most informa- tive and critical features to reconstruct the input activation. A higher value of αenforces stronger sparsity by shrinking more values in h(z)to zero, but this may lead to information loss and degraded reconstruction quality. A lower value of αpriori- tizes reconstruction accuracy but may result in less sparsity, reducing the interpretability of the learned features. Thus, selecting an optimal αis crucial for achieving a balance between interpretability and accurate data representation.3.2 Different SAE Variants As SAEs continue to emerge as a powerful tool for interpreting the internal representations of LLMs, researchers have increasingly focused on refining and extending their capabilities. Various SAE vari- ants have been proposed to address the limitations of traditional SAEs, each introducing improve- ments from different perspectives. In this section, we categorize these advancements into two main groups: Improve Architectural, which modify the structure and design of the traditional SAE, and Improve Training Strategy, which retain the orig- inal architecture but introduce novel methods to enhance training efficiency, feature selection, and sparsity enforcement. 3.2.1 Improve Architecture Gated SAE. The Gated SAE (Rajamanoharan et al., 2024a) is a modification of the standard SAE that aims to improve the trade-off between recon- struction fidelity and sparsity enforcement. Tradi- tional SAEs suffer from shrinkage bias (Wright and Sharkey, 2024), where the L1-norm regularization systematically underestimates feature activations, leading to reduced reconstruction accuracy. In- spired by Gated Linear Units (Dauphin et al., 2017; Shazeer, 2020), Gated SAEs replace the standard ReLU encoder with a gated ReLU encoder, which allows the L1 penalty to be applied solely to the selection mechanism: ˜h(z) =1 πgate(z)>0 ⊙ReLU (πmag(z)),(4) where πgate(z) =Wgate(z−bdec)+bgateis the gat- ing function that determines which features should be activated. Wgateis a weight matrix for feature selection. πmag(z) =Wmag(z−bdec)+bmagis the magnitude estimation function that determines the strength of the active features. Wmagis a weight matrix for magnitude estimation. 1[·]is the Heavi- side step function that binarizes activations and ⊙ denotes element-wise multiplication. In this case, Gated SAEs introduce independent pathways for determining which features are activated and their respective strengths, reducing bias and improving interpretability. To optimize the Gated SAE, the authors introduce an auxiliary loss ∥z−ˆzfrozen ReLU πgate(z) ∥2 2on top of the traditional loss function. This addresses the issue of the non-differentiability of the Heaviside step function used in the gating mechanism, and Page 6: enables gradient flow during backpropagation while still enforcing sparsity. Here, ˆzfrozen is a copy of the decoder with frozen weights. TopK SAE. The TopK SAE (Gao et al., 2025) is an improvement over the traditional SAE, designed to directly enforce sparsity without requiring L1- norm regularization. Instead of penalizing all acti- vations, which can introduce shrinkage bias, TopK SAEs enforce sparsity by retaining only the top K largest activations and setting the rest to zero. This ensures that only the most important features con- tribute to the learned representation. The encoder applies a linear transformation followed by a hard TopK selection: ˜h(z) =TopK Wenc(z−bpre) , (5) where Wenc∈Rd×mis the encoder weight matrix, andbpre∈Rmis a pre-normalization bias term applied before the TopK selection. Since the sparsity constraint is explicitly en- forced through the TopK operation, there is no need for an additional sparsity regularization term in the loss function. The training objective reduces to minimizing the reconstruction loss: L(z) =∥z−ˆz∥2 2+αLaux, (6) whereLauxis an auxiliary loss scaled by the coeffi- cient α, designed to stabilize training and prevent dead latents (Templeton et al., 2024). BatchTopK SAE. The BatchTopK SAE is a mod- ification of the TopK SAE, designed to address the limitations of enforcing a fixed number of active features per sample. Bussmann et al. (2024) iden- tified two key inefficiencies in the standard TopK SAE. First, it forces every token to use exactly Kfeatures, even when some tokens may require fewer or more active features. It also does not allow flexibility across a batch, leading to inefficient spar- sity control. To overcome these issues, BatchTopK SAEs apply the TopK selection globally across the entire batch instead of enforcing it per token. This means that BatchTopK selects the top K×Bac- tivations across the entire batch, where Bis the batch size. The encoder is modified to: ˜h(Z) =BatchTopK Wenc(Z−bpre) ,(7) where Z∈RB×dis the input batch matrix, and Bis the batch size. Similar to TopK SAE, Batch- TopK directly controls sparsity through the selec- tion mechanism, it eliminates the need for explicit sparsity regularization: L(Z) =∥Z−ˆZ∥2 2+αLaux.ProLU SAE. The ProLU SAE (Taggart, 2024) introduces a novel activation function called Pro- portional ReLU, which serves as an alternative to ReLU in traditional SAEs. ProLU SAE provides a Pareto improvement over both standard SAEs with L1-norm regularization, which suffer from shrink- age bias, and SAEs trained with a Sqrt( L1) penalty, which attempt to mitigate shrinkage but still do not fully address inconsistencies in activation scaling. In contrast to ReLU, which applies a fixed thresh- old at zero, ProLU introduces a learnable threshold for each activation, allowing the model to deter- mine the optimal activation boundary dynamically. The ProLU activation function is defined as: ProLU (mi, bi) =( mi,ifmi+bi>0andmi>0 0, otherwise, (8) where miis the pre-activation output from the en- coder, and biis a learnable bias term that shifts the activation threshold. The encoding process in ProLU SAE replaces the standard ReLU activa- tion with ProLU, leading to the following encoding function: ˜h(z) =ProLU ((z−bdec)Wenc,benc).(9) The ProLU SAE training objective consists of the standard reconstruction loss combined with an auxiliary sparsity term: L(z) =∥z−ˆz∥2 2+λP(˜h(z)), (10) where λis the sparsity penalty coefficient, and P(˜h(z))is a sparsity-inducing function. The au- thors found that using a Sqrt( L1) penalty, defined asP(h) =∥h∥1/2, provided better sparsity control compared to the standard L1-norm. JumpReLU SAE. The JumpReLU SAE (Raja- manoharan et al., 2024b) is a modification of the traditional SAE that replaces the standard ReLU activation function with JumpReLU. The ReLU activation function sets all negative pre-activation values to zero but allows small positive values, lead- ing to false positives in feature selection and un- derestimation of feature magnitudes. JumpReLU introduces an explicit threshold θthat zeroes out pre-activations below this threshold, ensuring that weak activations do not contribute to the recon- struction. The JumpReLU activation function is defined as: JumpReLUθ(z) =z·H(z−θ), (11) Page 7: where θis a learnable threshold and H(x)is the Heaviside step function, which is 1 when x > 0 and 0 otherwise. The encoder in JumpReLU SAE follows a standard linear transformation followed by JumpReLU activation: ˜h(z) =JumpReLUθ(Wencz+benc). (12) Unlike traditional SAEs that use L1-norm for spar- sity regularization, JumpReLU SAEs directly op- timize the L0-norm, which counts the number of nonzero activations: L(z) =∥z−ˆz∥2 2+α∥h(z)∥0. Switch SAE. Inspired by Mixture of Experts (MoE) models (Shazeer et al., 2017), Switch SAE (Mudide et al., 2024) introduces a more compu- tationally efficient framework for training SAEs. Instead of training a single large SAE, Switch SAE leverages multiple smaller “expert SAEs” E1, E2, ..., E Nand a routing network that dynam- ically assigns each input to an appropriate expert. This approach enables efficient scaling to a large number of features while avoiding the memory and FLOP bottlenecks of traditional SAEs. Each “ex- pert SAE” follows a standard TopK SAE formula- tion: Ei(z) =W(i) dec·TopK (W(i) encz), (13) where W(i) encandW(i) decare the encoder and decoder weight matrices for expert i. The routing network determines which expert is assigned to each input by computing a probability distribution over the experts: p(z) =softmax (Wrouter(z−brouter)),(14) where Wrouter is the routing weight matrix. brouter is the bias term. p(z)represents the probability of selecting each expert. The final reconstruction is computed as: ˆz=pi∗(z)Ei∗(z)(z−bpre) +bpre, (15) where i∗(z)is the selected expert for input z. To ensure balanced expert utilization and avoid expert collapse, Switch SAE incorporates an auxil- iary loss for load balancing: Laux=NNX i=1fi·Pi, (16) where fiis the fraction of activations assigned to expert i, andPiis the fraction of router probability assigned to expert i. This auxiliary loss is then added to the traditional reconstruction loss function to form the final learning objective.3.2.2 Improve Training Strategy Layer Group SAE. Traditionally, one SAE is trained per layer in a transformer-based LLM, re- sulting in a substantial number of parameters and high computational costs. To address this ineffi- ciency, the Layer Group SAE (Ghilardi et al., 2024) clusters multiple layers into groups based on acti- vation similarity and trains a single SAE per group. This significantly reduces training time while pre- serving reconstruction accuracy and interpretability. To determine which layers should be grouped to- gether, the method measures the angular similarity between layer activations, defined as: dangular (zp post,zq post) =1 πarccos zp post·zq post ∥zp post∥2∥zq post∥2! , (17) where zp postandzq postrepresent post-MLP residual stream activations for layers pandq. Using this similarity metric, layers with highly correlated ac- tivations are clustered together through a hierar- chical clustering strategy. The number of groups Kis chosen based on a computational trade-off, balancing efficiency and reconstruction accuracy. Once the layer groups are formed, a single SAE is trained per group instead of one per layer. The SAE architecture and training objective remains similar as in traditional SAEs, optimizing for both reconstruction accuracy and sparsity. Feature Choice SAE. Traditional SAEs face sev- eral limitations, including dead features, fixed spar- sity per token, and lack of adaptive computation. Feature Choice SAEs (Ayonrinde, 2024) address these issues by imposing a constraint on the num- ber of tokens each feature can be active for, rather than restricting the number of active features per token. This approach ensures that all features are utilized efficiently, preventing feature collapse and improving reconstruction accuracy. This sparsity allocation constraint is defined as: X jSi,j=m,∀i,where M=mF, (18) where Si,jis a binary selection matrix, indicating whether feature iis active for token j. Each feature must be activated for exactly mtokens, enforcing uniform feature utilization. Mutual Choice SAE. Mutual Choice SAE (Ay- onrinde, 2024) remove all constraints on sparsity allocation, allowing the model to freely distribute its limited total sparsity budget across all tokens Page 8: and features. Unlike TopK SAEs, which enforce a fixed number of active features per token, or Fea- ture Choice SAEs, which constrain the number of tokens each feature can be assigned to, Mutual Choice SAE introduce global sparsity allocation. This means that instead of enforcing a per-token or per-feature selection, the model selects the top Mfeature-token matches across the entire dataset, ensuring that sparsity is allocated adaptively based on reconstruction needs. Mathematically, the acti- vation selection process is defined as: S=TopKIndices (Z′, M), (19) where Z′represents the pre-activation affinity ma- trix between tokens and features. Mis the global sparsity budget, denoting the total number of ac- tive feature-token pairs allowed. TopKIndices (·) selects the top Mactivations globally, instead of enforcing a fixed Kper token. Feature Aligned SAE. The Feature Aligned SAE (Marks et al., 2024) introduces Mutual Feature Regularization (MFR), a novel training method de- signed to improve the interpretability and fidelity of learned features in SAEs. Traditional SAEs often suffer from feature fragmentation, where meaning- ful input features get split across multiple decoder weights, and feature entanglement, where multi- ple independent input features are merged into a single decoder weight. These issues reduce the in- terpretability of SAEs and limit their effectiveness in analyzing neural activations. The key insight be- hind MFR is that features learned by multiple SAEs trained on the same dataset are more likely to align with the true underlying structure of the input data. To enforce this, Feature Aligned SAE trains mul- tiple SAEs in parallel and applies a MFR penalty that encourages them to learn mutually consistent features: LMFR=α 1 N(N−1)N−1X i=1NX j=i+1(1−MMCS (W(i), W(j))) , (20) where W(i)andW(j)are the decoder weight ma- trices of different SAEs. Mean of Max Cosine Similarity (MMCS) measures the degree of align- ment between the learned features across SAEs. α is a hyperparameter that controls the strength of the regularization. This mutual feature regulariza- tion is then combined with the traditional SAE loss to form the final training objective of the Feature Aligned SAE. End-to-end SAE. Traditional SAEs often prior- itize minimizing reconstruction error rather thanensuring that learned features are functionally im- portant to the model’s decision-making. This often leads to feature splitting, where a single meaningful feature is divided into multiple redundant compo- nents. To address this, End-to-end SAE (Braun et al., 2025) modifies the training objective to en- sure that the discovered features directly influence the network’s output. They propose minimizing the Kullback-Leibler (KL) divergence between the original network’s output distribution and the out- put distribution when using SAE activations, for- mulated as: Le2e=KL(ˆy, y) +α∥h(z)∥1. (21) To further ensure that activations follow similar computational pathways in later layers, they pro- pose E2e + Downstream SAE, which introduces an additional downstream reconstruction loss, leading to the formulation: Le2e+ds=KL(ˆy, y) +α∥h(z)∥1+βLX k=l+1∥ˆa(k)−a(k)∥2 2.(22) By shifting the training focus from activation re- construction to output distribution preservation, this method ensures that learned features are more aligned with the actual computational processes of the network while maintaining interpretability. Formal Languages SAE. Traditional SAEs effec- tiveness remains questionable in language models due to their reliance on correlations rather than causal attributions. While SAEs often recover fea- tures that correlate with linguistic structures, such as parts of speech or syntactic depth, interventions on these features frequently do not influence the model’s predictions, suggesting that current train- ing objectives fail to ensure causal relevance. To address this, Formal Languages SAE (Menon et al., 2024) introduce a causal loss term that explicitly encourages SAEs to learn features that impact the model’s computation. Their proposed loss function is given by: L=Lrecon+αLsparse +βLcaus, (23) where Lrecon is the standard reconstruction loss, Lsparse enforces sparsity, and Lcausensures that in- terventions on learned features result in predictable changes in model output. Specialized SAE. Traditional SAEs struggle to capture rare and low-frequency concepts, which are critical for understanding model behavior in spe- cific subdomains. To address this, Specialized SAE Page 9: (SSAE) (Muhamed et al., 2024) focuses on learning rare subdomain-specific features through targeted data selection and a novel training objective. In- stead of training on the full dataset, SSAE uses high-recall dense retrieval methods, such as BM25, Contriever, and TracIn reranking, to identify rel- evant subdomain data, ensuring that rare features are well-represented. Additionally, they introduce Tilted Empirical Risk Minimization (TERM), an objective that optimizes for worst-case reconstruc- tion loss rather than average loss. This is achieved by modifying the standard SAE loss function to: LTERM (t;w) =1 tlog 1 NNX i=1et·Lw(zi)! ,(24) where Lw(zi)is the standard SAE loss for repre- sentation zi.Nis the size of a minibatch, and t is the tilt parameter that controls emphasis on rare concept reconstruction. 3.3 SAE Training Even though the framework of SAEs is conceptu- ally straightforward, training SAEs is both compu- tationally expensive and data-intensive. The com- plexity arises due to the overcomplete dictionary representation, large-scale data requirements, and the layer-wise training paradigm necessary for in- terpreting LLMs. Each of these factors contributes to the substantial computational cost associated with training SAEs at scale. Overcomplete Dictionary Representation. A defining characteristic of SAEs is their overcom- plete dictionary, where the number of learned fea- tures far exceeds the dimensionality of the LLM’s latent space. This overcompleteness is what en- ables SAEs to enforce sparsity, allowing them to isolate and extract meaningful feature activations from high-dimensional representations. The en- forced sparsity is crucial for LLM interpretabil- ity, as it helps decompose complex neural activa- tions into more semantically meaningful features. Empirical studies highlight the scale of overcom- pleteness; for example, LLaMa-Scope (He et al., 2024) trained SAEs with 32K and 128K features, which are 8 ×and 32 ×larger than the hidden size of LLaMa3.1-8B. This extreme overparameteriza- tion provides a highly expressive feature space but significantly increases the computational burden during training. Large-Scale Data Requirements. Since the in- put to an SAE consists of representations fromLLMs, an enormous amount of data is required to ensure that the model learns a diverse and rep- resentative set of activations. To effectively train an SAE, it is essential to activate a wide range of neurons in the LLM, which necessitates process- ing large-scale datasets covering diverse linguistic structures. Moreover, because SAEs are overcom- plete, they require significantly more training data to converge. Empirical results from Gemma-Scope (Lieberum et al., 2024) illustrate this requirement: SAEs with 16.4K features were trained on 4 bil- lion tokens, while 1M-feature SAEs required 16 billion tokens to reach satisfactory performance. This highlights the immense data demands neces- sary for training effective SAEs. Another challenge arises when scaling up the training data, which is how to efficiently shuffling massive datasets across distributed systems. Shuffling is crucial to prevent models from learning spurious, order-dependent patterns. However, as datasets grow to terabyte or petabyte scales, performing a distributed shuffle be- comes a significant engineering hurdle (Anthropic, 2024). Layer-Wise Training. Interpreting an LLM re- quires understanding its representations at each layer, which necessitates training separate SAEs for different layers of the model. The standard approach is to train one SAE per layer, meaning that for deep models, this process must be repeated across dozens or even hundreds of layers, com- pounding the computational cost. The necessity of layer-wise training is further evidenced by ongoing research efforts attempting to reduce the number of SAEs required. For example, Layer Group SAE (Ghilardi et al., 2024), which we discussed previ- ously, clusters multiple layers into layer groups and trains a single SAE per group instead of per layer. The emergence of such strategies demonstrates the significant computational burden imposed by layer- wise SAE training and the ongoing efforts to opti- mize it. 4 Explainability Analysis of SAEs This section aims to interpret the learned feature vectors from a trained SAE with natural language. Specifically, given a pre-defined vocabulary set V, the goal of the explainability analysis is to extract a subset of words Im⊂ V to represent the mean- ing of wm=Wdec[m], form= 1, ..., M . Hu- mans can understand the meaning of wmby read- ing their natural language explanations Im. There Page 10: morphological? or? physico-chemical? structure? of? native? starch? is? disrupted? in? some? way,? such? as? in? food? preparation.? The? most? common? way? to? modify? starch? is? to? apply? heat.? Cooking? pits,? hearths,? and? ovens? that? may? have? come? into? contact? with? starchy? material? yield? modified? starches? which? can? provide? other? insights. Negative Logits Positive Logits Statistical Analysis -0.39 -0.3 4 -0.3 2 -0.3 2 -0.3 2 -0.3 0 -0.3 0 -0. 29 -0. 28 0.71 0.70 0.70 0.67 0.67 0.66 0.65 0.64 0.64 VocabProj MaxAct consumer Food food ? ? Foods FOODS product FOOD products impegno wikipagina alugar financière telefónica auroit Empfang jurk désert 0 0.5 1 0 -0.4 0.7 10k 20k 500 1000Figure 2: The figure illustrates the interpretation of a learned SAE feature using V ocabProj and MaxAct. V ocabProj lists words with the highest logits in “Positive Logits” column, and lowest logits in “Negative Logits” column. The upper histogram in Statistical Analysis shows the distribution of randomly sampled non-zero activations, with the y-axis representing the number of sampled activations and the x-axis indicating activation scores. The lower histogram depicts the logit density, where the y-axis represents the number of tokens and the x-axis corresponds to logit scores. MaxAct highlights tokens in an input text that strongly activate the learned feature. The figure references the Neuronpedia website (Lin, 2023). are two lines of work for this purpose, namely the input-based andoutput-based methods. Figure 2 visualizes generated explanations of using different methods to interpret a learned feature vector. 4.1 Input-based Explanations MaxAct. The most straightforward way to collect natural language explanation is by selecting a set of texts whose hidden representation can maximally activate a certain feature vector we are interpret- ing (Bricken et al., 2023). Formally, given a large corpus Xwhere each text span x∈ VNconsists of Nwords, the MaxAct strategy finds Ktext spans that could maximally activate our interested learned feature vector wm: Im= arg max X′⊂X,|X|=KX x∈X′f<l(x)·w⊤ m, (25) where f<l(x)indicates generating the hidden rep- resentation of input text xat the l-th layer, and lis the layer our SAE is trained for. This strategy is reasonable for interpreting weight vectors of SAEs because of the sparse nature of SAEs, which indi- cates that a learned feature vector should only be activated by a certain pattern/concept. Therefore, summarizing the text spans that could maximally activate a certain weight vector gives us a clue to understanding the semantic meaning of the learned feature vector.PruningMaxAct. While MaxAct collects text spans that maximally activate a feature vector, these spans often contain extraneous or redundant phrases that can obscure the underlying concept. Building on the Neuron-to-Graph approach (Foote et al., 2023), researchers (Gao et al., 2025) intro- duce a pruning operation to remove irrelevant to- kens from each text span, thereby retaining only the minimal context necessary to preserve strong activation. Formally, let p(·)be a pruning strategy that maps text xtop(x), and let p−1(·)recover the original text from its pruned version. The final pruned spans are then gathered via: Im= arg max X′⊂p(X),|X′|=KX x∈X′f<l(x)·w⊤ m, s.t.∀x∈p(X),f<l(p−1(x))·w⊤ m f<l(x)·w⊤m≥0.5,(26) where the condition enforces that the pruned text p(x)retains at least half of the original activation. In practice, p(·)can be instantiated by removing selected tokens or replacing them with padding. According to Gao et al. (2025), this PruningMax- Act technique yields higher recall (i.e., finds more relevant examples) but lower precision compared to the original MaxAct strategy. Page 11: 4.2 Output-based Explanations VocabProj. Output-based explanations project the learned feature vectors to the output word embed- dings of texts to compute the activations. Mathe- matically, fout(w) :V →Rddenotes the output word embedding layer that returns the output em- beddings of word w, and we can collect the natural language explanations by: Im= arg max V′⊂V,|V′|=KX w∈V′fout(w)·w⊤ m.(27) This mapping process makes sense for decoder- only LLMs because the layers in such models share the same residual stream, enabling the representa- tions in the intermediate layers to be linear cor- related to the output word embeddings (Nostalge- braist, 2020). Recent works (Wu et al., 2025b; Gur-Arieh et al., 2025) find that output-based ex- planations show a stronger promise in interpreting LLM behaviors (i.e., generated texts) compared to input-based ones. MutInfo. The V ocabProj assumes that the output embeddings that maximally activate an interested feature vector can best describe the meaning of the learned feature. However, this assumption may fail for frequent words, whose embeddings often have large l2norm (Gao et al., 2019). To address this, (Wu et al., 2025b) proposes extracting a vocabulary subset that maximizes mutual information with the learned feature. Formally, let Cdenote knowledge encoded by wc, the explanations are extracted by Im= arg max V′⊂V,|V′|=MMI(V′;C)∝arg min V′⊂V,|V′|=MH(C|V′) ∝arg max V′⊂V,|V′|=MX w∈V′p(w|wm) logp(wm|w), (28) where MI(·;·)indicates mutual information be- tween two variables(Cover, 1999), and U(C)in- cludes all possible vectors that express the knowl- edgeC. Practically, the conditional probabilities can be estimated by: p(w|wm) =exp(fout(w)·w⊤ m)P w′∈Vexp(fout(w′)·w⊤m), p(wc|w) =exp(fout(w)·w⊤ c)P c′∈Cexp(fout(w)·W⊤ c′).(29) Compared with V ocabProj, that only considers p(w|wm), this mutual-information-driven objec- tive highlights the need to normalize the raw ac- tivation with p(wm|w). That is to say, if a wordwhose output embedding consistently activates var- ious learned feature vectors, it has no specific to interpret any of them. 5 Evaluation Metrics and Methods Evaluating SAEs is inherently challenging due to the absence of ground truth labels. Unlike tradi- tional machine learning tasks where performance can be directly measured against labeled data, the quality of an SAE must be inferred through a di- verse set of metrics. These metrics assess both the internal structure of the model and its func- tional utility. To provide a comprehensive evalu- ation framework, we categorize SAE evaluation methods into two main groups: structural metrics and functional metrics. This categorization ensures a holistic assessment of SAEs, covering both their training behavior and real-world applicability. 5.1 Structural Metrics Structural metrics focus on assessing whether an SAE behaves as intended during training. SAEs are designed to optimize both reconstruction fidelity and sparsity, as these properties are explicitly en- forced in the training loss. Therefore, natural eval- uation metrics assess reconstruction accuracy and sparsity in the model’s learned representations. Reconstruction Fidelity. The most fundamental way to evaluate reconstruction fidelity is through Mean Squared Error (MSE) and Cosine Similar- ity (Ng et al., 2011), which directly compare the original activations with SAE-reconstructed acti- vations. Additional metrics such as Fraction of Variance Unexplained (FVU) (also known as nor- malized loss) (Gao et al., 2025) and Explained Vari- ance (Karvonen et al., 2024) measure how much variance in the original data is retained after SAE reconstructs. Beyond direct reconstruction com- parisons, researchers also evaluate how SAEs af- fect the probability distribution of model outputs. Cross-Entropy Loss (Shannon, 1948) and KL Di- vergence (Kullback and Leibler, 1951) measure the shift in probability distributions when substitut- ing original model activations with SAE-generated activations. If the SAE faithfully reconstructs acti- vations, the probability distributions should remain similar. Similarly, Delta LM Loss (Lieberum et al., 2024) quantifies the difference between the original language model loss and the loss incurred when replacing activations with those from the SAE. An- other important aspect of reconstruction fidelity is Page 12: magnitude preservation. The L2 Ratio (Karvonen et al., 2024) compares the Euclidean norms of dif- ferent activations to ensure that the SAE does not systematically alter activation magnitudes. Sparsity. A key design objective of SAEs is spar- sity, which ensures that only a small subset of la- tent neurons activate for any given input. The most direct metric for sparsity is L0 Sparsity (Louizos et al., 2017), which measures the average number of nonzero activations per input. However, sparsity is not just about minimizing activations; it also re- quires ensuring that the active features are meaning- ful. To assess feature usage patterns, Latent Firing Frequency (He et al., 2024) and Feature Density Statistics (Karvonen et al., 2024) track how often each SAE latent is activated across different inputs, ensuring that features are neither too frequent nor inactive. Additionally, the Sparsity-fidelity Trade- off (Gao et al., 2025) evaluates whether adjusting sparsity affects reconstruction quality, helping to determine the optimal balance between sparsity and fidelity. 5.2 Functional Metrics While structural metrics ensure that an SAE fol- lows its design principles, functional metrics assess whether the SAE is useful for real-world analy- sis. These include interpretability, which assesses whether the SAE’s learned features correspond to meaningful and distinct concepts, and robustness, which evaluates whether the learned representa- tions are stable and generalizable. Interpretability. One of the primary motivations for SAEs is to enhance interpretability by disen- tangling LLM activations into meaningful features. A crucial property for interpretability is monose- manticity, where each feature should encode a sin- gle concept. RA VEL and Automated Interpretabil- ity (Karvonen et al., 2024) automatically evaluate monosemanticity by using a language model to gen- erate and assess feature descriptions. These meth- ods analyze the most activating contexts for each feature and assign interpretability scores. Sparse Probing and Targeted Probe Perturbation (TPP) (Karvonen et al., 2024) evaluate whether SAE fea- tures align with specific downstream tasks. In sparse probing, a linear probe is trained using only a small subset of SAE activations, while TPP mea- sures how much perturbing individual latents im- pacts probe accuracy. If a small number of active features enable strong performance, the SAE has learned disentangled and meaningful representa-tions. Beyond evaluating feature alignment, it is also crucial to assess the faithfulness of feature descriptions. Input-Based Evaluation and Output- Based Evaluation (Gur-Arieh et al., 2025) provide a framework for verifying whether feature descrip- tions accurately reflect what a feature represents. Input-Based Evaluation tests whether a given fea- ture description correctly identifies which inputs activate the feature by generating activating and neutral examples and measuring activation differ- ences. Output-Based Evaluation assesses whether a feature description captures the causal influence of the feature on model outputs by modifying feature activations and comparing the resulting generated texts. Feature Absorption (Karvonen et al., 2024) assesses whether a feature is capturing multiple in- dependent concepts instead of a single interpretable concept. If adding more features does not signifi- cantly improve representation quality, it suggests that the extracted features are already sufficient. Another approach to detecting whether each neu- ron is monosemantic is checking for redundancy or overlap with other neurons. Feature Geometry Analysis (He et al., 2024; Bricken et al., 2023; Tem- pleton et al., 2024) detects redundancy among SAE latents by measuring cosine similarity between de- coder columns. If two features have high cosine similarity, they may represent redundant concepts rather than independent units. Robustness. In addition to being interpretable, a well-designed SAE should be robust in various contexts. Robustness ensures that SAEs do not overfit to a specific dataset or condition but in- stead generalize effectively. Generalizability (He et al., 2024) assesses whether SAEs remain effec- tive when applied to out-of-distribution data. Two common tests for generalizability include evaluat- ing whether SAEs trained on shorter text sequences still perform well on longer sequences and check- ing whether SAEs trained on base LLM activations generalize to instruction-finetuned models. Un- learning (Karvonen et al., 2024) measures whether an SAE can selectively forget specific features while preserving useful information. This is crucial for applications that require privacy-focused mod- els, where sensitive information needs to be erased. Spurious Correlation Removal (SCR) (Karvonen et al., 2024) tests whether an SAE can eliminate biased correlations in downstream models. If re- moving certain latents reduces unwanted correla- tions without harming performance, the SAE has learned to capture and remove spurious patterns. Page 13: ... ... Sparse Autoencoder ... ... Steered Output ... ... ... ... (a) Steering Vector Extraction Input Transfomer Block Embedding ... ... Transfomer Block Unembedding Logits (b) Steering LLM Behavior Write a brief angry review in 10 words for the smartphone 'Apple 52 Pro Max'. Overpriced junk! Laggy, terrible battery, useless updates, worst purchase ever! Steer Happiness Feature: Amazing phone! Fast, stunning display, great battery, worth every penny! Steer Confusion Feature: Great camera, but lags? Expensive, yet feels cheap? I?m lost. Steer Fact Feature: Apple 52 Pro Max doesn?t exist yet, so no review possible. (c) Steered Output Example Example Input: Original Output: Steered Output: ... ...Figure 3: The figure illustrates the process of using a SAE to steer the behavior of a LLM, with an example of the resulting steered output. In part (a), normally people use SAE to extract a steering vector by comparing two representations: z, which lacks a certain feature, and z′, which contains that feature. In part (b), this steering vector is added to the input representation, modifying the LLM’s behavior to align with the desired feature. Part (c) demonstrates the example results of this process, where the steered output reflects the steered feature, even when the original input prompt is neutral or contradictory to the feature being introduced. 6Applications in Large Language Models The latents learned by SAEs represent a collection of low-level concepts. Each SAE latent can be interpreted through gathering its activating exam- ples (Lin, 2023). This approach enables latents to be interpreted in a human-understandable manner, thereby enhancing our comprehension of how mod- els perform tasks and facilitating more effective control of model behaviors. 6.1 Model Behavior Anlysis SAEs construct a dictionary of concepts through their latents, providing a more fine-grained per- spective for concept interpretation. This capability enables the analysis of the model’s internal repre- sentations and learned knowledge in greater detail. A recent study utilizes SAEs to reveal the mech- anism of hallucination, where entity recognition plays a pivot role in recalling facts. A direction distinguishing whether the model knows an en- tity is identified, which is usually used for hallu- cination refusal in chat models (Ferrando et al., 2025). Some studies focus on interpreting how in-context learning (ICL) is performed within mod- els. One study focuses on general ICL tasks, and task-related function vectors has been successfully isolated (Kharlapenko et al., 2024). Another study focuses on reinforcement learning(RL) tasks. Their experiment shows that LLM’s internal representa-tions are capable of capturing temporal difference errors and Q-values that are essential in RL com- putations (Demircan et al., 2025). Besides, one study attempts to examine the working mechanism of instruction following. Their analysis on trans- lation tasks shows that instructions are composed of multiple relevant concepts rather than individual ones (He et al., 2025). Moreover, SAEs have been employed to study behaviors related to toxicity, sycophancy, refusal, and emotions. One recent study shows that features captured by SAEs can be used to construct probes to classify cross-lingual toxicity (Gallifant et al., 2025). By examining SAE latents that activate on anger-related tokens, researchers have identified steering vectors that control angry outputs (Nanda et al., 2024). Additionally, other research demon- strates that SAEs can reconstruct vectors respon- sible for refusing to answering harmful questions as well as directions that produce sycophantic re- sponses (neverix et al., 2024). 6.2 Model Steering Unlike supervised concept vectors, such as probing classifiers (Belinkov, 2022; Zhao et al., 2025; Jin et al., 2025), SAEs can simultaneously learn a large volume of concept vectors. The learned vectors can then be utilized to steer model behaviors in ways similar to supervised vectors (See Figure 3). As previously mentioned, SAE latents can be Page 14: employed to produce steering vectors that control model outputs related to toxicity, sycophancy, re- fusal, and emotions. Other steering applications have also proven feasible. Recently, a study shows that SAE latents are able to capture instructions such as translations. These identified latents are effective in manipulating models to translate inputs according to specific instructions (He et al., 2025). Another investigation focusing on semantic search- ing demonstrates that SAEs can learn fine-grained semantic concepts at various levels of abstraction. These concept vectors can be used to steer mod- els toward related semantic (O’Neill et al., 2024). Alternatively, SAEs trained on biological datasets can provide biology-related features that enable the unlearning of biology-relevant knowledge with fewer side effects than existing techniques (Far- rell et al., 2024). Given that steering effects are generally challenging to control, SAE-TS further utilizes SAE latents to optimize steering vectors by measuring the changes in SAE feature activation caused by steering, thereby helping construct vec- tors that target relevant features while minimizing unwanted effects (Chalnev et al., 2024). Moreover, explanations based on SAE latents risk prioritiz- ing linguistic features over semantic meaning. Wu et al. (2025b) propose a novel approach that pro- motes diverse semantic explanations, which has been demonstrated to enhance model safety. 6.3 Model Training SAEs are trained to obtain more sparse and in- terpretable features. The learned concepts and sparsity are both beneficial in model transparency, which can be utilized in model training to align model with human understanding and improve model performance. Since SAEs includes feature- level constraints, Yin et al. (2024) leverage these constraints to enable sparsity-enforced alignment in post-training. Their experiments demonstrate that this approach achieves superior performance across benchmark datasets with reduced computa- tional costs. Similarly, combine learned concepts with next-token prediction training to build more transparent models. Specifically, they extract in- fluential concepts on outputs from SAEs, then in- corporate these concept vectors into hidden states by modifying token embeddings. Results show that models trained with this method perform bet- ter and exhibit greater robustness on token predic- tion and knowledge distillation across benchmark datasets (Tack et al., 2025). Moreover, SAEs’ abil-ity to provide large-scale explanations has been well explored. By examining the diversity of ac- tivated features, Yang et al. (2025) developed a new approach to augment data diversity. Another work uses task-specific features learned in SAEs to mitigate unintended features within models, sig- nificantly improving model generalization on real- world tasks (Wu et al., 2025a). 7 Research Challenges In this section, we outline several critical research challenges with SAEs. Although SAEs have emerged as promising tools for providing large- scale, fine-grained interpretable explanations, these challenges could threaten the faithfulness, effec- tiveness, and efficiency of their applications. 7.1 Incomplete Concept Dictionary SAEs are trained on large corpora of data encom- passing various concepts. However, achieving comprehensive concept coverage remains challeng- ing (Muhamed et al., 2024). Additionally, the learn- ing process of SAEs functions as a black box where learned concepts cannot be predetermined. Conse- quently, controlling the completeness of input and output concepts is nearly impossible. Furthermore, explanations provided by SAEs may be incomplete or misleading due to the conceptual gaps. This lim- itation can result in unreliable interpretations when applying SAEs to complex reasoning tasks that require comprehensive knowledge representation. 7.2 Lack Theoretical Foundations The development of SAEs is indeed based on as- sumptions of superposition and linear concept rep- resentation. Empirically, we’ve found it effective to construct high-level features through linear com- binations of low-level features. However, our un- derstanding of how these concepts are represented in hidden spaces and their spatial relationships re- mains limited. This limitation explains why we must derive combination parameters empirically rather than mathematically. The validity and ad- vancement of SAEs may remain unclear until we can properly demonstrate the correctness of these fundamental assumptions about concept represen- tation and superposition in neural networks. 7.3 Reconstruction Errors SAEs are trained by minimizing the reconstruction errors between original and reconstructed activa- tions. However, these errors persist and remain Page 15: poorly understood. Recent research by (Gao et al., 2025) demonstrates that reconstruction errors can produce significant performance degradation com- parable to using a model with only 10% of the pre- training compute. This finding raises substantial concerns about SAE accuracy as interpretability tools. Furthermore, the impact of these reconstruc- tion errors on model generations has not been ad- equately measured. The field lacks output-centric metrics that could precisely quantify how recon- structed activations affect a model’s final outputs. To advance our understanding of SAEs and their re- liability as interpretability tools, developing metrics that directly measure the effect of reconstruction errors on generated content is essential. 7.4 Computational Burden SAEs operate at the layer level for each model, mapping original activations to a much higher- dimensional representation space before recon- structing them back to the original space. This architecture necessitates that SAE parameters for a specific layer significantly outnumber the parame- ters of that original layer itself. Consequently, the overall training computation exceeds that of the original model training, particularly problematic for LLMs with billions of parameters. The exten- sive computational resources required create a sub- stantial barrier for researchers interested in investi- gating these methods. Furthermore, SAEs exhibit limited transferability across models, they must be trained specifically for each model and each layer, exacerbating the computational burden. This layer-specific and model-specific training require- ment multiplies the already significant resource demands, further restricting accessibility for the broader research community. 7.5 Connection to the Broader Field of Interpretability The field of MI has been critiqued for its insuffi- cient engagement with the broader interpretability and NLP research literature (Bereska and Gavves, 2024; Saphra and Wiegreffe, 2024). Many of the research topics within MI, such as polysemanticity, superposition, and SAEs, have been investigated in prior and concurrent non-MI fields, often un- der different terminologies while addressing the same fundamental challenges (Saphra and Wiegr- effe, 2024; Elhage et al., 2022). For instance, the study of polysemanticity and superposition, which aims to understand how features are encoded in themodel activations, have been studied in the con- text of distributed representations (Hinton, 1984; Mikolov et al., 2013b,a; Arora et al., 2018; Olah, 2023), disentangled representations (Higgins et al., 2018; Kim and Mnih, 2018; Locatello et al., 2019), and concept-based interpretability (Nicolson et al., 2024; Kim et al., 2018). Similarly, SAEs are closely related to and draw inspiration from earlier lines of research on sparse coding and dictionary learning (Olshausen and Field, 1997; Gregor and LeCun, 2010; Faruqui et al., 2015; Subramanian et al., 2018). These methods, like SAEs, posit the feature sparsity hypothesis (Elhage et al., 2022) and aim to learn an overcomplete representation to disentangle the features from activation in super- position. Since these fields pursue similar goals or study the same research problems, the current dis- connect causes issues such as missing relevant lit- erature, hindering collaboration, unintentionally re- defining established concepts, rediscovering exist- ing techniques, and overlooking well-known base- lines. Therefore, it is imperative for the MI com- munity to bridge these gaps and more actively inte- grate findings from related non-MI research. 8 Conclusions In this survey, we provided a comprehensive exam- ination of SAEs as a promising approach to inter- preting and understanding LLMs. SAEs effectively address the challenge of polysemanticity through learning overcomplete, sparse representations that disentangle superimposed features into more inter- pretable units. We have systematically explored the foundational principles, technical frameworks, evaluation methodologies, and real-world appli- cations of SAEs in the context of LLM analysis. While SAEs have demonstrated considerable suc- cess in revealing the internal mechanisms of LLMs, several challenges remain, including the incom- pleteness of concept dictionaries, limited theoret- ical foundations, persistent reconstruction errors, and substantial computational requirements. De- spite these challenges, SAEs continue to evolve through architectural innovations and improved training strategies, offering deeper insights into the inner workings of increasingly complex LLMs. Page 16: References Anthropic. 2024. [link]. Anthropic. 2024. Introducing Claude 3.5 Sonnet. An- nouncement of Claude 3.5 Sonnet model release, fea- turing improved intelligence, vision capabilities, and new Artifacts feature. Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2018. Linear algebraic struc- ture of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics , 6:483–495. Kola Ayonrinde. 2024. Adaptive sparse allocation with mutual choice & feature choice sparse autoencoders. arXiv preprint arXiv:2411.02124 . Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguis- tics, 48(1):207–219. Leonard Bereska and Efstratios Gavves. 2024. Mech- anistic interpretability for ai safety–a review. arXiv preprint arXiv:2404.14082 . Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hal- lahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. 2023. Pythia: A suite for analyzing large language mod- els across training and scaling. In International Conference on Machine Learning , pages 2397–2430. PMLR. Steven Bills, Nick Cammarata, Dan Moss- ing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. 2023. Language mod- els can explain neurons in language models. https://openaipublic.blob.core.windows. net/neuron-explainer/paper/index.html . Joseph Bloom. 2024. Open source sparse autoencoders for all residual stream layers of gpt2 small. Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, and Lee Sharkey. 2025. Identifying functionally im- portant features with end-to-end sparse dictionary learning. Advances in Neural Information Process- ing Systems , 37:107286–107325. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. 2023. Towards monosemanticity: Decom- posing language models with dictionary learning. Transformer Circuits Thread . Https://transformer- circuits.pub/2023/monosemantic- features/index.html.Bart Bussmann, Patrick Leask, and Neel Nanda. 2024. Batchtopk sparse autoencoders. arXiv preprint arXiv:2412.06410 . Sviatoslav Chalnev, Matthew Siu, and Arthur Conmy. 2024. Improving steering vectors by target- ing sparse autoencoder features. arXiv preprint arXiv:2411.02193 . Thomas M Cover. 1999. Elements of information theory . John Wiley & Sons. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. 2023. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600 . Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated con- volutional networks. In International conference on machine learning , pages 933–941. PMLR. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxi- ang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y . K. Li, Y . Q. Wang, Y . X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yu- jia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y . X. Zhu, Yanhong Xu, Page 17: Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zi- jia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. 2025. Deepseek-r1: Incen- tivizing reasoning capability in llms via reinforce- ment learning. Preprint , arXiv:2501.12948. Can Demircan, Tankred Saanum, Akshay Kumar Ja- gadish, Marcel Binz, and Eric Schulz. 2025. Sparse autoencoders reveal temporal difference learning in large language models. In The Thirteenth Interna- tional Conference on Learning Representations . Jonathan Donnelly and Adam Roegiest. 2019. On in- terpretability and feature representations: an analysis of the sentiment neuron. In Advances in Information Retrieval: 41st European Conference on IR Research, ECIR 2019, Cologne, Germany, April 14–18, 2019, Proceedings, Part I 41 , pages 795–802. Springer. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 . Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. 2022. Toy models of superpo- sition. Transformer Circuits Thread . Eoin Farrell, Yeu-Tong Lau, and Arthur Conmy. 2024. Applying sparse autoencoders to unlearn knowledge in language models. In Neurips Safe Generative AI Workshop 2024 . Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah Smith. 2015. Sparse overcom- plete word vector representations. arXiv preprint arXiv:1506.02004 . Javier Ferrando, Oscar Balcells Obeso, Senthooran Ra- jamanoharan, and Neel Nanda. 2025. Do i know this entity? knowledge awareness and hallucinations in language models. In The Thirteenth International Conference on Learning Representations . Javier Ferrando, Gabriele Sarti, Arianna Bisazza, and Marta R Costa-jussà. 2024. A primer on the in- ner workings of transformer-based language models. arXiv preprint arXiv:2405.00208 . Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, and Fazl Barez. 2023. Neuron to graph: Interpreting language model neurons at scale. arXiv preprint arXiv:2305.19911 . Jack Gallifant, Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, and Danielle S Bitterman. 2025.Sparse autoencoder features for classifications and transferability. arXiv preprint arXiv:2502.11367 . Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie- Yan Liu. 2019. Representation degeneration problem in training natural language generation models. arXiv preprint arXiv:1907.12009 . Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. 2025. Scaling and evaluating sparse autoencoders. In The Thirteenth International Conference on Learning Representations . Davide Ghilardi, Federico Belotti, and Marco Moli- nari. 2024. Efficient training of sparse autoencoders for large language models via layer groups. arXiv preprint arXiv:2410.21508 . Karol Gregor and Yann LeCun. 2010. Learning fast approximations of sparse coding. In Proceedings of the 27th international conference on international conference on machine learning , pages 399–406. Yoav Gur-Arieh, Roy Mayan, Chen Agassy, Atticus Geiger, and Mor Geva. 2025. Enhancing automated interpretability with output-centric feature descrip- tions. arXiv preprint arXiv:2501.08319 . Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, et al. 2024. Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders. arXiv preprint arXiv:2410.20526 . Zirui He, Haiyan Zhao, Yiran Qiao, Fan Yang, Ali Payani, Jing Ma, and Mengnan Du. 2025. Saif: A sparse autoencoder framework for interpreting and steering instruction following of language models. arXiv preprint arXiv:2502.11356 . Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. 2018. Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230 . Geoffrey E Hinton. 1984. Distributed representations. Mingyu Jin, Qinkai Yu, Jingyuan Huang, Qingcheng Zeng, Zhenting Wang, Wenyue Hua, Haiyan Zhao, Kai Mei, Yanda Meng, Kaize Ding, et al. 2025. Ex- ploring concept depth: How large language models acquire knowledge and concept at different layers? InProceedings of the 31st International Conference on Computational Linguistics , pages 558–573. A Karvonen, C Rager, J Lin, C Tigges, J Bloom, D Chanin, YT Lau, E Farrell, A Conmy, C Mc- Dougall, et al. 2024. Saebench: A comprehensive benchmark for sparse autoencoders, december 2024. URL https://www. neuronpedia. org/sae-bench/info . Dmitrii Kharlapenko, neverix, Neel Nanda, and Authur Conmy. 2024. Extracting SAE task features for in- context learning. [Accessed 26-02-2025]. Page 18: Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. 2018. In- terpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In In- ternational conference on machine learning , pages 2668–2677. PMLR. Hyunjik Kim and Andriy Mnih. 2018. Disentangling by factorising. In International conference on machine learning , pages 2649–2658. PMLR. Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathe- matical statistics , 22(1):79–86. Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. 2024. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. arXiv preprint arXiv:2408.05147 . Johnny Lin. 2023. Neuronpedia: Interactive reference and tooling for analyzing neural networks. Software available from neuronpedia.org. Francesco Locatello, Stefan Bauer, Mario Lucic, Gun- nar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. 2019. Challenging common as- sumptions in the unsupervised learning of disentan- gled representations. In international conference on machine learning , pages 4114–4124. PMLR. Christos Louizos, Max Welling, and Diederik P Kingma. 2017. Learning sparse neural networks through l_0 regularization. arXiv preprint arXiv:1712.01312 . Luke Marks, Alasdair Paren, David Krueger, and Fazl Barez. 2024. Enhancing neural network interpretabil- ity with feature-aligned sparse autoencoders. arXiv preprint arXiv:2411.01220 . Abhinav Menon, Manish Shrivastava, David Krueger, and Ekdeep Singh Lubana. 2024. Analyzing (in) abilities of saes via formal languages. arXiv preprint arXiv:2410.11767 . Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013a. Distributed representa- tions of words and phrases and their compositionality. Advances in neural information processing systems , 26. Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013b. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the as- sociation for computational linguistics: Human lan- guage technologies , pages 746–751. Anish Mudide, Joshua Engels, Eric J Michaud, Max Tegmark, and Christian Schroeder de Witt. 2024. Ef- ficient dictionary learning with switch sparse autoen- coders. arXiv preprint arXiv:2410.08201 .Aashiq Muhamed, Mona Diab, and Virginia Smith. 2024. Decoding dark matter: Specialized sparse autoencoders for interpreting rare concepts in foun- dation models. arXiv preprint arXiv:2411.00743 . Neel Nanda, Arthur Conmy, Lewis Smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, and Vikrant Varma. 2024. [Full Post] Progress Update #1 from the GDM Mech Interp Team. [Accessed 26-02-2025]. neverix, Dmitrii Kharlapenko, Arthur Conmy, and Neel Nanda. 2024. SAE features for refusal and syco- phancy steering vectors. [Accessed 26-02-2025]. Andrew Ng et al. 2011. Sparse autoencoder. CS294A Lecture notes , 72(2011):1–19. Anh Nguyen, Jason Yosinski, and Jeff Clune. 2019. Un- derstanding neural networks via feature visualization: A survey. Explainable AI: interpreting, explaining and visualizing deep learning , pages 55–76. Angus Nicolson, Lisa Schut, J Alison Noble, and Yarin Gal. 2024. Explaining explainability: Under- standing concept activation vectors. arXiv preprint arXiv:2404.03713 . Nostalgebraist. 2020. Interpreting gpt: the logit lens. Chris Olah. 2023. Distributed representations: Compo- sition & superposition. Transformer Circuits Thread , 27. Bruno A Olshausen and David J Field. 1997. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research , 37(23):3311– 3325. Charles O’Neill, Christine Ye, Kartheik Iyer, and John F Wu. 2024. Disentangling dense embed- dings with sparse autoencoders. arXiv preprint arXiv:2408.00657 . OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Alt- man, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haim- ing Bao, Mohammad Bavarian, Jeff Belgum, Ir- wan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brock- man, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Ful- ford, Leo Gao, Elie Georges, Christian Gibson, Vik Page 19: Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo- Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Hee- woo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Ka- mali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirch- ner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Kon- stantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambat- tista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perel- man, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Poko- rny, Michelle Pokrass, Vitchyr H. Pong, Tolly Pow- ell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ry- der, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Fe- lipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Fe- lipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea V oss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Ji- ayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qim- ing Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Bar-ret Zoph. 2024. Gpt-4 technical report. Preprint , arXiv:2303.08774. Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. 2017. Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444 . Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9. Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, and Ziyu Yao. 2024. A practical review of mecha- nistic interpretability for transformer-based language models. arXiv preprint arXiv:2407.02646 . Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. 2024a. Improving dictionary learning with gated sparse autoencoders. arXiv preprint arXiv:2404.16014 . Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, and Neel Nanda. 2024b. Jumping ahead: Im- proving reconstruction fidelity with jumprelu sparse autoencoders. arXiv preprint arXiv:2407.14435 . Marks Samuel, Karvonen Adam, and Mueller Aaron. 2024. dictionary learning. https://github.com/ saprmarks/dictionary_learning . Naomi Saphra and Sarah Wiegreffe. 2024. Mechanistic? arXiv preprint arXiv:2410.09087 . Claude E Shannon. 1948. A mathematical theory of communication. The Bell system technical journal , 27(3):379–423. Noam Shazeer. 2020. Glu variants improve transformer. arXiv preprint arXiv:2002.05202 . Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 . Anant Subramanian, Danish Pruthi, Harsh Jhamtani, Taylor Berg-Kirkpatrick, and Eduard Hovy. 2018. Spine: Sparse interpretable neural embeddings. In Proceedings of the AAAI conference on artificial in- telligence , volume 32. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 . Jihoon Tack, Jack Lanchantin, Jane Yu, Andrew Co- hen, Ilia Kulikov, Janice Lan, Shibo Hao, Yuandong Tian, Jason Weston, and Xian Li. 2025. Llm pre- training with continuous concepts. arXiv preprint arXiv:2502.08524 . Page 20: Glen Taggart. 2024. Prolu: A nonlinearity for sparse autoencoders - ai alignment forum. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati- raju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. 2024. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118 . Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. 2024. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread . Benjamin Wright and Lee Sharkey. 2024. Addressing feature suppression in saes - ai alignment forum. Xuansheng Wu, Wenhao Yu, Xiaoming Zhai, and Ning- hao Liu. 2025a. Self-regularization with latent space explanations for controllable llm-based classification. arXiv preprint arXiv:2502.14133 . Xuansheng Wu, Jiayi Yuan, Wenlin Yao, Xiaoming Zhai, and Ninghao Liu. 2025b. Interpreting and steering llms with mutual information-based expla- nations on sparse autoencoders. arXiv preprint arXiv:2502.15576 . xAI. 2025. Grok 3 beta — the age of reasoning agents. Blog post announcing Grok 3 Beta, describ- ing improvements in reasoning capabilities and per- formance benchmarks. Xianjun Yang, Shaoliang Nie, Lijuan Liu, Suchin Guru- rangan, Ujjwal Karn, Rui Hou, Madian Khabsa, and Yuning Mao. 2025. Diversity-driven data selection for language model tuning through sparse autoen- coder. arXiv preprint arXiv:2502.14050 . Qingyu Yin, Chak Tou Leong, Hongbo Zhang, Minjun Zhu, Hanqi Yan, Qiang Zhang, Yulan He, Wenjie Li, Jun Wang, Yue Zhang, et al. 2024. Direct preference optimization using sparse feature-level constraints. arXiv preprint arXiv:2411.07618 . Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, and Mengnan Du. 2024a. Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology , 15(2):1–38. Haiyan Zhao, Fan Yang, Bo Shen, Himabindu Lakkaraju, and Mengnan Du. 2024b. Towards uncovering how large language model works: An explainability perspective. arXiv preprint arXiv:2402.10688 .Haiyan Zhao, Heng Zhao, Bo Shen, Ali Payani, Fan Yang, and Mengnan Du. 2025. Beyond single con- cept vector: Modeling concept subspace in llms with gaussian distribution. The Thirteenth International Conference on Learning Representations .

---