Language models (LLMs) like GPT-4 and Claude 3 demonstrate exceptional performance, yet they remain enigmatic black boxes. Naturally, one might ask: how do these models generate such precise responses or even come up with creative ideas? The interpretability of these models is a critical challenge, and one promising approach to uncovering their inner workings involves Sparse Autoencoders (SAEs). In this article, we explore how SAEs decompose LLMs into interpretable components, providing valuable insights into the functioning of these complex AI systems.
To understand how SAEs contribute to the interpretability of LLMs, we must first understand the concept of an autoencoder. An autoencoder is a neural network that learns to compress and decompress its input. Imagine a neural network that takes a 100-dimensional vector as input, reduces it to 50 dimensions, and then reconstructs it back to 100 dimensions. The goal of the model is to minimize the difference between the input vector and the reconstructed output.
Sparse Autoencoders distinguish themselves by adding a sparsity penalty to their loss function, encouraging the model to activate only a small proportion of neurons in the intermediate layers. This sparsity penalty pushes the model to create intermediate vectors (activations) with as many zero values as possible. As we’ll see, this penalty enables the creation of compressed representations, making the intermediate activations of larger models like LLMs more interpretable.
The idea of using SAEs for the interpretability of LLMs revolves around decomposing their intermediate activations to make them more comprehensible to humans. The activations of a language model are often opaque: a single neuron can encode multiple concepts simultaneously, a phenomenon known as superposition. This complexity makes it challenging to discern what each neuron represents.
SAEs can be trained on the intermediate activations of LLMs at various points in their architecture, often between two layers. Each component of these activations, extracted through the training of an SAE, is called a feature. A feature aims to represent an identifiable concept interpretable by humans (e.g., semantic concepts like "human flaws" or "price increases," grammatical categories, or specific themes).
In summary, the goal of using SAEs is to achieve a sparse representation (with as many zero values as possible) where each feature corresponds to an interpretable concept. These representations break down LLMs into smaller, more explicit elements, making it easier to understand how these models generate their responses.
As previously mentioned, one of the strengths of SAEs lies in their ability to extract features from LLMs’ intermediate activations. Here’s how the process works:
This process, which identifies precise concepts from LLM activations, also enables direct control over the model’s behavior. Using a technique called feature steering, researchers can directly manipulate the activation of a feature to guide the model’s responses toward a specific topic. For instance, researchers at Anthropic showed that by artificially amplifying the activation of the feature associated with the Golden Gate Bridge, they could make Claude systematically reference the bridge in its responses, even when it was irrelevant. When asked to describe its physical form, Claude responded: “I am the Golden Gate Bridge... my physical form is the iconic bridge itself.”
Recently, significant efforts have been made by companies like OpenAI and Anthropic to develop methods for decomposing language models using Sparse Autoencoders. OpenAI succeeded in breaking down the activations of GPT-4 into 16 million potentially interpretable features. These features include themes such as "rising prices" or "human imperfection," corresponding to concepts activated within the model.
Anthropic focused on extracting features from Claude 3 Sonnet, discovering complex features ranging from "code type signatures" to "personality traits," "cultural biases," and even "abstract behavioral traits related to deception or manipulation." Their work highlights that SAEs can extract high-level features, revealing not only how the model reacts to specific inputs but also how to manipulate it to address biases or emphasize specific themes.
Despite their promise, Sparse Autoencoders have limitations. A key challenge is evaluating interpretability: how can we ensure that a feature is genuinely interpretable or corresponds to a concept comprehensible to humans? Current evaluation methods rely heavily on subjective judgment. Researchers must manually examine feature activations and decide whether they make sense. Furthermore, while SAEs can decompose activations locally, they provide limited insights into how these features are used across other layers of the model.
Nonetheless, the outlook is encouraging. Efforts to make LLMs more explainable build trust in these models—a crucial factor for their safe use in sensitive contexts. Advances in training SAEs, such as increasing decoder sizes or applying novel regularization methods, could extend interpretability to a broader range of models and use cases in the future.
Anthropic recently introduced Sparse Crosscoders, a new approach that extends the capabilities of Sparse Autoencoders by extracting interpretable features from LLMs across multiple layers simultaneously. Unlike traditional methods that analyze each layer in isolation, Sparse Crosscoders link activations between layers and identify common patterns persisting throughout the model. This novel approach offers a clearer interpretation of LLMs’ internal mechanisms. Moreover, Sparse Crosscoders allow researchers to compare different training versions of the same model, isolating shared and specific features—a valuable tool for studying the effects of fine-tuning. This promising method paves the way for finer analyses of model evolution, enhancing transparency and safety in AI development.
Sparse Autoencoders represent a promising advance in opening the black box of LLMs by breaking down their activations into interpretable components. This approach is essential for improving understanding and trust in these AI models. Research by OpenAI and Anthropic demonstrates that these methods can be applied to increasingly large models, offering valuable insights into their inner workings. The recent introduction of Sparse Crosscoders by Anthropic opens new avenues for interpretability, enabling the extraction of features across multiple layers simultaneously. With these advancements, SAEs and Sparse Crosscoders play a central role in improving the reliability and transparency of LLMs, paving the way for trustworthy and human-accessible AI analysis.
[1] An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability by Adam Karvonen
[2] Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet by Anthropic
[3] Extracting Concepts from GPT-4 by OpenAI
[4] Sparse Crosscoders for Cross-Layer Features and Model Diffing by Anthropic