Interpretability of LLMs: The Role of Sparse Autoencoders

Language models (LLMs) like GPT-4 and Claude 3 demonstrate exceptional performance, yet they remain enigmatic black boxes. Naturally, one might ask: how do these models generate such precise responses or even come up with creative ideas? The interpretability of these models is a critical challenge, and one promising approach to uncovering their inner workings involves Sparse Autoencoders (SAEs). In this article, we explore how SAEs decompose LLMs into interpretable components, providing valuable insights into the functioning of these complex AI systems.

What is a Sparse Autoencoder?

To understand how SAEs contribute to the interpretability of LLMs, we must first understand the concept of an autoencoder. An autoencoder is a neural network that learns to compress and decompress its input. Imagine a neural network that takes a 100-dimensional vector as input, reduces it to 50 dimensions, and then reconstructs it back to 100 dimensions. The goal of the model is to minimize the difference between the input vector and the reconstructed output.

‍

‍

Sparse Autoencoders distinguish themselves by adding a sparsity penalty to their loss function, encouraging the model to activate only a small proportion of neurons in the intermediate layers. This sparsity penalty pushes the model to create intermediate vectors (activations) with as many zero values as possible. As we’ll see, this penalty enables the creation of compressed representations, making the intermediate activations of larger models like LLMs more interpretable.

‍

‍

Sparse Autoencoders and LLM Interpretability

The idea of using SAEs for the interpretability of LLMs revolves around decomposing their intermediate activations to make them more comprehensible to humans. The activations of a language model are often opaque: a single neuron can encode multiple concepts simultaneously, a phenomenon known as superposition. This complexity makes it challenging to discern what each neuron represents.

SAEs can be trained on the intermediate activations of LLMs at various points in their architecture, often between two layers. Each component of these activations, extracted through the training of an SAE, is called a feature. A feature aims to represent an identifiable concept interpretable by humans (e.g., semantic concepts like "human flaws" or "price increases," grammatical categories, or specific themes).

In summary, the goal of using SAEs is to achieve a sparse representation (with as many zero values as possible) where each feature corresponds to an interpretable concept. These representations break down LLMs into smaller, more explicit elements, making it easier to understand how these models generate their responses.

How Sparse Autoencoders Extract Features from LLMs

As previously mentioned, one of the strengths of SAEs lies in their ability to extract features from LLMs’ intermediate activations. Here’s how the process works:

Encoding Intermediate Activations: A SAE takes as input the intermediate activations of an LLM—these activations are often very large vectors (e.g., 12,288 dimensions for GPT-3). The SAE encodes this vector into a new, high-dimensional representation that is sparse (most values are zero).

Sparsity and Concept Reduction: This encoding process enforces a sparsity constraint. Although the representation may have high dimensionality, only a small fraction of neurons are activated. For instance, in a 49,512-dimensional space, perhaps only 100 dimensions are nonzero. The goal is to decompose the activations into distinct components where each feature is distinct and easier to interpret.

Feature Learning: During training, the SAE learns to identify combinations of neurons corresponding to important concepts in the LLM. The SAE’s decoder is then used to reconstruct the original activations from this compressed version. Each activated feature in the sparse representation signifies a precise concept. Simplified, each feature extracted by the SAE captures a specific and identifiable aspect of the information processed by the model, enabling researchers to better understand its mechanisms.

Interpreting Features: Once features are extracted, researchers work to interpret what each feature represents. For example, one feature might correspond to a geographic concept like the Golden Gate Bridge or to a more abstract theme like relative clauses in sentences. By observing the types of inputs that activate specific features, researchers can gain a clearer idea of what each feature captures.

‍

Example: Extracting the Feature “Golden Gate Bridge”

‍

This process, which identifies precise concepts from LLM activations, also enables direct control over the model’s behavior. Using a technique called feature steering, researchers can directly manipulate the activation of a feature to guide the model’s responses toward a specific topic. For instance, researchers at Anthropic showed that by artificially amplifying the activation of the feature associated with the Golden Gate Bridge, they could make Claude systematically reference the bridge in its responses, even when it was irrelevant. When asked to describe its physical form, Claude responded: “I am the Golden Gate Bridge... my physical form is the iconic bridge itself.”

OpenAI and Anthropic: Promising Results

Recently, significant efforts have been made by companies like OpenAI and Anthropic to develop methods for decomposing language models using Sparse Autoencoders. OpenAI succeeded in breaking down the activations of GPT-4 into 16 million potentially interpretable features. These features include themes such as "rising prices" or "human imperfection," corresponding to concepts activated within the model.

Anthropic focused on extracting features from Claude 3 Sonnet, discovering complex features ranging from "code type signatures" to "personality traits," "cultural biases," and even "abstract behavioral traits related to deception or manipulation." Their work highlights that SAEs can extract high-level features, revealing not only how the model reacts to specific inputs but also how to manipulate it to address biases or emphasize specific themes.

Current Limitations and the Future of Interpretability

Despite their promise, Sparse Autoencoders have limitations. A key challenge is evaluating interpretability: how can we ensure that a feature is genuinely interpretable or corresponds to a concept comprehensible to humans? Current evaluation methods rely heavily on subjective judgment. Researchers must manually examine feature activations and decide whether they make sense. Furthermore, while SAEs can decompose activations locally, they provide limited insights into how these features are used across other layers of the model.

Nonetheless, the outlook is encouraging. Efforts to make LLMs more explainable build trust in these models—a crucial factor for their safe use in sensitive contexts. Advances in training SAEs, such as increasing decoder sizes or applying novel regularization methods, could extend interpretability to a broader range of models and use cases in the future.

Recent Advances

Anthropic recently introduced Sparse Crosscoders, a new approach that extends the capabilities of Sparse Autoencoders by extracting interpretable features from LLMs across multiple layers simultaneously. Unlike traditional methods that analyze each layer in isolation, Sparse Crosscoders link activations between layers and identify common patterns persisting throughout the model. This novel approach offers a clearer interpretation of LLMs’ internal mechanisms. Moreover, Sparse Crosscoders allow researchers to compare different training versions of the same model, isolating shared and specific features—a valuable tool for studying the effects of fine-tuning. This promising method paves the way for finer analyses of model evolution, enhancing transparency and safety in AI development.

‍

‍

Conclusion

Sparse Autoencoders represent a promising advance in opening the black box of LLMs by breaking down their activations into interpretable components. This approach is essential for improving understanding and trust in these AI models. Research by OpenAI and Anthropic demonstrates that these methods can be applied to increasingly large models, offering valuable insights into their inner workings. The recent introduction of Sparse Crosscoders by Anthropic opens new avenues for interpretability, enabling the extraction of features across multiple layers simultaneously. With these advancements, SAEs and Sparse Crosscoders play a central role in improving the reliability and transparency of LLMs, paving the way for trustworthy and human-accessible AI analysis.