Extracting Concepts from LLMs: Anthropic’s recent discoveries 📖
LLMs are huge piles of neurons that somehow give useful outputs, but we’re still not sure how that works under the hood: at which points do real concepts emerge from this mathematical mess?
Now that LLM research has gone past the time of "We don't understand why this works, but scaling works so let's just pile up trillions of parameters", there's a renewed focus on "Let's understand how these algorithms work to improve them". Anthropic team has worked for years on interpretability, this has given them a good lead, and they published fascinating discoveries in the process of understanding LLMs: here I want to share what I've learned by reading some landmark publications.
1. Models are actually superposing features in each neuron - Toy models of superposition
- When you have 2 neurons, why not encode more than 2 features on the ensemble? (cf figure below) ⇒ this would allow for better performance by encoding finer-grained concepts for the same count of parameters!
- The researcher hypothesized that “the neural networks we observe in practice are in some sense noisily simulating larger, highly sparse networks”
- What would these features be? It would be nice if they were:
- Interpretable, which would allow us to better understand how LLMs work.
- Transcendant for more generality: multilingual, multimodal,…
- Having a definite impact on the model's behaviour: best would be if you manipulate the feature, to see the impact on the outputs.
2. Extract features - Towards Monosemanticity
- What can be the benefits of exploiting the superposed features of an LLM?
- Using features as an output: it would provide better interpretability
- Using features as an input: manipulating features would allow to steer model!
- In this paper, researchers simply use a Sparse AutoEncoder (SAE) trained on the activations of the MLP layer from one model, and train it to find features. It seems to work, with a few meaningful features extracted like “Arabic text” or “Genetic code”! But as the experiment was restricted to a very simple, 1-layer model, the extracted concepts are still very basic.
3. Scaling that up to a real LLM - Scaling Monosemanticity
- The researcher scale up the previous recipe to apply it to Claude-3-Sonnet, Anthropic’s medium size LLM.
- The features they obtain are astounding:
- Interpretability: ✅ You can use another LLM to annotate features (explore features in their interactive map) and they really correspond to complex concepts: cf example below: there is a clear “Tourist attraction” feature that turns on for every mention of Tour Eiffel or Mona Lisa. And for most concepts, the associated feature is much better correlated than any group of neurons.
- They also check these boxes:
- Transcendant: âś…, the features are multilingual and multi-modal, the same kind of features appear across different models.
- Ability to steer model behavior: ✅, see below how increasing the "Golden Gate Bridge" feature changes the model's behaviour! 🤯
But do the SAE-extracted features make much more sense than specific neurons to represent particular concepts? To measure this, researchers first ask CLaude-3-Opus to give explanations for each feature (something like a short label: "About Golden Gate Bridge"), then ask it to predict what the feature activation should be given a specific sentence (like for "I visited the Golden Gate Bridge", activation should be close to 1), then compare it with the real activation, either for a neuron or an extracted feature. The result clearly shows that SAE-extracted features make more sense than neurons and how much sense they make improves with SAE size, which validates the approach of using these features:
4+. Moving forward
OpenAI also published a paper yesterday: it's more focused on technical optimisation of the concept. They follow Anthropic’s approach of using SAEs to encode the activations from MLP layers, and they add a very insightful limitation to this approach: when forcing all model activations to go through the SAE-decoded features, the performance degrades to one of a model that used 10x less compute.
This limitation means this approach is still far from completely explaining how LLMs work, and there is progress to be made: let’s hope further results keep being shared for the benefit of the community!