Post
2193
π Today's pick in Interpretability & Analysis of LMs: Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models by
@sammarks
C. Rager
@eircjm
@belinkov
@davidbau
@amueller
This work proposes using features and errors from sparse autoencoders trained to reconstruct LM activations as interpretable units for circuit discovery. The authors then introduce SHIFT, a technique for editing model behavior by ablating interpretable elements from sparse feature circuits. This method is applied alongside unsupervised circuit discovery at scale by means of clustering, showing highly interpretable feature circuits interacting to produce behaviors like predicting sequence increments.
I found the experiment of Section 4 especially convincing and exciting in terms of downstream applications: authors trained a classifier over a biased dataset, and showcased how SHIFT intervention in feature space leads to performances matching those of the same model trained on an unbiased data distribution!
π Paper: Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models (2403.19647)
π All daily picks: https://huggingface.co./collections/gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9
This work proposes using features and errors from sparse autoencoders trained to reconstruct LM activations as interpretable units for circuit discovery. The authors then introduce SHIFT, a technique for editing model behavior by ablating interpretable elements from sparse feature circuits. This method is applied alongside unsupervised circuit discovery at scale by means of clustering, showing highly interpretable feature circuits interacting to produce behaviors like predicting sequence increments.
I found the experiment of Section 4 especially convincing and exciting in terms of downstream applications: authors trained a classifier over a biased dataset, and showcased how SHIFT intervention in feature space leads to performances matching those of the same model trained on an unbiased data distribution!
π Paper: Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models (2403.19647)
π All daily picks: https://huggingface.co./collections/gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9