Concept Steerers: Leveraging K-Sparse Autoencoders for Controllable Generations
Abstract
Despite the remarkable progress in text-to-image generative models, they are prone to adversarial attacks and inadvertently generate unsafe, unethical content. Existing approaches often rely on fine-tuning models to remove specific concepts, which is computationally expensive, lack scalability, and/or compromise generation quality. In this work, we propose a novel framework leveraging k-sparse autoencoders (k-SAEs) to enable efficient and interpretable concept manipulation in diffusion models. Specifically, we first identify interpretable monosemantic concepts in the latent space of text embeddings and leverage them to precisely steer the generation away or towards a given concept (e.g., nudity) or to introduce a new concept (e.g., photographic style). Through extensive experiments, we demonstrate that our approach is very simple, requires no retraining of the base model nor LoRA adapters, does not compromise the generation quality, and is robust to adversarial prompt manipulations. Our method yields an improvement of 20.01% in unsafe concept removal, is effective in style manipulation, and is sim5x faster than current state-of-the-art.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AdvAnchor: Enhancing Diffusion Model Unlearning with Adversarial Anchors (2024)
- Distorting Embedding Space for Safety: A Defense Mechanism for Adversarially Robust Diffusion Models (2025)
- Exploring Representation-Aligned Latent Space for Better Generation (2025)
- SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders (2025)
- VENOM: Text-driven Unrestricted Adversarial Example Generation with Diffusion Models (2025)
- Boosting Alignment for Post-Unlearning Text-to-Image Generative Models (2024)
- Buster: Implanting Semantic Backdoor into Text Encoder to Mitigate NSFW Content Generation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper