license: mit
library_name: transformers
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
base_model_relation: adapter
SeerAttention-DeepSeek-R1-Distill-Qwen-14B-AttnGates
This repo only contains the AttnGates' weights for deepseek-ai/DeepSeek-R1-Distill-Qwen-14B.
SeerAttention introduces learnable AttnGate modules to accelerate the computationally intensive prefill stage of long-context large language models (LLMs) via dynamic block-level sparsity. The AttnGates are trained in a parameter-efficient self-distillation framework, where they learn to mimic the 2D max-pooled attention patterns of the original frozen model, preserving its integrity while avoiding costly retraining. During inference, these gates generate block-sparse binary masks by applying threshold/TopK to their learned soft scores, enabling efficient computation through a custom block-sparse FlashAttention kernel.
Original Github Repo https://github.com/microsoft/SeerAttention.
LongBenchV2 CoT Benchmark
All the SeerAttention models run with threshold=5e-4.
For R1-Distilled models, we remove the two passes generation setup (think + summary), we directly ask the models to output anwser after thinking. The generation max length is set to 10240.
Model | Overall | Easy | Hard | Short | Medium | Long |
---|---|---|---|---|---|---|
Llama-3.1-8B-Instruct | 30.4 | 31.2 | 29.9 | 37.8 | 24.7 | 29.6 |
SeerAttention-Llama-3.1-8B | 31.6 | 33.3 | 30.5 | 33.9 | 31.6 | 27.8 |
Qwen2.5-14B-Instruct | 34.8 | 37.5 | 33.1 | 44.4 | 32.1 | 24.1 |
SeerAttention-Qwen2.5-14B | 32.8 | 38.0 | 29.6 | 45.0 | 30.2 | 17.6 |
Qwen2.5-32B-Instruct | 36.4 | 42.2 | 32.8 | 47.8 | 29.8 | 30.6 |
SeerAttention-Qwen2.5-32B | 36.4 | 41.1 | 33.4 | 49.4 | 29.8 | 27.8 |
DeepSeek-R1-Distill-Qwen-14B | 34.2 | 43.2 | 28.6 | 45.0 | 27.9 | 28.7 |
SeerAttention-DeepSeek-R1-Distill-Qwen-14B | 31.6 | 35.9 | 28.9 | 41.7 | 26.0 | 25.9 |
DeepSeek-R1-Distill-Qwen-32B | 37.2 | 42.7 | 33.8 | 47.2 | 35.8 | 23.1 |
SeerAttention-DeepSeek-R1-Distill-Qwen-32B | 37.0 | 42.2 | 33.8 | 49.4 | 31.6 | 26.9 |