README.md · SeerAttention/SeerAttention-DeepSeek-R1-Distill-Qwen-14B-AttnGates at main

metadata

license: mit
library_name: transformers
base_model:
  - deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
base_model_relation: adapter

SeerAttention-DeepSeek-R1-Distill-Qwen-14B-AttnGates

This repo only contains the AttnGates' weights for deepseek-ai/DeepSeek-R1-Distill-Qwen-14B.

SeerAttention introduces learnable AttnGate modules to accelerate the computationally intensive prefill stage of long-context large language models (LLMs) via dynamic block-level sparsity. The AttnGates are trained in a parameter-efficient self-distillation framework, where they learn to mimic the 2D max-pooled attention patterns of the original frozen model, preserving its integrity while avoiding costly retraining. During inference, these gates generate block-sparse binary masks by applying threshold/TopK to their learned soft scores, enabling efficient computation through a custom block-sparse FlashAttention kernel.

Original Github Repo https://github.com/microsoft/SeerAttention.

LongBenchV2 CoT Benchmark

All the SeerAttention models run with threshold=5e-4.

For R1-Distilled models, we remove the two passes generation setup (think + summary), we directly ask the models to output anwser after thinking. The generation max length is set to 10240.

Model	Overall	Easy	Hard	Short	Medium	Long
Llama-3.1-8B-Instruct	30.4	31.2	29.9	37.8	24.7	29.6
SeerAttention-Llama-3.1-8B	31.6	33.3	30.5	33.9	31.6	27.8
Qwen2.5-14B-Instruct	34.8	37.5	33.1	44.4	32.1	24.1
SeerAttention-Qwen2.5-14B	32.8	38.0	29.6	45.0	30.2	17.6
Qwen2.5-32B-Instruct	36.4	42.2	32.8	47.8	29.8	30.6
SeerAttention-Qwen2.5-32B	36.4	41.1	33.4	49.4	29.8	27.8
DeepSeek-R1-Distill-Qwen-14B	34.2	43.2	28.6	45.0	27.9	28.7
SeerAttention-DeepSeek-R1-Distill-Qwen-14B	31.6	35.9	28.9	41.7	26.0	25.9
DeepSeek-R1-Distill-Qwen-32B	37.2	42.7	33.8	47.2	35.8	23.1
SeerAttention-DeepSeek-R1-Distill-Qwen-32B	37.0	42.2	33.8	49.4	31.6	26.9