SeerAttention's picture
Update README.md
2466c75 verified
metadata
license: mit
library_name: transformers
base_model:
  - deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
base_model_relation: adapter

SeerAttention-DeepSeek-R1-Distill-Qwen-14B-AttnGates

This repo only contains the AttnGates' weights for deepseek-ai/DeepSeek-R1-Distill-Qwen-14B.

SeerAttention introduces learnable AttnGate modules to accelerate the computationally intensive prefill stage of long-context large language models (LLMs) via dynamic block-level sparsity. The AttnGates are trained in a parameter-efficient self-distillation framework, where they learn to mimic the 2D max-pooled attention patterns of the original frozen model, preserving its integrity while avoiding costly retraining. During inference, these gates generate block-sparse binary masks by applying threshold/TopK to their learned soft scores, enabling efficient computation through a custom block-sparse FlashAttention kernel.

Original Github Repo https://github.com/microsoft/SeerAttention.

LongBenchV2 CoT Benchmark

All the SeerAttention models run with threshold=5e-4.

For R1-Distilled models, we remove the two passes generation setup (think + summary), we directly ask the models to output anwser after thinking. The generation max length is set to 10240.

Model Overall Easy Hard Short Medium Long
Llama-3.1-8B-Instruct 30.4 31.2 29.9 37.8 24.7 29.6
SeerAttention-Llama-3.1-8B 31.6 33.3 30.5 33.9 31.6 27.8
Qwen2.5-14B-Instruct 34.8 37.5 33.1 44.4 32.1 24.1
SeerAttention-Qwen2.5-14B 32.8 38.0 29.6 45.0 30.2 17.6
Qwen2.5-32B-Instruct 36.4 42.2 32.8 47.8 29.8 30.6
SeerAttention-Qwen2.5-32B 36.4 41.1 33.4 49.4 29.8 27.8
DeepSeek-R1-Distill-Qwen-14B 34.2 43.2 28.6 45.0 27.9 28.7
SeerAttention-DeepSeek-R1-Distill-Qwen-14B 31.6 35.9 28.9 41.7 26.0 25.9
DeepSeek-R1-Distill-Qwen-32B 37.2 42.7 33.8 47.2 35.8 23.1
SeerAttention-DeepSeek-R1-Distill-Qwen-32B 37.0 42.2 33.8 49.4 31.6 26.9