This repo only contains the AttnGates' weights for Qwen2.5-7B-Instruct Model.
SeerAttention introduces learnable AttnGate modules to accelerate the computationally intensive prefill stage of long-context large language models (LLMs) via dynamic block-level sparsity. The AttnGates are trained in a parameter-efficient self-distillation framework, where they learn to mimic the 2D max-pooled attention patterns of the original frozen model, preserving its integrity while avoiding costly retraining. During inference, these gates generate block-sparse binary masks by applying threshold/TopK to their learned soft scores, enabling efficient computation through a custom block-sparse FlashAttention kernel.
Original Github Repo
https://github.com/microsoft/SeerAttention.
Evaluation Results
PG19 PPL
Density | 8192 tokens (ppl) | 16384 tokens (ppl) | 32768 tokens (ppl) |
---|---|---|---|
0.10 | 10.19 | 9.73 | 9.59 |
0.20 | 9.78 | 9.53 | 9.46 |
0.30 | 9.67 | 9.46 | 9.41 |
0.40 | 9.63 | 9.43 | 9.39 |
0.50 | 9.60 | 9.42 | 9.38 |
1.00 | 9.58 | 9.41 | 9.38 |
LongBench
Task | 0-4k (Dense / Sparse) | 4-8k (Dense / Sparse) | 8k+ (Dense / Sparse) |
---|---|---|---|
hotpotqa | 56.86 / 55.65 | 52.74 / 52.14 | 55.59 / 55.65 |
trec | 61.00 / 61.00 | 73.00 / 73.00 | 70.00 / 71.00 |
2wikimqa | 50.74 / 50.57 | 48.59 / 48.51 | 31.51 / 31.66 |
multi_news | 23.72 / 25.84 | 21.93 / 22.03 | 20.78 / 22.01 |
lcc | 60.94 / 62.08 | 64.99 / 66.71 | 58.84 / 62.83 |
qasper | 44.45 / 46.00 | 33.69 / 33.26 | 29.21 / 29.90 |
passage_count | 20.00 / 19.00 | 7.000 / 7.000 | 8.000 / 7.000 |
passage_retrieval_en | 97.00 / 97.00 | 89.00 / 88.00 | 81.14 / 81.83 |
triviaqa | 88.02 / 86.02 | 87.82 / 87.99 | 88.98 / 88.27 |
samsum | 41.38 / 41.97 | 39.00 / 39.85 | 45.72 / 45.34 |
gov_report | 31.44 / 34.43 | 31.34 / 32.60 | 29.68 / 31.54 |
repobench-p | 65.34 / 65.58 | 61.06 / 62.66 | 57.17 / 57.07 |
multifieldqa_en | 57.50 / 56.02 | 46.61 / 46.33 | 50.16 / 49.34 |
averaged score | 53.72 / 53.94 | 50.52 / 50.78 | 48.21 / 48.73 |
averaged density | 0.842 | 0.624 | 0.379 |
LongBenchV2 CoT Benchmark
All the SeerAttention models run with threshold=5e-4.
For R1-Distilled models, we remove the two passes generation setup (think + summary), we directly ask the models to output anwser after thinking. The generation max length is set to 10240.
Model | Overall | Easy | Hard | Short | Medium | Long |
---|---|---|---|---|---|---|
Llama-3.1-8B-Instruct | 30.4 | 31.2 | 29.9 | 37.8 | 24.7 | 29.6 |
SeerAttention-Llama-3.1-8B | 31.6 | 33.3 | 30.5 | 33.9 | 31.6 | 27.8 |
Qwen2.5-14B-Instruct | 34.8 | 37.5 | 33.1 | 44.4 | 32.1 | 24.1 |
SeerAttention-Qwen2.5-14B | 32.8 | 38.0 | 29.6 | 45.0 | 30.2 | 17.6 |
Qwen2.5-32B-Instruct | 36.4 | 42.2 | 32.8 | 47.8 | 29.8 | 30.6 |
SeerAttention-Qwen2.5-32B | 36.4 | 41.1 | 33.4 | 49.4 | 29.8 | 27.8 |
DeepSeek-R1-Distill-Qwen-14B | 34.2 | 43.2 | 28.6 | 45.0 | 27.9 | 28.7 |
SeerAttention-DeepSeek-R1-Distill-Qwen-14B | 31.6 | 35.9 | 28.9 | 41.7 | 26.0 | 25.9 |
DeepSeek-R1-Distill-Qwen-32B | 37.2 | 42.7 | 33.8 | 47.2 | 35.8 | 23.1 |
SeerAttention-DeepSeek-R1-Distill-Qwen-32B | 37.0 | 42.2 | 33.8 | 49.4 | 31.6 | 26.9 |
- Downloads last month
- 238