File size: 2,858 Bytes
2466c75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
---
license: mit
library_name: transformers
base_model:
  - deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
base_model_relation: adapter
---


## SeerAttention-DeepSeek-R1-Distill-Qwen-14B-AttnGates

This repo only contains the AttnGates' weights for deepseek-ai/DeepSeek-R1-Distill-Qwen-14B.

[SeerAttention](https://arxiv.org/abs/2410.13276) introduces learnable AttnGate modules to accelerate the computationally intensive prefill stage of long-context large language models (LLMs) via dynamic block-level sparsity. The AttnGates are trained in a parameter-efficient self-distillation framework, where they learn to mimic the 2D max-pooled attention patterns of the original frozen model, preserving its integrity while avoiding costly retraining. During inference, these gates generate block-sparse binary masks by applying threshold/TopK to their learned soft scores, enabling efficient computation through a custom block-sparse FlashAttention kernel. 

Original Github Repo
https://github.com/microsoft/SeerAttention.


## LongBenchV2 CoT Benchmark

All the SeerAttention models run with threshold=5e-4.

For R1-Distilled models, we remove the two passes generation setup (think + summary), we directly ask the models to output anwser after thinking. The generation max length is set to 10240. 



| Model | Overall | Easy | Hard | Short | Medium | Long |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| [Llama-3.1-8B-Instruct](https://huggingface.co./meta-llama/Llama-3.1-8B-Instruct) | 30.4 | 31.2 | 29.9 | 37.8 | 24.7 | 29.6 |
| [SeerAttention-Llama-3.1-8B](https://huggingface.co./SeerAttention/SeerAttention-Llama-3.1-8B-AttnGates) | 31.6 | 33.3 | 30.5 | 33.9 | 31.6 | 27.8 |
| [Qwen2.5-14B-Instruct](https://huggingface.co./Qwen/Qwen2.5-14B-Instruct) | 34.8 | 37.5 | 33.1 | 44.4 | 32.1 | 24.1 |
| [SeerAttention-Qwen2.5-14B](https://huggingface.co./SeerAttention/SeerAttention-Qwen2.5-14B-AttnGates) | 32.8 | 38.0 | 29.6 | 45.0 | 30.2 | 17.6 |
| [Qwen2.5-32B-Instruct]((https://huggingface.co./Qwen/Qwen2.5-32B-Instruct)) | 36.4 | 42.2 | 32.8 | 47.8 | 29.8 | 30.6 |
| [SeerAttention-Qwen2.5-32B](https://huggingface.co./SeerAttention/SeerAttention-Qwen2.5-32B-AttnGates) | 36.4 | 41.1 | 33.4 | 49.4 | 29.8 | 27.8 |
| [DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co./deepseek-ai/DeepSeek-R1-Distill-Qwen-14B) | 34.2 | 43.2 | 28.6 | 45.0 | 27.9 | 28.7 |
| [SeerAttention-DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co./SeerAttention/SeerAttention-DeepSeek-R1-Distill-Qwen-14B-AttnGates) | 31.6 | 35.9 | 28.9 | 41.7 | 26.0 | 25.9 |
| [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co./deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) | 37.2 | 42.7 | 33.8 | 47.2 | 35.8 | 23.1 |
| [SeerAttention-DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co./SeerAttention/SeerAttention-DeepSeek-R1-Distill-Qwen-32B-AttnGates) | 37.0 | 42.2 | 33.8 | 49.4 | 31.6 | 26.9 |