metadata

library_name: transformers
license: mit
pipeline_tag: text-generation

Model Card for TokenSwift-DeepSeek-R1-Distill-Qwen-32B

This model implements TokenSwift, a framework that accelerates text generation for long sequences (up to 100K tokens), as described in From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens.

Model Details

Model Description

This model is a finetuned version of Qwen2.5 32B, adapted for efficient long sequence text generation using the TokenSwift framework. TokenSwift achieves lossless acceleration by using a tree-based attention mechanism to construct candidate tokens, then verifying these candidates against the full model with a KV cache. This approach reduces computation time significantly while maintaining output quality.

Developed by: BigAI NLCO
License: MIT
Finetuned from model: Qwen2.5 32B

Model Sources

Repository: https://huggingface.co./TokenSwift/TokenSwift-DeepSeek-R1-Distill-Qwen-32B
Paper: https://arxiv.org/abs/2502.18890
Code: https://github.com/bigai-nlco/TokenSwift
Demo: https://github.com/user-attachments/assets/5094fca7-0b12-470c-a7b6-456d254855d1

Uses

Direct Use

This model can be used directly for generating long sequences of text. See the code example below for how to get started.

Downstream Use

This model can be further fine-tuned for specific downstream tasks requiring long sequence generation.

Out-of-Scope Use

This model is not intended for tasks that require short text generation or other NLP tasks like classification or translation. It is also not suitable for generating malicious or harmful content.

Bias, Risks, and Limitations

As a large language model, this model may exhibit biases present in the training data. It is important to be aware of these potential biases and to use the model responsibly. Additionally, the model's performance may degrade on inputs significantly different from the training data.

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("TokenSwift/TokenSwift-DeepSeek-R1-Distill-Qwen-32B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("TokenSwift/TokenSwift-DeepSeek-R1-Distill-Qwen-32B", device_map="auto", trust_remote_code=True)

# Example usage
prompt = "Generate a long story about a futuristic city."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
generated_text = model.generate(**inputs, max_length=10000)
print(tokenizer.decode(generated_text[0]))

Training Details

Training Data

The model was trained on a filtered subset of the PG-19 dataset, with sequences longer than 8K tokens removed. Processed training data can be found at qwen2.5-pg19.

Training Procedure

Details about the training procedure can be found in the associated paper and the Github repository.

Citation

@misc{wu2025hoursminuteslosslessacceleration,
      title={From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens}, 
      author={Tong Wu and Junzhe Shen and Zixia Jia and Yuxuan Wang and Zilong Zheng},
      year={2025},
      eprint={2502.18890},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.18890}, 
}