ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer
Abstract
As is known, hybrid quadratic and subquadratic attention models in multi-head architectures have surpassed both Transformer and Linear RNN models , with these works primarily focusing on reducing KV complexity and improving efficiency. For further research on expressiveness, we introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention, which aims to make RNN more expressive and demonstrates state tracking ability beyond transformers. We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours using 16 AMD MI300X GPUs while maintaining Qwen 2.5's performance. In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens. We will explain the detailed process and share our insights on building more powerful foundation models. Please note that this is an ongoing work that will be updated continuously. The model checkpoints and source code are available at https://github.com/yynil/RWKVInside{https://github.com/yynil/RWKVInside}, https://huggingface.co./RWKV-Red-Team/ARWKV-7B-Preview-0.1{https://huggingface.co./RWKV-Red-Team/ARWKV-7B-Preview-0.1}.
Community
Our ongoing research investigates the potential of RNN-based attention mechanisms. Specifically, we are exploring the integration of recurrent neural architectures within the attention framework to enhance the model's expressive capacity and representational power.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BabyHGRN: Exploring RNNs for Sample-Efficient Training of Language Models (2024)
- Does Self-Attention Need Separate Weights in Transformers? (2024)
- SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator (2024)
- LLMs are Also Effective Embedding Models: An In-depth Overview (2024)
- Lillama: Large Language Models Compression via Low-Rank Feature Distillation (2024)
- Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model (2024)
- Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper