HongyuanTao commited on
Commit
5032dc0
·
verified ·
1 Parent(s): 6cf7380

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -3
README.md CHANGED
@@ -1,3 +1,15 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+ ## Introduction
5
+ We propose mmMamba, the first decoder-only multimodal state space model achieved through quadratic to linear distillation using moderate academic computing resources. Unlike existing linear-complexity encoder-based multimodal large language models (MLLMs), mmMamba eliminates the need for separate vision encoders and underperforming pre-trained RNN-based LLMs. Through our seeding strategy and three-stage progressive distillation recipe, mmMamba effectively transfers knowledge from quadratic-complexity decoder-only pre-trained MLLMs while preserving multimodal capabilities. Additionally, mmMamba introduces flexible hybrid architectures that strategically combine Transformer and Mamba layers, enabling customizable trade-offs between computational efficiency and model performance.
6
+
7
+ Distilled from the decoder-only HoVLE-2.6B, our pure Mamba-2-based mmMamba-linear achieves performance competitive with existing linear and quadratic-complexity VLMs, including those with 2x larger parameter size like EVE-7B. The hybrid variant, mmMamba-hybrid, further enhances performance across all benchmarks, approaching the capabilities of the teacher model HoVLE. In long-context scenarios with 103K tokens, mmMamba-linear demonstrates remarkable efficiency gains with a 20.6× speedup and 75.8% GPU memory reduction compared to HoVLE, while mmMamba-hybrid achieves a 13.5× speedup and 60.2% memory savings.
8
+
9
+ <div align="center">
10
+ <img src="assets/teaser.png" />
11
+
12
+
13
+ <b>Seeding strategy and three-stage distillation pipeline of mmMamba.</b>
14
+ <img src="assets/pipeline.png" />
15
+ </div>