File size: 1,750 Bytes
c936134
2212f0c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c936134
 
2212f0c
 
 
 
4c34d50
2212f0c
 
 
 
 
 
 
 
 
 
 
 
4c34d50
2212f0c
 
f53aa47
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
---
language:
- en
tags:
- speech-to-text
- speech-translation
- conversational-AI
- speech-understanding
- whisper
license: apache-2.0
datasets:
- custom
metrics:
- wer
- bleu
- AIR-Bench
---

# Soundwave: Less is More for Speech-Text Alignment in LLMs
<p align="center">
  <font size="3"><a href="https://github.com/FreedomIntelligence/Soundwave">🐈‍⬛ Github</a>&nbsp|&nbsp<a href="https://arxiv.org/abs/2502.12900">📃 Paper</a>|&nbsp<a href="https://huggingface.co./spaces/FreedomIntelligence/SoundwaveDemo">📼 Online Demo</a>&nbsp</font>
</p>

## Model Description
Soundwave is a Speech-to-Text model that bridges the gap between speech and text. It is trained on just 10k hours of data and delivers exceptional performance in speech translation and AIR-Bench speech tasks.

### Key Features
<div>
  <ul>
    <font size="3"><li>A Speech-to-Text Model Bridging the Gap Between Speech and Text</li></font>
    <font size="3"><li>Utilizes Data-Efficient Strategy and Unique Architecture, Trained on Only 10k Hours of Data</li></font>
    <font size="3"><li>Exceptional Performance in Speech Translation and AIR-Bench Speech Tasks</li></font>
    <font size="3"><li>Retains Intelligence During Conversations, Ideal for Interactive Tasks</li></font>
  </ul>
</div>

## Usage
Load the Soundwave model and run inference with your audio files as shown in the <a href="https://github.com/FreedomIntelligence/Soundwave">GitHub repository</a>.

# <span>📖 Citation</span>
```
@article{zhang2025soundwave,
  title={Soundwave: Less is More for Speech-Text Alignment in LLMs},
  author={Zhang, Yuhao and Liu, Zhiheng and Bu, Fan and Zhang, Ruiyu and Wang, Benyou and Li, Haizhou},
  journal={arXiv preprint arXiv:2502.12900},
  year={2025}
}
```