File size: 1,780 Bytes
c936134 2212f0c e869b56 2212f0c e869b56 2212f0c c936134 f204dcc 5780aff f204dcc c936134 2212f0c 4c34d50 2212f0c 4c34d50 2212f0c f53aa47 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
---
language:
- en
tags:
- audio-text-to-text
- speech-translation
- speech-understanding
- audio
- chat
license: apache-2.0
datasets:
- custom
metrics:
- wer
- bleu
- AIR-Bench
---
<div align="center">
<h1>
Soundwave: Less is More for Speech-Text Alignment in LLMs
</h1>
</div>
<p align="center">
<font size="3"><a href="https://github.com/FreedomIntelligence/Soundwave">🐈⬛ Github</a> | <a href="https://arxiv.org/abs/2502.12900">📃 Paper</a>| <a href="https://huggingface.co./spaces/FreedomIntelligence/SoundwaveDemo">📼 Online Demo</a> </font>
</p>
## Model Description
Soundwave is a Speech-to-Text model that bridges the gap between speech and text. It is trained on just 10k hours of data and delivers exceptional performance in speech translation and AIR-Bench speech tasks.
### Key Features
<div>
<ul>
<font size="3"><li>A Speech-to-Text Model Bridging the Gap Between Speech and Text</li></font>
<font size="3"><li>Utilizes Data-Efficient Strategy and Unique Architecture, Trained on Only 10k Hours of Data</li></font>
<font size="3"><li>Exceptional Performance in Speech Translation and AIR-Bench Speech Tasks</li></font>
<font size="3"><li>Retains Intelligence During Conversations, Ideal for Interactive Tasks</li></font>
</ul>
</div>
## Usage
Load the Soundwave model and run inference with your audio files as shown in the <a href="https://github.com/FreedomIntelligence/Soundwave">GitHub repository</a>.
# <span>📖 Citation</span>
```
@article{zhang2025soundwave,
title={Soundwave: Less is More for Speech-Text Alignment in LLMs},
author={Zhang, Yuhao and Liu, Zhiheng and Bu, Fan and Zhang, Ruiyu and Wang, Benyou and Li, Haizhou},
journal={arXiv preprint arXiv:2502.12900},
year={2025}
}
``` |