File size: 4,679 Bytes
cacf3f9
6a51493
cacf3f9
 
 
 
 
 
 
 
e4e2d95
149bdaf
 
887709d
149bdaf
 
 
887709d
149bdaf
 
 
 
 
 
 
887709d
149bdaf
887709d
149bdaf
887709d
149bdaf
887709d
149bdaf
2c21386
 
 
149bdaf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2c21386
 
149bdaf
 
 
 
 
 
 
c130713
 
 
 
 
149bdaf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
887709d
149bdaf
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
license: llama3.1
language:
- en
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
tags:
- large language models
- speech-language models
- speech interaction
- speech-to-speech
library_name: llama-omni
---

# πŸ¦™πŸŽ§ LLaMA-Omni: Seamless Speech Interaction with Large Language Models

> **Authors: [Qingkai Fang](https://fangqingkai.github.io/), [Shoutao Guo](https://scholar.google.com/citations?hl=en&user=XwHtPyAAAAAJ), [Yan Zhou](https://zhouyan19.github.io/zhouyan/), [Zhengrui Ma](https://scholar.google.com.hk/citations?user=dUgq6tEAAAAJ), [Shaolei Zhang](https://zhangshaolei1998.github.io/), [Yang Feng*](https://people.ucas.edu.cn/~yangfeng?language=en)**

[[Paper]](https://arxiv.org/abs/2409.06666) [[Model]](https://huggingface.co./ICTNLP/Llama-3.1-8B-Omni) [[Code]](https://github.com/ictnlp/LLaMA-Omni)

LLaMA-Omni is a speech-language model built upon Llama-3.1-8B-Instruct. It supports low-latency and high-quality speech interactions, simultaneously generating both text and speech responses based on speech instructions.

![](images/model.png)

## πŸ’‘ Highlights

- πŸ’ͺ **Built on Llama-3.1-8B-Instruct, ensuring high-quality responses.**

- πŸš€ **Low-latency speech interaction with a latency as low as 226ms.**

- 🎧 **Simultaneous generation of both text and speech responses.**

- ♻️ **Trained in less than 3 days using just 4 GPUs.**


<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65b7573482d384513443875e/dr4XWUxzuVQ52lBuzNBTt.mp4"></video>

## Install

1. Clone this repository.

```shell
git clone https://github.com/ictnlp/LLaMA-Omni
cd LLaMA-Omni
```

2. Install packages.

```shell
conda create -n llama-omni python=3.10
conda activate llama-omni
pip install pip==24.0
pip install -e .
```

3. Install `fairseq`.

```shell
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install -e . --no-build-isolation
```

4. Install `flash-attention`.

```shell
pip install flash-attn --no-build-isolation
```

## Quick Start

1. Download the `Llama-3.1-8B-Omni` model from πŸ€—[Huggingface](https://huggingface.co./ICTNLP/Llama-3.1-8B-Omni). 

2. Download the `Whisper-large-v3` model.

```shell
import whisper
model = whisper.load_model("large-v3", download_root="models/speech_encoder/")
```

3. Download the unit-based HiFi-GAN vocoder.

```shell
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/g_00500000 -P vocoder/
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/config.json -P vocoder/
```

## Gradio Demo

1. Launch a controller.
```shell
python -m omni_speech.serve.controller --host 0.0.0.0 --port 10000
```

2. Launch a gradio web server.
```shell
python -m omni_speech.serve.gradio_web_server --controller http://localhost:10000 --port 8000 --model-list-mode reload --vocoder vocoder/g_00500000 --vocoder-cfg vocoder/config.json
```

3. Launch a model worker.
```shell
python -m omni_speech.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path Llama-3.1-8B-Omni --model-name Llama-3.1-8B-Omni --s2s
```

4. Visit [http://localhost:8000/](http://localhost:8000/) and interact with LLaMA-3.1-8B-Omni!

**Note: Due to the instability of streaming audio playback in Gradio, we have only implemented streaming audio synthesis without enabling autoplay. If you have a good solution, feel free to submit a PR. Thanks!**

## Local Inference

To run inference locally, please organize the speech instruction files according to the format in the `omni_speech/infer/examples` directory, then refer to the following script.
```shell
bash omni_speech/infer/run.sh omni_speech/infer/examples
```

## LICENSE

Our code is released under the Apache-2.0 License. Our model, as it is built on Llama 3.1, is required to comply with the [Llama 3.1 License](https://llama.meta.com/llama3_1/license/).


## Acknowledgements

- [LLaVA](https://github.com/haotian-liu/LLaVA): The codebase we built upon.
- [SLAM-LLM](https://github.com/X-LANCE/SLAM-LLM): We borrow some code about speech encoder and speech adaptor.

## Citation

If you have any questions, please feel free to submit an issue or contact `[email protected]`.

If our work is useful for you, please cite as:

```
@article{fang-etal-2024-llama-omni,
  title={LLaMA-Omni: Seamless Speech Interaction with Large Language Models},
  author={Fang, Qingkai and Guo, Shoutao and Zhou, Yan and Ma, Zhengrui and Zhang, Shaolei and Feng, Yang},
  journal={arXiv preprint arXiv:2409.06666},
  year={2024}
}
```