gpt-omni commited on
Commit
45cc0ed
β€’
1 Parent(s): 166ead8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +140 -137
README.md CHANGED
@@ -1,137 +1,140 @@
1
-
2
- # Mini-Omni2
3
-
4
- <!-- <p align="center">
5
- <img src="./data/figures/title.png" width="100%"/>
6
- </p> -->
7
-
8
-
9
- <p align="center">
10
- πŸ€— <a href="https://huggingface.co/gpt-omni/mini-omni2">Hugging Face</a> | πŸ“– <a href="https://github.com/gpt-omni/mini-omni2">Github</a>
11
- | πŸ“‘ <a href="https://arxiv.org/abs/2410.11190">Technical report</a>
12
- </p>
13
-
14
- Mini-Omni2 is an **omni-interactive** model. It can **understand image, audio and text inputs and has end-to-end voice conversations with users**. Featuring **real-time voice output**, **omni-capable multimodal understanding** and flexible interaction **ability with interruption mechanism while speaking**.
15
-
16
- <p align="center">
17
- <img src="./data/figures/framework.jpeg" width="100%"/>
18
- </p>
19
-
20
-
21
- ## Updates
22
-
23
- - **2024.10:** Release the model, technical report, inference and chat demo code.
24
-
25
- ## Features
26
- βœ… **Multimodal interaction**: with the ability to understand images, speech and text, just like GPT-4o.
27
-
28
- βœ… **Real-time speech-to-speech** conversational capabilities. No extra ASR or TTS models required, just like [Mini-Omni](https://github.com/gpt-omni/mini-omni).
29
-
30
- <!-- βœ… **Streaming audio output**: with first-chunk latency of audio stream less than 0.3s. -->
31
-
32
- <!-- βœ… **Duplex interaction**: hearing while speaking, it can be interrupted by key words like "stop omni". -->
33
-
34
-
35
- ## Demo
36
-
37
- NOTE: need to unmute first.
38
-
39
- https://github.com/user-attachments/assets/ad97ca7f-f8b4-40c3-a7e8-fa54b4edf155
40
-
41
-
42
- ## ToDo
43
- - [ ] update interruption mechanism
44
-
45
-
46
- ## Install
47
-
48
- Create a new conda environment and install the required packages:
49
-
50
- ```sh
51
- conda create -n omni python=3.10
52
- conda activate omni
53
-
54
- git clone https://github.com/gpt-omni/mini-omni2.git
55
- cd mini-omni2
56
- pip install -r requirements.txt
57
- ```
58
-
59
- ## Quick start
60
-
61
- **Interactive demo**
62
-
63
- - start server
64
-
65
- NOTE: you need to start the server before running the streamlit or gradio demo with API_URL set to the server address.
66
-
67
- ```sh
68
- sudo apt-get install ffmpeg
69
- conda activate omni
70
- cd mini-omni2
71
- python3 server.py --ip '0.0.0.0' --port 60808
72
- ```
73
-
74
-
75
- - run streamlit demo
76
-
77
- NOTE: you need to run streamlit **locally** with PyAudio installed.
78
-
79
- ```sh
80
- pip install PyAudio==0.2.14
81
- API_URL=http://0.0.0.0:60808/chat streamlit run webui/omni_streamlit.py
82
- ```
83
-
84
-
85
- **Local test**
86
-
87
- ```sh
88
- conda activate omni
89
- cd mini-omni2
90
- # test run the preset audio samples and questions
91
- python inference_vision.py
92
- ```
93
-
94
- ## Mini-Omni2 Overview
95
-
96
- **1. Multimodal Modeling**:
97
- We use multiple sequences as the input and output of the model. In the input part, we will concatenate image, audio and text features to perform a series of comprehensive tasks, as shown in the following figures. In the output part, we use text-guided delayed parallel output to generate real-time speech responses.
98
- <p align="center">
99
- <img src="./data/figures/inputids.png" width="100%"/>
100
- </p>
101
-
102
- **2. Multi-stage Training**:
103
- We propose an efficient alignment training method and conduct encoder adaptation, modal alignment, and multimodal fine-tuning respectively in the three-stage training.
104
- <p align="center">
105
- <img src="./data/figures/training.jpeg" width="100%"/>
106
- </p>
107
-
108
- <!-- **3. Cases**:
109
- Here are more cases of Mini-Omni2:
110
- <p align="center">
111
- <img src="./data/figures/samples.png" width="100%"/>
112
- </p> -->
113
-
114
- ## FAQ
115
-
116
- **1. Does the model support other languages?**
117
-
118
- No, the model is only trained on English. However, as we use whisper as the audio encoder, the model can understand other languages which is supported by whisper (like chinese), but the output is only in English.
119
-
120
- **2. Error: can not run streamlit in local browser, with remote streamlit server**
121
-
122
- You need start streamlit **locally** with PyAudio installed.
123
-
124
-
125
- ## Acknowledgements
126
-
127
- - [Qwen2](https://github.com/QwenLM/Qwen2/) as the LLM backbone.
128
- - [litGPT](https://github.com/Lightning-AI/litgpt/) for training and inference.
129
- - [whisper](https://github.com/openai/whisper/) for audio encoding.
130
- - [clip](https://github.com/openai/CLIP) for image encoding.
131
- - [snac](https://github.com/hubertsiuzdak/snac/) for audio decoding.
132
- - [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) for generating synthetic speech.
133
- - [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) and [MOSS](https://github.com/OpenMOSS/MOSS/tree/main) for alignment.
134
-
135
- <!-- ## Star History
136
-
137
- [![Star History Chart](https://api.star-history.com/svg?repos=gpt-omni/mini-omni2&type=Date)](https://star-history.com/#gpt-omni/mini-omni2&Date)
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ # Mini-Omni2
6
+
7
+ <!-- <p align="center">
8
+ <img src="./data/figures/title.png" width="100%"/>
9
+ </p> -->
10
+
11
+
12
+ <p align="center">
13
+ πŸ€— <a href="https://huggingface.co/gpt-omni/mini-omni2">Hugging Face</a> | πŸ“– <a href="https://github.com/gpt-omni/mini-omni2">Github</a>
14
+ | πŸ“‘ <a href="https://arxiv.org/abs/2410.11190">Technical report</a>
15
+ </p>
16
+
17
+ Mini-Omni2 is an **omni-interactive** model. It can **understand image, audio and text inputs and has end-to-end voice conversations with users**. Featuring **real-time voice output**, **omni-capable multimodal understanding** and flexible interaction **ability with interruption mechanism while speaking**.
18
+
19
+ <p align="center">
20
+ <img src="./data/figures/framework.jpeg" width="100%"/>
21
+ </p>
22
+
23
+
24
+ ## Updates
25
+
26
+ - **2024.10:** Release the model, technical report, inference and chat demo code.
27
+
28
+ ## Features
29
+ βœ… **Multimodal interaction**: with the ability to understand images, speech and text, just like GPT-4o.
30
+
31
+ βœ… **Real-time speech-to-speech** conversational capabilities. No extra ASR or TTS models required, just like [Mini-Omni](https://github.com/gpt-omni/mini-omni).
32
+
33
+ <!-- βœ… **Streaming audio output**: with first-chunk latency of audio stream less than 0.3s. -->
34
+
35
+ <!-- βœ… **Duplex interaction**: hearing while speaking, it can be interrupted by key words like "stop omni". -->
36
+
37
+
38
+ ## Demo
39
+
40
+ NOTE: need to unmute first.
41
+
42
+ https://github.com/user-attachments/assets/ad97ca7f-f8b4-40c3-a7e8-fa54b4edf155
43
+
44
+
45
+ ## ToDo
46
+ - [ ] update interruption mechanism
47
+
48
+
49
+ ## Install
50
+
51
+ Create a new conda environment and install the required packages:
52
+
53
+ ```sh
54
+ conda create -n omni python=3.10
55
+ conda activate omni
56
+
57
+ git clone https://github.com/gpt-omni/mini-omni2.git
58
+ cd mini-omni2
59
+ pip install -r requirements.txt
60
+ ```
61
+
62
+ ## Quick start
63
+
64
+ **Interactive demo**
65
+
66
+ - start server
67
+
68
+ NOTE: you need to start the server before running the streamlit or gradio demo with API_URL set to the server address.
69
+
70
+ ```sh
71
+ sudo apt-get install ffmpeg
72
+ conda activate omni
73
+ cd mini-omni2
74
+ python3 server.py --ip '0.0.0.0' --port 60808
75
+ ```
76
+
77
+
78
+ - run streamlit demo
79
+
80
+ NOTE: you need to run streamlit **locally** with PyAudio installed.
81
+
82
+ ```sh
83
+ pip install PyAudio==0.2.14
84
+ API_URL=http://0.0.0.0:60808/chat streamlit run webui/omni_streamlit.py
85
+ ```
86
+
87
+
88
+ **Local test**
89
+
90
+ ```sh
91
+ conda activate omni
92
+ cd mini-omni2
93
+ # test run the preset audio samples and questions
94
+ python inference_vision.py
95
+ ```
96
+
97
+ ## Mini-Omni2 Overview
98
+
99
+ **1. Multimodal Modeling**:
100
+ We use multiple sequences as the input and output of the model. In the input part, we will concatenate image, audio and text features to perform a series of comprehensive tasks, as shown in the following figures. In the output part, we use text-guided delayed parallel output to generate real-time speech responses.
101
+ <p align="center">
102
+ <img src="./data/figures/inputids.png" width="100%"/>
103
+ </p>
104
+
105
+ **2. Multi-stage Training**:
106
+ We propose an efficient alignment training method and conduct encoder adaptation, modal alignment, and multimodal fine-tuning respectively in the three-stage training.
107
+ <p align="center">
108
+ <img src="./data/figures/training.jpeg" width="100%"/>
109
+ </p>
110
+
111
+ <!-- **3. Cases**:
112
+ Here are more cases of Mini-Omni2:
113
+ <p align="center">
114
+ <img src="./data/figures/samples.png" width="100%"/>
115
+ </p> -->
116
+
117
+ ## FAQ
118
+
119
+ **1. Does the model support other languages?**
120
+
121
+ No, the model is only trained on English. However, as we use whisper as the audio encoder, the model can understand other languages which is supported by whisper (like chinese), but the output is only in English.
122
+
123
+ **2. Error: can not run streamlit in local browser, with remote streamlit server**
124
+
125
+ You need start streamlit **locally** with PyAudio installed.
126
+
127
+
128
+ ## Acknowledgements
129
+
130
+ - [Qwen2](https://github.com/QwenLM/Qwen2/) as the LLM backbone.
131
+ - [litGPT](https://github.com/Lightning-AI/litgpt/) for training and inference.
132
+ - [whisper](https://github.com/openai/whisper/) for audio encoding.
133
+ - [clip](https://github.com/openai/CLIP) for image encoding.
134
+ - [snac](https://github.com/hubertsiuzdak/snac/) for audio decoding.
135
+ - [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) for generating synthetic speech.
136
+ - [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) and [MOSS](https://github.com/OpenMOSS/MOSS/tree/main) for alignment.
137
+
138
+ <!-- ## Star History
139
+
140
+ [![Star History Chart](https://api.star-history.com/svg?repos=gpt-omni/mini-omni2&type=Date)](https://star-history.com/#gpt-omni/mini-omni2&Date)