Spaces:
Runtime error
Runtime error
VoiceCloning-be
commited on
Commit
โข
5211111
1
Parent(s):
4efe6b5
Update README.md
Browse files
README.md
CHANGED
@@ -1,320 +1,8 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
<img alt="Fork" src="https://img.shields.io/github/forks/iahispano/applio?style=for-the-badge&color=FFFFFF" />
|
10 |
-
<img alt="Issues" src="https://img.shields.io/github/issues/iahispano/applio?style=for-the-badge&color=FFFFFF" />
|
11 |
-
</p>
|
12 |
-
|
13 |
-
<p align="center">VITS-based Voice Conversion focused on simplicity, quality, and performance.</p>
|
14 |
-
|
15 |
-
<p align="center">
|
16 |
-
<a href="https://applio.org" target="_blank">๐ Website</a>
|
17 |
-
โข
|
18 |
-
<a href="https://docs.applio.org" target="_blank">๐ Documentation</a>
|
19 |
-
โข
|
20 |
-
<a href="https://discord.gg/iahispano" target="_blank">โ๏ธ Discord</a>
|
21 |
-
</p>
|
22 |
-
|
23 |
-
<p align="center">
|
24 |
-
<a href="https://github.com/IAHispano/Applio-Plugins" target="_blank">๐ Plugins</a>
|
25 |
-
โข
|
26 |
-
<a href="https://huggingface.co/IAHispano/Applio/tree/main/Compiled" target="_blank">๐ฆ Compiled</a>
|
27 |
-
โข
|
28 |
-
<a href="https://applio.org/playground" target="_blank">๐ฎ Playground</a>
|
29 |
-
โข
|
30 |
-
<a href="https://colab.research.google.com/github/iahispano/applio/blob/master/assets/Applio.ipynb" target="_blank">๐ Google Colab (UI)</a>
|
31 |
-
โข
|
32 |
-
<a href="https://colab.research.google.com/github/iahispano/applio/blob/master/assets/Applio_NoUI.ipynb" target="_blank">๐ Google Colab (No UI)</a>
|
33 |
-
</p>
|
34 |
-
|
35 |
-
## Table of Contents
|
36 |
-
|
37 |
-
- [Installation](#installation)
|
38 |
-
- [Windows](#windows)
|
39 |
-
- [macOS](#macos)
|
40 |
-
- [Linux](#linux)
|
41 |
-
- [Makefile](#makefile)
|
42 |
-
- [Usage](#usage)
|
43 |
-
- [Windows](#windows-1)
|
44 |
-
- [macOS](#macos-1)
|
45 |
-
- [Linux](#linux-1)
|
46 |
-
- [Makefile](#makefile-1)
|
47 |
-
- [Technical Information](#technical-information)
|
48 |
-
- [Repository Enhancements](#repository-enhancements)
|
49 |
-
- [Commercial Usage](#commercial-usage)
|
50 |
-
- [References](#references)
|
51 |
-
- [Contributors](#contributors)
|
52 |
-
|
53 |
-
## Installation
|
54 |
-
|
55 |
-
Download the latest version from [GitHub Releases](https://github.com/IAHispano/Applio-RVC-Fork/releases) or use the [Compiled Versions](https://huggingface.co/IAHispano/Applio/tree/main/Compiled).
|
56 |
-
|
57 |
-
### Windows
|
58 |
-
|
59 |
-
```bash
|
60 |
-
./run-install.bat
|
61 |
-
```
|
62 |
-
|
63 |
-
### macOS
|
64 |
-
|
65 |
-
For macOS, you need to install the requirements in a Python environment version 3.9 to 3.11. Here are the steps:
|
66 |
-
|
67 |
-
```bash
|
68 |
-
python3 -m venv .venv
|
69 |
-
source .venv/bin/activate
|
70 |
-
chmod +x run-install.sh
|
71 |
-
./run-install.sh
|
72 |
-
```
|
73 |
-
|
74 |
-
### Linux
|
75 |
-
|
76 |
-
Certain Linux-based operating systems may encounter complications with the installer. In such instances, we suggest installing the `requirements.txt` within a Python environment version 3.9 to 3.11.
|
77 |
-
|
78 |
-
```bash
|
79 |
-
chmod +x run-install.sh
|
80 |
-
./run-install.sh
|
81 |
-
```
|
82 |
-
|
83 |
-
### Makefile
|
84 |
-
|
85 |
-
For platforms such as [Paperspace](https://www.paperspace.com/):
|
86 |
-
|
87 |
-
```bash
|
88 |
-
make run-install
|
89 |
-
```
|
90 |
-
|
91 |
-
## Usage
|
92 |
-
|
93 |
-
Visit [Applio Documentation](https://docs.applio.org/) for a detailed UI usage explanation.
|
94 |
-
|
95 |
-
### Windows
|
96 |
-
|
97 |
-
```bash
|
98 |
-
./run-applio.bat
|
99 |
-
```
|
100 |
-
|
101 |
-
### macOS
|
102 |
-
|
103 |
-
```bash
|
104 |
-
chmod +x run-applio.sh
|
105 |
-
./run-applio.sh
|
106 |
-
```
|
107 |
-
|
108 |
-
### Linux
|
109 |
-
|
110 |
-
```bash
|
111 |
-
chmod +x run-applio.sh
|
112 |
-
./run-applio.sh
|
113 |
-
```
|
114 |
-
|
115 |
-
### Makefile
|
116 |
-
|
117 |
-
For platforms such as [Paperspace](https://www.paperspace.com/):
|
118 |
-
|
119 |
-
```bash
|
120 |
-
make run-applio
|
121 |
-
```
|
122 |
-
|
123 |
-
## Technical Information
|
124 |
-
|
125 |
-
Applio uses an enhanced version of the Retrieval-based Voice Conversion (RVC) model, a powerful technique for transforming the voice of an audio signal to sound like another person. This advanced implementation of RVC in Applio enables high-quality voice conversion while maintaining simplicity and performance.
|
126 |
-
|
127 |
-
### 0. Pre-Learning: Key Concepts in Speech Processing and Voice Conversion
|
128 |
-
|
129 |
-
This section introduces fundamental concepts in speech processing and voice conversion, paving the way for a deeper understanding of the RVC pipeline:
|
130 |
-
|
131 |
-
#### 1. Speech Representation
|
132 |
-
|
133 |
-
- **Phoneme:** The smallest unit of sound in a language that distinguishes one word from another. Examples: /k/, /รฆ/, /t/.
|
134 |
-
- **Spectrogram:** A visual representation of the frequency content of a sound over time, showing how the intensity of different frequencies changes over the duration of the audio.
|
135 |
-
- **Mel-Spectrogram:** A type of spectrogram that mimics human auditory perception, emphasizing frequencies that are more important to human hearing.
|
136 |
-
- **Speaker Embedding:** A vector representation that captures the unique acoustic characteristics of a speaker's voice, encoding information about pitch, tone, timbre, and other vocal qualities.
|
137 |
-
|
138 |
-
#### 2. Text-to-Speech (TTS)
|
139 |
-
|
140 |
-
- **TTS Model:** A machine learning model that generates artificial speech from written text.
|
141 |
-
- **Encoder-Decoder Architecture:** A common architecture in TTS models, where an encoder processes the text and pitch information to create a latent representation, and a decoder uses this representation to synthesize the audio signal.
|
142 |
-
- **Transformer Architecture:** A powerful neural network architecture particularly well-suited for sequence modeling, allowing the model to handle long sequences of text or audio and capture relationships between elements.
|
143 |
-
|
144 |
-
#### 3. Voice Conversion
|
145 |
-
|
146 |
-
- **Voice Conversion (VC):** The process of transforming the voice of a speaker in an audio signal to sound like another speaker.
|
147 |
-
- **Speaker Adaptation:** The process of adapting a TTS model to a specific speaker, often by training on a small dataset of the speaker's voice.
|
148 |
-
- **Retrieval-Based VC (RVC):** A voice conversion approach where speaker embeddings are retrieved from a database and used to guide the TTS model in synthesizing audio with the target speaker's voice.
|
149 |
-
|
150 |
-
#### 4. Additional Concepts
|
151 |
-
|
152 |
-
- **ContentVec:** A powerful self-supervised learning model for speech representation, excelling at capturing speaker-specific information.
|
153 |
-
- **FAISS:** A library for efficient similarity search, used to retrieve speaker embeddings that are similar to the extracted ContentVec embedding.
|
154 |
-
- **Neural Source Filter (NSF):** A module that models audio generation as a filtering process, allowing the model to produce high-quality and realistic audio signals by learning complex relationships between the source signal and the output waveform.
|
155 |
-
|
156 |
-
#### 5. Why are these concepts important?
|
157 |
-
|
158 |
-
Understanding these concepts is essential for appreciating the mechanics and capabilities of the RVC pipeline:
|
159 |
-
|
160 |
-
- **Speech Representation:** Different representations capture different aspects of speech, allowing for effective analysis and manipulation.
|
161 |
-
- **TTS Models:** The TTS model forms the foundation of RVC, providing the ability to synthesize audio from text and pitch.
|
162 |
-
- **Voice Conversion:** Voice conversion aims to transfer a speaker's identity to a different audio signal.
|
163 |
-
- **ContentVec and Speaker Embeddings:** ContentVec provides a powerful way to extract speaker-specific information, which is crucial for accurate voice conversion.
|
164 |
-
- **FAISS:** This library enables efficient speaker embedding retrieval, facilitating the selection of appropriate target voices.
|
165 |
-
- **NSF:** The NSF is a critical component of the TTS model, contributing to the generation of realistic and high-quality audio.
|
166 |
-
|
167 |
-
### 1. Model Architecture
|
168 |
-
|
169 |
-
The RVC model comprises two main components:
|
170 |
-
|
171 |
-
#### A. Encoder-Decoder Network
|
172 |
-
|
173 |
-
This network synthesizes audio based on text and pitch information while incorporating speaker characteristics from the ContentVec embedding.
|
174 |
-
|
175 |
-
**Encoder:**
|
176 |
-
|
177 |
-
- **Input:** Phoneme sequences (text representation) and pitch information (optional).
|
178 |
-
- **Embeddings:**
|
179 |
-
- Phonemes are represented as vectors using linear layers, creating a dense representation of the text input.
|
180 |
-
- Pitch is usually converted to a one-hot encoding or a continuous value and embedded similarly.
|
181 |
-
- **Transformer Encoder:** Processes the embedded features in a highly parallel manner.
|
182 |
-
|
183 |
-
It employs:
|
184 |
-
|
185 |
-
- **Self-Attention:** Allows the encoder to attend to different parts of the input sequence to understand the relationships between words and their context.
|
186 |
-
- **Feedforward Networks (FFN):** Apply non-linear transformations to further refine the features captured by self-attention.
|
187 |
-
- **Layer Normalization:** Stabilizes training and improves performance by normalizing the outputs of each layer.
|
188 |
-
- **Dropout:** A regularization technique to prevent overfitting.
|
189 |
-
- **Output:** Produces a latent representation of the input text and pitch, capturing their relationships and serving as the input for the decoder.
|
190 |
-
|
191 |
-
**Decoder:**
|
192 |
-
|
193 |
-
- **Input:** The latent representation from the encoder.
|
194 |
-
- **Transformer Decoder:** Receives the encoder output and utilizes:
|
195 |
-
- **Self-Attention:** Allows the decoder to attend to different parts of the generated sequence to maintain consistency and coherence in the output audio.
|
196 |
-
- **Encoder-Decoder Attention:** Enables the decoder to incorporate information from the input text and pitch into the audio generation process.
|
197 |
-
- **Neural Source Filter (NSF):** A powerful component for generating audio, modeling the generation process as a filter applied to a source signal. It uses:
|
198 |
-
- **Upsampling:** Increases the resolution of the latent representation to match the desired length of the audio signal.
|
199 |
-
- **Residual Blocks:** Learn complex and non-linear relationships between input features and the output audio, contributing to realistic and detailed waveforms.
|
200 |
-
- **Source Module:** Generates the excitation signal (often harmonic) that drives the NSF. It combines sine waves (for voiced sounds) and noise (for unvoiced sounds) to create a natural source signal.
|
201 |
-
- **Noise Convolution:** Convolves noise with the harmonic signal to introduce additional variation and realism.
|
202 |
-
- **Final Convolutional Layer:** Converts the filtered output to a single-channel audio waveform.
|
203 |
-
- **Output:** Synthesized audio signal.
|
204 |
-
|
205 |
-
#### B. ContentVec Speaker Embedding Extractor
|
206 |
-
|
207 |
-
Extracts speaker-specific information from the input audio.
|
208 |
-
|
209 |
-
- **Input:** The preprocessed audio signal.
|
210 |
-
- **Processing:** The ContentVec model, trained on a massive dataset of speech data, processes the input audio and extracts a speaker embedding vector, capturing the unique acoustic properties of the speaker's voice.
|
211 |
-
- **Output:** A speaker embedding vector representing the voice of the speaker.
|
212 |
-
|
213 |
-
### 2. Training Stage
|
214 |
-
|
215 |
-
The RVC model is trained using a combination of two key losses:
|
216 |
-
|
217 |
-
- **Generative Loss:**
|
218 |
-
- **Mel-Spectrogram:** The Mel-spectrogram is computed for both the target audio and the generated audio.
|
219 |
-
- **L1 Loss:** Measures the absolute difference between the Mel-spectrograms of the target and generated audio, encouraging the decoder to produce audio with a similar spectral profile.
|
220 |
-
- **Discriminative Loss:**
|
221 |
-
- **Multi-Period Discriminator:** Tries to distinguish between real and generated audio at different time scales, using convolution layers to capture long-term dependencies in the audio.
|
222 |
-
- **Adversarial Training:** The generator tries to fool the discriminator by producing audio that sounds real, while the discriminator is trained to correctly identify generated audio.
|
223 |
-
- **Optional KL Divergence Loss:** Measures the difference between the distributions of latent variables generated by the encoder and a posterior encoder (which infers the latent representation from the target audio). Encourages the model to learn a more efficient and stable latent representation.
|
224 |
-
|
225 |
-
### 3. Inference Stage
|
226 |
-
|
227 |
-
The inference stage utilizes the trained model to convert the voice of an audio input to sound like a target speaker. Here's a breakdown:
|
228 |
-
|
229 |
-
**Input:**
|
230 |
-
|
231 |
-
- Phoneme sequences (text representation).
|
232 |
-
- Pitch information (optional).
|
233 |
-
- Target speaker ID (identifies the desired voice).
|
234 |
-
|
235 |
-
**Steps:**
|
236 |
-
|
237 |
-
- **ContentVec Embedding Extraction:**
|
238 |
-
- The ContentVec model processes the input audio and extracts a speaker embedding vector, capturing the voice characteristics of the speaker.
|
239 |
-
- **Optional Embedding Retrieval:**
|
240 |
-
- **FAISS Index:** Used to efficiently search for speaker embeddings similar to the extracted ContentVec embedding. It helps guide the voice conversion process toward a specific speaker when multiple speakers are available.
|
241 |
-
- **Embedding Retrieval:** The FAISS index is queried using the extracted ContentVec embedding, and similar embeddings are retrieved.
|
242 |
-
- **Embedding Manipulation:**
|
243 |
-
- **Blending:** The extracted ContentVec embedding can be blended with retrieved embeddings using the index_rate parameter, allowing control over how much the target speaker's voice influences the conversion.
|
244 |
-
- **Encoder-Decoder Processing:**
|
245 |
-
- **Encoder:** Encodes the phoneme sequences and pitch into a latent representation, capturing the relationships between them.
|
246 |
-
- **Decoder:** Synthesizes the audio signal, incorporating the speaker characteristics from the ContentVec embedding (potentially blended with retrieved embeddings).
|
247 |
-
- **Post-Processing:**
|
248 |
-
- **Resampling:** Adjusts the sampling rate of the generated audio if needed.
|
249 |
-
- **RMS Adjustment:** Adjusts the volume (RMS) of the output audio to match the input audio.
|
250 |
-
|
251 |
-
### 4. Key Techniques
|
252 |
-
|
253 |
-
- **Transformer Architecture:** The Transformer architecture is a powerful tool for sequence modeling, enabling the encoder and decoder to efficiently process long sequences and capture complex relationships within the data.
|
254 |
-
- **Neural Source Filter (NSF):** Models audio generation as a filtering process, allowing the model to produce high-quality and realistic audio signals by learning complex relationships between the source signal and the output waveform.
|
255 |
-
- **Flow-Based Generative Model:** Enables the model to learn complex probability distributions for the audio signal, leading to more realistic and diverse generated speech.
|
256 |
-
- **Multi-period Discriminator:** Helps improve the quality and realism of the generated audio by evaluating the audio at
|
257 |
-
|
258 |
-
different temporal scales and providing feedback to the generator.
|
259 |
-
|
260 |
-
- **Relative Positional Encoding:** Helps the model understand the relative positions of elements within the input sequences, improving the model's ability to handle long sequences and maintain context.
|
261 |
-
|
262 |
-
### 5. Future Challenges
|
263 |
-
|
264 |
-
Despite the advancements in Retrieval-Based Voice Conversion, several challenges and areas for future research remain:
|
265 |
-
|
266 |
-
- **Speaker Generalization:** Improving the ability of models to generalize to unseen speakers with minimal data.
|
267 |
-
- **Real-time Processing:** Enhancing the efficiency of models to support real-time voice conversion applications.
|
268 |
-
- **Emotional Expression:** Better capturing and transferring emotional nuances in voice conversion.
|
269 |
-
- **Noise Robustness:** Improving the robustness of voice conversion models to handle noisy and low-quality input audio.
|
270 |
-
|
271 |
-
## Repository Enhancements
|
272 |
-
|
273 |
-
This repository has undergone significant enhancements to improve its functionality and maintainability:
|
274 |
-
|
275 |
-
- **Modular Codebase:** Restructured codebase for better organization, readability, and maintenance.
|
276 |
-
- **Hop Length Implementation:** Improved efficiency and performance, especially on Crepe (formerly Mangio-Crepe), thanks to [@Mangio621](https://github.com/Mangio621/Mangio-RVC-Fork).
|
277 |
-
- **Translations in 30+ Languages:** Added support for over 30 languages.
|
278 |
-
- **Cross-Platform Compatibility:** Ensured seamless operation across various platforms.
|
279 |
-
- **Optimized Requirements:** Fine-tuned project requirements for enhanced performance.
|
280 |
-
- **Streamlined Installation:** Simplified installation process for a user-friendly setup.
|
281 |
-
- **Hybrid F0 Estimation:** Introduced a personalized 'hybrid' F0 estimation method utilizing nanmedian.
|
282 |
-
- **Easy-to-Use UI:** Implemented an intuitive user interface.
|
283 |
-
- **Plugin System:** Introduced a plugin system for extending functionality.
|
284 |
-
- **Overtraining Detector:** Implemented a detector to prevent excessive training.
|
285 |
-
- **Model Search:** Integrated model search feature for easy discovery.
|
286 |
-
- **Pretrained Models:** Added support for custom pretrained models.
|
287 |
-
- **Voice Blender:** Developed a feature to combine two trained models to create a new one.
|
288 |
-
- **Accessibility Improvements:** Enhanced with descriptive tooltips for UI elements.
|
289 |
-
- **New F0 Extraction Methods:** Introduced methods like FCPE or Hybrid for pitch extraction.
|
290 |
-
- **Output Format Selection:** Added feature to choose audio file formats.
|
291 |
-
- **Hashing System:** Assigned unique IDs to models to prevent unauthorized duplication.
|
292 |
-
- **Model Download System:** Supported downloads from various platforms.
|
293 |
-
- **TTS Enhancements:** Improved Text-to-Speech functionality.
|
294 |
-
- **Split Audio:** Implemented audio splitting for faster processing.
|
295 |
-
- **Discord Presence:** Displayed usage status on Discord.
|
296 |
-
- **Flask Integration:** Enabled automatic model downloads via Flask.
|
297 |
-
- **Support Tab:** Added a tab for screen recording to report issues.
|
298 |
-
|
299 |
-
These enhancements contribute to a more robust and scalable codebase, making the repository more accessible for contributors and users alike.
|
300 |
-
|
301 |
-
## Commercial Usage
|
302 |
-
|
303 |
-
For commercial purposes, please adhere to the guidelines outlined in the [MIT license](./LICENSE) governing this project. Prior to integrating Applio into your application, we kindly request that you contact us at [email protected] to ensure ethical use.
|
304 |
-
|
305 |
-
Please note, the use of Applio-generated audio files falls under your own responsibility and must always respect applicable copyrights. We encourage you to consider supporting the continuous development and maintenance of Applio through a donation.
|
306 |
-
|
307 |
-
Your cooperation and support are greatly appreciated. Thank you!
|
308 |
-
|
309 |
-
## References
|
310 |
-
|
311 |
-
Applio is possible to these projects and those cited in their references.
|
312 |
-
|
313 |
-
- [gradio-screen-recorder](https://huggingface.co/spaces/gstaff/gradio-screen-recorder) by gstaff
|
314 |
-
- [rvc-cli](https://github.com/blaise-tk/rvc-cli) by blaisewf
|
315 |
-
|
316 |
-
### Contributors
|
317 |
-
|
318 |
-
<a href="https://github.com/IAHispano/Applio/graphs/contributors" target="_blank">
|
319 |
-
<img src="https://contrib.rocks/image?repo=IAHispano/Applio" />
|
320 |
-
</a>
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
title: Applio Full GPU
|
4 |
+
sdk: gradio
|
5 |
+
emoji: ๐ฃ๏ธ
|
6 |
+
colorFrom: blue
|
7 |
+
colorTo: blue
|
8 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|