VoiceCloning-be commited on
Commit
5211111
โ€ข
1 Parent(s): 4efe6b5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -320
README.md CHANGED
@@ -1,320 +1,8 @@
1
- <h1 align="center">
2
- <a href="https://applio.org" target="_blank"><img src="https://github.com/IAHispano/Applio/assets/133521603/78e975d8-b07f-47ba-ab23-5a31592f322a" alt="Applio"></a>
3
- </h1>
4
-
5
- <p align="center">
6
- <img alt="Contributors" src="https://img.shields.io/github/contributors/iahispano/applio?style=for-the-badge&color=FFFFFF" />
7
- <img alt="Release" src="https://img.shields.io/github/release/iahispano/applio?style=for-the-badge&color=FFFFFF" />
8
- <img alt="Stars" src="https://img.shields.io/github/stars/iahispano/applio?style=for-the-badge&color=FFFFFF" />
9
- <img alt="Fork" src="https://img.shields.io/github/forks/iahispano/applio?style=for-the-badge&color=FFFFFF" />
10
- <img alt="Issues" src="https://img.shields.io/github/issues/iahispano/applio?style=for-the-badge&color=FFFFFF" />
11
- </p>
12
-
13
- <p align="center">VITS-based Voice Conversion focused on simplicity, quality, and performance.</p>
14
-
15
- <p align="center">
16
- <a href="https://applio.org" target="_blank">๐ŸŒ Website</a>
17
- โ€ข
18
- <a href="https://docs.applio.org" target="_blank">๐Ÿ“š Documentation</a>
19
- โ€ข
20
- <a href="https://discord.gg/iahispano" target="_blank">โ˜Ž๏ธ Discord</a>
21
- </p>
22
-
23
- <p align="center">
24
- <a href="https://github.com/IAHispano/Applio-Plugins" target="_blank">๐Ÿ›’ Plugins</a>
25
- โ€ข
26
- <a href="https://huggingface.co/IAHispano/Applio/tree/main/Compiled" target="_blank">๐Ÿ“ฆ Compiled</a>
27
- โ€ข
28
- <a href="https://applio.org/playground" target="_blank">๐ŸŽฎ Playground</a>
29
- โ€ข
30
- <a href="https://colab.research.google.com/github/iahispano/applio/blob/master/assets/Applio.ipynb" target="_blank">๐Ÿ”Ž Google Colab (UI)</a>
31
- โ€ข
32
- <a href="https://colab.research.google.com/github/iahispano/applio/blob/master/assets/Applio_NoUI.ipynb" target="_blank">๐Ÿ”Ž Google Colab (No UI)</a>
33
- </p>
34
-
35
- ## Table of Contents
36
-
37
- - [Installation](#installation)
38
- - [Windows](#windows)
39
- - [macOS](#macos)
40
- - [Linux](#linux)
41
- - [Makefile](#makefile)
42
- - [Usage](#usage)
43
- - [Windows](#windows-1)
44
- - [macOS](#macos-1)
45
- - [Linux](#linux-1)
46
- - [Makefile](#makefile-1)
47
- - [Technical Information](#technical-information)
48
- - [Repository Enhancements](#repository-enhancements)
49
- - [Commercial Usage](#commercial-usage)
50
- - [References](#references)
51
- - [Contributors](#contributors)
52
-
53
- ## Installation
54
-
55
- Download the latest version from [GitHub Releases](https://github.com/IAHispano/Applio-RVC-Fork/releases) or use the [Compiled Versions](https://huggingface.co/IAHispano/Applio/tree/main/Compiled).
56
-
57
- ### Windows
58
-
59
- ```bash
60
- ./run-install.bat
61
- ```
62
-
63
- ### macOS
64
-
65
- For macOS, you need to install the requirements in a Python environment version 3.9 to 3.11. Here are the steps:
66
-
67
- ```bash
68
- python3 -m venv .venv
69
- source .venv/bin/activate
70
- chmod +x run-install.sh
71
- ./run-install.sh
72
- ```
73
-
74
- ### Linux
75
-
76
- Certain Linux-based operating systems may encounter complications with the installer. In such instances, we suggest installing the `requirements.txt` within a Python environment version 3.9 to 3.11.
77
-
78
- ```bash
79
- chmod +x run-install.sh
80
- ./run-install.sh
81
- ```
82
-
83
- ### Makefile
84
-
85
- For platforms such as [Paperspace](https://www.paperspace.com/):
86
-
87
- ```bash
88
- make run-install
89
- ```
90
-
91
- ## Usage
92
-
93
- Visit [Applio Documentation](https://docs.applio.org/) for a detailed UI usage explanation.
94
-
95
- ### Windows
96
-
97
- ```bash
98
- ./run-applio.bat
99
- ```
100
-
101
- ### macOS
102
-
103
- ```bash
104
- chmod +x run-applio.sh
105
- ./run-applio.sh
106
- ```
107
-
108
- ### Linux
109
-
110
- ```bash
111
- chmod +x run-applio.sh
112
- ./run-applio.sh
113
- ```
114
-
115
- ### Makefile
116
-
117
- For platforms such as [Paperspace](https://www.paperspace.com/):
118
-
119
- ```bash
120
- make run-applio
121
- ```
122
-
123
- ## Technical Information
124
-
125
- Applio uses an enhanced version of the Retrieval-based Voice Conversion (RVC) model, a powerful technique for transforming the voice of an audio signal to sound like another person. This advanced implementation of RVC in Applio enables high-quality voice conversion while maintaining simplicity and performance.
126
-
127
- ### 0. Pre-Learning: Key Concepts in Speech Processing and Voice Conversion
128
-
129
- This section introduces fundamental concepts in speech processing and voice conversion, paving the way for a deeper understanding of the RVC pipeline:
130
-
131
- #### 1. Speech Representation
132
-
133
- - **Phoneme:** The smallest unit of sound in a language that distinguishes one word from another. Examples: /k/, /รฆ/, /t/.
134
- - **Spectrogram:** A visual representation of the frequency content of a sound over time, showing how the intensity of different frequencies changes over the duration of the audio.
135
- - **Mel-Spectrogram:** A type of spectrogram that mimics human auditory perception, emphasizing frequencies that are more important to human hearing.
136
- - **Speaker Embedding:** A vector representation that captures the unique acoustic characteristics of a speaker's voice, encoding information about pitch, tone, timbre, and other vocal qualities.
137
-
138
- #### 2. Text-to-Speech (TTS)
139
-
140
- - **TTS Model:** A machine learning model that generates artificial speech from written text.
141
- - **Encoder-Decoder Architecture:** A common architecture in TTS models, where an encoder processes the text and pitch information to create a latent representation, and a decoder uses this representation to synthesize the audio signal.
142
- - **Transformer Architecture:** A powerful neural network architecture particularly well-suited for sequence modeling, allowing the model to handle long sequences of text or audio and capture relationships between elements.
143
-
144
- #### 3. Voice Conversion
145
-
146
- - **Voice Conversion (VC):** The process of transforming the voice of a speaker in an audio signal to sound like another speaker.
147
- - **Speaker Adaptation:** The process of adapting a TTS model to a specific speaker, often by training on a small dataset of the speaker's voice.
148
- - **Retrieval-Based VC (RVC):** A voice conversion approach where speaker embeddings are retrieved from a database and used to guide the TTS model in synthesizing audio with the target speaker's voice.
149
-
150
- #### 4. Additional Concepts
151
-
152
- - **ContentVec:** A powerful self-supervised learning model for speech representation, excelling at capturing speaker-specific information.
153
- - **FAISS:** A library for efficient similarity search, used to retrieve speaker embeddings that are similar to the extracted ContentVec embedding.
154
- - **Neural Source Filter (NSF):** A module that models audio generation as a filtering process, allowing the model to produce high-quality and realistic audio signals by learning complex relationships between the source signal and the output waveform.
155
-
156
- #### 5. Why are these concepts important?
157
-
158
- Understanding these concepts is essential for appreciating the mechanics and capabilities of the RVC pipeline:
159
-
160
- - **Speech Representation:** Different representations capture different aspects of speech, allowing for effective analysis and manipulation.
161
- - **TTS Models:** The TTS model forms the foundation of RVC, providing the ability to synthesize audio from text and pitch.
162
- - **Voice Conversion:** Voice conversion aims to transfer a speaker's identity to a different audio signal.
163
- - **ContentVec and Speaker Embeddings:** ContentVec provides a powerful way to extract speaker-specific information, which is crucial for accurate voice conversion.
164
- - **FAISS:** This library enables efficient speaker embedding retrieval, facilitating the selection of appropriate target voices.
165
- - **NSF:** The NSF is a critical component of the TTS model, contributing to the generation of realistic and high-quality audio.
166
-
167
- ### 1. Model Architecture
168
-
169
- The RVC model comprises two main components:
170
-
171
- #### A. Encoder-Decoder Network
172
-
173
- This network synthesizes audio based on text and pitch information while incorporating speaker characteristics from the ContentVec embedding.
174
-
175
- **Encoder:**
176
-
177
- - **Input:** Phoneme sequences (text representation) and pitch information (optional).
178
- - **Embeddings:**
179
- - Phonemes are represented as vectors using linear layers, creating a dense representation of the text input.
180
- - Pitch is usually converted to a one-hot encoding or a continuous value and embedded similarly.
181
- - **Transformer Encoder:** Processes the embedded features in a highly parallel manner.
182
-
183
- It employs:
184
-
185
- - **Self-Attention:** Allows the encoder to attend to different parts of the input sequence to understand the relationships between words and their context.
186
- - **Feedforward Networks (FFN):** Apply non-linear transformations to further refine the features captured by self-attention.
187
- - **Layer Normalization:** Stabilizes training and improves performance by normalizing the outputs of each layer.
188
- - **Dropout:** A regularization technique to prevent overfitting.
189
- - **Output:** Produces a latent representation of the input text and pitch, capturing their relationships and serving as the input for the decoder.
190
-
191
- **Decoder:**
192
-
193
- - **Input:** The latent representation from the encoder.
194
- - **Transformer Decoder:** Receives the encoder output and utilizes:
195
- - **Self-Attention:** Allows the decoder to attend to different parts of the generated sequence to maintain consistency and coherence in the output audio.
196
- - **Encoder-Decoder Attention:** Enables the decoder to incorporate information from the input text and pitch into the audio generation process.
197
- - **Neural Source Filter (NSF):** A powerful component for generating audio, modeling the generation process as a filter applied to a source signal. It uses:
198
- - **Upsampling:** Increases the resolution of the latent representation to match the desired length of the audio signal.
199
- - **Residual Blocks:** Learn complex and non-linear relationships between input features and the output audio, contributing to realistic and detailed waveforms.
200
- - **Source Module:** Generates the excitation signal (often harmonic) that drives the NSF. It combines sine waves (for voiced sounds) and noise (for unvoiced sounds) to create a natural source signal.
201
- - **Noise Convolution:** Convolves noise with the harmonic signal to introduce additional variation and realism.
202
- - **Final Convolutional Layer:** Converts the filtered output to a single-channel audio waveform.
203
- - **Output:** Synthesized audio signal.
204
-
205
- #### B. ContentVec Speaker Embedding Extractor
206
-
207
- Extracts speaker-specific information from the input audio.
208
-
209
- - **Input:** The preprocessed audio signal.
210
- - **Processing:** The ContentVec model, trained on a massive dataset of speech data, processes the input audio and extracts a speaker embedding vector, capturing the unique acoustic properties of the speaker's voice.
211
- - **Output:** A speaker embedding vector representing the voice of the speaker.
212
-
213
- ### 2. Training Stage
214
-
215
- The RVC model is trained using a combination of two key losses:
216
-
217
- - **Generative Loss:**
218
- - **Mel-Spectrogram:** The Mel-spectrogram is computed for both the target audio and the generated audio.
219
- - **L1 Loss:** Measures the absolute difference between the Mel-spectrograms of the target and generated audio, encouraging the decoder to produce audio with a similar spectral profile.
220
- - **Discriminative Loss:**
221
- - **Multi-Period Discriminator:** Tries to distinguish between real and generated audio at different time scales, using convolution layers to capture long-term dependencies in the audio.
222
- - **Adversarial Training:** The generator tries to fool the discriminator by producing audio that sounds real, while the discriminator is trained to correctly identify generated audio.
223
- - **Optional KL Divergence Loss:** Measures the difference between the distributions of latent variables generated by the encoder and a posterior encoder (which infers the latent representation from the target audio). Encourages the model to learn a more efficient and stable latent representation.
224
-
225
- ### 3. Inference Stage
226
-
227
- The inference stage utilizes the trained model to convert the voice of an audio input to sound like a target speaker. Here's a breakdown:
228
-
229
- **Input:**
230
-
231
- - Phoneme sequences (text representation).
232
- - Pitch information (optional).
233
- - Target speaker ID (identifies the desired voice).
234
-
235
- **Steps:**
236
-
237
- - **ContentVec Embedding Extraction:**
238
- - The ContentVec model processes the input audio and extracts a speaker embedding vector, capturing the voice characteristics of the speaker.
239
- - **Optional Embedding Retrieval:**
240
- - **FAISS Index:** Used to efficiently search for speaker embeddings similar to the extracted ContentVec embedding. It helps guide the voice conversion process toward a specific speaker when multiple speakers are available.
241
- - **Embedding Retrieval:** The FAISS index is queried using the extracted ContentVec embedding, and similar embeddings are retrieved.
242
- - **Embedding Manipulation:**
243
- - **Blending:** The extracted ContentVec embedding can be blended with retrieved embeddings using the index_rate parameter, allowing control over how much the target speaker's voice influences the conversion.
244
- - **Encoder-Decoder Processing:**
245
- - **Encoder:** Encodes the phoneme sequences and pitch into a latent representation, capturing the relationships between them.
246
- - **Decoder:** Synthesizes the audio signal, incorporating the speaker characteristics from the ContentVec embedding (potentially blended with retrieved embeddings).
247
- - **Post-Processing:**
248
- - **Resampling:** Adjusts the sampling rate of the generated audio if needed.
249
- - **RMS Adjustment:** Adjusts the volume (RMS) of the output audio to match the input audio.
250
-
251
- ### 4. Key Techniques
252
-
253
- - **Transformer Architecture:** The Transformer architecture is a powerful tool for sequence modeling, enabling the encoder and decoder to efficiently process long sequences and capture complex relationships within the data.
254
- - **Neural Source Filter (NSF):** Models audio generation as a filtering process, allowing the model to produce high-quality and realistic audio signals by learning complex relationships between the source signal and the output waveform.
255
- - **Flow-Based Generative Model:** Enables the model to learn complex probability distributions for the audio signal, leading to more realistic and diverse generated speech.
256
- - **Multi-period Discriminator:** Helps improve the quality and realism of the generated audio by evaluating the audio at
257
-
258
- different temporal scales and providing feedback to the generator.
259
-
260
- - **Relative Positional Encoding:** Helps the model understand the relative positions of elements within the input sequences, improving the model's ability to handle long sequences and maintain context.
261
-
262
- ### 5. Future Challenges
263
-
264
- Despite the advancements in Retrieval-Based Voice Conversion, several challenges and areas for future research remain:
265
-
266
- - **Speaker Generalization:** Improving the ability of models to generalize to unseen speakers with minimal data.
267
- - **Real-time Processing:** Enhancing the efficiency of models to support real-time voice conversion applications.
268
- - **Emotional Expression:** Better capturing and transferring emotional nuances in voice conversion.
269
- - **Noise Robustness:** Improving the robustness of voice conversion models to handle noisy and low-quality input audio.
270
-
271
- ## Repository Enhancements
272
-
273
- This repository has undergone significant enhancements to improve its functionality and maintainability:
274
-
275
- - **Modular Codebase:** Restructured codebase for better organization, readability, and maintenance.
276
- - **Hop Length Implementation:** Improved efficiency and performance, especially on Crepe (formerly Mangio-Crepe), thanks to [@Mangio621](https://github.com/Mangio621/Mangio-RVC-Fork).
277
- - **Translations in 30+ Languages:** Added support for over 30 languages.
278
- - **Cross-Platform Compatibility:** Ensured seamless operation across various platforms.
279
- - **Optimized Requirements:** Fine-tuned project requirements for enhanced performance.
280
- - **Streamlined Installation:** Simplified installation process for a user-friendly setup.
281
- - **Hybrid F0 Estimation:** Introduced a personalized 'hybrid' F0 estimation method utilizing nanmedian.
282
- - **Easy-to-Use UI:** Implemented an intuitive user interface.
283
- - **Plugin System:** Introduced a plugin system for extending functionality.
284
- - **Overtraining Detector:** Implemented a detector to prevent excessive training.
285
- - **Model Search:** Integrated model search feature for easy discovery.
286
- - **Pretrained Models:** Added support for custom pretrained models.
287
- - **Voice Blender:** Developed a feature to combine two trained models to create a new one.
288
- - **Accessibility Improvements:** Enhanced with descriptive tooltips for UI elements.
289
- - **New F0 Extraction Methods:** Introduced methods like FCPE or Hybrid for pitch extraction.
290
- - **Output Format Selection:** Added feature to choose audio file formats.
291
- - **Hashing System:** Assigned unique IDs to models to prevent unauthorized duplication.
292
- - **Model Download System:** Supported downloads from various platforms.
293
- - **TTS Enhancements:** Improved Text-to-Speech functionality.
294
- - **Split Audio:** Implemented audio splitting for faster processing.
295
- - **Discord Presence:** Displayed usage status on Discord.
296
- - **Flask Integration:** Enabled automatic model downloads via Flask.
297
- - **Support Tab:** Added a tab for screen recording to report issues.
298
-
299
- These enhancements contribute to a more robust and scalable codebase, making the repository more accessible for contributors and users alike.
300
-
301
- ## Commercial Usage
302
-
303
- For commercial purposes, please adhere to the guidelines outlined in the [MIT license](./LICENSE) governing this project. Prior to integrating Applio into your application, we kindly request that you contact us at [email protected] to ensure ethical use.
304
-
305
- Please note, the use of Applio-generated audio files falls under your own responsibility and must always respect applicable copyrights. We encourage you to consider supporting the continuous development and maintenance of Applio through a donation.
306
-
307
- Your cooperation and support are greatly appreciated. Thank you!
308
-
309
- ## References
310
-
311
- Applio is possible to these projects and those cited in their references.
312
-
313
- - [gradio-screen-recorder](https://huggingface.co/spaces/gstaff/gradio-screen-recorder) by gstaff
314
- - [rvc-cli](https://github.com/blaise-tk/rvc-cli) by blaisewf
315
-
316
- ### Contributors
317
-
318
- <a href="https://github.com/IAHispano/Applio/graphs/contributors" target="_blank">
319
- <img src="https://contrib.rocks/image?repo=IAHispano/Applio" />
320
- </a>
 
1
+ ---
2
+ license: mit
3
+ title: Applio Full GPU
4
+ sdk: gradio
5
+ emoji: ๐Ÿ—ฃ๏ธ
6
+ colorFrom: blue
7
+ colorTo: blue
8
+ ---