Spaces:
Runtime error
Runtime error
File size: 18,133 Bytes
4efe6b5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 |
<h1 align="center">
<a href="https://applio.org" target="_blank"><img src="https://github.com/IAHispano/Applio/assets/133521603/78e975d8-b07f-47ba-ab23-5a31592f322a" alt="Applio"></a>
</h1>
<p align="center">
<img alt="Contributors" src="https://img.shields.io/github/contributors/iahispano/applio?style=for-the-badge&color=FFFFFF" />
<img alt="Release" src="https://img.shields.io/github/release/iahispano/applio?style=for-the-badge&color=FFFFFF" />
<img alt="Stars" src="https://img.shields.io/github/stars/iahispano/applio?style=for-the-badge&color=FFFFFF" />
<img alt="Fork" src="https://img.shields.io/github/forks/iahispano/applio?style=for-the-badge&color=FFFFFF" />
<img alt="Issues" src="https://img.shields.io/github/issues/iahispano/applio?style=for-the-badge&color=FFFFFF" />
</p>
<p align="center">VITS-based Voice Conversion focused on simplicity, quality, and performance.</p>
<p align="center">
<a href="https://applio.org" target="_blank">๐ Website</a>
โข
<a href="https://docs.applio.org" target="_blank">๐ Documentation</a>
โข
<a href="https://discord.gg/iahispano" target="_blank">โ๏ธ Discord</a>
</p>
<p align="center">
<a href="https://github.com/IAHispano/Applio-Plugins" target="_blank">๐ Plugins</a>
โข
<a href="https://huggingface.co./IAHispano/Applio/tree/main/Compiled" target="_blank">๐ฆ Compiled</a>
โข
<a href="https://applio.org/playground" target="_blank">๐ฎ Playground</a>
โข
<a href="https://colab.research.google.com/github/iahispano/applio/blob/master/assets/Applio.ipynb" target="_blank">๐ Google Colab (UI)</a>
โข
<a href="https://colab.research.google.com/github/iahispano/applio/blob/master/assets/Applio_NoUI.ipynb" target="_blank">๐ Google Colab (No UI)</a>
</p>
## Table of Contents
- [Installation](#installation)
- [Windows](#windows)
- [macOS](#macos)
- [Linux](#linux)
- [Makefile](#makefile)
- [Usage](#usage)
- [Windows](#windows-1)
- [macOS](#macos-1)
- [Linux](#linux-1)
- [Makefile](#makefile-1)
- [Technical Information](#technical-information)
- [Repository Enhancements](#repository-enhancements)
- [Commercial Usage](#commercial-usage)
- [References](#references)
- [Contributors](#contributors)
## Installation
Download the latest version from [GitHub Releases](https://github.com/IAHispano/Applio-RVC-Fork/releases) or use the [Compiled Versions](https://huggingface.co./IAHispano/Applio/tree/main/Compiled).
### Windows
```bash
./run-install.bat
```
### macOS
For macOS, you need to install the requirements in a Python environment version 3.9 to 3.11. Here are the steps:
```bash
python3 -m venv .venv
source .venv/bin/activate
chmod +x run-install.sh
./run-install.sh
```
### Linux
Certain Linux-based operating systems may encounter complications with the installer. In such instances, we suggest installing the `requirements.txt` within a Python environment version 3.9 to 3.11.
```bash
chmod +x run-install.sh
./run-install.sh
```
### Makefile
For platforms such as [Paperspace](https://www.paperspace.com/):
```bash
make run-install
```
## Usage
Visit [Applio Documentation](https://docs.applio.org/) for a detailed UI usage explanation.
### Windows
```bash
./run-applio.bat
```
### macOS
```bash
chmod +x run-applio.sh
./run-applio.sh
```
### Linux
```bash
chmod +x run-applio.sh
./run-applio.sh
```
### Makefile
For platforms such as [Paperspace](https://www.paperspace.com/):
```bash
make run-applio
```
## Technical Information
Applio uses an enhanced version of the Retrieval-based Voice Conversion (RVC) model, a powerful technique for transforming the voice of an audio signal to sound like another person. This advanced implementation of RVC in Applio enables high-quality voice conversion while maintaining simplicity and performance.
### 0. Pre-Learning: Key Concepts in Speech Processing and Voice Conversion
This section introduces fundamental concepts in speech processing and voice conversion, paving the way for a deeper understanding of the RVC pipeline:
#### 1. Speech Representation
- **Phoneme:** The smallest unit of sound in a language that distinguishes one word from another. Examples: /k/, /รฆ/, /t/.
- **Spectrogram:** A visual representation of the frequency content of a sound over time, showing how the intensity of different frequencies changes over the duration of the audio.
- **Mel-Spectrogram:** A type of spectrogram that mimics human auditory perception, emphasizing frequencies that are more important to human hearing.
- **Speaker Embedding:** A vector representation that captures the unique acoustic characteristics of a speaker's voice, encoding information about pitch, tone, timbre, and other vocal qualities.
#### 2. Text-to-Speech (TTS)
- **TTS Model:** A machine learning model that generates artificial speech from written text.
- **Encoder-Decoder Architecture:** A common architecture in TTS models, where an encoder processes the text and pitch information to create a latent representation, and a decoder uses this representation to synthesize the audio signal.
- **Transformer Architecture:** A powerful neural network architecture particularly well-suited for sequence modeling, allowing the model to handle long sequences of text or audio and capture relationships between elements.
#### 3. Voice Conversion
- **Voice Conversion (VC):** The process of transforming the voice of a speaker in an audio signal to sound like another speaker.
- **Speaker Adaptation:** The process of adapting a TTS model to a specific speaker, often by training on a small dataset of the speaker's voice.
- **Retrieval-Based VC (RVC):** A voice conversion approach where speaker embeddings are retrieved from a database and used to guide the TTS model in synthesizing audio with the target speaker's voice.
#### 4. Additional Concepts
- **ContentVec:** A powerful self-supervised learning model for speech representation, excelling at capturing speaker-specific information.
- **FAISS:** A library for efficient similarity search, used to retrieve speaker embeddings that are similar to the extracted ContentVec embedding.
- **Neural Source Filter (NSF):** A module that models audio generation as a filtering process, allowing the model to produce high-quality and realistic audio signals by learning complex relationships between the source signal and the output waveform.
#### 5. Why are these concepts important?
Understanding these concepts is essential for appreciating the mechanics and capabilities of the RVC pipeline:
- **Speech Representation:** Different representations capture different aspects of speech, allowing for effective analysis and manipulation.
- **TTS Models:** The TTS model forms the foundation of RVC, providing the ability to synthesize audio from text and pitch.
- **Voice Conversion:** Voice conversion aims to transfer a speaker's identity to a different audio signal.
- **ContentVec and Speaker Embeddings:** ContentVec provides a powerful way to extract speaker-specific information, which is crucial for accurate voice conversion.
- **FAISS:** This library enables efficient speaker embedding retrieval, facilitating the selection of appropriate target voices.
- **NSF:** The NSF is a critical component of the TTS model, contributing to the generation of realistic and high-quality audio.
### 1. Model Architecture
The RVC model comprises two main components:
#### A. Encoder-Decoder Network
This network synthesizes audio based on text and pitch information while incorporating speaker characteristics from the ContentVec embedding.
**Encoder:**
- **Input:** Phoneme sequences (text representation) and pitch information (optional).
- **Embeddings:**
- Phonemes are represented as vectors using linear layers, creating a dense representation of the text input.
- Pitch is usually converted to a one-hot encoding or a continuous value and embedded similarly.
- **Transformer Encoder:** Processes the embedded features in a highly parallel manner.
It employs:
- **Self-Attention:** Allows the encoder to attend to different parts of the input sequence to understand the relationships between words and their context.
- **Feedforward Networks (FFN):** Apply non-linear transformations to further refine the features captured by self-attention.
- **Layer Normalization:** Stabilizes training and improves performance by normalizing the outputs of each layer.
- **Dropout:** A regularization technique to prevent overfitting.
- **Output:** Produces a latent representation of the input text and pitch, capturing their relationships and serving as the input for the decoder.
**Decoder:**
- **Input:** The latent representation from the encoder.
- **Transformer Decoder:** Receives the encoder output and utilizes:
- **Self-Attention:** Allows the decoder to attend to different parts of the generated sequence to maintain consistency and coherence in the output audio.
- **Encoder-Decoder Attention:** Enables the decoder to incorporate information from the input text and pitch into the audio generation process.
- **Neural Source Filter (NSF):** A powerful component for generating audio, modeling the generation process as a filter applied to a source signal. It uses:
- **Upsampling:** Increases the resolution of the latent representation to match the desired length of the audio signal.
- **Residual Blocks:** Learn complex and non-linear relationships between input features and the output audio, contributing to realistic and detailed waveforms.
- **Source Module:** Generates the excitation signal (often harmonic) that drives the NSF. It combines sine waves (for voiced sounds) and noise (for unvoiced sounds) to create a natural source signal.
- **Noise Convolution:** Convolves noise with the harmonic signal to introduce additional variation and realism.
- **Final Convolutional Layer:** Converts the filtered output to a single-channel audio waveform.
- **Output:** Synthesized audio signal.
#### B. ContentVec Speaker Embedding Extractor
Extracts speaker-specific information from the input audio.
- **Input:** The preprocessed audio signal.
- **Processing:** The ContentVec model, trained on a massive dataset of speech data, processes the input audio and extracts a speaker embedding vector, capturing the unique acoustic properties of the speaker's voice.
- **Output:** A speaker embedding vector representing the voice of the speaker.
### 2. Training Stage
The RVC model is trained using a combination of two key losses:
- **Generative Loss:**
- **Mel-Spectrogram:** The Mel-spectrogram is computed for both the target audio and the generated audio.
- **L1 Loss:** Measures the absolute difference between the Mel-spectrograms of the target and generated audio, encouraging the decoder to produce audio with a similar spectral profile.
- **Discriminative Loss:**
- **Multi-Period Discriminator:** Tries to distinguish between real and generated audio at different time scales, using convolution layers to capture long-term dependencies in the audio.
- **Adversarial Training:** The generator tries to fool the discriminator by producing audio that sounds real, while the discriminator is trained to correctly identify generated audio.
- **Optional KL Divergence Loss:** Measures the difference between the distributions of latent variables generated by the encoder and a posterior encoder (which infers the latent representation from the target audio). Encourages the model to learn a more efficient and stable latent representation.
### 3. Inference Stage
The inference stage utilizes the trained model to convert the voice of an audio input to sound like a target speaker. Here's a breakdown:
**Input:**
- Phoneme sequences (text representation).
- Pitch information (optional).
- Target speaker ID (identifies the desired voice).
**Steps:**
- **ContentVec Embedding Extraction:**
- The ContentVec model processes the input audio and extracts a speaker embedding vector, capturing the voice characteristics of the speaker.
- **Optional Embedding Retrieval:**
- **FAISS Index:** Used to efficiently search for speaker embeddings similar to the extracted ContentVec embedding. It helps guide the voice conversion process toward a specific speaker when multiple speakers are available.
- **Embedding Retrieval:** The FAISS index is queried using the extracted ContentVec embedding, and similar embeddings are retrieved.
- **Embedding Manipulation:**
- **Blending:** The extracted ContentVec embedding can be blended with retrieved embeddings using the index_rate parameter, allowing control over how much the target speaker's voice influences the conversion.
- **Encoder-Decoder Processing:**
- **Encoder:** Encodes the phoneme sequences and pitch into a latent representation, capturing the relationships between them.
- **Decoder:** Synthesizes the audio signal, incorporating the speaker characteristics from the ContentVec embedding (potentially blended with retrieved embeddings).
- **Post-Processing:**
- **Resampling:** Adjusts the sampling rate of the generated audio if needed.
- **RMS Adjustment:** Adjusts the volume (RMS) of the output audio to match the input audio.
### 4. Key Techniques
- **Transformer Architecture:** The Transformer architecture is a powerful tool for sequence modeling, enabling the encoder and decoder to efficiently process long sequences and capture complex relationships within the data.
- **Neural Source Filter (NSF):** Models audio generation as a filtering process, allowing the model to produce high-quality and realistic audio signals by learning complex relationships between the source signal and the output waveform.
- **Flow-Based Generative Model:** Enables the model to learn complex probability distributions for the audio signal, leading to more realistic and diverse generated speech.
- **Multi-period Discriminator:** Helps improve the quality and realism of the generated audio by evaluating the audio at
different temporal scales and providing feedback to the generator.
- **Relative Positional Encoding:** Helps the model understand the relative positions of elements within the input sequences, improving the model's ability to handle long sequences and maintain context.
### 5. Future Challenges
Despite the advancements in Retrieval-Based Voice Conversion, several challenges and areas for future research remain:
- **Speaker Generalization:** Improving the ability of models to generalize to unseen speakers with minimal data.
- **Real-time Processing:** Enhancing the efficiency of models to support real-time voice conversion applications.
- **Emotional Expression:** Better capturing and transferring emotional nuances in voice conversion.
- **Noise Robustness:** Improving the robustness of voice conversion models to handle noisy and low-quality input audio.
## Repository Enhancements
This repository has undergone significant enhancements to improve its functionality and maintainability:
- **Modular Codebase:** Restructured codebase for better organization, readability, and maintenance.
- **Hop Length Implementation:** Improved efficiency and performance, especially on Crepe (formerly Mangio-Crepe), thanks to [@Mangio621](https://github.com/Mangio621/Mangio-RVC-Fork).
- **Translations in 30+ Languages:** Added support for over 30 languages.
- **Cross-Platform Compatibility:** Ensured seamless operation across various platforms.
- **Optimized Requirements:** Fine-tuned project requirements for enhanced performance.
- **Streamlined Installation:** Simplified installation process for a user-friendly setup.
- **Hybrid F0 Estimation:** Introduced a personalized 'hybrid' F0 estimation method utilizing nanmedian.
- **Easy-to-Use UI:** Implemented an intuitive user interface.
- **Plugin System:** Introduced a plugin system for extending functionality.
- **Overtraining Detector:** Implemented a detector to prevent excessive training.
- **Model Search:** Integrated model search feature for easy discovery.
- **Pretrained Models:** Added support for custom pretrained models.
- **Voice Blender:** Developed a feature to combine two trained models to create a new one.
- **Accessibility Improvements:** Enhanced with descriptive tooltips for UI elements.
- **New F0 Extraction Methods:** Introduced methods like FCPE or Hybrid for pitch extraction.
- **Output Format Selection:** Added feature to choose audio file formats.
- **Hashing System:** Assigned unique IDs to models to prevent unauthorized duplication.
- **Model Download System:** Supported downloads from various platforms.
- **TTS Enhancements:** Improved Text-to-Speech functionality.
- **Split Audio:** Implemented audio splitting for faster processing.
- **Discord Presence:** Displayed usage status on Discord.
- **Flask Integration:** Enabled automatic model downloads via Flask.
- **Support Tab:** Added a tab for screen recording to report issues.
These enhancements contribute to a more robust and scalable codebase, making the repository more accessible for contributors and users alike.
## Commercial Usage
For commercial purposes, please adhere to the guidelines outlined in the [MIT license](./LICENSE) governing this project. Prior to integrating Applio into your application, we kindly request that you contact us at [email protected] to ensure ethical use.
Please note, the use of Applio-generated audio files falls under your own responsibility and must always respect applicable copyrights. We encourage you to consider supporting the continuous development and maintenance of Applio through a donation.
Your cooperation and support are greatly appreciated. Thank you!
## References
Applio is possible to these projects and those cited in their references.
- [gradio-screen-recorder](https://huggingface.co./spaces/gstaff/gradio-screen-recorder) by gstaff
- [rvc-cli](https://github.com/blaise-tk/rvc-cli) by blaisewf
### Contributors
<a href="https://github.com/IAHispano/Applio/graphs/contributors" target="_blank">
<img src="https://contrib.rocks/image?repo=IAHispano/Applio" />
</a>
|