File size: 18,133 Bytes
4efe6b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
<h1 align="center">
  <a href="https://applio.org" target="_blank"><img src="https://github.com/IAHispano/Applio/assets/133521603/78e975d8-b07f-47ba-ab23-5a31592f322a" alt="Applio"></a>
</h1>

<p align="center">
    <img alt="Contributors" src="https://img.shields.io/github/contributors/iahispano/applio?style=for-the-badge&color=FFFFFF" />
    <img alt="Release" src="https://img.shields.io/github/release/iahispano/applio?style=for-the-badge&color=FFFFFF" />
    <img alt="Stars" src="https://img.shields.io/github/stars/iahispano/applio?style=for-the-badge&color=FFFFFF" />
    <img alt="Fork" src="https://img.shields.io/github/forks/iahispano/applio?style=for-the-badge&color=FFFFFF" />
    <img alt="Issues" src="https://img.shields.io/github/issues/iahispano/applio?style=for-the-badge&color=FFFFFF" />
</p>
  
<p align="center">VITS-based Voice Conversion focused on simplicity, quality, and performance.</p>

<p align="center">
  <a href="https://applio.org" target="_blank">๐ŸŒ Website</a>
  โ€ข
  <a href="https://docs.applio.org" target="_blank">๐Ÿ“š Documentation</a>
  โ€ข
  <a href="https://discord.gg/iahispano" target="_blank">โ˜Ž๏ธ Discord</a>
</p>

<p align="center">
  <a href="https://github.com/IAHispano/Applio-Plugins" target="_blank">๐Ÿ›’ Plugins</a>
  โ€ข
  <a href="https://huggingface.co./IAHispano/Applio/tree/main/Compiled" target="_blank">๐Ÿ“ฆ Compiled</a>
  โ€ข
  <a href="https://applio.org/playground" target="_blank">๐ŸŽฎ Playground</a>
  โ€ข
  <a href="https://colab.research.google.com/github/iahispano/applio/blob/master/assets/Applio.ipynb" target="_blank">๐Ÿ”Ž Google Colab (UI)</a>
  โ€ข
  <a href="https://colab.research.google.com/github/iahispano/applio/blob/master/assets/Applio_NoUI.ipynb" target="_blank">๐Ÿ”Ž Google Colab (No UI)</a>
</p>

## Table of Contents

- [Installation](#installation)
  - [Windows](#windows)
  - [macOS](#macos)
  - [Linux](#linux)
  - [Makefile](#makefile)
- [Usage](#usage)
  - [Windows](#windows-1)
  - [macOS](#macos-1)
  - [Linux](#linux-1)
  - [Makefile](#makefile-1)
- [Technical Information](#technical-information)
- [Repository Enhancements](#repository-enhancements)
- [Commercial Usage](#commercial-usage)
- [References](#references)
  - [Contributors](#contributors)

## Installation

Download the latest version from [GitHub Releases](https://github.com/IAHispano/Applio-RVC-Fork/releases) or use the [Compiled Versions](https://huggingface.co./IAHispano/Applio/tree/main/Compiled).

### Windows

```bash
./run-install.bat
```

### macOS

For macOS, you need to install the requirements in a Python environment version 3.9 to 3.11. Here are the steps:

```bash
python3 -m venv .venv
source .venv/bin/activate
chmod +x run-install.sh
./run-install.sh
```

### Linux

Certain Linux-based operating systems may encounter complications with the installer. In such instances, we suggest installing the `requirements.txt` within a Python environment version 3.9 to 3.11.

```bash
chmod +x run-install.sh
./run-install.sh
```

### Makefile

For platforms such as [Paperspace](https://www.paperspace.com/):

```bash
make run-install
```

## Usage

Visit [Applio Documentation](https://docs.applio.org/) for a detailed UI usage explanation.

### Windows

```bash
./run-applio.bat
```

### macOS

```bash
chmod +x run-applio.sh
./run-applio.sh
```

### Linux

```bash
chmod +x run-applio.sh
./run-applio.sh
```

### Makefile

For platforms such as [Paperspace](https://www.paperspace.com/):

```bash
make run-applio
```

## Technical Information

Applio uses an enhanced version of the Retrieval-based Voice Conversion (RVC) model, a powerful technique for transforming the voice of an audio signal to sound like another person. This advanced implementation of RVC in Applio enables high-quality voice conversion while maintaining simplicity and performance.

### 0. Pre-Learning: Key Concepts in Speech Processing and Voice Conversion

This section introduces fundamental concepts in speech processing and voice conversion, paving the way for a deeper understanding of the RVC pipeline:

#### 1. Speech Representation

- **Phoneme:** The smallest unit of sound in a language that distinguishes one word from another. Examples: /k/, /รฆ/, /t/.
- **Spectrogram:** A visual representation of the frequency content of a sound over time, showing how the intensity of different frequencies changes over the duration of the audio.
- **Mel-Spectrogram:** A type of spectrogram that mimics human auditory perception, emphasizing frequencies that are more important to human hearing.
- **Speaker Embedding:** A vector representation that captures the unique acoustic characteristics of a speaker's voice, encoding information about pitch, tone, timbre, and other vocal qualities.

#### 2. Text-to-Speech (TTS)

- **TTS Model:** A machine learning model that generates artificial speech from written text.
- **Encoder-Decoder Architecture:** A common architecture in TTS models, where an encoder processes the text and pitch information to create a latent representation, and a decoder uses this representation to synthesize the audio signal.
- **Transformer Architecture:** A powerful neural network architecture particularly well-suited for sequence modeling, allowing the model to handle long sequences of text or audio and capture relationships between elements.

#### 3. Voice Conversion

- **Voice Conversion (VC):** The process of transforming the voice of a speaker in an audio signal to sound like another speaker.
- **Speaker Adaptation:** The process of adapting a TTS model to a specific speaker, often by training on a small dataset of the speaker's voice.
- **Retrieval-Based VC (RVC):** A voice conversion approach where speaker embeddings are retrieved from a database and used to guide the TTS model in synthesizing audio with the target speaker's voice.

#### 4. Additional Concepts

- **ContentVec:** A powerful self-supervised learning model for speech representation, excelling at capturing speaker-specific information.
- **FAISS:** A library for efficient similarity search, used to retrieve speaker embeddings that are similar to the extracted ContentVec embedding.
- **Neural Source Filter (NSF):** A module that models audio generation as a filtering process, allowing the model to produce high-quality and realistic audio signals by learning complex relationships between the source signal and the output waveform.

#### 5. Why are these concepts important?

Understanding these concepts is essential for appreciating the mechanics and capabilities of the RVC pipeline:

- **Speech Representation:** Different representations capture different aspects of speech, allowing for effective analysis and manipulation.
- **TTS Models:** The TTS model forms the foundation of RVC, providing the ability to synthesize audio from text and pitch.
- **Voice Conversion:** Voice conversion aims to transfer a speaker's identity to a different audio signal.
- **ContentVec and Speaker Embeddings:** ContentVec provides a powerful way to extract speaker-specific information, which is crucial for accurate voice conversion.
- **FAISS:** This library enables efficient speaker embedding retrieval, facilitating the selection of appropriate target voices.
- **NSF:** The NSF is a critical component of the TTS model, contributing to the generation of realistic and high-quality audio.

### 1. Model Architecture

The RVC model comprises two main components:

#### A. Encoder-Decoder Network

This network synthesizes audio based on text and pitch information while incorporating speaker characteristics from the ContentVec embedding.

**Encoder:**

- **Input:** Phoneme sequences (text representation) and pitch information (optional).
- **Embeddings:**
  - Phonemes are represented as vectors using linear layers, creating a dense representation of the text input.
  - Pitch is usually converted to a one-hot encoding or a continuous value and embedded similarly.
- **Transformer Encoder:** Processes the embedded features in a highly parallel manner.

  It employs:

  - **Self-Attention:** Allows the encoder to attend to different parts of the input sequence to understand the relationships between words and their context.
  - **Feedforward Networks (FFN):** Apply non-linear transformations to further refine the features captured by self-attention.
  - **Layer Normalization:** Stabilizes training and improves performance by normalizing the outputs of each layer.
  - **Dropout:** A regularization technique to prevent overfitting.
  - **Output:** Produces a latent representation of the input text and pitch, capturing their relationships and serving as the input for the decoder.

**Decoder:**

- **Input:** The latent representation from the encoder.
- **Transformer Decoder:** Receives the encoder output and utilizes:
  - **Self-Attention:** Allows the decoder to attend to different parts of the generated sequence to maintain consistency and coherence in the output audio.
  - **Encoder-Decoder Attention:** Enables the decoder to incorporate information from the input text and pitch into the audio generation process.
- **Neural Source Filter (NSF):** A powerful component for generating audio, modeling the generation process as a filter applied to a source signal. It uses:
  - **Upsampling:** Increases the resolution of the latent representation to match the desired length of the audio signal.
  - **Residual Blocks:** Learn complex and non-linear relationships between input features and the output audio, contributing to realistic and detailed waveforms.
  - **Source Module:** Generates the excitation signal (often harmonic) that drives the NSF. It combines sine waves (for voiced sounds) and noise (for unvoiced sounds) to create a natural source signal.
  - **Noise Convolution:** Convolves noise with the harmonic signal to introduce additional variation and realism.
  - **Final Convolutional Layer:** Converts the filtered output to a single-channel audio waveform.
- **Output:** Synthesized audio signal.

#### B. ContentVec Speaker Embedding Extractor

Extracts speaker-specific information from the input audio.

- **Input:** The preprocessed audio signal.
- **Processing:** The ContentVec model, trained on a massive dataset of speech data, processes the input audio and extracts a speaker embedding vector, capturing the unique acoustic properties of the speaker's voice.
- **Output:** A speaker embedding vector representing the voice of the speaker.

### 2. Training Stage

The RVC model is trained using a combination of two key losses:

- **Generative Loss:**
  - **Mel-Spectrogram:** The Mel-spectrogram is computed for both the target audio and the generated audio.
  - **L1 Loss:** Measures the absolute difference between the Mel-spectrograms of the target and generated audio, encouraging the decoder to produce audio with a similar spectral profile.
- **Discriminative Loss:**
  - **Multi-Period Discriminator:** Tries to distinguish between real and generated audio at different time scales, using convolution layers to capture long-term dependencies in the audio.
  - **Adversarial Training:** The generator tries to fool the discriminator by producing audio that sounds real, while the discriminator is trained to correctly identify generated audio.
- **Optional KL Divergence Loss:** Measures the difference between the distributions of latent variables generated by the encoder and a posterior encoder (which infers the latent representation from the target audio). Encourages the model to learn a more efficient and stable latent representation.

### 3. Inference Stage

The inference stage utilizes the trained model to convert the voice of an audio input to sound like a target speaker. Here's a breakdown:

**Input:**

- Phoneme sequences (text representation).
- Pitch information (optional).
- Target speaker ID (identifies the desired voice).

**Steps:**

- **ContentVec Embedding Extraction:**
  - The ContentVec model processes the input audio and extracts a speaker embedding vector, capturing the voice characteristics of the speaker.
- **Optional Embedding Retrieval:**
  - **FAISS Index:** Used to efficiently search for speaker embeddings similar to the extracted ContentVec embedding. It helps guide the voice conversion process toward a specific speaker when multiple speakers are available.
  - **Embedding Retrieval:** The FAISS index is queried using the extracted ContentVec embedding, and similar embeddings are retrieved.
- **Embedding Manipulation:**
  - **Blending:** The extracted ContentVec embedding can be blended with retrieved embeddings using the index_rate parameter, allowing control over how much the target speaker's voice influences the conversion.
- **Encoder-Decoder Processing:**
  - **Encoder:** Encodes the phoneme sequences and pitch into a latent representation, capturing the relationships between them.
  - **Decoder:** Synthesizes the audio signal, incorporating the speaker characteristics from the ContentVec embedding (potentially blended with retrieved embeddings).
- **Post-Processing:**
  - **Resampling:** Adjusts the sampling rate of the generated audio if needed.
  - **RMS Adjustment:** Adjusts the volume (RMS) of the output audio to match the input audio.

### 4. Key Techniques

- **Transformer Architecture:** The Transformer architecture is a powerful tool for sequence modeling, enabling the encoder and decoder to efficiently process long sequences and capture complex relationships within the data.
- **Neural Source Filter (NSF):** Models audio generation as a filtering process, allowing the model to produce high-quality and realistic audio signals by learning complex relationships between the source signal and the output waveform.
- **Flow-Based Generative Model:** Enables the model to learn complex probability distributions for the audio signal, leading to more realistic and diverse generated speech.
- **Multi-period Discriminator:** Helps improve the quality and realism of the generated audio by evaluating the audio at

different temporal scales and providing feedback to the generator.

- **Relative Positional Encoding:** Helps the model understand the relative positions of elements within the input sequences, improving the model's ability to handle long sequences and maintain context.

### 5. Future Challenges

Despite the advancements in Retrieval-Based Voice Conversion, several challenges and areas for future research remain:

- **Speaker Generalization:** Improving the ability of models to generalize to unseen speakers with minimal data.
- **Real-time Processing:** Enhancing the efficiency of models to support real-time voice conversion applications.
- **Emotional Expression:** Better capturing and transferring emotional nuances in voice conversion.
- **Noise Robustness:** Improving the robustness of voice conversion models to handle noisy and low-quality input audio.

## Repository Enhancements

This repository has undergone significant enhancements to improve its functionality and maintainability:

- **Modular Codebase:** Restructured codebase for better organization, readability, and maintenance.
- **Hop Length Implementation:** Improved efficiency and performance, especially on Crepe (formerly Mangio-Crepe), thanks to [@Mangio621](https://github.com/Mangio621/Mangio-RVC-Fork).
- **Translations in 30+ Languages:** Added support for over 30 languages.
- **Cross-Platform Compatibility:** Ensured seamless operation across various platforms.
- **Optimized Requirements:** Fine-tuned project requirements for enhanced performance.
- **Streamlined Installation:** Simplified installation process for a user-friendly setup.
- **Hybrid F0 Estimation:** Introduced a personalized 'hybrid' F0 estimation method utilizing nanmedian.
- **Easy-to-Use UI:** Implemented an intuitive user interface.
- **Plugin System:** Introduced a plugin system for extending functionality.
- **Overtraining Detector:** Implemented a detector to prevent excessive training.
- **Model Search:** Integrated model search feature for easy discovery.
- **Pretrained Models:** Added support for custom pretrained models.
- **Voice Blender:** Developed a feature to combine two trained models to create a new one.
- **Accessibility Improvements:** Enhanced with descriptive tooltips for UI elements.
- **New F0 Extraction Methods:** Introduced methods like FCPE or Hybrid for pitch extraction.
- **Output Format Selection:** Added feature to choose audio file formats.
- **Hashing System:** Assigned unique IDs to models to prevent unauthorized duplication.
- **Model Download System:** Supported downloads from various platforms.
- **TTS Enhancements:** Improved Text-to-Speech functionality.
- **Split Audio:** Implemented audio splitting for faster processing.
- **Discord Presence:** Displayed usage status on Discord.
- **Flask Integration:** Enabled automatic model downloads via Flask.
- **Support Tab:** Added a tab for screen recording to report issues.

These enhancements contribute to a more robust and scalable codebase, making the repository more accessible for contributors and users alike.

## Commercial Usage

For commercial purposes, please adhere to the guidelines outlined in the [MIT license](./LICENSE) governing this project. Prior to integrating Applio into your application, we kindly request that you contact us at [email protected] to ensure ethical use.

Please note, the use of Applio-generated audio files falls under your own responsibility and must always respect applicable copyrights. We encourage you to consider supporting the continuous development and maintenance of Applio through a donation. 

Your cooperation and support are greatly appreciated. Thank you!

## References

Applio is possible to these projects and those cited in their references.

- [gradio-screen-recorder](https://huggingface.co./spaces/gstaff/gradio-screen-recorder) by gstaff
- [rvc-cli](https://github.com/blaise-tk/rvc-cli) by blaisewf

### Contributors

<a href="https://github.com/IAHispano/Applio/graphs/contributors" target="_blank">
  <img src="https://contrib.rocks/image?repo=IAHispano/Applio" />
</a>