What is Retrieval-based Voice Conversion WebUI?
Retrieval-based Voice Conversion WebUI is an open-source framework designed to make voice conversion simple and efficient. Built on the VITS model, it provides an easy-to-use interface for both inference and training, making it accessible even to those with limited experience in machine learning or audio processing. The WebUI supports a range of features, including voice conversion, real-time voice changing, and the ability to train models using small datasets.
UI preview
Training and inference Webui | Real-time voice changing GUI |
go-web.bat - infer-web.py | go-realtime-gui.bat |
Key Features:
- Tone Leakage Reduction: Utilizes top-1 retrieval to replace source features with training-set features.
- Easy and Fast Training: Can be done even on low-end graphics cards.
- Model Fusion: Change timbres by merging checkpoints.
- UVR5 Integration: Quickly separates vocals from instruments.
- AMD/Intel Acceleration: Supports GPU acceleration on a wide range of hardware.
- Real-Time Voice Changing: Achieves low latency (down to 90ms) with supported hardware.
Getting Started with Inference and Training
1. Set Up the Environment
To start using the Retrieval-based Voice Conversion WebUI, you’ll first need to prepare your environment. The framework requires Python 3.8 or higher.
Install Core Dependencies:
For NVIDIA GPUs:
pip install torch torchvision torchaudio
For AMD GPUs on Linux:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2
Install Other Dependencies:
Install using poetry
:
curl -sSL https://install.python-poetry.org | python3 -
poetry install
Or using pip
:
pip install -r requirements.txt
2. Download Pre-trained Models
The WebUI requires several pre-trained models to function properly. You can download these automatically using a provided script:
python tools/download_models.py
Alternatively, download the models manually from Hugging Face.
3. Install FFmpeg
FFmpeg is necessary for handling audio files. Installation steps vary depending on your operating system:
- Ubuntu/Debian:
sudo apt install ffmpeg
- macOS:
brew install ffmpeg
- Windows:
Download
ffmpeg.exe
andffprobe.exe
from Hugging Face and place them in the root folder.
4. Start the WebUI
Once your environment is set up and the necessary models are downloaded, you can start the WebUI.
For general usage:
python infer-web.py
For Windows users, you can also start the WebUI by double-clicking go-web.bat
.
Training a New Model
Training your own voice conversion model with Retrieval-based Voice Conversion WebUI is straightforward and can be done with as little as 10 minutes of low-noise speech data.
- Prepare Your Dataset: Collect and preprocess your audio data.
- Start the Training Interface: Launch the WebUI as described above and navigate to the training section.
- Set Training Parameters: Configure the model parameters and training options based on your dataset.
- Begin Training: Start the training process. The WebUI will guide you through each step, providing feedback on the model's progress.
Links and Resources
- Colab Notebook: Run the WebUI in a Colab environment.
Or colab mod here - GitHub Repository: Access the source code and documentation.
- Hugging Face Models: Download pre-trained models and other necessary files.
Conclusion
Retrieval-based Voice Conversion WebUI provides a powerful and flexible tool for voice conversion and real-time voice modification. Whether you're a researcher, developer, or enthusiast, this framework offers a robust set of features to explore voice conversion technology.
For more detailed instructions and updates, visit the official GitHub repository.