ClassicVC
ClassicVC is an any-to-any voice conversion model that enables users to design their original speaker styles by selecting the coordinates from the continuous latent spaces. The model components are implemented using PyTorch and fully compatible with ONNX.
MMCXLI provides the dedicated graphical user interface (GUI) for ClassicVC. It runs on wxPython and ONNX Runtime. Users can download the ONNX files and try out speech conversion without having to install PyTorch or train a model with their own voice data.
Model Details
Model Description
- Developed by: Lyodos (Lyodos the City of the Museum)
Model Sources
- Repository: GitHub
Uses
Based on the MIT License, users can use the model codes and checkpoints for research purpose. It is provided with no guarantees.
Direct Use
Out-of-Scope Use
This model was prototyped as a hobbyist's research into any-to-any voice conversion, and we make no guarantees especially regarding its reliability or real-time operation.
As for use in situations involving an unspecified number of people, such as web broadcasting, and mission-critical applications, including medical, transportation, infrastructure, and weapon systems, we cannot prohibit such use as the developer, since the MIT License is the only stated license, but we do not encourage it.
[More Information Needed]
Bias, Risks, and Limitations
We used three large-scale speech corpora (LibriSpeech, Samrómur Children 21.09, and VoxCeleb 1 and 2) to make the latent space of speakers that can be embedded using the style encoder of ClassicVC as inclusive as possible of all natural human voice.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
The Notebook 01 of the ClassicVC repository provides the procedure for offline (non real-time) voice conversion.
The MMCXLI repository provides GUI, which depends on local Python environment.
Training Details
Training Data
The model checkpoints provided here were trained on the following three datasets.
- LibriSpeech ASR corpus
- V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "Librispeech: An ASR corpus based on public domain audio books," 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 2015, pp. 5206-5210, doi: 10.1109/ICASSP.2015.7178964.
- https://ieeexplore.ieee.org/document/7178964
- https://openslr.org/12/
- Samrómur Children 21.09
- Mena, Carlos; et al., 2021, Samromur Children 21.09, CLARIN-IS, http://hdl.handle.net/20.500.12537/185.
- https://repository.clarin.is/repository/xmlui/handle/20.500.12537/185
- https://openslr.org/117/
- VoxCeleb 1 and 2
- A. Nagrani*, J. S. Chung*, A. Zisserman, "VoxCeleb: a large-scale speaker identification dataset", Interspeech 2017
- J. S. Chung*, A. Nagrani*, A. Zisserman, "VoxCeleb2: Deep Speaker Recognition", Interspeech 2018
- A. Nagrani*, J. S. Chung*, W. Xie, A. Zisserman, "VoxCeleb: Large-scale speaker verification in the wild", Computer Speech and Language, 2019
- https://huggingface.co./datasets/ProgramComputer/voxceleb/tree/main/vox2
Training Procedure
The Notebook 02 of the ClassicVC repository provides the procedure for data preparation.
The Notebook 03 of the ClassicVC repository provides the training code.