VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

VITS is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior. This repository contains the weights for the official VITS checkpoint trained on the LJ Speech dataset.

VITS ISTFT: New decoder synthesized speech as natural as that synthesized by VITS while achieving a real-time factor of 0.066 on an Intel Core i7 CPU, 4.1 times faster than original VITS. Suitable for real-time and edge device applications

Checkpoint	Train Hours	Speakers
ljspeech_vits_ms_istft	24	1
ljspeech_vits_mb_istft	24	1
ljspeech_vits_istft	24	1

Usage

To use this checkpoint, first install the latest version of the library:

pip install --upgrade transformers accelerate

Then, run inference with the following code-snippet:

from transformers import AutoModel, AutoTokenizer
import torch
import numpy as np

model = AutoModel.from_pretrained("anhnct/ljspeech_vits_ms_istft", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("anhnct/ljspeech_vits_ms_istft")

text = "Hey, it's Hugging Face on the phone"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    output = model(**inputs).waveform

The resulting waveform can be saved as a .wav file:

import scipy

data_np = output.numpy()
data_np_squeezed = np.squeeze(data_np)
scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=data_np_squeezed)

Or displayed in a Jupyter Notebook / Google Colab:

from IPython.display import Audio

Audio(data_np_squeezed, rate=model.config.sampling_rate)

License

The model is licensed as MIT.