---
license: mit
tags:
- vits
- vits istft
- istft
pipeline_tag: text-to-speech
---

# VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

VITS is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a 
conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior. This repository 
contains the weights for the official VITS checkpoint trained on the [LJ Speech](https://huggingface.co./datasets/lj_speech) dataset. 

# VITS ISTFT:  New decoder synthesized speech as natural as that synthesized by VITS while achieving a real-time factor of 0.066 on an Intel Core i7 CPU, 4.1 times faster than original VITS. Suitable for real-time and edge device applications

| Checkpoint | Train Hours | Speakers |
|------------|-------------|----------|
| [ljspeech_vits_ms_istft](https://huggingface.co./anhnct/ljspeech_vits_ms_istft)   | 24          | 1        |
| [ljspeech_vits_mb_istft](https://huggingface.co./anhnct/ljspeech_vits_mb_istft)   | 24          | 1        |
| [ljspeech_vits_istft](https://huggingface.co./anhnct/ljspeech_vits_istft)   | 24          | 1        |

## Usage

To use this checkpoint, 
first install the latest version of the library:

```
pip install --upgrade transformers accelerate
```

Then, run inference with the following code-snippet:

```python
from transformers import AutoModel, AutoTokenizer
import torch
import numpy as np

model = AutoModel.from_pretrained("anhnct/ljspeech_vits_mb_istft", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("anhnct/ljspeech_vits_mb_istft")

text = "Hey, it's Hugging Face on the phone"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    output = model(**inputs).waveform
```

The resulting waveform can be saved as a `.wav` file:

```python
import scipy

data_np = output.numpy()
data_np_squeezed = np.squeeze(data_np)
scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=data_np_squeezed)
```

Or displayed in a Jupyter Notebook / Google Colab:

```python
from IPython.display import Audio

Audio(data_np_squeezed, rate=model.config.sampling_rate)
```

## License

The model is licensed as [**MIT**](https://github.com/jaywalnut310/vits/blob/main/LICENSE).