anhnct
/

ljspeech_vits_mb_istft

Model card Files Files and versions Community

ljspeech_vits_mb_istft / README.md

anhnct's picture

Update README.md

f853f64 verified 3 days ago

|

history blame contribute delete

2.33 kB

	---
	license: mit
	tags:
	- vits
	- vits istft
	- istft
	pipeline_tag: text-to-speech
	---

	# VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

	VITS is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a
	conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior. This repository
	contains the weights for the official VITS checkpoint trained on the [LJ Speech](https://huggingface.co./datasets/lj_speech) dataset.

	# VITS ISTFT: New decoder synthesized speech as natural as that synthesized by VITS while achieving a real-time factor of 0.066 on an Intel Core i7 CPU, 4.1 times faster than original VITS. Suitable for real-time and edge device applications

	\| Checkpoint \| Train Hours \| Speakers \|
	\|------------\|-------------\|----------\|
	\| [ljspeech_vits_ms_istft](https://huggingface.co./anhnct/ljspeech_vits_ms_istft) \| 24 \| 1 \|
	\| [ljspeech_vits_mb_istft](https://huggingface.co./anhnct/ljspeech_vits_mb_istft) \| 24 \| 1 \|
	\| [ljspeech_vits_istft](https://huggingface.co./anhnct/ljspeech_vits_istft) \| 24 \| 1 \|

	## Usage

	To use this checkpoint,
	first install the latest version of the library:

	```
	pip install --upgrade transformers accelerate
	```

	Then, run inference with the following code-snippet:

	```python
	from transformers import AutoModel, AutoTokenizer
	import torch
	import numpy as np

	model = AutoModel.from_pretrained("anhnct/ljspeech_vits_mb_istft", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("anhnct/ljspeech_vits_mb_istft")

	text = "Hey, it's Hugging Face on the phone"
	inputs = tokenizer(text, return_tensors="pt")

	with torch.no_grad():
	output = model(**inputs).waveform
	```

	The resulting waveform can be saved as a `.wav` file:

	```python
	import scipy

	data_np = output.numpy()
	data_np_squeezed = np.squeeze(data_np)
	scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=data_np_squeezed)
	```

	Or displayed in a Jupyter Notebook / Google Colab:

	```python
	from IPython.display import Audio

	Audio(data_np_squeezed, rate=model.config.sampling_rate)
	```

	## License

	The model is licensed as [MIT](https://github.com/jaywalnut310/vits/blob/main/LICENSE).