--- license: mit ---

A StyleTTS2 fine-tune, designed for expressiveness.


Vokan features: - A diverse dataset for a more authentic zero-shot performance - Training on 6+ days worth of audio, with 672 diverse and expressive speakers - Training on 1x H100 for 300 hours and 1x 3090 for an additional 600 hours ### Audio Examples ### Demo Spaces Coming soon... ## This model was made possible thanks to - [DagsHub](https://dagshub.com) who sponsored us with their GPU compute (with special thanks to Dean!) - And the assistance from [camenduru](https://github.com/camenduru) on cloud infrastructure and model training
discord ## Citations ```citations @misc{li2023styletts, title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models}, author={Yinghao Aaron Li and Cong Han and Vinay S. Raghavan and Gavin Mischler and Nima Mesgarani}, year={2023}, eprint={2306.07691}, archivePrefix={arXiv}, primaryClass={eess.AS} } @misc{zen2019libritts, title={LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech}, author={Heiga Zen and Viet Dang and Rob Clark and Yu Zhang and Ron J. Weiss and Ye Jia and Zhifeng Chen and Yonghui Wu}, year={2019}, eprint={1904.02882}, archivePrefix={arXiv}, primaryClass={cs.SD} } Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald, "CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit", The Centre for Speech Technology Research (CSTR), University of Edinburgh ``` ## License ``` MIT ``` Stay tuned for Vokan V2!