Overview

The F5-TTS model is fine-tuned on the LJSpeech dataset with an emphasis on stability, ensuring it avoids choppiness, mispronunciations, repetitions, and skipping words.

Differences from the original model: The text input is converted to phonenes, we don't use the raw text. The phoneme alignment is used during training, whereas a duration predictor is used during inference.

Source code for phoneme alignment: https://github.com/sinhprous/F5-TTS/blob/main/src/f5_tts/train/datasets/utils_alignment.py

Source code for duration predictor: https://github.com/sinhprous/F5-TTS/blob/main/src/f5_tts/model/duration_predictor.py

Colab demo: colab

Audio samples

Outputs from original model was generated using https://huggingface.co./spaces/mrfakename/E2-F5-TTS The original model usually skips words in these hard texts..

Data - driven AI systems said, "Key data is the key, data is key, data is key, data is the key, and the key to the data is key, the data key is the key to the data that is key to the key". Can you keep up?

Original model:

Finetuned model:

Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.

Original model:

Finetuned model:

Call one two three - one two three - one two three four who call one two three - one two three - one two three four who call one two three - one two three - one two three four who call one two three - one two three - one two three four.

Original model:

Finetuned model:

License

This model is released under the Creative Commons Attribution Non Commercial Share Alike 4.0 license, which allows for free usage, modification, and distribution

Model Information

Base Model: SWivid/F5-TTS
Total Training Duration: 130.000 steps

Training Configuration:

"exp_name": "F5TTS_Base",
"learning_rate": 1e-05,
"batch_size_per_gpu": 2000,
"batch_size_type": "frame",
"max_samples": 64,
"grad_accumulation_steps": 1,
"max_grad_norm": 1,
"epochs": 144,
"num_warmup_updates": 5838,
"save_per_updates": 11676,
"last_per_steps": 2918,
"finetune": true,
"file_checkpoint_train": "",
"tokenizer_type": "char",
"tokenizer_file": "",
"mixed_precision": "fp16",
"logger": "wandb",
"bnb_optimizer": true

Usage Instructions

Go to base repo

To do

  • Multi-speaker model

Other links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for sinhprous/F5TTS-stabilized-LJSpeech

Base model

SWivid/F5-TTS
Finetuned
(24)
this model