@Pendrokar on Hugging Face: "TTS: Sorry, I just cannot get the hype behind F5 TTS. It has now gathered a…"

Pendrokar

posted an update Nov 24, 2024

Post

745

TTS: Sorry, I just cannot get the hype behind F5 TTS. It has now gathered a thousand votes in the TTS Arena fork and **has remained in #8 spot** against the _mostly_ Open TTS adversaries.

The voice sample used is the same as XTTS. F5 has so far been unstable, being unemotional/monotone/depressed and mispronouncing words (_awestruck_).

If you have suggestions please give feedback in the following thread:
mrfakename/E2-F5-TTS#32

YaTharThShaRma999

Nov 24, 2024

@Pendrokar I mean the sample itself is pretty unemotional, so a good voice cloning model would have to be unemotional as well.

The unstability is an issue though which even I don't like, it can be somewhat solved by the silence trimmer but not fully.

Pendrokar

Nov 24, 2024

True about the narration style sample, but that still did not stop XTTS in surpassing F5. Both use the same sample.

hexgrad

Nov 25, 2024

This is conjecture, but it's possible the voice sample for XTTS is in-distribution, i.e. seen during training, and if so you'd expect it to perform better than F5 given the same reference. No knock on XTTS btw, Kokoro is equally guilty for this—the voice used in the Arena is also in-distribution.

It would not be surprising to me if voice cloning is simply "looking up" the most similar speaker or interpolation of speakers seen in training. François Chollet has discussed this phenomenon many times wrt LLMs, and I highly recommend to listening to his talks.

https://hf.co/spaces/hexgrad/Kokoro-TTS/discussions/3#6744bdea8c689a7071742134

Pendrokar

Nov 27, 2024

True, a sample from the original dataset would probably be the best. My attempt to try to fetch one from Emilia dataset was unsuccessful as HF dataset viewer can only show the German samples. Emilia's homepage has a ASMR-y example prompt given.

Join the conversation