Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Pendrokar 
posted an update about 1 month ago
Post
384
TTS: Sorry, I just cannot get the hype behind F5 TTS. It has now gathered a thousand votes in the TTS Arena fork and **has remained in #8 spot** against the _mostly_ Open TTS adversaries.

The voice sample used is the same as XTTS. F5 has so far been unstable, being unemotional/monotone/depressed and mispronouncing words (_awestruck_).

If you have suggestions please give feedback in the following thread:
mrfakename/E2-F5-TTS#32

@Pendrokar I mean the sample itself is pretty unemotional, so a good voice cloning model would have to be unemotional as well.

The unstability is an issue though which even I don't like, it can be somewhat solved by the silence trimmer but not fully.

·

True about the narration style sample, but that still did not stop XTTS in surpassing F5. Both use the same sample.

This is conjecture, but it's possible the voice sample for XTTS is in-distribution, i.e. seen during training, and if so you'd expect it to perform better than F5 given the same reference. No knock on XTTS btw, Kokoro is equally guilty for this—the voice used in the Arena is also in-distribution.

It would not be surprising to me if voice cloning is simply "looking up" the most similar speaker or interpolation of speakers seen in training. François Chollet has discussed this phenomenon many times wrt LLMs, and I highly recommend to listening to his talks.

https://hf.co/spaces/hexgrad/Kokoro-TTS/discussions/3#6744bdea8c689a7071742134

·

True, a sample from the original dataset would probably be the best. My attempt to try to fetch one from Emilia dataset was unsuccessful as HF dataset viewer can only show the German samples. Emilia's homepage has a ASMR-y example prompt given.