How does TTS Arena work across different models?

#80
by zhihaodu - opened

As far as I know, the models inside are not the same; some are fine-tuned (SFT) for a single speaker, like ElevenLabs, Play.HT 2.0, while others are zero-shot, like GPT-SoVITS, CosyVoice 2.0. How is it ensured that these models are compared fairly?

This information is hidden within the private TTS router space. The dev can reveal some information about a model. But you have to ask directly.

TTS model authors can ask for changes. And may have access to the parameters through being a member of TTS-AGI. How does one become a member? It has been a year long mystery.

In the forked TTS Arena, I do not change the default parameters of the TTS Spaces. I don't think it is my task to find out the best way for a TTS to perform. It is the duty of the HF Space authors to do so. Same with F5 TTS Space, which has no default voice.

some are fine-tuned (SFT) for a single speaker, like ElevenLabs, Play.HT 2.0, while others are zero-shot, like GPT-SoVITS, CosyVoice 2.0.

Also you can use this table to get info on which model is Zero-Shot (Insta-clone column)
https://huggingface.co./datasets/Pendrokar/open_tts_tracker

This information is hidden within the private TTS router space. The dev can reveal some information about a model. But you have to ask directly.

TTS model authors can ask for changes. And may have access to the parameters through being a member of TTS-AGI. How does one become a member? It has been a year long mystery.

In the forked TTS Arena, I do not change the default parameters of the TTS Spaces. I don't think it is my task to find out the best way for a TTS to perform. It is the duty of the HF Space authors to do so. Same with F5 TTS Space, which has no default voice.

I see. Thanks for your comment. It's really helpful.

TTS AGI org

Hey @zhihaodu ,
Sorry about the delay. What @Pendrokar said about the router Space is correct - we use a private endpoint to proxy requests to models to protect our API keys and route them to privately-hosted instances of the model. We aim to use the default settings for models when possible. For zero-shot models, if the authors do not recommended a voice, we will attempt to find a well-performing reference voice sample for the model and use that.
Please let me know if you have any other questions!

Sign up or log in to comment