ttsds/benchmark · The .txt file when submitting dataset

Nov 22, 2024

Create a dataset with your TTS model and the evaluation dataset. Use the wav files as speaker reference and the text as the prompt. Create a .tar.gz file with the dataset, and make sure to inlcude .wav files and .txt files.

When submitting a dataset is the .txt file the prompt or the transcript text?

cdminix

TTS Distribution Score org Nov 22, 2024

The text that is synthesised i.e. the transcript text. Please use the same txt files that are already in the evaluation set.

cdminix changed discussion status to closed Dec 4, 2024

Pendrokar

2 days ago

•

edited 2 days ago

@cdminix Are these really are the prompts for the TTS to synthesize? Some seem cutoff.

001.txt,"That is very important," said Holmes.
002.txt,"It's nearly two weeks now.
006.txt,three.
017.txt,"That's what you miss, Marie.
036.txt,You must excuse me."
048.txt,"To morrow.

30 second reference audio within 006.wav just to say "three"?

cdminix

TTS Distribution Score org 2 days ago

It's a valid point that this looks a bit "noisy" but that's because LibriTTS is noisy.

The reference dataset was created by randomly matching text and voice samples from the same speaker within LibriTTS.
I assume most TTS systems trained using LibriTTS also encounter long reference audio matched with short inference text or vice versa.

I guess the question is if anything would be gained by manually curating a more "normalized" reference dataset.
On one hand it could be argued that it would be closer to what we use for human evaluation, on the other hand it doesn't test if TTS systems can handle edge cases (e.g. just a single word with a long reference).
I'll take it into consideration as I keep working on this, thanks for raising this!

cdminix changed discussion status to open 2 days ago