Generation stopping early

#1
by cdminix - opened

Thanks for releasing yet another model, looks really promising!
I'm working on adding your model to a benchmark, and for some reason, generation often cuts of early:
Example here: https://replicate.com/p/0pwfvbk43srmc0cmjn1s5jws5w
and here: https://replicate.com/p/c7eprj7rwxrme0cmjn2tegvqxm
Is this something you encountered? My inference code can be found here: https://github.com/ttsds/cog_tts/blob/main/amphion_vevo/predict.py

Hi cdminix, thanks for your interest!

Can you share your prompt wav files (and the corresponding texts) with me? (I can only download the generated results via your shared URLs). I'll check them.

BTW, I'm not allowed to access the inference code link you shared. Maybe there are some permission issues?

Sorry about that, you should have access to the code now.

The English prompt is this one: https://github.com/open-mmlab/Amphion/blob/main/egs/tts/VALLE_V2/example.wav
"and keeping eternity before the eyes though much"

And the Chinese prompt is this one: https://maskgct.github.io/audios/icl_smaples/icl_20.wav
"对,这就是我万人敬仰的太乙真人,虽然有点婴儿肥,但也掩不住我逼人的帅气。" (according to whisper)

Thanks for the quick response!

Hi @cdminix , I have checked the two cases, and there are some very interesting observations.

Here are my inference code: https://github.com/RMSnow/Amphion/blob/vevo_hf_debug/models/vc/vevo/infer_vevotts_debug.py, and the generated results: https://github.com/RMSnow/Amphion/tree/vevo_hf_debug/models/vc/vevo/output_debug.

Specifically, I tried different prompts (such as "same prompt wav, but different punctuations in the prompt text", or just "different prompt wav files") like this:

# ===== Debug for Case1 =====
    src_text = "With tenure, Suzie'd have all the more leisure for yachting, but her publications are no good."

    ref_list = [
        {
            "wav_path": "./egs/tts/VALLE_V2/example.wav",
            "text": "and keeping eternity before the eyes though much",
        },
        {
            "wav_path": "./egs/tts/VALLE_V2/example.wav",
            "text": "and keeping eternity before the eyes, though much.",
        },
        {
            "wav_path": "./models/vc/vevo/wav/arabic_male.wav",
            "text": "Flip stood undecided, his ears strained to catch the slightest sound.",
        },
    ]

# # ===== Debug for Case2 =====
    src_text = "张飞爱吃包子,李白游览华山,奇珍异兽满山坡。"

    ref_list = [
        {
            "wav_path": "./models/vc/vevo/wav/icl_20.wav",
            "text": "对,这就是我万人敬仰的太乙真人,虽然有点婴儿肥,但也掩不住我逼人的帅气。",
        },
        {
            "wav_path": "./models/vc/vevo/wav/icl_20.wav",
            "text": "对,这就是我,万人敬仰的太乙真人。虽然有点婴儿肥,但也掩不住我,逼人的帅气。",
        },
        {
            "wav_path": "./models/vc/vevo/wav/mandarin_female.wav",
            "text": "哇,恭喜你中了大乐透,八百万可真不少呢。有什么特别的计划或想法吗?",
        },
    ]

We can observe that the model's performance for different prompts is very diverse. For the first English case,

  • In the original prompt text you provided, there is a lack of commas and full stops. Under this setting, it is difficult for zero-shot TTS models to handle a text pattern like and keeping eternity before the eyes **though much With tenure**, Suzie'd have all the more leisure for yachting, but her publications are no good., which is not a "familiar" pattern for models.
  • After inserting some "stop" punctuations, the full text and keeping eternity before the eyes, **though much. With tenure**, Suzie'd have all the more leisure for yachting, but her publications are no good. seems to be a more similar pattern as those in pre-training. Therefore, the result could be better.
  • I found that the ./egs/tts/VALLE_V2/example.wav has no complete semantic context, and no clear ending -- also, which seems not to be a common pattern in pre-training. So I stayed with the same target text but used another prompt. The final result is good, as we expected.

For the second Chinese case, the prompt "./models/vc/vevo/wav/icl_20.wav" is really, really expressive. I believe such cases are usually very hard for most zero-shot TTS models. The fundamental problem behind that, I think, is that there are few such data during pre-training. So, the flaw of autoregressive models such as "early stopping" or "hallocination" will be more obvious for these cases.

In a word, these bad results are mainly because of the "train-inference mismatch" during the current zero-shot TTS design. We believe this is a very interesting and important research problem worthy of our future exploration.

Thank you for the in-depth response - robustness to different prompts seems like a challenging problem.
I will keep in mind punctuation and use complete utterances for my benchmark.
Thanks again for making your work publicly available!

cdminix changed discussion status to closed

Sign up or log in to comment