hexgrad/Kokoro-82M · can you use a Style-TTS2 network in kokoro?

21 days ago

Posting here as well as github since this discussion is fuller:

I have a few StyleTTS-2 networks I've trained and finetuned with some super high quality data I've recorded (I'm an audio engineer). I'm very happy with those, but I'd like to try to bring that over to kokoro, since it's so much faster!

It's a little unclear what, if any, modifications are needed. Most pieces in the config files seem about the same, other than some pieces being cut out.

Do I need to re-train a Style-TTS2 network with the different configuration to get something workable in kokoro? Or is it possible to use my Style-TTS2 work directly in this system?

hexgrad

Owner 21 days ago

You can find the inference code in https://github.com/hexgrad/kokoro and there are currently not many files.

The main inference differences are tokenization, the loading of style vectors from voicepacks, and the lack of style diffusion. It is unlikely your StyleTTS2 models will work out-of-the-box with kokoro, but with some modifications, they might (PRs welcome).

It is likely the inference speed gain can likely be explained by the lack of style diffusion, so the easiest path to speeding up your inference is probably to skip this step, although it may impact the output.

Additionally, you can see that Kokoro uses ISTFTNet decoder, which is the second paper cited in the bubbles at the top. ISTFTNet should be faster than HiFiGAN by somewhere between 1.5x to 2x. If you have trained a StyleTTS2 network on HiFiGAN, I am not aware of any easy way to flip this over to ISTFTNet without retraining.

hexgrad changed discussion status to closed 21 days ago

hugeingface

21 days ago

Thanks! As I was reviewing this code and seeing if I could load one of my networks that seemed to me as the main point of difference in the configs (hifigan vs istftnet). So some re-training would be in order with some different parameters. The tokenization and voicepacks seems broadly similar, and clearly the style diffusion isn't critical for good generation. Interesting work!