HKUSTAudio/Llasa-3B · Training a specific voice

20 days ago

Hi! Thanks a lot for this amazing model!

I was wondering what alternatives were out there to have the model answer in a specific voice without the need for the prompt wav?

When I use the text prompt method, the output voice is randomly a male or female voice (English inference). I guess this emerges from the dataset and voices that were used to train the model.
Is there a way to force or pretrain/fine tune the model to use a specific voice (I have a prompt wav and transcript of a voice) so that we can use input prompt instead of wav prompt (especially for fast inference in a real-time use case)?

So far, I've managed to turn the example from the repo (first and second) into a flask endpoint, but inference speed is approx. 50% RTF thus not really usable in a real-time use case.
I've experimented a bit with streaming but then the quality of the chunk batches are not as good as the sync inference, and moreover I haven't succeeded so far in using a prompt wav and custom voice in the streamed implementation.

Thank you for your help!

HKUST-Audio

HKUST Audio org 15 days ago

Hi, we've updated the fine-tuning instructions—you might find them helpful: GitHub link. Also, here's a demo showcasing the results: Hugging Face link.

Delfshkrimm

11 days ago

Thanks a lot! Will give it a try asap :) Amazing work as always