Training a specific voice
Hi! Thanks a lot for this amazing model!
I was wondering what alternatives were out there to have the model answer in a specific voice without the need for the prompt wav?
When I use the text prompt method, the output voice is randomly a male or female voice (English inference). I guess this emerges from the dataset and voices that were used to train the model.
Is there a way to force or pretrain/fine tune the model to use a specific voice (I have a prompt wav and transcript of a voice) so that we can use input prompt instead of wav prompt (especially for fast inference in a real-time use case)?
So far, I've managed to turn the example from the repo (first and second) into a flask endpoint, but inference speed is approx. 50% RTF thus not really usable in a real-time use case.
I've experimented a bit with streaming but then the quality of the chunk batches are not as good as the sync inference, and moreover I haven't succeeded so far in using a prompt wav and custom voice in the streamed implementation.
Thank you for your help!
Hi, we've updated the fine-tuning instructions—you might find them helpful: GitHub link. Also, here's a demo showcasing the results: Hugging Face link.
Thanks a lot! Will give it a try asap :) Amazing work as always