Speaker ids - genders and accents

#2
by jordimas - opened

Hello,

Is there is any documentation available of the speakers id supported?

I guess that every speaker ID corresponds to a combination of gender and language variant but I have not been able to found a list.

Thanks

Jordi,

Projecte Aina org

Hi Jordi,

As starting point check if the info here can help? https://github.com/langtech-bsc/Matcha-TTS/commit/9aebd8be5f4c674a06ec3210a8d8840d96b21d35

But in any case that should be added to the documentation. Thanks for pointing it out

Albert

Projecte Aina org

Hello Jordi,

we have a couple of matxa versions. One trained only with Catalan central accent ( @albertcanig info refers to speaker IDs of this model), and the multiaccent version, which indeed was trained with a total of 8 speakers (1 female and 1 male per accent). Here you have the info:

{
"balear":{
"quim": 0,
"olga": 1
},
"central":{
"grau": 2,
"elia": 3
},
"nord-occidental":{
"pere": 4,
"emma": 5
},
"valencia":{
"lluc": 6,
"gina": 7
}
}

We definetely upload this info to the model card. Thanks!

Alex

Projecte Aina org

Hi Jordi,

As Alex says we need to add this to the documentation clearly. But just in case you run into more issues, we use our HF spaces code as reference for these issues. The code is here for example in this case it is pointing to the json here, basically the info Alex just gave.

Sorry for the inconvenience, hope the example code helps until we introduce the relevant information in the model card.

Best

Thanks for documentation!

I few things that AI have observed. I am following the README.md instructions in this repo:

  1. Using "matcha_vocos_inference.py --speaker_id 2" produces a woman's voice for me, but if I understand this correctly, it should be 2 should be grau / central.

  2. Another thing that I have observed is that matcha_vocos_inference.py uses speaker ID=20:

    parser.add_argument('--speaker_id', type=int, default=20, help='Speaker ID')

Which is not in the mapping

  1. Also the I have obseved that the inference code here https://huggingface.co./spaces/projecte-aina/matxa-alvocat-tts-ca/blob/main/infer_onnx.py for the TTS method seems to be
    different to the one in matcha_vocos_inference.py"

Really 1) is that I want to fix, I'm commenting 2 and 3 in case it thelps.

Thanks

Jordi

Projecte Aina org
edited Sep 30, 2024

Hi Jordi, thanks for the feedback.

  1. The model downloads by default the multispeaker version from HF https://github.com/langtech-bsc/Matcha-TTS/blob/dev-cat/matcha_vocos_inference.py#L128, you can change that to point to the multiaccent model handle "projecte-aina/matxa-tts-cat-multiaccent"

  2. The inference code in the space is different as is using a ONNX version of the model, it also has an additional denoising step cause the inference is using less steps(10) to generate the mel spectrograms.

It's important to match each speaker_id with its corresponding text cleaner. Currently, the script is using a catalan central phonemizer, but we plan to update the code with specific cleaners for each accent. We'll provide more information once these changes are implemented.

Thanks I will wait until you update matcha_vocos_inference.py

If you plan that people users this as CLI tool, it will be good also to include it here:
https://github.com/langtech-bsc/Matcha-TTS/blob/dev-cat/setup.py#L40

then when you do "pip install" gets installed and you can call it like a comand like " matcha_vocos_inference"

I have this packaged as a OVOS plugin here https://github.com/OpenVoiceOS/ovos-tts-plugin-matxa-multispeaker-cat

in case it is useful as a reference

Projecte Aina org

Thanks Casimiro, we've added the OVOS plugin instructions to the ONNX section of the model card.

Jordi, we've also updated the repository with the text cleaners. The only differences between the OVOS plugin and PyTorch inference scripts are:

  1. in the OVOS plugin, n_timesteps is fixed at 10 due to ONNX export
  2. the OVOS plugin doesn't require PyTorch, which is an advantage for deployments.

Thanks so much for the fixes and additional documentation

I'm focusing on the inference scripts as described here:

https://huggingface.co./projecte-aina/matxa-tts-cat-multiaccent

When I do:

python3 matcha_vocos_inference.py --output_path=output/ --text_input="Bon dia Manel, avui anem a la muntanya." --speaker_id 2

My expectation was grau's male voice but instead I get a woman's voice.

If I do:

python3 matcha_vocos_inference.py --output_path=output/ --text_input="Bon dia Manel, avui anem a la muntanya." --speaker_id 3

My expectation was elias's female voice but instead I get a man's voice.

Do you know what the problem may be?

Thanks

Projecte Aina org

Hi Jordi,

This is because the script loads the multispeaker model by default, I changed to the multiaccent model but leave the reference for the multispeaker model handle in HF.

Best,

Sign up or log in to comment