Text-to-Speech
Transformers
Safetensors
parler_tts
text2text-generation
annotation

FineTuning for Single Speaker

#6
by skjdhuhsnjd - opened

Hi, I'm new to IndicParler TTS. I'm trying to fine-tune it for a single speaker, but I'm encountering this error: 'TypeError: 'NoneType' object is not subscriptable'.

I suspect the issue might be related to using --feature_extractor_name "parler-tts/dac_44khZ_8kbps" because I couldn't find a feature extractor specifically for IndicParler. I'm a beginner and would appreciate some guidance.

AI4Bharat org

Hi,

We do not train or finetune DAC on Indic Parler TTS data, but rather use the pretrained one from ylacombe/dac_44khz. You should be able to use that. That being said, AutoProcessor.from_pretrained("ai4bharat/indic-parler-tts", trust_remote_code=True) should also work. Would be able to look into it if you can share a code snippet.

Thank you for showing interest in Indic Parler TTS.

First of all, thank you so much for your time. I'm using the following script:

!accelerate launch ./training/run_parler_tts_training.py \ --model_name_or_path "ai4bharat/indic-parler-tts-pretrained" \ --feature_extractor_name "ylacombe/dac_44khz" \ --description_tokenizer_name "ai4bharat/indic-parler-tts-pretrained" \ --prompt_tokenizer_name "ai4bharat/indic-parler-tts-pretrained" \ --report_to "wandb" \ --overwrite_output_dir true \ --train_dataset_name "mavihsrr/Hindi_TTS_M-2k" \ --train_metadata_dataset_name "skjdhuhsnjd/h-t-tagged" \ --train_dataset_config_name "default" \ --train_split_name "train" \ --eval_dataset_name "mavihsrr/Hindi_TTS_M-2k" \ --eval_metadata_dataset_name "skjdhuhsnjd/h-t-tagged" \ --eval_dataset_config_name "default" \ --eval_split_name "train" \ --max_eval_samples 8 \ --per_device_eval_batch_size 8 \ --target_audio_column_name "audio" \ --description_column_name "text_description" \ --prompt_column_name "text" \ --max_duration_in_seconds 20 \ --min_duration_in_seconds 2.0 \ --max_text_length 400 \ --preprocessing_num_workers 2 \ --do_train true \ --num_train_epochs 2 \ --gradient_accumulation_steps 18 \ --gradient_checkpointing true \ --per_device_train_batch_size 2 \ --learning_rate 0.00008 \ --adam_beta1 0.9 \ --adam_beta2 0.99 \ --weight_decay 0.01 \ --lr_scheduler_type "constant_with_warmup" \ --warmup_steps 50 \ --logging_steps 2 \ --freeze_text_encoder true \ --audio_encoder_per_device_batch_size 4 \ --dtype "float16" \ --seed 456 \ --output_dir "./output_dir_training/" \ --temporary_save_to_disk "./audio_code_tmp/" \ --save_to_disk "./tmp_dataset_audio/" \ --dataloader_num_workers 2 \ --do_eval \ --predict_with_generate \ --include_inputs_for_metrics \ --group_by_length true

However, I keep getting this error:
subprocess.CalledProcessError: Command '['/usr/bin/python3', './training/run_parler_tts_training.py', ...]' returned non-zero exit status 1

When I use the tokenizer (ylacombe/parler-tts-mini-v1-Jenny-colab) for both description and prompt, the process completes without errors, but the output audio quality is terrible. You can check the audio samples here: (https://wandb.ai/sjahk-/parler-speech/reports/Speech-samples-24-12-20-19-31-39---VmlldzoxMDY3NzI5Mw?accessToken=lmtsm2zj12qoc0nl8os0dgpdgyorvbufbgrqjnzfb1bqmfxmnak35cnxspoo6pgc)

Could you please guide me on the appropriate description and prompt tokenizer to use for fine-tuning in Hindi? Thanks in advance!

Any help would mean a lot! I believe the issue might be with the prompt or description tokenizer.

AI4Bharat org

Hi @skjdhuhsnjd ,

Please use flan-t5-large tokenizer as that is our description encoder as well. This model works pretty well for our use case as the descriptions are still in English, and FlanT5 is instruction tuned which means better representations even without training it.

AI4Bharat org

For any clarification on which models where used, please look at the config: https://huggingface.co./ai4bharat/indic-parler-tts/blob/main/config.json

Hi @AshwinSankar

First of all, thank you so much for your time. I’m really sorry to bother you, but as a beginner, your help means a lot to me. I was using this notebook:

https://colab.research.google.com/github/ylacombe/scripts_and_notebooks/blob/main/Finetuning_Parler_TTS_on_a_single_speaker_dataset.ipynb

to fine-tune the Indic Parler pretrained model.

I replaced the model path with "ai4bharat/indic-parler-tts-pretrained", the prompt and description tokenizer with "google/flan-t5-large", and the feature extractor with "ylacombe/dac_44khz".

However, I’m still encountering this error:
TypeError: dacmodel.encode() got an unexpected keyword argument 'bandwidth'

I’d be incredibly grateful if you could take some time from your busy schedule to guide me through this issue. Thank you so much in advance!

AI4Bharat org

which version of transformers are you using?

I'm using Google Colab with Transformers version 4.46.1.

This comment has been hidden

Hi @AshwinSankar

I've tried my best, but I haven't been able to resolve the problem. Could you please take a look at it?

Thank you!

Sign up or log in to comment