openai/whisper · How to fine tune the model

tahercoolguy

Sep 24, 2022

Can we really fine tune model with our datasets

averoo

Oct 3, 2022

Here is an example for finetuning.

https://colab.research.google.com/drive/1P4ClLkPmfsaKn2tBbRp0nVjGMRKR-EWz

averoo

Nov 9, 2022

Another tutorial on finetuning.

https://huggingface.co./blog/fine-tune-whisper

tahercoolguy

Nov 9, 2022

Thank you

EranML

Jan 1, 2023

Can you please show an example of how to create a new tokenizer and than fine-tune whisper with it?

kundanashish

Jan 3, 2023

Somebody pls guide me.
I have set of thousand of sentences. I want use existing whisper model and just rescore it to fit to my scenario as it will be one of those sentences only.
Is it possible to do in whisper?

sanchit-gandhi

Jan 5, 2023

Hey @EranML

My advice would be to follow this blog-post for fine-tuning the model: https://huggingface.co./blog/fine-tune-whisper

The Whisper model is pre-trained on 96 languages. This means that the pre-trained tokenizer already has a vast vocabulary encompassing many thousands of words! I would recommend that you leverage this pre-trained tokenizer directly rather than training a new one. Why? Because then we can also leverage all of the pre-trained Whisper weights directly! If we build a new tokenizer, we have to randomly initialise some of the Whisper weights to work with our new tokenizer, meaning we lose some of the knowledge from pre-training. If we use the pre-trained one, we can use all of the weights (and so all of the knowledge!). The Whisper model quickly learns which bit of the pre-trained tokenizer to use when fine-tuning.

So I’d recommend you keep the pre-trained tokenizer, and simply set the correct language when you instantiate the processor in this line: https://huggingface.co./blog/fine-tune-whisper#combine-to-create-a-whisperprocessor

Yes there’s a bit of redundancy in the tokenizer, but our overall performance should be better!

What language are you fine-tuning on? It's probably quite likely that all the characters you need are already in the pre-trained Whisper tokenizer!

sanchit-gandhi

Jan 5, 2023

Hey @kundanashish ! To clarify, you want to improve the Whisper model's performance on your set of 1000 sentences, but don't care about how it performs on any others? You can simply fine-tune it on these sentences using this blog-post: https://huggingface.co./blog/fine-tune-whisper

You might first need to convert your audio-text dataset into a HF dataset format: https://huggingface.co./docs/datasets/audio_dataset

86 hidden messages

Expand all

kundangraticare

Jan 7, 2024

Hi @sanchit-gandhi
We need your expert opinion on which model would be best suited for fine tuning in our case, where we are mostly interested in our English data (we have a dataset of a few 10s of GBs only) and the best accuracy possible on that.
Also, do you recommend using any distil-whisper model in spite of the original whisper for the same?

Kindly let me know your views.
Thanks a lot

KevinPPP

Jan 14, 2024

Hello @sanchit-gandhi san,

Thank you so much for providing such a detailed notebook for fine-tuning whisper.

I have a question regarding how to set up the LoraConfig so that the target_modules only targets ["q_proj", "v_proj"] of the decoder stack.
It seems that both encoder and decoder uses the same module names, so setting the target_modules to ["q_proj", "v_proj"] creates lora layers for both encoder and decoder.
How can I target the decoder's attention layers?

Original Whisper
OrderedDict([('model', WhisperModel(
(encoder): WhisperEncoder(
(conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,))
(conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,))
(embed_positions): Embedding(1500, 768)
(layers): ModuleList(
(0-11): 12 x WhisperEncoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear8bitLt(in_features=768, out_features=768, bias=False)
(v_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
(q_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
(out_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear8bitLt(in_features=768, out_features=3072, bias=True)
(fc2): Linear8bitLt(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
(layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(decoder): WhisperDecoder(
(embed_tokens): Embedding(51865, 768, padding_idx=50257)
(embed_positions): WhisperPositionalEmbedding(448, 768)
(layers): ModuleList(
(0-11): 12 x WhisperDecoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear8bitLt(in_features=768, out_features=768, bias=False)
(v_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
(q_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
(out_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(encoder_attn): WhisperAttention(
(k_proj): Linear8bitLt(in_features=768, out_features=768, bias=False)
(v_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
(q_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
(out_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(fc1): Linear8bitLt(in_features=768, out_features=3072, bias=True)
(fc2): Linear8bitLt(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
(layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)), ('proj_out', Linear(in_features=768, out_features=51865, bias=False))])

After get_peft_model
OrderedDict([('base_model',
LoraModel(
(model): WhisperForConditionalGeneration(
(model): WhisperModel(
(encoder): WhisperEncoder(
(conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,))
(conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,))
(embed_positions): Embedding(1500, 768)
(layers): ModuleList(
(0-11): 12 x WhisperEncoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear8bitLt(in_features=768, out_features=768, bias=False)
(v_proj): Linear8bitLt(
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=768, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=768, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(base_layer): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(q_proj): Linear8bitLt(
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=768, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=768, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(base_layer): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(out_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear8bitLt(in_features=768, out_features=3072, bias=True)
(fc2): Linear8bitLt(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
(layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(decoder): WhisperDecoder(
(embed_tokens): Embedding(51865, 768, padding_idx=50257)
(embed_positions): WhisperPositionalEmbedding(448, 768)
(layers): ModuleList(
(0-11): 12 x WhisperDecoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear8bitLt(in_features=768, out_features=768, bias=False)
(v_proj): Linear8bitLt(
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=768, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=768, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(base_layer): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(q_proj): Linear8bitLt(
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=768, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=768, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(base_layer): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(out_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(encoder_attn): WhisperAttention(
(k_proj): Linear8bitLt(in_features=768, out_features=768, bias=False)
(v_proj): Linear8bitLt(
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=768, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=768, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(base_layer): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(q_proj): Linear8bitLt(
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=768, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=768, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(base_layer): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(out_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(fc1): Linear8bitLt(in_features=768, out_features=3072, bias=True)
(fc2): Linear8bitLt(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
(layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
(proj_out): Linear(in_features=768, out_features=51865, bias=False)
)
))])

munikumar

Feb 28, 2024

•

edited Feb 28, 2024

Hi @sanchit-gandhi , I am fine-tuning, whisper small on Hindi data. While fine-tuning validation WER is decreasing but the validation loss is increasing. It seems like it is overfitting. How can I solve this, means what sort of regularization I can use? And also any advice on the warm up steps hyperparameter, means any recommended value? Can anyone please help? Thanks in advance.

jakiya

Mar 1, 2024

@sanchit-gandhi
I used your tutorial to finetune the whisper model on a local dataset. thank you very much, it was really helpful.
My issue is in this step

from transformers import Seq2SeqTrainingArguments
training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-small-hi", # change to a repo name of your choice
)

actually I am new in the field and I am not able to understand what is the output directory.
could I use a local directory of my system instead of "./whisper-small-hi".
and I also want to know that how and from where I can I use this finetuned model to test it .

Mutturaj

May 12, 2024

•

edited May 12, 2024

Hi @sanchit-gandhi , thank u for posting such a detailed article to fine rune whisper with custom training data.
I have followed the article and could generate the model with my own training dataset .
I have doubt ,when i run, trainer.tain() , i can see the training starts and check points are store in different directory

I have set updating checkpoints to Hugging face as false, as i want save the model locally

After this step i am running save model to local directory, thats also working fine.but checkpoints and save model are done to different directory

My question here is
1.When i load trained model from local directory,,which directory path i have to provide, last checkpoint dir or model saved path???.

2.And the other issue is checkpoint dir do not have files like vocab.json etc..and loding fails, for this ,,workaround i did was to copy files from saved model dir to checkpoint dir

Kindly help me with my queries

Please find the Training steps: ********************************************

training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-muttu-small-vv-trained", # change to a repo name of your choice
per_device_train_batch_size=2,
gradient_accumulation_steps=8, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
warmup_steps=500,
max_steps=500,
gradient_checkpointing=True,
fp16=False,
evaluation_strategy="steps",
per_device_eval_batch_size=8,
predict_with_generate=True,
generation_max_length=225,
save_steps=100,
eval_steps=100,
logging_steps=25,
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="wer",
greater_is_better=False,
push_to_hub=False,
use_cpu=True
)
trainer = Seq2SeqTrainer(
args=training_args,
model=model,
train_dataset=common_voice["train"],
eval_dataset=common_voice["test"],
data_collator=data_collator,
compute_metrics=compute_metrics,
tokenizer=processor.feature_extractor
)
#this is trainng and generating check point
trainer.train()

#save trainer model and processors.
trainer.save_model(training_args.output_dir)
processor.save_pretrained(training_args.output_dir)

Files stored under checkpoint dir ********************************************
3815 May 12 06:21 generation_config.json
2260 May 12 06:21 config.json
966995080 May 12 06:21 model.safetensors
5240 May 12 06:21 training_args.bin
339 May 12 06:21 preprocessor_config.json
1064 May 12 06:21 scheduler.pt
13990 May 12 06:21 rng_state.pth
1925050668 May 12 06:21 optimizer.pt
4436 May 12 06:21 trainer_state.json

dolphin12

Jul 25, 2024

•

edited Jul 25, 2024

@sanchit-gandhi on indic languages I am looking to finetune the whisper model by generating audios which has mixture of numbers and domain specific names plus the public dataset available on the web because it seems like words which are homophones to each other and number especially it's making a lot of mistakes adding three zeros when number "100" is spoken and so on.Nonetheless numbers are spoken differently in different languages.So training with a mixture of public dataset and the synthetic data on different indic languages should solve this issues but my concern is will it be able to retain parameters like initialprompt,repetition penalty,vad filter,hotwords(which faster whisper provides) etc after fine-tuning using Peft or without Peft?

cautroi

Oct 19, 2024

Based on this guide https://huggingface.co./blog/fine-tune-whisper, I tried to fine-tune "small" and "large-v3" models.

The fine-tuned "small" model works normally.
But the fine-tuned "large-v3" model works poorly on non-English audio files such as Chinese audio files, it auto-translates Chinese to English though I specified transcribing in Chinese, not translating.
Have you faced this issue, and can give me advice? Thank you so much.

vizsatiz

Oct 30, 2024

•

edited Oct 30, 2024

Hey @Jiltseb , in this case you will probably get better results fine-tuning on language 1 and language 2 at the same time. For this, you will need to switch the language code in your tokenizer depending on the language for each individual sample, so that the model learns to differentiate between language 1 and 2. We'll take the Whisper fine-tuning blog post as our starting point. Suppose your dataset has a column called "language" that says what the language is for each sample (e.g. "Hindi" or "French"), then we can update our prepare_dataset function as follows:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

   # get the language of our text
   tokenizer.set_prefix_tokens(language=batch["language"], task="transcribe") 
   # encode target text to label ids 
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch
It's pretty easy to add a new "language" column to your dataset. Suppose you load the Hindi version of common voice as:
from datasets import load_dataset, DatasetDict

common_voice_hi = DatasetDict()

common_voice_hi["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train+validation", use_auth_token=True)
common_voice_hi["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="test", use_auth_token=True)
You can then add the language column using dataset's add_column method:
for split in common_voice_hi:
    language_column = ["Hindi"] * len(common_voice_hi[split])
    common_voice_hi[split] = common_voice_hi[split].add_column("language", language_column)
Supposing we do the same for "French":
common_voice_fr = DatasetDict()

common_voice_fr["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="train+validation", use_auth_token=True)
common_voice_fr["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", use_auth_token=True)

for split in common_voice_fr:
    language_column = ["French"] * len(common_voice_fr[split])
    common_voice_fr[split] = common_voice_fr[split].add_column("language", language_column)
We can now combine our two datasets using concatenate_datasets:
from datasets import concatenate_datasets

common_voice_merged = DatasetDict()

for split in common_voice_hi:
    common_voice_merged[split] = concatenate_datasets([common_voice_hi[split], common_voice_fr[split])
Voila! Now you can use this combined dataset for training on two languages at once.

@sanchit-gandhi First of all, thank you for your amazing blog on fine-tuning
This answer is a great place to start on multi-language finetuning
I was trying this out. I wanted to ask you about how to train to translate and transcribe for 5 languages. Should I go sequentially by doing transcription fine-tuning and translation or is there a better approach?