Spaces:
Running
on
L4
How to fine tune the model
Can we really fine tune model with our datasets
Here is an example for finetuning.
https://colab.research.google.com/drive/1P4ClLkPmfsaKn2tBbRp0nVjGMRKR-EWz
Another tutorial on finetuning.
Thank you
Can you please show an example of how to create a new tokenizer and than fine-tune whisper with it?
Somebody pls guide me.
I have set of thousand of sentences. I want use existing whisper model and just rescore it to fit to my scenario as it will be one of those sentences only.
Is it possible to do in whisper?
Hey @EranML
My advice would be to follow this blog-post for fine-tuning the model: https://huggingface.co./blog/fine-tune-whisper
The Whisper model is pre-trained on 96 languages. This means that the pre-trained tokenizer already has a vast vocabulary encompassing many thousands of words! I would recommend that you leverage this pre-trained tokenizer directly rather than training a new one. Why? Because then we can also leverage all of the pre-trained Whisper weights directly! If we build a new tokenizer, we have to randomly initialise some of the Whisper weights to work with our new tokenizer, meaning we lose some of the knowledge from pre-training. If we use the pre-trained one, we can use all of the weights (and so all of the knowledge!). The Whisper model quickly learns which bit of the pre-trained tokenizer to use when fine-tuning.
So I’d recommend you keep the pre-trained tokenizer, and simply set the correct language when you instantiate the processor in this line: https://huggingface.co./blog/fine-tune-whisper#combine-to-create-a-whisperprocessor
Yes there’s a bit of redundancy in the tokenizer, but our overall performance should be better!
What language are you fine-tuning on? It's probably quite likely that all the characters you need are already in the pre-trained Whisper tokenizer!
Hey @kundanashish ! To clarify, you want to improve the Whisper model's performance on your set of 1000 sentences, but don't care about how it performs on any others? You can simply fine-tune it on these sentences using this blog-post: https://huggingface.co./blog/fine-tune-whisper
You might first need to convert your audio-text dataset into a HF dataset format: https://huggingface.co./docs/datasets/audio_dataset
Hi
@sanchit-gandhi
Thanks for your response.
Yes you understanding is correct.
Actually I only have text. I do not want to use the existing the acoustic model and do fine tunning at language model layer.
Hey @kundanashish ! Sorry for the late reply here. I would strongly advise against fine-tuning only the language model (decoder) of the Whisper model on text-only data. My worry here is that we will completely break the model and loose all it's pre-trained capabilities if we do this.
Whisper is an encoder-decoder architecture. The encoder transforms the audio inputs into a set of hidden state representations, extracting important features from the spoken speech. The decoder auto-regressively predicts text tokens, conditional on previously predicted tokens and the encoder hidden states (see https://huggingface.co./blog/encoder-decoder#encoder-decoder). If we omit the encoder hidden-states, we completely change the functionality of the Whisper model: the decoder now only predicts tokens conditional on the previously predicted tokens, not the encoder hidden states. This will change the weights such that the model only uses the previous tokens and not the encoder hidden representations. Thus, the model goes from being purposed for speech recognition (speech to text) to causal language modelling (text to text). When we use this fine-tuned model at inference time, this time with the audio inputs, the weights will be messed-up for speech recognition and the model will likely fail.
I would recommend either:
- Fine-tuning the model on audio-transcription pairs (i.e. get the audio for your text sentences and train on audio + text) according to the blog post
- Using the zero-shot model (no fine-tuning) to generate Whisper predictions. Take the prediction from the Whisper model, and find the sentence in your corpus of 1000 sentences that is most similar to this prediction. Use this nearest sentence as your output.
@sanchit-gandhi in the blog post you've mentioned, what are some of the parameters you've played around with for getting best results? Further, any tips on how to go about tweaking said params?
Hey @jungledude23 !
In my experience, the most important three are:
- Batch size
- Learning rate
- Dropout
1. Batch size
One thing I've noticed a lot looking at training logs is noisy training loss curves. This generally gives noisy parameter updates, which can throw your model off and delay it reaching a local optimum. A noisy training loss can be combated by increasing your batch size. A larger batch size means more training samples per update, and is thus closer to a 'true' gradient update that you'd get using all the data at once. You can find recommended batch size configurations here https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event#recommended-training-configurations
2. Learning rate
- As a rule of thumb, a learning rate 40x smaller than the pre-training learning rate works well (see page 28 of the Whisper paper)
- Monitor the training loss over the first 1000 steps. If it decays quickly and smoothly, you've got a good learning rate. If it bounces around and is very noisy, the learning rate is too high and you should reduce it by at least a factor of 10
3. Dropout
- Recommended when the amount of training data that you have is small. In this case, it can be used to prevent overfitting.
- Use a max dropout value of 0.1. Going higher than this generally hurts performance:
model.config.dropout = 0.1
- You can also set
attention_dropout
andactivation_dropout
but these have less of an impact thandropout
(see https://huggingface.co./docs/transformers/model_doc/whisper#transformers.WhisperConfig.attention_dropout)
More details regarding learning rate:
The learning rate is indeed a very important parameter to get good fine-tuning performance, and one that we have to experiment with to get right. My recommendation would be to monitor the training loss for the first 500-1000 training steps of your fine-tuning run to gauge whether you've set the learning rate appropriately. Each case is different, but I've tried to give a setting that works best for most!In practice, using a lower learning rate for fine-tuning vs pre-training gives superior results. These are the observations that I made when fine-tuning the Whisper model for the ESB paper (https://arxiv.org/abs/2210.13352) and from my extensive testing for multilingual fine-tuning prior to the event. Generally, I found that a learning rate of 1e-5 worked well for the small and medium checkpoints across most languages. This is the highest learning rate that you can get away with without the gradient updates becoming noisy. Selecting a higher learning rate means that you perform larger parameter updates, and so should be able to push the parameters into a more optimal range faster. But if you go too high, you risk the gradient updates becoming unstable, giving a noisy training loss curve and noisy parameter updates. This is when you'll get worse performance.
I asked the Whisper author Jong Wook Kim about his suggestions for fine-tuning. His recommendation was to select a learning rate about 40x smaller than pre-training, and linearly decay it to 0 over the course of training. For the small checkpoint, this would be 5e-4 / 40 = 1.25e-5, near enough 1e-5! So my empirical observations align with his 🙂
You can use this as a rule of thumb for selecting the learning rate!
Hi
@sanchit-gandhi
Thanks for the response.
Can't I just finetune using text data and audio is mandatory.
Please pardon me if I am sounding silly, I am a newbie in this field.
Hi @kundanashish ,
You must have your X, and y values if you wish to fine tune on your specific task.
Maybe an easier example would be training an image classifier to classify images of cats and dogs.
You have the images(your X values) and their labels like "cat" or "dog"(your y values).
Now imagine you want to train this model without any images.
This is akin to trying to tune whisper without audio and just text.
Love the analogy @Kristopher ! 🙌 Indeed we need (text, audio) pairs for fine-tuning to work.
Have you considered option two from this list @kundanashish ? https://huggingface.co./spaces/openai/whisper/discussions/6#63c142a294b28327f0e6bebd
It could work using the pre-trained Whisper model to generate predictions for the transcriptions, and then picking the sentence in your set of 1000 sentences that is most similar to this prediction? What do you think?
Hi Sanchit, I have my mapping.csv that has audio, sentence -- The audio field is the path to the audio.
When I try to train following your https://huggingface.co./blog/fine-tune-whisper tutorial, I get the following:
The following columns in the training set don't have a corresponding argument in `WhisperForConditionalGeneration.forward` and have been ignored: audio, sentence. If audio, sentence are not expected by `WhisperForConditionalGeneration.forward`, you can safely ignore this message.
In your tutorial, the first element of the dataset has more info:
{'audio': {'path': '/home/sanchit_huggingface_co/.cache/huggingface/datasets/downloads/extracted/607848c7e74a89a3b5225c0fa5ffb9470e39b7f11112db614962076a847f3abf/cv-corpus-11.0-2022-09-21/hi/clips/common_voice_hi_25998259.mp3',
'array': array([0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 9.6724887e-07,
1.5334779e-06, 1.0415988e-06], dtype=float32),
'sampling_rate': 48000},
'sentence': 'खीर की मिठास पर गरमाई बिहार की सियासत, कुशवाहा ने दी सफाई'}
Is there a specific format for my mapping.csv? Thanks for the great work as usual!
Hey @asennoussi !
The warning message suggests to me that something is going wrong either in the data pre-processing stage. We shouldn't have features like audio
and sentence
forwarded to our data collator.
Data pre-processing
Currently, our data pre-processing function looks as follows:
def prepare_dataset(batch):
# load and resample audio data from 48 to 16kHz
audio = batch["audio"]
# compute log-Mel input features from input audio array
batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
# encode target text to label ids
batch["labels"] = tokenizer(batch["sentence"]).input_ids
return batch
This assumes that our audio dataset has a column called audio
which is the loaded audio array.
If we're working with a .csv
file containing pairs of (path to audio, text), we first need to load the audio samples before pre-processing them with our processor
class. We can do this in one of two ways:
- Load your dataset as a HF
dataset
- Load each audio sample on the fly
For 1, you can follow this guide: https://huggingface.co./docs/datasets/audio_dataset#create-an-audio-dataset. Once you've created your audio dataset and pushed it to the Hub, you can simply load it using the load_dataset
function and follow the blog post from start to finish!
For 2, we can make a few modifications to the prepare_dataset
function to first load our audio from the path. We can do this with the librosa
library:
import librosa
def prepare_dataset(batch):
# load audio sample FROM PATH with specified sampling rate
audio_array, sampling_rate = librosa.load(batch[“audio_path”], sr=16000, mono=True)
# compute log-Mel input features from input audio array
batch["input_features"] = feature_extractor(audio_array, sampling_rate=sampling_rate).input_features[0]
# encode target text to label ids
batch["labels"] = tokenizer(batch["sentence"]).input_ids
return batch
You’ll just need to double check that the function has the right feature names to match your dataset (I’ve used “audio_path” and “sentence” but you might need to change these)
Let me know if that helps with your question! Feel free to post a code snippet / Colab link to your code if you want to share what you're currently doing (this might make it easier to fully dissect what's going on)
Thanks for the thorough answer Sanchit!
@sanchit-gandhi
I faced something weird.
I do have a large dataset: around 100GB of audio files that I split into snippets for training, so around 200 GBs of space.
Yesterday, the training stopped because of insufficient space.
When I looked around, I found a ~/.cache/huggingface/datasets/csv/default-1c16e2184a1fda8d/0.0.0/Somfile that is +500GB in size.
Why does that happen? I'm just curious.
I'll train my dataset little by little, but does deactivating caching help here? What's the downside of performance?
Hey @asennoussi !
This file is likely the arrow
file for your dataset, i.e. the cached file for the pre-processed version of your dataset (input features + labels). See https://huggingface.co./docs/datasets/about_cache#the-cache for details!
You can disable caching (see https://huggingface.co./docs/datasets/cache#enable-or-disable-caching). The pros here are that you save disk space, the cons are that you have to repeat any data pre-processing steps if you want to train on the same dataset in a second training run (essentially repeating the pre-processing that you performed before).
Alternatively, you can look into streaming mode to bypass any disk space constraints! See https://huggingface.co./blog/audio-datasets#streaming-mode-the-silver-bullet for an explanation on how this works and https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event for streaming mode resources (such as fine-tuning scripts)
Awesome! thanks a lot!
Please bear with me, now that I fine tuned my model, I have a new directory that has a bunch of checkpoints.
How do I use the newly fine-tuned model?
Hey @asennoussi !
You can load the model with pipeline
and transcribe audio samples of up to arbitrary length. Just specify the path to your model directory (the output_dir
you specified during training and provide the path to an audio file:
from transformers import pipeline
MODEL_PATH = "PATH/TO/MODEL"
AUDIO_PATH = "PATH/TO/AUDIO"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
task="automatic-speech-recognition",
model=MODEL_PATH,
chunk_length_s=30,
device=device,
)
# we override any special forced tokens for auto language detection - not necessary if you use transformers from main!
all_special_ids = pipe.tokenizer.all_special_ids
transcribe_token_id = all_special_ids[-5]
pipe.model.config.forced_decoder_ids = [[2, transcribe_token_id]]
# inference
out = pipe(audio)["text"]
print(out)
Hi
@sanchit-gandhi
,
I hope all is well.
In the script for the eval_metric:
# evaluate with the 'normalised' WER
do_normalize_eval = True
def compute_metrics(pred):
pred_ids = pred.predictions
label_ids = pred.label_ids
# replace -100 with the pad_token_id
label_ids[label_ids == -100] = processor.tokenizer.pad_token_id
# we do not want to group tokens when computing the metrics
pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)
if do_normalize_eval:
pred_str = [normalizer(pred) for pred in pred_str]
label_str = [normalizer(label) for label in label_str]
# filtering step to only evaluate the samples that correspond to non-zero references:
pred_str = [pred_str[i] for i in range(len(pred_str)) if len(label_str[i]) > 0]
label_str = [label_str[i] for i in range(len(label_str)) if len(label_str[i]) > 0]
wer = 100 * metric.compute(predictions=pred_str, references=label_str)
return {"wer":wer}
Then when the trainer tries to save the model, I get
/usr/local/lib/python3.8/dist-packages/transformers/trainer.py in _save_checkpoint(self, model, trial, metrics)
2236 if not metric_to_check.startswith("eval_"):
2237 metric_to_check = f"eval_{metric_to_check}"
-> 2238 metric_value = metrics[metric_to_check]
2239
2240 operator = np.greater if self.args.greater_is_better else np.less
KeyError: 'eval_wer'
Shouldn't compute_metrics
return {"eval_wer":wer}
instead of {"were":wer}
Hey @asennoussi ,
The function for compute_metrics
looks good to me! It's likely that the error lies in the training args.
Could you make sure --metric_for_best_model="wer" \
in your training args? See https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event#python-script
Here, we set our training args as:
python run_speech_recognition_seq2seq_streaming.py \
--model_name_or_path="openai/whisper-small" \
--dataset_name="mozilla-foundation/common_voice_11_0" \
--dataset_config_name="es" \
--language="spanish" \
--train_split_name="train+validation" \
--eval_split_name="test" \
--model_index_name="Whisper Small Spanish" \
--max_steps="5000" \
--output_dir="./" \
--per_device_train_batch_size="64" \
--per_device_eval_batch_size="32" \
--logging_steps="25" \
--learning_rate="1e-5" \
--warmup_steps="500" \
--evaluation_strategy="steps" \
--eval_steps="1000" \
--save_strategy="steps" \
--save_steps="1000" \
--generation_max_length="225" \
--length_column_name="input_length" \
--max_duration_in_seconds="30" \
--text_column_name="sentence" \
--freeze_feature_encoder="False" \
--report_to="tensorboard" \
--metric_for_best_model="wer" \
--greater_is_better="False" \
--load_best_model_at_end \
--gradient_checkpointing \
--fp16 \
--overwrite_output_dir \
--do_train \
--do_eval \
--predict_with_generate \
--do_normalize_eval \
--streaming \
--use_auth_token \
--push_to_hub
Where we have --metric_for_best_model="wer" \
, which indicates that the "wer" metric is the metric to optimise our eval performance for, see https://huggingface.co./docs/transformers/main_classes/trainer#transformers.TrainingArguments.metric_for_best_model for details
For a simple/stupid approach, for a language that is already supported, would you say the following is true:
- Take my audio samples for which I have GT transcriptions
- Run Whisper on them, get the generated text
- Compare the generated text with the GT transcripts and find those that mismatch
- Fine-tune Whisper ONLY on the audio samples and the GT transcripts for which a mismatch was found
So in short, fine-tune only on the corrected errors, not on the entire corpus? I imagine that would drastically reduce the fine-tuning time.
Is there any caveat to my approach?
(Note: I personally am interested in domain-specific fine-tuning, so there is a certain number of brand names, person names and domain-specific jargon that the model gets wrong to some extent, and I'm interested in fixing that)
Hey @twardoch - I think this is a cool idea, but if you have the full corpus available I would encourage you to fine-tune on that.
My worry is that if we only fine-tune on a subset of the corpus, we risk Whisper overfitting to these examples. The examples that Whisper initially makes errors on will be from a subset of the full distribution of data, so we risk overfitting Whisper to a small subset of the overall distribution.
- Suppose we have five examples: A, B, C, D, E
- Whisper initially makes errors on A, B, C, so we fine-tune it on these three examples
- After training for several epochs, it should no longer make errors on examples A, B, C - great!
- However, there's nothing stopping it from now making errors on examples D and E!
If we want to encourage Whisper to work on the full distribution of data, we should provide it training data drawn from the full distribution of data (i.e. all five training examples). Keeping examples where Whisper initially works ensures that Whisper continues to get these examples right after fine-tuning
Thanks! That's exactly what I imagined might happen, but wasn't sure if it would. You’re saying this is the risk, which I understand now. OK, fortunately my total corpus is not that huge overall.
What is the best/recommended approach for rapid prototyping? I'm already using the .tiny
models to run initial tests, but I have found that the amount of data seems to make no difference. I had naively expected that it would be much faster to finetune with 8hrs of data rather than 100hrs, and that this advantage would stack with the smaller base model. But it seems like the amount of finetuning data has no impact on expected training time. I've tried now two attempts with .tiny
, one with 100hrs of data and one with 10hrs and they both provide the exact same expected completion duration. What am I misunderstanding here?
@sanchit-gandhi
I wonder if you have some expert observations here.
Hey
@None
! Could you share your training configuration (i.e. your training args)? My reckoning is that we're setting --max_steps=50000
, which means that we'll train for 50k training steps no matter how much data we provide.
If you want to train based on the amount of data you have, you can remove --max_steps
and set --num_train_epochs
instead (see docs). If we do this, we'll train for a fixed number of epochs, so we'll scale our training time with the amount of data that we've got
Hey @asennoussi !
The warning message suggests to me that something is going wrong either in the data pre-processing stage. We shouldn't have features like
audio
andsentence
forwarded to our data collator.Data pre-processing
Currently, our data pre-processing function looks as follows:
def prepare_dataset(batch): # load and resample audio data from 48 to 16kHz audio = batch["audio"] # compute log-Mel input features from input audio array batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0] # encode target text to label ids batch["labels"] = tokenizer(batch["sentence"]).input_ids return batch
This assumes that our audio dataset has a column called
audio
which is the loaded audio array.If we're working with a
.csv
file containing pairs of (path to audio, text), we first need to load the audio samples before pre-processing them with ourprocessor
class. We can do this in one of two ways:
- Load your dataset as a HF
dataset
- Load each audio sample on the fly
For 1, you can follow this guide: https://huggingface.co./docs/datasets/audio_dataset#create-an-audio-dataset. Once you've created your audio dataset and pushed it to the Hub, you can simply load it using the
load_dataset
function and follow the blog post from start to finish!For 2, we can make a few modifications to the
prepare_dataset
function to first load our audio from the path. We can do this with thelibrosa
library:import librosa def prepare_dataset(batch): # load audio sample FROM PATH with specified sampling rate audio_array, sampling_rate = librosa.load(batch[“audio_path”], sr=16000, mono=True) # compute log-Mel input features from input audio array batch["input_features"] = feature_extractor(audio_array, sampling_rate=sampling_rate).input_features[0] # encode target text to label ids batch["labels"] = tokenizer(batch["sentence"]).input_ids return batch
You’ll just need to double check that the function has the right feature names to match your dataset (I’ve used “audio_path” and “sentence” but you might need to change these)
Let me know if that helps with your question! Feel free to post a code snippet / Colab link to your code if you want to share what you're currently doing (this might make it easier to fully dissect what's going on)
Is feature_extractor is librosa fucntion or your own function?
Edit :
found out the function.
I am currently trying how to train a model with single audio file can anyone have ideas about it with custom audio file (without HF hub)?
Thank in advance
how can we use the hugging face whisper model to fine-tune for language detection
Hey @johnwick999 ! You can check out this page for converting a custom dataset to HF datasets: https://huggingface.co./docs/datasets/audio_dataset#create-an-audio-dataset
Once you've done so, you'll be able to run the fine-tuning script exactly as is (just update the dataset id from mozilla-foundation/common_voice_11_0
to your dataset id).
Hey @Sibadatta !
Here's a code snippet for how you can use the pre-trained Whisper model for language detection:
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from transformers.models.whisper.tokenization_whisper import LANGUAGES
from datasets import load_dataset
model_id = "openai/whisper-tiny"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)
bos_token_id = processor.tokenizer.all_special_ids[-106]
decoder_input_ids = torch.tensor([bos_token_id])
dataset = load_dataset("facebook/multilingual_librispeech", "dutch", split="validation", streaming=True)
sample = next(iter(dataset))["audio"]
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
with torch.no_grad():
logits = model.forward(input_features, decoder_input_ids=decoder_input_ids).logits
pred_ids = torch.argmax(logits, dim=-1)
lang_ids = processor.decode(pred_ids[0])
lang_ids = lang_ids.lstrip("<|").rstrip("|>")
language = LANGUAGES[lang_ids]
I've also created a space here: https://huggingface.co./spaces/sanchit-gandhi/whisper-language-id
To fine-tune for language detection, you can adapt the code snippet to compute a cross-entropy loss between the pred ids and the target ids
Hey @johnwick999 ! You can check out this page for converting a custom dataset to HF datasets: https://huggingface.co./docs/datasets/audio_dataset#create-an-audio-dataset
Once you've done so, you'll be able to run the fine-tuning script exactly as is (just update the dataset id from
mozilla-foundation/common_voice_11_0
to your dataset id).
Thanks for the info.
I have been getting this error while loading the trained model. It says config.json not found in the model folder. Do you encounter this issue
Thanks in advance
Hey @Sibadatta !
Here's a code snippet for how you can use the pre-trained Whisper model for language detection:
import torch from transformers import WhisperForConditionalGeneration, WhisperProcessor from transformers.models.whisper.tokenization_whisper import LANGUAGES from datasets import load_dataset model_id = "openai/whisper-tiny" processor = WhisperProcessor.from_pretrained(model_id) model = WhisperForConditionalGeneration.from_pretrained(model_id) bos_token_id = processor.tokenizer.all_special_ids[-106] decoder_input_ids = torch.tensor([bos_token_id]) dataset = load_dataset("facebook/multilingual_librispeech", "dutch", split="validation", streaming=True) sample = next(iter(dataset))["audio"] input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features with torch.no_grad(): logits = model.forward(input_features, decoder_input_ids=decoder_input_ids).logits pred_ids = torch.argmax(logits, dim=-1) lang_ids = processor.decode(pred_ids[0]) lang_ids = lang_ids.lstrip("<|").rstrip("|>") language = LANGUAGES[lang_ids]
I've also created a space here: https://huggingface.co./spaces/sanchit-gandhi/whisper-language-id
To fine-tune for language detection, you can adapt the code snippet to compute a cross-entropy loss between the pred ids and the target ids
Scenario , let say i have 4-5 languages that i finally want to keep in my final model, and i want the model to detect those 4-5 language perfectly, so i fine-tune it with those much language's ASR data and language tokens for language detection. how can i perform multilingual and multitask fine-tuning along with fine-tuning its language detection decoder head.
Hello! I'm new in the field and I wanted to ask: how to fine-tune Whisper for a low-resource language that is not included in the pre-trained model?
The language in question shares some similarities with Persian/Kurdish. I have several hours of speech data for this language, but don't understand what to do next.
Here is an example for finetuning.
https://colab.research.google.com/drive/1P4ClLkPmfsaKn2tBbRp0nVjGMRKR-EWz
Hi everyone!
I flow the tutorial from link above
I see that the model fintune work well for ASR, but I want to use it for TTS(text to speech)
How to fintune or custom the tutorial for TTS, I very appreciated for sharing link or guide
Many thanks!
Hey
@Sibadatta
! I would modify the prepare_dataset
function to set the tokeniser's language for each training example. For this, you just need to know the language for each sample of your dataset (which I've assumed is stored under the column language
in your dataset):
def prepare_dataset(batch):
# load and resample audio data from 48 to 16kHz
audio = batch["audio"]
# compute log-Mel input features from input audio array
batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
+ language = batch["language"] # assuming you have the language for each sample of your dataset
+ tokenizer.set_prefix_tokens(language=language) # now switch the tokenizer language to the correct one
# encode target text to label ids
batch["labels"] = tokenizer(batch["sentence"]).input_ids
return batch
You can then fine-tune the model simultaneously on multiple languages and fine-tune all the required params. No other changes to your script are required!
Hey
@NursNurs
!
When you fine-tune it on a new language, Whisper does a pretty good job at leveraging its knowledge of the other 96 languages it’s pre-trained on. So you still probably only need 10s of hours of labelled audio data.
We tried two ways of setting it up for fine-tuning on new languages:
- Remove the language prediction task so that Whisper doesn’t get caught up with the fact it’s working on a new language and just focuses on transcribing text as accurately as possible (just set
language=None
in the tokenizer and processor) - Keep the language prediction task and tell Whisper that the new language is the same as one of it’s current languages (e.g. if fine-tuning on Nepali, tell Whisper it’s actually predicting Hindi, since the two are linguistically most similar): our thinking here was that we’d be able to leverage Whisper’s knowledge of the most linguistically similar language to the new language that we were showing it (just set
language=Hindi
in the tokenizer and processor)
In the end, 1 & 2 gave very comparable performance, so Whisper figures out how to make use of its existing knowledge itself, so you can set language
to either of the above two options
Hey @tupk ! Whisper is a model for speech-to-text, so we can't use it for text-to-speech unfortunately. I would advise that you check-out SpeechT5 for a model that can do both: https://huggingface.co./blog/speecht5
Do we need to train the encoder as well while fine-tuning, or just the decoder part. @sanchit-gandhi
Super good question @Sibadatta - you can freeze the encoder if your audio domain matches that seen during pre-training. Then you only need to adapt the decoder to the target text format! We did this for the ESB paper and it worked very well: https://arxiv.org/abs/2210.13352 See page 22 for details.
You can freeze the encoder by passing --freeze_encoder=True
, see https://github.com/huggingface/transformers/blob/01203475c9452af74ef8fe43c64203be0c959191/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py#L100
In Colab, you'll need to do:
model.freeze_encoder()
``
before you pass the model to the trainer.
After Fine tuning how will be sure that this wer is good enough? and we don't need to fine tune further Specifically for odia language fine tuning? @sanchit-gandhi
@sanchit-gandhi How to make sure that sequential finetuning works (finetuning with lang x followed by finetuning with lang y)? I have tried finetuning on Language 1 and then on Language 2. On evaluation with test data, it seems that the WER for language 1 got increased after finetuning its last checkpoint with language 2. I don't think it is expected behavior or am I missing something here? How should one go about training multiple languages one after the other, also how does one finetune it for unsupported languages in Whisper?
Hey @Ranjit , here you can use a held-out validation set to measure the performance of your fine-tuned model on unseen data. If your validation WER is less than a pre-defined threshold, you know that your model is 'good enough' and that you can use it with out any further fine-tuning. See https://huggingface.co./course/en/chapter5/3?fw=pt#creating-a-validation-set and https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets for details.
Hey
@Jiltseb
, in this case you will probably get better results fine-tuning on language 1 and language 2 at the same time. For this, you will need to switch the language code in your tokenizer depending on the language for each individual sample, so that the model learns to differentiate between language 1 and 2. We'll take the Whisper fine-tuning blog post as our starting point. Suppose your dataset has a column called "language"
that says what the language is for each sample (e.g. "Hindi"
or "French"
), then we can update our prepare_dataset
function as follows:
def prepare_dataset(batch):
# load and resample audio data from 48 to 16kHz
audio = batch["audio"]
# compute log-Mel input features from input audio array
batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
# get the language of our text
tokenizer.set_prefix_tokens(language=batch["language"], task="transcribe")
# encode target text to label ids
batch["labels"] = tokenizer(batch["sentence"]).input_ids
return batch
It's pretty easy to add a new "language"
column to your dataset. Suppose you load the Hindi version of common voice as:
from datasets import load_dataset, DatasetDict
common_voice_hi = DatasetDict()
common_voice_hi["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train+validation", use_auth_token=True)
common_voice_hi["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="test", use_auth_token=True)
You can then add the language column using dataset's add_column
method:
for split in common_voice_hi:
language_column = ["Hindi"] * len(common_voice_hi[split])
common_voice_hi[split] = common_voice_hi[split].add_column("language", language_column)
Supposing we do the same for "French"
:
common_voice_fr = DatasetDict()
common_voice_fr["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="train+validation", use_auth_token=True)
common_voice_fr["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", use_auth_token=True)
for split in common_voice_fr:
language_column = ["French"] * len(common_voice_fr[split])
common_voice_fr[split] = common_voice_fr[split].add_column("language", language_column)
We can now combine our two datasets using concatenate_datasets
:
from datasets import concatenate_datasets
common_voice_merged = DatasetDict()
for split in common_voice_hi:
common_voice_merged[split] = concatenate_datasets([common_voice_hi[split], common_voice_fr[split])
Voila! Now you can use this combined dataset for training on two languages at once.
Hi @sanchit-gandhi , Thank you for your input. I was more worried about the catastrophic forgetting and wanted to ensure the model kept the same performance for performant languages even after fine-tuning on others. PeFT with LoRA seems to be a better choice for this. How can we finetune whisper for audio classification tasks? Is there a blog/example notebook for this?
Hey
@Jilt
! PeFT + LoRA is indeed a cheap way of fine-tuning the Whisper model, one that retains 99% of the original pre-trained params. The run_audio_classification.py
script in transformers now supports the Whisper model. This is the script that I used to fine-tune the base model on the Common Language ID task: https://huggingface.co./sanchit-gandhi/whisper-base-ft-common-language-id/blob/main/run.sh
Hey @asennoussi !
You can load the model with
pipeline
and transcribe audio samples of up to arbitrary length. Just specify the path to your model directory (theoutput_dir
you specified during training and provide the path to an audio file:
from transformers import pipeline MODEL_PATH = "PATH/TO/MODEL" AUDIO_PATH = "PATH/TO/AUDIO" device = "cuda:0" if torch.cuda.is_available() else "cpu" pipe = pipeline( task="automatic-speech-recognition", model=MODEL_PATH, chunk_length_s=30, device=device, ) # we override any special forced tokens for auto language detection - not necessary if you use transformers from main! all_special_ids = pipe.tokenizer.all_special_ids transcribe_token_id = all_special_ids[-5] pipe.model.config.forced_decoder_ids = [[2, transcribe_token_id]] # inference out = pipe(AUDIO_PATH)["text"] print(out)
minor type fixed
Hey, i am getting a tensor mismatch error is there a way to verify that,or can I skip the batches as it is a large datasets
Hi @sanchit-gandhi , in DataCollatorSpeechSeq2SeqWithPadding of https://huggingface.co./blog/fine-tune-whisper, there is a step:
#if bos token is appended in previous tokenization step, cut bos token here as it's append later anyways
Can you please point me to the code where bos token will be appended later? I tried to locate that but haven't found yet.
Thank you :)
Hi, what is the range of token_ids that the gerate() function can generate? i am trying to fine-tune whisper to learn the speaker_id just before the start of transcription using token_ids > 50363
Here is an example for finetuning.
https://colab.research.google.com/drive/1P4ClLkPmfsaKn2tBbRp0nVjGMRKR-EWz
@averoo THX for your great script, but I encountered a problem that I wonder whether you have the same problem.
In my test, the saved checkpoint works not well. I mean, before reload the last checkpoint, the validation loss is 1.7, and after if you do not restart the whole script, but only reload the last checkpoint, the validation loss is 1.7, as expected. BUT, when I restart the script, load the last checkpoint, then the validation loss is 6.4!
This is so wired, and I tried my best to check where is wrong. I checked all the parameter names in checkpoint and in the model, and they are the same except there is 1 more parameter, 'encoder.positional_embedding', in the checkpoint, which is also as expected because it is in buffer. So this can not be the problem.
My question is why after restart the script, the model has a much higher loss?
Hey @Jiltseb , in this case you will probably get better results fine-tuning on language 1 and language 2 at the same time. For this, you will need to switch the language code in your tokenizer depending on the language for each individual sample, so that the model learns to differentiate between language 1 and 2. Suppose your dataset has a column called
"language"
that says what the language is for each sample (e.g."Hindi"
or"French"
), then we can update ourprepare_dataset
function as follows:
def prepare_dataset(batch): # load and resample audio data from 48 to 16kHz audio = batch["audio"] # compute log-Mel input features from input audio array batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0] # get the language of our text tokenizer.set_prefix_tokens(language=batch["language"], task="transcribe") # encode target text to label ids batch["labels"] = tokenizer(batch["sentence"]).input_ids return batch
It's pretty easy to add a new
"language"
column to your dataset. Suppose you load the Hindi version of common voice as:
from datasets import load_dataset, DatasetDict common_voice_hi = DatasetDict() common_voice_hi["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train+validation", use_auth_token=True) common_voice_hi["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="test", use_auth_token=True)
You can then add the language column using dataset's
add_column
method:
for split in common_voice_hi: language_column = ["Hindi"] * len(common_voice_hi[split]) common_voice_hi[split] = common_voice_hi[split].add_column("language", language_column)
Supposing we do the same for
"French"
:
common_voice_fr = DatasetDict() common_voice_fr["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="train+validation", use_auth_token=True) common_voice_fr["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", use_auth_token=True) for split in common_voice_fr: language_column = ["French"] * len(common_voice_fr[split]) common_voice_fr[split] = common_voice_fr[split].add_column("language", language_column)
We can now combine our two datasets using
concatenate_datasets
:
from datasets import concatenate_datasets common_voice_merged = DatasetDict() for split in common_voice_hi: common_voice_merged[split] = concatenate_datasets([common_voice_hi[split], common_voice_fr[split])
Voila! Now you can use this combined dataset for training on two languages at once.
Hi
@sanchit-gandhi
!
Thank you very much for this valuable demonstration. However, I have been doing some tests and I don't see much difference between the results after finetuning by changing the language in the tokenizer and after finetuning without indicating the language. Both improve the performance of the base model. Does this make sense? Am I doing something wrong? I am finetuning on 6 languages and including one under-represented language (Galician).
On the other hand, I am noticing that the ability of the model to identify the language (LID) worsens noticeably after finetuning (monolingual or multilingual).
I am commenting it in this github post:
https://github.com/openai/whisper/discussions/1454
Is this to be expected? Is there a way to perform finetuning in both tasks?
Thanks!
Hi
@sanchit-gandhi
!
I used your tutorial to finetune the whisper model on a local dataset. thank you very much, it was really helpful.
My issue is when I am mapping the prepare_dataset function on my data it takes really long time and my code crashes. I am training on a 20 hours dataset and my GPU is 16G.
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=4)
is there any way to make the prepare_dataset function work more efficiently?
I already downsample my data to 16hkHz and all the audio files are less than 30 seconds, it is exactly structured as the HF dataset.
I wonder if maybe for a local dataset, I should change the prepare_dataset function.
could you please help me and explain on a local dataset which parts and how it should be changed?
sorry if my questions sound silly, I am really new in this field.
Hello mahnaz, I am working on the same model and had the same issue. Once I got that fixed I got to new issue with the shape of my input file.
May I know if you got it working yet ?
I want to fine tune whisper for multiple language(Chinese and Tagalog), this is my code:
The tokenizer will use many place, I only change the prepare_dataset function. Will it work?
from dataclasses import dataclass
from typing import Any, List, Dict, Union
import re
import evaluate
import torch
from datasets import load_dataset, Audio, metric
from transformers import WhisperProcessor, WhisperForConditionalGeneration, \
Seq2SeqTrainingArguments, Seq2SeqTrainer
model_name_or_path = 'openai/whisper-medium'
output_dir = "./whisper-medium-zh-tl"
data_dir = "./dataset"
processor = WhisperProcessor.from_pretrained(model_name_or_path, language=None, task="transcribe")
def prepare_dataset(batch):
# load and resample audio data from 48 to 16kHz
audio = batch["audio"]
# compute log-Mel input features from input audio array
batch["input_features"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
processor.tokenizer.set_prefix_tokens(language=batch["language"], task="transcribe")
# encode target text to label ids
batch["labels"] = processor.tokenizer(batch["transcription"]).input_ids
return batch
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
processor: Any
def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
# split inputs and labels since they have to be of different lengths and need different padding methods
# first treat the audio inputs by simply returning torch tensors
input_features = [{"input_features": feature["input_features"]} for feature in features]
batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
# get the tokenized label sequences
label_features = [{"input_ids": feature["labels"]} for feature in features]
# pad the labels to max length
labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
# replace padding with -100 to ignore loss correctly
labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
# if bos token is appended in previous tokenization step,
# cut bos token here as it's append later anyways
if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
labels = labels[:, 1:]
batch["labels"] = labels
return batch
def compute_metrics(pred):
pred_ids = pred.predictions
label_ids = pred.label_ids
# replace -100 with the pad_token_id
label_ids[label_ids == -100] = tokenizer.pad_token_id
# we do not want to group tokens when computing the metrics
pred_str = self.processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = self.processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)
wer = 100 * metric.compute(predictions=pred_str, references=label_str)
return {"wer": wer}
if __name__ == "__main__":
common_voice = load_dataset("audiofolder", data_dir=data_dir)
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=8)
print(common_voice)
train_samples = len(common_voice["train"])
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)
metric = evaluate.load("wer")
model = WhisperForConditionalGeneration.from_pretrained(model_name_or_path)
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
training_args = Seq2SeqTrainingArguments(
output_dir=output_dir, # change to a repo name of your choice
per_device_train_batch_size=8,
gradient_accumulation_steps=2, # increase by 2x for every 2x decrease in batch size
num_of_epchos=5,
learning_rate=1e-5,
warmup_steps=500,
gradient_checkpointing=True,
fp16=True,
evaluation_strategy="steps",
per_device_eval_batch_size=8,
predict_with_generate=True,
generation_max_length=225,
save_steps=train_samples*10,
eval_steps=1000,
logging_steps=25,
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="wer",
greater_is_better=False,
push_to_hub=False
)
trainer = Seq2SeqTrainer(
args=training_args,
model=model,
train_dataset=common_voice["train"],
eval_dataset=common_voice["test"],
data_collator=data_collator,
compute_metrics=compute_metrics,
tokenizer=processor.feature_extractor,
)
trainer.train()
trainer.save_model(output_dir)
Hi @pedramaa ,
unfortunately, I have not found anything helpful yet, for me, it works only if I reduced my dataset size( only using 4 hours of data). I run the code on two GPUs (one 3090 and one 1080) parallelly. If you find anything helpful please share it with me. Thank you so much
Hi mahnaz,
maybe this can be helpful for you: https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event#streaming-mode
Hey @taohoang - the BOS token id is appended in the Whisper modelling code, at the point when we shift the labels right to get the decoder input ids: https://github.com/huggingface/transformers/blob/fd56f7f0813d412c3e0848cbd6f94a23de2c07b7/src/transformers/models/whisper/modeling_whisper.py#L65
Hey
@faycel
- the .generate
method can output any of the tokens from the model's vocabulary. We first run a forward pass to get the logits over the entire vocabulary, and then sample from this distribution to predict our next token. So if you've expanded the vocabulary to > 50363 by expanding the dimensionality of the final embedding layer and also the tokeniser's vocab, then you can generate with no code changes required. See this thread for details: https://discuss.huggingface.co/t/adding-custom-vocabularies-on-whisper/29311?u=nbroad
Hey @andrespm - this is expected behaviour. When you fine-tune it on a new language, Whisper does a pretty good job at leveraging its knowledge of the other 96 languages it’s pre-trained on.
Pretty much all modern languages will be linguistically similar to at least one of the 96 languages Whisper already knows, so you’ll probably fall under this paradigm of cross-lingual knowledge representations
We tried two ways of setting it up for fine-tuning on new languages:
- Remove the language prediction task so that Whisper doesn’t get caught up with the fact it’s working on a new language and just focuses on transcribing text as accurately as possible
- Keep the language prediction task and tell Whisper that the new language is the same as one of it’s current languages (e.g. if fine-tuning on Nepali, tell Whisper it’s actually predicting Hindi, since the two are linguistically most similar): our thinking here was that we’d be able to leverage Whisper’s knowledge of the most linguistically similar language to the new language that we were showing it
In the end, 1 & 2 gave very comparable performance, so Whisper figures out how to make use of its existing knowledge itself
Regarding maintaining LID performance after fine-tuning, you can try two strategies to reduce catastrophic forgetting (the phenomenon where a model forgets what it learnt during a prior round of training):
- Include data from different languages during fine-tuning, but only count the loss from the language id token towards the overall loss (i.e. discard the transcriptions if you don't want to fine-tune on these other languages, and just train the model to predict the language token)
- Try fine-tuning using PEFT: we've seen that the model is far less likely to catastrophically forget after PEFT fine-tuning, since the base model weighs are frozen. See Vaibhavs10/fast-whisper-finetuning for details
Overall, I think option 2 is the easier of the two here
Hey
@mahnaz
- if you're running into issues preparing the dataset, you can try tweaking the datasets
parameters for the .map
method. I would recommend using num_proc=1
to start with, since using more than this is probably crashing your system if you don't have the required CPUs
Hey @LukeJacob2023 - indeed that should work! Looks like you've got your data in the right format for multi-language fine-tuning! How did you get on here? Did the fine-tuned model improve compared to the pre-trained one?
Hey @LukeJacob2023 - indeed that should work! Looks like you've got your data in the right format for multi-language fine-tuning! How did you get on here? Did the fine-tuned model improve compared to the pre-trained one?
Thank you. I have successfully fine-tune it by set language to None. Because I am afraid the tokenzier may cause error. The model works ok.
Hi
@sanchit-gandhi
,
I am working on adding wolof language datasets in mozilla/common, where it is not yet available.
Do you reckon it would be possible to build and add it on top of your common-language-id ?
Or you would recommend training from whispers with all commons over again ?
Hi @sanchit-gandhi ,
I'm reaching out to seek guidance on how to resolve this error. When attempting to use the trainer.push_to_hub(**kwargs) function to push a model to the Hugging Face model hub, I encounter an HTTPError followed by a BadRequestError. Here's a brief overview of the error messages:
Hi @sanchit-gandhi , I have a question about changing the number of steps while training whisper model. Would doubling the steps number give me any significant improvement? Also is there a way to know how to select the right number for the training steps?
Hi @sanchit-gandhi , How to fine-tune the 'translate' task ? not 'transcribe'
hi @sanchit-gandhi , i want to perform transfer learning only for identifying the language and further finetune it for their dialects. Also these dialects or the language might not be a part of the existing model architecture , how to do this im confused
ignore
Hi
@sanchit-gandhi
We need your expert opinion on which model would be best suited for fine tuning in our case, where we are mostly interested in our English data (we have a dataset of a few 10s of GBs only) and the best accuracy possible on that.
Also, do you recommend using any distil-whisper model in spite of the original whisper for the same?
Kindly let me know your views.
Thanks a lot
Hello @sanchit-gandhi san,
Thank you so much for providing such a detailed notebook for fine-tuning whisper.
I have a question regarding how to set up the LoraConfig so that the target_modules only targets ["q_proj", "v_proj"] of the decoder stack.
It seems that both encoder and decoder uses the same module names, so setting the target_modules to ["q_proj", "v_proj"] creates lora layers for both encoder and decoder.
How can I target the decoder's attention layers?
Original Whisper
OrderedDict([('model', WhisperModel(
(encoder): WhisperEncoder(
(conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,))
(conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,))
(embed_positions): Embedding(1500, 768)
(layers): ModuleList(
(0-11): 12 x WhisperEncoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear8bitLt(in_features=768, out_features=768, bias=False)
(v_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
(q_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
(out_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear8bitLt(in_features=768, out_features=3072, bias=True)
(fc2): Linear8bitLt(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
(layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(decoder): WhisperDecoder(
(embed_tokens): Embedding(51865, 768, padding_idx=50257)
(embed_positions): WhisperPositionalEmbedding(448, 768)
(layers): ModuleList(
(0-11): 12 x WhisperDecoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear8bitLt(in_features=768, out_features=768, bias=False)
(v_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
(q_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
(out_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(encoder_attn): WhisperAttention(
(k_proj): Linear8bitLt(in_features=768, out_features=768, bias=False)
(v_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
(q_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
(out_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(fc1): Linear8bitLt(in_features=768, out_features=3072, bias=True)
(fc2): Linear8bitLt(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
(layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)), ('proj_out', Linear(in_features=768, out_features=51865, bias=False))])
After get_peft_model
OrderedDict([('base_model',
LoraModel(
(model): WhisperForConditionalGeneration(
(model): WhisperModel(
(encoder): WhisperEncoder(
(conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,))
(conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,))
(embed_positions): Embedding(1500, 768)
(layers): ModuleList(
(0-11): 12 x WhisperEncoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear8bitLt(in_features=768, out_features=768, bias=False)
(v_proj): Linear8bitLt(
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=768, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=768, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(base_layer): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(q_proj): Linear8bitLt(
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=768, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=768, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(base_layer): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(out_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear8bitLt(in_features=768, out_features=3072, bias=True)
(fc2): Linear8bitLt(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
(layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(decoder): WhisperDecoder(
(embed_tokens): Embedding(51865, 768, padding_idx=50257)
(embed_positions): WhisperPositionalEmbedding(448, 768)
(layers): ModuleList(
(0-11): 12 x WhisperDecoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear8bitLt(in_features=768, out_features=768, bias=False)
(v_proj): Linear8bitLt(
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=768, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=768, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(base_layer): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(q_proj): Linear8bitLt(
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=768, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=768, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(base_layer): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(out_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(encoder_attn): WhisperAttention(
(k_proj): Linear8bitLt(in_features=768, out_features=768, bias=False)
(v_proj): Linear8bitLt(
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=768, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=768, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(base_layer): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(q_proj): Linear8bitLt(
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=768, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=768, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(base_layer): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(out_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(fc1): Linear8bitLt(in_features=768, out_features=3072, bias=True)
(fc2): Linear8bitLt(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
(layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
(proj_out): Linear(in_features=768, out_features=51865, bias=False)
)
))])
Hi
@sanchit-gandhi
, I am fine-tuning, whisper small on Hindi data. While fine-tuning validation WER
is decreasing but the validation loss is increasing. It seems like it is overfitting. How can I solve this, means what sort of regularization I can use? And also any advice on the warm up
steps hyperparameter, means any recommended value? Can anyone please help? Thanks in advance.
@sanchit-gandhi
I used your tutorial to finetune the whisper model on a local dataset. thank you very much, it was really helpful.
My issue is in this step
from transformers import Seq2SeqTrainingArguments
training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-small-hi", # change to a repo name of your choice
)
actually I am new in the field and I am not able to understand what is the output directory.
could I use a local directory of my system instead of "./whisper-small-hi".
and I also want to know that how and from where I can I use this finetuned model to test it .
Hi
@sanchit-gandhi
, thank u for posting such a detailed article to fine rune whisper with custom training data.
I have followed the article and could generate the model with my own training dataset .
I have doubt ,when i run, trainer.tain() , i can see the training starts and check points are store in different directory
I have set updating checkpoints to Hugging face as false, as i want save the model locally
After this step i am running save model to local directory, thats also working fine.but checkpoints and save model are done to different directory
My question here is
1.When i load trained model from local directory,,which directory path i have to provide, last checkpoint dir or model saved path???.
2.And the other issue is checkpoint dir do not have files like vocab.json etc..and loding fails, for this ,,workaround i did was to copy files from saved model dir to checkpoint dir
Kindly help me with my queries
Please find the Training steps: ********************************************
training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-muttu-small-vv-trained", # change to a repo name of your choice
per_device_train_batch_size=2,
gradient_accumulation_steps=8, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
warmup_steps=500,
max_steps=500,
gradient_checkpointing=True,
fp16=False,
evaluation_strategy="steps",
per_device_eval_batch_size=8,
predict_with_generate=True,
generation_max_length=225,
save_steps=100,
eval_steps=100,
logging_steps=25,
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="wer",
greater_is_better=False,
push_to_hub=False,
use_cpu=True
)
trainer = Seq2SeqTrainer(
args=training_args,
model=model,
train_dataset=common_voice["train"],
eval_dataset=common_voice["test"],
data_collator=data_collator,
compute_metrics=compute_metrics,
tokenizer=processor.feature_extractor
)
#this is trainng and generating check point
trainer.train()
#save trainer model and processors.
trainer.save_model(training_args.output_dir)
processor.save_pretrained(training_args.output_dir)
Files stored under checkpoint dir ********************************************
3815 May 12 06:21 generation_config.json
2260 May 12 06:21 config.json
966995080 May 12 06:21 model.safetensors
5240 May 12 06:21 training_args.bin
339 May 12 06:21 preprocessor_config.json
1064 May 12 06:21 scheduler.pt
13990 May 12 06:21 rng_state.pth
1925050668 May 12 06:21 optimizer.pt
4436 May 12 06:21 trainer_state.json
@sanchit-gandhi on indic languages I am looking to finetune the whisper model by generating audios which has mixture of numbers and domain specific names plus the public dataset available on the web because it seems like words which are homophones to each other and number especially it's making a lot of mistakes adding three zeros when number "100" is spoken and so on.Nonetheless numbers are spoken differently in different languages.So training with a mixture of public dataset and the synthetic data on different indic languages should solve this issues but my concern is will it be able to retain parameters like initialprompt,repetition penalty,vad filter,hotwords(which faster whisper provides) etc after fine-tuning using Peft or without Peft?
Based on this guide https://huggingface.co./blog/fine-tune-whisper, I tried to fine-tune "small" and "large-v3" models.
- The fine-tuned "small" model works normally.
- But the fine-tuned "large-v3" model works poorly on non-English audio files such as Chinese audio files, it auto-translates Chinese to English though I specified transcribing in Chinese, not translating.
Have you faced this issue, and can give me advice? Thank you so much.
Hey @Jiltseb , in this case you will probably get better results fine-tuning on language 1 and language 2 at the same time. For this, you will need to switch the language code in your tokenizer depending on the language for each individual sample, so that the model learns to differentiate between language 1 and 2. We'll take the Whisper fine-tuning blog post as our starting point. Suppose your dataset has a column called
"language"
that says what the language is for each sample (e.g."Hindi"
or"French"
), then we can update ourprepare_dataset
function as follows:
def prepare_dataset(batch): # load and resample audio data from 48 to 16kHz audio = batch["audio"] # compute log-Mel input features from input audio array batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0] # get the language of our text tokenizer.set_prefix_tokens(language=batch["language"], task="transcribe") # encode target text to label ids batch["labels"] = tokenizer(batch["sentence"]).input_ids return batch
It's pretty easy to add a new
"language"
column to your dataset. Suppose you load the Hindi version of common voice as:
from datasets import load_dataset, DatasetDict common_voice_hi = DatasetDict() common_voice_hi["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train+validation", use_auth_token=True) common_voice_hi["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="test", use_auth_token=True)
You can then add the language column using dataset's
add_column
method:
for split in common_voice_hi: language_column = ["Hindi"] * len(common_voice_hi[split]) common_voice_hi[split] = common_voice_hi[split].add_column("language", language_column)
Supposing we do the same for
"French"
:
common_voice_fr = DatasetDict() common_voice_fr["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="train+validation", use_auth_token=True) common_voice_fr["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", use_auth_token=True) for split in common_voice_fr: language_column = ["French"] * len(common_voice_fr[split]) common_voice_fr[split] = common_voice_fr[split].add_column("language", language_column)
We can now combine our two datasets using
concatenate_datasets
:
from datasets import concatenate_datasets common_voice_merged = DatasetDict() for split in common_voice_hi: common_voice_merged[split] = concatenate_datasets([common_voice_hi[split], common_voice_fr[split])
Voila! Now you can use this combined dataset for training on two languages at once.
@sanchit-gandhi
First of all, thank you for your amazing blog on fine-tuning
This answer is a great place to start on multi-language finetuning
I was trying this out. I wanted to ask you about how to train to translate and transcribe for 5 languages. Should I go sequentially by doing transcription fine-tuning and translation or is there a better approach?