Further Finetuning

#1
by Rajeev1908 - opened

Dear Team,

Thanks for sharing the model it looks good. Can you help us with a finetuning script of this model or some pointer. We need to finetune this for our Industry Domain of real estate.

Pointer will also be good.

We have created the following scrip, kindly advice on it will it train the model good.

import os
import torch
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments
from datasets import load_dataset

Load the model and processor

model_name = "Oriserve/Whisper-Hindi2Hinglish-Prime"
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name)

def preprocess_function(batch):
audio = batch["audio"]
inputs = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt")
with processor.as_target_processor():
labels = processor(batch["text"], return_tensors="pt").input_ids
inputs["labels"] = labels
return inputs

Load dataset

data_files = {"train": "path_to_train.jsonl", "validation": "path_to_validation.jsonl"}
dataset = load_dataset("json", data_files=data_files)

dataset = dataset.cast_column("audio", datasets.Audio(sampling_rate=16_000))
dataset = dataset.map(preprocess_function, remove_columns=dataset["train"].column_names, num_proc=4)

Training arguments

output_dir = "./whisper_finetuned"
training_args = Seq2SeqTrainingArguments(
output_dir=output_dir,
evaluation_strategy="epoch",
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
gradient_accumulation_steps=2,
learning_rate=1e-4,
num_train_epochs=5,
weight_decay=0.01,
save_total_limit=2,
save_strategy="epoch",
predict_with_generate=True,
fp16=True,
logging_dir=f"{output_dir}/logs",
logging_strategy="steps",
logging_steps=100,
report_to="tensorboard",
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
ddp_find_unused_parameters=False
)

Define trainer

trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
tokenizer=processor.feature_extractor,
data_collator=processor.data_collator
)

Fine-tune the model

trainer.train()

Save final model

trainer.save_model(output_dir)

Save processor

processor.save_pretrained(output_dir)

Optional: log training stats

import pandas as pd
log_file = os.path.join(training_args.logging_dir, "events.out.tfevents.*")
logs = pd.read_csv(log_file, sep="\t")
print("Training and evaluation losses per epoch:")
print(logs.groupby("epoch")["loss", "eval_loss"].mean())

Oriserve org

Hi @Rajeev1908 , Thanks for reaching out to us. Your code looks good and should work for finetuning the model. You can also follow the below steps for better results:

  1. Ensure that the data being used represents your use case, i.e. should represent the audios that your model will come across when running inference
  2. On each evaluation run, calculate metrics such as wer (word error rate) to get a better understanding of the model performance

Addtionally, We also provide custom curated robust ASR model APIs, which are much cheaper than other players in the market like Deepgram and Azure. To know more, you can reach out to us at [email protected]

Hi Team,

Thanks for the response. Will reachout to the team over email for more details around the API.

Thanks
Kunal

Hi Team,

What should be the length of data recording chunks for this as we got some error on this script. We created chunks for 10s-15s. This is giving some error. Do we need to append these chunks to create bigger chunks? will that work ?

Kindly advice on this.

Thanks
Rajeev

Oriserve org

@Rajeev1908 The whisper model works with 30s audios, if your audios are shorter/longer than 30s try padding or trimming them accordingly

ai-team-ori changed discussion status to closed

Sign up or log in to comment