Model Card for Fine-tuned Whisper Large V3 (Moroccan Darija)

Model Overview

Model Name: Whisper Large V3 (Fine-tuned for Moroccan Darija)
Author: Ayoub Laachir
License: apache-2.0
Repository: Ayoub-Laachir/MaghrebVoice
Dataset: Ayoub-Laachir/Darija_Dataset

Description

This model is a fine-tuned version of OpenAI’s Whisper Large V3, specifically adapted for recognizing and transcribing Moroccan Darija, a dialect influenced by Arabic, Berber, French, and Spanish. The project aims to improve technological accessibility for millions of Moroccans and serve as a blueprint for similar advancements in underrepresented languages.

Technologies Used

Whisper Large V3: OpenAI’s state-of-the-art speech recognition model
PEFT (Parameter-Efficient Fine-Tuning) with LoRA (Low-Rank Adaptation): An efficient fine-tuning technique
Google Colab: Cloud environment for training the model
Hugging Face: Hosting the dataset and final model

Dataset Preparation

The dataset preparation involved several steps:

Cleaning: Correcting bad transcriptions and standardizing word spellings.
Audio Processing: Converting all samples to a 16 kHz sample rate.
Dataset Split: Creating a training set of 3,312 samples and a test set of 150 samples.
Format Conversion: Transforming the dataset into the parquet file format.
Uploading: The prepared dataset was uploaded to the Hugging Face Hub.

Training Process

The model was fine-tuned using the following parameters:

Per device train batch size: 8
Gradient accumulation steps: 1
Learning rate: 1e-4 (0.0001)
Warmup steps: 200
Number of train epochs: 2
Logging and evaluation: every 50 steps
Weight decay: 0.01

Training progress showed a steady decrease in both training and validation loss over 8000 steps.

Testing and Evaluation

The model was evaluated using:

Word Error Rate (WER): 3.1467%
Character Error Rate (CER): 2.3893%

These metrics demonstrate the model's ability to accurately transcribe Moroccan Darija speech.

The fine-tuned model shows improved handling of Darija-specific words, sentence structure, and overall accuracy.

Audio Transcription Script with PEFT Layers

This script demonstrates how to transcribe audio files using the fine-tuned Whisper Large V3 model for Moroccan Darija, incorporating PEFT (Parameter-Efficient Fine-Tuning) layers for improved performance.

Required Libraries

Before running the script, ensure you have the following libraries installed. You can install them using:

!pip install --upgrade pip
!pip install --upgrade transformers accelerate librosa soundfile pydub
!pip install peft==0.3.0  # Install PEFT library

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import librosa
import soundfile as sf
from pydub import AudioSegment
from peft import PeftModel, PeftConfig  # Import PEFT classes

# Set the device to GPU if available, else use CPU
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Configuration for the base Whisper model
base_model_name = "openai/whisper-large-v3"  # Base model for Whisper
processor = AutoProcessor.from_pretrained(base_model_name)  # Load the processor

# Load your fine-tuned model configuration
model_name = "Ayoub-Laachir/MaghrebVoice_OnlyLoRaLayers"  # Fine-tuned model with LoRA layers
peft_config = PeftConfig.from_pretrained(model_name)  # Load PEFT configuration

# Load the base model
base_model = AutoModelForSpeechSeq2Seq.from_pretrained(base_model_name).to(device)  # Load the base model

# Load the PEFT model
model = PeftModel.from_pretrained(base_model, model_name).to(device)  # Load the PEFT model

# Merge the LoRA weights with the base model
model = model.merge_and_unload()  # Combine the LoRA weights into the base model

# Configuration for transcription
config = {
    "language": "arabic",  # Language for transcription
    "task": "transcribe",  # Task type
    "chunk_length_s": 30,  # Length of each audio chunk in seconds
    "stride_length_s": 5,   # Overlap between chunks in seconds
}

# Initialize the automatic speech recognition pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,  # Use the merged model
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    chunk_length_s=config["chunk_length_s"],
    stride_length_s=config["stride_length_s"],
)

# Convert audio to 16kHz sampling rate
def convert_audio_to_16khz(input_path, output_path):
    audio, sr = librosa.load(input_path, sr=None)  # Load the audio file
    audio_16k = librosa.resample(audio, orig_sr=sr, target_sr=16000)  # Resample to 16kHz
    sf.write(output_path, audio_16k, 16000)  # Save the converted audio

# Format time in HH:MM:SS.milliseconds
def format_time(seconds):
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    seconds = seconds % 60
    return f"{hours:02d}:{minutes:02d}:{seconds:06.3f}"

# Transcribe audio file
def transcribe_audio(audio_path):
    try:
        result = pipe(audio_path, return_timestamps=True)  # Transcribe audio and get timestamps
        return result["chunks"]  # Return transcription chunks
    except Exception as e:
        print(f"Error transcribing audio: {e}")
        return None

# Main function to execute the transcription process
def main():
    # Specify input and output audio paths (update paths as needed)
    input_audio_path = "/path/to/your/input/audio.mp3"  # Replace with your input audio path
    output_audio_path = "/path/to/your/output/audio_16khz.wav"  # Replace with your output audio path

    # Convert audio to 16kHz
    convert_audio_to_16khz(input_audio_path, output_audio_path)

    # Transcribe the converted audio
    transcription_chunks = transcribe_audio(output_audio_path)

    if transcription_chunks:
        print("WEBVTT\n")  # Print header for WEBVTT format
        for chunk in transcription_chunks:
            start_time = format_time(chunk["timestamp"][0])  # Format start time
            end_time = format_time(chunk["timestamp"][1])    # Format end time
            text = chunk["text"]                              # Get the transcribed text
            print(f"{start_time} --> {end_time}")           # Print time range
            print(f"{text}\n")                               # Print transcribed text
    else:
        print("Transcription failed.")

if __name__ == "__main__":
    main()

Challenges and Future Improvements

Challenges Encountered

Diverse spellings of words in Moroccan Darija
Cleaning and standardizing the dataset

Future Improvements

Expand the dataset to include more Darija accents and expressions
Further fine-tune the model for specific Moroccan regional dialects
Explore integration into practical applications like voice assistants and transcription services

Conclusion

This project marks a significant step towards making AI more accessible for Moroccan Arabic speakers. The success of this fine-tuned model highlights the potential for adapting advanced AI technologies to underrepresented languages, serving as a model for similar initiatives in North Africa.

Ayoub-Laachir
/

MaghrebVoice_OnlyLoRaLayers