--- license: apache-2.0 datasets: - Ayoub-Laachir/Darija_Dataset language: - dj metrics: - wer - cer base_model: - openai/whisper-large-v3 pipeline_tag: automatic-speech-recognition --- # Model Card for Fine-tuned Whisper Large V3 (Moroccan Darija) ## Model Overview **Model Name:** Whisper Large V3 (Fine-tuned for Moroccan Darija) **Author:** Ayoub Laachir **License:** apache-2.0 **Repository:** [Ayoub-Laachir/MaghrebVoice](https://huggingface.co./Ayoub-Laachir/MaghrebVoice) **Dataset:** [Ayoub-Laachir/Darija_Dataset](https://huggingface.co./datasets/Ayoub-Laachir/Darija_Dataset) ## Description This model is a fine-tuned version of OpenAI’s Whisper Large V3, specifically adapted for recognizing and transcribing Moroccan Darija, a dialect influenced by Arabic, Berber, French, and Spanish. The project aims to improve technological accessibility for millions of Moroccans and serve as a blueprint for similar advancements in underrepresented languages. ## Technologies Used - **Whisper Large V3:** OpenAI’s state-of-the-art speech recognition model - **PEFT (Parameter-Efficient Fine-Tuning) with LoRA (Low-Rank Adaptation):** An efficient fine-tuning technique - **Google Colab:** Cloud environment for training the model - **Hugging Face:** Hosting the dataset and final model ## Dataset Preparation The dataset preparation involved several steps: 1. **Cleaning:** Correcting bad transcriptions and standardizing word spellings. 2. **Audio Processing:** Converting all samples to a 16 kHz sample rate. 3. **Dataset Split:** Creating a training set of 3,312 samples and a test set of 150 samples. 4. **Format Conversion:** Transforming the dataset into the parquet file format. 5. **Uploading:** The prepared dataset was uploaded to the Hugging Face Hub. ## Training Process The model was fine-tuned using the following parameters: - **Per device train batch size:** 8 - **Gradient accumulation steps:** 1 - **Learning rate:** 1e-4 (0.0001) - **Warmup steps:** 200 - **Number of train epochs:** 2 - **Logging and evaluation:** every 50 steps - **Weight decay:** 0.01 Training progress showed a steady decrease in both training and validation loss over 8000 steps. ## Testing and Evaluation The model was evaluated using: - **Word Error Rate (WER):** 3.1467% - **Character Error Rate (CER):** 2.3893% These metrics demonstrate the model's ability to accurately transcribe Moroccan Darija speech. The fine-tuned model shows improved handling of Darija-specific words, sentence structure, and overall accuracy. ## Audio Transcription Script with PEFT Layers This script demonstrates how to transcribe audio files using the fine-tuned Whisper Large V3 model for Moroccan Darija, incorporating PEFT (Parameter-Efficient Fine-Tuning) layers for improved performance. ### Required Libraries Before running the script, ensure you have the following libraries installed. You can install them using: ```bash !pip install --upgrade pip !pip install --upgrade transformers accelerate librosa soundfile pydub !pip install peft==0.3.0 # Install PEFT library ``` ```python import torch from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline import librosa import soundfile as sf from pydub import AudioSegment from peft import PeftModel, PeftConfig # Import PEFT classes # Set the device to GPU if available, else use CPU device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 # Configuration for the base Whisper model base_model_name = "openai/whisper-large-v3" # Base model for Whisper processor = AutoProcessor.from_pretrained(base_model_name) # Load the processor # Load your fine-tuned model configuration model_name = "Ayoub-Laachir/MaghrebVoice_OnlyLoRaLayers" # Fine-tuned model with LoRA layers peft_config = PeftConfig.from_pretrained(model_name) # Load PEFT configuration # Load the base model base_model = AutoModelForSpeechSeq2Seq.from_pretrained(base_model_name).to(device) # Load the base model # Load the PEFT model model = PeftModel.from_pretrained(base_model, model_name).to(device) # Load the PEFT model # Merge the LoRA weights with the base model model = model.merge_and_unload() # Combine the LoRA weights into the base model # Configuration for transcription config = { "language": "arabic", # Language for transcription "task": "transcribe", # Task type "chunk_length_s": 30, # Length of each audio chunk in seconds "stride_length_s": 5, # Overlap between chunks in seconds } # Initialize the automatic speech recognition pipeline pipe = pipeline( "automatic-speech-recognition", model=model, # Use the merged model tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, torch_dtype=torch_dtype, device=device, chunk_length_s=config["chunk_length_s"], stride_length_s=config["stride_length_s"], ) # Convert audio to 16kHz sampling rate def convert_audio_to_16khz(input_path, output_path): audio, sr = librosa.load(input_path, sr=None) # Load the audio file audio_16k = librosa.resample(audio, orig_sr=sr, target_sr=16000) # Resample to 16kHz sf.write(output_path, audio_16k, 16000) # Save the converted audio # Format time in HH:MM:SS.milliseconds def format_time(seconds): hours = int(seconds // 3600) minutes = int((seconds % 3600) // 60) seconds = seconds % 60 return f"{hours:02d}:{minutes:02d}:{seconds:06.3f}" # Transcribe audio file def transcribe_audio(audio_path): try: result = pipe(audio_path, return_timestamps=True) # Transcribe audio and get timestamps return result["chunks"] # Return transcription chunks except Exception as e: print(f"Error transcribing audio: {e}") return None # Main function to execute the transcription process def main(): # Specify input and output audio paths (update paths as needed) input_audio_path = "/path/to/your/input/audio.mp3" # Replace with your input audio path output_audio_path = "/path/to/your/output/audio_16khz.wav" # Replace with your output audio path # Convert audio to 16kHz convert_audio_to_16khz(input_audio_path, output_audio_path) # Transcribe the converted audio transcription_chunks = transcribe_audio(output_audio_path) if transcription_chunks: print("WEBVTT\n") # Print header for WEBVTT format for chunk in transcription_chunks: start_time = format_time(chunk["timestamp"][0]) # Format start time end_time = format_time(chunk["timestamp"][1]) # Format end time text = chunk["text"] # Get the transcribed text print(f"{start_time} --> {end_time}") # Print time range print(f"{text}\n") # Print transcribed text else: print("Transcription failed.") if __name__ == "__main__": main() ``` ## Challenges and Future Improvements ### Challenges Encountered - Diverse spellings of words in Moroccan Darija - Cleaning and standardizing the dataset ### Future Improvements - Expand the dataset to include more Darija accents and expressions - Further fine-tune the model for specific Moroccan regional dialects - Explore integration into practical applications like voice assistants and transcription services ## Conclusion This project marks a significant step towards making AI more accessible for Moroccan Arabic speakers. The success of this fine-tuned model highlights the potential for adapting advanced AI technologies to underrepresented languages, serving as a model for similar initiatives in North Africa.