--- language: - en - hi tags: - audio - automatic-speech-recognition - whisper-event - pytorch inference: true model-index: - name: Whisper-Hindi2Hinglish-Swift results: - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: google/fleurs type: google/fleurs config: hi_in split: test metrics: - type: wer value: 35.0888 name: WER - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: mozilla-foundation/common_voice_20_0 type: mozilla-foundation/common_voice_20_0 config: hi split: test metrics: - type: wer value: 38.6549 name: WER - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: Indic-Voices type: Indic-Voices config: hi split: test metrics: - type: wer value: 65.2147 name: WER widget: - src: audios/f89b6428-c58a-4355-ad63-0752b69f2d30.wav output: text: vah bas din mein kitni baar chalti hai? - src: audios/09cf2547-9d09-4914-926a-cf2043549c15.wav output: text: >- Salmaan ki image se prabhaavit hote hain is company ke share bhaav jaane kaise? - src: audios/6f7df89f-91a7-4cbd-be43-af7bce71a34b.wav output: text: vah roya aur aur roya. - src: audios/969bede5-d816-461b-9bf2-bd115e098439.wav output: text: helmet na pahnne se bhaarat mein har gante hoti hai chaar logon ki maut. - src: audios/cef43941-72c9-4d28-88dd-cb62808dc056.wav output: text: usne mujhe chithi ka javaab na dene ke lie daanta. - src: audios/b27d49fe-fced-4a17-9887-7bfbc5d4a899.wav output: text: puraana shahar divaaron se ghera hua hai. - src: audios/common_voice_hi_23796065.mp3 example_title: Speech Example 1 - src: audios/common_voice_hi_41666099.mp3 example_title: Speech Example 2 - src: audios/common_voice_hi_41429198.mp3 example_title: Speech Example 3 - src: audios/common_voice_hi_41429259.mp3 example_title: Speech Example 4 - src: audios/common_voice_hi_40904697.mp3 example_title: Speech Example 5 pipeline_tag: automatic-speech-recognition license: apache-2.0 metrics: - wer base_model: - openai/whisper-base library_name: transformers --- ## Whisper-Hindi2Hinglish-Swift: ### Table of Contents: - [Key Features](#key-features) - [Training](#training) - [Data](#data) - [Finetuning](#finetuning) - [Usage](#usage) - [Performance Overview](#performance-overview) - [Qualitative Performance Overview](#qualitative-performance-overview) - [Quantitative Performance Overview](#quantitative-performance-overview) - [Miscellaneous](#miscellaneous) ### Key Features: 1. **Hinglish as a language**: Added ability to transcribe audio into spoken Hinglish language reducing chances of grammatical errors 2. **Whisper Architecture**: Based on the whisper architecture making it easy to use with the transformers package 3. **Hallucination Mitigation**: Minimizes transcription hallucinations to enhance accuracy. 4. **Performance Increase**: ~57% average performance increase versus pretrained model across benchmarking datasets ### Training: #### Data: - **Duration**: A total of ~550 Hrs of noisy Indian-accented Hindi data was used to finetune the model. - **Collection**: Due to a lack of ASR-ready hinglish datasets available, a specially curated proprietary dataset was used. - **Labelling**: This data was then labeled using a SOTA model and the transcriptions were improved by human intervention. - **Quality**: Emphasis was placed on collecting noisy data for the task as the intended use case of the model is in Indian environments where background noise is abundant. - **Processing**: It was ensured that the audios are all chunked into chunks of length <30s, and there are at max 2 speakers in a clip. No further processing steps were done to not change the quality of the source data. #### Finetuning: - **Novel Trainer Architecture**: A custom trainer was written to ensure efficient supervised finetuning, with custom callbacks to enable higher observability during the training process. - **Custom Dynamic Layer Freezing**: Most active layers were identified in the model by running inference on a subset of the training data using the pre-trained models. These layers were then kept unfrozen during the training process while all the other layers were kept frozen. This enabled faster convergence and efficient finetuning - **Deepspeed Integration**: Deepspeed was also utilized to speed up, and optimize the training process. ### Performance Overview #### Qualitative Performance Overview | Audio | Whisper Base | Whisper-Hindi2Hinglish-Swift | |-------|--------------|------------------------------| | | وہاں بس دن میں کتنی بار چلتی ہے | vah bas din mein kitni baar chalti hai? | | | سلمان کی ایمیت سے پراوہویت ہوتے ہیں اس کمپنی کے سیر بھاؤ جانے کیسے | salmaan ki image se prabhaavit hote hain is company ke share bhaav jaane kaise? | | | تو لویا تو لویا | vah roya aur aur roya. | | | حلمت نہ پیننے سے بھارت میں ہر گنٹے ہوتی ہے چار لوگوں کی موت | helmet na pahnne se bhaarat mein har gante hoti hai chaar logon ki maut. | | | اوستہ مجھے چٹھیکہ جواب نہ دینے کے لیٹانٹہ | usne mujhe chithi ka javaab na dene ke lie daanta. | | | پرانا شاہ دیواروں سے گیرا ہوا ہے | puraana shahar divaaron se ghera hua hai. | #### Quantitative Performance Overview ***Note***: - *The below WER scores are for Hinglish text generated by our model and the original whisper model* - *To check our model's real-world performance against other SOTA models please head to our [Speech-To-Text Arena](https://huggingface.co./spaces/Oriserve/ASR_arena) arena space.* | Dataset | Whisper Base | Whisper-Hindi2Hinglish-Swift | |-------|------------------------|-------------------------| | [Common-Voice](https://commonvoice.mozilla.org/en) | 106.7936 | 38.6549 | | [FLEURS](https://huggingface.co./datasets/google/fleurs) | 104.2783 | 35.0888 | | [Indic-Voices](https://ai4bharat.iitm.ac.in/datasets/indicvoices)| 110.8399 | 65.2147 | ### Usage: #### Using Transformers - To run the model, first install the Transformers library ```pip install --upgrade transformers``` - The model can be used with the [`pipeline`](https://huggingface.co./docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) class to transcribe audios of arbitrary length: ```python import torch from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline from datasets import load_dataset # Set device (GPU if available, otherwise CPU) and precision device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 # Specify the pre-trained model ID model_id = "Oriserve/Whisper-Hindi2Hinglish-Swift" # Load the speech-to-text model with specified configurations model = AutoModelForSpeechSeq2Seq.from_pretrained( model_id, torch_dtype=torch_dtype, # Use appropriate precision (float16 for GPU, float32 for CPU) low_cpu_mem_usage=True, # Optimize memory usage during loading use_safetensors=True # Use safetensors format for better security ) model.to(device) # Move model to specified device # Load the processor for audio preprocessing and tokenization processor = AutoProcessor.from_pretrained(model_id) # Create speech recognition pipeline pipe = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, torch_dtype=torch_dtype, device=device, generate_kwargs={ "task": "transcribe", # Set task to transcription "language": "en" # Specify English language } ) # Process audio file and print transcription sample = "sample.wav" # Input audio file path result = pipe(sample) # Run inference print(result["text"]) # Print transcribed text ``` #### Using the OpenAI Whisper module - First, install the openai-whisper library ```pip install -U openai-whisper tqdm``` - Convert the huggingface checkpoint to a pytorch model ```python import torch from transformers import AutoModelForSpeechSeq2Seq import re from tqdm import tqdm from collections import OrderedDict import json # Load parameter name mapping from HF to OpenAI format with open('convert_hf2openai.json', 'r') as f: reverse_translation = json.load(f) reverse_translation = OrderedDict(reverse_translation) def save_model(model, save_path): def reverse_translate(current_param): # Convert parameter names using regex patterns for pattern, repl in reverse_translation.items(): if re.match(pattern, current_param): return re.sub(pattern, repl, current_param) # Extract model dimensions from config config = model.config model_dims = { "n_mels": config.num_mel_bins, # Number of mel spectrogram bins "n_vocab": config.vocab_size, # Vocabulary size "n_audio_ctx": config.max_source_positions, # Max audio context length "n_audio_state": config.d_model, # Audio encoder state dimension "n_audio_head": config.encoder_attention_heads, # Audio encoder attention heads "n_audio_layer": config.encoder_layers, # Number of audio encoder layers "n_text_ctx": config.max_target_positions, # Max text context length "n_text_state": config.d_model, # Text decoder state dimension "n_text_head": config.decoder_attention_heads, # Text decoder attention heads "n_text_layer": config.decoder_layers, # Number of text decoder layers } # Convert model state dict to Whisper format original_model_state_dict = model.state_dict() new_state_dict = {} for key, value in tqdm(original_model_state_dict.items()): key = key.replace("model.", "") # Remove 'model.' prefix new_key = reverse_translate(key) # Convert parameter names if new_key is not None: new_state_dict[new_key] = value # Create final model dictionary pytorch_model = {"dims": model_dims, "model_state_dict": new_state_dict} # Save converted model torch.save(pytorch_model, save_path) # Load Hugging Face model model_id = "Oriserve/Whisper-Hindi2Hinglish-Swift" model = AutoModelForSpeechSeq2Seq.from_pretrained( model_id, low_cpu_mem_usage=True, # Optimize memory usage use_safetensors=True # Use safetensors format ) # Convert and save model model_save_path = "Whisper-Hindi2Hinglish-Swift.pt" save_model(model,model_save_path) ``` - Transcribe ```python import whisper # Load converted model with Whisper and transcribe model = whisper.load_model("Whisper-Hindi2Hinglish-Swift.pt") result = model.transcribe("sample.wav") print(result["text"]) ``` ### Miscellaneous This model is from a family of transformers-based ASR models trained by Oriserve. To compare this model against other models from the same family or other SOTA models please head to our [Speech-To-Text Arena](https://huggingface.co./spaces/Oriserve/ASR_arena). To learn more about our other models, and other queries regarding AI voice agents you can reach out to us at our email [ai-team@oriserve.com](ai-team@oriserve.com)