File size: 12,698 Bytes
c58b72c 0dd630c c58b72c 0dd630c c58b72c 0dd630c 8fd1f7d c58b72c 0dd630c c58b72c 0dd630c c58b72c 45bff28 c58b72c 82124bc c58b72c 6305ad1 c58b72c 7c3c33f c58b72c 6305ad1 c58b72c 7c3c33f c58b72c a8b1f53 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 |
---
language:
- en
- hi
tags:
- audio
- automatic-speech-recognition
- whisper-event
- pytorch
inference: true
model-index:
- name: Whisper-Hindi2Hinglish-Swift
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: google/fleurs
type: google/fleurs
config: hi_in
split: test
metrics:
- type: wer
value: 35.0888
name: WER
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: mozilla-foundation/common_voice_20_0
type: mozilla-foundation/common_voice_20_0
config: hi
split: test
metrics:
- type: wer
value: 38.6549
name: WER
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Indic-Voices
type: Indic-Voices
config: hi
split: test
metrics:
- type: wer
value: 65.2147
name: WER
widget:
- src: audios/f89b6428-c58a-4355-ad63-0752b69f2d30.wav
output:
text: vah bas din mein kitni baar chalti hai?
- src: audios/09cf2547-9d09-4914-926a-cf2043549c15.wav
output:
text: >-
Salmaan ki image se prabhaavit hote hain is company ke share bhaav jaane
kaise?
- src: audios/6f7df89f-91a7-4cbd-be43-af7bce71a34b.wav
output:
text: vah roya aur aur roya.
- src: audios/969bede5-d816-461b-9bf2-bd115e098439.wav
output:
text: helmet na pahnne se bhaarat mein har gante hoti hai chaar logon ki maut.
- src: audios/cef43941-72c9-4d28-88dd-cb62808dc056.wav
output:
text: usne mujhe chithi ka javaab na dene ke lie daanta.
- src: audios/b27d49fe-fced-4a17-9887-7bfbc5d4a899.wav
output:
text: puraana shahar divaaron se ghera hua hai.
- src: audios/common_voice_hi_23796065.mp3
example_title: Speech Example 1
- src: audios/common_voice_hi_41666099.mp3
example_title: Speech Example 2
- src: audios/common_voice_hi_41429198.mp3
example_title: Speech Example 3
- src: audios/common_voice_hi_41429259.mp3
example_title: Speech Example 4
- src: audios/common_voice_hi_40904697.mp3
example_title: Speech Example 5
pipeline_tag: automatic-speech-recognition
license: apache-2.0
metrics:
- wer
base_model:
- openai/whisper-base
library_name: transformers
---
## Whisper-Hindi2Hinglish-Swift:
### Table of Contents:
- [Key Features](#key-features)
- [Training](#training)
- [Data](#data)
- [Finetuning](#finetuning)
- [Usage](#usage)
- [Performance Overview](#performance-overview)
- [Qualitative Performance Overview](#qualitative-performance-overview)
- [Quantitative Performance Overview](#quantitative-performance-overview)
- [Miscellaneous](#miscellaneous)
### Key Features:
1. **Hinglish as a language**: Added ability to transcribe audio into spoken Hinglish language reducing chances of grammatical errors
2. **Whisper Architecture**: Based on the whisper architecture making it easy to use with the transformers package
3. **Hallucination Mitigation**: Minimizes transcription hallucinations to enhance accuracy.
4. **Performance Increase**: ~57% average performance increase versus pretrained model across benchmarking datasets
### Training:
#### Data:
- **Duration**: A total of ~550 Hrs of noisy Indian-accented Hindi data was used to finetune the model.
- **Collection**: Due to a lack of ASR-ready hinglish datasets available, a specially curated proprietary dataset was used.
- **Labelling**: This data was then labeled using a SOTA model and the transcriptions were improved by human intervention.
- **Quality**: Emphasis was placed on collecting noisy data for the task as the intended use case of the model is in Indian environments where background noise is abundant.
- **Processing**: It was ensured that the audios are all chunked into chunks of length <30s, and there are at max 2 speakers in a clip. No further processing steps were done to not change the quality of the source data.
#### Finetuning:
- **Novel Trainer Architecture**: A custom trainer was written to ensure efficient supervised finetuning, with custom callbacks to enable higher observability during the training process.
- **Custom Dynamic Layer Freezing**: Most active layers were identified in the model by running inference on a subset of the training data using the pre-trained models. These layers were then kept unfrozen during the training process while all the other layers were kept frozen. This enabled faster convergence and efficient finetuning
- **Deepspeed Integration**: Deepspeed was also utilized to speed up, and optimize the training process.
### Performance Overview
#### Qualitative Performance Overview
| Audio | Whisper Base | Whisper-Hindi2Hinglish-Swift |
|-------|--------------|------------------------------|
| <audio controls><source src="https://huggingface.co./Oriserve/Whisper-Hindi2Hinglish-Swift/resolve/main/audios/f89b6428-c58a-4355-ad63-0752b69f2d30.wav" type="audio/wav"></audio> | وہاں بس دن میں کتنی بار چلتی ہے | vah bas din mein kitni baar chalti hai? |
| <audio controls><source src="https://huggingface.co./Oriserve/Whisper-Hindi2Hinglish-Swift/resolve/main/audios/09cf2547-9d09-4914-926a-cf2043549c15.wav" type="audio/wav"></audio> | سلمان کی ایمیت سے پراوہویت ہوتے ہیں اس کمپنی کے سیر بھاؤ جانے کیسے | salmaan ki image se prabhaavit hote hain is company ke share bhaav jaane kaise? |
| <audio controls><source src="https://huggingface.co./Oriserve/Whisper-Hindi2Hinglish-Swift/resolve/main/audios/6f7df89f-91a7-4cbd-be43-af7bce71a34b.wav" type="audio/wav"></audio> | تو لویا تو لویا | vah roya aur aur roya. |
| <audio controls><source src="https://huggingface.co./Oriserve/Whisper-Hindi2Hinglish-Swift/resolve/main/audios/969bede5-d816-461b-9bf2-bd115e098439.wav" type="audio/wav"></audio> | حلمت نہ پیننے سے بھارت میں ہر گنٹے ہوتی ہے چار لوگوں کی موت | helmet na pahnne se bhaarat mein har gante hoti hai chaar logon ki maut. |
| <audio controls><source src="https://huggingface.co./Oriserve/Whisper-Hindi2Hinglish-Swift/resolve/main/audios/cef43941-72c9-4d28-88dd-cb62808dc056.wav" type="audio/wav"></audio> | اوستہ مجھے چٹھیکہ جواب نہ دینے کے لیٹانٹہ | usne mujhe chithi ka javaab na dene ke lie daanta. |
| <audio controls><source src="https://huggingface.co./Oriserve/Whisper-Hindi2Hinglish-Swift/resolve/main/audios/b27d49fe-fced-4a17-9887-7bfbc5d4a899.wav" type="audio/wav"></audio> | پرانا شاہ دیواروں سے گیرا ہوا ہے | puraana shahar divaaron se ghera hua hai. |
#### Quantitative Performance Overview
***Note***:
- *The below WER scores are for Hinglish text generated by our model and the original whisper model*
- *To check our model's real-world performance against other SOTA models please head to our [Speech-To-Text Arena](https://huggingface.co./spaces/Oriserve/ASR_arena) arena space.*
| Dataset | Whisper Base | Whisper-Hindi2Hinglish-Swift |
|-------|------------------------|-------------------------|
| [Common-Voice](https://commonvoice.mozilla.org/en) | 106.7936 | 38.6549 |
| [FLEURS](https://huggingface.co./datasets/google/fleurs) | 104.2783 | 35.0888 |
| [Indic-Voices](https://ai4bharat.iitm.ac.in/datasets/indicvoices)| 110.8399 | 65.2147 |
### Usage:
#### Using Transformers
- To run the model, first install the Transformers library
```pip install --upgrade transformers```
- The model can be used with the [`pipeline`](https://huggingface.co./docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
class to transcribe audios of arbitrary length:
```python
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
# Set device (GPU if available, otherwise CPU) and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
# Specify the pre-trained model ID
model_id = "Oriserve/Whisper-Hindi2Hinglish-Swift"
# Load the speech-to-text model with specified configurations
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype, # Use appropriate precision (float16 for GPU, float32 for CPU)
low_cpu_mem_usage=True, # Optimize memory usage during loading
use_safetensors=True # Use safetensors format for better security
)
model.to(device) # Move model to specified device
# Load the processor for audio preprocessing and tokenization
processor = AutoProcessor.from_pretrained(model_id)
# Create speech recognition pipeline
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
generate_kwargs={
"task": "transcribe", # Set task to transcription
"language": "en" # Specify English language
}
)
# Process audio file and print transcription
sample = "sample.wav" # Input audio file path
result = pipe(sample) # Run inference
print(result["text"]) # Print transcribed text
```
#### Using the OpenAI Whisper module
- First, install the openai-whisper library
```pip install -U openai-whisper tqdm```
- Convert the huggingface checkpoint to a pytorch model
```python
import torch
from transformers import AutoModelForSpeechSeq2Seq
import re
from tqdm import tqdm
from collections import OrderedDict
import json
# Load parameter name mapping from HF to OpenAI format
with open('convert_hf2openai.json', 'r') as f:
reverse_translation = json.load(f)
reverse_translation = OrderedDict(reverse_translation)
def save_model(model, save_path):
def reverse_translate(current_param):
# Convert parameter names using regex patterns
for pattern, repl in reverse_translation.items():
if re.match(pattern, current_param):
return re.sub(pattern, repl, current_param)
# Extract model dimensions from config
config = model.config
model_dims = {
"n_mels": config.num_mel_bins, # Number of mel spectrogram bins
"n_vocab": config.vocab_size, # Vocabulary size
"n_audio_ctx": config.max_source_positions, # Max audio context length
"n_audio_state": config.d_model, # Audio encoder state dimension
"n_audio_head": config.encoder_attention_heads, # Audio encoder attention heads
"n_audio_layer": config.encoder_layers, # Number of audio encoder layers
"n_text_ctx": config.max_target_positions, # Max text context length
"n_text_state": config.d_model, # Text decoder state dimension
"n_text_head": config.decoder_attention_heads, # Text decoder attention heads
"n_text_layer": config.decoder_layers, # Number of text decoder layers
}
# Convert model state dict to Whisper format
original_model_state_dict = model.state_dict()
new_state_dict = {}
for key, value in tqdm(original_model_state_dict.items()):
key = key.replace("model.", "") # Remove 'model.' prefix
new_key = reverse_translate(key) # Convert parameter names
if new_key is not None:
new_state_dict[new_key] = value
# Create final model dictionary
pytorch_model = {"dims": model_dims, "model_state_dict": new_state_dict}
# Save converted model
torch.save(pytorch_model, save_path)
# Load Hugging Face model
model_id = "Oriserve/Whisper-Hindi2Hinglish-Swift"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
low_cpu_mem_usage=True, # Optimize memory usage
use_safetensors=True # Use safetensors format
)
# Convert and save model
model_save_path = "Whisper-Hindi2Hinglish-Swift.pt"
save_model(model,model_save_path)
```
- Transcribe
```python
import whisper
# Load converted model with Whisper and transcribe
model = whisper.load_model("Whisper-Hindi2Hinglish-Swift.pt")
result = model.transcribe("sample.wav")
print(result["text"])
```
### Miscellaneous
This model is from a family of transformers-based ASR models trained by Oriserve. To compare this model against other models from the same family or other SOTA models please head to our [Speech-To-Text Arena](https://huggingface.co./spaces/Oriserve/ASR_arena). To learn more about our other models, and other queries regarding AI voice agents you can reach out to us at our email [[email protected]]([email protected]) |