|
--- |
|
library_name: transformers |
|
base_model: jq/whisper-large-v3-salt-plus-xog-myx-kin-swa |
|
tags: |
|
- generated_from_trainer |
|
datasets: |
|
- Sunbird/salt |
|
language: |
|
- lg |
|
- en |
|
- nyn |
|
- ach |
|
- teo |
|
- lgg |
|
model-index: |
|
- name: whisper-large-v3-salt-plus-xog-myx-kin-swa-continued |
|
results: [] |
|
--- |
|
|
|
|
|
# Whisper large for Ugandan languages |
|
|
|
This model is an adaptation of whisper-large-v3 for the following languages widely spoken in Uganda: |
|
Luganda, Acholi, Lugbara, Ateso, Runyankole, Rutooro, Lumasaba, Swahili, Lusoga, Kinyarwanda and English (Ugandan accent). |
|
|
|
## Training |
|
|
|
The model was trained with the SALT dataset, Common Voice (Luganda, Swahili, Kinyarwanda), Google FLEURS and Makerere Yogera datasets. |
|
To help with generalisation in practical settings, training used addition of random noise |
|
and random downsampling to 8kHz to simulate phone speech. |
|
Street noise sampled from urban locations in Uganda was added to improve robustness. |
|
|
|
# Usage |
|
|
|
The model is used in a similar way to the base Whisper model. |
|
The model will attempt to auto-detect the language and provide a transcription. |
|
However, note that language detection is not always accurate and results may be |
|
improved by specifying it instead. The languages in this model are not supported |
|
by the base Whisper model, so the format is slightly different: |
|
|
|
|
|
```python |
|
import transformers |
|
import datasets |
|
import torch |
|
|
|
processor = transformers.WhisperProcessor.from_pretrained( |
|
"Sunbird/asr-whisper-large-v3-salt") |
|
model = transformers.WhisperForConditionalGeneration.from_pretrained( |
|
"Sunbird/asr-whisper-large-v3-salt") |
|
|
|
SALT_LANGUAGE_TOKENS_WHISPER = { |
|
'eng': 50259, # English (Ugandan) |
|
'swa': 50318, # Swahili |
|
'ach': 50357, # Acholi |
|
'lgg': 50356, # Lugbara |
|
'lug': 50355, # Luganda |
|
'nyn': 50354, # Runyankole |
|
'teo': 50353, # Ateso |
|
'xog': 50352, # Lusoga |
|
'ttj': 50351, # Rutooro |
|
'kin': 50350, # Kinyarwanda |
|
'myx': 50349, # Lumasaba |
|
} |
|
|
|
# Get some test audio |
|
ds = datasets.load_dataset('Sunbird/salt', 'multispeaker-lug', split='test') |
|
audio = ds[0]['audio'] |
|
sample_rate = ds[0]['sample_rate'] |
|
|
|
# Specify a language from one of the above. |
|
lang = 'lug' |
|
|
|
# Apply the model |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
input_features = processor( |
|
audio, sampling_rate=sample_rate, return_tensors="pt").input_features |
|
input_features = input_features.to(device) |
|
predicted_ids = model.to(device).generate( |
|
input_features, |
|
# Optionally set language=None here instead to auto-detect. |
|
language=processor.tokenizer.decode(SALT_LANGUAGE_TOKENS_WHISPER[lang]), |
|
forced_decoder_ids=None) |
|
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) |
|
|
|
print(transcription) |
|
# Ekikoola kya kasooli kya kyenvu wabula langi yaakyo etera okuba eya kitaka wansi. |
|
``` |
|
|
|
|
|
#### Performance Metrics |
|
|
|
Evaluated on SALT text and held-out split from Common Voice (swa, kin) and Yogera (ttj, xog). |
|
|
|
- eval_WER_eng: 0.018 |
|
- eval_WER_lug: 0.142 |
|
- eval_WER_ach: 0.195 |
|
- eval_WER_lgg: 0.189 |
|
- eval_WER_teo: 0.202 |
|
- eval_WER_nyn: 0.234 |
|
- eval_WER_myx: 0.461 |
|
- eval_WER_xog: 0.453 |
|
- eval_WER_swa: 0.069 |
|
- eval_WER_kin: 0.111 |
|
- eval_WER_mean: 0.207 |
|
- eval_CER_eng: 0.009 |
|
- eval_CER_lug: 0.029 |
|
- eval_CER_ach: 0.045 |
|
- eval_CER_lgg: 0.045 |
|
- eval_CER_teo: 0.051 |
|
- eval_CER_nyn: 0.043 |
|
- eval_CER_myx: 0.092 |
|
- eval_CER_xog: 0.081 |
|
- eval_CER_swa: 0.015 |
|
- eval_CER_kin: 0.031 |
|
- eval_CER_mean: 0.044 |
|
|
|
|