metadata

library_name: transformers
base_model: jq/whisper-large-v3-salt-plus-xog-myx-kin-swa
tags:
  - generated_from_trainer
datasets:
  - Sunbird/salt
language:
  - lg
  - en
  - nyn
  - ach
  - teo
  - lgg
model-index:
  - name: whisper-large-v3-salt-plus-xog-myx-kin-swa-continued
    results: []

Whisper large for Ugandan languages

This model is an adaptation of whisper-large-v3 for the following languages widely spoken in Uganda: Luganda, Acholi, Lugbara, Ateso, Runyankole, Rutooro, Lumasaba, Swahili, Lusoga, Kinyarwanda and English (Ugandan accent).

Training

The model was trained with the SALT dataset, Common Voice (Luganda, Swahili, Kinyarwanda), Google FLEURS and Makerere Yogera datasets. To help with generalisation in practical settings, training used addition of random noise and random downsampling to 8kHz to simulate phone speech. Street noise sampled from urban locations in Uganda was added to improve robustness.

Usage

The model is used in a similar way to the base Whisper model. The model will attempt to auto-detect the language and provide a transcription. However, note that language detection is not always accurate and results may be improved by specifying it instead. The languages in this model are not supported by the base Whisper model, so the format is slightly different:

import transformers
import datasets
import torch

processor = transformers.WhisperProcessor.from_pretrained(
    "Sunbird/asr-whisper-large-v3-salt")
model = transformers.WhisperForConditionalGeneration.from_pretrained(
    "Sunbird/asr-whisper-large-v3-salt")

SALT_LANGUAGE_TOKENS_WHISPER = {
    'eng': 50259,  # English (Ugandan)
    'swa': 50318,  # Swahili
    'ach': 50357,  # Acholi
    'lgg': 50356,  # Lugbara
    'lug': 50355,  # Luganda
    'nyn': 50354,  # Runyankole
    'teo': 50353,  # Ateso
    'xog': 50352,  # Lusoga
    'ttj': 50351,  # Rutooro
    'kin': 50350,  # Kinyarwanda
    'myx': 50349,  # Lumasaba
}

# Get some test audio
ds = datasets.load_dataset('Sunbird/salt', 'multispeaker-lug', split='test')
audio = ds[0]['audio']
sample_rate = ds[0]['sample_rate']

# Specify a language from one of the above.
lang = 'lug'

# Apply the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
input_features = processor(
    audio, sampling_rate=sample_rate, return_tensors="pt").input_features
input_features = input_features.to(device)
predicted_ids = model.to(device).generate(
    input_features,
    # Optionally set language=None here instead to auto-detect.
    language=processor.tokenizer.decode(SALT_LANGUAGE_TOKENS_WHISPER[lang]),
    forced_decoder_ids=None)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

print(transcription)
# Ekikoola kya kasooli kya kyenvu wabula langi yaakyo etera okuba eya kitaka wansi.

Performance Metrics

Evaluated on SALT text and held-out split from Common Voice (swa, kin) and Yogera (ttj, xog).

eval_WER_eng: 0.018
eval_WER_lug: 0.142
eval_WER_ach: 0.195
eval_WER_lgg: 0.189
eval_WER_teo: 0.202
eval_WER_nyn: 0.234
eval_WER_myx: 0.461
eval_WER_xog: 0.453
eval_WER_swa: 0.069
eval_WER_kin: 0.111
eval_WER_mean: 0.207
eval_CER_eng: 0.009
eval_CER_lug: 0.029
eval_CER_ach: 0.045
eval_CER_lgg: 0.045
eval_CER_teo: 0.051
eval_CER_nyn: 0.043
eval_CER_myx: 0.092
eval_CER_xog: 0.081
eval_CER_swa: 0.015
eval_CER_kin: 0.031
eval_CER_mean: 0.044