Sunbird
/

asr-whisper-large-v3-salt

Automatic Speech Recognition

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

asr-whisper-large-v3-salt / README.md

jq's picture

jq

Update README.md

ea9f98c verified 6 days ago

|

history blame contribute delete

3.43 kB

	---
	library_name: transformers
	base_model: jq/whisper-large-v3-salt-plus-xog-myx-kin-swa
	tags:
	- generated_from_trainer
	datasets:
	- Sunbird/salt
	language:
	- lg
	- en
	- nyn
	- ach
	- teo
	- lgg
	model-index:
	- name: whisper-large-v3-salt-plus-xog-myx-kin-swa-continued
	results: []
	---


	# Whisper large for Ugandan languages

	This model is an adaptation of whisper-large-v3 for the following languages widely spoken in Uganda:
	Luganda, Acholi, Lugbara, Ateso, Runyankole, Rutooro, Lumasaba, Swahili, Lusoga, Kinyarwanda and English (Ugandan accent).

	## Training

	The model was trained with the SALT dataset, Common Voice (Luganda, Swahili, Kinyarwanda), Google FLEURS and Makerere Yogera datasets.
	To help with generalisation in practical settings, training used addition of random noise
	and random downsampling to 8kHz to simulate phone speech.
	Street noise sampled from urban locations in Uganda was added to improve robustness.

	# Usage

	The model is used in a similar way to the base Whisper model.
	The model will attempt to auto-detect the language and provide a transcription.
	However, note that language detection is not always accurate and results may be
	improved by specifying it instead. The languages in this model are not supported
	by the base Whisper model, so the format is slightly different:


	```python
	import transformers
	import datasets
	import torch

	processor = transformers.WhisperProcessor.from_pretrained(
	"Sunbird/asr-whisper-large-v3-salt")
	model = transformers.WhisperForConditionalGeneration.from_pretrained(
	"Sunbird/asr-whisper-large-v3-salt")

	SALT_LANGUAGE_TOKENS_WHISPER = {
	'eng': 50259, # English (Ugandan)
	'swa': 50318, # Swahili
	'ach': 50357, # Acholi
	'lgg': 50356, # Lugbara
	'lug': 50355, # Luganda
	'nyn': 50354, # Runyankole
	'teo': 50353, # Ateso
	'xog': 50352, # Lusoga
	'ttj': 50351, # Rutooro
	'kin': 50350, # Kinyarwanda
	'myx': 50349, # Lumasaba
	}

	# Get some test audio
	ds = datasets.load_dataset('Sunbird/salt', 'multispeaker-lug', split='test')
	audio = ds[0]['audio']
	sample_rate = ds[0]['sample_rate']

	# Specify a language from one of the above.
	lang = 'lug'

	# Apply the model
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	input_features = processor(
	audio, sampling_rate=sample_rate, return_tensors="pt").input_features
	input_features = input_features.to(device)
	predicted_ids = model.to(device).generate(
	input_features,
	# Optionally set language=None here instead to auto-detect.
	language=processor.tokenizer.decode(SALT_LANGUAGE_TOKENS_WHISPER[lang]),
	forced_decoder_ids=None)
	transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

	print(transcription)
	# Ekikoola kya kasooli kya kyenvu wabula langi yaakyo etera okuba eya kitaka wansi.
	```


	#### Performance Metrics

	Evaluated on SALT text and held-out split from Common Voice (swa, kin) and Yogera (ttj, xog).

	- eval_WER_eng: 0.018
	- eval_WER_lug: 0.142
	- eval_WER_ach: 0.195
	- eval_WER_lgg: 0.189
	- eval_WER_teo: 0.202
	- eval_WER_nyn: 0.234
	- eval_WER_myx: 0.461
	- eval_WER_xog: 0.453
	- eval_WER_swa: 0.069
	- eval_WER_kin: 0.111
	- eval_WER_mean: 0.207
	- eval_CER_eng: 0.009
	- eval_CER_lug: 0.029
	- eval_CER_ach: 0.045
	- eval_CER_lgg: 0.045
	- eval_CER_teo: 0.051
	- eval_CER_nyn: 0.043
	- eval_CER_myx: 0.092
	- eval_CER_xog: 0.081
	- eval_CER_swa: 0.015
	- eval_CER_kin: 0.031
	- eval_CER_mean: 0.044