KBLab
/

whisper-large-rixvox

Automatic Speech Recognition

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

whisper-large-rixvox / README.md

marma's picture

Update README.md

6cdb988 almost 2 years ago

|

history blame contribute delete

1.61 kB

	---
	license: apache-2.0
	datasets:
	- KBLab/rixvox
	language:
	- sv
	---
	# Whisper Large RixVox Swedish

	This is a [Whisper large](https://huggingface.co./openai/whisper-large-v2) finetuned for Swedish
	using the [RixVox](https://huggingface.co./datasets/KBLab/rixvox) dataset.

	Please note that this model, as every other encoder-decoder speech-to-text model, is prone to
	hallucinating on unexpected inputs and treats the task as translation rather than transcription.
	I.e your mileage may vary depending on filtering and type of data.

	In this release the entire encoder was frozen. Subsequent releases will not do this if the
	generalization to other types of data (i.e not parliamentary speeches) is kept when not freezing
	the encoder.

	## Evaluation (test)

	* RixVox WER: `22.59`
	* RixVox WER (normalized*): `19.33`
	* Common Voice 11 WER: `18.03`
	* Common Voice 11 WER (normalized*): `13.23`
	* Fleurs WER: `14.26`
	* Fleurs WER (normalized*): `8.99`

	*) Normalization is done by applying the following to source and generated texts:

	```
	def normalize(s):
	return ' '.join([ x for x in sub('[^0-9a-zåäöA-ZÅÄÖ ]', ' ', s.lower().replace('é', 'e')).split() ])
	```

	In comparison the original Whisper large gets `30.56`/`25.58`, `18.76`/`15.00`, and `14.53`/`9.19` respectively.

	## Training

	Training was done using Huggingface and Deepspeed with ZeRO stage 2.

	* learning rate: 1e-5
	* optimizer: CPUAdamW (Deepspeed)
	* lr scheduler: linear
	* warmup steps: 500
	* per device batch size: 20
	* GPUs: 8 x NVIDIA A100 40GB
	* total batch size: 160
	* steps: 20000
	* lowercase: no
	* fp16
	* entire encoder was frozen