|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- KBLab/rixvox |
|
language: |
|
- sv |
|
--- |
|
# Whisper Large RixVox Swedish |
|
|
|
This is a [Whisper large](https://huggingface.co./openai/whisper-large-v2) finetuned for Swedish |
|
using the [RixVox](https://huggingface.co./datasets/KBLab/rixvox) dataset. |
|
|
|
Please note that this model, as every other encoder-decoder speech-to-text model, is prone to |
|
hallucinating on unexpected inputs and treats the task as translation rather than transcription. |
|
I.e your mileage may vary depending on filtering and type of data. |
|
|
|
In this release the entire encoder was frozen. Subsequent releases will not do this **if** the |
|
generalization to other types of data (i.e not parliamentary speeches) is kept when not freezing |
|
the encoder. |
|
|
|
## Evaluation (test) |
|
|
|
* RixVox WER: `22.59` |
|
* RixVox WER (normalized*): `19.33` |
|
* Common Voice 11 WER: `18.03` |
|
* Common Voice 11 WER (normalized*): `13.23` |
|
* Fleurs WER: `14.26` |
|
* Fleurs WER (normalized*): `8.99` |
|
|
|
*) Normalization is done by applying the following to source and generated texts: |
|
|
|
``` |
|
def normalize(s): |
|
return ' '.join([ x for x in sub('[^0-9a-zåäöA-ZÅÄÖ ]', ' ', s.lower().replace('é', 'e')).split() ]) |
|
``` |
|
|
|
In comparison the original Whisper large gets `30.56`/`25.58`, `18.76`/`15.00`, and `14.53`/`9.19` respectively. |
|
|
|
## Training |
|
|
|
Training was done using Huggingface and Deepspeed with ZeRO stage 2. |
|
|
|
* learning rate: 1e-5 |
|
* optimizer: CPUAdamW (Deepspeed) |
|
* lr scheduler: linear |
|
* warmup steps: 500 |
|
* per device batch size: 20 |
|
* GPUs: 8 x NVIDIA A100 40GB |
|
* total batch size: 160 |
|
* steps: 20000 |
|
* lowercase: no |
|
* fp16 |
|
* entire encoder was frozen |