File size: 5,823 Bytes
2e2a17c
0bbd929
 
2e2a17c
 
 
 
 
 
 
 
 
ff380a3
6535c72
 
 
 
 
 
 
 
 
 
 
 
 
2e2a17c
a8a6f91
 
2e2a17c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6535c72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a8a6f91
 
6535c72
 
 
 
 
 
 
 
 
 
 
 
 
 
a8a6f91
6535c72
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
library_name: transformers
base_model: openai/whisper-large
language:
- sv
pipeline_tag: automatic-speech-recognition
license: apache-2.0
datasets:
- KBLab/rixvox-v2
---
## KB-Whisper Large

The National Library of Sweden releases a new suite of Whisper models trained on over 50,000 hours of Swedish speech. In evaluations across [FLEURS](https://huggingface.co./datasets/google/fleurs), [CommonVoice](https://huggingface.co./datasets/mozilla-foundation/common_voice_16_1) and [NST](https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-54/), our best performing model reduces the Word Error Rate (WER) by an average of 47% compared to OpenAI's `whisper-large-v3`. The performance of smaller Whisper model sizes on Swedish speech has also substantially improved, with `kb-whisper-small` outperforming `openai/whisper-large-v3` (a model six times its size).

| Model size  |   | FLEURS | CommonVoice | NST  |
|------------|---------|--------|-------------|------|
| [tiny](https://huggingface.co./KBLab/kb-whisper-tiny)       | **KBLab**   | **13.2**  | **12.9**  | **11.2**  |
|            | OpenAI  | 59.2   | 67.8   | 85.2   |
| [base](https://huggingface.co./KBLab/kb-whisper-base)       | **KBLab**   | **9.1**   | **8.7**   | **7.8**   |
|            | OpenAI  | 39.6   | 52.1   | 53.4   |
| [small](https://huggingface.co./KBLab/kb-whisper-small)      | **KBLab**   | **7.3**   | **6.4**   | **6.6**   |
|            | OpenAI  | 20.6   | 26.4   | 26.4   |
| [medium](https://huggingface.co./KBLab/kb-whisper-medium)     | **KBLab**   | **6.6**   | **5.4**   | **5.8**   |
|            | OpenAI  | 12.1   | 15.8   | 17.1   |
| [large-v3](https://huggingface.co./KBLab/kb-whisper-large)   | **KBLab**   | **5.4**   | **4.1**   | **5.2**   |
|            | OpenAI  | 7.8    | 9.5    | 11.3    |

Table: **Word Error Rate (WER)** comparison between KBLab's Whisper models and the corresponding OpenAI versions. 

### Usage

```python
import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "KBLab/kb-whisper-large"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache"
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

generate_kwargs = {"task": "transcribe", "language": "sv"}
# Add return_timestamps=True for output with timestamps
res = pipe("audio.mp3", 
           chunk_length_s=30,
           generate_kwargs={"task": "transcribe", "language": "sv"})
```

### Training data

Our models have been trained on over 50,000 hours of Swedish audio with text transcriptions. The models were trained in 2 stages, each characterized by the application of different quality filters and thresholds for said filters. 

Stage 1 employed low threshold values (0.15 to 0.30 BLEU), whereas Stage 2 used stricter thresholds (`BLEU >= 0.7`, weighted ROUGE-N `>= 0.7`, CER of first and last 10 characters `<= 0.2`).

| Dataset      | Continued pretraining (h) -- Stage 1 | Finetuning (h) -- Stage 2 |
|-------------|--------------------------|--------------|
| Subtitles   | 34,261                   | 3,110        |
| Riksdag     | 21,949                   | 5,119        |
| ISOF        | 54                       | 54           |
| NST         | 250                      | 250          |
| **Total**   | **56,514**               | **8,533**    |

The default when loading our models through Hugging Face is **Stage 2**. We have however also uploaded the checkpoints of our continued pretraing and tagged them. You can these other checkpoints by specifying the `revision`. For example: [`pretrained-checkpoint`](https://huggingface.co./KBLab/kb-whisper-large/tree/pretrained-checkpoint). The Stage 2 default model tag is named `standard`.

### Evaluation


#### WER
| Model size  |  | FLEURS | CommonVoice | NST  |
|------------|---------|--------|-------------|------|
| [tiny](https://huggingface.co./KBLab/kb-whisper-tiny)       | **KBLab**   | **13.2**  | **12.9**  | **11.2**  |
|            | OpenAI  | 59.2   | 67.8   | 85.2   |
| [base](https://huggingface.co./KBLab/kb-whisper-base)       | **KBLab**   | **9.1**   | **8.7**   | **7.8**   |
|            | OpenAI  | 39.6   | 52.1   | 53.4   |
| [small](https://huggingface.co./KBLab/kb-whisper-small)      | **KBLab**   | **7.3**   | **6.4**   | **6.6**   |
|            | OpenAI  | 20.6   | 26.4   | 26.4   |
| [medium](https://huggingface.co./KBLab/kb-whisper-medium)     | **KBLab**   | **6.6**   | **5.4**   | **5.8**   |
|            | OpenAI  | 12.1   | 15.8   | 17.1   |
| [large-v3](https://huggingface.co./KBLab/kb-whisper-large)   | **KBLab**   | **5.4**   | **4.1**   | **5.2**   |
|            | OpenAI  | 7.8    | 9.5    | 11.3    |


#### BLEU Score
| Model size  |   | FLEURS | CommonVoice | NST  |
|------------|---------|--------|-------------|------|
| tiny       | KBLab   | **76.6**  | **73.7**  | **74.3**  |
|            | OpenAI  | 26.9   | 21.1   | 24.0   |
| base       | KBLab   | **83.2**   | **79.9**   | **78.3**   |
|            | OpenAI  | 41.1   | 32.5   | 36.9   |
| small      | KBLab   | **86.6**   | **83.5**   | **79.6**   |
|            | OpenAI  | 64.0   | 56.5   | 58.2   |
| medium     | KBLab   | **87.6**   | **85.0**   | **80.2**   |
|            | OpenAI  | 77.1   | 70.1   | 68.9   |
| large-v3   | KBLab   | **89.8**   | **87.2**   | **81.1**   |
|            | OpenAI  | 84.9    | 79.1    | 75.1    |