|
--- |
|
language: sv |
|
metrics: |
|
- wer |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
- speech |
|
- hf-asr-leaderboard |
|
- sv |
|
license: cc0-1.0 |
|
datasets: |
|
- common_voice |
|
- NST_Swedish_ASR_Database |
|
- P4 |
|
- The_Swedish_Culturomics_Gigaword_Corpus |
|
model-index: |
|
- name: Wav2vec 2.0 large VoxRex Swedish (C) with 4-gram |
|
results: |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Common Voice 6.1 |
|
type: common_voice |
|
args: sv-SE |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 6.4723 |
|
--- |
|
|
|
# KBLab's wav2vec 2.0 large VoxRex Swedish (C) with 4-gram model |
|
Training of the acoustic model is the work of KBLab. See [VoxRex-C](https://huggingface.co./KBLab/wav2vec2-large-voxrex-swedish) for more details. This repo extends the acoustic model with a social media 4-gram language model for boosted performance. |
|
|
|
## Model description |
|
VoxRex-C is extended with a 4-gram language model estimated from a subset extracted from [The Swedish Culturomics Gigaword Corpus](https://spraakbanken.gu.se/resurser/gigaword) from Språkbanken. The subset contains 40M words from the social media genre between 2010 and 2015. |
|
|
|
## How to use |
|
#### Simple usage example with pipeline |
|
```python |
|
import torch |
|
from transformers import pipeline |
|
|
|
# Load the model. Using GPU if available |
|
model_name = 'viktor-enzell/wav2vec2-large-voxrex-swedish-4gram' |
|
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
pipe = pipeline(model=model_name).to(device) |
|
|
|
# Run inference on an audio file |
|
output = pipe('path/to/audio.mp3')['text'] |
|
``` |
|
|
|
#### More verbose usage example with audio pre-processing |
|
Example of transcribing 1% of the Common Voice test split. The model expects 16kHz audio, so audio with another sampling rate is resampled to 16kHz. |
|
|
|
```python |
|
from transformers import Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM |
|
from datasets import load_dataset |
|
import torch |
|
import torchaudio.functional as F |
|
|
|
# Import model and processor. Using GPU if available |
|
model_name = 'viktor-enzell/wav2vec2-large-voxrex-swedish-4gram' |
|
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device); |
|
processor = Wav2Vec2ProcessorWithLM.from_pretrained(model_name) |
|
|
|
# Import and process speech data |
|
common_voice = load_dataset('common_voice', 'sv-SE', split='test[:1%]') |
|
|
|
def speech_file_to_array(sample): |
|
# Convert speech file to array and downsample to 16 kHz |
|
sampling_rate = sample['audio']['sampling_rate'] |
|
sample['speech'] = F.resample(torch.tensor(sample['audio']['array']), sampling_rate, 16_000) |
|
return sample |
|
|
|
common_voice = common_voice.map(speech_file_to_array) |
|
|
|
# Run inference |
|
inputs = processor(common_voice['speech'], sampling_rate=16_000, return_tensors='pt', padding=True).to(device) |
|
|
|
with torch.no_grad(): |
|
logits = model(**inputs).logits |
|
|
|
transcripts = processor.batch_decode(logits.cpu().numpy()).text |
|
``` |
|
|
|
## Training procedure |
|
Text data for the n-gram model is pre-processed by removing characters not part of the wav2vec 2.0 vocabulary and uppercasing all characters. After pre-processing and storing each text sample on a new line in a text file, a [KenLM](https://github.com/kpu/kenlm) model is estimated. See [this tutorial](https://huggingface.co./blog/wav2vec2-with-ngram) for more details. |
|
|
|
## Evaluation results |
|
The model was evaluated on the full Common Voice test set version 6.1. VoxRex-C achieved a WER of 9.03% without the language model and 6.47% with the language model. |
|
|