File size: 7,111 Bytes
2752b4c 414e4b3 a3d7538 414e4b3 2752b4c 71cc0e5 a3d7538 111fa18 414e4b3 a3d7538 414e4b3 a3d7538 414e4b3 a3d7538 414e4b3 a3d7538 baa8068 a3d7538 baa8068 a3d7538 baa8068 a3d7538 baa8068 a3d7538 baa8068 a3d7538 baa8068 a3d7538 2752b4c 414e4b3 f890604 414e4b3 71cc0e5 8f82e5d 71cc0e5 8f82e5d 71cc0e5 2eed5b6 8f82e5d 71cc0e5 8f82e5d 71cc0e5 8f82e5d 71cc0e5 8f82e5d 71cc0e5 2eed5b6 a3d7538 2eed5b6 a3d7538 baa8068 2eed5b6 414e4b3 f890604 |
|
---
language: ru
datasets:
- SberDevices/Golos
- bond005/sova_rudevices
- bond005/rulibrispeech
metrics:
- wer
- cer
tags:
- audio
- automatic-speech-recognition
- speech
- xlsr-fine-tuning-week
license: apache-2.0
widget:
- example_title: test sound with Russian speech "нейросети это хорошо"
src: https://huggingface.co./bond005/wav2vec2-large-ru-golos/resolve/main/test_sound_ru.flac
model-index:
- name: XLSR Wav2Vec2 Russian by Ivan Bondarenko
results:
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: Sberdevices Golos (crowd)
type: SberDevices/Golos
args: ru
metrics:
- name: Test WER
type: wer
value: 10.144
- name: Test CER
type: cer
value: 2.168
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: Sberdevices Golos (farfield)
type: SberDevices/Golos
args: ru
metrics:
- name: Test WER
type: wer
value: 20.353
- name: Test CER
type: cer
value: 6.030
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice ru
type: common_voice
args: ru
metrics:
- name: Test WER
type: wer
value: 18.548
- name: Test CER
type: cer
value: 4.000
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Sova RuDevices
type: bond005/sova_rudevices
args: ru
metrics:
- name: Test WER
type: wer
value: 25.410
- name: Test CER
type: cer
value: 7.965
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Russian Librispeech
type: bond005/rulibrispeech
args: ru
metrics:
- name: Test WER
type: wer
value: 21.872
- name: Test CER
type: cer
value: 4.469
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Voxforge Ru
type: dangrebenkin/voxforge-ru-dataset
args: ru
metrics:
- name: Test WER
type: wer
value: 27.084
- name: Test CER
type: cer
value: 6.986
---
# Wav2Vec2-Large-Ru-Golos
The Wav2Vec2 model is based on [facebook/wav2vec2-large-xlsr-53](https://huggingface.co./facebook/wav2vec2-large-xlsr-53), fine-tuned in Russian using [Sberdevices Golos](https://huggingface.co./datasets/SberDevices/Golos) with audio augmentations like as pitch shift, acceleration/deceleration of sound, reverberation etc.
When using this model, make sure that your speech input is sampled at 16kHz.
## Usage
To transcribe audio files the model can be used as a standalone acoustic model as follows:
```python
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch
# load model and tokenizer
processor = Wav2Vec2Processor.from_pretrained("bond005/wav2vec2-large-ru-golos")
model = Wav2Vec2ForCTC.from_pretrained("bond005/wav2vec2-large-ru-golos")
# load the test part of Golos dataset and read first soundfile
ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="test")
# tokenize
processed = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest") # Batch size 1
# retrieve logits
logits = model(processed.input_values, attention_mask=processed.attention_mask).logits
# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)
```
## Evaluation
This code snippet shows how to evaluate **bond005/wav2vec2-large-ru-golos** on Golos dataset's "crowd" and "farfield" test data.
```python
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from jiwer import wer, cer # we need word error rate (WER) and character error rate (CER)
# load the test part of Golos Crowd and remove samples with empty "true" transcriptions
golos_crowd_test = load_dataset("bond005/sberdevices_golos_10h_crowd", split="test")
golos_crowd_test = golos_crowd_test.filter(
lambda it1: (it1["transcription"] is not None) and (len(it1["transcription"].strip()) > 0)
)
# load the test part of Golos Farfield and remove sampels with empty "true" transcriptions
golos_farfield_test = load_dataset("bond005/sberdevices_golos_100h_farfield", split="test")
golos_farfield_test = golos_farfield_test.filter(
lambda it2: (it2["transcription"] is not None) and (len(it2["transcription"].strip()) > 0)
)
# load model and tokenizer
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
# recognize one sound
def map_to_pred(batch):
# tokenize and vectorize
processed = processor(
batch["audio"]["array"], sampling_rate=batch["audio"]["sampling_rate"],
return_tensors="pt", padding="longest"
)
input_values = processed.input_values.to("cuda")
attention_mask = processed.attention_mask.to("cuda")
# recognize
with torch.no_grad():
logits = model(input_values, attention_mask=attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
# decode
transcription = processor.batch_decode(predicted_ids)
batch["text"] = transcription[0]
return batch
# calculate WER and CER on the crowd domain
crowd_result = golos_crowd_test.map(map_to_pred, remove_columns=["audio"])
crowd_wer = wer(crowd_result["transcription"], crowd_result["text"])
crowd_cer = cer(crowd_result["transcription"], crowd_result["text"])
print("Word error rate on the Crowd domain:", crowd_wer)
print("Character error rate on the Crowd domain:", crowd_cer)
# calculate WER and CER on the farfield domain
farfield_result = golos_farfield_test.map(map_to_pred, remove_columns=["audio"])
farfield_wer = wer(farfield_result["transcription"], farfield_result["text"])
farfield_cer = cer(farfield_result["transcription"], farfield_result["text"])
print("Word error rate on the Farfield domain:", farfield_wer)
print("Character error rate on the Farfield domain:", farfield_cer)
```
*Result (WER, %)*:
| "crowd" | "farfield" |
|---------|------------|
| 10.144 | 20.353 |
*Result (CER, %)*:
| "crowd" | "farfield" |
|---------|------------|
| 2.168 | 6.030 |
You can see the evaluation script on other datasets, including Russian Librispeech and SOVA RuDevices, on my Kaggle web-page https://www.kaggle.com/code/bond005/wav2vec2-ru-eval
## Citation
If you want to cite this model you can use this:
```bibtex
@misc{bondarenko2022wav2vec2-large-ru-golos,
title={XLSR Wav2Vec2 Russian by Ivan Bondarenko},
author={Bondarenko, Ivan},
publisher={Hugging Face},
journal={Hugging Face Hub},
howpublished={\url{https://huggingface.co./bond005/wav2vec2-large-ru-golos}},
year={2022}
}
```
|