File size: 3,177 Bytes
51a86c4 f75eb47 15b4ff5 51a86c4 ec1ce66 51a86c4 ec1ce66 51a86c4 ec1ce66 51a86c4 ec1ce66 51a86c4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
---
language: hr
datasets:
- parlaspeech-hr
tags:
- audio
- automatic-speech-recognition
- parlaspeech
widget:
- example_title: example 1
src: https://huggingface.co./classla/wav2vec2-xls-r-parlaspeech-hr/raw/main/1800.m4a
- example_title: example 2
src: https://huggingface.co./classla/wav2vec2-xls-r-parlaspeech-hr/raw/main/00020578b.flac.wav
- example_title: example 3
src: https://huggingface.co./classla/wav2vec2-xls-r-parlaspeech-hr/raw/main/00020570a.flac.wav
---
# wav2vec2-large-slavic-parlaspeech-hr-lm
This model for Croatian ASR is based on the [facebook/wav2vec2-large-slavic-voxpopuli-v2 model](facebook/wav2vec2-large-slavic-voxpopuli-v2) and was fine-tuned with 300 hours of recordings and transcripts from the ASR Croatian parliament dataset [ParlaSpeech-HR v1.0](http://hdl.handle.net/11356/1494) and enhanced with a language model.
The efforts resulting in this model were coordinated by Nikola Ljubešić, the rough manual data alignment was performed by Ivo-Pavao Jazbec, the method for fine automatic data alignment from [Plüss et al.](https://arxiv.org/abs/2010.02810) was applied by Vuk Batanović and Lenka Bajčetić, the transcripts were normalised by Danijel Korzinek, while the final modelling was performed by Peter Rupnik.
If you use this model, please cite the following paper:
Nikola Ljubešić, Danijel Koržinek, Peter Rupnik, Ivo-Pavao Jazbec. ParlaSpeech-HR -- a freely available ASR dataset for Croatian bootstrapped from the ParlaMint corpus. Submitted to ParlaCLARIN@LREC.
## Metrics
|split|CER|WER|
|---|---|---|
|dev|0.0253|0.0556|
|test|0.0188|0.0430|
## Usage in `transformers`
Tested with `transformers==4.18.0`, `torch==1.11.0`, and `SoundFile==0.10.3.post1`.
```python
from transformers import Wav2Vec2ProcessorWithLM, Wav2Vec2ForCTC
import soundfile as sf
import torch
import os
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# load model and tokenizer
processor = Wav2Vec2ProcessorWithLM.from_pretrained(
"5roop/wav2vec2-large-slavic-parlaspeech-hr-lm")
model = Wav2Vec2ForCTC.from_pretrained("5roop/wav2vec2-large-slavic-parlaspeech-hr-lm")
# download the example wav files:
os.system("wget https://huggingface.co./classla/wav2vec2-large-slavic-parlaspeech-hr/raw/main/00020570a.flac.wav")
# read the wav file
speech, sample_rate = sf.read("00020570a.flac.wav")
input_values = processor(speech, sampling_rate=sample_rate, return_tensors="pt").input_values.cuda()
inputs = processor(speech, sampling_rate=sample_rate, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
transcription = processor.batch_decode(logits.numpy()).text[0]
# remove the raw wav file
os.system("rm 00020570a.flac.wav")
transcription # 'velik broj poslovnih subjekata poslao je sa minusom velik dio'
```
## Training hyperparameters
In fine-tuning, the following arguments were used:
| arg | value |
|-------------------------------|-------|
| `per_device_train_batch_size` | 16 |
| `gradient_accumulation_steps` | 4 |
| `num_train_epochs` | 8 |
| `learning_rate` | 3e-4 |
| `warmup_steps` | 500 | |