classla
/

wav2vec2-xls-r-parlaspeech-hr-lm

Automatic Speech Recognition

Inference Endpoints

Model card Files Files and versions Community

5roop commited on Apr 28, 2022

Commit

e59388d

•

1 Parent(s): 69b265c

Create README.md

Files changed (1) hide show

README.md +86 -0

README.md ADDED Viewed

	@@ -0,0 +1,86 @@

+---
+language: hr
+datasets:
+- parlaspeech-hr
+tags:
+- audio
+- automatic-speech-recognition
+- parlaspeech
+widget:
+- example_title: example 1
+  src: https://huggingface.co/5roop/wav2vec2-xls-r-parlaspeech-hr-lm/raw/main/1800.m4a
+- example_title: example 2
+  src: https://huggingface.co/5roop/wav2vec2-xls-r-parlaspeech-hr-lm/raw/main/00020578b.flac.wav
+- example_title: example 3
+  src: https://huggingface.co/5roop/wav2vec2-xls-r-parlaspeech-hr-lm/raw/main/00020570a.flac.wav
+---
+# wav2vec2-xls-r-parlaspeech-hr-lm
+This model for Croatian ASR is based on the [facebook/wav2vec2-xls-r-300m model](https://huggingface.co/facebook/wav2vec2-xls-r-300m) and was fine-tuned with 300 hours of recordings and transcripts from the ASR Croatian parliament dataset [ParlaSpeech-HR v1.0](http://hdl.handle.net/11356/1494).
+The efforts resulting in this model were coordinated by Nikola Ljubešić, the rough manual data alignment was performed by Ivo-Pavao Jazbec, the method for fine automatic data alignment from [Plüss et al.](https://arxiv.org/abs/2010.02810) was applied by Vuk Batanović and Lenka Bajčetić, the transcripts were normalised by Danijel Korzinek, while the final modelling was performed by Peter Rupnik.
+If you use this model, please cite the following paper:
+Nikola Ljubešić, Danijel Koržinek, Peter Rupnik, Ivo-Pavao Jazbec. ParlaSpeech-HR -- a freely available ASR dataset for Croatian bootstrapped from the ParlaMint corpus. Submitted to ParlaCLARIN@LREC.
+## Metrics
+|split|CER|WER|
+|---|---|---|
+|dev|0.0335|0.1046|
+|test|0.0234|0.0761|
+## Usage in `transformers`
+So far untested approach that worked before:
+```python
+from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
+import soundfile as sf
+import torch
+import os
+device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+# load model and tokenizer
+processor = Wav2Vec2Processor.from_pretrained(
+    "classla/wav2vec2-xls-r-parlaspeech-hr")
+model = Wav2Vec2ForCTC.from_pretrained("classla/wav2vec2-xls-r-parlaspeech-hr")
+# download the example wav files:
+os.system("wget https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr/raw/main/00020570a.flac.wav")
+# read the wav file
+speech, sample_rate = sf.read("00020570a.flac.wav")
+input_values = processor(speech, sampling_rate=sample_rate, return_tensors="pt").input_values.to(device)
+# remove the raw wav file
+os.system("rm 00020570a.flac.wav")
+# retrieve logits
+logits = model.to(device)(input_values).logits
+# take argmax and decode
+predicted_ids = torch.argmax(logits, dim=-1)
+transcription = processor.decode(predicted_ids[0]).lower()
+# transcription: 'veliki broj poslovnih subjekata posluje sa minusom velik dio'
+```
+## Training hyperparameters
+In fine-tuning, the following arguments were used:
+| arg                           | value |
+|-------------------------------|-------|
+| `per_device_train_batch_size` | 16    |
+| `gradient_accumulation_steps` | 4     |
+| `num_train_epochs`            | 8     |
+| `learning_rate`               | 3e-4  |
+| `warmup_steps`                | 500   |