5roop commited on
Commit
e59388d
1 Parent(s): 69b265c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -0
README.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: hr
3
+ datasets:
4
+ - parlaspeech-hr
5
+ tags:
6
+ - audio
7
+ - automatic-speech-recognition
8
+ - parlaspeech
9
+ widget:
10
+ - example_title: example 1
11
+ src: https://huggingface.co/5roop/wav2vec2-xls-r-parlaspeech-hr-lm/raw/main/1800.m4a
12
+ - example_title: example 2
13
+ src: https://huggingface.co/5roop/wav2vec2-xls-r-parlaspeech-hr-lm/raw/main/00020578b.flac.wav
14
+ - example_title: example 3
15
+ src: https://huggingface.co/5roop/wav2vec2-xls-r-parlaspeech-hr-lm/raw/main/00020570a.flac.wav
16
+ ---
17
+
18
+ # wav2vec2-xls-r-parlaspeech-hr-lm
19
+
20
+ This model for Croatian ASR is based on the [facebook/wav2vec2-xls-r-300m model](https://huggingface.co/facebook/wav2vec2-xls-r-300m) and was fine-tuned with 300 hours of recordings and transcripts from the ASR Croatian parliament dataset [ParlaSpeech-HR v1.0](http://hdl.handle.net/11356/1494).
21
+
22
+ The efforts resulting in this model were coordinated by Nikola Ljubešić, the rough manual data alignment was performed by Ivo-Pavao Jazbec, the method for fine automatic data alignment from [Plüss et al.](https://arxiv.org/abs/2010.02810) was applied by Vuk Batanović and Lenka Bajčetić, the transcripts were normalised by Danijel Korzinek, while the final modelling was performed by Peter Rupnik.
23
+
24
+ If you use this model, please cite the following paper:
25
+
26
+ Nikola Ljubešić, Danijel Koržinek, Peter Rupnik, Ivo-Pavao Jazbec. ParlaSpeech-HR -- a freely available ASR dataset for Croatian bootstrapped from the ParlaMint corpus. Submitted to ParlaCLARIN@LREC.
27
+
28
+ ## Metrics
29
+
30
+ |split|CER|WER|
31
+ |---|---|---|
32
+ |dev|0.0335|0.1046|
33
+ |test|0.0234|0.0761|
34
+
35
+
36
+ ## Usage in `transformers`
37
+
38
+ So far untested approach that worked before:
39
+
40
+ ```python
41
+ from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
42
+ import soundfile as sf
43
+ import torch
44
+ import os
45
+
46
+ device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
47
+
48
+ # load model and tokenizer
49
+ processor = Wav2Vec2Processor.from_pretrained(
50
+ "classla/wav2vec2-xls-r-parlaspeech-hr")
51
+ model = Wav2Vec2ForCTC.from_pretrained("classla/wav2vec2-xls-r-parlaspeech-hr")
52
+
53
+
54
+ # download the example wav files:
55
+ os.system("wget https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr/raw/main/00020570a.flac.wav")
56
+
57
+ # read the wav file
58
+ speech, sample_rate = sf.read("00020570a.flac.wav")
59
+ input_values = processor(speech, sampling_rate=sample_rate, return_tensors="pt").input_values.to(device)
60
+
61
+ # remove the raw wav file
62
+ os.system("rm 00020570a.flac.wav")
63
+
64
+ # retrieve logits
65
+ logits = model.to(device)(input_values).logits
66
+
67
+ # take argmax and decode
68
+ predicted_ids = torch.argmax(logits, dim=-1)
69
+ transcription = processor.decode(predicted_ids[0]).lower()
70
+
71
+ # transcription: 'veliki broj poslovnih subjekata posluje sa minusom velik dio'
72
+ ```
73
+
74
+
75
+
76
+ ## Training hyperparameters
77
+
78
+ In fine-tuning, the following arguments were used:
79
+
80
+ | arg | value |
81
+ |-------------------------------|-------|
82
+ | `per_device_train_batch_size` | 16 |
83
+ | `gradient_accumulation_steps` | 4 |
84
+ | `num_train_epochs` | 8 |
85
+ | `learning_rate` | 3e-4 |
86
+ | `warmup_steps` | 500 |