jlehecka commited on
Commit
e6715a8
1 Parent(s): 0e4ea41

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -0
README.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - de
5
+ tags:
6
+ - English
7
+ - German
8
+ - bilingual
9
+ - KKY
10
+ - FAV
11
+ license: cc-by-nc-sa-4.0
12
+ ---
13
+
14
+ # wav2vec2-base-en-de-50k
15
+ This is a bilingual Wav2Vec 2.0 base model pre-trained from 100 thousand hours of speech (50 thousand hours of English and 50 thousand hours of German).
16
+ It has been released along with a paper **A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for
17
+ Automatic Speech Recognition in Multilingual Oral History Archives** accepted to INTERSPEECH2024 conference.
18
+
19
+ ## Paper
20
+ The pre-print of our paper is available at http://arxiv.org/abs/2407.17160.
21
+
22
+ ### All pre-trained models released along with the paper
23
+ - [fav-kky/wav2vec2-base-cs-50k](https://huggingface.co/fav-kky/wav2vec2-base-cs-50k) (monolingual Czech)
24
+ - [fav-kky/wav2vec2-base-de-50k](https://huggingface.co/fav-kky/wav2vec2-base-de-50k) (monolingual German)
25
+ - [fav-kky/wav2vec2-base-cs-en-100k](https://huggingface.co/fav-kky/wav2vec2-base-cs-en-100k) (bilingual Czech+English)
26
+ - [fav-kky/wav2vec2-base-cs-de-100k](https://huggingface.co/fav-kky/wav2vec2-base-cs-de-100k) (bilingual Czech+German)
27
+ - [fav-kky/wav2vec2-base-en-de-100k](https://huggingface.co/fav-kky/wav2vec2-base-en-de-100k) (bilingual English+German)
28
+ - [fav-kky/wav2vec2-base-cs-en-de-150k](https://huggingface.co/fav-kky/wav2vec2-base-cs-en-de-150k) (trilingual Czech+English+German)
29
+
30
+ ## Citation
31
+ If you find this model useful, please cite our paper:
32
+ ```
33
+ @inproceedings{lehecka2024bitrilingual,
34
+ title = {{A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives}},
35
+ author = {
36
+ Jan Lehe\v{c}ka and
37
+ Josef V. Psutka and
38
+ Lubo\v{s} \v{S}m\'{i}dl and
39
+ Pavel Ircing and
40
+ Josef Psutka
41
+ },
42
+ booktitle={Proc. Interspeech 2024},
43
+ note={In Press},
44
+ year={2024},
45
+ url={https://arxiv.org/abs/2407.17160},
46
+ }
47
+ ```
48
+
49
+ ## Usage
50
+ This model does not have a tokenizer as it was pretrained on audio alone.
51
+ In order to use this model for speech recognition, a tokenizer should be created
52
+ and the model should be [fine-tuned](https://huggingface.co/blog/fine-tune-wav2vec2-english) on labeled ASR data.
53
+
54
+ Inputs must be 16kHz mono audio files.
55
+
56
+ This model can be used e.g., to extract per-frame contextual embeddings from audio:
57
+ ```python
58
+ from transformers import Wav2Vec2Model, Wav2Vec2FeatureExtractor
59
+ import torchaudio
60
+
61
+ feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("fav-kky/wav2vec2-base-en-de-100k")
62
+ model = Wav2Vec2Model.from_pretrained("fav-kky/wav2vec2-base-en-de-100k")
63
+
64
+ speech_array, sampling_rate = torchaudio.load("/path/to/audio/file.wav")
65
+ inputs = feature_extractor(
66
+ speech_array,
67
+ sampling_rate=16_000,
68
+ return_tensors="pt"
69
+ )["input_values"][0]
70
+
71
+ output = model(inputs)
72
+ embeddings = output.last_hidden_state.detach().numpy()[0]
73
+ ```
74
+
75
+ ## Related works