fav-kky
/

wav2vec2-base-en-de-100k

Inference Endpoints

Model card Files Files and versions Community

jlehecka commited on Jul 25

Commit

e6715a8

•

1 Parent(s): 0e4ea41

Create README.md

Files changed (1) hide show

README.md +75 -0

README.md ADDED Viewed

	@@ -0,0 +1,75 @@

+---
+language:
+- en
+- de
+tags:
+- English
+- German
+- bilingual
+- KKY
+- FAV
+license: cc-by-nc-sa-4.0
+---
+# wav2vec2-base-en-de-50k
+This is a bilingual Wav2Vec 2.0 base model pre-trained from 100 thousand hours of speech (50 thousand hours of English and 50 thousand hours of German).
+It has been released along with a paper **A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for
+Automatic Speech Recognition in Multilingual Oral History Archives** accepted to INTERSPEECH2024 conference.
+## Paper
+The pre-print of our paper is available at http://arxiv.org/abs/2407.17160.
+### All pre-trained models released along with the paper
+- [fav-kky/wav2vec2-base-cs-50k](https://huggingface.co/fav-kky/wav2vec2-base-cs-50k) (monolingual Czech)
+- [fav-kky/wav2vec2-base-de-50k](https://huggingface.co/fav-kky/wav2vec2-base-de-50k) (monolingual German)
+- [fav-kky/wav2vec2-base-cs-en-100k](https://huggingface.co/fav-kky/wav2vec2-base-cs-en-100k) (bilingual Czech+English)
+- [fav-kky/wav2vec2-base-cs-de-100k](https://huggingface.co/fav-kky/wav2vec2-base-cs-de-100k) (bilingual Czech+German)
+- [fav-kky/wav2vec2-base-en-de-100k](https://huggingface.co/fav-kky/wav2vec2-base-en-de-100k) (bilingual English+German)
+- [fav-kky/wav2vec2-base-cs-en-de-150k](https://huggingface.co/fav-kky/wav2vec2-base-cs-en-de-150k) (trilingual Czech+English+German)
+## Citation
+If you find this model useful, please cite our paper:
+```
+@inproceedings{lehecka2024bitrilingual,
+  title = {{A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives}},
+  author = {
+    Jan Lehe\v{c}ka and
+    Josef V. Psutka and
+    Lubo\v{s} \v{S}m\'{i}dl and
+    Pavel Ircing and
+    Josef Psutka
+  },
+  booktitle={Proc. Interspeech 2024},
+  note={In Press},
+  year={2024},
+  url={https://arxiv.org/abs/2407.17160},
+}
+```
+## Usage
+This model does not have a tokenizer as it was pretrained on audio alone.
+In order to use this model for speech recognition, a tokenizer should be created
+and the model should be [fine-tuned](https://huggingface.co/blog/fine-tune-wav2vec2-english) on labeled ASR data.
+Inputs must be 16kHz mono audio files.
+This model can be used e.g., to extract per-frame contextual embeddings from audio:
+```python
+from transformers import Wav2Vec2Model, Wav2Vec2FeatureExtractor
+import torchaudio
+feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("fav-kky/wav2vec2-base-en-de-100k")
+model = Wav2Vec2Model.from_pretrained("fav-kky/wav2vec2-base-en-de-100k")
+speech_array, sampling_rate = torchaudio.load("/path/to/audio/file.wav")
+inputs = feature_extractor(
+    speech_array,
+    sampling_rate=16_000,
+    return_tensors="pt"
+)["input_values"][0]
+output = model(inputs)
+embeddings = output.last_hidden_state.detach().numpy()[0]
+```
+## Related works