--- license: apache-2.0 tags: - automatic-speech-recognition - fi - finnish library_name: transformers language: fi base_model: - GetmanY1/wav2vec2-large-fi-150k model-index: - name: wav2vec2-large-fi-150k-finetuned results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Lahjoita puhetta (Donate Speech) type: lahjoita-puhetta args: fi metrics: - name: Dev WER type: wer value: 15.34 - name: Dev CER type: cer value: 4.14 - name: Test WER type: wer value: 16.86 - name: Test CER type: cer value: 5.07 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Finnish Parliament type: FinParl args: fi metrics: - name: Dev16 WER type: wer value: 11.3 - name: Dev16 CER type: cer value: 4.75 - name: Test16 WER type: wer value: 8.29 - name: Test16 CER type: cer value: 3.34 - name: Test20 WER type: wer value: 6.94 - name: Test20 CER type: cer value: 2.15 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice 16.1 type: mozilla-foundation/common_voice_16_1 args: fi metrics: - name: Dev WER type: wer value: 7.17 - name: Dev CER type: cer value: 1.11 - name: Test WER type: wer value: 5.86 - name: Test CER type: cer value: 0.91 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: FLEURS type: google/fleurs args: fi_fi metrics: - name: Dev WER type: wer value: 9.2 - name: Dev CER type: cer value: 5.23 - name: Test WER type: wer value: 10.69 - name: Test CER type: cer value: 5.79 --- # Finnish Wav2vec2-Large ASR [GetmanY1/wav2vec2-large-fi-150k](https://huggingface.co./GetmanY1/wav2vec2-large-fi-150k) fine-tuned on 4600 hours of Finnish speech on 16kHz sampled speech audio: * 1500 hours of [Lahjoita puhetta (Donate Speech)](https://link.springer.com/article/10.1007/s10579-022-09606-3) (colloquial Finnish) * 3100 hours of the [Finnish Parliament dataset](https://link.springer.com/article/10.1007/s10579-023-09650-7) When using the model make sure that your speech input is also sampled at 16Khz. ## Model description The Finnish Wav2Vec2 Large has the same architecture and uses the same training objective as the English and multilingual one described in [Paper](https://arxiv.org/abs/2006.11477). [GetmanY1/wav2vec2-large-fi-150k](https://huggingface.co./GetmanY1/wav2vec2-large-fi-150k) is a large-scale, 317-million parameter monolingual model pre-trained on 158k hours of unlabeled Finnish speech, including [KAVI radio and television archive materials](https://kavi.fi/en/radio-ja-televisioarkistointia-vuodesta-2008/), Lahjoita puhetta (Donate Speech), Finnish Parliament, Finnish VoxPopuli. You can read more about the pre-trained model from [this paper](TODO). The training scripts are available on [GitHub](https://github.com/aalto-speech/large-scale-monolingual-speech-foundation-models). ## Intended uses You can use this model for Finnish ASR (speech-to-text). ### How to use To transcribe audio files the model can be used as a standalone acoustic model as follows: ``` from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC from datasets import load_dataset import torch # load model and processor processor = Wav2Vec2Processor.from_pretrained("GetmanY1/wav2vec2-large-fi-150k-finetuned") model = Wav2Vec2ForCTC.from_pretrained("GetmanY1/wav2vec2-large-fi-150k-finetuned") # load dummy dataset and read soundfiles ds = load_dataset("mozilla-foundation/common_voice_16_1", "fi", split='test') # tokenize input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1 # retrieve logits logits = model(input_values).logits # take argmax and decode predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) ``` ## Team Members - Yaroslav Getman, [Hugging Face profile](https://huggingface.co./GetmanY1), [LinkedIn profile](https://www.linkedin.com/in/yaroslav-getman/) - Tamas Grosz, [Hugging Face profile](https://huggingface.co./Grosy), [LinkedIn profile](https://www.linkedin.com/in/tam%C3%A1s-gr%C3%B3sz-950a049a/) Feel free to contact us for more details 🤗