classla
/

wav2vec2-xls-r-parlaspeech-hr-lm

Automatic Speech Recognition

Inference Endpoints

Model card Files Files and versions Community

wav2vec2-xls-r-parlaspeech-hr-lm / README.md

5roop's picture

Create README.md

e59388d over 2 years ago

|

3.07 kB

	---
	language: hr
	datasets:
	- parlaspeech-hr
	tags:
	- audio
	- automatic-speech-recognition
	- parlaspeech
	widget:
	- example_title: example 1
	src: https://huggingface.co./5roop/wav2vec2-xls-r-parlaspeech-hr-lm/raw/main/1800.m4a
	- example_title: example 2
	src: https://huggingface.co./5roop/wav2vec2-xls-r-parlaspeech-hr-lm/raw/main/00020578b.flac.wav
	- example_title: example 3
	src: https://huggingface.co./5roop/wav2vec2-xls-r-parlaspeech-hr-lm/raw/main/00020570a.flac.wav
	---

	# wav2vec2-xls-r-parlaspeech-hr-lm

	This model for Croatian ASR is based on the [facebook/wav2vec2-xls-r-300m model](https://huggingface.co./facebook/wav2vec2-xls-r-300m) and was fine-tuned with 300 hours of recordings and transcripts from the ASR Croatian parliament dataset [ParlaSpeech-HR v1.0](http://hdl.handle.net/11356/1494).

	The efforts resulting in this model were coordinated by Nikola Ljubešić, the rough manual data alignment was performed by Ivo-Pavao Jazbec, the method for fine automatic data alignment from [Plüss et al.](https://arxiv.org/abs/2010.02810) was applied by Vuk Batanović and Lenka Bajčetić, the transcripts were normalised by Danijel Korzinek, while the final modelling was performed by Peter Rupnik.

	If you use this model, please cite the following paper:

	Nikola Ljubešić, Danijel Koržinek, Peter Rupnik, Ivo-Pavao Jazbec. ParlaSpeech-HR -- a freely available ASR dataset for Croatian bootstrapped from the ParlaMint corpus. Submitted to ParlaCLARIN@LREC.

	## Metrics

	\|split\|CER\|WER\|
	\|---\|---\|---\|
	\|dev\|0.0335\|0.1046\|
	\|test\|0.0234\|0.0761\|


	## Usage in `transformers`

	So far untested approach that worked before:

	```python
	from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
	import soundfile as sf
	import torch
	import os

	device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

	# load model and tokenizer
	processor = Wav2Vec2Processor.from_pretrained(
	"classla/wav2vec2-xls-r-parlaspeech-hr")
	model = Wav2Vec2ForCTC.from_pretrained("classla/wav2vec2-xls-r-parlaspeech-hr")


	# download the example wav files:
	os.system("wget https://huggingface.co./classla/wav2vec2-xls-r-parlaspeech-hr/raw/main/00020570a.flac.wav")

	# read the wav file
	speech, sample_rate = sf.read("00020570a.flac.wav")
	input_values = processor(speech, sampling_rate=sample_rate, return_tensors="pt").input_values.to(device)

	# remove the raw wav file
	os.system("rm 00020570a.flac.wav")

	# retrieve logits
	logits = model.to(device)(input_values).logits

	# take argmax and decode
	predicted_ids = torch.argmax(logits, dim=-1)
	transcription = processor.decode(predicted_ids[0]).lower()

	# transcription: 'veliki broj poslovnih subjekata posluje sa minusom velik dio'
	```



	## Training hyperparameters

	In fine-tuning, the following arguments were used:

	\| arg \| value \|
	\|-------------------------------\|-------\|
	\| `per_device_train_batch_size` \| 16 \|
	\| `gradient_accumulation_steps` \| 4 \|
	\| `num_train_epochs` \| 8 \|
	\| `learning_rate` \| 3e-4 \|
	\| `warmup_steps` \| 500 \|