viktor-enzell
/

wav2vec2-large-voxrex-swedish-4gram

Automatic Speech Recognition

hf-asr-leaderboard

Inference Endpoints

Model card Files Files and versions Community

wav2vec2-large-voxrex-swedish-4gram / README.md

viktor-enzell's picture

Update README.md

efbaccd over 2 years ago

|

history blame contribute delete

3.53 kB

	---
	language: sv
	metrics:
	- wer
	tags:
	- audio
	- automatic-speech-recognition
	- speech
	- hf-asr-leaderboard
	- sv
	license: cc0-1.0
	datasets:
	- common_voice
	- NST_Swedish_ASR_Database
	- P4
	- The_Swedish_Culturomics_Gigaword_Corpus
	model-index:
	- name: Wav2vec 2.0 large VoxRex Swedish (C) with 4-gram
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice 6.1
	type: common_voice
	args: sv-SE
	metrics:
	- name: Test WER
	type: wer
	value: 6.4723
	---

	# KBLab's wav2vec 2.0 large VoxRex Swedish (C) with 4-gram model
	Training of the acoustic model is the work of KBLab. See [VoxRex-C](https://huggingface.co./KBLab/wav2vec2-large-voxrex-swedish) for more details. This repo extends the acoustic model with a social media 4-gram language model for boosted performance.

	## Model description
	VoxRex-C is extended with a 4-gram language model estimated from a subset extracted from [The Swedish Culturomics Gigaword Corpus](https://spraakbanken.gu.se/resurser/gigaword) from Språkbanken. The subset contains 40M words from the social media genre between 2010 and 2015.

	## How to use
	#### Simple usage example with pipeline
	```python
	import torch
	from transformers import pipeline

	# Load the model. Using GPU if available
	model_name = 'viktor-enzell/wav2vec2-large-voxrex-swedish-4gram'
	device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
	pipe = pipeline(model=model_name).to(device)

	# Run inference on an audio file
	output = pipe('path/to/audio.mp3')['text']
	```

	#### More verbose usage example with audio pre-processing
	Example of transcribing 1% of the Common Voice test split. The model expects 16kHz audio, so audio with another sampling rate is resampled to 16kHz.

	```python
	from transformers import Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM
	from datasets import load_dataset
	import torch
	import torchaudio.functional as F

	# Import model and processor. Using GPU if available
	model_name = 'viktor-enzell/wav2vec2-large-voxrex-swedish-4gram'
	device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
	model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device);
	processor = Wav2Vec2ProcessorWithLM.from_pretrained(model_name)

	# Import and process speech data
	common_voice = load_dataset('common_voice', 'sv-SE', split='test[:1%]')

	def speech_file_to_array(sample):
	# Convert speech file to array and downsample to 16 kHz
	sampling_rate = sample['audio']['sampling_rate']
	sample['speech'] = F.resample(torch.tensor(sample['audio']['array']), sampling_rate, 16_000)
	return sample

	common_voice = common_voice.map(speech_file_to_array)

	# Run inference
	inputs = processor(common_voice['speech'], sampling_rate=16_000, return_tensors='pt', padding=True).to(device)

	with torch.no_grad():
	logits = model(**inputs).logits

	transcripts = processor.batch_decode(logits.cpu().numpy()).text
	```

	## Training procedure
	Text data for the n-gram model is pre-processed by removing characters not part of the wav2vec 2.0 vocabulary and uppercasing all characters. After pre-processing and storing each text sample on a new line in a text file, a [KenLM](https://github.com/kpu/kenlm) model is estimated. See [this tutorial](https://huggingface.co./blog/wav2vec2-with-ngram) for more details.

	## Evaluation results
	The model was evaluated on the full Common Voice test set version 6.1. VoxRex-C achieved a WER of 9.03% without the language model and 6.47% with the language model.