Fine-tuned Wav2Vec2 XLS-R 1B model for ASR in French
This model is a fine-tuned version of facebook/wav2vec2-xls-r-1b on the MOZILLA-FOUNDATION/COMMON_VOICE_9_0 - FR dataset.
Usage
- To use on a local audio file without the language model
import torch
import torchaudio
from transformers import AutoModelForCTC, Wav2Vec2Processor
processor = Wav2Vec2Processor.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr")
model = AutoModelForCTC.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr").cuda()
# path to your audio file
wav_path = "example.wav"
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0) # mono
# resample
if sample_rate != 16_000:
resampler = torchaudio.transforms.Resample(sample_rate, 16_000)
waveform = resampler(waveform)
# normalize
input_dict = processor(waveform, sampling_rate=16_000, return_tensors="pt")
with torch.inference_mode():
logits = model(input_dict.input_values.to("cuda")).logits
# decode
predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentence = processor.batch_decode(predicted_ids)[0]
- To use on a local audio file with the language model
import torch
import torchaudio
from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM
processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr")
model = AutoModelForCTC.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr").cuda()
model_sampling_rate = processor_with_lm.feature_extractor.sampling_rate
# path to your audio file
wav_path = "example.wav"
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0) # mono
# resample
if sample_rate != 16_000:
resampler = torchaudio.transforms.Resample(sample_rate, 16_000)
waveform = resampler(waveform)
# normalize
input_dict = processor_with_lm(waveform, sampling_rate=16_000, return_tensors="pt")
with torch.inference_mode():
logits = model(input_dict.input_values.to("cuda")).logits
predicted_sentence = processor_with_lm.batch_decode(logits.cpu().numpy()).text[0]
Evaluation
- To evaluate on
mozilla-foundation/common_voice_9_0
python eval.py \
--model_id "bhuang/wav2vec2-xls-r-1b-cv9-fr" \
--dataset "mozilla-foundation/common_voice_9_0" \
--config "fr" \
--split "test" \
--log_outputs \
--outdir "outputs/results_mozilla-foundatio_common_voice_9_0_with_lm"
- To evaluate on
speech-recognition-community-v2/dev_data
python eval.py \
--model_id "bhuang/wav2vec2-xls-r-1b-cv9-fr" \
--dataset "speech-recognition-community-v2/dev_data" \
--config "fr" \
--split "validation" \
--chunk_length_s 5.0 \
--stride_length_s 1.0 \
--log_outputs \
--outdir "outputs/results_speech-recognition-community-v2_dev_data_with_lm"
- Downloads last month
- 12
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Datasets used to train bofenghuang/wav2vec2-xls-r-1b-cv9-fr
Evaluation results
- Test WER on Common Voice 9self-reported12.720
- Test WER (+LM) on Common Voice 9self-reported10.600
- Test WER on Robust Speech Event - Dev Dataself-reported24.280
- Test WER (+LM) on Robust Speech Event - Dev Dataself-reported20.850