--- language: es datasets: - common_voice metrics: - wer - cer tags: - audio - automatic-speech-recognition - speech - xlsr-fine-tuning-week license: apache-2.0 --- # Wav2Vec2-Large-XLSR-53-Spanish-With-LM This is a model copy of [Wav2Vec2-Large-XLSR-53-Spanish](https://huggingface.co./jonatasgrosman/wav2vec2-large-xlsr-53-spanish) that has language model support. This model card can be seen as a demo for the [pyctcdecode](https://github.com/kensho-technologies/pyctcdecode) integration with Transformers led by [this PR](https://github.com/huggingface/transformers/pull/14339). The PR explains in-detail how the integration works. In a nutshell: This PR adds a new Wav2Vec2WithLMProcessor class as drop-in replacement for Wav2Vec2Processor. The only change from the existing ASR pipeline will be: ```diff from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor from datasets import load_dataset ds = load_dataset("common_voice", "es", split="test", streaming=True) sample = next(iter(ds)) model = Wav2Vec2ForCTC.from_pretrained("patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm") processor = Wav2Vec2Processor.from_pretrained("patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm") input_values = processor(sample["audio"]["array"], return_tensors="pt").input_values logits = model(input_values).logits prediction_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(prediction_ids) print(transcription) ``` | Model | WER | CER | | ------------- | ------------- | ------------- | | jonatasgrosman/wav2vec2-large-xlsr-53-spanish | **8.81%** | **2.70%** | | pcuenq/wav2vec2-large-xlsr-53-es | 10.55% | 3.20% | | facebook/wav2vec2-large-xlsr-53-spanish | 16.99% | 5.40% | | mrm8488/wav2vec2-large-xlsr-53-spanish | 19.20% | 5.96% |