patrickvonplaten
/

wav2vec2-large-xlsr-53-spanish-with-lm

@@ -11,170 +11,42 @@ tags:
 - speech
 - xlsr-fine-tuning-week
 license: apache-2.0
-model-index:
-- name: XLSR Wav2Vec2 Spanish by Jonatas Grosman
-  results:
-  - task:
-      name: Speech Recognition
-      type: automatic-speech-recognition
-    dataset:
-      name: Common Voice es
-      type: common_voice
-      args: es
-    metrics:
-       - name: Test WER
-         type: wer
-         value: 8.81
-       - name: Test CER
-         type: cer
-         value: 2.70
 ---
-# Wav2Vec2-Large-XLSR-53-Spanish
-Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Spanish using the [Common Voice](https://huggingface.co/datasets/common_voice).
-When using this model, make sure that your speech input is sampled at 16kHz.
-This model has been fine-tuned thanks to the GPU credits generously given by the [OVHcloud](https://www.ovhcloud.com/en/public-cloud/ai-training/) :)
-The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint
-## Usage
-The model can be used directly (without a language model) as follows...
-Using the [ASRecognition](https://github.com/jonatasgrosman/asrecognition) library:
-```python
-from asrecognition import ASREngine
-asr = ASREngine("es", model_path="jonatasgrosman/wav2vec2-large-xlsr-53-spanish")
-audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]
-transcriptions = asr.transcribe(audio_paths)
-```
-Writing your own inference script:
-```python
-import torch
-import librosa
-from datasets import load_dataset
 from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
-LANG_ID = "es"
-MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-spanish"
-SAMPLES = 10
-test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
-processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
-model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
-# Preprocessing the datasets.
-# We need to read the audio files as arrays
-def speech_file_to_array_fn(batch):
-    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
-    batch["speech"] = speech_array
-    batch["sentence"] = batch["sentence"].upper()
-    return batch
-test_dataset = test_dataset.map(speech_file_to_array_fn)
-inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
-with torch.no_grad():
-    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
-predicted_ids = torch.argmax(logits, dim=-1)
-predicted_sentences = processor.batch_decode(predicted_ids)
-for i, predicted_sentence in enumerate(predicted_sentences):
-    print("-" * 100)
-    print("Reference:", test_dataset[i]["sentence"])
-    print("Prediction:", predicted_sentence)
-```
-| Reference  | Prediction |
-| ------------- | ------------- |
-| HABITA EN AGUAS POCO PROFUNDAS Y ROCOSAS. | HABITAN AGUAS POCO PROFUNDAS Y ROCOSAS |
-| OPERA PRINCIPALMENTE VUELOS DE CABOTAJE Y REGIONALES DE CARGA. | OPERA PRINCIPALMENTE VUELO DE CARBOTAJES Y REGIONALES DE CARGAN |
-| PARA VISITAR CONTACTAR PRIMERO CON LA DIRECCIÓN. | PARA VISITAR CONTACTAR PRIMERO CON LA DIRECCIÓN |
-| TRES | TRES |
-| REALIZÓ LOS ESTUDIOS PRIMARIOS EN FRANCIA, PARA CONTINUAR LUEGO EN ESPAÑA. | REALIZÓ LOS ESTUDIOS PRIMARIOS EN FRANCIA PARA CONTINUAR LUEGO EN ESPAÑA |
-| EN LOS AÑOS QUE SIGUIERON, ESTE TRABAJO ESPARTA PRODUJO DOCENAS DE BUENOS JUGADORES. | EN LOS AÑOS QUE SIGUIERON ESTE TRABAJO ESPARTA PRODUJO DOCENA DE BUENOS JUGADORES |
-| SE ESTÁ TRATANDO DE RECUPERAR SU CULTIVO EN LAS ISLAS CANARIAS. | SE ESTÓ TRATANDO DE RECUPERAR SU CULTIVO EN LAS ISLAS CANARIAS |
-| SÍ | SÍ |
-| "FUE ""SACADA"" DE LA SERIE EN EL EPISODIO ""LEAD"", EN QUE ALEXANDRA CABOT REGRESÓ." | FUE SACADA DE LA SERIE EN EL EPISODIO LEED EN QUE ALEXANDRA KAOT REGRESÓ |
-| SE UBICAN ESPECÍFICAMENTE EN EL VALLE DE MOKA, EN LA PROVINCIA DE BIOKO SUR. | SE UBICAN ESPECÍFICAMENTE EN EL VALLE DE MOCA EN LA PROVINCIA DE PÍOCOSUR |
-## Evaluation
-The model can be evaluated as follows on the Spanish test data of Common Voice.
-```python
-import torch
-import re
-import librosa
-from datasets import load_dataset, load_metric
-from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
-LANG_ID = "es"
-MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-spanish"
-DEVICE = "cuda"
-CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
-                   "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
-                   "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
-                   "、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",
-                   "『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]
-test_dataset = load_dataset("common_voice", LANG_ID, split="test")
-wer = load_metric("wer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py
-cer = load_metric("cer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py
-chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"
-processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
-model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
-model.to(DEVICE)
-# Preprocessing the datasets.
-# We need to read the audio files as arrays
-def speech_file_to_array_fn(batch):
-    with warnings.catch_warnings():
-        warnings.simplefilter("ignore")
-        speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
-    batch["speech"] = speech_array
-    batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()
-    return batch
-test_dataset = test_dataset.map(speech_file_to_array_fn)
-# Preprocessing the datasets.
-# We need to read the audio files as arrays
-def evaluate(batch):
-    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
-    with torch.no_grad():
-        logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits
-    pred_ids = torch.argmax(logits, dim=-1)
-    batch["pred_strings"] = processor.batch_decode(pred_ids)
-    return batch
-result = test_dataset.map(evaluate, batched=True, batch_size=8)
-predictions = [x.upper() for x in result["pred_strings"]]
-references = [x.upper() for x in result["sentence"]]
-print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
-print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
 ```
-**Test Result**:
-In the table below I report the Word Error Rate (WER) and the Character Error Rate (CER) of the model. I ran the evaluation script described above on other models as well (on 2021-04-22). Note that the table below may show different results from those already reported, this may have been caused due to some specificity of the other evaluation scripts used.
 | Model | WER | CER |
 | ------------- | ------------- | ------------- |

 - speech
 - xlsr-fine-tuning-week
 license: apache-2.0
 ---
+# Wav2Vec2-Large-XLSR-53-Spanish-With-LM
+This is a model copy of [Wav2Vec2-Large-XLSR-53-Spanish](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-spanish)
+that has language model support.
+This model card can be seen as a demo for the [pyctcdecode](https://github.com/kensho-technologies/pyctcdecode) integration
+with Transformers led by [this PR](https://github.com/huggingface/transformers/pull/14339). The PR explains in-detail how the
+integration works.
+In a nutshell: This PR adds a new Wav2Vec2WithLMProcessor class as drop-in replacement for Wav2Vec2Processor.
+The only change from the existing ASR pipeline will be:
+```diff
 from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
+from datasets import load_dataset
+ds = load_dataset("common_voice", "es", split="test", streaming=True)
+sample = next(iter(ds))
+model = Wav2Vec2ForCTC.from_pretrained("patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm")
+processor = Wav2Vec2Processor.from_pretrained("patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm")
+input_values = processor(sample["audio"]["array"], return_tensors="pt").input_values
+logits = model(input_values).logits
+prediction_ids = torch.argmax(logits, dim=-1)
+transcription = processor.batch_decode(prediction_ids)
+print(transcription)
 ```
 | Model | WER | CER |
 | ------------- | ------------- | ------------- |