xekri's picture
Update model card
d8c7361
|
raw
history blame
8.33 kB
metadata
language:
  - eo
license: apache-2.0
tags:
  - automatic-speech-recognition
  - mozilla-foundation/common_voice_13_0
  - generated_from_trainer
metrics:
  - wer
model-index:
  - name: wav2vec2-common_voice_13_0-eo-3
    results: []

wav2vec2-common_voice_13_0-eo-3, an Esperanto speech recognizer

This model is a fine-tuned version of facebook/wav2vec2-large-xlsr-53 on the mozilla-foundation/common_voice_13_0 Esperanto dataset. It achieves the following results on the evaluation set:

  • Loss: 0.2191
  • Cer: 0.0208
  • Wer: 0.0687

Model description

See facebook/wav2vec2-large-xlsr-53.

Intended uses & limitations

Speech recognition for Esperanto. The base model was pretrained and finetuned on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16KHz.

Training and evaluation data

The training split was set to train[:15000] while the eval split was set to validation[:1500].

Training procedure

I used run_speech_recognition_ctc.py with the following train.json file passed to it:

{
  "dataset_name": "mozilla-foundation/common_voice_13_0",
  "model_name_or_path": "facebook/wav2vec2-large-xlsr-53",
  "dataset_config_name": "eo",
  "output_dir": "./wav2vec2-common_voice_13_0-eo-3",
  "train_split_name": "train[:15000]",
  "eval_split_name": "validation[:1500]",
  "eval_metrics": ["cer", "wer"],
  "overwrite_output_dir": true,
  "preprocessing_num_workers": 8,
  "num_train_epochs": 100,
  "per_device_train_batch_size": 8,
  "gradient_accumulation_steps": 4,
  "gradient_checkpointing": true,
  "learning_rate": 3e-5,
  "warmup_steps": 500,
  "evaluation_strategy": "steps",
  "text_column_name": "sentence",
  "length_column_name": "input_length",
  "save_steps": 1000,
  "eval_steps": 1000,
  "layerdrop": 0.1,
  "save_total_limit": 3,
  "freeze_feature_encoder": true,
  "chars_to_ignore": "-!\"'(),.:;=?_`¨«¸»ʼ‑–—‘’“”„…‹›♫?",
  "chars_to_substitute": {
    "przy": "pŝe",
    "byn": "bin",
    "cx": "ĉ",
    "sx": "ŝ",
    "fi": "fi",
    "fl": "fl",
    "ǔ": "ŭ",
    "ñ": "nj",
    "á": "a",
    "é": "e",
    "ü": "ŭ",
    "y": "j",
    "qu": "ku"
  },
  "fp16": true,
  "group_by_length": true,
  "push_to_hub": true,
  "do_train": true,
  "do_eval": true
}

I went through the dataset to find non-speech characters, and these were placed in chars_to_ignore. In addition, there were character sequences that could be transcribed to Esperanto phonemes, and these were placed as a dictionary in chars_to_substitute. This required adding such an argument to the program:

def dict_field(default=None, metadata=None):
    return field(default_factory=lambda: default, metadata=metadata)

@dataclass
class DataTrainingArguments:
  ...
    chars_to_substitute: Optional[Dict[str, str]] = dict_field(
        default=None,
        metadata={"help": "A dict of characters to replace."},
    )

Then I copied remove_special_characters to do the actual substitution:

    def remove_special_characters(batch):
        text = batch[text_column_name]
        if chars_to_ignore_regex is not None:
            text = re.sub(chars_to_ignore_regex, "", batch[text_column_name])
        batch["target_text"] = text.lower() + " "
        return batch

    def substitute_characters(batch):
        text: str = batch["target_text"]
        if data_args.chars_to_substitute is not None:
            for k, v in data_args.chars_to_substitute.items():
                text.replace(k, v)
        batch["target_text"] = text.lower()
        return batch

    with training_args.main_process_first(desc="dataset map special characters removal"):
        raw_datasets = raw_datasets.map(
            remove_special_characters,
            remove_columns=[text_column_name],
            desc="remove special characters from datasets",
        )

    with training_args.main_process_first(desc="dataset map special characters substitute"):
        raw_datasets = raw_datasets.map(
            substitute_characters,
            desc="substitute special characters in datasets",
        )

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 3e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 32
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • layerdrop: 0.1
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 500
  • num_epochs: 100
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Cer Validation Loss Wer
2.6416 2.13 1000 0.1541 0.8599 0.6449
0.2633 4.27 2000 0.0335 0.1897 0.1431
0.1739 6.4 3000 0.0289 0.1732 0.1145
0.1378 8.53 4000 0.0276 0.1729 0.1066
0.1172 10.67 5000 0.0268 0.1773 0.1019
0.1049 12.8 6000 0.0255 0.1701 0.0937
0.0951 14.93 7000 0.0253 0.1718 0.0933
0.0851 17.07 8000 0.0239 0.1787 0.0834
0.0809 19.2 9000 0.0235 0.1802 0.0835
0.0756 21.33 10000 0.0239 0.1784 0.0855
0.0708 23.47 11000 0.0235 0.1748 0.0824
0.0657 25.6 12000 0.0228 0.1830 0.0796
0.0605 27.73 13000 0.0230 0.1896 0.0798
0.0583 29.87 14000 0.0224 0.1889 0.0778
0.0608 32.0 15000 0.0223 0.1849 0.0757
0.0556 34.13 16000 0.0223 0.1872 0.0767
0.0534 36.27 17000 0.0221 0.1893 0.0751
0.0523 38.4 18000 0.0218 0.1925 0.0729
0.0494 40.53 19000 0.0221 0.1957 0.0745
0.0475 42.67 20000 0.0217 0.1961 0.0740
0.048 44.8 21000 0.0214 0.1957 0.0714
0.0459 46.93 22000 0.0215 0.1968 0.0717
0.0435 49.07 23000 0.0217 0.2008 0.0717
0.0428 51.2 24000 0.0212 0.1991 0.0696
0.0418 53.33 25000 0.0215 0.2034 0.0714
0.0404 55.47 26000 0.0210 0.2014 0.0684
0.0394 57.6 27000 0.0210 0.2050 0.0681
0.0399 59.73 28000 0.0211 0.2039 0.0700
0.0389 61.87 29000 0.0214 0.2091 0.0694
0.038 64.0 30000 0.0210 0.2100 0.0702
0.0361 66.13 31000 0.0215 0.2119 0.0703
0.0359 68.27 32000 0.0213 0.2108 0.0714
0.0354 70.4 33000 0.0211 0.2120 0.0699
0.0364 72.53 34000 0.0211 0.2128 0.0688
0.0361 74.67 35000 0.0212 0.2134 0.0694
0.0332 76.8 36000 0.0210 0.2176 0.0698
0.0341 78.93 37000 0.0208 0.2170 0.0688
0.032 81.07 38000 0.0209 0.2157 0.0686
0.0318 83.33 39000 0.0209 0.2166 0.0685
0.0325 85.47 40000 0.0209 0.2172 0.0687
0.0316 87.6 41000 0.0208 0.2181 0.0678
0.0302 89.73 42000 0.0208 0.2171 0.0679
0.0318 91.87 43000 0.0211 0.2179 0.0702
0.0314 94.0 44000 0.0208 0.2186 0.0690
0.0309 96.13 45000 0.0210 0.2193 0.0696
0.031 98.27 46000 0.0208 0.2191 0.0686

Framework versions

  • Transformers 4.29.1
  • Pytorch 2.0.1+cu118
  • Datasets 2.12.0
  • Tokenizers 0.13.3