metadata

language:
  - eo
license: apache-2.0
tags:
  - automatic-speech-recognition
  - mozilla-foundation/common_voice_13_0
  - generated_from_trainer
metrics:
  - wer
model-index:
  - name: wav2vec2-common_voice_13_0-eo-3
    results: []

wav2vec2-common_voice_13_0-eo-3, an Esperanto speech recognizer

This model is a fine-tuned version of facebook/wav2vec2-large-xlsr-53 on the mozilla-foundation/common_voice_13_0 Esperanto dataset. It achieves the following results on the evaluation set:

Loss: 0.2191
Cer: 0.0208
Wer: 0.0687

Model description

See facebook/wav2vec2-large-xlsr-53.

Intended uses & limitations

Speech recognition for Esperanto. The base model was pretrained and finetuned on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16KHz.

Training and evaluation data

The training split was set to train[:15000] while the eval split was set to validation[:1500].

Training procedure

I used run_speech_recognition_ctc.py with the following train.json file passed to it:

{
  "dataset_name": "mozilla-foundation/common_voice_13_0",
  "model_name_or_path": "facebook/wav2vec2-large-xlsr-53",
  "dataset_config_name": "eo",
  "output_dir": "./wav2vec2-common_voice_13_0-eo-3",
  "train_split_name": "train[:15000]",
  "eval_split_name": "validation[:1500]",
  "eval_metrics": ["cer", "wer"],
  "overwrite_output_dir": true,
  "preprocessing_num_workers": 8,
  "num_train_epochs": 100,
  "per_device_train_batch_size": 8,
  "gradient_accumulation_steps": 4,
  "gradient_checkpointing": true,
  "learning_rate": 3e-5,
  "warmup_steps": 500,
  "evaluation_strategy": "steps",
  "text_column_name": "sentence",
  "length_column_name": "input_length",
  "save_steps": 1000,
  "eval_steps": 1000,
  "layerdrop": 0.1,
  "save_total_limit": 3,
  "freeze_feature_encoder": true,
  "chars_to_ignore": "-!\"'(),.:;=?_`¨«¸»ʼ‑–—‘’“”„…‹›♫？",
  "chars_to_substitute": {
    "przy": "pŝe",
    "byn": "bin",
    "cx": "ĉ",
    "sx": "ŝ",
    "ﬁ": "fi",
    "ﬂ": "fl",
    "ǔ": "ŭ",
    "ñ": "nj",
    "á": "a",
    "é": "e",
    "ü": "ŭ",
    "y": "j",
    "qu": "ku"
  },
  "fp16": true,
  "group_by_length": true,
  "push_to_hub": true,
  "do_train": true,
  "do_eval": true
}

I went through the dataset to find non-speech characters, and these were placed in chars_to_ignore. In addition, there were character sequences that could be transcribed to Esperanto phonemes, and these were placed as a dictionary in chars_to_substitute. This required adding such an argument to the program:

def dict_field(default=None, metadata=None):
    return field(default_factory=lambda: default, metadata=metadata)

@dataclass
class DataTrainingArguments:
  ...
    chars_to_substitute: Optional[Dict[str, str]] = dict_field(
        default=None,
        metadata={"help": "A dict of characters to replace."},
    )

Then I copied remove_special_characters to do the actual substitution:

    def remove_special_characters(batch):
        text = batch[text_column_name]
        if chars_to_ignore_regex is not None:
            text = re.sub(chars_to_ignore_regex, "", batch[text_column_name])
        batch["target_text"] = text.lower() + " "
        return batch

    def substitute_characters(batch):
        text: str = batch["target_text"]
        if data_args.chars_to_substitute is not None:
            for k, v in data_args.chars_to_substitute.items():
                text.replace(k, v)
        batch["target_text"] = text.lower()
        return batch

    with training_args.main_process_first(desc="dataset map special characters removal"):
        raw_datasets = raw_datasets.map(
            remove_special_characters,
            remove_columns=[text_column_name],
            desc="remove special characters from datasets",
        )

    with training_args.main_process_first(desc="dataset map special characters substitute"):
        raw_datasets = raw_datasets.map(
            substitute_characters,
            desc="substitute special characters in datasets",
        )

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
layerdrop: 0.1
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 100
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Cer	Validation Loss	Wer
2.6416	2.13	1000	0.1541	0.8599	0.6449
0.2633	4.27	2000	0.0335	0.1897	0.1431
0.1739	6.4	3000	0.0289	0.1732	0.1145
0.1378	8.53	4000	0.0276	0.1729	0.1066
0.1172	10.67	5000	0.0268	0.1773	0.1019
0.1049	12.8	6000	0.0255	0.1701	0.0937
0.0951	14.93	7000	0.0253	0.1718	0.0933
0.0851	17.07	8000	0.0239	0.1787	0.0834
0.0809	19.2	9000	0.0235	0.1802	0.0835
0.0756	21.33	10000	0.0239	0.1784	0.0855
0.0708	23.47	11000	0.0235	0.1748	0.0824
0.0657	25.6	12000	0.0228	0.1830	0.0796
0.0605	27.73	13000	0.0230	0.1896	0.0798
0.0583	29.87	14000	0.0224	0.1889	0.0778
0.0608	32.0	15000	0.0223	0.1849	0.0757
0.0556	34.13	16000	0.0223	0.1872	0.0767
0.0534	36.27	17000	0.0221	0.1893	0.0751
0.0523	38.4	18000	0.0218	0.1925	0.0729
0.0494	40.53	19000	0.0221	0.1957	0.0745
0.0475	42.67	20000	0.0217	0.1961	0.0740
0.048	44.8	21000	0.0214	0.1957	0.0714
0.0459	46.93	22000	0.0215	0.1968	0.0717
0.0435	49.07	23000	0.0217	0.2008	0.0717
0.0428	51.2	24000	0.0212	0.1991	0.0696
0.0418	53.33	25000	0.0215	0.2034	0.0714
0.0404	55.47	26000	0.0210	0.2014	0.0684
0.0394	57.6	27000	0.0210	0.2050	0.0681
0.0399	59.73	28000	0.0211	0.2039	0.0700
0.0389	61.87	29000	0.0214	0.2091	0.0694
0.038	64.0	30000	0.0210	0.2100	0.0702
0.0361	66.13	31000	0.0215	0.2119	0.0703
0.0359	68.27	32000	0.0213	0.2108	0.0714
0.0354	70.4	33000	0.0211	0.2120	0.0699
0.0364	72.53	34000	0.0211	0.2128	0.0688
0.0361	74.67	35000	0.0212	0.2134	0.0694
0.0332	76.8	36000	0.0210	0.2176	0.0698
0.0341	78.93	37000	0.0208	0.2170	0.0688
0.032	81.07	38000	0.0209	0.2157	0.0686
0.0318	83.33	39000	0.0209	0.2166	0.0685
0.0325	85.47	40000	0.0209	0.2172	0.0687
0.0316	87.6	41000	0.0208	0.2181	0.0678
0.0302	89.73	42000	0.0208	0.2171	0.0679
0.0318	91.87	43000	0.0211	0.2179	0.0702
0.0314	94.0	44000	0.0208	0.2186	0.0690
0.0309	96.13	45000	0.0210	0.2193	0.0696
0.031	98.27	46000	0.0208	0.2191	0.0686

Framework versions

Transformers 4.29.1
Pytorch 2.0.1+cu118
Datasets 2.12.0
Tokenizers 0.13.3