Wav2Vec2-Large-Robust renewed ETRI Data of Korean-English Pronunciation Model

This repository contains a fine-tuned Wav2Vec2-Large-Robust model for phoneme recognition tasks. The model was trained and evaluated on our in-house English pronunciations of Korean learners dataset, which was made with ETRI and revised by SNU SLP lab.

Data Information

  • Dataset Name: English Pronunciation of Korean Learners (made with ETRI) revised by SNU SLP lab

  • Data Type: Speech recordings of Korean learners speaking English, annotated with phoneme sequences.

  • Annotation: Each utterance is transcribed at the phoneme level, including pronunciation errors marked with _err. These errors highlight phoneme substitutions, insertions, and deletions that occur due to the influence of the Korean language on English pronunciation.

  • Train Set: 14,305 samples

  • Valid Set: 1,590 samples

  • Test Set: 3,974 samples

Training Procedure

The model was fine-tuned for phoneme recognition using the Hugging Face transformers library. Below are the training steps:

  1. Data preprocessing to align audio with phoneme labels.
  2. Wav2Vec2-Large-Robust model fine-tuning with CTC loss.
  3. Evaluation on validation and test sets.

Training Hyperparameters

  • Epochs: 50
  • Learning Rate: 0.0001
  • Warmup Ratio: 0.1
  • Scheduler: Linear
  • Batch Size: 8
  • Loss Reduction: Mean
  • Feature Extractor Freeze: Enabled

Training Results

The following metrics were achieved during training:

  • Final Training Loss: 0.2415
  • Word Error Rate (WER) on Training Set: 0.0508
  • Validation Loss: 0.3999
  • Word Error Rate (WER) on Validation Set: 0.1622

Test Results

The model was evaluated on the test dataset with the following performance:

  • Word Error Rate (WER): 0.0905

Phoneme Data Example

Below is an example of how the dataset is structured for phoneme recognition tasks:

Sample :

  • Provided Sentence: I'M LOOKING FOR MY PUPPY LUKE HE RAN AWAY THIS MORNING
  • True Phonemes of Korean pronunciation: ay m l uh k ih_err ng f er m ay p ah p iy l uw k hh iy l ah n ah w ey dh ih s m ao r n ih_err n
  • Predicted Phonemes: ay m l uh k ih ng f er m ay p ah p iy l uh k hh iy l ah n ah w ey dh ih s m ao r n ih_err ng

Training Logs

TensorBoard logs are available for detailed training analysis:

  • events.out.tfevents.1737043507.oem-WS-C621E-SAGE-Series.1534005.0
  • events.out.tfevents.1737088179.oem-WS-C621E-SAGE-Series.1534005.1

Use the following command to visualize logs:

tensorboard --logdir=./logs/
Downloads last month
270
Safetensors
Model size
316M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.