joeyMartig's picture
Added model description
db3e9a4 verified
metadata
language:
  - fr
license: apache-2.0
base_model: openai/whisper-large-v3
tags:
  - generated_from_trainer
metrics:
  - wer
model-index:
  - name: Whisper large v3 FR D&D - Joey Martig
    results: []

Whisper large v3 FR D&D - Joey Martig

This model is a fine-tuned version of openai/whisper-large-v3 on the None dataset. It achieves the following results on the evaluation set:

  • Loss: 0.0117
  • Wer: 33.4454

Model description

The model is a fine-tuned version of OpenAI's Whisper, specifically trained to recognize and transcribe specialized vocabulary from the Dungeons & Dragons (D&D) universe. This fine-tuning process involves retraining an existing Whisper model on a custom dataset composed of audio samples containing D&D-specific terms, which were not adequately recognized by the original model. The goal is to enhance the model's ability to accurately transcribe D&D terminology, which includes unique names of monsters, characters, and places, making it a more effective tool for users engaging with content related to D&D.

Intended uses & limitations

Intended Uses:

  • The model is intended for use in scenarios where accurate transcription of specialized D&D vocabulary is crucial. This includes applications such as automatic transcription of game sessions, creation of subtitles for D&D-related content, or assisting in the documentation of in-game narratives.
  • The model is particularly useful for users who frequently encounter or work with D&D-specific language that standard transcription models might struggle to accurately transcribe.

Limitations:

  • The model's performance is constrained by the size and diversity of the training dataset. Since the dataset used was relatively small and focused, the model might not perform well on a broader range of accents, voice types, or D&D-specific terms that were not included in the training set.
  • The model requires significant computational resources for training and fine-tuning. While it shows improvements over the base model, these gains are achieved at the cost of extended processing times and the need for powerful hardware, such as GPUs available on HPC clusters.
  • Due to the limitations in data, the model may still produce errors or inconsistent results, especially when encountering terms or phrases outside the scope of the fine-tuning dataset.

Training and evaluation data

The training data consisted of 136 initial audio samples derived from a vocabulary of 34 D&D-specific words, with each word incorporated into two different sentences. To expand this limited dataset, audio filters were applied to the samples to artificially increase their variety, resulting in a fivefold increase in the number of training examples, reaching a total of 680 samples.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 50
  • num_epochs: 10
  • mixed_precision_training: Native AMP

Training results

Epoch Step Validation Loss Wer
1.0 7 0.9825 38.1513
2.0 14 0.7112 35.7143
3.0 21 0.4668 68.2353
4.0 28 0.2396 33.6134
5.0 35 0.1178 33.4454
6.0 42 0.0526 33.4454
7.0 49 0.0317 33.4454
8.0 56 0.0165 33.4454
9.0 63 0.0133 33.4454
10.0 70 0.0117 33.4454

Framework versions

  • Transformers 4.43.0.dev0
  • Pytorch 2.3.0+cu121
  • Datasets 2.19.1
  • Tokenizers 0.19.1