Whisper large v3 FR D&D - Joey Martig

This model is a fine-tuned version of openai/whisper-large-v3 on the None dataset. It achieves the following results on the evaluation set:

Loss: 0.0117
Wer: 33.4454

Model description

The model is a fine-tuned version of OpenAI's Whisper, specifically trained to recognize and transcribe specialized vocabulary from the Dungeons & Dragons (D&D) universe. This fine-tuning process involves retraining an existing Whisper model on a custom dataset composed of audio samples containing D&D-specific terms, which were not adequately recognized by the original model. The goal is to enhance the model's ability to accurately transcribe D&D terminology, which includes unique names of monsters, characters, and places, making it a more effective tool for users engaging with content related to D&D.

Intended uses & limitations

Intended Uses:

The model is intended for use in scenarios where accurate transcription of specialized D&D vocabulary is crucial. This includes applications such as automatic transcription of game sessions, creation of subtitles for D&D-related content, or assisting in the documentation of in-game narratives.
The model is particularly useful for users who frequently encounter or work with D&D-specific language that standard transcription models might struggle to accurately transcribe.

Limitations:

The model's performance is constrained by the size and diversity of the training dataset. Since the dataset used was relatively small and focused, the model might not perform well on a broader range of accents, voice types, or D&D-specific terms that were not included in the training set.
The model requires significant computational resources for training and fine-tuning. While it shows improvements over the base model, these gains are achieved at the cost of extended processing times and the need for powerful hardware, such as GPUs available on HPC clusters.
Due to the limitations in data, the model may still produce errors or inconsistent results, especially when encountering terms or phrases outside the scope of the fine-tuning dataset.

Training and evaluation data

The training data consisted of 136 initial audio samples derived from a vocabulary of 34 D&D-specific words, with each word incorporated into two different sentences. To expand this limited dataset, audio filters were applied to the samples to artificially increase their variety, resulting in a fivefold increase in the number of training examples, reaching a total of 680 samples.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 50
num_epochs: 10
mixed_precision_training: Native AMP

Training results

Epoch	Step	Validation Loss	Wer
1.0	7	0.9825	38.1513
2.0	14	0.7112	35.7143
3.0	21	0.4668	68.2353
4.0	28	0.2396	33.6134
5.0	35	0.1178	33.4454
6.0	42	0.0526	33.4454
7.0	49	0.0317	33.4454
8.0	56	0.0165	33.4454
9.0	63	0.0133	33.4454
10.0	70	0.0117	33.4454

Framework versions

Transformers 4.43.0.dev0
Pytorch 2.3.0+cu121
Datasets 2.19.1
Tokenizers 0.19.1

joeyMartig
/

whisper-large-v3-dnd-fr