--- language: - fr license: apache-2.0 base_model: openai/whisper-large-v3 tags: - generated_from_trainer metrics: - wer model-index: - name: Whisper large v3 FR D&D - Joey Martig results: [] --- # Whisper large v3 FR D&D - Joey Martig This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co./openai/whisper-large-v3) on the None dataset. It achieves the following results on the evaluation set: - Loss: 0.0117 - Wer: 33.4454 ## Model description The model is a fine-tuned version of OpenAI's Whisper, specifically trained to recognize and transcribe specialized vocabulary from the Dungeons & Dragons (D&D) universe. This fine-tuning process involves retraining an existing Whisper model on a custom dataset composed of audio samples containing D&D-specific terms, which were not adequately recognized by the original model. The goal is to enhance the model's ability to accurately transcribe D&D terminology, which includes unique names of monsters, characters, and places, making it a more effective tool for users engaging with content related to D&D. ## Intended uses & limitations ### Intended Uses: - The model is intended for use in scenarios where accurate transcription of specialized D&D vocabulary is crucial. This includes applications such as automatic transcription of game sessions, creation of subtitles for D&D-related content, or assisting in the documentation of in-game narratives. - The model is particularly useful for users who frequently encounter or work with D&D-specific language that standard transcription models might struggle to accurately transcribe. ### Limitations: - The model's performance is constrained by the size and diversity of the training dataset. Since the dataset used was relatively small and focused, the model might not perform well on a broader range of accents, voice types, or D&D-specific terms that were not included in the training set. - The model requires significant computational resources for training and fine-tuning. While it shows improvements over the base model, these gains are achieved at the cost of extended processing times and the need for powerful hardware, such as GPUs available on HPC clusters. - Due to the limitations in data, the model may still produce errors or inconsistent results, especially when encountering terms or phrases outside the scope of the fine-tuning dataset. ## Training and evaluation data The training data consisted of 136 initial audio samples derived from a vocabulary of 34 D&D-specific words, with each word incorporated into two different sentences. To expand this limited dataset, audio filters were applied to the samples to artificially increase their variety, resulting in a fivefold increase in the number of training examples, reaching a total of 680 samples. ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 1e-05 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_steps: 50 - num_epochs: 10 - mixed_precision_training: Native AMP ### Training results | Epoch | Step | Validation Loss | Wer | |:-----:|:----:|:---------------:|:-------:| | 1.0 | 7 | 0.9825 | 38.1513 | | 2.0 | 14 | 0.7112 | 35.7143 | | 3.0 | 21 | 0.4668 | 68.2353 | | 4.0 | 28 | 0.2396 | 33.6134 | | 5.0 | 35 | 0.1178 | 33.4454 | | 6.0 | 42 | 0.0526 | 33.4454 | | 7.0 | 49 | 0.0317 | 33.4454 | | 8.0 | 56 | 0.0165 | 33.4454 | | 9.0 | 63 | 0.0133 | 33.4454 | | 10.0 | 70 | 0.0117 | 33.4454 | ### Framework versions - Transformers 4.43.0.dev0 - Pytorch 2.3.0+cu121 - Datasets 2.19.1 - Tokenizers 0.19.1