Whisper Small Basque
This model is a fine-tuned version of openai/whisper-small specifically for Basque (eu) language Automatic Speech Recognition (ASR). It was trained on the asierhv/composite_corpus_eu_v2.1 dataset, which is a composite corpus designed to improve Basque ASR performance.
Key improvements and results compared to the base model:
- Significant WER reduction: The fine-tuned model achieves a Word Error Rate (WER) of 9.5479 on the validation set of the
asierhv/composite_corpus_eu_v2.1
dataset, demonstrating improved accuracy compared to the basewhisper-small
model for Basque. - Performance on Common Voice: When evaluated on the Mozilla Common Voice 18.0 dataset, the model achieved a WER of 7.63. This demonstrates the model's ability to generalize to other Basque speech datasets, and highlights the improved accuracy due to the larger model size.
Model description
This model leverages the whisper-small
architecture, which offers a balance between accuracy and computational efficiency. By fine-tuning it on a dedicated Basque speech corpus, the model specializes in accurately transcribing Basque speech. This model has a larger capacity than whisper-base
, improving accuracy at the cost of increased computational resources.
Intended uses & limitations
Intended uses:
- High-accuracy automatic transcription of Basque speech for professional applications.
- Development of advanced Basque speech-based applications that require high precision.
- Research in Basque speech processing where the highest possible accuracy is needed.
- Professional transcription services and applications requiring very high accuracy.
- Use in scenarios where a higher computational cost is justified by the significant improvement in accuracy.
Limitations:
- Performance is still influenced by audio quality, with challenges arising from background noise and poor recording conditions.
- Accuracy may be affected by highly dialectal or informal Basque speech.
- Despite improved performance, the model may still produce errors, particularly with complex linguistic structures or rare words.
- The small model is larger than both the base and tiny models, so inference will be slower and require more resources.
Training and evaluation data
- Training dataset: asierhv/composite_corpus_eu_v2.1. This dataset is a comprehensive collection of Basque speech data, tailored to enhance the performance of Basque ASR systems.
- Evaluation Dataset: The
test
split ofasierhv/composite_corpus_eu_v2.1
.
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1.25e-05
- train_batch_size: 32
- eval_batch_size: 16
- seed: 42
- optimizer: AdamW with betas=(0.9, 0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 10000
- mixed_precision_training: Native AMP
Training results
Training Loss | Epoch | Step | Validation Loss | WER |
---|---|---|---|---|
0.3863 | 0.1 | 1000 | 0.4090 | 21.2189 |
0.1897 | 0.2 | 2000 | 0.3457 | 15.4490 |
0.1379 | 0.3 | 3000 | 0.3283 | 13.5756 |
0.1825 | 0.4 | 4000 | 0.3024 | 12.3954 |
0.0775 | 0.5 | 5000 | 0.3198 | 11.8771 |
0.0975 | 0.6 | 6000 | 0.2924 | 11.2589 |
0.1132 | 0.7 | 7000 | 0.2969 | 10.8468 |
0.0852 | 0.8 | 8000 | 0.2237 | 9.7727 |
0.0585 | 0.9 | 9000 | 0.2317 | 9.6291 |
0.0654 | 1.0 | 10000 | 0.2353 | 9.5479 |
Framework versions
- Transformers 4.49.0.dev0
- Pytorch 2.6.0+cu124
- Datasets 3.3.1.dev0
- Tokenizers 0.21.0
- Downloads last month
- 71