Whisper Small Basque

This model is a fine-tuned version of openai/whisper-small specifically for Basque (eu) language Automatic Speech Recognition (ASR). It was trained on the asierhv/composite_corpus_eu_v2.1 dataset, which is a composite corpus designed to improve Basque ASR performance.

Key improvements and results compared to the base model:

Significant WER reduction: The fine-tuned model achieves a Word Error Rate (WER) of 9.5479 on the validation set of the asierhv/composite_corpus_eu_v2.1 dataset, demonstrating improved accuracy compared to the base whisper-small model for Basque.
Performance on Common Voice: When evaluated on the Mozilla Common Voice 18.0 dataset, the model achieved a WER of 7.63. This demonstrates the model's ability to generalize to other Basque speech datasets, and highlights the improved accuracy due to the larger model size.

Model description

This model leverages the whisper-small architecture, which offers a balance between accuracy and computational efficiency. By fine-tuning it on a dedicated Basque speech corpus, the model specializes in accurately transcribing Basque speech. This model has a larger capacity than whisper-base, improving accuracy at the cost of increased computational resources.

Intended uses & limitations

Intended uses:

High-accuracy automatic transcription of Basque speech for professional applications.
Development of advanced Basque speech-based applications that require high precision.
Research in Basque speech processing where the highest possible accuracy is needed.
Professional transcription services and applications requiring very high accuracy.
Use in scenarios where a higher computational cost is justified by the significant improvement in accuracy.

Limitations:

Performance is still influenced by audio quality, with challenges arising from background noise and poor recording conditions.
Accuracy may be affected by highly dialectal or informal Basque speech.
Despite improved performance, the model may still produce errors, particularly with complex linguistic structures or rare words.
The small model is larger than both the base and tiny models, so inference will be slower and require more resources.

Training and evaluation data

Training dataset: asierhv/composite_corpus_eu_v2.1. This dataset is a comprehensive collection of Basque speech data, tailored to enhance the performance of Basque ASR systems.
Evaluation Dataset: The test split of asierhv/composite_corpus_eu_v2.1.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1.25e-05
train_batch_size: 32
eval_batch_size: 16
seed: 42
optimizer: AdamW with betas=(0.9, 0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
training_steps: 10000
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	WER
0.3863	0.1	1000	0.4090	21.2189
0.1897	0.2	2000	0.3457	15.4490
0.1379	0.3	3000	0.3283	13.5756
0.1825	0.4	4000	0.3024	12.3954
0.0775	0.5	5000	0.3198	11.8771
0.0975	0.6	6000	0.2924	11.2589
0.1132	0.7	7000	0.2969	10.8468
0.0852	0.8	8000	0.2237	9.7727
0.0585	0.9	9000	0.2317	9.6291
0.0654	1.0	10000	0.2353	9.5479

Framework versions

Transformers 4.49.0.dev0
Pytorch 2.6.0+cu124
Datasets 3.3.1.dev0
Tokenizers 0.21.0

xezpeleta
/

whisper-small-eu

Whisper Small Basque

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for xezpeleta/whisper-small-eu

Dataset used to train xezpeleta/whisper-small-eu

Collection including xezpeleta/whisper-small-eu

Whisper basque fine-tuning

Evaluation results