Whisper Tiny Basque

This model is a fine-tuned version of openai/whisper-tiny specifically for Basque (eu) language Automatic Speech Recognition (ASR). It was trained on the asierhv/composite_corpus_eu_v2.1 dataset, which is a composite corpus designed to improve Basque ASR performance.

Key improvements and results compared to the base model:

Significant WER reduction: The fine-tuned model achieves a Word Error Rate (WER) of 14.8495 on the validation set of the asierhv/composite_corpus_eu_v2.1 dataset, demonstrating improved accuracy compared to the base whisper-tiny model for Basque.
Performance on Common Voice: When evaluated on the Mozilla Common Voice 18.0 dataset, the model achieved a WER of 13.56. This demonstrates the model's ability to generalize to other Basque speech datasets.

Model description

This model leverages the power of the Whisper architecture, originally developed by OpenAI, and adapts it to the specific nuances of the Basque language. By fine-tuning the whisper-tiny model on a comprehensive Basque speech corpus, it learns to accurately transcribe spoken Basque. The whisper-tiny model is the smallest of the whisper models, providing a good balance between speed and accuracy.

Intended uses & limitations

Intended uses:

Automatic transcription of Basque speech.
Development of Basque speech-based applications.
Research on Basque speech processing.
Accessibility tools for Basque speakers.

Limitations:

Performance may vary depending on the quality of the audio input (e.g., background noise, recording quality).
The model might struggle with highly dialectal or informal speech.
While the model shows improved performance, it may still produce errors, especially with complex sentences or uncommon words.
The model is based on the small version of whisper, and thus, accuracy may be improved with larger models.

Training and evaluation data

Training dataset: asierhv/composite_corpus_eu_v2.1. This dataset is a composite corpus of Basque speech data, designed to improve the performance of Basque ASR systems.
Evaluation Dataset: The test portion of asierhv/composite_corpus_eu_v2.1.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3.75e-05
train_batch_size: 32
eval_batch_size: 16
seed: 42
optimizer: AdamW with betas=(0.9, 0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 1000
training_steps: 10000
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	WER
0.586	0.1	1000	0.6249	34.1639
0.3145	0.2	2000	0.5048	25.2591
0.225	0.3	3000	0.4839	22.0557
0.3003	0.4	4000	0.4540	20.3072
0.132	0.5	5000	0.4574	19.0146
0.1588	0.6	6000	0.4380	17.8219
0.1841	0.7	7000	0.4395	16.6667
0.143	0.8	8000	0.3719	15.4490
0.0967	0.9	9000	0.3685	15.1368
0.1059	1.0	10000	0.3719	14.8495

Framework versions

Transformers 4.49.0.dev0
Pytorch 2.6.0+cu124
Datasets 3.3.1.dev0
Tokenizers 0.21.0

xezpeleta
/

whisper-tiny-eu

Whisper Tiny Basque

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for xezpeleta/whisper-tiny-eu

Dataset used to train xezpeleta/whisper-tiny-eu

Collection including xezpeleta/whisper-tiny-eu

Whisper basque fine-tuning

Evaluation results