Edit model card

Whisper Tiny Fine-tuned on Kalaallisut

There is a chance that i redo and start from beginning, depends on how it goes. Please dont depend on the transcription from it.....

This model is a fine-tuned version of the openai/whisper-tiny model on a small dataset of the Kalaallisut (Greenlandic) language. Whisper is a general-purpose speech recognition model trained on a large-scale dataset. However, this fine-tuned version on Kalaallisut may still produce unreliable transcriptions due to the small amount of available training data.

Model Details

  • Model Name: Whisper Tiny Fine-tuned on Kalaallisut
  • Base Model: openai/whisper-tiny
  • Fine-tuned on: Kalaallisut language dataset
  • Dataset: Audio-transcription pairs (limited in size and variety)
  • Purpose: Speech-to-text for the Kalaallisut language
  • License: MIT License

Fine-Tuning Process

The model has been fine-tuned incrementally with newly added data in Kalaallisut. Each fine-tuning session adds more data, which helps the model improve its understanding of the language. However, due to the small dataset, it is still prone to overfitting and may produce inaccurate or gibberish transcriptions in some cases.

  • Fine-tuning strategy: New data was added, and the model was fine-tuned using a low learning rate to avoid overwriting previously learned weights.
  • Learning Rate: 1e-5 (reduced in later fine-tuning stages to 5e-6).
  • Batch Size: 16 for training, 8 for evaluation
  • Evaluation Metric: Word Error Rate (WER) used to evaluate the model’s performance.
  • Checkpoints: Frequent checkpointing and validation during training to prevent overfitting.

Recent Fine-tuning Updates:

  • New data: Additional Kalaallisut audio data was added, expanding the model’s vocabulary and helping it better understand different speech patterns.
  • Improved performance: The model has shown some improvements, but it still struggles with more complex or lengthy speech inputs.
  • Overfitting reduction: Adjustments such as lower learning rates and early stopping have been introduced to mitigate overfitting.

Training Data

The training data consists of a small set of audio-transcription pairs in Kalaallisut. Due to the limited size of the dataset, the model’s performance is not fully reliable for general use and may produce inaccurate or gibberish transcriptions for complex or diverse audio inputs.

  • Hours of Audio: The dataset is still limited to a few hours of speech data.
  • Dataset Type: Spoken words and phrases in Kalaallisut, with some conversational audio added in later updates.
  • Limitations: The model’s performance is limited by the small dataset, and it may not generalize well to more complex audio inputs, especially those containing less common phrases or dialects.

Known Issues

  • Transcription Quality: The model may produce gibberish or incorrect transcriptions, especially for longer or more complex audio inputs.
  • Small dataset limitations: The model's vocabulary is limited, and it struggles with less common words or more phonetically complex speech.
  • Noisy or fast speech: The model may produce unintelligible transcriptions for audio that is noisy or spoken too quickly.

Limitations

  • Generalization issues: Due to the small dataset, the model may not generalize well to new or more complex audio inputs.
  • Inaccurate transcriptions: The model may produce gibberish or incorrect transcriptions in certain scenarios.
  • Unsuitable for production use: The model is currently not suitable for large-scale production applications and is intended for experimentation, research, or small projects.

Intended Use

This model can be used for:

  • Experimentation and research: Useful for testing speech-to-text in the Kalaallisut language, but note that the output may not always be reliable.
  • Small projects: It can be useful for transcription tasks in small-scale projects that require Kalaallisut language support.
  • Further fine-tuning: The model is suitable for further fine-tuning with additional data to improve its performance.

Usage Example

To use the model for transcription, you can load it with the Hugging Face transformers library:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

# Load the model and processor
processor = WhisperProcessor.from_pretrained("VoiceLessQ/whisper-tiny-kalaallisut")
model = WhisperForConditionalGeneration.from_pretrained("VoiceLessQ/whisper-tiny-kalaallisut")

# Load and process an audio file (replace this with your audio loading method)
audio_array = ...  # Load your audio file as a numpy array or waveform

# Prepare the input features for the model
input_features = processor(audio_array, sampling_rate=16000, return_tensors="pt").input_features

# Generate transcription from the audio input
generated_ids = model.generate(input_features)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(f"Transcription: {transcription}")
Downloads last month
84
Safetensors
Model size
37.8M params
Tensor type
F32
·
Inference Examples
Unable to determine this model's library. Check the docs .