--- language: - kal # Kalaallisut language (Greenlandic) license: mit # Model license metrics: - wer # Word Error Rate (WER) used as evaluation metric model_name: Whisper Tiny Fine-tuned on Kalaallisut tags: - whisper - automatic-speech-recognition - speech-to-text - kalaallisut - greenlandic pipeline_tag: automatic-speech-recognition widget: - src: https://huggingface.co./datasets/your-dataset/sample_audio.mp3 # Replace with actual path to audio file --- # Whisper Tiny Fine-tuned on Kalaallisut There is a chance that i redo and start from beginning, depends on how it goes. Please dont depend on the transcription from it..... This model is a fine-tuned version of the `openai/whisper-tiny` model on a **small dataset** of the Kalaallisut (Greenlandic) language. Whisper is a general-purpose speech recognition model trained on a large-scale dataset. However, this fine-tuned version on Kalaallisut **may still produce unreliable transcriptions** due to the small amount of available training data. ## Model Details - **Model Name**: Whisper Tiny Fine-tuned on Kalaallisut - **Base Model**: [openai/whisper-tiny](https://huggingface.co./openai/whisper-tiny) - **Fine-tuned on**: Kalaallisut language dataset - **Dataset**: Audio-transcription pairs (limited in size and variety) - **Purpose**: Speech-to-text for the Kalaallisut language - **License**: MIT License ## Fine-Tuning Process The model has been fine-tuned **incrementally** with newly added data in Kalaallisut. Each fine-tuning session adds more data, which helps the model improve its understanding of the language. However, due to the small dataset, it is still prone to **overfitting** and may produce **inaccurate or gibberish transcriptions** in some cases. - **Fine-tuning strategy**: New data was added, and the model was fine-tuned using a low learning rate to avoid overwriting previously learned weights. - **Learning Rate**: 1e-5 (reduced in later fine-tuning stages to 5e-6). - **Batch Size**: 16 for training, 8 for evaluation - **Evaluation Metric**: Word Error Rate (WER) used to evaluate the model’s performance. - **Checkpoints**: Frequent checkpointing and validation during training to prevent overfitting. ### Recent Fine-tuning Updates: - **New data**: Additional Kalaallisut audio data was added, expanding the model’s vocabulary and helping it better understand different speech patterns. - **Improved performance**: The model has shown some improvements, but it still struggles with more complex or lengthy speech inputs. - **Overfitting reduction**: Adjustments such as lower learning rates and early stopping have been introduced to mitigate overfitting. ## Training Data The training data consists of a small set of **audio-transcription pairs** in Kalaallisut. Due to the **limited size of the dataset**, the model’s performance is not fully reliable for general use and may produce **inaccurate or gibberish transcriptions** for complex or diverse audio inputs. - **Hours of Audio**: The dataset is still limited to a few hours of speech data. - **Dataset Type**: Spoken words and phrases in Kalaallisut, with some conversational audio added in later updates. - **Limitations**: The model’s performance is limited by the small dataset, and it may not generalize well to more complex audio inputs, especially those containing less common phrases or dialects. ### Known Issues - **Transcription Quality**: The model may produce **gibberish or incorrect transcriptions**, especially for longer or more complex audio inputs. - **Small dataset limitations**: The model's vocabulary is limited, and it struggles with **less common words** or more phonetically complex speech. - **Noisy or fast speech**: The model may produce unintelligible transcriptions for audio that is noisy or spoken too quickly. ### Limitations - **Generalization issues**: Due to the small dataset, the model may not generalize well to new or more complex audio inputs. - **Inaccurate transcriptions**: The model may produce **gibberish or incorrect transcriptions** in certain scenarios. - **Unsuitable for production use**: The model is currently not suitable for large-scale production applications and is intended for experimentation, research, or small projects. ### Intended Use This model can be used for: - **Experimentation and research**: Useful for testing speech-to-text in the Kalaallisut language, but note that the output may not always be reliable. - **Small projects**: It can be useful for transcription tasks in small-scale projects that require Kalaallisut language support. - **Further fine-tuning**: The model is suitable for further fine-tuning with additional data to improve its performance. ### Usage Example To use the model for transcription, you can load it with the Hugging Face `transformers` library: ```python from transformers import WhisperProcessor, WhisperForConditionalGeneration import torch # Load the model and processor processor = WhisperProcessor.from_pretrained("VoiceLessQ/whisper-tiny-kalaallisut") model = WhisperForConditionalGeneration.from_pretrained("VoiceLessQ/whisper-tiny-kalaallisut") # Load and process an audio file (replace this with your audio loading method) audio_array = ... # Load your audio file as a numpy array or waveform # Prepare the input features for the model input_features = processor(audio_array, sampling_rate=16000, return_tensors="pt").input_features # Generate transcription from the audio input generated_ids = model.generate(input_features) transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(f"Transcription: {transcription}")