Moroccan Darija Embedding Models
This repository contains word embedding models trained for Moroccan Darija, a widely spoken Arabic dialect in Morocco. Currently, it includes FastText-based embeddings trained on the curated Al Atlas dataset composed of Moroccan Darija text.
Features
- FastText embeddings: Pre-trained word vectors using FastText, which supports subword information and works well with dialectal and morphologically rich languages.
- Efficient training pipeline: Code for training FastText embeddings on Moroccan Darija datasets.
- Pre-trained models: Ready-to-use embeddings for downstream NLP tasks are available in the Hugging Face hub
Installation
Clone the Github repository and install the required dependencies:
git clone https://github.com/BounharAbdelaziz/Moroccan-Darija-Embedding.git
cd Moroccan-Darija-Embedding
pip install -r requirements.txt
Usage
Loading Pre-trained Embeddings
You can load the trained FastText model using gensim
:
import fasttext
model = fasttext.load_model("fasttext_cbow_v0.bin") # download the models from the hub https://huggingface.co./atlasia/Moroccan-Darija-Embedding
word_vector = model.get_word_vector("كلمة")
Roadmap
- ✅ FastText embeddings
- ⏳ Word2Vec and GloVe embeddings
- ⏳ Transformer-based contextual embeddings (e.g., BERT, RoBERTa)
- ⏳ Sentence embeddings: Continue training the MoRdern-Bert model.
Contributing
Contributions are welcome! Feel free to open issues or submit pull requests to improve the models and codebase.
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.