Model Card for atlasia/XLM-RoBERTa-Morocco
Model Description
XLM-RoBERTa-Morocco is a masked language model fine-tuned specifically for Moroccan Darija (Moroccan Arabic dialect). This model is based on FacebookAI/xlm-roberta-large and has been further trained on the comprehensive Atlaset dataset, a curated collection of Moroccan Darija text.
Intended Uses
This model is designed for:
- Text classification tasks in Moroccan Darija
- Named entity recognition in Moroccan Darija
- Sentiment analysis of Moroccan text
- Question answering systems for Moroccan users
- Building embeddings for Moroccan Darija text
- Serving as a foundation for downstream NLP tasks specific to Moroccan dialect
Training Details
- Base Model: FacebookAI/xlm-roberta-large
- Training Data: Atlaset dataset (1.17M examples, 155M tokens)
- Training Procedure: Fine-tuning with masked language modeling objective
- Hyperparameters:
- Batch size: 128
- Learning rate: 1e-4
- Training was optimized after testing learning rates in range {1e-4, 5e-5, 1e-5}
Performance
In human evaluations conducted through the Atlaset-Arena, this model demonstrated significant improvements over baseline models:
Model | Wins | Total Comparisons | Win Rate (%) |
---|---|---|---|
atlasia/XLM-RoBERTa-Morocco | 72 | 120 | 60.00 |
aubmindlab/bert-base-arabertv02 | 63 | 114 | 55.26 |
SI2M-Lab/DarijaBERT | 55 | 119 | 46.22 |
FacebookAI/xlm-roberta-large | 51 | 120 | 42.50 |
google-bert/bert-base-multilingual-cased | 29 | 120 | 24.17 |
The model shows a 17.5% performance improvement over the base XLM-RoBERTa-large model.
Limitations
- While the model performs well on Moroccan Darija, performance may vary across different regional variations within Morocco
- The model may not handle code-switching between Darija and other languages optimally
- Performance on highly technical or specialized domains may be limited by the training data composition
Ethical Considerations
- This model is intended to improve accessibility of NLP technologies for Moroccan Darija speakers
- Users should be aware that the model may reflect biases present in the training data
- The model should be further evaluated before deployment in high-stakes applications
How to Use
from transformers import AutoModelForMaskedLM, AutoTokenizer
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("atlasia/XLM-RoBERTa-Morocco")
model = AutoModelForMaskedLM.from_pretrained("atlasia/XLM-RoBERTa-Morocco")
# Example usage for masked language modeling
text = "أنا كنتكلم الدارجة المغربية [MASK] مزيان."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
Citation
@misc{atlasia2025xlm-roberta-morocco,
title={XLM-RoBERTa-Morocco: A Masked Language Model for Moroccan Darija},
author={Abdelaziz Bounhar and Abdeljalil El Majjodi},
year={2025},
howpublished={\url{https://huggingface.co./atlasia/XLM-RoBERTa-Morocco}},
organization={AtlasIA}
}
Acknowledgements
We thank the Hugging Face team for their support and the vibrant research community behind Moroccan Darija NLP. Special thanks to all contributors of the Atlaset dataset that made this model possible.
- Downloads last month
- 14
Model tree for atlasia/XLM-RoBERTa-Morocco
Base model
FacebookAI/xlm-roberta-largeDataset used to train atlasia/XLM-RoBERTa-Morocco
Space using atlasia/XLM-RoBERTa-Morocco 1
Collection including atlasia/XLM-RoBERTa-Morocco
Collection
6 items
•
Updated
•
2