Model Card for atlasia/XLM-RoBERTa-Morocco

Model Description

XLM-RoBERTa-Morocco is a masked language model fine-tuned specifically for Moroccan Darija (Moroccan Arabic dialect). This model is based on FacebookAI/xlm-roberta-large and has been further trained on the comprehensive Atlaset dataset, a curated collection of Moroccan Darija text.

Intended Uses

This model is designed for:

Text classification tasks in Moroccan Darija
Named entity recognition in Moroccan Darija
Sentiment analysis of Moroccan text
Question answering systems for Moroccan users
Building embeddings for Moroccan Darija text
Serving as a foundation for downstream NLP tasks specific to Moroccan dialect

Training Details

Base Model: FacebookAI/xlm-roberta-large
Training Data: Atlaset dataset (1.17M examples, 155M tokens)
Training Procedure: Fine-tuning with masked language modeling objective
Hyperparameters:
- Batch size: 128
- Learning rate: 1e-4
- Training was optimized after testing learning rates in range {1e-4, 5e-5, 1e-5}

Performance

In human evaluations conducted through the Atlaset-Arena, this model demonstrated significant improvements over baseline models:

Model	Wins	Total Comparisons	Win Rate (%)
atlasia/XLM-RoBERTa-Morocco	72	120	60.00
aubmindlab/bert-base-arabertv02	63	114	55.26
SI2M-Lab/DarijaBERT	55	119	46.22
FacebookAI/xlm-roberta-large	51	120	42.50
google-bert/bert-base-multilingual-cased	29	120	24.17

The model shows a 17.5% performance improvement over the base XLM-RoBERTa-large model.

Limitations

While the model performs well on Moroccan Darija, performance may vary across different regional variations within Morocco
The model may not handle code-switching between Darija and other languages optimally
Performance on highly technical or specialized domains may be limited by the training data composition

Ethical Considerations

This model is intended to improve accessibility of NLP technologies for Moroccan Darija speakers
Users should be aware that the model may reflect biases present in the training data
The model should be further evaluated before deployment in high-stakes applications

How to Use

from transformers import AutoModelForMaskedLM, AutoTokenizer

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("atlasia/XLM-RoBERTa-Morocco")
model = AutoModelForMaskedLM.from_pretrained("atlasia/XLM-RoBERTa-Morocco")

# Example usage for masked language modeling
text = "أنا كنتكلم الدارجة المغربية [MASK] مزيان."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Citation

@misc{atlasia2025xlm-roberta-morocco,
  title={XLM-RoBERTa-Morocco: A Masked Language Model for Moroccan Darija},
  author={Abdelaziz Bounhar and Abdeljalil El Majjodi},
  year={2025},
  howpublished={\url{https://huggingface.co./atlasia/XLM-RoBERTa-Morocco}},
  organization={AtlasIA}
}

Acknowledgements

We thank the Hugging Face team for their support and the vibrant research community behind Moroccan Darija NLP. Special thanks to all contributors of the Atlaset dataset that made this model possible.

atlasia
/

XLM-RoBERTa-Morocco

You need to agree to share your contact information to access this model

Model Card for atlasia/XLM-RoBERTa-Morocco

Model Description

Intended Uses

Training Details

Performance

Limitations

Ethical Considerations

How to Use

Citation

Acknowledgements

Model tree for atlasia/XLM-RoBERTa-Morocco

Dataset used to train atlasia/XLM-RoBERTa-Morocco

Space using atlasia/XLM-RoBERTa-Morocco 1

Collection including atlasia/XLM-RoBERTa-Morocco

Moroccan Darija LLMs