You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Model Card for atlasia/XLM-RoBERTa-Morocco

Model Description

XLM-RoBERTa-Morocco is a masked language model fine-tuned specifically for Moroccan Darija (Moroccan Arabic dialect). This model is based on FacebookAI/xlm-roberta-large and has been further trained on the comprehensive Atlaset dataset, a curated collection of Moroccan Darija text.

Intended Uses

This model is designed for:

  • Text classification tasks in Moroccan Darija
  • Named entity recognition in Moroccan Darija
  • Sentiment analysis of Moroccan text
  • Question answering systems for Moroccan users
  • Building embeddings for Moroccan Darija text
  • Serving as a foundation for downstream NLP tasks specific to Moroccan dialect

Training Details

  • Base Model: FacebookAI/xlm-roberta-large
  • Training Data: Atlaset dataset (1.17M examples, 155M tokens)
  • Training Procedure: Fine-tuning with masked language modeling objective
  • Hyperparameters:
    • Batch size: 128
    • Learning rate: 1e-4
    • Training was optimized after testing learning rates in range {1e-4, 5e-5, 1e-5}

Performance

In human evaluations conducted through the Atlaset-Arena, this model demonstrated significant improvements over baseline models:

Model Wins Total Comparisons Win Rate (%)
atlasia/XLM-RoBERTa-Morocco 72 120 60.00
aubmindlab/bert-base-arabertv02 63 114 55.26
SI2M-Lab/DarijaBERT 55 119 46.22
FacebookAI/xlm-roberta-large 51 120 42.50
google-bert/bert-base-multilingual-cased 29 120 24.17

The model shows a 17.5% performance improvement over the base XLM-RoBERTa-large model.

Limitations

  • While the model performs well on Moroccan Darija, performance may vary across different regional variations within Morocco
  • The model may not handle code-switching between Darija and other languages optimally
  • Performance on highly technical or specialized domains may be limited by the training data composition

Ethical Considerations

  • This model is intended to improve accessibility of NLP technologies for Moroccan Darija speakers
  • Users should be aware that the model may reflect biases present in the training data
  • The model should be further evaluated before deployment in high-stakes applications

How to Use

from transformers import AutoModelForMaskedLM, AutoTokenizer

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("atlasia/XLM-RoBERTa-Morocco")
model = AutoModelForMaskedLM.from_pretrained("atlasia/XLM-RoBERTa-Morocco")

# Example usage for masked language modeling
text = "أنا كنتكلم الدارجة المغربية [MASK] مزيان."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Citation

@misc{atlasia2025xlm-roberta-morocco,
  title={XLM-RoBERTa-Morocco: A Masked Language Model for Moroccan Darija},
  author={Abdelaziz Bounhar and Abdeljalil El Majjodi},
  year={2025},
  howpublished={\url{https://huggingface.co./atlasia/XLM-RoBERTa-Morocco}},
  organization={AtlasIA}
}

Acknowledgements

We thank the Hugging Face team for their support and the vibrant research community behind Moroccan Darija NLP. Special thanks to all contributors of the Atlaset dataset that made this model possible.

Downloads last month
14
Safetensors
Model size
560M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for atlasia/XLM-RoBERTa-Morocco

Finetuned
(376)
this model

Dataset used to train atlasia/XLM-RoBERTa-Morocco

Space using atlasia/XLM-RoBERTa-Morocco 1

Collection including atlasia/XLM-RoBERTa-Morocco