metadata

language:
  - en
  - ha
  - yo
  - ig
  - pcm
pipeline_tag: fill-mask

NaijaXLM-T-base

This is a XLM-Roberta-base model further pretrained on 2.2 billion Nigerian tweets, described and evaluated in the reference paper. This model was developed together with @pvcastro.

Model Details

Model Description

Model type: xlm-roberta
Language(s) (NLP): (Nigerian) English, Nigerian Pidgin, Hausa, Yoruba, Igbo
Finetuned from model [optional]: xlm-roberta-base

Model Sources [optional]

Repository: https://github.com/manueltonneau/hate_speech_nigeria
Paper: https://arxiv.org/abs/2403.19260

Training Details

Training Data

The model was further pre-trained on 2.2 billion tweets posted between March 2007 and July 2023 and forming the timelines of 2.8 million Twitter users with a profile location in Nigeria.

Training Procedure

We performed an adaptive fine tuning of XLM-R on the Nigerian Twitter dataset. We kept the same vocabulary as XLM-R and trained the model until convergence for a total of one epoch, using 1% of the dataset as validation set. The training procedure was conducted in a distributed environment, for approximately 10 days, using 4 nodes with 4 RTX 8000 GPUs each, with a total batch size of 576.

Evaluation

BibTeX entry and citation information

Please cite the reference paper if you use this model.

@article{tonneau2024naijahate,
  title={NaijaHate: Evaluating Hate Speech Detection on Nigerian Twitter Using Representative Data},
  author={Tonneau, Manuel and de Castro, Pedro Vitor Quinta and Lasri, Karim and Farouq, Ibrahim and Subramanian, Lakshminarayanan and Orozco-Olvera, Victor and Fraiberger, Samuel},
  journal={arXiv preprint arXiv:2403.19260},
  year={2024}
}