|
--- |
|
language: |
|
- en |
|
- ha |
|
- yo |
|
- ig |
|
- pcm |
|
pipeline_tag: fill-mask |
|
--- |
|
|
|
|
|
# NaijaXLM-T-base |
|
This is a XLM-Roberta-base model further pretrained on 2.2 billion Nigerian tweets, described and evaluated in the [reference paper](https://arxiv.org/abs/2403.19260). This model was developed together with [@pvcastro](https://huggingface.co./pvcastro). |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
|
|
|
|
- **Model type:** xlm-roberta |
|
- **Language(s) (NLP):** (Nigerian) English, Nigerian Pidgin, Hausa, Yoruba, Igbo |
|
- **Finetuned from model [optional]:** xlm-roberta-base |
|
|
|
### Model Sources [optional] |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** https://github.com/manueltonneau/hate_speech_nigeria |
|
- **Paper:** https://arxiv.org/abs/2403.19260 |
|
|
|
|
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
The model was further pre-trained on 2.2 billion tweets posted between March 2007 and July 2023 and forming the timelines of 2.8 million Twitter users with a profile location in Nigeria. |
|
|
|
### Training Procedure |
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
We performed an adaptive fine tuning of XLM-R on the Nigerian Twitter dataset. |
|
We kept the same vocabulary as XLM-R and trained the model until convergence for a total of one epoch, using 1\% of the dataset as validation set. The training procedure was conducted in a distributed environment, for approximately 10 days, using 4 nodes with 4 RTX 8000 GPUs each, with a total batch size of 576. |
|
|
|
|
|
## Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
|
|
## BibTeX entry and citation information |
|
|
|
|
|
Please cite the [reference paper](https://arxiv.org/abs/2403.19260) if you use this model. |
|
|
|
```bibtex |
|
@article{tonneau2024naijahate, |
|
title={NaijaHate: Evaluating Hate Speech Detection on Nigerian Twitter Using Representative Data}, |
|
author={Tonneau, Manuel and de Castro, Pedro Vitor Quinta and Lasri, Karim and Farouq, Ibrahim and Subramanian, Lakshminarayanan and Orozco-Olvera, Victor and Fraiberger, Samuel}, |
|
journal={arXiv preprint arXiv:2403.19260}, |
|
year={2024} |
|
} |
|
``` |