File size: 2,491 Bytes
84092c4 a831340 fd5b3b0 a831340 e5c08e6 fd5b3b0 a831340 fd5b3b0 a831340 fd5b3b0 a831340 6e5d6ad fd5b3b0 a831340 fd5b3b0 a831340 7dcf752 fd5b3b0 a831340 fd5b3b0 95b8c83 fd5b3b0 a831340 95b8c83 72fb5ee |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
---
language:
- en
- ha
- yo
- ig
- pcm
pipeline_tag: fill-mask
---
# NaijaXLM-T-base
This is a XLM-Roberta-base model further pretrained on 2.2 billion Nigerian tweets, described and evaluated in the [reference paper](https://arxiv.org/abs/2403.19260). This model was developed by [@pvcastro](https://huggingface.co./pvcastro) and [@manueltonneau](https://huggingface.co./manueltonneau).
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
- **Model type:** xlm-roberta
- **Language(s) (NLP):** (Nigerian) English, Nigerian Pidgin, Hausa, Yoruba, Igbo
- **Finetuned from model [optional]:** xlm-roberta-base
### Model Sources [optional]
<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/manueltonneau/hate_speech_nigeria
- **Paper:** https://arxiv.org/abs/2403.19260
## Training Details
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
The model was further pre-trained on 2.2 billion tweets posted between March 2007 and July 2023 and forming the timelines of 2.8 million Twitter users with a profile location in Nigeria.
### Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
We performed an adaptive fine tuning of XLM-R on the Nigerian Twitter dataset.
We kept the same vocabulary as XLM-R and trained the model until convergence for a total of one epoch, using 1\% of the dataset as validation set. The training procedure was conducted in a distributed environment, for approximately 10 days, using 4 nodes with 4 RTX 8000 GPUs each, with a total batch size of 576.
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
## BibTeX entry and citation information
Please cite the [reference paper](https://arxiv.org/abs/2403.19260) if you use this model.
```bibtex
@article{tonneau2024naijahate,
title={NaijaHate: Evaluating Hate Speech Detection on Nigerian Twitter Using Representative Data},
author={Tonneau, Manuel and de Castro, Pedro Vitor Quinta and Lasri, Karim and Farouq, Ibrahim and Subramanian, Lakshminarayanan and Orozco-Olvera, Victor and Fraiberger, Samuel},
journal={arXiv preprint arXiv:2403.19260},
year={2024}
}
``` |