---
language:
- en
- ha
- yo
- ig
- pcm
pipeline_tag: fill-mask
---


# NaijaXLM-T-base
This is a XLM-Roberta-base model further pretrained on 2.2 billion Nigerian tweets, described and evaluated in the reference paper (TODO). This model was developed together with [@pvcastro](https://huggingface.co./pvcastro). 

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->


- **Model type:** xlm-roberta
- **Language(s) (NLP):** (Nigerian) English, Nigerian Pidgin, Hausa, Yoruba, Igbo
- **Finetuned from model [optional]:** xlm-roberta-base

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository:** https://github.com/manueltonneau/hate_speech_nigeria
- **Paper [optional]:** TODO


## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

The model was further pre-trained on 2.2 billion tweets posted between March 2007 and July 2023 and forming the timelines of 2.8 million Twitter users with a profile location in Nigeria.

### Training Procedure 

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
We performed an adaptive fine tuning of XLM-R on the Nigerian Twitter dataset. 
We kept the same vocabulary as XLM-R and trained the model for one epoch, using 1\% of the dataset as validation set. The training procedure was conducted in a distributed environment, for approximately 10 days, using 4 nodes with 4 RTX 8000 GPUs each, with a total batch size of 576.


## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->


## BibTeX entry and citation information

TODO

Please cite the reference paper (TODO) if you use this model.

```bibtex
@inproceedings{XXX}
```