Zabantu - Exploring Multilingual Language Model training for South African Bantu Languages

Zabantu( "Za" for South Africa, "bantu" for Bantu languages) is a collection of masked language models that have been trained from scratch using a compact dataset comprising various subsets of Bantu languages spoken in South Africa. These models are inspired by the work done on AfriBERTa, which demonstrated the effectiveness of training on XLM-R architecture using a smaller dataset. The focus of this work was to use LLMs to advance NLP applications in Tshivenda and also to serve as a benchmark for future works covering Bantu languages.

Model Details

Model Name: Zabantu-XLM-Roberta
Model Version: 0.0.1
Model Architecture: XLM-RoBERTa
Model Size: 80 - 250 million parameters
Language Support: Tshivenda, Nguni languages (Zulu, Xhosa, Swati), Sotho languages (Northern Sotho, Southern Sotho, Setswana), and Xitsonga.

Usage example(s)

from transformers import pipeline

# Initialize the pipeline for masked language model
# Note: You might need to login, and request permissions to access dsfsi while the model is in private-beta
unmasker = pipeline('fill-mask', model='dsfsi/zabantu-bantu-250m')


sample_sentences = {
    'zulu': "Le ndoda ithi izo____ ukudla.",  # Masked word for Zulu
    'tshivenda': "Mufana uyo____ vhukuma.",  # Masked word for Tshivenda
    'sepedi': "Mosadi o ____ pheka.",  # Masked word for Sepedi
    'tswana': "Monna o ____ tsamaya.",  # Masked word for Tswana
    'tsonga': "N'wana wa xisati u ____ ku tsaka."  # Masked word for Tsonga
}


for language, sentence in sample_sentences.items():
    masked_sentence = sentence.replace('____', unmasker.tokenizer.mask_token)
    # Get the model predictions
    results = unmasker(masked_sentence)
    print(f"Original sentence ({language}): {sentence}")
    print(f"Top prediction for the masked token: {results[0]['sequence']}\n")

For fine-tuning tasks, checkout these examples:

Model Variants

This model card provides an overview of the multilingual language models developed for South African languages, with a specific focus on advancing Tshivenda natural language processing (NLP) coverage. Zabantu-XLMR refers to a fleet of models trained on different combinations of South African Bantu languages. These include:

Zabantu-VEN: A monolingual language model trained on 73k raw sentences in Tshivenda
Zabantu-NSO: A monolingual language model trained on 179k raw sentences in Sepedi
Zabantu-NSO+VEN: A bilingual language model trained on 179k raw sentences in Sepedi and 73k sentences in Tshivenda
Zabantu-SOT+VEN: A multilingual language model trained on 479k raw sentences from Sesotho, Sepedi, Setswana, and Tshivenda
Zabantu-BANTU: A multilingual language model trained on 1.4M raw sentences from 9 South African Bantu languages

Intended Use

Like any Masked Language Model (MLM), Zabantu models can be adapted to a variety of semantic tasks such as:

Text Classification/Categorization: Assigning categories or labels to a whole document, or sections of a document, based on its content.
Sentiment Analysis: Determining the sentiment of a text, such as whether the opinion is positive, negative, or neutral.
Named Entity Recognition (NER): Identifying and classifying key information (entities) in text into predefined categories such as the names of people, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
Part-of-Speech Tagging (POS): Assigning word types to each word (like noun, verb, adjective, etc.), based on both its definition and its context.
Semantic Text Similarity: Measuring how similar two pieces of texts are, which is useful in various applications such as information retrieval, document clustering, and duplicate detection.
etc.

Performance and Limitations

Performance: The Zabantu models demonstrate promising performance on various NLP tasks, including news topic classification with competitive results compared to similar pre-trained cross-lingual models such as AfriBERTa and AfroXLMR.

Monolingual test F1 scores on News Topic Classification

Weighted F1 [%]	Afriberta-large	Afroxlmr	zabantu-nsoven	zabantu-sotven	zabantu-bantu
nso	71.4	71.6	74.3	69	70.6
ven	74.3	74.1	77	76	75.6

Few-shot(50 shots) test F1 scores on News Topic Classification

Weighted F1 [%]	Afriberta	Afroxlmr	zabantu-nsoven	zabantu-sotven	zabantu-bantu
ven	60	62	66	69	55

Limitations:
- Although efforts have been made to include a wide range of South African languages, the model's coverage may still be limited for certain dialects. We note that the training set was largely dominated by Setwana and IsiXhosa.
- We also acknowledge the potential to further improve the model by training it on more data, including additional domains and topics.
- As with any language model, the generated output should be carefully reviewed and post-processed to ensure accuracy and cultural sensitivity.

Training Data

The models have been trained on a large corpus of text data collected from various sources, including SADiLaR, Leipnets, Flores, CC-100, Opus and various South African government websites. The training data covers a wide range of topics and domains, notably religion, politics, academics and health (mostly Covid-19).

Closing Remarks

The Zabantu models provide a valuable resource for advancing Tshivenda NLP coverage and promoting cross-lingual learning techniques for South African languages. They have the potential to enhance various NLP applications, foster linguistic diversity, and contribute to the development of language technologies in the South African context.

dsfsi
/

zabantu-xlm-roberta