Fine-tuned RoBERTa on Malay Language

This model is a fine-tuned version of the mesolitica/roberta-base-bahasa-cased model, specifically trained on a custom Malay dataset. The model is fine-tuned for Masked Language Modeling (MLM) on normalized Malay sentences.

Model Description

This model is based on the RoBERTa architecture, a robustly optimized version of BERT. It was pre-trained on a large corpus of text in the Malay language and then fine-tuned on a specialized dataset consisting of normalized Malay sentences. The fine-tuning task involved predicting masked tokens in sentences, which is typical for masked language modeling tasks.

Training Details

Pre-trained Model: mesolitica/roberta-base-bahasa-cased
Task: Masked Language Modeling (MLM)
Training Dataset: Custom dataset of Malay sentences
Training Duration: 3 epochs
Batch Size: 16 per device
Learning Rate: 1e-6
Optimizer: AdamW
Evaluation: Evaluated every 200 steps

Training and Validation Loss

The following table shows the training and validation loss at each evaluation step during the fine-tuning process:

Step	Training Loss	Validation Loss
200	0.069000	0.069317
800	0.070100	0.067430
1400	0.069000	0.066185
2000	0.037900	0.066657
2600	0.040200	0.066858
3200	0.041800	0.066634
3800	0.023700	0.067717
4400	0.024500	0.068275
5000	0.024500	0.068108

Observations

The training loss consistently decreased over time, with notable reductions in the earlier steps.
The validation loss showed slight fluctuations, but overall, it remained relatively stable after the first few thousand steps.
The model demonstrated good convergence as training progressed, with a sharp drop in the training loss after the first few steps.

Intended Use

This model is intended for tasks such as:

Masked Language Modeling (MLM): Fill in the blanks for masked tokens in a Malay sentence.
Text Generation: Generate plausible text given a context.
Text Understanding: Extract contextual meaning from Malay sentences.

Updated News

This model is used for the research paper : "Mitigating Linguistic Bias between Malay and Indonesian Languages using Masked Language Models" which been accepted as a short paper (poster presentation) for the Research Track at DASFAA 2025.
Author: Ferdinand Lenchau Bit, Iman Khaleda binti Zamri, Amzine Toushik Wasi, Taki Hasan Rafi, and Dong-Kyu Chae (Department of Computer Science, Hanyang University, Seoul, South Korea)

matchaoneshot
/

RoBERTa-MalayMLMFineTuned