Fine-tuned RoBERTa on Malay Language
This model is a fine-tuned version of the mesolitica/roberta-base-bahasa-cased
model, specifically trained on a custom Malay dataset. The model is fine-tuned for Masked Language Modeling (MLM) on normalized Malay sentences.
Model Description
This model is based on the RoBERTa architecture, a robustly optimized version of BERT. It was pre-trained on a large corpus of text in the Malay language and then fine-tuned on a specialized dataset consisting of normalized Malay sentences. The fine-tuning task involved predicting masked tokens in sentences, which is typical for masked language modeling tasks.
Training Details
- Pre-trained Model:
mesolitica/roberta-base-bahasa-cased
- Task: Masked Language Modeling (MLM)
- Training Dataset: Custom dataset of Malay sentences
- Training Duration: 3 epochs
- Batch Size: 16 per device
- Learning Rate: 1e-6
- Optimizer: AdamW
- Evaluation: Evaluated every 200 steps
Training and Validation Loss
The following table shows the training and validation loss at each evaluation step during the fine-tuning process:
Step | Training Loss | Validation Loss |
---|---|---|
200 | 0.069000 | 0.069317 |
800 | 0.070100 | 0.067430 |
1400 | 0.069000 | 0.066185 |
2000 | 0.037900 | 0.066657 |
2600 | 0.040200 | 0.066858 |
3200 | 0.041800 | 0.066634 |
3800 | 0.023700 | 0.067717 |
4400 | 0.024500 | 0.068275 |
5000 | 0.024500 | 0.068108 |
Observations
- The training loss consistently decreased over time, with notable reductions in the earlier steps.
- The validation loss showed slight fluctuations, but overall, it remained relatively stable after the first few thousand steps.
- The model demonstrated good convergence as training progressed, with a sharp drop in the training loss after the first few steps.
Intended Use
This model is intended for tasks such as:
- Masked Language Modeling (MLM): Fill in the blanks for masked tokens in a Malay sentence.
- Text Generation: Generate plausible text given a context.
- Text Understanding: Extract contextual meaning from Malay sentences.
Updated News
- This model is used for the research paper : "Mitigating Linguistic Bias between Malay and Indonesian Languages using Masked Language Models" which been accepted as a short paper (poster presentation) for the Research Track at DASFAA 2025.
- Author: Ferdinand Lenchau Bit, Iman Khaleda binti Zamri, Amzine Toushik Wasi, Taki Hasan Rafi, and Dong-Kyu Chae (Department of Computer Science, Hanyang University, Seoul, South Korea)
Model tree for matchaoneshot/RoBERTa-MalayMLMFineTuned
Base model
mesolitica/roberta-base-bahasa-cased