Fine-tuned RoBERTa on Malay Language

This model is a fine-tuned version of the mesolitica/roberta-base-bahasa-cased model, specifically trained on a custom Malay dataset. The model is fine-tuned for Masked Language Modeling (MLM) on normalized Malay sentences.

Model Description

This model is based on the RoBERTa architecture, a robustly optimized version of BERT. It was pre-trained on a large corpus of text in the Malay language and then fine-tuned on a specialized dataset consisting of normalized Malay sentences. The fine-tuning task involved predicting masked tokens in sentences, which is typical for masked language modeling tasks.

Training Details

  • Pre-trained Model: mesolitica/roberta-base-bahasa-cased
  • Task: Masked Language Modeling (MLM)
  • Training Dataset: Custom dataset of Malay sentences
  • Training Duration: 3 epochs
  • Batch Size: 16 per device
  • Learning Rate: 1e-6
  • Optimizer: AdamW
  • Evaluation: Evaluated every 200 steps

Training and Validation Loss

The following table shows the training and validation loss at each evaluation step during the fine-tuning process:

Step Training Loss Validation Loss
200 0.069000 0.069317
800 0.070100 0.067430
1400 0.069000 0.066185
2000 0.037900 0.066657
2600 0.040200 0.066858
3200 0.041800 0.066634
3800 0.023700 0.067717
4400 0.024500 0.068275
5000 0.024500 0.068108

Observations

  • The training loss consistently decreased over time, with notable reductions in the earlier steps.
  • The validation loss showed slight fluctuations, but overall, it remained relatively stable after the first few thousand steps.
  • The model demonstrated good convergence as training progressed, with a sharp drop in the training loss after the first few steps.

Intended Use

This model is intended for tasks such as:

  • Masked Language Modeling (MLM): Fill in the blanks for masked tokens in a Malay sentence.
  • Text Generation: Generate plausible text given a context.
  • Text Understanding: Extract contextual meaning from Malay sentences.

Updated News

  • This model is used for the research paper : "Mitigating Linguistic Bias between Malay and Indonesian Languages using Masked Language Models" which been accepted as a short paper (poster presentation) for the Research Track at DASFAA 2025.
  • Author: Ferdinand Lenchau Bit, Iman Khaleda binti Zamri, Amzine Toushik Wasi, Taki Hasan Rafi, and Dong-Kyu Chae (Department of Computer Science, Hanyang University, Seoul, South Korea)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for matchaoneshot/RoBERTa-MalayMLMFineTuned

Finetuned
(1)
this model