roberta-base-bahasa-cased
Pretrained RoBERTa base language model for Malay.
Pretraining Corpus
roberta-base-bahasa-cased
model was pretrained on ~400 miliion words. Below is list of data we trained on,
- IIUM confession, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
- local Instagram, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
- local news, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
- local parliament hansards, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
- local research papers related to
kebudayaan
,keagaaman
andetnik
, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean - local twitter, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
- Malay Wattpad, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
- Malay Wikipedia, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
Pretraining details
- All steps can reproduce from https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/roberta.
Example using AutoModelWithLMHead
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
model = AutoModelForMaskedLM.from_pretrained('mesolitica/roberta-base-bahasa-cased')
tokenizer = AutoTokenizer.from_pretrained(
'mesolitica/roberta-base-bahasa-cased',
do_lower_case = False,
)
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
fill_mask('Permohonan Najib, anak untuk dengar isu perlembagaan <mask> .')
Output is,
[{'score': 0.3368818759918213,
'token': 746,
'token_str': ' negara',
'sequence': 'Permohonan Najib, anak untuk dengar isu perlembagaan negara.'},
{'score': 0.09646568447351456,
'token': 598,
'token_str': ' Malaysia',
'sequence': 'Permohonan Najib, anak untuk dengar isu perlembagaan Malaysia.'},
{'score': 0.029483484104275703,
'token': 3265,
'token_str': ' UMNO',
'sequence': 'Permohonan Najib, anak untuk dengar isu perlembagaan UMNO.'},
{'score': 0.026470622047781944,
'token': 2562,
'token_str': ' parti',
'sequence': 'Permohonan Najib, anak untuk dengar isu perlembagaan parti.'},
{'score': 0.023237623274326324,
'token': 391,
'token_str': ' ini',
'sequence': 'Permohonan Najib, anak untuk dengar isu perlembagaan ini.'}]
- Downloads last month
- 45
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.