Indonesian RoBERTa Base PRDECT-ID
Indonesian RoBERTa Base PRDECT-ID is a emotion text-classification model based on the RoBERTa model. The model was originally the pre-trained Indonesian RoBERTa Base model, which is then fine-tuned on the PRDECT-ID
dataset consisting of Indonesian product reviews (Sutoyo et al., 2022).
This model was trained using HuggingFace's PyTorch framework. All training was done on a NVIDIA T4, provided by Google Colaboratory. Training metrics were logged via Tensorboard.
Model
Model | #params | Arch. | Training/Validation data (text) |
---|---|---|---|
indonesian-roberta-base-prdect-id |
124M | RoBERTa Base | PRDECT-ID |
Evaluation Results
The model achieves the following results on evaluation:
Dataset | Accuracy | F1 | Precision | Recall |
---|---|---|---|---|
PRDECT-ID |
0.685185 | 0.644750 | 0.646400 | 0.643710 |
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
learning_rate
: 2e-05train_batch_size
: 32eval_batch_size
: 32seed
: 42optimizer
: Adam withbetas=(0.9,0.999)
andepsilon=1e-08
lr_scheduler_type
: linearnum_epochs
: 5
Training results
Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 | Precision | Recall |
---|---|---|---|---|---|---|---|
1.0358 | 1.0 | 152 | 0.8293 | 0.6519 | 0.5814 | 0.6399 | 0.5746 |
0.7012 | 2.0 | 304 | 0.7444 | 0.6741 | 0.6269 | 0.6360 | 0.6220 |
0.5599 | 3.0 | 456 | 0.7635 | 0.6852 | 0.6440 | 0.6433 | 0.6453 |
0.4628 | 4.0 | 608 | 0.8031 | 0.6852 | 0.6421 | 0.6471 | 0.6396 |
0.4027 | 5.0 | 760 | 0.8133 | 0.6852 | 0.6447 | 0.6464 | 0.6437 |
How to Use
As Text Classifier
from transformers import pipeline
pretrained_name = "w11wo/indonesian-roberta-base-prdect-id"
nlp = pipeline(
"sentiment-analysis",
model=pretrained_name,
tokenizer=pretrained_name
)
nlp("Wah, kualitas produk ini sangat bagus!")
Disclaimer
Do consider the biases which come from both the pre-trained RoBERTa model and the PRDECT-ID
dataset that may be carried over into the results of this model.
Author
Indonesian RoBERTa Base PRDECT-ID was trained and evaluated by Wilson Wongso. All computation and development are done on Google Colaboratory using their free GPU access.
Framework versions
- Transformers 4.24.0
- Pytorch 1.12.1+cu113
- Datasets 2.7.1
- Tokenizers 0.13.2
References
@article{SUTOYO2022108554,
title = {PRDECT-ID: Indonesian product reviews dataset for emotions classification tasks},
journal = {Data in Brief},
volume = {44},
pages = {108554},
year = {2022},
issn = {2352-3409},
doi = {https://doi.org/10.1016/j.dib.2022.108554},
url = {https://www.sciencedirect.com/science/article/pii/S2352340922007612},
author = {Rhio Sutoyo and Said Achmad and Andry Chowanda and Esther Widhi Andangsari and Sani M. Isa},
keywords = {Natural language processing, Text processing, Text mining, Emotions classification, Sentiment analysis},
abstract = {Recognizing emotions is vital in communication. Emotions convey additional meanings to the communication process. Nowadays, people can communicate their emotions on many platforms; one is the product review. Product reviews in the online platform are an important element that affects customers’ buying decisions. Hence, it is essential to recognize emotions from the product reviews. Emotions recognition from the product reviews can be done automatically using a machine or deep learning algorithm. Dataset can be considered as the fuel to model the recognizer. However, only a limited dataset exists in recognizing emotions from the product reviews, particularly in a local language. This research contributes to the dataset collection of 5400 product reviews in Indonesian. It was carefully curated from various (29) product categories, annotated with five emotions, and verified by an expert in clinical psychology. The dataset supports an innovative process to build automatic emotion classification on product reviews.}
}
- Downloads last month
- 13