RoBERTa
Spam messages frequently carry malicious links or phishing attempts posing significant threats to both organizations and their users. By choosing our RoBERTa-based spam message detection system, organizations can greatly enhance their security infrastructure. Our system effectively detects and filters out spam messages, adding an extra layer of security that safeguards organizations against potential financial losses, legal consequences, and reputational harm.
Dataset
The dataset is composed of messages labeled by ham or spam, merged from three data sources: 1. SMS Spam Collection https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset 2. Telegram Spam Ham https://huggingface.co./datasets/thehamkercat/telegram-spam-ham/tree/main 3. Enron Spam: https://huggingface.co./datasets/SetFit/enron_spam/tree/main (only used message column and labels)
The prepare script for enron is available at https://github.com/mshenoda/roberta-spam/tree/main/data/enron. The data is split 80% train 10% validation, and 10% test sets; the scripts used to split and merge of the three data sources are available at: https://github.com/mshenoda/roberta-spam/tree/main/data/utils.
Dataset Class Distribution
Model Architecture
The model is fine tuned RoBERTa base
roberta-base: https://huggingface.co./roberta-base
paper: https://arxiv.org/abs/1907.11692
Metrics
Loss | Accuracy | Precision / Recall | Confusion Matrix |
---|---|---|---|
![]() |
![]() |
![]() |
![]() |
Required Packages
- numpy
- torch
- transformers
- pandas
- tqdm
- matplotlib
Install
pip3 install -r requirements.txt
Directory Structure
Place all the files in same directory as the following:
ββββ data/ contains csv data files
ββββ plots/ contains metrics results and plots
ββββ roberta-spam trained model weights
ββββ utils/ contains helper functions
ββββ demo.ipynb jupyter notebook run the demo
ββββ detector.py SpamMessageDetector with methods train, evaluate, detect
ββββ dataset.py custom dataset class for spam messages
Running Demo
To run the demo, please run the following Jupyter Notebook: demo.ipynb
** Recommend using VSCode https://code.visualstudio.com for running the demo notebook