Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,180 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Spam Detector BERT MoE v2.2
|
2 |
+
|
3 |
+
[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Model-blue)](https://huggingface.co/AntiSpamInstitute/spam-detector-bert-MoE-v2.2)
|
4 |
+
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
|
5 |
+
|
6 |
+
## Table of Contents
|
7 |
+
- [Overview](#overview)
|
8 |
+
- [Model Description](#model-description)
|
9 |
+
- [Features](#features)
|
10 |
+
- [Usage](#usage)
|
11 |
+
- [Installation](#installation)
|
12 |
+
- [Quick Start](#quick-start)
|
13 |
+
- [Example](#example)
|
14 |
+
- [Model Architecture](#model-architecture)
|
15 |
+
- [Training Data](#training-data)
|
16 |
+
- [Performance](#performance)
|
17 |
+
- [Intended Use](#intended-use)
|
18 |
+
- [Limitations](#limitations)
|
19 |
+
- [Citing This Model](#citing-this-model)
|
20 |
+
- [License](#license)
|
21 |
+
- [Contact](#contact)
|
22 |
+
|
23 |
+
## Overview
|
24 |
+
|
25 |
+
The **Spam Detector BERT MoE v2.2** is a state-of-the-art natural language processing model designed to accurately classify text messages and content as spam or non-spam. Leveraging a BERT-based architecture enhanced with a Mixture of Experts (MoE) approach, this model achieves high performance and scalability for diverse spam detection applications.
|
26 |
+
|
27 |
+
## Model Description
|
28 |
+
|
29 |
+
This model is built upon the BERT (Bidirectional Encoder Representations from Transformers) architecture and incorporates a Mixture of Experts (MoE) mechanism to improve its ability to handle a wide variety of spam patterns. The MoE layer allows the model to activate different "experts" (sub-models) based on the input, enhancing its capacity to generalize across different types of spam content.
|
30 |
+
|
31 |
+
- **Model Name:** spam-detector-bert-MoE-v2.2
|
32 |
+
- **Architecture:** BERT with Mixture of Experts (MoE)
|
33 |
+
- **Language:** English
|
34 |
+
- **Task:** Text Classification (Spam Detection)
|
35 |
+
|
36 |
+
## Features
|
37 |
+
|
38 |
+
- **High Accuracy:** Achieves superior performance in distinguishing spam from non-spam messages.
|
39 |
+
- **Scalable:** Efficiently handles large datasets and real-time classification tasks.
|
40 |
+
- **Versatile:** Suitable for various applications, including email filtering, SMS spam detection, and social media monitoring.
|
41 |
+
- **Pre-trained:** Ready-to-use with extensive pre-training on diverse datasets.
|
42 |
+
|
43 |
+
## Usage
|
44 |
+
|
45 |
+
### Installation
|
46 |
+
|
47 |
+
First, ensure you have the [Transformers](https://github.com/huggingface/transformers) library installed. You can install it via pip:
|
48 |
+
|
49 |
+
```bash
|
50 |
+
pip install transformers
|
51 |
+
```
|
52 |
+
|
53 |
+
### Quick Start
|
54 |
+
|
55 |
+
Here's how to quickly get started with the **Spam Detector BERT MoE v2.2** model using the Hugging Face `transformers` library:
|
56 |
+
|
57 |
+
```python
|
58 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
59 |
+
import torch
|
60 |
+
|
61 |
+
# Load the tokenizer and model
|
62 |
+
tokenizer = AutoTokenizer.from_pretrained("AntiSpamInstitute/spam-detector-bert-MoE-v2.2")
|
63 |
+
model = AutoModelForSequenceClassification.from_pretrained("AntiSpamInstitute/spam-detector-bert-MoE-v2.2")
|
64 |
+
|
65 |
+
# Sample text
|
66 |
+
texts = [
|
67 |
+
"Congratulations! You've won a $1,000 Walmart gift card. Click here to claim now.",
|
68 |
+
"Hey, are we still meeting for lunch today?"
|
69 |
+
]
|
70 |
+
|
71 |
+
# Tokenize the input
|
72 |
+
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
|
73 |
+
|
74 |
+
# Get model predictions
|
75 |
+
with torch.no_grad():
|
76 |
+
outputs = model(**inputs)
|
77 |
+
logits = outputs.logits
|
78 |
+
|
79 |
+
# Apply softmax to get probabilities
|
80 |
+
probabilities = torch.softmax(logits, dim=1)
|
81 |
+
|
82 |
+
# Get predicted labels
|
83 |
+
predictions = torch.argmax(probabilities, dim=1)
|
84 |
+
|
85 |
+
# Map labels to class names
|
86 |
+
label_map = {0: "Not Spam", 1: "Spam"}
|
87 |
+
for text, prediction in zip(texts, predictions):
|
88 |
+
print(f"Text: {text}\nPrediction: {label_map[prediction.item()]}\n")
|
89 |
+
```
|
90 |
+
|
91 |
+
### Example
|
92 |
+
|
93 |
+
**Input:**
|
94 |
+
```plaintext
|
95 |
+
"Limited time offer! Buy one get one free on all products. Visit our store now!"
|
96 |
+
```
|
97 |
+
|
98 |
+
**Output:**
|
99 |
+
```plaintext
|
100 |
+
Prediction: Spam
|
101 |
+
```
|
102 |
+
|
103 |
+
## Model Architecture
|
104 |
+
|
105 |
+
The **Spam Detector BERT MoE v2.2** employs the following architecture components:
|
106 |
+
|
107 |
+
- **BERT Base:** Utilizes the pre-trained BERT base model with 12 transformer layers, 768 hidden units, and 12 attention heads.
|
108 |
+
- **Mixture of Experts (MoE):** Incorporates an MoE layer that consists of multiple expert feed-forward networks. During inference, only a subset of experts are activated based on the input, enhancing the model's capacity without significantly increasing computational costs.
|
109 |
+
- **Classification Head:** A linear layer on top of the BERT embeddings for binary classification (spam vs. not spam).
|
110 |
+
|
111 |
+
## Training Data
|
112 |
+
|
113 |
+
The model was trained on a diverse and extensive dataset comprising:
|
114 |
+
|
115 |
+
- **Public Spam Datasets:** Including SMS Spam Collection, Enron Email Dataset, and various social media spam datasets.
|
116 |
+
- **Synthetic Data:** Generated to augment the training set and cover a wide range of spam scenarios.
|
117 |
+
- **Real-World Data:** Collected from multiple domains to ensure robustness and generalization.
|
118 |
+
|
119 |
+
The training data was preprocessed to remove personally identifiable information (PII) and ensure compliance with data privacy standards.
|
120 |
+
|
121 |
+
## Performance
|
122 |
+
|
123 |
+
The **Spam Detector BERT MoE v2.2** achieves the following performance metrics on benchmark datasets:
|
124 |
+
|
125 |
+
| Dataset | Accuracy | Precision | Recall | F1 Score |
|
126 |
+
|-----------------------|----------|-----------|--------|----------|
|
127 |
+
| SMS Spam Collection | 98.5% | 98.7% | 98.3% | 98.5% |
|
128 |
+
| Enron Email Dataset | 97.8% | 98.0% | 97.5% | 97.7% |
|
129 |
+
| Social Media Spam | 96.5% | 96.7% | 96.3% | 96.5% |
|
130 |
+
|
131 |
+
*Note: These metrics are based on the model's performance at the time of release and may vary with different data distributions.*
|
132 |
+
|
133 |
+
## Intended Use
|
134 |
+
|
135 |
+
The **Spam Detector BERT MoE v2.2** is intended for use in the following applications:
|
136 |
+
|
137 |
+
- **Email Filtering:** Automatically classify and filter spam emails.
|
138 |
+
- **SMS Spam Detection:** Identify and block spam messages in mobile communications.
|
139 |
+
- **Social Media Monitoring:** Detect and manage spam content on platforms like Twitter and Facebook.
|
140 |
+
- **Content Moderation:** Assist in maintaining the quality of user-generated content by filtering out unwanted spam.
|
141 |
+
|
142 |
+
## Limitations
|
143 |
+
|
144 |
+
While the **Spam Detector BERT MoE v2.2** demonstrates high accuracy, users should be aware of the following limitations:
|
145 |
+
|
146 |
+
- **Language Support:** Currently optimized for English text. Performance may degrade for other languages.
|
147 |
+
- **Evolving Spam Tactics:** Spammers continually adapt their strategies, which may affect the model's effectiveness over time. Regular updates and retraining are recommended.
|
148 |
+
- **Context Understanding:** The model primarily focuses on textual features and may not fully capture contextual nuances or intent beyond the text.
|
149 |
+
- **Resource Requirements:** The MoE architecture, while efficient, may require substantial computational resources for deployment in resource-constrained environments.
|
150 |
+
|
151 |
+
## Citing This Model
|
152 |
+
|
153 |
+
If you use the **Spam Detector BERT MoE v2.2** in your research or applications, please cite it as follows:
|
154 |
+
|
155 |
+
```bibtex
|
156 |
+
@misc{AntiSpamInstitute_spam-detector-bert-MoE-v2.2,
|
157 |
+
author = {AntiSpamInstitute},
|
158 |
+
title = {spam-detector-bert-MoE-v2.2},
|
159 |
+
year = {2024},
|
160 |
+
publisher = {Hugging Face},
|
161 |
+
url = {https://huggingface.co/AntiSpamInstitute/spam-detector-bert-MoE-v2.2}
|
162 |
+
}
|
163 |
+
```
|
164 |
+
|
165 |
+
## License
|
166 |
+
|
167 |
+
This model is released under the [Apache 2.0 License](LICENSE). Please review the license terms before using the model.
|
168 |
+
|
169 |
+
## Contact
|
170 |
+
|
171 |
+
For questions, issues, or contributions, please reach out:
|
172 |
+
|
173 |
+
- **GitHub Repository:** [AntiSpamInstitute/spam-detector-bert-MoE-v2.2](https://github.com/AntiSpamInstitute/spam-detector-bert-MoE-v2.2)
|
174 |
+
- **Email:** [email protected]
|
175 |
+
- **Twitter:** [@AntiSpamInstitute](https://twitter.com/AntiSpamInstitute)
|
176 |
+
|
177 |
+
---
|
178 |
+
|
179 |
+
*This README was generated to provide comprehensive information about the Spam Detector BERT MoE v2.2 model. For the latest updates and more detailed documentation, please visit the [Hugging Face model page](https://huggingface.co/AntiSpamInstitute/spam-detector-bert-MoE-v2.2).*
|
180 |
+
|