SmolLM Fine-Tuned for Plagiarism Detection
This repository hosts a fine-tuned version of SmolLM (135M Parameters) for detecting plagiarism by classifying sentence pairs as either plagiarized or non-plagiarized. Fine-tuning was performed on the MIT Plagiarism Detection Dataset to enhance the model’s accuracy and performance in identifying textual similarities.
Model Information
- Base Model: HuggingFaceTB/SmolLM2-135M-Instruct
- Fine-tuned Model Name:
jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection
- License: MIT
- Language: English
- Task: Text Classification
- Metrics: Accuracy, F1 Score, Recall
Dataset
The model was fine-tuned on the MIT Plagiarism Detection Dataset, which provides pairs of sentences labeled to indicate whether one is a rephrased version of the other (i.e., plagiarized). This dataset is suited for sentence-level similarity detection, and the labels (1
for plagiarized and 0
for non-plagiarized) offer a straightforward approach to training for binary classification.
Training Procedure
The fine-tuning was done using the transformers
library from Hugging Face. Key details include:
- Model Architecture: The model was modified for sequence classification with two output labels.
- Optimizer: AdamW was used to handle optimization, with a learning rate of 2e-5.
- Loss Function: Cross-Entropy Loss was used as the objective function.
- Batch Size: Set to 16 for memory and performance balance.
- Epochs: Trained for 3 epochs.
- Padding: A custom padding token was added to align with SmolLM’s requirements, ensuring smooth tokenization.
Training involved a DataLoader that fed sentence pairs into the model, tokenized with attention masking, truncation, and padding. After training, the model achieved a high accuracy score, around 99.66% on the training dataset.
Usage
This model can be employed directly within the Hugging Face Transformers library to classify sentence pairs as plagiarized or non-plagiarized. Simply load the model and tokenizer from the jatinmehra/smolLM-fine-tuned-for-plagiarism-detection
repository, and provide sentence pairs as inputs. The model’s output logits can be interpreted to determine whether plagiarism is detected.
- Example:
from transformers import GPT2Tokenizer, LlamaForSequenceClassification
tokenizer = GPT2Tokenizer.from_pretrained("jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection")
model = LlamaForSequenceClassification.from_pretrained("jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection", num_labels=2)
model.eval()
Evaluation
During evaluation, the model performed robustly with the following metrics:
Accuracy on Validation set: 96%
Classification Report On Test Set
Accuracy: 96.20%
Class | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
0 | 0.96 | 0.97 | 0.96 | 36,586 |
1 | 0.97 | 0.96 | 0.96 | 36,888 |
Overall Metrics:
- Accuracy: 0.96
- Macro Average:
- Precision: 0.96
- Recall: 0.96
- F1-Score: 0.96
- Weighted Average:
- Precision: 0.96
- Recall: 0.96
- F1-Score: 0.96
- Total Support: 73,474
Model and Tokenizer Saving
Upon completion of fine-tuning, the model and tokenizer were saved for deployment and ease of loading in future projects. They can be loaded from Hugging Face or saved locally for custom applications.
License
This model and associated code are released under the MIT License, allowing for both personal and commercial use.
Connect with Me
I appreciate your support and am happy to connect!
GitHub | Email | LinkedIn | Portfolio
- Downloads last month
- 102
Model tree for jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection
Base model
HuggingFaceTB/SmolLM2-135M