Update README.md
Browse files
README.md
CHANGED
@@ -33,20 +33,57 @@ It achieves the following results on the evaluation set:
|
|
33 |
- Log Loss: 0.9232
|
34 |
- Loss: 0.3017
|
35 |
|
|
|
36 |
## Model description
|
37 |
|
38 |
-
|
|
|
|
|
|
|
|
|
|
|
39 |
|
40 |
## Intended uses & limitations
|
41 |
|
42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
43 |
|
44 |
## Training and evaluation data
|
45 |
|
46 |
-
|
|
|
|
|
|
|
|
|
|
|
47 |
|
48 |
## Training procedure
|
49 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
50 |
### Training hyperparameters
|
51 |
|
52 |
The following hyperparameters were used during training:
|
|
|
33 |
- Log Loss: 0.9232
|
34 |
- Loss: 0.3017
|
35 |
|
36 |
+
|
37 |
## Model description
|
38 |
|
39 |
+
DistilBERT is a smaller, faster, cheaper version of BERT, achieved through knowledge distillation. It retains 97% of BERT’s language understanding while being 60% faster and smaller. This fine-tuned version of DistilBERT is trained to detect AI-generated text in paragraphs from the STEM domain.
|
40 |
+
|
41 |
+
Key characteristics:
|
42 |
+
- **Architecture**: Transformer-based model
|
43 |
+
- **Pre-training objective**: Masked Language Modeling (MLM)
|
44 |
+
- **Fine-tuning objective**: Binary classification (Human-written vs AI-generated)
|
45 |
|
46 |
## Intended uses & limitations
|
47 |
|
48 |
+
### Intended uses
|
49 |
+
- **AI Text Detection**: Identifying paragraphs in the STEM domain that are generated by AI versus those written by humans.
|
50 |
+
- **Educational Tools**: Assisting educators in detecting AI-generated content in academic submissions.
|
51 |
+
- **Research**: Analyzing the effectiveness of AI-generated content detection in STEM-related texts.
|
52 |
+
|
53 |
+
### Limitations
|
54 |
+
- **Domain Specificity**: The model is fine-tuned specifically on STEM paragraphs and may not perform as well on texts from other domains.
|
55 |
+
- **Generalization**: While the model is effective at detecting AI-generated text in STEM, it may not generalize well to other types of AI-generated content outside of its training data.
|
56 |
+
- **Biases**: The model may inherit biases present in the training data, which could affect its performance and fairness.
|
57 |
|
58 |
## Training and evaluation data
|
59 |
|
60 |
+
The model was fine-tuned on the "16K-trueparagraph-STEM" dataset, which consists of 16,000 paragraphs from various STEM domains. The dataset includes both human-written and AI-generated paragraphs to provide a balanced training set for the model.
|
61 |
+
|
62 |
+
### Dataset Details
|
63 |
+
- **Size**: 16,000 paragraphs
|
64 |
+
- **Sources**: Academic papers, research articles, and other STEM-related documents.
|
65 |
+
- **Balance**: Approximately 50% human-written paragraphs and 50% AI-generated paragraphs.
|
66 |
|
67 |
## Training procedure
|
68 |
|
69 |
+
### Preprocessing
|
70 |
+
- **Tokenization**: Texts were tokenized using the DistilBERT tokenizer.
|
71 |
+
- **Truncation/Padding**: All inputs were truncated or padded to a maximum length of 512 tokens.
|
72 |
+
|
73 |
+
### Hyperparameters
|
74 |
+
- **Optimizer**: AdamW
|
75 |
+
- **Learning Rate**: 5e-5
|
76 |
+
- **Batch Size**: 16
|
77 |
+
- **Number of Epochs**: 3
|
78 |
+
|
79 |
+
### Training
|
80 |
+
- **Loss Function**: Binary Cross-Entropy Loss
|
81 |
+
- **Evaluation Metrics**: Accuracy, Precision, Recall, F1-Score, ROC-AUC
|
82 |
+
|
83 |
+
### Hardware
|
84 |
+
- **Environment**: Training was conducted on a single NVIDIA Tesla V100 GPU.
|
85 |
+
- **Training Time**: Approximately 4 hours.
|
86 |
+
|
87 |
### Training hyperparameters
|
88 |
|
89 |
The following hyperparameters were used during training:
|