pffaundez commited on
Commit
e305f53
1 Parent(s): b0a8eb5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -3
README.md CHANGED
@@ -33,20 +33,57 @@ It achieves the following results on the evaluation set:
33
  - Log Loss: 0.9232
34
  - Loss: 0.3017
35
 
 
36
  ## Model description
37
 
38
- More information needed
 
 
 
 
 
39
 
40
  ## Intended uses & limitations
41
 
42
- More information needed
 
 
 
 
 
 
 
 
43
 
44
  ## Training and evaluation data
45
 
46
- More information needed
 
 
 
 
 
47
 
48
  ## Training procedure
49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  ### Training hyperparameters
51
 
52
  The following hyperparameters were used during training:
 
33
  - Log Loss: 0.9232
34
  - Loss: 0.3017
35
 
36
+
37
  ## Model description
38
 
39
+ DistilBERT is a smaller, faster, cheaper version of BERT, achieved through knowledge distillation. It retains 97% of BERT’s language understanding while being 60% faster and smaller. This fine-tuned version of DistilBERT is trained to detect AI-generated text in paragraphs from the STEM domain.
40
+
41
+ Key characteristics:
42
+ - **Architecture**: Transformer-based model
43
+ - **Pre-training objective**: Masked Language Modeling (MLM)
44
+ - **Fine-tuning objective**: Binary classification (Human-written vs AI-generated)
45
 
46
  ## Intended uses & limitations
47
 
48
+ ### Intended uses
49
+ - **AI Text Detection**: Identifying paragraphs in the STEM domain that are generated by AI versus those written by humans.
50
+ - **Educational Tools**: Assisting educators in detecting AI-generated content in academic submissions.
51
+ - **Research**: Analyzing the effectiveness of AI-generated content detection in STEM-related texts.
52
+
53
+ ### Limitations
54
+ - **Domain Specificity**: The model is fine-tuned specifically on STEM paragraphs and may not perform as well on texts from other domains.
55
+ - **Generalization**: While the model is effective at detecting AI-generated text in STEM, it may not generalize well to other types of AI-generated content outside of its training data.
56
+ - **Biases**: The model may inherit biases present in the training data, which could affect its performance and fairness.
57
 
58
  ## Training and evaluation data
59
 
60
+ The model was fine-tuned on the "16K-trueparagraph-STEM" dataset, which consists of 16,000 paragraphs from various STEM domains. The dataset includes both human-written and AI-generated paragraphs to provide a balanced training set for the model.
61
+
62
+ ### Dataset Details
63
+ - **Size**: 16,000 paragraphs
64
+ - **Sources**: Academic papers, research articles, and other STEM-related documents.
65
+ - **Balance**: Approximately 50% human-written paragraphs and 50% AI-generated paragraphs.
66
 
67
  ## Training procedure
68
 
69
+ ### Preprocessing
70
+ - **Tokenization**: Texts were tokenized using the DistilBERT tokenizer.
71
+ - **Truncation/Padding**: All inputs were truncated or padded to a maximum length of 512 tokens.
72
+
73
+ ### Hyperparameters
74
+ - **Optimizer**: AdamW
75
+ - **Learning Rate**: 5e-5
76
+ - **Batch Size**: 16
77
+ - **Number of Epochs**: 3
78
+
79
+ ### Training
80
+ - **Loss Function**: Binary Cross-Entropy Loss
81
+ - **Evaluation Metrics**: Accuracy, Precision, Recall, F1-Score, ROC-AUC
82
+
83
+ ### Hardware
84
+ - **Environment**: Training was conducted on a single NVIDIA Tesla V100 GPU.
85
+ - **Training Time**: Approximately 4 hours.
86
+
87
  ### Training hyperparameters
88
 
89
  The following hyperparameters were used during training: