pffaundez
/

trueparagraph.ai-DistilBERT

Text Classification

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

pffaundez commited on Jul 22

Commit

e305f53

•

1 Parent(s): b0a8eb5

Update README.md

Files changed (1) hide show

README.md +40 -3

README.md CHANGED Viewed

@@ -33,20 +33,57 @@ It achieves the following results on the evaluation set:
 - Log Loss: 0.9232
 - Loss: 0.3017
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:

 - Log Loss: 0.9232
 - Loss: 0.3017
 ## Model description
+DistilBERT is a smaller, faster, cheaper version of BERT, achieved through knowledge distillation. It retains 97% of BERT’s language understanding while being 60% faster and smaller. This fine-tuned version of DistilBERT is trained to detect AI-generated text in paragraphs from the STEM domain.
+Key characteristics:
+- **Architecture**: Transformer-based model
+- **Pre-training objective**: Masked Language Modeling (MLM)
+- **Fine-tuning objective**: Binary classification (Human-written vs AI-generated)
 ## Intended uses & limitations
+### Intended uses
+- **AI Text Detection**: Identifying paragraphs in the STEM domain that are generated by AI versus those written by humans.
+- **Educational Tools**: Assisting educators in detecting AI-generated content in academic submissions.
+- **Research**: Analyzing the effectiveness of AI-generated content detection in STEM-related texts.
+### Limitations
+- **Domain Specificity**: The model is fine-tuned specifically on STEM paragraphs and may not perform as well on texts from other domains.
+- **Generalization**: While the model is effective at detecting AI-generated text in STEM, it may not generalize well to other types of AI-generated content outside of its training data.
+- **Biases**: The model may inherit biases present in the training data, which could affect its performance and fairness.
 ## Training and evaluation data
+The model was fine-tuned on the "16K-trueparagraph-STEM" dataset, which consists of 16,000 paragraphs from various STEM domains. The dataset includes both human-written and AI-generated paragraphs to provide a balanced training set for the model.
+### Dataset Details
+- **Size**: 16,000 paragraphs
+- **Sources**: Academic papers, research articles, and other STEM-related documents.
+- **Balance**: Approximately 50% human-written paragraphs and 50% AI-generated paragraphs.
 ## Training procedure
+### Preprocessing
+- **Tokenization**: Texts were tokenized using the DistilBERT tokenizer.
+- **Truncation/Padding**: All inputs were truncated or padded to a maximum length of 512 tokens.
+### Hyperparameters
+- **Optimizer**: AdamW
+- **Learning Rate**: 5e-5
+- **Batch Size**: 16
+- **Number of Epochs**: 3
+### Training
+- **Loss Function**: Binary Cross-Entropy Loss
+- **Evaluation Metrics**: Accuracy, Precision, Recall, F1-Score, ROC-AUC
+### Hardware
+- **Environment**: Training was conducted on a single NVIDIA Tesla V100 GPU.
+- **Training Time**: Approximately 4 hours.
 ### Training hyperparameters
 The following hyperparameters were used during training: