XsoraS
/

NeyabAI-GPT2_version

 metrics:
 - accuracy
 pipeline_tag: text-generation
+---
+# Fine-Tuning GPT-2 on Custom Dataset:
+This repository demonstrates how to fine-tune the GPT-2 language model on a custom dataset using PyTorch and Hugging Face's Transformers library. The code provides an end-to-end example, from loading the dataset to training the model and evaluating its performance.
+## Requirements
+- Python 3.6+
+- PyTorch
+- Transformers (Hugging Face)
+- NumPy
+You can install the required packages using pip:
+```bash
+pip install torch transformers numpy
+```
+## Fine-Tuning Script
+The following script outlines the steps for fine-tuning GPT-2 on a custom dataset:
+```python
+from transformers import GPT2LMHeadModel, GPT2TokenizerFast, AdamW
+import torch
+from torch.utils.data import DataLoader, TensorDataset
+import numpy as np
+# Load pre-trained model and tokenizer
+model_name = "XsoraS/NeyabAI"
+model = GPT2LMHeadModel.from_pretrained(model_name)
+tokenizer = GPT2TokenizerFast.from_pretrained(model_name)
+tokenizer.pad_token = tokenizer.eos_token
+# Example dataset
+dataset = ["Your custom dataset goes here."]  # Replace with your actual dataset
+# Tokenization function
+def tokenize_function(examples):
+    return tokenizer(examples, padding='max_length', truncation=True, max_length=400)
+# Tokenize the dataset
+tokenized_inputs = [tokenize_function(text) for text in dataset]
+input_ids = [input['input_ids'] for input in tokenized_inputs]
+attention_masks = [input['attention_mask'] for input in tokenized_inputs]
+# Convert to torch tensors
+input_ids = torch.tensor(input_ids)
+attention_masks = torch.tensor(attention_masks)
+labels = input_ids.clone()
+# Create DataLoader
+batch_size = 8
+dataset = TensorDataset(input_ids, attention_masks, labels)
+dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
+# Configure device
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model.to(device)
+model = model.half()
+# Set up optimizer
+optimizer = AdamW(model.parameters(), lr=3e-5)
+# Define accuracy calculation
+def calculate_accuracy(preds, labels):
+    pred_flat = np.argmax(preds, axis=-1).flatten()
+    labels_flat = labels.flatten()
+    return np.sum(pred_flat == labels_flat) / len(labels_flat)
+# Training loop (simplified)
+for epoch in range(3):  # Adjust the number of epochs as needed
+    for batch in dataloader:
+        batch = tuple(t.to(device) for t in batch)
+        input_ids, attention_masks, labels = batch
+        outputs = model(input_ids, attention_mask=attention_masks, labels=labels)
+        loss = outputs.loss
+        logits = outputs.logits
+        loss.backward()
+        optimizer.step()
+        optimizer.zero_grad()
+        preds = logits.detach().cpu().numpy()
+        label_ids = labels.to('cpu').numpy()
+        acc = calculate_accuracy(preds, label_ids)
+        print(f"Loss: {loss.item()}, Accuracy: {acc}")
+print("Training complete!")
+```
+## Notes
+- **Dataset:** Replace the `dataset` variable with your actual dataset.
+- **Max Length:** Adjust the `max_length` parameter in the `tokenize_function` as needed based on the length of your input texts.
+- **Batch Size and Learning Rate:** You may need to tune the `batch_size` and learning rate (`lr`) according to your dataset and hardware capabilities.
+- **Epochs:** Adjust the number of epochs based on your convergence criteria.
+## Acknowledgments
+- This project uses the [Transformers](https://huggingface.co/transformers/) library by Hugging Face.
+- Inspired by various fine-tuning examples and tutorials from the Hugging Face community.