metadata
license: apache-2.0
language:
- en
metrics:
- accuracy
pipeline_tag: text-generation
Using NeyabAI:
Direct Use:
Fine-Tuning:
This repository demonstrates how to fine-tune the NeyabAI(GPT-2) language model on a custom dataset using PyTorch and Hugging Face's Transformers library. The code provides an end-to-end example, from loading the dataset to training the model and evaluating its performance.
Requirements
- Python 3.6+
- PyTorch
- Transformers (Hugging Face)
- NumPy
You can install the required packages using pip:
pip install torch transformers numpy
Fine-Tuning Script
The following script outlines the steps for fine-tuning GPT-2 on a custom dataset:
from transformers import GPT2LMHeadModel, GPT2TokenizerFast, AdamW
import torch
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
# Load pre-trained model and tokenizer
model_name = "XsoraS/NeyabAI"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2TokenizerFast.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Example dataset
dataset = ["Your custom dataset goes here."] # Replace with your actual dataset
# Tokenization function
def tokenize_function(examples):
return tokenizer(examples, padding='max_length', truncation=True, max_length=512)
# Tokenize the dataset
tokenized_inputs = [tokenize_function(text) for text in dataset]
input_ids = [input['input_ids'] for input in tokenized_inputs]
attention_masks = [input['attention_mask'] for input in tokenized_inputs]
# Convert to torch tensors
input_ids = torch.tensor(input_ids)
attention_masks = torch.tensor(attention_masks)
labels = input_ids.clone()
# Create DataLoader
batch_size = 8
dataset = TensorDataset(input_ids, attention_masks, labels)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
# Configure device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model = model.half()
# Set up optimizer
optimizer = AdamW(model.parameters(), lr=3e-5)
# Define accuracy calculation
def calculate_accuracy(preds, labels):
pred_flat = np.argmax(preds, axis=-1).flatten()
labels_flat = labels.flatten()
return np.sum(pred_flat == labels_flat) / len(labels_flat)
# Training loop (simplified)
for epoch in range(3): # Adjust the number of epochs as needed
for batch in dataloader:
batch = tuple(t.to(device) for t in batch)
input_ids, attention_masks, labels = batch
outputs = model(input_ids, attention_mask=attention_masks, labels=labels)
loss = outputs.loss
logits = outputs.logits
loss.backward()
optimizer.step()
optimizer.zero_grad()
preds = logits.detach().cpu().numpy()
label_ids = labels.to('cpu').numpy()
acc = calculate_accuracy(preds, label_ids)
print(f"Loss: {loss.item()}, Accuracy: {acc}")
print("Training complete!")
Notes
- Dataset: Replace the
dataset
variable with your actual dataset. - Max Length: Adjust the
max_length
parameter in thetokenize_function
as needed based on the length of your input texts. - Batch Size and Learning Rate: You may need to tune the
batch_size
and learning rate (lr
) according to your dataset and hardware capabilities. - Epochs: Adjust the number of epochs based on your convergence criteria.
Acknowledgments
- This project uses the Transformers library by Hugging Face.
- Inspired by various fine-tuning examples and tutorials from the Hugging Face community.