|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
pipeline_tag: text-generation |
|
--- |
|
# Using NeyabAI: |
|
|
|
|
|
# Direct Use: |
|
|
|
|
|
|
|
|
|
# Fine-Tuning: |
|
|
|
This repository demonstrates how to fine-tune the NeyabAI(GPT-2) language model on a custom dataset using PyTorch and Hugging Face's Transformers library. The code provides an end-to-end example, from loading the dataset to training the model and evaluating its performance. |
|
|
|
## Requirements |
|
|
|
- Python 3.6+ |
|
- PyTorch |
|
- Transformers (Hugging Face) |
|
- NumPy |
|
|
|
You can install the required packages using pip: |
|
```bash |
|
pip install torch transformers numpy |
|
``` |
|
|
|
## Fine-Tuning Script |
|
The following script outlines the steps for fine-tuning GPT-2 on a custom dataset: |
|
```python |
|
from transformers import GPT2LMHeadModel, GPT2TokenizerFast, AdamW |
|
import torch |
|
from torch.utils.data import DataLoader, TensorDataset |
|
import numpy as np |
|
|
|
# Load pre-trained model and tokenizer |
|
model_name = "XsoraS/NeyabAI" |
|
model = GPT2LMHeadModel.from_pretrained(model_name) |
|
tokenizer = GPT2TokenizerFast.from_pretrained(model_name) |
|
tokenizer.pad_token = tokenizer.eos_token |
|
|
|
# Example dataset |
|
dataset = ["Your custom dataset goes here."] # Replace with your actual dataset |
|
|
|
# Tokenization function |
|
def tokenize_function(examples): |
|
return tokenizer(examples, padding='max_length', truncation=True, max_length=512) |
|
|
|
# Tokenize the dataset |
|
tokenized_inputs = [tokenize_function(text) for text in dataset] |
|
input_ids = [input['input_ids'] for input in tokenized_inputs] |
|
attention_masks = [input['attention_mask'] for input in tokenized_inputs] |
|
|
|
# Convert to torch tensors |
|
input_ids = torch.tensor(input_ids) |
|
attention_masks = torch.tensor(attention_masks) |
|
labels = input_ids.clone() |
|
|
|
# Create DataLoader |
|
batch_size = 8 |
|
dataset = TensorDataset(input_ids, attention_masks, labels) |
|
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True) |
|
|
|
# Configure device |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model.to(device) |
|
model = model.half() |
|
|
|
# Set up optimizer |
|
optimizer = AdamW(model.parameters(), lr=3e-5) |
|
|
|
# Define accuracy calculation |
|
def calculate_accuracy(preds, labels): |
|
pred_flat = np.argmax(preds, axis=-1).flatten() |
|
labels_flat = labels.flatten() |
|
return np.sum(pred_flat == labels_flat) / len(labels_flat) |
|
|
|
# Training loop (simplified) |
|
for epoch in range(3): # Adjust the number of epochs as needed |
|
for batch in dataloader: |
|
batch = tuple(t.to(device) for t in batch) |
|
input_ids, attention_masks, labels = batch |
|
|
|
outputs = model(input_ids, attention_mask=attention_masks, labels=labels) |
|
loss = outputs.loss |
|
logits = outputs.logits |
|
|
|
loss.backward() |
|
optimizer.step() |
|
optimizer.zero_grad() |
|
|
|
preds = logits.detach().cpu().numpy() |
|
label_ids = labels.to('cpu').numpy() |
|
acc = calculate_accuracy(preds, label_ids) |
|
|
|
print(f"Loss: {loss.item()}, Accuracy: {acc}") |
|
|
|
print("Training complete!") |
|
``` |
|
|
|
## Notes |
|
|
|
- **Dataset:** Replace the `dataset` variable with your actual dataset. |
|
- **Max Length:** Adjust the `max_length` parameter in the `tokenize_function` as needed based on the length of your input texts. |
|
- **Batch Size and Learning Rate:** You may need to tune the `batch_size` and learning rate (`lr`) according to your dataset and hardware capabilities. |
|
- **Epochs:** Adjust the number of epochs based on your convergence criteria. |
|
|
|
## Acknowledgments |
|
|
|
- This project uses the [Transformers](https://huggingface.co./transformers/) library by Hugging Face. |
|
- Inspired by various fine-tuning examples and tutorials from the Hugging Face community. |
|
|