File size: 3,624 Bytes
d93134c
 
 
 
 
 
 
557dda3
260c0c1
557dda3
ad93a84
e3b8995
ad93a84
 
 
 
e3b8995
ad93a84
e3b8995
557dda3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
839d517
557dda3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
---
license: apache-2.0
language:
- en
metrics:
- accuracy
pipeline_tag: text-generation
---
# Using NeyabAI:


  # Direct Use:




  # Fine-Tuning:

  This repository demonstrates how to fine-tune the NeyabAI(GPT-2) language model on a custom dataset using PyTorch and Hugging Face's Transformers library. The code provides an end-to-end example, from loading the dataset to training the model and evaluating its performance.

## Requirements

- Python 3.6+
- PyTorch
- Transformers (Hugging Face)
- NumPy

You can install the required packages using pip:
```bash
pip install torch transformers numpy
```

## Fine-Tuning Script
The following script outlines the steps for fine-tuning GPT-2 on a custom dataset:
```python
from transformers import GPT2LMHeadModel, GPT2TokenizerFast, AdamW
import torch
from torch.utils.data import DataLoader, TensorDataset
import numpy as np

# Load pre-trained model and tokenizer
model_name = "XsoraS/NeyabAI"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2TokenizerFast.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Example dataset
dataset = ["Your custom dataset goes here."]  # Replace with your actual dataset

# Tokenization function
def tokenize_function(examples):
    return tokenizer(examples, padding='max_length', truncation=True, max_length=512)

# Tokenize the dataset
tokenized_inputs = [tokenize_function(text) for text in dataset]
input_ids = [input['input_ids'] for input in tokenized_inputs]
attention_masks = [input['attention_mask'] for input in tokenized_inputs]

# Convert to torch tensors
input_ids = torch.tensor(input_ids)
attention_masks = torch.tensor(attention_masks)
labels = input_ids.clone()

# Create DataLoader
batch_size = 8
dataset = TensorDataset(input_ids, attention_masks, labels)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Configure device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model = model.half()

# Set up optimizer
optimizer = AdamW(model.parameters(), lr=3e-5)

# Define accuracy calculation
def calculate_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=-1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

# Training loop (simplified)
for epoch in range(3):  # Adjust the number of epochs as needed
    for batch in dataloader:
        batch = tuple(t.to(device) for t in batch)
        input_ids, attention_masks, labels = batch

        outputs = model(input_ids, attention_mask=attention_masks, labels=labels)
        loss = outputs.loss
        logits = outputs.logits

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        preds = logits.detach().cpu().numpy()
        label_ids = labels.to('cpu').numpy()
        acc = calculate_accuracy(preds, label_ids)

        print(f"Loss: {loss.item()}, Accuracy: {acc}")

print("Training complete!")
```

## Notes

- **Dataset:** Replace the `dataset` variable with your actual dataset.
- **Max Length:** Adjust the `max_length` parameter in the `tokenize_function` as needed based on the length of your input texts.
- **Batch Size and Learning Rate:** You may need to tune the `batch_size` and learning rate (`lr`) according to your dataset and hardware capabilities.
- **Epochs:** Adjust the number of epochs based on your convergence criteria.

## Acknowledgments

- This project uses the [Transformers](https://huggingface.co./transformers/) library by Hugging Face.
- Inspired by various fine-tuning examples and tutorials from the Hugging Face community.