readme updates.
Browse files
README.md
CHANGED
@@ -5,4 +5,108 @@ language:
|
|
5 |
metrics:
|
6 |
- accuracy
|
7 |
pipeline_tag: text-generation
|
8 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
metrics:
|
6 |
- accuracy
|
7 |
pipeline_tag: text-generation
|
8 |
+
---
|
9 |
+
|
10 |
+
# Fine-Tuning GPT-2 on Custom Dataset:
|
11 |
+
|
12 |
+
This repository demonstrates how to fine-tune the GPT-2 language model on a custom dataset using PyTorch and Hugging Face's Transformers library. The code provides an end-to-end example, from loading the dataset to training the model and evaluating its performance.
|
13 |
+
|
14 |
+
## Requirements
|
15 |
+
|
16 |
+
- Python 3.6+
|
17 |
+
- PyTorch
|
18 |
+
- Transformers (Hugging Face)
|
19 |
+
- NumPy
|
20 |
+
|
21 |
+
You can install the required packages using pip:
|
22 |
+
|
23 |
+
```bash
|
24 |
+
pip install torch transformers numpy
|
25 |
+
```
|
26 |
+
|
27 |
+
## Fine-Tuning Script
|
28 |
+
|
29 |
+
The following script outlines the steps for fine-tuning GPT-2 on a custom dataset:
|
30 |
+
|
31 |
+
```python
|
32 |
+
from transformers import GPT2LMHeadModel, GPT2TokenizerFast, AdamW
|
33 |
+
import torch
|
34 |
+
from torch.utils.data import DataLoader, TensorDataset
|
35 |
+
import numpy as np
|
36 |
+
|
37 |
+
# Load pre-trained model and tokenizer
|
38 |
+
model_name = "XsoraS/NeyabAI"
|
39 |
+
model = GPT2LMHeadModel.from_pretrained(model_name)
|
40 |
+
tokenizer = GPT2TokenizerFast.from_pretrained(model_name)
|
41 |
+
tokenizer.pad_token = tokenizer.eos_token
|
42 |
+
|
43 |
+
# Example dataset
|
44 |
+
dataset = ["Your custom dataset goes here."] # Replace with your actual dataset
|
45 |
+
|
46 |
+
# Tokenization function
|
47 |
+
def tokenize_function(examples):
|
48 |
+
return tokenizer(examples, padding='max_length', truncation=True, max_length=400)
|
49 |
+
|
50 |
+
# Tokenize the dataset
|
51 |
+
tokenized_inputs = [tokenize_function(text) for text in dataset]
|
52 |
+
input_ids = [input['input_ids'] for input in tokenized_inputs]
|
53 |
+
attention_masks = [input['attention_mask'] for input in tokenized_inputs]
|
54 |
+
|
55 |
+
# Convert to torch tensors
|
56 |
+
input_ids = torch.tensor(input_ids)
|
57 |
+
attention_masks = torch.tensor(attention_masks)
|
58 |
+
labels = input_ids.clone()
|
59 |
+
|
60 |
+
# Create DataLoader
|
61 |
+
batch_size = 8
|
62 |
+
dataset = TensorDataset(input_ids, attention_masks, labels)
|
63 |
+
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
|
64 |
+
|
65 |
+
# Configure device
|
66 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
67 |
+
model.to(device)
|
68 |
+
model = model.half()
|
69 |
+
|
70 |
+
# Set up optimizer
|
71 |
+
optimizer = AdamW(model.parameters(), lr=3e-5)
|
72 |
+
|
73 |
+
# Define accuracy calculation
|
74 |
+
def calculate_accuracy(preds, labels):
|
75 |
+
pred_flat = np.argmax(preds, axis=-1).flatten()
|
76 |
+
labels_flat = labels.flatten()
|
77 |
+
return np.sum(pred_flat == labels_flat) / len(labels_flat)
|
78 |
+
|
79 |
+
# Training loop (simplified)
|
80 |
+
for epoch in range(3): # Adjust the number of epochs as needed
|
81 |
+
for batch in dataloader:
|
82 |
+
batch = tuple(t.to(device) for t in batch)
|
83 |
+
input_ids, attention_masks, labels = batch
|
84 |
+
|
85 |
+
outputs = model(input_ids, attention_mask=attention_masks, labels=labels)
|
86 |
+
loss = outputs.loss
|
87 |
+
logits = outputs.logits
|
88 |
+
|
89 |
+
loss.backward()
|
90 |
+
optimizer.step()
|
91 |
+
optimizer.zero_grad()
|
92 |
+
|
93 |
+
preds = logits.detach().cpu().numpy()
|
94 |
+
label_ids = labels.to('cpu').numpy()
|
95 |
+
acc = calculate_accuracy(preds, label_ids)
|
96 |
+
|
97 |
+
print(f"Loss: {loss.item()}, Accuracy: {acc}")
|
98 |
+
|
99 |
+
print("Training complete!")
|
100 |
+
```
|
101 |
+
|
102 |
+
## Notes
|
103 |
+
|
104 |
+
- **Dataset:** Replace the `dataset` variable with your actual dataset.
|
105 |
+
- **Max Length:** Adjust the `max_length` parameter in the `tokenize_function` as needed based on the length of your input texts.
|
106 |
+
- **Batch Size and Learning Rate:** You may need to tune the `batch_size` and learning rate (`lr`) according to your dataset and hardware capabilities.
|
107 |
+
- **Epochs:** Adjust the number of epochs based on your convergence criteria.
|
108 |
+
|
109 |
+
## Acknowledgments
|
110 |
+
|
111 |
+
- This project uses the [Transformers](https://huggingface.co/transformers/) library by Hugging Face.
|
112 |
+
- Inspired by various fine-tuning examples and tutorials from the Hugging Face community.
|