XsoraS commited on
Commit
557dda3
1 Parent(s): d93134c

readme updates.

Browse files
Files changed (1) hide show
  1. README.md +105 -1
README.md CHANGED
@@ -5,4 +5,108 @@ language:
5
  metrics:
6
  - accuracy
7
  pipeline_tag: text-generation
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  metrics:
6
  - accuracy
7
  pipeline_tag: text-generation
8
+ ---
9
+
10
+ # Fine-Tuning GPT-2 on Custom Dataset:
11
+
12
+ This repository demonstrates how to fine-tune the GPT-2 language model on a custom dataset using PyTorch and Hugging Face's Transformers library. The code provides an end-to-end example, from loading the dataset to training the model and evaluating its performance.
13
+
14
+ ## Requirements
15
+
16
+ - Python 3.6+
17
+ - PyTorch
18
+ - Transformers (Hugging Face)
19
+ - NumPy
20
+
21
+ You can install the required packages using pip:
22
+
23
+ ```bash
24
+ pip install torch transformers numpy
25
+ ```
26
+
27
+ ## Fine-Tuning Script
28
+
29
+ The following script outlines the steps for fine-tuning GPT-2 on a custom dataset:
30
+
31
+ ```python
32
+ from transformers import GPT2LMHeadModel, GPT2TokenizerFast, AdamW
33
+ import torch
34
+ from torch.utils.data import DataLoader, TensorDataset
35
+ import numpy as np
36
+
37
+ # Load pre-trained model and tokenizer
38
+ model_name = "XsoraS/NeyabAI"
39
+ model = GPT2LMHeadModel.from_pretrained(model_name)
40
+ tokenizer = GPT2TokenizerFast.from_pretrained(model_name)
41
+ tokenizer.pad_token = tokenizer.eos_token
42
+
43
+ # Example dataset
44
+ dataset = ["Your custom dataset goes here."] # Replace with your actual dataset
45
+
46
+ # Tokenization function
47
+ def tokenize_function(examples):
48
+ return tokenizer(examples, padding='max_length', truncation=True, max_length=400)
49
+
50
+ # Tokenize the dataset
51
+ tokenized_inputs = [tokenize_function(text) for text in dataset]
52
+ input_ids = [input['input_ids'] for input in tokenized_inputs]
53
+ attention_masks = [input['attention_mask'] for input in tokenized_inputs]
54
+
55
+ # Convert to torch tensors
56
+ input_ids = torch.tensor(input_ids)
57
+ attention_masks = torch.tensor(attention_masks)
58
+ labels = input_ids.clone()
59
+
60
+ # Create DataLoader
61
+ batch_size = 8
62
+ dataset = TensorDataset(input_ids, attention_masks, labels)
63
+ dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
64
+
65
+ # Configure device
66
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
67
+ model.to(device)
68
+ model = model.half()
69
+
70
+ # Set up optimizer
71
+ optimizer = AdamW(model.parameters(), lr=3e-5)
72
+
73
+ # Define accuracy calculation
74
+ def calculate_accuracy(preds, labels):
75
+ pred_flat = np.argmax(preds, axis=-1).flatten()
76
+ labels_flat = labels.flatten()
77
+ return np.sum(pred_flat == labels_flat) / len(labels_flat)
78
+
79
+ # Training loop (simplified)
80
+ for epoch in range(3): # Adjust the number of epochs as needed
81
+ for batch in dataloader:
82
+ batch = tuple(t.to(device) for t in batch)
83
+ input_ids, attention_masks, labels = batch
84
+
85
+ outputs = model(input_ids, attention_mask=attention_masks, labels=labels)
86
+ loss = outputs.loss
87
+ logits = outputs.logits
88
+
89
+ loss.backward()
90
+ optimizer.step()
91
+ optimizer.zero_grad()
92
+
93
+ preds = logits.detach().cpu().numpy()
94
+ label_ids = labels.to('cpu').numpy()
95
+ acc = calculate_accuracy(preds, label_ids)
96
+
97
+ print(f"Loss: {loss.item()}, Accuracy: {acc}")
98
+
99
+ print("Training complete!")
100
+ ```
101
+
102
+ ## Notes
103
+
104
+ - **Dataset:** Replace the `dataset` variable with your actual dataset.
105
+ - **Max Length:** Adjust the `max_length` parameter in the `tokenize_function` as needed based on the length of your input texts.
106
+ - **Batch Size and Learning Rate:** You may need to tune the `batch_size` and learning rate (`lr`) according to your dataset and hardware capabilities.
107
+ - **Epochs:** Adjust the number of epochs based on your convergence criteria.
108
+
109
+ ## Acknowledgments
110
+
111
+ - This project uses the [Transformers](https://huggingface.co/transformers/) library by Hugging Face.
112
+ - Inspired by various fine-tuning examples and tutorials from the Hugging Face community.