|
--- |
|
license: mit |
|
datasets: |
|
- Skylion007/openwebtext |
|
language: |
|
- en |
|
metrics: |
|
- perplexity |
|
pipeline_tag: text-generation |
|
--- |
|
# GPT-2 Mini |
|
|
|
A smaller GPT-2 model with (only) 39M parameters. It was pretrained on a subset of OpenWebText, the open-source version of the pretraining dataset used by OpenAI for the original GPT-2 models. |
|
|
|
## Uses |
|
|
|
The purpose of this model is mainly for research and education. Its small size allows for fast experiments in resource-limited settings, while still being able of generating complex and coherent text. |
|
|
|
## Getting Started |
|
|
|
Use the code below to get started with the model: |
|
```py |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
# Load model |
|
model = AutoModelForCausalLM.from_pretrained("erwanf/gpt2-mini") |
|
model.eval() |
|
|
|
# Load tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("erwanf/gpt2-mini") |
|
|
|
# Generate text |
|
prompt = "Hello, I'm a language model," |
|
input_ids = tokenizer.encode(prompt, return_tensors="pt") |
|
|
|
output = model.generate(input_ids, do_sample=True, max_length=50, num_return_sequences=5) |
|
output_text = tokenizer.batch_decode(output, skip_special_tokens=True) |
|
print(output_text) |
|
``` |
|
|
|
Output: |
|
``` |
|
["Hello, I'm a language model, I can't be more efficient in words.\n\nYou can use this as a point to find out the next bit in your system, and learn more about me.\n\nI think a lot of the", |
|
"Hello, I'm a language model, my teacher is a good teacher - a good school teacher – and one thing you have to remember:\n\nIt's not perfect. A school is not perfect; it isn't perfect at all!\n\n", |
|
'Hello, I\'m a language model, but if I can do something for you then go for it (for a word). Here is my blog, the language:\n\nI\'ve not used "normal" in English words, but I\'ve always', |
|
'Hello, I\'m a language model, I\'m talking to you the very first time I used a dictionary and it can be much better than one word in my dictionary. What would an "abnormal" English dictionary have to do with a dictionary and', |
|
'Hello, I\'m a language model, the most powerful representation of words and phrases in the language I\'m using."\n\nThe new rules change that makes it much harder for people to understand a language that does not have a native grammar (even with'] |
|
``` |
|
|
|
## Training Details |
|
|
|
The architecture relies on the GPT-2 model, with smaller dimensions and less layers. It uses the same tokenizer as GPT-2. We used the first 2M rows from the OpenWebText dataset, out of which we use 1k for test and validation sets. |
|
|
|
### Hyperparameters |
|
|
|
| **Hyperparameter** | **Value** | |
|
|------------------------|------------------| |
|
| **Model Parameters** | | |
|
| Vocabulary Size | 50,257 | |
|
| Context Length | 512 | |
|
| Number of Layers | 4 | |
|
| Hidden Size | 512 | |
|
| Number of Attention Heads | 8 | |
|
| Intermediate Size | 2048 | |
|
| Activation Function | GELU | |
|
| Dropout | No | |
|
| **Training Parameters**| | |
|
| Learning Rate | 5e-4 | |
|
| Batch Size | 256 | |
|
| Optimizer | AdamW | |
|
| beta1 | 0.9 | |
|
| beta2 | 0.98 | |
|
| Weight Decay | 0.1 | |
|
| Training Steps | 100,000 | |
|
| Warmup Steps | 4,000 | |
|
| Learning Rate Scheduler| Cosine | |
|
| Training Dataset Size | 1M samples | |
|
| Validation Dataset Size| 1k samples | |
|
| Float Type | bf16 | |
|
|
|
|