Model description
This repository aims to re-create the GPT 1 architecture, using HuggingFace's
transformers
.
The original paper of the model can be found here. The blog post accompanying this paper is here. The code and weights can be found here.
The original model was trained, as noted in OpenAI's blogpost, 1 month on 8 GPU's (P600's), on the original BookCorpus dataset (containing around ~7000 books).
This model instead is trained using the BookCorpusOpen dataset,
which contains 17.000 books (around ~6GB). The tokenized dataset (9GB) can be
found in data/
in this repository. The tokenizer is a BPE tokenizer, with
40.000 vocabulary merges, as the original paper. It is re-implemented using
HuggingFace tokenizers
library, and trained on the
BookCorpusOpen dataset.
How to use
See preprocessing.py
on how the data was preprocessed and tokenized.
See pre_training.py
on how the model was pre-trained.
See inference.py
for an example.
Converted model
Inside gpt1-converted-weights/
is the converted safetensors model from the
original weights, which can be used directly with the code inside this repo. The
conversion script and original weights can also be found there.