Model description

This repository aims to re-create the GPT 1 architecture, using HuggingFace's transformers.

The original paper of the model can be found here. The blog post accompanying this paper is here. The code and weights can be found here.

The original model was trained, as noted in OpenAI's blogpost, 1 month on 8 GPU's (P600's), on the original BookCorpus dataset (containing around ~7000 books).

This model instead is trained using the BookCorpusOpen dataset, which contains ~~17.000 books (around ~6GB). The tokenized dataset (~~9GB) can be found in data/ in this repository. The tokenizer is a BPE tokenizer, with 40.000 vocabulary merges, as the original paper. It is re-implemented using HuggingFace tokenizers library, and trained on the BookCorpusOpen dataset.

How to use

See preprocessing.py on how the data was preprocessed and tokenized.

See pre_training.py on how the model was pre-trained.

See inference.py for an example.

Converted model

Inside gpt1-converted-weights/ is the converted safetensors model from the original weights, which can be used directly with the code inside this repo. The conversion script and original weights can also be found there.