From zero to GPT-hero
Reading list to fully understand GPT (and GPT-2) and to be able to implement those from scratch
Paper • 1508.07909 • Published • 4Note Useful to have more insights about the tokenizer trained and used for GPT-2, which is a modified BPE as defined in this paper. Additionally, it's implemented in `tiktoken` at https://github.com/openai/tiktoken
Attention Is All You Need
Paper • 1706.03762 • Published • 49Note Attention is explained after being introduced at https://arxiv.org/abs/1409.0473, this paper proposed an Encoder-Decoder architecture, the Transformer. The whole architecture is interesting, but we'll transition into Decoder-only architectures for GPT.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 16Note Is interesting even though not comparable to GPT-2, as it's build using the encoder blocks and it's not auto-regressive in nature, but adds context on both sides of a word to achieve better results. But nice to read before GPT-2 to understand the differences and why it's been a relevant architecture.
Generating Wikipedia by Summarizing Long Sequences
Paper • 1801.10198 • Published • 3Note Introduces the concept of Decoder-only architectures, which is later on adopted by GPT.
openai-community/gpt2
Text Generation • Updated • 10.5M • • 2.46kNote * GPT: "Improving Language Understanding by Generative Pre-Training" at https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf * GPT-2: "Language Models are Unsupervised Multitask Learners" at https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
openai-community/gpt2-medium
Text Generation • Updated • 334k • 159openai-community/gpt2-large
Text Generation • Updated • 559k • 279openai-community/gpt2-xl
Text Generation • Updated • 195k • • 318