license: apache-2.0
The TokenFormer is a fully attention-based architecture that unifies the computations of token-token and token-parameter interactions by entirely employing the attention mechanism, maximizes the flexibility of neural network.(see paper). It contains four models of sizes 150M, 450M, 900M, 1.5B. For each size, it's trained based on gpt-neox code base and uses Pile with 300B tokens. All 4 model sizes are trained on the exact same data, in the exact same order.
TokenFormer-450M
Model Details
- Developed by: Haiyang Wang
- Model type: TokenFormer-based Language Model
- Language: English
- Learn more: TokenFormer's GitHub repository for training procedure, config files, and details on how to use. See paper for more evals and implementation details.
- Library: GPT-NeoX
- License: Apache 2.0
- Contact: to ask questions about this model, please email Haiyang Wang.
TokenFormer model | Layers | #QKV Param Tokens | #Output Param Tokens | #FFN Param Tokens | Model Dim | Heads | Batch Size | Learning Rate | Training Iterations |
---|---|---|---|---|---|---|---|---|---|
150M | 12 | 768 | 768 | 3072 | 768 | 12 | 2M | 6.0 x 10-4 | 143000 |
450M | 24 | 1024 | 1024 | 4096 | 1024 | 16 | 2M | 6.0 x 10-4 | 143000 |
900M | 32 | 1280 | 1280 | 5120 | 1280 | 16 | 2M | 6.0 x 10-4 | 143000 |
1.5B | 40 | 1536 | 1536 | 6144 | 1536 | 16 | 2M | 6.0 x 10-4 | 143000 |
Training
Training data
The Pile is a 825GiB general-purpose dataset in
English. It was created by EleutherAI specifically for training large language
models. It contains texts from 22 diverse sources, roughly broken down into
five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl),
prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and
miscellaneous (e.g. GitHub, Enron Emails). See the Pile
paper for a breakdown of all data sources,
methodology, and a discussion of ethical implications. Consult the
datasheet for more detailed documentation
about the Pile and its component datasets. The Pile can be downloaded from
the official website, or from a community
mirror.
Training procedure
We follow the default training strategy of Pythia in gpt-neox, including the dataset processing, hyper-parameter and code base. All models were trained on the exact same data, in the exact same order. Each model saw 299,892,736,000 tokens during training.
All TokenFormer models trained for 143000 steps at a batch size
of 2M (2,097,152 tokens).
See GitHub for more details on training
procedure.
TokenFormer uses the same tokenizer as GPT-NeoX-
20B.
Evaluations
All 16 TokenFormer models were evaluated using the LM Evaluation
Harness.
You can run the evaluation with our instruction.
Expand the sections below to see plots of evaluation results for all
TokenFormer compared with Opensource Transformer-based LLMs.
Model | #Param | LAMBADA | HellaSwag | PIQA | Arc-E | Arc-C | WinoGrande | Average |
---|---|---|---|---|---|---|---|---|
Pythia | 150M | 35.4 | 30.3 | 62.3 | 43.6 | 23.6 | 51.3 | 40.1 |
TokenFormer | 150M | 45.0 | 35.5 | 64.9 | 47.3 | 24.9 | 50.4 | 44.7 |
Pythia | 410M | 51.4 | 40.6 | 66.9 | 52.1 | 24.6 | 53.8 | 48.2 |
TokenFormer | 450M | 57.3 | 47.5 | 69.5 | 56.2 | 26.7 | 54.6 | 52.0 |
Pythia | 1B | 56.1 | 47.2 | 70.7 | 57.0 | 27.1 | 53.5 | 51.9 |
TokenFormer | 900M | 64.0 | 55.3 | 72.4 | 59.9 | 30.6 | 56.4 | 56.4 |
GPT-Neo | 1.3B | 57.2 | 48.9 | 71.1 | 56.2 | 25.9 | 54.9 | 52.4 |
OPT | 1.3B | 58.0 | 53.7 | 72.4 | 56.7 | 29.6 | 59.5 | 55.0 |
Pythia | 1.3B | 61.7 | 52.1 | 71.0 | 60.5 | 28.5 | 57.2 | 55.2 |
GPT-Neo | 2.7B | 62.2 | 55.8 | 71.1 | 61.1 | 30.2 | 57.6 | 56.5 |
OPT | 2.7B | 63.6 | 60.6 | 74.8 | 60.8 | 31.3 | 61.0 | 58.7 |
Pythia | 2.8B | 64.7 | 59.3 | 74.0 | 64.1 | 32.9 | 59.7 | 59.1 |
TokenFormer | 1.5B | 64.7 | 60.0 | 74.8 | 64.8 | 32.0 | 59.7 | 59.3 |