Haiyang-W commited on
Commit
4cbb72d
1 Parent(s): b0db186

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -1
README.md CHANGED
@@ -15,7 +15,7 @@ same data, in the exact same order.
15
  ## Model Details
16
 
17
  - Developed by: [Haiyang Wang](https://haiyang-w.github.io/)
18
- - Model type: ToeknFormer-based Language Model
19
  - Language: English
20
  - Learn more: [TokenFormer's GitHub repository](https://github.com/Haiyang-W/TokenFormer)
21
  for training procedure, config files, and details on how to use.
@@ -24,3 +24,72 @@ same data, in the exact same order.
24
  - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
25
  - License: Apache 2.0
26
  - Contact: to ask questions about this model, please email Haiyang Wang.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ## Model Details
16
 
17
  - Developed by: [Haiyang Wang](https://haiyang-w.github.io/)
18
+ - Model type: TokenFormer-based Language Model
19
  - Language: English
20
  - Learn more: [TokenFormer's GitHub repository](https://github.com/Haiyang-W/TokenFormer)
21
  for training procedure, config files, and details on how to use.
 
24
  - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
25
  - License: Apache 2.0
26
  - Contact: to ask questions about this model, please email Haiyang Wang.
27
+
28
+ <figure>
29
+
30
+ | TokenFormer model | Layers | #QKV Param Tokens | #Output Param Tokens | #FFN Param Tokens | Model Dim | Heads | Batch Size | Learning Rate | Training Iterations |
31
+ | ----------------: | -----: | :---------------: | :------------------: | :---------------: | :-------: | :---: | :--------: | :-------------------: | :-------------------------: |
32
+ | 150M | 12 | 768 | 768 | 3072 | 768 | 12 | 2M | 6.0 x 10<sup>-4</sup> | 143000 |
33
+ | 450M | 24 | 1024 | 1024 | 4096 | 1024 | 16 | 2M | 6.0 x 10<sup>-4</sup> | 143000 |
34
+ | 900M | 32 | 1280 | 1280 | 5120 | 1280 | 16 | 2M | 6.0 x 10<sup>-4</sup> | 143000 |
35
+ | 1.5B | 40 | 1536 | 1536 | 6144 | 1536 | 16 | 2M | 6.0 x 10<sup>-4</sup> | 143000 |
36
+ <figcaption>Engineering details for the <i>TokenFormer</i>. </figcaption>
37
+ </figure>
38
+
39
+ ## Training
40
+
41
+ ### Training data
42
+
43
+ [The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in
44
+ English. It was created by EleutherAI specifically for training large language
45
+ models. It contains texts from 22 diverse sources, roughly broken down into
46
+ five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl),
47
+ prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and
48
+ miscellaneous (e.g. GitHub, Enron Emails). See [the Pile
49
+ paper](https://arxiv.org/abs/2101.00027) for a breakdown of all data sources,
50
+ methodology, and a discussion of ethical implications. Consult [the
51
+ datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
52
+ about the Pile and its component datasets. The Pile can be downloaded from
53
+ the [official website](https://pile.eleuther.ai/), or from a [community
54
+ mirror](https://the-eye.eu/public/AI/pile/).<br>
55
+
56
+ ### Training procedure
57
+ We follow the default training strategy of [Pythia](https://arxiv.org/abs/2304.01373) in [gpt-neox](https://github.com/EleutherAI/gpt-neox),
58
+ including the dataset processing, hyper-parameter and code base.
59
+ All models were trained on the exact same data, in the exact same order. Each
60
+ model saw 299,892,736,000 tokens during training.
61
+
62
+ All *TokenFormer* models trained for 143000 steps at a batch size
63
+ of 2M (2,097,152 tokens).<br>
64
+ See [GitHub](https://github.com/Haiyang-W/TokenFormer) for more details on training
65
+ procedure.<br>
66
+ TokenFormer uses the same tokenizer as [GPT-NeoX-
67
+ 20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
68
+
69
+ ## Evaluations
70
+
71
+ All 16 *TokenFormer* models were evaluated using the [LM Evaluation
72
+ Harness](https://github.com/EleutherAI/lm-evaluation-harness).
73
+ You can run the evaluation with our [instruction](https://github.com/Haiyang-W/TokenFormer?tab=readme-ov-file#evaluations).<br>
74
+ Expand the sections below to see plots of evaluation results for all
75
+ TokenFormer compared with Opensource Transformer-based LLMs.
76
+
77
+ <figure>
78
+
79
+ | Model | #Param | LAMBADA | HellaSwag | PIQA | Arc-E | Arc-C | WinoGrande | Average |
80
+ | :----: | :------: | :------: | :-------: | :--: | :---: | :---: | :--------: | :------: |
81
+ | Pythia | 150M | 35.4 | 30.3 | 62.3 | 43.6 | 23.6 | 51.3 | 40.1 |
82
+ | TokenFormer | 150M | 45.0 | 35.5 | 64.9 | 47.3 | 24.9 | 50.4 | 44.7 |
83
+ | Pythia | 410M | 51.4 | 40.6 | 66.9 | 52.1 | 24.6 | 53.8 | 48.2 |
84
+ | TokenFormer | 450M | 57.3 | 47.5 | 69.5 | 56.2 | 26.7 | 54.6 | 52.0 |
85
+ | Pythia | 1B | 56.1 | 47.2 | 70.7 | 57.0 | 27.1 | 53.5 | 51.9 |
86
+ | TokenFormer | 900M | 64.0 | 55.3 | 72.4 | 59.9 | 30.6 | 56.4 | 56.4 |
87
+ | GPT-Neo | 1.3B | 57.2 | 48.9 | 71.1 | 56.2 | 25.9 | 54.9 | 52.4 |
88
+ | OPT | 1.3B | 58.0 | 53.7 | 72.4 | 56.7 | 29.6 | 59.5 | 55.0 |
89
+ | Pythia | 1.3B | 61.7 | 52.1 | 71.0 | 60.5 | 28.5 | 57.2 | 55.2 |
90
+ | GPT-Neo | 2.7B | 62.2 | 55.8 | 71.1 | 61.1 | 30.2 | 57.6 | 56.5 |
91
+ | OPT | 2.7B | 63.6 | 60.6 | 74.8 | 60.8 | 31.3 | 61.0 | 58.7 |
92
+ | Pythia | 2.8B | 64.7 | 59.3 | 74.0 | 64.1 | 32.9 | 59.7 | 59.1 |
93
+ | TokenFormer | 1.5B | 64.7 | 60.0 | 74.8 | 64.8 | 32.0 | 59.7 | 59.3 |
94
+ <figcaption>Zero-shot evaluation of Language Modeling. </figcaption>
95
+ </figure>