Haiyang-W
/

TokenFormer-450M

PyTorch

Model card Files Files and versions Community

Haiyang-W commited on 7 days ago

Commit

4cbb72d

•

1 Parent(s): b0db186

Update README.md

Browse files

Files changed (1) hide show

README.md +70 -1

README.md CHANGED Viewed

@@ -15,7 +15,7 @@ same data, in the exact same order.
 ## Model Details
 - Developed by: [Haiyang Wang](https://haiyang-w.github.io/)
-- Model type: ToeknFormer-based Language Model
 - Language: English
 - Learn more: [TokenFormer's GitHub repository](https://github.com/Haiyang-W/TokenFormer)
  for training procedure, config files, and details on how to use.
@@ -24,3 +24,72 @@ same data, in the exact same order.
 - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
 - License: Apache 2.0
 - Contact: to ask questions about this model, please email Haiyang Wang.

 ## Model Details
 - Developed by: [Haiyang Wang](https://haiyang-w.github.io/)
+- Model type: TokenFormer-based Language Model
 - Language: English
 - Learn more: [TokenFormer's GitHub repository](https://github.com/Haiyang-W/TokenFormer)
  for training procedure, config files, and details on how to use.
 - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
 - License: Apache 2.0
 - Contact: to ask questions about this model, please email Haiyang Wang.
+<figure>
+| TokenFormer model | Layers | #QKV Param Tokens | #Output Param Tokens | #FFN Param Tokens | Model Dim | Heads | Batch Size | Learning Rate         | Training Iterations         |
+| ----------------: | -----: | :---------------: | :------------------: | :---------------: | :-------: | :---: | :--------: | :-------------------: | :-------------------------: |
+| 150M              | 12     | 768               | 768                  | 3072              | 768       | 12    | 2M         | 6.0 x 10<sup>-4</sup> | 143000                      |
+| 450M              | 24     | 1024              | 1024                 | 4096              | 1024      | 16    | 2M         | 6.0 x 10<sup>-4</sup> | 143000                      |
+| 900M              | 32     | 1280              | 1280                 | 5120              | 1280      | 16    | 2M         | 6.0 x 10<sup>-4</sup> | 143000                      |
+| 1.5B              | 40     | 1536              | 1536                 | 6144              | 1536      | 16    | 2M         | 6.0 x 10<sup>-4</sup> | 143000                      |
+<figcaption>Engineering details for the <i>TokenFormer</i>. </figcaption>
+</figure>
+## Training
+### Training data
+[The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in
+English. It was created by EleutherAI specifically for training large language
+models. It contains texts from 22 diverse sources, roughly broken down into
+five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl),
+prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and
+miscellaneous (e.g. GitHub, Enron Emails). See [the Pile
+paper](https://arxiv.org/abs/2101.00027) for a breakdown of all data sources,
+methodology, and a discussion of ethical implications. Consult [the
+datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
+about the Pile and its component datasets. The Pile can be downloaded from
+the [official website](https://pile.eleuther.ai/), or from a [community
+mirror](https://the-eye.eu/public/AI/pile/).<br>
+### Training procedure
+We follow the default training strategy of [Pythia](https://arxiv.org/abs/2304.01373) in [gpt-neox](https://github.com/EleutherAI/gpt-neox),
+including the dataset processing, hyper-parameter and code base.
+All models were trained on the exact same data, in the exact same order. Each
+model saw 299,892,736,000 tokens during training.
+All *TokenFormer* models trained for 143000 steps at a batch size
+of 2M (2,097,152 tokens).<br>
+See [GitHub](https://github.com/Haiyang-W/TokenFormer) for more details on training
+ procedure.<br>
+TokenFormer uses the same tokenizer as [GPT-NeoX-
+20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
+## Evaluations
+All 16 *TokenFormer* models were evaluated using the [LM Evaluation
+Harness](https://github.com/EleutherAI/lm-evaluation-harness).
+You can run the evaluation with our [instruction](https://github.com/Haiyang-W/TokenFormer?tab=readme-ov-file#evaluations).<br>
+Expand the sections below to see plots of evaluation results for all
+TokenFormer compared with Opensource Transformer-based LLMs.
+<figure>
+| Model        | #Param   |  LAMBADA | HellaSwag | PIQA | Arc-E  | Arc-C | WinoGrande | Average  |
+| :----:       | :------: | :------: | :-------: | :--: | :---:  | :---: | :--------: | :------: |
+| Pythia       |    150M  | 35.4     |     30.3  | 62.3 |  43.6  | 23.6  |  51.3      |   40.1   |
+| TokenFormer  |    150M  | 45.0     |     35.5  | 64.9 |  47.3  | 24.9  |  50.4      |   44.7   |
+| Pythia       |    410M  | 51.4     |     40.6  | 66.9 |  52.1  | 24.6  |  53.8      |   48.2   |
+| TokenFormer  |    450M  | 57.3     |     47.5  | 69.5 |  56.2  | 26.7  |  54.6      |   52.0   |
+| Pythia       |    1B    | 56.1     |     47.2  | 70.7 |  57.0  | 27.1  |  53.5      |   51.9   |
+| TokenFormer  |    900M  | 64.0     |     55.3  | 72.4 |  59.9  | 30.6  |  56.4      |   56.4   |
+| GPT-Neo      |    1.3B  | 57.2     |     48.9  | 71.1 |  56.2  | 25.9  |  54.9      |   52.4   |
+| OPT          |    1.3B  | 58.0     |     53.7  | 72.4 |  56.7  | 29.6  |  59.5      |   55.0   |
+| Pythia       |    1.3B  | 61.7     |     52.1  | 71.0 |  60.5  | 28.5  |  57.2      |   55.2   |
+| GPT-Neo      |    2.7B  | 62.2     |     55.8  | 71.1 |  61.1  | 30.2  |  57.6      |   56.5   |
+| OPT          |    2.7B  | 63.6     |     60.6  | 74.8 |  60.8  | 31.3  |  61.0      |   58.7   |
+| Pythia       |    2.8B  | 64.7     |     59.3  | 74.0 |  64.1  | 32.9  |  59.7      |   59.1   |
+| TokenFormer  |    1.5B  | 64.7     |     60.0  | 74.8 |  64.8  | 32.0  |  59.7      |   59.3   |
+<figcaption>Zero-shot evaluation of Language Modeling. </figcaption>
+</figure>