avi-skowron
commited on
Commit
•
591f386
1
Parent(s):
17d09d9
Create README.md
Browse filesPlease note this is v1, it does not contain evaluations yet.
README.md
ADDED
@@ -0,0 +1,209 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
tags:
|
5 |
+
- pytorch
|
6 |
+
- causal-lm
|
7 |
+
- pythia
|
8 |
+
license: apache-2.0
|
9 |
+
datasets:
|
10 |
+
- the_pile
|
11 |
+
---
|
12 |
+
|
13 |
+
The *Pythia Scaling Suite* is a collection of models developed to facilitate
|
14 |
+
interpretability research. It contains two sets of eight models of sizes
|
15 |
+
70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two
|
16 |
+
models: one trained on the Pile, and one trained on the Pile after the dataset
|
17 |
+
has been globally deduplicated. All 8 model sizes are trained on the exact
|
18 |
+
same data, in the exact same order. All Pythia models are available
|
19 |
+
[on Hugging Face](https://huggingface.co/EleutherAI).
|
20 |
+
|
21 |
+
Some design choices were made for the sake of interpretability research and
|
22 |
+
to ensure consistency across all models. However, the Pythia models are
|
23 |
+
competitive with, or mildly outperform, other similar and same-sized models,
|
24 |
+
such as OPT and the GPT-Neo suite.
|
25 |
+
|
26 |
+
Please note that all models in the *Pythia* suite were re-named in January
|
27 |
+
2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
|
28 |
+
comparing the old and new names</a> is provided in this model card, together
|
29 |
+
with exact model parameter counts.
|
30 |
+
|
31 |
+
## Pythia-70M
|
32 |
+
|
33 |
+
### Model Details
|
34 |
+
|
35 |
+
- Developed by: [EleutherAI](http://eleuther.ai)
|
36 |
+
- Model type: Transformer-based Language Model
|
37 |
+
- Language: English
|
38 |
+
- Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia)
|
39 |
+
for training procedure, config files, and details on how to use.
|
40 |
+
- Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
|
41 |
+
- License: Apache 2.0
|
42 |
+
- Contact: to ask questions about this model, join the [EleutherAI
|
43 |
+
Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
|
44 |
+
Please read the existing *Pythia* documentation before asking about it in the
|
45 |
+
EleutherAI Discord. For general correspondence:
|
46 |
+
[[email protected]](mailto:[email protected]).
|
47 |
+
|
48 |
+
<figure>
|
49 |
+
|
50 |
+
| Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate | Equivalent Models |
|
51 |
+
| -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | :--------------------: |
|
52 |
+
| 70M | 18,915,328 | 6 | 512 | 8 | 2M | 1.0 x 10<sup>-3</sup> | — |
|
53 |
+
| 160M | 85,056,000 | 12 | 768 | 12 | 4M | 6.0 x 10<sup>-4</sup> | GPT-Neo 125M, OPT-125M |
|
54 |
+
| 410M | 302,311,424 | 24 | 1024 | 16 | 4M | 3.0 x 10<sup>-4</sup> | OPT-350M |
|
55 |
+
| 1.0B | 805,736,448 | 16 | 2048 | 8 | 2M | 3.0 x 10<sup>-4</sup> | — |
|
56 |
+
| 1.4B | 1,208,602,624 | 24 | 2048 | 16 | 4M | 2.0 x 10<sup>-4</sup> | GPT-Neo 1.3B, OPT-1.3B |
|
57 |
+
| 2.8B | 2,517,652,480 | 32 | 2560 | 32 | 2M | 1.6 x 10<sup>-4</sup> | GPT-Neo 2.7B, OPT-2.7B |
|
58 |
+
| 6.9B | 6,444,163,072 | 32 | 4096 | 32 | 2M | 1.2 x 10<sup>-4</sup> | OPT-6.7B |
|
59 |
+
| 12B | 11,327,027,200 | 36 | 5120 | 40 | 2M | 1.2 x 10<sup>-4</sup> | — |
|
60 |
+
<figcaption>Engineering details for the <i>Pythia Suite</i>. Deduped and
|
61 |
+
non-deduped models of a given size have the same hyperparameters. “Equivalent”
|
62 |
+
models have <b>exactly</b> the same architecture, and the same number of
|
63 |
+
non-embedding parameters.</figcaption>
|
64 |
+
</figure>
|
65 |
+
|
66 |
+
### Uses and Limitations
|
67 |
+
|
68 |
+
#### Intended Use
|
69 |
+
|
70 |
+
All Pythia models were developed specifically for research purposes. This
|
71 |
+
suite is intended to provide a controlled setting for performing scientific
|
72 |
+
experiments. To enable the study of how language models change over the course
|
73 |
+
of training, we provide 143 evenly spaced intermediate checkpoints per model.
|
74 |
+
These checkpoints are hosted on Hugging Face as branches. Note that branch
|
75 |
+
`143000` corresponds exactly to the model checkpoint on the `main` branch
|
76 |
+
of each model.
|
77 |
+
|
78 |
+
#### Out-of-scope use
|
79 |
+
|
80 |
+
Performance on NLP benchmarks is not a priority for *Pythia* models, although
|
81 |
+
its evaluation results are competitive with similarly-sized language models,
|
82 |
+
such as those from the OPT and BLOOM suites.
|
83 |
+
|
84 |
+
Pythia-70M has not been fine-tuned for downstream tasks for which
|
85 |
+
language models are commonly deployed, such as writing genre prose,
|
86 |
+
or commercial chatbots. This means Pythia-70M will likely **not**
|
87 |
+
respond to a given prompt the way e.g. ChatGPT does. This is because, unlike
|
88 |
+
this model, ChatGPT was fine-tuned using Reinforcement Learning from Human
|
89 |
+
Feedback (RLHF) to better “understand” human instructions.
|
90 |
+
|
91 |
+
#### Limitations and biases
|
92 |
+
|
93 |
+
The core functionality of a large language model is to take a string of text
|
94 |
+
and predict the next token. The token deemed statistically most likely by the
|
95 |
+
model need not produce the most “accurate” text. Never rely on
|
96 |
+
Pythia-70M to produce factually accurate output.
|
97 |
+
|
98 |
+
This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset
|
99 |
+
known to contain profanity and texts that are lewd or otherwise offensive.
|
100 |
+
See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a
|
101 |
+
discussion of documented biases with regards to gender, religion, and race.
|
102 |
+
Pythia-70M may produce socially unacceptable or undesirable text,
|
103 |
+
*even if* the prompt itself does not include anything explicitly offensive.
|
104 |
+
|
105 |
+
If you plan on using text generated through, for example, the Hosted Inference
|
106 |
+
API, we recommend having a human curate the outputs of this language model
|
107 |
+
before presenting it to other people. Please inform your audience that the
|
108 |
+
text was generated by Pythia-70M.
|
109 |
+
|
110 |
+
### Quickstart
|
111 |
+
|
112 |
+
Pythia models can be loaded and used via the following code, demonstrated here
|
113 |
+
for the third `pythia-70m-deduped` checkpoint:
|
114 |
+
|
115 |
+
```python
|
116 |
+
from transformers import GPTNeoXForCausalLM, AutoTokenizer
|
117 |
+
|
118 |
+
model = GPTNeoXForCausalLM.from_pretrained(
|
119 |
+
"EleutherAI/pythia-70m-deduped",
|
120 |
+
revision="step3000",
|
121 |
+
cache_dir="./pythia-70m-deduped/step3000",
|
122 |
+
)
|
123 |
+
|
124 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
125 |
+
"EleutherAI/pythia-70m-deduped",
|
126 |
+
revision="step3000",
|
127 |
+
cache_dir="./pythia-70m-deduped/step3000",
|
128 |
+
)
|
129 |
+
|
130 |
+
inputs = tokenizer("Hello, I am", return_tensors="pt")
|
131 |
+
tokens = model.generate(**inputs)
|
132 |
+
tokenizer.decode(tokens[0])
|
133 |
+
```
|
134 |
+
|
135 |
+
Revision/branch `step143000` corresponds exactly to the model checkpoint on
|
136 |
+
the `main` branch of each model.
|
137 |
+
|
138 |
+
For more information on how to use all Pythia models, see [documentation on
|
139 |
+
GitHub](https://github.com/EleutherAI/pythia).
|
140 |
+
|
141 |
+
### Training
|
142 |
+
|
143 |
+
#### Training data
|
144 |
+
|
145 |
+
[The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in
|
146 |
+
English. It was created by EleutherAI specifically for training large language
|
147 |
+
models. It contains texts from 22 diverse sources, roughly broken down into
|
148 |
+
five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl),
|
149 |
+
prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and
|
150 |
+
miscellaneous (e.g. GitHub, Enron Emails). See [the Pile
|
151 |
+
paper](https://arxiv.org/abs/2101.00027) for a breakdown of all data sources,
|
152 |
+
methodology, and a discussion of ethical implications. Consult [the
|
153 |
+
datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
|
154 |
+
about the Pile and its component datasets. The Pile can be downloaded from
|
155 |
+
the [official website](https://pile.eleuther.ai/), or from a [community
|
156 |
+
mirror](https://the-eye.eu/public/AI/pile/).
|
157 |
+
|
158 |
+
The Pile was **not** deduplicated before being used to train Pythia-70M.
|
159 |
+
|
160 |
+
#### Training procedure
|
161 |
+
|
162 |
+
All models were trained on the exact same data, in the exact same order. Each
|
163 |
+
model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
|
164 |
+
model are saved every 2,097,152,000 tokens, spaced evenly throughout training.
|
165 |
+
This corresponds to training for just under 1 epoch on the Pile for
|
166 |
+
non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.
|
167 |
+
|
168 |
+
All Pythia models trained for the equivalent of 143000 steps at a batch size
|
169 |
+
of 2,097,152 tokens. Two batch sizes were used: 2M and 4M. Models with a batch
|
170 |
+
size of 4M tokens listed were originally trained for 71500 steps instead, with
|
171 |
+
checkpoints every 500 steps. The checkpoints on Hugging Face are renamed for
|
172 |
+
consistency with all 2M batch models, so `step1000` is the first checkpoint
|
173 |
+
for `pythia-1.4b` that was saved (corresponding to step 500 in training), and
|
174 |
+
`step1000` is likewise the first `pythia-6.9b` checkpoint that was saved
|
175 |
+
(corresponding to 1000 “actual” steps).
|
176 |
+
|
177 |
+
See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
|
178 |
+
procedure, including [how to reproduce
|
179 |
+
it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).
|
180 |
+
|
181 |
+
### Evaluations
|
182 |
+
|
183 |
+
All 16 *Pythia* models were evaluated using the [LM Evaluation
|
184 |
+
Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
|
185 |
+
the results by model and step at `results/json/*` in the [GitHub
|
186 |
+
repository](https://github.com/EleutherAI/pythia/tree/main/results/json).
|
187 |
+
|
188 |
+
February 2023 note: select evaluations and comparison with OPT and BLOOM
|
189 |
+
models will be added here at a later date.
|
190 |
+
|
191 |
+
### Naming convention and parameter count
|
192 |
+
|
193 |
+
Pythia models were re-named in January 2023. It is possible that the old
|
194 |
+
naming convention still persists in some documentation by accident. The
|
195 |
+
current naming convention (70M, 160M, etc.) is based on total parameter count.
|
196 |
+
|
197 |
+
<figure style="width:32em">
|
198 |
+
|
199 |
+
| current Pythia suffix | old suffix | total params | non-embedding params |
|
200 |
+
| --------------------: | ---------: | -------------: | -------------------: |
|
201 |
+
| 70M | 19M | 70,426,624 | 18,915,328 |
|
202 |
+
| 160M | 125M | 162,322,944 | 85,056,000 |
|
203 |
+
| 410M | 350M | 405,334,016 | 302,311,424 |
|
204 |
+
| 1B | 800M | 1,011,781,632 | 805,736,448 |
|
205 |
+
| 1.4B | 1.3B | 1,414,647,808 | 1,208,602,624 |
|
206 |
+
| 2.8B | 2.7B | 2,775,208,960 | 2,517,652,480 |
|
207 |
+
| 6.9B | 6.7B | 6,857,302,016 | 6,444,163,072 |
|
208 |
+
| 12B | 13B | 11,846,072,320 | 11,327,027,200 |
|
209 |
+
</figure>
|