Update README.md
Browse files
README.md
CHANGED
@@ -24,7 +24,7 @@ We follow the work of [Sanh et al. (2019)](https://arxiv.org/abs/1910.01108) and
|
|
24 |
In an effort to maximize VRAM utilization, to reach a combined batch size of 4096 samples we use a device batch size of 2 with 2048 gradient accumulation steps and a context length of 2048 tokens with both the teacher and student model in bf16 precision. This allowed us to utilize around 98.94% of the 12 gigabytes of VRAM that the RTX3060 card has during training.
|
25 |
It also means our training set totals to approximately 537 million training tokens, as our model trained for 64 steps. All training samples were taken from [The Pile](https://arxiv.org/abs/2101.00027).
|
26 |
|
27 |
-
|
28 |
|
29 |
### Evaluation
|
30 |
|
|
|
24 |
In an effort to maximize VRAM utilization, to reach a combined batch size of 4096 samples we use a device batch size of 2 with 2048 gradient accumulation steps and a context length of 2048 tokens with both the teacher and student model in bf16 precision. This allowed us to utilize around 98.94% of the 12 gigabytes of VRAM that the RTX3060 card has during training.
|
25 |
It also means our training set totals to approximately 537 million training tokens, as our model trained for 64 steps. All training samples were taken from [The Pile](https://arxiv.org/abs/2101.00027).
|
26 |
|
27 |
+
A learning rate of 1e-4 was used in this study, with no learning rate schedule.
|
28 |
|
29 |
### Evaluation
|
30 |
|