crumb commited on
Commit
5072135
·
1 Parent(s): 9196f12

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -24,7 +24,7 @@ We follow the work of [Sanh et al. (2019)](https://arxiv.org/abs/1910.01108) and
24
  In an effort to maximize VRAM utilization, to reach a combined batch size of 4096 samples we use a device batch size of 2 with 2048 gradient accumulation steps and a context length of 2048 tokens with both the teacher and student model in bf16 precision. This allowed us to utilize around 98.94% of the 12 gigabytes of VRAM that the RTX3060 card has during training.
25
  It also means our training set totals to approximately 537 million training tokens, as our model trained for 64 steps. All training samples were taken from [The Pile](https://arxiv.org/abs/2101.00027).
26
 
27
- In this study, a linearly decaying learning rate schedule was employed to improve the training process of the deep learning model. Specifically, the learning rate was initialized at 0, with a linear warmup to 1e-4 over 10% of the training data, and linearly decayed back to 0 over the course of the rest of the training process. A decaying learning rate schedule is commonly used in deep learning to help the model converge to a better solution. It starts with a higher learning rate to enable the model to make more significant coarse updates to the weights in the initial stages of training. As training progresses, the learning rate is gradually reduced to allow the model to make finer, more precise changes to it's weights.
28
 
29
  ### Evaluation
30
 
 
24
  In an effort to maximize VRAM utilization, to reach a combined batch size of 4096 samples we use a device batch size of 2 with 2048 gradient accumulation steps and a context length of 2048 tokens with both the teacher and student model in bf16 precision. This allowed us to utilize around 98.94% of the 12 gigabytes of VRAM that the RTX3060 card has during training.
25
  It also means our training set totals to approximately 537 million training tokens, as our model trained for 64 steps. All training samples were taken from [The Pile](https://arxiv.org/abs/2101.00027).
26
 
27
+ A learning rate of 1e-4 was used in this study, with no learning rate schedule.
28
 
29
  ### Evaluation
30