minor updates - still needs work
Browse files
README.md
CHANGED
@@ -93,23 +93,23 @@ This code outputs the following:
|
|
93 |
|
94 |
### Training Data / Preprocessing
|
95 |
|
96 |
-
The data used comes from
|
97 |
following way: The train and test splits are tokenized, concatenated, and chunked into chunks of 256 tokens. We subsequently load the training data into a `DataCollator` that
|
98 |
applies a custom random masking function when batching. We mask of 15% of tokens in each chunk. The evaluation data is masked in its entirety, to remove randomness when evaluating,
|
99 |
and passed to a `DataCollator` with the default collating function.
|
100 |
|
101 |
### Training Procedure
|
102 |
|
103 |
-
The model was trained locally on a single-node with
|
104 |
|
105 |
|
106 |
#### Training Hyperparameters
|
107 |
|
108 |
-
- **Precision:** We use FP32 precision,
|
109 |
- **Optimizer:** `apex.optimizers.FusedAdam`, a fused kernel version of the AdamW optimizer from Nvidia `apex`
|
110 |
- **Learning Rate:** We use a linear learing rate scheduler with an initial learning rate of 5e-5
|
111 |
- **Batch Size:** 32
|
112 |
-
- **Number of Training Steps**: 2877 steps over the course of 3 epochs
|
113 |
|
114 |
|
115 |
## Evaluation / Metrics
|
@@ -127,7 +127,7 @@ Perplexity: [https://en.wikipedia.org/wiki/Perplexity](https://en.wikipedia.org/
|
|
127 |
|
128 |
#### Testing Data
|
129 |
|
130 |
-
The
|
131 |
into chunks of 256 tokens, increased these respective splits by approximately ~5k records each. We apply a single masking function to the evaluation dataset before testing as mentioned above.
|
132 |
|
133 |
### Results
|
|
|
93 |
|
94 |
### Training Data / Preprocessing
|
95 |
|
96 |
+
The data used comes from Google DeepMind and the π€ hub. The model card can be found [here](https://huggingface.co/datasets/deepmind/mathematics). This dataset is preprocessed in the
|
97 |
following way: The train and test splits are tokenized, concatenated, and chunked into chunks of 256 tokens. We subsequently load the training data into a `DataCollator` that
|
98 |
applies a custom random masking function when batching. We mask of 15% of tokens in each chunk. The evaluation data is masked in its entirety, to remove randomness when evaluating,
|
99 |
and passed to a `DataCollator` with the default collating function.
|
100 |
|
101 |
### Training Procedure
|
102 |
|
103 |
+
The model was trained locally on a single-node with multiple Nvidia A100 GPUs using π€ Transformers, π€ Tokenizers, and a custom PyTorch training loop that made use of π€ Accelerate.
|
104 |
|
105 |
|
106 |
#### Training Hyperparameters
|
107 |
|
108 |
+
- **Precision:** We use FP32 precision, the same precision of the base "google/flan-t5-large" model.
|
109 |
- **Optimizer:** `apex.optimizers.FusedAdam`, a fused kernel version of the AdamW optimizer from Nvidia `apex`
|
110 |
- **Learning Rate:** We use a linear learing rate scheduler with an initial learning rate of 5e-5
|
111 |
- **Batch Size:** 32
|
112 |
+
- **Number of Training Steps**: 2877 steps over the course of 3 epochs, followed by
|
113 |
|
114 |
|
115 |
## Evaluation / Metrics
|
|
|
127 |
|
128 |
#### Testing Data
|
129 |
|
130 |
+
The 1D Linear Algebra split of the Google DeepMind Mathematics dataset comes pre-split into training and evaluation data of 2M and 10k records, respectively. Our preprocessing, which included the chunking of concatenated, tokenized inputs
|
131 |
into chunks of 256 tokens, increased these respective splits by approximately ~5k records each. We apply a single masking function to the evaluation dataset before testing as mentioned above.
|
132 |
|
133 |
### Results
|