MarioBarbeque commited on
Commit
4425b35
Β·
verified Β·
1 Parent(s): 86dd4bc

minor updates - still needs work

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -93,23 +93,23 @@ This code outputs the following:
93
 
94
  ### Training Data / Preprocessing
95
 
96
- The data used comes from the Stanford NLP πŸ€— hub. The model card can be found [here](https://huggingface.co/datasets/stanfordnlp/imdb). This dataset is preprocessed in the
97
  following way: The train and test splits are tokenized, concatenated, and chunked into chunks of 256 tokens. We subsequently load the training data into a `DataCollator` that
98
  applies a custom random masking function when batching. We mask of 15% of tokens in each chunk. The evaluation data is masked in its entirety, to remove randomness when evaluating,
99
  and passed to a `DataCollator` with the default collating function.
100
 
101
  ### Training Procedure
102
 
103
- The model was trained locally on a single-node with one 16GB Nvidia T4 using πŸ€— Transformers, πŸ€— Tokenizers, and a custom PyTorch training loop that made use of πŸ€— Accelerate.
104
 
105
 
106
  #### Training Hyperparameters
107
 
108
- - **Precision:** We use FP32 precision, as follows immediately from the precision inhereted for the original "DistilBERT/distilbert-base-uncased" model.
109
  - **Optimizer:** `apex.optimizers.FusedAdam`, a fused kernel version of the AdamW optimizer from Nvidia `apex`
110
  - **Learning Rate:** We use a linear learing rate scheduler with an initial learning rate of 5e-5
111
  - **Batch Size:** 32
112
- - **Number of Training Steps**: 2877 steps over the course of 3 epochs
113
 
114
 
115
  ## Evaluation / Metrics
@@ -127,7 +127,7 @@ Perplexity: [https://en.wikipedia.org/wiki/Perplexity](https://en.wikipedia.org/
127
 
128
  #### Testing Data
129
 
130
- The IMDB dataset from Stanford NLP comes pre-split into training and testing data of 25k reviews each. Our preprocessing, which included the chunking of concatenated, tokenized inputs
131
  into chunks of 256 tokens, increased these respective splits by approximately ~5k records each. We apply a single masking function to the evaluation dataset before testing as mentioned above.
132
 
133
  ### Results
 
93
 
94
  ### Training Data / Preprocessing
95
 
96
+ The data used comes from Google DeepMind and the πŸ€— hub. The model card can be found [here](https://huggingface.co/datasets/deepmind/mathematics). This dataset is preprocessed in the
97
  following way: The train and test splits are tokenized, concatenated, and chunked into chunks of 256 tokens. We subsequently load the training data into a `DataCollator` that
98
  applies a custom random masking function when batching. We mask of 15% of tokens in each chunk. The evaluation data is masked in its entirety, to remove randomness when evaluating,
99
  and passed to a `DataCollator` with the default collating function.
100
 
101
  ### Training Procedure
102
 
103
+ The model was trained locally on a single-node with multiple Nvidia A100 GPUs using πŸ€— Transformers, πŸ€— Tokenizers, and a custom PyTorch training loop that made use of πŸ€— Accelerate.
104
 
105
 
106
  #### Training Hyperparameters
107
 
108
+ - **Precision:** We use FP32 precision, the same precision of the base "google/flan-t5-large" model.
109
  - **Optimizer:** `apex.optimizers.FusedAdam`, a fused kernel version of the AdamW optimizer from Nvidia `apex`
110
  - **Learning Rate:** We use a linear learing rate scheduler with an initial learning rate of 5e-5
111
  - **Batch Size:** 32
112
+ - **Number of Training Steps**: 2877 steps over the course of 3 epochs, followed by
113
 
114
 
115
  ## Evaluation / Metrics
 
127
 
128
  #### Testing Data
129
 
130
+ The 1D Linear Algebra split of the Google DeepMind Mathematics dataset comes pre-split into training and evaluation data of 2M and 10k records, respectively. Our preprocessing, which included the chunking of concatenated, tokenized inputs
131
  into chunks of 256 tokens, increased these respective splits by approximately ~5k records each. We apply a single masking function to the evaluation dataset before testing as mentioned above.
132
 
133
  ### Results