MarioBarbeque
/

CyberSolve-LinAlg-1.2

@@ -93,23 +93,23 @@ This code outputs the following:
 ### Training Data / Preprocessing
-The data used comes from the Stanford NLP 🤗 hub. The model card can be found [here](https://huggingface.co/datasets/stanfordnlp/imdb). This dataset is preprocessed in the
 following way: The train and test splits are tokenized, concatenated, and chunked into chunks of 256 tokens. We subsequently load the training data into a `DataCollator` that
 applies a custom random masking function when batching. We mask of 15% of tokens in each chunk. The evaluation data is masked in its entirety, to remove randomness when evaluating,
 and passed to a `DataCollator` with the default collating function.
 ### Training Procedure
-The model was trained locally on a single-node with one 16GB Nvidia T4 using 🤗 Transformers, 🤗 Tokenizers, and a custom PyTorch training loop that made use of 🤗 Accelerate.
 #### Training Hyperparameters
-- **Precision:** We use FP32 precision, as follows immediately from the precision inhereted for the original "DistilBERT/distilbert-base-uncased" model.
 - **Optimizer:** `apex.optimizers.FusedAdam`, a fused kernel version of the AdamW optimizer from Nvidia `apex`
 - **Learning Rate:** We use a linear learing rate scheduler with an initial learning rate of 5e-5
 - **Batch Size:** 32
-- **Number of Training Steps**: 2877 steps over the course of 3 epochs
 ## Evaluation / Metrics
@@ -127,7 +127,7 @@ Perplexity: [https://en.wikipedia.org/wiki/Perplexity](https://en.wikipedia.org/
 #### Testing Data
-The IMDB dataset from Stanford NLP comes pre-split into training and testing data of 25k reviews each. Our preprocessing, which included the chunking of concatenated, tokenized inputs
 into chunks of 256 tokens, increased these respective splits by approximately ~5k records each. We apply a single masking function to the evaluation dataset before testing as mentioned above.
 ### Results

 ### Training Data / Preprocessing
+The data used comes from Google DeepMind and the 🤗 hub. The model card can be found [here](https://huggingface.co/datasets/deepmind/mathematics). This dataset is preprocessed in the
 following way: The train and test splits are tokenized, concatenated, and chunked into chunks of 256 tokens. We subsequently load the training data into a `DataCollator` that
 applies a custom random masking function when batching. We mask of 15% of tokens in each chunk. The evaluation data is masked in its entirety, to remove randomness when evaluating,
 and passed to a `DataCollator` with the default collating function.
 ### Training Procedure
+The model was trained locally on a single-node with multiple Nvidia A100 GPUs using 🤗 Transformers, 🤗 Tokenizers, and a custom PyTorch training loop that made use of 🤗 Accelerate.
 #### Training Hyperparameters
+- **Precision:** We use FP32 precision, the same precision of the base "google/flan-t5-large" model.
 - **Optimizer:** `apex.optimizers.FusedAdam`, a fused kernel version of the AdamW optimizer from Nvidia `apex`
 - **Learning Rate:** We use a linear learing rate scheduler with an initial learning rate of 5e-5
 - **Batch Size:** 32
+- **Number of Training Steps**: 2877 steps over the course of 3 epochs, followed by
 ## Evaluation / Metrics
 #### Testing Data
+The 1D Linear Algebra split of the Google DeepMind Mathematics dataset comes pre-split into training and evaluation data of 2M and 10k records, respectively. Our preprocessing, which included the chunking of concatenated, tokenized inputs
 into chunks of 256 tokens, increased these respective splits by approximately ~5k records each. We apply a single masking function to the evaluation dataset before testing as mentioned above.
 ### Results