--- base_model: - google/flan-t5-large datasets: - deepmind/math_dataset language: - en library_name: transformers metrics: - exact_match --- # Model Card for Model ID Welcome to the ๐Ÿค–๐ŸงฎCyberSolve LinAlg 1.2๐Ÿง ๐Ÿ“ model card! We introduce **CyberSolve LinAlg 1.2**, a text-to-text large language model trained to solve linear equations. Specifically, *CyberSolve LingAlg 1.2* is a downstream version of the *FLAN-T5 large* model, [Google/FLAN-T5-large](https://huggingface.co./google/flan-t5-large), fine-tuned on the one-dimensional linear algebra split of the Google DeepMind mathematics dataset. The model weights of *CyberSolve LinAlg 1.2* are a further downstream checkpoint from the original *CyberSolve LinAlg 1.1* checkpoint, trained for additional epochs to improve model capability. **Note**: This is currently the most capable version of CyberSolve LinAlg. See this model demoed in the [CyberSolve LinAlg 1.2 ๐Ÿค– Space](https://huggingface.co./spaces/MarioBarbeque/CyberSolveLinAlg1.2). ## Model Details ### Model Description and Overview To construct **CyberSolve LinAlg 1.2**, the *FLAN-T5 large* model is fined-tuned using a custom PyTorch training loop optimized for multiple Nvidia A100 GPUs. We supervise a training of *FLAN-T5 large* on the *algebra__linear_1d* split of the Google DeepMind mathematics dataset, an open source dateset from Google DeepMind available through the ๐Ÿค— hub at [deepmind/math_dataset](https://huggingface.co./datasets/deepmind/math_dataset). This large dataset consists of code generating mathematical problems and their solutions to a variety of tasks across unique mathematical disciplines. In this preliminary family of CyberSolve models, we are specifically interested in understanding the ability of neural models to solve non-trivial mathematical tasks. As such, the CyberSolve **LinAlg 1.x** family of models are trained on a set of 2M simpler, one-dimension linear equations. We preprocessed the data and simulated the training on a smaller, downsampled set of the dataset before training for multiple epochs over the dataset's entirety. This model in particular has been trained for 2 additional epochs, limited only by funds, beyond the original *CyberSolve LinAlg 1.1* checkpoint. Version 1.2 is the most capable version of CyberSolve LinAlg, scoring a **90.75** exact match score on the evaluation set of 10k linear equations from the DeepMind *algebra__linear_1d* split. This is a non-trivial improvement from the exact match score of **86.56** attained by *CyberSolve LinAlg 1.1*. - **Developed by:** John Graham Reynolds - **Funded by:** Vanderbilt University - **Model type:** Text-to-Text Generation - **Language(s) (NLP):** English - **Finetuned from model:** "Google/FLAN-T5-large" ### Model Source - **Repository:** TODO ## Uses ### Direct Use In order to effectively query the model's ability to solve linear equations, a string of the format `"Solve for x."` should be tokenized and passed to the model's `generate` attribute. An example input string is `input_text = "Solve 24 = 1601*c - 1605*c for c."`. The model will attempt to solve the equation, outputting its prediction in a simple numeric format. See the example below. ## How to Use and Query the Model Use the code below to get started with the model. Users pass an `input_text` string (again, of the form `input_text = "Solve 24 = 1601*c - 1605*c for c."`) which prompts the model to solve a one-dimensional linear equation. Model prediction is significantly faster on a GPU, and so usage of the `.to('cuda')` commands to make sure both the model and all input ids are on the GPU is best practice. Furthermore, the FLAN-T5 model architecture makes use of many normalization layers, as is common in the transformer architecture. By default, CyberSolve uses the T5 model's `T5LayerNorm` Python class; it is highly recommended that user install the Nvidia `Apex` package for Nvidia GPUs or the ROCm `Apex` package for AMD GPUs. Once installed, the model will default to using the `apex.normalization.FusedRMSNorm` class when computing the normalization layers. The `FusedRMSNorm` class from `apex` makes use of an optimized fused kernel that is much faster than the standard `T5LayerNorm` class, thereby significantly improving both inference and training. The base FLAN-T5 model is capable of answering a variety of prompts, but the domain-adapted CyberSolve LinAlg model is designed specifically for solving linear equations. As such, users must be considerate in their prompt engineering to issue a coherent, relevant query as outlined above and below. ``` python # import apex import torch from transformers import T5ForConditionalGeneration, T5Tokenizer model = T5ForConditionalGeneration.from_pretrained("MarioBarbeque/CyberSolve-LinAlg-1.2").to("cuda") tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large") # CyberSolve uses the same tokenizer as the base FLAN-T5 model # Pass the model instruction to solve a linear equation in the following simple format input_text = "Solve 24 = 1601*c - 1605*c for c." input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda") outputs = model.generate(input_ids) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` This code outputs the following: ``` python -6 ``` ## Training Details ### Training Data / Preprocessing The data used comes from Google DeepMind and the ๐Ÿค— hub. The model card can be found [here](https://huggingface.co./datasets/deepmind/mathematics). The Deepmind Mathematics DatasetDict object is composed of a vast variety of underlying mathematics datasets. Each of the underlying datasets contains a specific class of mathematical problems and their solutions. For the CyberSolve LinAlg *1.x* family of models, we are interested specifically in the solving of one-dimensional linear equations, so we use the *algebra__linear_1d* split. The training and evaluation splits of the 1D linear algebra dataset split are preprocessed in the following way: we format the raw problems and their solutions of the form `"b'Solve 65*l - 361 + 881 = 0 for l.\\n'"` and `"b'-8\\n'"` into the much cleanear `"Solve 65*l - 361 + 881 = 0 for l."` and `"-8"`. All inputs and labels are then tokenized. We subsequently evaluate the length of each *input_ids* vector and each *labels* vector to ensure there are no outliers and no inputs that need to be truncated. For later ease of loading, we upload these preprocessed and tokenized training and evaluation datasets to the ๐Ÿค— hub at the following locations: [MarioBarbeque/DeepMind-LinAlg-1D-train](https://huggingface.co./datasets/MarioBarbeque/DeepMind-LinAlg-1D-train) and [MarioBarbeque/DeepMind-LinAlg-1D-eval](https://huggingface.co./datasets/MarioBarbeque/DeepMind-LinAlg-1D-eval). ### Training Procedure The model was trained locally on a single-node with multiple Nvidia A100 GPUs using ๐Ÿค— Transformers, ๐Ÿค— Tokenizers, and a custom PyTorch training loop that made use of both Nvidia Apex and ๐Ÿค— Accelerate. #### Training Hyperparameters - **Precision:** We use FP32 precision, the same precision of the base "google/flan-t5-large" model. - **Optimizer:** `apex.optimizers.FusedAdam`, a fused kernel version of the AdamW optimizer from Nvidia Apex - **Learning Rate:** We use a linear learing rate scheduler with an initial learning rate of 1e-4 to further adjust the CyberSolve LinAlg **1.1** weights - **Batch Size:** 64 - **Number of Training Steps**: 1918 training steps over 2 additional epochs (CyberSolve LinAlg **1.2**) - beyond the original 2877 total steps over 3 epochs (CyberSolve LinAlg **1.1**) ## Evaluation / Metrics We evaluate our text-to-text linear equation solver by using the `exact_match` metric to compare the model's decoded predicted tokens with their numeric labels. *CyberSolve LinAlg 1.2* scores a **90.75** exact match score on the evaluation set of 10k linear equations from the DeepMind *algebra__linear_1d* split. This is a non-trivial improvement from the exact match score of **86.56** attained by *CyberSolve LinAlg 1.1*. Additionally, we construct a partial correctness dataset available at the following model card: [MarioBarbeque/CyberSolve-LinAlg-1.2-correctness-benchmark](https://huggingface.co./datasets/MarioBarbeque/CyberSolve-LinAlg-1.2-correctness-benchmark). This dataset was created with the goal of analyzing both the token-to-token and decoded-sequence-to-decoded-sequence partial correctness of CyberSolve's predicitions in detail beyond just its ability to get answers flat out right or wrong. Similar partial correctness benchmark datasets were created for the intial [FLAN-T5 model](https://huggingface.co./datasets/MarioBarbeque/FLAN-T5-DeepMind-LinAlg-1D-benchmark), the [zeroth-generation downsampled training](https://huggingface.co./datasets/MarioBarbeque/CyberSolve-DeepMind-LinAlg-1D-downsample-benchmark-v2) of CyberSolve, and the [1.1 version](https://huggingface.co./datasets/MarioBarbeque/CyberSolve-LinAlg-1.1-correctness-benchmark) of the model. *We have yet to complete partial correctness analysis of the various model versions and their predicitions, but we look forward to better understanding the mathematical reasoning capabilities of models and publishing our results when complete!* ### Testing Data, Factors & Metrics #### Testing Data The 1D Linear Algebra split of the Google DeepMind Mathematics dataset comes pre-split into training and evaluation datasets of 2M and 10k records, respectively. Before training *CyberSolve LinAlg 1.1*, we trained a zeroth-generation, downsampled verison of CyberSolve by scikit-learn `train_test_split`-ing the set of 2M training records into much smaller training and evaluation datasets. We used this smaller set to evaluate the less-interesting zeroth-generation model, while we used the standard set of 10k evaluation records for evaluating both *CyberSolve LinAlg 1.1* and *CyberSolve LinAlg 1.2*. ### Results We find the following benchmark scores for our each of our neural models after the corresponding epoch of training. |model | epoch | exact_match score | |--------------------------------|-------|-------------------| |CyberSolve LinAlg **1.2** |1 | 90.75 | |CyberSolve LinAlg **1.2** |0 | 83.12 | | -------------------------------|--------|--------| |CyberSolve LinAlg **1.1** |2 | 86.56 | |CyberSolve LinAlg **1.1** |1 | 73.80 | |CyberSolve LinAlg **1.1** |0 | 55.35 | | -------------------------------|--------|--------| |CyberSolve LinAlg **Downsample**|2 | 44.99 | |CyberSolve LinAlg **Downsample**|1 | 39.69 | |CyberSolve LinAlg **Downsample**|0 | 32.21 | #### Summary We train this model for the purpose of researching the mathematical reasoning abilities of transformer-based neural models (both the full-correctness and partial-correctness mathematical reasoning abilities of neural models). Our efforts made use of the ๐Ÿค— ecosystem, a system of parallelized Nvidia A100 GPUs in an Azure Databricks environment, custom PyTorch training and evaluation code, novel high-performance computing and deep learning libraries like Nvidia Apex, and more. We learned a great deal and look forward to finalizing our research on the partial correctness reasoning abilities of these preliminary models. We also eagerly plan to further improve the CyberSolve family of models to tackle more difficult mathematical tasks. As we look forward, CyberSolve LinAlg *2.x* will likely incoropate knowledge of systems of composed one-dimensional linear equations and more general multiple variable linear equations. Finally, methods related to reinforcement learning are equally enticing for improving neural reasoning abilities; the future is bright for teaching mathematics to AI! We look forward to taking part in this great and worthy endeavor. ## Environmental Impact - **Hardware Type:** Nvidia Ampere A100 80GB - **Hours used:** 21.5 - **Cloud Provider:** Microsoft Azure - **Compute Region:** EastUS - **Carbon Emitted:** 3.18 kgCO2 Experiments were conducted using Azure in region eastus, which has a carbon efficiency of 0.37 kgCO$_2$eq/kWh. A cumulative of 21.5 hours of computation was performed on hardware of type A100 SXM4 80 GB (TDP of 400W). Total emissions are estimated to be 3.18 kgCO$_2$eq of which 100 percents were directly offset by the cloud provider. Estimations were conducted using the MachineLearning Impact calculator presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). #### Hardware The model was trained locally in an Azure Databricks workspace using a single node cloud compute instance with 2 Nvidia A100 80GB GPUs for 21.5 GPU Hours. #### Software Training utilized PyTorch, Nvidia Apex, ๐Ÿค— Transformers, ๐Ÿค— Tokenizers, ๐Ÿค— Datasets, ๐Ÿค— Accelerate, and more in an Azure Databricks execution environment. #### Citations @article{lacoste2019quantifying, title={Quantifying the Carbon Emissions of Machine Learning}, author={Lacoste, Alexandre and Luccioni, Alexandra and Schmidt, Victor and Dandres, Thomas}, journal={arXiv preprint arXiv:1910.09700}, year={2019} }