Model Card for Model ID
Welcome to the ๐ค๐งฎCyberSolve LinAlg 1.2๐ง ๐ model card!
We introduce CyberSolve LinAlg 1.2, a text-to-text large language model trained to solve linear equations. Specifically, CyberSolve LingAlg 1.2 is a downstream version of the FLAN-T5 large model, Google/FLAN-T5-large, fine-tuned on the one-dimensional linear algebra split of the Google DeepMind mathematics dataset. The model weights of CyberSolve LinAlg 1.2 are a further downstream checkpoint from the original CyberSolve LinAlg 1.1 checkpoint, trained for additional epochs to improve model capability.
Note: This is currently the most capable version of CyberSolve LinAlg. See this model demoed in the CyberSolve LinAlg 1.2 ๐ค Space.
Model Details
Model Description and Overview
To construct CyberSolve LinAlg 1.2, the FLAN-T5 large model is fined-tuned using a custom PyTorch training loop optimized for multiple GPUs. We supervise a training of FLAN-T5 large on the algebra__linear_1d split of the Google DeepMind mathematics dataset, an open source dateset from Google DeepMind available through the ๐ค hub deepmind/math_dataset. This large dataset consists of code generating mathematical problems and their solutions to a variety of tasks across unique mathematical disciplines.
In this preliminary family of CyberSolve models, we are specifically interested in understanding the ability of neural models to solve non-trivial mathematical tasks. As such, the CyberSolve LinAlg 1.x family of models are trained on a set of 2M simpler, one-dimension linear equations. We preprocessed the data and simulated the training process on a smaller, downsampled set of the dataset before training for multiple epochs over the dataset's entirety. This model in particular has been trained for 2 additional epochs, limited only by funds, beyond the original CyberSolve LinAlg 1.1 checkpoint.
Version 1.2 is the most capable version of CyberSolve LinAlg, scoring a 90.75 exact match score on the evaluation set of 10k linear equations from the DeepMind algebra__linear_1d split. This is a non-trivial improvement from the exact match score of 86.56 attained by CyberSolve LinAlg 1.1.
- Developed by: John Graham Reynolds
- Funded by: Vanderbilt University
- Model type: Text-to-Text Generation
- Language(s) (NLP): English
- Finetuned from model: "Google/FLAN-T5-large"
Model Source
- Repository: TODO https://github.com/johngrahamreynolds/DistilBERT-DeNiro TODO
Uses
Direct Use
In order to effectively query the model's ability to solve linear equations, a string of the format Solve <any one-dimensional linear equation>.
should be tokenized and passed to the model's generate
attribute. An example input string is input_text = "Solve 24 = 1601*c - 1605*c for c."
.
The model will attempt to solve the equation, outputting its prediction in a simple numeric format. See the example below.
How to Use and Query the Model
Use the code below to get started with the model. Reference the Nvidia apex
package for optimized inference. Users pass a text
string detailing a sentence with a [MASK]
token. The model will provide options
to fill the mask based on the sentence context and its background of knowledge. Note - the DistilBERT base model was trained on a very large general corpus of text.
In our training, we have fine-tuned the model on the large IMDB movie review dataset. That is, the model is now accustomed to filling [MASK]
tokens with words related to
the domain of movies/tv/films. To see the model's afinity for cinematic lingo, it is best to be considerate in one's prompt engineering. Specifically, to most likely generate movie related text,
one should ideally pass a masked text
string that could reasonably be found in someone's review of a movie. See the example below:
# import apex
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
model = T5ForConditionalGeneration.from_pretrained("MarioBarbeque/CyberSolve-LinAlg-1.2").to("cuda")
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large") # CyberSolve uses the same tokenizer as the base FLAN-T5 model
# Pass the model instruction to solve a linear equation in the following simple format
input_text = "Solve 24 = 1601*c - 1605*c for c."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
This code outputs the following:
-6
Training Details
Training Data / Preprocessing
The data used comes from Google DeepMind and the ๐ค hub. The model card can be found here. This dataset is preprocessed in the
following way: The train and test splits are tokenized, concatenated, and chunked into chunks of 256 tokens. We subsequently load the training data into a DataCollator
that
applies a custom random masking function when batching. We mask of 15% of tokens in each chunk. The evaluation data is masked in its entirety, to remove randomness when evaluating,
and passed to a DataCollator
with the default collating function.
Training Procedure
The model was trained locally on a single-node with multiple Nvidia A100 GPUs using ๐ค Transformers, ๐ค Tokenizers, and a custom PyTorch training loop that made use of ๐ค Accelerate.
Training Hyperparameters
- Precision: We use FP32 precision, the same precision of the base "google/flan-t5-large" model.
- Optimizer:
apex.optimizers.FusedAdam
, a fused kernel version of the AdamW optimizer from Nvidiaapex
- Learning Rate: We use a linear learing rate scheduler with an initial learning rate of 5e-5
- Batch Size: 32
- Number of Training Steps: 2877 steps over the course of 3 epochs, followed by
Evaluation / Metrics
We evaluate our masked language model's performance using the perplexity
metric, which has a few mathematical defitions. We define the perplexity as the exponential of the cross-entropy.
To remove randomness in our metrics, we premask our evaluation dataset with a single masking function. This ensures we are evaluating with respect to the same set of labels each epoch.
See the wikipedia links for perplexity and cross-entropy below for more a detailed discussion and various other definitions.
Cross-entropy: https://en.wikipedia.org/wiki/Cross-entropy
Perplexity: https://en.wikipedia.org/wiki/Perplexity
Testing Data, Factors & Metrics
Testing Data
The 1D Linear Algebra split of the Google DeepMind Mathematics dataset comes pre-split into training and evaluation data of 2M and 10k records, respectively. Our preprocessing, which included the chunking of concatenated, tokenized inputs into chunks of 256 tokens, increased these respective splits by approximately ~5k records each. We apply a single masking function to the evaluation dataset before testing as mentioned above.
Results
We find the following perplexity metrics over 3 training epochs:
epoch | perplexity |
---|---|
0 | 17.38 |
1 | 16.28 |
2 | 15.78 |
Summary
We train this model for the purpose of attempting a local training of a masked language model using both the ๐ค ecosystem and a custom PyTorch training and evaluation loop. We look forward to further fine-tuning this model on more film/actor/cinema related data in order to further improve the model's knowledge and ability in this domain - indeed cinema is one of the author's favorite things.
Environmental Impact
- Hardware Type: Nvidia Tesla T4 16GB
- Hours used: 1.2
- Cloud Provider: Microsoft Azure
- Compute Region: EastUS
- Carbon Emitted: 0.03 kgCO2
Experiments were conducted using Azure in region eastus, which has a carbon efficiency of 0.37 kgCO$_2$eq/kWh. A cumulative of 1.2 hours of computation was performed on hardware of type T4 (TDP of 70W).
Total emissions are estimated to be 0.03 kgCO$_2$eq of which 100 percents were directly offset by the cloud provider.
Estimations were conducted using the MachineLearning Impact calculator presented in Lacoste et al. (2019).
Hardware
The model was trained locally in an Azure Databricks workspace using a single node with 1 16GB Nvidia T4 GPU for 1.2 GPU Hours.
Software
Training utilized PyTorch, ๐ค Transformers, ๐ค Tokenizers, ๐ค Datasets, ๐ค Accelerate, and more in an Azure Databricks execution environment.
Citations
@article{lacoste2019quantifying, title={Quantifying the Carbon Emissions of Machine Learning}, author={Lacoste, Alexandre and Luccioni, Alexandra and Schmidt, Victor and Dandres, Thomas}, journal={arXiv preprint arXiv:1910.09700}, year={2019} }
- Downloads last month
- 23
Model tree for MarioBarbeque/CyberSolve-LinAlg-1.2
Base model
google/flan-t5-large