datasets:
- sagawa/ZINC-canonicalized
library_name: transformers
tags:
- safe
- datamol-io
- molecule-design
- smiles
- generated_from_trainer
model-index:
- name: SAFE_100M
results: []
SAFE_100M
SAFE_100M is a cutting-edge transformer-based model developed for molecular generation tasks. Trained from scratch on the ZINC dataset converted to the SAFE (SMILES Augmented For Encoding) format, SAFE_100M achieves a loss of 0.3887 on its evaluation set, demonstrating robust performance in generating valid and diverse molecular structures.
Table of Contents
- Model Description
- Intended Uses & Limitations
- Training and Evaluation Data
- Training Procedure
- Acknowledgements
- References
Model Description
SAFE_100M leverages the SAFE framework to enhance molecular representation through the SMILES Augmented For Encoding format. By utilizing the comprehensive ZINC dataset, the model excels in navigating chemical space, making it highly effective for applications such as:
- Drug Discovery
- Materials Science
- Chemical Engineering
The transformer architecture ensures the generation of both valid and structurally diverse molecules, facilitating innovative solutions across various scientific disciplines.
Intended Uses & Limitations
Intended Uses
SAFE_100M is designed to support:
- Molecular Structure Generation: Creating novel molecules with desired properties.
- Chemical Space Exploration: Identifying potential candidates for drug development.
- Material Design Assistance: Innovating new materials with specific characteristics.
Limitations
While SAFE_100M is a powerful tool, users should be aware of the following limitations:
- Validation Requirement: Outputs should be reviewed by domain experts before practical application.
- Synthetic Feasibility: Generated molecules may not always be synthesizable in a laboratory setting.
- Dataset Boundaries: The model's knowledge is confined to the chemical space represented in the ZINC dataset.
Training and Evaluation Data
The model was trained on the ZINC dataset, a large repository of commercially available chemical compounds optimized for virtual screening. This dataset was transformed into the SAFE format to enhance molecular encoding for machine learning applications.
Training Procedure
Training Hyperparameters
SAFE_100M was trained with the following hyperparameters:
- Learning Rate:
0.0001
- Training Batch Size:
100
- Evaluation Batch Size:
100
- Random Seed:
42
- Gradient Accumulation Steps:
2
- Total Training Batch Size:
200
- Optimizer: Adam (
betas=(0.9, 0.999)
,epsilon=1e-08
) - Learning Rate Scheduler: Linear with
10,000
warmup steps - Total Training Steps:
250,000
Framework Versions
The training process utilized the following software frameworks:
- Transformers:
4.44.2
- PyTorch:
2.4.0+cu121
- Datasets:
2.20.0
- Tokenizers:
0.19.1
Acknowledgements
We extend our gratitude to the authors of the SAFE framework for their significant contributions to the field of molecular design.
References
@article{noutahi2024gotta,
title={Gotta be SAFE: a new framework for molecular design},
author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
journal={Digital Discovery},
volume={3},
number={4},
pages={796--804},
year={2024},
publisher={Royal Society of Chemistry}
}