safe-100m / README.md
anrilombard's picture
Update README.md
3e6196c verified
---
library_name: transformers
tags:
- safe
- datamol-io
- molecule-design
- smiles
- generated_from_trainer
datasets:
- sagawa/ZINC-canonicalized
model-index:
- name: SAFE_100M
results: []
---
# SAFE_100M
SAFE_100M is a cutting-edge transformer-based model developed for molecular generation tasks. Trained from scratch on the [ZINC dataset](https://huggingface.co./datasets/sagawa/ZINC-canonicalized) converted to the SAFE (SMILES Augmented For Encoding) format, SAFE_100M achieves a loss of **0.3887** on its evaluation set, demonstrating robust performance in generating valid and diverse molecular structures.
## Model Description
SAFE_100M leverages the [SAFE framework](#references) to enhance molecular representation through the SMILES Augmented For Encoding format. By utilizing the comprehensive [ZINC dataset](https://huggingface.co./datasets/sagawa/ZINC-canonicalized), the model excels in navigating chemical space, making it highly effective for applications such as:
- **Drug Discovery**
- **Materials Science**
- **Chemical Engineering**
The transformer architecture ensures the generation of both valid and structurally diverse molecules, facilitating innovative solutions across various scientific disciplines.
## Intended Uses & Limitations
### Intended Uses
SAFE_100M is designed to support:
- **Molecular Structure Generation**: Creating novel molecules with desired properties.
- **Chemical Space Exploration**: Identifying potential candidates for drug development.
- **Material Design Assistance**: Innovating new materials with specific characteristics.
### Limitations
While SAFE_100M is a powerful tool, users should be aware of the following limitations:
- **Validation Requirement**: Outputs should be reviewed by domain experts before practical application.
- **Synthetic Feasibility**: Generated molecules may not always be synthesizable in a laboratory setting.
- **Dataset Boundaries**: The model's knowledge is confined to the chemical space represented in the ZINC dataset.
## Training and Evaluation Data
The model was trained on the [ZINC dataset](https://huggingface.co./datasets/sagawa/ZINC-canonicalized), a large repository of commercially available chemical compounds optimized for virtual screening. This dataset was transformed into the SAFE format to enhance molecular encoding for machine learning applications.
## Training Procedure
### Training Hyperparameters
SAFE_100M was trained with the following hyperparameters:
- **Learning Rate**: `0.0001`
- **Training Batch Size**: `100`
- **Evaluation Batch Size**: `100`
- **Random Seed**: `42`
- **Gradient Accumulation Steps**: `2`
- **Total Training Batch Size**: `200`
- **Optimizer**: Adam (`betas=(0.9, 0.999)`, `epsilon=1e-08`)
- **Learning Rate Scheduler**: Linear with `10,000` warmup steps
- **Total Training Steps**: `250,000`
### Framework Versions
The training process utilized the following software frameworks:
- **Transformers**: `4.44.2`
- **PyTorch**: `2.4.0+cu121`
- **Datasets**: `2.20.0`
- **Tokenizers**: `0.19.1`
## Acknowledgements
We extend our gratitude to the authors of the SAFE framework for their significant contributions to the field of molecular design.
## References
```bibtex
@article{noutahi2024gotta,
title={Gotta be SAFE: a new framework for molecular design},
author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
journal={Digital Discovery},
volume={3},
number={4},
pages={796--804},
year={2024},
publisher={Royal Society of Chemistry}
}
```