|
--- |
|
library_name: transformers |
|
tags: |
|
- safe |
|
- datamol-io |
|
- molecule-design |
|
- smiles |
|
- generated_from_trainer |
|
datasets: |
|
- sagawa/ZINC-canonicalized |
|
model-index: |
|
- name: SAFE_100M |
|
results: [] |
|
--- |
|
|
|
# SAFE_100M |
|
|
|
SAFE_100M is a cutting-edge transformer-based model developed for molecular generation tasks. Trained from scratch on the [ZINC dataset](https://huggingface.co./datasets/sagawa/ZINC-canonicalized) converted to the SAFE (SMILES Augmented For Encoding) format, SAFE_100M achieves a loss of **0.3887** on its evaluation set, demonstrating robust performance in generating valid and diverse molecular structures. |
|
|
|
## Model Description |
|
|
|
SAFE_100M leverages the [SAFE framework](#references) to enhance molecular representation through the SMILES Augmented For Encoding format. By utilizing the comprehensive [ZINC dataset](https://huggingface.co./datasets/sagawa/ZINC-canonicalized), the model excels in navigating chemical space, making it highly effective for applications such as: |
|
|
|
- **Drug Discovery** |
|
- **Materials Science** |
|
- **Chemical Engineering** |
|
|
|
The transformer architecture ensures the generation of both valid and structurally diverse molecules, facilitating innovative solutions across various scientific disciplines. |
|
|
|
## Intended Uses & Limitations |
|
|
|
### Intended Uses |
|
|
|
SAFE_100M is designed to support: |
|
|
|
- **Molecular Structure Generation**: Creating novel molecules with desired properties. |
|
- **Chemical Space Exploration**: Identifying potential candidates for drug development. |
|
- **Material Design Assistance**: Innovating new materials with specific characteristics. |
|
|
|
### Limitations |
|
|
|
While SAFE_100M is a powerful tool, users should be aware of the following limitations: |
|
|
|
- **Validation Requirement**: Outputs should be reviewed by domain experts before practical application. |
|
- **Synthetic Feasibility**: Generated molecules may not always be synthesizable in a laboratory setting. |
|
- **Dataset Boundaries**: The model's knowledge is confined to the chemical space represented in the ZINC dataset. |
|
|
|
## Training and Evaluation Data |
|
|
|
The model was trained on the [ZINC dataset](https://huggingface.co./datasets/sagawa/ZINC-canonicalized), a large repository of commercially available chemical compounds optimized for virtual screening. This dataset was transformed into the SAFE format to enhance molecular encoding for machine learning applications. |
|
|
|
## Training Procedure |
|
|
|
### Training Hyperparameters |
|
|
|
SAFE_100M was trained with the following hyperparameters: |
|
|
|
- **Learning Rate**: `0.0001` |
|
- **Training Batch Size**: `100` |
|
- **Evaluation Batch Size**: `100` |
|
- **Random Seed**: `42` |
|
- **Gradient Accumulation Steps**: `2` |
|
- **Total Training Batch Size**: `200` |
|
- **Optimizer**: Adam (`betas=(0.9, 0.999)`, `epsilon=1e-08`) |
|
- **Learning Rate Scheduler**: Linear with `10,000` warmup steps |
|
- **Total Training Steps**: `250,000` |
|
|
|
### Framework Versions |
|
|
|
The training process utilized the following software frameworks: |
|
|
|
- **Transformers**: `4.44.2` |
|
- **PyTorch**: `2.4.0+cu121` |
|
- **Datasets**: `2.20.0` |
|
- **Tokenizers**: `0.19.1` |
|
|
|
## Acknowledgements |
|
|
|
We extend our gratitude to the authors of the SAFE framework for their significant contributions to the field of molecular design. |
|
|
|
## References |
|
|
|
```bibtex |
|
@article{noutahi2024gotta, |
|
title={Gotta be SAFE: a new framework for molecular design}, |
|
author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio}, |
|
journal={Digital Discovery}, |
|
volume={3}, |
|
number={4}, |
|
pages={796--804}, |
|
year={2024}, |
|
publisher={Royal Society of Chemistry} |
|
} |
|
``` |
|
|