File size: 3,984 Bytes
aac07e8 dd6c2a3 aac07e8 ed8068c dd6c2a3 ed8068c dd6c2a3 aac07e8 ed8068c aac07e8 b584f5f aac07e8 b584f5f aac07e8 b584f5f aac07e8 b584f5f aac07e8 b584f5f aac07e8 b584f5f aac07e8 b584f5f aac07e8 b584f5f aac07e8 b584f5f aac07e8 b584f5f aac07e8 b584f5f aac07e8 b584f5f aac07e8 b584f5f aac07e8 b584f5f aac07e8 b584f5f aac07e8 b584f5f aac07e8 b584f5f aac07e8 b584f5f aac07e8 b584f5f aac07e8 b584f5f aac07e8 b584f5f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
---
datasets:
- sagawa/ZINC-canonicalized
library_name: transformers
tags:
- safe
- datamol-io
- molecule-design
- smiles
- generated_from_trainer
model-index:
- name: SAFE_100M
results: []
---
# SAFE_100M
SAFE_100M is a cutting-edge transformer-based model developed for molecular generation tasks. Trained from scratch on the [ZINC dataset](https://huggingface.co./datasets/sagawa/ZINC-canonicalized) converted to the SAFE (SMILES Augmented For Encoding) format, SAFE_100M achieves a loss of **0.3887** on its evaluation set, demonstrating robust performance in generating valid and diverse molecular structures.
## Table of Contents
- [Model Description](#model-description)
- [Intended Uses & Limitations](#intended-uses--limitations)
- [Training and Evaluation Data](#training-and-evaluation-data)
- [Training Procedure](#training-procedure)
- [Training Hyperparameters](#training-hyperparameters)
- [Framework Versions](#framework-versions)
- [Acknowledgements](#acknowledgements)
- [References](#references)
## Model Description
SAFE_100M leverages the [SAFE framework](#references) to enhance molecular representation through the SMILES Augmented For Encoding format. By utilizing the comprehensive [ZINC dataset](https://huggingface.co./datasets/sagawa/ZINC-canonicalized), the model excels in navigating chemical space, making it highly effective for applications such as:
- **Drug Discovery**
- **Materials Science**
- **Chemical Engineering**
The transformer architecture ensures the generation of both valid and structurally diverse molecules, facilitating innovative solutions across various scientific disciplines.
## Intended Uses & Limitations
### Intended Uses
SAFE_100M is designed to support:
- **Molecular Structure Generation**: Creating novel molecules with desired properties.
- **Chemical Space Exploration**: Identifying potential candidates for drug development.
- **Material Design Assistance**: Innovating new materials with specific characteristics.
### Limitations
While SAFE_100M is a powerful tool, users should be aware of the following limitations:
- **Validation Requirement**: Outputs should be reviewed by domain experts before practical application.
- **Synthetic Feasibility**: Generated molecules may not always be synthesizable in a laboratory setting.
- **Dataset Boundaries**: The model's knowledge is confined to the chemical space represented in the ZINC dataset.
## Training and Evaluation Data
The model was trained on the [ZINC dataset](https://huggingface.co./datasets/sagawa/ZINC-canonicalized), a large repository of commercially available chemical compounds optimized for virtual screening. This dataset was transformed into the SAFE format to enhance molecular encoding for machine learning applications.
## Training Procedure
### Training Hyperparameters
SAFE_100M was trained with the following hyperparameters:
- **Learning Rate**: `0.0001`
- **Training Batch Size**: `100`
- **Evaluation Batch Size**: `100`
- **Random Seed**: `42`
- **Gradient Accumulation Steps**: `2`
- **Total Training Batch Size**: `200`
- **Optimizer**: Adam (`betas=(0.9, 0.999)`, `epsilon=1e-08`)
- **Learning Rate Scheduler**: Linear with `10,000` warmup steps
- **Total Training Steps**: `250,000`
### Framework Versions
The training process utilized the following software frameworks:
- **Transformers**: `4.44.2`
- **PyTorch**: `2.4.0+cu121`
- **Datasets**: `2.20.0`
- **Tokenizers**: `0.19.1`
## Acknowledgements
We extend our gratitude to the authors of the SAFE framework for their significant contributions to the field of molecular design.
## References
```bibtex
@article{noutahi2024gotta,
title={Gotta be SAFE: a new framework for molecular design},
author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
journal={Digital Discovery},
volume={3},
number={4},
pages={796--804},
year={2024},
publisher={Royal Society of Chemistry}
}
```
|