metadata

datasets:
  - sagawa/ZINC-canonicalized
library_name: transformers
tags:
  - safe
  - datamol-io
  - molecule-design
  - smiles
  - generated_from_trainer
model-index:
  - name: SAFE_100M
    results: []

SAFE_100M

SAFE_100M is a cutting-edge transformer-based model developed for molecular generation tasks. Trained from scratch on the ZINC dataset converted to the SAFE (SMILES Augmented For Encoding) format, SAFE_100M achieves a loss of 0.3887 on its evaluation set, demonstrating robust performance in generating valid and diverse molecular structures.

Model Description
Intended Uses & Limitations
Training and Evaluation Data
Training Procedure
- Training Hyperparameters
- Framework Versions
Acknowledgements
References

Model Description

SAFE_100M leverages the SAFE framework to enhance molecular representation through the SMILES Augmented For Encoding format. By utilizing the comprehensive ZINC dataset, the model excels in navigating chemical space, making it highly effective for applications such as:

Drug Discovery
Materials Science
Chemical Engineering

The transformer architecture ensures the generation of both valid and structurally diverse molecules, facilitating innovative solutions across various scientific disciplines.

Intended Uses & Limitations

Intended Uses

SAFE_100M is designed to support:

Molecular Structure Generation: Creating novel molecules with desired properties.
Chemical Space Exploration: Identifying potential candidates for drug development.
Material Design Assistance: Innovating new materials with specific characteristics.

Limitations

While SAFE_100M is a powerful tool, users should be aware of the following limitations:

Validation Requirement: Outputs should be reviewed by domain experts before practical application.
Synthetic Feasibility: Generated molecules may not always be synthesizable in a laboratory setting.
Dataset Boundaries: The model's knowledge is confined to the chemical space represented in the ZINC dataset.

Training and Evaluation Data

The model was trained on the ZINC dataset, a large repository of commercially available chemical compounds optimized for virtual screening. This dataset was transformed into the SAFE format to enhance molecular encoding for machine learning applications.

Training Procedure

Training Hyperparameters

SAFE_100M was trained with the following hyperparameters:

Learning Rate: 0.0001
Training Batch Size: 100
Evaluation Batch Size: 100
Random Seed: 42
Gradient Accumulation Steps: 2
Total Training Batch Size: 200
Optimizer: Adam (betas=(0.9, 0.999), epsilon=1e-08)
Learning Rate Scheduler: Linear with 10,000 warmup steps
Total Training Steps: 250,000

Framework Versions

The training process utilized the following software frameworks:

Transformers: 4.44.2
PyTorch: 2.4.0+cu121
Datasets: 2.20.0
Tokenizers: 0.19.1

Acknowledgements

We extend our gratitude to the authors of the SAFE framework for their significant contributions to the field of molecular design.

References

@article{noutahi2024gotta,
  title={Gotta be SAFE: a new framework for molecular design},
  author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
  journal={Digital Discovery},
  volume={3},
  number={4},
  pages={796--804},
  year={2024},
  publisher={Royal Society of Chemistry}
}

anrilombard
/

safe-100m