SAFE_100M

SAFE_100M is a cutting-edge transformer-based model developed for molecular generation tasks. Trained from scratch on the ZINC dataset converted to the SAFE (SMILES Augmented For Encoding) format, SAFE_100M achieves a loss of 0.3887 on its evaluation set, demonstrating robust performance in generating valid and diverse molecular structures.

Model Description

SAFE_100M leverages the SAFE framework to enhance molecular representation through the SMILES Augmented For Encoding format. By utilizing the comprehensive ZINC dataset, the model excels in navigating chemical space, making it highly effective for applications such as:

  • Drug Discovery
  • Materials Science
  • Chemical Engineering

The transformer architecture ensures the generation of both valid and structurally diverse molecules, facilitating innovative solutions across various scientific disciplines.

Intended Uses & Limitations

Intended Uses

SAFE_100M is designed to support:

  • Molecular Structure Generation: Creating novel molecules with desired properties.
  • Chemical Space Exploration: Identifying potential candidates for drug development.
  • Material Design Assistance: Innovating new materials with specific characteristics.

Limitations

While SAFE_100M is a powerful tool, users should be aware of the following limitations:

  • Validation Requirement: Outputs should be reviewed by domain experts before practical application.
  • Synthetic Feasibility: Generated molecules may not always be synthesizable in a laboratory setting.
  • Dataset Boundaries: The model's knowledge is confined to the chemical space represented in the ZINC dataset.

Training and Evaluation Data

The model was trained on the ZINC dataset, a large repository of commercially available chemical compounds optimized for virtual screening. This dataset was transformed into the SAFE format to enhance molecular encoding for machine learning applications.

Training Procedure

Training Hyperparameters

SAFE_100M was trained with the following hyperparameters:

  • Learning Rate: 0.0001
  • Training Batch Size: 100
  • Evaluation Batch Size: 100
  • Random Seed: 42
  • Gradient Accumulation Steps: 2
  • Total Training Batch Size: 200
  • Optimizer: Adam (betas=(0.9, 0.999), epsilon=1e-08)
  • Learning Rate Scheduler: Linear with 10,000 warmup steps
  • Total Training Steps: 250,000

Framework Versions

The training process utilized the following software frameworks:

  • Transformers: 4.44.2
  • PyTorch: 2.4.0+cu121
  • Datasets: 2.20.0
  • Tokenizers: 0.19.1

Acknowledgements

We extend our gratitude to the authors of the SAFE framework for their significant contributions to the field of molecular design.

References

@article{noutahi2024gotta,
  title={Gotta be SAFE: a new framework for molecular design},
  author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
  journal={Digital Discovery},
  volume={3},
  number={4},
  pages={796--804},
  year={2024},
  publisher={Royal Society of Chemistry}
}
Downloads last month
20
Safetensors
Model size
87.3M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train anrilombard/safe-100m