---
library_name: transformers
tags:
  - safe
  - datamol-io
  - molecule-design
  - smiles
  - generated_from_trainer
datasets:
  - sagawa/ZINC-canonicalized
model-index:
  - name: SAFE_100M
    results: []
---

# SAFE_100M

SAFE_100M is a cutting-edge transformer-based model developed for molecular generation tasks. Trained from scratch on the [ZINC dataset](https://huggingface.co./datasets/sagawa/ZINC-canonicalized) converted to the SAFE (SMILES Augmented For Encoding) format, SAFE_100M achieves a loss of **0.3887** on its evaluation set, demonstrating robust performance in generating valid and diverse molecular structures.

## Model Description

SAFE_100M leverages the [SAFE framework](#references) to enhance molecular representation through the SMILES Augmented For Encoding format. By utilizing the comprehensive [ZINC dataset](https://huggingface.co./datasets/sagawa/ZINC-canonicalized), the model excels in navigating chemical space, making it highly effective for applications such as:

- **Drug Discovery**
- **Materials Science**
- **Chemical Engineering**

The transformer architecture ensures the generation of both valid and structurally diverse molecules, facilitating innovative solutions across various scientific disciplines.

## Intended Uses & Limitations

### Intended Uses

SAFE_100M is designed to support:

- **Molecular Structure Generation**: Creating novel molecules with desired properties.
- **Chemical Space Exploration**: Identifying potential candidates for drug development.
- **Material Design Assistance**: Innovating new materials with specific characteristics.

### Limitations

While SAFE_100M is a powerful tool, users should be aware of the following limitations:

- **Validation Requirement**: Outputs should be reviewed by domain experts before practical application.
- **Synthetic Feasibility**: Generated molecules may not always be synthesizable in a laboratory setting.
- **Dataset Boundaries**: The model's knowledge is confined to the chemical space represented in the ZINC dataset.

## Training and Evaluation Data

The model was trained on the [ZINC dataset](https://huggingface.co./datasets/sagawa/ZINC-canonicalized), a large repository of commercially available chemical compounds optimized for virtual screening. This dataset was transformed into the SAFE format to enhance molecular encoding for machine learning applications.

## Training Procedure

### Training Hyperparameters

SAFE_100M was trained with the following hyperparameters:

- **Learning Rate**: `0.0001`
- **Training Batch Size**: `100`
- **Evaluation Batch Size**: `100`
- **Random Seed**: `42`
- **Gradient Accumulation Steps**: `2`
- **Total Training Batch Size**: `200`
- **Optimizer**: Adam (`betas=(0.9, 0.999)`, `epsilon=1e-08`)
- **Learning Rate Scheduler**: Linear with `10,000` warmup steps
- **Total Training Steps**: `250,000`

### Framework Versions

The training process utilized the following software frameworks:

- **Transformers**: `4.44.2`
- **PyTorch**: `2.4.0+cu121`
- **Datasets**: `2.20.0`
- **Tokenizers**: `0.19.1`

## Acknowledgements

We extend our gratitude to the authors of the SAFE framework for their significant contributions to the field of molecular design.

## References

```bibtex
@article{noutahi2024gotta,
  title={Gotta be SAFE: a new framework for molecular design},
  author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
  journal={Digital Discovery},
  volume={3},
  number={4},
  pages={796--804},
  year={2024},
  publisher={Royal Society of Chemistry}
}
```