metadata

library_name: transformers
tags:
  - safe
  - datamol-io
  - molecule-design
  - smiles
  - generated_from_trainer
datasets:
  - sagawa/ZINC-canonicalized
model-index:
  - name: SAFE_100M
    results: []

SAFE_100M

This model was trained from scratch on the ZINC dataset converted to SAFE format for molecule generation tasks. It achieves the following results on the evaluation set:

Loss: 0.3887

Model description

SAFE_100M is a transformer-based model designed for molecular generation tasks. It was trained on the ZINC dataset (https://huggingface.co./datasets/sagawa/ZINC-canonicalized), which has been converted to the SAFE (SMILES Augmented For Encoding) format. This format is specifically tailored for improved molecular representation in machine learning tasks.

The model is intended to generate valid and diverse molecular structures, which can be useful in various applications such as drug discovery, materials science, and chemical engineering.

This model utilizes the SAFE framework, which was introduced in the following paper:

@article{noutahi2024gotta,
  title={Gotta be SAFE: a new framework for molecular design},
  author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
  journal={Digital Discovery},
  volume={3},
  number={4},
  pages={796--804},
  year={2024},
  publisher={Royal Society of Chemistry}
}

We acknowledge and thank the authors for their valuable contribution to the field of molecular design.

Intended uses & limitations

This model is primarily intended for:

Generating molecular structures
Exploring chemical space for drug discovery
Assisting in the design of new materials

Limitations:

The model's output should be validated by domain experts before practical application
Generated molecules may not always be synthetically feasible
The model's knowledge is limited to the chemical space represented in the ZINC dataset

Training and evaluation data

The model was trained on the ZINC dataset (https://huggingface.co./datasets/sagawa/ZINC-canonicalized), which was converted to the SAFE format. The ZINC dataset is a large collection of commercially available chemical compounds for virtual screening.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 100
eval_batch_size: 100
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 200
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 10000
training_steps: 250000

Framework versions

Transformers 4.44.2
Pytorch 2.4.0+cu121
Datasets 2.20.0
Tokenizers 0.19.1