library_name: transformers
tags:
- safe
- datamol-io
- molecule-design
- smiles
- generated_from_trainer
datasets:
- sagawa/ZINC-canonicalized
model-index:
- name: SAFE_100M
results: []
SAFE_100M
This model was trained from scratch on the ZINC dataset converted to SAFE format for molecule generation tasks. It achieves the following results on the evaluation set:
- Loss: 0.3887
Model description
SAFE_100M is a transformer-based model designed for molecular generation tasks. It was trained on the ZINC dataset (https://huggingface.co./datasets/sagawa/ZINC-canonicalized), which has been converted to the SAFE (SMILES Augmented For Encoding) format. This format is specifically tailored for improved molecular representation in machine learning tasks.
The model is intended to generate valid and diverse molecular structures, which can be useful in various applications such as drug discovery, materials science, and chemical engineering.
This model utilizes the SAFE framework, which was introduced in the following paper:
@article{noutahi2024gotta,
title={Gotta be SAFE: a new framework for molecular design},
author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
journal={Digital Discovery},
volume={3},
number={4},
pages={796--804},
year={2024},
publisher={Royal Society of Chemistry}
}
We acknowledge and thank the authors for their valuable contribution to the field of molecular design.
Intended uses & limitations
This model is primarily intended for:
- Generating molecular structures
- Exploring chemical space for drug discovery
- Assisting in the design of new materials
Limitations:
- The model's output should be validated by domain experts before practical application
- Generated molecules may not always be synthetically feasible
- The model's knowledge is limited to the chemical space represented in the ZINC dataset
Training and evaluation data
The model was trained on the ZINC dataset (https://huggingface.co./datasets/sagawa/ZINC-canonicalized), which was converted to the SAFE format. The ZINC dataset is a large collection of commercially available chemical compounds for virtual screening.
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 100
- eval_batch_size: 100
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 200
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 10000
- training_steps: 250000
Framework versions
- Transformers 4.44.2
- Pytorch 2.4.0+cu121
- Datasets 2.20.0
- Tokenizers 0.19.1