safe-100m / README.md
anrilombard's picture
Upload README.md with huggingface_hub
57f8592 verified
|
raw
history blame
2.77 kB
metadata
library_name: transformers
tags:
  - safe
  - datamol-io
  - molecule-design
  - smiles
  - generated_from_trainer
datasets:
  - sagawa/ZINC-canonicalized
model-index:
  - name: SAFE_100M
    results: []

SAFE_100M

This model was trained from scratch on the ZINC dataset converted to SAFE format for molecule generation tasks. It achieves the following results on the evaluation set:

  • Loss: 0.3887

Model description

SAFE_100M is a transformer-based model designed for molecular generation tasks. It was trained on the ZINC dataset (https://huggingface.co./datasets/sagawa/ZINC-canonicalized), which has been converted to the SAFE (SMILES Augmented For Encoding) format. This format is specifically tailored for improved molecular representation in machine learning tasks.

The model is intended to generate valid and diverse molecular structures, which can be useful in various applications such as drug discovery, materials science, and chemical engineering.

This model utilizes the SAFE framework, which was introduced in the following paper:

@article{noutahi2024gotta,
  title={Gotta be SAFE: a new framework for molecular design},
  author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
  journal={Digital Discovery},
  volume={3},
  number={4},
  pages={796--804},
  year={2024},
  publisher={Royal Society of Chemistry}
}

We acknowledge and thank the authors for their valuable contribution to the field of molecular design.

Intended uses & limitations

This model is primarily intended for:

  • Generating molecular structures
  • Exploring chemical space for drug discovery
  • Assisting in the design of new materials

Limitations:

  • The model's output should be validated by domain experts before practical application
  • Generated molecules may not always be synthetically feasible
  • The model's knowledge is limited to the chemical space represented in the ZINC dataset

Training and evaluation data

The model was trained on the ZINC dataset (https://huggingface.co./datasets/sagawa/ZINC-canonicalized), which was converted to the SAFE format. The ZINC dataset is a large collection of commercially available chemical compounds for virtual screening.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 100
  • eval_batch_size: 100
  • seed: 42
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 200
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 10000
  • training_steps: 250000

Framework versions

  • Transformers 4.44.2
  • Pytorch 2.4.0+cu121
  • Datasets 2.20.0
  • Tokenizers 0.19.1