--- library_name: transformers tags: - safe - datamol-io - molecule-design - smiles - generated_from_trainer datasets: - sagawa/ZINC-canonicalized model-index: - name: SAFE_100M results: [] --- # SAFE_100M This model was trained from scratch on the ZINC dataset converted to SAFE format for molecule generation tasks. It achieves the following results on the evaluation set: - Loss: 0.3887 ## Model description SAFE_100M is a transformer-based model designed for molecular generation tasks. It was trained on the ZINC dataset (https://huggingface.co./datasets/sagawa/ZINC-canonicalized), which has been converted to the SAFE (SMILES Augmented For Encoding) format. This format is specifically tailored for improved molecular representation in machine learning tasks. The model is intended to generate valid and diverse molecular structures, which can be useful in various applications such as drug discovery, materials science, and chemical engineering. This model utilizes the SAFE framework, which was introduced in the following paper: ```bibtex @article{noutahi2024gotta, title={Gotta be SAFE: a new framework for molecular design}, author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio}, journal={Digital Discovery}, volume={3}, number={4}, pages={796--804}, year={2024}, publisher={Royal Society of Chemistry} } ``` We acknowledge and thank the authors for their valuable contribution to the field of molecular design. ## Intended uses & limitations This model is primarily intended for: - Generating molecular structures - Exploring chemical space for drug discovery - Assisting in the design of new materials Limitations: - The model's output should be validated by domain experts before practical application - Generated molecules may not always be synthetically feasible - The model's knowledge is limited to the chemical space represented in the ZINC dataset ## Training and evaluation data The model was trained on the ZINC dataset (https://huggingface.co./datasets/sagawa/ZINC-canonicalized), which was converted to the SAFE format. The ZINC dataset is a large collection of commercially available chemical compounds for virtual screening. ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.0001 - train_batch_size: 100 - eval_batch_size: 100 - seed: 42 - gradient_accumulation_steps: 2 - total_train_batch_size: 200 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_steps: 10000 - training_steps: 250000 ### Framework versions - Transformers 4.44.2 - Pytorch 2.4.0+cu121 - Datasets 2.20.0 - Tokenizers 0.19.1