anrilombard
/

safe-100m

Feature Extraction

molecule-design

Generated from Trainer

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

safe-100m / README.md

anrilombard's picture

Upload model

b3ae2c1 verified 25 days ago

|

2.75 kB

	---
	datasets:
	- sagawa/ZINC-canonicalized
	library_name: transformers
	tags:
	- safe
	- datamol-io
	- molecule-design
	- smiles
	- generated_from_trainer
	model-index:
	- name: SAFE_100M
	results: []
	---

	# SAFE_100M

	This model was trained from scratch on the ZINC dataset converted to SAFE format for molecule generation tasks.
	It achieves the following results on the evaluation set:

	- Loss: 0.3887

	## Model description

	SAFE_100M is a transformer-based model designed for molecular generation tasks. It was trained on the ZINC dataset (https://huggingface.co./datasets/sagawa/ZINC-canonicalized), which has been converted to the SAFE (SMILES Augmented For Encoding) format. This format is specifically tailored for improved molecular representation in machine learning tasks.

	The model is intended to generate valid and diverse molecular structures, which can be useful in various applications such as drug discovery, materials science, and chemical engineering.

	This model utilizes the SAFE framework, which was introduced in the following paper:

	```bibtex
	@article{noutahi2024gotta,
	title={Gotta be SAFE: a new framework for molecular design},
	author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
	journal={Digital Discovery},
	volume={3},
	number={4},
	pages={796--804},
	year={2024},
	publisher={Royal Society of Chemistry}
	}
	```

	We acknowledge and thank the authors for their valuable contribution to the field of molecular design.

	## Intended uses & limitations

	This model is primarily intended for:

	- Generating molecular structures
	- Exploring chemical space for drug discovery
	- Assisting in the design of new materials

	Limitations:

	- The model's output should be validated by domain experts before practical application
	- Generated molecules may not always be synthetically feasible
	- The model's knowledge is limited to the chemical space represented in the ZINC dataset

	## Training and evaluation data

	The model was trained on the ZINC dataset (https://huggingface.co./datasets/sagawa/ZINC-canonicalized), which was converted to the SAFE format. The ZINC dataset is a large collection of commercially available chemical compounds for virtual screening.

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:

	- learning_rate: 0.0001
	- train_batch_size: 100
	- eval_batch_size: 100
	- seed: 42
	- gradient_accumulation_steps: 2
	- total_train_batch_size: 200
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 10000
	- training_steps: 250000

	### Framework versions

	- Transformers 4.44.2
	- Pytorch 2.4.0+cu121
	- Datasets 2.20.0
	- Tokenizers 0.19.1