safe-100m / README.md

Update README.md

3e6196c verified 25 days ago

3.59 kB

	---
	library_name: transformers
	tags:
	- safe
	- datamol-io
	- molecule-design
	- smiles
	- generated_from_trainer
	datasets:
	- sagawa/ZINC-canonicalized
	model-index:
	- name: SAFE_100M
	results: []
	---

	# SAFE_100M

	SAFE_100M is a cutting-edge transformer-based model developed for molecular generation tasks. Trained from scratch on the [ZINC dataset](https://huggingface.co./datasets/sagawa/ZINC-canonicalized) converted to the SAFE (SMILES Augmented For Encoding) format, SAFE_100M achieves a loss of 0.3887 on its evaluation set, demonstrating robust performance in generating valid and diverse molecular structures.

	## Model Description

	SAFE_100M leverages the [SAFE framework](#references) to enhance molecular representation through the SMILES Augmented For Encoding format. By utilizing the comprehensive [ZINC dataset](https://huggingface.co./datasets/sagawa/ZINC-canonicalized), the model excels in navigating chemical space, making it highly effective for applications such as:

	- Drug Discovery
	- Materials Science
	- Chemical Engineering

	The transformer architecture ensures the generation of both valid and structurally diverse molecules, facilitating innovative solutions across various scientific disciplines.

	## Intended Uses & Limitations

	### Intended Uses

	SAFE_100M is designed to support:

	- Molecular Structure Generation: Creating novel molecules with desired properties.
	- Chemical Space Exploration: Identifying potential candidates for drug development.
	- Material Design Assistance: Innovating new materials with specific characteristics.

	### Limitations

	While SAFE_100M is a powerful tool, users should be aware of the following limitations:

	- Validation Requirement: Outputs should be reviewed by domain experts before practical application.
	- Synthetic Feasibility: Generated molecules may not always be synthesizable in a laboratory setting.
	- Dataset Boundaries: The model's knowledge is confined to the chemical space represented in the ZINC dataset.

	## Training and Evaluation Data

	The model was trained on the [ZINC dataset](https://huggingface.co./datasets/sagawa/ZINC-canonicalized), a large repository of commercially available chemical compounds optimized for virtual screening. This dataset was transformed into the SAFE format to enhance molecular encoding for machine learning applications.

	## Training Procedure

	### Training Hyperparameters

	SAFE_100M was trained with the following hyperparameters:

	- Learning Rate: `0.0001`
	- Training Batch Size: `100`
	- Evaluation Batch Size: `100`
	- Random Seed: `42`
	- Gradient Accumulation Steps: `2`
	- Total Training Batch Size: `200`
	- Optimizer: Adam (`betas=(0.9, 0.999)`, `epsilon=1e-08`)
	- Learning Rate Scheduler: Linear with `10,000` warmup steps
	- Total Training Steps: `250,000`

	### Framework Versions

	The training process utilized the following software frameworks:

	- Transformers: `4.44.2`
	- PyTorch: `2.4.0+cu121`
	- Datasets: `2.20.0`
	- Tokenizers: `0.19.1`

	## Acknowledgements

	We extend our gratitude to the authors of the SAFE framework for their significant contributions to the field of molecular design.

	## References

	```bibtex
	@article{noutahi2024gotta,
	title={Gotta be SAFE: a new framework for molecular design},
	author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
	journal={Digital Discovery},
	volume={3},
	number={4},
	pages={796--804},
	year={2024},
	publisher={Royal Society of Chemistry}
	}
	```