anrilombard
/

safe-100m

@@ -1,86 +1,105 @@
 ---
-datasets:
-- sagawa/ZINC-canonicalized
 library_name: transformers
 tags:
-- safe
-- datamol-io
-- molecule-design
-- smiles
-- generated_from_trainer
 model-index:
-- name: SAFE_100M
-  results: []
 ---
 # SAFE_100M
-This model was trained from scratch on the ZINC dataset converted to SAFE format for molecule generation tasks.
-It achieves the following results on the evaluation set:
-- Loss: 0.3887
-## Model description
-SAFE_100M is a transformer-based model designed for molecular generation tasks. It was trained on the ZINC dataset (https://huggingface.co/datasets/sagawa/ZINC-canonicalized), which has been converted to the SAFE (SMILES Augmented For Encoding) format. This format is specifically tailored for improved molecular representation in machine learning tasks.
-The model is intended to generate valid and diverse molecular structures, which can be useful in various applications such as drug discovery, materials science, and chemical engineering.
-This model utilizes the SAFE framework, which was introduced in the following paper:
-```bibtex
-@article{noutahi2024gotta,
-  title={Gotta be SAFE: a new framework for molecular design},
-  author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
-  journal={Digital Discovery},
-  volume={3},
-  number={4},
-  pages={796--804},
-  year={2024},
-  publisher={Royal Society of Chemistry}
-}
-```
-We acknowledge and thank the authors for their valuable contribution to the field of molecular design.
-## Intended uses & limitations
-This model is primarily intended for:
-- Generating molecular structures
-- Exploring chemical space for drug discovery
-- Assisting in the design of new materials
-Limitations:
-- The model's output should be validated by domain experts before practical application
-- Generated molecules may not always be synthetically feasible
-- The model's knowledge is limited to the chemical space represented in the ZINC dataset
-## Training and evaluation data
-The model was trained on the ZINC dataset (https://huggingface.co/datasets/sagawa/ZINC-canonicalized), which was converted to the SAFE format. The ZINC dataset is a large collection of commercially available chemical compounds for virtual screening.
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 0.0001
-- train_batch_size: 100
-- eval_batch_size: 100
-- seed: 42
-- gradient_accumulation_steps: 2
-- total_train_batch_size: 200
-- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
-- lr_scheduler_type: linear
-- lr_scheduler_warmup_steps: 10000
-- training_steps: 250000
-### Framework versions
-- Transformers 4.44.2
-- Pytorch 2.4.0+cu121
-- Datasets 2.20.0
-- Tokenizers 0.19.1

 ---
 library_name: transformers
 tags:
+  - safe
+  - datamol-io
+  - molecule-design
+  - smiles
+  - generated_from_trainer
+datasets:
+  - sagawa/ZINC-canonicalized
 model-index:
+  - name: SAFE_100M
+    results: []
 ---
 # SAFE_100M
+SAFE_100M is a cutting-edge transformer-based model developed for molecular generation tasks. Trained from scratch on the [ZINC dataset](https://huggingface.co/datasets/sagawa/ZINC-canonicalized) converted to the SAFE (SMILES Augmented For Encoding) format, SAFE_100M achieves a loss of **0.3887** on its evaluation set, demonstrating robust performance in generating valid and diverse molecular structures.
+## Table of Contents
+- [Model Description](#model-description)
+- [Intended Uses & Limitations](#intended-uses--limitations)
+- [Training and Evaluation Data](#training-and-evaluation-data)
+- [Training Procedure](#training-procedure)
+  - [Training Hyperparameters](#training-hyperparameters)
+  - [Framework Versions](#framework-versions)
+- [Acknowledgements](#acknowledgements)
+- [References](#references)
+## Model Description
+SAFE_100M leverages the [SAFE framework](#references) to enhance molecular representation through the SMILES Augmented For Encoding format. By utilizing the comprehensive [ZINC dataset](https://huggingface.co/datasets/sagawa/ZINC-canonicalized), the model excels in navigating chemical space, making it highly effective for applications such as:
+- **Drug Discovery**
+- **Materials Science**
+- **Chemical Engineering**
+The transformer architecture ensures the generation of both valid and structurally diverse molecules, facilitating innovative solutions across various scientific disciplines.
+## Intended Uses & Limitations
+### Intended Uses
+SAFE_100M is designed to support:
+- **Molecular Structure Generation**: Creating novel molecules with desired properties.
+- **Chemical Space Exploration**: Identifying potential candidates for drug development.
+- **Material Design Assistance**: Innovating new materials with specific characteristics.
+### Limitations
+While SAFE_100M is a powerful tool, users should be aware of the following limitations:
+- **Validation Requirement**: Outputs should be reviewed by domain experts before practical application.
+- **Synthetic Feasibility**: Generated molecules may not always be synthesizable in a laboratory setting.
+- **Dataset Boundaries**: The model's knowledge is confined to the chemical space represented in the ZINC dataset.
+## Training and Evaluation Data
+The model was trained on the [ZINC dataset](https://huggingface.co/datasets/sagawa/ZINC-canonicalized), a large repository of commercially available chemical compounds optimized for virtual screening. This dataset was transformed into the SAFE format to enhance molecular encoding for machine learning applications.
+## Training Procedure
+### Training Hyperparameters
+SAFE_100M was trained with the following hyperparameters:
+- **Learning Rate**: `0.0001`
+- **Training Batch Size**: `100`
+- **Evaluation Batch Size**: `100`
+- **Random Seed**: `42`
+- **Gradient Accumulation Steps**: `2`
+- **Total Training Batch Size**: `200`
+- **Optimizer**: Adam (`betas=(0.9, 0.999)`, `epsilon=1e-08`)
+- **Learning Rate Scheduler**: Linear with `10,000` warmup steps
+- **Total Training Steps**: `250,000`
+### Framework Versions
+The training process utilized the following software frameworks:
+- **Transformers**: `4.44.2`
+- **PyTorch**: `2.4.0+cu121`
+- **Datasets**: `2.20.0`
+- **Tokenizers**: `0.19.1`
+## Acknowledgements
+We extend our gratitude to the authors of the SAFE framework for their significant contributions to the field of molecular design.
+## References
+```bibtex
+@article{noutahi2024gotta,
+  title={Gotta be SAFE: a new framework for molecular design},
+  author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
+  journal={Digital Discovery},
+  volume={3},
+  number={4},
+  pages={796--804},
+  year={2024},
+  publisher={Royal Society of Chemistry}
+}
+```