anrilombard commited on
Commit
b584f5f
·
verified ·
1 Parent(s): b3ae2c1

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +77 -58
README.md CHANGED
@@ -1,86 +1,105 @@
1
  ---
2
- datasets:
3
- - sagawa/ZINC-canonicalized
4
  library_name: transformers
5
  tags:
6
- - safe
7
- - datamol-io
8
- - molecule-design
9
- - smiles
10
- - generated_from_trainer
 
 
11
  model-index:
12
- - name: SAFE_100M
13
- results: []
14
  ---
15
 
16
  # SAFE_100M
17
 
18
- This model was trained from scratch on the ZINC dataset converted to SAFE format for molecule generation tasks.
19
- It achieves the following results on the evaluation set:
20
 
21
- - Loss: 0.3887
22
 
23
- ## Model description
 
 
 
 
 
 
 
24
 
25
- SAFE_100M is a transformer-based model designed for molecular generation tasks. It was trained on the ZINC dataset (https://huggingface.co/datasets/sagawa/ZINC-canonicalized), which has been converted to the SAFE (SMILES Augmented For Encoding) format. This format is specifically tailored for improved molecular representation in machine learning tasks.
26
 
27
- The model is intended to generate valid and diverse molecular structures, which can be useful in various applications such as drug discovery, materials science, and chemical engineering.
28
 
29
- This model utilizes the SAFE framework, which was introduced in the following paper:
 
 
30
 
31
- ```bibtex
32
- @article{noutahi2024gotta,
33
- title={Gotta be SAFE: a new framework for molecular design},
34
- author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
35
- journal={Digital Discovery},
36
- volume={3},
37
- number={4},
38
- pages={796--804},
39
- year={2024},
40
- publisher={Royal Society of Chemistry}
41
- }
42
- ```
43
 
44
- We acknowledge and thank the authors for their valuable contribution to the field of molecular design.
45
 
46
- ## Intended uses & limitations
 
 
47
 
48
- This model is primarily intended for:
49
 
50
- - Generating molecular structures
51
- - Exploring chemical space for drug discovery
52
- - Assisting in the design of new materials
53
 
54
- Limitations:
 
 
55
 
56
- - The model's output should be validated by domain experts before practical application
57
- - Generated molecules may not always be synthetically feasible
58
- - The model's knowledge is limited to the chemical space represented in the ZINC dataset
59
 
60
- ## Training and evaluation data
61
 
62
- The model was trained on the ZINC dataset (https://huggingface.co/datasets/sagawa/ZINC-canonicalized), which was converted to the SAFE format. The ZINC dataset is a large collection of commercially available chemical compounds for virtual screening.
63
 
64
- ## Training procedure
65
 
66
- ### Training hyperparameters
67
 
68
- The following hyperparameters were used during training:
 
 
 
 
 
 
 
 
69
 
70
- - learning_rate: 0.0001
71
- - train_batch_size: 100
72
- - eval_batch_size: 100
73
- - seed: 42
74
- - gradient_accumulation_steps: 2
75
- - total_train_batch_size: 200
76
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
77
- - lr_scheduler_type: linear
78
- - lr_scheduler_warmup_steps: 10000
79
- - training_steps: 250000
80
 
81
- ### Framework versions
82
 
83
- - Transformers 4.44.2
84
- - Pytorch 2.4.0+cu121
85
- - Datasets 2.20.0
86
- - Tokenizers 0.19.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
2
  library_name: transformers
3
  tags:
4
+ - safe
5
+ - datamol-io
6
+ - molecule-design
7
+ - smiles
8
+ - generated_from_trainer
9
+ datasets:
10
+ - sagawa/ZINC-canonicalized
11
  model-index:
12
+ - name: SAFE_100M
13
+ results: []
14
  ---
15
 
16
  # SAFE_100M
17
 
18
+ SAFE_100M is a cutting-edge transformer-based model developed for molecular generation tasks. Trained from scratch on the [ZINC dataset](https://huggingface.co/datasets/sagawa/ZINC-canonicalized) converted to the SAFE (SMILES Augmented For Encoding) format, SAFE_100M achieves a loss of **0.3887** on its evaluation set, demonstrating robust performance in generating valid and diverse molecular structures.
 
19
 
20
+ ## Table of Contents
21
 
22
+ - [Model Description](#model-description)
23
+ - [Intended Uses & Limitations](#intended-uses--limitations)
24
+ - [Training and Evaluation Data](#training-and-evaluation-data)
25
+ - [Training Procedure](#training-procedure)
26
+ - [Training Hyperparameters](#training-hyperparameters)
27
+ - [Framework Versions](#framework-versions)
28
+ - [Acknowledgements](#acknowledgements)
29
+ - [References](#references)
30
 
31
+ ## Model Description
32
 
33
+ SAFE_100M leverages the [SAFE framework](#references) to enhance molecular representation through the SMILES Augmented For Encoding format. By utilizing the comprehensive [ZINC dataset](https://huggingface.co/datasets/sagawa/ZINC-canonicalized), the model excels in navigating chemical space, making it highly effective for applications such as:
34
 
35
+ - **Drug Discovery**
36
+ - **Materials Science**
37
+ - **Chemical Engineering**
38
 
39
+ The transformer architecture ensures the generation of both valid and structurally diverse molecules, facilitating innovative solutions across various scientific disciplines.
40
+
41
+ ## Intended Uses & Limitations
42
+
43
+ ### Intended Uses
 
 
 
 
 
 
 
44
 
45
+ SAFE_100M is designed to support:
46
 
47
+ - **Molecular Structure Generation**: Creating novel molecules with desired properties.
48
+ - **Chemical Space Exploration**: Identifying potential candidates for drug development.
49
+ - **Material Design Assistance**: Innovating new materials with specific characteristics.
50
 
51
+ ### Limitations
52
 
53
+ While SAFE_100M is a powerful tool, users should be aware of the following limitations:
 
 
54
 
55
+ - **Validation Requirement**: Outputs should be reviewed by domain experts before practical application.
56
+ - **Synthetic Feasibility**: Generated molecules may not always be synthesizable in a laboratory setting.
57
+ - **Dataset Boundaries**: The model's knowledge is confined to the chemical space represented in the ZINC dataset.
58
 
59
+ ## Training and Evaluation Data
 
 
60
 
61
+ The model was trained on the [ZINC dataset](https://huggingface.co/datasets/sagawa/ZINC-canonicalized), a large repository of commercially available chemical compounds optimized for virtual screening. This dataset was transformed into the SAFE format to enhance molecular encoding for machine learning applications.
62
 
63
+ ## Training Procedure
64
 
65
+ ### Training Hyperparameters
66
 
67
+ SAFE_100M was trained with the following hyperparameters:
68
 
69
+ - **Learning Rate**: `0.0001`
70
+ - **Training Batch Size**: `100`
71
+ - **Evaluation Batch Size**: `100`
72
+ - **Random Seed**: `42`
73
+ - **Gradient Accumulation Steps**: `2`
74
+ - **Total Training Batch Size**: `200`
75
+ - **Optimizer**: Adam (`betas=(0.9, 0.999)`, `epsilon=1e-08`)
76
+ - **Learning Rate Scheduler**: Linear with `10,000` warmup steps
77
+ - **Total Training Steps**: `250,000`
78
 
79
+ ### Framework Versions
 
 
 
 
 
 
 
 
 
80
 
81
+ The training process utilized the following software frameworks:
82
 
83
+ - **Transformers**: `4.44.2`
84
+ - **PyTorch**: `2.4.0+cu121`
85
+ - **Datasets**: `2.20.0`
86
+ - **Tokenizers**: `0.19.1`
87
+
88
+ ## Acknowledgements
89
+
90
+ We extend our gratitude to the authors of the SAFE framework for their significant contributions to the field of molecular design.
91
+
92
+ ## References
93
+
94
+ ```bibtex
95
+ @article{noutahi2024gotta,
96
+ title={Gotta be SAFE: a new framework for molecular design},
97
+ author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
98
+ journal={Digital Discovery},
99
+ volume={3},
100
+ number={4},
101
+ pages={796--804},
102
+ year={2024},
103
+ publisher={Royal Society of Chemistry}
104
+ }
105
+ ```