File size: 3,984 Bytes
aac07e8
dd6c2a3
 
aac07e8
ed8068c
dd6c2a3
 
 
 
 
ed8068c
dd6c2a3
 
aac07e8
 
ed8068c
aac07e8
b584f5f
aac07e8
b584f5f
aac07e8
b584f5f
 
 
 
 
 
 
 
aac07e8
b584f5f
aac07e8
b584f5f
aac07e8
b584f5f
 
 
aac07e8
b584f5f
 
 
 
 
aac07e8
b584f5f
aac07e8
b584f5f
 
 
aac07e8
b584f5f
aac07e8
b584f5f
aac07e8
b584f5f
 
 
aac07e8
b584f5f
aac07e8
b584f5f
aac07e8
b584f5f
aac07e8
b584f5f
aac07e8
b584f5f
aac07e8
b584f5f
 
 
 
 
 
 
 
 
aac07e8
b584f5f
aac07e8
b584f5f
aac07e8
b584f5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
datasets:
- sagawa/ZINC-canonicalized
library_name: transformers
tags:
- safe
- datamol-io
- molecule-design
- smiles
- generated_from_trainer
model-index:
- name: SAFE_100M
  results: []
---

# SAFE_100M

SAFE_100M is a cutting-edge transformer-based model developed for molecular generation tasks. Trained from scratch on the [ZINC dataset](https://huggingface.co./datasets/sagawa/ZINC-canonicalized) converted to the SAFE (SMILES Augmented For Encoding) format, SAFE_100M achieves a loss of **0.3887** on its evaluation set, demonstrating robust performance in generating valid and diverse molecular structures.

## Table of Contents

- [Model Description](#model-description)
- [Intended Uses & Limitations](#intended-uses--limitations)
- [Training and Evaluation Data](#training-and-evaluation-data)
- [Training Procedure](#training-procedure)
  - [Training Hyperparameters](#training-hyperparameters)
  - [Framework Versions](#framework-versions)
- [Acknowledgements](#acknowledgements)
- [References](#references)

## Model Description

SAFE_100M leverages the [SAFE framework](#references) to enhance molecular representation through the SMILES Augmented For Encoding format. By utilizing the comprehensive [ZINC dataset](https://huggingface.co./datasets/sagawa/ZINC-canonicalized), the model excels in navigating chemical space, making it highly effective for applications such as:

- **Drug Discovery**
- **Materials Science**
- **Chemical Engineering**

The transformer architecture ensures the generation of both valid and structurally diverse molecules, facilitating innovative solutions across various scientific disciplines.

## Intended Uses & Limitations

### Intended Uses

SAFE_100M is designed to support:

- **Molecular Structure Generation**: Creating novel molecules with desired properties.
- **Chemical Space Exploration**: Identifying potential candidates for drug development.
- **Material Design Assistance**: Innovating new materials with specific characteristics.

### Limitations

While SAFE_100M is a powerful tool, users should be aware of the following limitations:

- **Validation Requirement**: Outputs should be reviewed by domain experts before practical application.
- **Synthetic Feasibility**: Generated molecules may not always be synthesizable in a laboratory setting.
- **Dataset Boundaries**: The model's knowledge is confined to the chemical space represented in the ZINC dataset.

## Training and Evaluation Data

The model was trained on the [ZINC dataset](https://huggingface.co./datasets/sagawa/ZINC-canonicalized), a large repository of commercially available chemical compounds optimized for virtual screening. This dataset was transformed into the SAFE format to enhance molecular encoding for machine learning applications.

## Training Procedure

### Training Hyperparameters

SAFE_100M was trained with the following hyperparameters:

- **Learning Rate**: `0.0001`
- **Training Batch Size**: `100`
- **Evaluation Batch Size**: `100`
- **Random Seed**: `42`
- **Gradient Accumulation Steps**: `2`
- **Total Training Batch Size**: `200`
- **Optimizer**: Adam (`betas=(0.9, 0.999)`, `epsilon=1e-08`)
- **Learning Rate Scheduler**: Linear with `10,000` warmup steps
- **Total Training Steps**: `250,000`

### Framework Versions

The training process utilized the following software frameworks:

- **Transformers**: `4.44.2`
- **PyTorch**: `2.4.0+cu121`
- **Datasets**: `2.20.0`
- **Tokenizers**: `0.19.1`

## Acknowledgements

We extend our gratitude to the authors of the SAFE framework for their significant contributions to the field of molecular design.

## References

```bibtex
@article{noutahi2024gotta,
  title={Gotta be SAFE: a new framework for molecular design},
  author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
  journal={Digital Discovery},
  volume={3},
  number={4},
  pages={796--804},
  year={2024},
  publisher={Royal Society of Chemistry}
}
```