File size: 5,312 Bytes
17a3d79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
library_name: mamba-ssm
tags:
  - safe
  - mamba
  - state-space-model
  - molecular-generation
  - smiles
  - generated_from_trainer
datasets:
  - sagawa/ZINC-canonicalized
model-index:
  - name: SSM_100M
    results: []
---

# SSM_100M

SSM_100M is a state space model (SSM) developed with the Mamba framework for molecular generation. **The model was trained using the code from [https://github.com/Anri-Lombard/Mamba-SAFE](https://github.com/Anri-Lombard/Mamba-SAFE).** It was trained from scratch on the [ZINC dataset](https://huggingface.co./datasets/sagawa/ZINC-canonicalized), converted from SMILES to the SAFE (SMILES Augmented For Encoding) format. SSM_100M leverages state space models' efficiency and scalability to match the performance of transformer-based models like [SAFE_100M](https://huggingface.co./anrilombard/safe-100m) while using fewer computational resources.

## Evaluation Results

SSM_100M performs similarly to the transformer-based SAFE_100M model in molecular generation, maintaining high validity and diversity of generated molecules. It achieves these results with lower computational overhead, making it a more resource-efficient option for large-scale applications.

## Model Description

SSM_100M uses the Mamba framework's state space modeling to generate valid and diverse molecular structures efficiently. By converting the ZINC dataset from SMILES to SAFE format, the model benefits from improved molecular encoding, enhancing performance in areas such as:

- **Drug Discovery:** Identifying potential drug candidates with optimal properties.
- **Materials Science:** Designing novel materials with targeted characteristics.
- **Chemical Engineering:** Developing new chemical processes and compounds more efficiently.

### Mamba Framework

The Mamba framework underpins SSM_100M, offering a robust architecture for linear-time sequence modeling with selective state spaces. It was introduced in the following paper:

```bibtex
@article{gu2023mamba,
  title={Mamba: Linear-time sequence modeling with selective state spaces},
  author={Gu, Albert and Dao, Tri},
  journal={arXiv preprint arXiv:2312.00752},
  year={2023}
}
```

We thank the authors for their contributions to sequence modeling.

### SAFE Framework

SSM_100M employs the SAFE framework to enhance molecular representation using the SMILES Augmented For Encoding format. The SAFE framework is detailed in the following publication:

```bibtex
@article{noutahi2024gotta,
  title={Gotta be SAFE: a new framework for molecular design},
  author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
  journal={Digital Discovery},
  volume={3},
  number={4},
  pages={796--804},
  year={2024},
  publisher={Royal Society of Chemistry}
}
```

We appreciate the authors' invaluable work in molecular design.

## Intended Uses & Limitations

### Intended Uses

SSM_100M is suitable for:

- **Molecular Structure Generation:** Creating new molecules with specific properties.
- **Chemical Space Exploration:** Navigating the vast landscape of possible chemical compounds for research and development.
- **Material Design:** Assisting in the creation of new materials with desired functionalities.

### Limitations

Users should be aware of the following limitations:

- **Validation Required:** Outputs should be validated by domain experts before use.
- **Synthetic Feasibility:** Generated molecules may not always be synthesizable in the lab.
- **Dataset Boundaries:** The model is limited to the chemical space of the ZINC dataset, which may restrict its applicability to novel or rare compounds outside this space.

## Training and Evaluation Data

SSM_100M was trained on the [ZINC dataset](https://huggingface.co./datasets/sagawa/ZINC-canonicalized), a comprehensive collection of commercially available chemical compounds optimized for virtual screening. The dataset was converted from SMILES to SAFE format to improve molecular encoding for machine learning, enhancing the model's ability to generate meaningful and diverse molecular structures.

## Training Procedure

### Training Hyperparameters

SSM_100M was trained with the following hyperparameters:

- **Learning Rate:** `0.0003`
- **Training Batch Size:** `64`
- **Evaluation Batch Size:** `64`
- **Random Seed:** `42`
- **Gradient Accumulation Steps:** `4`
- **Total Training Batch Size:** `256`
- **Optimizer:** Adam (`betas=(0.9, 0.98)`, `epsilon=1e-09`)
- **Learning Rate Scheduler:** Cosine with `50,000` warmup steps
- **Total Training Steps:** `300,000`
- **Model Parameters:** 100M

### Framework Versions

The training utilized the following software frameworks:

- **Mamba:** `1.2.3`
- **PyTorch:** `2.0.1`
- **Datasets:** `2.20.0`
- **Tokenizers:** `0.19.1`

## Acknowledgements

We thank the authors and contributors of the following frameworks and datasets:

- **Mamba Framework:** For providing a solid foundation for state space modeling.
- **SAFE Framework:** For improving molecular representation with innovative encoding techniques.
- **ZINC Dataset Authors:** For curating a comprehensive dataset essential for training effective molecular generation models.

For more information and updates, visit the [Mamba-SAFE repository](https://github.com/Anri-Lombard/Mamba-SAFE).