|
--- |
|
license: apache-2.0 |
|
metrics: |
|
- accuracy |
|
- bleu |
|
pipeline_tag: text2text-generation |
|
tags: |
|
- chemistry |
|
- biology |
|
- medical |
|
- smiles |
|
- iupac |
|
- text-generation-inference |
|
widget: |
|
- text: ethanol |
|
example_title: CCO |
|
--- |
|
# IUPAC2SMILES-canonical-small |
|
|
|
IUPAC2SMILES-canonical-small was designed to accurately translate IUPAC chemical names to SMILES. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
IUPAC2SMILES-canonical-small is based on the MT5 model with optimizations in implementing different tokenizers for the encoder and decoder. |
|
- **Developed by:** Knowladgator Engineering |
|
- **Model type:** Encoder-Decoder with attention mechanism |
|
- **Language(s) (NLP):** SMILES, IUPAC (English) |
|
- **License:** Apache License 2.0 |
|
|
|
### Model Sources |
|
- **Paper:** coming soon |
|
- **Demo:** [ChemicalConverters](https://huggingface.co./spaces/knowledgator/ChemicalConverters) |
|
|
|
## Quickstart |
|
Firstly, install the library: |
|
```commandline |
|
pip install chemical-converters |
|
``` |
|
### IUPAC to SMILES |
|
#### To perform simple translation, follow the example: |
|
```python |
|
from chemicalconverters import NamesConverter |
|
|
|
converter = NamesConverter(model_name="knowledgator/IUPAC2SMILES-canonical-small") |
|
print(converter.iupac_to_smiles('ethanol')) |
|
print(converter.iupac_to_smiles(['ethanol', 'ethanol', 'ethanol'])) |
|
``` |
|
```text |
|
['CCO'] |
|
['CCO', 'CCO', 'CCO'] |
|
``` |
|
#### Processing in batches: |
|
```python |
|
from chemicalconverters import NamesConverter |
|
|
|
converter = NamesConverter(model_name="knowledgator/IUPAC2SMILES-canonical-small") |
|
print(converter.iupac_to_smiles(["buta-1,3-diene" for _ in range(10)], num_beams=1, |
|
process_in_batch=True, batch_size=1000)) |
|
``` |
|
```text |
|
['<SYST>C=CC=C', '<SYST>C=CC=C'...] |
|
``` |
|
Our models also predict IUPAC styles from the table: |
|
|
|
| Style Token | Description | |
|
|-------------|----------------------------------------------------------------------------------------------------| |
|
| `<BASE>` | The most known name of the substance, sometimes is the mixture of traditional and systematic style | |
|
| `<SYST>` | The totally systematic style without trivial names | |
|
| `<TRAD>` | The style is based on trivial names of the parts of substances | |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
This model has limited accuracy in processing large molecules and currently, doesn't support isomeric and isotopic SMILES. |
|
|
|
### Training Procedure |
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
The model was trained on 100M examples of SMILES-IUPAC pairs with lr=0.0003, batch_size=1024 for 2 epochs. |
|
|
|
## Evaluation |
|
|
|
| Model | Accuracy | BLEU-4 score | Size(MB) | |
|
|-------------------------------------|---------|------------------|----------| |
|
| IUPAC2SMILES-canonical-small |88.9% |0.966 |23 | |
|
| IUPAC2SMILES-canonical-base |93.7% |0.974 |180 | |
|
| STOUT V2.0\* |68.47% |0.92 |128 | |
|
*According to the original paper https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00512-4 |
|
|
|
## Citation |
|
Coming soon. |
|
|
|
## Model Card Authors |
|
|
|
[Mykhailo Shtopko](https://huggingface.co./BioMike) |
|
|
|
## Model Card Contact |
|
|
|
[email protected] |