WarmMolGenOne

A target specific molecule generator model which is warm started (i.e. initialized) from pretrained biochemical language models and trained on interacting protein-compound pairs, viewing targeted molecular generation as a translation task between protein and molecular languages. It was introduced in the paper, "Exploiting pretrained biochemical language models for targeted drug design", which has been accepted for publication in Bioinformatics Published by Oxford University Press and first released in this repository.

WarmMolGenOne is a Transformer-based encoder-decoder model initialized with Protein RoBERTa and ChemBERTa checkpoints and trained on interacting protein-compound pairs filtered from BindingDB. The model takes a protein sequence as an input and outputs a SMILES sequence.

How to use

from transformers import EncoderDecoderModel, RobertaTokenizer, pipeline
protein_tokenizer = RobertaTokenizer.from_pretrained("gokceuludogan/WarmMolGenOne")
mol_tokenizer = RobertaTokenizer.from_pretrained("seyonec/PubChem10M_SMILES_BPE_450k")
model = EncoderDecoderModel.from_pretrained("gokceuludogan/WarmMolGenOne")
inputs = protein_tokenizer("MENTENSVDSKSIKNLEPKIIHGSESMDSGISLDNSYKMDYPEMGLCIIINNKNFHKSTG", >>> return_tensors="pt")
outputs = model.generate(**inputs, decoder_start_token_id=mol_tokenizer.bos_token_id, 
                          eos_token_id=mol_tokenizer.eos_token_id, pad_token_id=mol_tokenizer.eos_token_id, 
                          max_length=128, num_return_sequences=5, do_sample=True, top_p=0.95)
mol_tokenizer.batch_decode(outputs, skip_special_tokens=True)
# Sample output
# ['Cn1cc(nn1)-c1ccccc1NS(=O)(=O)c1ccc2[nH]ccc2c1',
# 'CC(C)(C)c1[se]nc2sc(cc12)C(O)=O',
# '[O-][N+](=O)c1ccc(CN2CCC(CC2)NC(=O)c2cccc3ccccc23)cc1',
# 'OC(=O)CNC(=O)CCC\\C=C\\CN1[C@@H](Cc2cn(nn2)-c2ccccc2)C(=O)N[C@@H](CCCN2C(S)=NC(C)(C2=O)c2ccc(F)cc2)C1=O',
# 'OCC1(CCC1)C(=O)NCC1CCN(CC1)c1nc(c(s1)-c1ccc2OCOc2c1)C(O)=O']

Citation

@article{10.1093/bioinformatics/btac482,
    author = {Uludoğan, Gökçe and Ozkirimli, Elif and Ulgen, Kutlu O. and Karalı, Nilgün Lütfiye and Özgür, Arzucan},
    title = "{Exploiting Pretrained Biochemical Language Models for Targeted Drug Design}",
    journal = {Bioinformatics},
    year = {2022},
    doi = {10.1093/bioinformatics/btac482},
    url = {https://doi.org/10.1093/bioinformatics/btac482}
}
Downloads last month
11
Inference Examples
Inference API (serverless) has been turned off for this model.

Spaces using gokceuludogan/WarmMolGenOne 2