PDB requested to start inverse folding

#1
by sblimay - opened

I have a doubt related to the Inverse folding task (3.3 in the Colab SaprotHub). Should I upload a known pdb file? which is the point of giving the pdb file to obtain the exact AA sequence if I already know the AA sequence? The masking with '#' should be applied in the pdb file or in the resulting sequence after running the 3.3.1 cell?
saprot-doubt.png
I don't really get the difference between giving the backbone structure and giving the pdb file.]
Thank you

westlake-repl org
edited Jul 25, 2024

Hi, nice questions!

Should I upload a known pdb file?

Not necessary. In a protein design pipeline, researchers may have a designed structure backbone without knowing its amino acids. In this case, all amino acids in the pdb file could be marked as meaningless "A". Therefore, researchers can upload the pdb file to predict its primary sequence. On the other hand, even if you already know the AA sequence, you can still predict more diverse AA sequences that share the same structure, which is useful in protein engineering or protein design task.

The masking with '#' should be applied in the pdb file or in the resulting sequence after running the 3.3.1 cell?

You only have to mask the amino acids in the input box after running the 3.3.1 cell. Note that our model only predicts the amino acids that are masked. So if you want to keep some positions unchanged, just keep the amino acids unmasked.

Hope this could resolve your problem:)

Hi Saprot team, I read your papers but I still have doubts about what I understood, so here are my questions:

If I wanted to make a new protein from a known protein from ColabSaprotHub, should I go directly to the 3.3 reverse fold prediction inference section? When providing a PDB file, does the system tokenize the amino acids and the 3D coordinates of the PDB or just de aminoacids? Then it is masked and in 3.3.2 the new sequences are generated from the backbone, which would be the 3D coordinates of each atom of the PDB file?

On the other hand, it is not clear to me if I can mask amino acids within the PDB file, should I put a # in the amino acids directly? and then upload it to Colab in 3.3.1? Because I do that and it gives me an error.

Finally, if I knew the 3D structure that I wanted my protein to have, could I mask all the amino acids within the PDB or in the cell provided by Colab and a totally new protein would be generated, but the model would be inspired by the structure provided by the 3D coordinates?

westlake-repl org

If I wanted to make a new protein from a known protein from ColabSaprotHub, should I go directly to the 3.3 reverse fold prediction inference section?

Yes. The module 3.3 is for the inference of protein inverse folding. When you upload a PDB file, the system will use foldseek to encode the 3D coordinates and convert them into structural tokens and also identify the amino acids of the protein. So you would see two line of sequences (one for amino acids and the other for foldseek tokens).
image.png

On the other hand, it is not clear to me if I can mask amino acids within the PDB file, should I put a # in the amino acids directly? and then upload it to Colab in 3.3.1?

You do not have to make any change to your PDB file. You only have to mask the amino acids in the input box after running the 3.3.1 cell. Note that our model only predicts the amino acids that are masked. So if you want to keep some positions unchanged, just keep the amino acids unmasked.

Hello,

I was trying out the 3.3 inverse folding of colab saprot, when I load the PDB file of my known protein from the protein data bank, the sequence is 1227 amino acids as reported, but when I tokenize it in 3.3.1 it reports that it has 1996 amino acids, which forces me to cut the sequence from 1227 to 1996 in order to use 3.3.2 and make inference in the region that I masked, why is this? How can I solve it?

Finally, when it generates the PDB sequence in 3.3.3 from a sequence generated in 3.3.2, TM-Align tells me when I structurally align the initial known protein from 3.3.1 with the one generated in 3.3.3 that the generated PDB now has 1024 amino acids. I don't understand why the sequence is cut again? Does it have to do with Saprot deciding to make an inference by ignoring amino acids rather than replacing them?

westlake-repl org

Hi,

For the first problem, Could you give the PDB ID that you used for inverse folding?

For the second problem that the generated PDB has 1024 amino acids, we find that it's bacause the system will ignore residues longer than 1024. We have removed this setting so it should work well now.

The protein ID is 6NME, its PDB and FASTA say it has 1227 amino acids but 1996 are tokenized in 3.3.1

Thanks for replying and solving the other problem.

westlake-repl org

Hi,

I downloaded pdb file and it tokenized 1196 amino acids:

image.png

This should be the problem of foldseek, as we use foldseek to parse pdb file and get both amino acid sequence and structural sequence. I believe such case is uncommon, and in most case you would get normal results.

Sign up or log in to comment