littleworth commited on
Commit
8011031
1 Parent(s): 8e7d17d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -0
README.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - transformers
5
+ - protein
6
+ - peptide-receptor
7
+ license: apache-2.0
8
+ datasets:
9
+ - custom
10
+ ---
11
+
12
+
13
+
14
+ ## Model Description
15
+
16
+
17
+ This model predicts receptor classes, identified by their PDB IDs, from peptide sequences using the [ESM2](https://huggingface.co/docs/transformers/model_doc/esm) (Evolutionary Scale Modeling) protein language model with esm2_t6_8M_UR50D pre-trained weights. The model is fine-tuned for receptor prediction using datasets from [PROPEDIA](http://bioinfo.dcc.ufmg.br/propedia2/) and [PepNN](https://www.nature.com/articles/s42003-022-03445-2), as well as novel peptides experimentally validated to bind to their target proteins, with binding conformations determined using ClusPro, a protein-protein docking tool. The name `pep2rec_cppp` reflects the model's ability to predict peptide-to-receptor relationships, leveraging training data from ClusPro, PROPEDIA, and PepNN.
18
+ It's particularly useful for researchers and practitioners in bioinformatics, drug discovery, and related fields, aiming to understand or predict peptide-receptor interactions.
19
+
20
+ ## How to Use
21
+
22
+ Here is how to predict the receptor class for a peptide sequence using this model:
23
+
24
+ ```python
25
+ import torch
26
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
27
+ from joblib import load
28
+
29
+ MODEL_PATH = "littleworth/esm2_t12_35M_UR50D_pep2rec_cppp"
30
+ model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)
31
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
32
+
33
+ LABEL_ENCODER_PATH = f"{MODEL_PATH}/label_encoder.joblib"
34
+ label_encoder = load(LABEL_ENCODER_PATH)
35
+
36
+
37
+ input_sequence = "GNLIVVGRVIMS"
38
+
39
+ inputs = tokenizer(input_sequence, return_tensors="pt", truncation=True, padding=True)
40
+
41
+ with torch.no_grad():
42
+ outputs = model(**inputs)
43
+ probabilities = torch.softmax(outputs.logits, dim=1)
44
+ predicted_class_idx = probabilities.argmax(dim=1).item()
45
+
46
+ predicted_class = label_encoder.inverse_transform([predicted_class_idx])[0]
47
+
48
+ class_probabilities = probabilities.squeeze().tolist()
49
+ class_labels = label_encoder.inverse_transform(range(len(class_probabilities)))
50
+
51
+ sorted_indices = torch.argsort(probabilities, descending=True).squeeze()
52
+ sorted_class_labels = [class_labels[i] for i in sorted_indices.tolist()]
53
+ sorted_class_probabilities = probabilities.squeeze()[sorted_indices].tolist()
54
+
55
+ print(f"Predicted Receptor Class: {predicted_class}")
56
+ print("Top 10 Class Probabilities:")
57
+ for label, prob in zip(sorted_class_labels[:10], sorted_class_probabilities[:10]):
58
+ print(f"{label}: {prob:.4f}")
59
+
60
+
61
+ ```
62
+
63
+ Which gives this output:
64
+
65
+ ```
66
+ Predicted Receptor Class: 1JXP
67
+ Top 10 Class Probabilities:
68
+ 1JXP: 0.9839
69
+ 3KEE: 0.0001
70
+ 5EAY: 0.0001
71
+ 1Z9O: 0.0001
72
+ 2KBM: 0.0001
73
+ 2FES: 0.0001
74
+ 1MWN: 0.0001
75
+ 5CFC: 0.0001
76
+ 6O09: 0.0001
77
+ 1DKD: 0.0001
78
+ ```
79
+
80
+ ## Evaluation Results
81
+
82
+ The model was evaluated on a held-out test set, yielding the following metrics:
83
+
84
+ ```
85
+ {
86
+ "train/loss": 0.727,
87
+ "train/grad_norm": 4.4672017097473145,
88
+ "train/learning_rate": 2.3235385792411667e-8,
89
+ "train/epoch": 10,
90
+ "train/global_step": 352910,
91
+ "_timestamp": 1712189024.5060718,
92
+ "_runtime": 503183.0418128967,
93
+ "_step": 716,
94
+ "eval/loss": 0.7138708829879761,
95
+ "eval/accuracy": 0.7794731752930051,
96
+ "eval/runtime": 5914.5446,
97
+ "eval/samples_per_second": 15.912,
98
+ "eval/steps_per_second": 15.912,
99
+ "train/train_runtime": 497231.6027,
100
+ "train/train_samples_per_second": 5.678,
101
+ "train/train_steps_per_second": 0.71,
102
+ "train/total_flos": 600463318555361300,
103
+ "train/train_loss": 0.9245198557043193,
104
+ "_wandb": {
105
+ "runtime": 503182
106
+ }
107
+ }
108
+ ```
109
+