Commit
·
ca6376e
1
Parent(s):
1055666
Update README.md
Browse files
README.md
CHANGED
@@ -22,20 +22,26 @@ tags:
|
|
22 |
# ESM-2 for Binding Site Prediction
|
23 |
|
24 |
**This model may be overfit to some extent (see below).**
|
|
|
|
|
|
|
|
|
25 |
This model *may be* close to SOTA compared to [these SOTA structural models](https://www.biorxiv.org/content/10.1101/2023.08.11.553028v1).
|
26 |
Note the especially high recall below.
|
|
|
27 |
One of the primary goals in training this model is to prove the viability of using simple, single sequence only protein language models
|
28 |
for binary token classification tasks like predicting binding and active sites of protein sequences based on sequence alone. This project
|
29 |
is also an attempt to make deep learning techniques like LoRA more accessible and to showcase the competative or even superior performance
|
30 |
-
of simple models and techniques.
|
|
|
|
|
|
|
31 |
have a model that can predict binding residues from sequence alone. We also hope that this project will be helpful in this regard.
|
32 |
It has been shown that pLMs like ESM-2 contain structural information in the attention maps that recapitulate the contact maps of proteins,
|
33 |
and that single sequence masked language models like ESMFold can be used in atomically accurate predictions of folds, even outperforming
|
34 |
AlphaFold2. In our approach we show a positive correlation between scaling the model size and data
|
35 |
-
in a 1-to-1 fashion provides
|
36 |
-
comprehensive
|
37 |
-
[this report](https://wandb.ai/amelie-schreiber-math/huggingface/reports/ESM-2-Binding-Sites-Predictor-Part-3-Scaling-Results--Vmlldzo1NDA3Nzcy?accessToken=npsm0tatgumcidfwxubzjyuhal512xu8sjmpnf11sebktjm9mheg69ja397q57ok)).
|
38 |
-
|
39 |
|
40 |
This model is a finetuned version of the 35M parameter `esm2_t12_35M_UR50D` ([see here](https://huggingface.co/facebook/esm2_t12_35M_UR50D)
|
41 |
and [here](https://huggingface.co/docs/transformers/model_doc/esm) for more details). The model was finetuned with LoRA for
|
@@ -68,37 +74,46 @@ Let's analyze the train and test metrics one by one:
|
|
68 |
- **Train**: 99.09%
|
69 |
- **Test**: 94.86%
|
70 |
|
71 |
-
The accuracy is notably high in both training and test datasets, indicating that the model makes correct predictions a significant
|
|
|
72 |
|
73 |
### **2. Precision**
|
74 |
- **Train**: 77.49%
|
75 |
- **Test**: 41.00%
|
76 |
|
77 |
-
While the precision is quite good in the training dataset, it sees a decrease in the test dataset. This suggests that a substantial
|
|
|
|
|
78 |
|
79 |
### **3. Recall**
|
80 |
- **Train**: 98.62%
|
81 |
- **Test**: 82.70%
|
82 |
|
83 |
-
The recall is impressive in both the training and test datasets, indicating that the model is able to identify a large proportion of
|
|
|
|
|
84 |
|
85 |
### **4. F1-Score**
|
86 |
- **Train**: 86.79%
|
87 |
- **Test**: 54.80%
|
88 |
|
89 |
-
The F1-score, which is the harmonic mean of precision and recall, is good in the training dataset but sees a decrease in the test dataset.
|
|
|
|
|
90 |
|
91 |
### **5. AUC (Area Under the ROC Curve)**
|
92 |
- **Train**: 98.86%
|
93 |
- **Test**: 89.02%
|
94 |
|
95 |
-
The AUC is quite high in both the training and test datasets, indicating that the model has a good capability to distinguish
|
|
|
96 |
|
97 |
### **6. MCC (Matthews Correlation Coefficient)**
|
98 |
- **Train**: 86.99%
|
99 |
- **Test**: 56.06%
|
100 |
|
101 |
-
The MCC, a balanced metric which takes into account true and false positives and negatives, is good in the training set but decreases
|
|
|
102 |
|
103 |
### **Overall Analysis**
|
104 |
|
@@ -112,13 +127,16 @@ The MCC, a balanced metric which takes into account true and false positives and
|
|
112 |
- **Complexity Reduction**: Consider reducing the model's complexity by training a LoRA for different weight matrices to prevent potential overfitting and improve generalization.
|
113 |
- **Class Imbalance**: If the dataset has a class imbalance, techniques such as resampling or utilizing class weights might be beneficial.
|
114 |
|
115 |
-
|
|
|
|
|
|
|
116 |
|
117 |
## Running Inference
|
118 |
|
119 |
You can download and run [this notebook](https://huggingface.co/AmelieSchreiber/esm2_t12_35M_lora_binding_sites_v2_cp3/blob/main/testing_and_inference.ipynb)
|
120 |
to test out any of the ESMB models. Be sure to download the datasets linked to in the notebook.
|
121 |
-
Note, if you would like to run the models on the train/test split to get the metrics, you may need to do
|
122 |
locally or in a Colab Pro instance as the datasets are quite large and will not run in a standard Colab
|
123 |
(you can still run inference on your own protein sequences though).
|
124 |
|
@@ -127,7 +145,7 @@ locally or in a Colab Pro instance as the datasets are quite large and will not
|
|
127 |
|
128 |
This model was finetuned with LoRA on ~549K protein sequences from the UniProt database. The dataset can be found
|
129 |
[here](https://huggingface.co/datasets/AmelieSchreiber/binding_sites_random_split_by_family_550K). The model obtains
|
130 |
-
the following test metrics:
|
131 |
|
132 |
```python
|
133 |
Epoch: 3
|
|
|
22 |
# ESM-2 for Binding Site Prediction
|
23 |
|
24 |
**This model may be overfit to some extent (see below).**
|
25 |
+
Try running [this notebook](https://huggingface.co/AmelieSchreiber/esm2_t12_35M_lora_binding_sites_v2_cp3/blob/main/testing_esmb.ipynb)
|
26 |
+
on the datasets linked to in the notebook. See if you can figure out why the metrics differ so much on the datasets. Is it due to something
|
27 |
+
like sequence similarity in the train/test split? Is there something fundamentally flawed with the method? Splitting the sequences based on family
|
28 |
+
in UniProt seemed to help, but perhaps a more rigorous approach is necessary?
|
29 |
This model *may be* close to SOTA compared to [these SOTA structural models](https://www.biorxiv.org/content/10.1101/2023.08.11.553028v1).
|
30 |
Note the especially high recall below.
|
31 |
+
|
32 |
One of the primary goals in training this model is to prove the viability of using simple, single sequence only protein language models
|
33 |
for binary token classification tasks like predicting binding and active sites of protein sequences based on sequence alone. This project
|
34 |
is also an attempt to make deep learning techniques like LoRA more accessible and to showcase the competative or even superior performance
|
35 |
+
of simple models and techniques. This however may not be as viable as other methods. The model seems to show good performance, but
|
36 |
+
testing based on [this notebook]() seems to indicate otherwise.
|
37 |
+
|
38 |
+
Since most proteins still do not have a predicted 3D fold or backbone structure, it is useful to
|
39 |
have a model that can predict binding residues from sequence alone. We also hope that this project will be helpful in this regard.
|
40 |
It has been shown that pLMs like ESM-2 contain structural information in the attention maps that recapitulate the contact maps of proteins,
|
41 |
and that single sequence masked language models like ESMFold can be used in atomically accurate predictions of folds, even outperforming
|
42 |
AlphaFold2. In our approach we show a positive correlation between scaling the model size and data
|
43 |
+
in a 1-to-1 fashion provides what appears to be comparable to SOTA performance, although our comparison to the SOTA models is not fair and
|
44 |
+
comprehensive. Using the notebook linked above should help further evaluate the model, but initial findings seem pretty poor.
|
|
|
|
|
45 |
|
46 |
This model is a finetuned version of the 35M parameter `esm2_t12_35M_UR50D` ([see here](https://huggingface.co/facebook/esm2_t12_35M_UR50D)
|
47 |
and [here](https://huggingface.co/docs/transformers/model_doc/esm) for more details). The model was finetuned with LoRA for
|
|
|
74 |
- **Train**: 99.09%
|
75 |
- **Test**: 94.86%
|
76 |
|
77 |
+
The accuracy is notably high in both training and test datasets, indicating that the model makes correct predictions a significant
|
78 |
+
majority of the time. The high accuracy on the test dataset signifies good generalization capabilities.
|
79 |
|
80 |
### **2. Precision**
|
81 |
- **Train**: 77.49%
|
82 |
- **Test**: 41.00%
|
83 |
|
84 |
+
While the precision is quite good in the training dataset, it sees a decrease in the test dataset. This suggests that a substantial
|
85 |
+
proportion of the instances that the model predicts as positive are actually negative, which could potentially lead to a higher
|
86 |
+
false-positive rate.
|
87 |
|
88 |
### **3. Recall**
|
89 |
- **Train**: 98.62%
|
90 |
- **Test**: 82.70%
|
91 |
|
92 |
+
The recall is impressive in both the training and test datasets, indicating that the model is able to identify a large proportion of
|
93 |
+
actual positive instances correctly. A high recall in the test dataset suggests that the model maintains its sensitivity in identifying
|
94 |
+
positive cases when generalized to unseen data.
|
95 |
|
96 |
### **4. F1-Score**
|
97 |
- **Train**: 86.79%
|
98 |
- **Test**: 54.80%
|
99 |
|
100 |
+
The F1-score, which is the harmonic mean of precision and recall, is good in the training dataset but sees a decrease in the test dataset.
|
101 |
+
The decrease in the F1-score from training to testing suggests a worsened balance between precision and recall in the unseen data,
|
102 |
+
largely due to a decrease in precision.
|
103 |
|
104 |
### **5. AUC (Area Under the ROC Curve)**
|
105 |
- **Train**: 98.86%
|
106 |
- **Test**: 89.02%
|
107 |
|
108 |
+
The AUC is quite high in both the training and test datasets, indicating that the model has a good capability to distinguish
|
109 |
+
between the positive and negative classes. A high AUC in the test dataset further suggests that the model generalizes well to unseen data.
|
110 |
|
111 |
### **6. MCC (Matthews Correlation Coefficient)**
|
112 |
- **Train**: 86.99%
|
113 |
- **Test**: 56.06%
|
114 |
|
115 |
+
The MCC, a balanced metric which takes into account true and false positives and negatives, is good in the training set but decreases
|
116 |
+
in the test set. This suggests a diminished quality of binary classifications on the test dataset compared to the training dataset.
|
117 |
|
118 |
### **Overall Analysis**
|
119 |
|
|
|
127 |
- **Complexity Reduction**: Consider reducing the model's complexity by training a LoRA for different weight matrices to prevent potential overfitting and improve generalization.
|
128 |
- **Class Imbalance**: If the dataset has a class imbalance, techniques such as resampling or utilizing class weights might be beneficial.
|
129 |
|
130 |
+
So, the model performs well on the training dataset and maintains a reasonably good performance on the test dataset,
|
131 |
+
demonstrating a good generalization capability. However, the decrease in certain metrics like precision and F1-score in the test
|
132 |
+
dataset compared to the training dataset indicates room for improvement to optimize the model further for unseen data. It would be
|
133 |
+
advantageous to enhance precision without significantly compromising recall to achieve a more harmonious balance between the two.
|
134 |
|
135 |
## Running Inference
|
136 |
|
137 |
You can download and run [this notebook](https://huggingface.co/AmelieSchreiber/esm2_t12_35M_lora_binding_sites_v2_cp3/blob/main/testing_and_inference.ipynb)
|
138 |
to test out any of the ESMB models. Be sure to download the datasets linked to in the notebook.
|
139 |
+
Note, if you would like to run the models on the train/test split to get the metrics, you may need to do so
|
140 |
locally or in a Colab Pro instance as the datasets are quite large and will not run in a standard Colab
|
141 |
(you can still run inference on your own protein sequences though).
|
142 |
|
|
|
145 |
|
146 |
This model was finetuned with LoRA on ~549K protein sequences from the UniProt database. The dataset can be found
|
147 |
[here](https://huggingface.co/datasets/AmelieSchreiber/binding_sites_random_split_by_family_550K). The model obtains
|
148 |
+
the following test metrics, also shown above:
|
149 |
|
150 |
```python
|
151 |
Epoch: 3
|