AmelieSchreiber commited on
Commit
ca6376e
·
1 Parent(s): 1055666

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -14
README.md CHANGED
@@ -22,20 +22,26 @@ tags:
22
  # ESM-2 for Binding Site Prediction
23
 
24
  **This model may be overfit to some extent (see below).**
 
 
 
 
25
  This model *may be* close to SOTA compared to [these SOTA structural models](https://www.biorxiv.org/content/10.1101/2023.08.11.553028v1).
26
  Note the especially high recall below.
 
27
  One of the primary goals in training this model is to prove the viability of using simple, single sequence only protein language models
28
  for binary token classification tasks like predicting binding and active sites of protein sequences based on sequence alone. This project
29
  is also an attempt to make deep learning techniques like LoRA more accessible and to showcase the competative or even superior performance
30
- of simple models and techniques. Moreover, since most proteins still do not have a predicted 3D fold or backbone structure, it is useful to
 
 
 
31
  have a model that can predict binding residues from sequence alone. We also hope that this project will be helpful in this regard.
32
  It has been shown that pLMs like ESM-2 contain structural information in the attention maps that recapitulate the contact maps of proteins,
33
  and that single sequence masked language models like ESMFold can be used in atomically accurate predictions of folds, even outperforming
34
  AlphaFold2. In our approach we show a positive correlation between scaling the model size and data
35
- in a 1-to-1 fashion provides competative and possibly even comparable to SOTA performance, although our comparison to the SOTA models is not as fair and
36
- comprehensive as it could be (see [this report for more details](https://api.wandb.ai/links/amelie-schreiber-math/0asqd3hs) and also
37
- [this report](https://wandb.ai/amelie-schreiber-math/huggingface/reports/ESM-2-Binding-Sites-Predictor-Part-3-Scaling-Results--Vmlldzo1NDA3Nzcy?accessToken=npsm0tatgumcidfwxubzjyuhal512xu8sjmpnf11sebktjm9mheg69ja397q57ok)).
38
-
39
 
40
  This model is a finetuned version of the 35M parameter `esm2_t12_35M_UR50D` ([see here](https://huggingface.co/facebook/esm2_t12_35M_UR50D)
41
  and [here](https://huggingface.co/docs/transformers/model_doc/esm) for more details). The model was finetuned with LoRA for
@@ -68,37 +74,46 @@ Let's analyze the train and test metrics one by one:
68
  - **Train**: 99.09%
69
  - **Test**: 94.86%
70
 
71
- The accuracy is notably high in both training and test datasets, indicating that the model makes correct predictions a significant majority of the time. The high accuracy on the test dataset signifies good generalization capabilities.
 
72
 
73
  ### **2. Precision**
74
  - **Train**: 77.49%
75
  - **Test**: 41.00%
76
 
77
- While the precision is quite good in the training dataset, it sees a decrease in the test dataset. This suggests that a substantial proportion of the instances that the model predicts as positive are actually negative, which could potentially lead to a higher false-positive rate.
 
 
78
 
79
  ### **3. Recall**
80
  - **Train**: 98.62%
81
  - **Test**: 82.70%
82
 
83
- The recall is impressive in both the training and test datasets, indicating that the model is able to identify a large proportion of actual positive instances correctly. A high recall in the test dataset suggests that the model maintains its sensitivity in identifying positive cases when generalized to unseen data.
 
 
84
 
85
  ### **4. F1-Score**
86
  - **Train**: 86.79%
87
  - **Test**: 54.80%
88
 
89
- The F1-score, which is the harmonic mean of precision and recall, is good in the training dataset but sees a decrease in the test dataset. The decrease in the F1-score from training to testing suggests a worsened balance between precision and recall in the unseen data, largely due to a decrease in precision.
 
 
90
 
91
  ### **5. AUC (Area Under the ROC Curve)**
92
  - **Train**: 98.86%
93
  - **Test**: 89.02%
94
 
95
- The AUC is quite high in both the training and test datasets, indicating that the model has a good capability to distinguish between the positive and negative classes. A high AUC in the test dataset further suggests that the model generalizes well to unseen data.
 
96
 
97
  ### **6. MCC (Matthews Correlation Coefficient)**
98
  - **Train**: 86.99%
99
  - **Test**: 56.06%
100
 
101
- The MCC, a balanced metric which takes into account true and false positives and negatives, is good in the training set but decreases in the test set. This suggests a diminished quality of binary classifications on the test dataset compared to the training dataset.
 
102
 
103
  ### **Overall Analysis**
104
 
@@ -112,13 +127,16 @@ The MCC, a balanced metric which takes into account true and false positives and
112
  - **Complexity Reduction**: Consider reducing the model's complexity by training a LoRA for different weight matrices to prevent potential overfitting and improve generalization.
113
  - **Class Imbalance**: If the dataset has a class imbalance, techniques such as resampling or utilizing class weights might be beneficial.
114
 
115
- In conclusion, the model performs well on the training dataset and maintains a reasonably good performance on the test dataset, demonstrating a solid generalization capability. However, the decrease in certain metrics like precision and F1-score in the test dataset compared to the training dataset indicates room for improvement to optimize the model further for unseen data. It would be advantageous to enhance precision without significantly compromising recall to achieve a more harmonious balance between the two.
 
 
 
116
 
117
  ## Running Inference
118
 
119
  You can download and run [this notebook](https://huggingface.co/AmelieSchreiber/esm2_t12_35M_lora_binding_sites_v2_cp3/blob/main/testing_and_inference.ipynb)
120
  to test out any of the ESMB models. Be sure to download the datasets linked to in the notebook.
121
- Note, if you would like to run the models on the train/test split to get the metrics, you may need to do
122
  locally or in a Colab Pro instance as the datasets are quite large and will not run in a standard Colab
123
  (you can still run inference on your own protein sequences though).
124
 
@@ -127,7 +145,7 @@ locally or in a Colab Pro instance as the datasets are quite large and will not
127
 
128
  This model was finetuned with LoRA on ~549K protein sequences from the UniProt database. The dataset can be found
129
  [here](https://huggingface.co/datasets/AmelieSchreiber/binding_sites_random_split_by_family_550K). The model obtains
130
- the following test metrics:
131
 
132
  ```python
133
  Epoch: 3
 
22
  # ESM-2 for Binding Site Prediction
23
 
24
  **This model may be overfit to some extent (see below).**
25
+ Try running [this notebook](https://huggingface.co/AmelieSchreiber/esm2_t12_35M_lora_binding_sites_v2_cp3/blob/main/testing_esmb.ipynb)
26
+ on the datasets linked to in the notebook. See if you can figure out why the metrics differ so much on the datasets. Is it due to something
27
+ like sequence similarity in the train/test split? Is there something fundamentally flawed with the method? Splitting the sequences based on family
28
+ in UniProt seemed to help, but perhaps a more rigorous approach is necessary?
29
  This model *may be* close to SOTA compared to [these SOTA structural models](https://www.biorxiv.org/content/10.1101/2023.08.11.553028v1).
30
  Note the especially high recall below.
31
+
32
  One of the primary goals in training this model is to prove the viability of using simple, single sequence only protein language models
33
  for binary token classification tasks like predicting binding and active sites of protein sequences based on sequence alone. This project
34
  is also an attempt to make deep learning techniques like LoRA more accessible and to showcase the competative or even superior performance
35
+ of simple models and techniques. This however may not be as viable as other methods. The model seems to show good performance, but
36
+ testing based on [this notebook]() seems to indicate otherwise.
37
+
38
+ Since most proteins still do not have a predicted 3D fold or backbone structure, it is useful to
39
  have a model that can predict binding residues from sequence alone. We also hope that this project will be helpful in this regard.
40
  It has been shown that pLMs like ESM-2 contain structural information in the attention maps that recapitulate the contact maps of proteins,
41
  and that single sequence masked language models like ESMFold can be used in atomically accurate predictions of folds, even outperforming
42
  AlphaFold2. In our approach we show a positive correlation between scaling the model size and data
43
+ in a 1-to-1 fashion provides what appears to be comparable to SOTA performance, although our comparison to the SOTA models is not fair and
44
+ comprehensive. Using the notebook linked above should help further evaluate the model, but initial findings seem pretty poor.
 
 
45
 
46
  This model is a finetuned version of the 35M parameter `esm2_t12_35M_UR50D` ([see here](https://huggingface.co/facebook/esm2_t12_35M_UR50D)
47
  and [here](https://huggingface.co/docs/transformers/model_doc/esm) for more details). The model was finetuned with LoRA for
 
74
  - **Train**: 99.09%
75
  - **Test**: 94.86%
76
 
77
+ The accuracy is notably high in both training and test datasets, indicating that the model makes correct predictions a significant
78
+ majority of the time. The high accuracy on the test dataset signifies good generalization capabilities.
79
 
80
  ### **2. Precision**
81
  - **Train**: 77.49%
82
  - **Test**: 41.00%
83
 
84
+ While the precision is quite good in the training dataset, it sees a decrease in the test dataset. This suggests that a substantial
85
+ proportion of the instances that the model predicts as positive are actually negative, which could potentially lead to a higher
86
+ false-positive rate.
87
 
88
  ### **3. Recall**
89
  - **Train**: 98.62%
90
  - **Test**: 82.70%
91
 
92
+ The recall is impressive in both the training and test datasets, indicating that the model is able to identify a large proportion of
93
+ actual positive instances correctly. A high recall in the test dataset suggests that the model maintains its sensitivity in identifying
94
+ positive cases when generalized to unseen data.
95
 
96
  ### **4. F1-Score**
97
  - **Train**: 86.79%
98
  - **Test**: 54.80%
99
 
100
+ The F1-score, which is the harmonic mean of precision and recall, is good in the training dataset but sees a decrease in the test dataset.
101
+ The decrease in the F1-score from training to testing suggests a worsened balance between precision and recall in the unseen data,
102
+ largely due to a decrease in precision.
103
 
104
  ### **5. AUC (Area Under the ROC Curve)**
105
  - **Train**: 98.86%
106
  - **Test**: 89.02%
107
 
108
+ The AUC is quite high in both the training and test datasets, indicating that the model has a good capability to distinguish
109
+ between the positive and negative classes. A high AUC in the test dataset further suggests that the model generalizes well to unseen data.
110
 
111
  ### **6. MCC (Matthews Correlation Coefficient)**
112
  - **Train**: 86.99%
113
  - **Test**: 56.06%
114
 
115
+ The MCC, a balanced metric which takes into account true and false positives and negatives, is good in the training set but decreases
116
+ in the test set. This suggests a diminished quality of binary classifications on the test dataset compared to the training dataset.
117
 
118
  ### **Overall Analysis**
119
 
 
127
  - **Complexity Reduction**: Consider reducing the model's complexity by training a LoRA for different weight matrices to prevent potential overfitting and improve generalization.
128
  - **Class Imbalance**: If the dataset has a class imbalance, techniques such as resampling or utilizing class weights might be beneficial.
129
 
130
+ So, the model performs well on the training dataset and maintains a reasonably good performance on the test dataset,
131
+ demonstrating a good generalization capability. However, the decrease in certain metrics like precision and F1-score in the test
132
+ dataset compared to the training dataset indicates room for improvement to optimize the model further for unseen data. It would be
133
+ advantageous to enhance precision without significantly compromising recall to achieve a more harmonious balance between the two.
134
 
135
  ## Running Inference
136
 
137
  You can download and run [this notebook](https://huggingface.co/AmelieSchreiber/esm2_t12_35M_lora_binding_sites_v2_cp3/blob/main/testing_and_inference.ipynb)
138
  to test out any of the ESMB models. Be sure to download the datasets linked to in the notebook.
139
+ Note, if you would like to run the models on the train/test split to get the metrics, you may need to do so
140
  locally or in a Colab Pro instance as the datasets are quite large and will not run in a standard Colab
141
  (you can still run inference on your own protein sequences though).
142
 
 
145
 
146
  This model was finetuned with LoRA on ~549K protein sequences from the UniProt database. The dataset can be found
147
  [here](https://huggingface.co/datasets/AmelieSchreiber/binding_sites_random_split_by_family_550K). The model obtains
148
+ the following test metrics, also shown above:
149
 
150
  ```python
151
  Epoch: 3