ZhiyuanChen
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -10,6 +10,19 @@ library_name: multimolecule
|
|
10 |
pipeline_tag: fill-mask
|
11 |
mask_token: "<mask>"
|
12 |
widget:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
- example_title: "microRNA-21"
|
14 |
text: "UAGC<mask>UAUCAGACUGAUGUUGA"
|
15 |
output:
|
@@ -63,8 +76,8 @@ UTR-LM is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-style m
|
|
63 |
|
64 |
### Variations
|
65 |
|
66 |
-
- **[`multimolecule/utrlm
|
67 |
-
- **[`multimolecule/utrlm
|
68 |
|
69 |
### Model Specification
|
70 |
|
@@ -110,7 +123,7 @@ UTR-LM is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-style m
|
|
110 |
- **Paper**: [A 5’ UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions](http://doi.org/10.1038/s41467-021-24436-7)
|
111 |
- **Developed by**: Yanyi Chu, Dan Yu, Yupeng Li, Kaixuan Huang, Yue Shen, Le Cong, Jason Zhang, Mengdi Wang
|
112 |
- **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [ESM](https://huggingface.co/facebook/esm2_t48_15B_UR50D)
|
113 |
-
- **Original Repository**: [
|
114 |
|
115 |
## Usage
|
116 |
|
@@ -127,29 +140,29 @@ You can use this model directly with a pipeline for masked language modeling:
|
|
127 |
```python
|
128 |
>>> import multimolecule # you must import multimolecule to register models
|
129 |
>>> from transformers import pipeline
|
130 |
-
>>> unmasker = pipeline(
|
131 |
-
>>> unmasker("
|
132 |
|
133 |
-
[{'score': 0.
|
134 |
'token': 23,
|
135 |
'token_str': '*',
|
136 |
-
'sequence': '
|
137 |
-
{'score': 0.
|
138 |
'token': 5,
|
139 |
'token_str': '<null>',
|
140 |
-
'sequence': 'U
|
141 |
-
{'score': 0.
|
142 |
-
'token':
|
143 |
-
'token_str': '
|
144 |
-
'sequence': 'U
|
145 |
-
{'score': 0.
|
146 |
'token': 10,
|
147 |
'token_str': 'N',
|
148 |
-
'sequence': '
|
149 |
-
{'score': 0.
|
150 |
-
'token':
|
151 |
-
'token_str': '
|
152 |
-
'sequence': 'U
|
153 |
```
|
154 |
|
155 |
### Downstream Use
|
@@ -162,11 +175,11 @@ Here is how to use this model to get the features of a given sequence in PyTorch
|
|
162 |
from multimolecule import RnaTokenizer, UtrLmModel
|
163 |
|
164 |
|
165 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
166 |
-
model = UtrLmModel.from_pretrained(
|
167 |
|
168 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
169 |
-
input = tokenizer(text, return_tensors=
|
170 |
|
171 |
output = model(**input)
|
172 |
```
|
@@ -182,17 +195,17 @@ import torch
|
|
182 |
from multimolecule import RnaTokenizer, UtrLmForSequencePrediction
|
183 |
|
184 |
|
185 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
186 |
-
model = UtrLmForSequencePrediction.from_pretrained(
|
187 |
|
188 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
189 |
-
input = tokenizer(text, return_tensors=
|
190 |
label = torch.tensor([1])
|
191 |
|
192 |
output = model(**input, labels=label)
|
193 |
```
|
194 |
|
195 |
-
####
|
196 |
|
197 |
**Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
|
198 |
|
@@ -200,14 +213,14 @@ Here is how to use this model as backbone to fine-tune for a nucleotide-level ta
|
|
200 |
|
201 |
```python
|
202 |
import torch
|
203 |
-
from multimolecule import RnaTokenizer,
|
204 |
|
205 |
|
206 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
207 |
-
model =
|
208 |
|
209 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
210 |
-
input = tokenizer(text, return_tensors=
|
211 |
label = torch.randint(2, (len(text), ))
|
212 |
|
213 |
output = model(**input, labels=label)
|
@@ -224,11 +237,11 @@ import torch
|
|
224 |
from multimolecule import RnaTokenizer, UtrLmForContactPrediction
|
225 |
|
226 |
|
227 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
228 |
-
model = UtrLmForContactPrediction.from_pretrained(
|
229 |
|
230 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
231 |
-
input = tokenizer(text, return_tensors=
|
232 |
label = torch.randint(2, (len(text), len(text)))
|
233 |
|
234 |
output = model(**input, labels=label)
|
|
|
10 |
pipeline_tag: fill-mask
|
11 |
mask_token: "<mask>"
|
12 |
widget:
|
13 |
+
- example_title: "HIV-1"
|
14 |
+
text: "GGUC<mask>CUCUGGUUAGACCAGAUCUGAGCCU"
|
15 |
+
output:
|
16 |
+
- label: "*"
|
17 |
+
score: 0.07707168161869049
|
18 |
+
- label: "<null>"
|
19 |
+
score: 0.07588472962379456
|
20 |
+
- label: "U"
|
21 |
+
score: 0.07178673148155212
|
22 |
+
- label: "N"
|
23 |
+
score: 0.06414645165205002
|
24 |
+
- label: "Y"
|
25 |
+
score: 0.06385370343923569
|
26 |
- example_title: "microRNA-21"
|
27 |
text: "UAGC<mask>UAUCAGACUGAUGUUGA"
|
28 |
output:
|
|
|
76 |
|
77 |
### Variations
|
78 |
|
79 |
+
- **[`multimolecule/utrlm-te_el`](https://huggingface.co/multimolecule/utrlm-te_el)**: The UTR-LM model for Translation Efficiency of transcripts and mRNA Expression Level.
|
80 |
+
- **[`multimolecule/utrlm-mrl`](https://huggingface.co/multimolecule/utrlm-mrl)**: The UTR-LM model for Mean Ribosome Loading.
|
81 |
|
82 |
### Model Specification
|
83 |
|
|
|
123 |
- **Paper**: [A 5’ UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions](http://doi.org/10.1038/s41467-021-24436-7)
|
124 |
- **Developed by**: Yanyi Chu, Dan Yu, Yupeng Li, Kaixuan Huang, Yue Shen, Le Cong, Jason Zhang, Mengdi Wang
|
125 |
- **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [ESM](https://huggingface.co/facebook/esm2_t48_15B_UR50D)
|
126 |
+
- **Original Repository**: [a96123155/UTR-LM](https://github.com/a96123155/UTR-LM)
|
127 |
|
128 |
## Usage
|
129 |
|
|
|
140 |
```python
|
141 |
>>> import multimolecule # you must import multimolecule to register models
|
142 |
>>> from transformers import pipeline
|
143 |
+
>>> unmasker = pipeline("fill-mask", model="multimolecule/utrlm-te_el")
|
144 |
+
>>> unmasker("gguc<mask>cucugguuagaccagaucugagccu")
|
145 |
|
146 |
+
[{'score': 0.07707168161869049,
|
147 |
'token': 23,
|
148 |
'token_str': '*',
|
149 |
+
'sequence': 'G G U C * C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
150 |
+
{'score': 0.07588472962379456,
|
151 |
'token': 5,
|
152 |
'token_str': '<null>',
|
153 |
+
'sequence': 'G G U C C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
154 |
+
{'score': 0.07178673148155212,
|
155 |
+
'token': 9,
|
156 |
+
'token_str': 'U',
|
157 |
+
'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
158 |
+
{'score': 0.06414645165205002,
|
159 |
'token': 10,
|
160 |
'token_str': 'N',
|
161 |
+
'sequence': 'G G U C N C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
162 |
+
{'score': 0.06385370343923569,
|
163 |
+
'token': 12,
|
164 |
+
'token_str': 'Y',
|
165 |
+
'sequence': 'G G U C Y C U C U G G U U A G A C C A G A U C U G A G C C U'}]
|
166 |
```
|
167 |
|
168 |
### Downstream Use
|
|
|
175 |
from multimolecule import RnaTokenizer, UtrLmModel
|
176 |
|
177 |
|
178 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/utrlm-te_el")
|
179 |
+
model = UtrLmModel.from_pretrained("multimolecule/utrlm-te_el")
|
180 |
|
181 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
182 |
+
input = tokenizer(text, return_tensors="pt")
|
183 |
|
184 |
output = model(**input)
|
185 |
```
|
|
|
195 |
from multimolecule import RnaTokenizer, UtrLmForSequencePrediction
|
196 |
|
197 |
|
198 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/utrlm-te_el")
|
199 |
+
model = UtrLmForSequencePrediction.from_pretrained("multimolecule/utrlm-te_el")
|
200 |
|
201 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
202 |
+
input = tokenizer(text, return_tensors="pt")
|
203 |
label = torch.tensor([1])
|
204 |
|
205 |
output = model(**input, labels=label)
|
206 |
```
|
207 |
|
208 |
+
#### Token Classification / Regression
|
209 |
|
210 |
**Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
|
211 |
|
|
|
213 |
|
214 |
```python
|
215 |
import torch
|
216 |
+
from multimolecule import RnaTokenizer, UtrLmForTokenPrediction
|
217 |
|
218 |
|
219 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/utrlm-te_el")
|
220 |
+
model = UtrLmForTokenPrediction.from_pretrained("multimolecule/utrlm-te_el")
|
221 |
|
222 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
223 |
+
input = tokenizer(text, return_tensors="pt")
|
224 |
label = torch.randint(2, (len(text), ))
|
225 |
|
226 |
output = model(**input, labels=label)
|
|
|
237 |
from multimolecule import RnaTokenizer, UtrLmForContactPrediction
|
238 |
|
239 |
|
240 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/utrlm-te_el")
|
241 |
+
model = UtrLmForContactPrediction.from_pretrained("multimolecule/utrlm-te_el")
|
242 |
|
243 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
244 |
+
input = tokenizer(text, return_tensors="pt")
|
245 |
label = torch.randint(2, (len(text), len(text)))
|
246 |
|
247 |
output = model(**input, labels=label)
|