ZhiyuanChen
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -10,6 +10,19 @@ library_name: multimolecule
|
|
10 |
pipeline_tag: fill-mask
|
11 |
mask_token: "<mask>"
|
12 |
widget:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
- example_title: "microRNA-21"
|
14 |
text: "UAGC<mask>UAUCAGACUGAUGUUGA"
|
15 |
output:
|
@@ -78,7 +91,7 @@ Note that during the conversion process, additional tokens such as `[IND]` and n
|
|
78 |
- **Paper**: Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning
|
79 |
- **Developed by**: Ning Wang, Jiang Bian, Yuchen Li, Xuhong Li, Shahid Mumtaz, Linghe Kong, Haoyi Xiong.
|
80 |
- **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [ERNIE](https://huggingface.co/nghuyong/ernie-3.0-base-zh)
|
81 |
-
- **Original Repository**: [
|
82 |
|
83 |
## Usage
|
84 |
|
@@ -95,29 +108,29 @@ You can use this model directly with a pipeline for masked language modeling:
|
|
95 |
```python
|
96 |
>>> import multimolecule # you must import multimolecule to register models
|
97 |
>>> from transformers import pipeline
|
98 |
-
>>> unmasker = pipeline(
|
99 |
-
>>> unmasker("
|
100 |
|
101 |
-
[{'score': 0.
|
102 |
'token': 8,
|
103 |
'token_str': 'G',
|
104 |
-
'sequence': 'U
|
105 |
-
{'score': 0.
|
106 |
'token': 11,
|
107 |
'token_str': 'R',
|
108 |
-
'sequence': '
|
109 |
-
{'score': 0.
|
110 |
'token': 6,
|
111 |
'token_str': 'A',
|
112 |
-
'sequence': '
|
113 |
-
{'score': 0.
|
114 |
-
'token': 2,
|
115 |
-
'token_str': '<eos>',
|
116 |
-
'sequence': 'U A G C U A U C A G A C U G A U G U U G A'},
|
117 |
-
{'score': 0.073448047041893,
|
118 |
'token': 20,
|
119 |
'token_str': 'V',
|
120 |
-
'sequence': '
|
|
|
|
|
|
|
|
|
121 |
```
|
122 |
|
123 |
### Downstream Use
|
@@ -130,11 +143,11 @@ Here is how to use this model to get the features of a given sequence in PyTorch
|
|
130 |
from multimolecule import RnaTokenizer, RnaErnieModel
|
131 |
|
132 |
|
133 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
134 |
-
model = RnaErnieModel.from_pretrained(
|
135 |
|
136 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
137 |
-
input = tokenizer(text, return_tensors=
|
138 |
|
139 |
output = model(**input)
|
140 |
```
|
@@ -150,17 +163,17 @@ import torch
|
|
150 |
from multimolecule import RnaTokenizer, RnaErnieForSequencePrediction
|
151 |
|
152 |
|
153 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
154 |
-
model = RnaErnieForSequencePrediction.from_pretrained(
|
155 |
|
156 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
157 |
-
input = tokenizer(text, return_tensors=
|
158 |
label = torch.tensor([1])
|
159 |
|
160 |
output = model(**input, labels=label)
|
161 |
```
|
162 |
|
163 |
-
####
|
164 |
|
165 |
**Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
|
166 |
|
@@ -168,14 +181,14 @@ Here is how to use this model as backbone to fine-tune for a nucleotide-level ta
|
|
168 |
|
169 |
```python
|
170 |
import torch
|
171 |
-
from multimolecule import RnaTokenizer,
|
172 |
|
173 |
|
174 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
175 |
-
model =
|
176 |
|
177 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
178 |
-
input = tokenizer(text, return_tensors=
|
179 |
label = torch.randint(2, (len(text), ))
|
180 |
|
181 |
output = model(**input, labels=label)
|
@@ -192,11 +205,11 @@ import torch
|
|
192 |
from multimolecule import RnaTokenizer, RnaErnieForContactPrediction
|
193 |
|
194 |
|
195 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
196 |
-
model = RnaErnieForContactPrediction.from_pretrained(
|
197 |
|
198 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
199 |
-
input = tokenizer(text, return_tensors=
|
200 |
label = torch.randint(2, (len(text), len(text)))
|
201 |
|
202 |
output = model(**input, labels=label)
|
|
|
10 |
pipeline_tag: fill-mask
|
11 |
mask_token: "<mask>"
|
12 |
widget:
|
13 |
+
- example_title: "HIV-1"
|
14 |
+
text: "GGUC<mask>CUCUGGUUAGACCAGAUCUGAGCCU"
|
15 |
+
output:
|
16 |
+
- label: "G"
|
17 |
+
score: 0.09252794831991196
|
18 |
+
- label: "R"
|
19 |
+
score: 0.09062391519546509
|
20 |
+
- label: "A"
|
21 |
+
score: 0.08875908702611923
|
22 |
+
- label: "V"
|
23 |
+
score: 0.07809742540121078
|
24 |
+
- label: "S"
|
25 |
+
score: 0.07325706630945206
|
26 |
- example_title: "microRNA-21"
|
27 |
text: "UAGC<mask>UAUCAGACUGAUGUUGA"
|
28 |
output:
|
|
|
91 |
- **Paper**: Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning
|
92 |
- **Developed by**: Ning Wang, Jiang Bian, Yuchen Li, Xuhong Li, Shahid Mumtaz, Linghe Kong, Haoyi Xiong.
|
93 |
- **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [ERNIE](https://huggingface.co/nghuyong/ernie-3.0-base-zh)
|
94 |
+
- **Original Repository**: [CatIIIIIIII/RNAErnie](https://github.com/CatIIIIIIII/RNAErnie)
|
95 |
|
96 |
## Usage
|
97 |
|
|
|
108 |
```python
|
109 |
>>> import multimolecule # you must import multimolecule to register models
|
110 |
>>> from transformers import pipeline
|
111 |
+
>>> unmasker = pipeline("fill-mask", model="multimolecule/rnaernie")
|
112 |
+
>>> unmasker("gguc<mask>cucugguuagaccagaucugagccu")
|
113 |
|
114 |
+
[{'score': 0.09252794831991196,
|
115 |
'token': 8,
|
116 |
'token_str': 'G',
|
117 |
+
'sequence': 'G G U C G C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
118 |
+
{'score': 0.09062391519546509,
|
119 |
'token': 11,
|
120 |
'token_str': 'R',
|
121 |
+
'sequence': 'G G U C R C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
122 |
+
{'score': 0.08875908702611923,
|
123 |
'token': 6,
|
124 |
'token_str': 'A',
|
125 |
+
'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
126 |
+
{'score': 0.07809742540121078,
|
|
|
|
|
|
|
|
|
127 |
'token': 20,
|
128 |
'token_str': 'V',
|
129 |
+
'sequence': 'G G U C V C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
130 |
+
{'score': 0.07325706630945206,
|
131 |
+
'token': 13,
|
132 |
+
'token_str': 'S',
|
133 |
+
'sequence': 'G G U C S C U C U G G U U A G A C C A G A U C U G A G C C U'}]
|
134 |
```
|
135 |
|
136 |
### Downstream Use
|
|
|
143 |
from multimolecule import RnaTokenizer, RnaErnieModel
|
144 |
|
145 |
|
146 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnaernie")
|
147 |
+
model = RnaErnieModel.from_pretrained("multimolecule/rnaernie")
|
148 |
|
149 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
150 |
+
input = tokenizer(text, return_tensors="pt")
|
151 |
|
152 |
output = model(**input)
|
153 |
```
|
|
|
163 |
from multimolecule import RnaTokenizer, RnaErnieForSequencePrediction
|
164 |
|
165 |
|
166 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnaernie")
|
167 |
+
model = RnaErnieForSequencePrediction.from_pretrained("multimolecule/rnaernie")
|
168 |
|
169 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
170 |
+
input = tokenizer(text, return_tensors="pt")
|
171 |
label = torch.tensor([1])
|
172 |
|
173 |
output = model(**input, labels=label)
|
174 |
```
|
175 |
|
176 |
+
#### Token Classification / Regression
|
177 |
|
178 |
**Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
|
179 |
|
|
|
181 |
|
182 |
```python
|
183 |
import torch
|
184 |
+
from multimolecule import RnaTokenizer, RnaErnieForTokenPrediction
|
185 |
|
186 |
|
187 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnaernie")
|
188 |
+
model = RnaErnieForTokenPrediction.from_pretrained("multimolecule/rnaernie")
|
189 |
|
190 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
191 |
+
input = tokenizer(text, return_tensors="pt")
|
192 |
label = torch.randint(2, (len(text), ))
|
193 |
|
194 |
output = model(**input, labels=label)
|
|
|
205 |
from multimolecule import RnaTokenizer, RnaErnieForContactPrediction
|
206 |
|
207 |
|
208 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnaernie")
|
209 |
+
model = RnaErnieForContactPrediction.from_pretrained("multimolecule/rnaernie")
|
210 |
|
211 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
212 |
+
input = tokenizer(text, return_tensors="pt")
|
213 |
label = torch.randint(2, (len(text), len(text)))
|
214 |
|
215 |
output = model(**input, labels=label)
|