ZhiyuanChen commited on
Commit
81064c2
·
verified ·
1 Parent(s): 34e5967

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -29
README.md CHANGED
@@ -10,6 +10,19 @@ library_name: multimolecule
10
  pipeline_tag: fill-mask
11
  mask_token: "<mask>"
12
  widget:
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  - example_title: "microRNA-21"
14
  text: "UAGC<mask>UAUCAGACUGAUGUUGA"
15
  output:
@@ -78,7 +91,7 @@ Note that during the conversion process, additional tokens such as `[IND]` and n
78
  - **Paper**: Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning
79
  - **Developed by**: Ning Wang, Jiang Bian, Yuchen Li, Xuhong Li, Shahid Mumtaz, Linghe Kong, Haoyi Xiong.
80
  - **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [ERNIE](https://huggingface.co/nghuyong/ernie-3.0-base-zh)
81
- - **Original Repository**: [https://github.com/CatIIIIIIII/RNAErnie](https://github.com/CatIIIIIIII/RNAErnie)
82
 
83
  ## Usage
84
 
@@ -95,29 +108,29 @@ You can use this model directly with a pipeline for masked language modeling:
95
  ```python
96
  >>> import multimolecule # you must import multimolecule to register models
97
  >>> from transformers import pipeline
98
- >>> unmasker = pipeline('fill-mask', model='multimolecule/rnaernie')
99
- >>> unmasker("uagc<mask>uaucagacugauguuga")
100
 
101
- [{'score': 0.09372635930776596,
102
  'token': 8,
103
  'token_str': 'G',
104
- 'sequence': 'U A G C G U A U C A G A C U G A U G U U G A'},
105
- {'score': 0.08816102892160416,
106
  'token': 11,
107
  'token_str': 'R',
108
- 'sequence': 'U A G C R U A U C A G A C U G A U G U U G A'},
109
- {'score': 0.08292599022388458,
110
  'token': 6,
111
  'token_str': 'A',
112
- 'sequence': 'U A G C A U A U C A G A C U G A U G U U G A'},
113
- {'score': 0.07841548323631287,
114
- 'token': 2,
115
- 'token_str': '<eos>',
116
- 'sequence': 'U A G C U A U C A G A C U G A U G U U G A'},
117
- {'score': 0.073448047041893,
118
  'token': 20,
119
  'token_str': 'V',
120
- 'sequence': 'U A G C V U A U C A G A C U G A U G U U G A'}]
 
 
 
 
121
  ```
122
 
123
  ### Downstream Use
@@ -130,11 +143,11 @@ Here is how to use this model to get the features of a given sequence in PyTorch
130
  from multimolecule import RnaTokenizer, RnaErnieModel
131
 
132
 
133
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/rnaernie')
134
- model = RnaErnieModel.from_pretrained('multimolecule/rnaernie')
135
 
136
  text = "UAGCUUAUCAGACUGAUGUUGA"
137
- input = tokenizer(text, return_tensors='pt')
138
 
139
  output = model(**input)
140
  ```
@@ -150,17 +163,17 @@ import torch
150
  from multimolecule import RnaTokenizer, RnaErnieForSequencePrediction
151
 
152
 
153
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/rnaernie')
154
- model = RnaErnieForSequencePrediction.from_pretrained('multimolecule/rnaernie')
155
 
156
  text = "UAGCUUAUCAGACUGAUGUUGA"
157
- input = tokenizer(text, return_tensors='pt')
158
  label = torch.tensor([1])
159
 
160
  output = model(**input, labels=label)
161
  ```
162
 
163
- #### Nucleotide Classification / Regression
164
 
165
  **Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
166
 
@@ -168,14 +181,14 @@ Here is how to use this model as backbone to fine-tune for a nucleotide-level ta
168
 
169
  ```python
170
  import torch
171
- from multimolecule import RnaTokenizer, RnaErnieForNucleotidePrediction
172
 
173
 
174
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/rnaernie')
175
- model = RnaErnieForNucleotidePrediction.from_pretrained('multimolecule/rnaernie')
176
 
177
  text = "UAGCUUAUCAGACUGAUGUUGA"
178
- input = tokenizer(text, return_tensors='pt')
179
  label = torch.randint(2, (len(text), ))
180
 
181
  output = model(**input, labels=label)
@@ -192,11 +205,11 @@ import torch
192
  from multimolecule import RnaTokenizer, RnaErnieForContactPrediction
193
 
194
 
195
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/rnaernie')
196
- model = RnaErnieForContactPrediction.from_pretrained('multimolecule/rnaernie')
197
 
198
  text = "UAGCUUAUCAGACUGAUGUUGA"
199
- input = tokenizer(text, return_tensors='pt')
200
  label = torch.randint(2, (len(text), len(text)))
201
 
202
  output = model(**input, labels=label)
 
10
  pipeline_tag: fill-mask
11
  mask_token: "<mask>"
12
  widget:
13
+ - example_title: "HIV-1"
14
+ text: "GGUC<mask>CUCUGGUUAGACCAGAUCUGAGCCU"
15
+ output:
16
+ - label: "G"
17
+ score: 0.09252794831991196
18
+ - label: "R"
19
+ score: 0.09062391519546509
20
+ - label: "A"
21
+ score: 0.08875908702611923
22
+ - label: "V"
23
+ score: 0.07809742540121078
24
+ - label: "S"
25
+ score: 0.07325706630945206
26
  - example_title: "microRNA-21"
27
  text: "UAGC<mask>UAUCAGACUGAUGUUGA"
28
  output:
 
91
  - **Paper**: Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning
92
  - **Developed by**: Ning Wang, Jiang Bian, Yuchen Li, Xuhong Li, Shahid Mumtaz, Linghe Kong, Haoyi Xiong.
93
  - **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [ERNIE](https://huggingface.co/nghuyong/ernie-3.0-base-zh)
94
+ - **Original Repository**: [CatIIIIIIII/RNAErnie](https://github.com/CatIIIIIIII/RNAErnie)
95
 
96
  ## Usage
97
 
 
108
  ```python
109
  >>> import multimolecule # you must import multimolecule to register models
110
  >>> from transformers import pipeline
111
+ >>> unmasker = pipeline("fill-mask", model="multimolecule/rnaernie")
112
+ >>> unmasker("gguc<mask>cucugguuagaccagaucugagccu")
113
 
114
+ [{'score': 0.09252794831991196,
115
  'token': 8,
116
  'token_str': 'G',
117
+ 'sequence': 'G G U C G C U C U G G U U A G A C C A G A U C U G A G C C U'},
118
+ {'score': 0.09062391519546509,
119
  'token': 11,
120
  'token_str': 'R',
121
+ 'sequence': 'G G U C R C U C U G G U U A G A C C A G A U C U G A G C C U'},
122
+ {'score': 0.08875908702611923,
123
  'token': 6,
124
  'token_str': 'A',
125
+ 'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'},
126
+ {'score': 0.07809742540121078,
 
 
 
 
127
  'token': 20,
128
  'token_str': 'V',
129
+ 'sequence': 'G G U C V C U C U G G U U A G A C C A G A U C U G A G C C U'},
130
+ {'score': 0.07325706630945206,
131
+ 'token': 13,
132
+ 'token_str': 'S',
133
+ 'sequence': 'G G U C S C U C U G G U U A G A C C A G A U C U G A G C C U'}]
134
  ```
135
 
136
  ### Downstream Use
 
143
  from multimolecule import RnaTokenizer, RnaErnieModel
144
 
145
 
146
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnaernie")
147
+ model = RnaErnieModel.from_pretrained("multimolecule/rnaernie")
148
 
149
  text = "UAGCUUAUCAGACUGAUGUUGA"
150
+ input = tokenizer(text, return_tensors="pt")
151
 
152
  output = model(**input)
153
  ```
 
163
  from multimolecule import RnaTokenizer, RnaErnieForSequencePrediction
164
 
165
 
166
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnaernie")
167
+ model = RnaErnieForSequencePrediction.from_pretrained("multimolecule/rnaernie")
168
 
169
  text = "UAGCUUAUCAGACUGAUGUUGA"
170
+ input = tokenizer(text, return_tensors="pt")
171
  label = torch.tensor([1])
172
 
173
  output = model(**input, labels=label)
174
  ```
175
 
176
+ #### Token Classification / Regression
177
 
178
  **Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
179
 
 
181
 
182
  ```python
183
  import torch
184
+ from multimolecule import RnaTokenizer, RnaErnieForTokenPrediction
185
 
186
 
187
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnaernie")
188
+ model = RnaErnieForTokenPrediction.from_pretrained("multimolecule/rnaernie")
189
 
190
  text = "UAGCUUAUCAGACUGAUGUUGA"
191
+ input = tokenizer(text, return_tensors="pt")
192
  label = torch.randint(2, (len(text), ))
193
 
194
  output = model(**input, labels=label)
 
205
  from multimolecule import RnaTokenizer, RnaErnieForContactPrediction
206
 
207
 
208
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnaernie")
209
+ model = RnaErnieForContactPrediction.from_pretrained("multimolecule/rnaernie")
210
 
211
  text = "UAGCUUAUCAGACUGAUGUUGA"
212
+ input = tokenizer(text, return_tensors="pt")
213
  label = torch.randint(2, (len(text), len(text)))
214
 
215
  output = model(**input, labels=label)