shahrukhx01
commited on
Commit
·
3dd0c57
1
Parent(s):
9b766f8
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: "en"
|
3 |
+
tags:
|
4 |
+
- chemical-domain
|
5 |
+
- safety-datasheets
|
6 |
+
widget:
|
7 |
+
- text: "The removal of mercaptans, and for drying of gases and [MASK]."
|
8 |
+
---
|
9 |
+
# BERT for Chemical Industry
|
10 |
+
A BERT-based language model further pre-trained from the checkpoint of [SciBERT](https://huggingface.co/allenai/scibert_scivocab_uncased). We used a corpus of over 40,000+ technical documents from the **Chemical Industrial domain** and combined it with 13,000 Wikipedia Chemistry articles, ranging from Safety Data Sheets and Products Information Documents, with 250,000+ tokens from the Chemical domain and pre-trained using MLM and over 9.2 million paragraphs.
|
11 |
+
- Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
|
12 |
+
the entire masked sentence through the model and has to predict the masked words. This is different from traditional
|
13 |
+
recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like
|
14 |
+
GPT internally masks the future tokens. It allows the model to learn a bidirectional representation of the
|
15 |
+
sentence.
|
16 |
+
```python
|
17 |
+
from transformers import pipeline
|
18 |
+
fill_mask = pipeline(
|
19 |
+
"fill-mask",
|
20 |
+
model="recobo/chemical-bert-uncased",
|
21 |
+
tokenizer="recobo/chemical-bert-uncased"
|
22 |
+
)
|
23 |
+
fill_mask("we create [MASK]")
|
24 |
+
```
|