recobo
/

chemical-bert-uncased

chemical-domain

safety-datasheets

Inference Endpoints

Model card Files Files and versions Community

shahrukhx01 commited on Aug 31, 2021

Commit

3dd0c57

·

1 Parent(s): 9b766f8

Create README.md

Files changed (1) hide show

README.md +24 -0

README.md ADDED Viewed

	@@ -0,0 +1,24 @@

+---
+language: "en"
+tags:
+- chemical-domain
+- safety-datasheets
+widget:
+- text: "The removal of mercaptans, and for drying of gases and [MASK]."
+---
+# BERT for Chemical Industry
+A BERT-based language model further pre-trained from the checkpoint of [SciBERT](https://huggingface.co/allenai/scibert_scivocab_uncased). We used a corpus of over 40,000+ technical documents from the **Chemical Industrial domain** and combined it with 13,000 Wikipedia Chemistry articles, ranging from Safety Data Sheets and Products Information Documents, with 250,000+ tokens from the Chemical domain and pre-trained using MLM and over 9.2 million paragraphs.
+- Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
+  the entire masked sentence through the model and has to predict the masked words. This is different from traditional
+  recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like
+  GPT internally masks the future tokens. It allows the model to learn a bidirectional representation of the
+  sentence.
+```python
+from transformers import pipeline
+fill_mask = pipeline(
+    "fill-mask",
+    model="recobo/chemical-bert-uncased",
+    tokenizer="recobo/chemical-bert-uncased"
+)
+fill_mask("we create [MASK]")
+```