MatTPUSciBERT / README.md
lfoppiano's picture
update readme
7912cbc

Material SciBERT (TPU): Improving language understanding in materials science

Work in progress

Introduction

SciBERT-based model pre-trained with materials science scientific fulltext

Authors

Luca Foppiano Pedro Ortiz Suarez

TLDR

  • Collected full-text from ~700000 articles provided by the National Institute for Materials Science (NIMS) TDM platform (https://dice.nims.go.jp/services/TDM-PF/en/), dataset called ScienceCorpus (SciCorpus)
  • We added to the SciBERT vocabulary (32k tokens), 100 domain-specific unknown words extracted from SciCorpus with a keywords modeler (KeyBERT)
  • Starting conditions: original SciBERT weights
  • Pre-train the model MatTpuSciBERT from on the Google Cloud with the TPU (Tensor Processing Unit) as follow:
    • 800000 steps with batch_size: 256, max_seq_length:512
    • 100000 steps with batch_size: 2048, max_seq_length:128
  • Fine-tuning and testing on NER on superconductors (https://github.com/lfoppiano/grobid-superconductors) and physical quantities (https://github.com/kermitt2/grobid-quantities)

Related work

BERT Implementations

Relevant models

Results

Results obtained via 10-fold cross-validation, using DeLFT (https://github.com/kermitt2/delft)

NER Superconductors

Model Precision Recall F1
SciBERT (baseline) 81.62% 84.23% 82.90%
MatSciBERT (Gupta) 81.45% 84.36% 82.88%
MatTPUSciBERT 82.13% 85.15% 83.61%
MatBERT (Ceder) 81.25% 83.99% 82.60%
BatteryScibert-cased 81.09% 84.14% 82.59%

NER Quantities

Model Precision Recall F1
SciBERT (baseline) 88.73% 86.76% 87.73%
MatSciBERT (Gupta) 84.98% 90.12% 87.47%
MatTPUSciBERT 88.62% 86.33% 87.46%
MatBERT (Ceder) 85.08% 89.93% 87.44%
BatteryScibert-cased 85.02% 89.30% 87.11%
BatteryScibert-cased 81.09% 84.14% 82.59%

References

This work was supported by Google, through the researchers program https://cloud.google.com/edu/researchers

Acknowledgements

TBA