EstBERT

What's this?

The EstBERT model is a pretrained BERTBase model exclusively trained on Estonian cased corpus on both 128 and 512 sequence length of data.

How to use?

You can use the model transformer library both in tensorflow and pytorch version.

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("tartuNLP/EstBERT")
model = AutoModelForMaskedLM.from_pretrained("tartuNLP/EstBERT")

You can also download the pretrained model from here, EstBERT_128 EstBERT_512

Dataset used to train the model

The EstBERT model is trained both on 128 and 512 sequence length of data. For training the EstBERT we used the Estonian National Corpus 2017, which was the largest Estonian language corpus available at the time. It consists of four sub-corpora: Estonian Reference Corpus 1990-2008, Estonian Web Corpus 2013, Estonian Web Corpus 2017 and Estonian Wikipedia Corpus 2017.

Reference to cite

Tanvir et al 2021

Why would I use?

Overall EstBERT performs better in parts of speech (POS), name entity recognition (NER), rubric, and sentiment classification tasks compared to mBERT and XLM-RoBERTa. The comparative results can be found below;

Model UPOS XPOS Morph bf UPOS bf XPOS Morph
EstBERT 97.89 98.40 96.93 97.84 98.43 96.80
mBERT 97.42 98.06 96.24 97.43 98.13 96.13
XLM-RoBERTa 97.78 98.36 96.53 97.80 98.40 96.69
Model Rubric128 Sentiment128 Rubric128 Sentiment512
EstBERT 81.70 74.36 80.96 74.50
mBERT 75.67 70.23 74.94 69.52
XLM-RoBERTa 80.34 74.50 78.62 76.07
Model Precicion128 Recall128 F1-Score128 Precision512 Recall512 F1-Score512
EstBERT 88.42 90.38 89.39 88.35 89.74 89.04
mBERT 85.88 87.09 86.51 88.47 88.28 88.37
XLM-RoBERTa 87.55 91.19 89.34 87.50 90.76 89.10

BibTeX entry and citation info

@misc{tanvir2020estbert,
      title={EstBERT: A Pretrained Language-Specific BERT for Estonian}, 
      author={Hasan Tanvir and Claudia Kittask and Kairit Sirts},
      year={2020},
      eprint={2011.04784},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Downloads last month
284
Safetensors
Model size
124M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for tartuNLP/EstBERT

Finetuned
(2001)
this model
Finetunes
7 models