Update README.md
Browse files
README.md
CHANGED
@@ -17,9 +17,25 @@ This Electra model was trained on more than 8 billion tokens of Bosnian, Croatia
|
|
17 |
|
18 |
***new*** We have published a version of this model fine-tuned on the named entity recognition task ([bcms-bertic-ner](https://huggingface.co/CLASSLA/bcms-bertic-ner)).
|
19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
Comparing this model to [multilingual BERT](https://huggingface.co/bert-base-multilingual-cased) and [CroSloEngual BERT](https://huggingface.co/EMBEDDIA/crosloengual-bert) on the tasks of (1) part-of-speech tagging, (2) named entity recognition, (3) geolocation prediction, and (4) commonsense causal reasoning, shows the BERTić model to be superior to the other two.
|
21 |
|
22 |
-
|
23 |
|
24 |
Evaluation metric is (seqeval) microF1. Reported are means of five runs. Best results are presented in bold. Statistical significance is calculated between two best-performing systems via a two-tailed t-test (* p<=0.05, ** p<=0.01, *** p<=0.001, ***** p<=0.0001).
|
25 |
|
@@ -30,7 +46,7 @@ reldi-hr | Croatian | internet non-standard | - | 88.87 | 91.63 | **92.28*&a
|
|
30 |
SETimes.SR | Serbian | standard | 95.00 | 95.50 | **96.41** | 96.31
|
31 |
reldi-sr | Serbian | internet non-standard | - | 91.26 | 93.54 | **93.90*****
|
32 |
|
33 |
-
|
34 |
|
35 |
Evaluation metric is (seqeval) microF1. Reported are means of five runs. Best results are presented in bold. Statistical significance is calculated between two best-performing systems via a two-tailed t-test (* p<=0.05, ** p<=0.01, *** p<=0.001, ***** p<=0.0001).
|
36 |
|
@@ -42,7 +58,7 @@ SETimes.SR | Serbian | standard | 84.64 | **92.41** | 92.28 | 92.02
|
|
42 |
reldi-sr | Serbian | internet non-standard | - | 81.29 | 82.76 | **87.92******
|
43 |
|
44 |
|
45 |
-
|
46 |
|
47 |
The dataset comes from the VarDial 2020 evaluation campaign's shared task on [Social Media variety Geolocation prediction](https://sites.google.com/view/vardial2020/evaluation-campaign). The task is to predict the latitude and longitude of a tweet given its text.
|
48 |
|
@@ -55,7 +71,7 @@ mBERT | 42.25 | 82.05
|
|
55 |
cseBERT | 40.76 | 81.88
|
56 |
BERTić | **37.96** | **79.30**
|
57 |
|
58 |
-
|
59 |
|
60 |
The dataset is a translation of the [COPA dataset](https://people.ict.usc.edu/~gordon/copa.html) into Croatian (to-be-released).
|
61 |
|
|
|
17 |
|
18 |
***new*** We have published a version of this model fine-tuned on the named entity recognition task ([bcms-bertic-ner](https://huggingface.co/CLASSLA/bcms-bertic-ner)).
|
19 |
|
20 |
+
If you use the model, please cite the following paper:
|
21 |
+
|
22 |
+
```
|
23 |
+
@inproceedings{ljubesic-lauc-2021-bertic,
|
24 |
+
title = "{BERTić} - The Transformer Language Model for {B}osnian, {C}roatian, {M}ontenegrin and {S}erbian",
|
25 |
+
author = "Ljube{\v{s}}i{\'c}, Nikola and
|
26 |
+
Lauc, Davor",
|
27 |
+
booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
|
28 |
+
year = "2021",
|
29 |
+
address = "Kiev, Ukraine",
|
30 |
+
publisher = "Association for Computational Linguistics"
|
31 |
+
}
|
32 |
+
```
|
33 |
+
|
34 |
+
## Benchmarking
|
35 |
+
|
36 |
Comparing this model to [multilingual BERT](https://huggingface.co/bert-base-multilingual-cased) and [CroSloEngual BERT](https://huggingface.co/EMBEDDIA/crosloengual-bert) on the tasks of (1) part-of-speech tagging, (2) named entity recognition, (3) geolocation prediction, and (4) commonsense causal reasoning, shows the BERTić model to be superior to the other two.
|
37 |
|
38 |
+
### Part-of-speech tagging
|
39 |
|
40 |
Evaluation metric is (seqeval) microF1. Reported are means of five runs. Best results are presented in bold. Statistical significance is calculated between two best-performing systems via a two-tailed t-test (* p<=0.05, ** p<=0.01, *** p<=0.001, ***** p<=0.0001).
|
41 |
|
|
|
46 |
SETimes.SR | Serbian | standard | 95.00 | 95.50 | **96.41** | 96.31
|
47 |
reldi-sr | Serbian | internet non-standard | - | 91.26 | 93.54 | **93.90*****
|
48 |
|
49 |
+
### Named entity recognition
|
50 |
|
51 |
Evaluation metric is (seqeval) microF1. Reported are means of five runs. Best results are presented in bold. Statistical significance is calculated between two best-performing systems via a two-tailed t-test (* p<=0.05, ** p<=0.01, *** p<=0.001, ***** p<=0.0001).
|
52 |
|
|
|
58 |
reldi-sr | Serbian | internet non-standard | - | 81.29 | 82.76 | **87.92******
|
59 |
|
60 |
|
61 |
+
### Geolocation prediction
|
62 |
|
63 |
The dataset comes from the VarDial 2020 evaluation campaign's shared task on [Social Media variety Geolocation prediction](https://sites.google.com/view/vardial2020/evaluation-campaign). The task is to predict the latitude and longitude of a tweet given its text.
|
64 |
|
|
|
71 |
cseBERT | 40.76 | 81.88
|
72 |
BERTić | **37.96** | **79.30**
|
73 |
|
74 |
+
### Choice Of Plausible Alternatives
|
75 |
|
76 |
The dataset is a translation of the [COPA dataset](https://people.ict.usc.edu/~gordon/copa.html) into Croatian (to-be-released).
|
77 |
|