|
--- |
|
license: cc-by-sa-4.0 |
|
language: |
|
- hr |
|
- bs |
|
- sr |
|
datasets: |
|
- classla/xlm-r-bertic-data |
|
--- |
|
# XLM-R-BERTić |
|
|
|
This model was produced by pre-training [XLM-Roberta-large](https://huggingface.co./xlm-roberta-large) 48k steps on South Slavic languages using [XLM-R-BERTić dataset](https://huggingface.co./datasets/classla/xlm-r-bertic-data) |
|
|
|
# Benchmarking |
|
Three tasks were chosen for model evaluation: |
|
* Named Entity Recognition (NER) |
|
* Sentiment regression |
|
* COPA (Choice of plausible alternatives) |
|
|
|
|
|
In all cases, this model was finetuned for specific downstream tasks. |
|
|
|
## NER |
|
|
|
Mean macro-F1 scores were used to evaluate performance. Datasets used: [hr500k](https://huggingface.co./datasets/classla/hr500k), [ReLDI-sr](https://huggingface.co./datasets/classla/reldi_sr), [ReLDI-hr](https://huggingface.co./datasets/classla/reldi_hr), and [SETimes.SR](https://huggingface.co./datasets/classla/setimes_sr). |
|
|
|
| system | dataset | F1 score | |
|
|:-----------------------------------------------------------------------|:--------|---------:| |
|
| [**XLM-R-BERTić**](https://huggingface.co./classla/xlm-r-bertic) | hr500k | 0.927 | |
|
| [BERTić](https://huggingface.co./classla/bcms-bertic) | hr500k | 0.925 | |
|
| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) | hr500k | 0.923 | |
|
| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) | hr500k | 0.919 | |
|
| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) | hr500k | 0.918 | |
|
| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) | hr500k | 0.903 | |
|
|
|
| system | dataset | F1 score | |
|
|:-----------------------------------------------------------------------|:---------|---------:| |
|
| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) | ReLDI-hr | 0.812 | |
|
| [**XLM-R-BERTić**](https://huggingface.co./classla/xlm-r-bertic) | ReLDI-hr | 0.809 | |
|
| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) | ReLDI-hr | 0.794 | |
|
| [BERTić](https://huggingface.co./classla/bcms-bertic) | ReLDI-hr | 0.792 | |
|
| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) | ReLDI-hr | 0.791 | |
|
| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) | ReLDI-hr | 0.763 | |
|
|
|
| system | dataset | F1 score | |
|
|:-----------------------------------------------------------------------|:-----------|---------:| |
|
| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) | SETimes.SR | 0.949 | |
|
| [**XLM-R-BERTić**](https://huggingface.co./classla/xlm-r-bertic) | SETimes.SR | 0.940 | |
|
| [BERTić](https://huggingface.co./classla/bcms-bertic) | SETimes.SR | 0.936 | |
|
| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) | SETimes.SR | 0.933 | |
|
| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) | SETimes.SR | 0.922 | |
|
| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) | SETimes.SR | 0.914 | |
|
|
|
| system | dataset | F1 score | |
|
|:-----------------------------------------------------------------------|:---------|---------:| |
|
| [**XLM-R-BERTić**](https://huggingface.co./classla/xlm-r-bertic) | ReLDI-sr | 0.841 | |
|
| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) | ReLDI-sr | 0.824 | |
|
| [BERTić](https://huggingface.co./classla/bcms-bertic) | ReLDI-sr | 0.798 | |
|
| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) | ReLDI-sr | 0.774 | |
|
| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) | ReLDI-sr | 0.751 | |
|
| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) | ReLDI-sr | 0.734 | |
|
|
|
## Sentiment regression |
|
|
|
[ParlaSent dataset](https://huggingface.co./datasets/classla/ParlaSent) was used to evaluate sentiment regression for Bosnian, Croatian, and Serbian languages. |
|
The procedure is explained in greater detail in the dedicated [benchmarking repository](https://github.com/clarinsi/benchich/tree/main/sentiment). |
|
|
|
| system | train | test | r^2 | |
|
|:-----------------------------------------------------------------------|:--------------------|:-------------------------|------:| |
|
| [xlm-r-parlasent](https://huggingface.co./classla/xlm-r-parlasent) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.615 | |
|
| [BERTić](https://huggingface.co./classla/bcms-bertic) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.612 | |
|
| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.607 | |
|
| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.605 | |
|
| [**XLM-R-BERTić**](https://huggingface.co./classla/xlm-r-bertic) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.601 | |
|
| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.537 | |
|
| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.500 | |
|
| dummy (mean) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | -0.12 | |
|
|
|
|
|
## COPA |
|
|
|
|
|
| system | dataset | Accuracy score | |
|
|:-----------------------------------------------------------------------|:--------|---------------:| |
|
| [BERTić](https://huggingface.co./classla/bcms-bertic) | Copa-SR | 0.689 | |
|
| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) | Copa-SR | 0.665 | |
|
| [**XLM-R-BERTić**](https://huggingface.co./classla/xlm-r-bertic) | Copa-SR | 0.637 | |
|
| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) | Copa-SR | 0.607 | |
|
| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) | Copa-SR | 0.573 | |
|
| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) | Copa-SR | 0.570 | |
|
|
|
|
|
| system | dataset | Accuracy score | |
|
|:-----------------------------------------------------------------------|:--------|---------------:| |
|
| [BERTić](https://huggingface.co./classla/bcms-bertic) | Copa-HR | 0.669 | |
|
| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) | Copa-HR | 0.628 | |
|
| [**XLM-R-BERTić**](https://huggingface.co./classla/xlm-r-bertic) | Copa-HR | 0.635 | |
|
| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) | Copa-HR | 0.669 | |
|
| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) | Copa-HR | 0.585 | |
|
| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) | Copa-HR | 0.571 | |
|
|
|
|
|
|
|
# Citation |
|
|
|
Please cite the following paper: |
|
``` |
|
@inproceedings{ljubesic-etal-2024-language, |
|
title = "Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining", |
|
author = "Ljube{\v{s}}i{\'c}, Nikola and |
|
Suchomel, V{\'\i}t and |
|
Rupnik, Peter and |
|
Kuzman, Taja and |
|
van Noord, Rik", |
|
editor = "Melero, Maite and |
|
Sakti, Sakriani and |
|
Soria, Claudia", |
|
booktitle = "Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024", |
|
month = may, |
|
year = "2024", |
|
address = "Torino, Italia", |
|
publisher = "ELRA and ICCL", |
|
url = "https://aclanthology.org/2024.sigul-1.23", |
|
pages = "189--203", |
|
} |
|
|
|
``` |