xlm-r-bertic / README.md
5roop's picture
Add ACL citation, specify macro-F1
18597f6 verified
|
raw
history blame
8.15 kB
---
license: cc-by-sa-4.0
language:
- hr
- bs
- sr
datasets:
- classla/xlm-r-bertic-data
---
# XLM-R-BERTić
This model was produced by pre-training [XLM-Roberta-large](https://huggingface.co./xlm-roberta-large) 48k steps on South Slavic languages using [XLM-R-BERTić dataset](https://huggingface.co./datasets/classla/xlm-r-bertic-data)
# Benchmarking
Three tasks were chosen for model evaluation:
* Named Entity Recognition (NER)
* Sentiment regression
* COPA (Choice of plausible alternatives)
In all cases, this model was finetuned for specific downstream tasks.
## NER
Mean macro-F1 scores were used to evaluate performance. Datasets used: [hr500k](https://huggingface.co./datasets/classla/hr500k), [ReLDI-sr](https://huggingface.co./datasets/classla/reldi_sr), [ReLDI-hr](https://huggingface.co./datasets/classla/reldi_hr), and [SETimes.SR](https://huggingface.co./datasets/classla/setimes_sr).
| system | dataset | F1 score |
|:-----------------------------------------------------------------------|:--------|---------:|
| [**XLM-R-BERTić**](https://huggingface.co./classla/xlm-r-bertic) | hr500k | 0.927 |
| [BERTić](https://huggingface.co./classla/bcms-bertic) | hr500k | 0.925 |
| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) | hr500k | 0.923 |
| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) | hr500k | 0.919 |
| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) | hr500k | 0.918 |
| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) | hr500k | 0.903 |
| system | dataset | F1 score |
|:-----------------------------------------------------------------------|:---------|---------:|
| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) | ReLDI-hr | 0.812 |
| [**XLM-R-BERTić**](https://huggingface.co./classla/xlm-r-bertic) | ReLDI-hr | 0.809 |
| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) | ReLDI-hr | 0.794 |
| [BERTić](https://huggingface.co./classla/bcms-bertic) | ReLDI-hr | 0.792 |
| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) | ReLDI-hr | 0.791 |
| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) | ReLDI-hr | 0.763 |
| system | dataset | F1 score |
|:-----------------------------------------------------------------------|:-----------|---------:|
| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) | SETimes.SR | 0.949 |
| [**XLM-R-BERTić**](https://huggingface.co./classla/xlm-r-bertic) | SETimes.SR | 0.940 |
| [BERTić](https://huggingface.co./classla/bcms-bertic) | SETimes.SR | 0.936 |
| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) | SETimes.SR | 0.933 |
| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) | SETimes.SR | 0.922 |
| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) | SETimes.SR | 0.914 |
| system | dataset | F1 score |
|:-----------------------------------------------------------------------|:---------|---------:|
| [**XLM-R-BERTić**](https://huggingface.co./classla/xlm-r-bertic) | ReLDI-sr | 0.841 |
| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) | ReLDI-sr | 0.824 |
| [BERTić](https://huggingface.co./classla/bcms-bertic) | ReLDI-sr | 0.798 |
| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) | ReLDI-sr | 0.774 |
| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) | ReLDI-sr | 0.751 |
| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) | ReLDI-sr | 0.734 |
## Sentiment regression
[ParlaSent dataset](https://huggingface.co./datasets/classla/ParlaSent) was used to evaluate sentiment regression for Bosnian, Croatian, and Serbian languages.
The procedure is explained in greater detail in the dedicated [benchmarking repository](https://github.com/clarinsi/benchich/tree/main/sentiment).
| system | train | test | r^2 |
|:-----------------------------------------------------------------------|:--------------------|:-------------------------|------:|
| [xlm-r-parlasent](https://huggingface.co./classla/xlm-r-parlasent) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.615 |
| [BERTić](https://huggingface.co./classla/bcms-bertic) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.612 |
| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.607 |
| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.605 |
| [**XLM-R-BERTić**](https://huggingface.co./classla/xlm-r-bertic) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.601 |
| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.537 |
| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.500 |
| dummy (mean) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | -0.12 |
## COPA
| system | dataset | Accuracy score |
|:-----------------------------------------------------------------------|:--------|---------------:|
| [BERTić](https://huggingface.co./classla/bcms-bertic) | Copa-SR | 0.689 |
| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) | Copa-SR | 0.665 |
| [**XLM-R-BERTić**](https://huggingface.co./classla/xlm-r-bertic) | Copa-SR | 0.637 |
| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) | Copa-SR | 0.607 |
| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) | Copa-SR | 0.573 |
| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) | Copa-SR | 0.570 |
| system | dataset | Accuracy score |
|:-----------------------------------------------------------------------|:--------|---------------:|
| [BERTić](https://huggingface.co./classla/bcms-bertic) | Copa-HR | 0.669 |
| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) | Copa-HR | 0.628 |
| [**XLM-R-BERTić**](https://huggingface.co./classla/xlm-r-bertic) | Copa-HR | 0.635 |
| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) | Copa-HR | 0.669 |
| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) | Copa-HR | 0.585 |
| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) | Copa-HR | 0.571 |
# Citation
Please cite the following paper:
```
@inproceedings{ljubesic-etal-2024-language,
title = "Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining",
author = "Ljube{\v{s}}i{\'c}, Nikola and
Suchomel, V{\'\i}t and
Rupnik, Peter and
Kuzman, Taja and
van Noord, Rik",
editor = "Melero, Maite and
Sakti, Sakriani and
Soria, Claudia",
booktitle = "Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.sigul-1.23",
pages = "189--203",
}
```