xlm-r-bertic / README.md

Add ACL citation, specify macro-F1

18597f6 verified 6 months ago

8.15 kB

	---
	license: cc-by-sa-4.0
	language:
	- hr
	- bs
	- sr
	datasets:
	- classla/xlm-r-bertic-data
	---
	# XLM-R-BERTić

	This model was produced by pre-training [XLM-Roberta-large](https://huggingface.co./xlm-roberta-large) 48k steps on South Slavic languages using [XLM-R-BERTić dataset](https://huggingface.co./datasets/classla/xlm-r-bertic-data)

	# Benchmarking
	Three tasks were chosen for model evaluation:
	* Named Entity Recognition (NER)
	* Sentiment regression
	* COPA (Choice of plausible alternatives)


	In all cases, this model was finetuned for specific downstream tasks.

	## NER

	Mean macro-F1 scores were used to evaluate performance. Datasets used: [hr500k](https://huggingface.co./datasets/classla/hr500k), [ReLDI-sr](https://huggingface.co./datasets/classla/reldi_sr), [ReLDI-hr](https://huggingface.co./datasets/classla/reldi_hr), and [SETimes.SR](https://huggingface.co./datasets/classla/setimes_sr).

	\| system \| dataset \| F1 score \|
	\|:-----------------------------------------------------------------------\|:--------\|---------:\|
	\| [XLM-R-BERTić](https://huggingface.co./classla/xlm-r-bertic) \| hr500k \| 0.927 \|
	\| [BERTić](https://huggingface.co./classla/bcms-bertic) \| hr500k \| 0.925 \|
	\| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) \| hr500k \| 0.923 \|
	\| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) \| hr500k \| 0.919 \|
	\| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) \| hr500k \| 0.918 \|
	\| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) \| hr500k \| 0.903 \|

	\| system \| dataset \| F1 score \|
	\|:-----------------------------------------------------------------------\|:---------\|---------:\|
	\| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) \| ReLDI-hr \| 0.812 \|
	\| [XLM-R-BERTić](https://huggingface.co./classla/xlm-r-bertic) \| ReLDI-hr \| 0.809 \|
	\| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) \| ReLDI-hr \| 0.794 \|
	\| [BERTić](https://huggingface.co./classla/bcms-bertic) \| ReLDI-hr \| 0.792 \|
	\| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) \| ReLDI-hr \| 0.791 \|
	\| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) \| ReLDI-hr \| 0.763 \|

	\| system \| dataset \| F1 score \|
	\|:-----------------------------------------------------------------------\|:-----------\|---------:\|
	\| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) \| SETimes.SR \| 0.949 \|
	\| [XLM-R-BERTić](https://huggingface.co./classla/xlm-r-bertic) \| SETimes.SR \| 0.940 \|
	\| [BERTić](https://huggingface.co./classla/bcms-bertic) \| SETimes.SR \| 0.936 \|
	\| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) \| SETimes.SR \| 0.933 \|
	\| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) \| SETimes.SR \| 0.922 \|
	\| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) \| SETimes.SR \| 0.914 \|

	\| system \| dataset \| F1 score \|
	\|:-----------------------------------------------------------------------\|:---------\|---------:\|
	\| [XLM-R-BERTić](https://huggingface.co./classla/xlm-r-bertic) \| ReLDI-sr \| 0.841 \|
	\| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) \| ReLDI-sr \| 0.824 \|
	\| [BERTić](https://huggingface.co./classla/bcms-bertic) \| ReLDI-sr \| 0.798 \|
	\| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) \| ReLDI-sr \| 0.774 \|
	\| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) \| ReLDI-sr \| 0.751 \|
	\| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) \| ReLDI-sr \| 0.734 \|

	## Sentiment regression

	[ParlaSent dataset](https://huggingface.co./datasets/classla/ParlaSent) was used to evaluate sentiment regression for Bosnian, Croatian, and Serbian languages.
	The procedure is explained in greater detail in the dedicated [benchmarking repository](https://github.com/clarinsi/benchich/tree/main/sentiment).

	\| system \| train \| test \| r^2 \|
	\|:-----------------------------------------------------------------------\|:--------------------\|:-------------------------\|------:\|
	\| [xlm-r-parlasent](https://huggingface.co./classla/xlm-r-parlasent) \| ParlaSent_BCS.jsonl \| ParlaSent_BCS_test.jsonl \| 0.615 \|
	\| [BERTić](https://huggingface.co./classla/bcms-bertic) \| ParlaSent_BCS.jsonl \| ParlaSent_BCS_test.jsonl \| 0.612 \|
	\| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) \| ParlaSent_BCS.jsonl \| ParlaSent_BCS_test.jsonl \| 0.607 \|
	\| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) \| ParlaSent_BCS.jsonl \| ParlaSent_BCS_test.jsonl \| 0.605 \|
	\| [XLM-R-BERTić](https://huggingface.co./classla/xlm-r-bertic) \| ParlaSent_BCS.jsonl \| ParlaSent_BCS_test.jsonl \| 0.601 \|
	\| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) \| ParlaSent_BCS.jsonl \| ParlaSent_BCS_test.jsonl \| 0.537 \|
	\| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) \| ParlaSent_BCS.jsonl \| ParlaSent_BCS_test.jsonl \| 0.500 \|
	\| dummy (mean) \| ParlaSent_BCS.jsonl \| ParlaSent_BCS_test.jsonl \| -0.12 \|


	## COPA


	\| system \| dataset \| Accuracy score \|
	\|:-----------------------------------------------------------------------\|:--------\|---------------:\|
	\| [BERTić](https://huggingface.co./classla/bcms-bertic) \| Copa-SR \| 0.689 \|
	\| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) \| Copa-SR \| 0.665 \|
	\| [XLM-R-BERTić](https://huggingface.co./classla/xlm-r-bertic) \| Copa-SR \| 0.637 \|
	\| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) \| Copa-SR \| 0.607 \|
	\| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) \| Copa-SR \| 0.573 \|
	\| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) \| Copa-SR \| 0.570 \|


	\| system \| dataset \| Accuracy score \|
	\|:-----------------------------------------------------------------------\|:--------\|---------------:\|
	\| [BERTić](https://huggingface.co./classla/bcms-bertic) \| Copa-HR \| 0.669 \|
	\| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) \| Copa-HR \| 0.628 \|
	\| [XLM-R-BERTić](https://huggingface.co./classla/xlm-r-bertic) \| Copa-HR \| 0.635 \|
	\| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) \| Copa-HR \| 0.669 \|
	\| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) \| Copa-HR \| 0.585 \|
	\| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) \| Copa-HR \| 0.571 \|



	# Citation

	Please cite the following paper:
	```
	@inproceedings{ljubesic-etal-2024-language,
	title = "Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining",
	author = "Ljube{\v{s}}i{\'c}, Nikola and
	Suchomel, V{\'\i}t and
	Rupnik, Peter and
	Kuzman, Taja and
	van Noord, Rik",
	editor = "Melero, Maite and
	Sakti, Sakriani and
	Soria, Claudia",
	booktitle = "Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024",
	month = may,
	year = "2024",
	address = "Torino, Italia",
	publisher = "ELRA and ICCL",
	url = "https://aclanthology.org/2024.sigul-1.23",
	pages = "189--203",
	}

	```

	---
	license: cc-by-sa-4.0
	language:
	- hr
	- bs
	- sr
	datasets:
	- classla/xlm-r-bertic-data
	---
	# XLM-R-BERTić

	This model was produced by pre-training [XLM-Roberta-large](https://huggingface.co./xlm-roberta-large) 48k steps on South Slavic languages using [XLM-R-BERTić dataset](https://huggingface.co./datasets/classla/xlm-r-bertic-data)

	# Benchmarking
	Three tasks were chosen for model evaluation:
	* Named Entity Recognition (NER)
	* Sentiment regression
	* COPA (Choice of plausible alternatives)


	In all cases, this model was finetuned for specific downstream tasks.

	## NER

	Mean macro-F1 scores were used to evaluate performance. Datasets used: [hr500k](https://huggingface.co./datasets/classla/hr500k), [ReLDI-sr](https://huggingface.co./datasets/classla/reldi_sr), [ReLDI-hr](https://huggingface.co./datasets/classla/reldi_hr), and [SETimes.SR](https://huggingface.co./datasets/classla/setimes_sr).

	\| system \| dataset \| F1 score \|
	\|:-----------------------------------------------------------------------\|:--------\|---------:\|
	\| [XLM-R-BERTić](https://huggingface.co./classla/xlm-r-bertic) \| hr500k \| 0.927 \|
	\| [BERTić](https://huggingface.co./classla/bcms-bertic) \| hr500k \| 0.925 \|
	\| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) \| hr500k \| 0.923 \|
	\| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) \| hr500k \| 0.919 \|
	\| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) \| hr500k \| 0.918 \|
	\| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) \| hr500k \| 0.903 \|

	\| system \| dataset \| F1 score \|
	\|:-----------------------------------------------------------------------\|:---------\|---------:\|
	\| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) \| ReLDI-hr \| 0.812 \|
	\| [XLM-R-BERTić](https://huggingface.co./classla/xlm-r-bertic) \| ReLDI-hr \| 0.809 \|
	\| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) \| ReLDI-hr \| 0.794 \|
	\| [BERTić](https://huggingface.co./classla/bcms-bertic) \| ReLDI-hr \| 0.792 \|
	\| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) \| ReLDI-hr \| 0.791 \|
	\| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) \| ReLDI-hr \| 0.763 \|

	\| system \| dataset \| F1 score \|
	\|:-----------------------------------------------------------------------\|:-----------\|---------:\|
	\| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) \| SETimes.SR \| 0.949 \|
	\| [XLM-R-BERTić](https://huggingface.co./classla/xlm-r-bertic) \| SETimes.SR \| 0.940 \|
	\| [BERTić](https://huggingface.co./classla/bcms-bertic) \| SETimes.SR \| 0.936 \|
	\| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) \| SETimes.SR \| 0.933 \|
	\| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) \| SETimes.SR \| 0.922 \|
	\| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) \| SETimes.SR \| 0.914 \|

	\| system \| dataset \| F1 score \|
	\|:-----------------------------------------------------------------------\|:---------\|---------:\|
	\| [XLM-R-BERTić](https://huggingface.co./classla/xlm-r-bertic) \| ReLDI-sr \| 0.841 \|
	\| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) \| ReLDI-sr \| 0.824 \|
	\| [BERTić](https://huggingface.co./classla/bcms-bertic) \| ReLDI-sr \| 0.798 \|
	\| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) \| ReLDI-sr \| 0.774 \|
	\| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) \| ReLDI-sr \| 0.751 \|
	\| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) \| ReLDI-sr \| 0.734 \|

	## Sentiment regression

	[ParlaSent dataset](https://huggingface.co./datasets/classla/ParlaSent) was used to evaluate sentiment regression for Bosnian, Croatian, and Serbian languages.
	The procedure is explained in greater detail in the dedicated [benchmarking repository](https://github.com/clarinsi/benchich/tree/main/sentiment).

	\| system \| train \| test \| r^2 \|
	\|:-----------------------------------------------------------------------\|:--------------------\|:-------------------------\|------:\|
	\| [xlm-r-parlasent](https://huggingface.co./classla/xlm-r-parlasent) \| ParlaSent_BCS.jsonl \| ParlaSent_BCS_test.jsonl \| 0.615 \|
	\| [BERTić](https://huggingface.co./classla/bcms-bertic) \| ParlaSent_BCS.jsonl \| ParlaSent_BCS_test.jsonl \| 0.612 \|
	\| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) \| ParlaSent_BCS.jsonl \| ParlaSent_BCS_test.jsonl \| 0.607 \|
	\| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) \| ParlaSent_BCS.jsonl \| ParlaSent_BCS_test.jsonl \| 0.605 \|
	\| [XLM-R-BERTić](https://huggingface.co./classla/xlm-r-bertic) \| ParlaSent_BCS.jsonl \| ParlaSent_BCS_test.jsonl \| 0.601 \|
	\| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) \| ParlaSent_BCS.jsonl \| ParlaSent_BCS_test.jsonl \| 0.537 \|
	\| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) \| ParlaSent_BCS.jsonl \| ParlaSent_BCS_test.jsonl \| 0.500 \|
	\| dummy (mean) \| ParlaSent_BCS.jsonl \| ParlaSent_BCS_test.jsonl \| -0.12 \|


	## COPA


	\| system \| dataset \| Accuracy score \|
	\|:-----------------------------------------------------------------------\|:--------\|---------------:\|
	\| [BERTić](https://huggingface.co./classla/bcms-bertic) \| Copa-SR \| 0.689 \|
	\| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) \| Copa-SR \| 0.665 \|
	\| [XLM-R-BERTić](https://huggingface.co./classla/xlm-r-bertic) \| Copa-SR \| 0.637 \|
	\| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) \| Copa-SR \| 0.607 \|
	\| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) \| Copa-SR \| 0.573 \|
	\| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) \| Copa-SR \| 0.570 \|


	\| system \| dataset \| Accuracy score \|
	\|:-----------------------------------------------------------------------\|:--------\|---------------:\|
	\| [BERTić](https://huggingface.co./classla/bcms-bertic) \| Copa-HR \| 0.669 \|
	\| [XLM-R-SloBERTić](https://huggingface.co./classla/xlm-r-slobertic) \| Copa-HR \| 0.628 \|
	\| [XLM-R-BERTić](https://huggingface.co./classla/xlm-r-bertic) \| Copa-HR \| 0.635 \|
	\| [crosloengual-bert](https://huggingface.co./EMBEDDIA/crosloengual-bert) \| Copa-HR \| 0.669 \|
	\| [XLM-Roberta-Base](https://huggingface.co./xlm-roberta-base) \| Copa-HR \| 0.585 \|
	\| [XLM-Roberta-Large](https://huggingface.co./xlm-roberta-large) \| Copa-HR \| 0.571 \|



	# Citation

	Please cite the following paper:
	```
	@inproceedings{ljubesic-etal-2024-language,
	title = "Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining",
	author = "Ljube{\v{s}}i{\'c}, Nikola and
	Suchomel, V{\'\i}t and
	Rupnik, Peter and
	Kuzman, Taja and
	van Noord, Rik",
	editor = "Melero, Maite and
	Sakti, Sakriani and
	Soria, Claudia",
	booktitle = "Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024",
	month = may,
	year = "2024",
	address = "Torino, Italia",
	publisher = "ELRA and ICCL",
	url = "https://aclanthology.org/2024.sigul-1.23",
	pages = "189--203",
	}

	```