polish-roberta-base-v2-cposes-tagging

This model is a fine-tuned version of sdadas/polish-roberta-base-v2 on the nkjp1m dataset. It achieves the following results on the evaluation set:

Loss: 0.0458
Precision: 0.9913
Recall: 0.9912
F1: 0.9913
Accuracy: 0.9889

You can find the training notebook here: https://github.com/WikKam/roberta-pos-finetuning

Usage

from transformers import pipeline

nlp = pipeline("token-classification", "wkaminski/polish-roberta-base-v2-cposes-tagging")

nlp("Ale dzisiaj leje")

Model description

This model is a coarse-part-of-speech tagger for the Polish language based on sdadas/polish-roberta-base-v2. It support 13 classes representing coarse part of speech):

{
 0: 'A',
 1: 'Adv',
 2: 'Comp',
 3: 'Conj',
 4: 'Dig',
 5: 'Interj',
 6: 'N',
 7: 'Num',
 8: 'Part',
 9: 'Prep',
 10: 'Punct',
 11: 'V',
 12: 'X'
}

Tags meaning is the same as in nkjp1m dataset:

Tag	Description in English	Description in Polish	Example in Polish
A	Adjective	przymiotnik	szybki
Adv	Adverb	przysłówek	szybko
Comp	Comparative / Complementizer	stopień porównawczy / spójnik podrzędny	lepszy / że
Conj	Conjunction	spójnik	i
Dig	Digit	cyfra	5, 3
Interj	Interjection	wykrzyknik	och!
N	Noun	rzeczownik	dom
Num	Numeral	liczebnik	jeden
Part	Particle	partykuła	by
Prep	Preposition	przyimek	w
Punct	Punctuation	interpunkcja	., !, ?
V	Verb	czasownik	biegać
X	Unknown / Other	niesklasyfikowane	xxx

Intended uses & limitations

Even though we have some nice tools for pos-tagging in polish (http://morfeusz.sgjp.pl/), I needed a pos tagger for polish that could be easily loaded inside the browser. Huggingface supports such functionality and that's why I created this model.

Training and evaluation data

Model was trained on a half of test data of the nkjp1m dataset (~0.5 milion tokens).

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy
0.0471	1.0	2155	0.0491	0.9896	0.9900	0.9898	0.9873
0.0291	2.0	4310	0.0467	0.9901	0.9905	0.9903	0.9884
0.0191	3.0	6465	0.0458	0.9913	0.9912	0.9913	0.9889

Framework versions

Transformers 4.35.2
Pytorch 2.1.0+cu118
Datasets 2.15.0
Tokenizers 0.15.0

wkaminski
/

polish-roberta-base-v2-cposes-tagging