parser / udpipe2 /docs /models_pdtc10.md
anasampa2's picture
Upload 151 files
ee0ec3d verified
|
raw
history blame
7.4 kB

Czech PDT-C 1.0 Model #czech_pdtc1.0_model

PDT-C 1.0 Model is distributed under the CC BY-NC-SA licence. The model is trained on PDT-C 1.0 treebank using RobeCzech model, and performs morphological analysis using the MorfFlex CZ 2.0 morphological dictionary via MorphoDiTa.

The model requires UDPipe 2.1, together with Python packages ufal.udpipe version at least 1.3.1.1 and ufal.morphodita version at least 1.11.2.1.

Download

The latest version 231116 of the Czech PDT-C 1.0 model can be downloaded from the LINDAT/CLARIN repository.

The model is also available in the REST service.

PDT-C 1.0 Morphological System

PDT-C 1.0 uses the PDT-C tag set from MorfFlex CZ 2.0, which is an evolution of the original PDT tag set devised by Jan Hajič (Hajič, 2004). The tags are positional with 15 positions corresponding to part of speech, detailed part of speech, gender, number, case, etc. (e.g. NNFS1-----A----). Different meanings of same lemmas are distinguished and additional comments can be provided for every lemma meaning. The complete reference can be found in the Manual for Morphological Annotation, Revision for the Prague Dependency Treebank - Consolidated 2020 release and quick reference is available in the PDT-C positional morphological tags overview.

The PDT-C 1.0 emply dependency relations from the PDT analytical level, with a quick reference available in the PDT-C analytical functions and clause segmentation overview.

In the CoNLL-U format, the

  • tags are filled in the XPOS column, and
  • the dependency relations are filled in the DEPREL, even if they are different from the universal dependency relations.

PDT-C 1.0 Train/Dev/Test Splits

The PDT-C corpus consists of four datasets, but some of them do not have an official train/dev/test split. We therefore used the following split:

  • PDT dataset is already split into train, dev (dtest), and test (etest).
  • PCEDT dataset is a translated version of the Wall Street Journal, so we used the usual split into train (sections 0-18), dev (sections 19-21), and test (sections 22-24).
  • PDTSC and FAUST datasets have no split, so we split it into dev (documents with identifiers ending with 6), test (documents with identifiers ending with 7), and train (all the remaining documents).

Acknowledgements #czech_pdtc1.0_model_acknowledgements

This work has been supported by the LINDAT/CLARIAH-CZ project funded by Ministry of Education, Youth and Sports of the Czech Republic (project LM2023062).

Publications

Model Performance

Tagging and Lemmatization

We evaluate tagging and lemmatization on the four datasets of PDT-C 1.0, and we also compute a macro-average. For lemmatization, we use the following metrics:

  • Lemmas: a primary metric comparing the lemma proper, which is the lemma with an optional lemma number (but we ignore the additional lemma comments like “this is a given name”);
  • LemmasEM: an exact match comparing also the lemma comments. This metric is less or equal to Lemmas. Our model directly predicts only lemma proper (no additional comments), and relies on the morphological dictionary to supply the comments, so it fails to generate comments for unknown words (like an unknown given name).

We perform the evaluation using the udpipe2_eval.py, which is a minor extension of the CoNLL 2018 Shared Task evaluation script.

Because the model also include a rule-based tokenizer and sentence splitter, we evaluate both:

  • using raw input text, which must first be tokenized and split into sentences. The resulting scores are in fact F1-scores. Note that the FAUST dataset does not contain any discernible sentence boundaries.
  • using gold tokenization.
Treebank Mode Tokens Sents XPOS Lemma LemmaEM
PDT Raw text 99.91 88.00 98.69 99.10 98.86
PDT Gold tokenization 98.78 99.19 98.96
PCEDT Raw text 99.97 94.06 98.77 99.36 98.75
PCEDT Gold tokenization 98.80 99.40 98.78
PDTSC Raw text 100.0 98.31 98.77 99.23 99.16
PDTSC Gold tokenization 98.77 99.23 99.16
FAUST Raw text 100.0 10.98 97.05 98.88 98.43
FAUST Gold tokenization 97.42 98.78 98.30
MacroAvg Gold tokenization 98.44 99.15 98.80

Dependency Parsing

In PDT-C 1.0, the only manually annotated dependency parsing dataset is a subset of the PDT dataset. We perform the evaluation as in the previous section.

Treebank Mode Tokens Sents XPOS Lemma LemmaEM UAS LAS
PDT subset Raw text 99.94 88.49 98.74 99.16 98.97 93.45 90.32
PDT subset Gold tokenization 98.81 99.23 99.03 94.41 91.48