metadata

license: cc-by-sa-4.0
datasets:
  - cjvt/cc_gigafida
language:
  - sl
tags:
  - word case classification

language:

license: cc-by-sa-4.0

sloberta-word-case-classification-multilabel

SloBERTa model finetuned on the Gigafida dataset for word case classification.

The input to the model is expected to be fully lowercased text. The model classifies whether the input words should stay lowercased, be uppercased, or be all-uppercased. In addition, it provides a constrained explanation for its case classification. See usage example below for more details.

Usage example

Imagine we have the following Slovenian text. Asterisked words have an incorrect word casing.

Linus *torvalds* je *Finski* programer, Poznan kot izumitelj operacijskega sistema Linux.
(EN: Linus Torvalds is a Finnish programer, known as the inventor of the Linux operating sistem)

The model expects an all-lowercased input, so we pass it the following text:

linus torvalds je finski programer, poznan kot izumitelj operacijskega sistema linux.

The model might return the following predictions (note: predictions chosen for demonstration/explanation, not reproducibility!):

linus -> UPPER_ENTITY, UPPER_BEGIN
torvalds -> UPPER_ENTITY
je -> LOWER_OTHER
finski -> LOWER_ADJ_SKI
programer -> LOWER_OTHER
, -> LOWER_OTHER
poznan -> LOWER_HYPERCORRECTION
kot -> LOWER_OTHER
izumitelj -> LOWER_OTHER
operacijskega -> LOWER_OTHER
sistema -> LOWER_OTHER
linux -> UPPER_ENTITY

Then we would compare the (coarse) predictions (i.e., LOWER/UPPER/UPPER_ALLUC) with the initial casing and observe the following:

Torvalds is originally lowercased, but the model corrects it to uppercase (because it is an entity),
finski is originally uppercased, but the model corrects it to lowercase (because it is an adjective with suffix -ski),
poznan is originally uppercased, but the model corrects it to lowercase (the model assumes that the user made the mistake due to hypercorrection, meaning they naïvely uppercased a word after a character that could be punctuation),

The other predictions agree with the word case in the initial text, so they are assumed to be correct.

More details

More concretely, the model is a 12-class multi-label classifier with the following class indices and interpretations:

0: "LOWER_OTHER",  # lowercased for an uncaptured reason
1: "LOWER_HYPERCORRECTION",  # lowercase due to hypercorrection (e.g., user automatically uppercased a word after a "." despite it not being a punctuation mark - the word should instead be lowercased)
2: "LOWER_ADJ_SKI",  # lowercased because the word is an adjective ending in suffix -ski
3: "LOWER_ENTITY_PART",  # lowercased word that is part of an entity (e.g., "Novo **mesto**")
4: "UPPER_OTHER",  # upercased for an uncaptured reason
5: "UPPER_BEGIN",  # upercased because the word begins a sentence
6: "UPPER_ENTITY",  # uppercased word that is part of an entity
7: "UPPER_DIRECT_SPEECH",  # upercased word due to direct speech
8: "UPPER_ADJ_OTHER",  # upercased adjective for an uncaptured reason (usually this is a possesive adjective)
9: "UPPER_ALLUC_OTHER",  # all-uppercased for an uncaptured reason
10: "UPPER_ALLUC_BEGIN",  # all-uppercased because the word begins a sentence
11: "UPPER_ALLUC_ENTITY"  # all-uppercased because the word is part of an entity

As the model is trained for multi-label classification, a word can be assigned multiple labels whose probability is > T. Naïvely T=0.5 can be used, but it is slightly better to use label thresholds optimized on a small validation set - they are noted in the file label_thresholds.json and below (along with the validation set F1 achieved with the best threshold).

LOWER_OTHER: T=0.4500 -> F1 =  0.9965
LOWER_HYPERCORRECTION: T=0.5800 -> F1 =  0.8555
LOWER_ADJ_SKI: T=0.4810 -> F1 =  0.9863
LOWER_ENTITY_PART: T=0.4330 -> F1 =  0.8024
UPPER_OTHER: T=0.4460 -> F1 =  0.7538
UPPER_BEGIN: T=0.4690 -> F1 =  0.9905
UPPER_ENTITY: T=0.5030 -> F1 =  0.9670
UPPER_DIRECT_SPEECH: T=0.4170 -> F1 =  0.9852
UPPER_ADJ_OTHER: T=0.5080 -> F1 =  0.9431
UPPER_ALLUC_OTHER: T=0.4850 -> F1 =  0.8463
UPPER_ALLUC_BEGIN: T=0.5170 -> F1 =  0.9798
UPPER_ALLUC_ENTITY: T=0.4490 -> F1 =  0.9391