|
--- |
|
license: cc-by-sa-4.0 |
|
datasets: |
|
- cjvt/cc_gigafida |
|
language: |
|
- sl |
|
tags: |
|
- word case classification |
|
--- |
|
|
|
--- |
|
language: |
|
- sl |
|
|
|
license: cc-by-sa-4.0 |
|
--- |
|
|
|
# sloberta-word-case-classification-multilabel |
|
|
|
SloBERTa model finetuned on the Gigafida dataset for word case classification. |
|
|
|
The input to the model is expected to be **fully lowercased text**. |
|
The model classifies whether the input words should stay lowercased, be uppercased, or be all-uppercased. In addition, it provides a constrained explanation for its case classification. |
|
See usage example below for more details. |
|
|
|
## Usage example |
|
Imagine we have the following Slovenian text. Asterisked words have an incorrect word casing. |
|
``` |
|
Linus *torvalds* je *Finski* programer, Poznan kot izumitelj operacijskega sistema Linux. |
|
(EN: Linus Torvalds is a Finnish programer, known as the inventor of the Linux operating sistem) |
|
``` |
|
|
|
The model expects an all-lowercased input, so we pass it the following text: |
|
``` |
|
linus *torvalds* je finski programer, poznan kot izumitelj operacijskega sistema linux. |
|
``` |
|
|
|
The model might return the following predictions (note: predictions chosen for demonstration/explanation, not reproducibility!): |
|
``` |
|
Linus -> UPPER_ENTITY, UPPER_BEGIN |
|
Torvalds -> UPPER_ENTITY |
|
je -> LOWER_OTHER |
|
finski -> LOWER_ADJ_SKI |
|
programer -> LOWER_OTHER |
|
, -> LOWER_OTHER |
|
Poznan -> LOWER_HYPERCORRECTION |
|
kot -> LOWER_OTHER |
|
izumitelj -> LOWER_OTHER |
|
operacijskega -> LOWER_OTHER |
|
sistema -> LOWER_OTHER |
|
linux -> UPPER_ENTITY |
|
``` |
|
|
|
Then we would compare the (coarse) predictions (i.e., LOWER/UPPER/UPPER_ALLUC) with the initial casing and observe the following: |
|
- `Torvalds` is originally lowercased, but the model corrects it to uppercase (because it is an entity), |
|
- `finski` is originally uppercased, but the model corrects it to lowercase (because it is an adjective with suffix -ski), |
|
- `poznan` is originally uppercased, but the model corrects it to lowercase (the model assumes that the user made the mistake due to hypercorrection, meaning they naïvely uppercased a word after a character that could be punctuation), |
|
|
|
The other predictions agree with the word case in the initial text, so they are assumed to be correct. |
|
|
|
|
|
## More details |
|
More concretely, the model is a 12-class multi-label classifier with the following class indices and interpretations: |
|
``` |
|
0: "LOWER_OTHER", # lowercased for an uncaptured reason |
|
1: "LOWER_HYPERCORRECTION", # lowercase due to hypercorrection (e.g., user automatically uppercased a word after a "." despite it not being a punctuation mark - the word should instead be lowercased) |
|
2: "LOWER_ADJ_SKI", # lowercased because the word is an adjective ending in suffix -ski |
|
3: "LOWER_ENTITY_PART", # lowercased word that is part of an entity (e.g., "Novo **mesto**") |
|
4: "UPPER_OTHER", # upercased for an uncaptured reason |
|
5: "UPPER_BEGIN", # upercased because the word begins a sentence |
|
6: "UPPER_ENTITY", # uppercased word that is part of an entity |
|
7: "UPPER_DIRECT_SPEECH", # upercased word due to direct speech |
|
8: "UPPER_ADJ_OTHER", # upercased adjective for an uncaptured reason (usually this is a possesive adjective) |
|
9: "UPPER_ALLUC_OTHER", # all-uppercased for an uncaptured reason |
|
10: "UPPER_ALLUC_BEGIN", # all-uppercased because the word begins a sentence |
|
11: "UPPER_ALLUC_ENTITY" # all-uppercased because the word is part of an entity |
|
``` |
|
|
|
As the model is trained for multi-label classification, a word can be assigned multiple labels whose probability is > T. Naïvely T=0.5 can be used, but it is slightly better to use label thresholds optimized on a small validation set - |
|
they are noted in the file `label_thresholds.json` and below (along with the validation set F1 achieved with the best threshold). |
|
|
|
``` |
|
LOWER_OTHER: T=0.4500 -> F1 = 0.9965 |
|
LOWER_HYPERCORRECTION: T=0.5800 -> F1 = 0.8555 |
|
LOWER_ADJ_SKI: T=0.4810 -> F1 = 0.9863 |
|
LOWER_ENTITY_PART: T=0.4330 -> F1 = 0.8024 |
|
UPPER_OTHER: T=0.4460 -> F1 = 0.7538 |
|
UPPER_BEGIN: T=0.4690 -> F1 = 0.9905 |
|
UPPER_ENTITY: T=0.5030 -> F1 = 0.9670 |
|
UPPER_DIRECT_SPEECH: T=0.4170 -> F1 = 0.9852 |
|
UPPER_ADJ_OTHER: T=0.5080 -> F1 = 0.9431 |
|
UPPER_ALLUC_OTHER: T=0.4850 -> F1 = 0.8463 |
|
UPPER_ALLUC_BEGIN: T=0.5170 -> F1 = 0.9798 |
|
UPPER_ALLUC_ENTITY: T=0.4490 -> F1 = 0.9391 |
|
``` |
|
|
|
|