matejklemen
commited on
Commit
•
2c4441c
1
Parent(s):
40603e9
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,96 @@
|
|
1 |
---
|
2 |
license: cc-by-sa-4.0
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: cc-by-sa-4.0
|
3 |
+
datasets:
|
4 |
+
- cjvt/cc_gigafida
|
5 |
+
language:
|
6 |
+
- sl
|
7 |
+
tags:
|
8 |
+
- word case classification
|
9 |
---
|
10 |
+
|
11 |
+
---
|
12 |
+
language:
|
13 |
+
- sl
|
14 |
+
|
15 |
+
license: cc-by-sa-4.0
|
16 |
+
---
|
17 |
+
|
18 |
+
# sloberta-word-case-classification-multilabel
|
19 |
+
|
20 |
+
SloBERTa model finetuned on the Gigafida dataset for word case classification.
|
21 |
+
|
22 |
+
The input to the model is expected to be **fully lowercased text**.
|
23 |
+
The model classifies whether the input words should stay lowercased, be uppercased, or be all-uppercased. In addition, it provides a constrained explanation for its case classification.
|
24 |
+
See usage example below for more details.
|
25 |
+
|
26 |
+
## Usage example
|
27 |
+
Imagine we have the following Slovenian text. Asterisked words have an incorrect word casing.
|
28 |
+
```
|
29 |
+
Linus *torvalds* je *Finski* programer, Poznan kot izumitelj operacijskega sistema Linux.
|
30 |
+
```
|
31 |
+
|
32 |
+
The model expects an all-lowercased input, so we pass it the following text:
|
33 |
+
```
|
34 |
+
linus *torvalds* je finski programer, poznan kot izumitelj operacijskega sistema linux.
|
35 |
+
(EN: Linus Torvalds is a Finnish programer, known as the inventor of the Linux operating sistem)
|
36 |
+
```
|
37 |
+
|
38 |
+
The model might return the following predictions (note: predictions chosen for demonstration/explanation, not reproducibility!):
|
39 |
+
```
|
40 |
+
Linus -> UPPER_ENTITY, UPPER_BEGIN
|
41 |
+
Torvalds -> UPPER_ENTITY
|
42 |
+
je -> LOWER_OTHER
|
43 |
+
finski -> LOWER_ADJ_SKI
|
44 |
+
programer -> LOWER_OTHER
|
45 |
+
, -> LOWER_OTHER
|
46 |
+
Poznan -> LOWER_HYPERCORRECTION
|
47 |
+
kot -> LOWER_OTHER
|
48 |
+
izumitelj -> LOWER_OTHER
|
49 |
+
operacijskega -> LOWER_OTHER
|
50 |
+
sistema -> LOWER_OTHER
|
51 |
+
linux -> UPPER_ENTITY
|
52 |
+
```
|
53 |
+
|
54 |
+
Then we would compare the (coarse) predictions (i.e., LOWER/UPPER/UPPER_ALLUC) with the initial casing and observe the following:
|
55 |
+
- `Torvalds` is originally lowercased, but the model corrects it to uppercase (because it is an entity),
|
56 |
+
- `finski` is originally uppercased, but the model corrects it to lowercase (because it is an adjective with suffix -ski),
|
57 |
+
- `poznan` is originally uppercased, but the model corrects it to lowercase (the model assumes that the user made the mistake due to hypercorrection, meaning they naïvely uppercased a word after a character that could be punctuation),
|
58 |
+
|
59 |
+
The other predictions agree with the word case in the initial text, so they are assumed to be correct.
|
60 |
+
|
61 |
+
|
62 |
+
## More details
|
63 |
+
More concretely, the model is a 12-class multi-label classifier with the following class indices and interpretations:
|
64 |
+
```
|
65 |
+
0: "LOWER_OTHER", # lowercased for an uncaptured reason
|
66 |
+
1: "LOWER_HYPERCORRECTION", # lowercase due to hypercorrection (e.g., user automatically uppercased a word after a "." despite it not being a punctuation mark - the word should instead be lowercased)
|
67 |
+
2: "LOWER_ADJ_SKI", # lowercased because the word is an adjective ending in suffix -ski
|
68 |
+
3: "LOWER_ENTITY_PART", # lowercased word that is part of an entity (e.g., "Novo **mesto**")
|
69 |
+
4: "UPPER_OTHER", # upercased for an uncaptured reason
|
70 |
+
5: "UPPER_BEGIN", # upercased because the word begins a sentence
|
71 |
+
6: "UPPER_ENTITY", # uppercased word that is part of an entity
|
72 |
+
7: "UPPER_DIRECT_SPEECH", # upercased word due to direct speech
|
73 |
+
8: "UPPER_ADJ_OTHER", # upercased adjective for an uncaptured reason (usually this is a possesive adjective)
|
74 |
+
9: "UPPER_ALLUC_OTHER", # all-uppercased for an uncaptured reason
|
75 |
+
10: "UPPER_ALLUC_BEGIN", # all-uppercased because the word begins a sentence
|
76 |
+
11: "UPPER_ALLUC_ENTITY" # all-uppercased because the word is part of an entity
|
77 |
+
```
|
78 |
+
|
79 |
+
As the model is trained for multi-label classification, a word can be assigned multiple labels whose probability is > T. Naïvely T=0.5 can be used, but it is slightly better to use label thresholds optimized on a small validation set -
|
80 |
+
they are noted in the file `label_thresholds.json` and below (along with the validation set F1 achieved with the best threshold).
|
81 |
+
|
82 |
+
```
|
83 |
+
LOWER_OTHER: T=0.4500 -> F1 = 0.9965
|
84 |
+
LOWER_HYPERCORRECTION: T=0.5800 -> F1 = 0.8555
|
85 |
+
LOWER_ADJ_SKI: T=0.4810 -> F1 = 0.9863
|
86 |
+
LOWER_ENTITY_PART: T=0.4330 -> F1 = 0.8024
|
87 |
+
UPPER_OTHER: T=0.4460 -> F1 = 0.7538
|
88 |
+
UPPER_BEGIN: T=0.4690 -> F1 = 0.9905
|
89 |
+
UPPER_ENTITY: T=0.5030 -> F1 = 0.9670
|
90 |
+
UPPER_DIRECT_SPEECH: T=0.4170 -> F1 = 0.9852
|
91 |
+
UPPER_ADJ_OTHER: T=0.5080 -> F1 = 0.9431
|
92 |
+
UPPER_ALLUC_OTHER: T=0.4850 -> F1 = 0.8463
|
93 |
+
UPPER_ALLUC_BEGIN: T=0.5170 -> F1 = 0.9798
|
94 |
+
UPPER_ALLUC_ENTITY: T=0.4490 -> F1 = 0.9391
|
95 |
+
```
|
96 |
+
|