matejklemen commited on
Commit
2c4441c
1 Parent(s): 40603e9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -0
README.md CHANGED
@@ -1,3 +1,96 @@
1
  ---
2
  license: cc-by-sa-4.0
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-sa-4.0
3
+ datasets:
4
+ - cjvt/cc_gigafida
5
+ language:
6
+ - sl
7
+ tags:
8
+ - word case classification
9
  ---
10
+
11
+ ---
12
+ language:
13
+ - sl
14
+
15
+ license: cc-by-sa-4.0
16
+ ---
17
+
18
+ # sloberta-word-case-classification-multilabel
19
+
20
+ SloBERTa model finetuned on the Gigafida dataset for word case classification.
21
+
22
+ The input to the model is expected to be **fully lowercased text**.
23
+ The model classifies whether the input words should stay lowercased, be uppercased, or be all-uppercased. In addition, it provides a constrained explanation for its case classification.
24
+ See usage example below for more details.
25
+
26
+ ## Usage example
27
+ Imagine we have the following Slovenian text. Asterisked words have an incorrect word casing.
28
+ ```
29
+ Linus *torvalds* je *Finski* programer, Poznan kot izumitelj operacijskega sistema Linux.
30
+ ```
31
+
32
+ The model expects an all-lowercased input, so we pass it the following text:
33
+ ```
34
+ linus *torvalds* je finski programer, poznan kot izumitelj operacijskega sistema linux.
35
+ (EN: Linus Torvalds is a Finnish programer, known as the inventor of the Linux operating sistem)
36
+ ```
37
+
38
+ The model might return the following predictions (note: predictions chosen for demonstration/explanation, not reproducibility!):
39
+ ```
40
+ Linus -> UPPER_ENTITY, UPPER_BEGIN
41
+ Torvalds -> UPPER_ENTITY
42
+ je -> LOWER_OTHER
43
+ finski -> LOWER_ADJ_SKI
44
+ programer -> LOWER_OTHER
45
+ , -> LOWER_OTHER
46
+ Poznan -> LOWER_HYPERCORRECTION
47
+ kot -> LOWER_OTHER
48
+ izumitelj -> LOWER_OTHER
49
+ operacijskega -> LOWER_OTHER
50
+ sistema -> LOWER_OTHER
51
+ linux -> UPPER_ENTITY
52
+ ```
53
+
54
+ Then we would compare the (coarse) predictions (i.e., LOWER/UPPER/UPPER_ALLUC) with the initial casing and observe the following:
55
+ - `Torvalds` is originally lowercased, but the model corrects it to uppercase (because it is an entity),
56
+ - `finski` is originally uppercased, but the model corrects it to lowercase (because it is an adjective with suffix -ski),
57
+ - `poznan` is originally uppercased, but the model corrects it to lowercase (the model assumes that the user made the mistake due to hypercorrection, meaning they naïvely uppercased a word after a character that could be punctuation),
58
+
59
+ The other predictions agree with the word case in the initial text, so they are assumed to be correct.
60
+
61
+
62
+ ## More details
63
+ More concretely, the model is a 12-class multi-label classifier with the following class indices and interpretations:
64
+ ```
65
+ 0: "LOWER_OTHER", # lowercased for an uncaptured reason
66
+ 1: "LOWER_HYPERCORRECTION", # lowercase due to hypercorrection (e.g., user automatically uppercased a word after a "." despite it not being a punctuation mark - the word should instead be lowercased)
67
+ 2: "LOWER_ADJ_SKI", # lowercased because the word is an adjective ending in suffix -ski
68
+ 3: "LOWER_ENTITY_PART", # lowercased word that is part of an entity (e.g., "Novo **mesto**")
69
+ 4: "UPPER_OTHER", # upercased for an uncaptured reason
70
+ 5: "UPPER_BEGIN", # upercased because the word begins a sentence
71
+ 6: "UPPER_ENTITY", # uppercased word that is part of an entity
72
+ 7: "UPPER_DIRECT_SPEECH", # upercased word due to direct speech
73
+ 8: "UPPER_ADJ_OTHER", # upercased adjective for an uncaptured reason (usually this is a possesive adjective)
74
+ 9: "UPPER_ALLUC_OTHER", # all-uppercased for an uncaptured reason
75
+ 10: "UPPER_ALLUC_BEGIN", # all-uppercased because the word begins a sentence
76
+ 11: "UPPER_ALLUC_ENTITY" # all-uppercased because the word is part of an entity
77
+ ```
78
+
79
+ As the model is trained for multi-label classification, a word can be assigned multiple labels whose probability is > T. Naïvely T=0.5 can be used, but it is slightly better to use label thresholds optimized on a small validation set -
80
+ they are noted in the file `label_thresholds.json` and below (along with the validation set F1 achieved with the best threshold).
81
+
82
+ ```
83
+ LOWER_OTHER: T=0.4500 -> F1 = 0.9965
84
+ LOWER_HYPERCORRECTION: T=0.5800 -> F1 = 0.8555
85
+ LOWER_ADJ_SKI: T=0.4810 -> F1 = 0.9863
86
+ LOWER_ENTITY_PART: T=0.4330 -> F1 = 0.8024
87
+ UPPER_OTHER: T=0.4460 -> F1 = 0.7538
88
+ UPPER_BEGIN: T=0.4690 -> F1 = 0.9905
89
+ UPPER_ENTITY: T=0.5030 -> F1 = 0.9670
90
+ UPPER_DIRECT_SPEECH: T=0.4170 -> F1 = 0.9852
91
+ UPPER_ADJ_OTHER: T=0.5080 -> F1 = 0.9431
92
+ UPPER_ALLUC_OTHER: T=0.4850 -> F1 = 0.8463
93
+ UPPER_ALLUC_BEGIN: T=0.5170 -> F1 = 0.9798
94
+ UPPER_ALLUC_ENTITY: T=0.4490 -> F1 = 0.9391
95
+ ```
96
+