Update README.md

496c000 about 1 year ago

4.26 kB

	---
	license: cc-by-sa-4.0
	datasets:
	- cjvt/cc_gigafida
	language:
	- sl
	tags:
	- word case classification
	---

	---
	language:
	- sl

	license: cc-by-sa-4.0
	---

	# sloberta-word-case-classification-multilabel

	SloBERTa model finetuned on the Gigafida dataset for word case classification.

	The input to the model is expected to be fully lowercased text.
	The model classifies whether the input words should stay lowercased, be uppercased, or be all-uppercased. In addition, it provides a constrained explanation for its case classification.
	See usage example below for more details.

	## Usage example
	Imagine we have the following Slovenian text. Asterisked words have an incorrect word casing.
	```
	Linus torvalds je Finski programer, Poznan kot izumitelj operacijskega sistema Linux.
	(EN: Linus Torvalds is a Finnish programer, known as the inventor of the Linux operating sistem)
	```

	The model expects an all-lowercased input, so we pass it the following text:
	```
	linus torvalds je finski programer, poznan kot izumitelj operacijskega sistema linux.
	```

	The model might return the following predictions (note: predictions chosen for demonstration/explanation, not reproducibility!):
	```
	Linus -> UPPER_ENTITY, UPPER_BEGIN
	Torvalds -> UPPER_ENTITY
	je -> LOWER_OTHER
	finski -> LOWER_ADJ_SKI
	programer -> LOWER_OTHER
	, -> LOWER_OTHER
	Poznan -> LOWER_HYPERCORRECTION
	kot -> LOWER_OTHER
	izumitelj -> LOWER_OTHER
	operacijskega -> LOWER_OTHER
	sistema -> LOWER_OTHER
	linux -> UPPER_ENTITY
	```

	Then we would compare the (coarse) predictions (i.e., LOWER/UPPER/UPPER_ALLUC) with the initial casing and observe the following:
	- `Torvalds` is originally lowercased, but the model corrects it to uppercase (because it is an entity),
	- `finski` is originally uppercased, but the model corrects it to lowercase (because it is an adjective with suffix -ski),
	- `poznan` is originally uppercased, but the model corrects it to lowercase (the model assumes that the user made the mistake due to hypercorrection, meaning they naïvely uppercased a word after a character that could be punctuation),

	The other predictions agree with the word case in the initial text, so they are assumed to be correct.


	## More details
	More concretely, the model is a 12-class multi-label classifier with the following class indices and interpretations:
	```
	0: "LOWER_OTHER", # lowercased for an uncaptured reason
	1: "LOWER_HYPERCORRECTION", # lowercase due to hypercorrection (e.g., user automatically uppercased a word after a "." despite it not being a punctuation mark - the word should instead be lowercased)
	2: "LOWER_ADJ_SKI", # lowercased because the word is an adjective ending in suffix -ski
	3: "LOWER_ENTITY_PART", # lowercased word that is part of an entity (e.g., "Novo mesto")
	4: "UPPER_OTHER", # upercased for an uncaptured reason
	5: "UPPER_BEGIN", # upercased because the word begins a sentence
	6: "UPPER_ENTITY", # uppercased word that is part of an entity
	7: "UPPER_DIRECT_SPEECH", # upercased word due to direct speech
	8: "UPPER_ADJ_OTHER", # upercased adjective for an uncaptured reason (usually this is a possesive adjective)
	9: "UPPER_ALLUC_OTHER", # all-uppercased for an uncaptured reason
	10: "UPPER_ALLUC_BEGIN", # all-uppercased because the word begins a sentence
	11: "UPPER_ALLUC_ENTITY" # all-uppercased because the word is part of an entity
	```

	As the model is trained for multi-label classification, a word can be assigned multiple labels whose probability is > T. Naïvely T=0.5 can be used, but it is slightly better to use label thresholds optimized on a small validation set -
	they are noted in the file `label_thresholds.json` and below (along with the validation set F1 achieved with the best threshold).

	```
	LOWER_OTHER: T=0.4500 -> F1 = 0.9965
	LOWER_HYPERCORRECTION: T=0.5800 -> F1 = 0.8555
	LOWER_ADJ_SKI: T=0.4810 -> F1 = 0.9863
	LOWER_ENTITY_PART: T=0.4330 -> F1 = 0.8024
	UPPER_OTHER: T=0.4460 -> F1 = 0.7538
	UPPER_BEGIN: T=0.4690 -> F1 = 0.9905
	UPPER_ENTITY: T=0.5030 -> F1 = 0.9670
	UPPER_DIRECT_SPEECH: T=0.4170 -> F1 = 0.9852
	UPPER_ADJ_OTHER: T=0.5080 -> F1 = 0.9431
	UPPER_ALLUC_OTHER: T=0.4850 -> F1 = 0.8463
	UPPER_ALLUC_BEGIN: T=0.5170 -> F1 = 0.9798
	UPPER_ALLUC_ENTITY: T=0.4490 -> F1 = 0.9391
	```