cjvt
/


language:

  • sl

license: cc-by-sa-4.0

SloBERTa-Incorrect-Spelling-Annotator

This SloBERTa model is designed to annotate incorrectly spelled words in text. It utilizes the following labels:

  • 1: Indicates incorrectly spelled words.
  • 2: Denotes cases where two words should be written together.
  • 3: Suggests that a word should be written separately.

Model Output Example

Imagine we have the following Slovenian text:

Model vbesedilu o znači besede, v katerih se najajajo napake.

If we convert input data to format acceptable by SloBERTa model:

Model <mask> vbesedilu <mask> o <mask> znači <mask> besede <mask> , <mask> v <mask> katerih <mask> se <mask> najajajo <mask> napake <mask> . <mask>

The model might return the following predictions (note: predictions chosen for demonstration/explanation, not reproducibility!):

Model 0 vbesedilu 3 o 2 znači 2 besede 0 , 0 v 0 katerih 0 se 0 najajajo 1 napake 0 . 0

We can observe the following:

  1. In the input sentence, the word najajajo is spelled incorrectly, so the model marks it with the token (0).
  2. The word vbesedilu should be written as two words v and besedilu, so the model marks it with the token (3).
  3. The words o and znači should be written as one word označi, so the model marks them with the tokens (2).

More details

Testing model with generated test sets provides following result:

  • 1 token prediction -> Precission: 0,911; Recall: 0,975; F1: 0,942

Testing the model with test sets constructed using the Šolar Eval dataset provides the following results:

  • 1 token prediction -> Precission: 0,900; Recall: 0,860; F1: 0,880
  • 2 token prediction -> Precission: 0,826; Recall:0,853; F1: 0,839
  • 3 token prediction -> Precission: 0,518; Recall: 0,671; F1: 0,585

Acknowledgement

The authors acknowledge the financial support from the Slovenian Research and Innovation Agency - research core funding No. P6-0411: Language Resources and Technologies for Slovene and research project No. J7-3159: Empirical foundations for digitally-supported development of writing skills.

Authors

Thanks to Martin Božič, Marko Robnik-Šikonja and Špela Arhar Holdt for developing these models.

Downloads last month
9
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train cjvt/SloBERTa-slo-word-spelling-annotator