README.md · lgessler/microbert-coptic-m at main

metadata

language: cop
widget:
  - text: ⲁⲗⲗⲁ ⲁⲛⲟⲕ ⲁⲓⲥⲉⲧⲡⲧⲏⲩⲧⲛ ·

This is a MicroBERT model for Coptic.

Its suffix is -m, which means that it was pretrained using supervision from masked language modeling.
The unlabeled Coptic data was taken from version 4.2.0 of the Coptic SCRIPTORIUM corpus, totaling 970,642 tokens.
The UD treebank UD_Coptic_Scriptorium, v2.9, totaling 48,632 tokens, was used for labeled data.

Please see the repository and the paper for more details.