--- license: mit widget: - text: Universis presentes [MASK] inspecturis - text: eandem [MASK] per omnia parati observare - text: yo [MASK] rey de Galicia, de las Indias - text: en avant contre les choses [MASK] contenues datasets: - cc100 - bigscience-historical-texts/Open_Medieval_French - latinwikipedia language: - la - fr - es --- ## Model Details This is a RoBERTa model trained from scratch on medieval texts. The model is intended to be used as a fondation for other ML tasks on NLP and HTR environments. The train dataset entails 650M of tokens coming from texts on classical and medieval latin; old french and old Spanish from a period ranging from 5th BC to 16th centuries. Several big corpora were cleaned and transformed to be used during the training process : | dataset | size | Lang | dates | | ------------- |:-------------:| -----:|-----:| | CC100 [1] | 3,2Gb | la | 5th BC - 18th| | Corpus Corporum [2] | 3,0Gb | la | 5th BC - 16th | | CEMA [3] | 320Mb | la+fro |9th - 15th | | HOME-Alcar [4] | 38Mb | la+fro | 12th - 15th | | BFM [5] | 34Mb | fro | 13th - 15th| | AND [6] | 19Mb | fro | 13th - 15th| | CODEA [7] | 13Mb | spa |12th - 16th | | | ~6,5Gb | | | | 650M tokens (4,5Gb)* | | | * A significant overlapped quantity of text was detected across the corpora, specially on medieval collections. Besides, synthetic text ("Lorem ipsum dolorem...") was iteratively deleted. [1] CC-NET Repository : https://huggingface.co./datasets/cc100 [2] Repositorium operum lationorum apud universitatem Turicensem : https://mlat.uzh.ch/ [3] Cartae Europae Medii Aevi (5th-15th c.) : https://cema.lamop.fr/ [4] History of Medieval Europe : https://doi.org/10.5281/zenodo.5600884 [5] Base du Français Médieval : https://txm-bfm.huma-num.fr/txm/ [6] Anglo-Normand Dictionary : https://anglo-norman.net/ [7] Corpus de Docuemntos Españoles anteriores a 1900: https://www.corpuscodea.es/