Model Details
This is a RoBERTa model trained from scratch on medieval texts. The model is intended to be used as a fondation for other ML tasks on NLP and HTR environments.
The train dataset entails 650M of tokens coming from texts on classical and medieval latin; old french and old Spanish from a period ranging from 5th BC to 16th centuries.
Several big corpora were cleaned and transformed to be used during the training process :
dataset | size | Lang | dates |
---|---|---|---|
CC100 [1] | 3,2Gb | la | 5th BC - 18th |
Corpus Corporum [2] | 3,0Gb | la | 5th BC - 16th |
CEMA [3] | 320Mb | la+fro | 9th - 15th |
HOME-Alcar [4] | 38Mb | la+fro | 12th - 15th |
BFM [5] | 34Mb | fro | 13th - 15th |
AND [6] | 19Mb | fro | 13th - 15th |
CODEA [7] | 13Mb | spa | 12th - 16th |
~6,5Gb | |||
650M tokens (4,5Gb)* |
- A significant overlapped quantity of text was detected across the corpora, specially on medieval collections. Besides, synthetic text ("Lorem ipsum dolorem...") was iteratively deleted.
[1] CC-NET Repository : https://huggingface.co./datasets/cc100
[2] Repositorium operum lationorum apud universitatem Turicensem : https://mlat.uzh.ch/
[3] Cartae Europae Medii Aevi (5th-15th c.) : https://cema.lamop.fr/
[4] History of Medieval Europe : https://doi.org/10.5281/zenodo.5600884
[5] Base du Français Médieval : https://txm-bfm.huma-num.fr/txm/
[6] Anglo-Normand Dictionary : https://anglo-norman.net/
[7] Corpus de Docuemntos Españoles anteriores a 1900: https://www.corpuscodea.es/
- Downloads last month
- 4