---
license: mit
widget:
- text: Universis presentes [MASK] inspecturis
- text: eandem [MASK] per omnia parati observare
- text: yo [MASK] rey de Galicia, de las Indias
- text: en avant contre les choses [MASK] contenues
datasets:
- cc100
- bigscience-historical-texts/Open_Medieval_French
- latinwikipedia
language:
- la
- fr
- es
---

## Model Details

This is a RoBERTa model trained from scratch on medieval texts. The model is intended to be used as a fondation for other ML tasks on NLP and HTR environments.

The train dataset entails 650M of tokens coming from texts on classical and medieval latin; old french and old Spanish from a period ranging from 5th BC to 16th centuries.

Several big corpora were cleaned and transformed to be used during the training process :

| dataset        | size          | Lang  | dates  |
| ------------- |:-------------:| -----:|-----:|
| CC100 [1]     | 3,2Gb | la | 5th BC - 18th|
| Corpus Corporum [2]    | 3,0Gb      |   la | 5th BC - 16th |
| CEMA [3] | 320Mb      |  la+fro   |9th - 15th |
| HOME-Alcar [4] | 38Mb     |  la+fro   | 12th - 15th |
| BFM [5] | 34Mb      |  fro   | 13th - 15th|
| AND [6] | 19Mb      |  fro   | 13th - 15th|
| CODEA [7] | 13Mb      |  spa   |12th - 16th |
|  | ~6,5Gb      |    |
|  | 650M tokens (4,5Gb)*     |   | |


* A significant overlapped quantity of text was detected across the corpora, specially on medieval collections. Besides, synthetic text ("Lorem ipsum dolorem...") was iteratively deleted.

[1] CC-NET Repository : https://huggingface.co./datasets/cc100

[2] Repositorium operum lationorum apud universitatem Turicensem : https://mlat.uzh.ch/

[3] Cartae Europae Medii Aevi (5th-15th c.) : https://cema.lamop.fr/

[4] History of Medieval Europe : https://doi.org/10.5281/zenodo.5600884

[5] Base du Français Médieval : https://txm-bfm.huma-num.fr/txm/

[6] Anglo-Normand Dictionary : https://anglo-norman.net/

[7] Corpus de Docuemntos Españoles anteriores a 1900: https://www.corpuscodea.es/