magistermilitum
commited on
Commit
•
44e2ce9
1
Parent(s):
e0ae2f1
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,53 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
widget:
|
4 |
+
- text: Universis presentes [MASK] inspecturis
|
5 |
+
- text: eandem [MASK] per omnia parati observare
|
6 |
+
- text: yo [MASK] rey de Galicia, de las Indias
|
7 |
+
- text: en avant contre les choses [MASK] contenues
|
8 |
+
datasets:
|
9 |
+
- cc100
|
10 |
+
- bigscience-historical-texts/Open_Medieval_French
|
11 |
+
- latinwikipedia
|
12 |
+
language:
|
13 |
+
- la
|
14 |
+
- fr
|
15 |
+
- es
|
16 |
+
---
|
17 |
+
|
18 |
+
## Model Details
|
19 |
+
|
20 |
+
This is a RoBERTa model trained from scratch on medieval texts. The model is intended to be used as a fondation for other ML tasks on NLP and HTR environments.
|
21 |
+
|
22 |
+
The train dataset entails 650M of tokens coming from texts on classical and medieval latin; old french and old Spanish from a period ranging from 5th BC to 16th centuries.
|
23 |
+
|
24 |
+
Several big corpora were cleaned and transformed to be used during the training process :
|
25 |
+
|
26 |
+
| dataset | size | Lang | dates |
|
27 |
+
| ------------- |:-------------:| -----:|-----:|
|
28 |
+
| CC100 [1] | 3,2Gb | la | 5th BC - 18th|
|
29 |
+
| Corpus Corporum [2] | 3,0Gb | la | 5th BC - 16th |
|
30 |
+
| CEMA [3] | 320Mb | la+fro |9th - 15th |
|
31 |
+
| HOME-Alcar [4] | 38Mb | la+fro | 12th - 15th |
|
32 |
+
| BFM [5] | 34Mb | fro | 13th - 15th|
|
33 |
+
| AND [6] | 19Mb | fro | 13th - 15th|
|
34 |
+
| CODEA [7] | 13Mb | spa |12th - 16th |
|
35 |
+
| | ~6,5Gb | |
|
36 |
+
| | 650M tokens (4,5Gb)* | | |
|
37 |
+
|
38 |
+
|
39 |
+
* A significant overlapped quantity of text was detected across the corpora, specially on medieval collections. Besides, synthetic text ("Lorem ipsum dolorem...") was iteratively deleted.
|
40 |
+
|
41 |
+
[1] CC-NET Repository : https://huggingface.co/datasets/cc100
|
42 |
+
|
43 |
+
[2] Repositorium operum lationorum apud universitatem Turicensem : https://mlat.uzh.ch/
|
44 |
+
|
45 |
+
[3] Cartae Europae Medii Aevi (5th-15th c.) : https://cema.lamop.fr/
|
46 |
+
|
47 |
+
[4] History of Medieval Europe : https://doi.org/10.5281/zenodo.5600884
|
48 |
+
|
49 |
+
[5] Base du Français Médieval : https://txm-bfm.huma-num.fr/txm/
|
50 |
+
|
51 |
+
[6] Anglo-Normand Dictionary : https://anglo-norman.net/
|
52 |
+
|
53 |
+
[7] Corpus de Docuemntos Españoles anteriores a 1900: https://www.corpuscodea.es/
|