PyTorch
Latin
French
Spanish
roberta
magistermilitum commited on
Commit
44e2ce9
1 Parent(s): e0ae2f1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -3
README.md CHANGED
@@ -1,3 +1,53 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ widget:
4
+ - text: Universis presentes [MASK] inspecturis
5
+ - text: eandem [MASK] per omnia parati observare
6
+ - text: yo [MASK] rey de Galicia, de las Indias
7
+ - text: en avant contre les choses [MASK] contenues
8
+ datasets:
9
+ - cc100
10
+ - bigscience-historical-texts/Open_Medieval_French
11
+ - latinwikipedia
12
+ language:
13
+ - la
14
+ - fr
15
+ - es
16
+ ---
17
+
18
+ ## Model Details
19
+
20
+ This is a RoBERTa model trained from scratch on medieval texts. The model is intended to be used as a fondation for other ML tasks on NLP and HTR environments.
21
+
22
+ The train dataset entails 650M of tokens coming from texts on classical and medieval latin; old french and old Spanish from a period ranging from 5th BC to 16th centuries.
23
+
24
+ Several big corpora were cleaned and transformed to be used during the training process :
25
+
26
+ | dataset | size | Lang | dates |
27
+ | ------------- |:-------------:| -----:|-----:|
28
+ | CC100 [1] | 3,2Gb | la | 5th BC - 18th|
29
+ | Corpus Corporum [2] | 3,0Gb | la | 5th BC - 16th |
30
+ | CEMA [3] | 320Mb | la+fro |9th - 15th |
31
+ | HOME-Alcar [4] | 38Mb | la+fro | 12th - 15th |
32
+ | BFM [5] | 34Mb | fro | 13th - 15th|
33
+ | AND [6] | 19Mb | fro | 13th - 15th|
34
+ | CODEA [7] | 13Mb | spa |12th - 16th |
35
+ | | ~6,5Gb | |
36
+ | | 650M tokens (4,5Gb)* | | |
37
+
38
+
39
+ * A significant overlapped quantity of text was detected across the corpora, specially on medieval collections. Besides, synthetic text ("Lorem ipsum dolorem...") was iteratively deleted.
40
+
41
+ [1] CC-NET Repository : https://huggingface.co/datasets/cc100
42
+
43
+ [2] Repositorium operum lationorum apud universitatem Turicensem : https://mlat.uzh.ch/
44
+
45
+ [3] Cartae Europae Medii Aevi (5th-15th c.) : https://cema.lamop.fr/
46
+
47
+ [4] History of Medieval Europe : https://doi.org/10.5281/zenodo.5600884
48
+
49
+ [5] Base du Français Médieval : https://txm-bfm.huma-num.fr/txm/
50
+
51
+ [6] Anglo-Normand Dictionary : https://anglo-norman.net/
52
+
53
+ [7] Corpus de Docuemntos Españoles anteriores a 1900: https://www.corpuscodea.es/