|
--- |
|
license: mit |
|
widget: |
|
- text: Universis presentes [MASK] inspecturis |
|
- text: eandem [MASK] per omnia parati observare |
|
- text: yo [MASK] rey de Galicia, de las Indias |
|
- text: en avant contre les choses [MASK] contenues |
|
datasets: |
|
- cc100 |
|
- bigscience-historical-texts/Open_Medieval_French |
|
- latinwikipedia |
|
language: |
|
- la |
|
- fr |
|
- es |
|
tags: |
|
- handwritten-text-recognition |
|
--- |
|
|
|
|
|
## TrOCR model adapted to Handwritting Text Recognition on medieval manuscripts (12th-16th centuries) |
|
|
|
**TRIDIS** (*Tria Digita Scribunt*) is a Handwriting Text Recognition model trained on semi-diplomatic transcriptions |
|
from medieval and Early Modern Manuscripts. It is suitable for work on documentary manuscripts, that is, manuscripts arising |
|
from legal, administrative, and memorial practices more commonly from the Late Middle Ages (13th century and onwards). |
|
It can also show good performance on documents from other domains, such as literature books, scholarly treatises and cartularies |
|
providing a versatile tool for historians and philologists in transforming and analyzing historical texts. |
|
|
|
A paper presenting the first version of the model is available here: |
|
Sergio Torres Aguilar, Vincent Jolivet. Handwritten Text Recognition for Documentary Medieval Manuscripts. Journal of Data Mining and Digital Humanities. 2023. https://hal.science/hal-03892163 |
|
|
|
|
|
#### Rules of transcription : |
|
|
|
Main factor of semi-diplomatic edition is that abbreviations have been resolved: |
|
- both those by suspension (<mark>facimꝰ</mark> ---> <mark>facimus</mark>) and by contraction (<mark>dñi</mark> --> <mark>domini</mark>). |
|
- Likewise, those using conventional signs (<mark>⁊</mark> --> <mark>et</mark> ; <mark>ꝓ</mark> --> <mark>pro</mark>) have been resolved. |
|
- The named entities (names of persons, places and institutions) have been capitalized. |
|
- The beginning of a block of text as well as the original capitals used by the scribe are also capitalized. |
|
- The consonantal <mark>i</mark> and <mark>u</mark> characters have been transcribed as <mark>j</mark> and <mark>v</mark> in both French and Latin. |
|
- The punctuation marks used in the manuscript like: <mark>.</mark> or <mark>/</mark> or <mark>|</mark> have not been systematically transcribed as the transcription has been standardized with modern punctuation. |
|
- Corrections and words that appear cancelled in the manuscript have been transcribed surrounded by the sign <mark>$</mark> at the beginning and at the end. |
|
|
|
|
|
#### Corpora |
|
The model was trained on charters, registers, feudal books and legal proceedings from the Late Medieval period (11th-16th centuries). |
|
|
|
The training and evaluation involved 2950 pages, 245k lines of text, and almost 2.3M tokens, conducted using three freely available ground-truth corpora: |
|
|
|
- The Alcar-HOME database: https://zenodo.org/record/5600884 |
|
- The e-NDP corpus: https://zenodo.org/record/7575693 |
|
- The Himanis project: https://zenodo.org/record/5535306 |
|
- Königsfelden Abbey corpus: https://zenodo.org/record/5179361 |
|
- Monumenta Luxemburgensia. |
|
|
|
|
|
#### Accuracy |
|
TRIDIS was trained using a encode-decoder architecture based on a fine-tuned version of the TrOCR-large handwritten ([microsoft/trocr-large-handwritten](https://huggingface.co./microsoft/trocr-large-handwritten)) and a RoBERTa modelized on medieval texts ([magistermilitum/RoBERTa_medieval](https://huggingface.co./magistermilitum/RoBERTa_medieval)). |
|
|
|
This final model operates in a multilingual environment (Latin, Old French, and Old Spanish) and is capable of recognizing several Latin script families (mostly Textualis and Cursiva) in documents produced circa 11th - 16th centuries. |
|
|
|
During evaluation, the model showed an accuracy of 94.3% on the validation set and a CER (Character Error Ratio) of about 0.06 to 0.12 on three external unseen datasets |
|
and a WER of about 0.14 to 0.26 respectively, which is about 30% lower compared to CRNN+CTC solutions trained on the same corpora. |