tridis_HTR / README.md

Update README.md

12522d3 verified 8 months ago

3.93 kB

	---
	license: mit
	widget:
	- text: Universis presentes [MASK] inspecturis
	- text: eandem [MASK] per omnia parati observare
	- text: yo [MASK] rey de Galicia, de las Indias
	- text: en avant contre les choses [MASK] contenues
	datasets:
	- cc100
	- bigscience-historical-texts/Open_Medieval_French
	- latinwikipedia
	language:
	- la
	- fr
	- es
	tags:
	- handwritten-text-recognition
	---


	## TrOCR model adapted to Handwritting Text Recognition on medieval manuscripts (12th-16th centuries)

	TRIDIS (Tria Digita Scribunt) is a Handwriting Text Recognition model trained on semi-diplomatic transcriptions
	from medieval and Early Modern Manuscripts. It is suitable for work on documentary manuscripts, that is, manuscripts arising
	from legal, administrative, and memorial practices more commonly from the Late Middle Ages (13th century and onwards).
	It can also show good performance on documents from other domains, such as literature books, scholarly treatises and cartularies
	providing a versatile tool for historians and philologists in transforming and analyzing historical texts.

	A paper presenting the first version of the model is available here:
	Sergio Torres Aguilar, Vincent Jolivet. Handwritten Text Recognition for Documentary Medieval Manuscripts. Journal of Data Mining and Digital Humanities. 2023. https://hal.science/hal-03892163


	#### Rules of transcription :

	Main factor of semi-diplomatic edition is that abbreviations have been resolved:
	- both those by suspension (<mark>facimꝰ</mark> ---> <mark>facimus</mark>) and by contraction (<mark>dñi</mark> --> <mark>domini</mark>).
	- Likewise, those using conventional signs (<mark>⁊</mark> --> <mark>et</mark> ; <mark>ꝓ</mark> --> <mark>pro</mark>) have been resolved.
	- The named entities (names of persons, places and institutions) have been capitalized.
	- The beginning of a block of text as well as the original capitals used by the scribe are also capitalized.
	- The consonantal <mark>i</mark> and <mark>u</mark> characters have been transcribed as <mark>j</mark> and <mark>v</mark> in both French and Latin.
	- The punctuation marks used in the manuscript like: <mark>.</mark> or <mark>/</mark> or <mark>\|</mark> have not been systematically transcribed as the transcription has been standardized with modern punctuation.
	- Corrections and words that appear cancelled in the manuscript have been transcribed surrounded by the sign <mark>$</mark> at the beginning and at the end.


	#### Corpora
	The model was trained on charters, registers, feudal books and legal proceedings from the Late Medieval period (11th-16th centuries).

	The training and evaluation involved 2950 pages, 245k lines of text, and almost 2.3M tokens, conducted using three freely available ground-truth corpora:

	- The Alcar-HOME database: https://zenodo.org/record/5600884
	- The e-NDP corpus: https://zenodo.org/record/7575693
	- The Himanis project: https://zenodo.org/record/5535306
	- Königsfelden Abbey corpus: https://zenodo.org/record/5179361
	- Monumenta Luxemburgensia.


	#### Accuracy
	TRIDIS was trained using a encode-decoder architecture based on a fine-tuned version of the TrOCR-large handwritten ([microsoft/trocr-large-handwritten](https://huggingface.co./microsoft/trocr-large-handwritten)) and a RoBERTa modelized on medieval texts ([magistermilitum/RoBERTa_medieval](https://huggingface.co./magistermilitum/RoBERTa_medieval)).

	This final model operates in a multilingual environment (Latin, Old French, and Old Spanish) and is capable of recognizing several Latin script families (mostly Textualis and Cursiva) in documents produced circa 11th - 16th centuries.

	During evaluation, the model showed an accuracy of 94.3% on the validation set and a CER (Character Error Ratio) of about 0.06 to 0.12 on three external unseen datasets
	and a WER of about 0.14 to 0.26 respectively, which is about 30% lower compared to CRNN+CTC solutions trained on the same corpora.