tridis_HTR / README.md
magistermilitum's picture
Update README.md
943c937 verified
metadata
license: mit
widget:
  - src: https://gitlab.com/magistermilitum/e-NDP/-/raw/main/images/linea.png
    example_title: random 14th line
datasets:
  - cc100
  - bigscience-historical-texts/Open_Medieval_French
  - latinwikipedia
language:
  - la
  - fr
  - es
tags:
  - handwritten-text-recognition
  - Image-to-text
pipeline_tag: image-to-text
base_model:
  - microsoft/trocr-large-handwritten
  - magistermilitum/Roberta_Historical

TrOCR model adapted to Handwritting Text Recognition on medieval manuscripts (12th-16th centuries)

TRIDIS (Tria Digita Scribunt) is a Handwriting Text Recognition model trained on semi-diplomatic transcriptions from medieval and Early Modern Manuscripts. It is suitable for work on documentary manuscripts, that is, manuscripts arising from legal, administrative, and memorial practices more commonly from the Late Middle Ages (13th century and onwards). It can also show good performance on documents from other domains, such as literature books, scholarly treatises and cartularies providing a versatile tool for historians and philologists in transforming and analyzing historical texts.

A paper presenting the first version of the model is available here: Sergio Torres Aguilar, Vincent Jolivet. Handwritten Text Recognition for Documentary Medieval Manuscripts. Journal of Data Mining and Digital Humanities. 2023. https://hal.science/hal-03892163

Rules of transcription :

Main factor of semi-diplomatic edition is that abbreviations have been resolved:

  • both those by suspension (facimꝰ ---> facimus) and by contraction (dñi --> domini).
  • Likewise, those using conventional signs ( --> et ; --> pro) have been resolved. 
  • The named entities (names of persons, places and institutions) have been capitalized.
  • The beginning of a block of text as well as the original capitals used by the scribe are also capitalized.
  • The consonantal i and u characters have been transcribed as j and v in both French and Latin.
  • The punctuation marks used in the manuscript like: . or / or | have not been systematically transcribed as the transcription has been standardized with modern punctuation.
  • Corrections and words that appear cancelled in the manuscript have been transcribed surrounded by the sign $ at the beginning and at the end.

Corpora

The model was trained on charters, registers, feudal books and legal proceedings from the Late Medieval period (11th-16th centuries).

The training and evaluation ground-truth datasets involved 2950 pages, 245k lines of text, and almost 2.3M tokens, conducted using several freely available ground-truth corpora:

(Addionally, the model was pre-trained on a synthetic dataset (300k lines) generated using a GAN architecture.)

Accuracy

TRIDIS was trained using a encode-decoder architecture based on a fine-tuned version of the TrOCR-large handwritten (microsoft/trocr-large-handwritten) and a RoBERTa modelized on medieval texts (magistermilitum/RoBERTa_medieval).

This final model operates in a multilingual environment (Latin, Old French, and Old Spanish) and is capable of recognizing several Latin script families (mostly Textualis and Cursiva) in documents produced circa 11th - 16th centuries.

During evaluation, the model showed an accuracy of 94.3% on the validation set and a CER (Character Error Ratio) of about 0.06 to 0.12 on three external unseen datasets and a WER of about 0.14 to 0.26 respectively, which is about 30% lower compared to CRNN+CTC solutions trained on the same corpora.

Other formats

A CRNN+CTC version of this model trained on Kraken 4.0 (https://github.com/mittagessen/kraken) using the same gold-standard annotation is available in Zenodo:

Torres Aguilar, S., & Jolivet, V. (2024). TRIDIS: HTR model for Multilingual Medieval and Early Modern Documentary Manuscripts (11th-16th) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10800223

Paper

A journal paper presenting the scientific basis of this models is also available:

Torres Aguilar, Sergio, Jolivet, Vincent . La reconnaissance de l'écriture pour les manuscrits documentaires du Moyen Âge, Journal of Data Mining & Digital Humanities, 22 décembre 2023 - https://hal.science/hal-03892163/document