OCRerrcr is a small language model specialized for the detection of OCR error.

OCRerrcr was trained by Eliot Jones for PleIAs on a sample of 1000 documents with labelled OCR errors from open data documents (Finance Commons) and cultural heritage sources (Common Corpus).

To date, OCRerrcr provide the most accurate agnostic OCR error rate estimate. PleIAs has also develop an alternative pipeline for this tasks, OCRoscope, that scale significantly better but also significantly less accurate, especially for document with fewer mistakes.

The name OCRerrcr (instead of OCRerror) is a playful allusion to a common OCR misreading.

Example

The following is a low-error example sentence taken from Common Corpus:

They did not approach cer, but turned away and passed irom her presence, filled with sorrow and moved with sympathy, which her intense emotions seemed to communicate to even these thoughtless young men of the tho plains.

And the OCRerrcr detection (with formatting for clarity):

They did not approach <er>cer,</er> but turned away and passed <er>irom</er> her presence, filled with sorrow and moved with sympathy, which her intense emotions seemed to communicate to even these thoughtless young men of the <er>tho</er> plains.

Downloads last month
81
Safetensors
Model size
434M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including PleIAs/OCRerrcr