SBB
/

michalbubula commited on
Commit
09277d2
·
verified ·
1 Parent(s): 7330ece

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +205 -3
README.md CHANGED
@@ -1,3 +1,205 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - image-to-image
5
+ - keras
6
+ - TensorFlow
7
+ - pixelwise-segmentation
8
+ ---
9
+
10
+ # Model Card for Eynollah Image Extraction
11
+
12
+ This model is designed for image segmentation, specifically focusing on extracting illustrations from historical document scans. It is integrated into the Eynollah pipeline to either save the cropped images to a specified directory or output the coordinates of the extracted images in a PAGE-XML file.
13
+
14
+ Questions and comments about the models can be directed to Vahid Rezanezhad at [[email protected]](mailto:[email protected]).
15
+
16
+
17
+
18
+ # Table of Contents
19
+
20
+ - [Model Card for Eynollah Image Extraction](#model-card-for-eynollah-image-extraction)
21
+ - [Table of Contents](#table-of-contents)
22
+ - [Model Details](#model-details)
23
+ - [Model Description](#model-description)
24
+ - [Uses](#uses)
25
+ - [Direct Use](#direct-use)
26
+ - [Downstream Use](#downstream-use)
27
+ - [Out-of-Scope Use](#out-of-scope-use)
28
+ - [Bias, Risks, and Limitations](#bias-risks-and-limitations)
29
+ - [Recommendations](#recommendations)
30
+ - [Training Details](#training-details)
31
+ - [Training Data](#training-data)
32
+ - [Training Procedure](#training-procedure)
33
+ - [Preprocessing](#preprocessing)
34
+ - [Speeds, Sizes, Times](#speeds-sizes-times)
35
+ - [Training hyperparameters](#training-hyperparameters)
36
+ - [Evaluation](#evaluation)
37
+ - [Testing Data and Metrics](#testing-data-and-metrics)
38
+ - [Metrics](#metrics)
39
+ - [Environmental Impact](#environmental-impact)
40
+ - [Technical Specifications](#technical-specifications)
41
+ - [Model Architecture and Objective](#model-architecture-and-objective)
42
+ - [Compute Infrastructure](#compute-infrastructure)
43
+ - [Hardware](#hardware)
44
+ - [Software](#software)
45
+ - [More Information](#more-information)
46
+ - [Model Card Authors](#model-card-authors)
47
+ - [Model Card Contact](#model-card-contact)
48
+ - [How to Get Started with the Model](#how-to-get-started-with-the-model)
49
+
50
+
51
+ # Model Details
52
+
53
+ ## Model Description
54
+
55
+ The Eynollah suite consists of 14 models and presents a document layout analysis (DLA) system for historical documents implemented by pixel-wise segmentation using convolutional neural networks. In addition, heuristic methods are applied to detect marginals and to determine the reading order of text regions. Within this suite, the image extraction model is designed for image segmentation, specifically focusing on extracting illustrations from historical document scans.
56
+
57
+ The detection and classification of multiple classes of layout elements such as headings, images, tables etc. as part of DLA is required in order to extract and process them in subsequent steps. Altogether, the combination of image detection, classification and segmentation on the wide variety that can be found in over 400 years of printed cultural heritage makes this a very challenging task. Deep learning models are complemented with heuristics for the detection of text lines, marginals, and reading order. Furthermore, an optional image enhancement step was added in case of documents that either have insufficient pixel density and/or require scaling. Also, a column classifier for the analysis of multi-column documents was added. With these additions, DLA performance was improved, and a high accuracy in the prediction of the reading order is accomplished.
58
+
59
+ Additionally, there are cases where the user only wants to extract specific elements from a document scan, with one such element being illustrations within the document. To address this, we have trained a pixel-wise segmentation model focused primarily on segmenting images. This model is integrated into the Eynollah pipeline and can either crop the images and save them to a specified directory or write the coordinates of all images into a PAGE-XML file.
60
+
61
+ Two Arabic/Persian terms form the name of the model suite: عين الله, which can be transcribed as "ain'allah" or "eynollah"; it translates into English as "God's Eye" – it sees (nearly) everything on the document image.
62
+
63
+
64
+ - **Developed by:** [Vahid Rezanezhad]([email protected])
65
+ - **Shared by:** [Staatsbibliothek zu Berlin / Berlin State Library](https://huggingface.co/SBB)
66
+ - **Model type:** Neural Network
67
+ - **Language(s) (NLP):** Irrelevant; works on all languages
68
+ - **License:** apache-2.0
69
+ - **Resources for more information:** see the [GitHub Repo](https://github.com/qurator-spk/eynollah)
70
+
71
+ # Uses
72
+
73
+ This model is designed for image segmentation, specifically focusing on extracting illustrations from historical document scans.
74
+
75
+
76
+
77
+ ## Direct Use
78
+
79
+ The system performs document layout analysis in a series of steps. First, the image is cropped and the number of columns determined. Then the pixels-per-inch (ppi) rate is detected, and when ppi is below 300, the image is re-scaled and enhanced. Now the main region types (text regions, images, separators, background) are detected as the early layout. Marginals are detected with a heuristic method, then – optionally – headers (or headings or floatings) and drop capitals. Next, textline segmentation is performed for text regions, and for each text region the slope of deskewing is calculated. Heuristics are used to determine bounding boxes (or contours in the case of curved lines) of textlines in each region after deskewing. After that, the reading order of text regions is detected based on separators, headers (or headings or floatings) and the coordinates of columns. Finally, all results are written to a PAGE-XML file.
80
+
81
+ Four models of the suite are focused on the implementation of early document layout detection techniques aimed at extracting essential attributes such as background, text regions, images, and separators from document images. To enhance the performance and adaptability of the models across diverse scenarios, a comprehensive training approach involving various augmentation procedures has been employed. By systematically experimenting with different augmentation techniques, efforts are made to achieve optimal results, ensuring that the models exhibit robustness and accuracy in effectively detecting the distinct components of documents in a wide range of real-world scenarios.
82
+
83
+ Within the suite, the model ``eynollah-main-regions_20231127_672_org_ens_11_13_16_17_18`` is used for the task of illustration detection only.
84
+
85
+ The robustness of this model is attained through an ensemble technique that combines the weights from various epochs. By employing this approach, the model achieves a high level of resilience and stability, effectively leveraging the strengths of multiple epochs to enhance its overall performance and deliver consistent and reliable results.
86
+
87
+
88
+
89
+ ## Downstream Use
90
+
91
+ The intended use of this suite of 13 models is conceived of as a system. Comparable systems are [dhSegment](https://doi.org/10.1109/ICFHR-2018.2018.00011) and [P2PaLA](https://github.com/lquirosd/P2PaLA). However – and as always with a system –, every component can potentially be used on its own. Each model might therefore be used or trained for other downstream purposes.
92
+
93
+
94
+
95
+
96
+ ## Out-of-Scope Use
97
+
98
+ This model suite does NOT perform any Optical Character Recognition (OCR), it is an image-to-PAGE-XML system only.
99
+
100
+
101
+
102
+ # Bias, Risks, and Limitations
103
+
104
+ The pre-processing of digitised historical and contemporary texts is a task contributing to knowledge creation, both by developing models facilitating the necessary pre-processing steps of document layout analysis and ultimately by enabling better discoverability of information in the processed works. Since the content of the document images is not touched, ethical challenges cannot be identified. The endeavour of developing the model was not undertaken for profit. Though a commercial product based on this model may be developed in the future, it will always remain openly accessible without any commercial interest. The aim of the development of this model was to improve document layout analysis, an endeavour that is not for profit. As a technical limitation, it has to be noted that there is a lot of performance to gain for historical text by adding more historical Ground Truth data.
105
+
106
+
107
+
108
+ ## Recommendations
109
+
110
+ The application of machine learning models to convert a document image into PAGE-XML is a process which can still be improved. The suite of 14 models performs pixel-wise segmentation, which is done in patches; it therefore lacks a global understanding of documents and makes the document layout analysis system unable to detect some document subcategories like page numbers. The end-to-end system with different stages uses the output of each task for the next step. Therefore a poor prediction in early steps may cause a poor final document information extraction. Patch-wise segmentation can cause problems to segment pixels between text blocks, large scale drop capitals, headers and tables, since in each patch the model sees only a part of the document element and not all of it. Therefore, there is a lot to gain by improving the existing model suite.
111
+
112
+
113
+
114
+ # Training Details
115
+
116
+ ## Training Data
117
+
118
+ For model training, Ground Truth in PAGE-XML format was sourced mainly from three datasets. The [IMPACT dataset of historical document images](https://doi.org/10.1145/2501115.2501130) contains a representative sample of historical books and newspapers from Europe’s major libraries. Due to restrictions, only the open data from the National Library of the Netherlands (KB) and the Poznan Supercomputing and Networking Center (PSNC) were used. The [Europeana Newspapers Project (ENP) image and ground truth dataset of historical newspapers](https://doi.org/10.1109/ICDAR.2015.7333898) contains images representative of the newspaper digitisation projects of 12 national and major European libraries. The [OCR-D dataset](https://doi.org/10.1145/3322905.3322916) of German prints between 1500 and 1900 is based on a selection from the holdings of the “German Text Archive” (DTA), the Digitized Collections of the Berlin State Library and the Wolfenbüttel Digital Library.
119
+
120
+
121
+ ## Training Procedure
122
+
123
+ All models, except for the column classifier, are designed for pixelwise segmentation. The training of these models follows a patchwise approach, wherein the documents are divided into smaller patches and fed into the models during training. To train these segmentation models, annotated labels are required. Each sub-element in the document is assigned a unique value for identification.
124
+ If you consider examining the training process, take a look at the repository which contains the source code for training an encoder model for document image segmentation [on GitHub](https://github.com/qurator-spk/sbb_pixelwise_segmentation).
125
+
126
+
127
+ ### Preprocessing
128
+
129
+ In order to use this suite of models for historical documents, no preprocessing is needed for the input images.
130
+
131
+ ### Speeds, Sizes, Times
132
+
133
+ The duration of training per epoch varies, typically lasting between 2 to 5 hours, depending on the specific use case, the datasets used, and the extent of applied data augmentation. Our ResNet-50-U-Net model has 38.15M parameters where only parameters of the decoder part are fully trained (14.71M parameters). In the case of the column classifier, we used a ResNet-50 with two dense connected layers at the top. The column classifier model has 25.6M parameters, where only parameters of dense layers are fully trained (2.16M parameters).
134
+
135
+ ### Training hyperparameters
136
+
137
+ Within the context of a constant model architecture, our hyperparameters encompass diverse augmentations, each characterised by its unique set of parameters. In addition to these, the model training process involves other hyperparameters, including the choice of the loss function, the number of batches utilised, the patch size of input images, and the number of epochs.
138
+
139
+
140
+
141
+ # Evaluation
142
+
143
+ Given the inadequacy of the prevailing segmentation metric in achieving optimal outcomes for document segmentation, particularly with respect to isolating regions as effectively as desired, we proceeded to evaluate our model using smaller batches of the three above-named datasets used for training. In pursuit of improved results, we employed an ensemble learning approach by aggregating the best epoch weights.
144
+
145
+
146
+ ## Testing Data and Metrics
147
+
148
+ Any historical document scans can be used as testing data and the metrics noted below can be used for evaluation.
149
+
150
+ ### Metrics
151
+
152
+ In addition to utilising conventional performance metrics such as mean Intersection over Union (mIoU), precision, recall, and F1-score, we have incorporated the [Prima layout evaluation](https://www.primaresearch.org/tools/PerformanceEvaluation) metrics, namely Merge, Split, Miss, and the overall success rate.
153
+
154
+
155
+
156
+
157
+ # Environmental Impact
158
+
159
+
160
+ - **Hardware Type:** GA100.
161
+ - **Hours used:** 2 to 5 hours per epoch.
162
+ - **Cloud Provider:** No cloud.
163
+ - **Compute Region:** Germany.
164
+
165
+ # Technical Specifications
166
+
167
+ ## Model Architecture and Objective
168
+
169
+ The model architecture is a ResNet50-UNet encoder-decoder, designed specifically to segment illustrations within a document image.
170
+
171
+ ## Compute Infrastructure
172
+
173
+
174
+
175
+ ### Hardware
176
+
177
+ Training and pre-training has been performed on a single GA100.
178
+
179
+ ### Software
180
+
181
+ See the code published on [GitHub](https://github.com/qurator-spk/eynollah).
182
+
183
+
184
+ # More Information
185
+
186
+ SHA256 and MD5 hashes for the model ``eynollah-main-regions_20231127_672_org_ens_11_13_16_17_18/saved_model.pb``:
187
+
188
+ **SHA256:** d8046b0b7751a1d1741bd0b61111671472ea49c6a347e2f10f0a8ba54b612c05
189
+
190
+ **MD5:** 95db5b99e66ba4eba1b31a7dd7688095
191
+
192
+
193
+ # Model Card Authors
194
+
195
+ [Vahid Rezanezhad]([email protected]) and [Jörg Lehmann]([email protected])
196
+
197
+ # Model Card Contact
198
+
199
+ Questions and comments about the model can be directed to Clemens Neudecker at [email protected], questions and comments about the model card can be directed to Jörg Lehmann at [email protected]
200
+
201
+ # How to Get Started with the Model
202
+
203
+ How to get started with this model is explained in the ReadMe file of the [GitHub repository over here](https://github.com/qurator-spk/eynollah/blob/main/README.md).
204
+
205
+ Model Card as of September 11th, 2024