Update README.md
Browse files
README.md
CHANGED
@@ -8,6 +8,56 @@ tags:
|
|
8 |
- document-analysis
|
9 |
---
|
10 |
|
11 |
-
yolo-doclaynet
|
12 |
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
- document-analysis
|
9 |
---
|
10 |
|
11 |
+
**More details refer to [Github](https://github.com/ppaanngggg/yolo-doclaynet)**
|
12 |
|
13 |
+
## Introduction
|
14 |
+
|
15 |
+
You know that RAG is very popular these days. There are many applications that support talking to documents. However,
|
16 |
+
there is a huge performance drop when talking to a complex document due to the complex structures. So it's a challenge
|
17 |
+
to extract content from complex document and organize it into parsable form. This repo aims to solve this challenge with
|
18 |
+
a fast and good performance method.
|
19 |
+
|
20 |
+
## Detection Sample
|
21 |
+
|
22 |
+
data:image/s3,"s3://crabby-images/22281/222817119a2b9ff4e21dbcdb042fc7e11d0b7a1f" alt="image"
|
23 |
+
|
24 |
+
## Method
|
25 |
+
|
26 |
+
1. `YOLO` is the most advenced detect model developed by [Ultralytics](https://github.com/ultralytics/ultralytics). YOLO
|
27 |
+
has 5 different sizes of base model and a super powerful framework for training and deployment. So I chose YOLO to
|
28 |
+
solve this challenge.
|
29 |
+
2. `DocLayNet` is a human-annotated document layout segmentation dataset containing 80863 pages from a broad variety of
|
30 |
+
document sources. As far as I know, it's the most qualified document layout analysis dataset.
|
31 |
+
|
32 |
+
## Usage
|
33 |
+
|
34 |
+
```python
|
35 |
+
from ultralytics import YOLO
|
36 |
+
|
37 |
+
model = YOLO("{path to model file}")
|
38 |
+
pred = model("{path to test image}")
|
39 |
+
print(pred)
|
40 |
+
```
|
41 |
+
|
42 |
+
## Dataset
|
43 |
+
|
44 |
+
DocLayNet can be found more details and download at this [link](https://github.com/DS4SD/DocLayNet). It has 11 labels:
|
45 |
+
|
46 |
+
- **Text**: Regular paragraphs.
|
47 |
+
- **Picture**: A graphic or photograph.
|
48 |
+
- **Caption**: Special text outside a picture or table that introduces this picture or
|
49 |
+
table.
|
50 |
+
- **Section-header**: Any kind of heading in the text, except overall document title.
|
51 |
+
- **Footnote**: Typically small text at the bottom of a page, with a number or symbol
|
52 |
+
that is referred to in the text above.
|
53 |
+
- **Formula**: Mathematical equation on its own line.
|
54 |
+
- **Table**: Material arranged in a grid alignment with rows and columns, often
|
55 |
+
with separator lines.
|
56 |
+
- **List-item**: One element of a list, in a hanging shape, i.e., from the second line
|
57 |
+
onwards the paragraph is indented more than the first line.
|
58 |
+
- **Page-header**: Repeating elements like page number at the top, outside of the
|
59 |
+
normal text flow.
|
60 |
+
- **Page-footer**: Repeating elements like page number at the bottom, outside of the
|
61 |
+
normal text flow.
|
62 |
+
- **Title**: Overall title of a document, (almost) exclusively on the first page and
|
63 |
+
typically appearing in large font.
|