Pclanglais commited on
Commit
52f05fa
·
verified ·
1 Parent(s): 780eb1a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -27
README.md CHANGED
@@ -1,27 +0,0 @@
1
- **Estienne** is a text-segmentation model trained on Deberta.
2
-
3
- In contrast with most text-segmentation approach, Estienne is based on token classification. Editorial structure are identified similarly to named-entity recognition.
4
-
5
- Estienne was trained on 2,000 example of manually annotated texts, excerpted at random from three very large dataset collected by Pleias: Common Corpus (cultural heritage texts in the public domain), Marianne-OpenData (French/English administrative documents) and OpenScientificPile (scientific publications in free licenses, indexed on OpenAlex).
6
-
7
- Given the diversity of the corpus, Estienne should work out on diverse document formats in European languages.
8
-
9
- The model is named in reference to the humanist Henri Estienne who introduced many practices of text segmentation still in use in scholarly edition today.
10
-
11
- ## Use
12
- As Deberta remove newline by default and has no support for it in the tokenizer, they should be replaced by pilcrows (¶).
13
-
14
- Estienne supports the following segmentations:
15
- * **Text**
16
- * **Separator** - actually a segmentation separator. They are generally based on newline (actually ¶) with some variations due to text segmentation understanding.
17
- * **Title**
18
- * **Table**
19
- * **Dialog** - any kind of speaker attributed intervention.
20
- * **Bibliography** - statement of a specific bibliographic reference, either in a bibliography section or a footnote.
21
- * **Contact** - personal information, can be especially useful in the context of PII removal.
22
- * **Paratext** - any non-meaningful text included in standard documents like header, page numbering, section recall, etc.
23
- * **Author** - author names and signatures.
24
- * **Date** - statement of date and time, common in letters and newspaper articles.
25
- * **Keyword** - list of keywords, especially common in scientific publications.
26
-
27
- ## Example