Upload 3 files
Browse files- JCLS_model_card.md +48 -48
- README.md +50 -50
- final_model.pkl +2 -2
JCLS_model_card.md
CHANGED
@@ -6,7 +6,7 @@ tags:
|
|
6 |
- camembert
|
7 |
- literary-texts
|
8 |
- nested-entities
|
9 |
-
-
|
10 |
license: apache-2.0
|
11 |
metrics:
|
12 |
- f1
|
@@ -18,7 +18,7 @@ pipeline_tag: token-classification
|
|
18 |
---
|
19 |
|
20 |
## INTRODUCTION:
|
21 |
-
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a NER model built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings, trained to predict nested entities in french, specifically for literary texts.
|
22 |
|
23 |
The predicted entities are:
|
24 |
- mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
|
@@ -31,14 +31,14 @@ The predicted entities are:
|
|
31 |
## MODEL PERFORMANCES (LOOCV):
|
32 |
| NER_tag | precision | recall | f1_score | support | support % |
|
33 |
|-----------|-------------|----------|------------|-----------|-------------|
|
34 |
-
| PER |
|
35 |
-
| FAC |
|
36 |
-
| TIME |
|
37 |
-
| LOC |
|
38 |
-
| GPE |
|
39 |
-
| VEH |
|
40 |
-
| micro_avg |
|
41 |
-
| macro_avg |
|
42 |
|
43 |
## TRAINING PARAMETERS:
|
44 |
- Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
|
@@ -75,48 +75,48 @@ Model Output: BIOES labels sequence
|
|
75 |
*** IN CONSTRUCTION ***
|
76 |
|
77 |
## TRAINING CORPUS:
|
78 |
-
| | Document
|
79 |
-
|
80 |
-
| 0 |
|
81 |
-
| 1 |
|
82 |
-
| 2 |
|
83 |
-
| 3 |
|
84 |
-
| 4 |
|
85 |
-
| 5 | 1856_Cousin-Victor_Madame-de-Hautefort
|
86 |
-
| 6 | 1863_Gautier-
|
87 |
-
| 7 | 1873_Zola-
|
88 |
-
| 8 | 1881_Flaubert-Gustave_Bouvard-et-
|
89 |
-
| 9 |
|
90 |
-
| 10 |
|
91 |
-
| 11 |
|
92 |
-
| 12 |
|
93 |
-
| 13 |
|
94 |
-
| 14 |
|
95 |
-
| 15 |
|
96 |
-
| 16 |
|
97 |
-
| 17 |
|
98 |
-
| 18 |
|
99 |
-
| 19 |
|
100 |
-
| 20 |
|
101 |
-
| 21 | 1903_Conan-
|
102 |
-
| 22 |
|
103 |
-
| 23 |
|
104 |
-
| 24 |
|
105 |
-
| 25 | 1923_Radiguet-Raymond_Le-diable-au-corps
|
106 |
-
| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin
|
107 |
-
| 27 | 1937_Audoux-Marguerite_Douce-
|
108 |
-
| 28 | TOTAL
|
109 |
|
110 |
## PREDICTIONS CONFUSION MATRIX:
|
111 |
| Gold Labels | PER | FAC | TIME | LOC | GPE | VEH | O | support |
|
112 |
|---------------|-------|-------|--------|-------|-------|-------|-----|-----------|
|
113 |
-
| PER |
|
114 |
-
| FAC |
|
115 |
-
| TIME | 1 | 0 |
|
116 |
-
| LOC | 0 |
|
117 |
-
| GPE |
|
118 |
-
| VEH |
|
119 |
-
| O |
|
120 |
|
121 |
## CONTACT:
|
122 |
mail: antoine [dot] bourgois [at] protonmail [dot] com
|
|
|
6 |
- camembert
|
7 |
- literary-texts
|
8 |
- nested-entities
|
9 |
+
- BookNLP-fr
|
10 |
license: apache-2.0
|
11 |
metrics:
|
12 |
- f1
|
|
|
18 |
---
|
19 |
|
20 |
## INTRODUCTION:
|
21 |
+
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **NER model** built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings, trained to predict nested entities in french, specifically for literary texts.
|
22 |
|
23 |
The predicted entities are:
|
24 |
- mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
|
|
|
31 |
## MODEL PERFORMANCES (LOOCV):
|
32 |
| NER_tag | precision | recall | f1_score | support | support % |
|
33 |
|-----------|-------------|----------|------------|-----------|-------------|
|
34 |
+
| PER | 92.97% | 96.25% | 94.58% | 4,162 | 86.12% |
|
35 |
+
| FAC | 76.58% | 75.89% | 76.23% | 224 | 4.63% |
|
36 |
+
| TIME | 66.97% | 69.48% | 68.20% | 213 | 4.41% |
|
37 |
+
| LOC | 70.00% | 57.27% | 63.00% | 110 | 2.28% |
|
38 |
+
| GPE | 80.65% | 78.12% | 79.37% | 64 | 1.32% |
|
39 |
+
| VEH | 57.75% | 68.33% | 62.60% | 60 | 1.24% |
|
40 |
+
| micro_avg | 89.94% | 92.65% | 91.25% | 4,833 | 100.00% |
|
41 |
+
| macro_avg | 74.15% | 74.23% | 74.00% | 4,833 | 100.00% |
|
42 |
|
43 |
## TRAINING PARAMETERS:
|
44 |
- Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
|
|
|
75 |
*** IN CONSTRUCTION ***
|
76 |
|
77 |
## TRAINING CORPUS:
|
78 |
+
| | Document | Tokens Count | Is included in model eval |
|
79 |
+
|----|---------------------------------------------------------------------------------|----------------|-----------------------------------|
|
80 |
+
| 0 | 1830_Balzac-Honoré-de_La-maison-du-chat-qui-pelote | 24,776 tokens | False |
|
81 |
+
| 1 | 1830_Balzac-Honoré-de_Sarrasine | 15,408 tokens | False |
|
82 |
+
| 2 | 1836_Gautier-Théophile_La-morte-amoureuse | 14,293 tokens | False |
|
83 |
+
| 3 | 1837_Balzac-Honoré-de_La-maison-Nucingen | 30,034 tokens | False |
|
84 |
+
| 4 | 1841_Sand-George_Pauline | 12,398 tokens | False |
|
85 |
+
| 5 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | False |
|
86 |
+
| 6 | 1863_Gautier-Théophile_Le-capitaine-Fracasse | 11,848 tokens | False |
|
87 |
+
| 7 | 1873_Zola-Émile_Le-ventre-de-Paris | 12,613 tokens | False |
|
88 |
+
| 8 | 1881_Flaubert-Gustave_Bouvard-et-Pécuchet | 12,308 tokens | False |
|
89 |
+
| 9 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-buche | 2,267 tokens | False |
|
90 |
+
| 10 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-relique | 2,041 tokens | False |
|
91 |
+
| 11 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-rouille | 2,949 tokens | True |
|
92 |
+
| 12 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Madame-Baptiste | 2,578 tokens | True |
|
93 |
+
| 13 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Marocca | 4,078 tokens | False |
|
94 |
+
| 14 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-A-cheval | 2,878 tokens | False |
|
95 |
+
| 15 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Fou | 1,905 tokens | False |
|
96 |
+
| 16 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Mademoiselle-Fifi | 5,439 tokens | True |
|
97 |
+
| 17 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Reveil | 2,159 tokens | False |
|
98 |
+
| 18 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Un-reveillon | 2,364 tokens | False |
|
99 |
+
| 19 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Une-ruse | 2,469 tokens | False |
|
100 |
+
| 20 | 1901_Achard-Lucie_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,775 tokens | False |
|
101 |
+
| 21 | 1903_Conan-Laure_Élisabeth-Seton | 13,046 tokens | False |
|
102 |
+
| 22 | 1904-1912_Rolland-Romain_Jean-Christophe(1) | 10,982 tokens | True |
|
103 |
+
| 23 | 1904-1912_Rolland-Romain_Jean-Christophe(2) | 10,305 tokens | False |
|
104 |
+
| 24 | 1917_Bourgeois-Adèle_Némoville | 12,468 tokens | False |
|
105 |
+
| 25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,850 tokens | False |
|
106 |
+
| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 12,144 tokens | True |
|
107 |
+
| 27 | 1937_Audoux-Marguerite_Douce-Lumière | 12,346 tokens | False |
|
108 |
+
| 28 | TOTAL | 275,489 tokens | 5 files used for cross-validation |
|
109 |
|
110 |
## PREDICTIONS CONFUSION MATRIX:
|
111 |
| Gold Labels | PER | FAC | TIME | LOC | GPE | VEH | O | support |
|
112 |
|---------------|-------|-------|--------|-------|-------|-------|-----|-----------|
|
113 |
+
| PER | 4,006 | 0 | 2 | 1 | 1 | 3 | 149 | 4,162 |
|
114 |
+
| FAC | 8 | 170 | 0 | 2 | 0 | 1 | 43 | 224 |
|
115 |
+
| TIME | 1 | 0 | 148 | 0 | 0 | 0 | 64 | 213 |
|
116 |
+
| LOC | 0 | 2 | 0 | 63 | 6 | 0 | 39 | 110 |
|
117 |
+
| GPE | 2 | 1 | 0 | 3 | 50 | 0 | 8 | 64 |
|
118 |
+
| VEH | 3 | 0 | 0 | 0 | 0 | 41 | 16 | 60 |
|
119 |
+
| O | 287 | 49 | 70 | 21 | 5 | 26 | 0 | 458 |
|
120 |
|
121 |
## CONTACT:
|
122 |
mail: antoine [dot] bourgois [at] protonmail [dot] com
|
README.md
CHANGED
@@ -6,7 +6,7 @@ tags:
|
|
6 |
- camembert
|
7 |
- literary-texts
|
8 |
- nested-entities
|
9 |
-
-
|
10 |
license: apache-2.0
|
11 |
metrics:
|
12 |
- f1
|
@@ -18,7 +18,7 @@ pipeline_tag: token-classification
|
|
18 |
---
|
19 |
|
20 |
## INTRODUCTION:
|
21 |
-
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a NER model built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings, trained to predict nested entities in french, specifically for literary texts.
|
22 |
|
23 |
The predicted entities are:
|
24 |
- mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
|
@@ -31,14 +31,14 @@ The predicted entities are:
|
|
31 |
## MODEL PERFORMANCES (LOOCV):
|
32 |
| NER_tag | precision | recall | f1_score | support | support % |
|
33 |
|-----------|-------------|----------|------------|-----------|-------------|
|
34 |
-
| PER |
|
35 |
-
| FAC |
|
36 |
-
| TIME |
|
37 |
-
|
|
38 |
-
|
|
39 |
-
| VEH |
|
40 |
-
| micro_avg |
|
41 |
-
| macro_avg |
|
42 |
|
43 |
## TRAINING PARAMETERS:
|
44 |
- Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
|
@@ -75,48 +75,48 @@ Model Output: BIOES labels sequence
|
|
75 |
*** IN CONSTRUCTION ***
|
76 |
|
77 |
## TRAINING CORPUS:
|
78 |
-
| | Document
|
79 |
-
|
80 |
-
| 0 |
|
81 |
-
| 1 |
|
82 |
-
| 2 |
|
83 |
-
| 3 |
|
84 |
-
| 4 |
|
85 |
-
| 5 | 1856_Cousin-Victor_Madame-de-Hautefort
|
86 |
-
| 6 | 1863_Gautier-
|
87 |
-
| 7 | 1873_Zola-
|
88 |
-
| 8 | 1881_Flaubert-Gustave_Bouvard-et-
|
89 |
-
| 9 |
|
90 |
-
| 10 |
|
91 |
-
| 11 |
|
92 |
-
| 12 |
|
93 |
-
| 13 |
|
94 |
-
| 14 |
|
95 |
-
| 15 |
|
96 |
-
| 16 |
|
97 |
-
| 17 |
|
98 |
-
| 18 |
|
99 |
-
| 19 |
|
100 |
-
| 20 |
|
101 |
-
| 21 | 1903_Conan-
|
102 |
-
| 22 |
|
103 |
-
| 23 |
|
104 |
-
| 24 |
|
105 |
-
| 25 | 1923_Radiguet-Raymond_Le-diable-au-corps
|
106 |
-
| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin
|
107 |
-
| 27 | 1937_Audoux-Marguerite_Douce-
|
108 |
-
| 28 | TOTAL
|
109 |
|
110 |
## PREDICTIONS CONFUSION MATRIX:
|
111 |
-
| Gold Labels | PER
|
112 |
-
|
113 |
-
| PER |
|
114 |
-
| FAC |
|
115 |
-
| TIME |
|
116 |
-
|
|
117 |
-
|
|
118 |
-
| VEH |
|
119 |
-
| O |
|
120 |
|
121 |
## CONTACT:
|
122 |
mail: antoine [dot] bourgois [at] protonmail [dot] com
|
|
|
6 |
- camembert
|
7 |
- literary-texts
|
8 |
- nested-entities
|
9 |
+
- BookNLP-fr
|
10 |
license: apache-2.0
|
11 |
metrics:
|
12 |
- f1
|
|
|
18 |
---
|
19 |
|
20 |
## INTRODUCTION:
|
21 |
+
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **NER model** built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings, trained to predict nested entities in french, specifically for literary texts.
|
22 |
|
23 |
The predicted entities are:
|
24 |
- mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
|
|
|
31 |
## MODEL PERFORMANCES (LOOCV):
|
32 |
| NER_tag | precision | recall | f1_score | support | support % |
|
33 |
|-----------|-------------|----------|------------|-----------|-------------|
|
34 |
+
| PER | 92.78% | 94.29% | 93.53% | 10,354 | 86.92% |
|
35 |
+
| FAC | 69.81% | 69.92% | 69.87% | 635 | 5.33% |
|
36 |
+
| TIME | 64.21% | 62.12% | 63.15% | 462 | 3.88% |
|
37 |
+
| LOC | 63.50% | 46.28% | 53.54% | 188 | 1.58% |
|
38 |
+
| GPE | 79.86% | 74.68% | 77.18% | 154 | 1.29% |
|
39 |
+
| VEH | 61.82% | 57.14% | 59.39% | 119 | 1.00% |
|
40 |
+
| micro_avg | 89.51% | 90.36% | 89.91% | 11,912 | 100.00% |
|
41 |
+
| macro_avg | 72.00% | 67.40% | 69.44% | 11,912 | 100.00% |
|
42 |
|
43 |
## TRAINING PARAMETERS:
|
44 |
- Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
|
|
|
75 |
*** IN CONSTRUCTION ***
|
76 |
|
77 |
## TRAINING CORPUS:
|
78 |
+
| | Document | Tokens Count | Is included in model eval |
|
79 |
+
|----|---------------------------------------------------------------------------------|----------------|-----------------------------------|
|
80 |
+
| 0 | 1830_Balzac-Honoré-de_La-maison-du-chat-qui-pelote | 24,776 tokens | True |
|
81 |
+
| 1 | 1830_Balzac-Honoré-de_Sarrasine | 15,408 tokens | True |
|
82 |
+
| 2 | 1836_Gautier-Théophile_La-morte-amoureuse | 14,293 tokens | True |
|
83 |
+
| 3 | 1837_Balzac-Honoré-de_La-maison-Nucingen | 30,034 tokens | False |
|
84 |
+
| 4 | 1841_Sand-George_Pauline | 12,398 tokens | False |
|
85 |
+
| 5 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | False |
|
86 |
+
| 6 | 1863_Gautier-Théophile_Le-capitaine-Fracasse | 11,848 tokens | False |
|
87 |
+
| 7 | 1873_Zola-Émile_Le-ventre-de-Paris | 12,613 tokens | False |
|
88 |
+
| 8 | 1881_Flaubert-Gustave_Bouvard-et-Pécuchet | 12,308 tokens | False |
|
89 |
+
| 9 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-buche | 2,267 tokens | False |
|
90 |
+
| 10 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-relique | 2,041 tokens | False |
|
91 |
+
| 11 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-rouille | 2,949 tokens | True |
|
92 |
+
| 12 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Madame-Baptiste | 2,578 tokens | True |
|
93 |
+
| 13 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Marocca | 4,078 tokens | False |
|
94 |
+
| 14 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-A-cheval | 2,878 tokens | False |
|
95 |
+
| 15 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Fou | 1,905 tokens | False |
|
96 |
+
| 16 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Mademoiselle-Fifi | 5,439 tokens | True |
|
97 |
+
| 17 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Reveil | 2,159 tokens | False |
|
98 |
+
| 18 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Un-reveillon | 2,364 tokens | False |
|
99 |
+
| 19 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Une-ruse | 2,469 tokens | False |
|
100 |
+
| 20 | 1901_Achard-Lucie_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,775 tokens | False |
|
101 |
+
| 21 | 1903_Conan-Laure_Élisabeth-Seton | 13,046 tokens | False |
|
102 |
+
| 22 | 1904-1912_Rolland-Romain_Jean-Christophe(1) | 10,982 tokens | True |
|
103 |
+
| 23 | 1904-1912_Rolland-Romain_Jean-Christophe(2) | 10,305 tokens | False |
|
104 |
+
| 24 | 1917_Bourgeois-Adèle_Némoville | 12,468 tokens | False |
|
105 |
+
| 25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,850 tokens | False |
|
106 |
+
| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 12,144 tokens | True |
|
107 |
+
| 27 | 1937_Audoux-Marguerite_Douce-Lumière | 12,346 tokens | False |
|
108 |
+
| 28 | TOTAL | 275,489 tokens | 8 files used for cross-validation |
|
109 |
|
110 |
## PREDICTIONS CONFUSION MATRIX:
|
111 |
+
| Gold Labels | PER | FAC | TIME | LOC | GPE | VEH | O | support |
|
112 |
+
|---------------|-------|-------|--------|-------|-------|-------|-----|-----------|
|
113 |
+
| PER | 9,763 | 3 | 6 | 1 | 1 | 6 | 574 | 10,354 |
|
114 |
+
| FAC | 27 | 444 | 1 | 4 | 4 | 1 | 154 | 635 |
|
115 |
+
| TIME | 1 | 0 | 287 | 0 | 0 | 0 | 174 | 462 |
|
116 |
+
| LOC | 1 | 13 | 0 | 87 | 11 | 0 | 76 | 188 |
|
117 |
+
| GPE | 3 | 2 | 1 | 8 | 115 | 0 | 25 | 154 |
|
118 |
+
| VEH | 12 | 1 | 0 | 0 | 0 | 68 | 38 | 119 |
|
119 |
+
| O | 709 | 168 | 151 | 37 | 13 | 35 | 0 | 1,113 |
|
120 |
|
121 |
## CONTACT:
|
122 |
mail: antoine [dot] bourgois [at] protonmail [dot] com
|
final_model.pkl
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:33aee46fef5eb366deed3c1407205a9f0b3ee115590473f11da0d4f3d2f29c02
|
3 |
+
size 386304699
|