language: fr
tags:
- coreference-resolution
- anaphora-resolution
- mentions-linking
- literary-texts
- camembert
- literary-texts
- nested-entities
- BookNLP-fr
license: apache-2.0
metrics:
- MUC
- B3
- CEAF
- CoNLL-F1
base_model:
- almanach/camembert-large
INTRODUCTION:
This model, developed as part of the BookNLP-fr project, is a coreference resolution model built on top of camembert-large embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French.
This specific model has been trained to link entities of the following types: PER.
MODEL PERFORMANCES (LOOCV):
Overall Coreference Resolution Performances for non-overlapping windows of different length:
Window width (tokens) | Document count | Sample count | MUC F1 | B3 F1 | CEAFe F1 | CONLL F1 | |
---|---|---|---|---|---|---|---|
0 | 500 | 4 | 41 | 90.91% | 78.47% | 64.73% | 78.03% |
2 | 2,000 | 1 | 5 | 94.56% | 70.95% | 48.18% | 71.23% |
4 | 10,000 | 1 | 1 | 94.50% | 57.67% | 35.50% | 62.56% |
Coreference Resolution Performances on the fully annotated sample for each document:
Token count | Mention count | MUC F1 | B3 F1 | CEAFe F1 | CONLL F1 | |
---|---|---|---|---|---|---|
0 | 2,578 | 330 | 89.66% | 69.56% | 68.68% | 75.97% |
1 | 2,949 | 386 | 95.93% | 71.31% | 69.86% | 79.03% |
2 | 5,439 | 558 | 90.20% | 59.67% | 58.89% | 69.58% |
3 | 10,982 | 1,095 | 94.40% | 58.46% | 32.62% | 61.83% |
TRAINING PARAMETERS:
- Entities types: PER
- Split strategy: Leave-one-out cross-validation (31 files)
- Train/Validation split: 0.85 / 0.15
- Batch size: 16,000
- Initial learning rate: 0.0004
- Focal loss gamma: 1
- Focal loss alpha: 0.25
- Pronoun lookup antecedents: 30
- Common and Proper nouns lookup antecedents: 300
MODEL ARCHITECTURE:
Model Input: 2,165 dimensions vector
Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions)
Additional mentions features (106 dimensions):
- Length of mentions
- Position of the mention's start token within the sentence
- Grammatical category of the mentions (pronoun, common noun, proper noun)
- Dependency relation of the mention's head (one-hot encoded)
- Gender of the mentions (one-hot encoded)
- Number (singular/plural) of the mentions (one-hot encoded)
- Grammatical person of the mentions (one-hot encoded)
Additional mention pairs features (11 dimensions):
- Distance between mention IDs
- Distance between start tokens of mentions
- Distance between end tokens of mentions
- Distance between sentences containing mentions
- Distance between paragraphs containing mentions
- Difference in nesting levels of mentions
- Ratio of shared tokens between mentions
- Exact text match between mentions (binary)
- Exact match of mention heads (binary)
- Match of syntactic heads between mentions (binary)
- Match of entity types between mentions (binary)
Hidden Layers:
- Number of layers: 3
- Units per layer: 1,900 nodes
- Activation function: relu
- Dropout rate: 0.6
Final Layer:
- Type: Linear
- Input: 1900 dimensions
- Output: 1 dimension (mention pair coreference score)
Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence.
HOW TO USE:
*** IN CONSTRUCTION ***
TRAINING CORPUS:
Document | Tokens Count | Is included in model eval | |
---|---|---|---|
0 | 1731_Prévost-Antoine-François_Manon-Lescaut_PER-ONLY | 71,219 tokens | False |
1 | 1832_Sand-George_Indiana_PER-ONLY | 112,221 tokens | False |
2 | 1836_Gautier-Theophile_La-morte-amoureuse | 14,293 tokens | False |
3 | 1840_Sand-George_Pauline | 12,398 tokens | False |
4 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote | 24,776 tokens | False |
5 | 1844_Balzac-Honore-de_La-Maison-Nucingen | 30,034 tokens | False |
6 | 1844_Balzac-Honore-de_Sarrasine | 15,408 tokens | False |
7 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | False |
8 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,848 tokens | False |
9 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,613 tokens | False |
10 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,308 tokens | False |
11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,439 tokens | True |
12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,578 tokens | True |
13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,949 tokens | True |
14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,078 tokens | False |
15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,267 tokens | False |
16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,041 tokens | False |
17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU | 1,905 tokens | False |
18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL | 2,159 tokens | False |
19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE | 2,469 tokens | False |
20 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL | 2,878 tokens | False |
21 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,358 tokens | False |
22 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,776 tokens | False |
23 | 1903_Conan-Laure_Elisabeth_Seton | 13,046 tokens | False |
24 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens | True |
25 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens | False |
26 | 1917_Adèle-Bourgeois_Némoville | 12,468 tokens | False |
27 | 1923_Delly_Dans-les-ruines | 95,617 tokens | False |
28 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,850 tokens | False |
29 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 12,083 tokens | True |
30 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,346 tokens | False |
31 | TOTAL | 554,480 tokens | 5 files used for cross-validation |
CONTACT:
mail: antoine [dot] bourgois [at] protonmail [dot] com