File size: 8,256 Bytes
039cd26 f5fbf09 039cd26 f5fbf09 039cd26 f5fbf09 039cd26 f5fbf09 039cd26 f5fbf09 ca05713 f5fbf09 ca05713 039cd26 f5fbf09 ca05713 3148c33 f5fbf09 039cd26 f5fbf09 039cd26 ca05713 039cd26 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
---
language: fr
tags:
- coreference-resolution
- anaphora-resolution
- mentions-linking
- literary-texts
- camembert
- literary-texts
- nested-entities
- BookNLP-fr
license: apache-2.0
metrics:
- MUC
- B3
- CEAF
- CoNLL-F1
base_model:
- almanach/camembert-large
---
## INTRODUCTION:
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **coreference resolution model** built on top of [camembert-large](https://huggingface.co./almanach/camembert-large) embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French.
This specific model has been trained to link entities of the following types: PER.
## MODEL PERFORMANCES (LOOCV):
Overall Coreference Resolution Performances for non-overlapping windows of different length:
| | Window width (tokens) | Document count | Sample count | MUC F1 | B3 F1 | CEAFe F1 | CONLL F1 |
|----|-------------------------|------------------|----------------|----------|---------|------------|------------|
| 0 | 500 | 4 | 41 | 90.91% | 78.47% | 64.73% | 78.03% |
| 2 | 2,000 | 1 | 5 | 94.56% | 70.95% | 48.18% | 71.23% |
| 4 | 10,000 | 1 | 1 | 94.50% | 57.67% | 35.50% | 62.56% |
Coreference Resolution Performances on the fully annotated sample for each document:
| | Token count | Mention count | MUC F1 | B3 F1 | CEAFe F1 | CONLL F1 |
|----|---------------|-----------------|----------|---------|------------|------------|
| 0 | 2,578 | 330 | 89.66% | 69.56% | 68.68% | 75.97% |
| 1 | 2,949 | 386 | 95.93% | 71.31% | 69.86% | 79.03% |
| 2 | 5,439 | 558 | 90.20% | 59.67% | 58.89% | 69.58% |
| 3 | 10,982 | 1,095 | 94.40% | 58.46% | 32.62% | 61.83% |
## TRAINING PARAMETERS:
- Entities types: PER
- Split strategy: Leave-one-out cross-validation (31 files)
- Train/Validation split: 0.85 / 0.15
- Batch size: 16,000
- Initial learning rate: 0.0004
- Focal loss gamma: 1
- Focal loss alpha: 0.25
- Pronoun lookup antecedents: 30
- Common and Proper nouns lookup antecedents: 300
## MODEL ARCHITECTURE:
Model Input: 2,165 dimensions vector
- Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions)
- Additional mentions features (106 dimensions):
- Length of mentions
- Position of the mention's start token within the sentence
- Grammatical category of the mentions (pronoun, common noun, proper noun)
- Dependency relation of the mention's head (one-hot encoded)
- Gender of the mentions (one-hot encoded)
- Number (singular/plural) of the mentions (one-hot encoded)
- Grammatical person of the mentions (one-hot encoded)
- Additional mention pairs features (11 dimensions):
- Distance between mention IDs
- Distance between start tokens of mentions
- Distance between end tokens of mentions
- Distance between sentences containing mentions
- Distance between paragraphs containing mentions
- Difference in nesting levels of mentions
- Ratio of shared tokens between mentions
- Exact text match between mentions (binary)
- Exact match of mention heads (binary)
- Match of syntactic heads between mentions (binary)
- Match of entity types between mentions (binary)
- Hidden Layers:
- Number of layers: 3
- Units per layer: 1,900 nodes
- Activation function: relu
- Dropout rate: 0.6
- Final Layer:
- Type: Linear
- Input: 1900 dimensions
- Output: 1 dimension (mention pair coreference score)
Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence.
## HOW TO USE:
*** IN CONSTRUCTION ***
## TRAINING CORPUS:
| | Document | Tokens Count | Is included in model eval |
|----|----------------------------------------------------------------|----------------|-----------------------------------|
| 0 | 1731_Prévost-Antoine-François_Manon-Lescaut_PER-ONLY | 71,219 tokens | False |
| 1 | 1832_Sand-George_Indiana_PER-ONLY | 112,221 tokens | False |
| 2 | 1836_Gautier-Theophile_La-morte-amoureuse | 14,293 tokens | False |
| 3 | 1840_Sand-George_Pauline | 12,398 tokens | False |
| 4 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote | 24,776 tokens | False |
| 5 | 1844_Balzac-Honore-de_La-Maison-Nucingen | 30,034 tokens | False |
| 6 | 1844_Balzac-Honore-de_Sarrasine | 15,408 tokens | False |
| 7 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | False |
| 8 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,848 tokens | False |
| 9 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,613 tokens | False |
| 10 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,308 tokens | False |
| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,439 tokens | **True** |
| 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,578 tokens | **True** |
| 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,949 tokens | **True** |
| 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,078 tokens | False |
| 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,267 tokens | False |
| 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,041 tokens | False |
| 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU | 1,905 tokens | False |
| 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL | 2,159 tokens | False |
| 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE | 2,469 tokens | False |
| 20 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL | 2,878 tokens | False |
| 21 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,358 tokens | False |
| 22 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,776 tokens | False |
| 23 | 1903_Conan-Laure_Elisabeth_Seton | 13,046 tokens | False |
| 24 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens | **True** |
| 25 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens | False |
| 26 | 1917_Adèle-Bourgeois_Némoville | 12,468 tokens | False |
| 27 | 1923_Delly_Dans-les-ruines | 95,617 tokens | False |
| 28 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,850 tokens | False |
| 29 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 12,083 tokens | **True** |
| 30 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,346 tokens | False |
| 31 | TOTAL | 554,480 tokens | 5 files used for cross-validation |
## CONTACT:
mail: antoine [dot] bourgois [at] protonmail [dot] com
|