--- language: fr tags: - coreference-resolution - anaphora-resolution - mentions-linking - literary-texts - camembert - literary-texts - nested-entities - BookNLP-fr license: apache-2.0 metrics: - MUC - B3 - CEAF - CoNLL-F1 base_model: - almanach/camembert-large --- ## INTRODUCTION: This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **coreference resolution model** built on top of [camembert-large](https://huggingface.co./almanach/camembert-large) embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French. This specific model has been trained to link entities of the following types: PER. ## MODEL PERFORMANCES (LOOCV): Overall Coreference Resolution Performances for non-overlapping windows of different length: | | Window width (tokens) | Document count | Sample count | MUC F1 | B3 F1 | CEAFe F1 | CONLL F1 | |----|-------------------------|------------------|----------------|----------|---------|------------|------------| | 0 | 500 | 29 | 677 | 92.18% | 83.86% | 76.86% | 84.30% | | 1 | 1,000 | 29 | 332 | 92.65% | 79.79% | 71.77% | 81.40% | | 2 | 2,000 | 28 | 162 | 93.29% | 75.85% | 67.34% | 78.83% | | 3 | 5,000 | 19 | 56 | 93.76% | 69.60% | 61.16% | 74.84% | | 4 | 10,000 | 18 | 27 | 94.28% | 65.73% | 58.59% | 72.86% | | 5 | 25,000 | 2 | 3 | 94.76% | 62.48% | 53.33% | 70.19% | | 6 | 50,000 | 1 | 1 | 97.39% | 56.43% | 47.40% | 67.07% | Coreference Resolution Performances on the fully annotated sample for each document: | | Token count | Mention count | MUC F1 | B3 F1 | CEAFe F1 | CONLL F1 | |----|---------------|-----------------|----------|---------|------------|------------| | 0 | 1,864 | 253 | 98.16% | 95.39% | 60.34% | 84.63% | | 1 | 2,034 | 321 | 97.47% | 92.79% | 80.04% | 90.10% | | 2 | 2,141 | 297 | 95.06% | 77.99% | 65.08% | 79.38% | | 3 | 2,251 | 235 | 91.95% | 80.47% | 46.56% | 73.00% | | 4 | 2,343 | 239 | 83.87% | 61.95% | 43.58% | 63.13% | | 5 | 2,441 | 314 | 91.85% | 55.70% | 60.82% | 69.46% | | 6 | 2,554 | 330 | 90.24% | 65.27% | 72.36% | 75.96% | | 7 | 2,860 | 369 | 93.65% | 84.89% | 74.93% | 84.49% | | 8 | 2,929 | 386 | 95.65% | 78.21% | 64.23% | 79.37% | | 9 | 4,067 | 429 | 97.46% | 85.20% | 62.52% | 81.73% | | 10 | 5,425 | 558 | 90.46% | 53.03% | 59.52% | 67.67% | | 11 | 10,305 | 1,436 | 96.37% | 74.83% | 59.91% | 77.04% | | 12 | 10,982 | 1,095 | 97.18% | 65.30% | 60.49% | 74.32% | | 13 | 11,768 | 1,734 | 93.30% | 64.14% | 64.12% | 73.85% | | 14 | 11,834 | 600 | 92.21% | 67.51% | 60.74% | 73.49% | | 15 | 11,902 | 1,692 | 95.03% | 58.83% | 45.59% | 66.49% | | 16 | 12,281 | 1,089 | 95.06% | 62.05% | 72.55% | 76.55% | | 17 | 12,285 | 1,489 | 95.28% | 77.84% | 57.43% | 76.85% | | 18 | 12,315 | 1,501 | 95.36% | 57.07% | 64.26% | 72.23% | | 19 | 12,389 | 1,654 | 93.19% | 54.21% | 51.84% | 66.41% | | 20 | 12,557 | 1,085 | 92.30% | 66.97% | 46.65% | 68.64% | | 21 | 12,703 | 1,731 | 90.40% | 53.70% | 61.37% | 68.49% | | 22 | 13,023 | 1,559 | 93.86% | 61.71% | 62.41% | 72.66% | | 23 | 14,299 | 1,582 | 97.23% | 69.25% | 67.04% | 77.84% | | 24 | 14,637 | 2,127 | 95.78% | 71.34% | 63.28% | 76.80% | | 25 | 15,408 | 1,769 | 92.85% | 54.11% | 56.12% | 67.69% | | 26 | 24,776 | 2,716 | 94.31% | 63.51% | 54.12% | 70.65% | | 27 | 30,987 | 2,980 | 89.55% | 54.25% | 59.68% | 67.83% | | 28 | 71,219 | 11,857 | 97.38% | 50.85% | 45.93% | 64.72% | ## TRAINING PARAMETERS: - Entities types: PER - Split strategy: Leave-one-out cross-validation (29 files) - Train/Validation split: 0.85 / 0.15 - Batch size: 16,000 - Initial learning rate: 0.0004 - Focal loss gamma: 1 - Focal loss alpha: 0.25 - Pronoun lookup antecedents: 30 - Common and Proper nouns lookup antecedents: 300 ## MODEL ARCHITECTURE: Model Input: 2,165 dimensions vector - Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions) - Additional mentions features (106 dimensions): - Length of mentions - Position of the mention's start token within the sentence - Grammatical category of the mentions (pronoun, common noun, proper noun) - Dependency relation of the mention's head (one-hot encoded) - Gender of the mentions (one-hot encoded) - Number (singular/plural) of the mentions (one-hot encoded) - Grammatical person of the mentions (one-hot encoded) - Additional mention pairs features (11 dimensions): - Distance between mention IDs - Distance between start tokens of mentions - Distance between end tokens of mentions - Distance between sentences containing mentions - Distance between paragraphs containing mentions - Difference in nesting levels of mentions - Ratio of shared tokens between mentions - Exact text match between mentions (binary) - Exact match of mention heads (binary) - Match of syntactic heads between mentions (binary) - Match of entity types between mentions (binary) - Hidden Layers: - Number of layers: 3 - Units per layer: 1,900 nodes - Activation function: relu - Dropout rate: 0.6 - Final Layer: - Type: Linear - Input: 1900 dimensions - Output: 1 dimension (mention pair coreference score) Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence. ## HOW TO USE: *** IN CONSTRUCTION *** ## TRAINING CORPUS: | | Document | Tokens Count | Is included in model eval | |----|----------------------------------------------------------------|----------------|------------------------------------| | 0 | 1836_Gautier-Theophile_La-morte-amoureuse | 14,299 tokens | **True** | | 1 | 1840_Sand-George_Pauline | 12,315 tokens | **True** | | 2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote | 24,776 tokens | **True** | | 3 | 1844_Balzac-Honore-de_La-Maison-Nucingen | 30,987 tokens | **True** | | 4 | 1844_Balzac-Honore-de_Sarrasine | 15,408 tokens | **True** | | 5 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | **True** | | 6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,834 tokens | **True** | | 7 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,557 tokens | **True** | | 8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,281 tokens | **True** | | 9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens | **True** | | 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,554 tokens | **True** | | 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,929 tokens | **True** | | 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,067 tokens | **True** | | 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,251 tokens | **True** | | 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,034 tokens | **True** | | 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU | 1,864 tokens | **True** | | 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL | 2,141 tokens | **True** | | 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE | 2,441 tokens | **True** | | 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL | 2,860 tokens | **True** | | 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,343 tokens | **True** | | 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,703 tokens | **True** | | 21 | 1903_Conan-Laure_Elisabeth_Seton | 13,023 tokens | **True** | | 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens | **True** | | 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens | **True** | | 24 | 1917_Adèle-Bourgeois_Némoville | 12,389 tokens | **True** | | 25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,637 tokens | **True** | | 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 11,902 tokens | **True** | | 27 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,285 tokens | **True** | | 28 | Manon_Lescaut_PEDRO | 71,219 tokens | **True** | | 29 | TOTAL | 346,579 tokens | 29 files used for cross-validation | ## CONTACT: mail: antoine [dot] bourgois [at] protonmail [dot] com