AntoineBourgois's picture
Upload 2 files
ca05713 verified
metadata
language: fr
tags:
  - coreference-resolution
  - anaphora-resolution
  - mentions-linking
  - literary-texts
  - camembert
  - literary-texts
  - nested-entities
  - BookNLP-fr
license: apache-2.0
metrics:
  - MUC
  - B3
  - CEAF
  - CoNLL-F1
base_model:
  - almanach/camembert-large

INTRODUCTION:

This model, developed as part of the BookNLP-fr project, is a coreference resolution model built on top of camembert-large embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French.

This specific model has been trained to link entities of the following types: PER.

MODEL PERFORMANCES (LOOCV):

Overall Coreference Resolution Performances for non-overlapping windows of different length:

Window width (tokens) Document count Sample count MUC F1 B3 F1 CEAFe F1 CONLL F1
0 500 4 41 90.91% 78.47% 64.73% 78.03%
2 2,000 1 5 94.56% 70.95% 48.18% 71.23%
4 10,000 1 1 94.50% 57.67% 35.50% 62.56%

Coreference Resolution Performances on the fully annotated sample for each document:

Token count Mention count MUC F1 B3 F1 CEAFe F1 CONLL F1
0 2,578 330 89.66% 69.56% 68.68% 75.97%
1 2,949 386 95.93% 71.31% 69.86% 79.03%
2 5,439 558 90.20% 59.67% 58.89% 69.58%
3 10,982 1,095 94.40% 58.46% 32.62% 61.83%

TRAINING PARAMETERS:

  • Entities types: PER
  • Split strategy: Leave-one-out cross-validation (31 files)
  • Train/Validation split: 0.85 / 0.15
  • Batch size: 16,000
  • Initial learning rate: 0.0004
  • Focal loss gamma: 1
  • Focal loss alpha: 0.25
  • Pronoun lookup antecedents: 30
  • Common and Proper nouns lookup antecedents: 300

MODEL ARCHITECTURE:

Model Input: 2,165 dimensions vector

  • Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions)

  • Additional mentions features (106 dimensions):

    • Length of mentions
    • Position of the mention's start token within the sentence
    • Grammatical category of the mentions (pronoun, common noun, proper noun)
    • Dependency relation of the mention's head (one-hot encoded)
    • Gender of the mentions (one-hot encoded)
    • Number (singular/plural) of the mentions (one-hot encoded)
    • Grammatical person of the mentions (one-hot encoded)
  • Additional mention pairs features (11 dimensions):

    • Distance between mention IDs
    • Distance between start tokens of mentions
    • Distance between end tokens of mentions
    • Distance between sentences containing mentions
    • Distance between paragraphs containing mentions
    • Difference in nesting levels of mentions
    • Ratio of shared tokens between mentions
    • Exact text match between mentions (binary)
    • Exact match of mention heads (binary)
    • Match of syntactic heads between mentions (binary)
    • Match of entity types between mentions (binary)
  • Hidden Layers:

    • Number of layers: 3
    • Units per layer: 1,900 nodes
    • Activation function: relu
    • Dropout rate: 0.6
  • Final Layer:

    • Type: Linear
    • Input: 1900 dimensions
    • Output: 1 dimension (mention pair coreference score)

Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence.

HOW TO USE:

*** IN CONSTRUCTION ***

TRAINING CORPUS:

Document Tokens Count Is included in model eval
0 1731_Prévost-Antoine-François_Manon-Lescaut_PER-ONLY 71,219 tokens False
1 1832_Sand-George_Indiana_PER-ONLY 112,221 tokens False
2 1836_Gautier-Theophile_La-morte-amoureuse 14,293 tokens False
3 1840_Sand-George_Pauline 12,398 tokens False
4 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote 24,776 tokens False
5 1844_Balzac-Honore-de_La-Maison-Nucingen 30,034 tokens False
6 1844_Balzac-Honore-de_Sarrasine 15,408 tokens False
7 1856_Cousin-Victor_Madame-de-Hautefort 11,768 tokens False
8 1863_Gautier-Theophile_Le-capitaine-Fracasse 11,848 tokens False
9 1873_Zola-Emile_Le-ventre-de-Paris 12,613 tokens False
10 1881_Flaubert-Gustave_Bouvard-et-Pecuchet 12,308 tokens False
11 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI 5,439 tokens True
12 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE 2,578 tokens True
13 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE 2,949 tokens True
14 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA 4,078 tokens False
15 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE 2,267 tokens False
16 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE 2,041 tokens False
17 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU 1,905 tokens False
18 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL 2,159 tokens False
19 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE 2,469 tokens False
20 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL 2,878 tokens False
21 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON 2,358 tokens False
22 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis 12,776 tokens False
23 1903_Conan-Laure_Elisabeth_Seton 13,046 tokens False
24 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube 10,982 tokens True
25 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin 10,305 tokens False
26 1917_Adèle-Bourgeois_Némoville 12,468 tokens False
27 1923_Delly_Dans-les-ruines 95,617 tokens False
28 1923_Radiguet-Raymond_Le-diable-au-corps 14,850 tokens False
29 1926_Audoux-Marguerite_De-la-ville-au-moulin 12,083 tokens True
30 1937_Audoux-Marguerite_Douce-Lumiere 12,346 tokens False
31 TOTAL 554,480 tokens 5 files used for cross-validation

CONTACT:

mail: antoine [dot] bourgois [at] protonmail [dot] com