File size: 8,256 Bytes
039cd26
 
 
f5fbf09
 
 
 
039cd26
 
 
 
 
 
f5fbf09
 
 
 
039cd26
 
 
 
 
f5fbf09
039cd26
f5fbf09
039cd26
 
f5fbf09
 
 
ca05713
 
 
f5fbf09
 
 
 
ca05713
 
 
 
039cd26
 
f5fbf09
ca05713
3148c33
f5fbf09
 
 
 
 
 
039cd26
 
f5fbf09
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
039cd26
 
 
 
 
 
 
ca05713
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
039cd26
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
language: fr
tags:
- coreference-resolution
- anaphora-resolution
- mentions-linking
- literary-texts
- camembert
- literary-texts
- nested-entities
- BookNLP-fr
license: apache-2.0
metrics:
- MUC
- B3
- CEAF
- CoNLL-F1
base_model:
- almanach/camembert-large
---

## INTRODUCTION:
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **coreference resolution model** built on top of [camembert-large](https://huggingface.co./almanach/camembert-large) embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French.

This specific model has been trained to link entities of the following types: PER.

## MODEL PERFORMANCES (LOOCV):
Overall Coreference Resolution Performances for non-overlapping windows of different length:
|    | Window width (tokens)   |   Document count |   Sample count | MUC F1   | B3 F1   | CEAFe F1   | CONLL F1   |
|----|-------------------------|------------------|----------------|----------|---------|------------|------------|
|  0 | 500                     |                4 |             41 | 90.91%   | 78.47%  | 64.73%     | 78.03%     |
|  2 | 2,000                   |                1 |              5 | 94.56%   | 70.95%  | 48.18%     | 71.23%     |
|  4 | 10,000                  |                1 |              1 | 94.50%   | 57.67%  | 35.50%     | 62.56%     |

Coreference Resolution Performances on the fully annotated sample for each document:
|    | Token count   | Mention count   | MUC F1   | B3 F1   | CEAFe F1   | CONLL F1   |
|----|---------------|-----------------|----------|---------|------------|------------|
|  0 | 2,578         | 330             | 89.66%   | 69.56%  | 68.68%     | 75.97%     |
|  1 | 2,949         | 386             | 95.93%   | 71.31%  | 69.86%     | 79.03%     |
|  2 | 5,439         | 558             | 90.20%   | 59.67%  | 58.89%     | 69.58%     |
|  3 | 10,982        | 1,095           | 94.40%   | 58.46%  | 32.62%     | 61.83%     |

## TRAINING PARAMETERS:
- Entities types: PER
- Split strategy: Leave-one-out cross-validation (31 files)
- Train/Validation split: 0.85 / 0.15
- Batch size: 16,000
- Initial learning rate: 0.0004
- Focal loss gamma: 1
- Focal loss alpha: 0.25
- Pronoun lookup antecedents: 30
- Common and Proper nouns lookup antecedents: 300

## MODEL ARCHITECTURE:
Model Input: 2,165 dimensions vector
- Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions)
- Additional mentions features (106 dimensions):
  - Length of mentions
  - Position of the mention's start token within the sentence
  - Grammatical category of the mentions (pronoun, common noun, proper noun)
  - Dependency relation of the mention's head (one-hot encoded)
  - Gender of the mentions (one-hot encoded)
  - Number (singular/plural) of the mentions (one-hot encoded)
  - Grammatical person of the mentions (one-hot encoded)
- Additional mention pairs features (11 dimensions):
  - Distance between mention IDs
  - Distance between start tokens of mentions
  - Distance between end tokens of mentions
  - Distance between sentences containing mentions
  - Distance between paragraphs containing mentions
  - Difference in nesting levels of mentions
  - Ratio of shared tokens between mentions
  - Exact text match between mentions (binary)
  - Exact match of mention heads (binary)
  - Match of syntactic heads between mentions (binary)
  - Match of entity types between mentions (binary)

- Hidden Layers:
  - Number of layers: 3
  - Units per layer: 1,900 nodes
  - Activation function: relu
  - Dropout rate: 0.6

- Final Layer:
  - Type: Linear
  - Input: 1900 dimensions
  - Output: 1 dimension (mention pair coreference score)

Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence.

## HOW TO USE:
*** IN CONSTRUCTION ***

## TRAINING CORPUS:
|    | Document                                                       | Tokens Count   | Is included in model eval         |
|----|----------------------------------------------------------------|----------------|-----------------------------------|
|  0 | 1731_Prévost-Antoine-François_Manon-Lescaut_PER-ONLY           | 71,219 tokens  | False                             |
|  1 | 1832_Sand-George_Indiana_PER-ONLY                              | 112,221 tokens | False                             |
|  2 | 1836_Gautier-Theophile_La-morte-amoureuse                      | 14,293 tokens  | False                             |
|  3 | 1840_Sand-George_Pauline                                       | 12,398 tokens  | False                             |
|  4 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote             | 24,776 tokens  | False                             |
|  5 | 1844_Balzac-Honore-de_La-Maison-Nucingen                       | 30,034 tokens  | False                             |
|  6 | 1844_Balzac-Honore-de_Sarrasine                                | 15,408 tokens  | False                             |
|  7 | 1856_Cousin-Victor_Madame-de-Hautefort                         | 11,768 tokens  | False                             |
|  8 | 1863_Gautier-Theophile_Le-capitaine-Fracasse                   | 11,848 tokens  | False                             |
|  9 | 1873_Zola-Emile_Le-ventre-de-Paris                             | 12,613 tokens  | False                             |
| 10 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet                      | 12,308 tokens  | False                             |
| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,439 tokens   | **True**                          |
| 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE   | 2,578 tokens   | **True**                          |
| 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE        | 2,949 tokens   | **True**                          |
| 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA           | 4,078 tokens   | False                             |
| 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE          | 2,267 tokens   | False                             |
| 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE        | 2,041 tokens   | False                             |
| 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU               | 1,905 tokens   | False                             |
| 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL            | 2,159 tokens   | False                             |
| 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE          | 2,469 tokens   | False                             |
| 20 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL          | 2,878 tokens   | False                             |
| 21 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON      | 2,358 tokens   | False                             |
| 22 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis   | 12,776 tokens  | False                             |
| 23 | 1903_Conan-Laure_Elisabeth_Seton                               | 13,046 tokens  | False                             |
| 24 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube              | 10,982 tokens  | **True**                          |
| 25 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin           | 10,305 tokens  | False                             |
| 26 | 1917_Adèle-Bourgeois_Némoville                                 | 12,468 tokens  | False                             |
| 27 | 1923_Delly_Dans-les-ruines                                     | 95,617 tokens  | False                             |
| 28 | 1923_Radiguet-Raymond_Le-diable-au-corps                       | 14,850 tokens  | False                             |
| 29 | 1926_Audoux-Marguerite_De-la-ville-au-moulin                   | 12,083 tokens  | **True**                          |
| 30 | 1937_Audoux-Marguerite_Douce-Lumiere                           | 12,346 tokens  | False                             |
| 31 | TOTAL                                                          | 554,480 tokens | 5 files used for cross-validation |

## CONTACT:
mail: antoine [dot] bourgois [at] protonmail [dot] com