File size: 8,256 Bytes

---
language: fr
tags:
- coreference-resolution
- anaphora-resolution
- mentions-linking
- literary-texts
- camembert
- literary-texts
- nested-entities
- BookNLP-fr
license: apache-2.0
metrics:
- MUC
- B3
- CEAF
- CoNLL-F1
base_model:
- almanach/camembert-large
---

## INTRODUCTION:
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **coreference resolution model** built on top of [camembert-large](https://huggingface.co./almanach/camembert-large) embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French.

This specific model has been trained to link entities of the following types: PER.

## MODEL PERFORMANCES (LOOCV):
Overall Coreference Resolution Performances for non-overlapping windows of different length:
|    | Window width (tokens)   |   Document count |   Sample count | MUC F1   | B3 F1   | CEAFe F1   | CONLL F1   |
|----|-------------------------|------------------|----------------|----------|---------|------------|------------|
|  0 | 500                     |                4 |             41 | 90.91%   | 78.47%  | 64.73%     | 78.03%     |
|  2 | 2,000                   |                1 |              5 | 94.56%   | 70.95%  | 48.18%     | 71.23%     |
|  4 | 10,000                  |                1 |              1 | 94.50%   | 57.67%  | 35.50%     | 62.56%     |

Coreference Resolution Performances on the fully annotated sample for each document:
|    | Token count   | Mention count   | MUC F1   | B3 F1   | CEAFe F1   | CONLL F1   |
|----|---------------|-----------------|----------|---------|------------|------------|
|  0 | 2,578         | 330             | 89.66%   | 69.56%  | 68.68%     | 75.97%     |
|  1 | 2,949         | 386             | 95.93%   | 71.31%  | 69.86%     | 79.03%     |
|  2 | 5,439         | 558             | 90.20%   | 59.67%  | 58.89%     | 69.58%     |
|  3 | 10,982        | 1,095           | 94.40%   | 58.46%  | 32.62%     | 61.83%     |

## TRAINING PARAMETERS:
- Entities types: PER
- Split strategy: Leave-one-out cross-validation (31 files)
- Train/Validation split: 0.85 / 0.15
- Batch size: 16,000
- Initial learning rate: 0.0004
- Focal loss gamma: 1
- Focal loss alpha: 0.25
- Pronoun lookup antecedents: 30
- Common and Proper nouns lookup antecedents: 300

## MODEL ARCHITECTURE:
Model Input: 2,165 dimensions vector
- Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions)
- Additional mentions features (106 dimensions):
  - Length of mentions
  - Position of the mention's start token within the sentence
  - Grammatical category of the mentions (pronoun, common noun, proper noun)
  - Dependency relation of the mention's head (one-hot encoded)
  - Gender of the mentions (one-hot encoded)
  - Number (singular/plural) of the mentions (one-hot encoded)
  - Grammatical person of the mentions (one-hot encoded)
- Additional mention pairs features (11 dimensions):
  - Distance between mention IDs
  - Distance between start tokens of mentions
  - Distance between end tokens of mentions
  - Distance between sentences containing mentions
  - Distance between paragraphs containing mentions
  - Difference in nesting levels of mentions
  - Ratio of shared tokens between mentions
  - Exact text match between mentions (binary)
  - Exact match of mention heads (binary)
  - Match of syntactic heads between mentions (binary)
  - Match of entity types between mentions (binary)

- Hidden Layers:
  - Number of layers: 3
  - Units per layer: 1,900 nodes
  - Activation function: relu
  - Dropout rate: 0.6

- Final Layer:
  - Type: Linear
  - Input: 1900 dimensions
  - Output: 1 dimension (mention pair coreference score)

Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence.

## HOW TO USE:
*** IN CONSTRUCTION ***

## TRAINING CORPUS:
|    | Document                                                       | Tokens Count   | Is included in model eval         |
|----|----------------------------------------------------------------|----------------|-----------------------------------|
|  0 | 1731_Prévost-Antoine-François_Manon-Lescaut_PER-ONLY           | 71,219 tokens  | False                             |
|  1 | 1832_Sand-George_Indiana_PER-ONLY                              | 112,221 tokens | False                             |
|  2 | 1836_Gautier-Theophile_La-morte-amoureuse                      | 14,293 tokens  | False                             |
|  3 | 1840_Sand-George_Pauline                                       | 12,398 tokens  | False                             |
|  4 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote             | 24,776 tokens  | False                             |
|  5 | 1844_Balzac-Honore-de_La-Maison-Nucingen                       | 30,034 tokens  | False                             |
|  6 | 1844_Balzac-Honore-de_Sarrasine                                | 15,408 tokens  | False                             |
|  7 | 1856_Cousin-Victor_Madame-de-Hautefort                         | 11,768 tokens  | False                             |
|  8 | 1863_Gautier-Theophile_Le-capitaine-Fracasse                   | 11,848 tokens  | False                             |
|  9 | 1873_Zola-Emile_Le-ventre-de-Paris                             | 12,613 tokens  | False                             |
| 10 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet                      | 12,308 tokens  | False                             |
| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,439 tokens   | **True**                          |
| 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE   | 2,578 tokens   | **True**                          |
| 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE        | 2,949 tokens   | **True**                          |
| 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA           | 4,078 tokens   | False                             |
| 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE          | 2,267 tokens   | False                             |
| 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE        | 2,041 tokens   | False                             |
| 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU               | 1,905 tokens   | False                             |
| 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL            | 2,159 tokens   | False                             |
| 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE          | 2,469 tokens   | False                             |
| 20 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL          | 2,878 tokens   | False                             |
| 21 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON      | 2,358 tokens   | False                             |
| 22 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis   | 12,776 tokens  | False                             |
| 23 | 1903_Conan-Laure_Elisabeth_Seton                               | 13,046 tokens  | False                             |
| 24 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube              | 10,982 tokens  | **True**                          |
| 25 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin           | 10,305 tokens  | False                             |
| 26 | 1917_Adèle-Bourgeois_Némoville                                 | 12,468 tokens  | False                             |
| 27 | 1923_Delly_Dans-les-ruines                                     | 95,617 tokens  | False                             |
| 28 | 1923_Radiguet-Raymond_Le-diable-au-corps                       | 14,850 tokens  | False                             |
| 29 | 1926_Audoux-Marguerite_De-la-ville-au-moulin                   | 12,083 tokens  | **True**                          |
| 30 | 1937_Audoux-Marguerite_Douce-Lumiere                           | 12,346 tokens  | False                             |
| 31 | TOTAL                                                          | 554,480 tokens | 5 files used for cross-validation |

## CONTACT:
mail: antoine [dot] bourgois [at] protonmail [dot] com