---
language: fr
tags:
- coreference-resolution
- anaphora-resolution
- mentions-linking
- literary-texts
- camembert
- literary-texts
- nested-entities
- BookNLP-fr
license: apache-2.0
metrics:
- MUC
- B3
- CEAF
- CoNLL-F1
base_model:
- almanach/camembert-large
---

## INTRODUCTION:
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **coreference resolution model** built on top of [camembert-large](https://huggingface.co./almanach/camembert-large) embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French.

This specific model has been trained to link entities of the following types: PER.

## MODEL PERFORMANCES (LOOCV):
Overall Coreference Resolution Performances for non-overlapping windows of different length:
|    | Window width (tokens)   |   Document count |   Sample count | MUC F1   | B3 F1   | CEAFe F1   | CONLL F1   |
|----|-------------------------|------------------|----------------|----------|---------|------------|------------|
|  0 | 500                     |               29 |            677 | 92.18%   | 83.86%  | 76.86%     | 84.30%     |
|  1 | 1,000                   |               29 |            332 | 92.65%   | 79.79%  | 71.77%     | 81.40%     |
|  2 | 2,000                   |               28 |            162 | 93.29%   | 75.85%  | 67.34%     | 78.83%     |
|  3 | 5,000                   |               19 |             56 | 93.76%   | 69.60%  | 61.16%     | 74.84%     |
|  4 | 10,000                  |               18 |             27 | 94.28%   | 65.73%  | 58.59%     | 72.86%     |
|  5 | 25,000                  |                2 |              3 | 94.76%   | 62.48%  | 53.33%     | 70.19%     |
|  6 | 50,000                  |                1 |              1 | 97.39%   | 56.43%  | 47.40%     | 67.07%     |

Coreference Resolution Performances on the fully annotated sample for each document:
|    | Token count   | Mention count   | MUC F1   | B3 F1   | CEAFe F1   | CONLL F1   |
|----|---------------|-----------------|----------|---------|------------|------------|
|  0 | 1,864         | 253             | 98.16%   | 95.39%  | 60.34%     | 84.63%     |
|  1 | 2,034         | 321             | 97.47%   | 92.79%  | 80.04%     | 90.10%     |
|  2 | 2,141         | 297             | 95.06%   | 77.99%  | 65.08%     | 79.38%     |
|  3 | 2,251         | 235             | 91.95%   | 80.47%  | 46.56%     | 73.00%     |
|  4 | 2,343         | 239             | 83.87%   | 61.95%  | 43.58%     | 63.13%     |
|  5 | 2,441         | 314             | 91.85%   | 55.70%  | 60.82%     | 69.46%     |
|  6 | 2,554         | 330             | 90.24%   | 65.27%  | 72.36%     | 75.96%     |
|  7 | 2,860         | 369             | 93.65%   | 84.89%  | 74.93%     | 84.49%     |
|  8 | 2,929         | 386             | 95.65%   | 78.21%  | 64.23%     | 79.37%     |
|  9 | 4,067         | 429             | 97.46%   | 85.20%  | 62.52%     | 81.73%     |
| 10 | 5,425         | 558             | 90.46%   | 53.03%  | 59.52%     | 67.67%     |
| 11 | 10,305        | 1,436           | 96.37%   | 74.83%  | 59.91%     | 77.04%     |
| 12 | 10,982        | 1,095           | 97.18%   | 65.30%  | 60.49%     | 74.32%     |
| 13 | 11,768        | 1,734           | 93.30%   | 64.14%  | 64.12%     | 73.85%     |
| 14 | 11,834        | 600             | 92.21%   | 67.51%  | 60.74%     | 73.49%     |
| 15 | 11,902        | 1,692           | 95.03%   | 58.83%  | 45.59%     | 66.49%     |
| 16 | 12,281        | 1,089           | 95.06%   | 62.05%  | 72.55%     | 76.55%     |
| 17 | 12,285        | 1,489           | 95.28%   | 77.84%  | 57.43%     | 76.85%     |
| 18 | 12,315        | 1,501           | 95.36%   | 57.07%  | 64.26%     | 72.23%     |
| 19 | 12,389        | 1,654           | 93.19%   | 54.21%  | 51.84%     | 66.41%     |
| 20 | 12,557        | 1,085           | 92.30%   | 66.97%  | 46.65%     | 68.64%     |
| 21 | 12,703        | 1,731           | 90.40%   | 53.70%  | 61.37%     | 68.49%     |
| 22 | 13,023        | 1,559           | 93.86%   | 61.71%  | 62.41%     | 72.66%     |
| 23 | 14,299        | 1,582           | 97.23%   | 69.25%  | 67.04%     | 77.84%     |
| 24 | 14,637        | 2,127           | 95.78%   | 71.34%  | 63.28%     | 76.80%     |
| 25 | 15,408        | 1,769           | 92.85%   | 54.11%  | 56.12%     | 67.69%     |
| 26 | 24,776        | 2,716           | 94.31%   | 63.51%  | 54.12%     | 70.65%     |
| 27 | 30,987        | 2,980           | 89.55%   | 54.25%  | 59.68%     | 67.83%     |
| 28 | 71,219        | 11,857          | 97.38%   | 50.85%  | 45.93%     | 64.72%     |

## TRAINING PARAMETERS:
- Entities types: PER
- Split strategy: Leave-one-out cross-validation (29 files)
- Train/Validation split: 0.85 / 0.15
- Batch size: 16,000
- Initial learning rate: 0.0004
- Focal loss gamma: 1
- Focal loss alpha: 0.25
- Pronoun lookup antecedents: 30
- Common and Proper nouns lookup antecedents: 300

## MODEL ARCHITECTURE:
Model Input: 2,165 dimensions vector
- Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions)
- Additional mentions features (106 dimensions):
  - Length of mentions
  - Position of the mention's start token within the sentence
  - Grammatical category of the mentions (pronoun, common noun, proper noun)
  - Dependency relation of the mention's head (one-hot encoded)
  - Gender of the mentions (one-hot encoded)
  - Number (singular/plural) of the mentions (one-hot encoded)
  - Grammatical person of the mentions (one-hot encoded)
- Additional mention pairs features (11 dimensions):
  - Distance between mention IDs
  - Distance between start tokens of mentions
  - Distance between end tokens of mentions
  - Distance between sentences containing mentions
  - Distance between paragraphs containing mentions
  - Difference in nesting levels of mentions
  - Ratio of shared tokens between mentions
  - Exact text match between mentions (binary)
  - Exact match of mention heads (binary)
  - Match of syntactic heads between mentions (binary)
  - Match of entity types between mentions (binary)

- Hidden Layers:
  - Number of layers: 3
  - Units per layer: 1,900 nodes
  - Activation function: relu
  - Dropout rate: 0.6

- Final Layer:
  - Type: Linear
  - Input: 1900 dimensions
  - Output: 1 dimension (mention pair coreference score)

Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence.

## HOW TO USE:
*** IN CONSTRUCTION ***

## TRAINING CORPUS:
|    | Document                                                       | Tokens Count   | Is included in model eval          |
|----|----------------------------------------------------------------|----------------|------------------------------------|
|  0 | 1836_Gautier-Theophile_La-morte-amoureuse                      | 14,299 tokens  | **True**                           |
|  1 | 1840_Sand-George_Pauline                                       | 12,315 tokens  | **True**                           |
|  2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote             | 24,776 tokens  | **True**                           |
|  3 | 1844_Balzac-Honore-de_La-Maison-Nucingen                       | 30,987 tokens  | **True**                           |
|  4 | 1844_Balzac-Honore-de_Sarrasine                                | 15,408 tokens  | **True**                           |
|  5 | 1856_Cousin-Victor_Madame-de-Hautefort                         | 11,768 tokens  | **True**                           |
|  6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse                   | 11,834 tokens  | **True**                           |
|  7 | 1873_Zola-Emile_Le-ventre-de-Paris                             | 12,557 tokens  | **True**                           |
|  8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet                      | 12,281 tokens  | **True**                           |
|  9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens   | **True**                           |
| 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE   | 2,554 tokens   | **True**                           |
| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE        | 2,929 tokens   | **True**                           |
| 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA           | 4,067 tokens   | **True**                           |
| 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE          | 2,251 tokens   | **True**                           |
| 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE        | 2,034 tokens   | **True**                           |
| 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU               | 1,864 tokens   | **True**                           |
| 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL            | 2,141 tokens   | **True**                           |
| 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE          | 2,441 tokens   | **True**                           |
| 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL          | 2,860 tokens   | **True**                           |
| 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON      | 2,343 tokens   | **True**                           |
| 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis   | 12,703 tokens  | **True**                           |
| 21 | 1903_Conan-Laure_Elisabeth_Seton                               | 13,023 tokens  | **True**                           |
| 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube              | 10,982 tokens  | **True**                           |
| 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin           | 10,305 tokens  | **True**                           |
| 24 | 1917_Adèle-Bourgeois_Némoville                                 | 12,389 tokens  | **True**                           |
| 25 | 1923_Radiguet-Raymond_Le-diable-au-corps                       | 14,637 tokens  | **True**                           |
| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin                   | 11,902 tokens  | **True**                           |
| 27 | 1937_Audoux-Marguerite_Douce-Lumiere                           | 12,285 tokens  | **True**                           |
| 28 | Manon_Lescaut_PEDRO                                            | 71,219 tokens  | **True**                           |
| 29 | TOTAL                                                          | 346,579 tokens | 29 files used for cross-validation |

## CONTACT:
mail: antoine [dot] bourgois [at] protonmail [dot] com