INTRODUCTION:

This model, developed as part of the BookNLP-fr project, is a NER model built on top of camembert-large embeddings, trained to predict nested entities in french, specifically for literary texts.

The predicted entities are:

  • mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
  • facilities (FAC): chatêau, sentier, chambre, couloir, ...
  • time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
  • geo-political entities (GPE): Montrouge, France, le petit hameau, ...
  • locations (LOC): le sud, Mars, l'océan, le bois, ...
  • vehicles (VEH): avion, voitures, calèche, vélos, ...

MODEL PERFORMANCES (LOOCV):

NER_tag precision recall f1_score support support %
PER 92.46% 93.71% 93.08% 32,204 84.13%
FAC 70.63% 70.94% 70.78% 2,295 6.00%
TIME 58.66% 57.75% 58.20% 1,671 4.37%
GPE 77.64% 77.37% 77.50% 866 2.26%
LOC 62.96% 45.71% 52.97% 781 2.04%
VEH 63.43% 47.95% 54.61% 463 1.21%
micro_avg 88.39% 88.87% 88.58% 38,280 100.00%
macro_avg 70.96% 65.57% 67.86% 38,280 100.00%

TRAINING PARAMETERS:

  • Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
  • Tagging scheme: BIOES
  • Nested entities levels: [0, 1]
  • Split strategy: Leave-one-out cross-validation (28 files)
  • Train/Validation split: 0.85 / 0.15
  • Batch size: 16
  • Initial learning rate: 0.00014

MODEL ARCHITECTURE:

Model Input: Maximum context camembert-large embeddings (1024 dimensions)

  • Locked Dropout: 0.5

  • Projection layer:

    • layer type: highway layer
    • input: 1024 dimensions
    • output: 2048 dimensions
  • BiLSTM layer:

    • input: 2048 dimensions
    • output: 256 dimensions (hidden state)
  • Linear layer:

    • input: 256 dimensions
    • output: 25 dimensions (predicted labels with BIOES tagging scheme)
  • CRF layer

Model Output: BIOES labels sequence

HOW TO USE:

*** IN CONSTRUCTION ***

TRAINING CORPUS:

Document Tokens Count Is included in model eval
0 1830_Balzac-Honoré-de_La-maison-du-chat-qui-pelote 24,776 tokens True
1 1830_Balzac-Honoré-de_Sarrasine 15,408 tokens True
2 1836_Gautier-Théophile_La-morte-amoureuse 14,293 tokens True
3 1837_Balzac-Honoré-de_La-maison-Nucingen 30,034 tokens True
4 1841_Sand-George_Pauline 12,398 tokens True
5 1856_Cousin-Victor_Madame-de-Hautefort 11,768 tokens True
6 1863_Gautier-Théophile_Le-capitaine-Fracasse 11,848 tokens True
7 1873_Zola-Émile_Le-ventre-de-Paris 12,613 tokens True
8 1881_Flaubert-Gustave_Bouvard-et-Pécuchet 12,308 tokens True
9 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-buche 2,267 tokens True
10 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-relique 2,041 tokens True
11 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-rouille 2,949 tokens True
12 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Madame-Baptiste 2,578 tokens True
13 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Marocca 4,078 tokens True
14 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-A-cheval 2,878 tokens True
15 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Fou 1,905 tokens True
16 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Mademoiselle-Fifi 5,439 tokens True
17 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Reveil 2,159 tokens True
18 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Un-reveillon 2,364 tokens True
19 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Une-ruse 2,469 tokens True
20 1901_Achard-Lucie_Rosalie-de-Constant-sa-famille-et-ses-amis 12,775 tokens True
21 1903_Conan-Laure_Élisabeth-Seton 13,046 tokens True
22 1904-1912_Rolland-Romain_Jean-Christophe(1) 10,982 tokens True
23 1904-1912_Rolland-Romain_Jean-Christophe(2) 10,305 tokens True
24 1917_Bourgeois-Adèle_Némoville 12,468 tokens True
25 1923_Radiguet-Raymond_Le-diable-au-corps 14,850 tokens True
26 1926_Audoux-Marguerite_De-la-ville-au-moulin 12,144 tokens True
27 1937_Audoux-Marguerite_Douce-Lumière 12,346 tokens True
28 TOTAL 275,489 tokens 28 files used for cross-validation

PREDICTIONS CONFUSION MATRIX:

Gold Labels PER FAC TIME GPE LOC VEH O support
PER 30,177 28 14 7 7 31 1,940 32,204
FAC 42 1,628 1 22 17 1 584 2,295
TIME 8 1 965 1 1 0 695 1,671
GPE 13 31 2 670 31 0 119 866
LOC 8 64 1 56 357 0 295 781
VEH 54 8 0 0 0 222 179 463
O 2,285 524 661 100 150 96 0 3,816

CONTACT:

mail: antoine [dot] bourgois [at] protonmail [dot] com

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for AntoineBourgois/BookNLP-fr_NER_camembert-large_FAC_GPE_LOC_PER_TIME_VEH

Finetuned
(11)
this model

Collection including AntoineBourgois/BookNLP-fr_NER_camembert-large_FAC_GPE_LOC_PER_TIME_VEH