INTRODUCTION:

This model, developed as part of the BookNLP-fr project, is a NER model built on top of camembert-large embeddings, trained to predict nested entities in french, specifically for literary texts.

The predicted entities are:

  • mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
  • facilities (FAC): chatêau, sentier, chambre, couloir, ...
  • time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
  • geo-political entities (GPE): Montrouge, France, le petit hameau, ...
  • locations (LOC): le sud, Mars, l'océan, le bois, ...
  • vehicles (VEH): avion, voitures, calèche, vélos, ...

MODEL PERFORMANCES (LOOCV):

NER_tag precision recall f1_score support support %
PER 94.58% 95.16% 94.87% 71,738 100.00%
micro_avg 94.58% 95.16% 94.87% 71,738 100.00%
macro_avg 94.58% 95.16% 94.87% 71,738 100.00%

TRAINING PARAMETERS:

  • Entities types: ['PER']
  • Tagging scheme: BIOES
  • Nested entities levels: [0, 1]
  • Split strategy: Leave-one-out cross-validation (31 files)
  • Train/Validation split: 0.85 / 0.15
  • Batch size: 16
  • Initial learning rate: 0.00014

MODEL ARCHITECTURE:

Model Input: Maximum context camembert-large embeddings (1024 dimensions)

  • Locked Dropout: 0.5

  • Projection layer:

    • layer type: highway layer
    • input: 1024 dimensions
    • output: 2048 dimensions
  • BiLSTM layer:

    • input: 2048 dimensions
    • output: 256 dimensions (hidden state)
  • Linear layer:

    • input: 256 dimensions
    • output: 5 dimensions (predicted labels with BIOES tagging scheme)
  • CRF layer

Model Output: BIOES labels sequence

HOW TO USE:

*** IN CONSTRUCTION ***

TRAINING CORPUS:

Document Tokens Count Is included in model eval
0 1731_Prévost-Antoine-François_Manon-Lescaut_PER-ONLY 71,219 tokens True
1 1830_Balzac-Honoré-de_La-maison-du-chat-qui-pelote 24,776 tokens True
2 1830_Balzac-Honoré-de_Sarrasine 15,408 tokens True
3 1832_Sand-George_Indiana_PER-ONLY 112,221 tokens True
4 1836_Gautier-Théophile_La-morte-amoureuse 14,293 tokens True
5 1837_Balzac-Honoré-de_La-maison-Nucingen 30,030 tokens True
6 1841_Sand-George_Pauline 12,398 tokens True
7 1856_Cousin-Victor_Madame-de-Hautefort 11,768 tokens True
8 1863_Gautier-Théophile_Le-capitaine-Fracasse 11,848 tokens True
9 1873_Zola-Émile_Le-ventre-de-Paris 12,613 tokens True
10 1881_Flaubert-Gustave_Bouvard-et-Pécuchet 12,308 tokens True
11 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-buche 2,267 tokens True
12 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-relique 2,041 tokens True
13 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-rouille 2,949 tokens True
14 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Madame-Baptiste 2,578 tokens True
15 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Marocca 4,078 tokens True
16 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-A-cheval 2,878 tokens True
17 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Fou 1,905 tokens True
18 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Mademoiselle-Fifi 5,439 tokens True
19 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Reveil 2,159 tokens True
20 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Un-reveillon 2,364 tokens True
21 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Une-ruse 2,469 tokens True
22 1901_Achard-Lucie_Rosalie-de-Constant-sa-famille-et-ses-amis 12,775 tokens True
23 1903_Conan-Laure_Élisabeth-Seton 13,046 tokens True
24 1904-1912_Rolland-Romain_Jean-Christophe(1) 10,982 tokens True
25 1904-1912_Rolland-Romain_Jean-Christophe(2) 10,305 tokens True
26 1917_Bourgeois-Adèle_Némoville 12,468 tokens True
27 1923_Delly_Dans-les-ruines 95,617 tokens True
28 1923_Radiguet-Raymond_Le-diable-au-corps 14,850 tokens True
29 1926_Audoux-Marguerite_De-la-ville-au-moulin 12,144 tokens True
30 1937_Audoux-Marguerite_Douce-Lumière 12,346 tokens True
31 TOTAL 554,542 tokens 31 files used for cross-validation

PREDICTIONS CONFUSION MATRIX:

Gold Labels PER O support
PER 68,267 3,471 71,738
O 3,910 0 3,910

CONTACT:

mail: antoine [dot] bourgois [at] protonmail [dot] com

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for AntoineBourgois/BookNLP-fr_NER_camembert-large_PER

Finetuned
(11)
this model