AntoineBourgois/BookNLP-fr_NER_camembert-large_FAC_GPE_LOC_PER_TIME_VEH

INTRODUCTION:

This model, developed as part of the BookNLP-fr project, is a NER model built on top of camembert-large embeddings, trained to predict nested entities in french, specifically for literary texts.

The predicted entities are:

mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
facilities (FAC): chatêau, sentier, chambre, couloir, ...
time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
geo-political entities (GPE): Montrouge, France, le petit hameau, ...
locations (LOC): le sud, Mars, l'océan, le bois, ...
vehicles (VEH): avion, voitures, calèche, vélos, ...

MODEL PERFORMANCES (LOOCV):

NER_tag	precision	recall	f1_score	support	support %
PER	92.46%	93.71%	93.08%	32,204	84.13%
FAC	70.63%	70.94%	70.78%	2,295	6.00%
TIME	58.66%	57.75%	58.20%	1,671	4.37%
GPE	77.64%	77.37%	77.50%	866	2.26%
LOC	62.96%	45.71%	52.97%	781	2.04%
VEH	63.43%	47.95%	54.61%	463	1.21%
micro_avg	88.39%	88.87%	88.58%	38,280	100.00%
macro_avg	70.96%	65.57%	67.86%	38,280	100.00%

TRAINING PARAMETERS:

Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
Tagging scheme: BIOES
Nested entities levels: [0, 1]
Split strategy: Leave-one-out cross-validation (28 files)
Train/Validation split: 0.85 / 0.15
Batch size: 16
Initial learning rate: 0.00014

MODEL ARCHITECTURE:

Model Input: Maximum context camembert-large embeddings (1024 dimensions)

Locked Dropout: 0.5
Projection layer:
- layer type: highway layer
- input: 1024 dimensions
- output: 2048 dimensions
BiLSTM layer:
- input: 2048 dimensions
- output: 256 dimensions (hidden state)
Linear layer:
- input: 256 dimensions
- output: 25 dimensions (predicted labels with BIOES tagging scheme)
CRF layer

Model Output: BIOES labels sequence

HOW TO USE:

*** IN CONSTRUCTION ***

TRAINING CORPUS:

	Document	Tokens Count	Is included in model eval
0	1830_Balzac-Honoré-de_La-maison-du-chat-qui-pelote	24,776 tokens	True
1	1830_Balzac-Honoré-de_Sarrasine	15,408 tokens	True
2	1836_Gautier-Théophile_La-morte-amoureuse	14,293 tokens	True
3	1837_Balzac-Honoré-de_La-maison-Nucingen	30,034 tokens	True
4	1841_Sand-George_Pauline	12,398 tokens	True
5	1856_Cousin-Victor_Madame-de-Hautefort	11,768 tokens	True
6	1863_Gautier-Théophile_Le-capitaine-Fracasse	11,848 tokens	True
7	1873_Zola-Émile_Le-ventre-de-Paris	12,613 tokens	True
8	1881_Flaubert-Gustave_Bouvard-et-Pécuchet	12,308 tokens	True
9	1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-buche	2,267 tokens	True
10	1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-relique	2,041 tokens	True
11	1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-rouille	2,949 tokens	True
12	1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Madame-Baptiste	2,578 tokens	True
13	1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Marocca	4,078 tokens	True
14	1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-A-cheval	2,878 tokens	True
15	1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Fou	1,905 tokens	True
16	1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Mademoiselle-Fifi	5,439 tokens	True
17	1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Reveil	2,159 tokens	True
18	1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Un-reveillon	2,364 tokens	True
19	1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Une-ruse	2,469 tokens	True
20	1901_Achard-Lucie_Rosalie-de-Constant-sa-famille-et-ses-amis	12,775 tokens	True
21	1903_Conan-Laure_Élisabeth-Seton	13,046 tokens	True
22	1904-1912_Rolland-Romain_Jean-Christophe(1)	10,982 tokens	True
23	1904-1912_Rolland-Romain_Jean-Christophe(2)	10,305 tokens	True
24	1917_Bourgeois-Adèle_Némoville	12,468 tokens	True
25	1923_Radiguet-Raymond_Le-diable-au-corps	14,850 tokens	True
26	1926_Audoux-Marguerite_De-la-ville-au-moulin	12,144 tokens	True
27	1937_Audoux-Marguerite_Douce-Lumière	12,346 tokens	True
28	TOTAL	275,489 tokens	28 files used for cross-validation

PREDICTIONS CONFUSION MATRIX:

Gold Labels	PER	FAC	TIME	GPE	LOC	VEH	O	support
PER	30,177	28	14	7	7	31	1,940	32,204
FAC	42	1,628	1	22	17	1	584	2,295
TIME	8	1	965	1	1	0	695	1,671
GPE	13	31	2	670	31	0	119	866
LOC	8	64	1	56	357	0	295	781
VEH	54	8	0	0	0	222	179	463
O	2,285	524	661	100	150	96	0	3,816

CONTACT:

mail: antoine [dot] bourgois [at] protonmail [dot] com

AntoineBourgois
/

BookNLP-fr_NER_camembert-large_FAC_GPE_LOC_PER_TIME_VEH