jeanpoll commited on
Commit
fd3c994
1 Parent(s): 8a797a2

First versin with dates

Browse files
README.md ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: fr
3
+ datasets:
4
+ - Jean-Baptiste/wikiner_fr
5
+ widget:
6
+ - text: "Je m'appelle jean-baptiste et j'habite à montréal depuis fevr 2012"
7
+ ---
8
+
9
+ # camembert-ner: model fine-tuned from camemBERT for NER task.
10
+
11
+ ## Introduction
12
+
13
+ [camembert-ner-with-dates] is an extension of french camembert-ner model with an additionnal tag for dates.
14
+ Model was trained on enriched version of wikiner-fr dataset (~170 634 sentences).
15
+
16
+ On my test data (mix of chat and email), this model got an f1 score of ~83% (in comparison dateparser was ~70%).
17
+ Dateparser library can still be be used on the output of this model in order to convert text to python datetime object.
18
+ https://dateparser.readthedocs.io/en/latest/
19
+
20
+
21
+ ## How to use camembert-ner with HuggingFace
22
+
23
+ ##### Load camembert-ner and its sub-word tokenizer :
24
+
25
+ ```python
26
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
27
+
28
+ tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/camembert-ner")
29
+ model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/camembert-ner")
30
+
31
+
32
+ ##### Process text sample (from wikipedia)
33
+
34
+ from transformers import pipeline
35
+
36
+ nlp = pipeline('ner', model=model, tokenizer=tokenizer, grouped_entities=True)
37
+ nlp("Apple est créée le 1er avril 1976 dans le garage de la maison d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak et Ronald Wayne14, puis constituée sous forme de société le 3 janvier 1977 à l'origine sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification de ses produits, le mot « computer » est retiré le 9 janvier 2015.")
38
+
39
+
40
+ [{'entity_group': 'ORG',
41
+ 'score': 0.9776379466056824,
42
+ 'word': 'Apple',
43
+ 'start': 0,
44
+ 'end': 5},
45
+ {'entity_group': 'DATE',
46
+ 'score': 0.9793774570737567,
47
+ 'word': 'le 1er avril 1976 dans le',
48
+ 'start': 15,
49
+ 'end': 41},
50
+ {'entity_group': 'PER',
51
+ 'score': 0.9958226680755615,
52
+ 'word': 'Steve Jobs',
53
+ 'start': 74,
54
+ 'end': 85},
55
+ {'entity_group': 'LOC',
56
+ 'score': 0.995087186495463,
57
+ 'word': 'Los Altos',
58
+ 'start': 87,
59
+ 'end': 97},
60
+ {'entity_group': 'LOC',
61
+ 'score': 0.9953305125236511,
62
+ 'word': 'Californie',
63
+ 'start': 100,
64
+ 'end': 111},
65
+ {'entity_group': 'PER',
66
+ 'score': 0.9961076378822327,
67
+ 'word': 'Steve Jobs',
68
+ 'start': 115,
69
+ 'end': 126},
70
+ {'entity_group': 'PER',
71
+ 'score': 0.9960325956344604,
72
+ 'word': 'Steve Wozniak',
73
+ 'start': 127,
74
+ 'end': 141},
75
+ {'entity_group': 'PER',
76
+ 'score': 0.9957776467005411,
77
+ 'word': 'Ronald Wayne',
78
+ 'start': 144,
79
+ 'end': 157},
80
+ {'entity_group': 'DATE',
81
+ 'score': 0.994030773639679,
82
+ 'word': 'le 3 janvier 1977 à',
83
+ 'start': 198,
84
+ 'end': 218},
85
+ {'entity_group': 'ORG',
86
+ 'score': 0.9720810294151306,
87
+ 'word': "d'Apple Computer",
88
+ 'start': 240,
89
+ 'end': 257},
90
+ {'entity_group': 'DATE',
91
+ 'score': 0.9924157659212748,
92
+ 'word': '30 ans et',
93
+ 'start': 272,
94
+ 'end': 282},
95
+ {'entity_group': 'DATE',
96
+ 'score': 0.9934852868318558,
97
+ 'word': 'le 9 janvier 2015.',
98
+ 'start': 363,
99
+ 'end': 382}]
100
+
101
+ ```
102
+
103
+
104
+ ## Model performances (metric: seqeval)
105
+
106
+ Global
107
+ ```
108
+ 'precision': 0.928
109
+ 'recall': 0.928
110
+ 'f1': 0.928
111
+ ```
112
+
113
+ By entity
114
+ ```
115
+ Label LOC: (precision:0.929, recall:0.932, f1:0.931, support:9510)
116
+ Label PER: (precision:0.952, recall:0.965, f1:0.959, support:9399)
117
+ Label MISC: (precision:0.878, recall:0.844, f1:0.860, support:5364)
118
+ Label ORG: (precision:0.848, recall:0.883, f1:0.865, support:2299)
119
+ Label DATE: Not relevant because of method used to add date tag on wikiner dataset (estimated f1 ~90%)
120
+
121
+
122
+ ```
123
+
config.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "models_data/models/camembert-base_forTokenCLassification",
3
+ "architectures": [
4
+ "CamembertForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 5,
8
+ "eos_token_id": 6,
9
+ "gradient_checkpointing": false,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "id2label": {
14
+ "0": "O",
15
+ "1": "LOC",
16
+ "2": "PER",
17
+ "3": "MISC",
18
+ "4": "ORG",
19
+ "5": "DATE"
20
+ },
21
+ "initializer_range": 0.02,
22
+ "intermediate_size": 3072,
23
+ "label2id": {
24
+ "DATE": 5,
25
+ "LOC": 1,
26
+ "MISC": 3,
27
+ "O": 0,
28
+ "ORG": 4,
29
+ "PER": 2
30
+ },
31
+ "layer_norm_eps": 1e-05,
32
+ "max_position_embeddings": 514,
33
+ "model_type": "camembert",
34
+ "num_attention_heads": 12,
35
+ "num_hidden_layers": 12,
36
+ "output_past": true,
37
+ "pad_token_id": 1,
38
+ "position_embedding_type": "absolute",
39
+ "transformers_version": "4.2.2",
40
+ "type_vocab_size": 1,
41
+ "use_cache": true,
42
+ "vocab_size": 32005
43
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:82f02fbb095b057e0f13a782b612d2bdfaf6edd31d193b49af1dc1fc9f94c761
3
+ size 440233263
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:988bc5a00281c6d210a5d34bd143d0363741a432fefe741bf71e61b1869d4314
3
+ size 810912
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": "<mask>", "additional_special_tokens": ["<s>NOTUSED", "</s>NOTUSED"]}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "sep_token": "</s>", "cls_token": "<s>", "unk_token": "<unk>", "pad_token": "<pad>", "mask_token": "<mask>", "additional_special_tokens": ["<s>NOTUSED", "</s>NOTUSED"], "model_max_length": 512, "name_or_path": "models_data/models/camembert-base_forTokenCLassification", "special_tokens_map_file": "models_data/models/camembert-base_forTokenCLassification\\special_tokens_map.json"}