DidulaThavisha commited on
Commit
e9653c6
·
verified ·
1 Parent(s): 16e41dd

Upload 8 files

Browse files
Files changed (8) hide show
  1. README.md +155 -0
  2. config.json +49 -0
  3. gitattributes +16 -0
  4. merges.txt +0 -0
  5. special_tokens_map.json +1 -0
  6. tokenizer.json +0 -0
  7. tokenizer_config.json +1 -0
  8. vocab.json +0 -0
README.md ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - token-classification
5
+ datasets:
6
+ - conll2003
7
+ metrics:
8
+ - precision
9
+ - recall
10
+ - f1
11
+ - accuracy
12
+ model-index:
13
+ - name: distilroberta-base-ner-conll2003
14
+ results:
15
+ - task:
16
+ type: token-classification
17
+ name: Token Classification
18
+ dataset:
19
+ name: conll2003
20
+ type: conll2003
21
+ metrics:
22
+ - type: precision
23
+ value: 0.9492923423001218
24
+ name: Precision
25
+ - type: recall
26
+ value: 0.9565545901020023
27
+ name: Recall
28
+ - type: f1
29
+ value: 0.9529096297690173
30
+ name: F1
31
+ - type: accuracy
32
+ value: 0.9883096560400111
33
+ name: Accuracy
34
+ - task:
35
+ type: token-classification
36
+ name: Token Classification
37
+ dataset:
38
+ name: conll2003
39
+ type: conll2003
40
+ config: conll2003
41
+ split: validation
42
+ metrics:
43
+ - type: accuracy
44
+ value: 0.9883249976987512
45
+ name: Accuracy
46
+ verified: true
47
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiZTEwNzFlMjk0ZDY4NTg2MGQxMDZkM2IyZjdjNDEwYmNiMWY1MWZiNzg1ZjMyZTlkYzQ0MmVmNTZkMjEyMGQ1YiIsInZlcnNpb24iOjF9.zxapWje7kbauQ5-VDNbY487JB5wkN4XqgaLwoX1cSmNfgpp-MPCjqrocxayb1kImbN8CvzOpU1aSfvRfyd5fAw
48
+ - type: precision
49
+ value: 0.9906910190038265
50
+ name: Precision
51
+ verified: true
52
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiMWRjMjYyOGQ2MGMwOGE1ODQyNDU1MzZiNWU4MGUzYWVlNjQ3NDhjZDRlZTE0NDlmMGJjZjliZjU2ZmFiZmZiYyIsInZlcnNpb24iOjF9.G_QY9mDkIkllmWPsgmUoVgs-R9XjfYkdJMS8hcyGM-7NXsbigUgZZnhfD0TjDak62UoEplqwSX5r0S4xKPdxBQ
53
+ - type: recall
54
+ value: 0.9916635820847483
55
+ name: Recall
56
+ verified: true
57
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiODE0MDE5ZWMzNTM5MTA1NTI4YzNhNzI2NzVjODIzZWY0OWE2ODJiN2FiNmVkNGVkMTI2ODZiOGEwNTEzNzk2MCIsInZlcnNpb24iOjF9.zenVqRfs8TrKoiIu_QXQJtHyj3dEH97ZDLxUn_UJ2tdW36hpBflgKCJNBvFFkra7bS4cNRfIkwxxCUMWH1ptBg
58
+ - type: f1
59
+ value: 0.9911770619696786
60
+ name: F1
61
+ verified: true
62
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiZWZjY2NiNjZlNDFiODQ3M2JkOWJjNzRlY2FmNjMwNGFkNzFmNTBkOGQ5YTcyZjUzNjAwNDAxMThiNTE5ZThiNiIsInZlcnNpb24iOjF9.c9aD9hycCS-WBaLUb8NKzIpd2LE6xfJrhg3fL9_832RiMq5gcMs9qtarP3Jbo6WbPs_WThr_v4gn7K4Ti-0-CA
63
+ - type: loss
64
+ value: 0.05638007074594498
65
+ name: loss
66
+ verified: true
67
+ verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNGM3NTQ5ODBhMDcyNjBjMGUxMDgzYjI2NjEwNjM0MjU0MjEzMTRmODA2MjMwZWU1YTQ3OWU2YjUzNTliZTkwMSIsInZlcnNpb24iOjF9.03OwbxrdKm-vg6ia5CBYdEaSCuRbT0pLoEvwpd4NtjydVzo5wzS-pWgY6vH4PlI0ZCTBY0Po0IZSsJulWJttDg
68
+ ---
69
+
70
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
71
+ should probably proofread and complete it, then remove this comment. -->
72
+
73
+ # distilroberta-base-ner-conll2003
74
+
75
+ This model is a fine-tuned version of [distilroberta-base](https://huggingface.co/distilroberta-base) on the conll2003 dataset.
76
+
77
+ eval F1-Score: **95,29** (CoNLL-03)
78
+ test F1-Score: **90,74** (CoNLL-03)
79
+
80
+ eval F1-Score: **95,29** (CoNLL++ / CoNLL-03 corrected)
81
+ test F1-Score: **92,23** (CoNLL++ / CoNLL-03 corrected)
82
+
83
+
84
+ ## Model Usage
85
+
86
+ ```python
87
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
88
+ from transformers import pipeline
89
+
90
+ tokenizer = AutoTokenizer.from_pretrained("philschmid/distilroberta-base-ner-conll2003")
91
+ model = AutoModelForTokenClassification.from_pretrained("philschmid/distilroberta-base-ner-conll2003")
92
+
93
+ nlp = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True)
94
+ example = "My name is Philipp and live in Germany"
95
+
96
+ nlp(example)
97
+
98
+ ```
99
+
100
+
101
+ ## Training procedure
102
+
103
+ ### Training hyperparameters
104
+
105
+ The following hyperparameters were used during training:
106
+ - learning_rate: 4.9902376275441704e-05
107
+ - train_batch_size: 32
108
+ - eval_batch_size: 16
109
+ - seed: 42
110
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
111
+ - lr_scheduler_type: linear
112
+ - num_epochs: 6.0
113
+ - mixed_precision_training: Native AMP
114
+
115
+ ### Training results
116
+
117
+ #### CoNNL2003
118
+
119
+ It achieves the following results on the evaluation set:
120
+ - Loss: 0.0583
121
+ - Precision: 0.9493
122
+ - Recall: 0.9566
123
+ - F1: 0.9529
124
+ - Accuracy: 0.9883
125
+
126
+ It achieves the following results on the test set:
127
+ - Loss: 0.2025
128
+ - Precision: 0.8999
129
+ - Recall: 0.915
130
+ - F1: 0.9074
131
+ - Accuracy: 0.9741
132
+
133
+ #### CoNNL++ / CoNLL2003 corrected
134
+
135
+ It achieves the following results on the evaluation set:
136
+ - Loss: 0.0567
137
+ - Precision: 0.9493
138
+ - Recall: 0.9566
139
+ - F1: 0.9529
140
+ - Accuracy: 0.9883
141
+
142
+ It achieves the following results on the test set:
143
+ - Loss: 0.1359
144
+ - Precision: 0.92
145
+ - Recall: 0.9245
146
+ - F1: 0.9223
147
+ - Accuracy: 0.9785
148
+
149
+
150
+ ### Framework versions
151
+
152
+ - Transformers 4.6.1
153
+ - Pytorch 1.8.1+cu101
154
+ - Datasets 1.6.2
155
+ - Tokenizers 0.10.2
config.json ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "distilroberta-base",
3
+ "architectures": [
4
+ "RobertaForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "eos_token_id": 2,
9
+ "finetuning_task": "ner",
10
+ "gradient_checkpointing": false,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.1,
13
+ "hidden_size": 768,
14
+ "id2label": {
15
+ "0": "O",
16
+ "1": "B-PER",
17
+ "2": "I-PER",
18
+ "3": "B-ORG",
19
+ "4": "I-ORG",
20
+ "5": "B-LOC",
21
+ "6": "I-LOC",
22
+ "7": "B-MISC",
23
+ "8": "I-MISC"
24
+ },
25
+ "initializer_range": 0.02,
26
+ "intermediate_size": 3072,
27
+ "label2id": {
28
+ "O": 0,
29
+ "B-PER": 1,
30
+ "I-PER": 2,
31
+ "B-ORG": 3,
32
+ "I-ORG": 4,
33
+ "B-LOC": 5,
34
+ "I-LOC": 6,
35
+ "B-MISC": 7,
36
+ "I-MISC": 8
37
+ },
38
+ "layer_norm_eps": 1e-05,
39
+ "max_position_embeddings": 514,
40
+ "model_type": "roberta",
41
+ "num_attention_heads": 12,
42
+ "num_hidden_layers": 6,
43
+ "pad_token_id": 1,
44
+ "position_embedding_type": "absolute",
45
+ "transformers_version": "4.6.1",
46
+ "type_vocab_size": 1,
47
+ "use_cache": true,
48
+ "vocab_size": 50265
49
+ }
gitattributes ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
2
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.h5 filter=lfs diff=lfs merge=lfs -text
5
+ *.tflite filter=lfs diff=lfs merge=lfs -text
6
+ *.tar.gz filter=lfs diff=lfs merge=lfs -text
7
+ *.ot filter=lfs diff=lfs merge=lfs -text
8
+ *.onnx filter=lfs diff=lfs merge=lfs -text
9
+ *.arrow filter=lfs diff=lfs merge=lfs -text
10
+ *.ftz filter=lfs diff=lfs merge=lfs -text
11
+ *.joblib filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.pb filter=lfs diff=lfs merge=lfs -text
15
+ *.pt filter=lfs diff=lfs merge=lfs -text
16
+ *.pth filter=lfs diff=lfs merge=lfs -text
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false}}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "<unk>", "bos_token": "<s>", "eos_token": "</s>", "add_prefix_space": true, "errors": "replace", "sep_token": "</s>", "cls_token": "<s>", "pad_token": "<pad>", "mask_token": "<mask>", "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "distilroberta-base"}
vocab.json ADDED
The diff for this file is too large to render. See raw diff