File size: 3,629 Bytes
db87061
 
 
 
 
 
 
 
 
 
f910125
db87061
 
 
 
 
 
 
 
 
 
 
 
 
62e38bc
db87061
f910125
db87061
 
 
 
 
 
 
 
 
ec93c35
db87061
ec93c35
db87061
ec93c35
db87061
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ec93c35
 
2d2ac88
 
 
ae4bf2c
2d2ac88
 
 
 
 
 
 
 
 
 
172c5a3
ec93c35
 
2d2ac88
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
license: mit
tags:
- generated_from_trainer
metrics:
- precision
- recall
- f1
- accuracy
model_index:
- name: bert-portuguese-ner-archive
  results:
  - task:
      name: Token Classification
      type: token-classification
    metric:
      name: Accuracy
      type: accuracy
      value: 0.9700325118974698
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# bert-portuguese-ner

This model is a fine-tuned version of [neuralmind/bert-base-portuguese-cased](https://huggingface.co./neuralmind/bert-base-portuguese-cased)
It achieves the following results on the evaluation set:
- Loss: 0.1140
- Precision: 0.9147
- Recall: 0.9483
- F1: 0.9312
- Accuracy: 0.9700

## Model description

This model was fine-tunned on token classification task (NER) on Portuguese archival documents. The annotated labels are: Date, Profession, Person, Place, Organization

### Datasets

All the training and evaluation data is available at: http://ner.epl.di.uminho.pt/


### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 4

### Training results

| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1     | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
| No log        | 1.0   | 192  | 0.1438          | 0.8917    | 0.9392 | 0.9148 | 0.9633   |
| 0.2454        | 2.0   | 384  | 0.1222          | 0.8985    | 0.9417 | 0.9196 | 0.9671   |
| 0.0526        | 3.0   | 576  | 0.1098          | 0.9150    | 0.9481 | 0.9312 | 0.9698   |
| 0.0372        | 4.0   | 768  | 0.1140          | 0.9147    | 0.9483 | 0.9312 | 0.9700   |


### Framework versions

- Transformers 4.10.0.dev0
- Pytorch 1.9.0+cu111
- Datasets 1.10.2
- Tokenizers 0.10.3
### Citation

```bibtex

@Article{make4010003,
AUTHOR = {Cunha, Luís Filipe and Ramalho, José Carlos},
TITLE = {NER in Archival Finding Aids: Extended},
JOURNAL = {Machine Learning and Knowledge Extraction},
VOLUME = {4},
YEAR = {2022},
NUMBER = {1},
PAGES = {42--65},
URL = {https://www.mdpi.com/2504-4990/4/1/3},
ISSN = {2504-4990},
ABSTRACT = {The amount of information preserved in Portuguese archives has increased over the years. These documents represent a national heritage of high importance, as they portray the country&rsquo;s history. Currently, most Portuguese archives have made their finding aids available to the public in digital format, however, these data do not have any annotation, so it is not always easy to analyze their content. In this work, Named Entity Recognition solutions were created that allow the identification and classification of several named entities from the archival finding aids. These named entities translate into crucial information about their context and, with high confidence results, they can be used for several purposes, for example, the creation of smart browsing tools by using entity linking and record linking techniques. In order to achieve high result scores, we annotated several corpora to train our own Machine Learning algorithms in this context domain. We also used different architectures, such as CNNs, LSTMs, and Maximum Entropy models. Finally, all the created datasets and ML models were made available to the public with a developed web platform, NER@DI.},
DOI = {10.3390/make4010003}
}




```