bourdoiscatie's picture
Update README.md
05ead53 verified
|
raw
history blame
31.2 kB
metadata
license: mit
base_model: camembert-base
metrics:
  - precision
  - recall
  - f1
  - accuracy
model-index:
  - name: NERmembert-base-4entities
    results: []
datasets:
  - CATIE-AQ/frenchNER_4entities
language:
  - fr
widget:
  - text: >-
      Assurés de disputer l'Euro 2024 en Allemagne l'été prochain (du 14 juin au
      14 juillet) depuis leur victoire aux Pays-Bas, les Bleus ont fait le
      nécessaire pour avoir des certitudes. Avec six victoires en six matchs
      officiels et un seul but encaissé, Didier Deschamps a consolidé les acquis
      de la dernière Coupe du monde. Les joueurs clés sont connus : Kylian
      Mbappé, Aurélien Tchouameni, Antoine Griezmann, Ibrahima Konaté ou encore
      Mike Maignan.
library_name: transformers
pipeline_tag: token-classification
co2_eq_emissions: 20

NERmembert-base-4entities

Model Description

We present NERmembert-base-4entities, which is a CamemBERT base fine-tuned for the Name Entity Recognition task for the French language on four French NER datasets for 4 entities (LOC, PER, ORG, MISC).
All these datasets were concatenated and cleaned into a single dataset that we called frenchNER_4entities.
There are a total of 384,773 rows, of which 328,757 are for training, 24,131 for validation and 31,885 for testing.
Our methodology is described in a blog post available in English or French.

Dataset

The dataset used is frenchNER_4entities, which represents ~385k sentences labeled in 4 categories:

Label Examples
PER "La Bruyère", "Gaspard de Coligny", "Wittgenstein"
ORG "UTBM", "American Airlines", "id Software"
LOC "République du Cap-Vert", "Créteil", "Bordeaux"
MISC "Wolfenstein 3D", "Révolution française", "Coupe du monde"

The distribution of the entities is as follows:


Splits

O

PER

LOC

ORG

MISC

train

7,539,692

307,144

286,746

127,089

799,494

validation

544,580

24,034

21,585

5,927

18,221

test

720,623

32,870

29,683

7,911

21,760

Evaluation results

The evaluation was carried out using the evaluate python package.

frenchNER_4entities

For space reasons, we show only the F1 of the different models. You can see the full results below the table.


Model

PER

LOC

ORG

MISC

Jean-Baptiste/camembert-ner

0.971

0.947

0.902

0.663

cmarkea/distilcamembert-base-ner

0.974

0.948

0.892

0.658

NERmembert-base-3entities

A

B

C

0

NERmembert-base-4entities (this model)

0.978

0.958

0.903

0.814

NERmembert-large-4entities

A

B

C

D
Full results

Model

Metrics

PER

LOC

ORG

MISC

O

Overall

Jean-Baptiste/camembert-ner

Precision

0.952

0.924

0.870

0.845

0.986

0.976

Recall

0.990

0.972

0.938

0.546

0.992

0.976
F1
0.971

0.947

0.902

0.663

0.989

0.976

cmarkea/distilcamembert-base-ner

Precision

0.962

0.933

0.857

0.830

0.985

0.976

Recall

0.987

0.963

0.930

0.545

0.993

0.976
F1
0.974

0.948

0.892

0.658

0.989

0.976

NERmembert-base-3entities

Precision

A

B

C

0

X

X

Recall

A

B

C

0

X

X
F1
A

B

C

0

X

X

NERmembert-base-4entities (this model)

Precision

0.973

0.951

0.888

0.850

0.993

0.984

Recall

0.983

0.964

0.918

0.781

0.993

0.984
F1
0.978

0.958

0.903

0.814

0.993

0.984

NERmembert-large-4entities

Precision

A

B

C

D

E

F

Recall

A

B

C

D

E

F
F1
A

B

C

D

E

F

In detail:

multiconer

For space reasons, we show only the F1 of the different models. You can see the full results below the table.


Model

PER

LOC

ORG

MISC

Jean-Baptiste/camembert-ner

0.940

0.761

0.723

0.560

cmarkea/distilcamembert-base-ner

0.921

0.748

0.694

0.530

NERmembert-base-3entities

A

B

C

0

NERmembert-base-4entities (this model)

0.960

0.890

0.867

0.852

NERmembert-large-4entities

A

B

C

D
Full results

Model

Metrics

PER

LOC

ORG

MISC

O

Overall

Jean-Baptiste/camembert-ner

Precision

0.908

0.717

0.753

0.620

0.936

0.889

Recall

0.975

0.811

0.696

0.511

0.938

0.889
F1
0.940

0.761

0.723

0.560

0.937

0.889

cmarkea/distilcamembert-base-ner

Precision

0.885

0.738

0.737

0.589

0.928

0.881

Recall

0.960

0.759

0.655

0.482

0.939

0.881
F1
0.921

0.748

0.694

0.530

0.934

0.881

NERmembert-base-3entities

Precision

A

B

C

0

X

X

Recall

A

B

C

0

X

X
F1
A

B

C

0

X

X

NERmembert-base-4entities (this model)

Precision

0.954

0.893

0.851

0.849

0.979

0.954

Recall

0.967

0.887

0.883

0.855

0.974

0.954
F1
0.960

0.890

0.867

0.852

0.977

0.954

NERmembert-large-4entities

Precision

A

B

C

D

E

F

Recall

A

B

C

D

E

F
F1
A

B

C

D

E

F

multinerd

For space reasons, we show only the F1 of the different models. You can see the full results below the table.


Model

PER

LOC

ORG

MISC

Jean-Baptiste/camembert-ner

0.962

0.934

0.888

0.419

cmarkea/distilcamembert-base-ner

0.972

0.938

0.884

0.430

NERmembert-base-3entities

A

B

C

0

NERmembert-base-4entities (this model)

0.985

0.973

0.938

0.770

NERmembert-large-4entities

A

B

C

D
Full results

Model

Metrics

PER

LOC

ORG

MISC

O

Overall

Jean-Baptiste/camembert-ner

Precision

0.931

0.893

0.827

0.725

0.979

0.966

Recall

0.994

0.980

0.959

0.295

0.990

0.966
F1
0.962

0.934

0.888

0.419

0.984

0.966

cmarkea/distilcamembert-base-ner

Precision

0.954

0.908

0.817

0.705

0.977

0.967

Recall

0.991

0.969

0.963

0.310

0.990

0.967
F1
0.972

0.938

0.884

0.430

0.984

0.967

NERmembert-base-3entities

Precision

A

B

C

0

X

X

Recall

A

B

C

0

X

X
F1
A

B

C

0

X

X

NERmembert-base-4entities (this model)

Precision

0.976

0.961

0.91

0.829

0.991

0.983

Recall

0.994

0.985

0.967

0.719

0.993

0.983
F1
0.985

0.973

0.938

0.770

0.992

0.983

NERmembert-large-4entities

Precision

A

B

C

D

E

F

Recall

A

B

C

D

E

F
F1
A

B

C

D

E

F

wikiner

For space reasons, we show only the F1 of the different models. You can see the full results below the table.


Model

PER

LOC

ORG

MISC

Jean-Baptiste/camembert-ner

0.986

0.966

0.938

0.938

cmarkea/distilcamembert-base-ner

0.983

0.964

0.925

0.926

NERmembert-base-3entities

A

B

C

0

NERmembert-base-4entities (this model)

0.970

0.945

0.876

0.872

NERmembert-large-4entities

A

B

C

D
Full results

Model

Metrics

PER

LOC

ORG

MISC

O

Overall

Jean-Baptiste/camembert-ner

Precision

0.986

0.962

0.925

0.943

0.998

0.992

Recall

0.987

0.969

0.951

0.933

0.997

0.992
F1
0.986

0.966

0.938

0.938

0.998

0.992

cmarkea/distilcamembert-base-ner

Precision

0.982

0.964

0.910

0.942

0.997

0.991

Recall

0.985

0.963

0.940

0.910

0.998

0.991
F1
0.983

0.964

0.925

0.926

0.997

0.991

NERmembert-base-3entities

Precision

A

B

C

0

X

X

Recall

A

B

C

0

X

X
F1
A

B

C

0

X

X

NERmembert-base-4entities (this model)

Precision

0.970

0.944

0.872

0.878

0.996

0.986

Recall

0.969

0.947

0.880

0.866

0.996

0.986
F1
0.970

0.945

0.876

0.872

0.996

0.986

NERmembert-large-4entities

Precision

A

B

C

D

E

F

Recall

A

B

C

D

E

F
F1
A

B

C

D

E

F

Usage

Code

from transformers import pipeline

ner = pipeline('question-answering', model='CATIE-AQ/NERmembert-base-4entities', tokenizer='CATIE-AQ/NERmembert-base-4entities', aggregation_strategy="simple")

results = ner(
"Assurés de disputer l'Euro 2024 en Allemagne l'été prochain (du 14 juin au 14 juillet) depuis leur victoire aux Pays-Bas, les Bleus ont fait le nécessaire pour avoir des certitudes. Avec six victoires en six matchs officiels et un seul but encaissé, Didier Deschamps a consolidé les acquis de la dernière Coupe du monde. Les joueurs clés sont connus : Kylian Mbappé, Aurélien Tchouameni, Antoine Griezmann, Ibrahima Konaté ou encore Mike Maignan."
)

print(result)
```python
[{'entity_group': 'MISC',
  'score': 0.9404951632022858,
  'word': 'Euro 2024',
  'start': 22,
  'end': 31},
 {'entity_group': 'LOC',
  'score': 0.96980727,
  'word': 'Allemagne',
  'start': 35,
  'end': 44},
 {'entity_group': 'LOC',
  'score': 0.8612850904464722,
  'word': 'Pays-Bas',
  'start': 112,
  'end': 120},
 {'entity_group': 'ORG',
  'score': 0.8148028254508972,
  'word': 'les Bleus',
  'start': 122,
  'end': 131},
 {'entity_group': 'PER',
  'score': 0.9994482398033142,
  'word': 'Didier Deschamps',
  'start': 250,
  'end': 266},
 {'entity_group': 'MISC',
  'score': 0.84807388484478,
  'word': 'dernière Coupe du monde',
  'start': 296,
  'end': 319},
 {'entity_group': 'PER',
  'score': 0.9996860176324844,
  'word': 'Kylian Mbappé',
  'start': 352,
  'end': 365},
 {'entity_group': 'PER',
  'score': 0.9996881932020187,
  'word': 'Aurélien Tchouameni',
  'start': 367,
  'end': 386},
 {'entity_group': 'PER',
  'score': 0.9996924996376038,
  'word': 'Antoine Griezmann',
  'start': 388,
  'end': 405},
 {'entity_group': 'PER',
  'score': 0.9996860027313232,
  'word': 'Ibrahima Konaté',
  'start': 407,
  'end': 422},
 {'entity_group': 'PER',
  'score': 0.9996623992919922,
  'word': 'Mike Maignan',
  'start': 433,
  'end': 445}]

Try it through Space

A Space has been created to test the model. It is available here.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 3

Training results

Training Loss Epoch Step Validation Loss Precision Recall F1 Accuracy
0.0407 1.0 41095 0.0547 0.9816 0.9816 0.9816 0.9816
0.0242 2.0 82190 0.0488 0.9843 0.9843 0.9843 0.9843
0.018 3.0 123285 0.0542 0.9844 0.9844 0.9844 0.9844

Framework versions

  • Transformers 4.36.2
  • Pytorch 2.1.2
  • Datasets 2.16.1
  • Tokenizers 0.15.0

Environmental Impact

Carbon emissions were estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.

  • Hardware Type: A100 PCIe 40/80GB
  • Hours used: 1h45min
  • Cloud Provider: Private Infrastructure
  • Carbon Efficiency (kg/kWh): 0.046 (estimated from electricitymaps for the day of January 4, 2024.)
  • Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid): 0.02 kg eq. CO2

Citations

Camembert-frenchNER_4entities

TODO

multiconer

@inproceedings{multiconer2-report,
title={{SemEval-2023 Task 2: Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2)}},
author={Fetahu, Besnik and Kar, Sudipta and Chen, Zhiyu and Rokhlenko, Oleg and Malmasi, Shervin},
booktitle={Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)},
year={2023},
publisher={Association for Computational Linguistics}}

@article{multiconer2-data,
title={{MultiCoNER v2: a Large Multilingual dataset for Fine-grained and Noisy Named Entity Recognition}},
author={Fetahu, Besnik and Chen, Zhiyu and Kar, Sudipta and Rokhlenko, Oleg and Malmasi, Shervin},
year={2023}}

multinerd

@inproceedings{tedeschi-navigli-2022-multinerd,
title = "{M}ulti{NERD}: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation)",
author = "Tedeschi, Simone and Navigli, Roberto",
booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",
month = jul,
year = "2022",
address = "Seattle, United States",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-naacl.60",
doi = "10.18653/v1/2022.findings-naacl.60",
pages = "801--812"}

pii-masking-200k

@misc {ai4privacy_2023,
author = { {ai4Privacy} },
title = { pii-masking-200k (Revision 1d4c0a1) },
year = 2023,
url = { https://huggingface.co./datasets/ai4privacy/pii-masking-200k },
doi = { 10.57967/hf/1532 },
publisher = { Hugging Face }}

wikiner

@article{NOTHMAN2013151,
title = {Learning multilingual named entity recognition from Wikipedia},
journal = {Artificial Intelligence},
volume = {194},
pages = {151-175},
year = {2013},
note = {Artificial Intelligence, Wikipedia and Semi-Structured Resources},
issn = {0004-3702},
doi = {https://doi.org/10.1016/j.artint.2012.03.006},
url = {https://www.sciencedirect.com/science/article/pii/S0004370212000276},
author = {Joel Nothman and Nicky Ringland and Will Radford and Tara Murphy and James R. Curran}}

frenchNER_4entities

TODO

CamemBERT

@inproceedings{martin2020camembert,
title={CamemBERT: a Tasty French Language Model},
author={Martin, Louis and Muller, Benjamin and Su{'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {'E}ric Villemonte and Seddah, Djam{'e} and Sagot, Beno{^\i}t},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
year={2020}}

License

cc-by-4.0