--- license: mit base_model: camembert-base metrics: - precision - recall - f1 - accuracy model-index: - name: Camembert-base-frenchNER_4entities results: [] datasets: - CATIE-AQ/frenchNER_4entities language: - fr widget: - text: "Assurés de disputer l'Euro 2024 en Allemagne l'été prochain (du 14 juin au 14 juillet) depuis leur victoire aux Pays-Bas, les Bleus ont fait le nécessaire pour avoir des certitudes. Avec six victoires en six matchs officiels et un seul but encaissé, Didier Deschamps a consolidé les acquis de la dernière Coupe du monde. Les joueurs clés sont connus : Kylian Mbappé, Aurélien Tchouameni, Antoine Griezmann, Ibrahima Konaté ou encore Mike Maignan." library_name: transformers pipeline_tag: token-classification co2_eq_emissions: 35 --- # Camembert-base-frenchNER_3entities ## Model Description We present **Camembert-base-frenchNER_4entities**, which is a [CamemBERT base](https://huggingface.co./camembert-base) fine-tuned for the Name Entity Recognition task for the French language on four French NER datasets for 4 entities (LOC, PER, ORG, MISC). All these datasets were concatenated and cleaned into a single dataset that we called [frenchNER_4entities](https://huggingface.co./datasets/CATIE-AQ/frenchNER_4entities). There are a total of **384,773** rows, of which **328,757** are for training, **24,131** for validation and **31,885** for testing. Our methodology is described in a blog post available in [English](https://blog.vaniila.ai/en/NER_en/) or [French](https://blog.vaniila.ai/NER/). ## Dataset The dataset used is [frenchNER](https://huggingface.co./datasets/CATIE-AQ/frenchNER_4entities), which represents ~385k sentences labeled in 4 categories : * PER: personality ; * LOC: location ; * ORG: organization ; * MISC: miscellaneous ; * O: background (Outside entity). The distribution of the entities is as follows:

Splits	O	PER	LOC	ORG	MISC
train	A	B	C	D	E
validation	A	B	C	D	E
test	A	B	C	D	E

## Evaluation results The evaluation was carried out using the [**evaluate**](https://pypi.org/project/evaluate/) python package. ### frenchNER_4entities

Model	Metrics	PER	LOC	ORG	MISC	O	Overall
Camembert-base-frenchNER_4entities	Precision	A	B	C	D	E	F
	Recall	A	B	C	D	E	F
	F1	A	B	C	D	E	F
	Number	A	B	C	D	E	F

In detail: ### multiconer

Model	Metrics	PER	LOC	ORG	MISC	O	Overall
Camembert-base-frenchNER_4entities	Precision	A	B	C	D	E	F
	Recall	A	B	C	D	E	F
	F1	A	B	C	D	E	F
	Number	A	B	C	D	E	F

### multinerd

Model	Metrics	PER	LOC	ORG	MISC	O	Overall
Camembert-base-frenchNER_4entities	Precision	A	B	C	D	E	F
	Recall	A	B	C	D	E	F
	F1	A	B	C	D	E	F
	Number	A	B	C	D	E	F

### wikiner

Model	Metrics	PER	LOC	ORG	MISC	O	Overall
Camembert-base-frenchNER_4entities	Precision	A	B	C	D	E	F
	Recall	A	B	C	D	E	F
	F1	A	B	C	D	E	F
	Number	A	B	C	D	E	F

## Usage ### Code ```python from transformers import pipeline ner = pipeline('question-answering', model='CATIE-AQ/Camembert-base-frenchNER_4entities', tokenizer='CATIE-AQ/Camembert-base-frenchNER_4entities', grouped_entities=True) result = ner( "Assurés de disputer l'Euro 2024 en Allemagne l'été prochain (du 14 juin au 14 juillet) depuis leur victoire aux Pays-Bas, les Bleus ont fait le nécessaire pour avoir des certitudes. Avec six victoires en six matchs officiels et un seul but encaissé, Didier Deschamps a consolidé les acquis de la dernière Coupe du monde. Les joueurs clés sont connus : Kylian Mbappé, Aurélien Tchouameni, Antoine Griezmann, Ibrahima Konaté ou encore Mike Maignan." ) print(result) ``` ```python ``` ### Try it through Space A Space has been created to test the model. It is available [here](https://huggingface.co./spaces/CATIE-AQ/Camembert-NER). ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 2e-05 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 3 ### Training results | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy | |:-------------:|:-----:|:------:|:---------------:|:---------:|:------:|:------:|:--------:| | 0.0407 | 1.0 | 41095 | 0.0547 | 0.9816 | 0.9816 | 0.9816 | 0.9816 | | 0.0242 | 2.0 | 82190 | 0.0488 | 0.9843 | 0.9843 | 0.9843 | 0.9843 | | 0.018 | 3.0 | 123285 | 0.0542 | 0.9844 | 0.9844 | 0.9844 | 0.9844 | ### Framework versions - Transformers 4.36.2 - Pytorch 2.1.2 - Datasets 2.16.1 - Tokenizers 0.15.0 ## Environmental Impact *Carbon emissions were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.* - **Hardware Type:** A100 PCIe 40/80GB - **Hours used:** 1h45min - **Cloud Provider:** Private Infrastructure - **Carbon Efficiency (kg/kWh):** 0.046 (estimated from [electricitymaps](https://app.electricitymaps.com/zone/FR) for the day of January 4, 2024.) - **Carbon Emitted** *(Power consumption x Time x Carbon produced based on location of power grid)*: 0.02 kg eq. CO2 ## Citations ### Camembert-frenchNER_4entities ``` TODO ``` ### multiconer > @inproceedings{multiconer2-report, title={{SemEval-2023 Task 2: Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2)}}, author={Fetahu, Besnik and Kar, Sudipta and Chen, Zhiyu and Rokhlenko, Oleg and Malmasi, Shervin}, booktitle={Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)}, year={2023}, publisher={Association for Computational Linguistics}} > @article{multiconer2-data, title={{MultiCoNER v2: a Large Multilingual dataset for Fine-grained and Noisy Named Entity Recognition}}, author={Fetahu, Besnik and Chen, Zhiyu and Kar, Sudipta and Rokhlenko, Oleg and Malmasi, Shervin}, year={2023}} ### multinerd > @inproceedings{tedeschi-navigli-2022-multinerd, title = "{M}ulti{NERD}: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation)", author = "Tedeschi, Simone and Navigli, Roberto", booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022", month = jul, year = "2022", address = "Seattle, United States", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.findings-naacl.60", doi = "10.18653/v1/2022.findings-naacl.60", pages = "801--812"} ### pii-masking-200k > @misc {ai4privacy_2023, author = { {ai4Privacy} }, title = { pii-masking-200k (Revision 1d4c0a1) }, year = 2023, url = { https://huggingface.co./datasets/ai4privacy/pii-masking-200k }, doi = { 10.57967/hf/1532 }, publisher = { Hugging Face }} ### wikiner > @article{NOTHMAN2013151, title = {Learning multilingual named entity recognition from Wikipedia}, journal = {Artificial Intelligence}, volume = {194}, pages = {151-175}, year = {2013}, note = {Artificial Intelligence, Wikipedia and Semi-Structured Resources}, issn = {0004-3702}, doi = {https://doi.org/10.1016/j.artint.2012.03.006}, url = {https://www.sciencedirect.com/science/article/pii/S0004370212000276}, author = {Joel Nothman and Nicky Ringland and Will Radford and Tara Murphy and James R. Curran}} ### frenchNER_4entities ``` TODO ``` ### CamemBERT > @inproceedings{martin2020camembert, title={CamemBERT: a Tasty French Language Model}, author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t}, booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, year={2020}} ## License [cc-by-4.0](https://creativecommons.org/licenses/by/4.0/deed.en)