gonzalez-agirre commited on
Commit
046e7ce
1 Parent(s): a7be7f3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -3
README.md CHANGED
@@ -51,12 +51,63 @@ widget:
51
 
52
  # Catalan BERTa-v2 (roberta-base-ca-v2) finetuned for Named Entity Recognition.
53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  The **roberta-base-ca-v2-cased-ner** is a Named Entity Recognition (NER) model for the Catalan language fine-tuned from the [roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model, a [RoBERTa](https://arxiv.org/abs/1907.11692) base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers (check the roberta-base-ca-v2 model card for more details).
55
 
56
- ## Datasets
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  We used the NER dataset in Catalan called [Ancora-ca-NER](https://huggingface.co/datasets/projecte-aina/ancora-ca-ner) for training and evaluation.
58
 
59
- ## Evaluation and results
 
 
 
 
 
 
 
 
 
60
  We evaluated the _roberta-base-ca-v2-cased-ner_ on the Ancora-ca-ner test set against standard multilingual and monolingual baselines:
61
 
62
  | Model | Ancora-ca-ner (F1)|
@@ -68,7 +119,12 @@ We evaluated the _roberta-base-ca-v2-cased-ner_ on the Ancora-ca-ner test set ag
68
 
69
  For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/projecte-aina/club).
70
 
71
- ## Citing
 
 
 
 
 
72
  If you use any of these resources (datasets or models) in your work, please cite our latest paper:
73
  ```bibtex
74
  @inproceedings{armengol-estape-etal-2021-multilingual,
@@ -94,3 +150,7 @@ If you use any of these resources (datasets or models) in your work, please cite
94
 
95
  ### Funding
96
  This work was funded by the [Catalan Government](https://politiquesdigitals.gencat.cat/en/inici/index.html) within the framework of the [AINA project.](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
 
 
 
 
 
51
 
52
  # Catalan BERTa-v2 (roberta-base-ca-v2) finetuned for Named Entity Recognition.
53
 
54
+ ## Table of Contents
55
+ - [Model Description](#model-description)
56
+ - [Intended Uses and Limitations](#intended-uses-and-limitations)
57
+ - [How to Use](#how-to-use)
58
+ - [Training](#training)
59
+ - [Training Data](#training-data)
60
+ - [Training Procedure](#training-procedure)
61
+ - [Evaluation](#evaluation)
62
+ - [Variable and Metrics](#variable-and-metrics)
63
+ - [Evaluation Results](#evaluation-results)
64
+ - [Licensing Information](#licensing-information)
65
+ - [Citation Information](#citation-information)
66
+ - [Funding](#funding)
67
+ - [Contributions](#contributions)
68
+
69
+ ## Model description
70
+
71
  The **roberta-base-ca-v2-cased-ner** is a Named Entity Recognition (NER) model for the Catalan language fine-tuned from the [roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model, a [RoBERTa](https://arxiv.org/abs/1907.11692) base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers (check the roberta-base-ca-v2 model card for more details).
72
 
73
+ ## Intended Uses and Limitations
74
+
75
+ **roberta-base-ca-v2-cased-ner** model can be used to recognize Named Entities in the provided text. The model is limited by its training dataset and may not generalize well for all use cases.
76
+
77
+ ## How to Use
78
+
79
+ Here is how to use this model in PyTorch:
80
+
81
+ ```python
82
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
83
+ from transformers import pipeline
84
+ from pprint import pprint
85
+
86
+ tokenizer = AutoTokenizer.from_pretrained("projecte-aina/roberta-base-ca-v2-cased-ner")
87
+ model = AutoModelForTokenClassification.from_pretrained("projecte-aina/roberta-base-ca-v2-cased-ner")
88
+
89
+ nlp = pipeline("ner", model=model, tokenizer=tokenizer)
90
+ example = "Em dic Lluïsa i visc a Santa Maria del Camí."
91
+
92
+ ner_results = nlp(example)
93
+ pprint(ner_results)
94
+ ```
95
+
96
+ ## Training
97
+
98
+ ### Training data
99
  We used the NER dataset in Catalan called [Ancora-ca-NER](https://huggingface.co/datasets/projecte-aina/ancora-ca-ner) for training and evaluation.
100
 
101
+ ### Training Procedure
102
+ The model was trained with a batch size of 16 and a learning rate of 5e-5 for 5 epochs. We then selected the best checkpoint using the downstream task metric in the corresponding development set, and then evaluated it on the test set.
103
+
104
+ ## Evaluation
105
+
106
+ ### Variable and Metrics
107
+
108
+ This model was finetuned maximizing F1 score.
109
+
110
+ ### Evaluation results
111
  We evaluated the _roberta-base-ca-v2-cased-ner_ on the Ancora-ca-ner test set against standard multilingual and monolingual baselines:
112
 
113
  | Model | Ancora-ca-ner (F1)|
 
119
 
120
  For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/projecte-aina/club).
121
 
122
+ ## Licensing Information
123
+
124
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
125
+
126
+ ## Citation Information
127
+
128
  If you use any of these resources (datasets or models) in your work, please cite our latest paper:
129
  ```bibtex
130
  @inproceedings{armengol-estape-etal-2021-multilingual,
 
150
 
151
  ### Funding
152
  This work was funded by the [Catalan Government](https://politiquesdigitals.gencat.cat/en/inici/index.html) within the framework of the [AINA project.](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
153
+
154
+ ## Contributions
155
+
156
+ [N/A]