disham993 commited on
Commit
5ce2ad2
·
verified ·
1 Parent(s): 85e41f8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -31
README.md CHANGED
@@ -8,58 +8,156 @@ tags:
8
  datasets:
9
  - disham993/ElectricalNER
10
  metrics:
11
- - epoch: 1.0
12
- - eval_precision: 0.8935291782453354
13
- - eval_recall: 0.9075806451612904
14
- - eval_f1: 0.9005001000200039
15
- - eval_accuracy: 0.9586046624222324
16
- - eval_runtime: 2.509
17
- - eval_samples_per_second: 601.44
18
- - eval_steps_per_second: 9.566
19
  ---
 
20
 
21
- # disham993/electrical-ner-modernbert-base
22
 
23
- ## Model description
24
-
25
- This model is fine-tuned from [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) for token-classification tasks.
26
 
27
  ## Training Data
28
 
29
- The model was trained on the disham993/ElectricalNER dataset.
30
 
31
  ## Model Details
32
- - **Base Model:** answerdotai/ModernBERT-base
33
- - **Task:** token-classification
34
- - **Language:** en
35
- - **Dataset:** disham993/ElectricalNER
36
 
37
- ## Training procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
- ### Training hyperparameters
40
- [Please add your training hyperparameters here]
41
 
42
- ## Evaluation results
43
 
44
- ### Metrics\n- epoch: 1.0\n- eval_precision: 0.8935291782453354\n- eval_recall: 0.9075806451612904\n- eval_f1: 0.9005001000200039\n- eval_accuracy: 0.9586046624222324\n- eval_runtime: 2.509\n- eval_samples_per_second: 601.44\n- eval_steps_per_second: 9.566
 
 
 
 
 
 
45
 
46
  ## Usage
47
 
48
- ```python
49
- from transformers import AutoTokenizer, AutoModel
50
 
51
- tokenizer = AutoTokenizer.from_pretrained("disham993/electrical-ner-modernbert-base")
52
- model = AutoModel.from_pretrained("disham993/electrical-ner-modernbert-base")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  ```
54
 
55
- ## Limitations and bias
 
 
 
 
 
 
 
56
 
57
- [Add any known limitations or biases of the model]
58
 
59
  ## Training Infrastructure
60
 
61
- [Add details about training infrastructure used]
62
 
63
- ## Last update
64
 
65
- 2024-12-30
 
8
  datasets:
9
  - disham993/ElectricalNER
10
  metrics:
11
+ - epoch: 5.0
12
+ - eval_precision: 0.9108
13
+ - eval_recall: 0.9248
14
+ - eval_f1: 0.9177
15
+ - eval_accuracy: 0.9664
16
+ - eval_runtime: 2.121
17
+ - eval_samples_per_second: 711.447
18
+ - eval_steps_per_second: 11.315
19
  ---
20
+ # electrical-ner-ModernBERT-base
21
 
22
+ ## Model Description
23
 
24
+ This model is fine-tuned from [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-base) for token-classification tasks, specifically Named Entity Recognition (NER) in the electrical engineering domain. The model has been optimized to extract entities such as components, materials, standards, and design parameters from technical texts with high precision and recall.
 
 
25
 
26
  ## Training Data
27
 
28
+ The model was trained on the [disham993/ElectricalNER](https://huggingface.co/datasets/disham993/ElectricalNER) dataset, a GPT-4o-mini-generated dataset curated for the electrical engineering domain. This dataset includes diverse technical contexts, such as circuit design, testing, maintenance, installation, troubleshooting, or research.
29
 
30
  ## Model Details
 
 
 
 
31
 
32
+ - **Base Model:** [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base)
33
+ - **Task:** Token Classification (NER)
34
+ - **Language:** English (en)
35
+ - **Dataset:** [disham993/ElectricalNER](https://huggingface.co/datasets/disham993/ElectricalNER)
36
+
37
+ ## Training Procedure
38
+
39
+ ### Training Hyperparameters
40
+
41
+ The model was fine-tuned using the following hyperparameters:
42
+
43
+ - **Evaluation Strategy:** epoch
44
+ - **Learning Rate:** 1e-5
45
+ - **Batch Size:** 64 (for both training and evaluation)
46
+ - **Number of Epochs:** 5
47
+ - **Weight Decay:** 0.01
48
 
49
+ ## Evaluation Results
 
50
 
51
+ The following metrics were achieved during evaluation:
52
 
53
+ - **Precision:** 0.9108
54
+ - **Recall:** 0.9248
55
+ - **F1 Score:** 0.9177
56
+ - **Accuracy:** 0.9664
57
+ - **Evaluation Runtime:** 2.121 seconds
58
+ - **Samples Per Second:** 711.447
59
+ - **Steps Per Second:** 11.315
60
 
61
  ## Usage
62
 
63
+ You can use this model for Named Entity Recognition tasks as follows:
 
64
 
65
+ ```python
66
+ from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
67
+
68
+ model_name = "disham993/electrical-ner-modernbert-large"
69
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
70
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
71
+
72
+ nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
73
+
74
+ text = "The Xilinx Vivado development suite was used to program the Artix-7 FPGA."
75
+
76
+ ner_results = nlp(text)
77
+
78
+ def clean_and_group_entities(ner_results, min_score=0.40):
79
+ """
80
+ Cleans and groups named entity recognition (NER) results based on a minimum score threshold.
81
+
82
+ Args:
83
+ ner_results (list of dict): A list of dictionaries containing NER results. Each dictionary should have the keys:
84
+ - "word" (str): The recognized word or token.
85
+ - "entity_group" (str): The entity group or label.
86
+ - "start" (int): The start position of the entity in the text.
87
+ - "end" (int): The end position of the entity in the text.
88
+ - "score" (float): The confidence score of the entity recognition.
89
+ min_score (float, optional): The minimum score threshold for considering an entity. Defaults to 0.40.
90
+
91
+ Returns:
92
+ list of dict: A list of grouped entities that meet the minimum score threshold. Each dictionary contains:
93
+ - "entity_group" (str): The entity group or label.
94
+ - "word" (str): The concatenated word or token.
95
+ - "start" (int): The start position of the entity in the text.
96
+ - "end" (int): The end position of the entity in the text.
97
+ - "score" (float): The minimum confidence score of the grouped entity.
98
+ """
99
+ grouped_entities = []
100
+ current_entity = None
101
+
102
+ for result in ner_results:
103
+ # Skip entities with score below threshold
104
+ if result["score"] < min_score:
105
+ if current_entity:
106
+ # Add current entity if it meets threshold
107
+ if current_entity["score"] >= min_score:
108
+ grouped_entities.append(current_entity)
109
+ current_entity = None
110
+ continue
111
+
112
+ word = result["word"].replace("##", "") # Remove subword token markers
113
+
114
+ if current_entity and result["entity_group"] == current_entity["entity_group"] and result["start"] == current_entity["end"]:
115
+ # Continue the current entity
116
+ current_entity["word"] += word
117
+ current_entity["end"] = result["end"]
118
+ current_entity["score"] = min(current_entity["score"], result["score"])
119
+
120
+ # If combined score drops below threshold, discard the entity
121
+ if current_entity["score"] < min_score:
122
+ current_entity = None
123
+ else:
124
+ # Finalize the current entity if it meets threshold
125
+ if current_entity and current_entity["score"] >= min_score:
126
+ grouped_entities.append(current_entity)
127
+
128
+ # Start a new entity
129
+ current_entity = {
130
+ "entity_group": result["entity_group"],
131
+ "word": word,
132
+ "start": result["start"],
133
+ "end": result["end"],
134
+ "score": result["score"]
135
+ }
136
+
137
+ # Add the last entity if it meets threshold
138
+ if current_entity and current_entity["score"] >= min_score:
139
+ grouped_entities.append(current_entity)
140
+
141
+ return grouped_entities
142
+
143
+ cleaned_results = clean_and_group_entities(ner_results)
144
  ```
145
 
146
+ ## Limitations and Bias
147
+
148
+ While this model performs well in the electrical engineering domain, it is not designed for use in other domains. Additionally, it may:
149
+
150
+ - Misclassify entities due to potential inaccuracies in the GPT-4o-mini-generated dataset.
151
+ - Struggle with ambiguous contexts or low-confidence predictions - this is minimized with help of `clean_and_group_entities` function.
152
+
153
+ This model is intended for research and educational purposes only, and users are encouraged to validate results before applying them to critical applications.
154
 
155
+ Users are encouraged to validate results for critical applications.
156
 
157
  ## Training Infrastructure
158
 
159
+ For a complete guide covering the entire process - from data tokenization to pushing the model to the Hugging Face Hub - please refer to the [GitHub repository](https://github.com/di37/ner-electrical-finetuning).
160
 
161
+ ## Last Update
162
 
163
+ 2024-12-30