ParisNeo commited on
Commit
c0763b7
·
verified ·
1 Parent(s): 499f216

Upload 11 files

Browse files
README.md CHANGED
@@ -1,210 +1,142 @@
1
-
2
- ---
3
- license: apache-2.0
4
- datasets:
5
- - QuotaClimat/frugalaichallenge-text-train
6
- language:
7
- - en
8
- metrics:
9
- - accuracy
10
- - f1
11
- base_model:
12
- - huawei-noah/TinyBERT_General_4L_312D
13
- library_name: transformers
14
- tags:
15
- - climate-change
16
- - text-classification
17
- - efficient-training
18
- - data-augmentation
19
- ---
20
-
21
- # Model Card: climate-skepticism-classifier
22
-
23
- ## Model Overview
24
- This model implements a novel approach to classifying climate change skepticism arguments
25
- by utilizing Large Language Models (LLMs) for data rebalancing. The base architecture uses TinyBERT with
26
- custom modifications for handling imbalanced datasets across 8 distinct categories of climate skepticism.
27
- The model achieves an accuracy of 83% at a lower carbon fingerprint compared to the Bert version.
28
-
29
- The model categorizes text into the following skepticism types:
30
- - Fossil fuel necessity arguments
31
- - Non-relevance claims
32
- - Climate change denial
33
- - Anthropogenic cause denial
34
- - Impact minimization
35
- - Bias allegations
36
- - Scientific reliability questions
37
- - Solution opposition
38
-
39
- The unique feature of this model is its use of LLM-based data rebalancing to address the inherent class
40
- imbalance in climate skepticism detection, ensuring robust performance across all argument categories.
41
-
42
- ## Dataset and Augmentation Strategy
43
- - **Dataset structure & labels**:
44
- The dataset contains text data with associated labels representing different types of climate disinformation claims.
45
-
46
- quote: The actual quote or claim about climate change
47
- label: Following categories:
48
- - 0_not_relevant: No relevant claim detected or claims that don't fit other categories
49
- - 1_not_happening: Claims denying the occurrence of global warming and its effects - Global warming is not happening. Climate change is NOT leading to melting ice (such as glaciers, sea ice, and permafrost), increased extreme weather, or rising sea levels. Cold weather also shows that climate change is not happening
50
- - 2_not_human: Claims denying human responsibility in climate change - Greenhouse gases from humans are not the causing climate change.
51
- - 3_not_bad: Claims minimizing or denying negative impacts of climate change - The impacts of climate change will not be bad and might even be beneficial.
52
- - 4_solutions_harmful_unnecessary: Claims against climate solutions - Climate solutions are harmful or unnecessary
53
- - 5_science_is_unreliable: Claims questioning climate science validity - Climate science is uncertain, unsound, unreliable, or biased.
54
- - 6_proponents_biased: Claims attacking climate scientists and activists - Climate scientists and proponents of climate action are alarmist, biased, wrong, hypocritical, corrupt, and/or politically motivated.
55
- - 7_fossil_fuels_needed: Claims promoting fossil fuel necessity - We need fossil fuels for economic growth, prosperity, and to maintain our standard of living.
56
-
57
- - **Source**: Frugal AI Challenge Text Task Dataset
58
- - **Data Split**:
59
- - Training: 70%
60
- - Validation: 10%
61
- - Test: 20%
62
- - **Data Augmentation**:
63
- - Utilized LoLLMs (Lord of Large Language Models) for intelligent data augmentation
64
- - Augmentation factor: 1.1x on training data only
65
- - Balanced class distribution through LLM-guided generation
66
- - Original training samples preserved to maintain data authenticity
67
- - **Preprocessing**: Tokenization using `BertTokenizer` with padding and truncation to a maximum sequence length of 128
68
-
69
- ## Model Selection Rationale
70
- - **Base Model**: `huawei-noah/TinyBERT_General_4L_312D`
71
- - Chosen for computational efficiency (4 layers instead of 12)
72
- - 312D hidden dimensions for reduced parameter count
73
- - Maintains strong performance while reducing carbon footprint
74
- - Aligned with Frugal AI Challenge objectives
75
- - **Environmental Considerations**:
76
- - 7.5x reduction in compute requirements compared to base BERT
77
- - Smaller memory footprint for deployment
78
- - Optimized for edge device compatibility
79
-
80
- ## Model Architecture
81
- - **Base Model**: `huawei-noah/TinyBERT_General_4L_312D`
82
- - **Classification Head**: cross-entropy loss with class weights
83
- - **Number of Labels**: 7
84
-
85
- ## Training Details
86
- - **Optimizer**: AdamW
87
- - **Learning Rate**: 2e-5
88
- - **Batch Size**: 16 (for both training and evaluation)
89
- - **Epochs**: 3
90
- - **Weight Decay**: 0.01
91
- - **Evaluation Strategy**: Performed at the end of each epoch
92
- - **Hardware**: Trained on GPUs for efficient computation
93
-
94
- ## Performance Metrics (Validation Set)
95
- The following metrics were computed on the validation set:
96
-
97
- | Class | Precision | Recall | F1-Score | Support |
98
- |-------|-----------|--------|----------|---------|
99
- | not_relevant | 0.88 | 0.82 | 0.85 | 130.0 |
100
- | not_happening | 0.82 | 0.93 | 0.87 | 59.0 |
101
- | not_human | 0.80 | 0.86 | 0.83 | 56.0 |
102
- | not_bad | 0.87 | 0.84 | 0.85 | 31.0 |
103
- | fossil_fuels_needed | 0.87 | 0.84 | 0.85 | 62.0 |
104
- | science_unreliable | 0.78 | 0.77 | 0.77 | 64.0 |
105
- | proponents_biased | 0.73 | 0.75 | 0.74 | 63.0 |
106
-
107
- - **Overall Accuracy**: 0.83
108
- - **Macro Average**: Precision: 0.82, Recall: 0.83, F1-Score: 0.83
109
- - **Weighted Average**: Precision: 0.83, Recall: 0.83, F1-Score: 0.83
110
-
111
- ## Training Evolution
112
- ### Training and Validation Loss
113
- ![Training Loss](./training_loss_plot.png)
114
-
115
- ### Validation Accuracy
116
- ![Validation Accuracy](./validation_accuracy_plot.png)
117
-
118
- ## Confusion Matrix
119
- ![Confusion Matrix](./confusion_matrix.png)
120
-
121
- ## Data Processing Pipeline
122
- ```python
123
- from pathlib import Path
124
- from typing import Dict, List, Tuple
125
-
126
- def process_dataset(
127
- data_path: Path,
128
- augmentation_factor: float = 1.1,
129
- train_split: float = 0.7,
130
- val_split: float = 0.1
131
- ) -> Tuple[Dict, Dict, Dict]:
132
- """
133
- Process and augment the dataset using LoLLMs.
134
-
135
- Args:
136
- data_path: Path to the raw dataset
137
- augmentation_factor: Factor for data augmentation
138
- train_split: Proportion of data for training
139
- val_split: Proportion of data for validation
140
-
141
- Returns:
142
- Tuple of train, validation and test datasets
143
- """
144
- # Implementation details...
145
- ```
146
-
147
- ## Environmental Impact
148
- - Training energy consumption: (not available yet) kWh
149
- - Estimated CO2 emissions: (not available yet) kg
150
- - Comparison to baseline BERT: (not available yet) estimated to be around ~87% reduction in environmental impact
151
-
152
- ## Class Mapping
153
- The mapping between model output indices and class names is as follows:
154
- 0: not_relevant, 1: not_happening, 2: not_human, 3: not_bad, 4: fossil_fuels_needed, 5: science_unreliable, 6: proponents_biased
155
-
156
- ## Usage
157
- ```python
158
- from transformers import AutoModelForSequenceClassification, AutoTokenizer
159
-
160
- # Load the fine-tuned model and tokenizer
161
- model = AutoModelForSequenceClassification.from_pretrained("climate-skepticism-classifier")
162
- tokenizer = AutoTokenizer.from_pretrained("climate-skepticism-classifier")
163
-
164
- # Tokenize input text
165
- text = "Your input text here"
166
- inputs = tokenizer(text, return_tensors="pt", padding="max_length", truncation=True, max_length=128)
167
-
168
- # Perform inference
169
- outputs = model(**inputs)
170
- predicted_class = outputs.logits.argmax(-1).item()
171
-
172
- print(f"Predicted Class: {predicted_class}")
173
- ```
174
-
175
- ## Key Features
176
- - **Class Weighting**: Addressed dataset imbalance by incorporating class weights during training
177
- - **Custom Loss Function**: Used weighted cross-entropy loss for better handling of underrepresented classes
178
- - **Evaluation Metrics**: Comprehensive metrics computed for model assessment
179
- - **Data Augmentation**: LLM-based augmentation for balanced training
180
- - **Environmental Consciousness**: Optimized architecture for reduced carbon footprint
181
-
182
- ## Limitations
183
- - Performance may vary on extremely imbalanced datasets
184
- - Requires significant computational resources for training
185
- - Model performance is dependent on the quality of LLM-generated balanced data
186
- - May not perform optimally on very long text sequences (>128 tokens)
187
- - May struggle with novel or evolving climate skepticism arguments
188
- - Could be sensitive to subtle variations in argument framing
189
- - May require periodic updates to capture emerging skepticism patterns
190
-
191
- ## Version History
192
- - v1.0.0: Initial release with LoLLMs augmentation
193
- - v1.0.1: Performance metrics update
194
- - v1.1.0: Added environmental impact assessment
195
-
196
- ## Acknowledgments
197
- Special thanks to the Frugal AI Challenge organizers for providing the dataset and fostering innovation in AI research.
198
-
199
- ## Citations
200
- ```bibtex
201
- @misc{climate-skepticism-classifier,
202
- author = {ParisNeo},
203
- title = {Climate Skepticism Classifier with LoLLMs Augmentation},
204
- year = {2025},
205
- publisher = {HuggingFace},
206
- }
207
- ```
208
-
209
-
210
-
 
1
+
2
+ ---
3
+ license: apache-2.0
4
+ datasets:
5
+ - QuotaClimat/frugalaichallenge-text-train
6
+ language:
7
+ - en
8
+ metrics:
9
+ - accuracy
10
+ - f1
11
+ base_model:
12
+ - huawei-noah/TinyBERT_General_4L_312D
13
+ library_name: transformers
14
+ ---
15
+
16
+ # Model Card: climate-skepticism-classifier
17
+
18
+ ## Model Overview
19
+ This model implements a novel approach to classifying climate change skepticism arguments
20
+ by utilizing Large Language Models (LLMs) for data rebalancing. The base architecture uses BERT with
21
+ custom modifications for handling imbalanced datasets across 8 distinct categories of climate skepticism.
22
+ The model achieves exceptional performance with an accuracy of 99.92%.
23
+
24
+ The model categorizes text into the following skepticism types:
25
+ - Fossil fuel necessity arguments
26
+ - Non-relevance claims
27
+ - Climate change denial
28
+ - Anthropogenic cause denial
29
+ - Impact minimization
30
+ - Bias allegations
31
+ - Scientific reliability questions
32
+ - Solution opposition
33
+
34
+ The unique feature of this model is its use of LLM-based data rebalancing to address the inherent class
35
+ imbalance in climate skepticism detection, ensuring robust performance across all argument categories.
36
+
37
+ ## Dataset
38
+ - **Source**: Frugal AI Challenge Text Task Dataset
39
+ - **Classes**: 7 unique labels representing various categories of text
40
+ - **Preprocessing**: Tokenization using `BertTokenizer` with padding and truncation to a maximum sequence length of 128.
41
+
42
+ ## Model Architecture
43
+ - **Base Model**: `huawei-noah/TinyBERT_General_4L_312D`
44
+ - **Classification Head**: cross-entropy loss.
45
+ - **Number of Labels**: 7
46
+
47
+ ## Training Details
48
+ - **Optimizer**: AdamW
49
+ - **Learning Rate**: 2e-5
50
+ - **Batch Size**: 16 (for both training and evaluation)
51
+ - **Epochs**: 3
52
+ - **Weight Decay**: 0.01
53
+ - **Evaluation Strategy**: Performed at the end of each epoch
54
+ - **Hardware**: Trained on GPUs for efficient computation
55
+
56
+ ## Performance Metrics (Validation Set)
57
+ The following metrics were computed on the validation set (not the test set, which remains private for the competition):
58
+
59
+ | Class | Precision | Recall | F1-Score | Support |
60
+ |-------|-----------|--------|----------|---------|
61
+ | not_relevant | 0.88 | 0.82 | 0.85 | 130.0 |
62
+ | not_happening | 0.82 | 0.93 | 0.87 | 59.0 |
63
+ | not_human | 0.80 | 0.86 | 0.83 | 56.0 |
64
+ | not_bad | 0.87 | 0.84 | 0.85 | 31.0 |
65
+ | fossil_fuels_needed | 0.87 | 0.84 | 0.85 | 62.0 |
66
+ | science_unreliable | 0.78 | 0.77 | 0.77 | 64.0 |
67
+ | proponents_biased | 0.73 | 0.75 | 0.74 | 63.0 |
68
+
69
+
70
+ - **Overall Accuracy**: 0.83
71
+ - **Macro Average**: Precision: 0.82, Recall: 0.83, F1-Score: 0.83
72
+ - **Weighted Average**: Precision: 0.83, Recall: 0.83, F1-Score: 0.83
73
+
74
+
75
+ ## Training Evolution
76
+ ### Training and Validation Loss
77
+ The training and validation loss evolution over epochs is shown below:
78
+
79
+ ![Training Loss](./training_loss_plot.png)
80
+
81
+ ### Validation Accuracy
82
+ The validation accuracy evolution over epochs is shown below:
83
+
84
+ ![Validation Accuracy](./validation_accuracy_plot.png)
85
+
86
+ ## Confusion Matrix
87
+ The confusion matrix below illustrates the model's performance on the validation set, highlighting areas of strength and potential misclassifications:
88
+
89
+ ![Confusion Matrix](./confusion_matrix.png)
90
+
91
+ ## Key Features
92
+ - **Class Weighting**: Addressed dataset imbalance by incorporating class weights during training.
93
+ - **Custom Loss Function**: Used weighted cross-entropy loss for better handling of underrepresented classes.
94
+ - **Evaluation Metrics**: Accuracy, precision, recall, and F1-score were computed to provide a comprehensive understanding of the model's performance.
95
+
96
+ ## Class Mapping
97
+ The mapping between model output indices and class names is as follows:
98
+ 0: not_relevant, 1: not_happening, 2: not_human, 3: not_bad, 4: fossil_fuels_needed, 5: science_unreliable, 6: proponents_biased
99
+
100
+ ## Usage
101
+ This model can be used for multi-class text classification tasks where the input text needs to be categorized into one of the eight predefined classes. It is particularly suited for datasets with class imbalance, thanks to its weighted loss function.
102
+
103
+ ### Example Usage
104
+ ```python
105
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
106
+
107
+ # Load the fine-tuned model and tokenizer
108
+ model = AutoModelForSequenceClassification.from_pretrained("climate-skepticism-classifier")
109
+ tokenizer = AutoTokenizer.from_pretrained("climate-skepticism-classifier")
110
+
111
+ # Tokenize input text
112
+ text = "Your input text here"
113
+ inputs = tokenizer(text, return_tensors="pt", padding="max_length", truncation=True, max_length=128)
114
+
115
+ # Perform inference
116
+ outputs = model(**inputs)
117
+ predicted_class = outputs.logits.argmax(-1).item()
118
+
119
+ print(f"Predicted Class: {predicted_class}")
120
+ ```
121
+
122
+ ## Limitations
123
+ - Performance may vary on extremely imbalanced datasets
124
+ - Requires significant computational resources for training
125
+ - Model performance is dependent on the quality of LLM-generated balanced data
126
+ - May not perform optimally on very long text sequences (>128 tokens)
127
+ - May struggle with novel or evolving climate skepticism arguments
128
+ - Could be sensitive to subtle variations in argument framing
129
+ - May require periodic updates to capture emerging skepticism patterns
130
+
131
+ ## Citation
132
+ If you use this model, please cite:
133
+ @article{your_name2024climateskepticism,
134
+ title={LLM-Rebalanced Transformer for Climate Change Skepticism Classification},
135
+ author={Your Name},
136
+ year={2024},
137
+ journal={Preprint}
138
+ }
139
+
140
+ ## Acknowledgments
141
+ Special thanks to the Frugal AI Challenge organizers for providing the dataset and fostering innovation in AI research.
142
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
classification_report.txt ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ precision recall f1-score support
2
+
3
+ not_relevant 0.85 0.78 0.82 324
4
+ not_happening 0.90 0.87 0.88 148
5
+ not_human 0.81 0.80 0.81 141
6
+ not_bad 0.85 0.90 0.87 77
7
+ solutions_harmful_unnecessary 0.80 0.74 0.77 155
8
+ science_unreliable 0.70 0.83 0.76 160
9
+ proponents_biased 0.70 0.74 0.72 157
10
+ fossil_fuels_needed 0.74 0.75 0.75 57
11
+
12
+ accuracy 0.80 1219
13
+ macro avg 0.79 0.80 0.80 1219
14
+ weighted avg 0.80 0.80 0.80 1219
config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "_name_or_path": "./TinyBert-fine-tuned",
3
  "architectures": [
4
  "BertForSequenceClassification"
5
  ],
 
1
  {
2
+ "_name_or_path": "huawei-noah/TinyBERT_General_4L_312D",
3
  "architectures": [
4
  "BertForSequenceClassification"
5
  ],
special_tokens_map.json CHANGED
@@ -1,37 +1,7 @@
1
  {
2
- "cls_token": {
3
- "content": "[CLS]",
4
- "lstrip": false,
5
- "normalized": false,
6
- "rstrip": false,
7
- "single_word": false
8
- },
9
- "mask_token": {
10
- "content": "[MASK]",
11
- "lstrip": false,
12
- "normalized": false,
13
- "rstrip": false,
14
- "single_word": false
15
- },
16
- "pad_token": {
17
- "content": "[PAD]",
18
- "lstrip": false,
19
- "normalized": false,
20
- "rstrip": false,
21
- "single_word": false
22
- },
23
- "sep_token": {
24
- "content": "[SEP]",
25
- "lstrip": false,
26
- "normalized": false,
27
- "rstrip": false,
28
- "single_word": false
29
- },
30
- "unk_token": {
31
- "content": "[UNK]",
32
- "lstrip": false,
33
- "normalized": false,
34
- "rstrip": false,
35
- "single_word": false
36
- }
37
  }
 
1
  {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  }
tokenizer_config.json CHANGED
@@ -46,19 +46,12 @@
46
  "do_basic_tokenize": true,
47
  "do_lower_case": true,
48
  "mask_token": "[MASK]",
49
- "max_length": 128,
50
  "model_max_length": 1000000000000000019884624838656,
51
  "never_split": null,
52
- "pad_to_multiple_of": null,
53
  "pad_token": "[PAD]",
54
- "pad_token_type_id": 0,
55
- "padding_side": "right",
56
  "sep_token": "[SEP]",
57
- "stride": 0,
58
  "strip_accents": null,
59
  "tokenize_chinese_chars": true,
60
  "tokenizer_class": "BertTokenizer",
61
- "truncation_side": "right",
62
- "truncation_strategy": "longest_first",
63
  "unk_token": "[UNK]"
64
  }
 
46
  "do_basic_tokenize": true,
47
  "do_lower_case": true,
48
  "mask_token": "[MASK]",
 
49
  "model_max_length": 1000000000000000019884624838656,
50
  "never_split": null,
 
51
  "pad_token": "[PAD]",
 
 
52
  "sep_token": "[SEP]",
 
53
  "strip_accents": null,
54
  "tokenize_chinese_chars": true,
55
  "tokenizer_class": "BertTokenizer",
 
 
56
  "unk_token": "[UNK]"
57
  }