metadata

license: mit
base_model: BAAI/bge-base-en-v1.5
tags:
  - generated_from_trainer
model-index:
  - name: SECTOR-multilabel-bge
    results: []
datasets:
  - GIZ/policy_classification

SECTOR-multilabel-bge

This model is a fine-tuned version of BAAI/bge-base-en-v1.5 on the Policy-Classification dataset.

The loss function BCEWithLogitsLoss is modified with pos_weight to focus on recall, therefore instead of loss the evaluation metrics are used to assess the model performance during training It achieves the following results on the evaluation set:

Loss: 0.6114
Precision-micro: 0.6428
Precision-samples: 0.7488
Precision-weighted: 0.6519
Recall-micro: 0.7855
Recall-samples: 0.8627
Recall-weighted: 0.7855
F1-micro: 0.7071
F1-samples: 0.7638
F1-weighted: 0.7109

Model description

The purpose of this model is to predict multiple labels simultaneously from a given input data. Specifically, the model will predict Sector labels - Agriculture,Buildings, Coastal Zone,Cross-Cutting Area,Disaster Risk Management (DRM),Economy-wide,Education,Energy,Environment,Health,Industries,LULUCF/Forestry,Social Development,Tourism, Transport,Urban,Waste,Water

Intended uses & limitations

More information needed

Training and evaluation data

Training Dataset: 10123

Class	Positive Count of Class
Agriculture	2235
Buildings	169
Coastal Zone	698
Cross-Cutting Area	1853
Disaster Risk Management (DRM)	814
Economy-wide	873
Education	180
Energy	2847
Environment	905
Health	662
Industries	419
LULUCF/Forestry	1861
Social Development	507
Tourism	192
Transport	1173
Urban	558
Waste	714
Water	1207

Validation Dataset: 936

Class	Positive Count of Class
Agriculture	200
Buildings	18
Coastal Zone	71
Cross-Cutting Area	180
Disaster Risk Management (DRM)	85
Economy-wide	85
Education	23
Energy	254
Environment	91
Health	68
Industries	41
LULUCF/Forestry	193
Social Development	56
Tourism	28
Transport	107
Urban	51
Waste	59
Water	106

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 7.04e-05
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 300
num_epochs: 7

Training results

Training Loss	Epoch	Step	Validation Loss	Precision-micro	Precision-samples	Precision-weighted	Recall-micro	Recall-samples	Recall-weighted	F1-micro	F1-samples	F1-weighted
0.7077	1.0	633	0.5490	0.4226	0.5465	0.4954	0.8211	0.8908	0.8211	0.5580	0.6243	0.5977
0.4546	2.0	1266	0.5009	0.4899	0.6127	0.5202	0.8438	0.9023	0.8438	0.6199	0.6822	0.6366
0.3105	3.0	1899	0.4947	0.5005	0.6593	0.5317	0.8508	0.8970	0.8508	0.6303	0.7125	0.6474
0.2044	4.0	2532	0.5430	0.5757	0.7044	0.5970	0.8106	0.8801	0.8106	0.6733	0.7379	0.6834
0.1314	5.0	3165	0.5633	0.6132	0.7385	0.6271	0.8065	0.8772	0.8065	0.6967	0.7606	0.7032
0.0892	6.0	3798	0.6073	0.6425	0.7499	0.6545	0.7844	0.8610	0.7844	0.7064	0.7634	0.7113
0.0721	7.0	4431	0.6114	0.6428	0.7488	0.6519	0.7855	0.8627	0.7855	0.7071	0.7638	0.7109

label	precision	recall	f1-score	support
Agriculture	0.720	0.850	0.780	200
Buildings	0.636	0.777	0.700	18
Coastal Zone	0.562	0.760	0.646	71
Cross-Cutting Area	0.569	0.777	0.657	180
Disaster Risk Management (DRM)	0.567	0.694	0.624	85
Economy-wide	0.461	0.635	0.534	85
Education	0.608	0.608	0.608	23
Energy	0.816	0.838	0.827	254
Environment	0.561	0.703	0.624	91
Health	0.708	0.750	0.728	68
Industries	0.660	0.902	0.762	41
LULUCF/Forestry	0.676	0.844	0.751	193
Social Development	0.593	0.678	0.633	56
Tourism	0.551	0.571	0.561	28
Transport	0.700	0.766	0.732	107
Urban	0.414	0.568	0.479	51
Waste	0.658	0.881	0.753	59
Water	0.602	0.773	0.677	106

Environmental Impact

Carbon emissions were measured using CodeCarbon.

Carbon Emitted: 0.02867 kg of CO2
Hours Used: 0.706 hours

Training Hardware

On Cloud: yes
GPU Model: 1 x Tesla T4
CPU Model: Intel(R) Xeon(R) CPU @ 2.00GHz
RAM Size: 12.67 GB

Framework versions

Transformers 4.38.1
Pytorch 2.1.0+cu121
Datasets 2.18.0
Tokenizers 0.15.2