pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- Hadith
- Islam
- Arabic
license: apache-2.0
datasets:
- FDSRashid/hadith_info
language:
- ar
library_name: sentence-transformers
QulBERT
This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
This model originates from the Camel-Bert_Classical Arabic model. It was then trained on the Jawami' Kalim dataset, specifically a dataset of 440,000 matns and their corresponding taraf labels. Taraf labels indicate two hadith are about the same report, and as such, are more semantically similar.
Usage (Sentence-Transformers)
Using this model becomes easy when you have sentence-transformers installed:
pip install -U sentence-transformers
Then you can use the model like this:
from sentence_transformers import SentenceTransformer
sentences = ["أنا أحب القراءة والكتابة.", "الطيور تحلق في السماء."]
model = SentenceTransformer('FDSRashid/QulBERT')
embeddings = model.encode(sentences)
print(embeddings)
Usage (HuggingFace Transformers)
Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
from transformers import AutoTokenizer, AutoModel
import torch
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Sentences we want sentence embeddings for
sentences = ["أنا أحب القراءة والكتابة.", "الطيور تحلق في السماء."]
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('FDSRashid/QulBERT')
model = AutoModel.from_pretrained('FDSRashid/QulBERT')
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
Evaluation Results
The dataset was split into 75% training, 15% eval, 10% test.
Validation Results during Training:
Binary Classification Evaluation:
epoch | steps | cossim_accuracy | cossim_accuracy_threshold | cossim_f1 | cossim_precision | cossim_recall | cossim_f1_threshold | cossim_ap | manhattan_accuracy | manhattan_accuracy_threshold | manhattan_f1 | manhattan_precision | manhattan_recall | manhattan_f1_threshold | manhattan_ap | euclidean_accuracy | euclidean_accuracy_threshold | euclidean_f1 | euclidean_precision | euclidean_recall | euclidean_f1_threshold | euclidean_ap | dot_accuracy | dot_accuracy_threshold | dot_f1 | dot_precision | dot_recall | dot_f1_threshold | dot_ap |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10000 | 0.87335 | 0.5980355739593506 | 0.866067203028869 | 0.9132749251413996 | 0.8235 | 0.5871663689613342 | 0.9466943574346693 | 0.87115 | 415.270751953125 | 0.8638917195787933 | 0.9047671172896941 | 0.82655 | 422.6612548828125 | 0.945152683467575 | 0.871875 | 18.699993133544922 | 0.8645460950343135 | 0.9041095890410958 | 0.8283 | 19.169795989990234 | 0.945247114112153 | 0.8731 | 262.36114501953125 | 0.8656574463026075 | 0.9177637107164864 | 0.81915 | 260.2767333984375 | 0.9463618096371682 |
0 | 20000 | 0.8655 | 0.5025224685668945 | 0.859078237761509 | 0.8968612304846869 | 0.82435 | 0.4888851046562195 | 0.943873652860419 | 0.866775 | 477.4580078125 | 0.860756186146168 | 0.9014832247824421 | 0.82355 | 485.6708984375 | 0.9442439376416185 | 0.8676 | 21.476741790771484 | 0.8606938065955735 | 0.8945580065800118 | 0.8293 | 22.168407440185547 | 0.9444315640627436 | 0.863225 | 241.51820373535156 | 0.8566324020610548 | 0.88835186080232 | 0.8271 | 230.02301025390625 | 0.9423405098129569 |
0 | -1 | 0.8866 | 0.7285321950912476 | 0.8816885280033313 | 0.919398610508033 | 0.84695 | 0.7145423889160156 | 0.9558629287413469 | 0.885275 | 355.03125 | 0.8803685772294236 | 0.918177869475513 | 0.84555 | 357.0611572265625 | 0.9550033563717418 | 0.8856 | 16.121074676513672 | 0.8809697221933201 | 0.918130557362828 | 0.8467 | 16.198532104492188 | 0.9552434220598536 | 0.8866 | 333.26568603515625 | 0.8812167536022311 | 0.9111929936986009 | 0.85315 | 325.6474304199219 | 0.9551592673018441 |
1 | 10000 | 0.88225 | 0.5847429037094116 | 0.8791732103956634 | 0.8909538967039737 | 0.8677 | 0.5608978271484375 | 0.9553668396978772 | 0.879975 | 404.1671447753906 | 0.8754545454545455 | 0.8843877551020408 | 0.8667 | 420.20391845703125 | 0.9539648051031446 | 0.879775 | 18.318096160888672 | 0.8759632369883004 | 0.8975394785163423 | 0.8554 | 18.77162742614746 | 0.9541900283694951 | 0.878325 | 242.8575897216797 | 0.8763834841057261 | 0.8859200980893022 | 0.86705 | 229.83326721191406 | 0.9521114062744855 |
1 | 20000 | 0.865425 | 0.483412504196167 | 0.8604195660017525 | 0.8878310817998085 | 0.83465 | 0.47202983498573303 | 0.9437698616032332 | 0.867725 | 490.8877868652344 | 0.8626237623762377 | 0.8905451448040886 | 0.8364 | 498.3052062988281 | 0.945935000502437 | 0.867725 | 21.84794044494629 | 0.8626810749177227 | 0.8954220237775028 | 0.83225 | 22.427053451538086 | 0.9460338001929801 | 0.862825 | 234.37701416015625 | 0.857083710699961 | 0.8882274068114776 | 0.82805 | 229.8949432373047 | 0.9405896665434951 |
1 | -1 | 0.866575 | 0.6635169982910156 | 0.8608573256557902 | 0.88173001310616 | 0.84095 | 0.6324930191040039 | 0.9452499579769719 | 0.866875 | 412.3456726074219 | 0.8617781992464822 | 0.8840511121628017 | 0.8406 | 428.6363525390625 | 0.9456397883265427 | 0.867275 | 18.474044799804688 | 0.8617669654289373 | 0.883254593175853 | 0.8413 | 19.42306900024414 | 0.9458234307667238 | 0.8645 | 340.140380859375 | 0.8589694801735291 | 0.8718648606890869 | 0.84645 | 320.98138427734375 | 0.9439794500521119 |
2 | 10000 | 0.85825 | 0.521987795829773 | 0.8545418167266907 | 0.8548839071257006 | 0.8542 | 0.4656229019165039 | 0.9388312444848291 | 0.85815 | 477.94244384765625 | 0.8541671894998369 | 0.8574595656774323 | 0.8509 | 508.4425048828125 | 0.9391298371675241 | 0.858625 | 21.995594024658203 | 0.854181558255897 | 0.8726267473398707 | 0.8365 | 22.506919860839844 | 0.9392082300175095 | 0.856875 | 262.9927673339844 | 0.8540649892527501 | 0.8640061396776669 | 0.84435 | 240.31259155273438 | 0.936665567408799 |
2 | 20000 | 0.861025 | 0.4792778789997101 | 0.8557936427338275 | 0.8693005983082319 | 0.8427 | 0.4365364611148834 | 0.9417166077380268 | 0.861325 | 490.29339599609375 | 0.8565778465126891 | 0.8688474000925783 | 0.84465 | 521.2939453125 | 0.9421024298390495 | 0.861225 | 22.693565368652344 | 0.8568891594997083 | 0.8697533089560694 | 0.8444 | 23.553585052490234 | 0.9422682260686701 | 0.859775 | 237.65704345703125 | 0.8545056078380817 | 0.8821400053233963 | 0.82855 | 224.57196044921875 | 0.9406093768234505 |
2 | -1 | 0.84645 | 0.7098060250282288 | 0.8385932801673421 | 0.8789257330775555 | 0.8018 | 0.702235221862793 | 0.932382298001216 | 0.849825 | 371.1478271484375 | 0.8419526841642077 | 0.8728131372759472 | 0.8132 | 385.735107421875 | 0.9344418607926894 | 0.8498 | 17.05820083618164 | 0.8418963040355231 | 0.8813781788351107 | 0.8058 | 17.261516571044922 | 0.9345154644039888 | 0.83745 | 359.3741455078125 | 0.8301335348954395 | 0.8366683595733875 | 0.8237 | 335.80609130859375 | 0.9256669298415723 |
3 | 10000 | 0.8692 | 0.6066867113113403 | 0.8639819190466407 | 0.8882551753274187 | 0.841 | 0.5866260528564453 | 0.9479885087178834 | 0.870575 | 437.38861083984375 | 0.8650388914644825 | 0.8920110485498778 | 0.83965 | 447.34051513671875 | 0.9484228602702792 | 0.870575 | 19.797679901123047 | 0.8655175071287281 | 0.8900512495376974 | 0.8423 | 20.318492889404297 | 0.9487290465239262 | 0.866525 | 297.63665771484375 | 0.862023653088042 | 0.8892669182924884 | 0.8364 | 295.425048828125 | 0.9460553171567032 |
3 | 20000 | 0.8723 | 0.5461836457252502 | 0.8661531678726109 | 0.8997790829247265 | 0.83495 | 0.5138773322105408 | 0.9483721005411583 | 0.872775 | 465.31109619140625 | 0.8667593021460553 | 0.8929063726009967 | 0.8421 | 492.2287292480469 | 0.9486788228598396 | 0.87305 | 21.46672821044922 | 0.8673375089844954 | 0.891221776746149 | 0.8447 | 22.356992721557617 | 0.9489411054456987 | 0.87085 | 268.9063720703125 | 0.8649067921503737 | 0.8955399689457622 | 0.8363 | 255.71820068359375 | 0.9471728845921085 |
3 | -1 | 0.8801 | 0.5941712260246277 | 0.8756740022187249 | 0.9045893076062044 | 0.84855 | 0.5840033292770386 | 0.9545450783524295 | 0.87755 | 432.21533203125 | 0.8737334773440313 | 0.8995022768188076 | 0.8494 | 439.35577392578125 | 0.9532505174511154 | 0.87805 | 19.783367156982422 | 0.873815256929146 | 0.8958924256749659 | 0.8528 | 20.03304100036621 | 0.953443356122637 | 0.880975 | 282.0526123046875 | 0.8761084893429446 | 0.9099429063880211 | 0.8447 | 279.80755615234375 | 0.9545987838548831 |
4 | 10000 | 0.850325 | 0.5770859718322754 | 0.8458372263326683 | 0.8544462017244018 | 0.8374 | 0.5550715923309326 | 0.9367499212412196 | 0.85215 | 446.5159606933594 | 0.8481793290514087 | 0.8686513968237329 | 0.82865 | 461.0674133300781 | 0.9378884193257083 | 0.85235 | 20.780521392822266 | 0.8487315362363361 | 0.8695903058280439 | 0.82885 | 20.85832977294922 | 0.9379865284776105 | 0.846775 | 297.2707214355469 | 0.8422069666920926 | 0.8568028970512157 | 0.8281 | 282.2066955566406 | 0.9342792490823187 |
4 | 20000 | 0.885725 | 0.5763461589813232 | 0.8810150085099798 | 0.9096815422302694 | 0.8541 | 0.5624827742576599 | 0.9567680001721202 | 0.8861 | 449.548095703125 | 0.881673031087419 | 0.9044113780955886 | 0.86005 | 462.72589111328125 | 0.9571293388400879 | 0.88635 | 20.378496170043945 | 0.8821848696234137 | 0.9091198472067483 | 0.8568 | 20.860164642333984 | 0.9573813976283176 | 0.883225 | 285.4012451171875 | 0.8786437246963561 | 0.8894467213114754 | 0.8681 | 268.5011291503906 | 0.9549886227962548 |
4 | -1 | 0.883425 | 0.5326807498931885 | 0.878749968085378 | 0.8978452548651328 | 0.86045 | 0.4872320890426636 | 0.956368376823993 | 0.88455 | 480.01300048828125 | 0.8794581927741869 | 0.9067063133860777 | 0.8538 | 497.9632568359375 | 0.9566012690704293 | 0.8845 | 21.905109405517578 | 0.8798647229125566 | 0.9022647259734118 | 0.85855 | 22.690349578857422 | 0.9567001435137067 | 0.881875 | 258.05084228515625 | 0.8778676433185817 | 0.8933637022466093 | 0.8629 | 243.83050537109375 | 0.9554938129957324 |
5 | 10000 | 0.893375 | 0.46282997727394104 | 0.8898161026116519 | 0.9175608201423563 | 0.8637 | 0.4469180405139923 | 0.9617751510273491 | 0.89385 | 512.1046752929688 | 0.8904936907301277 | 0.9161334672941674 | 0.86625 | 515.0869750976562 | 0.9619645895583173 | 0.894275 | 23.09744644165039 | 0.890798553215504 | 0.9146604856977295 | 0.86815 | 23.39638900756836 | 0.9622504494079881 | 0.892075 | 230.25645446777344 | 0.8889115628905951 | 0.9073582252773004 | 0.8712 | 213.14920043945312 | 0.9608017350146727 |
5 | 20000 | 0.905125 | 0.4999743402004242 | 0.9022725529793706 | 0.923060829541294 | 0.8824 | 0.4821454584598541 | 0.9677318333926658 | 0.905375 | 477.36669921875 | 0.9025055438024112 | 0.9205012218582644 | 0.8852 | 493.6698913574219 | 0.9681063663719243 | 0.90565 | 21.852725982666016 | 0.9027959303964531 | 0.9260291257031702 | 0.8807 | 22.224273681640625 | 0.9681537834478611 | 0.9035 | 237.94554138183594 | 0.9007219292406943 | 0.9228832231665093 | 0.8796 | 233.02957153320312 | 0.9664151462381492 |
5 | -1 | 0.908825 | 0.4167391061782837 | 0.9067954713895064 | 0.9274400125463955 | 0.88705 | 0.4167391061782837 | 0.9692830626530475 | 0.908575 | 511.60858154296875 | 0.9058531974144758 | 0.9261794054647092 | 0.8864 | 521.9729614257812 | 0.9691215287508383 | 0.9088 | 23.511920928955078 | 0.9063943343939237 | 0.9309508749736454 | 0.8831 | 23.511920928955078 | 0.9692657736763628 | 0.907875 | 195.81820678710938 | 0.9062697749765865 | 0.9177218434408161 | 0.8951 | 192.03176879882812 | 0.9688303836479663 |
6 | 10000 | 0.9117 | 0.43377184867858887 | 0.9091142688285324 | 0.9351377068245493 | 0.8845 | 0.42168402671813965 | 0.9705081136434329 | 0.911125 | 503.63323974609375 | 0.9086830163666956 | 0.9269738895246021 | 0.8911 | 515.934814453125 | 0.9705936795264274 | 0.911625 | 23.21420669555664 | 0.90933805237106 | 0.9334948133326313 | 0.8864 | 23.21420669555664 | 0.970644014417841 | 0.910625 | 190.39918518066406 | 0.9088312549409635 | 0.9274449591422474 | 0.89095 | 190.39918518066406 | 0.9698163020951304 |
6 | 20000 | 0.912575 | 0.4052755534648895 | 0.9098029112456524 | 0.938453361679511 | 0.88285 | 0.4011077880859375 | 0.9710678702761814 | 0.9119 | 513.6328125 | 0.9090537815555045 | 0.9285602544715024 | 0.89035 | 525.4749755859375 | 0.970994069548643 | 0.91225 | 23.212299346923828 | 0.9088935972301172 | 0.9404341781627633 | 0.8794 | 23.397891998291016 | 0.9711426496517335 | 0.911375 | 189.27462768554688 | 0.9083260657671984 | 0.9325819024544401 | 0.8853 | 182.22271728515625 | 0.9704334542723605 |
6 | -1 | 0.91105 | 0.38402271270751953 | 0.9082011127137852 | 0.9365703357416064 | 0.8815 | 0.3781573176383972 | 0.9708056816629487 | 0.9107 | 519.240966796875 | 0.9077167452346792 | 0.9308007566204287 | 0.88575 | 528.8313598632812 | 0.9708053595341734 | 0.910775 | 23.534488677978516 | 0.9077579997942176 | 0.9347849120576394 | 0.88225 | 23.839462280273438 | 0.9709448175722556 | 0.90935 | 175.73391723632812 | 0.9069991873222268 | 0.9216040462427746 | 0.89285 | 162.47686767578125 | 0.9701583328129889 |
Triplet Evaluation:
epoch | steps | accuracy_cosinus | accuracy_manhattan | accuracy_euclidean |
---|---|---|---|---|
0 | 10000 | 0.9344 | 0.9323 | 0.9322 |
0 | 20000 | 0.9279 | 0.9271 | 0.9271 |
0 | -1 | 0.9481 | 0.9466 | 0.9468 |
1 | 10000 | 0.9403 | 0.9378 | 0.9385 |
1 | 20000 | 0.9307 | 0.9306 | 0.9312 |
1 | -1 | 0.9364 | 0.9373 | 0.9369 |
2 | 10000 | 0.9235 | 0.9239 | 0.9242 |
2 | 20000 | 0.929 | 0.9287 | 0.928 |
2 | -1 | 0.9267 | 0.927 | 0.928 |
3 | 10000 | 0.9431 | 0.9422 | 0.9434 |
3 | 20000 | 0.9356 | 0.9376 | 0.9367 |
3 | -1 | 0.9484 | 0.9481 | 0.9473 |
4 | 10000 | 0.9347 | 0.935 | 0.9351 |
4 | 20000 | 0.9517 | 0.9511 | 0.9516 |
4 | -1 | 0.9465 | 0.9473 | 0.9469 |
5 | 10000 | 0.9521 | 0.9517 | 0.9521 |
5 | 20000 | 0.9615 | 0.9618 | 0.9615 |
5 | -1 | 0.9638 | 0.9639 | 0.9635 |
6 | 10000 | 0.9629 | 0.9644 | 0.9641 |
6 | 20000 | 0.9673 | 0.967 | 0.9665 |
6 | -1 | 0.9666 | 0.9658 | 0.9666 |
Training
The model was trained with the parameters:
DataLoader:
torch.utils.data.dataloader.DataLoader
of length 27693 with parameters:
{'batch_size': 12, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
Loss:
sentence_transformers.losses.BatchHardTripletLoss.BatchHardTripletLoss
Parameters of the fit()-Method:
{
"epochs": 7,
"evaluation_steps": 10000,
"evaluator": "sentence_transformers.evaluation.SequentialEvaluator.SequentialEvaluator",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 10000,
"weight_decay": 0.01
}
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)