QulBERT / README.md
FDSRashid's picture
Update README.md
30285cc
|
raw
history blame
20.4 kB
metadata
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - feature-extraction
  - sentence-similarity
  - transformers
  - Hadith
  - Islam
  - Arabic
license: apache-2.0
datasets:
  - FDSRashid/hadith_info
language:
  - ar
library_name: sentence-transformers

QulBERT

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

This model originates from the Camel-Bert_Classical Arabic model. It was then trained on the Jawami' Kalim dataset, specifically a dataset of 440,000 matns and their corresponding taraf labels. Taraf labels indicate two hadith are about the same report, and as such, are more semantically similar.

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["أنا أحب القراءة والكتابة.", "الطيور تحلق في السماء."]

model = SentenceTransformer('FDSRashid/QulBERT')
embeddings = model.encode(sentences)
print(embeddings)

Usage (HuggingFace Transformers)

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["أنا أحب القراءة والكتابة.", "الطيور تحلق في السماء."]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('FDSRashid/QulBERT')
model = AutoModel.from_pretrained('FDSRashid/QulBERT')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

Evaluation Results

The dataset was split into 75% training, 15% eval, 10% test.

Validation Results during Training:

Binary Classification Evaluation:

epoch steps cossim_accuracy cossim_accuracy_threshold cossim_f1 cossim_precision cossim_recall cossim_f1_threshold cossim_ap manhattan_accuracy manhattan_accuracy_threshold manhattan_f1 manhattan_precision manhattan_recall manhattan_f1_threshold manhattan_ap euclidean_accuracy euclidean_accuracy_threshold euclidean_f1 euclidean_precision euclidean_recall euclidean_f1_threshold euclidean_ap dot_accuracy dot_accuracy_threshold dot_f1 dot_precision dot_recall dot_f1_threshold dot_ap
0 10000 0.87335 0.5980355739593506 0.866067203028869 0.9132749251413996 0.8235 0.5871663689613342 0.9466943574346693 0.87115 415.270751953125 0.8638917195787933 0.9047671172896941 0.82655 422.6612548828125 0.945152683467575 0.871875 18.699993133544922 0.8645460950343135 0.9041095890410958 0.8283 19.169795989990234 0.945247114112153 0.8731 262.36114501953125 0.8656574463026075 0.9177637107164864 0.81915 260.2767333984375 0.9463618096371682
0 20000 0.8655 0.5025224685668945 0.859078237761509 0.8968612304846869 0.82435 0.4888851046562195 0.943873652860419 0.866775 477.4580078125 0.860756186146168 0.9014832247824421 0.82355 485.6708984375 0.9442439376416185 0.8676 21.476741790771484 0.8606938065955735 0.8945580065800118 0.8293 22.168407440185547 0.9444315640627436 0.863225 241.51820373535156 0.8566324020610548 0.88835186080232 0.8271 230.02301025390625 0.9423405098129569
0 -1 0.8866 0.7285321950912476 0.8816885280033313 0.919398610508033 0.84695 0.7145423889160156 0.9558629287413469 0.885275 355.03125 0.8803685772294236 0.918177869475513 0.84555 357.0611572265625 0.9550033563717418 0.8856 16.121074676513672 0.8809697221933201 0.918130557362828 0.8467 16.198532104492188 0.9552434220598536 0.8866 333.26568603515625 0.8812167536022311 0.9111929936986009 0.85315 325.6474304199219 0.9551592673018441
1 10000 0.88225 0.5847429037094116 0.8791732103956634 0.8909538967039737 0.8677 0.5608978271484375 0.9553668396978772 0.879975 404.1671447753906 0.8754545454545455 0.8843877551020408 0.8667 420.20391845703125 0.9539648051031446 0.879775 18.318096160888672 0.8759632369883004 0.8975394785163423 0.8554 18.77162742614746 0.9541900283694951 0.878325 242.8575897216797 0.8763834841057261 0.8859200980893022 0.86705 229.83326721191406 0.9521114062744855
1 20000 0.865425 0.483412504196167 0.8604195660017525 0.8878310817998085 0.83465 0.47202983498573303 0.9437698616032332 0.867725 490.8877868652344 0.8626237623762377 0.8905451448040886 0.8364 498.3052062988281 0.945935000502437 0.867725 21.84794044494629 0.8626810749177227 0.8954220237775028 0.83225 22.427053451538086 0.9460338001929801 0.862825 234.37701416015625 0.857083710699961 0.8882274068114776 0.82805 229.8949432373047 0.9405896665434951
1 -1 0.866575 0.6635169982910156 0.8608573256557902 0.88173001310616 0.84095 0.6324930191040039 0.9452499579769719 0.866875 412.3456726074219 0.8617781992464822 0.8840511121628017 0.8406 428.6363525390625 0.9456397883265427 0.867275 18.474044799804688 0.8617669654289373 0.883254593175853 0.8413 19.42306900024414 0.9458234307667238 0.8645 340.140380859375 0.8589694801735291 0.8718648606890869 0.84645 320.98138427734375 0.9439794500521119
2 10000 0.85825 0.521987795829773 0.8545418167266907 0.8548839071257006 0.8542 0.4656229019165039 0.9388312444848291 0.85815 477.94244384765625 0.8541671894998369 0.8574595656774323 0.8509 508.4425048828125 0.9391298371675241 0.858625 21.995594024658203 0.854181558255897 0.8726267473398707 0.8365 22.506919860839844 0.9392082300175095 0.856875 262.9927673339844 0.8540649892527501 0.8640061396776669 0.84435 240.31259155273438 0.936665567408799
2 20000 0.861025 0.4792778789997101 0.8557936427338275 0.8693005983082319 0.8427 0.4365364611148834 0.9417166077380268 0.861325 490.29339599609375 0.8565778465126891 0.8688474000925783 0.84465 521.2939453125 0.9421024298390495 0.861225 22.693565368652344 0.8568891594997083 0.8697533089560694 0.8444 23.553585052490234 0.9422682260686701 0.859775 237.65704345703125 0.8545056078380817 0.8821400053233963 0.82855 224.57196044921875 0.9406093768234505
2 -1 0.84645 0.7098060250282288 0.8385932801673421 0.8789257330775555 0.8018 0.702235221862793 0.932382298001216 0.849825 371.1478271484375 0.8419526841642077 0.8728131372759472 0.8132 385.735107421875 0.9344418607926894 0.8498 17.05820083618164 0.8418963040355231 0.8813781788351107 0.8058 17.261516571044922 0.9345154644039888 0.83745 359.3741455078125 0.8301335348954395 0.8366683595733875 0.8237 335.80609130859375 0.9256669298415723
3 10000 0.8692 0.6066867113113403 0.8639819190466407 0.8882551753274187 0.841 0.5866260528564453 0.9479885087178834 0.870575 437.38861083984375 0.8650388914644825 0.8920110485498778 0.83965 447.34051513671875 0.9484228602702792 0.870575 19.797679901123047 0.8655175071287281 0.8900512495376974 0.8423 20.318492889404297 0.9487290465239262 0.866525 297.63665771484375 0.862023653088042 0.8892669182924884 0.8364 295.425048828125 0.9460553171567032
3 20000 0.8723 0.5461836457252502 0.8661531678726109 0.8997790829247265 0.83495 0.5138773322105408 0.9483721005411583 0.872775 465.31109619140625 0.8667593021460553 0.8929063726009967 0.8421 492.2287292480469 0.9486788228598396 0.87305 21.46672821044922 0.8673375089844954 0.891221776746149 0.8447 22.356992721557617 0.9489411054456987 0.87085 268.9063720703125 0.8649067921503737 0.8955399689457622 0.8363 255.71820068359375 0.9471728845921085
3 -1 0.8801 0.5941712260246277 0.8756740022187249 0.9045893076062044 0.84855 0.5840033292770386 0.9545450783524295 0.87755 432.21533203125 0.8737334773440313 0.8995022768188076 0.8494 439.35577392578125 0.9532505174511154 0.87805 19.783367156982422 0.873815256929146 0.8958924256749659 0.8528 20.03304100036621 0.953443356122637 0.880975 282.0526123046875 0.8761084893429446 0.9099429063880211 0.8447 279.80755615234375 0.9545987838548831
4 10000 0.850325 0.5770859718322754 0.8458372263326683 0.8544462017244018 0.8374 0.5550715923309326 0.9367499212412196 0.85215 446.5159606933594 0.8481793290514087 0.8686513968237329 0.82865 461.0674133300781 0.9378884193257083 0.85235 20.780521392822266 0.8487315362363361 0.8695903058280439 0.82885 20.85832977294922 0.9379865284776105 0.846775 297.2707214355469 0.8422069666920926 0.8568028970512157 0.8281 282.2066955566406 0.9342792490823187
4 20000 0.885725 0.5763461589813232 0.8810150085099798 0.9096815422302694 0.8541 0.5624827742576599 0.9567680001721202 0.8861 449.548095703125 0.881673031087419 0.9044113780955886 0.86005 462.72589111328125 0.9571293388400879 0.88635 20.378496170043945 0.8821848696234137 0.9091198472067483 0.8568 20.860164642333984 0.9573813976283176 0.883225 285.4012451171875 0.8786437246963561 0.8894467213114754 0.8681 268.5011291503906 0.9549886227962548
4 -1 0.883425 0.5326807498931885 0.878749968085378 0.8978452548651328 0.86045 0.4872320890426636 0.956368376823993 0.88455 480.01300048828125 0.8794581927741869 0.9067063133860777 0.8538 497.9632568359375 0.9566012690704293 0.8845 21.905109405517578 0.8798647229125566 0.9022647259734118 0.85855 22.690349578857422 0.9567001435137067 0.881875 258.05084228515625 0.8778676433185817 0.8933637022466093 0.8629 243.83050537109375 0.9554938129957324
5 10000 0.893375 0.46282997727394104 0.8898161026116519 0.9175608201423563 0.8637 0.4469180405139923 0.9617751510273491 0.89385 512.1046752929688 0.8904936907301277 0.9161334672941674 0.86625 515.0869750976562 0.9619645895583173 0.894275 23.09744644165039 0.890798553215504 0.9146604856977295 0.86815 23.39638900756836 0.9622504494079881 0.892075 230.25645446777344 0.8889115628905951 0.9073582252773004 0.8712 213.14920043945312 0.9608017350146727
5 20000 0.905125 0.4999743402004242 0.9022725529793706 0.923060829541294 0.8824 0.4821454584598541 0.9677318333926658 0.905375 477.36669921875 0.9025055438024112 0.9205012218582644 0.8852 493.6698913574219 0.9681063663719243 0.90565 21.852725982666016 0.9027959303964531 0.9260291257031702 0.8807 22.224273681640625 0.9681537834478611 0.9035 237.94554138183594 0.9007219292406943 0.9228832231665093 0.8796 233.02957153320312 0.9664151462381492
5 -1 0.908825 0.4167391061782837 0.9067954713895064 0.9274400125463955 0.88705 0.4167391061782837 0.9692830626530475 0.908575 511.60858154296875 0.9058531974144758 0.9261794054647092 0.8864 521.9729614257812 0.9691215287508383 0.9088 23.511920928955078 0.9063943343939237 0.9309508749736454 0.8831 23.511920928955078 0.9692657736763628 0.907875 195.81820678710938 0.9062697749765865 0.9177218434408161 0.8951 192.03176879882812 0.9688303836479663
6 10000 0.9117 0.43377184867858887 0.9091142688285324 0.9351377068245493 0.8845 0.42168402671813965 0.9705081136434329 0.911125 503.63323974609375 0.9086830163666956 0.9269738895246021 0.8911 515.934814453125 0.9705936795264274 0.911625 23.21420669555664 0.90933805237106 0.9334948133326313 0.8864 23.21420669555664 0.970644014417841 0.910625 190.39918518066406 0.9088312549409635 0.9274449591422474 0.89095 190.39918518066406 0.9698163020951304
6 20000 0.912575 0.4052755534648895 0.9098029112456524 0.938453361679511 0.88285 0.4011077880859375 0.9710678702761814 0.9119 513.6328125 0.9090537815555045 0.9285602544715024 0.89035 525.4749755859375 0.970994069548643 0.91225 23.212299346923828 0.9088935972301172 0.9404341781627633 0.8794 23.397891998291016 0.9711426496517335 0.911375 189.27462768554688 0.9083260657671984 0.9325819024544401 0.8853 182.22271728515625 0.9704334542723605
6 -1 0.91105 0.38402271270751953 0.9082011127137852 0.9365703357416064 0.8815 0.3781573176383972 0.9708056816629487 0.9107 519.240966796875 0.9077167452346792 0.9308007566204287 0.88575 528.8313598632812 0.9708053595341734 0.910775 23.534488677978516 0.9077579997942176 0.9347849120576394 0.88225 23.839462280273438 0.9709448175722556 0.90935 175.73391723632812 0.9069991873222268 0.9216040462427746 0.89285 162.47686767578125 0.9701583328129889

Triplet Evaluation:

epoch steps accuracy_cosinus accuracy_manhattan accuracy_euclidean
0 10000 0.9344 0.9323 0.9322
0 20000 0.9279 0.9271 0.9271
0 -1 0.9481 0.9466 0.9468
1 10000 0.9403 0.9378 0.9385
1 20000 0.9307 0.9306 0.9312
1 -1 0.9364 0.9373 0.9369
2 10000 0.9235 0.9239 0.9242
2 20000 0.929 0.9287 0.928
2 -1 0.9267 0.927 0.928
3 10000 0.9431 0.9422 0.9434
3 20000 0.9356 0.9376 0.9367
3 -1 0.9484 0.9481 0.9473
4 10000 0.9347 0.935 0.9351
4 20000 0.9517 0.9511 0.9516
4 -1 0.9465 0.9473 0.9469
5 10000 0.9521 0.9517 0.9521
5 20000 0.9615 0.9618 0.9615
5 -1 0.9638 0.9639 0.9635
6 10000 0.9629 0.9644 0.9641
6 20000 0.9673 0.967 0.9665
6 -1 0.9666 0.9658 0.9666

Training

The model was trained with the parameters:

DataLoader:

torch.utils.data.dataloader.DataLoader of length 27693 with parameters:

{'batch_size': 12, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss:

sentence_transformers.losses.BatchHardTripletLoss.BatchHardTripletLoss

Parameters of the fit()-Method:

{
    "epochs": 7,
    "evaluation_steps": 10000,
    "evaluator": "sentence_transformers.evaluation.SequentialEvaluator.SequentialEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 10000,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Citing & Authors