MugheesAwan11's picture
Add new SentenceTransformer model.
6793140 verified
|
raw
history blame
36.6 kB
metadata
base_model: BAAI/bge-base-en-v1.5
datasets: []
language:
  - en
library_name: sentence-transformers
license: apache-2.0
metrics:
  - cosine_accuracy@1
  - cosine_accuracy@3
  - cosine_accuracy@5
  - cosine_accuracy@10
  - cosine_precision@1
  - cosine_precision@3
  - cosine_precision@5
  - cosine_precision@10
  - cosine_recall@1
  - cosine_recall@3
  - cosine_recall@5
  - cosine_recall@10
  - cosine_ndcg@10
  - cosine_mrr@10
  - cosine_map@100
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:1496
  - loss:MatryoshkaLoss
  - loss:MultipleNegativesRankingLoss
widget:
  - source_sentence: >-
      We are currently involved in, and may in the future be involved in, legal
      proceedings, claims, and government investigations in the ordinary course
      of business. These include proceedings, claims, and investigations
      relating to, among other things, regulatory matters, commercial matters,
      intellectual property, competition, tax, employment, pricing,
      discrimination, consumer rights, personal injury, and property rights.
    sentences:
      - >-
        What factors does the regulatory authority consider when ensuring data
        protection in cross border transfers in Zimbabwe?
      - >-
        How does Securiti enable enterprises to safely use data and the cloud
        while managing security, privacy, and compliance risks?
      - What types of legal issues is the company currently involved in?
  - source_sentence: >-
      The Company’s minority market share in the global smartphone, personal
      computer and tablet markets can make developers less inclined to develop
      or upgrade software for the Company’s products and more inclined to devote
      their resources to developing and upgrading software for competitors’
      products with larger market share. When developers focus their efforts on
      these competing platforms, the availability and quality of applications
      for the Company’s devices can suffer.
    sentences:
      - What is the role of obtaining consent in Thailand's PDPA?
      - >-
        Why might developers be less inclined to develop or upgrade software for
        the Company's products?
      - >-
        What caused the increase in energy generation and storage segment
        revenue in 2023?
  - source_sentence: >-
      ** : EMEA (Europe, the Middle East and Africa) The Irish DPA implements
      the GDPR into the national law by incorporating most of the provisions of
      the GDPR with limited additions and deletions. It contains several
      provisions restricting data subjects’ rights that they generally have
      under the GDPR, for example, where restrictions are necessary for the
      enforcement of civil law claims. Resources* : Irish DPA Overview Irish
      Cookie Guidance ### Japan #### Japan’s Act on the Protection of Personal
      Information (APPI) **Effective Date (Amended APPI)** : April 01, 2022
      **Region** : APAC (Asia-Pacific) Japan’s APPI regulates personal related
      information and applies to any Personal Information Controller (the
      “PIC''), that is a person or entity providing personal related information
      for use in business in Japan. The APPI also applies to the foreign
    sentences:
      - >-
        What are the requirements for CIIOs and personal information processors
        in the state cybersecurity department regarding cross-border data
        transfers and certifications?
      - How does the Irish DPA implement the GDPR into national law?
      - >-
        What is the current status of the Personal Data Protection Act in El
        Salvador compared to Monaco and Venezuela?
  - source_sentence: >-
      View Salesforce View Workday View GCP View Azure View Oracle View US
      California CCPA View US California CPRA View European Union GDPR View
      Thailand’s PDPA View China PIPL View Canada PIPEDA View Brazil's LGPD View
      \+ More View Privacy View Security View Governance View Marketing View
      Resources Blog View Collateral View Knowledge Center View Securiti
      Education View Company About Us View Partner Program View Contact Us View
      News Coverage
    sentences:
      - >-
        What is the role of ANPD in ensuring LGPD compliance and protecting data
        subject rights, including those related to health professionals?
      - >-
        According to the Spanish data protection law, who is required to hire a
        DPO if they possess certain information in the event of a data breach?
      - >-
        What is GCP and how does it relate to privacy, security, governance,
        marketing, and resources?
  - source_sentence: >-
      vital interests of the data subject; Complying with an obligation
      prescribed in PDPL, not being a contractual obligation, or complying with
      an order from a competent court, the Public Prosecution, the investigation
      Judge, or the Military Prosecution; or Preparing or pursuing a legal claim
      or defense. vs Articles: 44 50, Recitals: 101, 112 GDPR states that
      personal data shall be transferred to a third country or international
      organization with an adequate protection level as determined by the EU
      Commission. Suppose there is no decision on an adequate protection level.
      In that case, a transfer is only permitted when the data controller or
      data processor provides appropriate safeguards that ensure data subject
      rights. Appropriate safeguards include: BCRs with specific requirements
      (e.g., a legal basis for processing, a retention period, and complaint
      procedures) Standard data protection clauses adopted by the EU
      Commission,  level of protection. If there is no adequate level of
      protection, then data controllers in Turkey and abroad shall commit, in
      writing, to provide an adequate level of protection abroad, as well as
      agree on the fact that the transfer is permitted by the Board of KVKK. vs
      Articles 44 50 Recitals 101, 112 GDPR states that personal data shall be
      transferred to a third country or international organization with an
      adequate protection level as determined by the EU Commission. Suppose
      there is no decision on an adequate protection level. In that case, a
      transfer is only permitted when the data controller or data processor
      provides appropriate safeguards that ensure data subject' rights.
      Appropriate safeguards include: BCRs with specific requirements (e.g., a
      legal basis for processing, a retention period, and complaint procedures);
      standard data protection clauses adopted by the EU Commission or by a
      supervisory authority; an approved code
    sentences:
      - What is the right to be informed in relation to personal data?
      - >-
        In what situations can a controller process personal data to protect
        vital interests?
      - >-
        What obligations in PDPL must data controllers or processors meet to
        protect personal data transferred to a third country or international
        organization?
model-index:
  - name: SentenceTransformer based on BAAI/bge-base-en-v1.5
    results:
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 768
          type: dim_768
        metrics:
          - type: cosine_accuracy@1
            value: 0.4020618556701031
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.5773195876288659
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.6804123711340206
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.7938144329896907
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.4020618556701031
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.1924398625429553
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.1360824742268041
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.07938144329896907
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.4020618556701031
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.5773195876288659
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.6804123711340206
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.7938144329896907
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.5821623921468868
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.5161471117656685
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.5239473985229559
            name: Cosine Map@100
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 512
          type: dim_512
        metrics:
          - type: cosine_accuracy@1
            value: 0.41237113402061853
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.5670103092783505
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.6597938144329897
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.7835051546391752
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.41237113402061853
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.18900343642611683
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.1319587628865979
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.07835051546391752
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.41237113402061853
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.5670103092783505
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.6597938144329897
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.7835051546391752
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.5830365443881826
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.5208312878415973
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.5295727941555394
            name: Cosine Map@100
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 256
          type: dim_256
        metrics:
          - type: cosine_accuracy@1
            value: 0.4020618556701031
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.6185567010309279
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.6494845360824743
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.7628865979381443
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.4020618556701031
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.20618556701030924
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.12989690721649483
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.07628865979381441
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.4020618556701031
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.6185567010309279
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.6494845360824743
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.7628865979381443
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.576352896876016
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.5177957781050565
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.527827441661229
            name: Cosine Map@100

SentenceTransformer based on BAAI/bge-base-en-v1.5

This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: BAAI/bge-base-en-v1.5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("MugheesAwan11/bge-base-securiti-dataset-1-v22")
# Run inference
sentences = [
    "vital interests of the data subject; Complying with an obligation prescribed in PDPL, not being a contractual obligation, or complying with an order from a competent court, the Public Prosecution, the investigation Judge, or the Military Prosecution; or Preparing or pursuing a legal claim or defense. vs Articles: 44 50, Recitals: 101, 112 GDPR states that personal data shall be transferred to a third country or international organization with an adequate protection level as determined by the EU Commission. Suppose there is no decision on an adequate protection level. In that case, a transfer is only permitted when the data controller or data processor provides appropriate safeguards that ensure data subject rights. Appropriate safeguards include: BCRs with specific requirements (e.g., a legal basis for processing, a retention period, and complaint procedures) Standard data protection clauses adopted by the EU Commission,  level of protection. If there is no adequate level of protection, then data controllers in Turkey and abroad shall commit, in writing, to provide an adequate level of protection abroad, as well as agree on the fact that the transfer is permitted by the Board of KVKK. vs Articles 44 50 Recitals 101, 112 GDPR states that personal data shall be transferred to a third country or international organization with an adequate protection level as determined by the EU Commission. Suppose there is no decision on an adequate protection level. In that case, a transfer is only permitted when the data controller or data processor provides appropriate safeguards that ensure data subject' rights. Appropriate safeguards include: BCRs with specific requirements (e.g., a legal basis for processing, a retention period, and complaint procedures); standard data protection clauses adopted by the EU Commission or by a supervisory authority; an approved code",
    'What obligations in PDPL must data controllers or processors meet to protect personal data transferred to a third country or international organization?',
    'In what situations can a controller process personal data to protect vital interests?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.4021
cosine_accuracy@3 0.5773
cosine_accuracy@5 0.6804
cosine_accuracy@10 0.7938
cosine_precision@1 0.4021
cosine_precision@3 0.1924
cosine_precision@5 0.1361
cosine_precision@10 0.0794
cosine_recall@1 0.4021
cosine_recall@3 0.5773
cosine_recall@5 0.6804
cosine_recall@10 0.7938
cosine_ndcg@10 0.5822
cosine_mrr@10 0.5161
cosine_map@100 0.5239

Information Retrieval

Metric Value
cosine_accuracy@1 0.4124
cosine_accuracy@3 0.567
cosine_accuracy@5 0.6598
cosine_accuracy@10 0.7835
cosine_precision@1 0.4124
cosine_precision@3 0.189
cosine_precision@5 0.132
cosine_precision@10 0.0784
cosine_recall@1 0.4124
cosine_recall@3 0.567
cosine_recall@5 0.6598
cosine_recall@10 0.7835
cosine_ndcg@10 0.583
cosine_mrr@10 0.5208
cosine_map@100 0.5296

Information Retrieval

Metric Value
cosine_accuracy@1 0.4021
cosine_accuracy@3 0.6186
cosine_accuracy@5 0.6495
cosine_accuracy@10 0.7629
cosine_precision@1 0.4021
cosine_precision@3 0.2062
cosine_precision@5 0.1299
cosine_precision@10 0.0763
cosine_recall@1 0.4021
cosine_recall@3 0.6186
cosine_recall@5 0.6495
cosine_recall@10 0.7629
cosine_ndcg@10 0.5764
cosine_mrr@10 0.5178
cosine_map@100 0.5278

Training Details

Training Dataset

Unnamed Dataset

  • Size: 1,496 training samples
  • Columns: positive and anchor
  • Approximate statistics based on the first 1000 samples:
    positive anchor
    type string string
    details
    • min: 67 tokens
    • mean: 216.99 tokens
    • max: 512 tokens
    • min: 10 tokens
    • mean: 21.6 tokens
    • max: 102 tokens
  • Samples:
    positive anchor
    Leader in Data Privacy View Events Spotlight Talks Education Contact Us Schedule a Demo Products By Use Cases By Roles Data Command Center View Learn more Asset and Data Discovery Discover dark and native data assets Learn more Data Access Intelligence & Governance Identify which users have access to sensitive data and prevent unauthorized access Learn more Data Privacy Automation PrivacyCenter.Cloud Data Mapping
    data subject must be notified of any such extension within one month of receiving the request, along with the reasons for the delay and the possibility of complaining to the supervisory authority. The right to restrict processing applies when the data subject contests data accuracy, the processing is unlawful, and the data subject opposes erasure and requests restriction. The controller must inform data subjects before any such restriction is lifted. Under GDPR, the data subject also has the right to obtain from the controller the rectification of inaccurate personal data and to have incomplete personal data completed. Article: 22 Under PDPL, if a decision is based solely on automated processing of personal data intended to assess the data subject regarding his/her performance at work, financial standing, credit-worthiness, reliability, or conduct, then the data subject has the right to request processing in a manner that is not solely automated. This right shall not apply where the decision is taken in the course of entering into What is the requirement for notifying the data subject of any extension under GDPR and PDPL?
    Automation PrivacyCenter.Cloud Data Mapping
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256
        ],
        "matryoshka_weights": [
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 16
  • learning_rate: 2e-05
  • num_train_epochs: 1
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • bf16: True
  • tf32: True
  • load_best_model_at_end: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: True
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss dim_256_cosine_map@100 dim_512_cosine_map@100 dim_768_cosine_map@100
0.2128 10 3.8486 - - -
0.4255 20 2.3611 - - -
0.6383 30 2.3209 - - -
0.8511 40 1.3248 - - -
1.0 47 - 0.5278 0.5296 0.5239
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.14
  • Sentence Transformers: 3.0.1
  • Transformers: 4.41.2
  • PyTorch: 2.1.2+cu121
  • Accelerate: 0.31.0
  • Datasets: 2.19.1
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning}, 
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply}, 
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}