SentenceTransformer based on BAAI/bge-base-en-v1.5
This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: BAAI/bge-base-en-v1.5
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 tokens
- Similarity Function: Cosine Similarity
- Language: en
- License: apache-2.0
Model Sources
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("MugheesAwan11/bge-base-securiti-dataset-1-v14")
sentences = [
"office of the \u200b\u200bFederal Commissioner for Data Protection and Freedom of Information, with its headquarters in the city of Bonn. It is led by a Federal Commissioner, elected via a vote by the German Bundestag. Eligibility criteria include being at least 35 years old, appropriate qualifications in the field of data protection law gained through relevant professional experience. The Commissioner's term is for five years, which can be extended once. The Commissioner has the responsibility to act as the primary office responsible for enforcing the Federal Data Protection Act within Germany. Some of the office's key responsibilities include: Advising the Bundestag, the Bundesrat, and the Federal Government on administrative and legislative measures related to data protection within the country; To oversee and implement both the GDPR and Federal Data Protection Act within Germany; To promote awareness within the public related to the risks, rules, safeguards, and rights concerning the processing of personal data; To handle all, within Germany. It supplements and aligns with the requirements of the EU GDPR. Yes, Germany is covered by GDPR (General Data Protection Regulation). GDPR is a regulation that applies uniformly across all EU member states, including Germany. The Federal Data Protection Act established the office of the \u200b\u200bFederal Commissioner for Data Protection and Freedom of Information, with its headquarters in the city of Bonn. It is led by a Federal Commissioner, elected via a vote by the German Bundestag. Germany's interpretation is the Bundesdatenschutzgesetz (BDSG), the German Federal Data Protection Act. It mirrors the GDPR in all key areas while giving local German regulatory authorities the power to enforce it more efficiently nationally. ## Join Our Newsletter Get all the latest information, law updates and more delivered to your inbox ### Share Copy 14 ### More Stories that May Interest You View More",
'What are the main responsibilities of the Federal Commissioner for Data Protection and Freedom of Information in enforcing data protection laws in Germany, including the GDPR and the Federal Data Protection Act?',
'What is the collection and use of personal information by businesses?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
Evaluation
Metrics
Information Retrieval
Metric |
Value |
cosine_accuracy@1 |
0.6804 |
cosine_accuracy@3 |
0.9072 |
cosine_accuracy@5 |
0.9485 |
cosine_accuracy@10 |
0.9691 |
cosine_precision@1 |
0.6804 |
cosine_precision@3 |
0.3024 |
cosine_precision@5 |
0.1897 |
cosine_precision@10 |
0.0969 |
cosine_recall@1 |
0.6804 |
cosine_recall@3 |
0.9072 |
cosine_recall@5 |
0.9485 |
cosine_recall@10 |
0.9691 |
cosine_ndcg@10 |
0.8366 |
cosine_mrr@10 |
0.7925 |
cosine_map@100 |
0.7937 |
Information Retrieval
Metric |
Value |
cosine_accuracy@1 |
0.6907 |
cosine_accuracy@3 |
0.8763 |
cosine_accuracy@5 |
0.9278 |
cosine_accuracy@10 |
0.9691 |
cosine_precision@1 |
0.6907 |
cosine_precision@3 |
0.2921 |
cosine_precision@5 |
0.1856 |
cosine_precision@10 |
0.0969 |
cosine_recall@1 |
0.6907 |
cosine_recall@3 |
0.8763 |
cosine_recall@5 |
0.9278 |
cosine_recall@10 |
0.9691 |
cosine_ndcg@10 |
0.833 |
cosine_mrr@10 |
0.7889 |
cosine_map@100 |
0.7896 |
Information Retrieval
Metric |
Value |
cosine_accuracy@1 |
0.6907 |
cosine_accuracy@3 |
0.8557 |
cosine_accuracy@5 |
0.8969 |
cosine_accuracy@10 |
0.9278 |
cosine_precision@1 |
0.6907 |
cosine_precision@3 |
0.2852 |
cosine_precision@5 |
0.1794 |
cosine_precision@10 |
0.0928 |
cosine_recall@1 |
0.6907 |
cosine_recall@3 |
0.8557 |
cosine_recall@5 |
0.8969 |
cosine_recall@10 |
0.9278 |
cosine_ndcg@10 |
0.8132 |
cosine_mrr@10 |
0.7759 |
cosine_map@100 |
0.7795 |
Information Retrieval
Metric |
Value |
cosine_accuracy@1 |
0.5979 |
cosine_accuracy@3 |
0.7732 |
cosine_accuracy@5 |
0.8247 |
cosine_accuracy@10 |
0.8866 |
cosine_precision@1 |
0.5979 |
cosine_precision@3 |
0.2577 |
cosine_precision@5 |
0.1649 |
cosine_precision@10 |
0.0887 |
cosine_recall@1 |
0.5979 |
cosine_recall@3 |
0.7732 |
cosine_recall@5 |
0.8247 |
cosine_recall@10 |
0.8866 |
cosine_ndcg@10 |
0.7462 |
cosine_mrr@10 |
0.701 |
cosine_map@100 |
0.7047 |
Information Retrieval
Metric |
Value |
cosine_accuracy@1 |
0.5155 |
cosine_accuracy@3 |
0.6907 |
cosine_accuracy@5 |
0.7113 |
cosine_accuracy@10 |
0.7732 |
cosine_precision@1 |
0.5155 |
cosine_precision@3 |
0.2302 |
cosine_precision@5 |
0.1423 |
cosine_precision@10 |
0.0773 |
cosine_recall@1 |
0.5155 |
cosine_recall@3 |
0.6907 |
cosine_recall@5 |
0.7113 |
cosine_recall@10 |
0.7732 |
cosine_ndcg@10 |
0.6471 |
cosine_mrr@10 |
0.6064 |
cosine_map@100 |
0.6137 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 7,872 training samples
- Columns:
positive
and anchor
- Approximate statistics based on the first 1000 samples:
|
positive |
anchor |
type |
string |
string |
details |
- min: 18 tokens
- mean: 206.12 tokens
- max: 414 tokens
|
- min: 9 tokens
- mean: 21.62 tokens
- max: 102 tokens
|
- Samples:
positive |
anchor |
Automation PrivacyCenter.Cloud |
Data Mapping |
on both in terms of material and territorial scope. ### 1.1 Material Scope The Spanish data protection law affords blanket protection for all data that may have been collected on a data subject. There are only a handful of exceptions that include: Information subject to a pending legal case Information collected concerning the investigation of terrorism or organised crime Information classified as "Confidential" for matters related to Spain's national security ### 1.2 Territorial Scope The Spanish data protection law applies to all data handlers that are: Carrying out data collection activities in Spain Not established in Spain but carrying out data collection activities on Spanish territory Not established within the European Union but carrying out data collection activities on Spanish residents unless for data transit purposes only ## 2. Obligations for Organizations Under Spanish Data Protection Law The Spanish data protection law and GDPR lay out specific obligations for all data handlers. These obligations ensure, . ### 2.3 Privacy Policy Requirements Spain's data protection law requires all data handlers to inform the data subject of the following in their privacy policy: The purpose of collecting the data and the recipients of the information The obligatory or voluntary nature of the reply to the questions put to them The consequences of obtaining the data or of refusing to provide them The possibility of exercising rights of access, rectification, erasure, portability, and objection The identity and address of the controller or their local Spanish representative ### 2.4 Security Requirements Article 9 of Spain's Data Protection Law is direct and explicit in stating the responsibility of the data handler is to take adequate measures to ensure the protection of any data collected. It mandates all data handlers to adopt technical and organisational measures necessary to ensure the security of the personal data and prevent their alteration, loss, and unauthorised processing or access. Additionally, collection of any |
What are the requirements for organizations under the Spanish data protection law regarding privacy policies and security measures? |
before the point of collection of their personal information. ## Right to Erasure The right to erasure gives consumers the right to request deleting all their data stored by the organization. Organizations are supposed to comply within 45 days and must deliver a report to the consumer confirming the deletion of their information. ## Right to Opt-in for Minors Personal information containing minors' personal information cannot be sold by a business unless the minor (age of 13 to 16 years) or the Parent/Guardian (if the minor is aged below 13 years) opt-ins to allow this sale. Businesses can be held liable for the sale of minors' personal information if they either knew or wilfully disregarded the consumer's status as a minor and the minor or Parent/Guardian had not willingly opted in. ## Right to Continued Protection Even when consumers choose to allow a business to collect and sell their personal information, businesses' must sign written |
What are the conditions under which businesses can sell minors' personal information? |
- Loss:
MatryoshkaLoss
with these parameters:{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
768,
512,
256,
128,
64
],
"matryoshka_weights": [
1,
1,
1,
1,
1
],
"n_dims_per_step": -1
}
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: epoch
per_device_train_batch_size
: 32
per_device_eval_batch_size
: 16
learning_rate
: 2e-05
num_train_epochs
: 2
lr_scheduler_type
: cosine
warmup_ratio
: 0.1
bf16
: True
tf32
: True
load_best_model_at_end
: True
optim
: adamw_torch_fused
batch_sampler
: no_duplicates
All Hyperparameters
Click to expand
overwrite_output_dir
: False
do_predict
: False
eval_strategy
: epoch
prediction_loss_only
: True
per_device_train_batch_size
: 32
per_device_eval_batch_size
: 16
per_gpu_train_batch_size
: None
per_gpu_eval_batch_size
: None
gradient_accumulation_steps
: 1
eval_accumulation_steps
: None
learning_rate
: 2e-05
weight_decay
: 0.0
adam_beta1
: 0.9
adam_beta2
: 0.999
adam_epsilon
: 1e-08
max_grad_norm
: 1.0
num_train_epochs
: 2
max_steps
: -1
lr_scheduler_type
: cosine
lr_scheduler_kwargs
: {}
warmup_ratio
: 0.1
warmup_steps
: 0
log_level
: passive
log_level_replica
: warning
log_on_each_node
: True
logging_nan_inf_filter
: True
save_safetensors
: True
save_on_each_node
: False
save_only_model
: False
restore_callback_states_from_checkpoint
: False
no_cuda
: False
use_cpu
: False
use_mps_device
: False
seed
: 42
data_seed
: None
jit_mode_eval
: False
use_ipex
: False
bf16
: True
fp16
: False
fp16_opt_level
: O1
half_precision_backend
: auto
bf16_full_eval
: False
fp16_full_eval
: False
tf32
: True
local_rank
: 0
ddp_backend
: None
tpu_num_cores
: None
tpu_metrics_debug
: False
debug
: []
dataloader_drop_last
: False
dataloader_num_workers
: 0
dataloader_prefetch_factor
: None
past_index
: -1
disable_tqdm
: False
remove_unused_columns
: True
label_names
: None
load_best_model_at_end
: True
ignore_data_skip
: False
fsdp
: []
fsdp_min_num_params
: 0
fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap
: None
accelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed
: None
label_smoothing_factor
: 0.0
optim
: adamw_torch_fused
optim_args
: None
adafactor
: False
group_by_length
: False
length_column_name
: length
ddp_find_unused_parameters
: None
ddp_bucket_cap_mb
: None
ddp_broadcast_buffers
: False
dataloader_pin_memory
: True
dataloader_persistent_workers
: False
skip_memory_metrics
: True
use_legacy_prediction_loop
: False
push_to_hub
: False
resume_from_checkpoint
: None
hub_model_id
: None
hub_strategy
: every_save
hub_private_repo
: False
hub_always_push
: False
gradient_checkpointing
: False
gradient_checkpointing_kwargs
: None
include_inputs_for_metrics
: False
eval_do_concat_batches
: True
fp16_backend
: auto
push_to_hub_model_id
: None
push_to_hub_organization
: None
mp_parameters
:
auto_find_batch_size
: False
full_determinism
: False
torchdynamo
: None
ray_scope
: last
ddp_timeout
: 1800
torch_compile
: False
torch_compile_backend
: None
torch_compile_mode
: None
dispatch_batches
: None
split_batches
: None
include_tokens_per_second
: False
include_num_input_tokens_seen
: False
neftune_noise_alpha
: None
optim_target_modules
: None
batch_eval_metrics
: False
batch_sampler
: no_duplicates
multi_dataset_batch_sampler
: proportional
Training Logs
Epoch |
Step |
Training Loss |
dim_128_cosine_map@100 |
dim_256_cosine_map@100 |
dim_512_cosine_map@100 |
dim_64_cosine_map@100 |
dim_768_cosine_map@100 |
0.0407 |
10 |
7.3941 |
- |
- |
- |
- |
- |
0.0813 |
20 |
6.0968 |
- |
- |
- |
- |
- |
0.1220 |
30 |
4.9439 |
- |
- |
- |
- |
- |
0.1626 |
40 |
3.8622 |
- |
- |
- |
- |
- |
0.2033 |
50 |
3.0938 |
- |
- |
- |
- |
- |
0.2439 |
60 |
1.8775 |
- |
- |
- |
- |
- |
0.2846 |
70 |
2.3808 |
- |
- |
- |
- |
- |
0.3252 |
80 |
4.0718 |
- |
- |
- |
- |
- |
0.3659 |
90 |
2.2182 |
- |
- |
- |
- |
- |
0.4065 |
100 |
1.914 |
- |
- |
- |
- |
- |
0.4472 |
110 |
1.5123 |
- |
- |
- |
- |
- |
0.4878 |
120 |
1.7047 |
- |
- |
- |
- |
- |
0.5285 |
130 |
2.9509 |
- |
- |
- |
- |
- |
0.5691 |
140 |
1.0605 |
- |
- |
- |
- |
- |
0.6098 |
150 |
1.762 |
- |
- |
- |
- |
- |
0.6504 |
160 |
1.6545 |
- |
- |
- |
- |
- |
0.6911 |
170 |
3.0971 |
- |
- |
- |
- |
- |
0.7317 |
180 |
1.3791 |
- |
- |
- |
- |
- |
0.7724 |
190 |
1.9717 |
- |
- |
- |
- |
- |
0.8130 |
200 |
5.1309 |
- |
- |
- |
- |
- |
0.8537 |
210 |
1.4047 |
- |
- |
- |
- |
- |
0.8943 |
220 |
1.4391 |
- |
- |
- |
- |
- |
0.9350 |
230 |
3.6443 |
- |
- |
- |
- |
- |
0.9756 |
240 |
3.721 |
- |
- |
- |
- |
- |
1.0122 |
249 |
- |
0.6625 |
0.7330 |
0.7497 |
0.5784 |
0.7568 |
1.0041 |
250 |
1.3171 |
- |
- |
- |
- |
- |
1.0447 |
260 |
5.2603 |
- |
- |
- |
- |
- |
1.0854 |
270 |
4.0513 |
- |
- |
- |
- |
- |
1.1260 |
280 |
2.5508 |
- |
- |
- |
- |
- |
1.1667 |
290 |
1.7385 |
- |
- |
- |
- |
- |
1.2073 |
300 |
1.1692 |
- |
- |
- |
- |
- |
1.2480 |
310 |
0.788 |
- |
- |
- |
- |
- |
1.2886 |
320 |
1.2322 |
- |
- |
- |
- |
- |
1.3293 |
330 |
3.3735 |
- |
- |
- |
- |
- |
1.3699 |
340 |
1.2204 |
- |
- |
- |
- |
- |
1.4106 |
350 |
0.8458 |
- |
- |
- |
- |
- |
1.4512 |
360 |
0.7586 |
- |
- |
- |
- |
- |
1.4919 |
370 |
0.8964 |
- |
- |
- |
- |
- |
1.5325 |
380 |
1.9721 |
- |
- |
- |
- |
- |
1.5732 |
390 |
0.5605 |
- |
- |
- |
- |
- |
1.6138 |
400 |
0.9648 |
- |
- |
- |
- |
- |
1.6545 |
410 |
1.0002 |
- |
- |
- |
- |
- |
1.6951 |
420 |
2.138 |
- |
- |
- |
- |
- |
1.7358 |
430 |
0.8221 |
- |
- |
- |
- |
- |
1.7764 |
440 |
2.124 |
- |
- |
- |
- |
- |
1.8171 |
450 |
2.7892 |
- |
- |
- |
- |
- |
1.8577 |
460 |
0.9088 |
- |
- |
- |
- |
- |
1.8984 |
470 |
0.9254 |
- |
- |
- |
- |
- |
1.9390 |
480 |
3.1205 |
- |
- |
- |
- |
- |
1.9797 |
490 |
3.014 |
- |
- |
- |
- |
- |
1.9878 |
492 |
- |
0.7047 |
0.7795 |
0.7896 |
0.6137 |
0.7937 |
- The bold row denotes the saved checkpoint.
Framework Versions
- Python: 3.10.14
- Sentence Transformers: 3.0.1
- Transformers: 4.41.2
- PyTorch: 2.1.2+cu121
- Accelerate: 0.31.0
- Datasets: 2.19.1
- Tokenizers: 0.19.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}