SentenceTransformer based on Alibaba-NLP/gte-large-en-v1.5

This is a sentence-transformers model finetuned from Alibaba-NLP/gte-large-en-v1.5. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Alibaba-NLP/gte-large-en-v1.5
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 1024 tokens
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'What are the key principles and frameworks mentioned in the white paper that govern the implementation of AI in national security and defense activities?',
    'This white paper recognizes that national security (which includes certain law enforcement and \nhomeland security activities) and defense activities are of increased sensitivity and interest to our nation’s \nadversaries and are often subject to special requirements, such as those governing classified information and \nother protected data. Such activities require alternative, compatible safeguards through existing policies that \ngovern automated systems and AI, such as the Department of Defense (DOD) AI Ethical Principles and \nResponsible AI Implementation Pathway and the Intelligence Community (IC) AI Ethics Principles and \nFramework. The implementation of these policies to national security and defense activities can be informed by \nthe Blueprint for an AI Bill of Rights where feasible. \nThe Blueprint for an AI Bill of Rights is not intended to, and does not, create any legal right, benefit, or \ndefense, substantive or procedural, enforceable at law or in equity by any party against the United States, its \ndepartments, agencies, or entities, its officers, employees, or agents, or any other person, nor does it constitute a \nwaiver of sovereign immunity. \nCopyright Information \nThis document is a work of the United States Government and is in the public domain (see 17 U.S.C. §105). \n2',
    "APPENDIX\n• OSTP conducted meetings with a variety of stakeholders in the private sector and civil society. Some of these\nmeetings were specifically focused on providing ideas related to the development of the Blueprint for an AI\nBill of Rights while others provided useful general context on the positive use cases, potential harms, and/or\noversight possibilities for these technologies. Participants in these conversations from the private sector and\ncivil society included:\nAdobe \nAmerican Civil Liberties Union (ACLU) The Aspen Commission on Information Disorder The Awood Center The Australian Human Rights Commission Biometrics Institute The Brookings Institute BSA | The Software Alliance Cantellus Group Center for American Progress Center for Democracy and Technology Center on Privacy and Technology at Georgetown Law Christiana Care Color of Change Coworker Data Robot Data Trust Alliance Data and Society Research Institute Deepmind EdSAFE AI Alliance Electronic Privacy Information Center (EPIC) Encode Justice Equal AI Google Hitachi's AI Policy Committee The Innocence Project Institute of Electrical and Electronics Engineers (IEEE) Intuit Lawyers Committee for Civil Rights Under Law Legal Aid Society The Leadership Conference on Civil and Human Rights Meta Microsoft The MIT AI Policy Forum Movement Alliance Project The National Association of Criminal Defense Lawyers O’Neil Risk Consulting & Algorithmic Auditing The Partnership on AI Pinterest The Plaintext Group pymetrics SAP The Security Industry Association Software and Information Industry Association (SIIA) Special Competitive Studies Project Thorn United for Respect University of California at Berkeley Citris Policy Lab University of California at Berkeley Labor Center Unfinished/Project Liberty Upturn US Chamber of Commerce US Chamber of Commerce Technology Engagement Center \nA.I. Working Group\nVibrent HealthWarehouse Worker ResourceCenterWaymap\n62",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.7222
cosine_accuracy@3 0.9815
cosine_accuracy@5 1.0
cosine_accuracy@10 1.0
cosine_precision@1 0.7222
cosine_precision@3 0.3272
cosine_precision@5 0.2
cosine_precision@10 0.1
cosine_recall@1 0.7222
cosine_recall@3 0.9815
cosine_recall@5 1.0
cosine_recall@10 1.0
cosine_ndcg@10 0.8816
cosine_mrr@10 0.841
cosine_map@100 0.841
dot_accuracy@1 0.7037
dot_accuracy@3 0.9815
dot_accuracy@5 1.0
dot_accuracy@10 1.0
dot_precision@1 0.7037
dot_precision@3 0.3272
dot_precision@5 0.2
dot_precision@10 0.1
dot_recall@1 0.7037
dot_recall@3 0.9815
dot_recall@5 1.0
dot_recall@10 1.0
dot_ndcg@10 0.8748
dot_mrr@10 0.8318
dot_map@100 0.8318

Training Details

Training Dataset

Unnamed Dataset

  • Size: 224 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 224 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 23 tokens
    • mean: 36.01 tokens
    • max: 55 tokens
    • min: 22 tokens
    • mean: 569.67 tokens
    • max: 1018 tokens
  • Samples:
    sentence_0 sentence_1
    What are the primary objectives outlined in the "Blueprint for an AI Bill of Rights" as it pertains to the American people? BLUEPRINT FOR AN
    AI B ILL OF
    RIGHTS
    MAKING AUTOMATED
    SYSTEMS WORK FOR
    THE AMERICAN PEOPLE
    OCTOBER 2022
    In what ways does the document propose to ensure that automated systems are designed to work effectively for the benefit of society? BLUEPRINT FOR AN
    AI B ILL OF
    RIGHTS
    MAKING AUTOMATED
    SYSTEMS WORK FOR
    THE AMERICAN PEOPLE
    OCTOBER 2022
    What is the primary purpose of the Blueprint for an AI Bill of Rights as outlined by the White House Office of Science and Technology Policy? About this Document
    The Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People was
    published by the White House Office of Science and Technology Policy in October 2022. This framework was
    released one year after OSTP announced the launch of a process to develop “a bill of rights for an AI-powered
    world.” Its release follows a year of public engagement to inform this initiative. The framework is available
    online at: https://www.whitehouse.gov/ostp/ai-bill-of-rights
    About the Office of Science and Technology Policy
    The Office of Science and Technology Policy (OSTP) was established by the National Science and Technology
    Policy, Organization, and Priorities Act of 1976 to provide the President and others within the Executive Office
    of the President with advice on the scientific, engineering, and technological aspects of the economy, national
    security, health, foreign relations, the environment, and the technological recovery and use of resources, among
    other topics. OSTP leads interagency science and technology policy coordination efforts, assists the Office of
    Management and Budget (OMB) with an annual review and analysis of Federal research and development in
    budgets, and serves as a source of scientific and technological analysis and judgment for the President with
    respect to major policies, plans, and programs of the Federal Government.
    Legal Disclaimer
    The Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People is a white paper
    published by the White House Office of Science and Technology Policy. It is intended to support the
    development of policies and practices that protect civil rights and promote democratic values in the building,
    deployment, and governance of automated systems.
    The Blueprint for an AI Bill of Rights is non-binding and does not constitute U.S. government policy. It
    does not supersede, modify, or direct an interpretation of any existing statute, regulation, policy, or
    international instrument. It does not constitute binding guidance for the public or Federal agencies and
    therefore does not require compliance with the principles described herein. It also is not determinative of what
    the U.S. government’s position will be in any international negotiation. Adoption of these principles may not
    meet the requirements of existing statutes, regulations, policies, or international instruments, or the
    requirements of the Federal agencies that enforce them. These principles are not intended to, and do not,
    prohibit or limit any lawful activity of a government agency, including law enforcement, national security, or
    intelligence activities.
    The appropriate application of the principles set forth in this white paper depends significantly on the
    context in which automated systems are being utilized. In some circumstances, application of these principles
    in whole or in part may not be appropriate given the intended use of automated systems to achieve government
    agency missions. Future sector-specific guidance will likely be necessary and important for guiding the use of
    automated systems in certain settings such as AI systems used as part of school building security or automated
    health diagnostic systems.
    The Blueprint for an AI Bill of Rights recognizes that law enforcement activities require a balancing of
    equities, for example, between the protection of sensitive law enforcement information and the principle of
    notice; as such, notice may not be appropriate, or may need to be adjusted to protect sources, methods, and
    other law enforcement equities. Even in contexts where these principles may not apply in whole or in part,
    federal departments and agencies remain subject to judicial, privacy, and civil liberties oversight as well as
    existing policies and safeguards that govern automated systems, including, for example, Executive Order 13960,
    Promoting the Use of Trustworthy Artificial Intelligence in the Federal Government (December 2020).
    This white paper recognizes that national security (which includes certain law enforcement and
    homeland security activities) and defense activities are of increased sensitivity and interest to our nation’s
    adversaries and are often subject to special requirements, such as those governing classified information and
    other protected data. Such activities require alternative, compatible safeguards through existing policies that
    govern automated systems and AI, such as the Department of Defense (DOD) AI Ethical Principles and
    Responsible AI Implementation Pathway and the Intelligence Community (IC) AI Ethics Principles and
    Framework. The implementation of these policies to national security and defense activities can be informed by
    the Blueprint for an AI Bill of Rights where feasible.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 5
  • per_device_eval_batch_size: 5
  • num_train_epochs: 5
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 5
  • per_device_eval_batch_size: 5
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 5
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • eval_use_gather_object: False
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step cosine_map@100
1.0 45 0.8410

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.1.1
  • Transformers: 4.44.2
  • PyTorch: 2.4.1+cu121
  • Accelerate: 0.34.2
  • Datasets: 3.0.1
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
14
Safetensors
Model size
434M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for lw2134/policy_gte_large_5

Quantized
(5)
this model

Evaluation results