MythicalCow1's picture
Update README.md
6020665 verified
metadata
base_model: Snowflake/snowflake-arctic-embed-m
datasets: []
language: []
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:55744
  - loss:MultipleNegativesRankingLoss
widget:
  - source_sentence: >-
      Represent this sentence for searching relevant passages: 2014 Summer can i
      cash a check if my account is frozen?
    sentences:
      - >
        Jun 18 1927 Check Gift Card Balance. With your 16-digit card number and
        PIN, you can check the balance in a Walmart store, call 1-888-537-5503,
        or check your gift card balance online.
      - >
        13/07/2014 Frozen Account If your checking account has been frozen,
        which can happen if a levy has been placed on the account, you might
        still be able to cash a check. ... This means a check can be deposited
        into the account without being frozen, allowing you to access the cash.
      - >
        Guatemalan law allows firearm possession on shall-issue basis as a
        constitutional right. With approximately 12 civilian firearms per 100
        people, Guatemala is the 70th most armed country in the world.
        Constitution Guatemalan constitution protects right to own guns for
        home-defense: Law Current law regarding firearm possession was passed in
        2009. Permitted types of firearms Law allows civilians to own following
        types of firearms: Semi automatic pistols and revolvers of any calibre;
        Shotguns with barrel of length up to 24 inches; Mechanical and
        semi-automatic rifles. Firearm registration Simple possession requires
        registration of gun. Application for register must include:
        Certification proving ownership and legal acquisition of the firearm;
        Certification of lack of a criminal and police record in force (6 months
        of validity); Identity document; 4x4 photography on matte paper; Receipt
        of payment of all necessary fees; Presentation of firearm. Guatemalans
        are allowed possess any number of firearms. Carrying firearms Rules
        regarding carrying firearms are more strict with additional permit
        required and minimum age being 25 years. Only about 10% of legal guns
        can be carried in public places. Firearm possession Currently there are
        547,000 registered firearms in Guatemala (or 3 per 100 people). 60,658
        people have license to carry them. See also  Overview of gun laws by
        nation References  Guatemala Law of Guatemala
  - source_sentence: >-
      Represent this sentence for searching relevant passages: Be Great at
      Oblivion Elder Scrolls IV
    sentences:
      - >
        The Elder Scrolls IV: Oblivion is an intricate and very fun game. If you
        want to know how to completely just be the greatest at oblivion in the
        easiest way possible, this is the best guide for you.
      - >
        "08/03/75 Chronic elevation of potassium levels (also known as
        hyperkalemia) is usually a sign of reduced kidney function. However, it
        can also be caused by certain medications, acute injuries, or a severe
        diabetic crisis (called ""diabetic ketoacidosis"") among other things."
      - >
        12/01/2031 The major downfall of the Articles of Confederation was
        simply weakness. The federal government, under the Articles, was too
        weak to enforce their laws and therefore had no power. The Continental
        Congress had borrowed money to fight the Revolutionary War and could not
        repay their debts.
  - source_sentence: >-
      Represent this sentence for searching relevant passages: Renew Your
      Passport 11/19/71
    sentences:
      - >
        2025/02/18 The altitude affects the time an orbit takes, called the
        orbit period. The period of the space shuttle's orbit, at say 200
        kilometers, used to be about 90 minutes. Vanguard-1, by the way, has an
        orbital period of 134.2 minutes, with its periapsis altitude of 654 km,
        and apoapsis altitude of 3,969 km.
      - >
        The following article is for those who need to renew a United States of
        America Passport. You can usually renew your passport by mail, but under
        certain circumstances, you may need to renew your passport in person,
        instead. Nov 19 2071
      - >
        "09/06 You can say goodbye in German in nearly any circumstance if you
        know two phrases: ""Auf Wiedersehen"" and ""Tschüs."" If you really want
        to impress native German speakers, though, there are a few other phrases
        you can also use when parting ways."
  - source_sentence: >-
      Represent this sentence for searching relevant passages: today:2026-04-07
      last monday what is fx vs dx nikon?
    sentences:
      - >
        "spring 2026 Nikon makes a DX-format sensor and an FX-format sensor. The
        DX-format is the smaller sensor at 24x16mm; the larger FX-format sensor
        measures 36x24mm which is approximately the same size as 35mm film. ...
        The FX sensor, with more ""light gathering"" area, offers higher
        sensitivity and, generally, lower noise."
      - >
        10/21 A lifelong lack of calcium plays a role in the development of
        osteoporosis. Low calcium intake contributes to diminished bone density,
        early bone loss and an increased risk of fractures. Eating disorders.
        Severely restricting food intake and being underweight weakens bone in
        both men and women.
      - >
        2040 June Mahoe is a common name for several plants and may refer to:
        Alectryon macrococcus, or ʻalaʻalahua, a species of tree in the
        soapberry family endemic to Hawaii Melicytus ramiflorus, a tree endemic
        to New Zealand Other Melicytus trees in New Zealand Talipariti elatum,
        or blue mahoe, a species of tree in the mallow family native to the
        Caribbean
  - source_sentence: >-
      Represent this sentence for searching relevant passages: Witki,
      Warmian-Masurian Voivodeship 2040 Oct 12
    sentences:
      - >
        09/10 Honey roasted nuts make an excellent snack for special occasions,
        such as during the festive season or a party. 
      - >
        12-21-2046 This is a list of electoral results for the Electoral
        district of Irwin in Western Australian state elections. Members for
        Irwin Election results Elections in the 1940s  Preferences were not
        distributed.  Preferences were not distributed. Elections in the 1930s 
        Preferences were not distributed. Elections in the 1920s Elections in
        the 1910s Elections in the 1900s Elections in the 1890s References
        Western Australian state electoral results by district
      - >
        Witki () is a village in the administrative district of Gmina
        Bartoszyce, within Bartoszyce County, Warmian-Masurian Voivodeship, in
        northern Poland, close to the border with the Kaliningrad Oblast of
        Russia. It lies approximately east of Bartoszyce and north-east of the
        regional capital Olsztyn. References Witki 12/10/2040

Technical Report and Model Pipeline

To access our technical report and model pipeline scripts visit our github

SentenceTransformer based on Snowflake/snowflake-arctic-embed-m

This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-m. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Snowflake/snowflake-arctic-embed-m
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'Represent this sentence for searching relevant passages: Witki, Warmian-Masurian Voivodeship 2040 Oct 12',
    'Witki () is a village in the administrative district of Gmina Bartoszyce, within Bartoszyce County, Warmian-Masurian Voivodeship, in northern Poland, close to the border with the Kaliningrad Oblast of Russia. It lies approximately east of Bartoszyce and north-east of the regional capital Olsztyn. References Witki 12/10/2040\n',
    '12-21-2046 This is a list of electoral results for the Electoral district of Irwin in Western Australian state elections. Members for Irwin Election results Elections in the 1940s  Preferences were not distributed.  Preferences were not distributed. Elections in the 1930s  Preferences were not distributed. Elections in the 1920s Elections in the 1910s Elections in the 1900s Elections in the 1890s References Western Australian state electoral results by district\n',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 55,744 training samples
  • Columns: anchors and positive
  • Approximate statistics based on the first 1000 samples:
    anchors positive
    type string string
    details
    • min: 14 tokens
    • mean: 20.33 tokens
    • max: 33 tokens
    • min: 17 tokens
    • mean: 48.58 tokens
    • max: 75 tokens
  • Samples:
    anchors positive
    Represent this sentence for searching relevant passages: are bugs attracted to citronella November 10? Citronella is naturally occurring oil that repels insects. ... “Citronella oil is repellent to mosquitoes to a degree, but the amount being put out by a candle isn't going to be very effective,” Eric Hoffer, president of Hoffer Pest, told TODAY Home. 11/10
    Represent this sentence for searching relevant passages: are bugs attracted to citronella 11/10/09? Citronella is naturally occurring oil that repels insects. ... “Citronella oil is repellent to mosquitoes to a degree, but the amount being put out by a candle isn't going to be very effective,” Eric Hoffer, president of Hoffer Pest, told TODAY Home. 11/10/09
    Represent this sentence for searching relevant passages: are bugs attracted to citronella Jan 15? Citronella is naturally occurring oil that repels insects. ... “Citronella oil is repellent to mosquitoes to a degree, but the amount being put out by a candle isn't going to be very effective,” Eric Hoffer, president of Hoffer Pest, told TODAY Home. 01/15
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 1,000 evaluation samples
  • Columns: anchors and positive
  • Approximate statistics based on the first 1000 samples:
    anchors positive
    type string string
    details
    • min: 12 tokens
    • mean: 21.57 tokens
    • max: 43 tokens
    • min: 8 tokens
    • mean: 66.44 tokens
    • max: 512 tokens
  • Samples:
    anchors positive
    Represent this sentence for searching relevant passages: Identify a Psychopath 3/28 Psychopathy is a personality construct consisting of a cluster of characteristics used by mental health professionals to describe someone who is charming, manipulative, emotionally ruthless and potentially criminal. 03/28
    Represent this sentence for searching relevant passages: what is dangerous high blood pressure in pregnancy? A blood pressure that is greater than 130/90 mm Hg or that is 15 degrees higher on the top number from where you started before pregnancy may be cause for concern. High blood pressure during pregnancy is defined as 140 mm Hg or higher systolic, with diastolic 90 mm Hg or higher.
    Represent this sentence for searching relevant passages: Be a Better Cheerleader June 22 What do you think when you think of a good cheerleader? Tight with motions? Can hold a stunt? Well, it's not just that. You need to be fit in 3 categories: mental/emotional health, social health, and physical health. 06/22
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 64
  • learning_rate: 1.5e-05
  • weight_decay: 0.01
  • num_train_epochs: 1
  • warmup_ratio: 0.1
  • warmup_steps: 400
  • bf16: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 1.5e-05
  • weight_decay: 0.01
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 400
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • eval_use_gather_object: False
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss loss
0.0023 1 2.4713 -
0.0229 10 2.4907 -
0.0459 20 2.4574 -
0.0688 30 2.4861 -
0.0917 40 2.4612 -
0.1147 50 2.4353 -
0.1376 60 2.3967 -
0.1606 70 2.3609 -
0.1835 80 2.3079 -
0.2064 90 2.1928 -
0.2294 100 2.1581 -
0.2523 110 2.0822 -
0.2752 120 1.9739 -
0.2982 130 1.8393 -
0.3211 140 1.7397 -
0.3440 150 1.5249 -
0.3670 160 1.4281 -
0.3899 170 1.3197 -
0.4128 180 1.211 -
0.4358 190 1.1086 -
0.4587 200 0.9598 0.2301
0.4817 210 1.0904 -
0.5046 220 0.9813 -
0.5275 230 1.1148 -
0.5505 240 1.2813 -
0.5734 250 1.2259 -
0.5963 260 1.221 -
0.6193 270 1.1547 -
0.6422 280 1.1286 -
0.6651 290 0.9932 -
0.6881 300 0.978 -
0.7110 310 0.9505 -
0.7339 320 0.8731 -
0.7569 330 0.824 -
0.7798 340 0.8979 -
0.8028 350 1.756 -
0.8257 360 1.6785 -
0.8486 370 1.5944 -
0.8716 380 1.5417 -
0.8945 390 1.4788 -
0.9174 400 0.9873 0.0695
0.9404 410 0.1664 -
0.9633 420 0.1336 -
0.9862 430 0.1193 -

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.0.1
  • Transformers: 4.43.4
  • PyTorch: 2.4.0+cu121
  • Accelerate: 0.33.0
  • Datasets: 2.20.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply}, 
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}