MythicalCow1's picture
Update README.md
6020665 verified
---
base_model: Snowflake/snowflake-arctic-embed-m
datasets: []
language: []
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:55744
- loss:MultipleNegativesRankingLoss
widget:
- source_sentence: 'Represent this sentence for searching relevant passages: 2014
Summer can i cash a check if my account is frozen?'
sentences:
- 'Jun 18 1927 Check Gift Card Balance. With your 16-digit card number and PIN,
you can check the balance in a Walmart store, call 1-888-537-5503, or check your
gift card balance online.
'
- '13/07/2014 Frozen Account If your checking account has been frozen, which can
happen if a levy has been placed on the account, you might still be able to cash
a check. ... This means a check can be deposited into the account without being
frozen, allowing you to access the cash.
'
- 'Guatemalan law allows firearm possession on shall-issue basis as a constitutional
right. With approximately 12 civilian firearms per 100 people, Guatemala is the
70th most armed country in the world. Constitution Guatemalan constitution protects
right to own guns for home-defense: Law Current law regarding firearm possession
was passed in 2009. Permitted types of firearms Law allows civilians to own following
types of firearms: Semi automatic pistols and revolvers of any calibre; Shotguns
with barrel of length up to 24 inches; Mechanical and semi-automatic rifles. Firearm
registration Simple possession requires registration of gun. Application for register
must include: Certification proving ownership and legal acquisition of the firearm;
Certification of lack of a criminal and police record in force (6 months of validity);
Identity document; 4x4 photography on matte paper; Receipt of payment of all necessary
fees; Presentation of firearm. Guatemalans are allowed possess any number of firearms.
Carrying firearms Rules regarding carrying firearms are more strict with additional
permit required and minimum age being 25 years. Only about 10% of legal guns can
be carried in public places. Firearm possession Currently there are 547,000 registered
firearms in Guatemala (or 3 per 100 people). 60,658 people have license to carry
them. See also Overview of gun laws by nation References Guatemala Law of Guatemala
'
- source_sentence: 'Represent this sentence for searching relevant passages: Be Great
at Oblivion Elder Scrolls IV'
sentences:
- 'The Elder Scrolls IV: Oblivion is an intricate and very fun game. If you want
to know how to completely just be the greatest at oblivion in the easiest way
possible, this is the best guide for you.
'
- '"08/03/75 Chronic elevation of potassium levels (also known as hyperkalemia)
is usually a sign of reduced kidney function. However, it can also be caused by
certain medications, acute injuries, or a severe diabetic crisis (called ""diabetic
ketoacidosis"") among other things."
'
- '12/01/2031 The major downfall of the Articles of Confederation was simply weakness.
The federal government, under the Articles, was too weak to enforce their laws
and therefore had no power. The Continental Congress had borrowed money to fight
the Revolutionary War and could not repay their debts.
'
- source_sentence: 'Represent this sentence for searching relevant passages: Renew
Your Passport 11/19/71'
sentences:
- '2025/02/18 The altitude affects the time an orbit takes, called the orbit period.
The period of the space shuttle''s orbit, at say 200 kilometers, used to be about
90 minutes. Vanguard-1, by the way, has an orbital period of 134.2 minutes, with
its periapsis altitude of 654 km, and apoapsis altitude of 3,969 km.
'
- 'The following article is for those who need to renew a United States of America
Passport. You can usually renew your passport by mail, but under certain circumstances,
you may need to renew your passport in person, instead. Nov 19 2071
'
- '"09/06 You can say goodbye in German in nearly any circumstance if you know two
phrases: ""Auf Wiedersehen"" and ""Tschüs."" If you really want to impress native
German speakers, though, there are a few other phrases you can also use when parting
ways."
'
- source_sentence: 'Represent this sentence for searching relevant passages: today:2026-04-07
last monday what is fx vs dx nikon?'
sentences:
- '"spring 2026 Nikon makes a DX-format sensor and an FX-format sensor. The DX-format
is the smaller sensor at 24x16mm; the larger FX-format sensor measures 36x24mm
which is approximately the same size as 35mm film. ... The FX sensor, with more
""light gathering"" area, offers higher sensitivity and, generally, lower noise."
'
- '10/21 A lifelong lack of calcium plays a role in the development of osteoporosis.
Low calcium intake contributes to diminished bone density, early bone loss and
an increased risk of fractures. Eating disorders. Severely restricting food intake
and being underweight weakens bone in both men and women.
'
- '2040 June Mahoe is a common name for several plants and may refer to: Alectryon
macrococcus, or ʻalaʻalahua, a species of tree in the soapberry family endemic
to Hawaii Melicytus ramiflorus, a tree endemic to New Zealand Other Melicytus
trees in New Zealand Talipariti elatum, or blue mahoe, a species of tree in the
mallow family native to the Caribbean
'
- source_sentence: 'Represent this sentence for searching relevant passages: Witki,
Warmian-Masurian Voivodeship 2040 Oct 12'
sentences:
- "09/10 Honey roasted nuts make an excellent snack for special occasions, such\
\ as during the festive season or a party. \n"
- '12-21-2046 This is a list of electoral results for the Electoral district of
Irwin in Western Australian state elections. Members for Irwin Election results
Elections in the 1940s Preferences were not distributed. Preferences were not
distributed. Elections in the 1930s Preferences were not distributed. Elections
in the 1920s Elections in the 1910s Elections in the 1900s Elections in the 1890s
References Western Australian state electoral results by district
'
- 'Witki () is a village in the administrative district of Gmina Bartoszyce, within
Bartoszyce County, Warmian-Masurian Voivodeship, in northern Poland, close to
the border with the Kaliningrad Oblast of Russia. It lies approximately east of
Bartoszyce and north-east of the regional capital Olsztyn. References Witki 12/10/2040
'
---
# Technical Report and Model Pipeline
To access our technical report and model pipeline scripts visit our [github](https://github.com/khoj-ai/timely/tree/main)
# SentenceTransformer based on Snowflake/snowflake-arctic-embed-m
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Snowflake/snowflake-arctic-embed-m](https://huggingface.co./Snowflake/snowflake-arctic-embed-m). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
## Model Details
### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [Snowflake/snowflake-arctic-embed-m](https://huggingface.co./Snowflake/snowflake-arctic-embed-m) <!-- at revision 71bc94c8f9ea1e54fba11167004205a65e5da2cc -->
- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 768 tokens
- **Similarity Function:** Cosine Similarity
<!-- - **Training Dataset:** Unknown -->
<!-- - **Language:** Unknown -->
<!-- - **License:** Unknown -->
### Model Sources
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co./models?library=sentence-transformers)
### Full Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
```
## Usage
### Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
```bash
pip install -U sentence-transformers
```
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
'Represent this sentence for searching relevant passages: Witki, Warmian-Masurian Voivodeship 2040 Oct 12',
'Witki () is a village in the administrative district of Gmina Bartoszyce, within Bartoszyce County, Warmian-Masurian Voivodeship, in northern Poland, close to the border with the Kaliningrad Oblast of Russia. It lies approximately east of Bartoszyce and north-east of the regional capital Olsztyn. References Witki 12/10/2040\n',
'12-21-2046 This is a list of electoral results for the Electoral district of Irwin in Western Australian state elections. Members for Irwin Election results Elections in the 1940s Preferences were not distributed. Preferences were not distributed. Elections in the 1930s Preferences were not distributed. Elections in the 1920s Elections in the 1910s Elections in the 1900s Elections in the 1890s References Western Australian state electoral results by district\n',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```
<!--
### Direct Usage (Transformers)
<details><summary>Click to see the direct usage in Transformers</summary>
</details>
-->
<!--
### Downstream Usage (Sentence Transformers)
You can finetune this model on your own dataset.
<details><summary>Click to expand</summary>
</details>
-->
<!--
### Out-of-Scope Use
*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->
<!--
## Bias, Risks and Limitations
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
-->
<!--
### Recommendations
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->
## Training Details
### Training Dataset
#### Unnamed Dataset
* Size: 55,744 training samples
* Columns: <code>anchors</code> and <code>positive</code>
* Approximate statistics based on the first 1000 samples:
| | anchors | positive |
|:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
| type | string | string |
| details | <ul><li>min: 14 tokens</li><li>mean: 20.33 tokens</li><li>max: 33 tokens</li></ul> | <ul><li>min: 17 tokens</li><li>mean: 48.58 tokens</li><li>max: 75 tokens</li></ul> |
* Samples:
| anchors | positive |
|:--------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <code>Represent this sentence for searching relevant passages: are bugs attracted to citronella November 10?</code> | <code>Citronella is naturally occurring oil that repels insects. ... “Citronella oil is repellent to mosquitoes to a degree, but the amount being put out by a candle isn't going to be very effective,” Eric Hoffer, president of Hoffer Pest, told TODAY Home. 11/10<br></code> |
| <code>Represent this sentence for searching relevant passages: are bugs attracted to citronella 11/10/09?</code> | <code>Citronella is naturally occurring oil that repels insects. ... “Citronella oil is repellent to mosquitoes to a degree, but the amount being put out by a candle isn't going to be very effective,” Eric Hoffer, president of Hoffer Pest, told TODAY Home. 11/10/09<br></code> |
| <code>Represent this sentence for searching relevant passages: are bugs attracted to citronella Jan 15?</code> | <code>Citronella is naturally occurring oil that repels insects. ... “Citronella oil is repellent to mosquitoes to a degree, but the amount being put out by a candle isn't going to be very effective,” Eric Hoffer, president of Hoffer Pest, told TODAY Home. 01/15<br></code> |
* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
```json
{
"scale": 20.0,
"similarity_fct": "cos_sim"
}
```
### Evaluation Dataset
#### Unnamed Dataset
* Size: 1,000 evaluation samples
* Columns: <code>anchors</code> and <code>positive</code>
* Approximate statistics based on the first 1000 samples:
| | anchors | positive |
|:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
| type | string | string |
| details | <ul><li>min: 12 tokens</li><li>mean: 21.57 tokens</li><li>max: 43 tokens</li></ul> | <ul><li>min: 8 tokens</li><li>mean: 66.44 tokens</li><li>max: 512 tokens</li></ul> |
* Samples:
| anchors | positive |
|:--------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <code>Represent this sentence for searching relevant passages: Identify a Psychopath 3/28</code> | <code>Psychopathy is a personality construct consisting of a cluster of characteristics used by mental health professionals to describe someone who is charming, manipulative, emotionally ruthless and potentially criminal. 03/28<br></code> |
| <code>Represent this sentence for searching relevant passages: what is dangerous high blood pressure in pregnancy?</code> | <code>A blood pressure that is greater than 130/90 mm Hg or that is 15 degrees higher on the top number from where you started before pregnancy may be cause for concern. High blood pressure during pregnancy is defined as 140 mm Hg or higher systolic, with diastolic 90 mm Hg or higher.<br></code> |
| <code>Represent this sentence for searching relevant passages: Be a Better Cheerleader June 22</code> | <code>What do you think when you think of a good cheerleader? Tight with motions? Can hold a stunt? Well, it's not just that. You need to be fit in 3 categories: mental/emotional health, social health, and physical health. 06/22<br></code> |
* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
```json
{
"scale": 20.0,
"similarity_fct": "cos_sim"
}
```
### Training Hyperparameters
#### Non-Default Hyperparameters
- `eval_strategy`: steps
- `per_device_train_batch_size`: 128
- `per_device_eval_batch_size`: 64
- `learning_rate`: 1.5e-05
- `weight_decay`: 0.01
- `num_train_epochs`: 1
- `warmup_ratio`: 0.1
- `warmup_steps`: 400
- `bf16`: True
- `batch_sampler`: no_duplicates
#### All Hyperparameters
<details><summary>Click to expand</summary>
- `overwrite_output_dir`: False
- `do_predict`: False
- `eval_strategy`: steps
- `prediction_loss_only`: True
- `per_device_train_batch_size`: 128
- `per_device_eval_batch_size`: 64
- `per_gpu_train_batch_size`: None
- `per_gpu_eval_batch_size`: None
- `gradient_accumulation_steps`: 1
- `eval_accumulation_steps`: None
- `torch_empty_cache_steps`: None
- `learning_rate`: 1.5e-05
- `weight_decay`: 0.01
- `adam_beta1`: 0.9
- `adam_beta2`: 0.999
- `adam_epsilon`: 1e-08
- `max_grad_norm`: 1.0
- `num_train_epochs`: 1
- `max_steps`: -1
- `lr_scheduler_type`: linear
- `lr_scheduler_kwargs`: {}
- `warmup_ratio`: 0.1
- `warmup_steps`: 400
- `log_level`: passive
- `log_level_replica`: warning
- `log_on_each_node`: True
- `logging_nan_inf_filter`: True
- `save_safetensors`: True
- `save_on_each_node`: False
- `save_only_model`: False
- `restore_callback_states_from_checkpoint`: False
- `no_cuda`: False
- `use_cpu`: False
- `use_mps_device`: False
- `seed`: 42
- `data_seed`: None
- `jit_mode_eval`: False
- `use_ipex`: False
- `bf16`: True
- `fp16`: False
- `fp16_opt_level`: O1
- `half_precision_backend`: auto
- `bf16_full_eval`: False
- `fp16_full_eval`: False
- `tf32`: None
- `local_rank`: 0
- `ddp_backend`: None
- `tpu_num_cores`: None
- `tpu_metrics_debug`: False
- `debug`: []
- `dataloader_drop_last`: False
- `dataloader_num_workers`: 0
- `dataloader_prefetch_factor`: None
- `past_index`: -1
- `disable_tqdm`: False
- `remove_unused_columns`: True
- `label_names`: None
- `load_best_model_at_end`: False
- `ignore_data_skip`: False
- `fsdp`: []
- `fsdp_min_num_params`: 0
- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
- `fsdp_transformer_layer_cls_to_wrap`: None
- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
- `deepspeed`: None
- `label_smoothing_factor`: 0.0
- `optim`: adamw_torch
- `optim_args`: None
- `adafactor`: False
- `group_by_length`: False
- `length_column_name`: length
- `ddp_find_unused_parameters`: None
- `ddp_bucket_cap_mb`: None
- `ddp_broadcast_buffers`: False
- `dataloader_pin_memory`: True
- `dataloader_persistent_workers`: False
- `skip_memory_metrics`: True
- `use_legacy_prediction_loop`: False
- `push_to_hub`: False
- `resume_from_checkpoint`: None
- `hub_model_id`: None
- `hub_strategy`: every_save
- `hub_private_repo`: False
- `hub_always_push`: False
- `gradient_checkpointing`: False
- `gradient_checkpointing_kwargs`: None
- `include_inputs_for_metrics`: False
- `eval_do_concat_batches`: True
- `fp16_backend`: auto
- `push_to_hub_model_id`: None
- `push_to_hub_organization`: None
- `mp_parameters`:
- `auto_find_batch_size`: False
- `full_determinism`: False
- `torchdynamo`: None
- `ray_scope`: last
- `ddp_timeout`: 1800
- `torch_compile`: False
- `torch_compile_backend`: None
- `torch_compile_mode`: None
- `dispatch_batches`: None
- `split_batches`: None
- `include_tokens_per_second`: False
- `include_num_input_tokens_seen`: False
- `neftune_noise_alpha`: None
- `optim_target_modules`: None
- `batch_eval_metrics`: False
- `eval_on_start`: False
- `eval_use_gather_object`: False
- `batch_sampler`: no_duplicates
- `multi_dataset_batch_sampler`: proportional
</details>
### Training Logs
| Epoch | Step | Training Loss | loss |
|:------:|:----:|:-------------:|:------:|
| 0.0023 | 1 | 2.4713 | - |
| 0.0229 | 10 | 2.4907 | - |
| 0.0459 | 20 | 2.4574 | - |
| 0.0688 | 30 | 2.4861 | - |
| 0.0917 | 40 | 2.4612 | - |
| 0.1147 | 50 | 2.4353 | - |
| 0.1376 | 60 | 2.3967 | - |
| 0.1606 | 70 | 2.3609 | - |
| 0.1835 | 80 | 2.3079 | - |
| 0.2064 | 90 | 2.1928 | - |
| 0.2294 | 100 | 2.1581 | - |
| 0.2523 | 110 | 2.0822 | - |
| 0.2752 | 120 | 1.9739 | - |
| 0.2982 | 130 | 1.8393 | - |
| 0.3211 | 140 | 1.7397 | - |
| 0.3440 | 150 | 1.5249 | - |
| 0.3670 | 160 | 1.4281 | - |
| 0.3899 | 170 | 1.3197 | - |
| 0.4128 | 180 | 1.211 | - |
| 0.4358 | 190 | 1.1086 | - |
| 0.4587 | 200 | 0.9598 | 0.2301 |
| 0.4817 | 210 | 1.0904 | - |
| 0.5046 | 220 | 0.9813 | - |
| 0.5275 | 230 | 1.1148 | - |
| 0.5505 | 240 | 1.2813 | - |
| 0.5734 | 250 | 1.2259 | - |
| 0.5963 | 260 | 1.221 | - |
| 0.6193 | 270 | 1.1547 | - |
| 0.6422 | 280 | 1.1286 | - |
| 0.6651 | 290 | 0.9932 | - |
| 0.6881 | 300 | 0.978 | - |
| 0.7110 | 310 | 0.9505 | - |
| 0.7339 | 320 | 0.8731 | - |
| 0.7569 | 330 | 0.824 | - |
| 0.7798 | 340 | 0.8979 | - |
| 0.8028 | 350 | 1.756 | - |
| 0.8257 | 360 | 1.6785 | - |
| 0.8486 | 370 | 1.5944 | - |
| 0.8716 | 380 | 1.5417 | - |
| 0.8945 | 390 | 1.4788 | - |
| 0.9174 | 400 | 0.9873 | 0.0695 |
| 0.9404 | 410 | 0.1664 | - |
| 0.9633 | 420 | 0.1336 | - |
| 0.9862 | 430 | 0.1193 | - |
### Framework Versions
- Python: 3.10.12
- Sentence Transformers: 3.0.1
- Transformers: 4.43.4
- PyTorch: 2.4.0+cu121
- Accelerate: 0.33.0
- Datasets: 2.20.0
- Tokenizers: 0.19.1
## Citation
### BibTeX
#### Sentence Transformers
```bibtex
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
```
#### MultipleNegativesRankingLoss
```bibtex
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
<!--
## Glossary
*Clearly define terms in order to be accessible across audiences.*
-->
<!--
## Model Card Authors
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
-->
<!--
## Model Card Contact
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
-->