metadata
base_model: BAAI/bge-base-en-v1.5
datasets: []
language: []
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:1340
- loss:MultipleNegativesRankingLoss
widget:
- source_sentence: Who popularized the term 'Dalit'?
sentences:
- >-
Fakhruddin Ali Ahmed was the fifth President of India from 1974 to 1977
and also the 2nd President of India to die in office.
- >-
Arunachal Pradesh or South Tibet is a state between India and China. The
country that owns this region is disputed. China says that they own it
and call it South Tibet (Zangnan 藏南). In 2017, China started renaming
places in this territory. In 2019 China destroyed 30,000 "incorrect"
world maps that showed South Tibet as part of India.
- >-
"Dalit" refers to socially, economically and historically marginalized
communities predominantly in India . It also means "broken/scattered" in
Sanskrit and Hindi . The term "dalits" was in use as a translation for
the British Raj census classification of "Depressed Classes" prior to
1935. It was popularised by the economist and reformer B. R. Ambedkar
(1891–1956), who included all depressed people irrespective of their
caste into the definition of dalits. Hence the first group he made was
called the "Labour Party" and included as its members all people of the
society who were kept depressed, including women, small scale farmers
and people from backward castes.
- source_sentence: What is India's contribution to the Olympic Movement?
sentences:
- >-
Prem Pal Singh Rawat (in India called Maharaji and in the past called
Guru Maharaj Ji and Balyogeshwar) was born in India on December 10,
1957. He teaches inner peace by the use of what he calls "Knowledge".
Groups that have helped him are the Divine Light Mission, Elan Vital
(1983), and The Prem Rawat Foundation (2001).
- >-
Boota Singh (Gurmukhi: ਬੂਟਾ ਸਿੰਘ; Shahmukhi: بوٹا سنگھ), sometimes
spelled as Buta Singh, was a Sikh soldier in the British Army. He served
in Burma during World War II, under the command of Lord Mountbatten. He
is very well known in India and Pakistan. He is famous for his tragic
love story with Zainab, a Muslim girl who he rescued from the riots
during the partition of India in 1947.
- >-
India at the Olympics is a history which includes 32 games in 19
countries and 800+ athletes. Since 1900, India has contributed to the
growth of the "Olympic Movement".
- source_sentence: What is significant about the fort in Jhansi?
sentences:
- >-
Western India is a region of the Republic of India, it includes Gujarat,
Madhya Pradesh and Maharashtra.
- >-
The Government of India Act 1858 was an Act of the Parliament of the
United Kingdom (21 & 22 Vict. c. 106) passed on August 2, 1858. Its
provisions called for the liquidation of the British East India Company
(who had up to this point been ruling British India under the auspices
of Parliament) and the transference of its functions to the British
Crown.
- >-
Jhansi is a historic city of India between the rivers Pahunj and Betwa
in the northern state of Uttar Pradesh, close to the border with Madhya
Pradesh. Jhansi is the administrative headquarters of Jhansi District
and Jhansi Division. The original walled city grew up around its stone
fort, which was built in 1613. The city is well connected to all other
major towns in Uttar Pradesh by road and railway networks. It is called
"gateway to Bundelkhand". Jhansi was besieged and taken by British
forces in 1858 during the Indian Rebellion of 1857.
- source_sentence: How is Dhanteras celebrated in Nepal?
sentences:
- >-
The National Stock Exchange of India Limited (NSE), is a Mumbai-based
stock exchange. It is the biggest stock exchange in India and the third
biggest in the world in terms of amounts of transactions. NSE is
mutually-owned by a set of leading financial institutions, banks,
insurance companies and other financial intermediaries in India but its
ownership and management operate as separate groups. As of 2006, the NSE
VSAT terminals, 2799 in total, cover more than 1500 cities across India.
In July 2007, the NSE had a total market capitalization of 42,74,509
crore INR making it the second-largest stock market in South Asia in
terms of market-capitalization.
- >-
Dhanteras (Sanskrit: धनतेरस), also known as Dhanatrayodashi () or
Dhanvantari Trayodashi, is the first day of the festival of Diwali in
India and the festival of Tihar in Nepal.
- >-
Perur taluk is a taluk in Coimbatore district, Tamil Nadu, India
associated with the neighbourhood of Perur. It was created by Government
of Tamil Nadu in 2013.
- source_sentence: What political roles did Rao hold in Andhra Pradesh?
sentences:
- >-
The 2023 ICC Cricket World Cup is scheduled to be hosted by India and
India was selected as the host at an International Cricket Council (ICC)
meeting in London in June 2013. This will be the 13th Cricket World Cup
competition. It will be the fourth time that India will be the host.
This will be the first time that India has hosted the tournament on its
own. India hosted previous World Cup tournaments in 1987 (with
Pakistan), 1996 (with Pakistan and Sri Lanka) and 2011 (with Sri Lanka
and Bangladesh). The semi final will be played at Wankhede Stadium. And
final will be played at Eden Gardens, Kolkata.
- >-
Ayyavazhi (, "path of the father"), is a religion with one god that
started in South India in the middle of the 19th century. The 'zhi' ()
in the word, 'Ayyavazhi', is a retroflex, ri.
- >-
Balli Durga Prasad Rao (15 June 1956 – 16 September 2020) was an Indian
politician. He was elected to the Lok Sabha, lower house of the
Parliament of India in the 2019 Indian general election. He was a member
of the YSR Congress Party. Rao was also a member of the Andhra Pradesh
MLA from 1985 to 1989, 1994 to 1999, and 2009 to 2014.
SentenceTransformer based on BAAI/bge-base-en-v1.5
This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: BAAI/bge-base-en-v1.5
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 tokens
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("dipanjanS/bge-base-en-v1.5-fte")
# Run inference
sentences = [
'What political roles did Rao hold in Andhra Pradesh?',
'Balli Durga Prasad Rao (15 June 1956 – 16 September 2020) was an Indian politician. He was elected to the Lok Sabha, lower house of the Parliament of India in the 2019 Indian general election. He was a member of the YSR Congress Party. Rao was also a member of the Andhra Pradesh MLA from 1985 to 1989, 1994 to 1999, and 2009 to 2014.',
'Ayyavazhi (, "path of the father"), is a religion with one god that started in South India in the middle of the 19th century. The \'zhi\' () in the word, \'Ayyavazhi\', is a retroflex, ri.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Training Details
Training Dataset
Unnamed Dataset
- Size: 1,340 training samples
- Columns:
question
andcontext
- Approximate statistics based on the first 1000 samples:
question context type string string details - min: 6 tokens
- mean: 12.39 tokens
- max: 24 tokens
- min: 9 tokens
- mean: 83.99 tokens
- max: 510 tokens
- Samples:
question context What is Basil commonly known as?
Basil ("Ocimum basilicum") ( or ) is a plant of the Family Lamiaceae. It is also known as Sweet Basil or Tulsi. It is a tender low-growing herb that is grown as a perennial in warm, tropical climates. Basil is originally native to India and other tropical regions of Asia. It has been cultivated there for more than 5,000 years. It is prominently featured in many cuisines throughout the world. Some of them are Italian, Thai, Vietnamese and Laotian cuisines. It grows to between 30–60 cm tall. It has light green, silky leaves 3–5 cm long and 1–3 cm broad. The leaves are opposite each other. The flowers are quite big. They are white in color and arranged as a spike.
Where is Basil originally native to?
Basil ("Ocimum basilicum") ( or ) is a plant of the Family Lamiaceae. It is also known as Sweet Basil or Tulsi. It is a tender low-growing herb that is grown as a perennial in warm, tropical climates. Basil is originally native to India and other tropical regions of Asia. It has been cultivated there for more than 5,000 years. It is prominently featured in many cuisines throughout the world. Some of them are Italian, Thai, Vietnamese and Laotian cuisines. It grows to between 30–60 cm tall. It has light green, silky leaves 3–5 cm long and 1–3 cm broad. The leaves are opposite each other. The flowers are quite big. They are white in color and arranged as a spike.
What is the significance of the Roerich Pact?
The Roerich Pact is a treaty on Protection of Artistic and Scientific Institutions and Historic Monuments, signed by the representatives of 21 states in the Oval Office of the White House on 15 April 1935. As of January 1, 1990, the Roerich Pact had been ratified by ten nations: Brazil, Chile, Colombia, Cuba, the Dominican Republic, El Salvador, Guatemala, Mexico, the United States, and Venezuela. It went into effect on 26 August 1935. The Government of India approved the Treaty in 1948, but did not take any further formal action. The Roerich Pact is also known as "Pax Cultura" ("Cultural Peace" or "Peace through Culture"). The most important part of the Roerich Pact is the legal recognition that the protection of culture is always more important than any military necessity.
- Loss:
MultipleNegativesRankingLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim" }
Evaluation Dataset
Unnamed Dataset
- Size: 100 evaluation samples
- Columns:
question
andcontext
- Approximate statistics based on the first 1000 samples:
question context type string string details - min: 7 tokens
- mean: 12.36 tokens
- max: 19 tokens
- min: 12 tokens
- mean: 84.15 tokens
- max: 235 tokens
- Samples:
question context What is the demographic composition of Kolathur?
Kolathur () is a town in Salem district in the Indian state of Tamil Nadu. As of the 2001 India census, Kolathur had a population of 10,319. Males make up 53% of the population and females 47%. A total of 9% of the population is under 6 years of age.
What is notable about India's democracy?
India is a country in Asia. It has an area of . It is at the center of South Asia. India has more than 1.2 billion (1,210,000,000) people, which is the second largest population in the world. It is the seventh largest country in the world by area and the largest country in South Asia. It is also the most populous democracy in the world.
Who was the Chief Justice of India before Dipak Misra?
Justice Dipak Misra (born 3 October 1953) was the Judge of the Supreme Court and the Chief Justice of India. He took over as the 45th Chief Justice of India (CJI), succeeding the 44th CJI, Justice J. S. Khehar.
- Loss:
MultipleNegativesRankingLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim" }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 16per_device_eval_batch_size
: 16learning_rate
: 3e-06max_steps
: 332warmup_ratio
: 0.1fp16
: Truebatch_sampler
: no_duplicates
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 16per_device_eval_batch_size
: 16per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonelearning_rate
: 3e-06weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 3.0max_steps
: 332lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.1warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Truefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Falsehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseeval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falsebatch_sampler
: no_duplicatesmulti_dataset_batch_sampler
: proportional
Training Logs
Epoch | Step | Training Loss | loss |
---|---|---|---|
0.2381 | 20 | 0.1832 | 0.0491 |
0.4762 | 40 | 0.1118 | 0.0246 |
0.7143 | 60 | 0.0991 | 0.0152 |
0.9524 | 80 | 0.0518 | 0.0106 |
1.1905 | 100 | 0.0665 | 0.0073 |
1.4286 | 120 | 0.0539 | 0.0058 |
1.6667 | 140 | 0.0548 | 0.0048 |
1.9048 | 160 | 0.0354 | 0.0041 |
2.1429 | 180 | 0.038 | 0.0034 |
2.3810 | 200 | 0.0592 | 0.0030 |
2.6190 | 220 | 0.0203 | 0.0027 |
2.8571 | 240 | 0.0441 | 0.0025 |
3.0952 | 260 | 0.023 | 0.0024 |
3.3333 | 280 | 0.0452 | 0.0023 |
3.5714 | 300 | 0.0128 | 0.0022 |
3.8095 | 320 | 0.0495 | 0.0022 |
Framework Versions
- Python: 3.10.12
- Sentence Transformers: 3.0.1
- Transformers: 4.42.4
- PyTorch: 2.3.1+cu121
- Accelerate: 0.32.1
- Datasets: 2.20.0
- Tokenizers: 0.19.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}