SentenceTransformer based on sentence-transformers/paraphrase-multilingual-mpnet-base-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-mpnet-base-v2 on the allstats-semantic-search-synthetic-dataset-v1 dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-semantic-search-model-v1-3")
# Run inference
sentences = [
    'perubahan nilai tukar petani bulan mei 2017',
    'Perkembangan Nilai Tukar Petani Mei 2017',
    'Statistik Restoran/Rumah Makan Tahun 2014',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric allstats-semantic-search-v1-3-dev allstat-semantic-search-v1-3-test
pearson_cosine 0.9959 0.9961
spearman_cosine 0.9641 0.9648

Training Details

Training Dataset

allstats-semantic-search-synthetic-dataset-v1

  • Dataset: allstats-semantic-search-synthetic-dataset-v1 at b13c0a7
  • Size: 212,940 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 5 tokens
    • mean: 11.46 tokens
    • max: 34 tokens
    • min: 5 tokens
    • mean: 14.47 tokens
    • max: 54 tokens
    • min: 0.0
    • mean: 0.5
    • max: 1.05
  • Samples:
    query doc label
    aDta industri besar dan sedang Indonesia 2008 Statistik Industri Besar dan Sedang Indonesia 2008 0.9
    profil bisnis konstruksi individu jawa barat 2022 Statistik Industri Manufaktur Indonesia 2015 - Bahan Baku 0.15
    data statistik ekonomi indonesia Nilai Tukar Valuta Asing di Indonesia 2014 0.08
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

allstats-semantic-search-synthetic-dataset-v1

  • Dataset: allstats-semantic-search-synthetic-dataset-v1 at b13c0a7
  • Size: 26,618 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 5 tokens
    • mean: 11.38 tokens
    • max: 34 tokens
    • min: 4 tokens
    • mean: 14.63 tokens
    • max: 55 tokens
    • min: 0.0
    • mean: 0.51
    • max: 1.0
  • Samples:
    query doc label
    tahun berapa ekspor naik 2,37% dan impor naik 30,30%? Bulan November 2006 Ekspor Naik 2,37 % dan Impor Naik 30,30 % 1.0
    Berapa produksi padi pada tahun 2023? Produksi padi tahun lainnya 0.0
    data statistik solus per aqua 2015 Statistik Solus Per Aqua (SPA) 2015 0.97
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • num_train_epochs: 16
  • warmup_ratio: 0.1
  • fp16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 16
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Click to expand
Epoch Step Training Loss Validation Loss allstats-semantic-search-v1-3-dev_spearman_cosine allstat-semantic-search-v1-3-test_spearman_cosine
0.1502 500 0.0579 0.0351 0.7132 -
0.3005 1000 0.03 0.0225 0.7589 -
0.4507 1500 0.0219 0.0185 0.7834 -
0.6010 2000 0.0181 0.0163 0.7946 -
0.7512 2500 0.0162 0.0147 0.7941 -
0.9014 3000 0.015 0.0147 0.8050 -
1.0517 3500 0.014 0.0131 0.7946 -
1.2019 4000 0.0119 0.0126 0.8038 -
1.3522 4500 0.0121 0.0128 0.8213 -
1.5024 5000 0.0117 0.0116 0.8268 -
1.6526 5500 0.0124 0.0117 0.8269 -
1.8029 6000 0.0111 0.0109 0.8421 -
1.9531 6500 0.0105 0.0108 0.8278 -
2.1034 7000 0.0091 0.0093 0.8460 -
2.2536 7500 0.0085 0.0091 0.8469 -
2.4038 8000 0.0079 0.0083 0.8595 -
2.5541 8500 0.0075 0.0085 0.8495 -
2.7043 9000 0.0073 0.0082 0.8614 -
2.8546 9500 0.0068 0.0077 0.8696 -
3.0048 10000 0.0066 0.0076 0.8669 -
3.1550 10500 0.0058 0.0072 0.8678 -
3.3053 11000 0.0056 0.0067 0.8703 -
3.4555 11500 0.0054 0.0067 0.8766 -
3.6058 12000 0.0054 0.0063 0.8678 -
3.7560 12500 0.0051 0.0061 0.8786 -
3.9062 13000 0.0052 0.0077 0.8699 -
4.0565 13500 0.005 0.0055 0.8859 -
4.2067 14000 0.0041 0.0054 0.8900 -
4.3570 14500 0.0038 0.0052 0.8892 -
4.5072 15000 0.0039 0.0050 0.8895 -
4.6575 15500 0.004 0.0052 0.8972 -
4.8077 16000 0.0042 0.0051 0.8927 -
4.9579 16500 0.0041 0.0052 0.8930 -
5.1082 17000 0.0034 0.0053 0.8998 -
5.2584 17500 0.003 0.0047 0.9023 -
5.4087 18000 0.0032 0.0045 0.9039 -
5.5589 18500 0.0032 0.0044 0.8996 -
5.7091 19000 0.0032 0.0041 0.9085 -
5.8594 19500 0.0032 0.0047 0.9072 -
6.0096 20000 0.0029 0.0037 0.9104 -
6.1599 20500 0.0024 0.0037 0.9112 -
6.3101 21000 0.0026 0.0039 0.9112 -
6.4603 21500 0.0024 0.0037 0.9157 -
6.6106 22000 0.0022 0.0038 0.9122 -
6.7608 22500 0.0025 0.0034 0.9170 -
6.9111 23000 0.0023 0.0034 0.9179 -
7.0613 23500 0.002 0.0031 0.9244 -
7.2115 24000 0.0019 0.0030 0.9250 -
7.3618 24500 0.0018 0.0032 0.9249 -
7.5120 25000 0.0022 0.0031 0.9162 -
7.6623 25500 0.0019 0.0030 0.9266 -
7.8125 26000 0.0019 0.0028 0.9297 -
7.9627 26500 0.0018 0.0028 0.9282 -
8.1130 27000 0.0015 0.0025 0.9324 -
8.2632 27500 0.0014 0.0027 0.9337 -
8.4135 28000 0.0015 0.0027 0.9327 -
8.5637 28500 0.0016 0.0027 0.9313 -
8.7139 29000 0.0016 0.0027 0.9333 -
8.8642 29500 0.0015 0.0025 0.9382 -
9.0144 30000 0.0014 0.0025 0.9375 -
9.1647 30500 0.0011 0.0024 0.9398 -
9.3149 31000 0.0012 0.0025 0.9384 -
9.4651 31500 0.0014 0.0025 0.9383 -
9.6154 32000 0.0013 0.0023 0.9410 -
9.7656 32500 0.0011 0.0023 0.9409 -
9.9159 33000 0.0012 0.0021 0.9432 -
10.0661 33500 0.0011 0.0021 0.9432 -
10.2163 34000 0.001 0.0021 0.9442 -
10.3666 34500 0.0009 0.0022 0.9436 -
10.5168 35000 0.001 0.0021 0.9468 -
10.6671 35500 0.001 0.0020 0.9471 -
10.8173 36000 0.001 0.0021 0.9467 -
10.9675 36500 0.0011 0.0021 0.9478 -
11.1178 37000 0.0008 0.0020 0.9493 -
11.2680 37500 0.0008 0.0019 0.9509 -
11.4183 38000 0.0008 0.0019 0.9504 -
11.5685 38500 0.0008 0.0019 0.9512 -
11.7188 39000 0.0008 0.0019 0.9516 -
11.8690 39500 0.0007 0.0019 0.9534 -
12.0192 40000 0.0007 0.0018 0.9539 -
12.1695 40500 0.0006 0.0018 0.9555 -
12.3197 41000 0.0006 0.0019 0.9551 -
12.4700 41500 0.0007 0.0019 0.9550 -
12.6202 42000 0.0008 0.0018 0.9552 -
12.7704 42500 0.0006 0.0017 0.9559 -
12.9207 43000 0.0006 0.0017 0.9568 -
13.0709 43500 0.0006 0.0017 0.9577 -
13.2212 44000 0.0005 0.0017 0.9581 -
13.3714 44500 0.0006 0.0017 0.9586 -
13.5216 45000 0.0005 0.0017 0.9587 -
13.6719 45500 0.0005 0.0017 0.9591 -
13.8221 46000 0.0006 0.0016 0.9600 -
13.9724 46500 0.0005 0.0016 0.9603 -
14.1226 47000 0.0005 0.0016 0.9609 -
14.2728 47500 0.0005 0.0016 0.9612 -
14.4231 48000 0.0005 0.0016 0.9611 -
14.5733 48500 0.0005 0.0016 0.9616 -
14.7236 49000 0.0004 0.0015 0.9625 -
14.8738 49500 0.0004 0.0016 0.9628 -
15.0240 50000 0.0004 0.0016 0.9631 -
15.1743 50500 0.0004 0.0016 0.9632 -
15.3245 51000 0.0004 0.0016 0.9633 -
15.4748 51500 0.0004 0.0016 0.9635 -
15.625 52000 0.0004 0.0015 0.9638 -
15.7752 52500 0.0004 0.0015 0.9640 -
15.9255 53000 0.0004 0.0015 0.9641 -
16.0 53248 - - - 0.9648

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.3.1
  • Transformers: 4.47.1
  • PyTorch: 2.2.2+cu121
  • Accelerate: 1.2.1
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
9
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for yahyaabd/allstats-semantic-search-model-v1-3

Dataset used to train yahyaabd/allstats-semantic-search-model-v1-3

Evaluation results

  • Pearson Cosine on allstats semantic search v1 3 dev
    self-reported
    0.996
  • Spearman Cosine on allstats semantic search v1 3 dev
    self-reported
    0.964
  • Pearson Cosine on allstat semantic search v1 3 test
    self-reported
    0.996
  • Spearman Cosine on allstat semantic search v1 3 test
    self-reported
    0.965