Multilingual mPNet finetuned for cross-lingual similarity

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-mpnet-base-v2. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("aryasuneesh/paraphrase-multilingual-mpnet-base-v2-7")
# Run inference
sentences = [
    "So Let's - Circle Back - to how YOU got your JOB - Jen Psaki",
    "Jen Psaki said, 'If you don’t buy anything, you won’t experience inflation’",
    'NAIA reverts to MIA, its old name',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric Value
pearson_cosine 0.9494
spearman_cosine 0.8549

Training Details

Training Dataset

Unnamed Dataset

  • Size: 178,008 training samples
  • Columns: text1, text2, and label
  • Approximate statistics based on the first 1000 samples:
    text1 text2 label
    type string string float
    details
    • min: 5 tokens
    • mean: 65.05 tokens
    • max: 128 tokens
    • min: 4 tokens
    • mean: 21.88 tokens
    • max: 128 tokens
    • min: 0.0
    • mean: 0.46
    • max: 1.0
  • Samples:
    text1 text2 label
    CONFIRM THAT THE UNITED STATES CARRIED CARRIED OUT A MILITARY ATTACK ON KABUL صورة لانفجار عبوة ناسفة استهدفت سيارة عسكرية جنوب غربي مدينة الرقة السوريّة. 0.0
    Lisboa grita Fora Bolsonaro durante show de Gustavo Lima De arrepiarl [USER] LISBOA, PORTUGAL Lisbon screams Fora Bolsonaro during concert by Gustavo Lima 0.0
    Singapore stops the vaccination after 48 people died The Telegraph Singapore halts use of flu vaccines after 48 die in South Korea [USER].06flatearth Singapore halts the rollout of influenza vaccination due to deaths in South Korea 1.0
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 44,503 evaluation samples
  • Columns: text1, text2, and label
  • Approximate statistics based on the first 1000 samples:
    text1 text2 label
    type string string float
    details
    • min: 7 tokens
    • mean: 66.12 tokens
    • max: 128 tokens
    • min: 4 tokens
    • mean: 22.01 tokens
    • max: 128 tokens
    • min: 0.0
    • mean: 0.48
    • max: 1.0
  • Samples:
    text1 text2 label
    141 UN PUEBLO QUE ELIGE A CORRUPTOS, LADRONES Y TRAIDORES NO ES VÍCTIMA, ES COMPLICE. GEORGE ORWELL or [USER] periodismo • poder para la gente “A people who elect corrupts, imposters, thieves and traitors, are not victims. You are an accomplice!” 0.0
    Watch Full Video [URL] Nasir Chenyoti, the one who spread smiles on people's faces, is fighting a life and death battle today. Pakistani comic Nasir Chinyoti burned in an accident 1.0
    at des Bezirkec Potsdam Abt. Veterinarsenen 1500 Heinrich-enn-Allee 107 III-15-01-Br 25. Juli 1985 04.07.1985 Information zum Infektionszeitpunkt und zur Übertragung der Coronavirueinfektion in Krein Brandenburg Ier 03.07.1985 gibt es in Kreis 7 staatliche ban. genossenschaftliche und 24 individuelle Coronavirus infektions-Bestunde (siehe Anlage). - Fia Fratinfektion hat vermutlich in der FA wollin stattgefunden (Blutentnahme v. 22.5.85, Feststellung 30.5.85). Von Galten der Betriebsleitung wird eine Einschleppung tiber 1KVE-Fahrzeuge der TVB Conthin vermutet. Dieses Dokument beweist, dass das Corona-Virus schon in der DDR existierte 1.0
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • learning_rate: 2e-05
  • weight_decay: 0.01
  • num_train_epochs: 5
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • fp16: True
  • fp16_full_eval: True
  • dataloader_num_workers: 4
  • load_best_model_at_end: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.01
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 5
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: True
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 4
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • eval_use_gather_object: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss eval-similarity_spearman_cosine
0.1247 347 0.1578 - -
0.2495 694 0.1356 - -
0.2498 695 - 0.1248 0.7041
0.3742 1041 0.1206 - -
0.4989 1388 0.1121 - -
0.4996 1390 - 0.1026 0.7569
0.6237 1735 0.1028 - -
0.7484 2082 0.093 - -
0.7495 2085 - 0.0862 0.7896
0.8731 2429 0.0889 - -
0.9978 2776 0.083 - -
0.9993 2780 - 0.0739 0.8097
1.1226 3123 0.0648 - -
1.2473 3470 0.062 - -
1.2491 3475 - 0.0662 0.8174
1.3720 3817 0.0595 - -
1.4968 4164 0.0567 - -
1.4989 4170 - 0.0585 0.8277
1.6215 4511 0.0553 - -
1.7462 4858 0.0513 - -
1.7487 4865 - 0.0518 0.8355
1.8710 5205 0.0497 - -
1.9957 5552 0.0465 - -
1.9986 5560 - 0.0462 0.8409
2.1204 5899 0.0336 - -
2.2451 6246 0.0319 - -
2.2484 6255 - 0.0433 0.8438
2.3699 6593 0.0311 - -
2.4946 6940 0.0304 - -
2.4982 6950 - 0.0401 0.8457
2.6193 7287 0.0306 - -
2.7441 7634 0.0302 - -
2.7480 7645 - 0.0356 0.8492
2.8688 7981 0.0275 - -
2.9935 8328 0.0281 - -
2.9978 8340 - 0.0330 0.8509
3.1183 8675 0.0198 - -
3.2430 9022 0.0198 - -
3.2477 9035 - 0.0315 0.8520
3.3677 9369 0.0183 - -
3.4925 9716 0.0182 - -
3.4975 9730 - 0.0303 0.8526
3.6172 10063 0.0189 - -
3.7419 10410 0.018 - -
3.7473 10425 - 0.0289 0.8539
3.8666 10757 0.0171 - -
3.9914 11104 0.0178 - -
3.9971 11120 - 0.0274 0.8546
4.1161 11451 0.014 - -
4.2408 11798 0.0142 - -
4.2469 11815 - 0.0269 0.8547
4.3656 12145 0.0137 - -
4.4903 12492 0.0135 - -
4.4968 12510 - 0.0266 0.8548
4.6150 12839 0.0136 - -
4.7398 13186 0.0138 - -
4.7466 13205 - 0.0265 0.8549
4.8645 13533 0.0135 - -
4.9892 13880 0.0136 - -
4.9964 13900 - 0.0265 0.8549
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.14
  • Sentence Transformers: 3.3.1
  • Transformers: 4.44.2
  • PyTorch: 2.4.1+cu121
  • Accelerate: 0.34.2
  • Datasets: 3.2.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
36
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for aryasuneesh/paraphrase-multilingual-mpnet-base-v2-7

Evaluation results