Description:

This Sentence-CamemBERT-Large Model is an Embedding Model for French developed by La Javaness. The purpose of this embedding model is to represent the content and semantics of a French sentence as a mathematical vector, allowing it to understand the meaning of the text beyond individual words in queries and documents. It offers powerful semantic search capabilities.

Pre-trained sentence embedding models are state-of-the-art of Sentence Embeddings for French.

The Lajavaness/sentence-camembert-large model is an improvement over the dangvantuan/sentence-camembert-base offering greater robustness and better performance on all STS benchmark datasets. It has been fine-tuned using the pre-trained facebook/camembert-large and Siamese BERT-Networks with 'sentences-transformers' on dataset stsb. Additionally, it has been combined with Augmented SBERT on dataset stsb. The model benefits from Pair Sampling Strategies using two models: CrossEncoder-camembert-large and dangvantuan/sentence-camembert-large

Usage

The model can be used directly (without a language model) as follows:

from sentence_transformers import SentenceTransformer
model =  SentenceTransformer("Lajavaness/sentence-camembert-large")

sentences = ["Un avion est en train de décoller.",
          "Un homme joue d'une grande flûte.",
          "Un homme étale du fromage râpé sur une pizza.",
          "Une personne jette un chat au plafond.",
          "Une personne est en train de plier un morceau de papier.",
          ]

embeddings = model.encode(sentences)

Evaluation

The model can be evaluated as follows on the French test data of stsb.

from sentence_transformers import SentenceTransformer
from sentence_transformers.readers import InputExample
from datasets import load_dataset
def convert_dataset(dataset):
    dataset_samples=[]
    for df in dataset:
        score = float(df['similarity_score'])/5.0  # Normalize score to range 0 ... 1
        inp_example = InputExample(texts=[df['sentence1'], 
                                    df['sentence2']], label=score)
        dataset_samples.append(inp_example)
    return dataset_samples

# Loading the dataset for evaluation
df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev")
df_test = load_dataset("stsb_multi_mt", name="fr", split="test")

# Convert the dataset for evaluation

# For Dev set:
dev_samples = convert_dataset(df_dev)
val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
val_evaluator(model, output_path="./")

# For Test set:
test_samples = convert_dataset(df_test)
test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
test_evaluator(model, output_path="./")

Test Result: The performance is measured using Pearson and Spearman correlation:

  • On dev
Model Pearson correlation Spearman correlation #params
Lajavaness/sentence-camembert-large 88.63 88.46 336M
dangvantuan/sentence-camembert-large 88.2 88.02 336M
Sahajtomar/french_semanti 87.44 87.30 336M
Lajavaness/sentence-flaubert-base 87.14 87.10 137M
GPT-3 (text-davinci-003) 85 NaN 175B
GPT-(text-embedding-ada-002) 79.75 80.44 NaN
  • On test, Pearson and Spearman correlation are evaluated on many different benchmark datasets:

Pearson score

Model STS-B STS12-fr STS13-fr STS14-fr STS15-fr STS16-fr SICK-fr params
Lajavaness/sentence-camembert-large 86.26 87.42 89.34 88.05 88.91 77.15 83.13 336M
dangvantuan/sentence-camembert-large 85.88 87.28 89.25 87.91 88.54 76.90 83.26 336M
Sahajtomar/french_semantic 85.80 86.05 88.50 86.57 87.49 77.85 83.27 336M
Lajavaness/sentence-flaubert-base 85.39 86.64 87.24 85.68 87.99 75.78 82.84 137M
GPT3 (text-embedding-ada-002) 79.03 66.16 75.48 70.69 77.88 65.18 - -

Spearman score

Model STS-B STS12-fr STS13-fr STS14-fr STS15-fr STS16-fr SICK-fr params
Lajavaness/sentence-camembert-large 86.14 81.22 88.61 86.28 89.01 78.65 77.71 336M
dangvantuan/sentence-camembert-large 85.78 81.09 88.68 85.81 88.56 78.49 77.70 336M
Sahajtomar/french_semantic 85.55 77.92 87.85 83.96 87.63 79.07 77.14 336M
Lajavaness/sentence-flaubert-base 85.67 79.97 86.91 84.57 88.10 77.84 77.55 137M
GPT3 (text-embedding-ada-002) 77.53 64.27 76.41 69.63 78.65 75.30 - -

Citation

@article{reimers2019sentence,
   title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
   author={Nils Reimers, Iryna Gurevych},
   journal={https://arxiv.org/abs/1908.10084},
   year={2019}
}


@article{martin2020camembert,
   title={CamemBERT: a Tasty French Language Mode},
   author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
   journal={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
   year={2020}
}
Downloads last month
1,697
Safetensors
Model size
337M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train Lajavaness/sentence-camembert-large

Spaces using Lajavaness/sentence-camembert-large 2

Evaluation results

  • Test Pearson correlation coefficient on Text Similarity fr
    self-reported
    88.630