--- pipeline_tag: sentence-similarity language: fr datasets: - stsb_multi_mt tags: - Text - Sentence Similarity - Sentence-Embedding - camembert-large license: apache-2.0 model-index: - name: sentence-camembert-large by Van Tuan DANG results: - task: name: Sentence-Embedding type: Text Similarity dataset: name: Text Similarity fr type: stsb_multi_mt args: fr metrics: - name: Test Pearson correlation coefficient type: Pearson_correlation_coefficient value: 88.63 library_name: sentence-transformers --- ## Description: This [**Sentence-CamemBERT-Large**](https://huggingface.co./Lajavaness/sentence-camembert-large) Model is an Embedding Model for French developed by [La Javaness](https://www.lajavaness.com/). The purpose of this embedding model is to represent the content and semantics of a French sentence as a mathematical vector, allowing it to understand the meaning of the text beyond individual words in queries and documents. It offers powerful semantic search capabilities. ## Pre-trained sentence embedding models are state-of-the-art of Sentence Embeddings for French. The [Lajavaness/sentence-camembert-large](https://huggingface.co./Lajavaness/sentence-camembert-large) model is an improvement over the [dangvantuan/sentence-camembert-base](https://huggingface.co./dangvantuan/sentence-camembert-large) offering greater robustness and better performance on all STS benchmark datasets. It has been fine-tuned using the pre-trained [facebook/camembert-large](https://huggingface.co./camembert/camembert-large) and [Siamese BERT-Networks with 'sentences-transformers'](https://www.sbert.net/) on dataset [stsb](https://huggingface.co./datasets/stsb_multi_mt/viewer/fr/train). Additionally, it has been combined with [Augmented SBERT](https://aclanthology.org/2021.naacl-main.28.pdf) on dataset [stsb](https://huggingface.co./datasets/stsb_multi_mt/viewer/fr/train). The model benefits from Pair Sampling Strategies using two models: [CrossEncoder-camembert-large](https://huggingface.co./dangvantuan/CrossEncoder-camembert-large) and [dangvantuan/sentence-camembert-large](https://huggingface.co./dangvantuan/sentence-camembert-large) ## Usage The model can be used directly (without a language model) as follows: ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("Lajavaness/sentence-camembert-large") sentences = ["Un avion est en train de décoller.", "Un homme joue d'une grande flûte.", "Un homme étale du fromage râpé sur une pizza.", "Une personne jette un chat au plafond.", "Une personne est en train de plier un morceau de papier.", ] embeddings = model.encode(sentences) ``` ## Evaluation The model can be evaluated as follows on the French test data of stsb. ```python from sentence_transformers import SentenceTransformer from sentence_transformers.readers import InputExample from datasets import load_dataset def convert_dataset(dataset): dataset_samples=[] for df in dataset: score = float(df['similarity_score'])/5.0 # Normalize score to range 0 ... 1 inp_example = InputExample(texts=[df['sentence1'], df['sentence2']], label=score) dataset_samples.append(inp_example) return dataset_samples # Loading the dataset for evaluation df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev") df_test = load_dataset("stsb_multi_mt", name="fr", split="test") # Convert the dataset for evaluation # For Dev set: dev_samples = convert_dataset(df_dev) val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev') val_evaluator(model, output_path="./") # For Test set: test_samples = convert_dataset(df_test) test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test') test_evaluator(model, output_path="./") ``` **Test Result**: The performance is measured using Pearson and Spearman correlation: - On dev | Model | Pearson correlation | Spearman correlation | #params | | ------------- | ------------- | ------------- |------------- | | [Lajavaness/sentence-camembert-large](https://huggingface.co./dangvantuan/sentence-camembert-large)| **88.63** |**88.46** | 336M| | [dangvantuan/sentence-camembert-large](https://huggingface.co./dangvantuan/sentence-camembert-large)| 88.2 |88.02 | 336M| | [Sahajtomar/french_semanti](https://huggingface.co./Sahajtomar/french_semantic)| 87.44 |87.30 | 336M| | [Lajavaness/sentence-flaubert-base](https://huggingface.co./Lajavaness/sentence-flaubert-base)| 87.14 |87.10 | 137M | | [GPT-3 (text-davinci-003)](https://platform.openai.com/docs/models) | 85 | NaN|175B | | [GPT-(text-embedding-ada-002)](https://platform.openai.com/docs/models) | 79.75 | 80.44|NaN | - On test, Pearson and Spearman correlation are evaluated on many different benchmark datasets: **Pearson score** | Model | [STS-B](https://huggingface.co./datasets/stsb_multi_mt/viewer/fr/train) | [STS12-fr ](https://huggingface.co./datasets/Lajavaness/STS12-fr)| [STS13-fr](https://huggingface.co./datasets/Lajavaness/STS13-fr) | [STS14-fr](https://huggingface.co./datasets/Lajavaness/STS14-fr) | [STS15-fr](https://huggingface.co./datasets/Lajavaness/STS15-fr) | [STS16-fr](https://huggingface.co./datasets/Lajavaness/STS16-fr) | [SICK-fr](https://huggingface.co./datasets/Lajavaness/SICK-fr) | params | |------------------------------------------|-------|----------|----------|----------|----------|----------|---------|--------| | [Lajavaness/sentence-camembert-large](https://huggingface.co./dangvantuan/sentence-camembert-large) | **86.26** | **87.42** | **89.34** | **88.05** | **88.91** | 77.15 | 83.13 | 336M | | [dangvantuan/sentence-camembert-large](https://huggingface.co./dangvantuan/sentence-camembert-large) | 85.88 | 87.28 | 89.25 | 87.91 | 88.54 | 76.90 | 83.26 | 336M | | [Sahajtomar/french_semantic](https://huggingface.co./Sahajtomar/french_semantic) | 85.80 | 86.05 | 88.50 | 86.57 | 87.49 | 77.85 | 83.27 | 336M | | [Lajavaness/sentence-flaubert-base](https://huggingface.co./Lajavaness/sentence-flaubert-base) | 85.39 | 86.64 | 87.24 | 85.68 | 87.99 | 75.78 | 82.84 | 137M | | [GPT3 (text-embedding-ada-002)](https://platform.openai.com/docs/models) | 79.03 | 66.16 | 75.48 | 70.69 | 77.88 | 65.18 | - | - | **Spearman score** | Model | [STS-B](https://huggingface.co./datasets/stsb_multi_mt/viewer/fr/train) | [STS12-fr ](https://huggingface.co./datasets/Lajavaness/STS12-fr)| [STS13-fr](https://huggingface.co./datasets/Lajavaness/STS13-fr) | [STS14-fr](https://huggingface.co./datasets/Lajavaness/STS14-fr) | [STS15-fr](https://huggingface.co./datasets/Lajavaness/STS15-fr) | [STS16-fr](https://huggingface.co./datasets/Lajavaness/STS16-fr) | [SICK-fr](https://huggingface.co./datasets/Lajavaness/SICK-fr) | params | |:-------------------------------------|-------:|---------:|---------:|---------:|---------:|---------:|--------:|:-------| | [Lajavaness/sentence-camembert-large](https://huggingface.co./dangvantuan/sentence-camembert-large) | **86.14** | **81.22** | 88.61 | **86.28** | **89.01** | 78.65 | **77.71** | 336M | | [dangvantuan/sentence-camembert-large](https://huggingface.co./dangvantuan/sentence-camembert-large) | 85.78 | 81.09 | 88.68 | 85.81 | 88.56 | 78.49 | 77.70 | 336M | | [Sahajtomar/french_semantic](https://huggingface.co./Sahajtomar/french_semantic) | 85.55 | 77.92 | 87.85 | 83.96 | 87.63 | 79.07 | 77.14 | 336M | | [Lajavaness/sentence-flaubert-base](https://huggingface.co./Lajavaness/sentence-flaubert-base) | 85.67 | 79.97 | 86.91 | 84.57 | 88.10 | 77.84 | 77.55 | 137M | | [GPT3 (text-embedding-ada-002)](https://platform.openai.com/docs/models) | 77.53 | 64.27 | 76.41 | 69.63 | 78.65 | 75.30 | - | - | ## Citation @article{reimers2019sentence, title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks}, author={Nils Reimers, Iryna Gurevych}, journal={https://arxiv.org/abs/1908.10084}, year={2019} } @article{martin2020camembert, title={CamemBERT: a Tasty French Language Mode}, author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t}, journal={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, year={2020} }