iliemihai
/

romanian-sentence-bert-base-uncased-v1

@@ -6,9 +6,9 @@ tags:
 license: mit
 ---
-# bert-base-romanian-uncased-v1
-The BERT **base**, **uncased** model for Romanian, trained on a 15GB corpus, version ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666)
 ### How to use
@@ -28,6 +28,41 @@ outputs = model(input_ids)
 last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 ```
 Remember to always sanitize your text! Replace ``s`` and ``t`` cedilla-letters to comma-letters with :
 ```
 text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
@@ -44,7 +79,7 @@ because the model was **NOT** trained on cedilla ``s`` and ``t``s. If you don't,
 | Warmup steps     | 500  |
 | Uncased      | True  |
 | Max. Seq. Length | 512   |
 ### Evaluation
@@ -71,35 +106,13 @@ The model is trained on the following corpora (stats in the table below are afte
 #### Finetuning
-The model is finetune on the  RO_MNLI dataset (translated entire MNLI dataset from English to Romanian).
 ### Citation
-If you use this model in a research paper, I'd kindly ask you to cite the following paper:
-```
-Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324–4328, Online. Association for Computational Linguistics.
-```
-or, in bibtex:
-```
-@inproceedings{dumitrescu-etal-2020-birth,
-    title = "The birth of {R}omanian {BERT}",
-    author = "Dumitrescu, Stefan  and
-      Avram, Andrei-Marius  and
-      Pyysalo, Sampo",
-    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
-    month = nov,
-    year = "2020",
-    address = "Online",
-    publisher = "Association for Computational Linguistics",
-    url = "https://aclanthology.org/2020.findings-emnlp.387",
-    doi = "10.18653/v1/2020.findings-emnlp.387",
-    pages = "4324--4328",
-}
-```
 #### Acknowledgements
-- We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!

 license: mit
 ---
+# sentence-bert-base-romanian-uncased-v1
+The BERT **base**, **uncased** model for Romanian, finetuned on RO_MNLI dataset (translated entire MNLI dataset from English to Romanian) ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666)
 ### How to use
 last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 ```
+Alternative use
+```python
+from sentence_transformers import SentenceTransformer
+import numpy as np
+# Inițializăm modelul
+model = SentenceTransformer("iliemihai/sentence-bert-base-romanian-uncased-v1")
+# Definim propozițiile
+sentences = [
+    "Un tren își începe călătoria către destinație.",
+    "O locomotivă pornește zgomotos spre o stație îndepărtată.",
+    "Un muzician cântă la un saxofon impresionant.",
+    "Un saxofonist evocă melodii suave sub lumina lunii.",
+    "O bucătăreasă presară condimente pe un platou cu legume.",
+    "Un chef adaugă un strop de mirodenii peste o salată colorată.",
+    "Un jongler își aruncă mingile colorate în aer.",
+    "Un artist de circ jonglează cu măiestrie sub reflectoare.",
+    "Un artist pictează un peisaj minunat pe o pânză albă.",
+    "Un pictor redă frumusețea naturii pe pânza sa strălucitoare."
+]
+# Obținem embeddings pentru fiecare propoziție
+embeddings = model.encode(sentences)
+# Calculăm similaritatea semantică folosind similaritatea cosine
+similarities = np.dot(embeddings, embeddings.T) / (np.linalg.norm(embeddings, axis=1)[:, np.newaxis] * np.linalg.norm(embeddings, axis=1)[np.newaxis, :])
+# Afisăm similaritatea dintre propozitii
+for i in range(len(sentences)):
+    for j in range(len(sentences)):
+        print(f"Similaritate între '{sentences[i]}' și '{sentences[j]}': {similarities[i, j]:.4f}")
+```
 Remember to always sanitize your text! Replace ``s`` and ``t`` cedilla-letters to comma-letters with :
 ```
 text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
 | Warmup steps     | 500  |
 | Uncased      | True  |
 | Max. Seq. Length | 512   |
+| Loss function | Contrastive Loss   |
 ### Evaluation
 #### Finetuning
+The model is finetune on the  RO_MNLI dataset (translated entire MNLI dataset from English to Romanian and select only contradiction and entailment pairs, ~ 256k sentence pairs).
 ### Citation
+Paper coming soon
 #### Acknowledgements
+- We'd like to thank [Stefan Dumitrescu](https://github.com/dumitrescustefan) and [Andrei Marius Avram](https://github.com/avramandrei) for pretraining the v1.0 BERT models!