Update README.md
Browse files
README.md
CHANGED
@@ -6,298 +6,57 @@ tags:
|
|
6 |
- sentence-transformers
|
7 |
- sentence-similarity
|
8 |
- feature-extraction
|
9 |
-
- dataset_size:10K<n<100K
|
10 |
- loss:MatryoshkaLoss
|
11 |
-
-
|
12 |
-
base_model: aari1995/gbert-large-alibi
|
13 |
metrics:
|
14 |
-
- pearson_cosine
|
15 |
- spearman_cosine
|
16 |
-
- pearson_manhattan
|
17 |
-
- spearman_manhattan
|
18 |
-
- pearson_euclidean
|
19 |
-
- spearman_euclidean
|
20 |
-
- pearson_dot
|
21 |
-
- spearman_dot
|
22 |
-
- pearson_max
|
23 |
-
- spearman_max
|
24 |
widget:
|
25 |
-
- source_sentence:
|
26 |
sentences:
|
27 |
-
-
|
28 |
-
-
|
29 |
-
-
|
30 |
-
- source_sentence:
|
31 |
sentences:
|
32 |
-
-
|
33 |
-
-
|
34 |
-
-
|
35 |
- source_sentence: Ein Mann übt Boxen
|
36 |
sentences:
|
37 |
-
- Ein
|
38 |
-
-
|
39 |
-
-
|
40 |
-
- source_sentence:
|
41 |
sentences:
|
42 |
-
-
|
43 |
-
-
|
44 |
-
-
|
45 |
-
- source_sentence: Eine Flagge weht.
|
46 |
-
sentences:
|
47 |
-
- Die Flagge bewegte sich in der Luft.
|
48 |
-
- Zwei Personen beobachten das Wasser.
|
49 |
-
- Zwei Frauen sitzen in einem Cafe.
|
50 |
-
|
51 |
-
- source_sentence: Der Mann heißt Joel.
|
52 |
-
sentences:
|
53 |
-
- Ein Mann mit einem englischen Namen.
|
54 |
-
- Die Frau heißt Joél.
|
55 |
-
- Freunde gehen feiern.
|
56 |
pipeline_tag: sentence-similarity
|
57 |
-
model-index:
|
58 |
-
- name: SentenceTransformer based on aari1995/gbert-large-nli_mix
|
59 |
-
results:
|
60 |
-
- task:
|
61 |
-
type: semantic-similarity
|
62 |
-
name: Semantic Similarity
|
63 |
-
dataset:
|
64 |
-
name: sts test 1024
|
65 |
-
type: sts-test-1024
|
66 |
-
metrics:
|
67 |
-
- type: pearson_cosine
|
68 |
-
value: 0.8538749625112824
|
69 |
-
name: Pearson Cosine
|
70 |
-
- type: spearman_cosine
|
71 |
-
value: 0.8622934726599119
|
72 |
-
name: Spearman Cosine
|
73 |
-
- type: pearson_manhattan
|
74 |
-
value: 0.8554617861095041
|
75 |
-
name: Pearson Manhattan
|
76 |
-
- type: spearman_manhattan
|
77 |
-
value: 0.8632850500504865
|
78 |
-
name: Spearman Manhattan
|
79 |
-
- type: pearson_euclidean
|
80 |
-
value: 0.8554205957277228
|
81 |
-
name: Pearson Euclidean
|
82 |
-
- type: spearman_euclidean
|
83 |
-
value: 0.8630779166725503
|
84 |
-
name: Spearman Euclidean
|
85 |
-
- type: pearson_dot
|
86 |
-
value: 0.8170146846171837
|
87 |
-
name: Pearson Dot
|
88 |
-
- type: spearman_dot
|
89 |
-
value: 0.8149857685956332
|
90 |
-
name: Spearman Dot
|
91 |
-
- type: pearson_max
|
92 |
-
value: 0.8554617861095041
|
93 |
-
name: Pearson Max
|
94 |
-
- type: spearman_max
|
95 |
-
value: 0.8632850500504865
|
96 |
-
name: Spearman Max
|
97 |
-
- task:
|
98 |
-
type: semantic-similarity
|
99 |
-
name: Semantic Similarity
|
100 |
-
dataset:
|
101 |
-
name: sts test 768
|
102 |
-
type: sts-test-768
|
103 |
-
metrics:
|
104 |
-
- type: pearson_cosine
|
105 |
-
value: 0.853820621972726
|
106 |
-
name: Pearson Cosine
|
107 |
-
- type: spearman_cosine
|
108 |
-
value: 0.863198271488271
|
109 |
-
name: Spearman Cosine
|
110 |
-
- type: pearson_manhattan
|
111 |
-
value: 0.8558709278385018
|
112 |
-
name: Pearson Manhattan
|
113 |
-
- type: spearman_manhattan
|
114 |
-
value: 0.8637532036004547
|
115 |
-
name: Spearman Manhattan
|
116 |
-
- type: pearson_euclidean
|
117 |
-
value: 0.8558597695346744
|
118 |
-
name: Pearson Euclidean
|
119 |
-
- type: spearman_euclidean
|
120 |
-
value: 0.8634247094122574
|
121 |
-
name: Spearman Euclidean
|
122 |
-
- type: pearson_dot
|
123 |
-
value: 0.8169163431962185
|
124 |
-
name: Pearson Dot
|
125 |
-
- type: spearman_dot
|
126 |
-
value: 0.8156867907361973
|
127 |
-
name: Spearman Dot
|
128 |
-
- type: pearson_max
|
129 |
-
value: 0.8558709278385018
|
130 |
-
name: Pearson Max
|
131 |
-
- type: spearman_max
|
132 |
-
value: 0.8637532036004547
|
133 |
-
name: Spearman Max
|
134 |
-
- task:
|
135 |
-
type: semantic-similarity
|
136 |
-
name: Semantic Similarity
|
137 |
-
dataset:
|
138 |
-
name: sts test 512
|
139 |
-
type: sts-test-512
|
140 |
-
metrics:
|
141 |
-
- type: pearson_cosine
|
142 |
-
value: 0.8502336569709972
|
143 |
-
name: Pearson Cosine
|
144 |
-
- type: spearman_cosine
|
145 |
-
value: 0.8623838162450902
|
146 |
-
name: Spearman Cosine
|
147 |
-
- type: pearson_manhattan
|
148 |
-
value: 0.8547121881183612
|
149 |
-
name: Pearson Manhattan
|
150 |
-
- type: spearman_manhattan
|
151 |
-
value: 0.8628698143219098
|
152 |
-
name: Spearman Manhattan
|
153 |
-
- type: pearson_euclidean
|
154 |
-
value: 0.8546114371189246
|
155 |
-
name: Pearson Euclidean
|
156 |
-
- type: spearman_euclidean
|
157 |
-
value: 0.8625109910600326
|
158 |
-
name: Spearman Euclidean
|
159 |
-
- type: pearson_dot
|
160 |
-
value: 0.8108392647310044
|
161 |
-
name: Pearson Dot
|
162 |
-
- type: spearman_dot
|
163 |
-
value: 0.8103261097232485
|
164 |
-
name: Spearman Dot
|
165 |
-
- type: pearson_max
|
166 |
-
value: 0.8547121881183612
|
167 |
-
name: Pearson Max
|
168 |
-
- type: spearman_max
|
169 |
-
value: 0.8628698143219098
|
170 |
-
name: Spearman Max
|
171 |
-
- task:
|
172 |
-
type: semantic-similarity
|
173 |
-
name: Semantic Similarity
|
174 |
-
dataset:
|
175 |
-
name: sts test 256
|
176 |
-
type: sts-test-256
|
177 |
-
metrics:
|
178 |
-
- type: pearson_cosine
|
179 |
-
value: 0.8441242786553879
|
180 |
-
name: Pearson Cosine
|
181 |
-
- type: spearman_cosine
|
182 |
-
value: 0.8582717489671877
|
183 |
-
name: Spearman Cosine
|
184 |
-
- type: pearson_manhattan
|
185 |
-
value: 0.8517415030362573
|
186 |
-
name: Pearson Manhattan
|
187 |
-
- type: spearman_manhattan
|
188 |
-
value: 0.8591688553092182
|
189 |
-
name: Spearman Manhattan
|
190 |
-
- type: pearson_euclidean
|
191 |
-
value: 0.8516965854845419
|
192 |
-
name: Pearson Euclidean
|
193 |
-
- type: spearman_euclidean
|
194 |
-
value: 0.8591770194196562
|
195 |
-
name: Spearman Euclidean
|
196 |
-
- type: pearson_dot
|
197 |
-
value: 0.7901870400809775
|
198 |
-
name: Pearson Dot
|
199 |
-
- type: spearman_dot
|
200 |
-
value: 0.7891397281321177
|
201 |
-
name: Spearman Dot
|
202 |
-
- type: pearson_max
|
203 |
-
value: 0.8517415030362573
|
204 |
-
name: Pearson Max
|
205 |
-
- type: spearman_max
|
206 |
-
value: 0.8591770194196562
|
207 |
-
name: Spearman Max
|
208 |
-
- task:
|
209 |
-
type: semantic-similarity
|
210 |
-
name: Semantic Similarity
|
211 |
-
dataset:
|
212 |
-
name: sts test 128
|
213 |
-
type: sts-test-128
|
214 |
-
metrics:
|
215 |
-
- type: pearson_cosine
|
216 |
-
value: 0.8369352495821198
|
217 |
-
name: Pearson Cosine
|
218 |
-
- type: spearman_cosine
|
219 |
-
value: 0.8545806562301762
|
220 |
-
name: Spearman Cosine
|
221 |
-
- type: pearson_manhattan
|
222 |
-
value: 0.8474289413580527
|
223 |
-
name: Pearson Manhattan
|
224 |
-
- type: spearman_manhattan
|
225 |
-
value: 0.8546935424655524
|
226 |
-
name: Spearman Manhattan
|
227 |
-
- type: pearson_euclidean
|
228 |
-
value: 0.8478267316251253
|
229 |
-
name: Pearson Euclidean
|
230 |
-
- type: spearman_euclidean
|
231 |
-
value: 0.8550464936365929
|
232 |
-
name: Spearman Euclidean
|
233 |
-
- type: pearson_dot
|
234 |
-
value: 0.7732663297266509
|
235 |
-
name: Pearson Dot
|
236 |
-
- type: spearman_dot
|
237 |
-
value: 0.7720532782903432
|
238 |
-
name: Spearman Dot
|
239 |
-
- type: pearson_max
|
240 |
-
value: 0.8478267316251253
|
241 |
-
name: Pearson Max
|
242 |
-
- type: spearman_max
|
243 |
-
value: 0.8550464936365929
|
244 |
-
name: Spearman Max
|
245 |
-
- task:
|
246 |
-
type: semantic-similarity
|
247 |
-
name: Semantic Similarity
|
248 |
-
dataset:
|
249 |
-
name: sts test 64
|
250 |
-
type: sts-test-64
|
251 |
-
metrics:
|
252 |
-
- type: pearson_cosine
|
253 |
-
value: 0.8282288301025145
|
254 |
-
name: Pearson Cosine
|
255 |
-
- type: spearman_cosine
|
256 |
-
value: 0.8507215646125454
|
257 |
-
name: Spearman Cosine
|
258 |
-
- type: pearson_manhattan
|
259 |
-
value: 0.8404915813802649
|
260 |
-
name: Pearson Manhattan
|
261 |
-
- type: spearman_manhattan
|
262 |
-
value: 0.8482910175231816
|
263 |
-
name: Spearman Manhattan
|
264 |
-
- type: pearson_euclidean
|
265 |
-
value: 0.8425986040609018
|
266 |
-
name: Pearson Euclidean
|
267 |
-
- type: spearman_euclidean
|
268 |
-
value: 0.8498681513437906
|
269 |
-
name: Spearman Euclidean
|
270 |
-
- type: pearson_dot
|
271 |
-
value: 0.7518854418344252
|
272 |
-
name: Pearson Dot
|
273 |
-
- type: spearman_dot
|
274 |
-
value: 0.7518133373839283
|
275 |
-
name: Spearman Dot
|
276 |
-
- type: pearson_max
|
277 |
-
value: 0.8425986040609018
|
278 |
-
name: Pearson Max
|
279 |
-
- type: spearman_max
|
280 |
-
value: 0.8507215646125454
|
281 |
-
name: Spearman Max
|
282 |
-
license: apache-2.0
|
283 |
---
|
284 |
|
285 |
# German Semantic V3
|
286 |
|
287 |
-
|
|
|
|
|
288 |
|
289 |
## Major updates and USPs:
|
290 |
|
291 |
-
- **
|
|
|
292 |
- **Matryoshka Embeddings:** The model is trained for embedding sizes from 1024 down to 64, allowing you to store much smaller embeddings with little quality loss.
|
293 |
-
- **German
|
294 |
- **Updated knowledge and quality data:** The backbone of this model is gbert-large by deepset. With Stage-2 pretraining on 1 Billion tokens of German fineweb by occiglot, up-to-date knowledge is ensured.
|
295 |
-
- **Flexibility:** Trained with flexible sequence-length and embedding truncation, flexibility is a core feature of the model, while improving on V2-performance.
|
296 |
- **Typo and Casing**: This model was trained to be robust against minor typos and casing, leading to slightly weaker benchmark performance and learning during training, but higher robustness of the embeddings.
|
|
|
297 |
- **License:** Apache 2.0
|
298 |
|
|
|
|
|
299 |
## Usage:
|
300 |
|
|
|
|
|
301 |
```python
|
302 |
from sentence_transformers import SentenceTransformer
|
303 |
|
@@ -314,307 +73,60 @@ sentences = [
|
|
314 |
'Die Flagge bewegte sich in der Luft.',
|
315 |
'Zwei Personen beobachten das Wasser.',
|
316 |
]
|
317 |
-
|
|
|
|
|
|
|
|
|
|
|
318 |
|
319 |
# Get the similarity scores for the embeddings
|
320 |
similarities = model.similarity(embeddings, embeddings)
|
321 |
|
322 |
-
|
323 |
```
|
324 |
|
325 |
-
|
326 |
-
|
327 |
-
## Model Details
|
328 |
-
|
329 |
-
### Model Description
|
330 |
-
- **Model Type:** Sentence Transformer
|
331 |
-
- **Base model:** gbert-large (alibi applied)
|
332 |
-
- **Maximum Sequence Length:** 8192 tokens
|
333 |
-
- **Output Dimensionality:** 1024 tokens
|
334 |
-
- **Similarity Function:** Cosine Similarity
|
335 |
-
- **Training Dataset:**
|
336 |
-
- multiple German datasets
|
337 |
-
- **Languages:** de
|
338 |
-
<!-- - **License:** Unknown -->
|
339 |
-
|
340 |
-
### Model Sources
|
341 |
-
|
342 |
-
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
|
343 |
-
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
|
344 |
-
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
|
345 |
-
|
346 |
### Full Model Architecture
|
347 |
|
348 |
```
|
349 |
SentenceTransformer(
|
350 |
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: JinaBertModel
|
351 |
-
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token':
|
352 |
)
|
353 |
```
|
354 |
|
355 |
-
## Usage
|
356 |
-
|
357 |
-
### Direct Usage (Sentence Transformers)
|
358 |
-
|
359 |
-
First install the Sentence Transformers library:
|
360 |
-
|
361 |
-
```bash
|
362 |
-
pip install -U sentence-transformers
|
363 |
-
```
|
364 |
-
|
365 |
-
Then you can load this model and run inference.
|
366 |
-
```python
|
367 |
-
from sentence_transformers import SentenceTransformer
|
368 |
|
369 |
-
|
370 |
-
model = SentenceTransformer("aari1995/German_Semantic_V3", trust_remote_code=True)
|
371 |
-
# Run inference
|
372 |
-
sentences = [
|
373 |
-
'Eine Flagge weht.',
|
374 |
-
'Die Flagge bewegte sich in der Luft.',
|
375 |
-
'Zwei Personen beobachten das Wasser.',
|
376 |
-
]
|
377 |
-
embeddings = model.encode(sentences)
|
378 |
-
print(embeddings.shape)
|
379 |
-
# [3, 1024]
|
380 |
-
|
381 |
-
# Get the similarity scores for the embeddings
|
382 |
-
similarities = model.similarity(embeddings, embeddings)
|
383 |
-
print(similarities.shape)
|
384 |
-
# [3, 3]
|
385 |
-
```
|
386 |
-
|
387 |
-
<!--
|
388 |
-
### Direct Usage (Transformers)
|
389 |
-
|
390 |
-
<details><summary>Click to see the direct usage in Transformers</summary>
|
391 |
-
|
392 |
-
</details>
|
393 |
-
-->
|
394 |
-
|
395 |
-
<!--
|
396 |
-
### Downstream Usage (Sentence Transformers)
|
397 |
-
|
398 |
-
You can finetune this model on your own dataset.
|
399 |
-
|
400 |
-
<details><summary>Click to expand</summary>
|
401 |
|
402 |
-
|
403 |
-
-->
|
404 |
|
405 |
-
|
406 |
-
### Out-of-Scope Use
|
407 |
|
408 |
-
|
409 |
-
-->
|
410 |
|
411 |
-
|
412 |
|
413 |
-
|
414 |
-
|
415 |
-
|
416 |
-
|
417 |
-
* Dataset: `sts-test-1024`
|
418 |
-
* Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
|
419 |
-
|
420 |
-
| Metric | Value |
|
421 |
-
|:--------------------|:-----------|
|
422 |
-
| pearson_cosine | 0.8539 |
|
423 |
-
| **spearman_cosine** | **0.8623** |
|
424 |
-
| pearson_manhattan | 0.8555 |
|
425 |
-
| spearman_manhattan | 0.8633 |
|
426 |
-
| pearson_euclidean | 0.8554 |
|
427 |
-
| spearman_euclidean | 0.8631 |
|
428 |
-
| pearson_dot | 0.817 |
|
429 |
-
| spearman_dot | 0.815 |
|
430 |
-
| pearson_max | 0.8555 |
|
431 |
-
| spearman_max | 0.8633 |
|
432 |
-
|
433 |
-
#### Semantic Similarity
|
434 |
-
* Dataset: `sts-test-768`
|
435 |
-
* Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
|
436 |
-
|
437 |
-
| Metric | Value |
|
438 |
-
|:--------------------|:-----------|
|
439 |
-
| pearson_cosine | 0.8538 |
|
440 |
-
| **spearman_cosine** | **0.8632** |
|
441 |
-
| pearson_manhattan | 0.8559 |
|
442 |
-
| spearman_manhattan | 0.8638 |
|
443 |
-
| pearson_euclidean | 0.8559 |
|
444 |
-
| spearman_euclidean | 0.8634 |
|
445 |
-
| pearson_dot | 0.8169 |
|
446 |
-
| spearman_dot | 0.8157 |
|
447 |
-
| pearson_max | 0.8559 |
|
448 |
-
| spearman_max | 0.8638 |
|
449 |
-
|
450 |
-
#### Semantic Similarity
|
451 |
-
* Dataset: `sts-test-512`
|
452 |
-
* Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
|
453 |
-
|
454 |
-
| Metric | Value |
|
455 |
-
|:--------------------|:-----------|
|
456 |
-
| pearson_cosine | 0.8502 |
|
457 |
-
| **spearman_cosine** | **0.8624** |
|
458 |
-
| pearson_manhattan | 0.8547 |
|
459 |
-
| spearman_manhattan | 0.8629 |
|
460 |
-
| pearson_euclidean | 0.8546 |
|
461 |
-
| spearman_euclidean | 0.8625 |
|
462 |
-
| pearson_dot | 0.8108 |
|
463 |
-
| spearman_dot | 0.8103 |
|
464 |
-
| pearson_max | 0.8547 |
|
465 |
-
| spearman_max | 0.8629 |
|
466 |
-
|
467 |
-
#### Semantic Similarity
|
468 |
-
* Dataset: `sts-test-256`
|
469 |
-
* Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
|
470 |
-
|
471 |
-
| Metric | Value |
|
472 |
-
|:--------------------|:-----------|
|
473 |
-
| pearson_cosine | 0.8441 |
|
474 |
-
| **spearman_cosine** | **0.8583** |
|
475 |
-
| pearson_manhattan | 0.8517 |
|
476 |
-
| spearman_manhattan | 0.8592 |
|
477 |
-
| pearson_euclidean | 0.8517 |
|
478 |
-
| spearman_euclidean | 0.8592 |
|
479 |
-
| pearson_dot | 0.7902 |
|
480 |
-
| spearman_dot | 0.7891 |
|
481 |
-
| pearson_max | 0.8517 |
|
482 |
-
| spearman_max | 0.8592 |
|
483 |
-
|
484 |
-
#### Semantic Similarity
|
485 |
-
* Dataset: `sts-test-128`
|
486 |
-
* Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
|
487 |
-
|
488 |
-
| Metric | Value |
|
489 |
-
|:--------------------|:-----------|
|
490 |
-
| pearson_cosine | 0.8369 |
|
491 |
-
| **spearman_cosine** | **0.8546** |
|
492 |
-
| pearson_manhattan | 0.8474 |
|
493 |
-
| spearman_manhattan | 0.8547 |
|
494 |
-
| pearson_euclidean | 0.8478 |
|
495 |
-
| spearman_euclidean | 0.855 |
|
496 |
-
| pearson_dot | 0.7733 |
|
497 |
-
| spearman_dot | 0.7721 |
|
498 |
-
| pearson_max | 0.8478 |
|
499 |
-
| spearman_max | 0.855 |
|
500 |
-
|
501 |
-
#### Semantic Similarity
|
502 |
-
* Dataset: `sts-test-64`
|
503 |
-
* Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
|
504 |
-
|
505 |
-
| Metric | Value |
|
506 |
-
|:--------------------|:-----------|
|
507 |
-
| pearson_cosine | 0.8282 |
|
508 |
-
| **spearman_cosine** | **0.8507** |
|
509 |
-
| pearson_manhattan | 0.8405 |
|
510 |
-
| spearman_manhattan | 0.8483 |
|
511 |
-
| pearson_euclidean | 0.8426 |
|
512 |
-
| spearman_euclidean | 0.8499 |
|
513 |
-
| pearson_dot | 0.7519 |
|
514 |
-
| spearman_dot | 0.7518 |
|
515 |
-
| pearson_max | 0.8426 |
|
516 |
-
| spearman_max | 0.8507 |
|
517 |
-
|
518 |
-
<!--
|
519 |
-
## Bias, Risks and Limitations
|
520 |
-
|
521 |
-
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
|
522 |
-
-->
|
523 |
-
|
524 |
-
<!--
|
525 |
-
### Recommendations
|
526 |
-
|
527 |
-
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
|
528 |
-
-->
|
529 |
-
|
530 |
-
## Training Details
|
531 |
-
|
532 |
-
* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
|
533 |
-
```json
|
534 |
-
{
|
535 |
-
"loss": "ContrastiveLoss",
|
536 |
-
"matryoshka_dims": [
|
537 |
-
1024,
|
538 |
-
768,
|
539 |
-
512,
|
540 |
-
256,
|
541 |
-
128,
|
542 |
-
64
|
543 |
-
],
|
544 |
-
"matryoshka_weights": [
|
545 |
-
1,
|
546 |
-
1,
|
547 |
-
1,
|
548 |
-
1,
|
549 |
-
1,
|
550 |
-
1
|
551 |
-
],
|
552 |
-
"n_dims_per_step": -1
|
553 |
-
}
|
554 |
-
```
|
555 |
-
## License / Credits and Special thanks to:
|
556 |
-
|
557 |
-
- to [Jina AI](https://huggingface.co/jinaai) for the model architecture, especially their ALiBi implementation
|
558 |
-
- to [deepset](https://huggingface.co/deepset) for gbert-large, which is imho still the greatest German model
|
559 |
-
- to [occiglot](https://huggingface.co/occiglot) for their fineweb-v0.5 "de" split, especially the data from OSCAR
|
560 |
-
|
561 |
-
## Citation
|
562 |
-
|
563 |
-
### BibTeX
|
564 |
-
|
565 |
-
#### Sentence Transformers
|
566 |
-
```bibtex
|
567 |
-
@inproceedings{reimers-2019-sentence-bert,
|
568 |
-
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
|
569 |
-
author = "Reimers, Nils and Gurevych, Iryna",
|
570 |
-
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
|
571 |
-
month = "11",
|
572 |
-
year = "2019",
|
573 |
-
publisher = "Association for Computational Linguistics",
|
574 |
-
url = "https://arxiv.org/abs/1908.10084",
|
575 |
-
}
|
576 |
-
```
|
577 |
|
578 |
-
|
579 |
-
```bibtex
|
580 |
-
@misc{kusupati2024matryoshka,
|
581 |
-
title={Matryoshka Representation Learning},
|
582 |
-
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
|
583 |
-
year={2024},
|
584 |
-
eprint={2205.13147},
|
585 |
-
archivePrefix={arXiv},
|
586 |
-
primaryClass={cs.LG}
|
587 |
-
}
|
588 |
-
```
|
589 |
|
590 |
-
|
591 |
-
```bibtex
|
592 |
-
@inproceedings{hadsell2006dimensionality,
|
593 |
-
author={Hadsell, R. and Chopra, S. and LeCun, Y.},
|
594 |
-
booktitle={2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)},
|
595 |
-
title={Dimensionality Reduction by Learning an Invariant Mapping},
|
596 |
-
year={2006},
|
597 |
-
volume={2},
|
598 |
-
number={},
|
599 |
-
pages={1735-1742},
|
600 |
-
doi={10.1109/CVPR.2006.100}
|
601 |
-
}
|
602 |
-
```
|
603 |
|
604 |
-
|
605 |
-
## Glossary
|
606 |
|
607 |
-
|
608 |
-
-->
|
609 |
|
610 |
-
|
611 |
-
|
612 |
|
613 |
-
|
614 |
-
-->
|
615 |
|
616 |
-
|
617 |
-
|
|
|
|
|
|
|
618 |
|
619 |
-
|
620 |
-
-->
|
|
|
6 |
- sentence-transformers
|
7 |
- sentence-similarity
|
8 |
- feature-extraction
|
|
|
9 |
- loss:MatryoshkaLoss
|
10 |
+
base_model: aari1995/gbert-large-2
|
|
|
11 |
metrics:
|
|
|
12 |
- spearman_cosine
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
widget:
|
14 |
+
- source_sentence: Bundeskanzler.
|
15 |
sentences:
|
16 |
+
- Angela Merkel.
|
17 |
+
- Olaf Scholz.
|
18 |
+
- Tino Chrupalla.
|
19 |
+
- source_sentence: Corona.
|
20 |
sentences:
|
21 |
+
- Virus.
|
22 |
+
- Krone.
|
23 |
+
- Bier.
|
24 |
- source_sentence: Ein Mann übt Boxen
|
25 |
sentences:
|
26 |
+
- Ein Affe praktiziert Kampfsportarten.
|
27 |
+
- Eine Person faltet ein Blatt Papier.
|
28 |
+
- Eine Frau geht mit ihrem Hund spazieren.
|
29 |
+
- source_sentence: Zwei Frauen laufen.
|
30 |
sentences:
|
31 |
+
- Frauen laufen.
|
32 |
+
- Die Frau prüft die Augen des Mannes.
|
33 |
+
- Ein Mann ist auf einem Dach
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
pipeline_tag: sentence-similarity
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
35 |
---
|
36 |
|
37 |
# German Semantic V3
|
38 |
|
39 |
+
The successor of [German_Semantic_STS_V2](https://huggingface.co/aari1995/German_Semantic_STS_V2) is here and comes with loads of cool new features! Feel free to provide feedback on the model and what you would like to see next.
|
40 |
+
|
41 |
+
**Note:** To run this model properly, see "Usage".
|
42 |
|
43 |
## Major updates and USPs:
|
44 |
|
45 |
+
- **Flexibility:** Trained with flexible sequence-length and embedding truncation, flexibility is a core feature of the model. Yet, smaller dimensions bring a minor trade-off in quality.
|
46 |
+
- **Sequence length:** Embed up to 8192 tokens (16 times more than V2 and other models)
|
47 |
- **Matryoshka Embeddings:** The model is trained for embedding sizes from 1024 down to 64, allowing you to store much smaller embeddings with little quality loss.
|
48 |
+
- **German only:** This model is German-only, it has rich cultural knowledge about Germany and German topics. Therefore, also the model to learn more efficient thanks to its tokenizer, deal better with shorter queries and generally be more nuanced in many scenarios.
|
49 |
- **Updated knowledge and quality data:** The backbone of this model is gbert-large by deepset. With Stage-2 pretraining on 1 Billion tokens of German fineweb by occiglot, up-to-date knowledge is ensured.
|
|
|
50 |
- **Typo and Casing**: This model was trained to be robust against minor typos and casing, leading to slightly weaker benchmark performance and learning during training, but higher robustness of the embeddings.
|
51 |
+
- **Pooling Function:** Moving away from mean pooling towards using the CLS token. Generally seems to learn better after the stage-2 pretraining and allows for more flexibility.
|
52 |
- **License:** Apache 2.0
|
53 |
|
54 |
+
(If you are looking for even better performance on tasks, but with a German knowledge-cutoff around 2020, check out [German_Semantic_V3b](https://huggingface.co/aari1995/German_Semantic_V3))
|
55 |
+
|
56 |
## Usage:
|
57 |
|
58 |
+
This model has some build-in functionality that is rather hidden. To profit from it, use this code:
|
59 |
+
|
60 |
```python
|
61 |
from sentence_transformers import SentenceTransformer
|
62 |
|
|
|
73 |
'Die Flagge bewegte sich in der Luft.',
|
74 |
'Zwei Personen beobachten das Wasser.',
|
75 |
]
|
76 |
+
|
77 |
+
# For FP16 embeddings (half space, no quality loss)
|
78 |
+
embeddings = model.encode(sentences, convert_to_tensor=True).half()
|
79 |
+
|
80 |
+
# For FP32 embeddings (takes more space)
|
81 |
+
# embeddings = model.encode(sentences)
|
82 |
|
83 |
# Get the similarity scores for the embeddings
|
84 |
similarities = model.similarity(embeddings, embeddings)
|
85 |
|
|
|
86 |
```
|
87 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
88 |
### Full Model Architecture
|
89 |
|
90 |
```
|
91 |
SentenceTransformer(
|
92 |
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: JinaBertModel
|
93 |
+
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
|
94 |
)
|
95 |
```
|
96 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
97 |
|
98 |
+
## Evaluation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
99 |
|
100 |
+
Evaluation to come.
|
|
|
101 |
|
102 |
+
## FAQ
|
|
|
103 |
|
104 |
+
**Q: Is this Model better than V2?**
|
|
|
105 |
|
106 |
+
**A:** In terms of flexibility, this model is better. Performance wise, in most of the experiments this model is also better.
|
107 |
|
108 |
+
**Q: What is the difference between V3 and V3b?**
|
109 |
+
A: V3 is slightly worse on benchmarks, while V3b has a knowledge cutoff by 2020, so it really depends on your use-case what model to use.
|
110 |
+
If you want peak performance and do not worry too much about recent developments, take this one (V3b).
|
111 |
+
If you are fine with sacrificing a few points on benchmarks and want the model to know what happened from 2020 on (elections, covid, other cultural events etc.), I'd suggest you use [German_Semantic_V3](https://huggingface.co/aari1995/German_Semantic_V3).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
112 |
|
113 |
+
**Q: How does the model perform vs. multilingual models?**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
114 |
|
115 |
+
**A:** There are really great multilingual models that will be very useful for many use-cases. This model shines with its cultural knowledge and knowledge about German people and behaviour.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
116 |
|
117 |
+
**Q: What is the trade-off when reducing the embedding size?**
|
|
|
118 |
|
119 |
+
**A:** Broadly speaking, when going from 1024 to 512 dimensions, there is very little trade-off (1 percent). When going down to 64 dimensions, you may face a decrease of up to 3 percent.
|
|
|
120 |
|
121 |
+
## Up next:
|
122 |
+
German_Semantic_V3_Instruct: Guiding your embeddings towards self-selected aspects
|
123 |
|
124 |
+
## Thank You and Credits
|
|
|
125 |
|
126 |
+
- To [jinaAI](https://huggingface.co/jinaai) for their BERT implementation that is used, especially ALiBi
|
127 |
+
- To [deepset](https://huggingface.co/deepset) for the gbert-large, which is a really great model
|
128 |
+
- To [occiglot](https://huggingface.co/occiglot) and OSCAR for their data used to pre-train the model
|
129 |
+
- To [Tom](https://huggingface.co/tomaarsen), especially for sentence-transformers, [Björn and Jan from ellamind](https://ellamind.com/de/) for the consultation
|
130 |
+
- To [Meta](https://huggingface.co/facebook) for XNLI which is used in variations
|
131 |
|
132 |
+
Idea, Training and Implementation by Aaron Chibb
|
|