aari1995 commited on
Commit
161d025
·
verified ·
1 Parent(s): 0f60f82

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +56 -544
README.md CHANGED
@@ -6,298 +6,57 @@ tags:
6
  - sentence-transformers
7
  - sentence-similarity
8
  - feature-extraction
9
- - dataset_size:10K<n<100K
10
  - loss:MatryoshkaLoss
11
- - loss:ContrastiveLoss
12
- base_model: aari1995/gbert-large-alibi
13
  metrics:
14
- - pearson_cosine
15
  - spearman_cosine
16
- - pearson_manhattan
17
- - spearman_manhattan
18
- - pearson_euclidean
19
- - spearman_euclidean
20
- - pearson_dot
21
- - spearman_dot
22
- - pearson_max
23
- - spearman_max
24
  widget:
25
- - source_sentence: Das Tor ist gelb.
26
  sentences:
27
- - Das Tor ist blau.
28
- - Ein Mann mit seinem Hund am Strand.
29
- - Die Menschen sitzen auf Bänken.
30
- - source_sentence: Das Tor ist blau.
31
  sentences:
32
- - Ein blaues Moped parkt auf dem Bürgersteig.
33
- - Drei Hunde spielen im weißen Schnee.
34
- - Bombenanschläge töten 19 Menschen im Irak
35
  - source_sentence: Ein Mann übt Boxen
36
  sentences:
37
- - Ein Fußballspieler versucht ein Tackling.
38
- - 1 Getötet bei Protest in Bangladesch
39
- - Das Mädchen sang in ein Mikrofon.
40
- - source_sentence: Drei Männer tanzen.
41
  sentences:
42
- - Ein Mann tanzt.
43
- - Ein Mann arbeitet an seinem Laptop.
44
- - Das Mädchen sang in ein Mikrofon.
45
- - source_sentence: Eine Flagge weht.
46
- sentences:
47
- - Die Flagge bewegte sich in der Luft.
48
- - Zwei Personen beobachten das Wasser.
49
- - Zwei Frauen sitzen in einem Cafe.
50
-
51
- - source_sentence: Der Mann heißt Joel.
52
- sentences:
53
- - Ein Mann mit einem englischen Namen.
54
- - Die Frau heißt Joél.
55
- - Freunde gehen feiern.
56
  pipeline_tag: sentence-similarity
57
- model-index:
58
- - name: SentenceTransformer based on aari1995/gbert-large-nli_mix
59
- results:
60
- - task:
61
- type: semantic-similarity
62
- name: Semantic Similarity
63
- dataset:
64
- name: sts test 1024
65
- type: sts-test-1024
66
- metrics:
67
- - type: pearson_cosine
68
- value: 0.8538749625112824
69
- name: Pearson Cosine
70
- - type: spearman_cosine
71
- value: 0.8622934726599119
72
- name: Spearman Cosine
73
- - type: pearson_manhattan
74
- value: 0.8554617861095041
75
- name: Pearson Manhattan
76
- - type: spearman_manhattan
77
- value: 0.8632850500504865
78
- name: Spearman Manhattan
79
- - type: pearson_euclidean
80
- value: 0.8554205957277228
81
- name: Pearson Euclidean
82
- - type: spearman_euclidean
83
- value: 0.8630779166725503
84
- name: Spearman Euclidean
85
- - type: pearson_dot
86
- value: 0.8170146846171837
87
- name: Pearson Dot
88
- - type: spearman_dot
89
- value: 0.8149857685956332
90
- name: Spearman Dot
91
- - type: pearson_max
92
- value: 0.8554617861095041
93
- name: Pearson Max
94
- - type: spearman_max
95
- value: 0.8632850500504865
96
- name: Spearman Max
97
- - task:
98
- type: semantic-similarity
99
- name: Semantic Similarity
100
- dataset:
101
- name: sts test 768
102
- type: sts-test-768
103
- metrics:
104
- - type: pearson_cosine
105
- value: 0.853820621972726
106
- name: Pearson Cosine
107
- - type: spearman_cosine
108
- value: 0.863198271488271
109
- name: Spearman Cosine
110
- - type: pearson_manhattan
111
- value: 0.8558709278385018
112
- name: Pearson Manhattan
113
- - type: spearman_manhattan
114
- value: 0.8637532036004547
115
- name: Spearman Manhattan
116
- - type: pearson_euclidean
117
- value: 0.8558597695346744
118
- name: Pearson Euclidean
119
- - type: spearman_euclidean
120
- value: 0.8634247094122574
121
- name: Spearman Euclidean
122
- - type: pearson_dot
123
- value: 0.8169163431962185
124
- name: Pearson Dot
125
- - type: spearman_dot
126
- value: 0.8156867907361973
127
- name: Spearman Dot
128
- - type: pearson_max
129
- value: 0.8558709278385018
130
- name: Pearson Max
131
- - type: spearman_max
132
- value: 0.8637532036004547
133
- name: Spearman Max
134
- - task:
135
- type: semantic-similarity
136
- name: Semantic Similarity
137
- dataset:
138
- name: sts test 512
139
- type: sts-test-512
140
- metrics:
141
- - type: pearson_cosine
142
- value: 0.8502336569709972
143
- name: Pearson Cosine
144
- - type: spearman_cosine
145
- value: 0.8623838162450902
146
- name: Spearman Cosine
147
- - type: pearson_manhattan
148
- value: 0.8547121881183612
149
- name: Pearson Manhattan
150
- - type: spearman_manhattan
151
- value: 0.8628698143219098
152
- name: Spearman Manhattan
153
- - type: pearson_euclidean
154
- value: 0.8546114371189246
155
- name: Pearson Euclidean
156
- - type: spearman_euclidean
157
- value: 0.8625109910600326
158
- name: Spearman Euclidean
159
- - type: pearson_dot
160
- value: 0.8108392647310044
161
- name: Pearson Dot
162
- - type: spearman_dot
163
- value: 0.8103261097232485
164
- name: Spearman Dot
165
- - type: pearson_max
166
- value: 0.8547121881183612
167
- name: Pearson Max
168
- - type: spearman_max
169
- value: 0.8628698143219098
170
- name: Spearman Max
171
- - task:
172
- type: semantic-similarity
173
- name: Semantic Similarity
174
- dataset:
175
- name: sts test 256
176
- type: sts-test-256
177
- metrics:
178
- - type: pearson_cosine
179
- value: 0.8441242786553879
180
- name: Pearson Cosine
181
- - type: spearman_cosine
182
- value: 0.8582717489671877
183
- name: Spearman Cosine
184
- - type: pearson_manhattan
185
- value: 0.8517415030362573
186
- name: Pearson Manhattan
187
- - type: spearman_manhattan
188
- value: 0.8591688553092182
189
- name: Spearman Manhattan
190
- - type: pearson_euclidean
191
- value: 0.8516965854845419
192
- name: Pearson Euclidean
193
- - type: spearman_euclidean
194
- value: 0.8591770194196562
195
- name: Spearman Euclidean
196
- - type: pearson_dot
197
- value: 0.7901870400809775
198
- name: Pearson Dot
199
- - type: spearman_dot
200
- value: 0.7891397281321177
201
- name: Spearman Dot
202
- - type: pearson_max
203
- value: 0.8517415030362573
204
- name: Pearson Max
205
- - type: spearman_max
206
- value: 0.8591770194196562
207
- name: Spearman Max
208
- - task:
209
- type: semantic-similarity
210
- name: Semantic Similarity
211
- dataset:
212
- name: sts test 128
213
- type: sts-test-128
214
- metrics:
215
- - type: pearson_cosine
216
- value: 0.8369352495821198
217
- name: Pearson Cosine
218
- - type: spearman_cosine
219
- value: 0.8545806562301762
220
- name: Spearman Cosine
221
- - type: pearson_manhattan
222
- value: 0.8474289413580527
223
- name: Pearson Manhattan
224
- - type: spearman_manhattan
225
- value: 0.8546935424655524
226
- name: Spearman Manhattan
227
- - type: pearson_euclidean
228
- value: 0.8478267316251253
229
- name: Pearson Euclidean
230
- - type: spearman_euclidean
231
- value: 0.8550464936365929
232
- name: Spearman Euclidean
233
- - type: pearson_dot
234
- value: 0.7732663297266509
235
- name: Pearson Dot
236
- - type: spearman_dot
237
- value: 0.7720532782903432
238
- name: Spearman Dot
239
- - type: pearson_max
240
- value: 0.8478267316251253
241
- name: Pearson Max
242
- - type: spearman_max
243
- value: 0.8550464936365929
244
- name: Spearman Max
245
- - task:
246
- type: semantic-similarity
247
- name: Semantic Similarity
248
- dataset:
249
- name: sts test 64
250
- type: sts-test-64
251
- metrics:
252
- - type: pearson_cosine
253
- value: 0.8282288301025145
254
- name: Pearson Cosine
255
- - type: spearman_cosine
256
- value: 0.8507215646125454
257
- name: Spearman Cosine
258
- - type: pearson_manhattan
259
- value: 0.8404915813802649
260
- name: Pearson Manhattan
261
- - type: spearman_manhattan
262
- value: 0.8482910175231816
263
- name: Spearman Manhattan
264
- - type: pearson_euclidean
265
- value: 0.8425986040609018
266
- name: Pearson Euclidean
267
- - type: spearman_euclidean
268
- value: 0.8498681513437906
269
- name: Spearman Euclidean
270
- - type: pearson_dot
271
- value: 0.7518854418344252
272
- name: Pearson Dot
273
- - type: spearman_dot
274
- value: 0.7518133373839283
275
- name: Spearman Dot
276
- - type: pearson_max
277
- value: 0.8425986040609018
278
- name: Pearson Max
279
- - type: spearman_max
280
- value: 0.8507215646125454
281
- name: Spearman Max
282
- license: apache-2.0
283
  ---
284
 
285
  # German Semantic V3
286
 
287
- Finally, a new version! The successor of German_Semantic_STS_V2 is here and comes with loads of cool new features!
 
 
288
 
289
  ## Major updates and USPs:
290
 
291
- - **Sequence length:** 8192, (16 times more than V2 and other models) -> thanks to the ALiBi implementation of Jina-Team!
 
292
  - **Matryoshka Embeddings:** The model is trained for embedding sizes from 1024 down to 64, allowing you to store much smaller embeddings with little quality loss.
293
- - **German cultural knowledge:** This model is German-only, it has rich cultural knowledge about Germany and German topics. Therefore, also the model to learn more efficient thanks to its tokenizer, deal better with shorter queries and generally be more nuanced.
294
  - **Updated knowledge and quality data:** The backbone of this model is gbert-large by deepset. With Stage-2 pretraining on 1 Billion tokens of German fineweb by occiglot, up-to-date knowledge is ensured.
295
- - **Flexibility:** Trained with flexible sequence-length and embedding truncation, flexibility is a core feature of the model, while improving on V2-performance.
296
  - **Typo and Casing**: This model was trained to be robust against minor typos and casing, leading to slightly weaker benchmark performance and learning during training, but higher robustness of the embeddings.
 
297
  - **License:** Apache 2.0
298
 
 
 
299
  ## Usage:
300
 
 
 
301
  ```python
302
  from sentence_transformers import SentenceTransformer
303
 
@@ -314,307 +73,60 @@ sentences = [
314
  'Die Flagge bewegte sich in der Luft.',
315
  'Zwei Personen beobachten das Wasser.',
316
  ]
317
- embeddings = model.encode(sentences)
 
 
 
 
 
318
 
319
  # Get the similarity scores for the embeddings
320
  similarities = model.similarity(embeddings, embeddings)
321
 
322
-
323
  ```
324
 
325
-
326
-
327
- ## Model Details
328
-
329
- ### Model Description
330
- - **Model Type:** Sentence Transformer
331
- - **Base model:** gbert-large (alibi applied)
332
- - **Maximum Sequence Length:** 8192 tokens
333
- - **Output Dimensionality:** 1024 tokens
334
- - **Similarity Function:** Cosine Similarity
335
- - **Training Dataset:**
336
- - multiple German datasets
337
- - **Languages:** de
338
- <!-- - **License:** Unknown -->
339
-
340
- ### Model Sources
341
-
342
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
343
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
344
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
345
-
346
  ### Full Model Architecture
347
 
348
  ```
349
  SentenceTransformer(
350
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: JinaBertModel
351
- (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
352
  )
353
  ```
354
 
355
- ## Usage
356
-
357
- ### Direct Usage (Sentence Transformers)
358
-
359
- First install the Sentence Transformers library:
360
-
361
- ```bash
362
- pip install -U sentence-transformers
363
- ```
364
-
365
- Then you can load this model and run inference.
366
- ```python
367
- from sentence_transformers import SentenceTransformer
368
 
369
- # Download from the 🤗 Hub
370
- model = SentenceTransformer("aari1995/German_Semantic_V3", trust_remote_code=True)
371
- # Run inference
372
- sentences = [
373
- 'Eine Flagge weht.',
374
- 'Die Flagge bewegte sich in der Luft.',
375
- 'Zwei Personen beobachten das Wasser.',
376
- ]
377
- embeddings = model.encode(sentences)
378
- print(embeddings.shape)
379
- # [3, 1024]
380
-
381
- # Get the similarity scores for the embeddings
382
- similarities = model.similarity(embeddings, embeddings)
383
- print(similarities.shape)
384
- # [3, 3]
385
- ```
386
-
387
- <!--
388
- ### Direct Usage (Transformers)
389
-
390
- <details><summary>Click to see the direct usage in Transformers</summary>
391
-
392
- </details>
393
- -->
394
-
395
- <!--
396
- ### Downstream Usage (Sentence Transformers)
397
-
398
- You can finetune this model on your own dataset.
399
-
400
- <details><summary>Click to expand</summary>
401
 
402
- </details>
403
- -->
404
 
405
- <!--
406
- ### Out-of-Scope Use
407
 
408
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
409
- -->
410
 
411
- ## Evaluation
412
 
413
- ### Metrics
414
-
415
-
416
- #### Semantic Similarity
417
- * Dataset: `sts-test-1024`
418
- * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
419
-
420
- | Metric | Value |
421
- |:--------------------|:-----------|
422
- | pearson_cosine | 0.8539 |
423
- | **spearman_cosine** | **0.8623** |
424
- | pearson_manhattan | 0.8555 |
425
- | spearman_manhattan | 0.8633 |
426
- | pearson_euclidean | 0.8554 |
427
- | spearman_euclidean | 0.8631 |
428
- | pearson_dot | 0.817 |
429
- | spearman_dot | 0.815 |
430
- | pearson_max | 0.8555 |
431
- | spearman_max | 0.8633 |
432
-
433
- #### Semantic Similarity
434
- * Dataset: `sts-test-768`
435
- * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
436
-
437
- | Metric | Value |
438
- |:--------------------|:-----------|
439
- | pearson_cosine | 0.8538 |
440
- | **spearman_cosine** | **0.8632** |
441
- | pearson_manhattan | 0.8559 |
442
- | spearman_manhattan | 0.8638 |
443
- | pearson_euclidean | 0.8559 |
444
- | spearman_euclidean | 0.8634 |
445
- | pearson_dot | 0.8169 |
446
- | spearman_dot | 0.8157 |
447
- | pearson_max | 0.8559 |
448
- | spearman_max | 0.8638 |
449
-
450
- #### Semantic Similarity
451
- * Dataset: `sts-test-512`
452
- * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
453
-
454
- | Metric | Value |
455
- |:--------------------|:-----------|
456
- | pearson_cosine | 0.8502 |
457
- | **spearman_cosine** | **0.8624** |
458
- | pearson_manhattan | 0.8547 |
459
- | spearman_manhattan | 0.8629 |
460
- | pearson_euclidean | 0.8546 |
461
- | spearman_euclidean | 0.8625 |
462
- | pearson_dot | 0.8108 |
463
- | spearman_dot | 0.8103 |
464
- | pearson_max | 0.8547 |
465
- | spearman_max | 0.8629 |
466
-
467
- #### Semantic Similarity
468
- * Dataset: `sts-test-256`
469
- * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
470
-
471
- | Metric | Value |
472
- |:--------------------|:-----------|
473
- | pearson_cosine | 0.8441 |
474
- | **spearman_cosine** | **0.8583** |
475
- | pearson_manhattan | 0.8517 |
476
- | spearman_manhattan | 0.8592 |
477
- | pearson_euclidean | 0.8517 |
478
- | spearman_euclidean | 0.8592 |
479
- | pearson_dot | 0.7902 |
480
- | spearman_dot | 0.7891 |
481
- | pearson_max | 0.8517 |
482
- | spearman_max | 0.8592 |
483
-
484
- #### Semantic Similarity
485
- * Dataset: `sts-test-128`
486
- * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
487
-
488
- | Metric | Value |
489
- |:--------------------|:-----------|
490
- | pearson_cosine | 0.8369 |
491
- | **spearman_cosine** | **0.8546** |
492
- | pearson_manhattan | 0.8474 |
493
- | spearman_manhattan | 0.8547 |
494
- | pearson_euclidean | 0.8478 |
495
- | spearman_euclidean | 0.855 |
496
- | pearson_dot | 0.7733 |
497
- | spearman_dot | 0.7721 |
498
- | pearson_max | 0.8478 |
499
- | spearman_max | 0.855 |
500
-
501
- #### Semantic Similarity
502
- * Dataset: `sts-test-64`
503
- * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
504
-
505
- | Metric | Value |
506
- |:--------------------|:-----------|
507
- | pearson_cosine | 0.8282 |
508
- | **spearman_cosine** | **0.8507** |
509
- | pearson_manhattan | 0.8405 |
510
- | spearman_manhattan | 0.8483 |
511
- | pearson_euclidean | 0.8426 |
512
- | spearman_euclidean | 0.8499 |
513
- | pearson_dot | 0.7519 |
514
- | spearman_dot | 0.7518 |
515
- | pearson_max | 0.8426 |
516
- | spearman_max | 0.8507 |
517
-
518
- <!--
519
- ## Bias, Risks and Limitations
520
-
521
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
522
- -->
523
-
524
- <!--
525
- ### Recommendations
526
-
527
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
528
- -->
529
-
530
- ## Training Details
531
-
532
- * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
533
- ```json
534
- {
535
- "loss": "ContrastiveLoss",
536
- "matryoshka_dims": [
537
- 1024,
538
- 768,
539
- 512,
540
- 256,
541
- 128,
542
- 64
543
- ],
544
- "matryoshka_weights": [
545
- 1,
546
- 1,
547
- 1,
548
- 1,
549
- 1,
550
- 1
551
- ],
552
- "n_dims_per_step": -1
553
- }
554
- ```
555
- ## License / Credits and Special thanks to:
556
-
557
- - to [Jina AI](https://huggingface.co/jinaai) for the model architecture, especially their ALiBi implementation
558
- - to [deepset](https://huggingface.co/deepset) for gbert-large, which is imho still the greatest German model
559
- - to [occiglot](https://huggingface.co/occiglot) for their fineweb-v0.5 "de" split, especially the data from OSCAR
560
-
561
- ## Citation
562
-
563
- ### BibTeX
564
-
565
- #### Sentence Transformers
566
- ```bibtex
567
- @inproceedings{reimers-2019-sentence-bert,
568
- title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
569
- author = "Reimers, Nils and Gurevych, Iryna",
570
- booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
571
- month = "11",
572
- year = "2019",
573
- publisher = "Association for Computational Linguistics",
574
- url = "https://arxiv.org/abs/1908.10084",
575
- }
576
- ```
577
 
578
- #### MatryoshkaLoss
579
- ```bibtex
580
- @misc{kusupati2024matryoshka,
581
- title={Matryoshka Representation Learning},
582
- author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
583
- year={2024},
584
- eprint={2205.13147},
585
- archivePrefix={arXiv},
586
- primaryClass={cs.LG}
587
- }
588
- ```
589
 
590
- #### ContrastiveLoss
591
- ```bibtex
592
- @inproceedings{hadsell2006dimensionality,
593
- author={Hadsell, R. and Chopra, S. and LeCun, Y.},
594
- booktitle={2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)},
595
- title={Dimensionality Reduction by Learning an Invariant Mapping},
596
- year={2006},
597
- volume={2},
598
- number={},
599
- pages={1735-1742},
600
- doi={10.1109/CVPR.2006.100}
601
- }
602
- ```
603
 
604
- <!--
605
- ## Glossary
606
 
607
- *Clearly define terms in order to be accessible across audiences.*
608
- -->
609
 
610
- <!--
611
- ## Model Card Authors
612
 
613
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
614
- -->
615
 
616
- <!--
617
- ## Model Card Contact
 
 
 
618
 
619
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
620
- -->
 
6
  - sentence-transformers
7
  - sentence-similarity
8
  - feature-extraction
 
9
  - loss:MatryoshkaLoss
10
+ base_model: aari1995/gbert-large-2
 
11
  metrics:
 
12
  - spearman_cosine
 
 
 
 
 
 
 
 
13
  widget:
14
+ - source_sentence: Bundeskanzler.
15
  sentences:
16
+ - Angela Merkel.
17
+ - Olaf Scholz.
18
+ - Tino Chrupalla.
19
+ - source_sentence: Corona.
20
  sentences:
21
+ - Virus.
22
+ - Krone.
23
+ - Bier.
24
  - source_sentence: Ein Mann übt Boxen
25
  sentences:
26
+ - Ein Affe praktiziert Kampfsportarten.
27
+ - Eine Person faltet ein Blatt Papier.
28
+ - Eine Frau geht mit ihrem Hund spazieren.
29
+ - source_sentence: Zwei Frauen laufen.
30
  sentences:
31
+ - Frauen laufen.
32
+ - Die Frau prüft die Augen des Mannes.
33
+ - Ein Mann ist auf einem Dach
 
 
 
 
 
 
 
 
 
 
 
34
  pipeline_tag: sentence-similarity
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
  ---
36
 
37
  # German Semantic V3
38
 
39
+ The successor of [German_Semantic_STS_V2](https://huggingface.co/aari1995/German_Semantic_STS_V2) is here and comes with loads of cool new features! Feel free to provide feedback on the model and what you would like to see next.
40
+
41
+ **Note:** To run this model properly, see "Usage".
42
 
43
  ## Major updates and USPs:
44
 
45
+ - **Flexibility:** Trained with flexible sequence-length and embedding truncation, flexibility is a core feature of the model. Yet, smaller dimensions bring a minor trade-off in quality.
46
+ - **Sequence length:** Embed up to 8192 tokens (16 times more than V2 and other models)
47
  - **Matryoshka Embeddings:** The model is trained for embedding sizes from 1024 down to 64, allowing you to store much smaller embeddings with little quality loss.
48
+ - **German only:** This model is German-only, it has rich cultural knowledge about Germany and German topics. Therefore, also the model to learn more efficient thanks to its tokenizer, deal better with shorter queries and generally be more nuanced in many scenarios.
49
  - **Updated knowledge and quality data:** The backbone of this model is gbert-large by deepset. With Stage-2 pretraining on 1 Billion tokens of German fineweb by occiglot, up-to-date knowledge is ensured.
 
50
  - **Typo and Casing**: This model was trained to be robust against minor typos and casing, leading to slightly weaker benchmark performance and learning during training, but higher robustness of the embeddings.
51
+ - **Pooling Function:** Moving away from mean pooling towards using the CLS token. Generally seems to learn better after the stage-2 pretraining and allows for more flexibility.
52
  - **License:** Apache 2.0
53
 
54
+ (If you are looking for even better performance on tasks, but with a German knowledge-cutoff around 2020, check out [German_Semantic_V3b](https://huggingface.co/aari1995/German_Semantic_V3))
55
+
56
  ## Usage:
57
 
58
+ This model has some build-in functionality that is rather hidden. To profit from it, use this code:
59
+
60
  ```python
61
  from sentence_transformers import SentenceTransformer
62
 
 
73
  'Die Flagge bewegte sich in der Luft.',
74
  'Zwei Personen beobachten das Wasser.',
75
  ]
76
+
77
+ # For FP16 embeddings (half space, no quality loss)
78
+ embeddings = model.encode(sentences, convert_to_tensor=True).half()
79
+
80
+ # For FP32 embeddings (takes more space)
81
+ # embeddings = model.encode(sentences)
82
 
83
  # Get the similarity scores for the embeddings
84
  similarities = model.similarity(embeddings, embeddings)
85
 
 
86
  ```
87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
  ### Full Model Architecture
89
 
90
  ```
91
  SentenceTransformer(
92
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: JinaBertModel
93
+ (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
94
  )
95
  ```
96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
 
98
+ ## Evaluation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
+ Evaluation to come.
 
101
 
102
+ ## FAQ
 
103
 
104
+ **Q: Is this Model better than V2?**
 
105
 
106
+ **A:** In terms of flexibility, this model is better. Performance wise, in most of the experiments this model is also better.
107
 
108
+ **Q: What is the difference between V3 and V3b?**
109
+ A: V3 is slightly worse on benchmarks, while V3b has a knowledge cutoff by 2020, so it really depends on your use-case what model to use.
110
+ If you want peak performance and do not worry too much about recent developments, take this one (V3b).
111
+ If you are fine with sacrificing a few points on benchmarks and want the model to know what happened from 2020 on (elections, covid, other cultural events etc.), I'd suggest you use [German_Semantic_V3](https://huggingface.co/aari1995/German_Semantic_V3).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
 
113
+ **Q: How does the model perform vs. multilingual models?**
 
 
 
 
 
 
 
 
 
 
114
 
115
+ **A:** There are really great multilingual models that will be very useful for many use-cases. This model shines with its cultural knowledge and knowledge about German people and behaviour.
 
 
 
 
 
 
 
 
 
 
 
 
116
 
117
+ **Q: What is the trade-off when reducing the embedding size?**
 
118
 
119
+ **A:** Broadly speaking, when going from 1024 to 512 dimensions, there is very little trade-off (1 percent). When going down to 64 dimensions, you may face a decrease of up to 3 percent.
 
120
 
121
+ ## Up next:
122
+ German_Semantic_V3_Instruct: Guiding your embeddings towards self-selected aspects
123
 
124
+ ## Thank You and Credits
 
125
 
126
+ - To [jinaAI](https://huggingface.co/jinaai) for their BERT implementation that is used, especially ALiBi
127
+ - To [deepset](https://huggingface.co/deepset) for the gbert-large, which is a really great model
128
+ - To [occiglot](https://huggingface.co/occiglot) and OSCAR for their data used to pre-train the model
129
+ - To [Tom](https://huggingface.co/tomaarsen), especially for sentence-transformers, [Björn and Jan from ellamind](https://ellamind.com/de/) for the consultation
130
+ - To [Meta](https://huggingface.co/facebook) for XNLI which is used in variations
131
 
132
+ Idea, Training and Implementation by Aaron Chibb