kr-manish commited on
Commit
70b81cc
1 Parent(s): d9fd64b

Add new SentenceTransformer model.

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,805 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: BAAI/bge-base-en-v1.5
3
+ datasets: []
4
+ language: []
5
+ library_name: sentence-transformers
6
+ metrics:
7
+ - cosine_accuracy@1
8
+ - cosine_accuracy@3
9
+ - cosine_accuracy@5
10
+ - cosine_accuracy@10
11
+ - cosine_precision@1
12
+ - cosine_precision@3
13
+ - cosine_precision@5
14
+ - cosine_precision@10
15
+ - cosine_recall@1
16
+ - cosine_recall@3
17
+ - cosine_recall@5
18
+ - cosine_recall@10
19
+ - cosine_ndcg@10
20
+ - cosine_mrr@10
21
+ - cosine_map@100
22
+ pipeline_tag: sentence-similarity
23
+ tags:
24
+ - sentence-transformers
25
+ - sentence-similarity
26
+ - feature-extraction
27
+ - generated_from_trainer
28
+ - dataset_size:111
29
+ - loss:MatryoshkaLoss
30
+ - loss:MultipleNegativesRankingLoss
31
+ widget:
32
+ - source_sentence: Template la - Spy cepA s3062 F30 Sequence ( 5' /3') Oligo [ l AGACTCCATATGGAGTCTAGCCAAACAG500
33
+ nM GAACA (SEQ ID NO, 1) In addition to containing the reagents necessary for driv­
34
+ ing the GAS NEAR assay, the lyophilized material also contains the lytic agent
35
+ for GAS, the protein plyC; therefore, 65 GAS lysis does not occur until the lyophilized
36
+ material is re-suspended. In some cases, the lyophilized material does not contain
37
+ a lytic agent for GAS, for example, in some
38
+ sentences:
39
+ - (45) Date of Patent
40
+ - http
41
+ - ID
42
+ - source_sentence: :-"<-------t 40000 -1-----/-f-~~-----I 35000 -----+-IN----------
43
+ § 30000 ----t+t---=~--- ~ 25000 ----~---++------t ~ 20000 -1----ff-r-ff.,.__----->t''n-\--------l
44
+ sentences:
45
+ - 45000 -------,-----=.....
46
+ - -~' ~-- -~<
47
+ - comprises
48
+ - source_sentence: 55 1. A composition comprising i) a forward template comprising
49
+ a nucleic acid sequence comprising a recognition region at the 3' end that is
50
+ complementary to the 3' end of the Streptococcus pyogenes (S. pyogenes) cell envelope
51
+ proteinase A 60 (cepA) gene antisense strand; a nicking enzyme bind­ ing site
52
+ and a nicking site upstream of said recognition region; and a stabilizing region
53
+ upstream of said nick­ ing site, the forward template comprising a nucleotide
54
+ sequence having at least 80, 85, or 95% identity to SEQ 65
55
+ sentences:
56
+ - ''' -- ,'' ,.,,,..,,,. _..,,,,.,,, .... ~-__ .... , , _,. ........-----.'
57
+ - What is claimed is
58
+ - annotated as follows
59
+ - source_sentence: 0 1 2 3 4 5 6 7 8 9 10 Time (minutes) FIG. 1 (Cont.)
60
+ sentences:
61
+ - ',-;.-'
62
+ - I I I I I I I I I
63
+ - (21) Appl. No.
64
+ - source_sentence: '~ " ''"-''-en 25000 1 ,.,,µ,· ,, · .,-,.. •~h • 1 (1) ,\ II J
65
+ } 7; . \ \(9,i, .,u, 4\:'
66
+ sentences:
67
+ - 80, 85, or 95% identity to SEQ ID NO
68
+ - u
69
+ - en 25000 I ' 'lJVL' • -. • . .,.. ""~" '' ' I Q) l!J "667 7 ..._7 ... -,
70
+ model-index:
71
+ - name: SentenceTransformer based on BAAI/bge-base-en-v1.5
72
+ results:
73
+ - task:
74
+ type: information-retrieval
75
+ name: Information Retrieval
76
+ dataset:
77
+ name: dim 768
78
+ type: dim_768
79
+ metrics:
80
+ - type: cosine_accuracy@1
81
+ value: 0.0
82
+ name: Cosine Accuracy@1
83
+ - type: cosine_accuracy@3
84
+ value: 0.07692307692307693
85
+ name: Cosine Accuracy@3
86
+ - type: cosine_accuracy@5
87
+ value: 0.07692307692307693
88
+ name: Cosine Accuracy@5
89
+ - type: cosine_accuracy@10
90
+ value: 0.23076923076923078
91
+ name: Cosine Accuracy@10
92
+ - type: cosine_precision@1
93
+ value: 0.0
94
+ name: Cosine Precision@1
95
+ - type: cosine_precision@3
96
+ value: 0.02564102564102564
97
+ name: Cosine Precision@3
98
+ - type: cosine_precision@5
99
+ value: 0.015384615384615385
100
+ name: Cosine Precision@5
101
+ - type: cosine_precision@10
102
+ value: 0.02307692307692308
103
+ name: Cosine Precision@10
104
+ - type: cosine_recall@1
105
+ value: 0.0
106
+ name: Cosine Recall@1
107
+ - type: cosine_recall@3
108
+ value: 0.07692307692307693
109
+ name: Cosine Recall@3
110
+ - type: cosine_recall@5
111
+ value: 0.07692307692307693
112
+ name: Cosine Recall@5
113
+ - type: cosine_recall@10
114
+ value: 0.23076923076923078
115
+ name: Cosine Recall@10
116
+ - type: cosine_ndcg@10
117
+ value: 0.10157463646252407
118
+ name: Cosine Ndcg@10
119
+ - type: cosine_mrr@10
120
+ value: 0.06227106227106227
121
+ name: Cosine Mrr@10
122
+ - type: cosine_map@100
123
+ value: 0.08137504276350917
124
+ name: Cosine Map@100
125
+ - task:
126
+ type: information-retrieval
127
+ name: Information Retrieval
128
+ dataset:
129
+ name: dim 512
130
+ type: dim_512
131
+ metrics:
132
+ - type: cosine_accuracy@1
133
+ value: 0.0
134
+ name: Cosine Accuracy@1
135
+ - type: cosine_accuracy@3
136
+ value: 0.07692307692307693
137
+ name: Cosine Accuracy@3
138
+ - type: cosine_accuracy@5
139
+ value: 0.07692307692307693
140
+ name: Cosine Accuracy@5
141
+ - type: cosine_accuracy@10
142
+ value: 0.23076923076923078
143
+ name: Cosine Accuracy@10
144
+ - type: cosine_precision@1
145
+ value: 0.0
146
+ name: Cosine Precision@1
147
+ - type: cosine_precision@3
148
+ value: 0.02564102564102564
149
+ name: Cosine Precision@3
150
+ - type: cosine_precision@5
151
+ value: 0.015384615384615385
152
+ name: Cosine Precision@5
153
+ - type: cosine_precision@10
154
+ value: 0.02307692307692308
155
+ name: Cosine Precision@10
156
+ - type: cosine_recall@1
157
+ value: 0.0
158
+ name: Cosine Recall@1
159
+ - type: cosine_recall@3
160
+ value: 0.07692307692307693
161
+ name: Cosine Recall@3
162
+ - type: cosine_recall@5
163
+ value: 0.07692307692307693
164
+ name: Cosine Recall@5
165
+ - type: cosine_recall@10
166
+ value: 0.23076923076923078
167
+ name: Cosine Recall@10
168
+ - type: cosine_ndcg@10
169
+ value: 0.09595574046316672
170
+ name: Cosine Ndcg@10
171
+ - type: cosine_mrr@10
172
+ value: 0.05662393162393163
173
+ name: Cosine Mrr@10
174
+ - type: cosine_map@100
175
+ value: 0.0744997471979569
176
+ name: Cosine Map@100
177
+ - task:
178
+ type: information-retrieval
179
+ name: Information Retrieval
180
+ dataset:
181
+ name: dim 256
182
+ type: dim_256
183
+ metrics:
184
+ - type: cosine_accuracy@1
185
+ value: 0.0
186
+ name: Cosine Accuracy@1
187
+ - type: cosine_accuracy@3
188
+ value: 0.07692307692307693
189
+ name: Cosine Accuracy@3
190
+ - type: cosine_accuracy@5
191
+ value: 0.07692307692307693
192
+ name: Cosine Accuracy@5
193
+ - type: cosine_accuracy@10
194
+ value: 0.23076923076923078
195
+ name: Cosine Accuracy@10
196
+ - type: cosine_precision@1
197
+ value: 0.0
198
+ name: Cosine Precision@1
199
+ - type: cosine_precision@3
200
+ value: 0.02564102564102564
201
+ name: Cosine Precision@3
202
+ - type: cosine_precision@5
203
+ value: 0.015384615384615385
204
+ name: Cosine Precision@5
205
+ - type: cosine_precision@10
206
+ value: 0.02307692307692308
207
+ name: Cosine Precision@10
208
+ - type: cosine_recall@1
209
+ value: 0.0
210
+ name: Cosine Recall@1
211
+ - type: cosine_recall@3
212
+ value: 0.07692307692307693
213
+ name: Cosine Recall@3
214
+ - type: cosine_recall@5
215
+ value: 0.07692307692307693
216
+ name: Cosine Recall@5
217
+ - type: cosine_recall@10
218
+ value: 0.23076923076923078
219
+ name: Cosine Recall@10
220
+ - type: cosine_ndcg@10
221
+ value: 0.0981693666921052
222
+ name: Cosine Ndcg@10
223
+ - type: cosine_mrr@10
224
+ value: 0.05897435897435897
225
+ name: Cosine Mrr@10
226
+ - type: cosine_map@100
227
+ value: 0.08277736107354086
228
+ name: Cosine Map@100
229
+ - task:
230
+ type: information-retrieval
231
+ name: Information Retrieval
232
+ dataset:
233
+ name: dim 128
234
+ type: dim_128
235
+ metrics:
236
+ - type: cosine_accuracy@1
237
+ value: 0.07692307692307693
238
+ name: Cosine Accuracy@1
239
+ - type: cosine_accuracy@3
240
+ value: 0.23076923076923078
241
+ name: Cosine Accuracy@3
242
+ - type: cosine_accuracy@5
243
+ value: 0.23076923076923078
244
+ name: Cosine Accuracy@5
245
+ - type: cosine_accuracy@10
246
+ value: 0.38461538461538464
247
+ name: Cosine Accuracy@10
248
+ - type: cosine_precision@1
249
+ value: 0.07692307692307693
250
+ name: Cosine Precision@1
251
+ - type: cosine_precision@3
252
+ value: 0.07692307692307693
253
+ name: Cosine Precision@3
254
+ - type: cosine_precision@5
255
+ value: 0.04615384615384616
256
+ name: Cosine Precision@5
257
+ - type: cosine_precision@10
258
+ value: 0.038461538461538464
259
+ name: Cosine Precision@10
260
+ - type: cosine_recall@1
261
+ value: 0.07692307692307693
262
+ name: Cosine Recall@1
263
+ - type: cosine_recall@3
264
+ value: 0.23076923076923078
265
+ name: Cosine Recall@3
266
+ - type: cosine_recall@5
267
+ value: 0.23076923076923078
268
+ name: Cosine Recall@5
269
+ - type: cosine_recall@10
270
+ value: 0.38461538461538464
271
+ name: Cosine Recall@10
272
+ - type: cosine_ndcg@10
273
+ value: 0.21938110224036803
274
+ name: Cosine Ndcg@10
275
+ - type: cosine_mrr@10
276
+ value: 0.1700854700854701
277
+ name: Cosine Mrr@10
278
+ - type: cosine_map@100
279
+ value: 0.1860790779646314
280
+ name: Cosine Map@100
281
+ - task:
282
+ type: information-retrieval
283
+ name: Information Retrieval
284
+ dataset:
285
+ name: dim 64
286
+ type: dim_64
287
+ metrics:
288
+ - type: cosine_accuracy@1
289
+ value: 0.0
290
+ name: Cosine Accuracy@1
291
+ - type: cosine_accuracy@3
292
+ value: 0.07692307692307693
293
+ name: Cosine Accuracy@3
294
+ - type: cosine_accuracy@5
295
+ value: 0.15384615384615385
296
+ name: Cosine Accuracy@5
297
+ - type: cosine_accuracy@10
298
+ value: 0.3076923076923077
299
+ name: Cosine Accuracy@10
300
+ - type: cosine_precision@1
301
+ value: 0.0
302
+ name: Cosine Precision@1
303
+ - type: cosine_precision@3
304
+ value: 0.02564102564102564
305
+ name: Cosine Precision@3
306
+ - type: cosine_precision@5
307
+ value: 0.03076923076923077
308
+ name: Cosine Precision@5
309
+ - type: cosine_precision@10
310
+ value: 0.03076923076923077
311
+ name: Cosine Precision@10
312
+ - type: cosine_recall@1
313
+ value: 0.0
314
+ name: Cosine Recall@1
315
+ - type: cosine_recall@3
316
+ value: 0.07692307692307693
317
+ name: Cosine Recall@3
318
+ - type: cosine_recall@5
319
+ value: 0.15384615384615385
320
+ name: Cosine Recall@5
321
+ - type: cosine_recall@10
322
+ value: 0.3076923076923077
323
+ name: Cosine Recall@10
324
+ - type: cosine_ndcg@10
325
+ value: 0.1299580480538269
326
+ name: Cosine Ndcg@10
327
+ - type: cosine_mrr@10
328
+ value: 0.07628205128205127
329
+ name: Cosine Mrr@10
330
+ - type: cosine_map@100
331
+ value: 0.10015432076692518
332
+ name: Cosine Map@100
333
+ ---
334
+
335
+ # SentenceTransformer based on BAAI/bge-base-en-v1.5
336
+
337
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
338
+
339
+ ## Model Details
340
+
341
+ ### Model Description
342
+ - **Model Type:** Sentence Transformer
343
+ - **Base model:** [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) <!-- at revision a5beb1e3e68b9ab74eb54cfd186867f64f240e1a -->
344
+ - **Maximum Sequence Length:** 512 tokens
345
+ - **Output Dimensionality:** 768 tokens
346
+ - **Similarity Function:** Cosine Similarity
347
+ <!-- - **Training Dataset:** Unknown -->
348
+ <!-- - **Language:** Unknown -->
349
+ <!-- - **License:** Unknown -->
350
+
351
+ ### Model Sources
352
+
353
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
354
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
355
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
356
+
357
+ ### Full Model Architecture
358
+
359
+ ```
360
+ SentenceTransformer(
361
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel
362
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
363
+ (2): Normalize()
364
+ )
365
+ ```
366
+
367
+ ## Usage
368
+
369
+ ### Direct Usage (Sentence Transformers)
370
+
371
+ First install the Sentence Transformers library:
372
+
373
+ ```bash
374
+ pip install -U sentence-transformers
375
+ ```
376
+
377
+ Then you can load this model and run inference.
378
+ ```python
379
+ from sentence_transformers import SentenceTransformer
380
+
381
+ # Download from the 🤗 Hub
382
+ model = SentenceTransformer("kr-manish/bge-base-raw_pdf_finetuned_vf1")
383
+ # Run inference
384
+ sentences = [
385
+ '~ " \'"-\'-en 25000 1 ,.,,µ,· ,, · .,-,.. •~h • 1 (1) ,\\ II J } 7; . \\ \\(9,i, .,u, 4\\:',
386
+ 'en 25000 I \' \'lJVL\' • -. • . .,.. ""~" \'\' \' I Q) l!J "667 7 ..._7 ... -,',
387
+ '80, 85, or 95% identity to SEQ ID NO',
388
+ ]
389
+ embeddings = model.encode(sentences)
390
+ print(embeddings.shape)
391
+ # [3, 768]
392
+
393
+ # Get the similarity scores for the embeddings
394
+ similarities = model.similarity(embeddings, embeddings)
395
+ print(similarities.shape)
396
+ # [3, 3]
397
+ ```
398
+
399
+ <!--
400
+ ### Direct Usage (Transformers)
401
+
402
+ <details><summary>Click to see the direct usage in Transformers</summary>
403
+
404
+ </details>
405
+ -->
406
+
407
+ <!--
408
+ ### Downstream Usage (Sentence Transformers)
409
+
410
+ You can finetune this model on your own dataset.
411
+
412
+ <details><summary>Click to expand</summary>
413
+
414
+ </details>
415
+ -->
416
+
417
+ <!--
418
+ ### Out-of-Scope Use
419
+
420
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
421
+ -->
422
+
423
+ ## Evaluation
424
+
425
+ ### Metrics
426
+
427
+ #### Information Retrieval
428
+ * Dataset: `dim_768`
429
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
430
+
431
+ | Metric | Value |
432
+ |:--------------------|:-----------|
433
+ | cosine_accuracy@1 | 0.0 |
434
+ | cosine_accuracy@3 | 0.0769 |
435
+ | cosine_accuracy@5 | 0.0769 |
436
+ | cosine_accuracy@10 | 0.2308 |
437
+ | cosine_precision@1 | 0.0 |
438
+ | cosine_precision@3 | 0.0256 |
439
+ | cosine_precision@5 | 0.0154 |
440
+ | cosine_precision@10 | 0.0231 |
441
+ | cosine_recall@1 | 0.0 |
442
+ | cosine_recall@3 | 0.0769 |
443
+ | cosine_recall@5 | 0.0769 |
444
+ | cosine_recall@10 | 0.2308 |
445
+ | cosine_ndcg@10 | 0.1016 |
446
+ | cosine_mrr@10 | 0.0623 |
447
+ | **cosine_map@100** | **0.0814** |
448
+
449
+ #### Information Retrieval
450
+ * Dataset: `dim_512`
451
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
452
+
453
+ | Metric | Value |
454
+ |:--------------------|:-----------|
455
+ | cosine_accuracy@1 | 0.0 |
456
+ | cosine_accuracy@3 | 0.0769 |
457
+ | cosine_accuracy@5 | 0.0769 |
458
+ | cosine_accuracy@10 | 0.2308 |
459
+ | cosine_precision@1 | 0.0 |
460
+ | cosine_precision@3 | 0.0256 |
461
+ | cosine_precision@5 | 0.0154 |
462
+ | cosine_precision@10 | 0.0231 |
463
+ | cosine_recall@1 | 0.0 |
464
+ | cosine_recall@3 | 0.0769 |
465
+ | cosine_recall@5 | 0.0769 |
466
+ | cosine_recall@10 | 0.2308 |
467
+ | cosine_ndcg@10 | 0.096 |
468
+ | cosine_mrr@10 | 0.0566 |
469
+ | **cosine_map@100** | **0.0745** |
470
+
471
+ #### Information Retrieval
472
+ * Dataset: `dim_256`
473
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
474
+
475
+ | Metric | Value |
476
+ |:--------------------|:-----------|
477
+ | cosine_accuracy@1 | 0.0 |
478
+ | cosine_accuracy@3 | 0.0769 |
479
+ | cosine_accuracy@5 | 0.0769 |
480
+ | cosine_accuracy@10 | 0.2308 |
481
+ | cosine_precision@1 | 0.0 |
482
+ | cosine_precision@3 | 0.0256 |
483
+ | cosine_precision@5 | 0.0154 |
484
+ | cosine_precision@10 | 0.0231 |
485
+ | cosine_recall@1 | 0.0 |
486
+ | cosine_recall@3 | 0.0769 |
487
+ | cosine_recall@5 | 0.0769 |
488
+ | cosine_recall@10 | 0.2308 |
489
+ | cosine_ndcg@10 | 0.0982 |
490
+ | cosine_mrr@10 | 0.059 |
491
+ | **cosine_map@100** | **0.0828** |
492
+
493
+ #### Information Retrieval
494
+ * Dataset: `dim_128`
495
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
496
+
497
+ | Metric | Value |
498
+ |:--------------------|:-----------|
499
+ | cosine_accuracy@1 | 0.0769 |
500
+ | cosine_accuracy@3 | 0.2308 |
501
+ | cosine_accuracy@5 | 0.2308 |
502
+ | cosine_accuracy@10 | 0.3846 |
503
+ | cosine_precision@1 | 0.0769 |
504
+ | cosine_precision@3 | 0.0769 |
505
+ | cosine_precision@5 | 0.0462 |
506
+ | cosine_precision@10 | 0.0385 |
507
+ | cosine_recall@1 | 0.0769 |
508
+ | cosine_recall@3 | 0.2308 |
509
+ | cosine_recall@5 | 0.2308 |
510
+ | cosine_recall@10 | 0.3846 |
511
+ | cosine_ndcg@10 | 0.2194 |
512
+ | cosine_mrr@10 | 0.1701 |
513
+ | **cosine_map@100** | **0.1861** |
514
+
515
+ #### Information Retrieval
516
+ * Dataset: `dim_64`
517
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
518
+
519
+ | Metric | Value |
520
+ |:--------------------|:-----------|
521
+ | cosine_accuracy@1 | 0.0 |
522
+ | cosine_accuracy@3 | 0.0769 |
523
+ | cosine_accuracy@5 | 0.1538 |
524
+ | cosine_accuracy@10 | 0.3077 |
525
+ | cosine_precision@1 | 0.0 |
526
+ | cosine_precision@3 | 0.0256 |
527
+ | cosine_precision@5 | 0.0308 |
528
+ | cosine_precision@10 | 0.0308 |
529
+ | cosine_recall@1 | 0.0 |
530
+ | cosine_recall@3 | 0.0769 |
531
+ | cosine_recall@5 | 0.1538 |
532
+ | cosine_recall@10 | 0.3077 |
533
+ | cosine_ndcg@10 | 0.13 |
534
+ | cosine_mrr@10 | 0.0763 |
535
+ | **cosine_map@100** | **0.1002** |
536
+
537
+ <!--
538
+ ## Bias, Risks and Limitations
539
+
540
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
541
+ -->
542
+
543
+ <!--
544
+ ### Recommendations
545
+
546
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
547
+ -->
548
+
549
+ ## Training Details
550
+
551
+ ### Training Dataset
552
+
553
+ #### Unnamed Dataset
554
+
555
+
556
+ * Size: 111 training samples
557
+ * Columns: <code>positive</code> and <code>anchor</code>
558
+ * Approximate statistics based on the first 1000 samples:
559
+ | | positive | anchor |
560
+ |:--------|:------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
561
+ | type | string | string |
562
+ | details | <ul><li>min: 2 tokens</li><li>mean: 124.53 tokens</li><li>max: 512 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 11.15 tokens</li><li>max: 60 tokens</li></ul> |
563
+ * Samples:
564
+ | positive | anchor |
565
+ ||:------------------------------------------|
566
+ | <code>ply C Tris pH8.0 Dextran Trehalose dNTPS Na2SO4 Triton X-100 DTT TABLE 3 GAS Lyophilization Mix -Reagent Composition vl.0 v2.0 Strep A (Target) Lyo Conditions 500 nM F30 500 nM F30b.5om 100 nM R41m 100 nM R41m.lb.5om 200 nM MB4 FAM 200 nM MB4_ Fam 3.0. ug 5.0 ug 30U 0.7 ug 1 ug 1 ug 50mM 50 mM Dextran 150 Dextran 500 5% in 2x Iyo 5% in 2x Iyo 100 mM in 2x Iyo 100 mM in 2x Iyo 0.3 mM 0.3 mM 15 mM 22.5 mM 0.10% 0.10% 2mM 2mM Strep A (IC) Lyo Conditions</code> | <code>NE</code> |
567
+ | <code>CTGTTTG (SEQ ID NO, 5) To confirm that the targeted sequence was conserved among all GAS cepA sequences found in the public domain as well as unique to GAS, multiple sequence alignments and BLAST analyses were performed. Multiple alignment analysis of these sequences showed complete homology for the region of the gene targeted by the 3062 assay. Further, there are currently 24 complete GAS genomes (including whole genome shotgun sequence) available for sequence analysis in NCBI Genome. The cepA gene is present in all 24 genomes, and the 3062 target region within cepA is conserved among all 24 genomes. Upon BLAST analysis, it was confirmed that no other species contain significant homology to the 3062 target sequence. Assay Development As a reference, the reagent mixtures discussed below are</code> | <code>GCAATCTGAGGAGAGGCCATACTTGTTC</code> |
568
+ | <code>AGATTGC (SEQ ID NO, 4)</code> | <code>CAAACAGGAACAAGTATGGCCTCTCCTC</code> |
569
+ * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
570
+ ```json
571
+ {
572
+ "loss": "MultipleNegativesRankingLoss",
573
+ "matryoshka_dims": [
574
+ 768,
575
+ 512,
576
+ 256,
577
+ 128,
578
+ 64
579
+ ],
580
+ "matryoshka_weights": [
581
+ 1,
582
+ 1,
583
+ 1,
584
+ 1,
585
+ 1
586
+ ],
587
+ "n_dims_per_step": -1
588
+ }
589
+ ```
590
+
591
+ ### Training Hyperparameters
592
+ #### Non-Default Hyperparameters
593
+
594
+ - `eval_strategy`: epoch
595
+ - `per_device_train_batch_size`: 16
596
+ - `per_device_eval_batch_size`: 16
597
+ - `gradient_accumulation_steps`: 32
598
+ - `num_train_epochs`: 15
599
+ - `lr_scheduler_type`: cosine
600
+ - `warmup_ratio`: 0.1
601
+ - `fp16`: True
602
+ - `load_best_model_at_end`: True
603
+ - `optim`: adamw_torch_fused
604
+
605
+ #### All Hyperparameters
606
+ <details><summary>Click to expand</summary>
607
+
608
+ - `overwrite_output_dir`: False
609
+ - `do_predict`: False
610
+ - `eval_strategy`: epoch
611
+ - `prediction_loss_only`: True
612
+ - `per_device_train_batch_size`: 16
613
+ - `per_device_eval_batch_size`: 16
614
+ - `per_gpu_train_batch_size`: None
615
+ - `per_gpu_eval_batch_size`: None
616
+ - `gradient_accumulation_steps`: 32
617
+ - `eval_accumulation_steps`: None
618
+ - `learning_rate`: 5e-05
619
+ - `weight_decay`: 0.0
620
+ - `adam_beta1`: 0.9
621
+ - `adam_beta2`: 0.999
622
+ - `adam_epsilon`: 1e-08
623
+ - `max_grad_norm`: 1.0
624
+ - `num_train_epochs`: 15
625
+ - `max_steps`: -1
626
+ - `lr_scheduler_type`: cosine
627
+ - `lr_scheduler_kwargs`: {}
628
+ - `warmup_ratio`: 0.1
629
+ - `warmup_steps`: 0
630
+ - `log_level`: passive
631
+ - `log_level_replica`: warning
632
+ - `log_on_each_node`: True
633
+ - `logging_nan_inf_filter`: True
634
+ - `save_safetensors`: True
635
+ - `save_on_each_node`: False
636
+ - `save_only_model`: False
637
+ - `restore_callback_states_from_checkpoint`: False
638
+ - `no_cuda`: False
639
+ - `use_cpu`: False
640
+ - `use_mps_device`: False
641
+ - `seed`: 42
642
+ - `data_seed`: None
643
+ - `jit_mode_eval`: False
644
+ - `use_ipex`: False
645
+ - `bf16`: False
646
+ - `fp16`: True
647
+ - `fp16_opt_level`: O1
648
+ - `half_precision_backend`: auto
649
+ - `bf16_full_eval`: False
650
+ - `fp16_full_eval`: False
651
+ - `tf32`: None
652
+ - `local_rank`: 0
653
+ - `ddp_backend`: None
654
+ - `tpu_num_cores`: None
655
+ - `tpu_metrics_debug`: False
656
+ - `debug`: []
657
+ - `dataloader_drop_last`: False
658
+ - `dataloader_num_workers`: 0
659
+ - `dataloader_prefetch_factor`: None
660
+ - `past_index`: -1
661
+ - `disable_tqdm`: False
662
+ - `remove_unused_columns`: True
663
+ - `label_names`: None
664
+ - `load_best_model_at_end`: True
665
+ - `ignore_data_skip`: False
666
+ - `fsdp`: []
667
+ - `fsdp_min_num_params`: 0
668
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
669
+ - `fsdp_transformer_layer_cls_to_wrap`: None
670
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
671
+ - `deepspeed`: None
672
+ - `label_smoothing_factor`: 0.0
673
+ - `optim`: adamw_torch_fused
674
+ - `optim_args`: None
675
+ - `adafactor`: False
676
+ - `group_by_length`: False
677
+ - `length_column_name`: length
678
+ - `ddp_find_unused_parameters`: None
679
+ - `ddp_bucket_cap_mb`: None
680
+ - `ddp_broadcast_buffers`: False
681
+ - `dataloader_pin_memory`: True
682
+ - `dataloader_persistent_workers`: False
683
+ - `skip_memory_metrics`: True
684
+ - `use_legacy_prediction_loop`: False
685
+ - `push_to_hub`: False
686
+ - `resume_from_checkpoint`: None
687
+ - `hub_model_id`: None
688
+ - `hub_strategy`: every_save
689
+ - `hub_private_repo`: False
690
+ - `hub_always_push`: False
691
+ - `gradient_checkpointing`: False
692
+ - `gradient_checkpointing_kwargs`: None
693
+ - `include_inputs_for_metrics`: False
694
+ - `eval_do_concat_batches`: True
695
+ - `fp16_backend`: auto
696
+ - `push_to_hub_model_id`: None
697
+ - `push_to_hub_organization`: None
698
+ - `mp_parameters`:
699
+ - `auto_find_batch_size`: False
700
+ - `full_determinism`: False
701
+ - `torchdynamo`: None
702
+ - `ray_scope`: last
703
+ - `ddp_timeout`: 1800
704
+ - `torch_compile`: False
705
+ - `torch_compile_backend`: None
706
+ - `torch_compile_mode`: None
707
+ - `dispatch_batches`: None
708
+ - `split_batches`: None
709
+ - `include_tokens_per_second`: False
710
+ - `include_num_input_tokens_seen`: False
711
+ - `neftune_noise_alpha`: None
712
+ - `optim_target_modules`: None
713
+ - `batch_eval_metrics`: False
714
+ - `batch_sampler`: batch_sampler
715
+ - `multi_dataset_batch_sampler`: proportional
716
+
717
+ </details>
718
+
719
+ ### Training Logs
720
+ | Epoch | Step | Training Loss | dim_128_cosine_map@100 | dim_256_cosine_map@100 | dim_512_cosine_map@100 | dim_64_cosine_map@100 | dim_768_cosine_map@100 |
721
+ |:-------:|:-----:|:-------------:|:----------------------:|:----------------------:|:----------------------:|:---------------------:|:----------------------:|
722
+ | 0 | 0 | - | 0.0747 | 0.0694 | 0.0681 | 0.1224 | 0.0705 |
723
+ | 1.0 | 1 | - | 0.0750 | 0.0694 | 0.0681 | 0.1224 | 0.0705 |
724
+ | 2.0 | 2 | - | 0.1008 | 0.0724 | 0.0696 | 0.0719 | 0.0710 |
725
+ | **3.0** | **3** | **-** | **0.1861** | **0.0828** | **0.0745** | **0.1002** | **0.0814** |
726
+ | 4.0 | 4 | - | 0.1711 | 0.0968 | 0.0825 | 0.0861 | 0.1001 |
727
+ | 5.0 | 6 | - | 0.1505 | 0.1140 | 0.1094 | 0.1534 | 0.1502 |
728
+ | 6.0 | 7 | - | 0.1222 | 0.1143 | 0.1108 | 0.1528 | 0.1520 |
729
+ | 7.0 | 8 | - | 0.1589 | 0.1536 | 0.1512 | 0.1513 | 0.1516 |
730
+ | 8.0 | 9 | - | 0.1561 | 0.1550 | 0.1531 | 0.1495 | 0.1520 |
731
+ | 9.0 | 10 | 1.8482 | 0.1565 | 0.1558 | 0.1544 | 0.1483 | 0.1522 |
732
+ | 10.0 | 12 | - | 0.1562 | 0.1551 | 0.1557 | 0.1416 | 0.1531 |
733
+ | 11.0 | 13 | - | 0.1561 | 0.1558 | 0.1562 | 0.1401 | 0.1533 |
734
+ | 12.0 | 14 | - | 0.1559 | 0.1559 | 0.1562 | 0.1402 | 0.1533 |
735
+ | 13.0 | 15 | - | 0.1861 | 0.0828 | 0.0745 | 0.1002 | 0.0814 |
736
+
737
+ * The bold row denotes the saved checkpoint.
738
+
739
+ ### Framework Versions
740
+ - Python: 3.10.12
741
+ - Sentence Transformers: 3.0.1
742
+ - Transformers: 4.41.2
743
+ - PyTorch: 2.3.0+cu121
744
+ - Accelerate: 0.32.1
745
+ - Datasets: 2.20.0
746
+ - Tokenizers: 0.19.1
747
+
748
+ ## Citation
749
+
750
+ ### BibTeX
751
+
752
+ #### Sentence Transformers
753
+ ```bibtex
754
+ @inproceedings{reimers-2019-sentence-bert,
755
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
756
+ author = "Reimers, Nils and Gurevych, Iryna",
757
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
758
+ month = "11",
759
+ year = "2019",
760
+ publisher = "Association for Computational Linguistics",
761
+ url = "https://arxiv.org/abs/1908.10084",
762
+ }
763
+ ```
764
+
765
+ #### MatryoshkaLoss
766
+ ```bibtex
767
+ @misc{kusupati2024matryoshka,
768
+ title={Matryoshka Representation Learning},
769
+ author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
770
+ year={2024},
771
+ eprint={2205.13147},
772
+ archivePrefix={arXiv},
773
+ primaryClass={cs.LG}
774
+ }
775
+ ```
776
+
777
+ #### MultipleNegativesRankingLoss
778
+ ```bibtex
779
+ @misc{henderson2017efficient,
780
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
781
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
782
+ year={2017},
783
+ eprint={1705.00652},
784
+ archivePrefix={arXiv},
785
+ primaryClass={cs.CL}
786
+ }
787
+ ```
788
+
789
+ <!--
790
+ ## Glossary
791
+
792
+ *Clearly define terms in order to be accessible across audiences.*
793
+ -->
794
+
795
+ <!--
796
+ ## Model Card Authors
797
+
798
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
799
+ -->
800
+
801
+ <!--
802
+ ## Model Card Contact
803
+
804
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
805
+ -->
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "BAAI/bge-base-en-v1.5",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "id2label": {
13
+ "0": "LABEL_0"
14
+ },
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 3072,
17
+ "label2id": {
18
+ "LABEL_0": 0
19
+ },
20
+ "layer_norm_eps": 1e-12,
21
+ "max_position_embeddings": 512,
22
+ "model_type": "bert",
23
+ "num_attention_heads": 12,
24
+ "num_hidden_layers": 12,
25
+ "pad_token_id": 0,
26
+ "position_embedding_type": "absolute",
27
+ "torch_dtype": "float32",
28
+ "transformers_version": "4.41.2",
29
+ "type_vocab_size": 2,
30
+ "use_cache": true,
31
+ "vocab_size": 30522
32
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.0.1",
4
+ "transformers": "4.41.2",
5
+ "pytorch": "2.3.0+cu121"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": null
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4bc536359a75765632574b90995b4b8c4cdf8ee24181c503bd9933cabcce73b5
3
+ size 437951328
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": true
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 512,
50
+ "never_split": null,
51
+ "pad_token": "[PAD]",
52
+ "sep_token": "[SEP]",
53
+ "strip_accents": null,
54
+ "tokenize_chinese_chars": true,
55
+ "tokenizer_class": "BertTokenizer",
56
+ "unk_token": "[UNK]"
57
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff